- Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO on Resident and Decibel Partners. And I'm joined by my co-host Swiggs, founder of Small.ai. - Hey, and today in the studio we have Soheil Doshi. Welcome. - Yeah, thanks for having me. - Among many things, you're a CEO and co-founder of Mixpanel.
- Yep. - And I think about three years ago, you left to start Mighty. - Mm-hmm. - And more recently, I think about a year ago, transitioned into Playground. And you've just announced your new round. I'd just like to start, touch on Mixpanel a little bit, 'cause it's obviously one of the more sort of successful analytics companies we previously had amplitude on.
And I'm curious if you had any sort of reflections on just that overall, the interaction of that amount of data that people would want to use for AI. Like, I don't know if there's still a part of you that stays in touch with that world. - Yeah, I mean, it's, I mean, you know, the short version is that maybe back in like 2015 or '16, I don't really remember exactly, 'cause it was a while ago, we had an ML team at Mixpanel.
And I think this is like when maybe deep learning or something like really just started getting kind of exciting. And we were thinking that maybe we, given that we had such vast amounts of data, perhaps we could predict things. So we built, you know, two or three different features.
I think we built a feature where we could predict whether users would churn from your product. We made a feature that could predict whether users would convert. We tried to be built a feature that could do anomaly detection. Like if something occurred in your product, that was just very surprising, maybe a spike in traffic in a particular region.
Could we tell you in advance? Could we tell you that that happened? 'Cause it's really hard to like know everything that's going on with your data. Could we tell you something surprising about your data? And we tried all of these various features. Most of it boiled down to just like, you know, using logistic regression.
And it never quite seemed very groundbreaking in the end. And so I think, you know, we had a four or five person ML team and I think we never expanded it from there. And I did all these fast AI courses trying to learn about ML. And that was the, that's the-- - That's the first time you did fast AI.
- Yeah, that was the first time I did fast AI. Yeah, I think I've done it now three times maybe. - Oh, okay. I didn't know it was a third. Okay. - No, no, just me reviewing it is maybe three times, but yeah. - Yeah, yeah, yeah. I mean, I think you mentioned prediction, but honestly like it's also just about the feedback, right?
The quality of feedback from users, I think it's useful for anyone building AI applications. - Yeah. - Yeah, self-evident. - Yeah, I think I haven't spent a lot of time thinking about Mixpanel 'cause it's been a long time, but yeah, I wonder now, given everything that's happened, like, you know, sometimes I'm like, oh, I wonder what we could do now.
And then I kind of like move on to whatever I'm working on, but things have changed significantly since. So yeah. - Yeah. Awesome. And then maybe we'll touch on Mighty a little bit. Mighty was very, very bold. It was basically, well, my framing of it was, you will run our browsers for us because everyone has too many tabs open.
I have too many tabs open and slowing down your machines that you can do it better for us in a centralized data center. - Yeah, we were first trying to make a browser that we would stream from a data center to your computer at extremely low latency. But the real objective wasn't trying to make a browser or anything like that.
The real objective was to try to make a new kind of computer. And the thought was just that, like, you know, we have these computers in front of us today and we upgrade them or they run out of RAM or they don't have enough RAM or not enough disk, or, you know, there's some limitation with our computers, perhaps like data locality is a problem.
Could we, you know, why do I need to think about upgrading my computer ever? And so, you know, we just had to kind of observe that, like, well, actually, it seems like a lot of applications are just now in the browser. You know, it's like how many real desktop applications do we use relative to the number of applications we use in the browser?
So it was just this realization that actually, like, you know, the browser was effectively becoming more or less our operating system over time. And so then that's why we kind of decided to go, hmm, maybe we can stream the browser. Fortunately, the idea did not work for a couple of different reasons, but the objective is to try to make a true new computer.
- Yeah, very bold, very bold. - Yeah, and I was there at YC Demo Day when you first announced it. - Oh, okay. - I think the last, or one of the last in-person ones, or like the PR 34 in Mission Bay. - Yeah, before COVID. - How do you think about that now when everybody wants to put some of these models in people's machines and some of them want to stream them in?
Do you think there's maybe another wave of the same problem before it was like browser apps too slow, and now it's like models too slow to run on device? - Yeah, I think, you know, we obviously pivoted away from Mighty, but a lot of what I somewhat believed at Mighty is like somewhat very true.
Maybe why I'm so excited about AI and what's happening. A lot of what Mighty was about was like moving compute somewhere else, right? Right now applications, they get limited quantities of memory, disk, networking, whatever your home network has, et cetera. You know, what if these applications could somehow, if we could shift compute, and then these applications have vastly more compute than they do today.
Right now it's just like client backend services, but you know, what if we could change the shape of how applications could interact with things? And it's changed my thinking. In some ways, AI has like a bit of a continuation of my belief that like, perhaps we can really shift compute somewhere else.
One of the problems with Mighty was that JavaScript is single-threaded in the browser. And what we learned, you know, the reason why we kind of abandoned Mighty was because I didn't believe we could make a new kind of computer. We could have made some kind of enterprise business, probably could have made maybe a lot of money, but it wasn't going to be what I hoped it was going to be.
And so once I realized that most of a web app is just going to be single-threaded JavaScript, then the only thing you could do largely withstanding changing JavaScript, which is a fool's errand most likely, is make a better CPU, right? And there's like three CPU manufacturers, two of which sell, you know, big ones, you know, AMD, Intel, and then of course, like Apple made the M1.
And it's not like single-threaded CPU core performance. Single core performance was like increasing very fast. It's plateauing rapidly. And even these different like companies were not doing as good of a job, you know, sort of with the continuation of Moore's law. But what happened in AI was that you got like, like if you think of the AI model as like a computer program which is like a compiled computer program, it is literally built and designed to do massive parallel computations.
And so if you could take like the universal approximation theorem to its like kind of logical complete point, you know, you're like, wow, I can get make computation happen really rapidly and parallel somewhere else. You know, so you end up with these like really amazing models that can like do anything.
It just turned out like perhaps, perhaps the new kind of computer would just simply be shifted, you know, into these like really amazing AI models in reality. - Yeah. Like I think Andrej Karpathy has always been, has been making a lot of analogies with the LLMOS. - Yeah, I saw his, yeah, I saw his video and I watched that, you know, maybe two weeks ago or something like that.
And I was like, oh man, this, I very much resonate with this like idea. - Why didn't I see this three years ago? - Yeah, I think, I think there still will be, you know, local models and then there'll be these very large models that have to be run in data centers.
Yeah, I think it just depends on kind of like the right tool for the job, like any, like any engineer would probably care about. But I think that, you know, by and large, like if the models continue to kind of keep getting bigger, you know, it's gonna, it's, you're always going to be wondering whether you should use the big thing or the small, you know, the tiny little model.
And it might just depend on like, you know, do you need 30 FPS or 60 FPS? Maybe that would be hard to do, you know, over a network. - Yeah, you tackle the much harder problem latency-wise, you know, than the AI models actually require. - Yeah, yeah, you can do quite well.
You can do quite well. You know, you definitely did 30 FPS video streaming, did very crazy things to make that work. So I'm actually quite bullish on the kinds of things you can do with networking. - Yeah, right. Maybe someday you'll come back to that at some point. - But so for those that don't know, you're very transparent on Twitter.
Very good to follow you just to learn your insights. And you actually published a postmortem on Mighty that people can read up on if they're willing to. And so there was a bit of an overlap. You started exploring the AI stuff in June, 2022, which is when you started saying like, "I'm taking Fast.ai again." Maybe, was there more context around that?
- Yeah, I think I was kind of like waiting for the team at Mighty to finish up something. And I was like, "Okay, well, what can I do? "I guess I will make some kind of like address bar predictor "in the browser." So we had forked Chrome and Chromium.
And I was like, "You know, one thing that's kind of lame "is that like this browser should be like a lot better "at predicting what I might do, where I might wanna go." You know, it struck me as really odd that Chrome had very little AI actually, or ML inside this browser.
And for a company like Google, you'd think there's a lot, but it's actually just like the code is actually just very, you know, it's just a bunch of if then statements is more or less the address bar. So it seemed like a pretty big opportunity. And that's also where a lot of people interact with the browser.
So in a long story short, I was like, "Hmm, I wonder what I could build here." So I started to, yeah, take some AI courses and try to review the material again and get back to figuring it out. But I think that was somewhat serendipitous because right around April was, I think, a very big watershed moment in AI 'cause that's when "Dolly 2" came out.
And I think that was the first like truly big viral moment for generative AI. - Because of the avocado chair. - Because of the avocado chair and yeah, exactly. Yeah, it was just so novel. - It wasn't as big for me as "Stable Diffusion." - Really? - Yeah, I don't know.
People was like, "All right, that's cool." I don't know. (laughs) - Yeah. - I mean, they had some flashy videos, but I never really, it didn't really register me as-- - But just that moment of images was just such a viral, novel moment. I think it just blew people's mind.
- Yeah, I mean, it was the first time I encountered Sam Altman 'cause they had this "Dolly 2" hackathon. They opened up the OpenAI office for developers to walk in back when it wasn't as, I guess, much of a security issue as it is today. Maybe take us through the journey to decide to pivot into this.
And also, choosing images. Obviously, you were inspired by "Dolly," but there could be any number of AI companies and businesses that you could start in the widest one, right? - Yeah. - So there must be an idea maze from June to September. - Yeah, yeah, there definitely was. So I think at that time, Mighty, OpenAI was not quite as popular as it is all of a sudden now these days.
But back then, I think they were more than happy. They had a lot more bandwidth to help anybody. And so we had been talking with the team there around trying to see if we could do really fast, low-latency address bar prediction with GPT-3 and 3.5 and that kind of thing.
And so we were sort of figuring out how could we make that low-latency. I think that just being able to talk to them and kind of being involved gave me a bird's-eye view into a bunch of things that started to happen. Obviously, first was the "Dolly 2" moment, but then "Stable Diffusion" came out, and that was a big moment for me as well.
And I remember just kind of sitting up one night thinking, I was like, "What are the kinds of companies "one could build? "What matters right now?" One thing that I observed is that I find a lot of great, I find a lot of inspiration when I'm working in a field in something, and then I can identify a bunch of problems.
Like for Mixpanel, I was an intern at a company, and I just noticed that they were doing all this data analysis. And so I thought, "Hmm, I wonder if I could make a product, "and then maybe they would use it." And in this case, the same thing kind of occurred.
It was like, okay, there are a bunch of infrastructure companies that are doing, they put a model up, and then you can use their API, like Replicate is a really good example of that. There are a bunch of companies that are helping you with training, model optimization, Mosaic at the time, and probably still was doing stuff like that.
So I just started listing out every category of everything, of every company that was doing something interesting. Obviously, Weights & Biases. I was like, "Oh man, Weights & Biases "is this great company. "Do I want to compete with that company? "I might be really good at competing with that company." Because of Mixpanel, 'cause it's so much of analysis.
I was like, "No, I don't want to do anything related to that. "I think that would be too boring now at this point." But, so I started to list out all these ideas, and one thing I observed was that at OpenAI, they have a playground for GPT-3, right? And all it was was just a text box, more or less.
And then there were some settings on the right, like temperature and whatever. - Top K, Top N. - Yeah, Top K. What's your end stop sequence? I mean, that was like their product before chat GPT. You know, really difficult to use, but fun if you're like an engineer. And I just noticed that their product kind of was evolving a little bit, where the interface kind of was getting more and more, a little bit more complex.
They had like a way where you could like, generate something in the middle of a sentence, and all those kinds of things. And I just thought to myself, I was like, "You know, there's not, "everything is just like this text box, "and you generate something, and that's about it." And Stable Diffusion had kind of come out, and it was all like hugging face and code.
Nobody was really building any UI. And so I had this kind of thing where I wrote prompt dash, like question mark in my notes. And I didn't know what was like the product for that, at the time. I mean, it seems kind of trite now. But yeah, I just like wrote prompt.
What's the thing for that? - Manager, prompt. - Prompt manager. Do you organize them? Like, do you like have a UI that can like-- - Library. - Play with them? Yeah, like a library. What would you make? And so then of course, then you thought about, what would the modalities be, given that?
How would you build a UI for each kind of modality? And so there were a couple people working on some pretty cool things. And I basically chose graphics because it seemed like the most obvious place where you could build a really powerful, complex UI that's not just only typing in a box.
That it would very much evolve beyond that. Like, what would be the best thing for something that's visual? Probably something visual. So yeah, I think that just that progression kind of happened and it just seemed like there was a lot of effort going into language, but not a lot of effort going into graphics.
And then maybe the very last thing was, I think I was talking to Aditya Ramesh, who is the co-creator of Dolly 2 and Sam. And I just kind of went to these guys and I was just like, hey, are you gonna make like a UI for this thing? Like a true UI, are you gonna go for this?
Are you gonna make a product? - For Dolly, yeah. - For Dolly, yeah. Are you gonna do anything here? 'Cause if you're not gonna do it, if you are gonna do it, just let me know and I will stop and I'll go do something else. But if you're not gonna do anything, I'll just do it.
And so we had a couple of conversations around what that would look like. And then I think ultimately they decided that they were gonna focus on language primarily. And yeah, I just felt like it was gonna be very underinvested in. - Yes, there's that sort of underinvestment from OpenAI, which I can see that.
But also it's a different type of customer than you're used to. Presumably, and Mixpanel are very good at selling to B2B developers. With Fairground, you're not. - Yeah. - Was that not a concern? - Well, not so much, because I think that right now I would say graphics is in this very nascent phase.
Like most of the customers are just like hobbyists, right? Like it's a little bit of like a novel toy as opposed to being this like very high utility thing. But I think ultimately if you believe that you could make it very high utility, then probably the next customers will end up being B2B.
It'll probably not be like consumer. Like there will certainly be a variation of this idea that's in consumer. If your quest is to kind of make like a super, something that surpasses human ability for graphics, like ultimately it will end up being used for business. So I think it's maybe more of a progression.
In fact, for me, it's maybe more like Mixpanel started out as SMB, and then very much like ended up starting to grow up towards enterprise. So for me, it's a little, I think it will be a very similar progression. - Yeah, yeah. - But yeah, I mean, the reason why I was excited about it is 'cause it was a creative tool.
I make music and it's AI. It's like something that I know I could stay up till three o'clock in the morning doing. Those are kind of like very simple bars for me. - Yeah. - Yeah. It's good decision criteria. - So you mentioned DALI, Stable Diffusion. You just had Playground V2 come out two days ago?
- Yeah, two days ago, yeah. - Two days ago. So this is a model you train completely from scratch. So it's not a cheap fine tune on something. You open source everything, including the weights. Why did you decide to do it? I know you supported Stable Diffusion XL in Playground before, right?
- Yep. - Yeah, what made you want to come up with V2 and maybe some of the interesting, technical research work you've done? - Yeah, so I think that we continue to feel like graphics and these foundation models for anything really related to pixels, but also definitely images, continues to be very under-invested.
It feels a little like graphics is in this GPT-2 moment, right, like even GPT-3. Even when GPT-3 came out, it was exciting. But it was like, what are you gonna use this for? You know, yeah, we'll do some text classification and some semantic analysis, and maybe it'll sometimes make a summary of something and it'll hallucinate.
But no one really had a very significant business application for GPT-3. And in images, we're kind of stuck in the same place. We're kind of like, okay, I write this thing in a box and I get some cool piece of artwork and the hands are kind of messed up and sometimes the eyes are a little weird.
Maybe I'll use it for a blog post, that kind of thing. The utility feels so limited. And so, you know, and then you sort of look at stable diffusion and we definitely use that model in our product and our users like it and use it and love it and enjoy it.
But it hasn't gone nearly far enough. So we were kind of faced with the choice of, you know, do we wait for progress to occur or do we make that progress happen? So, yeah, we kind of embarked on a plan to just decide to go train these things from scratch.
And I think the community has given us so much. The community for stable diffusion, I think, is one of the most vibrant communities on the internet. It's like amazing. It feels like, I hope this is what Homebrew Club felt like when computers showed up because it's like amazing what that community will do.
And it moves so fast. I've never seen anything in my life where so far, and heard other people's stories around this, where a research, an academic research paper comes out and then like two days later, someone has sample code for it and then two days later, there's a model and then two days later, it's like in nine products.
- Yeah. - Competing with each other. - Yeah. - It's incredible to see like math symbols on an academic paper go to features, well-designed features in a product. So I think the community has done so much. So I think we wanted to give back to the community kind of on our way.
We knew it wasn't going to be, we knew it was not ever going to be, certainly we would train a better model than what we gave out on Tuesday. But we definitely felt like there needs to be some kind of progress in these open source models. The last kind of milestone was in July when Stable Diffusion Excel came out, but there hasn't been anything really since, right?
- And there's Excel Turbo now. - Well, Excel Turbo is like this distilled model, right? So it's like lower quality, but fast. You have to decide what your trade-off is there. - And it's also a consistency model? - It's not, I don't think it's a consistency model. It's like, they did like a different thing.
- Yeah. - Yeah, I think it's like, I don't want to get quoted for this, but it's like something called ad, like adversarial something or another. - That's exactly right. - Yeah, I think it's, I've read something about that. Maybe it's like closer to GANs or something, but I didn't really read the full paper.
But yeah, there hasn't been quite enough progress in terms of, you know, there's no multitask image model. You know, the closest thing would be something called like EmuEdit, but there's no model for that. It's just a paper that's within meta. So we did that and we also gave out pre-trained weights, which is very rare.
Usually you just get the aligned model and then you have to like, see if you can do anything with it. We actually gave out, there's like a 256 pixel pre-trained stage and a 512. And we did that for academic research, 'cause there's a whole bunch of, we come across people all the time in academia and they have like, they have access to like one A100 or eight at best.
And so if we can give them kind of like a 512 pre-trained model, it might, our hope is that there'll be interesting novel research that occurs from that. - What research do you want to happen? - I would love to see more research around, you know, things that users care about tend to be things like character consistency.
- Between frames? - More like if you have like a face. Yeah, yeah, basically between frames, but more just like, you know, you have your face and it's in, you know, one image and then you want it to be like in another. And users are very particular and sensitive to faces changing.
'Cause we know, we know what, you know, we're trained on faces as humans. And, you know, that's something I don't, I'm not seeing a lot of innovation, enough innovation around multitask editing. You know, there are two things like instruct pics to pics and then the emu edit paper that are maybe very interesting, but we certainly are not pushing the fold on that in that regard.
It just, all kinds of things like around that rotation, you know, being able to keep coherence across images, style transfer is still very limited. Just even reasoning around images, you know, what's going on in an image, that kind of thing. Things are still very, very underpowered, very nascent. So therefore the utility is very, very limited.
- On the 1K Prompt Benchmark, you are 2.5X prefer to stable diffusion Excel. How do you get there? Is it better images in the training corpus? Is it, yeah, can you maybe talk through the improvements in the model? - I think they're still very early on in the recipe, but I think it's a lot of like little things.
And, you know, every now and then there are some big important things. Like certainly your data quality is really, really important. So we spend a lot of time thinking about that. But I would say it's a lot of things that you kind of clean up along the way as you train your model.
Everything from captions to the data that you align with after pre-train to how you're picking your data sets, how you filter your data sets. There's a lot, I feel like there's a lot of work in AI that's like, doesn't really feel like AI. It just really feels like just data set filtering and systems engineering.
And just like, you know, and the recipe is all there, but it's like a lot of extra work to do that. So I think these models, I think whatever version, I think we plan to do a Playground V 2.1, maybe either by the end of the year or early next year.
And we're just like watching what the community does with the model. And then we're just gonna take a lot of the things that they're unhappy about and just like fix them. You know, so for example, like maybe the eyes of people in an image don't feel right. They feel like they're a little misshapen or they're kind of blurry feeling.
That's something that we already know we wanna fix. So I think in that case, it's gonna be about data quality. Or maybe you wanna improve the kind of the dynamic range of color. You know, we wanna make sure that that's like got a good range in any image. So what technique can we use there?
There's different things like offset noise, pyramid noise, terminal zero SNR. Like there are all these various interesting things that you can do. So I think it's like a lot of just like tricks. Some are tricks, some are data, and some is just like cleaning. Yeah. - Specifically for faces, it's very common to use a pipeline rather than just train the base model more.
Do you have a strong belief either way on like, oh, they should be separated out to different stages for like improving the eyes, improving the face or enhance or whatever? Or do you think like it can all be done in one model? - I think we will make a unified model.
- Okay. - Yeah, I think we'll certainly in the end, ultimately make a unified model. There's not enough research about this. Maybe there is something out there that we haven't read. There are some bottlenecks, like for example, in the VAE, like the VAEs are ultimately like compressing these things.
And so you don't know, and then you might have like a big information bottleneck. So maybe you would use a pixel based model, perhaps. You know, there's a lot of belief. I think we've talked to people, everyone from like Rombach to various people, Rombach trained stable diffusion. You know, I think there's like a big question around the architecture of these things.
It's still kind of unknown, right? Like we've got transformers and we've got like a GPT architecture model, but then there's this like weird thing that's also seemingly working with diffusion. And so, you know, are we going to use vision transformers? Are we going to move to pixel based models?
Is there a different kind of architecture? We don't really, I don't think there have been enough experiments in this regard. - Still? Oh my God. - Yeah. - That's surprising. - Yeah, I think it's very computationally expensive to do a pipeline model where you're like fixing the eyes and you're fixing the mouth and you're fixing the hands.
- That's what everyone does as far as I understand. - Well, I'm not sure, I'm not exactly sure what you mean, but if you mean like you get an image and then you will like make another model specifically to fix a face. Yeah, I think that's a very computationally, that's fairly computationally expensive.
And I think it's like not, probably not the right thing, right way. - Yeah. - Yeah. And it doesn't generalize very well. Now you have to pick all these different things. - Yeah, you're just kind of glomming things on together. Like when I look at AI artists, like that's what they do.
- Ah, yeah, yeah, yeah. They'll do things like, you know, I think a lot of ARs will do, you know, control net tiling to do kind of generative upscaling of all these different pieces of the image. Yeah, I mean, to me, these are all just like, they're all hacks, ultimately in the end.
I mean, it just, to me, it's like, let's go back to where we were just three years, four years ago with where deep learning was at and where language was at. You know, it's the same thing. It's like, we were like, okay, well, I'll just train these very narrow models to try to do these things and kind of ensemble them or pipeline them to try to get to a best-in-class result.
And here we are with like where the models are gigantic and like very capable of solving huge amounts of tasks when given like lots of great data. So, yeah. - Makes sense. You also released a new benchmark called MJHQ-30K for automatic evaluation of a model's aesthetic quality. I have one question.
The dataset that you use for the benchmark is from MidJourney. - Yes. - You have 10 categories. How do you think about the Playground model, MidJourney? - You know, there are a lot of people, a lot of people in research like to come up with, they like to compare themselves to something they know they can beat, right?
But maybe this is the best reason why it can be helpful to not be a researcher also sometimes. Like I'm not like trained as a researcher. I don't have a PhD in anything AI related, for example. But I think if you care about products and you care about your users, then the most important thing that you wanna figure out is like everyone has to acknowledge that MidJourney is very good.
You know, they are the best at this thing. We would, I would happily, I'm happy to admit that. I have no problem admitting that. It's just easy. It's very visual to tell. So, you know, I think it's incumbent on us to try to compare ourselves to the thing that's best, even if we lose, even if we're not the best, right?
And, you know, at some point, if we are able to surpass MidJourney, then we only have ourselves to compare ourselves to. But on first blush, you know, I think it's worth comparing yourself to maybe the best thing and try to find like a really fair way of doing that.
So I think more people should try to do that. I definitely don't think you should be kind of comparing yourself on like some Google model or some old SD, you know, stable diffusion model and be like, look, we beat, you know, stable diffusion 1.5. I think users ultimately want care, you know, how close are you getting to the thing that like I also mostly, people mostly agree with.
So we put out that benchmark not because, for no other reason to say like, this seems like a worthy thing for us to at least try, you know, for people to try to get to. And then if we surpass it, great, we'll come up with another one. - Yeah, no, that's awesome.
And you kill stable diffusion Excel and everything. In the benchmark chart, it says Playground V2 1024 pixel dash aesthetic. - Yes. - You have kind of like, yeah, style fine tunes or like, what's the dash aesthetic for? - We debated this, maybe we named it wrong or something, but we were like, how do we help people realize the model that's aligned versus the models that weren't.
So because we gave out pre-trained models, we didn't want people to like use those. So that's why they're called base. And then the aesthetic model, yeah, we wanted people to pick up the thing that we thought would be like the thing that makes things pretty. Who wouldn't want the thing that's aesthetic?
But if there's a better name, we definitely are open to feedback. - No, no, that's cool. I was using the product. You also have the style filter and you have all these different style. And it seems like the styles are tied to the model. So there's some like SDXL styles, there's some Playground V2 styles.
Can you maybe give listeners an overview of how that works? Because in language, there's not this idea of like style, right, versus like in vision model there is, and you cannot get certain styles in different models. How do styles emerge and how do you categorize them and find them?
- Yeah, I mean, it's so fun having a community where people are just trying a model. Like it's only been two days for Playground V2 and we actually don't know what the model's capable of and not capable of. You know, we certainly see problems with it, but we have yet to see what emergent behavior is.
I mean, we've just sort of discovered that it takes about like a week before you start to see like new things. But I think like a lot of that style kind of emerges after that week where you start to see, you know, there's some styles that are very like well-known to us, like maybe like pixel art is a well-known style.
But then there's some style, photo realism is like another one that's like well-known to us. But there are some styles that cannot be easily named. You know, it's not as simple as like, okay, that's an anime style. It's very visual. And in the end, you end up making up the name for what that style represents.
And so the community kind of shapes itself around these different things. And so if anyone that's into stable diffusion and into building anything with graphics and stuff with these models, you know, you might've heard of like ProtoVision or DreamShaper, some of these weird names. But they're just, you know, invented by these authors, but they have a sort of je ne sais quoi that, you know, appeals to users.
- Because it like roughly embeds to what you want. - I guess so. I mean, it's like, you know, there's this one of my favorite ones that's fine-tuned. It's not made by us. It's called like Starlight XL. It's just this beautiful model. It's got really great color contrast and visual elements.
And the users love it. I love it. And yeah, it's so hard. I think that's like a very big open question with graphics that I'm not totally sure how we'll solve. Yeah, I think a lot of styles are sort of, I don't know, it's like an evolving situation too, 'cause styles get boring, right?
They get fatigued. It's like listening to the same style of pop song. I kind of, I try to relate to graphics a little bit like with music, because I think it gives you a little bit of a different shape to things. Like in music, it's not just, it's not as if we just have pop music and, you know, rap music and country music.
Like they're all of these, like the EDM genre alone has like sub genres. And I think that's very true in graphics and painting and art and anything that we're doing. There's just these sub genres, even if we can't quite always name them. But I think they are emergent from the community, which is why we're so always happy to work with the community.
- Yeah, that is a struggle, you know, coming back to this, like B2B versus B2C thing. B2C, you're gonna have a huge amount of diversity and then it's gonna reduce as you get towards more sort of B2B type use cases. I'm making this up here, tell me if you disagree.
So like you might be optimizing for a thing that you may eventually not need. - Yeah, possibly. Yeah, possibly. Yeah, I try not to share, I think like a simple thing with startups is that I worry sometimes by trying to be overly ambitious and like really scrutinizing like what something is in its most nascent phase that you miss the most ambitious thing you could have done.
Like just having like very basic curiosity with something very small can like kind of lead you to something amazing. Like Einstein definitely did that. And then when, and then he like, you know, he basically won all the prizes and got everything he wanted and then basically did like kind, didn't really, he kind of dismissed quantum and then just kind of was still searching, you know, for the unifying theory.
And he like had this quest. I think that happens a lot with like Nobel prize people. I think there's like a term for it that I forget. I actually wanted to go after a toy almost intentionally. So long as that I could see, I could imagine that it would lead to something very, very large later.
And so, yeah, it's a very, like I said, it's very hobbyist, but you need to start somewhere. You need to start with something that has a big gravitational pull, even if these hobbyists aren't likely to be the people that, you know, have a way to monetize it or whatever, even if they're, but they're doing it for fun.
So there's something there that I think is really important. But I agree with you that, you know, in time, we're gonna have to focus, we will absolutely focus on more utilitarian things, like things that are more related to editing feats that are much harder. But, and so I think like a very simple use case is just, you know, I'm not a graphics designer.
I don't know if, I don't know if you guys are, but it's sure, you know, it seems like very simple that like you, if we could give you the ability to do really complex graphics without skill, wouldn't you want that? You know, like my wife the other day was set, you know, said, ah, I wish Playground was better because I wish that, you know, don't you, when are you guys gonna have a feature where like we could make my son, his name's Devin, smile when he was not smiling in the picture for the holiday card, right?
You know, just being able to highlight his mouth and just say like, make him smile. Like, why can't we do that with like high fidelity and coherence? Little things like that, all the way to, you know, putting you in completely different scenarios. - Is that true? Can we not do that in painting?
- You can do in painting, but it's the quality is just so bad. Yeah, it's just really terrible quality. You know, it's like, you'll do it five times and it'll still like kind of look like crooked or just the artifact. Part of it's like, you know, the lips on the face are so, there's such, there's such little information there.
It's so small that the models really struggle with it. Yeah. - Make the picture smaller and you won't see it. - Wait, I think, I think that's my trick, I don't know. - Yeah, yeah, that's true. Or, you know, you could take that region and make it really big and then like say it's a mouth and then like shrink it.
It feels like you're wrestling with it more than it's doing something that kind of surprises you. Yeah. - It feels like you are very much the internal tastemaker. Like you carry in your head this vision for what a good art model should look like. Is it, do you find it hard to like communicate it to like your team and, you know, other people?
'Cause obviously it's hard to put into words like we just said. - Yeah, it's very hard to explain. Like images have such, like such high bit rate compared to just words. And we don't have enough words to describe these things. Difficult, I think everyone on the team, if they don't have good kind of like judgment taste or like an eye for some of these things, they're like steadily building it 'cause they have no choice, right?
So in that realm, I don't worry too much, actually. Like everyone is kind of like learning to get the eye is what I would call it. But I also have, you know, my own narrow taste. Like I'm at my, you know, I'm not, I don't represent the whole population either.
- True, true. - So. - When you benchmark models, you know, like this benchmark we're talking about, we use FID for efficient input distance. Okay, that's one measure, but like doesn't capture anything you just said about smiles. - Yeah, FID is generally a bad metric. You know, it's good up to a point and then it kind of like is irrelevant.
- Yeah. - Yeah. - And then, so are there any other metrics that you like apart from vibes? I'm always looking for alternatives to vibes. 'Cause vibes don't scale, you know? - You know, it might be fun to kind of talk about this because it's actually kind of fresh.
So up till now, we haven't needed to do a ton of like benchmarking because we hadn't trained our own model and then now we have. So now what? What does that mean? How do we evaluate it? You know, we're kind of like living with the last 48, 72 hours of going, did the way that we benchmark actually succeed?
Did it deliver? Right? You know, like I think Gemini just came out. They just put out a bunch of benchmarks, but all these benchmarks are just an approximation of how you think it's gonna end up with real world performance. And I think that's like very fascinating to me. So if you fake that benchmark, you'll still end up in a really bad scenario at the end of the day.
And so, you know, one of the benchmarks we did was we did a, we kind of curated like a thousand prompts. That's what we published in our blog post, you know, of all these tasks that we, a lot of them, some of them are curated by our team where we know the models all suck at it.
Like my favorite prompt that no model's really capable of is a horse riding an astronaut. - Yeah. - The inverse one. And it's really, really hard to do. - Not in data. - You know, another one is like a giraffe underneath a microwave. How does that work? (laughing) Right?
There's so many of these little funny ones. We do, we have prompts that are just like misspellings of things, right? Just to see if the models will figure it out. - So that's easy. That should embed to the same space. - Yeah. And just like all these very interesting, weird, weirdo things.
And so we have so many of these and then we kind of like evaluate whether the models are any good at it. And the reality is that they're all bad at it. And so then you're just picking the most aesthetic image. But I think, you know, we're just, we're still at the beginning of building like our, like the best benchmark we can that aligns most with just user happiness, I think.
'Cause we're not, we're not like putting these in papers and trying to like win, you know, I don't know, awards at ICCV or something if they have awards. Sorry if they don't. And you could. - Well, that's absolutely a valid strategy. - Yeah, you could. I don't think it could correlate necessarily with the impact we want to have on humanity.
I think we're still evolving whatever our benchmarks are. So the first benchmark was just like very difficult tasks that we know the models are bad at. Can we come up with a thousand of these? Whether they're hand-written and some of them are generated. And then can we ask the users, like, how do we do?
And then we wanted to use a benchmark like party prompts so that people in academia, we mostly did that so people in academia could measure their models against ours versus others. And, but yeah, I mean, fit is pretty bad. And I think, yeah, in terms of vibes, it's like when you put out the model and then you try to see like what users make.
And I think my sense is that we're gonna take all the things that we noticed that the users kind of were failing at and try to find like new ways to measure that, whether that's like a smile or, you know, color contrast or lighting. One benefit of Playground is that we have users making millions of images every single day.
And so we can just ask them. - And they go for like a post-generation feedback. - Yeah, we can just ask them. We can just say like, how good was the lighting here? How was the subject? How was the background? - Oh, like a proper form of like. - It's just like, you make it, you come to our site, you make an image and then we say, and then maybe randomly you just say, hey, you know, like how was the color and contrast of this image?
And you say, it was not very good. And then you just tell us. So I think we can get like tens of thousands of these evaluations every single day to truly measure real world performance as opposed to just like benchmark performance. Hopefully next year, I think we will try to publish kind of like a benchmark that anyone could use, that we evaluate ourselves on and that other people can, that we think does a good job of approximating real world performance because we've tried it and done it and noticed that it did.
Yeah, I think we will do that. - Yeah. I think we're going to ask a few more like sort of product-y questions. I personally have a few like categories that I consider special among, you know, you have like animals, art, fashion, food. There are some categories which I consider like a different tier of image.
So the top among them is text in images. How do you think about that? So one of the big wild ones for me, something I've been looking out for the entire year is just the progress of text and images. Like, do you, can you write in an image? - Yeah.
- Or an ideogram, I think, came out recently, which had decent but not perfect text and images. Dottie3 had improved some and all they said in their paper was that they just included more text in the dataset and it just worked. I was like, that's just, that's just lazy.
(laughing) But anyway, do you care about that? 'Cause I don't see any of that in like your sample. - Yeah, yeah. Yeah, the V2 model was mostly focused on image quality versus like the feature of text synthesis. 'Cause I, well, as a business user, I care a lot about that.
- Yeah. - Right. - Yeah, I'm very excited about text synthesis and yeah, I think ideogram has done a good job of maybe the best job. Dottie kind of has like a, it has like a hit rate. You know, you don't want just text effects. I think where this has to go is it has to be like, you could like write little tiny pieces of text like on like a milk carton.
- Yeah. - That's maybe not even the focal point of a scene. - Yeah. - I think that's like a very hard task that if you could do something like that, then there's a lot of other possibilities. - Well, you don't have to zero shot it. You can just be like here and focus on this.
- Sure, yeah, yeah, definitely. Yeah, yeah. So I think text synthesis would be very exciting. - Yeah. And then also flag that Max Wolf, Minimax here, which you must have come across his work. He's done a lot of stuff about using like logo masks that then map onto like food or vegetables and it looks like text, which can be pretty fun.
- Yeah, yeah. I mean, it's very interesting to, that's the wonderful thing about like the open source community is that you get things like control net and then you see all these people do these just amazing things with control net and then you wonder, I think from our point of view, we sort of go, that's really wonderful, but how do we end up with like a unified model that can do that?
What are the bottlenecks? What are the issues? Because the community ultimately has very limited resources. - Yeah. - And so they need these kinds of like work around work around research ideas to get there, but yeah. - Are techniques like control net portable to your architecture? - Definitely, yeah.
We kept the Playground v2 exactly the same as SDXL, not because, not out of laziness, but just because we wanted, we knew that the community already had tools. - Yeah. - It's, you know, all you have to do is maybe change a string in your code and then, you know, retrain a control net for it.
So it was very intentional to do that. We didn't want to fragment the community with different architectures. - Yeah. Yeah. I have more questions about that. I don't know. I don't want to DDoS you with topics, but okay. I was basically going to go over three more categories. One is UIs, like app UIs, like mock UIs.
Third is not safe for work, obviously. And then copyrighted stuff. I don't know if you care to comment on any of those. - The NSFW kind of like safety stuff is really important. Part of, I kind of think that one of the biggest risks kind of going into maybe the U.S.
election year will probably be very interrelated with like graphics, audio, video. I think it's going to be very hard to explain, you know, to a family relative who's not kind of in our world. And our world is like sometimes very, you know, we think it's very big, but it's very tiny compared to the rest of the world.
Some people are like, there's still lots of humanity who have no idea what chat GPT is. And I think it's going to be very hard to explain, you know, to your uncle, aunt, whoever, you know, hey, I saw, you know, I saw President Biden say this thing on a video.
You know, I can't believe, you know, he said that. I think that's going to be a very troubling thing going into the world next year, the year after. - Oh, I didn't, that's more like a risk thing. - Yeah. - Or like deep fakes. Well, faking, political faking. But there's just, there's a lot of studies on how, yeah, for most businesses, you don't want to train on not safe for work images, except that it makes you really good at bodies.
- Yeah, I mean, yeah, I mean, we personally, we filter out NSFW type of images in our data set so that it's, you know, so our safety filter stuff doesn't have to work as hard. - But you've heard this argument that it gets, it makes you worse at, because obviously, not safe for work images are very good at human anatomy, which you do want to be good at.
- Yeah, it's not about like, it's not like necessarily a bad thing to train on that data. It's more about like how you go and use it. That's why I was kind of talking about safety. - Yeah, I see. - You know, in part, because there are very terrible things that can happen in the world.
If you have a sufficiently, you know, extremely powerful graphics model, you know, suddenly like you can kind of imagine, you know, now if you can like generate nudes and then there's like, you can do very character consistent things with faces, like what does that lead to? - Yeah. - I think it's like more what occurs after that, right?
Even if you train on, let's say, you know, new data, if it does something to kind of help, there's nothing wrong with the human anatomy. It's very valid for a model to learn that, but then it's kind of like, how does that get used? And, you know, I won't bring up all of the very, very unsavory, terrible things that we see on a daily basis on the site.
I think it's more about what occurs. And so we, you know, we just recently did like a big sprint on safety internally around, and it's very difficult with graphics and art, right? Because there is tasteful art that has nudity, right? They're all over in museums, like, you know, it's very, very valid situations for that.
And then there's, you know, there's the things that are the gray line of that. You know, what I might not find tasteful, someone might be like, that is completely tasteful, right? And then there's things that are way over the line. And then there are things that are, you know, maybe you or, you know, maybe I would be okay with, but society isn't.
I think it's really hard with art. I think it's really, really hard. Sometimes even if you have like, even if you have things that are not nude, if a child goes to your site, scrolls down some images, you know, classrooms of kids, you know, using our product, it's a really difficult problem.
And it stretches mostly culture, society, politics, everything, yeah. - Okay. Another favorite topic of our listeners is UX and AI. And I think you're probably one of the best all-inclusive editors for these things. So you don't just have the, you know, prompt images come out, you pray, and now you do it again.
First, you let people pick a seed so they can kind of have semi-repeatable generation. You also have, yeah, you can pick how many images, and then you leave all of them in the canvas, and then you have kind of like this box, the generation box, and you can even cross between them and outpaint, there's all these things.
How did you get here? You know, most people are kind of like, give me text, I give you image. You know, you're like, these are all the tools for you. - Even though we were trying to make a graphics foundation model, I think we think that we're also trying to like re-imagine like what a graphics editor might look like given the change in technology.
So, you know, I don't think we're trying to build Photoshop, but it's the only thing that we could say that people are, you know, largely familiar with. Oh, okay, there's Photoshop. I think, you know, I don't think you would think of Photoshop without like the, you know, you wouldn't think, what would Photoshop compare itself to pre-computer, I don't know, right?
It's like, or kind of like a canvas, but, you know, there's these menu options, and you can use your mouse, what's a mouse? So I think that we're trying to make like, we're trying to re-imagine what a graphics editor might look like. Not just for the fun of it, but because we kind of have no choice.
Like there's this idea in image generation where you can generate images. That's like a super weird thing. What is that in Photoshop, right? You have to wait right now for the time being, but the wait is worth it often for a lot of people because they can't make that with their own skills.
So I think it goes back to, you know, how we started the company, which was kind of looking at GPT-3's Playground, that the reason why we're named Playground is a homage to that, actually. And, you know, it's like, shouldn't these products be more visual? Shouldn't, you know, shouldn't they, these prompt boxes are like a terminal window, right?
We're kind of at this weird point where it's just like CLI. It's like MS-DOS. I remember my mom using MS-DOS, and I memorized the keywords, like D-I-R-L-S, all those things, right? It feels a little like we're there, right? Prompt engineering is just like-- - The shirt I'm wearing, you know, it's a bug, not a feature.
- Yeah, exactly. Parentheses to say beautiful or whatever, which waits the word token more in the model or whatever. Yeah, it's, that's like super strange. I think that's not, I think everybody, I think a large portion of humanity would agree that that's not user-friendly, right? So how do we think about the products to be more user-friendly?
Well, sure, you know, sure it would be nice if I could like, you know, if I wanted to get rid of like the headphones on my head, you know, it'd be nice to mask it, and then say, you know, can you remove the headphones? You know, if I want to grow the, expand the image, it should, you know, how can we make that feel easier without typing lots of words and being really confused?
And by no, by no stretch of the imagination, I don't even think we've nailed the UI/UX yet. Part of that is because we don't, we're still experimenting. And part of that is because the model and the technology is going to get better. And whatever felt like the right UX six months ago is going to feel very broken now.
And so that's a little bit of how we got there, is kind of saying, does everything have to be like a prompt in a box? Or can we do, can we do things that make it very intuitive for users? - How do you decide what to give access to?
So you have things like Expand Prompt, which Dali 3 just does, it doesn't let you decide whether you should or not. - As in like, rewrites your prompts for you. - Yeah. - Yeah, for that feature, I think we'll probably, I think once we get it to be cheaper, we'll probably just give it up, we'll probably just give it away.
But we also decided something that, that might be a little bit different. We noticed that most of image generation is just like kind of casual. You know, it's in WhatsApp, it's, you know, it's in a Discord bot somewhere with Majorny, it's in ChatGPT. One of the differentiators I think we provide is at the expense of just lots of users necessarily, mainstream consumers, is that we provide as much like power and tweakability and configurability as possible.
So the only reason why it's a toggle, because we know that users might want to use it and might not want to use it, right? There are some really powerful power user hobbyists that know what they're doing. And then there's a lot of people that, you know, just want something that looks cool, but they don't know how to prompt.
And so I think a lot of Playground is more about going after that core user base that like knows, has a little bit more savviness and how to use these tools, yeah. So they might not use like these users probably, you know, the average Dell user is probably not going to use ControlNet.
They probably don't even know what that is. And so I think that like, as the models get more powerful, as there's more tooling, yeah, I think you could imagine it, hopefully you'll imagine a new sort of AI first graphics editor that's just as like powerful and configurable as Photoshop.
And you might have to master a new kind of tool. - Yeah, yeah, well. There's so many things I could bounce off of that. One, what you mentioned about waiting. We have to kind of somewhat address the elephant in the room. Consistency models have been blowing up the past month.
Is that, like, how do you think about integrating that? Obviously there's a lot of other companies also trying to beat you to that space as well. - I think we were the first company to integrate it. Well, we integrated it in a different way. There are like 10 companies right now that have kind of tried to do like interactive editing where you can like draw on the left side and then you get an image on the right side.
We decided to kind of like wait and see whether there's like true utility on that. We have a different feature that's like unique in our product that's called preview rendering. And so you go to the product and you say, we're like, what is the most common use case? The most common use case is you write a prompt and then you get an image.
But what's the most annoying thing about that? The most annoying thing is like, it feels like a slot machine, right? You're like, okay, I'm gonna put it in and maybe I'll get something cool. So we did something that seemed a lot simpler but a lot more relevant to how users already use this product, which is preview rendering.
You toggle it on and it will show you a render of the image. And then it's just like, graphics tools already have this. Like if you use Cinema 4D or After Effects or something, it's called viewport rendering. And so we try to take something that exists in the real world that has familiarity and say, okay, you're gonna get a rough sense of an early preview of this thing.
And then when you're ready to generate, we're gonna try to be as coherent about that image that you saw. That way you're not spending so much time just like pulling down the slot machine lever. So we were actually the first company, I think we were the first company to actually ship a quick LCM thing, yeah.
- Okay. (laughing) - We were very excited about it. So we shipped it very quick, yeah. - Yeah, I think like the other, well the demos I've been seeing it's also, I guess, it's not like a preview necessarily. They're almost using it to animate their generations, because you can kind of move shapes over.
- Yeah, yeah, they're like doing it. They're like animating it, but they're sort of showing like if I move a moon, you know, can I, yeah. - Yeah, I don't know. To me it unlocks video in a way. - Yeah. - That-- - But the video models are already so much better than that.
Yeah, so. (laughing) - There's another one which I think is, like how about the just general ecosystem of Loras, right? That Civit is obviously the most popular repository of Loras. How do you think about sort of interacting with that ecosystem? - Yeah, I mean, the guy that did Lora, not the guy that invented Loras, but the person that brought Loras to Stable Diffusion actually works with us on some projects.
His name is Simu. Shout out to Simu. And I think Loras are wonderful. Obviously fine tuning all these dream booth models and such, it's just so heavy. And giving, and it's obvious in our conversation around styles and vibes and it's very hard to evaluate the artistry of these things.
Loras give people this wonderful opportunity to create sub-genres of art. And I think they're amazing. And so any graphics tool, any kind of thing that's expressing art has to provide some level of customization to its user base that goes beyond just typing Greg Rakowski in a prompt. Right, we have to give more than that.
It's not like users want to type these real artist names. It's that they don't know how else to get an image that looks interesting. They truly want originality and uniqueness. And I think Loras provide that. And they provide it in a very nice scalable way. I hope that we find something even better than Loras in the long term.
'Cause there are still weaknesses to Loras, but I think they do a good job for now. - Yeah, and so you don't want to be the, like you wouldn't ever compete with Civet. You would just kind of-- - Civet's a site where like all these things get kind of hosted by the community, right?
And so yeah, we'll often pull down like some of the best things there. I think when we have a significantly better model, we will certainly build something. - I see. - That gets closer to that. I still, again, I go back to saying just, I still think this is like very nascent.
Things are very underpowered, right? Loras are not easy for people to train. You know, they're easy for an engineer, but they're not easy, you know, it sure would be nicer if I could just pick, you know, five or six reference images, right? And then say, hey, you know, this is, and they might even be five or six different reference images that are not, they're just very different, actually.
Like they're, they communicate a style, but they're actually like, it's like a mood board, right? And it takes, you have to be kind of an engineer almost to train these Loras or go to some site and be technically savvy at least. It seems like it'd be much better if I could say, I love this style.
I love this style, here are five images. And you tell the model, like, this is what I want. And the model gives you something that's very aligned with what your style is, what you're talking about. And it's a style you couldn't even communicate, right? There's no word, you know, this is, you know, if you have a Tron image, it's not just Tron, it's like Tron plus like four or five different weird things.
- Cyberpunk, yeah. - Yeah, even cyberpunk can have its like sub-genre, right? But I just think training Loras and doing that is very heavy, so I hope we can do better than that. - Cool. - Yeah. - We have Sharif from Lexica on the podcast before. - Oh, nice.
- Both of you have like a landing page with just a bunch of images where you can like explore things. - Yeah, yeah, we have a feed. - Yeah, yeah, is that something you see more and more of in terms of like coming up with these styles? Is that why you have that as the starting point versus a lot of other products, you just go in, you have the generation prompt, you don't see a lot of examples?
- Our feed is a little different than their feed. Our feed is more about community. So we have kind of like a Reddit thing going on where it's a kind of a competition like every day, loose competition, mostly fun competition of like making things. And there's just this wonderful community of people where they're liking each other's images and just showing their genuine interest in each other.
And I think we definitely learn about styles that way. One of the funniest polls, if you go to the Mid-Journey polls, they'll sometimes put these polls out and they'll say, you know, what do you wish you could like learn more from? And like one of the things that people vote the most for is like learning how to prompt, right?
And so I think like, you know, if you put away your research hat for a minute and you just put on like your product hat for a second, you're kind of like, well, why do people want to learn how to prompt, right? It's because they want to get higher quality images.
Well, what's higher quality composition, lighting, aesthetics, so on and so forth. And I think that the community on our feed, I think we might have the biggest community and it gives all of the users a way to learn how to prompt because they're just seeing this huge rising tide of all these images that are super cool and interesting and they can kind of like take each other's prompts and like kind of learn how to do that.
I think that'll be short-lived because I think the complexity of these things is going to get higher, but that's more about why we have that feed is to help each other, help teach users and then also just celebrate people's art. - You run your own infra. - We do.
- Yeah, that's unusual. (laughs) - It's necessary. - It's necessary. What have you learned running DevOps for GPUs? You had a tweet about like how many A100s you have, but I feel like it's out of date probably. - I mean, it just comes down to cost. These things are very expensive.
So we just want to make it as affordable for everybody as possible. I find the DevOps for inference to be relatively easy. It doesn't feel that different than, I think we had thousands and thousands of servers at Mixpanel just for dealing with the API had such huge quantities of volume that I didn't find it.
I don't find it particularly very different. I do find model optimization performance is very new to me. So I think that I find that very difficult at the moment. So that's very interesting. But scaling inference is not terrible. Scaling a training cluster is much, much harder than I perhaps anticipated.
- Why is that? - Well, it's just like a very large distributed system with if you have like a node that goes down then your training run crashes and then you have to somehow be resilient to that. And I would say training in for a software is very early.
It feels very broken. I can tell in 10 years, it would be a lot better. - Like a mosaic or whatever. - Yeah, we don't even know. I think we use very basic tools like Slurm for scheduling and just normal PyTorch, PyTorch Lightning, that kind of thing. I think our tooling is an ascent.
I think I talked to a friend that's over at XAI. They just, they like built their own scheduler and doing things with Kubernetes. When people are building out tools because the existing open source stuff doesn't work and everyone's doing their own bespoke thing, you know there's a valuable company to be formed.
- Yeah, I think it's Mosaic. I don't know. - Well, with Mosaic, yeah, it's tough with Mosaic 'cause anyway, I won't go into the details why, but yeah, we found it difficult to do it. It might be worth like wondering like why not everyone is going to Mosaic. Perhaps it's still, I just think it's nascent and perhaps Mosaic will come through.
- Cool, anything for you? - No, no, this was great. And just to wrap, we talked about some of the pivotal moments in your mind with like DALI and whatnot. If you were not doing this, what's the most interesting unsolved question in AI that you would try and build in?
- Oh man, coming up with startup ideas is very hard on the spot. - You shoot, you have to have them. I mean, you're a founder, you're a repeat founder. - I'm very picky about my startup ideas. So I don't have any great ones. The only thing that I, I don't have an idea per se as much as a curiosity.
And I suppose I'll pose it to you guys. Right now, we sort of think that a lot of the modalities just kind of feel like they're vision, language, audio, that's roughly it. And somehow all this will like turn into something, it'll be multimodal and then we'll end up with AGI perhaps.
And I just think that there are probably far more modalities than maybe we, than meets the eye. And it just seems hard for us to see it right now because it's sort of like we have tunnel vision on the moment. - We're just like code, image, audio, video. - Yeah, I think-- - Very, very broad categories.
- I think we are lacking imagination as a species in this regard. And I think like, you know, just like, you know, it's not, I don't know what company would form as a result of this, but you know, like there's some very difficult problems, like just like a true actual, like not a meta world model, but an actual world model that truly maps everything that's going in terms of like physics and fluids and all these various kinds of interactions.
And what does that kind of model, like a true physics foundation model of sorts that represents earth. And that in of itself seems very difficult, you know, but we just think of, but we're kind of stuck on like thinking that we can approximate everything with like, you know, a word or a token, if you will.
And I went, you know, I had a dinner last night where we were kind of debating this philosophically. And I think someone, you know, said something that I also believe in, which is like, at the end of the day, it doesn't really matter that it's like a token or a byte.
At the end of the day, it's just like some, you know, unit of information that it emits. But, you know, I do wonder if there are more, far more modalities than meets the eye. And if you could create that, then what would that, what would that company become? What problems could you solve?
So I don't know yet, so I don't have a great company for it. - I don't know. Maybe you would just inspire somebody to try. - Yeah, hopefully. - My personal response to that is I'm less interested in physics and more interested in people. Like how do I mind upload?
Because that is teleportation, that is immortality, that is everything. - Yeah, yeah, can we model our own, rather than trying to create consciousness, could we model our own? Even if it was lossy to some extent, yeah. - Yeah. Well, we won't solve that here. If I were to take a Bill Gates book trip and had a week, what should I take with me to learn AI?
- Oh man, oh gosh. You shouldn't take a book, you should just go to YouTube and visit Karpathy's class and just do it, do it, grind through it. That's actually the most useful thing for you? - I wish it came out when I started back last year. I'm as bummed that I didn't get to take it at the beginning, but I did do a few of his classes regardless.
I don't think books, every time I buy a programming book, I never read it. I always find that just writing code helps cement my internal understanding. - Yeah, so more generally, advice for founders who are not PhDs and are effectively self-taught like you are, what should they do, what should they avoid?
- Same thing that I would advise if you're programming. Pick a project that seems very exciting to you, but doesn't have to be too serious, and build it and learn every detail of it while you do it. - And it must be, should you train? Or can you go far enough not training, just fine-tuning?
- It depends, I would just follow your curiosity. If what you want to do is something that requires fundamental understanding of training models, then you should learn it. You don't have to be a PhD, you don't have to get to become a five-year, whatever, PhD, but if that's necessary, I would do it.
If it's not necessary, then go as far as you need to go, but I would learn, pick something that motivates. I think most people tap out on motivation, but they're deeply curious. - Cool. - Cool. - Thank you so much for coming out, man. - Thank you for having me, appreciate it.
(upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (gentle music)