back to index

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI


Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey, everyone.
00:00:04.480 | Welcome to the Latent Space Podcast.
00:00:06.160 | This is Alessio, partner in CTO at Danceable Partners,
00:00:08.920 | and I'm joined by my co-host, Swix, founder of SmileyEye.
00:00:11.920 | - Hey, and today we're in a very special studio
00:00:15.600 | inside the Fireworks office with Lin Tiao,
00:00:17.660 | CEO of Fireworks.
00:00:18.500 | Welcome.
00:00:19.320 | - Yeah.
00:00:20.160 | - Oh, you should welcome us.
00:00:21.000 | - Yeah, welcome.
00:00:21.820 | (all laughing)
00:00:24.080 | - Thanks for having us.
00:00:25.080 | It's unusual to be in the home of a startup,
00:00:29.160 | but I think our relationship is a bit unusual
00:00:32.200 | compared to all our normal guests.
00:00:33.960 | - Definitely.
00:00:34.800 | Yeah, I'm super excited to talk about very interesting
00:00:38.800 | topics in that space with both of you.
00:00:41.200 | - You just celebrated your two-year anniversary yesterday.
00:00:43.320 | - Yeah, it's quite a crazy journey.
00:00:45.080 | We circle around and share all the crazy stories
00:00:47.480 | across these two years, and it has been super fun.
00:00:51.300 | All the way from we experienced Silicon Valley bank run.
00:00:56.700 | - Right.
00:00:57.540 | - To we delete some data that shouldn't be deleted.
00:01:02.540 | Operationally, we went through a massive scale
00:01:08.160 | where we actually are busy getting capacity to...
00:01:13.160 | Yeah, we learned to kind of work with it as a team
00:01:17.260 | with a lot of brilliant people across different places,
00:01:21.080 | joining a company.
00:01:22.760 | It has really been a fun journey.
00:01:24.640 | - When you started, did you think the technical stuff
00:01:27.280 | would be harder or the bank run and then the people side?
00:01:30.780 | I think there's a lot of amazing researchers
00:01:32.860 | that want to do companies, and it's like,
00:01:34.600 | the hardest thing is going to be building the product,
00:01:36.520 | and then you have all these different other things.
00:01:38.440 | So, were you surprised by what has been your experience?
00:01:42.680 | - Yeah, to be honest with you,
00:01:44.600 | my focus has always been on the product side,
00:01:47.480 | and then after product, go to market.
00:01:49.420 | And I didn't realize the rest has been so complicated,
00:01:52.520 | operating a company and so on.
00:01:54.400 | But because I don't think about it,
00:01:56.760 | I just kind of manage it.
00:01:58.160 | So, it's done.
00:01:59.280 | (laughs)
00:02:00.120 | So, I think I just somehow don't think about it too much
00:02:04.180 | and solve whatever problem coming our way, and it worked.
00:02:08.440 | - So, I guess let's start at the pre-history,
00:02:10.600 | like the initial history of Fireworks.
00:02:13.740 | You ran the PyTorch team at Meta for a number of years,
00:02:17.920 | and we previously had Sumit Chintala on,
00:02:21.440 | and I think we were just all very interested
00:02:23.680 | in the history of GenEI.
00:02:25.720 | Maybe not that many people know
00:02:27.480 | how deeply involved Fire and Meta were
00:02:32.500 | prior to the current GenEI revolution.
00:02:34.400 | - My background is deep in distributed system,
00:02:38.280 | database management system,
00:02:40.000 | and I joined Meta from the data side.
00:02:43.180 | And I saw this tremendous amount of data growth,
00:02:46.020 | which cost a lot of money,
00:02:48.180 | and we're analyzing what's going on.
00:02:50.440 | And it's clear that AI is driving all this data generation.
00:02:55.100 | So, it's a very interesting time,
00:02:57.080 | because when I joined Meta,
00:02:58.880 | Meta's going through ramping down mobile-first,
00:03:01.880 | finishing the mobile-first transition,
00:03:03.860 | and then starting AI-first.
00:03:05.040 | And there's a fundamental reason about that sequence,
00:03:07.880 | because mobile-first gave a full range of user engagement
00:03:12.160 | that has never existed before.
00:03:14.320 | And all this user engagement generated a lot of data,
00:03:17.240 | and this data power AI.
00:03:19.560 | So, then the whole entire industry is also going through,
00:03:22.000 | following through the same transition.
00:03:24.720 | When I see, oh, okay, this AI is powering
00:03:27.100 | all this data generation,
00:03:28.620 | and look at where's our AI stack,
00:03:30.940 | there's no software, there's no hardware,
00:03:32.260 | there's no people, there's no team.
00:03:34.180 | I'm like, I want to dive up there and help this movement.
00:03:39.180 | So, when I started,
00:03:40.740 | it's a very interesting industry landscape.
00:03:42.780 | There are a lot of AI frameworks.
00:03:44.940 | It's a kind of proliferation of AI frameworks
00:03:48.580 | happening in the industry.
00:03:49.940 | But all the AI frameworks focus on production,
00:03:53.700 | and they use a very certain way
00:03:56.360 | of defining the graph of neural network,
00:03:59.280 | and they use that to drive the model actuation
00:04:02.420 | and productionization.
00:04:04.640 | And PyTorch is completely different.
00:04:06.040 | So, they could also assume that
00:04:07.840 | he was the user of his product.
00:04:11.100 | It is as a researcher face so much pain
00:04:13.480 | using existing AI framework.
00:04:15.520 | This is really hard to use,
00:04:16.620 | and I'm gonna do something different for myself.
00:04:19.840 | And that's the origin story of PyTorch.
00:04:21.720 | PyTorch actually started as the framework for researchers.
00:04:25.100 | Don't care about production at all.
00:04:26.980 | And as it grow in terms of adoption,
00:04:30.140 | so the interesting part of AI is research
00:04:32.380 | is the top of our normal production.
00:04:34.100 | There are so many researchers across academic,
00:04:37.680 | across industry, they innovate,
00:04:40.460 | and they put their results out there in open source.
00:04:43.620 | And that power the downstream productionization.
00:04:46.580 | So, it's brilliant for MATA
00:04:47.780 | to establish PyTorch as a strategy
00:04:50.620 | to drive massive adoption in open source,
00:04:53.700 | because MATA internally is a PyTorch job.
00:04:56.020 | So, it create a flying wheel effects.
00:04:58.740 | So, that's kind of a strategy behind PyTorch.
00:05:00.740 | But when I took on PyTorch,
00:05:02.980 | it's kind of a classical MATA established PyTorch
00:05:05.580 | as the framework for both research and production.
00:05:08.860 | So, no one has done that before.
00:05:10.540 | And we have to kind of rethink how to architect PyTorch
00:05:13.380 | so we can really sustain production workload,
00:05:15.820 | the stability, reliability, low latency,
00:05:18.100 | all this production concern was never a concern before,
00:05:21.100 | now it's a concern.
00:05:22.020 | And we actually have to adjust its design
00:05:24.860 | and make it work for both sides.
00:05:26.860 | And that took us five years,
00:05:28.700 | because MATA has so many AI use cases,
00:05:31.220 | all the way from ranking recommendation
00:05:33.180 | as powering the business top line,
00:05:35.300 | or as ranking use feed, video ranking,
00:05:37.500 | to site integrity detect bad content automatically using AI,
00:05:41.540 | to all kind of effects, translation,
00:05:44.340 | image classification, object detection, all this.
00:05:47.140 | And also across AI running on the server side,
00:05:49.940 | on mobile phones, on AI VR devices, the wide spectrum.
00:05:54.580 | So by the time, we actually basically managed
00:05:57.780 | to support AI across ubiquitous everywhere across MATA.
00:06:02.540 | But interestingly, through open source engagement,
00:06:04.700 | we work with a lot of companies.
00:06:06.540 | It is clear to us, like,
00:06:07.940 | this industry is start to take on AI first transition.
00:06:11.900 | And of course, MATA's hyperscale
00:06:13.540 | always go ahead of industry.
00:06:16.100 | And we feel like it feels like
00:06:17.780 | when we start this AI journey at MATA,
00:06:20.060 | there's no software, no hardware, no team.
00:06:22.460 | For many companies we engage with through PyTorch,
00:06:26.020 | we feel the pain.
00:06:27.180 | That's the genesis why we feel like,
00:06:28.980 | hey, if we create fireworks and support industry
00:06:33.060 | going through this transition,
00:06:33.900 | it will be a huge amount of impact.
00:06:35.540 | Of course, the problem that industry facing
00:06:37.740 | will not be the same as MATA.
00:06:39.340 | MATA is so big, right?
00:06:41.100 | So it's kind of skewed towards extreme scale
00:06:44.220 | and extreme optimization in the industry will be different.
00:06:47.420 | But we feel like we have the technical chop
00:06:51.340 | and we've seen a lot.
00:06:52.620 | We'll look to kind of drive that.
00:06:55.860 | So yeah, so that's how we started.
00:06:58.620 | - When you and I chatted about like the origins of fireworks,
00:07:01.780 | it was originally envisioned more as a PyTorch platform.
00:07:06.380 | And then later became much more focused on generative AI.
00:07:09.420 | Is that fair to say?
00:07:11.220 | - Right.
00:07:12.060 | - What was the customer discovery here?
00:07:13.300 | - Right, so I would say our initial blueprint
00:07:17.500 | is say we should be the PyTorch cloud
00:07:19.860 | because PyTorch is a library
00:07:22.020 | and there's no SaaS platform to enable AI workloads.
00:07:25.900 | - Even in 2022, it's interesting.
00:07:28.540 | - I would not say absolutely no,
00:07:30.020 | but like cloud providers have some of those,
00:07:32.260 | but it's not first class citizen, right?
00:07:34.340 | Because at 2022, there's still like TensorFlow
00:07:37.380 | is massively in production.
00:07:39.380 | And this is all pre-GNI.
00:07:41.580 | And PyTorch is kind of getting more and more adoption,
00:07:45.140 | but there's no PyTorch first SaaS platform existing.
00:07:49.900 | At the same time,
00:07:50.780 | we are also a very pragmatic set of people.
00:07:53.500 | We really want to make sure from the get-go,
00:07:55.940 | we get really, really close to customers.
00:07:58.380 | We understand their use case.
00:07:59.660 | We understand their pain points.
00:08:01.020 | We understand the value we deliver to them.
00:08:03.260 | So we want to take a different approach.
00:08:04.940 | Instead of building a horizontal PyTorch cloud,
00:08:07.060 | we want to build a verticalized platform first.
00:08:11.380 | And then we talk with many customers.
00:08:13.140 | And interesting, we started a company September 2022,
00:08:16.980 | and October, November, the OpenAI announced ChatGPT.
00:08:21.700 | And then boom, then when we talk with many customers,
00:08:24.060 | they are like, "Can you help us
00:08:25.900 | working on the GNI aspect?"
00:08:28.340 | So of course, there are some open-source models.
00:08:31.300 | It's not as good at that time,
00:08:32.620 | but people are already putting a lot of attention there.
00:08:35.700 | Then we decide that if we're going to pick a vertical,
00:08:38.220 | we're going to pick GNI.
00:08:39.620 | The other reason is all GNI models are PyTorch models.
00:08:42.740 | So that's another reason.
00:08:44.260 | We believe that because of the nature of GNI,
00:08:47.020 | it's going to generate a lot of human consumable content.
00:08:50.740 | It will drive a lot of consumer,
00:08:52.660 | customer-developer-facing application
00:08:54.620 | and product innovation.
00:08:56.020 | Guaranteed, right?
00:08:56.940 | We're just at the beginning of this.
00:08:58.900 | Our prediction is for those kind of application,
00:09:01.700 | the inference is much more important than training
00:09:04.420 | because inference scale is proportional
00:09:06.940 | to the up-limit award population.
00:09:09.500 | And training scale is proportional
00:09:11.860 | to the number of researchers.
00:09:12.860 | Of course, each training round could be very expensive.
00:09:15.980 | Although PyTorch supports both inference and training,
00:09:18.340 | we decide to lay the focus on inference.
00:09:21.180 | So yeah, so that's how we got started.
00:09:23.100 | And we launched our public platform August last year.
00:09:27.340 | And when we launched, it's a single product.
00:09:29.500 | It's a distributed inference engine
00:09:31.980 | with simple API, open-air compatible API,
00:09:34.620 | with many models.
00:09:35.860 | We started with LM, and later on, we added a lot of models.
00:09:38.660 | Fast forward to now, we are a full platform
00:09:41.780 | with multiple product lines.
00:09:43.220 | So we love to kind of dive deep into what we offer.
00:09:46.180 | So, but that's a very fun journey in the past two years.
00:09:49.780 | - What was the transition from you start focus on PyTorch
00:09:53.220 | and people want to understand the framework, get it live.
00:09:56.340 | And now I would say maybe most people that use you
00:09:58.500 | don't even really know much about PyTorch at all.
00:10:00.980 | They're just strong consumer model.
00:10:02.500 | From a product perspective,
00:10:04.460 | what were some of the decisions early on?
00:10:06.900 | Right in October, November,
00:10:08.060 | you were just like, "Hey, most people just care
00:10:10.060 | "about the model, not about the framework.
00:10:11.540 | "We're going to make it super easy."
00:10:12.700 | Or was it more a gradual transition
00:10:15.060 | to the model library you have today?
00:10:16.900 | - Yeah, so our product decision
00:10:18.500 | is all based on who is our ICP.
00:10:20.940 | And one thing we want to acknowledge here
00:10:23.260 | is the Gen-AI technology is disruptive.
00:10:26.220 | It's very different from AI before Gen-AI.
00:10:28.780 | So it's a clear leap forward.
00:10:31.700 | Because before Gen-AI,
00:10:33.220 | the companies that want to invest in AI,
00:10:36.380 | they have to train from scratch.
00:10:38.260 | There's no other way.
00:10:39.220 | There's no foundation model.
00:10:40.180 | It doesn't exist.
00:10:41.300 | So that means they need to start a team,
00:10:43.580 | first hire a team, who is capable of crunch data.
00:10:46.540 | There's a lot of data to crunch, right?
00:10:48.380 | Because training from scratch,
00:10:49.900 | you have to prepare a lot of data.
00:10:51.540 | And then they need to have GPUs to train.
00:10:56.540 | And then you need to start to manage GPUs.
00:10:58.100 | So then it becomes a very complex project.
00:11:00.980 | It takes a long time
00:11:01.860 | and not many companies can afford it, actually.
00:11:05.300 | And Gen-AI is a very different game right now
00:11:09.220 | because it is a foundation model,
00:11:10.980 | so you don't have to train anymore.
00:11:12.620 | That makes AI much more accessible as a technology.
00:11:16.540 | As an app developer or product manager,
00:11:18.340 | even not a developer,
00:11:19.740 | they can interact with Gen-AI models directly.
00:11:23.900 | So, and our goal is to make AI accessible
00:11:27.380 | to all app developers and product engineers.
00:11:30.100 | That's our goal.
00:11:31.340 | So then getting them into the building model
00:11:34.980 | doesn't make any sense anymore with this new technology.
00:11:38.620 | And then building easy, accessible APIs is the most important.
00:11:42.380 | Our, early on, when we got started,
00:11:44.540 | we decided we're going to be OpenAI compatible.
00:11:47.020 | It's just kind of very easy for developers
00:11:49.820 | to adopt this new technology.
00:11:51.900 | And we will manage the underlying complexity
00:11:54.700 | of serving all these models.
00:11:56.340 | - Yeah, OpenAI has become--
00:11:57.780 | - The standard. - The standard.
00:11:59.460 | Even as we're recording today,
00:12:01.180 | Gemini announced that they have OpenAI compatible APIs.
00:12:05.060 | - Interesting. - So then we just need
00:12:06.180 | to adopt it all at night and then we have everyone.
00:12:08.340 | - Yeah, that's interesting because
00:12:10.940 | we are working very closely with Meta
00:12:12.740 | as one of the partners.
00:12:14.180 | And Meta announced,
00:12:16.100 | Meta, of course, is kind of very generous
00:12:17.900 | to donate many very, very strong open source models.
00:12:21.020 | Expecting more to come.
00:12:22.580 | But also they have announced LamaStack.
00:12:25.180 | - Yeah.
00:12:26.020 | - Which is basically standardized,
00:12:29.740 | the upper-level stack, built on top of Lama models.
00:12:32.580 | So they don't just want to give out models
00:12:35.740 | and you figure out what the upper stack is.
00:12:37.180 | They instead want to build a community around the stack
00:12:39.940 | and build a kind of new standard.
00:12:42.540 | I think it's an interesting dynamics
00:12:44.460 | playing in the industry right now.
00:12:46.500 | When it's more standardized across OpenAI
00:12:49.980 | because they are kind of creating the top-of-the-line
00:12:52.660 | or standardized across Lama
00:12:54.540 | because this is the most used open source model.
00:12:57.340 | So I think it's really a lot of fun working at this time.
00:13:01.540 | - I've been a little bit more doubtful on LamaStack.
00:13:05.060 | I think you've been more positive.
00:13:06.380 | Basically, it's just like the Meta version
00:13:08.500 | of whatever Hugging Face offers,
00:13:10.980 | or TensorRT, or BLM,
00:13:13.020 | or whatever the open source opportunity is.
00:13:15.900 | But to me, it's not clear that
00:13:18.340 | just because Meta open source is Lama,
00:13:21.580 | that the rest of LamaStack will be adopted.
00:13:24.260 | And it's not clear why I should adopt it.
00:13:26.620 | So I don't know if you--
00:13:27.460 | - Yeah, it's very early right now.
00:13:28.980 | That's why I kind of will work very closely with them
00:13:32.060 | and give them feedback.
00:13:33.340 | The feedback to the Meta team is very important.
00:13:35.660 | So then they can use that to continue to improve the model
00:13:38.740 | and also improve the higher-level stack.
00:13:40.740 | I think the success of LamaStack
00:13:42.660 | heavily depends on the community adoption,
00:13:44.820 | and there's no way around it.
00:13:46.420 | And I know Meta team would like to kind of work
00:13:49.340 | with a broader set of community, but it's very early.
00:13:51.980 | - One thing that, after your Series B,
00:13:53.980 | so you raced for a benchmark,
00:13:55.420 | and then I remember being close to you
00:13:58.140 | for at least your Series B announcements,
00:14:01.100 | you started betting heavily on this term of compound AI.
00:14:03.820 | It's not a term that we've covered very much in the podcast,
00:14:06.460 | but I think it's definitely getting a lot of adoption
00:14:09.340 | from Databricks and the Berkeley people and all that.
00:14:12.020 | What's your take on compound AI?
00:14:14.300 | Why is it resonating with people?
00:14:16.100 | - Right, so let me give a little bit of context
00:14:18.980 | why we even consider that space.
00:14:22.140 | - Yeah, because pre-Series B,
00:14:24.140 | there was no message, and now it's like on your landing page.
00:14:27.900 | - So it's kind of a very organic evolution
00:14:31.300 | from when we first launched our public platform.
00:14:34.540 | We are a single product, and we are a distributed
00:14:36.380 | inference engine, where we do a lot of innovation,
00:14:39.900 | customize quota kernels, raw kernels,
00:14:43.980 | running a different kind of hardware,
00:14:45.860 | and build distributed disaggregated execution,
00:14:50.180 | inference execution, build all kind of caching.
00:14:52.820 | So that is one.
00:14:54.300 | So that's kind of one product line,
00:14:55.940 | is the fast, most cost-efficient inference platform.
00:14:59.420 | Because we wrote PyTorch code,
00:15:00.540 | we know we basically have a special PyTorch build for that,
00:15:03.900 | together with a custom kernel we wrote.
00:15:06.500 | And then we work with many more customers,
00:15:07.900 | we realized, oh, the distributed inference engine,
00:15:10.740 | our design is one size fits all, right?
00:15:12.980 | We want to have this inference endpoint,
00:15:14.940 | then everyone come in, and no matter what kind of
00:15:18.260 | form and shape or workload they have,
00:15:20.100 | it will just work for them, right?
00:15:21.180 | So that's great.
00:15:22.700 | But the reality is, we realized,
00:15:26.140 | all customers have different kind of use cases.
00:15:28.460 | The use cases come in all different form and shape.
00:15:31.260 | And the end result is, the data distribution
00:15:35.540 | in their inference workload doesn't align
00:15:37.900 | with the data distribution in the training data
00:15:40.700 | for the model, right?
00:15:41.780 | It's a given, actually.
00:15:42.620 | If you think about this, because researchers
00:15:44.620 | have to guesstimate what is important,
00:15:46.580 | what's not important, like in preparing data for training.
00:15:50.380 | So because of that misalignment,
00:15:52.180 | then we leave a lot of quality,
00:15:55.300 | latency, cost improvement on the table.
00:15:57.540 | So then we're saying, okay, we want to heavily invest
00:16:00.460 | in a customization engine.
00:16:02.740 | And we actually announced it called FireOptimizer.
00:16:04.980 | So FireOptimizer basically help user navigate
00:16:08.740 | a three-dimensional optimization space
00:16:10.700 | across quality, latency, and cost.
00:16:13.940 | So it's a three-dimensional curve.
00:16:16.180 | And even for one company, for different use case,
00:16:19.620 | they want to land in different spots.
00:16:22.100 | So we automate that process for our customer.
00:16:25.180 | It's very simple.
00:16:26.140 | You have your inference workload,
00:16:27.580 | and you inject into the optimizer,
00:16:30.700 | along with the objective function.
00:16:32.620 | And then we spit out inference deployment config
00:16:35.980 | and the model setup.
00:16:38.180 | So it's your customized setup.
00:16:41.260 | So that is a completely different product.
00:16:43.740 | So that product thinking is one size fits one,
00:16:46.700 | different for one size fits all.
00:16:48.140 | And now on top of that,
00:16:49.740 | we provide a huge variety of state-of-the-art models,
00:16:53.580 | hundreds of them,
00:16:54.940 | varying from text to state-of-the-art English models.
00:16:59.460 | That's where we started.
00:17:00.980 | And as we talk with many customers,
00:17:02.820 | we realize, oh, audio and text are very, very close.
00:17:06.420 | Many of our customers start to build assistants,
00:17:08.420 | all kinds of assistants using text,
00:17:10.460 | and they immediately want to add audio,
00:17:12.300 | audio in, audio out.
00:17:13.620 | So we support transcription, translation,
00:17:16.740 | speech synthesis, text, audio alignment,
00:17:20.740 | all different kind of audio features.
00:17:22.180 | It's a big announcement we're gonna,
00:17:23.860 | you should have heard.
00:17:24.700 | - By the time this is out.
00:17:25.540 | - By the time this is out.
00:17:26.380 | - Yeah.
00:17:27.220 | - And the other areas of vision
00:17:28.860 | and the text are very close with each other,
00:17:31.020 | because a lot of information doesn't live in plain text.
00:17:34.420 | A lot of information live in multimedia format,
00:17:36.900 | live in images, PDFs, screenshots,
00:17:39.580 | and in many other different formats.
00:17:41.420 | So oftentimes solve a problem,
00:17:43.740 | we need to put the vision model first
00:17:45.020 | to extract information.
00:17:46.460 | And then use language model to process
00:17:48.740 | and then send out results.
00:17:50.060 | So vision is important, we also support vision model.
00:17:52.580 | Various different kind of vision models specialize
00:17:54.580 | in processing different kind of source and extraction.
00:17:58.580 | And we're also gonna have another announcement
00:18:01.140 | of a new API endpoint,
00:18:02.980 | will support for people to upload
00:18:05.580 | various different kind of multimedia content
00:18:08.220 | and then get the extract very accurate information out
00:18:11.420 | and feed that into LM.
00:18:13.460 | And then of course we support embedding
00:18:15.540 | because embedding is very important
00:18:16.780 | for semantic search, for RAG and all this.
00:18:19.380 | And in addition to that, we also support text to image,
00:18:22.180 | image generation models, text to image, image to image,
00:18:25.100 | and we're adding text to video as well in our portfolio.
00:18:28.540 | So it's very comprehensive set of model catalog
00:18:31.220 | that build on, run on top of File Optimizer
00:18:34.220 | and Distribute Influence Engine.
00:18:36.060 | But then we talk with more customer,
00:18:37.460 | they solve business use case,
00:18:39.260 | and then we realize one model is not sufficient
00:18:42.060 | to solve their problem.
00:18:44.060 | And it's very clear because one is the model who listens,
00:18:47.860 | and many customer, when they onboard this JNI journey,
00:18:50.820 | they thought this is magical.
00:18:52.340 | JNI is gonna solve all my problems magically,
00:18:54.460 | but then they realize, oh, this model who listens.
00:18:57.100 | It who listens because it's not deterministic,
00:18:58.940 | it's probabilistic.
00:19:00.540 | So it's designed to always give you an answer,
00:19:03.260 | but based on probability, so it who listens.
00:19:05.900 | And that's actually sometimes the feature
00:19:08.300 | for creative writing, for example.
00:19:09.740 | Sometimes it's a bug because, hey,
00:19:11.460 | you don't want to give misinformation.
00:19:14.380 | And different model also have different specialties.
00:19:16.900 | To solve a problem, you want to ask different special model
00:19:19.940 | to kind of decompose your task
00:19:22.140 | into multiple small task, narrow task,
00:19:25.060 | and have an expert model solve that task really well.
00:19:28.140 | And of course, the model doesn't have all the information.
00:19:31.140 | It has limited knowledge
00:19:32.140 | because the training data is finite, not infinite.
00:19:34.580 | So model oftentimes doesn't have real-time information.
00:19:37.340 | It doesn't know any proprietary information
00:19:39.420 | within enterprise.
00:19:40.740 | It's clear that in order to really build
00:19:44.540 | a compiling application on top of JNI,
00:19:47.860 | we need a compound AI system.
00:19:49.660 | Compound AI system basically is gonna have multiple models
00:19:53.940 | across modalities along with APIs,
00:19:58.180 | whether it's public APIs, internal proprietary APIs,
00:20:01.700 | storage systems, database system,
00:20:04.020 | knowledge systems to work together
00:20:06.540 | to deliver the best answer.
00:20:07.860 | - Are you gonna offer a virtual database?
00:20:09.660 | - We actually heavily partner
00:20:11.580 | with several big vector database providers.
00:20:15.020 | - Which is your favorite?
00:20:16.380 | (laughing)
00:20:18.260 | - They are all great in different ways,
00:20:20.260 | but it's public information, like MongoDB is our investor,
00:20:23.740 | and we have been working closely with them for a while.
00:20:26.780 | - When you say distributed inference engine,
00:20:29.580 | what do you mean exactly?
00:20:30.500 | Because when I hear your explanation,
00:20:32.180 | it's almost like you're centralizing a lot of the decisions
00:20:35.580 | through the Fireworks platform
00:20:36.700 | on like the quality and whatnot.
00:20:38.300 | What do you mean distributed?
00:20:39.140 | It's like you have GPUs in like a lot of different clusters,
00:20:41.740 | so like you're sharding the inference across.
00:20:44.060 | - Right, right, right.
00:20:45.460 | So first of all, we run across multiple GPUs.
00:20:49.620 | But the way we distribute across multiple GPUs is unique.
00:20:54.060 | We don't distribute the whole model monolithically
00:20:56.020 | across multiple GPUs.
00:20:57.060 | We chop them into pieces
00:20:58.340 | and scale them completely differently
00:20:59.820 | based on what's the bottleneck.
00:21:02.020 | We also are distributed across regions.
00:21:05.500 | We have been running in North America,
00:21:07.300 | Emir, and Asia.
00:21:09.260 | We have regional affinity to applications
00:21:13.420 | because latency is extremely important.
00:21:15.940 | We are also like doing global load balancing
00:21:19.420 | because a lot of application there,
00:21:21.140 | they quickly scale to global population.
00:21:24.500 | And then at that scale,
00:21:26.700 | like different continent wakes up at a different time.
00:21:29.740 | And you want to kind of load balancing across.
00:21:32.580 | So all the way in a week, we also have,
00:21:35.300 | we manage various different kinds of hardware skew
00:21:38.140 | from different hardware vendors.
00:21:39.660 | And different hardware design is best
00:21:41.940 | for different type of workload,
00:21:44.580 | whether it's long context, short content, long generation.
00:21:47.820 | So all these different type of workload is best fitted
00:21:51.620 | for different kind of hardware skew.
00:21:53.980 | And then we can even distribute it
00:21:55.140 | across different hardware for a workload.
00:21:57.620 | So yeah, so the distribution actually
00:21:59.260 | is all around in the full stack.
00:22:02.860 | - At some point, we'll show on the YouTube
00:22:05.020 | the image that Ray, I think, has been working on
00:22:07.700 | with like all the different modalities that you offer.
00:22:10.140 | Like to me, it's basically you offer the open source version
00:22:13.620 | of everything that OpenAI typically offers, right?
00:22:16.380 | I don't think there is.
00:22:17.500 | Actually, if you do text to video,
00:22:19.740 | you will be a superset of what OpenAI offers
00:22:22.500 | 'cause they don't have Sora.
00:22:23.860 | Is that Mochi, by the way?
00:22:25.340 | - Mochi.
00:22:26.180 | - Mochi, right?
00:22:27.020 | - Mochi, and there are a few others.
00:22:29.380 | I will say the interesting thing is,
00:22:31.940 | I think we're betting on the open source community
00:22:35.620 | is gonna grow, like proliferate.
00:22:38.860 | This is literally what I see.
00:22:40.300 | - Yeah.
00:22:41.140 | - And there's amazing video generation companies.
00:22:44.500 | - Yeah.
00:22:45.340 | - There is amazing audio companies.
00:22:48.140 | Like cross-border, the innovation is off the chart
00:22:51.540 | and we are building on top of that.
00:22:53.100 | I think that's the advantage we have
00:22:55.900 | compared with a closed source company.
00:22:58.460 | - I think I want to restate the value proposition
00:23:00.420 | of Fireworks for people who are comparing you
00:23:02.940 | versus like a raw GPU provider, like a RunPod,
00:23:06.700 | or a Lambda, or anything like those,
00:23:08.820 | which is like you create the developer experience layer
00:23:12.380 | and you also make it easily scalable or serverless
00:23:15.700 | or as an end point.
00:23:18.220 | And then I think for some models,
00:23:20.460 | you have custom kernels, but not all models.
00:23:24.220 | - For almost for all models,
00:23:25.860 | for all large language models, all your models.
00:23:28.420 | - You just write kernels all day long?
00:23:29.260 | - In the VRS.
00:23:30.460 | (laughs)
00:23:31.900 | - Yeah.
00:23:32.740 | - Yeah, almost for all models we serve, we have.
00:23:34.980 | - And so that is called Fire Attention?
00:23:36.820 | - That's called Fire.
00:23:37.660 | - I don't remember the speed numbers,
00:23:39.620 | but apparently much better than VLM,
00:23:41.540 | especially on a concurrency basis.
00:23:44.180 | - Right, so Fire Attention is specific for,
00:23:46.860 | mostly for language model,
00:23:48.580 | but for other modalities,
00:23:49.740 | we'll also have a customized kernel.
00:23:52.460 | - Yeah, I think the typical challenge for people
00:23:55.140 | is understanding like, that has value.
00:23:57.940 | And then like, there are other people
00:23:59.860 | who are also offering open source models, right?
00:24:01.580 | Like your mode is your ability to offer
00:24:05.100 | like a good experience for all these customers.
00:24:07.580 | But if your existence is entirely reliant on people
00:24:10.340 | releasing nice open source models,
00:24:12.180 | other people can also do the same thing.
00:24:13.580 | - Right, yeah.
00:24:14.540 | So I will say we build on top of
00:24:16.020 | open source model foundation.
00:24:17.660 | So that's the kind of foundation we build on top of.
00:24:20.180 | But we look at the value prop
00:24:22.780 | from the lens of application developers
00:24:24.820 | and product engineers.
00:24:26.100 | So they want to create new UX.
00:24:28.900 | So what's happening in the industry right now
00:24:31.420 | is people are thinking about
00:24:32.660 | completely new way of designing products.
00:24:35.900 | And I'm talking to so many founders,
00:24:37.940 | it's just mind blowing.
00:24:39.740 | They help me understand existing way of doing PowerPoint,
00:24:44.340 | existing way of coding,
00:24:46.260 | existing way of managing customer service.
00:24:50.500 | It's actually putting a box in our head.
00:24:52.420 | For example, PowerPoint, right?
00:24:53.580 | So PowerPoint generation is,
00:24:55.260 | we always need to think about
00:24:56.380 | how to fit into my storytelling into this format
00:24:59.700 | of slide one after another.
00:25:01.740 | And I'm gonna juggle through like design
00:25:05.260 | together with what story to tell.
00:25:07.460 | But the most important thing is
00:25:08.820 | what's your storytelling lines, right?
00:25:11.460 | And why don't we create a space
00:25:13.580 | that is not limited to any format?
00:25:16.660 | And those kind of new product UX design
00:25:19.460 | combined with automated content generation through GNI
00:25:24.580 | is the new thing that many founders are doing.
00:25:27.940 | What are the challenges they're facing?
00:25:29.380 | All right, let's go from there.
00:25:30.780 | One is, again, because a lot of products
00:25:33.780 | built on top of GNI,
00:25:34.620 | they are consumer, personal, and developer facing,
00:25:36.900 | and they require interactive experience.
00:25:40.180 | It's just a kind of product experience we all get used to.
00:25:42.740 | And our desire is to actually get
00:25:44.500 | faster and faster interaction.
00:25:46.340 | Otherwise, nobody wants to spend time, right?
00:25:48.740 | So again, and then that requires low latency.
00:25:51.420 | And the other thing is,
00:25:52.700 | the nature of consumer, personal, and developer facing
00:25:54.780 | is your audience is very big.
00:25:57.180 | You want to scale up to product market fit quickly.
00:26:00.020 | But if you lose money at a small scale,
00:26:01.860 | you're gonna bankrupt quickly.
00:26:03.340 | So it's actually a big contrast is
00:26:06.020 | I actually have product market fit.
00:26:07.740 | But when I scale, I scale out of my business.
00:26:09.900 | So that's kind of very funny to think about it.
00:26:13.020 | So then have low latency and low cost is essential
00:26:18.020 | for those new application and product to survive
00:26:20.940 | and really become a generation company.
00:26:23.020 | So that's the design point for
00:26:25.620 | our distributed inference engine and the file optimizer.
00:26:29.700 | File optimizer, you can think about that
00:26:31.100 | as a feedback loop.
00:26:32.180 | The more you feed your inference workload
00:26:34.940 | to our inference engine,
00:26:36.740 | the more we help you improve quality,
00:26:39.740 | lower latency further, lower your cost.
00:26:42.460 | It basically becomes better.
00:26:43.940 | And we automate that because we don't want you
00:26:46.980 | as app developer or product engineer to think about
00:26:49.740 | how to figure out all these low-level details.
00:26:53.300 | It's impossible because you are not trained
00:26:54.940 | to do that at all.
00:26:56.180 | You should kind of keep your focus
00:26:57.660 | on the product innovation.
00:26:59.220 | And then the compound AI,
00:27:01.220 | we actually feel a lot of pain
00:27:02.740 | as the app developers, engineer,
00:27:05.420 | there are so many models.
00:27:07.380 | Every week, there's at least a new model coming out.
00:27:10.460 | - Tencent had a giant model this week.
00:27:12.340 | - Yeah, yeah, I saw that, I saw that.
00:27:16.220 | - Like 500 billion dollars.
00:27:18.500 | So they're like, should I keep chasing this
00:27:22.140 | or should I forget about it?
00:27:24.460 | And which model should I pick
00:27:26.140 | to solve what kind of sub-problem?
00:27:27.460 | How do I even decompose my problem
00:27:28.980 | into those smaller problems
00:27:30.220 | and fit the model into it?
00:27:31.660 | I have no idea.
00:27:33.020 | And then there are two ways
00:27:34.460 | to think about this design.
00:27:36.220 | I think I talked about that in the past.
00:27:37.700 | One is imperative, as in you tell,
00:27:41.100 | you figure out how to do it.
00:27:43.180 | You give developer tools to dictate how to do it.
00:27:46.660 | Or you build a declarative system
00:27:49.300 | where a developer tells what they want to do, not how.
00:27:52.660 | So these are completely two different designs.
00:27:55.380 | So the analogy I want to draw is in the data world,
00:27:59.740 | the database management system is a declarative system
00:28:02.700 | because people use database, use SQL.
00:28:05.100 | SQL is a way you say,
00:28:06.540 | what do you want to extract out of database?
00:28:08.740 | What kind of result do you want?
00:28:10.060 | But you don't figure out which node,
00:28:13.140 | how many nodes you're gonna run on top of,
00:28:14.980 | how you redefine your disk,
00:28:17.100 | which index you use, which project.
00:28:18.620 | You don't need to worry about any of those.
00:28:19.900 | And database management system will figure out,
00:28:22.340 | generate a new best plan and execute on that.
00:28:26.340 | So database is declarative.
00:28:28.580 | And it makes it super easy.
00:28:30.100 | You just learn SQL,
00:28:31.300 | which is learn a semantic meaning of SQL
00:28:33.500 | and you can use it.
00:28:34.660 | Imperative side is there are a lot of ETL pipelines
00:28:38.060 | and people design this DAG system
00:28:40.780 | with triggers, with actions,
00:28:42.340 | and you dictate exactly what to do.
00:28:44.580 | And if it fails, then we'll have to recover.
00:28:46.540 | So that's a declarative system.
00:28:48.780 | And we have seen a range of system
00:28:51.420 | in the ecosystem go different ways.
00:28:53.780 | I think there are value of both.
00:28:55.620 | There are value of both.
00:28:56.460 | I don't think one is gonna subsume the other,
00:28:58.860 | but we are leaning more into the philosophy
00:29:00.820 | of the declarative system
00:29:02.740 | because from the lens of app developer and product engineer,
00:29:06.020 | that would be easiest for them to integrate.
00:29:08.220 | - I understand that's also why PyTorch won as well, right?
00:29:12.180 | This is one of the reasons.
00:29:13.020 | - Ease of use.
00:29:13.860 | So yeah, focus on ease of use
00:29:15.820 | and then let the system take on
00:29:19.100 | the hard challenges and complexities.
00:29:21.260 | So we follow, we extend that thinking
00:29:23.780 | into current system design.
00:29:26.100 | So another announcement is we will also announce
00:29:29.420 | our next declarative system
00:29:32.940 | is gonna appear as a model
00:29:35.180 | that has extremely high quality.
00:29:38.020 | And this model is inspired by Owen's announcement
00:29:41.140 | for OpenAI.
00:29:42.460 | You should see that by the time we announce this or soon.
00:29:46.260 | - It's trained by you.
00:29:47.100 | - Yes.
00:29:47.940 | - Is this the first model that you train?
00:29:51.020 | Like this scale?
00:29:51.860 | - It's not the first.
00:29:52.860 | We actually have trained a model called FireFunction.
00:29:57.380 | It's a function calling model.
00:29:58.780 | It's our first step into compound AI system
00:30:01.380 | because function calling model
00:30:03.740 | can dispatch a request into multiple APIs.
00:30:08.740 | We have pre-baked set of APIs the model learn.
00:30:12.060 | You can also add additional APIs
00:30:14.740 | to through the configuration
00:30:16.220 | to let model dispatch accordingly.
00:30:18.340 | So we have a very high quality function calling model
00:30:21.020 | that already released.
00:30:22.460 | We have actually three versions.
00:30:23.700 | The latest version is very high quality.
00:30:25.900 | But now we take a further step
00:30:28.180 | that you don't even need to use function calling model.
00:30:30.420 | You use our new model we're gonna release.
00:30:33.500 | It will solve a lot of problem
00:30:35.060 | approaching very high, like OpenAI's quality.
00:30:38.860 | So I'm very excited about that.
00:30:41.420 | - Do you have any benchmarks yet or?
00:30:42.900 | - We have benchmark.
00:30:43.740 | We're gonna release it.
00:30:45.140 | Hopefully next week.
00:30:46.220 | We just put our model to LMSYS
00:30:49.220 | and people are guessing,
00:30:50.620 | is this a next Gemini model or a MADIS model?
00:30:55.620 | People are guessing.
00:30:56.500 | That's very interesting.
00:30:57.500 | We're like watching the Reddit discussion right now.
00:31:00.420 | - I mean, I have to ask more questions about this.
00:31:02.220 | When OpenAI released the one,
00:31:04.540 | a lot of people asked about whether or not
00:31:07.300 | it's a single model or whether it's like a chain of models.
00:31:10.420 | And basically everyone on the Strawberry team
00:31:14.140 | was very insistent that what they did
00:31:17.100 | for reinforcement learning, chain of thought,
00:31:19.380 | cannot be replicated by a whole bunch
00:31:20.900 | of open source model calls.
00:31:22.340 | Do you think that they are wrong?
00:31:24.500 | Have you done the same amount of work on RL as they have
00:31:27.660 | or was it a different direction?
00:31:29.740 | - I think they take a very specific approach
00:31:32.100 | where I do, the caliber of team is very high, right?
00:31:35.500 | So I do think they are the domain expert
00:31:37.980 | in doing the things they are doing.
00:31:39.100 | But I don't think there's only one way
00:31:41.900 | to achieve the same goal.
00:31:43.420 | We're on the same direction in the sense
00:31:46.220 | that the quality scaling law is shifting
00:31:49.620 | from training to inference.
00:31:51.300 | We are definitely on, for that I fully agree with them.
00:31:54.740 | But we're taking a completely different approach
00:31:57.180 | to the problem.
00:31:58.380 | All of that is because, of course,
00:32:00.580 | we didn't train the model from scratch.
00:32:02.140 | All of that is because we built on the show of giants, right?
00:32:05.140 | So the current model available we have access to
00:32:07.820 | is getting better and better.
00:32:09.300 | The future trend is the gap between the open source model,
00:32:11.780 | closed source model, it's just gonna shrink
00:32:14.540 | to the point there's not much difference.
00:32:17.260 | And then we're on the same level field.
00:32:19.180 | That's why I think our early investment in inference
00:32:22.820 | and all the work we do around balancing across quality,
00:32:27.820 | latency and cost pay off
00:32:29.780 | because we have accumulated a lot of experience there
00:32:32.260 | and that empower us to release this new model
00:32:36.420 | that is approaching open source quality.
00:32:39.380 | - I guess the question is,
00:32:40.340 | what do you think the gap to catch up will be?
00:32:43.100 | Because I think everybody agrees
00:32:44.700 | with open source models eventually will catch up.
00:32:47.340 | And I think with 4, then with Lama 3.2, 3.1, 4.5b,
00:32:51.580 | we close the gap.
00:32:52.420 | And then L1 just reopened the gap so much and it's unclear.
00:32:55.900 | Obviously you're saying your model will have-
00:32:57.580 | - We're closing that gap.
00:32:58.420 | - Yeah, but you think like in the future
00:33:00.140 | it's gonna be like months?
00:33:02.340 | - So here's the thing that's happened, right?
00:33:04.020 | There's public benchmark, it is what it is.
00:33:06.620 | But in reality, open source model in certain dimension
00:33:11.140 | already on par or beat closed source model, right?
00:33:15.100 | So for example, in the coding space,
00:33:18.260 | open source models are really, really good.
00:33:20.380 | And in function calling,
00:33:22.100 | like file function is also really, really good.
00:33:24.220 | So it's all a matter of whether you build one model
00:33:27.140 | to solve all the problem
00:33:28.220 | and you want to be the best of solving all the problems
00:33:31.260 | or in the open source domain, it's gonna specialize, right?
00:33:34.580 | All these different model builders specialize
00:33:36.820 | in certain narrow area.
00:33:39.260 | And it's logical that they can be really, really good
00:33:42.900 | in that very narrow area.
00:33:44.500 | And that's our prediction is with specialization,
00:33:48.540 | there will be a lot of expert models really, really good
00:33:51.420 | and even better than like one size fits all
00:33:53.860 | open source, closed source models.
00:33:55.700 | - I think this is the core debates
00:33:59.540 | that I am still not 100% either way on
00:34:03.020 | in terms of compound AI versus normal AI,
00:34:07.140 | 'cause you're basically fighting the bitter lesson.
00:34:09.460 | - Look at the human society, right?
00:34:11.500 | We specialize and you feel really good
00:34:13.900 | about someone specializing doing something really well, right?
00:34:17.300 | And that's how our, like when it evolved from ancient time,
00:34:19.980 | we're all journalists, we do everything in the tribe too.
00:34:22.580 | Now we heavily specialize in different domain.
00:34:24.860 | So my prediction is in the AI model space,
00:34:27.460 | it will happen also.
00:34:28.660 | Except for the bitter lesson,
00:34:30.420 | you get short-term gains by having specialists,
00:34:33.700 | domain specialists, and then someone just needs to train
00:34:36.060 | like a 10X bigger model on 10X more inference,
00:34:38.900 | 10X more data, 10X more model perhaps,
00:34:41.380 | whatever the current scaling law is.
00:34:43.780 | And then it supersedes all the individual models
00:34:46.380 | because of some generalized intelligence/world knowledge.
00:34:50.220 | You know, I think that is the core insight of the GPTs,
00:34:54.500 | the GPT 1, 2, 3, that was.
00:34:56.420 | - Right, but the scaling law again,
00:34:58.260 | the training scaling law is because
00:35:00.180 | you have increasing amount of data to train from
00:35:02.820 | and you can do a lot of compute, right?
00:35:04.780 | So I think on the data side, we're approaching the limit
00:35:08.300 | and the only data to increase that
00:35:09.860 | is synthetic generated data.
00:35:11.300 | And then there's like, what is the secret sauce there, right?
00:35:14.780 | Because if you have a very good large model,
00:35:17.020 | you can generate very good synthetic data
00:35:19.340 | and then continue to improve quality.
00:35:21.820 | So that's why I think in OpenAI,
00:35:23.340 | they are shifting from the training scaling law
00:35:25.180 | into inference scaling law.
00:35:26.020 | And it's the test time and all this.
00:35:28.260 | So I definitely believe that's the future direction
00:35:31.660 | and that's where we are really good at and doing inference.
00:35:34.540 | - Couple of questions on that.
00:35:35.580 | Are you planning to share your reasoning traces?
00:35:39.260 | - That's a very good question.
00:35:40.940 | We are still debating.
00:35:43.140 | - Yeah.
00:35:43.980 | - We're still debating.
00:35:46.660 | - I would say, if you, for example,
00:35:48.660 | it's interesting that like, for example, Sweden bench,
00:35:51.220 | if you want to be considered for ranking,
00:35:53.820 | you have to submit your reasoning traces
00:35:55.500 | and that has actually disqualified
00:35:56.940 | some of our past guests.
00:35:57.780 | Like CoSign was doing well on Sweden bench,
00:35:59.900 | but they didn't want to leak those results.
00:36:01.980 | So that's why you don't see O1 preview on Sweden bench
00:36:05.300 | because they don't submit their reasoning traces.
00:36:08.100 | And obviously it's IP,
00:36:09.100 | but also if you're going to be more open,
00:36:11.660 | then that's one way to be more open.
00:36:13.620 | So your model is not going to be open source, right?
00:36:16.100 | Like it's going to be a endpoint that you provide.
00:36:18.220 | - Yes.
00:36:19.380 | - Okay, cool.
00:36:20.340 | And then pricing also the same as OpenAI,
00:36:24.180 | just kind of face-on.
00:36:25.740 | - This is, I don't have actually information.
00:36:28.420 | Everything is going so fast,
00:36:29.580 | we haven't even think about that yet.
00:36:31.300 | Yeah, I should be more prepared.
00:36:33.500 | - I mean, this is live.
00:36:35.540 | It's nice to just talk about it as it goes live.
00:36:38.220 | Any other things that you're like,
00:36:39.620 | you want feedback on or you're thinking through?
00:36:41.700 | It's kind of nice to just talk about something
00:36:43.980 | when it's not decided yet about this new model.
00:36:46.260 | Like, I mean, it's going to be exciting.
00:36:48.100 | It's going to generate a lot of buzz.
00:36:50.340 | - Right.
00:36:51.780 | I'm very excited about to see
00:36:54.620 | how people are going to use this model.
00:36:56.860 | So there's already a Reddit discussion about it
00:37:00.020 | and the people are asking very deep medical questions.
00:37:03.420 | And it seems the model got it right.
00:37:05.460 | Surprising.
00:37:06.300 | And internally, we're also asking models
00:37:09.020 | to generate what is AGI.
00:37:10.940 | And it generates a very complicated DAG.
00:37:13.220 | Thinking process.
00:37:15.740 | So we're having a lot of fun testing this internally.
00:37:19.740 | But I'm more curious, how will people use it?
00:37:22.780 | What kind of application they're going to try
00:37:24.740 | and test on it?
00:37:26.020 | And that's where we'll really like to hear feedback
00:37:29.660 | from the community.
00:37:30.940 | And also feedback to us, like what works out well,
00:37:33.500 | what doesn't work out well?
00:37:34.740 | What works out well but surprising them?
00:37:37.660 | And what kind of thing they think we should improve on?
00:37:41.620 | And those kind of feedback will be tremendously helpful.
00:37:44.500 | - Yeah, I mean, so I've been a production user
00:37:46.220 | of Preview and Mini since March.
00:37:49.220 | I would say they're like very, very obvious
00:37:51.940 | in terms of quality.
00:37:52.780 | So much so that they made flaunts on it
00:37:55.180 | and for, oh, just like they made the previous
00:37:58.980 | state-of-the-art look bad.
00:38:00.220 | Like it's really that stark, that difference.
00:38:04.220 | The number one thing I actually, you know,
00:38:06.380 | just feedback or request, feature requests
00:38:08.700 | is people want control on the budget.
00:38:11.340 | Because right now in '01,
00:38:13.620 | it kind of decides its own thinking budget.
00:38:15.860 | But sometimes you know how hard the problem is
00:38:18.580 | and you want to actually tell the model,
00:38:20.620 | like spend two minutes on this,
00:38:22.900 | or spend some dollar amount.
00:38:23.980 | Maybe it's time, maybe it's dollars.
00:38:25.140 | I don't know what the budget is.
00:38:26.980 | - That makes a lot of sense.
00:38:27.980 | So we actually thought about that requirement
00:38:31.180 | and it should be at some point we need to support that.
00:38:35.540 | Not initially, but that makes a lot of sense.
00:38:38.460 | - Okay, so that was a fascinating overview
00:38:41.020 | of just like the things that you're working on.
00:38:42.940 | First of all, I realized that,
00:38:44.860 | I don't know if I've ever given you this feedback,
00:38:46.500 | but I think you guys are one of the reasons
00:38:48.820 | I agreed to advise you.
00:38:50.300 | Because like, you know, I think when you first met me,
00:38:52.020 | I was kind of dubious.
00:38:53.100 | I was like-
00:38:53.940 | - Who are you?
00:38:54.780 | - Let's replicate this together.
00:38:57.380 | There's like a laptop.
00:38:58.660 | There's like a whole bunch of other players.
00:38:59.940 | You're in very, very competitive fields.
00:39:01.900 | Like why will you win?
00:39:03.460 | And the reason I actually changed my mind
00:39:06.620 | was I saw you guys shipping.
00:39:08.540 | You know, I think your surface area is very big.
00:39:10.780 | The team is not that big.
00:39:11.980 | - No, we're only 40 people.
00:39:13.900 | - Yeah, and now here you are trying to compete
00:39:16.020 | with OpenAI and you know, everyone else.
00:39:18.020 | Like, what is the secret?
00:39:20.060 | - I think the team, the team is the secret.
00:39:23.380 | - Oh boy.
00:39:24.220 | So there's no, there's no thing I can just copy.
00:39:27.300 | You just-
00:39:28.700 | - No.
00:39:29.540 | - I think we all come from very aligned on the culture.
00:39:35.220 | 'Cause most of our team came from Meta.
00:39:38.140 | - Yeah.
00:39:38.980 | - And many startups.
00:39:40.300 | So we really believe in results.
00:39:42.460 | One is result.
00:39:43.900 | And second is customer.
00:39:45.700 | We're very customer obsessed.
00:39:47.220 | And we don't want to drive adoption
00:39:50.940 | for the sake of adoption.
00:39:52.100 | We really want to make sure we understand
00:39:55.220 | we are delivering a lot of business values to the customer.
00:39:58.380 | And we are, we really value their feedback.
00:40:02.340 | So we would wake up mid of night
00:40:05.660 | and deploy some model for them.
00:40:08.220 | Shuffle some capacity for them.
00:40:10.660 | And yeah, over the weekend, no brainer.
00:40:15.300 | So yeah, so that's just how we work as a team.
00:40:18.820 | And the caliber of the team is really, really high as well.
00:40:23.820 | So like, as Plugin, we're hiring.
00:40:27.300 | We're expanding very, very fast.
00:40:29.460 | So if we are passionate about working
00:40:32.260 | on the most cutting edge technology
00:40:34.580 | in the general space, come talk with us.
00:40:37.620 | - Yeah.
00:40:38.460 | Let's talk a little bit about that customer journey.
00:40:40.300 | I think one of your more famous customers is Cursor.
00:40:42.700 | We were the first podcast to have Cursor on
00:40:44.780 | and then obviously since then they have blown up.
00:40:46.460 | Cause and effect are not related.
00:40:48.180 | But you guys especially worked on a fast supply model
00:40:52.900 | where you were one of the first people
00:40:54.940 | to work on speculative decoding in a production setting.
00:40:58.860 | Maybe just talk about like,
00:41:00.020 | what was the behind the scenes of working with Cursor?
00:41:03.220 | - I will say, Cursor is a very, very unique team.
00:41:06.460 | I think a unique part is the team
00:41:08.740 | has very high technical caliber, right?
00:41:11.060 | There's no question about it.
00:41:12.420 | But they have decided,
00:41:14.380 | although like many companies including Copala,
00:41:17.340 | they will say, I'm going to build a whole entire stack
00:41:19.580 | because I can.
00:41:20.700 | And they are unique in the sense they seek partnership.
00:41:24.980 | Not because they cannot, they're fully capable,
00:41:27.580 | but they know where to focus.
00:41:29.020 | That to me is amazing.
00:41:30.660 | And of course they want to find a bypass partner.
00:41:33.580 | So we spent some time working together.
00:41:36.260 | They are pushing us very aggressively
00:41:39.180 | because for them to deliver high caliber product experience
00:41:42.580 | they need the latency.
00:41:44.060 | They need the interactive,
00:41:45.060 | but also high quality at the same time.
00:41:47.540 | So actually we expanded our product feature quite a lot
00:41:51.780 | as we support in Cursor.
00:41:53.420 | And they are growing so fast
00:41:55.220 | and we massively scaled quickly across multiple regions.
00:41:59.460 | And we develop pretty high intense inference stack,
00:42:04.460 | almost like similar to what we do for Meta.
00:42:07.900 | I think that's a very, very interesting engagement.
00:42:10.700 | And through that, there are a lot of trust being built.
00:42:13.180 | As in they realize, hey,
00:42:15.060 | this is a team they can really partner with
00:42:16.740 | and they can go big with.
00:42:18.820 | That comes back to, hey, we're really customer obsessed.
00:42:21.500 | And all the engineers working with them,
00:42:23.780 | there's just enormous amount of time
00:42:25.660 | syncing together with them and discussing.
00:42:29.180 | And we're not big on meetings,
00:42:30.700 | but we are like stack channel always on.
00:42:32.700 | Yeah, so you almost feel like working as one team.
00:42:36.220 | So I think that's really highlighted.
00:42:38.700 | - Yeah, for those who don't know,
00:42:39.940 | so basically Cursor is a VSCode fork,
00:42:41.940 | but most of the time people will be using close models.
00:42:45.060 | Like I actually use a lot of Sonnet.
00:42:47.340 | So you're not involved there, right?
00:42:49.140 | It's not like you host Sonnet
00:42:50.460 | or you have any partnership with it.
00:42:52.180 | You're involved where Cursor is small
00:42:53.980 | or like their house brand models are concerned, right?
00:42:58.860 | - I don't know what I can say,
00:43:00.340 | but the things they haven't said.
00:43:02.060 | (laughing)
00:43:04.620 | - Very obviously the dropdown is 4.0 and then Cursor, right?
00:43:08.380 | So like, I assume that the Cursor side is the Fireworks side
00:43:11.220 | and then the other side, they're calling out the other.
00:43:14.140 | Just kind of curious.
00:43:15.420 | And then like, do you see any more opportunity on like the,
00:43:18.500 | you know, I think you made a big splash
00:43:19.980 | with like 1000 tokens per second.
00:43:21.660 | That was because of speculative decoding.
00:43:23.500 | Is there more to push there?
00:43:25.540 | - We push a lot.
00:43:26.380 | Actually, when I mentioned a file optimizer, right?
00:43:29.020 | So as in, we have a unique automation stack
00:43:33.020 | that is one size fits one.
00:43:35.020 | We actually deployed to Cursor early on.
00:43:36.780 | Basically optimized for their specific workload.
00:43:39.220 | And that's a lot of juice to extract out of there.
00:43:42.180 | And we see the success in that product
00:43:44.500 | is actually can be widely adopted.
00:43:46.380 | So that's why we started a separate product line
00:43:49.380 | called the File Optimizer.
00:43:50.820 | So speculative decoding is just one approach.
00:43:54.020 | And speculative decoding here is not static.
00:43:55.820 | We actually wrote a blog post about it.
00:43:58.020 | There's so many different ways to do speculative decoding.
00:43:59.940 | You can pair a small model with a large model
00:44:02.220 | in the same model family,
00:44:03.580 | or you can have Eagle heads and so on.
00:44:06.380 | So there are different trade-offs
00:44:08.180 | of which approach to take.
00:44:10.060 | It really depends on your workload.
00:44:11.540 | And then with your workload,
00:44:12.580 | we can align the Eagle heads or Medusa heads
00:44:15.260 | or, you know, small, big model pair much better
00:44:18.380 | to extract the best latency reduction.
00:44:20.900 | So all of that is part of the File Optimizer offering.
00:44:23.980 | - I know you mentioned
00:44:24.980 | some of the other inference providers.
00:44:27.020 | I think the other question that people always have
00:44:28.980 | is around benchmarks.
00:44:30.260 | So you get different performance on different platforms.
00:44:34.140 | How should people think about,
00:44:35.740 | you know, people are like,
00:44:36.580 | "Hey, Lama 3.2 is X on MMLU."
00:44:40.540 | But maybe, you know, using speculative decoding,
00:44:43.100 | you go down a different path.
00:44:44.620 | Maybe some providers run a quantized model.
00:44:47.740 | How should people think about how much they should care
00:44:50.420 | about how you're actually running the model?
00:44:52.380 | You know, like, what's the delta
00:44:53.620 | between all the magic that you do
00:44:55.700 | and what a raw model?
00:44:58.020 | - Okay, so there are two big development cycle.
00:45:01.300 | One is experimentation, where they need fast iteration.
00:45:04.380 | They don't want to think about quality
00:45:05.700 | and they just kind of want to experiment
00:45:07.860 | with product experience and so on, right?
00:45:09.540 | So that's one.
00:45:10.860 | And then it looks good
00:45:12.460 | and they want to kind of post-product market
00:45:14.420 | but scaling and the quality is really important
00:45:17.020 | and latency and all the other things are becoming important.
00:45:20.020 | During the experimentation phase
00:45:21.700 | is just pick a good model.
00:45:23.540 | Don't worry about anything else.
00:45:24.740 | Make sure even like JNI is the right solution
00:45:26.740 | to your product and that's the focus.
00:45:29.460 | And then post-product market fit,
00:45:31.260 | then that's kind of the three-dimensional optimization curve
00:45:34.660 | start to kick in across quality, latency, cost,
00:45:38.420 | where you should land.
00:45:39.860 | And to me, it's a purely a product decision.
00:45:42.980 | To many product, if you choose a lower quality
00:45:46.340 | but better speed and lower cost,
00:45:49.380 | but it doesn't make a difference to the product experience,
00:45:52.020 | then you should do it.
00:45:53.300 | So that's why I think inference is part of the validation.
00:45:58.300 | The validation doesn't stop at offline eval.
00:46:00.780 | The validation is kind of,
00:46:02.180 | we'll go through A/B testing through inference
00:46:04.940 | and that's where we kind of offer
00:46:06.660 | various different configurations
00:46:07.780 | for you to test which is the best setting.
00:46:09.780 | So this is like traditional product evaluation.
00:46:13.580 | So product evaluation should also include
00:46:16.180 | your new model versions
00:46:18.020 | and different model setup into the consideration.
00:46:22.300 | - I want to specifically talk about
00:46:24.180 | what happens a few months ago
00:46:26.140 | with some of your major competitors.
00:46:28.780 | I mean, all of this is public.
00:46:30.580 | What is your take on what happens?
00:46:33.100 | And maybe you want to set the record straight
00:46:34.540 | on how Fireworks does quantization
00:46:36.580 | because I think a lot of people
00:46:38.300 | may have outdated perceptions
00:46:40.380 | or they didn't read the clarification posts
00:46:43.260 | on your approach to quantization.
00:46:45.100 | - First of all, it's always a surprise to us
00:46:47.260 | that without any notice, we got called out.
00:46:51.460 | - Specifically by name, which is normally not what-
00:46:54.620 | - Yeah, in a public post
00:46:56.820 | and have certain interpretation of our quality.
00:47:01.100 | So I was really surprised
00:47:02.940 | and it's not a good way to compete, right?
00:47:07.460 | We want to compete fairly
00:47:09.180 | and oftentimes when one vendor
00:47:12.540 | give out the result from another vendor
00:47:15.220 | is always extremely biased.
00:47:16.900 | So we actually refrain ourselves to do any of those
00:47:20.340 | and we happily partner with third party
00:47:22.620 | to do the most fair evaluation.
00:47:25.300 | So we are very surprised
00:47:26.620 | and we don't think that's a good way
00:47:29.540 | to figure out the competition landscape.
00:47:31.700 | So then we react.
00:47:33.260 | I think when it comes to quantization,
00:47:36.220 | the interpretation,
00:47:37.380 | we wrote out actually a very thorough blog post
00:47:39.700 | because again, no one size fits all.
00:47:42.580 | We have various different quantization schemes.
00:47:45.100 | We can quantize very different parts of the model
00:47:47.940 | from ways to activation to cross-TPU communication
00:47:50.540 | to they can use different quantization scheme
00:47:53.380 | or consistent across the board.
00:47:55.300 | And again, it's a trade-off.
00:47:56.420 | It's trade-off across this three-dimensional
00:47:58.820 | quality, latency, and cost.
00:48:00.740 | And for our customer,
00:48:01.740 | we actually let them find the best optimized point
00:48:05.540 | and that's kind of how,
00:48:06.820 | and we have very thorough evaluation process
00:48:09.700 | to pick that point.
00:48:11.460 | But for self-serve, there's only one point to pick.
00:48:14.300 | There's no customization available.
00:48:16.780 | So of course we, depends on like what we,
00:48:20.700 | we talk with many customer.
00:48:21.740 | We have to pick one point.
00:48:23.420 | And I think the end results like AA published,
00:48:28.420 | later on AA published a quality measure
00:48:32.980 | and we're actually, we look really good.
00:48:34.900 | So I don't, I wouldn't,
00:48:36.580 | that's why what I mean is I will leave the evaluation
00:48:39.980 | of quality or performance to third party
00:48:42.700 | and work with them to find the most fair benchmark approach
00:48:46.740 | methodology.
00:48:47.900 | But I'm not a part of approach of calling out specific names
00:48:52.900 | and critique other competitors in a very biased way.
00:48:57.380 | - Databases happens as well.
00:48:59.580 | I think you're the more politically correct one.
00:49:01.580 | And then Dima is the more,
00:49:03.300 | it's you on Twitter.
00:49:06.780 | - We, yeah.
00:49:09.300 | - It's like the Russian.
00:49:10.780 | - We partner.
00:49:11.820 | No, actually all these directions we build together.
00:49:15.020 | - Wow.
00:49:15.860 | - Play different roles.
00:49:18.700 | Cut this.
00:49:19.540 | - Another one that I wanted to,
00:49:22.300 | on just the last one on the competition side,
00:49:24.660 | there's a perception of price wars
00:49:26.660 | in hosting open source models.
00:49:29.220 | You are, you're,
00:49:30.060 | and we talked about the competitiveness in the market.
00:49:32.660 | Do you aim to make margin on open source models?
00:49:37.140 | - Oh, absolutely yes.
00:49:39.140 | So, but I think it really, when we think about pricing,
00:49:43.260 | it's really need to coordinate
00:49:45.860 | with the value we are delivering.
00:49:47.900 | If the value is limited,
00:49:49.620 | or there are a lot of people delivering same value,
00:49:51.980 | there's no differentiation.
00:49:53.180 | There's only one way to go is going down, right?
00:49:55.140 | So through competition.
00:49:56.460 | If I take a big step back,
00:49:58.220 | there is pricing from,
00:50:00.140 | we're more compared with like closed model providers,
00:50:03.380 | APIs, right?
00:50:04.500 | The closed model provider,
00:50:05.980 | their cost structure is even more interesting
00:50:08.140 | because we don't have any,
00:50:09.140 | we don't bear any training costs.
00:50:10.900 | And we focus on inference optimization,
00:50:13.220 | and that's kind of where we continue
00:50:15.100 | to add a lot of product value.
00:50:16.740 | So that's how we think about product.
00:50:18.540 | But for the closed source API provider,
00:50:21.780 | model provider,
00:50:22.820 | they bear a lot of training costs.
00:50:24.820 | And they need to amortize the training costs
00:50:26.380 | into the inference.
00:50:27.780 | So that created very interesting dynamics of,
00:50:30.700 | yeah, if we match pricing there,
00:50:32.980 | and I think how they are going to make money
00:50:35.380 | is very, very interesting.
00:50:37.700 | - So for listeners,
00:50:38.940 | opening eyes 2024,
00:50:40.860 | 4 billion in revenue,
00:50:42.620 | 3 billion in compute training,
00:50:45.220 | 2 billion in compute inference,
00:50:47.100 | 1 billion in research compute amortization,
00:50:52.140 | and 700 million in salaries.
00:50:54.420 | So that is like,
00:50:56.420 | (laughing)
00:50:58.660 | I mean, a lot of R&D.
00:51:01.500 | - Yeah, so I think matter is basically like,
00:51:04.860 | snake is zero, yeah.
00:51:06.820 | So that's a very, very interesting dynamics
00:51:09.340 | we're operating within.
00:51:10.500 | But coming back to inference, right,
00:51:11.660 | so we are, again, as I mentioned,
00:51:13.460 | our product is, we are a platform.
00:51:14.980 | We are not just a single model as a service provider,
00:51:18.020 | as many other inference providers,
00:51:19.820 | like they're providing single model.
00:51:21.460 | We have file optimizer to highly customize
00:51:24.420 | towards your inference workload.
00:51:26.140 | We have a compound AI system
00:51:27.700 | where significantly simplify your interaction
00:51:30.500 | to high quality and low latency, low cost.
00:51:34.020 | So those are all very different
00:51:36.180 | from other providers.
00:51:38.220 | - What do people not know about the work that you do?
00:51:41.100 | I guess like people are like,
00:51:41.940 | okay, Fireworks, you run model very quickly.
00:51:44.300 | You have the function model.
00:51:45.860 | Is there any kind of like underrated part of Fireworks
00:51:49.260 | that more people should try?
00:51:51.060 | - Yeah, actually one user post on x.com,
00:51:56.060 | he mentioned, oh, actually,
00:52:00.100 | Fireworks can allow me to upload the LoRa adapter
00:52:04.220 | to the service model at the same cost
00:52:07.540 | and use it at same cost.
00:52:09.260 | Nobody has provided that.
00:52:10.860 | That's because we have a very special,
00:52:13.580 | like we rolled out multi-LoRa last year, actually,
00:52:17.020 | and we actually have this function for a long time,
00:52:19.740 | and many people has been using it,
00:52:21.060 | but it's not well known that,
00:52:23.060 | oh, if you find your model,
00:52:24.380 | you don't need to use on-demand.
00:52:26.060 | If you find your model is LoRa,
00:52:27.780 | you can upload your LoRa adapter,
00:52:30.420 | and we deploy it as if it's a new model,
00:52:33.740 | and then you use, you get your endpoint,
00:52:36.220 | and you can use that directly,
00:52:37.500 | but at the same cost as the base model,
00:52:39.700 | so I'm happy that user is marketing it for us.
00:52:43.460 | He discovered that feature,
00:52:45.900 | but we have that for last year,
00:52:48.260 | so I think to feedback to me is,
00:52:53.060 | we have a lot of very, very good features,
00:52:55.180 | as Sean just mentioned.
00:52:56.660 | - I'm the advisor to the company,
00:52:57.820 | and I didn't know that you had
00:52:58.860 | speculative decoding released, you know?
00:53:01.020 | (laughing)
00:53:02.860 | - We have prompt catching way back last year also.
00:53:05.500 | We have many, yeah.
00:53:06.820 | So yeah, so I think that is one of the underrated feature,
00:53:10.140 | and if they're developers,
00:53:12.260 | you are using our self-serve platform,
00:53:14.500 | please try it out.
00:53:15.700 | - Yeah, yeah, yeah.
00:53:16.540 | The LoRa thing's interesting,
00:53:17.580 | because I think you also,
00:53:19.940 | the reason people add additional cost to it
00:53:22.620 | is not because they feel like charging people.
00:53:25.060 | Normally, in normal LoRa serving setups,
00:53:28.020 | there is a cost to dedicating,
00:53:30.420 | loading those weights,
00:53:31.580 | and dedicating a machine to that inference.
00:53:34.500 | How come you can't avoid it?
00:53:36.100 | - Yeah, so this is kind of our technique called multi-LoRa.
00:53:39.820 | So we basically have many LoRa adapters
00:53:43.380 | share the same base model,
00:53:45.340 | and basically we significantly reduce
00:53:47.540 | the memory footprint of serving,
00:53:50.460 | and one base model can sustain
00:53:52.180 | a hundred to a thousand LoRa adapters,
00:53:54.340 | and then basically all these different LoRa adapters
00:53:57.180 | can share the same,
00:53:58.100 | like direct the same traffic to the same base model,
00:54:00.340 | where base model is dominating the cost.
00:54:02.660 | So that's how we advertise that way,
00:54:05.060 | and that's how we can manage
00:54:07.180 | the tokens per dollar,
00:54:10.460 | million token pricing,
00:54:12.020 | the same as base model.
00:54:13.780 | - Is there anything that you think
00:54:15.860 | you want to request from the community,
00:54:17.500 | or you're looking for model-wise or tooling-wise
00:54:20.860 | that you think someone should be working on in this?
00:54:23.420 | - Yeah, so we really want to get a lot of feedback
00:54:27.060 | from the application developers
00:54:30.180 | who are starting to build on JNN,
00:54:32.700 | or on the already adopted,
00:54:35.020 | or starting to think about new use cases and so on,
00:54:38.660 | to try out Fireworks first,
00:54:41.740 | and let us know what works out really well for you,
00:54:44.180 | and what is your wish list,
00:54:46.020 | and what sucks, right?
00:54:47.980 | So what is not working out for you,
00:54:50.020 | and we would like to continue to improve,
00:54:53.060 | and for our new product launches,
00:54:54.820 | typically we want to launch to a small group of people.
00:54:58.020 | Usually we launch on our Discord first,
00:55:00.540 | to have a set of people use that first.
00:55:03.140 | So please join our Discord channel.
00:55:05.180 | We have a lot of communication going on there.
00:55:07.940 | Again, you can also give us feedback.
00:55:09.500 | We'll have a starting office hour
00:55:11.860 | for you to directly talk with our dev rel
00:55:14.300 | and engineers to exchange more long notes.
00:55:17.180 | - And you're hiring across the board?
00:55:18.940 | - We're hiring across the board.
00:55:20.260 | We're hiring front-end engineers,
00:55:22.220 | infrastructure cloud, infrastructure engineers,
00:55:24.140 | back-end system optimization engineers,
00:55:26.380 | applied researchers,
00:55:28.020 | and researchers who have done post-training,
00:55:31.300 | who have done a lot of fine-tuning and so on.
00:55:33.540 | - That's it.
00:55:35.460 | Thank you.
00:55:36.300 | - Awesome.
00:55:37.140 | - Thanks for having us.
00:55:38.820 | (upbeat music)
00:55:41.460 | (upbeat music)
00:55:44.060 | (upbeat music)
00:55:46.640 | (upbeat music)