Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey, everyone.

00:00:04.480 | Welcome to the Latent Space Podcast.

00:00:06.160 | This is Alessio, partner in CTO at Danceable Partners,

00:00:08.920 | and I'm joined by my co-host, Swix, founder of SmileyEye.

00:00:11.920 | - Hey, and today we're in a very special studio

00:00:15.600 | inside the Fireworks office with Lin Tiao,

00:00:17.660 | CEO of Fireworks.

00:00:18.500 | Welcome.

00:00:19.320 | - Yeah.

00:00:20.160 | - Oh, you should welcome us.

00:00:21.000 | - Yeah, welcome.

00:00:21.820 | (all laughing)

00:00:24.080 | - Thanks for having us.

00:00:25.080 | It's unusual to be in the home of a startup,

00:00:29.160 | but I think our relationship is a bit unusual

00:00:32.200 | compared to all our normal guests.

00:00:33.960 | - Definitely.

00:00:34.800 | Yeah, I'm super excited to talk about very interesting

00:00:38.800 | topics in that space with both of you.

00:00:41.200 | - You just celebrated your two-year anniversary yesterday.

00:00:43.320 | - Yeah, it's quite a crazy journey.

00:00:45.080 | We circle around and share all the crazy stories

00:00:47.480 | across these two years, and it has been super fun.

00:00:51.300 | All the way from we experienced Silicon Valley bank run.

00:00:56.700 | - Right.

00:00:57.540 | - To we delete some data that shouldn't be deleted.

00:01:02.540 | Operationally, we went through a massive scale

00:01:08.160 | where we actually are busy getting capacity to...

00:01:13.160 | Yeah, we learned to kind of work with it as a team

00:01:17.260 | with a lot of brilliant people across different places,

00:01:21.080 | joining a company.

00:01:22.760 | It has really been a fun journey.

00:01:24.640 | - When you started, did you think the technical stuff

00:01:27.280 | would be harder or the bank run and then the people side?

00:01:30.780 | I think there's a lot of amazing researchers

00:01:32.860 | that want to do companies, and it's like,

00:01:34.600 | the hardest thing is going to be building the product,

00:01:36.520 | and then you have all these different other things.

00:01:38.440 | So, were you surprised by what has been your experience?

00:01:42.680 | - Yeah, to be honest with you,

00:01:44.600 | my focus has always been on the product side,

00:01:47.480 | and then after product, go to market.

00:01:49.420 | And I didn't realize the rest has been so complicated,

00:01:52.520 | operating a company and so on.

00:01:54.400 | But because I don't think about it,

00:01:56.760 | I just kind of manage it.

00:01:58.160 | So, it's done.

00:01:59.280 | (laughs)

00:02:00.120 | So, I think I just somehow don't think about it too much

00:02:04.180 | and solve whatever problem coming our way, and it worked.

00:02:08.440 | - So, I guess let's start at the pre-history,

00:02:10.600 | like the initial history of Fireworks.

00:02:13.740 | You ran the PyTorch team at Meta for a number of years,

00:02:17.920 | and we previously had Sumit Chintala on,

00:02:21.440 | and I think we were just all very interested

00:02:23.680 | in the history of GenEI.

00:02:25.720 | Maybe not that many people know

00:02:27.480 | how deeply involved Fire and Meta were

00:02:32.500 | prior to the current GenEI revolution.

00:02:34.400 | - My background is deep in distributed system,

00:02:38.280 | database management system,

00:02:40.000 | and I joined Meta from the data side.

00:02:43.180 | And I saw this tremendous amount of data growth,

00:02:46.020 | which cost a lot of money,

00:02:48.180 | and we're analyzing what's going on.

00:02:50.440 | And it's clear that AI is driving all this data generation.

00:02:55.100 | So, it's a very interesting time,

00:02:57.080 | because when I joined Meta,

00:02:58.880 | Meta's going through ramping down mobile-first,

00:03:01.880 | finishing the mobile-first transition,

00:03:03.860 | and then starting AI-first.

00:03:05.040 | And there's a fundamental reason about that sequence,

00:03:07.880 | because mobile-first gave a full range of user engagement

00:03:12.160 | that has never existed before.

00:03:14.320 | And all this user engagement generated a lot of data,

00:03:17.240 | and this data power AI.

00:03:19.560 | So, then the whole entire industry is also going through,

00:03:22.000 | following through the same transition.

00:03:24.720 | When I see, oh, okay, this AI is powering

00:03:27.100 | all this data generation,

00:03:28.620 | and look at where's our AI stack,

00:03:30.940 | there's no software, there's no hardware,

00:03:32.260 | there's no people, there's no team.

00:03:34.180 | I'm like, I want to dive up there and help this movement.

00:03:39.180 | So, when I started,

00:03:40.740 | it's a very interesting industry landscape.

00:03:42.780 | There are a lot of AI frameworks.

00:03:44.940 | It's a kind of proliferation of AI frameworks

00:03:48.580 | happening in the industry.

00:03:49.940 | But all the AI frameworks focus on production,

00:03:53.700 | and they use a very certain way

00:03:56.360 | of defining the graph of neural network,

00:03:59.280 | and they use that to drive the model actuation

00:04:02.420 | and productionization.

00:04:04.640 | And PyTorch is completely different.

00:04:06.040 | So, they could also assume that

00:04:07.840 | he was the user of his product.

00:04:11.100 | It is as a researcher face so much pain

00:04:13.480 | using existing AI framework.

00:04:15.520 | This is really hard to use,

00:04:16.620 | and I'm gonna do something different for myself.

00:04:19.840 | And that's the origin story of PyTorch.

00:04:21.720 | PyTorch actually started as the framework for researchers.

00:04:25.100 | Don't care about production at all.

00:04:26.980 | And as it grow in terms of adoption,

00:04:30.140 | so the interesting part of AI is research

00:04:32.380 | is the top of our normal production.

00:04:34.100 | There are so many researchers across academic,

00:04:37.680 | across industry, they innovate,

00:04:40.460 | and they put their results out there in open source.

00:04:43.620 | And that power the downstream productionization.

00:04:46.580 | So, it's brilliant for MATA

00:04:47.780 | to establish PyTorch as a strategy

00:04:50.620 | to drive massive adoption in open source,

00:04:53.700 | because MATA internally is a PyTorch job.

00:04:56.020 | So, it create a flying wheel effects.

00:04:58.740 | So, that's kind of a strategy behind PyTorch.

00:05:00.740 | But when I took on PyTorch,

00:05:02.980 | it's kind of a classical MATA established PyTorch

00:05:05.580 | as the framework for both research and production.

00:05:08.860 | So, no one has done that before.

00:05:10.540 | And we have to kind of rethink how to architect PyTorch

00:05:13.380 | so we can really sustain production workload,

00:05:15.820 | the stability, reliability, low latency,

00:05:18.100 | all this production concern was never a concern before,

00:05:21.100 | now it's a concern.

00:05:22.020 | And we actually have to adjust its design

00:05:24.860 | and make it work for both sides.

00:05:26.860 | And that took us five years,

00:05:28.700 | because MATA has so many AI use cases,

00:05:31.220 | all the way from ranking recommendation

00:05:33.180 | as powering the business top line,

00:05:35.300 | or as ranking use feed, video ranking,

00:05:37.500 | to site integrity detect bad content automatically using AI,

00:05:41.540 | to all kind of effects, translation,

00:05:44.340 | image classification, object detection, all this.

00:05:47.140 | And also across AI running on the server side,

00:05:49.940 | on mobile phones, on AI VR devices, the wide spectrum.

00:05:54.580 | So by the time, we actually basically managed

00:05:57.780 | to support AI across ubiquitous everywhere across MATA.

00:06:02.540 | But interestingly, through open source engagement,

00:06:04.700 | we work with a lot of companies.

00:06:06.540 | It is clear to us, like,

00:06:07.940 | this industry is start to take on AI first transition.

00:06:11.900 | And of course, MATA's hyperscale

00:06:13.540 | always go ahead of industry.

00:06:16.100 | And we feel like it feels like

00:06:17.780 | when we start this AI journey at MATA,

00:06:20.060 | there's no software, no hardware, no team.

00:06:22.460 | For many companies we engage with through PyTorch,

00:06:26.020 | we feel the pain.

00:06:27.180 | That's the genesis why we feel like,

00:06:28.980 | hey, if we create fireworks and support industry

00:06:33.060 | going through this transition,

00:06:33.900 | it will be a huge amount of impact.

00:06:35.540 | Of course, the problem that industry facing

00:06:37.740 | will not be the same as MATA.

00:06:39.340 | MATA is so big, right?

00:06:41.100 | So it's kind of skewed towards extreme scale

00:06:44.220 | and extreme optimization in the industry will be different.

00:06:47.420 | But we feel like we have the technical chop

00:06:51.340 | and we've seen a lot.

00:06:52.620 | We'll look to kind of drive that.

00:06:55.860 | So yeah, so that's how we started.

00:06:58.620 | - When you and I chatted about like the origins of fireworks,

00:07:01.780 | it was originally envisioned more as a PyTorch platform.

00:07:06.380 | And then later became much more focused on generative AI.

00:07:09.420 | Is that fair to say?

00:07:11.220 | - Right.

00:07:12.060 | - What was the customer discovery here?

00:07:13.300 | - Right, so I would say our initial blueprint

00:07:17.500 | is say we should be the PyTorch cloud

00:07:19.860 | because PyTorch is a library

00:07:22.020 | and there's no SaaS platform to enable AI workloads.

00:07:25.900 | - Even in 2022, it's interesting.

00:07:28.540 | - I would not say absolutely no,

00:07:30.020 | but like cloud providers have some of those,

00:07:32.260 | but it's not first class citizen, right?

00:07:34.340 | Because at 2022, there's still like TensorFlow

00:07:37.380 | is massively in production.

00:07:39.380 | And this is all pre-GNI.

00:07:41.580 | And PyTorch is kind of getting more and more adoption,

00:07:45.140 | but there's no PyTorch first SaaS platform existing.

00:07:49.900 | At the same time,

00:07:50.780 | we are also a very pragmatic set of people.

00:07:53.500 | We really want to make sure from the get-go,

00:07:55.940 | we get really, really close to customers.

00:07:58.380 | We understand their use case.

00:07:59.660 | We understand their pain points.

00:08:01.020 | We understand the value we deliver to them.

00:08:03.260 | So we want to take a different approach.

00:08:04.940 | Instead of building a horizontal PyTorch cloud,

00:08:07.060 | we want to build a verticalized platform first.

00:08:11.380 | And then we talk with many customers.

00:08:13.140 | And interesting, we started a company September 2022,

00:08:16.980 | and October, November, the OpenAI announced ChatGPT.

00:08:21.700 | And then boom, then when we talk with many customers,

00:08:24.060 | they are like, "Can you help us

00:08:25.900 | working on the GNI aspect?"

00:08:28.340 | So of course, there are some open-source models.

00:08:31.300 | It's not as good at that time,

00:08:32.620 | but people are already putting a lot of attention there.

00:08:35.700 | Then we decide that if we're going to pick a vertical,

00:08:38.220 | we're going to pick GNI.

00:08:39.620 | The other reason is all GNI models are PyTorch models.

00:08:42.740 | So that's another reason.

00:08:44.260 | We believe that because of the nature of GNI,

00:08:47.020 | it's going to generate a lot of human consumable content.

00:08:50.740 | It will drive a lot of consumer,

00:08:52.660 | customer-developer-facing application

00:08:54.620 | and product innovation.

00:08:56.020 | Guaranteed, right?

00:08:56.940 | We're just at the beginning of this.

00:08:58.900 | Our prediction is for those kind of application,

00:09:01.700 | the inference is much more important than training

00:09:04.420 | because inference scale is proportional

00:09:06.940 | to the up-limit award population.

00:09:09.500 | And training scale is proportional

00:09:11.860 | to the number of researchers.

00:09:12.860 | Of course, each training round could be very expensive.

00:09:15.980 | Although PyTorch supports both inference and training,

00:09:18.340 | we decide to lay the focus on inference.

00:09:21.180 | So yeah, so that's how we got started.

00:09:23.100 | And we launched our public platform August last year.

00:09:27.340 | And when we launched, it's a single product.

00:09:29.500 | It's a distributed inference engine

00:09:31.980 | with simple API, open-air compatible API,

00:09:34.620 | with many models.

00:09:35.860 | We started with LM, and later on, we added a lot of models.

00:09:38.660 | Fast forward to now, we are a full platform

00:09:41.780 | with multiple product lines.

00:09:43.220 | So we love to kind of dive deep into what we offer.

00:09:46.180 | So, but that's a very fun journey in the past two years.

00:09:49.780 | - What was the transition from you start focus on PyTorch

00:09:53.220 | and people want to understand the framework, get it live.

00:09:56.340 | And now I would say maybe most people that use you

00:09:58.500 | don't even really know much about PyTorch at all.

00:10:00.980 | They're just strong consumer model.

00:10:02.500 | From a product perspective,

00:10:04.460 | what were some of the decisions early on?

00:10:06.900 | Right in October, November,

00:10:08.060 | you were just like, "Hey, most people just care

00:10:10.060 | "about the model, not about the framework.

00:10:11.540 | "We're going to make it super easy."

00:10:12.700 | Or was it more a gradual transition

00:10:15.060 | to the model library you have today?

00:10:16.900 | - Yeah, so our product decision

00:10:18.500 | is all based on who is our ICP.

00:10:20.940 | And one thing we want to acknowledge here

00:10:23.260 | is the Gen-AI technology is disruptive.

00:10:26.220 | It's very different from AI before Gen-AI.

00:10:28.780 | So it's a clear leap forward.

00:10:31.700 | Because before Gen-AI,

00:10:33.220 | the companies that want to invest in AI,

00:10:36.380 | they have to train from scratch.

00:10:38.260 | There's no other way.

00:10:39.220 | There's no foundation model.

00:10:40.180 | It doesn't exist.

00:10:41.300 | So that means they need to start a team,

00:10:43.580 | first hire a team, who is capable of crunch data.

00:10:46.540 | There's a lot of data to crunch, right?

00:10:48.380 | Because training from scratch,

00:10:49.900 | you have to prepare a lot of data.

00:10:51.540 | And then they need to have GPUs to train.

00:10:56.540 | And then you need to start to manage GPUs.

00:10:58.100 | So then it becomes a very complex project.

00:11:00.980 | It takes a long time

00:11:01.860 | and not many companies can afford it, actually.

00:11:05.300 | And Gen-AI is a very different game right now

00:11:09.220 | because it is a foundation model,

00:11:10.980 | so you don't have to train anymore.

00:11:12.620 | That makes AI much more accessible as a technology.

00:11:16.540 | As an app developer or product manager,

00:11:18.340 | even not a developer,

00:11:19.740 | they can interact with Gen-AI models directly.

00:11:23.900 | So, and our goal is to make AI accessible

00:11:27.380 | to all app developers and product engineers.

00:11:30.100 | That's our goal.

00:11:31.340 | So then getting them into the building model

00:11:34.980 | doesn't make any sense anymore with this new technology.

00:11:38.620 | And then building easy, accessible APIs is the most important.

00:11:42.380 | Our, early on, when we got started,

00:11:44.540 | we decided we're going to be OpenAI compatible.

00:11:47.020 | It's just kind of very easy for developers

00:11:49.820 | to adopt this new technology.

00:11:51.900 | And we will manage the underlying complexity

00:11:54.700 | of serving all these models.

00:11:56.340 | - Yeah, OpenAI has become--

00:11:57.780 | - The standard. - The standard.

00:11:59.460 | Even as we're recording today,

00:12:01.180 | Gemini announced that they have OpenAI compatible APIs.

00:12:05.060 | - Interesting. - So then we just need

00:12:06.180 | to adopt it all at night and then we have everyone.

00:12:08.340 | - Yeah, that's interesting because

00:12:10.940 | we are working very closely with Meta

00:12:12.740 | as one of the partners.

00:12:14.180 | And Meta announced,

00:12:16.100 | Meta, of course, is kind of very generous

00:12:17.900 | to donate many very, very strong open source models.

00:12:21.020 | Expecting more to come.

00:12:22.580 | But also they have announced LamaStack.

00:12:25.180 | - Yeah.

00:12:26.020 | - Which is basically standardized,

00:12:29.740 | the upper-level stack, built on top of Lama models.

00:12:32.580 | So they don't just want to give out models

00:12:35.740 | and you figure out what the upper stack is.

00:12:37.180 | They instead want to build a community around the stack

00:12:39.940 | and build a kind of new standard.

00:12:42.540 | I think it's an interesting dynamics

00:12:44.460 | playing in the industry right now.

00:12:46.500 | When it's more standardized across OpenAI

00:12:49.980 | because they are kind of creating the top-of-the-line

00:12:52.660 | or standardized across Lama

00:12:54.540 | because this is the most used open source model.

00:12:57.340 | So I think it's really a lot of fun working at this time.

00:13:01.540 | - I've been a little bit more doubtful on LamaStack.

00:13:05.060 | I think you've been more positive.

00:13:06.380 | Basically, it's just like the Meta version

00:13:08.500 | of whatever Hugging Face offers,

00:13:10.980 | or TensorRT, or BLM,

00:13:13.020 | or whatever the open source opportunity is.

00:13:15.900 | But to me, it's not clear that

00:13:18.340 | just because Meta open source is Lama,

00:13:21.580 | that the rest of LamaStack will be adopted.

00:13:24.260 | And it's not clear why I should adopt it.

00:13:26.620 | So I don't know if you--

00:13:27.460 | - Yeah, it's very early right now.

00:13:28.980 | That's why I kind of will work very closely with them

00:13:32.060 | and give them feedback.

00:13:33.340 | The feedback to the Meta team is very important.

00:13:35.660 | So then they can use that to continue to improve the model

00:13:38.740 | and also improve the higher-level stack.

00:13:40.740 | I think the success of LamaStack

00:13:42.660 | heavily depends on the community adoption,

00:13:44.820 | and there's no way around it.

00:13:46.420 | And I know Meta team would like to kind of work

00:13:49.340 | with a broader set of community, but it's very early.

00:13:51.980 | - One thing that, after your Series B,

00:13:53.980 | so you raced for a benchmark,

00:13:55.420 | and then I remember being close to you

00:13:58.140 | for at least your Series B announcements,

00:14:01.100 | you started betting heavily on this term of compound AI.

00:14:03.820 | It's not a term that we've covered very much in the podcast,

00:14:06.460 | but I think it's definitely getting a lot of adoption

00:14:09.340 | from Databricks and the Berkeley people and all that.

00:14:12.020 | What's your take on compound AI?

00:14:14.300 | Why is it resonating with people?

00:14:16.100 | - Right, so let me give a little bit of context

00:14:18.980 | why we even consider that space.

00:14:22.140 | - Yeah, because pre-Series B,

00:14:24.140 | there was no message, and now it's like on your landing page.

00:14:27.900 | - So it's kind of a very organic evolution

00:14:31.300 | from when we first launched our public platform.

00:14:34.540 | We are a single product, and we are a distributed

00:14:36.380 | inference engine, where we do a lot of innovation,

00:14:39.900 | customize quota kernels, raw kernels,

00:14:43.980 | running a different kind of hardware,

00:14:45.860 | and build distributed disaggregated execution,

00:14:50.180 | inference execution, build all kind of caching.

00:14:52.820 | So that is one.

00:14:54.300 | So that's kind of one product line,

00:14:55.940 | is the fast, most cost-efficient inference platform.

00:14:59.420 | Because we wrote PyTorch code,

00:15:00.540 | we know we basically have a special PyTorch build for that,

00:15:03.900 | together with a custom kernel we wrote.

00:15:06.500 | And then we work with many more customers,

00:15:07.900 | we realized, oh, the distributed inference engine,

00:15:10.740 | our design is one size fits all, right?

00:15:12.980 | We want to have this inference endpoint,

00:15:14.940 | then everyone come in, and no matter what kind of

00:15:18.260 | form and shape or workload they have,

00:15:20.100 | it will just work for them, right?

00:15:21.180 | So that's great.

00:15:22.700 | But the reality is, we realized,

00:15:26.140 | all customers have different kind of use cases.

00:15:28.460 | The use cases come in all different form and shape.

00:15:31.260 | And the end result is, the data distribution

00:15:35.540 | in their inference workload doesn't align

00:15:37.900 | with the data distribution in the training data

00:15:40.700 | for the model, right?

00:15:41.780 | It's a given, actually.

00:15:42.620 | If you think about this, because researchers

00:15:44.620 | have to guesstimate what is important,

00:15:46.580 | what's not important, like in preparing data for training.

00:15:50.380 | So because of that misalignment,

00:15:52.180 | then we leave a lot of quality,

00:15:55.300 | latency, cost improvement on the table.

00:15:57.540 | So then we're saying, okay, we want to heavily invest

00:16:00.460 | in a customization engine.

00:16:02.740 | And we actually announced it called FireOptimizer.

00:16:04.980 | So FireOptimizer basically help user navigate

00:16:08.740 | a three-dimensional optimization space

00:16:10.700 | across quality, latency, and cost.

00:16:13.940 | So it's a three-dimensional curve.

00:16:16.180 | And even for one company, for different use case,

00:16:19.620 | they want to land in different spots.

00:16:22.100 | So we automate that process for our customer.

00:16:25.180 | It's very simple.

00:16:26.140 | You have your inference workload,

00:16:27.580 | and you inject into the optimizer,

00:16:30.700 | along with the objective function.

00:16:32.620 | And then we spit out inference deployment config

00:16:35.980 | and the model setup.

00:16:38.180 | So it's your customized setup.

00:16:41.260 | So that is a completely different product.

00:16:43.740 | So that product thinking is one size fits one,

00:16:46.700 | different for one size fits all.

00:16:48.140 | And now on top of that,

00:16:49.740 | we provide a huge variety of state-of-the-art models,

00:16:53.580 | hundreds of them,

00:16:54.940 | varying from text to state-of-the-art English models.

00:16:59.460 | That's where we started.

00:17:00.980 | And as we talk with many customers,

00:17:02.820 | we realize, oh, audio and text are very, very close.

00:17:06.420 | Many of our customers start to build assistants,

00:17:08.420 | all kinds of assistants using text,

00:17:10.460 | and they immediately want to add audio,

00:17:12.300 | audio in, audio out.

00:17:13.620 | So we support transcription, translation,

00:17:16.740 | speech synthesis, text, audio alignment,

00:17:20.740 | all different kind of audio features.

00:17:22.180 | It's a big announcement we're gonna,

00:17:23.860 | you should have heard.

00:17:24.700 | - By the time this is out.

00:17:25.540 | - By the time this is out.

00:17:26.380 | - Yeah.

00:17:27.220 | - And the other areas of vision

00:17:28.860 | and the text are very close with each other,

00:17:31.020 | because a lot of information doesn't live in plain text.

00:17:34.420 | A lot of information live in multimedia format,

00:17:36.900 | live in images, PDFs, screenshots,

00:17:39.580 | and in many other different formats.

00:17:41.420 | So oftentimes solve a problem,

00:17:43.740 | we need to put the vision model first

00:17:45.020 | to extract information.

00:17:46.460 | And then use language model to process

00:17:48.740 | and then send out results.

00:17:50.060 | So vision is important, we also support vision model.

00:17:52.580 | Various different kind of vision models specialize

00:17:54.580 | in processing different kind of source and extraction.

00:17:58.580 | And we're also gonna have another announcement

00:18:01.140 | of a new API endpoint,

00:18:02.980 | will support for people to upload

00:18:05.580 | various different kind of multimedia content

00:18:08.220 | and then get the extract very accurate information out

00:18:11.420 | and feed that into LM.

00:18:13.460 | And then of course we support embedding

00:18:15.540 | because embedding is very important

00:18:16.780 | for semantic search, for RAG and all this.

00:18:19.380 | And in addition to that, we also support text to image,

00:18:22.180 | image generation models, text to image, image to image,

00:18:25.100 | and we're adding text to video as well in our portfolio.

00:18:28.540 | So it's very comprehensive set of model catalog

00:18:31.220 | that build on, run on top of File Optimizer

00:18:34.220 | and Distribute Influence Engine.

00:18:36.060 | But then we talk with more customer,

00:18:37.460 | they solve business use case,

00:18:39.260 | and then we realize one model is not sufficient

00:18:42.060 | to solve their problem.

00:18:44.060 | And it's very clear because one is the model who listens,

00:18:47.860 | and many customer, when they onboard this JNI journey,

00:18:50.820 | they thought this is magical.

00:18:52.340 | JNI is gonna solve all my problems magically,

00:18:54.460 | but then they realize, oh, this model who listens.

00:18:57.100 | It who listens because it's not deterministic,

00:18:58.940 | it's probabilistic.

00:19:00.540 | So it's designed to always give you an answer,

00:19:03.260 | but based on probability, so it who listens.

00:19:05.900 | And that's actually sometimes the feature

00:19:08.300 | for creative writing, for example.

00:19:09.740 | Sometimes it's a bug because, hey,

00:19:11.460 | you don't want to give misinformation.

00:19:14.380 | And different model also have different specialties.

00:19:16.900 | To solve a problem, you want to ask different special model

00:19:19.940 | to kind of decompose your task

00:19:22.140 | into multiple small task, narrow task,

00:19:25.060 | and have an expert model solve that task really well.

00:19:28.140 | And of course, the model doesn't have all the information.

00:19:31.140 | It has limited knowledge

00:19:32.140 | because the training data is finite, not infinite.

00:19:34.580 | So model oftentimes doesn't have real-time information.

00:19:37.340 | It doesn't know any proprietary information

00:19:39.420 | within enterprise.

00:19:40.740 | It's clear that in order to really build

00:19:44.540 | a compiling application on top of JNI,

00:19:47.860 | we need a compound AI system.

00:19:49.660 | Compound AI system basically is gonna have multiple models

00:19:53.940 | across modalities along with APIs,

00:19:58.180 | whether it's public APIs, internal proprietary APIs,

00:20:01.700 | storage systems, database system,

00:20:04.020 | knowledge systems to work together

00:20:06.540 | to deliver the best answer.

00:20:07.860 | - Are you gonna offer a virtual database?

00:20:09.660 | - We actually heavily partner

00:20:11.580 | with several big vector database providers.

00:20:15.020 | - Which is your favorite?

00:20:16.380 | (laughing)

00:20:18.260 | - They are all great in different ways,

00:20:20.260 | but it's public information, like MongoDB is our investor,

00:20:23.740 | and we have been working closely with them for a while.

00:20:26.780 | - When you say distributed inference engine,

00:20:29.580 | what do you mean exactly?

00:20:30.500 | Because when I hear your explanation,

00:20:32.180 | it's almost like you're centralizing a lot of the decisions

00:20:35.580 | through the Fireworks platform

00:20:36.700 | on like the quality and whatnot.

00:20:38.300 | What do you mean distributed?

00:20:39.140 | It's like you have GPUs in like a lot of different clusters,

00:20:41.740 | so like you're sharding the inference across.

00:20:44.060 | - Right, right, right.

00:20:45.460 | So first of all, we run across multiple GPUs.

00:20:49.620 | But the way we distribute across multiple GPUs is unique.

00:20:54.060 | We don't distribute the whole model monolithically

00:20:56.020 | across multiple GPUs.

00:20:57.060 | We chop them into pieces

00:20:58.340 | and scale them completely differently

00:20:59.820 | based on what's the bottleneck.

00:21:02.020 | We also are distributed across regions.

00:21:05.500 | We have been running in North America,

00:21:07.300 | Emir, and Asia.

00:21:09.260 | We have regional affinity to applications

00:21:13.420 | because latency is extremely important.

00:21:15.940 | We are also like doing global load balancing

00:21:19.420 | because a lot of application there,

00:21:21.140 | they quickly scale to global population.

00:21:24.500 | And then at that scale,

00:21:26.700 | like different continent wakes up at a different time.

00:21:29.740 | And you want to kind of load balancing across.

00:21:32.580 | So all the way in a week, we also have,

00:21:35.300 | we manage various different kinds of hardware skew

00:21:38.140 | from different hardware vendors.

00:21:39.660 | And different hardware design is best

00:21:41.940 | for different type of workload,

00:21:44.580 | whether it's long context, short content, long generation.

00:21:47.820 | So all these different type of workload is best fitted

00:21:51.620 | for different kind of hardware skew.

00:21:53.980 | And then we can even distribute it

00:21:55.140 | across different hardware for a workload.

00:21:57.620 | So yeah, so the distribution actually

00:21:59.260 | is all around in the full stack.

00:22:02.860 | - At some point, we'll show on the YouTube

00:22:05.020 | the image that Ray, I think, has been working on

00:22:07.700 | with like all the different modalities that you offer.

00:22:10.140 | Like to me, it's basically you offer the open source version

00:22:13.620 | of everything that OpenAI typically offers, right?

00:22:16.380 | I don't think there is.

00:22:17.500 | Actually, if you do text to video,

00:22:19.740 | you will be a superset of what OpenAI offers

00:22:22.500 | 'cause they don't have Sora.

00:22:23.860 | Is that Mochi, by the way?

00:22:25.340 | - Mochi.

00:22:26.180 | - Mochi, right?

00:22:27.020 | - Mochi, and there are a few others.

00:22:29.380 | I will say the interesting thing is,

00:22:31.940 | I think we're betting on the open source community

00:22:35.620 | is gonna grow, like proliferate.

00:22:38.860 | This is literally what I see.

00:22:40.300 | - Yeah.

00:22:41.140 | - And there's amazing video generation companies.

00:22:44.500 | - Yeah.

00:22:45.340 | - There is amazing audio companies.

00:22:48.140 | Like cross-border, the innovation is off the chart

00:22:51.540 | and we are building on top of that.

00:22:53.100 | I think that's the advantage we have

00:22:55.900 | compared with a closed source company.

00:22:58.460 | - I think I want to restate the value proposition

00:23:00.420 | of Fireworks for people who are comparing you

00:23:02.940 | versus like a raw GPU provider, like a RunPod,

00:23:06.700 | or a Lambda, or anything like those,

00:23:08.820 | which is like you create the developer experience layer

00:23:12.380 | and you also make it easily scalable or serverless

00:23:15.700 | or as an end point.

00:23:18.220 | And then I think for some models,

00:23:20.460 | you have custom kernels, but not all models.

00:23:24.220 | - For almost for all models,

00:23:25.860 | for all large language models, all your models.

00:23:28.420 | - You just write kernels all day long?

00:23:29.260 | - In the VRS.

00:23:30.460 | (laughs)

00:23:31.900 | - Yeah.

00:23:32.740 | - Yeah, almost for all models we serve, we have.

00:23:34.980 | - And so that is called Fire Attention?

00:23:36.820 | - That's called Fire.

00:23:37.660 | - I don't remember the speed numbers,

00:23:39.620 | but apparently much better than VLM,

00:23:41.540 | especially on a concurrency basis.

00:23:44.180 | - Right, so Fire Attention is specific for,

00:23:46.860 | mostly for language model,

00:23:48.580 | but for other modalities,

00:23:49.740 | we'll also have a customized kernel.

00:23:52.460 | - Yeah, I think the typical challenge for people

00:23:55.140 | is understanding like, that has value.

00:23:57.940 | And then like, there are other people

00:23:59.860 | who are also offering open source models, right?

00:24:01.580 | Like your mode is your ability to offer

00:24:05.100 | like a good experience for all these customers.

00:24:07.580 | But if your existence is entirely reliant on people

00:24:10.340 | releasing nice open source models,

00:24:12.180 | other people can also do the same thing.

00:24:13.580 | - Right, yeah.

00:24:14.540 | So I will say we build on top of

00:24:16.020 | open source model foundation.

00:24:17.660 | So that's the kind of foundation we build on top of.

00:24:20.180 | But we look at the value prop

00:24:22.780 | from the lens of application developers

00:24:24.820 | and product engineers.

00:24:26.100 | So they want to create new UX.

00:24:28.900 | So what's happening in the industry right now

00:24:31.420 | is people are thinking about

00:24:32.660 | completely new way of designing products.

00:24:35.900 | And I'm talking to so many founders,

00:24:37.940 | it's just mind blowing.

00:24:39.740 | They help me understand existing way of doing PowerPoint,

00:24:44.340 | existing way of coding,

00:24:46.260 | existing way of managing customer service.

00:24:50.500 | It's actually putting a box in our head.

00:24:52.420 | For example, PowerPoint, right?

00:24:53.580 | So PowerPoint generation is,

00:24:55.260 | we always need to think about

00:24:56.380 | how to fit into my storytelling into this format

00:24:59.700 | of slide one after another.

00:25:01.740 | And I'm gonna juggle through like design

00:25:05.260 | together with what story to tell.

00:25:07.460 | But the most important thing is

00:25:08.820 | what's your storytelling lines, right?

00:25:11.460 | And why don't we create a space

00:25:13.580 | that is not limited to any format?

00:25:16.660 | And those kind of new product UX design

00:25:19.460 | combined with automated content generation through GNI

00:25:24.580 | is the new thing that many founders are doing.

00:25:27.940 | What are the challenges they're facing?

00:25:29.380 | All right, let's go from there.

00:25:30.780 | One is, again, because a lot of products

00:25:33.780 | built on top of GNI,

00:25:34.620 | they are consumer, personal, and developer facing,

00:25:36.900 | and they require interactive experience.

00:25:40.180 | It's just a kind of product experience we all get used to.

00:25:42.740 | And our desire is to actually get

00:25:44.500 | faster and faster interaction.

00:25:46.340 | Otherwise, nobody wants to spend time, right?

00:25:48.740 | So again, and then that requires low latency.

00:25:51.420 | And the other thing is,

00:25:52.700 | the nature of consumer, personal, and developer facing

00:25:54.780 | is your audience is very big.

00:25:57.180 | You want to scale up to product market fit quickly.

00:26:00.020 | But if you lose money at a small scale,

00:26:01.860 | you're gonna bankrupt quickly.

00:26:03.340 | So it's actually a big contrast is

00:26:06.020 | I actually have product market fit.

00:26:07.740 | But when I scale, I scale out of my business.

00:26:09.900 | So that's kind of very funny to think about it.

00:26:13.020 | So then have low latency and low cost is essential

00:26:18.020 | for those new application and product to survive

00:26:20.940 | and really become a generation company.

00:26:23.020 | So that's the design point for

00:26:25.620 | our distributed inference engine and the file optimizer.

00:26:29.700 | File optimizer, you can think about that

00:26:31.100 | as a feedback loop.

00:26:32.180 | The more you feed your inference workload

00:26:34.940 | to our inference engine,

00:26:36.740 | the more we help you improve quality,

00:26:39.740 | lower latency further, lower your cost.

00:26:42.460 | It basically becomes better.

00:26:43.940 | And we automate that because we don't want you

00:26:46.980 | as app developer or product engineer to think about

00:26:49.740 | how to figure out all these low-level details.

00:26:53.300 | It's impossible because you are not trained

00:26:54.940 | to do that at all.

00:26:56.180 | You should kind of keep your focus

00:26:57.660 | on the product innovation.

00:26:59.220 | And then the compound AI,

00:27:01.220 | we actually feel a lot of pain

00:27:02.740 | as the app developers, engineer,

00:27:05.420 | there are so many models.

00:27:07.380 | Every week, there's at least a new model coming out.

00:27:10.460 | - Tencent had a giant model this week.

00:27:12.340 | - Yeah, yeah, I saw that, I saw that.

00:27:16.220 | - Like 500 billion dollars.

00:27:18.500 | So they're like, should I keep chasing this

00:27:22.140 | or should I forget about it?

00:27:24.460 | And which model should I pick

00:27:26.140 | to solve what kind of sub-problem?

00:27:27.460 | How do I even decompose my problem

00:27:28.980 | into those smaller problems

00:27:30.220 | and fit the model into it?

00:27:31.660 | I have no idea.

00:27:33.020 | And then there are two ways

00:27:34.460 | to think about this design.

00:27:36.220 | I think I talked about that in the past.

00:27:37.700 | One is imperative, as in you tell,

00:27:41.100 | you figure out how to do it.

00:27:43.180 | You give developer tools to dictate how to do it.

00:27:46.660 | Or you build a declarative system

00:27:49.300 | where a developer tells what they want to do, not how.

00:27:52.660 | So these are completely two different designs.

00:27:55.380 | So the analogy I want to draw is in the data world,

00:27:59.740 | the database management system is a declarative system

00:28:02.700 | because people use database, use SQL.

00:28:05.100 | SQL is a way you say,

00:28:06.540 | what do you want to extract out of database?

00:28:08.740 | What kind of result do you want?

00:28:10.060 | But you don't figure out which node,

00:28:13.140 | how many nodes you're gonna run on top of,

00:28:14.980 | how you redefine your disk,

00:28:17.100 | which index you use, which project.

00:28:18.620 | You don't need to worry about any of those.

00:28:19.900 | And database management system will figure out,

00:28:22.340 | generate a new best plan and execute on that.

00:28:26.340 | So database is declarative.

00:28:28.580 | And it makes it super easy.

00:28:30.100 | You just learn SQL,

00:28:31.300 | which is learn a semantic meaning of SQL

00:28:33.500 | and you can use it.

00:28:34.660 | Imperative side is there are a lot of ETL pipelines

00:28:38.060 | and people design this DAG system

00:28:40.780 | with triggers, with actions,

00:28:42.340 | and you dictate exactly what to do.

00:28:44.580 | And if it fails, then we'll have to recover.

00:28:46.540 | So that's a declarative system.

00:28:48.780 | And we have seen a range of system

00:28:51.420 | in the ecosystem go different ways.

00:28:53.780 | I think there are value of both.

00:28:55.620 | There are value of both.

00:28:56.460 | I don't think one is gonna subsume the other,

00:28:58.860 | but we are leaning more into the philosophy

00:29:00.820 | of the declarative system

00:29:02.740 | because from the lens of app developer and product engineer,

00:29:06.020 | that would be easiest for them to integrate.

00:29:08.220 | - I understand that's also why PyTorch won as well, right?

00:29:12.180 | This is one of the reasons.

00:29:13.020 | - Ease of use.

00:29:13.860 | So yeah, focus on ease of use

00:29:15.820 | and then let the system take on

00:29:19.100 | the hard challenges and complexities.

00:29:21.260 | So we follow, we extend that thinking

00:29:23.780 | into current system design.

00:29:26.100 | So another announcement is we will also announce

00:29:29.420 | our next declarative system

00:29:32.940 | is gonna appear as a model

00:29:35.180 | that has extremely high quality.

00:29:38.020 | And this model is inspired by Owen's announcement

00:29:41.140 | for OpenAI.

00:29:42.460 | You should see that by the time we announce this or soon.

00:29:46.260 | - It's trained by you.

00:29:47.100 | - Yes.

00:29:47.940 | - Is this the first model that you train?

00:29:51.020 | Like this scale?

00:29:51.860 | - It's not the first.

00:29:52.860 | We actually have trained a model called FireFunction.

00:29:57.380 | It's a function calling model.

00:29:58.780 | It's our first step into compound AI system

00:30:01.380 | because function calling model

00:30:03.740 | can dispatch a request into multiple APIs.

00:30:08.740 | We have pre-baked set of APIs the model learn.

00:30:12.060 | You can also add additional APIs

00:30:14.740 | to through the configuration

00:30:16.220 | to let model dispatch accordingly.

00:30:18.340 | So we have a very high quality function calling model

00:30:21.020 | that already released.

00:30:22.460 | We have actually three versions.

00:30:23.700 | The latest version is very high quality.

00:30:25.900 | But now we take a further step

00:30:28.180 | that you don't even need to use function calling model.

00:30:30.420 | You use our new model we're gonna release.

00:30:33.500 | It will solve a lot of problem

00:30:35.060 | approaching very high, like OpenAI's quality.

00:30:38.860 | So I'm very excited about that.

00:30:41.420 | - Do you have any benchmarks yet or?

00:30:42.900 | - We have benchmark.

00:30:43.740 | We're gonna release it.

00:30:45.140 | Hopefully next week.

00:30:46.220 | We just put our model to LMSYS

00:30:49.220 | and people are guessing,

00:30:50.620 | is this a next Gemini model or a MADIS model?

00:30:55.620 | People are guessing.

00:30:56.500 | That's very interesting.

00:30:57.500 | We're like watching the Reddit discussion right now.

00:31:00.420 | - I mean, I have to ask more questions about this.

00:31:02.220 | When OpenAI released the one,

00:31:04.540 | a lot of people asked about whether or not

00:31:07.300 | it's a single model or whether it's like a chain of models.

00:31:10.420 | And basically everyone on the Strawberry team

00:31:14.140 | was very insistent that what they did

00:31:17.100 | for reinforcement learning, chain of thought,

00:31:19.380 | cannot be replicated by a whole bunch

00:31:20.900 | of open source model calls.

00:31:22.340 | Do you think that they are wrong?

00:31:24.500 | Have you done the same amount of work on RL as they have

00:31:27.660 | or was it a different direction?

00:31:29.740 | - I think they take a very specific approach

00:31:32.100 | where I do, the caliber of team is very high, right?

00:31:35.500 | So I do think they are the domain expert

00:31:37.980 | in doing the things they are doing.

00:31:39.100 | But I don't think there's only one way

00:31:41.900 | to achieve the same goal.

00:31:43.420 | We're on the same direction in the sense

00:31:46.220 | that the quality scaling law is shifting

00:31:49.620 | from training to inference.

00:31:51.300 | We are definitely on, for that I fully agree with them.

00:31:54.740 | But we're taking a completely different approach

00:31:57.180 | to the problem.

00:31:58.380 | All of that is because, of course,

00:32:00.580 | we didn't train the model from scratch.

00:32:02.140 | All of that is because we built on the show of giants, right?

00:32:05.140 | So the current model available we have access to

00:32:07.820 | is getting better and better.

00:32:09.300 | The future trend is the gap between the open source model,

00:32:11.780 | closed source model, it's just gonna shrink

00:32:14.540 | to the point there's not much difference.

00:32:17.260 | And then we're on the same level field.

00:32:19.180 | That's why I think our early investment in inference

00:32:22.820 | and all the work we do around balancing across quality,

00:32:27.820 | latency and cost pay off

00:32:29.780 | because we have accumulated a lot of experience there

00:32:32.260 | and that empower us to release this new model

00:32:36.420 | that is approaching open source quality.

00:32:39.380 | - I guess the question is,

00:32:40.340 | what do you think the gap to catch up will be?

00:32:43.100 | Because I think everybody agrees

00:32:44.700 | with open source models eventually will catch up.

00:32:47.340 | And I think with 4, then with Lama 3.2, 3.1, 4.5b,

00:32:51.580 | we close the gap.

00:32:52.420 | And then L1 just reopened the gap so much and it's unclear.

00:32:55.900 | Obviously you're saying your model will have-

00:32:57.580 | - We're closing that gap.

00:32:58.420 | - Yeah, but you think like in the future

00:33:00.140 | it's gonna be like months?

00:33:02.340 | - So here's the thing that's happened, right?

00:33:04.020 | There's public benchmark, it is what it is.

00:33:06.620 | But in reality, open source model in certain dimension

00:33:11.140 | already on par or beat closed source model, right?

00:33:15.100 | So for example, in the coding space,

00:33:18.260 | open source models are really, really good.

00:33:20.380 | And in function calling,

00:33:22.100 | like file function is also really, really good.

00:33:24.220 | So it's all a matter of whether you build one model

00:33:27.140 | to solve all the problem

00:33:28.220 | and you want to be the best of solving all the problems

00:33:31.260 | or in the open source domain, it's gonna specialize, right?

00:33:34.580 | All these different model builders specialize

00:33:36.820 | in certain narrow area.

00:33:39.260 | And it's logical that they can be really, really good

00:33:42.900 | in that very narrow area.

00:33:44.500 | And that's our prediction is with specialization,

00:33:48.540 | there will be a lot of expert models really, really good

00:33:51.420 | and even better than like one size fits all

00:33:53.860 | open source, closed source models.

00:33:55.700 | - I think this is the core debates

00:33:59.540 | that I am still not 100% either way on

00:34:03.020 | in terms of compound AI versus normal AI,

00:34:07.140 | 'cause you're basically fighting the bitter lesson.

00:34:09.460 | - Look at the human society, right?

00:34:11.500 | We specialize and you feel really good

00:34:13.900 | about someone specializing doing something really well, right?

00:34:17.300 | And that's how our, like when it evolved from ancient time,

00:34:19.980 | we're all journalists, we do everything in the tribe too.

00:34:22.580 | Now we heavily specialize in different domain.

00:34:24.860 | So my prediction is in the AI model space,

00:34:27.460 | it will happen also.

00:34:28.660 | Except for the bitter lesson,

00:34:30.420 | you get short-term gains by having specialists,

00:34:33.700 | domain specialists, and then someone just needs to train

00:34:36.060 | like a 10X bigger model on 10X more inference,

00:34:38.900 | 10X more data, 10X more model perhaps,

00:34:41.380 | whatever the current scaling law is.

00:34:43.780 | And then it supersedes all the individual models

00:34:46.380 | because of some generalized intelligence/world knowledge.

00:34:50.220 | You know, I think that is the core insight of the GPTs,

00:34:54.500 | the GPT 1, 2, 3, that was.

00:34:56.420 | - Right, but the scaling law again,

00:34:58.260 | the training scaling law is because

00:35:00.180 | you have increasing amount of data to train from

00:35:02.820 | and you can do a lot of compute, right?

00:35:04.780 | So I think on the data side, we're approaching the limit

00:35:08.300 | and the only data to increase that

00:35:09.860 | is synthetic generated data.

00:35:11.300 | And then there's like, what is the secret sauce there, right?

00:35:14.780 | Because if you have a very good large model,

00:35:17.020 | you can generate very good synthetic data

00:35:19.340 | and then continue to improve quality.

00:35:21.820 | So that's why I think in OpenAI,

00:35:23.340 | they are shifting from the training scaling law

00:35:25.180 | into inference scaling law.

00:35:26.020 | And it's the test time and all this.

00:35:28.260 | So I definitely believe that's the future direction

00:35:31.660 | and that's where we are really good at and doing inference.

00:35:34.540 | - Couple of questions on that.

00:35:35.580 | Are you planning to share your reasoning traces?

00:35:39.260 | - That's a very good question.

00:35:40.940 | We are still debating.

00:35:43.140 | - Yeah.

00:35:43.980 | - We're still debating.

00:35:46.660 | - I would say, if you, for example,

00:35:48.660 | it's interesting that like, for example, Sweden bench,

00:35:51.220 | if you want to be considered for ranking,

00:35:53.820 | you have to submit your reasoning traces

00:35:55.500 | and that has actually disqualified

00:35:56.940 | some of our past guests.

00:35:57.780 | Like CoSign was doing well on Sweden bench,

00:35:59.900 | but they didn't want to leak those results.

00:36:01.980 | So that's why you don't see O1 preview on Sweden bench

00:36:05.300 | because they don't submit their reasoning traces.

00:36:08.100 | And obviously it's IP,

00:36:09.100 | but also if you're going to be more open,

00:36:11.660 | then that's one way to be more open.

00:36:13.620 | So your model is not going to be open source, right?

00:36:16.100 | Like it's going to be a endpoint that you provide.

00:36:18.220 | - Yes.

00:36:19.380 | - Okay, cool.

00:36:20.340 | And then pricing also the same as OpenAI,

00:36:24.180 | just kind of face-on.

00:36:25.740 | - This is, I don't have actually information.

00:36:28.420 | Everything is going so fast,

00:36:29.580 | we haven't even think about that yet.

00:36:31.300 | Yeah, I should be more prepared.

00:36:33.500 | - I mean, this is live.

00:36:35.540 | It's nice to just talk about it as it goes live.

00:36:38.220 | Any other things that you're like,

00:36:39.620 | you want feedback on or you're thinking through?

00:36:41.700 | It's kind of nice to just talk about something

00:36:43.980 | when it's not decided yet about this new model.

00:36:46.260 | Like, I mean, it's going to be exciting.

00:36:48.100 | It's going to generate a lot of buzz.

00:36:50.340 | - Right.

00:36:51.780 | I'm very excited about to see

00:36:54.620 | how people are going to use this model.

00:36:56.860 | So there's already a Reddit discussion about it

00:37:00.020 | and the people are asking very deep medical questions.

00:37:03.420 | And it seems the model got it right.

00:37:05.460 | Surprising.

00:37:06.300 | And internally, we're also asking models

00:37:09.020 | to generate what is AGI.

00:37:10.940 | And it generates a very complicated DAG.

00:37:13.220 | Thinking process.

00:37:15.740 | So we're having a lot of fun testing this internally.

00:37:19.740 | But I'm more curious, how will people use it?

00:37:22.780 | What kind of application they're going to try

00:37:24.740 | and test on it?

00:37:26.020 | And that's where we'll really like to hear feedback

00:37:29.660 | from the community.

00:37:30.940 | And also feedback to us, like what works out well,

00:37:33.500 | what doesn't work out well?

00:37:34.740 | What works out well but surprising them?

00:37:37.660 | And what kind of thing they think we should improve on?

00:37:41.620 | And those kind of feedback will be tremendously helpful.

00:37:44.500 | - Yeah, I mean, so I've been a production user

00:37:46.220 | of Preview and Mini since March.

00:37:49.220 | I would say they're like very, very obvious

00:37:51.940 | in terms of quality.

00:37:52.780 | So much so that they made flaunts on it

00:37:55.180 | and for, oh, just like they made the previous

00:37:58.980 | state-of-the-art look bad.

00:38:00.220 | Like it's really that stark, that difference.

00:38:04.220 | The number one thing I actually, you know,

00:38:06.380 | just feedback or request, feature requests

00:38:08.700 | is people want control on the budget.

00:38:11.340 | Because right now in '01,

00:38:13.620 | it kind of decides its own thinking budget.

00:38:15.860 | But sometimes you know how hard the problem is

00:38:18.580 | and you want to actually tell the model,

00:38:20.620 | like spend two minutes on this,

00:38:22.900 | or spend some dollar amount.

00:38:23.980 | Maybe it's time, maybe it's dollars.

00:38:25.140 | I don't know what the budget is.

00:38:26.980 | - That makes a lot of sense.

00:38:27.980 | So we actually thought about that requirement

00:38:31.180 | and it should be at some point we need to support that.

00:38:35.540 | Not initially, but that makes a lot of sense.

00:38:38.460 | - Okay, so that was a fascinating overview

00:38:41.020 | of just like the things that you're working on.

00:38:42.940 | First of all, I realized that,

00:38:44.860 | I don't know if I've ever given you this feedback,

00:38:46.500 | but I think you guys are one of the reasons

00:38:48.820 | I agreed to advise you.

00:38:50.300 | Because like, you know, I think when you first met me,

00:38:52.020 | I was kind of dubious.

00:38:53.100 | I was like-

00:38:53.940 | - Who are you?

00:38:54.780 | - Let's replicate this together.

00:38:57.380 | There's like a laptop.

00:38:58.660 | There's like a whole bunch of other players.

00:38:59.940 | You're in very, very competitive fields.

00:39:01.900 | Like why will you win?

00:39:03.460 | And the reason I actually changed my mind

00:39:06.620 | was I saw you guys shipping.

00:39:08.540 | You know, I think your surface area is very big.

00:39:10.780 | The team is not that big.

00:39:11.980 | - No, we're only 40 people.

00:39:13.900 | - Yeah, and now here you are trying to compete

00:39:16.020 | with OpenAI and you know, everyone else.

00:39:18.020 | Like, what is the secret?

00:39:20.060 | - I think the team, the team is the secret.

00:39:23.380 | - Oh boy.

00:39:24.220 | So there's no, there's no thing I can just copy.

00:39:27.300 | You just-

00:39:28.700 | - No.

00:39:29.540 | - I think we all come from very aligned on the culture.

00:39:35.220 | 'Cause most of our team came from Meta.

00:39:38.140 | - Yeah.

00:39:38.980 | - And many startups.

00:39:40.300 | So we really believe in results.

00:39:42.460 | One is result.

00:39:43.900 | And second is customer.

00:39:45.700 | We're very customer obsessed.

00:39:47.220 | And we don't want to drive adoption

00:39:50.940 | for the sake of adoption.

00:39:52.100 | We really want to make sure we understand

00:39:55.220 | we are delivering a lot of business values to the customer.

00:39:58.380 | And we are, we really value their feedback.

00:40:02.340 | So we would wake up mid of night

00:40:05.660 | and deploy some model for them.

00:40:08.220 | Shuffle some capacity for them.

00:40:10.660 | And yeah, over the weekend, no brainer.

00:40:15.300 | So yeah, so that's just how we work as a team.

00:40:18.820 | And the caliber of the team is really, really high as well.

00:40:23.820 | So like, as Plugin, we're hiring.

00:40:27.300 | We're expanding very, very fast.

00:40:29.460 | So if we are passionate about working

00:40:32.260 | on the most cutting edge technology

00:40:34.580 | in the general space, come talk with us.

00:40:37.620 | - Yeah.

00:40:38.460 | Let's talk a little bit about that customer journey.

00:40:40.300 | I think one of your more famous customers is Cursor.

00:40:42.700 | We were the first podcast to have Cursor on

00:40:44.780 | and then obviously since then they have blown up.

00:40:46.460 | Cause and effect are not related.

00:40:48.180 | But you guys especially worked on a fast supply model

00:40:52.900 | where you were one of the first people

00:40:54.940 | to work on speculative decoding in a production setting.

00:40:58.860 | Maybe just talk about like,

00:41:00.020 | what was the behind the scenes of working with Cursor?

00:41:03.220 | - I will say, Cursor is a very, very unique team.

00:41:06.460 | I think a unique part is the team

00:41:08.740 | has very high technical caliber, right?

00:41:11.060 | There's no question about it.

00:41:12.420 | But they have decided,

00:41:14.380 | although like many companies including Copala,

00:41:17.340 | they will say, I'm going to build a whole entire stack

00:41:19.580 | because I can.

00:41:20.700 | And they are unique in the sense they seek partnership.

00:41:24.980 | Not because they cannot, they're fully capable,

00:41:27.580 | but they know where to focus.

00:41:29.020 | That to me is amazing.

00:41:30.660 | And of course they want to find a bypass partner.

00:41:33.580 | So we spent some time working together.

00:41:36.260 | They are pushing us very aggressively

00:41:39.180 | because for them to deliver high caliber product experience

00:41:42.580 | they need the latency.

00:41:44.060 | They need the interactive,

00:41:45.060 | but also high quality at the same time.

00:41:47.540 | So actually we expanded our product feature quite a lot

00:41:51.780 | as we support in Cursor.

00:41:53.420 | And they are growing so fast

00:41:55.220 | and we massively scaled quickly across multiple regions.

00:41:59.460 | And we develop pretty high intense inference stack,

00:42:04.460 | almost like similar to what we do for Meta.

00:42:07.900 | I think that's a very, very interesting engagement.

00:42:10.700 | And through that, there are a lot of trust being built.

00:42:13.180 | As in they realize, hey,

00:42:15.060 | this is a team they can really partner with

00:42:16.740 | and they can go big with.

00:42:18.820 | That comes back to, hey, we're really customer obsessed.

00:42:21.500 | And all the engineers working with them,

00:42:23.780 | there's just enormous amount of time

00:42:25.660 | syncing together with them and discussing.

00:42:29.180 | And we're not big on meetings,

00:42:30.700 | but we are like stack channel always on.

00:42:32.700 | Yeah, so you almost feel like working as one team.

00:42:36.220 | So I think that's really highlighted.

00:42:38.700 | - Yeah, for those who don't know,

00:42:39.940 | so basically Cursor is a VSCode fork,

00:42:41.940 | but most of the time people will be using close models.

00:42:45.060 | Like I actually use a lot of Sonnet.

00:42:47.340 | So you're not involved there, right?

00:42:49.140 | It's not like you host Sonnet

00:42:50.460 | or you have any partnership with it.

00:42:52.180 | You're involved where Cursor is small

00:42:53.980 | or like their house brand models are concerned, right?

00:42:58.860 | - I don't know what I can say,

00:43:00.340 | but the things they haven't said.

00:43:02.060 | (laughing)

00:43:04.620 | - Very obviously the dropdown is 4.0 and then Cursor, right?

00:43:08.380 | So like, I assume that the Cursor side is the Fireworks side

00:43:11.220 | and then the other side, they're calling out the other.

00:43:14.140 | Just kind of curious.

00:43:15.420 | And then like, do you see any more opportunity on like the,

00:43:18.500 | you know, I think you made a big splash

00:43:19.980 | with like 1000 tokens per second.

00:43:21.660 | That was because of speculative decoding.

00:43:23.500 | Is there more to push there?

00:43:25.540 | - We push a lot.

00:43:26.380 | Actually, when I mentioned a file optimizer, right?

00:43:29.020 | So as in, we have a unique automation stack

00:43:33.020 | that is one size fits one.

00:43:35.020 | We actually deployed to Cursor early on.

00:43:36.780 | Basically optimized for their specific workload.

00:43:39.220 | And that's a lot of juice to extract out of there.

00:43:42.180 | And we see the success in that product

00:43:44.500 | is actually can be widely adopted.

00:43:46.380 | So that's why we started a separate product line

00:43:49.380 | called the File Optimizer.

00:43:50.820 | So speculative decoding is just one approach.

00:43:54.020 | And speculative decoding here is not static.

00:43:55.820 | We actually wrote a blog post about it.

00:43:58.020 | There's so many different ways to do speculative decoding.

00:43:59.940 | You can pair a small model with a large model

00:44:02.220 | in the same model family,

00:44:03.580 | or you can have Eagle heads and so on.

00:44:06.380 | So there are different trade-offs

00:44:08.180 | of which approach to take.

00:44:10.060 | It really depends on your workload.

00:44:11.540 | And then with your workload,

00:44:12.580 | we can align the Eagle heads or Medusa heads

00:44:15.260 | or, you know, small, big model pair much better

00:44:18.380 | to extract the best latency reduction.

00:44:20.900 | So all of that is part of the File Optimizer offering.

00:44:23.980 | - I know you mentioned

00:44:24.980 | some of the other inference providers.

00:44:27.020 | I think the other question that people always have

00:44:28.980 | is around benchmarks.

00:44:30.260 | So you get different performance on different platforms.

00:44:34.140 | How should people think about,

00:44:35.740 | you know, people are like,

00:44:36.580 | "Hey, Lama 3.2 is X on MMLU."

00:44:40.540 | But maybe, you know, using speculative decoding,

00:44:43.100 | you go down a different path.

00:44:44.620 | Maybe some providers run a quantized model.

00:44:47.740 | How should people think about how much they should care

00:44:50.420 | about how you're actually running the model?

00:44:52.380 | You know, like, what's the delta

00:44:53.620 | between all the magic that you do

00:44:55.700 | and what a raw model?

00:44:58.020 | - Okay, so there are two big development cycle.

00:45:01.300 | One is experimentation, where they need fast iteration.

00:45:04.380 | They don't want to think about quality

00:45:05.700 | and they just kind of want to experiment

00:45:07.860 | with product experience and so on, right?

00:45:09.540 | So that's one.

00:45:10.860 | And then it looks good

00:45:12.460 | and they want to kind of post-product market

00:45:14.420 | but scaling and the quality is really important

00:45:17.020 | and latency and all the other things are becoming important.

00:45:20.020 | During the experimentation phase

00:45:21.700 | is just pick a good model.

00:45:23.540 | Don't worry about anything else.

00:45:24.740 | Make sure even like JNI is the right solution

00:45:26.740 | to your product and that's the focus.

00:45:29.460 | And then post-product market fit,

00:45:31.260 | then that's kind of the three-dimensional optimization curve

00:45:34.660 | start to kick in across quality, latency, cost,

00:45:38.420 | where you should land.

00:45:39.860 | And to me, it's a purely a product decision.

00:45:42.980 | To many product, if you choose a lower quality

00:45:46.340 | but better speed and lower cost,

00:45:49.380 | but it doesn't make a difference to the product experience,

00:45:52.020 | then you should do it.

00:45:53.300 | So that's why I think inference is part of the validation.

00:45:58.300 | The validation doesn't stop at offline eval.

00:46:00.780 | The validation is kind of,

00:46:02.180 | we'll go through A/B testing through inference

00:46:04.940 | and that's where we kind of offer

00:46:06.660 | various different configurations

00:46:07.780 | for you to test which is the best setting.

00:46:09.780 | So this is like traditional product evaluation.

00:46:13.580 | So product evaluation should also include

00:46:16.180 | your new model versions

00:46:18.020 | and different model setup into the consideration.

00:46:22.300 | - I want to specifically talk about

00:46:24.180 | what happens a few months ago

00:46:26.140 | with some of your major competitors.

00:46:28.780 | I mean, all of this is public.

00:46:30.580 | What is your take on what happens?

00:46:33.100 | And maybe you want to set the record straight

00:46:34.540 | on how Fireworks does quantization

00:46:36.580 | because I think a lot of people

00:46:38.300 | may have outdated perceptions

00:46:40.380 | or they didn't read the clarification posts

00:46:43.260 | on your approach to quantization.

00:46:45.100 | - First of all, it's always a surprise to us

00:46:47.260 | that without any notice, we got called out.

00:46:51.460 | - Specifically by name, which is normally not what-

00:46:54.620 | - Yeah, in a public post

00:46:56.820 | and have certain interpretation of our quality.

00:47:01.100 | So I was really surprised

00:47:02.940 | and it's not a good way to compete, right?

00:47:07.460 | We want to compete fairly

00:47:09.180 | and oftentimes when one vendor

00:47:12.540 | give out the result from another vendor

00:47:15.220 | is always extremely biased.

00:47:16.900 | So we actually refrain ourselves to do any of those

00:47:20.340 | and we happily partner with third party

00:47:22.620 | to do the most fair evaluation.

00:47:25.300 | So we are very surprised

00:47:26.620 | and we don't think that's a good way

00:47:29.540 | to figure out the competition landscape.

00:47:31.700 | So then we react.

00:47:33.260 | I think when it comes to quantization,

00:47:36.220 | the interpretation,

00:47:37.380 | we wrote out actually a very thorough blog post

00:47:39.700 | because again, no one size fits all.

00:47:42.580 | We have various different quantization schemes.

00:47:45.100 | We can quantize very different parts of the model

00:47:47.940 | from ways to activation to cross-TPU communication

00:47:50.540 | to they can use different quantization scheme

00:47:53.380 | or consistent across the board.

00:47:55.300 | And again, it's a trade-off.

00:47:56.420 | It's trade-off across this three-dimensional

00:47:58.820 | quality, latency, and cost.

00:48:00.740 | And for our customer,

00:48:01.740 | we actually let them find the best optimized point

00:48:05.540 | and that's kind of how,

00:48:06.820 | and we have very thorough evaluation process

00:48:09.700 | to pick that point.

00:48:11.460 | But for self-serve, there's only one point to pick.

00:48:14.300 | There's no customization available.

00:48:16.780 | So of course we, depends on like what we,

00:48:20.700 | we talk with many customer.

00:48:21.740 | We have to pick one point.

00:48:23.420 | And I think the end results like AA published,

00:48:28.420 | later on AA published a quality measure

00:48:32.980 | and we're actually, we look really good.

00:48:34.900 | So I don't, I wouldn't,

00:48:36.580 | that's why what I mean is I will leave the evaluation

00:48:39.980 | of quality or performance to third party

00:48:42.700 | and work with them to find the most fair benchmark approach

00:48:46.740 | methodology.

00:48:47.900 | But I'm not a part of approach of calling out specific names

00:48:52.900 | and critique other competitors in a very biased way.

00:48:57.380 | - Databases happens as well.

00:48:59.580 | I think you're the more politically correct one.

00:49:01.580 | And then Dima is the more,

00:49:03.300 | it's you on Twitter.

00:49:06.780 | - We, yeah.

00:49:09.300 | - It's like the Russian.

00:49:10.780 | - We partner.

00:49:11.820 | No, actually all these directions we build together.

00:49:15.020 | - Wow.

00:49:15.860 | - Play different roles.

00:49:18.700 | Cut this.

00:49:19.540 | - Another one that I wanted to,

00:49:22.300 | on just the last one on the competition side,

00:49:24.660 | there's a perception of price wars

00:49:26.660 | in hosting open source models.

00:49:29.220 | You are, you're,

00:49:30.060 | and we talked about the competitiveness in the market.

00:49:32.660 | Do you aim to make margin on open source models?

00:49:37.140 | - Oh, absolutely yes.

00:49:39.140 | So, but I think it really, when we think about pricing,

00:49:43.260 | it's really need to coordinate

00:49:45.860 | with the value we are delivering.

00:49:47.900 | If the value is limited,

00:49:49.620 | or there are a lot of people delivering same value,

00:49:51.980 | there's no differentiation.

00:49:53.180 | There's only one way to go is going down, right?

00:49:55.140 | So through competition.

00:49:56.460 | If I take a big step back,

00:49:58.220 | there is pricing from,

00:50:00.140 | we're more compared with like closed model providers,

00:50:03.380 | APIs, right?

00:50:04.500 | The closed model provider,

00:50:05.980 | their cost structure is even more interesting

00:50:08.140 | because we don't have any,

00:50:09.140 | we don't bear any training costs.

00:50:10.900 | And we focus on inference optimization,

00:50:13.220 | and that's kind of where we continue

00:50:15.100 | to add a lot of product value.

00:50:16.740 | So that's how we think about product.

00:50:18.540 | But for the closed source API provider,

00:50:21.780 | model provider,

00:50:22.820 | they bear a lot of training costs.

00:50:24.820 | And they need to amortize the training costs

00:50:26.380 | into the inference.

00:50:27.780 | So that created very interesting dynamics of,

00:50:30.700 | yeah, if we match pricing there,

00:50:32.980 | and I think how they are going to make money

00:50:35.380 | is very, very interesting.

00:50:37.700 | - So for listeners,

00:50:38.940 | opening eyes 2024,

00:50:40.860 | 4 billion in revenue,

00:50:42.620 | 3 billion in compute training,

00:50:45.220 | 2 billion in compute inference,

00:50:47.100 | 1 billion in research compute amortization,

00:50:52.140 | and 700 million in salaries.

00:50:54.420 | So that is like,

00:50:56.420 | (laughing)

00:50:58.660 | I mean, a lot of R&D.

00:51:01.500 | - Yeah, so I think matter is basically like,

00:51:04.860 | snake is zero, yeah.

00:51:06.820 | So that's a very, very interesting dynamics

00:51:09.340 | we're operating within.

00:51:10.500 | But coming back to inference, right,

00:51:11.660 | so we are, again, as I mentioned,

00:51:13.460 | our product is, we are a platform.

00:51:14.980 | We are not just a single model as a service provider,

00:51:18.020 | as many other inference providers,

00:51:19.820 | like they're providing single model.

00:51:21.460 | We have file optimizer to highly customize

00:51:24.420 | towards your inference workload.

00:51:26.140 | We have a compound AI system

00:51:27.700 | where significantly simplify your interaction

00:51:30.500 | to high quality and low latency, low cost.

00:51:34.020 | So those are all very different

00:51:36.180 | from other providers.

00:51:38.220 | - What do people not know about the work that you do?

00:51:41.100 | I guess like people are like,

00:51:41.940 | okay, Fireworks, you run model very quickly.

00:51:44.300 | You have the function model.

00:51:45.860 | Is there any kind of like underrated part of Fireworks

00:51:49.260 | that more people should try?

00:51:51.060 | - Yeah, actually one user post on x.com,

00:51:56.060 | he mentioned, oh, actually,

00:52:00.100 | Fireworks can allow me to upload the LoRa adapter

00:52:04.220 | to the service model at the same cost

00:52:07.540 | and use it at same cost.

00:52:09.260 | Nobody has provided that.

00:52:10.860 | That's because we have a very special,

00:52:13.580 | like we rolled out multi-LoRa last year, actually,

00:52:17.020 | and we actually have this function for a long time,

00:52:19.740 | and many people has been using it,

00:52:21.060 | but it's not well known that,

00:52:23.060 | oh, if you find your model,

00:52:24.380 | you don't need to use on-demand.

00:52:26.060 | If you find your model is LoRa,

00:52:27.780 | you can upload your LoRa adapter,

00:52:30.420 | and we deploy it as if it's a new model,

00:52:33.740 | and then you use, you get your endpoint,

00:52:36.220 | and you can use that directly,

00:52:37.500 | but at the same cost as the base model,

00:52:39.700 | so I'm happy that user is marketing it for us.

00:52:43.460 | He discovered that feature,

00:52:45.900 | but we have that for last year,

00:52:48.260 | so I think to feedback to me is,

00:52:53.060 | we have a lot of very, very good features,

00:52:55.180 | as Sean just mentioned.

00:52:56.660 | - I'm the advisor to the company,

00:52:57.820 | and I didn't know that you had

00:52:58.860 | speculative decoding released, you know?

00:53:01.020 | (laughing)

00:53:02.860 | - We have prompt catching way back last year also.

00:53:05.500 | We have many, yeah.

00:53:06.820 | So yeah, so I think that is one of the underrated feature,

00:53:10.140 | and if they're developers,

00:53:12.260 | you are using our self-serve platform,

00:53:14.500 | please try it out.

00:53:15.700 | - Yeah, yeah, yeah.

00:53:16.540 | The LoRa thing's interesting,

00:53:17.580 | because I think you also,

00:53:19.940 | the reason people add additional cost to it

00:53:22.620 | is not because they feel like charging people.

00:53:25.060 | Normally, in normal LoRa serving setups,

00:53:28.020 | there is a cost to dedicating,

00:53:30.420 | loading those weights,

00:53:31.580 | and dedicating a machine to that inference.

00:53:34.500 | How come you can't avoid it?

00:53:36.100 | - Yeah, so this is kind of our technique called multi-LoRa.

00:53:39.820 | So we basically have many LoRa adapters

00:53:43.380 | share the same base model,

00:53:45.340 | and basically we significantly reduce

00:53:47.540 | the memory footprint of serving,

00:53:50.460 | and one base model can sustain

00:53:52.180 | a hundred to a thousand LoRa adapters,

00:53:54.340 | and then basically all these different LoRa adapters

00:53:57.180 | can share the same,

00:53:58.100 | like direct the same traffic to the same base model,

00:54:00.340 | where base model is dominating the cost.

00:54:02.660 | So that's how we advertise that way,

00:54:05.060 | and that's how we can manage

00:54:07.180 | the tokens per dollar,

00:54:10.460 | million token pricing,

00:54:12.020 | the same as base model.

00:54:13.780 | - Is there anything that you think

00:54:15.860 | you want to request from the community,

00:54:17.500 | or you're looking for model-wise or tooling-wise

00:54:20.860 | that you think someone should be working on in this?

00:54:23.420 | - Yeah, so we really want to get a lot of feedback

00:54:27.060 | from the application developers

00:54:30.180 | who are starting to build on JNN,

00:54:32.700 | or on the already adopted,

00:54:35.020 | or starting to think about new use cases and so on,

00:54:38.660 | to try out Fireworks first,

00:54:41.740 | and let us know what works out really well for you,

00:54:44.180 | and what is your wish list,

00:54:46.020 | and what sucks, right?

00:54:47.980 | So what is not working out for you,

00:54:50.020 | and we would like to continue to improve,

00:54:53.060 | and for our new product launches,

00:54:54.820 | typically we want to launch to a small group of people.

00:54:58.020 | Usually we launch on our Discord first,

00:55:00.540 | to have a set of people use that first.

00:55:03.140 | So please join our Discord channel.

00:55:05.180 | We have a lot of communication going on there.

00:55:07.940 | Again, you can also give us feedback.

00:55:09.500 | We'll have a starting office hour

00:55:11.860 | for you to directly talk with our dev rel

00:55:14.300 | and engineers to exchange more long notes.

00:55:17.180 | - And you're hiring across the board?

00:55:18.940 | - We're hiring across the board.

00:55:20.260 | We're hiring front-end engineers,

00:55:22.220 | infrastructure cloud, infrastructure engineers,

00:55:24.140 | back-end system optimization engineers,

00:55:26.380 | applied researchers,

00:55:28.020 | and researchers who have done post-training,

00:55:31.300 | who have done a lot of fine-tuning and so on.

00:55:33.540 | - That's it.

00:55:35.460 | Thank you.

00:55:36.300 | - Awesome.

00:55:37.140 | - Thanks for having us.

00:55:38.820 | (upbeat music)

00:55:41.460 | (upbeat music)

00:55:44.060 | (upbeat music)

00:55:46.640 | (upbeat music)