Back to Index

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI


Transcript

(upbeat music) - Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner in CTO at Danceable Partners, and I'm joined by my co-host, Swix, founder of SmileyEye. - Hey, and today we're in a very special studio inside the Fireworks office with Lin Tiao, CEO of Fireworks. Welcome.

- Yeah. - Oh, you should welcome us. - Yeah, welcome. (all laughing) - Thanks for having us. It's unusual to be in the home of a startup, but I think our relationship is a bit unusual compared to all our normal guests. - Definitely. Yeah, I'm super excited to talk about very interesting topics in that space with both of you.

- You just celebrated your two-year anniversary yesterday. - Yeah, it's quite a crazy journey. We circle around and share all the crazy stories across these two years, and it has been super fun. All the way from we experienced Silicon Valley bank run. - Right. - To we delete some data that shouldn't be deleted.

Operationally, we went through a massive scale where we actually are busy getting capacity to... Yeah, we learned to kind of work with it as a team with a lot of brilliant people across different places, joining a company. It has really been a fun journey. - When you started, did you think the technical stuff would be harder or the bank run and then the people side?

I think there's a lot of amazing researchers that want to do companies, and it's like, the hardest thing is going to be building the product, and then you have all these different other things. So, were you surprised by what has been your experience? - Yeah, to be honest with you, my focus has always been on the product side, and then after product, go to market.

And I didn't realize the rest has been so complicated, operating a company and so on. But because I don't think about it, I just kind of manage it. So, it's done. (laughs) So, I think I just somehow don't think about it too much and solve whatever problem coming our way, and it worked.

- So, I guess let's start at the pre-history, like the initial history of Fireworks. You ran the PyTorch team at Meta for a number of years, and we previously had Sumit Chintala on, and I think we were just all very interested in the history of GenEI. Maybe not that many people know how deeply involved Fire and Meta were prior to the current GenEI revolution.

- My background is deep in distributed system, database management system, and I joined Meta from the data side. And I saw this tremendous amount of data growth, which cost a lot of money, and we're analyzing what's going on. And it's clear that AI is driving all this data generation.

So, it's a very interesting time, because when I joined Meta, Meta's going through ramping down mobile-first, finishing the mobile-first transition, and then starting AI-first. And there's a fundamental reason about that sequence, because mobile-first gave a full range of user engagement that has never existed before. And all this user engagement generated a lot of data, and this data power AI.

So, then the whole entire industry is also going through, following through the same transition. When I see, oh, okay, this AI is powering all this data generation, and look at where's our AI stack, there's no software, there's no hardware, there's no people, there's no team. I'm like, I want to dive up there and help this movement.

So, when I started, it's a very interesting industry landscape. There are a lot of AI frameworks. It's a kind of proliferation of AI frameworks happening in the industry. But all the AI frameworks focus on production, and they use a very certain way of defining the graph of neural network, and they use that to drive the model actuation and productionization.

And PyTorch is completely different. So, they could also assume that he was the user of his product. It is as a researcher face so much pain using existing AI framework. This is really hard to use, and I'm gonna do something different for myself. And that's the origin story of PyTorch.

PyTorch actually started as the framework for researchers. Don't care about production at all. And as it grow in terms of adoption, so the interesting part of AI is research is the top of our normal production. There are so many researchers across academic, across industry, they innovate, and they put their results out there in open source.

And that power the downstream productionization. So, it's brilliant for MATA to establish PyTorch as a strategy to drive massive adoption in open source, because MATA internally is a PyTorch job. So, it create a flying wheel effects. So, that's kind of a strategy behind PyTorch. But when I took on PyTorch, it's kind of a classical MATA established PyTorch as the framework for both research and production.

So, no one has done that before. And we have to kind of rethink how to architect PyTorch so we can really sustain production workload, the stability, reliability, low latency, all this production concern was never a concern before, now it's a concern. And we actually have to adjust its design and make it work for both sides.

And that took us five years, because MATA has so many AI use cases, all the way from ranking recommendation as powering the business top line, or as ranking use feed, video ranking, to site integrity detect bad content automatically using AI, to all kind of effects, translation, image classification, object detection, all this.

And also across AI running on the server side, on mobile phones, on AI VR devices, the wide spectrum. So by the time, we actually basically managed to support AI across ubiquitous everywhere across MATA. But interestingly, through open source engagement, we work with a lot of companies. It is clear to us, like, this industry is start to take on AI first transition.

And of course, MATA's hyperscale always go ahead of industry. And we feel like it feels like when we start this AI journey at MATA, there's no software, no hardware, no team. For many companies we engage with through PyTorch, we feel the pain. That's the genesis why we feel like, hey, if we create fireworks and support industry going through this transition, it will be a huge amount of impact.

Of course, the problem that industry facing will not be the same as MATA. MATA is so big, right? So it's kind of skewed towards extreme scale and extreme optimization in the industry will be different. But we feel like we have the technical chop and we've seen a lot. We'll look to kind of drive that.

So yeah, so that's how we started. - When you and I chatted about like the origins of fireworks, it was originally envisioned more as a PyTorch platform. And then later became much more focused on generative AI. Is that fair to say? - Right. - What was the customer discovery here?

- Right, so I would say our initial blueprint is say we should be the PyTorch cloud because PyTorch is a library and there's no SaaS platform to enable AI workloads. - Even in 2022, it's interesting. - I would not say absolutely no, but like cloud providers have some of those, but it's not first class citizen, right?

Because at 2022, there's still like TensorFlow is massively in production. And this is all pre-GNI. And PyTorch is kind of getting more and more adoption, but there's no PyTorch first SaaS platform existing. At the same time, we are also a very pragmatic set of people. We really want to make sure from the get-go, we get really, really close to customers.

We understand their use case. We understand their pain points. We understand the value we deliver to them. So we want to take a different approach. Instead of building a horizontal PyTorch cloud, we want to build a verticalized platform first. And then we talk with many customers. And interesting, we started a company September 2022, and October, November, the OpenAI announced ChatGPT.

And then boom, then when we talk with many customers, they are like, "Can you help us working on the GNI aspect?" So of course, there are some open-source models. It's not as good at that time, but people are already putting a lot of attention there. Then we decide that if we're going to pick a vertical, we're going to pick GNI.

The other reason is all GNI models are PyTorch models. So that's another reason. We believe that because of the nature of GNI, it's going to generate a lot of human consumable content. It will drive a lot of consumer, customer-developer-facing application and product innovation. Guaranteed, right? We're just at the beginning of this.

Our prediction is for those kind of application, the inference is much more important than training because inference scale is proportional to the up-limit award population. And training scale is proportional to the number of researchers. Of course, each training round could be very expensive. Although PyTorch supports both inference and training, we decide to lay the focus on inference.

So yeah, so that's how we got started. And we launched our public platform August last year. And when we launched, it's a single product. It's a distributed inference engine with simple API, open-air compatible API, with many models. We started with LM, and later on, we added a lot of models.

Fast forward to now, we are a full platform with multiple product lines. So we love to kind of dive deep into what we offer. So, but that's a very fun journey in the past two years. - What was the transition from you start focus on PyTorch and people want to understand the framework, get it live.

And now I would say maybe most people that use you don't even really know much about PyTorch at all. They're just strong consumer model. From a product perspective, what were some of the decisions early on? Right in October, November, you were just like, "Hey, most people just care "about the model, not about the framework.

"We're going to make it super easy." Or was it more a gradual transition to the model library you have today? - Yeah, so our product decision is all based on who is our ICP. And one thing we want to acknowledge here is the Gen-AI technology is disruptive. It's very different from AI before Gen-AI.

So it's a clear leap forward. Because before Gen-AI, the companies that want to invest in AI, they have to train from scratch. There's no other way. There's no foundation model. It doesn't exist. So that means they need to start a team, first hire a team, who is capable of crunch data.

There's a lot of data to crunch, right? Because training from scratch, you have to prepare a lot of data. And then they need to have GPUs to train. And then you need to start to manage GPUs. So then it becomes a very complex project. It takes a long time and not many companies can afford it, actually.

And Gen-AI is a very different game right now because it is a foundation model, so you don't have to train anymore. That makes AI much more accessible as a technology. As an app developer or product manager, even not a developer, they can interact with Gen-AI models directly. So, and our goal is to make AI accessible to all app developers and product engineers.

That's our goal. So then getting them into the building model doesn't make any sense anymore with this new technology. And then building easy, accessible APIs is the most important. Our, early on, when we got started, we decided we're going to be OpenAI compatible. It's just kind of very easy for developers to adopt this new technology.

And we will manage the underlying complexity of serving all these models. - Yeah, OpenAI has become-- - The standard. - The standard. Even as we're recording today, Gemini announced that they have OpenAI compatible APIs. - Interesting. - So then we just need to adopt it all at night and then we have everyone.

- Yeah, that's interesting because we are working very closely with Meta as one of the partners. And Meta announced, Meta, of course, is kind of very generous to donate many very, very strong open source models. Expecting more to come. But also they have announced LamaStack. - Yeah. - Which is basically standardized, the upper-level stack, built on top of Lama models.

So they don't just want to give out models and you figure out what the upper stack is. They instead want to build a community around the stack and build a kind of new standard. I think it's an interesting dynamics playing in the industry right now. When it's more standardized across OpenAI because they are kind of creating the top-of-the-line or standardized across Lama because this is the most used open source model.

So I think it's really a lot of fun working at this time. - I've been a little bit more doubtful on LamaStack. I think you've been more positive. Basically, it's just like the Meta version of whatever Hugging Face offers, or TensorRT, or BLM, or whatever the open source opportunity is.

But to me, it's not clear that just because Meta open source is Lama, that the rest of LamaStack will be adopted. And it's not clear why I should adopt it. So I don't know if you-- - Yeah, it's very early right now. That's why I kind of will work very closely with them and give them feedback.

The feedback to the Meta team is very important. So then they can use that to continue to improve the model and also improve the higher-level stack. I think the success of LamaStack heavily depends on the community adoption, and there's no way around it. And I know Meta team would like to kind of work with a broader set of community, but it's very early.

- One thing that, after your Series B, so you raced for a benchmark, and then I remember being close to you for at least your Series B announcements, you started betting heavily on this term of compound AI. It's not a term that we've covered very much in the podcast, but I think it's definitely getting a lot of adoption from Databricks and the Berkeley people and all that.

What's your take on compound AI? Why is it resonating with people? - Right, so let me give a little bit of context why we even consider that space. - Yeah, because pre-Series B, there was no message, and now it's like on your landing page. - So it's kind of a very organic evolution from when we first launched our public platform.

We are a single product, and we are a distributed inference engine, where we do a lot of innovation, customize quota kernels, raw kernels, running a different kind of hardware, and build distributed disaggregated execution, inference execution, build all kind of caching. So that is one. So that's kind of one product line, is the fast, most cost-efficient inference platform.

Because we wrote PyTorch code, we know we basically have a special PyTorch build for that, together with a custom kernel we wrote. And then we work with many more customers, we realized, oh, the distributed inference engine, our design is one size fits all, right? We want to have this inference endpoint, then everyone come in, and no matter what kind of form and shape or workload they have, it will just work for them, right?

So that's great. But the reality is, we realized, all customers have different kind of use cases. The use cases come in all different form and shape. And the end result is, the data distribution in their inference workload doesn't align with the data distribution in the training data for the model, right?

It's a given, actually. If you think about this, because researchers have to guesstimate what is important, what's not important, like in preparing data for training. So because of that misalignment, then we leave a lot of quality, latency, cost improvement on the table. So then we're saying, okay, we want to heavily invest in a customization engine.

And we actually announced it called FireOptimizer. So FireOptimizer basically help user navigate a three-dimensional optimization space across quality, latency, and cost. So it's a three-dimensional curve. And even for one company, for different use case, they want to land in different spots. So we automate that process for our customer.

It's very simple. You have your inference workload, and you inject into the optimizer, along with the objective function. And then we spit out inference deployment config and the model setup. So it's your customized setup. So that is a completely different product. So that product thinking is one size fits one, different for one size fits all.

And now on top of that, we provide a huge variety of state-of-the-art models, hundreds of them, varying from text to state-of-the-art English models. That's where we started. And as we talk with many customers, we realize, oh, audio and text are very, very close. Many of our customers start to build assistants, all kinds of assistants using text, and they immediately want to add audio, audio in, audio out.

So we support transcription, translation, speech synthesis, text, audio alignment, all different kind of audio features. It's a big announcement we're gonna, you should have heard. - By the time this is out. - By the time this is out. - Yeah. - And the other areas of vision and the text are very close with each other, because a lot of information doesn't live in plain text.

A lot of information live in multimedia format, live in images, PDFs, screenshots, and in many other different formats. So oftentimes solve a problem, we need to put the vision model first to extract information. And then use language model to process and then send out results. So vision is important, we also support vision model.

Various different kind of vision models specialize in processing different kind of source and extraction. And we're also gonna have another announcement of a new API endpoint, will support for people to upload various different kind of multimedia content and then get the extract very accurate information out and feed that into LM.

And then of course we support embedding because embedding is very important for semantic search, for RAG and all this. And in addition to that, we also support text to image, image generation models, text to image, image to image, and we're adding text to video as well in our portfolio.

So it's very comprehensive set of model catalog that build on, run on top of File Optimizer and Distribute Influence Engine. But then we talk with more customer, they solve business use case, and then we realize one model is not sufficient to solve their problem. And it's very clear because one is the model who listens, and many customer, when they onboard this JNI journey, they thought this is magical.

JNI is gonna solve all my problems magically, but then they realize, oh, this model who listens. It who listens because it's not deterministic, it's probabilistic. So it's designed to always give you an answer, but based on probability, so it who listens. And that's actually sometimes the feature for creative writing, for example.

Sometimes it's a bug because, hey, you don't want to give misinformation. And different model also have different specialties. To solve a problem, you want to ask different special model to kind of decompose your task into multiple small task, narrow task, and have an expert model solve that task really well.

And of course, the model doesn't have all the information. It has limited knowledge because the training data is finite, not infinite. So model oftentimes doesn't have real-time information. It doesn't know any proprietary information within enterprise. It's clear that in order to really build a compiling application on top of JNI, we need a compound AI system.

Compound AI system basically is gonna have multiple models across modalities along with APIs, whether it's public APIs, internal proprietary APIs, storage systems, database system, knowledge systems to work together to deliver the best answer. - Are you gonna offer a virtual database? - We actually heavily partner with several big vector database providers.

- Which is your favorite? (laughing) - They are all great in different ways, but it's public information, like MongoDB is our investor, and we have been working closely with them for a while. - When you say distributed inference engine, what do you mean exactly? Because when I hear your explanation, it's almost like you're centralizing a lot of the decisions through the Fireworks platform on like the quality and whatnot.

What do you mean distributed? It's like you have GPUs in like a lot of different clusters, so like you're sharding the inference across. - Right, right, right. So first of all, we run across multiple GPUs. But the way we distribute across multiple GPUs is unique. We don't distribute the whole model monolithically across multiple GPUs.

We chop them into pieces and scale them completely differently based on what's the bottleneck. We also are distributed across regions. We have been running in North America, Emir, and Asia. We have regional affinity to applications because latency is extremely important. We are also like doing global load balancing because a lot of application there, they quickly scale to global population.

And then at that scale, like different continent wakes up at a different time. And you want to kind of load balancing across. So all the way in a week, we also have, we manage various different kinds of hardware skew from different hardware vendors. And different hardware design is best for different type of workload, whether it's long context, short content, long generation.

So all these different type of workload is best fitted for different kind of hardware skew. And then we can even distribute it across different hardware for a workload. So yeah, so the distribution actually is all around in the full stack. - At some point, we'll show on the YouTube the image that Ray, I think, has been working on with like all the different modalities that you offer.

Like to me, it's basically you offer the open source version of everything that OpenAI typically offers, right? I don't think there is. Actually, if you do text to video, you will be a superset of what OpenAI offers 'cause they don't have Sora. Is that Mochi, by the way? - Mochi.

- Mochi, right? - Mochi, and there are a few others. I will say the interesting thing is, I think we're betting on the open source community is gonna grow, like proliferate. This is literally what I see. - Yeah. - And there's amazing video generation companies. - Yeah. - There is amazing audio companies.

Like cross-border, the innovation is off the chart and we are building on top of that. I think that's the advantage we have compared with a closed source company. - I think I want to restate the value proposition of Fireworks for people who are comparing you versus like a raw GPU provider, like a RunPod, or a Lambda, or anything like those, which is like you create the developer experience layer and you also make it easily scalable or serverless or as an end point.

And then I think for some models, you have custom kernels, but not all models. - For almost for all models, for all large language models, all your models. - You just write kernels all day long? - In the VRS. (laughs) - Yeah. - Yeah, almost for all models we serve, we have.

- And so that is called Fire Attention? - That's called Fire. - I don't remember the speed numbers, but apparently much better than VLM, especially on a concurrency basis. - Right, so Fire Attention is specific for, mostly for language model, but for other modalities, we'll also have a customized kernel.

- Yeah, I think the typical challenge for people is understanding like, that has value. And then like, there are other people who are also offering open source models, right? Like your mode is your ability to offer like a good experience for all these customers. But if your existence is entirely reliant on people releasing nice open source models, other people can also do the same thing.

- Right, yeah. So I will say we build on top of open source model foundation. So that's the kind of foundation we build on top of. But we look at the value prop from the lens of application developers and product engineers. So they want to create new UX. So what's happening in the industry right now is people are thinking about completely new way of designing products.

And I'm talking to so many founders, it's just mind blowing. They help me understand existing way of doing PowerPoint, existing way of coding, existing way of managing customer service. It's actually putting a box in our head. For example, PowerPoint, right? So PowerPoint generation is, we always need to think about how to fit into my storytelling into this format of slide one after another.

And I'm gonna juggle through like design together with what story to tell. But the most important thing is what's your storytelling lines, right? And why don't we create a space that is not limited to any format? And those kind of new product UX design combined with automated content generation through GNI is the new thing that many founders are doing.

What are the challenges they're facing? All right, let's go from there. One is, again, because a lot of products built on top of GNI, they are consumer, personal, and developer facing, and they require interactive experience. It's just a kind of product experience we all get used to. And our desire is to actually get faster and faster interaction.

Otherwise, nobody wants to spend time, right? So again, and then that requires low latency. And the other thing is, the nature of consumer, personal, and developer facing is your audience is very big. You want to scale up to product market fit quickly. But if you lose money at a small scale, you're gonna bankrupt quickly.

So it's actually a big contrast is I actually have product market fit. But when I scale, I scale out of my business. So that's kind of very funny to think about it. So then have low latency and low cost is essential for those new application and product to survive and really become a generation company.

So that's the design point for our distributed inference engine and the file optimizer. File optimizer, you can think about that as a feedback loop. The more you feed your inference workload to our inference engine, the more we help you improve quality, lower latency further, lower your cost. It basically becomes better.

And we automate that because we don't want you as app developer or product engineer to think about how to figure out all these low-level details. It's impossible because you are not trained to do that at all. You should kind of keep your focus on the product innovation. And then the compound AI, we actually feel a lot of pain as the app developers, engineer, there are so many models.

Every week, there's at least a new model coming out. - Tencent had a giant model this week. - Yeah, yeah, I saw that, I saw that. - Like 500 billion dollars. So they're like, should I keep chasing this or should I forget about it? And which model should I pick to solve what kind of sub-problem?

How do I even decompose my problem into those smaller problems and fit the model into it? I have no idea. And then there are two ways to think about this design. I think I talked about that in the past. One is imperative, as in you tell, you figure out how to do it.

You give developer tools to dictate how to do it. Or you build a declarative system where a developer tells what they want to do, not how. So these are completely two different designs. So the analogy I want to draw is in the data world, the database management system is a declarative system because people use database, use SQL.

SQL is a way you say, what do you want to extract out of database? What kind of result do you want? But you don't figure out which node, how many nodes you're gonna run on top of, how you redefine your disk, which index you use, which project. You don't need to worry about any of those.

And database management system will figure out, generate a new best plan and execute on that. So database is declarative. And it makes it super easy. You just learn SQL, which is learn a semantic meaning of SQL and you can use it. Imperative side is there are a lot of ETL pipelines and people design this DAG system with triggers, with actions, and you dictate exactly what to do.

And if it fails, then we'll have to recover. So that's a declarative system. And we have seen a range of system in the ecosystem go different ways. I think there are value of both. There are value of both. I don't think one is gonna subsume the other, but we are leaning more into the philosophy of the declarative system because from the lens of app developer and product engineer, that would be easiest for them to integrate.

- I understand that's also why PyTorch won as well, right? This is one of the reasons. - Ease of use. So yeah, focus on ease of use and then let the system take on the hard challenges and complexities. So we follow, we extend that thinking into current system design.

So another announcement is we will also announce our next declarative system is gonna appear as a model that has extremely high quality. And this model is inspired by Owen's announcement for OpenAI. You should see that by the time we announce this or soon. - It's trained by you. - Yes.

- Is this the first model that you train? Like this scale? - It's not the first. We actually have trained a model called FireFunction. It's a function calling model. It's our first step into compound AI system because function calling model can dispatch a request into multiple APIs. We have pre-baked set of APIs the model learn.

You can also add additional APIs to through the configuration to let model dispatch accordingly. So we have a very high quality function calling model that already released. We have actually three versions. The latest version is very high quality. But now we take a further step that you don't even need to use function calling model.

You use our new model we're gonna release. It will solve a lot of problem approaching very high, like OpenAI's quality. So I'm very excited about that. - Do you have any benchmarks yet or? - We have benchmark. We're gonna release it. Hopefully next week. We just put our model to LMSYS and people are guessing, is this a next Gemini model or a MADIS model?

People are guessing. That's very interesting. We're like watching the Reddit discussion right now. - I mean, I have to ask more questions about this. When OpenAI released the one, a lot of people asked about whether or not it's a single model or whether it's like a chain of models.

And basically everyone on the Strawberry team was very insistent that what they did for reinforcement learning, chain of thought, cannot be replicated by a whole bunch of open source model calls. Do you think that they are wrong? Have you done the same amount of work on RL as they have or was it a different direction?

- I think they take a very specific approach where I do, the caliber of team is very high, right? So I do think they are the domain expert in doing the things they are doing. But I don't think there's only one way to achieve the same goal. We're on the same direction in the sense that the quality scaling law is shifting from training to inference.

We are definitely on, for that I fully agree with them. But we're taking a completely different approach to the problem. All of that is because, of course, we didn't train the model from scratch. All of that is because we built on the show of giants, right? So the current model available we have access to is getting better and better.

The future trend is the gap between the open source model, closed source model, it's just gonna shrink to the point there's not much difference. And then we're on the same level field. That's why I think our early investment in inference and all the work we do around balancing across quality, latency and cost pay off because we have accumulated a lot of experience there and that empower us to release this new model that is approaching open source quality.

- I guess the question is, what do you think the gap to catch up will be? Because I think everybody agrees with open source models eventually will catch up. And I think with 4, then with Lama 3.2, 3.1, 4.5b, we close the gap. And then L1 just reopened the gap so much and it's unclear.

Obviously you're saying your model will have- - We're closing that gap. - Yeah, but you think like in the future it's gonna be like months? - So here's the thing that's happened, right? There's public benchmark, it is what it is. But in reality, open source model in certain dimension already on par or beat closed source model, right?

So for example, in the coding space, open source models are really, really good. And in function calling, like file function is also really, really good. So it's all a matter of whether you build one model to solve all the problem and you want to be the best of solving all the problems or in the open source domain, it's gonna specialize, right?

All these different model builders specialize in certain narrow area. And it's logical that they can be really, really good in that very narrow area. And that's our prediction is with specialization, there will be a lot of expert models really, really good and even better than like one size fits all open source, closed source models.

- I think this is the core debates that I am still not 100% either way on in terms of compound AI versus normal AI, 'cause you're basically fighting the bitter lesson. - Look at the human society, right? We specialize and you feel really good about someone specializing doing something really well, right?

And that's how our, like when it evolved from ancient time, we're all journalists, we do everything in the tribe too. Now we heavily specialize in different domain. So my prediction is in the AI model space, it will happen also. Except for the bitter lesson, you get short-term gains by having specialists, domain specialists, and then someone just needs to train like a 10X bigger model on 10X more inference, 10X more data, 10X more model perhaps, whatever the current scaling law is.

And then it supersedes all the individual models because of some generalized intelligence/world knowledge. You know, I think that is the core insight of the GPTs, the GPT 1, 2, 3, that was. - Right, but the scaling law again, the training scaling law is because you have increasing amount of data to train from and you can do a lot of compute, right?

So I think on the data side, we're approaching the limit and the only data to increase that is synthetic generated data. And then there's like, what is the secret sauce there, right? Because if you have a very good large model, you can generate very good synthetic data and then continue to improve quality.

So that's why I think in OpenAI, they are shifting from the training scaling law into inference scaling law. And it's the test time and all this. So I definitely believe that's the future direction and that's where we are really good at and doing inference. - Couple of questions on that.

Are you planning to share your reasoning traces? - That's a very good question. We are still debating. - Yeah. - We're still debating. - I would say, if you, for example, it's interesting that like, for example, Sweden bench, if you want to be considered for ranking, you have to submit your reasoning traces and that has actually disqualified some of our past guests.

Like CoSign was doing well on Sweden bench, but they didn't want to leak those results. So that's why you don't see O1 preview on Sweden bench because they don't submit their reasoning traces. And obviously it's IP, but also if you're going to be more open, then that's one way to be more open.

So your model is not going to be open source, right? Like it's going to be a endpoint that you provide. - Yes. - Okay, cool. And then pricing also the same as OpenAI, just kind of face-on. - This is, I don't have actually information. Everything is going so fast, we haven't even think about that yet.

Yeah, I should be more prepared. - I mean, this is live. It's nice to just talk about it as it goes live. Any other things that you're like, you want feedback on or you're thinking through? It's kind of nice to just talk about something when it's not decided yet about this new model.

Like, I mean, it's going to be exciting. It's going to generate a lot of buzz. - Right. I'm very excited about to see how people are going to use this model. So there's already a Reddit discussion about it and the people are asking very deep medical questions. And it seems the model got it right.

Surprising. And internally, we're also asking models to generate what is AGI. And it generates a very complicated DAG. Thinking process. So we're having a lot of fun testing this internally. But I'm more curious, how will people use it? What kind of application they're going to try and test on it?

And that's where we'll really like to hear feedback from the community. And also feedback to us, like what works out well, what doesn't work out well? What works out well but surprising them? And what kind of thing they think we should improve on? And those kind of feedback will be tremendously helpful.

- Yeah, I mean, so I've been a production user of Preview and Mini since March. I would say they're like very, very obvious in terms of quality. So much so that they made flaunts on it and for, oh, just like they made the previous state-of-the-art look bad. Like it's really that stark, that difference.

The number one thing I actually, you know, just feedback or request, feature requests is people want control on the budget. Because right now in '01, it kind of decides its own thinking budget. But sometimes you know how hard the problem is and you want to actually tell the model, like spend two minutes on this, or spend some dollar amount.

Maybe it's time, maybe it's dollars. I don't know what the budget is. - That makes a lot of sense. So we actually thought about that requirement and it should be at some point we need to support that. Not initially, but that makes a lot of sense. - Okay, so that was a fascinating overview of just like the things that you're working on.

First of all, I realized that, I don't know if I've ever given you this feedback, but I think you guys are one of the reasons I agreed to advise you. Because like, you know, I think when you first met me, I was kind of dubious. I was like- - Who are you?

- Let's replicate this together. There's like a laptop. There's like a whole bunch of other players. You're in very, very competitive fields. Like why will you win? And the reason I actually changed my mind was I saw you guys shipping. You know, I think your surface area is very big.

The team is not that big. - No, we're only 40 people. - Yeah, and now here you are trying to compete with OpenAI and you know, everyone else. Like, what is the secret? - I think the team, the team is the secret. - Oh boy. So there's no, there's no thing I can just copy.

You just- - No. - I think we all come from very aligned on the culture. 'Cause most of our team came from Meta. - Yeah. - And many startups. So we really believe in results. One is result. And second is customer. We're very customer obsessed. And we don't want to drive adoption for the sake of adoption.

We really want to make sure we understand we are delivering a lot of business values to the customer. And we are, we really value their feedback. So we would wake up mid of night and deploy some model for them. Shuffle some capacity for them. And yeah, over the weekend, no brainer.

So yeah, so that's just how we work as a team. And the caliber of the team is really, really high as well. So like, as Plugin, we're hiring. We're expanding very, very fast. So if we are passionate about working on the most cutting edge technology in the general space, come talk with us.

- Yeah. Let's talk a little bit about that customer journey. I think one of your more famous customers is Cursor. We were the first podcast to have Cursor on and then obviously since then they have blown up. Cause and effect are not related. But you guys especially worked on a fast supply model where you were one of the first people to work on speculative decoding in a production setting.

Maybe just talk about like, what was the behind the scenes of working with Cursor? - I will say, Cursor is a very, very unique team. I think a unique part is the team has very high technical caliber, right? There's no question about it. But they have decided, although like many companies including Copala, they will say, I'm going to build a whole entire stack because I can.

And they are unique in the sense they seek partnership. Not because they cannot, they're fully capable, but they know where to focus. That to me is amazing. And of course they want to find a bypass partner. So we spent some time working together. They are pushing us very aggressively because for them to deliver high caliber product experience they need the latency.

They need the interactive, but also high quality at the same time. So actually we expanded our product feature quite a lot as we support in Cursor. And they are growing so fast and we massively scaled quickly across multiple regions. And we develop pretty high intense inference stack, almost like similar to what we do for Meta.

I think that's a very, very interesting engagement. And through that, there are a lot of trust being built. As in they realize, hey, this is a team they can really partner with and they can go big with. That comes back to, hey, we're really customer obsessed. And all the engineers working with them, there's just enormous amount of time syncing together with them and discussing.

And we're not big on meetings, but we are like stack channel always on. Yeah, so you almost feel like working as one team. So I think that's really highlighted. - Yeah, for those who don't know, so basically Cursor is a VSCode fork, but most of the time people will be using close models.

Like I actually use a lot of Sonnet. So you're not involved there, right? It's not like you host Sonnet or you have any partnership with it. You're involved where Cursor is small or like their house brand models are concerned, right? - I don't know what I can say, but the things they haven't said.

(laughing) - Very obviously the dropdown is 4.0 and then Cursor, right? So like, I assume that the Cursor side is the Fireworks side and then the other side, they're calling out the other. Just kind of curious. And then like, do you see any more opportunity on like the, you know, I think you made a big splash with like 1000 tokens per second.

That was because of speculative decoding. Is there more to push there? - We push a lot. Actually, when I mentioned a file optimizer, right? So as in, we have a unique automation stack that is one size fits one. We actually deployed to Cursor early on. Basically optimized for their specific workload.

And that's a lot of juice to extract out of there. And we see the success in that product is actually can be widely adopted. So that's why we started a separate product line called the File Optimizer. So speculative decoding is just one approach. And speculative decoding here is not static.

We actually wrote a blog post about it. There's so many different ways to do speculative decoding. You can pair a small model with a large model in the same model family, or you can have Eagle heads and so on. So there are different trade-offs of which approach to take.

It really depends on your workload. And then with your workload, we can align the Eagle heads or Medusa heads or, you know, small, big model pair much better to extract the best latency reduction. So all of that is part of the File Optimizer offering. - I know you mentioned some of the other inference providers.

I think the other question that people always have is around benchmarks. So you get different performance on different platforms. How should people think about, you know, people are like, "Hey, Lama 3.2 is X on MMLU." But maybe, you know, using speculative decoding, you go down a different path. Maybe some providers run a quantized model.

How should people think about how much they should care about how you're actually running the model? You know, like, what's the delta between all the magic that you do and what a raw model? - Okay, so there are two big development cycle. One is experimentation, where they need fast iteration.

They don't want to think about quality and they just kind of want to experiment with product experience and so on, right? So that's one. And then it looks good and they want to kind of post-product market but scaling and the quality is really important and latency and all the other things are becoming important.

During the experimentation phase is just pick a good model. Don't worry about anything else. Make sure even like JNI is the right solution to your product and that's the focus. And then post-product market fit, then that's kind of the three-dimensional optimization curve start to kick in across quality, latency, cost, where you should land.

And to me, it's a purely a product decision. To many product, if you choose a lower quality but better speed and lower cost, but it doesn't make a difference to the product experience, then you should do it. So that's why I think inference is part of the validation. The validation doesn't stop at offline eval.

The validation is kind of, we'll go through A/B testing through inference and that's where we kind of offer various different configurations for you to test which is the best setting. So this is like traditional product evaluation. So product evaluation should also include your new model versions and different model setup into the consideration.

- I want to specifically talk about what happens a few months ago with some of your major competitors. I mean, all of this is public. What is your take on what happens? And maybe you want to set the record straight on how Fireworks does quantization because I think a lot of people may have outdated perceptions or they didn't read the clarification posts on your approach to quantization.

- First of all, it's always a surprise to us that without any notice, we got called out. - Specifically by name, which is normally not what- - Yeah, in a public post and have certain interpretation of our quality. So I was really surprised and it's not a good way to compete, right?

We want to compete fairly and oftentimes when one vendor give out the result from another vendor is always extremely biased. So we actually refrain ourselves to do any of those and we happily partner with third party to do the most fair evaluation. So we are very surprised and we don't think that's a good way to figure out the competition landscape.

So then we react. I think when it comes to quantization, the interpretation, we wrote out actually a very thorough blog post because again, no one size fits all. We have various different quantization schemes. We can quantize very different parts of the model from ways to activation to cross-TPU communication to they can use different quantization scheme or consistent across the board.

And again, it's a trade-off. It's trade-off across this three-dimensional quality, latency, and cost. And for our customer, we actually let them find the best optimized point and that's kind of how, and we have very thorough evaluation process to pick that point. But for self-serve, there's only one point to pick.

There's no customization available. So of course we, depends on like what we, we talk with many customer. We have to pick one point. And I think the end results like AA published, later on AA published a quality measure and we're actually, we look really good. So I don't, I wouldn't, that's why what I mean is I will leave the evaluation of quality or performance to third party and work with them to find the most fair benchmark approach methodology.

But I'm not a part of approach of calling out specific names and critique other competitors in a very biased way. - Databases happens as well. I think you're the more politically correct one. And then Dima is the more, it's you on Twitter. - We, yeah. - It's like the Russian.

- We partner. No, actually all these directions we build together. - Wow. - Play different roles. Cut this. - Another one that I wanted to, on just the last one on the competition side, there's a perception of price wars in hosting open source models. You are, you're, and we talked about the competitiveness in the market.

Do you aim to make margin on open source models? - Oh, absolutely yes. So, but I think it really, when we think about pricing, it's really need to coordinate with the value we are delivering. If the value is limited, or there are a lot of people delivering same value, there's no differentiation.

There's only one way to go is going down, right? So through competition. If I take a big step back, there is pricing from, we're more compared with like closed model providers, APIs, right? The closed model provider, their cost structure is even more interesting because we don't have any, we don't bear any training costs.

And we focus on inference optimization, and that's kind of where we continue to add a lot of product value. So that's how we think about product. But for the closed source API provider, model provider, they bear a lot of training costs. And they need to amortize the training costs into the inference.

So that created very interesting dynamics of, yeah, if we match pricing there, and I think how they are going to make money is very, very interesting. - So for listeners, opening eyes 2024, 4 billion in revenue, 3 billion in compute training, 2 billion in compute inference, 1 billion in research compute amortization, and 700 million in salaries.

So that is like, (laughing) I mean, a lot of R&D. - Yeah, so I think matter is basically like, snake is zero, yeah. So that's a very, very interesting dynamics we're operating within. But coming back to inference, right, so we are, again, as I mentioned, our product is, we are a platform.

We are not just a single model as a service provider, as many other inference providers, like they're providing single model. We have file optimizer to highly customize towards your inference workload. We have a compound AI system where significantly simplify your interaction to high quality and low latency, low cost.

So those are all very different from other providers. - What do people not know about the work that you do? I guess like people are like, okay, Fireworks, you run model very quickly. You have the function model. Is there any kind of like underrated part of Fireworks that more people should try?

- Yeah, actually one user post on x.com, he mentioned, oh, actually, Fireworks can allow me to upload the LoRa adapter to the service model at the same cost and use it at same cost. Nobody has provided that. That's because we have a very special, like we rolled out multi-LoRa last year, actually, and we actually have this function for a long time, and many people has been using it, but it's not well known that, oh, if you find your model, you don't need to use on-demand.

If you find your model is LoRa, you can upload your LoRa adapter, and we deploy it as if it's a new model, and then you use, you get your endpoint, and you can use that directly, but at the same cost as the base model, so I'm happy that user is marketing it for us.

He discovered that feature, but we have that for last year, so I think to feedback to me is, we have a lot of very, very good features, as Sean just mentioned. - I'm the advisor to the company, and I didn't know that you had speculative decoding released, you know?

(laughing) - We have prompt catching way back last year also. We have many, yeah. So yeah, so I think that is one of the underrated feature, and if they're developers, you are using our self-serve platform, please try it out. - Yeah, yeah, yeah. The LoRa thing's interesting, because I think you also, the reason people add additional cost to it is not because they feel like charging people.

Normally, in normal LoRa serving setups, there is a cost to dedicating, loading those weights, and dedicating a machine to that inference. How come you can't avoid it? - Yeah, so this is kind of our technique called multi-LoRa. So we basically have many LoRa adapters share the same base model, and basically we significantly reduce the memory footprint of serving, and one base model can sustain a hundred to a thousand LoRa adapters, and then basically all these different LoRa adapters can share the same, like direct the same traffic to the same base model, where base model is dominating the cost.

So that's how we advertise that way, and that's how we can manage the tokens per dollar, million token pricing, the same as base model. - Is there anything that you think you want to request from the community, or you're looking for model-wise or tooling-wise that you think someone should be working on in this?

- Yeah, so we really want to get a lot of feedback from the application developers who are starting to build on JNN, or on the already adopted, or starting to think about new use cases and so on, to try out Fireworks first, and let us know what works out really well for you, and what is your wish list, and what sucks, right?

So what is not working out for you, and we would like to continue to improve, and for our new product launches, typically we want to launch to a small group of people. Usually we launch on our Discord first, to have a set of people use that first. So please join our Discord channel.

We have a lot of communication going on there. Again, you can also give us feedback. We'll have a starting office hour for you to directly talk with our dev rel and engineers to exchange more long notes. - And you're hiring across the board? - We're hiring across the board.

We're hiring front-end engineers, infrastructure cloud, infrastructure engineers, back-end system optimization engineers, applied researchers, and researchers who have done post-training, who have done a lot of fine-tuning and so on. - That's it. Thank you. - Awesome. - Thanks for having us. (upbeat music) (upbeat music) (upbeat music) (upbeat music)