Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov

Hello, everyone. So, my name is Dima. As mentioned, unfortunately, my co-founder, Lin, who was on the schedule, couldn't make it today because of some personal emergency. So you got me. And as you saw, we don't have yet AI to figure out video projection, but we have AI for a lot of other things.

So today I'm going to talk about Fireworks AI and generally I'm going to continue the theme which Katrin started about open models and how we basically focus on productionization and customization of open source models in inference at Fireworks. But first, as an introduction, what's our background? So the founding team of Fireworks comes from PyTorch leads at Meta and some veterans from Google AI.

So we combined have, like, probably a decade of experience in productionizing AI in some of the biggest companies in the world. And I myself personally have been core maintainer of PyTorch for the past five years. So topic of open source is really close to my heart. And since we kind of led this revolution of open source toolchain for deep learning through our work sort of on PyTorch and some of the Google technologies, we really believe that open source models are the future also for, like, for gen AI application and our focus at Fireworks is precisely on that.

So I mean, how many people in the audience actually, like, use GPT and deploy it for production? And how many people -- how many folks use open models for the production? Oh, okay. So I was about to convince you that share of open source models is going to grow over time, so it looks like in this audience it's already -- already sizable, but nevertheless.

So why -- why basically this trade-off, why go big or why go small? Currently still, like, bulk of production inference is still based on proprietary models. And the catch is that those are really good models and often frontier in many domains. However, the catch is that it's one model which is good in many, many sins.

And it's often served in the same way, regardless of the use case, which means that it may be if you have batch inference on some narrow domain or you have some super real-time use case where you need to -- you need to do, like, voice assistance or something, those are often served from the same infrastructure without customization.

In terms of model capabilities, it also means, yeah, like, GPT-4 is great or Claude is great and can handle a lot of sins, but you are often paying a lot for additional capabilities which are not needed in particular use case. You don't really need customer support chatbot to know about 150 Pokemons or be able to write your poetry, but you really want it to be really good in the particular narrow domain.

So this kind of discrepancy for large models leads to several issues. One as I mentioned is high latency, because using a big model means longer response times, which is particularly important for real-time use cases, like, voice systems. It gets more and more important with stuff, because for stuff like, for example, next -- right?

Like, you need to do a lot of steps for, like, something, like, agent-like application to do reasoning and call the model many times. So latency is really, really important. And often you see that you can pick smaller models, like Lama or Gemma, which you just talked about, and achieve the -- for the narrow domain, same or better quality while being, you know, up to 10 times faster.

For example, for some of the function calling use cases, like, externally benchmark from Berkeley, yeah, like, the -- you get similar performance from fine-tuned Lama 3 at 10x the speed. Cost is also -- is also an issue if you're running a big model for -- on a lot of traffic, you know, even if you have perhaps, you know, 5K tokens prompt and 10,000 users, and each of them calls SLM 20 times per day, you know, on GPT-4, even on GPT-4-0, it probably adds up to, like, 10K per day or something like several million per month -- or several million per year, which is a sizable cost of a startup.

You can easily cut that with much smaller models, and that often we see as a -- as a kind of motivation for reaching out for smaller and more customizable models. But really the -- like, where open models shine is domain adaptability. And that comes in two aspects. First, there are so many different fine-tunes and customizations.

I think Caitlin was mentioning about, you know, German built Indian languages adaptations, like, there are models specialized for code or for medicine. If you had to hide in face, there are, like, tens of thousands of different model variants. And because the weights are open, you can always customize to your particular use case.

And tune -- and tune quality specifically for -- for what you need. So open-source models are great. So what are the challenges? The challenges really come from three areas. First, like, what we usually see when people try to use, you know, open model, something like gem or whatever, or LAM might be, you run into complicated setup and maintenance, right?

You need to go and find GPUs somewhere. You need to figure out which frameworks to run on those. You need to, like, download your models, maybe do some performance optimization tuning, and you kind of have to repeat this process end-to-end every time the model gets updated or new version is released, et cetera.

On optimization itself, there is -- especially for LLMs, but generally for GNI models, there are many attributes and settings which are really dependent on your use case and requirements. Somebody needs low latency, somebody needs high throughput, prompts can be short, prompts can be long, et cetera. And choosing the optimal settings across the stack is actually not trivial.

And as I show you later, in many cases, you can get multiple X improvements from doing -- from doing this efficiently. And finally, like, just getting the production ready is actually hard. As you kind of go from experimentation to production, even just babysitting GPUs on public clouds is not as easy because GPUs are finicky, not always reliable.

But getting to enterprise scale requires, you know, all the scalability technology, telemetry, observability, et cetera. So those are things which we focus on solving at Fireworks. So starting with efficiency, we built our own custom service stack, which we believe is one of the fastest, if not the fastest. We did it -- did it from the ground up, meaning from writing our own, you know, CUDA kernels, all the way to customizing how the stuff gets deployed and orchestrated on the service level.

And that brings multiple optimizations. But most importantly, we really focus on customizing the service stack to your needs, which basically means for your custom workload and for your custom cost and latency requirements, we can tune it for those settings. What does it mean in practice? And what does customization mean in practice?

For example, many use cases use reg and use very long prompts. So there are many settings you can tune actually on the runtime level at the deployment level to optimize for long prompts, which often can be repeatable. So caching is useful or just tuning settings so the throughput is higher while maintaining latency.

So this is independently benchmarkable. If you go to, you know, artificial analysis and select long prompt, where Fireworks actually is the fastest, even faster than some of the other providers which are over there at ExpoBooth. And we don't only focus -- we don't only focus on LLM inference. We focus on many modalities.

As an example, for image generation, we are the fastest providers serving SDXL. We are also the only providers serving SD3 in stability's new model because their API actually routes to our servers. And finally, as I mentioned, like, LLMs, like, especially for LLMs, customization matters a lot. One requirement -- like, one paradigm of how to think about performance of LLMs often is useful for use cases is to think about maxim -- like, minimizing cost under a particular latency constraint.

We often have customers come and say, like, hey, I need to, like -- I have this -- my interactive implication. I need to generate that many tokens under two seconds. And that's where -- that's really where, like, cross-stack optimizations shine. Whereby tune into particular, like, latency cutoff and change in many settings, you can deliver much higher support, multiple times higher support, which -- higher support basically means fewer GPUs and lower cost.

In terms of -- in terms of model support, we support -- support best quality open source models. You know, we heard about Gemma now, obviously, LLMAS, some of the ASR and text-to-speech models pretty much from many providers. We also work with model developers, for example, for example, in U.S.

has also served on fireworks launched -- launched last week. And as a kind of platform capabilities, as I mentioned, we have a lot of open source models to get you started or customized ones. We do some of the fine-tuning of those models in-house. So I'm going to talk a little bit about function calling specialized models later on, or we do some some of the vision language models using ourselves, which we release as well.

And of course, the key for open source -- open model development is we can tune for a particular use case. So we do provide a platform for fine-tuning, whether you are bringing your dataset collected elsewhere or collecting it live with the feedback when served on our platform. Specifically on customization is like one interesting feature, which a lot of people starting to experiment with models find interesting, is if you try to fine-tune and deploy the resulting model, how to serve it efficiently.

It turns out if you do Plura fine-tuning, which a lot of folks do, you can do smart tricks and deploy multiple Plura models on the same GPU. Actually, thousands of them, which means we can give you still serverless inference with paying for token even if you have, like, thousands of model variants sitting and deployed there without having to pay any fixed cost.

Of course, single model is all great. But what we see increasingly more and more in applications is model is not a product, right, by itself. You need a kind of bigger system in order to solve target application. And the reason for that is because models by themselves tend to hallucinate, so you need some grounding, and that's where, like, access to external knowledge bases comes in.

Also we don't have, you know, yet an industry magical multimodal AI across all the modalities, so often you have to kind of chain multiple types of models. And, of course, you have all these, like, external tools and external actions which kind of end to end applications might want to do in agentic form.

So I think the term which I really like, which is, like, popularized by Databricks is, like, compound AI system, but basically increasingly seeing the transition from just the model being the product to kind of this combination of maybe, like, rag and function calling and external tools, et cetera, built together as the product.

And that's pretty much direction which we kind of see this field moving along over time. So what does it mean from kind of our perspective what we do in this case? So we see kind of as a function calling, like, agent at the core of this emerging architecture, which might be connected to either domain-specialized models served on our platform directly or maybe tuned for different needs and connected to external tools.

So maybe it's a content interpreter, or maybe it's, like, external APIs somewhere, with really, like, this kind of central agentic view, kind of central model kind of coordinating and trying to triage the user requirements, if it's, for example, a chatbot or something. You've probably all heard about, like, function calling, you know, popularized by OpenAI initially.

That's basically the same idea. So, yeah. The function calling is really, like, how to -- how to connect LLM to external tools and external elements. What does it mean in practice? So we actually focus on fine-tuning models specifically for function calling, so we release a series of models like that, like, the latest one for function V2 was released two weeks ago.

And what you can do with that is -- if I manage to click -- if I manage to click on this button. What it means is, like, you can build applications which kind of combine free-form general chat capabilities with function calling. So in this case, this is -- you know, this fire function has some chat capabilities.

So you can see you can, like, ask it what -- what can you do? And it has, like, some self-reflection to tell you what it can do. It's also connected in this demo app to a bunch of external tools. So it can query, like, stock quotes. It can plot some charts, all those, like, external APIs.

It can also generate images. But what it really needs to figure out is how to translate user query into -- do complex reasoning, translate it into function calls. So, for example, if we ask it to generate a bar chart with top three -- like, stocks of top cloud providers, like, the big three, it actually needs to do several steps, right?

It needs to understand that, like, top three cloud providers means, you know, AWS, GCP, and an Azure, right, and Azure is on the Microsoft. It needs to then go do function calls querying their stock prices. And finally, it needs to combine those information and send it to chat plotting API, which is what just happened in the background.

Another important aspect which you have to do for, like, efficient kind of function calling chat capabilities, you need to have contextual awareness. So if I ask it to add particular -- if I ask it to add Oracle to this graph, it needs to understand what I'm referring to and, like, still keep the previous context and regenerate the image.

And finally, you know, if I switch to a different topic, it kind of needs to drop the previous context and understand that, like, hey, this is less -- this historical context is less important. I'm going to start from scratch so there is no, like, oracle in that cat or whatever.

So, you know, this particular demo is -- is actually open source. You can, like, go to our GitHub and try it out. It's built with Fire Function and built with, like, a few other -- a few other models, including, like, SDXL, which are run on our platform. The model itself for function calling is actually open source.

It's on Hagen Face. I mean, you can, of course, call it at Fireworks for optimal speeds, but you can also run it locally if you want. It uses a bunch of, you know, functionality on our platform. For example, like, structure generation is, like, with JSON model grammar mode, which I think was similar to some of the previous talks from, like, Outline guys, which we were talking here yesterday.

Yeah. So, finally, try it out. And generally, like, how to get started in Fireworks. So, if you head out to Fireworks, say, such models, you'll find a lot of open source, open base models, which I mentioned about. They're available in the playground. In terms of product offering, we have this kind of range which can take you from early prototyping all the way to enterprise scale.

So, you can start with serverless inference, which is, you know, not different from, you know, getting to open API -- open AI playground or something where you pay per token. It's a cost and price. You don't need to worry about, like, hardware settings or anything. As I mentioned, you can still do fine-tuning.

So, you can -- you can do hosted fine-tuning on our platform. You can bring your own lower adapter and still serve it serverless. As you kind of graduate, like, maybe, like, a startup and you graduate to a more production scale, you might want to go to on-demand where it's more, like, dedicated hardware with more settings and modifications for your use case.

You can bring your own custom model fine-tuned from scratch or do it on our platform. And finally, if you kind of -- if you scale up to a bigger volume and want to go to enterprise level where it's kind of discounted long-term contracts, and we also will help you to kind of personalize hardware setup into some of those tuning for performance, which I -- which I talked about earlier.

And in terms of these cases, I mean, we're running production for many, many companies, ranging from small startups to big enterprises. We're serving, like -- last time I checked, like, more than 150 billion tokens per day. So, you know, companies like Quora built chatbots like Paul. Source Draft and Courser, which I think Courser had a talk here yesterday.

They used us for, like, some of the code assistant functionality. And their, like, latency is really important. As you can imagine, you know, folks, like, upstage and Liner are building, like, different assistants and agents on top of that. So, we are definitely production ready. Go try it out. Finally, we care a lot about developers, you guys.

So, actually, this is external numbers from, like, last year, land chain, state of AI stuff, where it turns out we are one of the -- like, after Hagen phase, the most popular platform for where people pull models, which is great. It was very nice to hear. And, again, for getting started, just, you know, head out -- head out to our website.

You can go in the -- go play in the playground right away. So, for example, you can run, you know, Lama or Gemma or whatever at the -- at the top speeds. And kind of go -- start building from there. I'm really excited to see what you can build with open models or fire function or some stuff which you -- which you can fine tune on -- on your own.

And, yeah, last point. We are, as I mentioned, open API compatible. So you can still use, you know, your favorite tools, the same clients, or you can use frameworks like land chain or Lama index or et cetera. So, yeah. Really excited to kind of -- to be here and tell a little bit about open source -- open source models and how we have fireworks focusing on productionizing that and scaling it up.

Go try it out. And you can also find us at the booth at the Expo. Thank you. Thank you.

Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov

Transcript