Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov

Hello, everyone. So, my name is Dima. As mentioned, unfortunately, my co-founder, Lin, who was on the schedule, couldn't make it today because of some personal emergency. So you got me. And as you saw, we don't have yet AI to figure out video projection, but we have AI for a lot of other things.

So today I'm going to talk about Fireworks AI and generally I'm going to continue the theme which Katrin started about open models and how we basically focus on productionization and customization of open source models in inference at Fireworks. But first, as an introduction, what's our background? So the founding team of Fireworks comes from PyTorch leads at Meta and some veterans from Google AI.

So we combined have, like, probably a decade of experience in productionizing AI in some of the biggest companies in the world. And I myself personally have been core maintainer of PyTorch for the past five years. So topic of open source is really close to my heart. And since we kind of led this revolution of open source tool chain for deep learning through our work on PyTorch and some of the Google technologies, we really believe that open source models are the future also for, like, for gen AI application.

And our focus at Fireworks is precisely on that. So I mean, how many people in the audience actually, like, use GPT and deploy it for production? And how many people -- how many folks use open models for the production? Oh, okay. So I was about to convince you that share of open source models is going to grow over time, but it looks like in this audience it's already -- already sizable.

But nevertheless, so why -- why basically this tradeoff, why go big or why go small? Currently still, like, bulk of production inference is still based on proprietary models. And the catch is that those are really good models and often frontier in many domains. However, the catch is that it's one model which is good in many, many sense.

And it's often served in the same way regardless of the use case. Which means that it may be if you have batch inference on some narrow domain or you have some super real-time use case where you need to -- you need to do, like, voice assistant or something, those are often served from the same infrastructure without customization.

In terms of model capabilities, it also means, yeah, like, GPT-4 is great or Claude is great and can handle a lot of sense. But you're often paying a lot for additional capabilities which are not needed in particular use case. You don't really need customer support chatbot to know about 150 Pokemons or be able to write your poetry.

But you really want it to be really good in the particular narrow domain. So this kind of discrepancy for large models leads to several issues. One as I mentioned is high latency because using a big model means longer response times which is particularly important for real-time use cases like voice systems.

It gets more and more important with agenting stuff because for stuff like, for example, next, right, like, you need to do a lot of steps for, like, something like agent-like application to do reasoning and call the model many times. So latency is really, really important. And often you see that you can pick smaller models like Lama or Gemma, which you just talked about, and achieve the -- for the narrow domain, same or better quality while being, you know, up to 10 times faster.

For example, for some of the function calling use cases, like, externally benchmark from Berkeley, yeah, like, you get similar performance from fine-tuned Lama 3 at 10x the speed. Cost is also an issue if you're running a big model for -- on a lot of traffic, you know, even if you have perhaps, you know, 5K tokens prompt and 10,000 users, and each of them calls LLM 20 times per day, you know, on GPT-4, even on GPT-4O, it probably adds up to, like, 10K per day or something like several million per year, which is a sizable cost of a startup.

You can easily cut that with much smaller models, and that often we see as a kind of motivation for reaching out for smaller and more customizable models. But really the -- like, where open models shine is domain adaptability, and that comes in two aspects. First, there are so many different fine-tunes and customizations.

I think Katelyn was mentioning about, you know, Jemma built Indian languages adaptations. Like, there are models specialized for code or for medicine. If you had to -- there are, like, tens of thousands of different model variants, and because the weights are open, you can always customize to your particular use case and tune quality specifically for what you need.

So open-source models are great. So what are the challenges? The challenges really come from three areas. Like, what we see when people try to use, you know, open model, something like Jemma or whatever, or what might be, you run into complicated setup and maintenance, right? You need to go and find GPUs somewhere.

You need to figure out which frameworks to run on those. You need to, like, download your models, maybe do some performance optimization tuning, and you kind of have to repeat this process end-to-end every time the model gets updated or new version is released, et cetera. On optimization itself, there is -- especially for LLMs, but generally for GNI models, there are many attributes and settings which are really dependent on your use case and requirements.

Somebody needs low latency, somebody needs high throughput, prompts can be short, prompts can be long, et cetera. And choosing the optimal settings across the stack is actually not trivial. And as I show you later, in many cases, you can get multiple X improvements from doing -- from doing this efficiently.

And finally, like, just getting the production ready is actually hard. As you kind of go from experimentation to production, even just babysitting GPUs on public clouds is not as easy because GPUs are finicky, not always reliable. But getting to enterprise scale requires, you know, all the scalability technology, telemetry, observability, et cetera.

So those are since which we focus on solving at fireworks. So starting with efficiency, we built our own custom serving stack, which we believe is one of the fastest if not the fastest. We did it -- did it from the ground up, meaning from writing our own, you know, Cuda kernels, all the way to customizing how the stuff gets deployed and orchestrated on the service level.

And that brings multiple optimizations. But most importantly, we really focus on customizing the service stack to your needs, which basically means for your custom workload and your custom cost and latency requirements, we can -- we can tune it for -- for those settings. What does it mean in practice?

And what does customization mean in practice? For example, many use cases use reg and use very long prompts. So there are many settings you can tune actually on the runtime level at the deployment level to optimize for long prompts, which often can be repeatable, so caching is useful, or just tuning settings so the throughput is higher while maintaining latency.

So this is independently benchmarkable. If you go to, you know, artificial analysis and select long prompt, where Fireworks actually is the fastest, even faster than some of the other providers which are over there at ExpoBoost. And we don't only focus -- we don't only focus on LLM inference, we focus on many modalities.

As an example, for image generation, we are the fastest providers serving SDXL, we're also the only providers serving SD3 in stability's new model because their API actually routes to our servers. And finally, as I mentioned, like, LLMs, like, customization -- especially for LLMs, customization matters a lot -- like, one problem, like, how to think about performance of LLMs, often, it's useful for use cases, is to think about maxim -- like, minimizing cost under a particular latency constraint.

We often have customers come and say, like, hey, I need to, like, have this -- my interactive implication, I need to generate that many tokens under two seconds. And that's where -- that's really where, like, cross-stack optimizations shine. And if they tune into particular, like, latency cutoff and change in many settings, you can deliver much higher throughput, multiple times higher throughput, which with -- higher throughput basically means fewer GPUs and lower cost.

In terms of model support, we support best quality open source models. You know, we heard about Gemma now, obviously, LLMAS, some of the ASR and text-to-speech models, pretty much from many providers. We also work with model developers, for example, for example, in the U.S. has also served on fireworks launched -- launched last week.

And as a kind of platform capabilities, as I mentioned, we have a lot of open source models to get you started or customized ones. We do some of the fine-tuning of those models in-house, so I'm going to talk a little bit about function calling specialized models later on, or we do some of the vision language models using ourselves, which we release as well.

And of course, the key for open source -- open model development is they can tune for a particular use case. So we do provide a platform for fine-tuning, whether you're bringing your data set collected elsewhere or collecting it live with the feedback when it serves on our platform. Specifically on customization, it's like one interesting feature which a lot of people starting to experiment with models find interesting is if you try to fine-tune and deploy the resulting model, how to serve it efficiently.

It turns out if you do Plura fine-tuning, which a lot of folks do, you can do smart tricks and deploy multiple models on the same GPU, actually thousands of them, which means that we can give you still serverless inference with paying for token even if you have, like, thousands of model variants sitting and deployed there without having to pay any fixed cost.

Of course, single model is all great, but what we see increasingly more and more in applications is model is not the product, right, by itself. You need a kind of bigger system in order to solve target application. And the reason for that is because models by themselves tend to hallucinate, so you need it to be able to solve a lot of things and that's where, like, RAG or access to external knowledge bases comes in.

Also, we don't have, you know, yet an industry magical multimodal AI across all the modalities, so often you have to kind of chain multiple types of models, and, of course, you have all this, like, external tools and external actions which kind of end-to-end applications might want to do in agentic form.

So I think the term which I really like, which is, like, popularized by Databricks is, like, compound AI system, but basically increasingly seeing, like, transition from just the model being the product to kind of this combination of maybe, like, RAG and function calling and external tools, et cetera, built together as the product.

And that's pretty much direction which we kind of see this field moving along over time. So what does it mean from our perspective what we do in this case? So we see kind of as a function calling, like, agent at the core of this emerging architecture which might be connected to either domain-specialized models served on our platform directly or maybe tuned for different needs and connected to external tools, maybe it's a content interpreter or maybe it's, like, external APIs somewhere with really, like, this kind of central agentic view -- kind of central model kind of coordinating and trying to triage the user requirements if it's, for example, a chatbot or something.

You probably all heard about, like, function calling, you know, popularized by OpenAI initially. That's basically the same idea. So yeah, the function calling is really, like, how to -- how to connect LLM to external tools and external elements. What does it mean in practice? So we actually focus on fine-tuning models specifically for function calling.

So we released a series of models like that. Like, the latest one of our function V2 was released two weeks ago. And what you can do with that is -- if I manage to click -- if I manage to click on this button -- what it means is, like, you can build applications which kind of combine freeform general chat capabilities with function calling.

So in this case, this is -- this is -- you know, this function has some chat capabilities. So you can see you can, like, ask it what -- what can you do? And it has, like, some self-reflection to tell you what it can do. It's also connected in this demo app to a bunch of external tools.

So it can query, like, stock quotes. It can plot some charts. All those, like, external APIs. It can also generate images. But what it really needs to figure out is how to translate user query into -- do complex reasoning and translate it into function calls. So for example, if we ask it to generate a bar chart with top three -- like, stocks of top cloud providers, like, the big three, it actually needs to do several steps, right?

It needs to understand that, like, top three cloud providers means, you know, AWS, GCP, and Azure, right? And Azure is on the Microsoft. It needs to then go do function calls, query in their stock prices. And finally, it needs to combine those information and send it to chat plotting API, which is what just happened in the background.

Another important aspect which you have to do for, like, efficient kind of function calling chat capabilities, you need to have contextual awareness. So if I ask it to add particular -- if I ask it to add Oracle to this graph, it needs to understand what I'm referring to and, like, still keep the previous context and regenerate the image.

And finally, you know, if I switch to a -- to a different topic, it kind of needs to drop the previous context and understand that, like, hey, this is less -- this historical context is less important. I'm going to start from scratch. There's no, like, oracle in that cat or whatever.

So, you know, this particular demo is -- is actually open source. You can, like, go to our GitHub and try it out. It's built with fire function and built with, like, a few other -- a few other models, including, like, SDXL, which are run on our platform. The model itself for function calling is actually open source.

It's on Hagenface. I mean, you can, of course, call it at fireworks for optimal speeds, but you can also run it locally if you want. It uses a bunch of, you know, functionality on our platform. For example, like, structure generation, like, with JSON model grammar mode, which I think was similar to some of the previous talks from, like, Outline guys which we were talking here yesterday.

Yeah. So, finally, try it out. And generally, like, how to get started in fireworks. So, if you head out to fireworks, say, such models, you'll find a lot of open source, open base models which I mentioned about. They're available in the playground. In terms of product offering, we have this kind of range which can take you from early prototyping all the way to enterprise scale.

So you can start with serverless inference which is, you know, not different from, you know, getting to open API -- open AI playground or something where you pay per token. It's a cost and price. You don't need to worry about, like, hardware settings or anything. As I mentioned, you can still do fine tuning.

So you can -- you can do posted fine tuning on our platform. You can bring your own lower adapter and still serve it serverless. As you kind of graduate to, like, maybe like a startup and you graduate to a more production scale, you might want to go to on-demand where it's more like dedicated hardware with more settings and modifications for your use case.

You can bring your own custom model fine tune from scratch or do it on our platform. And finally, as you kind of -- if you scale up to a bigger volume and want to go to enterprise level where it's discounted long-term contracts, and we also will help you to kind of personalize hardware setup and do some of those tuning for performance which I talked about earlier.

And in terms of these cases, I mean, we are running production for many, many companies ranging from small start-ups to big enterprises. We are serving, like -- last time I checked, like, more than 150 billion tokens per day. So, you know, companies like Quora build chatbots like Paul, which I think Courser had a talk here yesterday.

They use us for, like, some of the code assistant functionality. And their, like, latency is really important. As you can imagine, you know, folks like Upstage and Liner are building, like, different assistants and agents on top of that. So we are definitely production ready, go try it out. Finally, we care a lot about developers, you guys.

So actually, this is external numbers from, like, last year, state of AI stuff where it turns out we are one of the -- like, after Hagenface, the most popular platform where people pull models, which is great. It was very nice to hear. And again, for getting started, just, you know, head out to our website.

You can go play in the playground right away. So, for example, you can run, you know, Lama or Gemma or whatever at the top speeds. And kind of go start building from there and really excited to see what you can build with open models or fire function or some stuff which you can find on your own.

And yeah. Last point. We are, as I mentioned, open API compatible. So you can still use, you know, your favorite tools, the same clients, or you can use frameworks like Lanchain or Lama index or et cetera. So, yeah. Really excited to kind of -- to be here and tell a little bit about open source -- open source models and how we have fireworks focusing on productionizing that and scaling it up.

Go try it out. out and you can also find us at the booth at the expo thank you

Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov

Transcript