back to indexCustomized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov

00:00:17.040 |
As mentioned, unfortunately, my co-founder, Lin, who was on the schedule, couldn't make 00:00:25.840 |
And as you saw, we don't have yet AI to figure out video projection, but we have AI for a 00:00:32.940 |
So today I'm going to talk about Fireworks AI and generally I'm going to continue the theme 00:00:36.920 |
which Katrin started about open models and how we basically focus on productionization 00:00:43.060 |
and customization of open source models in inference at Fireworks. 00:00:48.600 |
But first, as an introduction, what's our background? 00:00:52.140 |
So the founding team of Fireworks comes from PyTorch leads at Meta and some veterans from 00:00:59.640 |
So we combined have, like, probably a decade of experience in productionizing AI in some 00:01:06.300 |
And I myself personally have been core maintainer of PyTorch for the past five years. 00:01:11.520 |
So topic of open source is really close to my heart. 00:01:15.980 |
And since we kind of led this revolution of open source tool chain for deep learning through 00:01:20.620 |
our work on PyTorch and some of the Google technologies, we really believe that open source models are 00:01:27.620 |
the future also for, like, for gen AI application. 00:01:31.580 |
And our focus at Fireworks is precisely on that. 00:01:36.800 |
So I mean, how many people in the audience actually, like, use GPT and deploy it for production? 00:01:45.420 |
And how many people -- how many folks use open models for the production? 00:01:51.640 |
So I was about to convince you that share of open source models is going to grow over time, 00:01:56.360 |
but it looks like in this audience it's already -- already sizable. 00:01:59.820 |
But nevertheless, so why -- why basically this tradeoff, why go big or why go small? 00:02:07.640 |
Currently still, like, bulk of production inference is still based on proprietary models. 00:02:12.040 |
And the catch is that those are really good models and often frontier in many domains. 00:02:17.700 |
However, the catch is that it's one model which is good in many, many sense. 00:02:22.500 |
And it's often served in the same way regardless of the use case. 00:02:26.220 |
Which means that it may be if you have batch inference on some narrow domain or you have 00:02:30.980 |
some super real-time use case where you need to -- you need to do, like, voice assistant 00:02:36.520 |
or something, those are often served from the same infrastructure without customization. 00:02:41.140 |
In terms of model capabilities, it also means, yeah, like, GPT-4 is great or Claude is great 00:02:48.320 |
But you're often paying a lot for additional capabilities which are not needed in particular use case. 00:02:53.460 |
You don't really need customer support chatbot to know about 150 Pokemons or be able to write 00:03:00.640 |
But you really want it to be really good in the particular narrow domain. 00:03:06.360 |
So this kind of discrepancy for large models leads to several issues. 00:03:11.020 |
One as I mentioned is high latency because using a big model means longer response times which 00:03:18.120 |
is particularly important for real-time use cases like voice systems. 00:03:22.140 |
It gets more and more important with agenting stuff because for stuff like, for example, 00:03:28.140 |
next, right, like, you need to do a lot of steps for, like, something like agent-like application 00:03:34.380 |
to do reasoning and call the model many times. 00:03:38.220 |
And often you see that you can pick smaller models like Lama or Gemma, which you just talked 00:03:44.400 |
about, and achieve the -- for the narrow domain, same or better quality while being, you know, 00:03:52.240 |
For example, for some of the function calling use cases, like, externally benchmark from Berkeley, 00:03:57.400 |
yeah, like, you get similar performance from fine-tuned Lama 3 at 10x the speed. 00:04:03.520 |
Cost is also an issue if you're running a big model for -- on a lot of traffic, you know, 00:04:10.240 |
even if you have perhaps, you know, 5K tokens prompt and 10,000 users, and each of them calls 00:04:15.900 |
LLM 20 times per day, you know, on GPT-4, even on GPT-4O, it probably adds up to, like, 10K 00:04:21.420 |
per day or something like several million per year, which is a sizable cost of a startup. 00:04:27.580 |
You can easily cut that with much smaller models, and that often we see as a kind of motivation 00:04:33.980 |
for reaching out for smaller and more customizable models. 00:04:39.180 |
But really the -- like, where open models shine is domain adaptability, and that comes in two 00:04:46.360 |
First, there are so many different fine-tunes and customizations. 00:04:49.440 |
I think Katelyn was mentioning about, you know, Jemma built Indian languages adaptations. 00:04:54.780 |
Like, there are models specialized for code or for medicine. 00:04:57.500 |
If you had to -- there are, like, tens of thousands of different model variants, and because the 00:05:02.520 |
weights are open, you can always customize to your particular use case and tune quality specifically 00:05:17.340 |
Like, what we see when people try to use, you know, open model, something like Jemma or whatever, 00:05:22.680 |
or what might be, you run into complicated setup and maintenance, right? 00:05:30.240 |
You need to figure out which frameworks to run on those. 00:05:33.680 |
You need to, like, download your models, maybe do some performance optimization tuning, and you 00:05:37.740 |
kind of have to repeat this process end-to-end every time the model gets updated or new version 00:05:44.880 |
On optimization itself, there is -- especially for LLMs, but generally for GNI models, there 00:05:49.960 |
are many attributes and settings which are really dependent on your use case and requirements. 00:05:56.860 |
Somebody needs low latency, somebody needs high throughput, prompts can be short, prompts can 00:06:01.760 |
And choosing the optimal settings across the stack is actually not trivial. 00:06:06.100 |
And as I show you later, in many cases, you can get multiple X improvements from doing -- from 00:06:12.860 |
And finally, like, just getting the production ready is actually hard. 00:06:16.220 |
As you kind of go from experimentation to production, even just babysitting GPUs on public clouds is 00:06:23.060 |
not as easy because GPUs are finicky, not always reliable. 00:06:26.860 |
But getting to enterprise scale requires, you know, all the scalability technology, telemetry, 00:06:33.400 |
So those are since which we focus on solving at fireworks. 00:06:38.100 |
So starting with efficiency, we built our own custom serving stack, which we believe is 00:06:46.400 |
We did it -- did it from the ground up, meaning from writing our own, you know, Cuda kernels, 00:06:51.440 |
all the way to customizing how the stuff gets deployed and orchestrated on the service level. 00:06:59.400 |
But most importantly, we really focus on customizing the service stack to your needs, which basically 00:07:04.760 |
means for your custom workload and your custom cost and latency requirements, we can -- we 00:07:15.220 |
And what does customization mean in practice? 00:07:17.680 |
For example, many use cases use reg and use very long prompts. 00:07:22.240 |
So there are many settings you can tune actually on the runtime level at the deployment level to 00:07:26.600 |
optimize for long prompts, which often can be repeatable, so caching is useful, or just tuning 00:07:32.640 |
settings so the throughput is higher while maintaining latency. 00:07:37.680 |
If you go to, you know, artificial analysis and select long prompt, where Fireworks actually 00:07:42.280 |
is the fastest, even faster than some of the other providers which are over there at ExpoBoost. 00:07:48.580 |
And we don't only focus -- we don't only focus on LLM inference, we focus on many modalities. 00:07:54.040 |
As an example, for image generation, we are the fastest providers serving SDXL, we're also 00:07:59.300 |
the only providers serving SD3 in stability's new model because their API actually routes 00:08:08.480 |
And finally, as I mentioned, like, LLMs, like, customization -- especially for LLMs, customization 00:08:12.800 |
matters a lot -- like, one problem, like, how to think about performance of LLMs, often, 00:08:18.060 |
it's useful for use cases, is to think about maxim -- like, minimizing cost under a particular 00:08:23.400 |
We often have customers come and say, like, hey, I need to, like, have this -- my interactive 00:08:27.220 |
implication, I need to generate that many tokens under two seconds. 00:08:31.060 |
And that's where -- that's really where, like, cross-stack optimizations shine. 00:08:36.320 |
And if they tune into particular, like, latency cutoff and change in many settings, you can 00:08:41.020 |
deliver much higher throughput, multiple times higher throughput, which with -- higher throughput 00:08:51.500 |
In terms of model support, we support best quality open source models. 00:08:56.680 |
You know, we heard about Gemma now, obviously, LLMAS, some of the ASR and text-to-speech models, 00:09:05.680 |
We also work with model developers, for example, for example, in the U.S. has also served on fireworks 00:09:15.980 |
And as a kind of platform capabilities, as I mentioned, we have a lot of open source models 00:09:24.640 |
We do some of the fine-tuning of those models in-house, so I'm going to talk a little bit 00:09:28.800 |
about function calling specialized models later on, or we do some of the vision language models 00:09:38.480 |
And of course, the key for open source -- open model development is they can tune for a particular 00:09:45.080 |
So we do provide a platform for fine-tuning, whether you're bringing your data set collected elsewhere 00:09:50.720 |
or collecting it live with the feedback when it serves on our platform. 00:09:56.480 |
Specifically on customization, it's like one interesting feature which a lot of people starting 00:10:01.780 |
to experiment with models find interesting is if you try to fine-tune and deploy the resulting 00:10:09.480 |
It turns out if you do Plura fine-tuning, which a lot of folks do, you can do smart tricks and 00:10:16.040 |
deploy multiple models on the same GPU, actually thousands of them, which means that we can 00:10:20.800 |
give you still serverless inference with paying for token even if you have, like, thousands 00:10:25.360 |
of model variants sitting and deployed there without having to pay any fixed cost. 00:10:33.400 |
Of course, single model is all great, but what we see increasingly more and more in applications 00:10:38.620 |
is model is not the product, right, by itself. 00:10:43.300 |
You need a kind of bigger system in order to solve target application. 00:10:47.720 |
And the reason for that is because models by themselves tend to hallucinate, so you need 00:10:52.160 |
it to be able to solve a lot of things and that's where, like, RAG or access to external knowledge 00:10:57.200 |
Also, we don't have, you know, yet an industry magical multimodal AI across all the modalities, 00:11:04.060 |
so often you have to kind of chain multiple types of models, and, of course, you have all 00:11:07.960 |
this, like, external tools and external actions which kind of end-to-end applications might want 00:11:16.400 |
So I think the term which I really like, which is, like, popularized by Databricks is, like, 00:11:20.620 |
compound AI system, but basically increasingly seeing, like, transition from just the model 00:11:25.660 |
being the product to kind of this combination of maybe, like, RAG and function calling and 00:11:29.560 |
external tools, et cetera, built together as the product. 00:11:32.980 |
And that's pretty much direction which we kind of see this field moving along over time. 00:11:40.220 |
So what does it mean from our perspective what we do in this case? 00:11:45.140 |
So we see kind of as a function calling, like, agent at the core of this emerging architecture 00:11:51.620 |
which might be connected to either domain-specialized models served on our platform directly or maybe 00:11:58.940 |
tuned for different needs and connected to external tools, maybe it's a content interpreter or maybe 00:12:03.840 |
it's, like, external APIs somewhere with really, like, this kind of central agentic view 00:12:09.820 |
-- kind of central model kind of coordinating and trying to triage the user requirements 00:12:16.740 |
if it's, for example, a chatbot or something. 00:12:19.400 |
You probably all heard about, like, function calling, you know, popularized by OpenAI initially. 00:12:27.200 |
So yeah, the function calling is really, like, how to -- how to connect LLM to external tools 00:12:36.680 |
So we actually focus on fine-tuning models specifically for function calling. 00:12:44.340 |
Like, the latest one of our function V2 was released two weeks ago. 00:12:48.680 |
And what you can do with that is -- if I manage to click -- if I manage to click on this button -- what 00:13:00.800 |
it means is, like, you can build applications which kind of combine freeform general chat capabilities 00:13:08.600 |
So in this case, this is -- this is -- you know, this function has some chat capabilities. 00:13:13.440 |
So you can see you can, like, ask it what -- what can you do? 00:13:15.520 |
And it has, like, some self-reflection to tell you what it can do. 00:13:18.860 |
It's also connected in this demo app to a bunch of external tools. 00:13:34.100 |
But what it really needs to figure out is how to translate user query into -- do complex 00:13:38.420 |
reasoning and translate it into function calls. 00:13:40.780 |
So for example, if we ask it to generate a bar chart with top three -- like, stocks of 00:13:47.180 |
top cloud providers, like, the big three, it actually needs to do several steps, right? 00:13:51.520 |
It needs to understand that, like, top three cloud providers means, you know, AWS, GCP, and 00:14:00.480 |
It needs to then go do function calls, query in their stock prices. 00:14:04.800 |
And finally, it needs to combine those information and send it to chat plotting API, which is what 00:14:13.640 |
Another important aspect which you have to do for, like, efficient kind of function calling 00:14:18.060 |
chat capabilities, you need to have contextual awareness. 00:14:20.660 |
So if I ask it to add particular -- if I ask it to add Oracle to this graph, it needs to 00:14:25.440 |
understand what I'm referring to and, like, still keep the previous context and regenerate 00:14:30.320 |
And finally, you know, if I switch to a -- to a different topic, it kind of needs to drop 00:14:34.980 |
the previous context and understand that, like, hey, this is less -- this historical context 00:14:41.440 |
There's no, like, oracle in that cat or whatever. 00:14:45.320 |
So, you know, this particular demo is -- is actually open source. 00:14:48.440 |
You can, like, go to our GitHub and try it out. 00:14:50.980 |
It's built with fire function and built with, like, a few other -- a few other models, including, 00:14:58.660 |
The model itself for function calling is actually open source. 00:15:02.660 |
I mean, you can, of course, call it at fireworks for optimal speeds, but you can also run it 00:15:09.880 |
It uses a bunch of, you know, functionality on our platform. 00:15:13.780 |
For example, like, structure generation, like, with JSON model grammar mode, which I think 00:15:18.780 |
was similar to some of the previous talks from, like, Outline guys which we were talking here 00:15:27.720 |
And generally, like, how to get started in fireworks. 00:15:30.360 |
So, if you head out to fireworks, say, such models, you'll find a lot of open source, open 00:15:40.720 |
In terms of product offering, we have this kind of range which can take you from early 00:15:47.620 |
So you can start with serverless inference which is, you know, not different from, you know, 00:15:52.000 |
getting to open API -- open AI playground or something where you pay per token. 00:15:57.900 |
You don't need to worry about, like, hardware settings or anything. 00:16:00.760 |
As I mentioned, you can still do fine tuning. 00:16:02.660 |
So you can -- you can do posted fine tuning on our platform. 00:16:05.160 |
You can bring your own lower adapter and still serve it serverless. 00:16:08.420 |
As you kind of graduate to, like, maybe like a startup and you graduate to a more production 00:16:12.920 |
scale, you might want to go to on-demand where it's more like dedicated hardware with more 00:16:17.660 |
settings and modifications for your use case. 00:16:21.020 |
You can bring your own custom model fine tune from scratch or do it on our platform. 00:16:25.260 |
And finally, as you kind of -- if you scale up to a bigger volume and want to go to enterprise 00:16:31.320 |
level where it's discounted long-term contracts, and we also will help you to kind of personalize 00:16:37.500 |
hardware setup and do some of those tuning for performance which I talked about earlier. 00:16:43.680 |
And in terms of these cases, I mean, we are running production for many, many companies ranging 00:16:50.820 |
We are serving, like -- last time I checked, like, more than 150 billion tokens per day. 00:16:55.320 |
So, you know, companies like Quora build chatbots like Paul, which I think Courser had a talk 00:17:03.760 |
They use us for, like, some of the code assistant functionality. 00:17:06.320 |
And their, like, latency is really important. 00:17:08.620 |
As you can imagine, you know, folks like Upstage and Liner are building, like, different 00:17:15.640 |
So we are definitely production ready, go try it out. 00:17:19.920 |
Finally, we care a lot about developers, you guys. 00:17:24.240 |
So actually, this is external numbers from, like, last year, state of AI stuff where it turns 00:17:30.180 |
out we are one of the -- like, after Hagenface, the most popular platform where people pull models, 00:17:38.940 |
And again, for getting started, just, you know, head out to our website. 00:17:44.760 |
You can go play in the playground right away. 00:17:48.000 |
So, for example, you can run, you know, Lama or Gemma or whatever at the top speeds. 00:17:54.140 |
And kind of go start building from there and really excited to see what you can build with 00:17:58.020 |
open models or fire function or some stuff which you can find on your own. 00:18:07.900 |
So you can still use, you know, your favorite tools, the same clients, or you can use frameworks 00:18:17.220 |
Really excited to kind of -- to be here and tell a little bit about open source -- open 00:18:25.220 |
source models and how we have fireworks focusing on productionizing that and scaling it up. 00:18:31.220 |
out and you can also find us at the booth at the expo thank you