back to indexMaking Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov

00:00:17.040 |
As mentioned, unfortunately, my co-founder, Lin, who was on the schedule, couldn't make 00:00:25.840 |
And as you saw, we don't have yet AI to figure out video projection, but we have AI for a 00:00:32.960 |
So today I'm going to talk about Fireworks AI and generally I'm going to continue the theme 00:00:36.900 |
which Katrin started about open models and how we basically focus on productionization 00:00:43.080 |
and customization of open source models in inference at Fireworks. 00:00:48.600 |
But first, as an introduction, what's our background? 00:00:52.140 |
So the founding team of Fireworks comes from PyTorch leads at Meta and some veterans from 00:00:59.640 |
So we combined have, like, probably a decade of experience in productionizing AI in some 00:01:06.300 |
And I myself personally have been core maintainer of PyTorch for the past five years. 00:01:11.520 |
So topic of open source is really close to my heart. 00:01:15.980 |
And since we kind of led this revolution of open source toolchain for deep learning through 00:01:20.500 |
our work sort of on PyTorch and some of the Google technologies, we really believe that open 00:01:25.680 |
source models are the future also for, like, for gen AI application and our focus at Fireworks 00:01:36.820 |
So I mean, how many people in the audience actually, like, use GPT and deploy it for production? 00:01:45.420 |
And how many people -- how many folks use open models for the production? 00:01:51.640 |
So I was about to convince you that share of open source models is going to grow over time, 00:01:56.300 |
so it looks like in this audience it's already -- already sizable, but nevertheless. 00:02:02.020 |
So why -- why basically this trade-off, why go big or why go small? 00:02:07.700 |
Currently still, like, bulk of production inference is still based on proprietary models. 00:02:12.080 |
And the catch is that those are really good models and often frontier in many domains. 00:02:17.740 |
However, the catch is that it's one model which is good in many, many sins. 00:02:22.520 |
And it's often served in the same way, regardless of the use case, which means that it may be 00:02:27.560 |
if you have batch inference on some narrow domain or you have some super real-time use case where 00:02:33.400 |
you need to -- you need to do, like, voice assistance or something, those are often served 00:02:38.640 |
from the same infrastructure without customization. 00:02:41.160 |
In terms of model capabilities, it also means, yeah, like, GPT-4 is great or Claude is great 00:02:46.580 |
and can handle a lot of sins, but you are often paying a lot for additional capabilities 00:02:53.480 |
You don't really need customer support chatbot to know about 150 Pokemons or be able to write 00:02:59.700 |
your poetry, but you really want it to be really good in the particular narrow domain. 00:03:06.380 |
So this kind of discrepancy for large models leads to several issues. 00:03:11.040 |
One as I mentioned is high latency, because using a big model means longer response times, 00:03:17.040 |
which is particularly important for real-time use cases, like, voice systems. 00:03:22.140 |
It gets more and more important with stuff, because for stuff like, for example, next -- right? 00:03:29.420 |
Like, you need to do a lot of steps for, like, something, like, agent-like application to do 00:03:38.220 |
And often you see that you can pick smaller models, like Lama or Gemma, which you just 00:03:43.780 |
talked about, and achieve the -- for the narrow domain, same or better quality while being, 00:03:52.260 |
For example, for some of the function calling use cases, like, externally benchmark from Berkeley, 00:03:57.380 |
yeah, like, the -- you get similar performance from fine-tuned Lama 3 at 10x the speed. 00:04:03.440 |
Cost is also -- is also an issue if you're running a big model for -- on a lot of traffic, you know, 00:04:10.000 |
even if you have perhaps, you know, 5K tokens prompt and 10,000 users, and each of them calls 00:04:15.920 |
SLM 20 times per day, you know, on GPT-4, even on GPT-4-0, it probably adds up to, like, 10K per 00:04:21.500 |
day or something like several million per month -- or several million per year, which is a sizable 00:04:27.500 |
You can easily cut that with much smaller models, and that often we see as a -- as a kind of motivation 00:04:33.960 |
for reaching out for smaller and more customizable models. 00:04:39.180 |
But really the -- like, where open models shine is domain adaptability. 00:04:46.160 |
First, there are so many different fine-tunes and customizations. 00:04:49.760 |
I think Caitlin was mentioning about, you know, German built Indian languages adaptations, like, 00:04:54.980 |
there are models specialized for code or for medicine. 00:04:57.480 |
If you had to hide in face, there are, like, tens of thousands of different model variants. 00:05:02.180 |
And because the weights are open, you can always customize to your particular use case. 00:05:06.040 |
And tune -- and tune quality specifically for -- for what you need. 00:05:17.000 |
First, like, what we usually see when people try to use, you know, open model, something like 00:05:21.420 |
gem or whatever, or LAM might be, you run into complicated setup and maintenance, right? 00:05:30.120 |
You need to figure out which frameworks to run on those. 00:05:33.680 |
You need to, like, download your models, maybe do some performance optimization tuning, and you 00:05:37.740 |
kind of have to repeat this process end-to-end every time the model gets updated or new version 00:05:44.860 |
On optimization itself, there is -- especially for LLMs, but generally for GNI models, there 00:05:49.940 |
are many attributes and settings which are really dependent on your use case and requirements. 00:05:56.880 |
Somebody needs low latency, somebody needs high throughput, prompts can be short, prompts 00:06:01.880 |
And choosing the optimal settings across the stack is actually not trivial. 00:06:06.300 |
And as I show you later, in many cases, you can get multiple X improvements from doing -- from 00:06:12.880 |
And finally, like, just getting the production ready is actually hard. 00:06:16.420 |
As you kind of go from experimentation to production, even just babysitting GPUs on public clouds is 00:06:23.060 |
not as easy because GPUs are finicky, not always reliable. 00:06:26.880 |
But getting to enterprise scale requires, you know, all the scalability technology, telemetry, 00:06:33.400 |
So those are things which we focus on solving at Fireworks. 00:06:38.100 |
So starting with efficiency, we built our own custom service stack, which we believe is one 00:06:46.400 |
We did it -- did it from the ground up, meaning from writing our own, you know, CUDA kernels, 00:06:51.440 |
all the way to customizing how the stuff gets deployed and orchestrated on the service level. 00:06:59.400 |
But most importantly, we really focus on customizing the service stack to your needs, which basically 00:07:04.780 |
means for your custom workload and for your custom cost and latency requirements, we can tune 00:07:15.240 |
And what does customization mean in practice? 00:07:17.700 |
For example, many use cases use reg and use very long prompts. 00:07:22.260 |
So there are many settings you can tune actually on the runtime level at the deployment level to 00:07:26.620 |
optimize for long prompts, which often can be repeatable. 00:07:30.220 |
So caching is useful or just tuning settings so the throughput is higher while maintaining latency. 00:07:37.680 |
If you go to, you know, artificial analysis and select long prompt, where Fireworks actually 00:07:42.280 |
is the fastest, even faster than some of the other providers which are over there at ExpoBooth. 00:07:48.580 |
And we don't only focus -- we don't only focus on LLM inference. 00:07:54.040 |
As an example, for image generation, we are the fastest providers serving SDXL. 00:07:58.160 |
We are also the only providers serving SD3 in stability's new model because their API actually 00:08:08.460 |
And finally, as I mentioned, like, LLMs, like, especially for LLMs, customization matters a lot. 00:08:14.660 |
One requirement -- like, one paradigm of how to think about performance of LLMs often is 00:08:18.400 |
useful for use cases is to think about maxim -- like, minimizing cost under a particular 00:08:23.400 |
We often have customers come and say, like, hey, I need to, like -- I have this -- my interactive 00:08:28.240 |
I need to generate that many tokens under two seconds. 00:08:31.420 |
And that's where -- that's really where, like, cross-stack optimizations shine. 00:08:36.000 |
Whereby tune into particular, like, latency cutoff and change in many settings, you can deliver 00:08:42.100 |
much higher support, multiple times higher support, which -- higher support basically means fewer 00:08:51.540 |
In terms of -- in terms of model support, we support -- support best quality open source models. 00:08:56.600 |
You know, we heard about Gemma now, obviously, LLMAS, some of the ASR and text-to-speech models pretty 00:09:05.560 |
We also work with model developers, for example, for example, in U.S. has also served on fireworks 00:09:15.940 |
And as a kind of platform capabilities, as I mentioned, we have a lot of open source models 00:09:24.660 |
We do some of the fine-tuning of those models in-house. 00:09:27.740 |
So I'm going to talk a little bit about function calling specialized models later on, or we do some 00:09:32.840 |
some of the vision language models using ourselves, which we release as well. 00:09:38.480 |
And of course, the key for open source -- open model development is we can tune for a particular 00:09:45.480 |
So we do provide a platform for fine-tuning, whether you are bringing your dataset collected 00:09:50.480 |
elsewhere or collecting it live with the feedback when served on our platform. 00:09:55.480 |
Specifically on customization is like one interesting feature, which a lot of people starting to 00:10:02.120 |
experiment with models find interesting, is if you try to fine-tune and deploy the resulting 00:10:09.120 |
It turns out if you do Plura fine-tuning, which a lot of folks do, you can do smart tricks and 00:10:16.000 |
deploy multiple Plura models on the same GPU. 00:10:18.120 |
Actually, thousands of them, which means we can give you still serverless inference with 00:10:23.760 |
paying for token even if you have, like, thousands of model variants sitting and deployed there 00:10:35.760 |
But what we see increasingly more and more in applications is model is not a product, right, 00:10:43.400 |
You need a kind of bigger system in order to solve target application. 00:10:47.720 |
And the reason for that is because models by themselves tend to hallucinate, so you need 00:10:51.760 |
some grounding, and that's where, like, access to external knowledge bases comes in. 00:10:58.280 |
Also we don't have, you know, yet an industry magical multimodal AI across all the modalities, 00:11:04.040 |
so often you have to kind of chain multiple types of models. 00:11:06.760 |
And, of course, you have all these, like, external tools and external actions which kind of end 00:11:11.820 |
to end applications might want to do in agentic form. 00:11:16.420 |
So I think the term which I really like, which is, like, popularized by Databricks is, like, 00:11:20.800 |
compound AI system, but basically increasingly seeing the transition from just the model being 00:11:25.860 |
the product to kind of this combination of maybe, like, rag and function calling and external 00:11:29.940 |
tools, et cetera, built together as the product. 00:11:32.980 |
And that's pretty much direction which we kind of see this field moving along over time. 00:11:39.700 |
So what does it mean from kind of our perspective what we do in this case? 00:11:45.160 |
So we see kind of as a function calling, like, agent at the core of this emerging architecture, 00:11:51.700 |
which might be connected to either domain-specialized models served on our platform directly or maybe 00:11:58.920 |
tuned for different needs and connected to external tools. 00:12:01.640 |
So maybe it's a content interpreter, or maybe it's, like, external APIs somewhere, 00:12:05.640 |
with really, like, this kind of central agentic view, kind of central model kind of coordinating 00:12:12.920 |
and trying to triage the user requirements, if it's, for example, a chatbot or something. 00:12:17.640 |
You've probably all heard about, like, function calling, you know, popularized by OpenAI initially. 00:12:27.360 |
The function calling is really, like, how to -- how to connect LLM to external tools and external 00:12:36.360 |
So we actually focus on fine-tuning models specifically for function calling, so we release 00:12:43.080 |
a series of models like that, like, the latest one for function V2 was released two weeks ago. 00:12:47.960 |
And what you can do with that is -- if I manage to click -- if I manage to click on this button. 00:12:58.520 |
What it means is, like, you can build applications which kind of combine free-form general chat 00:13:08.200 |
So in this case, this is -- you know, this fire function has some chat capabilities. 00:13:13.240 |
So you can see you can, like, ask it what -- what can you do? 00:13:15.720 |
And it has, like, some self-reflection to tell you what it can do. 00:13:18.600 |
It's also connected in this demo app to a bunch of external tools. 00:13:26.120 |
It can plot some charts, all those, like, external APIs. 00:13:33.800 |
But what it really needs to figure out is how to translate user query into -- do complex 00:13:40.520 |
So, for example, if we ask it to generate a bar chart with top three -- like, stocks of top 00:13:47.320 |
cloud providers, like, the big three, it actually needs to do several steps, right? 00:13:51.480 |
It needs to understand that, like, top three cloud providers means, you know, AWS, GCP, and 00:13:57.240 |
an Azure, right, and Azure is on the Microsoft. 00:13:59.880 |
It needs to then go do function calls querying their stock prices. 00:14:04.600 |
And finally, it needs to combine those information and send it to chat plotting API, 00:14:08.920 |
which is what just happened in the background. 00:14:11.880 |
Another important aspect which you have to do for, like, efficient kind of function calling chat 00:14:18.200 |
capabilities, you need to have contextual awareness. 00:14:20.440 |
So if I ask it to add particular -- if I ask it to add Oracle to this graph, it needs to understand 00:14:25.640 |
what I'm referring to and, like, still keep the previous context and regenerate the image. 00:14:29.880 |
And finally, you know, if I switch to a different topic, it kind of needs to drop the previous context 00:14:35.560 |
and understand that, like, hey, this is less -- this historical context is less important. 00:14:39.960 |
I'm going to start from scratch so there is no, like, oracle in that cat or whatever. 00:14:43.880 |
So, you know, this particular demo is -- is actually open source. 00:14:48.360 |
You can, like, go to our GitHub and try it out. 00:14:50.840 |
It's built with Fire Function and built with, like, a few other -- a few other models, including, like, 00:14:57.240 |
The model itself for function calling is actually open source. 00:15:02.760 |
I mean, you can, of course, call it at Fireworks for optimal speeds, but you can also run it locally 00:15:09.560 |
It uses a bunch of, you know, functionality on our platform. 00:15:13.000 |
For example, like, structure generation is, like, with JSON model grammar mode, which I think was 00:15:19.000 |
similar to some of the previous talks from, like, Outline guys, which we were talking here yesterday. 00:15:26.680 |
And generally, like, how to get started in Fireworks. 00:15:30.200 |
So, if you head out to Fireworks, say, such models, you'll find a lot of open source, 00:15:39.880 |
In terms of product offering, we have this kind of range which can take you from early prototyping 00:15:47.400 |
So, you can start with serverless inference, which is, you know, not different from, you know, 00:15:51.720 |
getting to open API -- open AI playground or something where you pay per token. 00:15:57.400 |
You don't need to worry about, like, hardware settings or anything. 00:16:00.440 |
As I mentioned, you can still do fine-tuning. 00:16:02.440 |
So, you can -- you can do hosted fine-tuning on our platform. 00:16:04.920 |
You can bring your own lower adapter and still serve it serverless. 00:16:08.120 |
As you kind of graduate, like, maybe, like, a startup and you graduate to a more production 00:16:12.680 |
scale, you might want to go to on-demand where it's more, like, dedicated hardware with more 00:16:17.560 |
settings and modifications for your use case. 00:16:20.200 |
You can bring your own custom model fine-tuned from scratch or do it on our platform. 00:16:24.920 |
And finally, if you kind of -- if you scale up to a bigger volume and want to go to enterprise 00:16:31.320 |
level where it's kind of discounted long-term contracts, and we also will help you to kind 00:16:36.600 |
of personalize hardware setup into some of those tuning for performance, which I -- which I talked 00:16:42.440 |
And in terms of these cases, I mean, we're running production for many, many companies, 00:16:48.360 |
ranging from small startups to big enterprises. 00:16:50.760 |
We're serving, like -- last time I checked, like, more than 150 billion tokens per day. 00:16:54.680 |
So, you know, companies like Quora built chatbots like Paul. 00:16:58.760 |
Source Draft and Courser, which I think Courser had a talk here yesterday. 00:17:03.560 |
They used us for, like, some of the code assistant functionality. 00:17:06.040 |
And their, like, latency is really important. 00:17:07.720 |
As you can imagine, you know, folks, like, upstage and Liner are building, like, different 00:17:19.320 |
Finally, we care a lot about developers, you guys. 00:17:24.040 |
So, actually, this is external numbers from, like, last year, land chain, state of AI stuff, 00:17:29.240 |
where it turns out we are one of the -- like, after Hagen phase, the most popular platform 00:17:34.280 |
for where people pull models, which is great. 00:17:38.600 |
And, again, for getting started, just, you know, head out -- head out to our website. 00:17:44.520 |
You can go in the -- go play in the playground right away. 00:17:47.880 |
So, for example, you can run, you know, Lama or Gemma or whatever at the -- at the top speeds. 00:17:55.720 |
I'm really excited to see what you can build with open models or fire function or some stuff 00:18:00.280 |
which you -- which you can fine tune on -- on your own. 00:18:07.080 |
So you can still use, you know, your favorite tools, the same clients, or you can use frameworks 00:18:17.000 |
Really excited to kind of -- to be here and tell a little bit about open source -- open source 00:18:25.320 |
models and how we have fireworks focusing on productionizing that and scaling it up. 00:18:31.480 |
And you can also find us at the booth at the Expo.