back to index

Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov


Whisper Transcript | Transcript Only Page

00:00:00.640 | Hello, everyone.
00:00:14.920 | So, my name is Dima.
00:00:17.040 | As mentioned, unfortunately, my co-founder, Lin, who was on the schedule, couldn't make
00:00:21.960 | it today because of some personal emergency.
00:00:24.320 | So you got me.
00:00:25.840 | And as you saw, we don't have yet AI to figure out video projection, but we have AI for a
00:00:31.960 | lot of other things.
00:00:32.960 | So today I'm going to talk about Fireworks AI and generally I'm going to continue the theme
00:00:36.900 | which Katrin started about open models and how we basically focus on productionization
00:00:43.080 | and customization of open source models in inference at Fireworks.
00:00:48.600 | But first, as an introduction, what's our background?
00:00:52.140 | So the founding team of Fireworks comes from PyTorch leads at Meta and some veterans from
00:00:58.300 | Google AI.
00:00:59.640 | So we combined have, like, probably a decade of experience in productionizing AI in some
00:01:03.640 | of the biggest companies in the world.
00:01:06.300 | And I myself personally have been core maintainer of PyTorch for the past five years.
00:01:11.520 | So topic of open source is really close to my heart.
00:01:15.980 | And since we kind of led this revolution of open source toolchain for deep learning through
00:01:20.500 | our work sort of on PyTorch and some of the Google technologies, we really believe that open
00:01:25.680 | source models are the future also for, like, for gen AI application and our focus at Fireworks
00:01:33.420 | is precisely on that.
00:01:36.820 | So I mean, how many people in the audience actually, like, use GPT and deploy it for production?
00:01:45.420 | And how many people -- how many folks use open models for the production?
00:01:50.320 | Oh, okay.
00:01:51.640 | So I was about to convince you that share of open source models is going to grow over time,
00:01:56.300 | so it looks like in this audience it's already -- already sizable, but nevertheless.
00:02:02.020 | So why -- why basically this trade-off, why go big or why go small?
00:02:07.700 | Currently still, like, bulk of production inference is still based on proprietary models.
00:02:12.080 | And the catch is that those are really good models and often frontier in many domains.
00:02:17.740 | However, the catch is that it's one model which is good in many, many sins.
00:02:22.520 | And it's often served in the same way, regardless of the use case, which means that it may be
00:02:27.560 | if you have batch inference on some narrow domain or you have some super real-time use case where
00:02:33.400 | you need to -- you need to do, like, voice assistance or something, those are often served
00:02:38.640 | from the same infrastructure without customization.
00:02:41.160 | In terms of model capabilities, it also means, yeah, like, GPT-4 is great or Claude is great
00:02:46.580 | and can handle a lot of sins, but you are often paying a lot for additional capabilities
00:02:51.320 | which are not needed in particular use case.
00:02:53.480 | You don't really need customer support chatbot to know about 150 Pokemons or be able to write
00:02:59.700 | your poetry, but you really want it to be really good in the particular narrow domain.
00:03:06.380 | So this kind of discrepancy for large models leads to several issues.
00:03:11.040 | One as I mentioned is high latency, because using a big model means longer response times,
00:03:17.040 | which is particularly important for real-time use cases, like, voice systems.
00:03:22.140 | It gets more and more important with stuff, because for stuff like, for example, next -- right?
00:03:29.420 | Like, you need to do a lot of steps for, like, something, like, agent-like application to do
00:03:34.580 | reasoning and call the model many times.
00:03:36.160 | So latency is really, really important.
00:03:38.220 | And often you see that you can pick smaller models, like Lama or Gemma, which you just
00:03:43.780 | talked about, and achieve the -- for the narrow domain, same or better quality while being,
00:03:49.440 | you know, up to 10 times faster.
00:03:52.260 | For example, for some of the function calling use cases, like, externally benchmark from Berkeley,
00:03:57.380 | yeah, like, the -- you get similar performance from fine-tuned Lama 3 at 10x the speed.
00:04:03.440 | Cost is also -- is also an issue if you're running a big model for -- on a lot of traffic, you know,
00:04:10.000 | even if you have perhaps, you know, 5K tokens prompt and 10,000 users, and each of them calls
00:04:15.920 | SLM 20 times per day, you know, on GPT-4, even on GPT-4-0, it probably adds up to, like, 10K per
00:04:21.500 | day or something like several million per month -- or several million per year, which is a sizable
00:04:26.500 | cost of a startup.
00:04:27.500 | You can easily cut that with much smaller models, and that often we see as a -- as a kind of motivation
00:04:33.960 | for reaching out for smaller and more customizable models.
00:04:39.180 | But really the -- like, where open models shine is domain adaptability.
00:04:44.180 | And that comes in two aspects.
00:04:46.160 | First, there are so many different fine-tunes and customizations.
00:04:49.760 | I think Caitlin was mentioning about, you know, German built Indian languages adaptations, like,
00:04:54.980 | there are models specialized for code or for medicine.
00:04:57.480 | If you had to hide in face, there are, like, tens of thousands of different model variants.
00:05:02.180 | And because the weights are open, you can always customize to your particular use case.
00:05:06.040 | And tune -- and tune quality specifically for -- for what you need.
00:05:11.740 | So open-source models are great.
00:05:13.140 | So what are the challenges?
00:05:14.140 | The challenges really come from three areas.
00:05:17.000 | First, like, what we usually see when people try to use, you know, open model, something like
00:05:21.420 | gem or whatever, or LAM might be, you run into complicated setup and maintenance, right?
00:05:28.120 | You need to go and find GPUs somewhere.
00:05:30.120 | You need to figure out which frameworks to run on those.
00:05:33.680 | You need to, like, download your models, maybe do some performance optimization tuning, and you
00:05:37.740 | kind of have to repeat this process end-to-end every time the model gets updated or new version
00:05:42.420 | is released, et cetera.
00:05:44.860 | On optimization itself, there is -- especially for LLMs, but generally for GNI models, there
00:05:49.940 | are many attributes and settings which are really dependent on your use case and requirements.
00:05:56.880 | Somebody needs low latency, somebody needs high throughput, prompts can be short, prompts
00:06:00.680 | can be long, et cetera.
00:06:01.880 | And choosing the optimal settings across the stack is actually not trivial.
00:06:06.300 | And as I show you later, in many cases, you can get multiple X improvements from doing -- from
00:06:10.940 | doing this efficiently.
00:06:12.880 | And finally, like, just getting the production ready is actually hard.
00:06:16.420 | As you kind of go from experimentation to production, even just babysitting GPUs on public clouds is
00:06:23.060 | not as easy because GPUs are finicky, not always reliable.
00:06:26.880 | But getting to enterprise scale requires, you know, all the scalability technology, telemetry,
00:06:31.680 | observability, et cetera.
00:06:33.400 | So those are things which we focus on solving at Fireworks.
00:06:38.100 | So starting with efficiency, we built our own custom service stack, which we believe is one
00:06:43.680 | of the fastest, if not the fastest.
00:06:46.400 | We did it -- did it from the ground up, meaning from writing our own, you know, CUDA kernels,
00:06:51.440 | all the way to customizing how the stuff gets deployed and orchestrated on the service level.
00:06:57.520 | And that brings multiple optimizations.
00:06:59.400 | But most importantly, we really focus on customizing the service stack to your needs, which basically
00:07:04.780 | means for your custom workload and for your custom cost and latency requirements, we can tune
00:07:11.360 | it for those settings.
00:07:14.240 | What does it mean in practice?
00:07:15.240 | And what does customization mean in practice?
00:07:17.700 | For example, many use cases use reg and use very long prompts.
00:07:22.260 | So there are many settings you can tune actually on the runtime level at the deployment level to
00:07:26.620 | optimize for long prompts, which often can be repeatable.
00:07:30.220 | So caching is useful or just tuning settings so the throughput is higher while maintaining latency.
00:07:35.580 | So this is independently benchmarkable.
00:07:37.680 | If you go to, you know, artificial analysis and select long prompt, where Fireworks actually
00:07:42.280 | is the fastest, even faster than some of the other providers which are over there at ExpoBooth.
00:07:48.580 | And we don't only focus -- we don't only focus on LLM inference.
00:07:51.800 | We focus on many modalities.
00:07:54.040 | As an example, for image generation, we are the fastest providers serving SDXL.
00:07:58.160 | We are also the only providers serving SD3 in stability's new model because their API actually
00:08:04.800 | routes to our servers.
00:08:08.460 | And finally, as I mentioned, like, LLMs, like, especially for LLMs, customization matters a lot.
00:08:14.660 | One requirement -- like, one paradigm of how to think about performance of LLMs often is
00:08:18.400 | useful for use cases is to think about maxim -- like, minimizing cost under a particular
00:08:22.400 | latency constraint.
00:08:23.400 | We often have customers come and say, like, hey, I need to, like -- I have this -- my interactive
00:08:27.240 | implication.
00:08:28.240 | I need to generate that many tokens under two seconds.
00:08:31.420 | And that's where -- that's really where, like, cross-stack optimizations shine.
00:08:36.000 | Whereby tune into particular, like, latency cutoff and change in many settings, you can deliver
00:08:42.100 | much higher support, multiple times higher support, which -- higher support basically means fewer
00:08:46.380 | GPUs and lower cost.
00:08:51.540 | In terms of -- in terms of model support, we support -- support best quality open source models.
00:08:56.600 | You know, we heard about Gemma now, obviously, LLMAS, some of the ASR and text-to-speech models pretty
00:09:03.920 | much from many providers.
00:09:05.560 | We also work with model developers, for example, for example, in U.S. has also served on fireworks
00:09:12.560 | launched -- launched last week.
00:09:15.940 | And as a kind of platform capabilities, as I mentioned, we have a lot of open source models
00:09:21.360 | to get you started or customized ones.
00:09:24.660 | We do some of the fine-tuning of those models in-house.
00:09:27.740 | So I'm going to talk a little bit about function calling specialized models later on, or we do some
00:09:32.840 | some of the vision language models using ourselves, which we release as well.
00:09:38.480 | And of course, the key for open source -- open model development is we can tune for a particular
00:09:44.480 | use case.
00:09:45.480 | So we do provide a platform for fine-tuning, whether you are bringing your dataset collected
00:09:50.480 | elsewhere or collecting it live with the feedback when served on our platform.
00:09:55.480 | Specifically on customization is like one interesting feature, which a lot of people starting to
00:10:02.120 | experiment with models find interesting, is if you try to fine-tune and deploy the resulting
00:10:07.000 | model, how to serve it efficiently.
00:10:09.120 | It turns out if you do Plura fine-tuning, which a lot of folks do, you can do smart tricks and
00:10:16.000 | deploy multiple Plura models on the same GPU.
00:10:18.120 | Actually, thousands of them, which means we can give you still serverless inference with
00:10:23.760 | paying for token even if you have, like, thousands of model variants sitting and deployed there
00:10:28.640 | without having to pay any fixed cost.
00:10:33.400 | Of course, single model is all great.
00:10:35.760 | But what we see increasingly more and more in applications is model is not a product, right,
00:10:42.400 | by itself.
00:10:43.400 | You need a kind of bigger system in order to solve target application.
00:10:47.720 | And the reason for that is because models by themselves tend to hallucinate, so you need
00:10:51.760 | some grounding, and that's where, like, access to external knowledge bases comes in.
00:10:58.280 | Also we don't have, you know, yet an industry magical multimodal AI across all the modalities,
00:11:04.040 | so often you have to kind of chain multiple types of models.
00:11:06.760 | And, of course, you have all these, like, external tools and external actions which kind of end
00:11:11.820 | to end applications might want to do in agentic form.
00:11:16.420 | So I think the term which I really like, which is, like, popularized by Databricks is, like,
00:11:20.800 | compound AI system, but basically increasingly seeing the transition from just the model being
00:11:25.860 | the product to kind of this combination of maybe, like, rag and function calling and external
00:11:29.940 | tools, et cetera, built together as the product.
00:11:32.980 | And that's pretty much direction which we kind of see this field moving along over time.
00:11:39.700 | So what does it mean from kind of our perspective what we do in this case?
00:11:45.160 | So we see kind of as a function calling, like, agent at the core of this emerging architecture,
00:11:51.700 | which might be connected to either domain-specialized models served on our platform directly or maybe
00:11:58.920 | tuned for different needs and connected to external tools.
00:12:01.640 | So maybe it's a content interpreter, or maybe it's, like, external APIs somewhere,
00:12:05.640 | with really, like, this kind of central agentic view, kind of central model kind of coordinating
00:12:12.920 | and trying to triage the user requirements, if it's, for example, a chatbot or something.
00:12:17.640 | You've probably all heard about, like, function calling, you know, popularized by OpenAI initially.
00:12:24.360 | That's basically the same idea.
00:12:26.360 | So, yeah.
00:12:27.360 | The function calling is really, like, how to -- how to connect LLM to external tools and external
00:12:34.360 | elements.
00:12:35.360 | What does it mean in practice?
00:12:36.360 | So we actually focus on fine-tuning models specifically for function calling, so we release
00:12:43.080 | a series of models like that, like, the latest one for function V2 was released two weeks ago.
00:12:47.960 | And what you can do with that is -- if I manage to click -- if I manage to click on this button.
00:12:58.520 | What it means is, like, you can build applications which kind of combine free-form general chat
00:13:05.320 | capabilities with function calling.
00:13:08.200 | So in this case, this is -- you know, this fire function has some chat capabilities.
00:13:13.240 | So you can see you can, like, ask it what -- what can you do?
00:13:15.720 | And it has, like, some self-reflection to tell you what it can do.
00:13:18.600 | It's also connected in this demo app to a bunch of external tools.
00:13:22.440 | So it can query, like, stock quotes.
00:13:26.120 | It can plot some charts, all those, like, external APIs.
00:13:29.000 | It can also generate images.
00:13:33.800 | But what it really needs to figure out is how to translate user query into -- do complex
00:13:38.360 | reasoning, translate it into function calls.
00:13:40.520 | So, for example, if we ask it to generate a bar chart with top three -- like, stocks of top
00:13:47.320 | cloud providers, like, the big three, it actually needs to do several steps, right?
00:13:51.480 | It needs to understand that, like, top three cloud providers means, you know, AWS, GCP, and
00:13:57.240 | an Azure, right, and Azure is on the Microsoft.
00:13:59.880 | It needs to then go do function calls querying their stock prices.
00:14:04.600 | And finally, it needs to combine those information and send it to chat plotting API,
00:14:08.920 | which is what just happened in the background.
00:14:11.880 | Another important aspect which you have to do for, like, efficient kind of function calling chat
00:14:18.200 | capabilities, you need to have contextual awareness.
00:14:20.440 | So if I ask it to add particular -- if I ask it to add Oracle to this graph, it needs to understand
00:14:25.640 | what I'm referring to and, like, still keep the previous context and regenerate the image.
00:14:29.880 | And finally, you know, if I switch to a different topic, it kind of needs to drop the previous context
00:14:35.560 | and understand that, like, hey, this is less -- this historical context is less important.
00:14:39.960 | I'm going to start from scratch so there is no, like, oracle in that cat or whatever.
00:14:43.880 | So, you know, this particular demo is -- is actually open source.
00:14:48.360 | You can, like, go to our GitHub and try it out.
00:14:50.840 | It's built with Fire Function and built with, like, a few other -- a few other models, including, like,
00:14:55.320 | SDXL, which are run on our platform.
00:14:57.240 | The model itself for function calling is actually open source.
00:15:01.400 | It's on Hagen Face.
00:15:02.760 | I mean, you can, of course, call it at Fireworks for optimal speeds, but you can also run it locally
00:15:08.600 | if you want.
00:15:09.560 | It uses a bunch of, you know, functionality on our platform.
00:15:13.000 | For example, like, structure generation is, like, with JSON model grammar mode, which I think was
00:15:19.000 | similar to some of the previous talks from, like, Outline guys, which we were talking here yesterday.
00:15:24.040 | Yeah.
00:15:25.480 | So, finally, try it out.
00:15:26.680 | And generally, like, how to get started in Fireworks.
00:15:30.200 | So, if you head out to Fireworks, say, such models, you'll find a lot of open source,
00:15:36.760 | open base models, which I mentioned about.
00:15:38.280 | They're available in the playground.
00:15:39.880 | In terms of product offering, we have this kind of range which can take you from early prototyping
00:15:45.800 | all the way to enterprise scale.
00:15:47.400 | So, you can start with serverless inference, which is, you know, not different from, you know,
00:15:51.720 | getting to open API -- open AI playground or something where you pay per token.
00:15:55.560 | It's a cost and price.
00:15:57.400 | You don't need to worry about, like, hardware settings or anything.
00:16:00.440 | As I mentioned, you can still do fine-tuning.
00:16:02.440 | So, you can -- you can do hosted fine-tuning on our platform.
00:16:04.920 | You can bring your own lower adapter and still serve it serverless.
00:16:08.120 | As you kind of graduate, like, maybe, like, a startup and you graduate to a more production
00:16:12.680 | scale, you might want to go to on-demand where it's more, like, dedicated hardware with more
00:16:17.560 | settings and modifications for your use case.
00:16:20.200 | You can bring your own custom model fine-tuned from scratch or do it on our platform.
00:16:24.920 | And finally, if you kind of -- if you scale up to a bigger volume and want to go to enterprise
00:16:31.320 | level where it's kind of discounted long-term contracts, and we also will help you to kind
00:16:36.600 | of personalize hardware setup into some of those tuning for performance, which I -- which I talked
00:16:41.800 | about earlier.
00:16:42.440 | And in terms of these cases, I mean, we're running production for many, many companies,
00:16:48.360 | ranging from small startups to big enterprises.
00:16:50.760 | We're serving, like -- last time I checked, like, more than 150 billion tokens per day.
00:16:54.680 | So, you know, companies like Quora built chatbots like Paul.
00:16:58.760 | Source Draft and Courser, which I think Courser had a talk here yesterday.
00:17:03.560 | They used us for, like, some of the code assistant functionality.
00:17:06.040 | And their, like, latency is really important.
00:17:07.720 | As you can imagine, you know, folks, like, upstage and Liner are building, like, different
00:17:13.000 | assistants and agents on top of that.
00:17:15.400 | So, we are definitely production ready.
00:17:18.040 | Go try it out.
00:17:19.320 | Finally, we care a lot about developers, you guys.
00:17:24.040 | So, actually, this is external numbers from, like, last year, land chain, state of AI stuff,
00:17:29.240 | where it turns out we are one of the -- like, after Hagen phase, the most popular platform
00:17:34.280 | for where people pull models, which is great.
00:17:35.960 | It was very nice to hear.
00:17:38.600 | And, again, for getting started, just, you know, head out -- head out to our website.
00:17:44.520 | You can go in the -- go play in the playground right away.
00:17:47.880 | So, for example, you can run, you know, Lama or Gemma or whatever at the -- at the top speeds.
00:17:52.600 | And kind of go -- start building from there.
00:17:55.720 | I'm really excited to see what you can build with open models or fire function or some stuff
00:18:00.280 | which you -- which you can fine tune on -- on your own.
00:18:03.320 | And, yeah, last point.
00:18:05.080 | We are, as I mentioned, open API compatible.
00:18:07.080 | So you can still use, you know, your favorite tools, the same clients, or you can use frameworks
00:18:13.880 | like land chain or Lama index or et cetera.
00:18:16.600 | So, yeah.
00:18:17.000 | Really excited to kind of -- to be here and tell a little bit about open source -- open source
00:18:25.320 | models and how we have fireworks focusing on productionizing that and scaling it up.
00:18:30.760 | Go try it out.
00:18:31.480 | And you can also find us at the booth at the Expo.
00:18:34.040 | Thank you.
00:18:37.880 | Thank you.