Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov

00:00:00.640 | Hello, everyone.

00:00:14.880 | So, my name is Dima.

00:00:17.040 | As mentioned, unfortunately, my co-founder, Lin, who was on the schedule, couldn't make

00:00:21.960 | it today because of some personal emergency.

00:00:24.320 | So you got me.

00:00:25.840 | And as you saw, we don't have yet AI to figure out video projection, but we have AI for a

00:00:31.940 | lot of other things.

00:00:32.940 | So today I'm going to talk about Fireworks AI and generally I'm going to continue the theme

00:00:36.920 | which Katrin started about open models and how we basically focus on productionization

00:00:43.060 | and customization of open source models in inference at Fireworks.

00:00:48.600 | But first, as an introduction, what's our background?

00:00:52.140 | So the founding team of Fireworks comes from PyTorch leads at Meta and some veterans from

00:00:58.280 | Google AI.

00:00:59.640 | So we combined have, like, probably a decade of experience in productionizing AI in some

00:01:03.660 | of the biggest companies in the world.

00:01:06.300 | And I myself personally have been core maintainer of PyTorch for the past five years.

00:01:11.520 | So topic of open source is really close to my heart.

00:01:15.980 | And since we kind of led this revolution of open source tool chain for deep learning through

00:01:20.620 | our work on PyTorch and some of the Google technologies, we really believe that open source models are

00:01:27.620 | the future also for, like, for gen AI application.

00:01:31.580 | And our focus at Fireworks is precisely on that.

00:01:36.800 | So I mean, how many people in the audience actually, like, use GPT and deploy it for production?

00:01:45.420 | And how many people -- how many folks use open models for the production?

00:01:50.460 | Oh, okay.

00:01:51.640 | So I was about to convince you that share of open source models is going to grow over time,

00:01:56.360 | but it looks like in this audience it's already -- already sizable.

00:01:59.820 | But nevertheless, so why -- why basically this tradeoff, why go big or why go small?

00:02:07.640 | Currently still, like, bulk of production inference is still based on proprietary models.

00:02:12.040 | And the catch is that those are really good models and often frontier in many domains.

00:02:17.700 | However, the catch is that it's one model which is good in many, many sense.

00:02:22.500 | And it's often served in the same way regardless of the use case.

00:02:26.220 | Which means that it may be if you have batch inference on some narrow domain or you have

00:02:30.980 | some super real-time use case where you need to -- you need to do, like, voice assistant

00:02:36.520 | or something, those are often served from the same infrastructure without customization.

00:02:41.140 | In terms of model capabilities, it also means, yeah, like, GPT-4 is great or Claude is great

00:02:46.560 | and can handle a lot of sense.

00:02:48.320 | But you're often paying a lot for additional capabilities which are not needed in particular use case.

00:02:53.460 | You don't really need customer support chatbot to know about 150 Pokemons or be able to write

00:02:59.640 | your poetry.

00:03:00.640 | But you really want it to be really good in the particular narrow domain.

00:03:06.360 | So this kind of discrepancy for large models leads to several issues.

00:03:11.020 | One as I mentioned is high latency because using a big model means longer response times which

00:03:18.120 | is particularly important for real-time use cases like voice systems.

00:03:22.140 | It gets more and more important with agenting stuff because for stuff like, for example,

00:03:28.140 | next, right, like, you need to do a lot of steps for, like, something like agent-like application

00:03:34.380 | to do reasoning and call the model many times.

00:03:36.140 | So latency is really, really important.

00:03:38.220 | And often you see that you can pick smaller models like Lama or Gemma, which you just talked

00:03:44.400 | about, and achieve the -- for the narrow domain, same or better quality while being, you know,

00:03:50.400 | up to 10 times faster.

00:03:52.240 | For example, for some of the function calling use cases, like, externally benchmark from Berkeley,

00:03:57.400 | yeah, like, you get similar performance from fine-tuned Lama 3 at 10x the speed.

00:04:03.520 | Cost is also an issue if you're running a big model for -- on a lot of traffic, you know,

00:04:10.240 | even if you have perhaps, you know, 5K tokens prompt and 10,000 users, and each of them calls

00:04:15.900 | LLM 20 times per day, you know, on GPT-4, even on GPT-4O, it probably adds up to, like, 10K

00:04:21.420 | per day or something like several million per year, which is a sizable cost of a startup.

00:04:27.580 | You can easily cut that with much smaller models, and that often we see as a kind of motivation

00:04:33.980 | for reaching out for smaller and more customizable models.

00:04:39.180 | But really the -- like, where open models shine is domain adaptability, and that comes in two

00:04:45.360 | aspects.

00:04:46.360 | First, there are so many different fine-tunes and customizations.

00:04:49.440 | I think Katelyn was mentioning about, you know, Jemma built Indian languages adaptations.

00:04:54.780 | Like, there are models specialized for code or for medicine.

00:04:57.500 | If you had to -- there are, like, tens of thousands of different model variants, and because the

00:05:02.520 | weights are open, you can always customize to your particular use case and tune quality specifically

00:05:09.200 | for what you need.

00:05:11.740 | So open-source models are great.

00:05:13.160 | So what are the challenges?

00:05:14.500 | The challenges really come from three areas.

00:05:17.340 | Like, what we see when people try to use, you know, open model, something like Jemma or whatever,

00:05:22.680 | or what might be, you run into complicated setup and maintenance, right?

00:05:28.400 | You need to go and find GPUs somewhere.

00:05:30.240 | You need to figure out which frameworks to run on those.

00:05:33.680 | You need to, like, download your models, maybe do some performance optimization tuning, and you

00:05:37.740 | kind of have to repeat this process end-to-end every time the model gets updated or new version

00:05:42.420 | is released, et cetera.

00:05:44.880 | On optimization itself, there is -- especially for LLMs, but generally for GNI models, there

00:05:49.960 | are many attributes and settings which are really dependent on your use case and requirements.

00:05:56.860 | Somebody needs low latency, somebody needs high throughput, prompts can be short, prompts can

00:06:00.760 | be long, et cetera.

00:06:01.760 | And choosing the optimal settings across the stack is actually not trivial.

00:06:06.100 | And as I show you later, in many cases, you can get multiple X improvements from doing -- from

00:06:10.900 | doing this efficiently.

00:06:12.860 | And finally, like, just getting the production ready is actually hard.

00:06:16.220 | As you kind of go from experimentation to production, even just babysitting GPUs on public clouds is

00:06:23.060 | not as easy because GPUs are finicky, not always reliable.

00:06:26.860 | But getting to enterprise scale requires, you know, all the scalability technology, telemetry,

00:06:31.680 | observability, et cetera.

00:06:33.400 | So those are since which we focus on solving at fireworks.

00:06:38.100 | So starting with efficiency, we built our own custom serving stack, which we believe is

00:06:43.560 | one of the fastest if not the fastest.

00:06:46.400 | We did it -- did it from the ground up, meaning from writing our own, you know, Cuda kernels,

00:06:51.440 | all the way to customizing how the stuff gets deployed and orchestrated on the service level.

00:06:57.520 | And that brings multiple optimizations.

00:06:59.400 | But most importantly, we really focus on customizing the service stack to your needs, which basically

00:07:04.760 | means for your custom workload and your custom cost and latency requirements, we can -- we

00:07:11.060 | can tune it for -- for those settings.

00:07:14.220 | What does it mean in practice?

00:07:15.220 | And what does customization mean in practice?

00:07:17.680 | For example, many use cases use reg and use very long prompts.

00:07:22.240 | So there are many settings you can tune actually on the runtime level at the deployment level to

00:07:26.600 | optimize for long prompts, which often can be repeatable, so caching is useful, or just tuning

00:07:32.640 | settings so the throughput is higher while maintaining latency.

00:07:35.580 | So this is independently benchmarkable.

00:07:37.680 | If you go to, you know, artificial analysis and select long prompt, where Fireworks actually

00:07:42.280 | is the fastest, even faster than some of the other providers which are over there at ExpoBoost.

00:07:48.580 | And we don't only focus -- we don't only focus on LLM inference, we focus on many modalities.

00:07:54.040 | As an example, for image generation, we are the fastest providers serving SDXL, we're also

00:07:59.300 | the only providers serving SD3 in stability's new model because their API actually routes

00:08:05.120 | to our servers.

00:08:08.480 | And finally, as I mentioned, like, LLMs, like, customization -- especially for LLMs, customization

00:08:12.800 | matters a lot -- like, one problem, like, how to think about performance of LLMs, often,

00:08:18.060 | it's useful for use cases, is to think about maxim -- like, minimizing cost under a particular

00:08:22.400 | latency constraint.

00:08:23.400 | We often have customers come and say, like, hey, I need to, like, have this -- my interactive

00:08:27.220 | implication, I need to generate that many tokens under two seconds.

00:08:31.060 | And that's where -- that's really where, like, cross-stack optimizations shine.

00:08:36.320 | And if they tune into particular, like, latency cutoff and change in many settings, you can

00:08:41.020 | deliver much higher throughput, multiple times higher throughput, which with -- higher throughput

00:08:45.580 | basically means fewer GPUs and lower cost.

00:08:51.500 | In terms of model support, we support best quality open source models.

00:08:56.680 | You know, we heard about Gemma now, obviously, LLMAS, some of the ASR and text-to-speech models,

00:09:03.800 | pretty much from many providers.

00:09:05.680 | We also work with model developers, for example, for example, in the U.S. has also served on fireworks

00:09:12.580 | launched -- launched last week.

00:09:15.980 | And as a kind of platform capabilities, as I mentioned, we have a lot of open source models

00:09:21.400 | to get you started or customized ones.

00:09:24.640 | We do some of the fine-tuning of those models in-house, so I'm going to talk a little bit

00:09:28.800 | about function calling specialized models later on, or we do some of the vision language models

00:09:35.240 | using ourselves, which we release as well.

00:09:38.480 | And of course, the key for open source -- open model development is they can tune for a particular

00:09:44.080 | use case.

00:09:45.080 | So we do provide a platform for fine-tuning, whether you're bringing your data set collected elsewhere

00:09:50.720 | or collecting it live with the feedback when it serves on our platform.

00:09:56.480 | Specifically on customization, it's like one interesting feature which a lot of people starting

00:10:01.780 | to experiment with models find interesting is if you try to fine-tune and deploy the resulting

00:10:06.960 | model, how to serve it efficiently.

00:10:09.480 | It turns out if you do Plura fine-tuning, which a lot of folks do, you can do smart tricks and

00:10:16.040 | deploy multiple models on the same GPU, actually thousands of them, which means that we can

00:10:20.800 | give you still serverless inference with paying for token even if you have, like, thousands

00:10:25.360 | of model variants sitting and deployed there without having to pay any fixed cost.

00:10:33.400 | Of course, single model is all great, but what we see increasingly more and more in applications

00:10:38.620 | is model is not the product, right, by itself.

00:10:43.300 | You need a kind of bigger system in order to solve target application.

00:10:47.720 | And the reason for that is because models by themselves tend to hallucinate, so you need

00:10:52.160 | it to be able to solve a lot of things and that's where, like, RAG or access to external knowledge

00:10:56.200 | bases comes in.

00:10:57.200 | Also, we don't have, you know, yet an industry magical multimodal AI across all the modalities,

00:11:04.060 | so often you have to kind of chain multiple types of models, and, of course, you have all

00:11:07.960 | this, like, external tools and external actions which kind of end-to-end applications might want

00:11:13.160 | to do in agentic form.

00:11:16.400 | So I think the term which I really like, which is, like, popularized by Databricks is, like,

00:11:20.620 | compound AI system, but basically increasingly seeing, like, transition from just the model

00:11:25.660 | being the product to kind of this combination of maybe, like, RAG and function calling and

00:11:29.560 | external tools, et cetera, built together as the product.

00:11:32.980 | And that's pretty much direction which we kind of see this field moving along over time.

00:11:40.220 | So what does it mean from our perspective what we do in this case?

00:11:45.140 | So we see kind of as a function calling, like, agent at the core of this emerging architecture

00:11:51.620 | which might be connected to either domain-specialized models served on our platform directly or maybe

00:11:58.940 | tuned for different needs and connected to external tools, maybe it's a content interpreter or maybe

00:12:03.840 | it's, like, external APIs somewhere with really, like, this kind of central agentic view

00:12:09.820 | -- kind of central model kind of coordinating and trying to triage the user requirements

00:12:16.740 | if it's, for example, a chatbot or something.

00:12:19.400 | You probably all heard about, like, function calling, you know, popularized by OpenAI initially.

00:12:24.880 | That's basically the same idea.

00:12:27.200 | So yeah, the function calling is really, like, how to -- how to connect LLM to external tools

00:12:32.760 | and external elements.

00:12:34.260 | What does it mean in practice?

00:12:36.680 | So we actually focus on fine-tuning models specifically for function calling.

00:12:42.180 | So we released a series of models like that.

00:12:44.340 | Like, the latest one of our function V2 was released two weeks ago.

00:12:48.680 | And what you can do with that is -- if I manage to click -- if I manage to click on this button -- what

00:13:00.800 | it means is, like, you can build applications which kind of combine freeform general chat capabilities

00:13:07.600 | with function calling.

00:13:08.600 | So in this case, this is -- this is -- you know, this function has some chat capabilities.

00:13:13.440 | So you can see you can, like, ask it what -- what can you do?

00:13:15.520 | And it has, like, some self-reflection to tell you what it can do.

00:13:18.860 | It's also connected in this demo app to a bunch of external tools.

00:13:22.660 | So it can query, like, stock quotes.

00:13:26.300 | It can plot some charts.

00:13:27.520 | All those, like, external APIs.

00:13:29.520 | It can also generate images.

00:13:34.100 | But what it really needs to figure out is how to translate user query into -- do complex

00:13:38.420 | reasoning and translate it into function calls.

00:13:40.780 | So for example, if we ask it to generate a bar chart with top three -- like, stocks of

00:13:47.180 | top cloud providers, like, the big three, it actually needs to do several steps, right?

00:13:51.520 | It needs to understand that, like, top three cloud providers means, you know, AWS, GCP, and

00:13:57.480 | Azure, right?

00:13:59.480 | And Azure is on the Microsoft.

00:14:00.480 | It needs to then go do function calls, query in their stock prices.

00:14:04.800 | And finally, it needs to combine those information and send it to chat plotting API, which is what

00:14:10.160 | just happened in the background.

00:14:13.640 | Another important aspect which you have to do for, like, efficient kind of function calling

00:14:18.060 | chat capabilities, you need to have contextual awareness.

00:14:20.660 | So if I ask it to add particular -- if I ask it to add Oracle to this graph, it needs to

00:14:25.440 | understand what I'm referring to and, like, still keep the previous context and regenerate

00:14:29.320 | the image.

00:14:30.320 | And finally, you know, if I switch to a -- to a different topic, it kind of needs to drop

00:14:34.980 | the previous context and understand that, like, hey, this is less -- this historical context

00:14:39.440 | is less important.

00:14:40.440 | I'm going to start from scratch.

00:14:41.440 | There's no, like, oracle in that cat or whatever.

00:14:45.320 | So, you know, this particular demo is -- is actually open source.

00:14:48.440 | You can, like, go to our GitHub and try it out.

00:14:50.980 | It's built with fire function and built with, like, a few other -- a few other models, including,

00:14:55.320 | like, SDXL, which are run on our platform.

00:14:58.660 | The model itself for function calling is actually open source.

00:15:01.660 | It's on Hagenface.

00:15:02.660 | I mean, you can, of course, call it at fireworks for optimal speeds, but you can also run it

00:15:08.480 | locally if you want.

00:15:09.880 | It uses a bunch of, you know, functionality on our platform.

00:15:13.780 | For example, like, structure generation, like, with JSON model grammar mode, which I think

00:15:18.780 | was similar to some of the previous talks from, like, Outline guys which we were talking here

00:15:23.600 | yesterday.

00:15:24.600 | Yeah.

00:15:25.600 | So, finally, try it out.

00:15:27.720 | And generally, like, how to get started in fireworks.

00:15:30.360 | So, if you head out to fireworks, say, such models, you'll find a lot of open source, open

00:15:37.060 | base models which I mentioned about.

00:15:38.540 | They're available in the playground.

00:15:40.720 | In terms of product offering, we have this kind of range which can take you from early

00:15:45.220 | prototyping all the way to enterprise scale.

00:15:47.620 | So you can start with serverless inference which is, you know, not different from, you know,

00:15:52.000 | getting to open API -- open AI playground or something where you pay per token.

00:15:56.900 | It's a cost and price.

00:15:57.900 | You don't need to worry about, like, hardware settings or anything.

00:16:00.760 | As I mentioned, you can still do fine tuning.

00:16:02.660 | So you can -- you can do posted fine tuning on our platform.

00:16:05.160 | You can bring your own lower adapter and still serve it serverless.

00:16:08.420 | As you kind of graduate to, like, maybe like a startup and you graduate to a more production

00:16:12.920 | scale, you might want to go to on-demand where it's more like dedicated hardware with more

00:16:17.660 | settings and modifications for your use case.

00:16:21.020 | You can bring your own custom model fine tune from scratch or do it on our platform.

00:16:25.260 | And finally, as you kind of -- if you scale up to a bigger volume and want to go to enterprise

00:16:31.320 | level where it's discounted long-term contracts, and we also will help you to kind of personalize

00:16:37.500 | hardware setup and do some of those tuning for performance which I talked about earlier.

00:16:43.680 | And in terms of these cases, I mean, we are running production for many, many companies ranging

00:16:48.720 | from small start-ups to big enterprises.

00:16:50.820 | We are serving, like -- last time I checked, like, more than 150 billion tokens per day.

00:16:55.320 | So, you know, companies like Quora build chatbots like Paul, which I think Courser had a talk

00:17:02.760 | here yesterday.

00:17:03.760 | They use us for, like, some of the code assistant functionality.

00:17:06.320 | And their, like, latency is really important.

00:17:08.620 | As you can imagine, you know, folks like Upstage and Liner are building, like, different

00:17:13.120 | assistants and agents on top of that.

00:17:15.640 | So we are definitely production ready, go try it out.

00:17:19.920 | Finally, we care a lot about developers, you guys.

00:17:24.240 | So actually, this is external numbers from, like, last year, state of AI stuff where it turns

00:17:30.180 | out we are one of the -- like, after Hagenface, the most popular platform where people pull models,

00:17:35.640 | which is great.

00:17:36.640 | It was very nice to hear.

00:17:38.940 | And again, for getting started, just, you know, head out to our website.

00:17:44.760 | You can go play in the playground right away.

00:17:48.000 | So, for example, you can run, you know, Lama or Gemma or whatever at the top speeds.

00:17:54.140 | And kind of go start building from there and really excited to see what you can build with

00:17:58.020 | open models or fire function or some stuff which you can find on your own.

00:18:04.260 | And yeah.

00:18:05.260 | Last point.

00:18:06.260 | We are, as I mentioned, open API compatible.

00:18:07.900 | So you can still use, you know, your favorite tools, the same clients, or you can use frameworks

00:18:13.980 | like Lanchain or Lama index or et cetera.

00:18:16.220 | So, yeah.

00:18:17.220 | Really excited to kind of -- to be here and tell a little bit about open source -- open

00:18:25.220 | source models and how we have fireworks focusing on productionizing that and scaling it up.

00:18:30.220 | Go try it out.

00:18:31.220 | out and you can also find us at the booth at the expo thank you