back to indexThe Rise of Open Models in the Enterprise — Amir Haghighat, Baseten

00:00:16.360 |
I'm co-founder and CTO of Base 10, the inference company. 00:00:23.160 |
I'm here to talk about the adoption of AI in their enterprise, 00:00:42.400 |
It's like, hey, is there a hype in the market? 00:00:49.280 |
that a lot of people point to that there is hype here 00:00:52.480 |
is that the adoption in enterprise has been slow. 00:00:59.920 |
are slow to adopt, and that has an implication. 00:01:08.040 |
for really the impact of AI and how large it can be, 00:01:15.640 |
And the reason is that enterprises are massive. 00:01:22.280 |
And if they're slow in adopting, then that paradigm shift 00:01:26.480 |
that we're talking about will be slow to materialize. 00:01:32.280 |
But I'm here to tell you, based on what I've learned about adoption 00:01:39.280 |
Because we happen to sit somewhere interesting. 00:01:45.440 |
And so over the past, I'm a company six years old, 00:01:54.560 |
from software companies that are public to literally soft drink companies 00:02:03.240 |
And I've seen patterns that I want to share those patterns with you. 00:02:07.960 |
One bias that I have is that I don't sell verticalized AI tooling. 00:02:14.080 |
I sell a very horizontal AI tooling, and this is important. 00:02:17.360 |
So enterprises are adopting vertical solutions. 00:02:20.280 |
You know, AI for sales, AI for marketing, AI for customer service. 00:02:28.480 |
But I think for the true value to get unlocked, 00:02:31.520 |
we need to see enterprises actually build with AI. 00:02:34.680 |
The analogy that I use is that if, in the 2000s, 00:02:38.640 |
enterprises were not really building tech themselves, 00:02:43.080 |
and were just buying Salesforce or products like that, 00:02:50.640 |
then the tech industry would just not be as big. 00:02:54.080 |
Companies like Snowflake and Databricks and Datadog 00:02:57.480 |
would not exist or would not exist to the shape that they do. 00:03:00.280 |
And so I really think that the values ultimately unlocks 00:03:03.080 |
once enterprises feel comfortable to actually build with AI themselves 00:03:15.080 |
They all start with open AI and anthropic enterprises. 00:03:24.000 |
It's just they do it differently from the rest of us in that they have their own 00:03:28.560 |
dedicated deployments of these models on Azure or AWS 00:03:33.560 |
for reasons around security and privacy and all that. 00:03:40.240 |
A lot of times they're more predictive ML teams to become AI teams and build on top of these. 00:03:50.040 |
And if they can continue doing that, they will. 00:03:53.120 |
because there's just a lot of inertia in sticking to that if it actually works. 00:03:59.560 |
Sticking to closed models, it's so easy to use, API-based, build on top of it. 00:04:08.760 |
And so let me tell you what I've seen going back in time and how that's changed over time. 00:04:14.000 |
So in 2023, I remember going and trying to sell to enterprises and the terms toying around came up quite a bit. 00:04:21.640 |
And I heard this actually literally from the CIO of a massive insurance company back then. 00:04:27.000 |
It was like, yeah, we put a dedicated deployment of open AI of GPT-4 or GPT-3 00:04:32.680 |
so that our engineers can toy around with it. 00:04:35.640 |
Kind of like almost dismissively talking about it. 00:04:41.400 |
In 2024, we saw actual production use cases, again, built on top of these closed models. 00:04:49.040 |
I would say like 40, 50 out of the 100 had something in production in that year. 00:04:56.080 |
And then in 2025, this year, something changed and is palpable, at least from where I'm sitting. 00:05:03.320 |
And the change is that there are cracks in that assumption. 00:05:06.480 |
There are cracks in the assumption that we can actually build on top of these closed frontier models indefinitely. 00:05:21.840 |
So often people say, oh, because enterprises don't want to have vendor locking. 00:05:31.520 |
It's because, one, there's a few of them now. 00:05:33.520 |
You know, you can go with OpenAI, Anthropic, Google has been coming up pretty well. 00:05:36.920 |
They're somewhat interoperable at a certain level. 00:05:41.640 |
So, like, building on top of them, yeah, you might have to, like, do your evals again and do some prompt tuning. 00:05:47.600 |
But generally, you can go from one to the other. 00:05:49.760 |
So, vendor locking is not something that I hear about. 00:05:51.960 |
Ballooning costs, I didn't hear that last year, and I know why. 00:05:58.680 |
It's because when I ask them, they say, look, the price per token is plummeting. 00:06:03.480 |
And we just talked about this, like, right before this. 00:06:05.920 |
And that's -- they were saying that problem will just take care of itself. 00:06:10.360 |
Compliance, privacy, security, also not problems because these frontier model companies kind of take care of that with the help of the CSPs, with the help of the cloud providers. 00:06:20.360 |
So, that these models are running in a dedicated way inside of their existing VPCs. 00:06:27.040 |
If these aren't the cracks in that assumption of just use closed models, then what are the cracks? 00:06:37.720 |
And I'll go through them one by one and go through, like, examples of these as well. 00:06:43.120 |
And then at the end, also talk about, you know, if these are the cracks, then... 00:06:57.960 |
Look, none of these enterprises are in any sort of misconception that they can build the next GPT-4 better than OpenAI can. 00:07:04.960 |
That's just not the reality, not as a general model, at least. 00:07:08.960 |
But for specific use cases, for specific tasks, we are seeing this where the frontier models are not necessarily the right tool. 00:07:16.960 |
So, the example, I've seen this in a couple of big health plans, is that they want to do medical document extraction. 00:07:24.640 |
So, they have millions of medical documents, you know, prior auths and medical claims. 00:07:30.640 |
And they're trying to get, you know, CPT, like, procedure codes and diagnosis codes and prescriptions, and just giving that to, you know, Claude or GPT doesn't do it. 00:07:43.320 |
Over the years, they've collected a lot of labeled data. 00:07:52.280 |
Another example is on the voice side, in particular on the transcription side, like, again, staying in the healthcare space, like, understanding medical jargon or, like, you know, getting transcription models to understand medical jargon. 00:08:05.280 |
That has been another reason to not just use an API-based generic model, but in-house it and do better than what they can do with just API-based models. 00:08:19.560 |
Look, these models, OpenAI, Anthropic, even, you know, the big players that serve open source models behind shared APIs, inherently, they're optimized for high throughput and high QPS at the expense of latency. 00:08:36.840 |
But a lot of times, we're seeing more and more now where latency is becoming very critical, especially when, you know, the AI voices or AI phone calls. 00:08:49.360 |
Time to first token, time to first sentence really starts mattering. 00:08:53.600 |
And you have to just think about things differently. 00:08:58.040 |
You can't just use the frontier models as is because, again, they're optimized for something else. 00:09:05.440 |
Around the unit economics, there's -- again, like I said before, like pricing, they said this will take care of itself. 00:09:15.880 |
And as you saw in the previous talk from Michael, the Deagentic use cases ballooned. 00:09:24.320 |
Like, every single user action can result in literally 50 inference calls. 00:09:28.320 |
And so, suddenly, the thing that you thought is going to take care of itself is not taking care of itself. 00:09:33.000 |
That costs are really ballooning, and enterprises think that maybe they can do better on the cost and unit economics. 00:09:42.280 |
In order to show ROI, in order to show that the solutions that they're pushing are economically viable, they need to show -- they need to reduce their costs somehow. 00:09:52.320 |
And they're realizing that they can actually run these models and pay for the compute and have that be a lot cheaper than paying per token and covering all the margins of someone else. 00:10:00.760 |
And really going from being a price taker to actually be the maker of the price and be in control of that. 00:10:07.160 |
And then, lastly, Destiny, this one is a bit vibey, but I'm hearing it more recently, that some CIOs, CTOs saying, if we, the enterprise, use just the frontier models, and so do our competitors, what is our advantage? 00:10:26.640 |
And maybe we should bring in some of these things in-house, and to be able to even differentiate, not just at the workflow and application level, but also at the AI level. 00:10:39.320 |
If those are the reasons why they want to adopt open source models and iterate on those and build those and fine tune those and distill those, then what changes? 00:10:48.880 |
Well, what changes is that they go from super simple world that just call an API and run with it, 00:10:54.400 |
to now you need to build inference, you need to build inference infra, you need to make sure that it scales well, you need to make sure that you can move fast, 00:11:02.800 |
that your engineers can actually deliver instead of having to hire a bunch of new types of people and then wait for them for a long time to actually build this infrastructure in-house. 00:11:15.280 |
So, one thing that I hear quite a bit at this point is, look, I hear this from enterprises, I hear this from startups, too, actually, 00:11:23.600 |
which is that, look, you know, we've picked a model, an open source model, we've heard of VLM or SGLANG or TRTLM, we have some GPUs, 00:11:32.000 |
in the case of enterprises, it's in the data center, in the case of startups, it's in some cloud, and you put these together and you get production inference. 00:11:43.680 |
I wish it was true, but I know for a fact that this is not, that there's a lot more that goes into making inference, especially mission critical inference work well inside of your company. 00:12:02.080 |
So, one, at the performance layer, you know, we talked about, you know, situations where things are very latency sensitive. 00:12:11.600 |
The way that you optimize models for latency is actually quite involved, both at the model level and at the infrastructure level. 00:12:22.320 |
You have to do it, you have to attack it at both levels. 00:12:24.800 |
So, as an example, you know, for, at the model level, it's like, hey, do you use speculative decoding? 00:12:39.360 |
There's a lot, and new techniques are coming out all the time. 00:12:42.240 |
The Eagle 3 paper came out like six months ago and it's like running in production and actually being very meaningful. 00:12:48.240 |
And so, as an enterprise, can you hire the right folks to be able to be on top of the research? 00:12:55.040 |
Because you don't, these are not just, you know, switches that you flip in SGLang or VLM and get the result. 00:13:01.760 |
Some of these optimizations bleed out of the model level into the infrastructure level. 00:13:07.600 |
So, as an example, if you want to be able to do prefix caching really well, if you want to be able to 00:13:15.200 |
disaggregate a serving really well, because that starts really mattering, especially in agentic use 00:13:22.640 |
cases where, like, the prompts are massive, but the prompts are somewhat similar from one to the other. 00:13:27.520 |
It ends up mattering a lot in you hitting your time to first token and you hitting your P99 of that in a 00:13:35.840 |
reliable way. Another thing on the infrastructure is, especially if it's the mission critical inference, 00:13:42.320 |
which more and more I see that that's the case, how do you guarantee four nines? 00:13:46.800 |
And with this, this formula does not guarantee you, you know, more than two nines. And I saw this firsthand. 00:13:56.080 |
So, how do you make sure that when the hardware fails underneath that, 00:14:01.280 |
you actually recover? How do you make sure that, you know, when VLM crashes, which happens often -- I saw 00:14:06.480 |
this firsthand with, like, when Triton crashes often, your tail latencies go through the roof 00:14:10.320 |
while you wait for these things to come back. And during that time, like, your users are feeling that. 00:14:20.800 |
how do you build against those and make sure that, you know, you can still guarantee four nines and not 00:14:26.240 |
be super over-provisioned and mess up all the unit economics that we talked about. When a big burst of 00:14:32.640 |
traffic comes in, how do you make sure they scale up fast? How do you make sure that, you know, I was 00:14:37.840 |
talking to this massive enterprise, like, soft drink example, where they're like, yeah, it takes us eight 00:14:43.200 |
minutes. When you want to bring up a new replica of the same model, it takes eight minutes. And I believe that, 00:14:48.160 |
because that -- if you add up all the different things that it goes into doing that, that is how 00:14:51.360 |
long it takes. But that's not okay. Again, your tail latencies go through the roof as soon as there's 00:14:55.840 |
a big spike of traffic. How do you account for that? And then there are other things around, again, like 00:15:01.600 |
making sure that your engineers move fast. The tooling, life cycle management. The observability, massive 00:15:08.480 |
iceberg, which is like -- it's like, oh, yeah, I just put some logs and metrics, and you realize there's a lot 00:15:15.440 |
more to do underneath it. We just -- previous talk Michael talked about. And then lots of stuff around 00:15:21.600 |
controls and audits and things that enterprises actually care about. So these are the dragons. 00:15:29.600 |
These are the dragons. And these are the times that then enterprises have a decision to make. Like, 00:15:33.920 |
once they get to this level, either, like, I tell them or -- and they have to believe me -- or they don't, 00:15:40.000 |
and they go and build it. And then they see some of these things, that they have a decision, which is 00:15:45.760 |
build or buy. And that's my job to then try to convince them that I think they should buy this 00:15:51.200 |
layer of infrastructure and platform as opposed to build it. And that's sometimes harder than it seems. 00:16:02.880 |
So I'm happy to talk more about these things. So I'll be at our booth. Like, two topics that I'd 00:16:07.440 |
love to talk about if you're interested. One, self-servingly, if you're an enterprise and those 00:16:12.720 |
problems resonate, I'd love to chat with you about. And two, less self-servingly, if you're a startup and 00:16:18.800 |
you're trying to sell to enterprises, I'm happy to chat about all the right decisions that we made, the wrong 00:16:24.160 |
decisions that we made along the way to build something that then, when it comes to selling to 00:16:29.200 |
enterprises, when it comes to deploying it into their own clouds, that it's actually possible and not a 00:16:36.000 |
massive other set of dragons over there. And then last thing, we have a happy hour. We'd love to see