back to index

The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten


Whisper Transcript | Transcript Only Page

00:00:14.720 | Yeah, hi everyone, my name is Amir.
00:00:16.360 | I'm co-founder and CTO of Base 10, the inference company.
00:00:20.640 | But I'm not here to talk about Base 10.
00:00:23.160 | I'm here to talk about the adoption of AI in their enterprise,
00:00:27.800 | why we should care about it, and how it's
00:00:31.080 | going based on what we've seen.
00:00:33.880 | So first, why we should care about it.
00:00:39.200 | Ultimately, we've heard this before.
00:00:42.400 | It's like, hey, is there a hype in the market?
00:00:44.080 | Is AI hyped?
00:00:45.480 | And it probably is, but the evidence
00:00:49.280 | that a lot of people point to that there is hype here
00:00:52.480 | is that the adoption in enterprise has been slow.
00:00:56.640 | I've heard this so many times, enterprises
00:00:59.920 | are slow to adopt, and that has an implication.
00:01:05.920 | If that is true, that has implication
00:01:08.040 | for really the impact of AI and how large it can be,
00:01:11.680 | and whether it is truly a hype or real.
00:01:15.640 | And the reason is that enterprises are massive.
00:01:18.160 | Their reach is massive.
00:01:20.000 | They have all the money.
00:01:22.280 | And if they're slow in adopting, then that paradigm shift
00:01:26.480 | that we're talking about will be slow to materialize.
00:01:32.280 | But I'm here to tell you, based on what I've learned about adoption
00:01:35.280 | of AI in enterprise, why me?
00:01:39.280 | Because we happen to sit somewhere interesting.
00:01:42.560 | We happen to sell to enterprises.
00:01:45.440 | And so over the past, I'm a company six years old,
00:01:48.560 | but over the past two years, in particular,
00:01:51.880 | I talked to, honestly, 100-plus enterprises
00:01:54.560 | from software companies that are public to literally soft drink companies
00:01:58.560 | that are Fortune 50.
00:02:03.240 | And I've seen patterns that I want to share those patterns with you.
00:02:07.960 | One bias that I have is that I don't sell verticalized AI tooling.
00:02:14.080 | I sell a very horizontal AI tooling, and this is important.
00:02:17.360 | So enterprises are adopting vertical solutions.
00:02:20.280 | You know, AI for sales, AI for marketing, AI for customer service.
00:02:24.880 | You just heard from Clay from Sierra.
00:02:27.240 | That adoption is happening.
00:02:28.480 | But I think for the true value to get unlocked,
00:02:31.520 | we need to see enterprises actually build with AI.
00:02:34.680 | The analogy that I use is that if, in the 2000s,
00:02:38.640 | enterprises were not really building tech themselves,
00:02:43.080 | and were just buying Salesforce or products like that,
00:02:48.560 | verticalized products like that,
00:02:50.640 | then the tech industry would just not be as big.
00:02:54.080 | Companies like Snowflake and Databricks and Datadog
00:02:57.480 | would not exist or would not exist to the shape that they do.
00:03:00.280 | And so I really think that the values ultimately unlocks
00:03:03.080 | once enterprises feel comfortable to actually build with AI themselves
00:03:06.640 | as opposed to just buy verticalized tooling.
00:03:09.880 | So let's talk about the journey.
00:03:12.080 | the journey that they go through.
00:03:15.080 | They all start with open AI and anthropic enterprises.
00:03:19.640 | They're like us for good reasons.
00:03:22.200 | It's just so easy to get started.
00:03:24.000 | It's just they do it differently from the rest of us in that they have their own
00:03:28.560 | dedicated deployments of these models on Azure or AWS
00:03:33.560 | for reasons around security and privacy and all that.
00:03:38.480 | And then they get their engineers.
00:03:40.240 | A lot of times they're more predictive ML teams to become AI teams and build on top of these.
00:03:47.600 | And they're happy with that.
00:03:50.040 | And if they can continue doing that, they will.
00:03:53.120 | because there's just a lot of inertia in sticking to that if it actually works.
00:03:59.560 | Sticking to closed models, it's so easy to use, API-based, build on top of it.
00:04:04.440 | But we're seeing cracks in that assumption.
00:04:08.760 | And so let me tell you what I've seen going back in time and how that's changed over time.
00:04:14.000 | So in 2023, I remember going and trying to sell to enterprises and the terms toying around came up quite a bit.
00:04:21.640 | And I heard this actually literally from the CIO of a massive insurance company back then.
00:04:27.000 | It was like, yeah, we put a dedicated deployment of open AI of GPT-4 or GPT-3
00:04:32.680 | so that our engineers can toy around with it.
00:04:35.640 | Kind of like almost dismissively talking about it.
00:04:37.560 | It's like, hey, go build something cute.
00:04:39.320 | That started to change.
00:04:41.400 | In 2024, we saw actual production use cases, again, built on top of these closed models.
00:04:49.040 | I would say like 40, 50 out of the 100 had something in production in that year.
00:04:56.080 | And then in 2025, this year, something changed and is palpable, at least from where I'm sitting.
00:05:03.320 | And the change is that there are cracks in that assumption.
00:05:06.480 | There are cracks in the assumption that we can actually build on top of these closed frontier models indefinitely.
00:05:14.600 | So what are those cracks?
00:05:16.200 | I'll tell you what the cracks are not.
00:05:17.840 | And there's some misconceptions over here.
00:05:20.800 | I'll tell you what they're not.
00:05:21.840 | So often people say, oh, because enterprises don't want to have vendor locking.
00:05:26.880 | I don't hear that, honestly.
00:05:29.200 | We go and talk to them.
00:05:30.200 | I think I know why.
00:05:31.520 | It's because, one, there's a few of them now.
00:05:33.520 | You know, you can go with OpenAI, Anthropic, Google has been coming up pretty well.
00:05:36.920 | They're somewhat interoperable at a certain level.
00:05:40.240 | Like, they all use OpenAI specs.
00:05:41.640 | So, like, building on top of them, yeah, you might have to, like, do your evals again and do some prompt tuning.
00:05:47.600 | But generally, you can go from one to the other.
00:05:49.760 | So, vendor locking is not something that I hear about.
00:05:51.960 | Ballooning costs, I didn't hear that last year, and I know why.
00:05:58.680 | It's because when I ask them, they say, look, the price per token is plummeting.
00:06:03.480 | And we just talked about this, like, right before this.
00:06:05.920 | And that's -- they were saying that problem will just take care of itself.
00:06:10.360 | Compliance, privacy, security, also not problems because these frontier model companies kind of take care of that with the help of the CSPs, with the help of the cloud providers.
00:06:20.360 | So, that these models are running in a dedicated way inside of their existing VPCs.
00:06:27.040 | If these aren't the cracks in that assumption of just use closed models, then what are the cracks?
00:06:33.160 | These are the reasons that I have seen.
00:06:37.720 | And I'll go through them one by one and go through, like, examples of these as well.
00:06:43.120 | And then at the end, also talk about, you know, if these are the cracks, then...
00:06:50.280 | You know, how do you get around them?
00:06:52.240 | And, you know, there be the dragons.
00:06:54.960 | So, one is around quality.
00:06:57.960 | Look, none of these enterprises are in any sort of misconception that they can build the next GPT-4 better than OpenAI can.
00:07:04.960 | That's just not the reality, not as a general model, at least.
00:07:08.960 | But for specific use cases, for specific tasks, we are seeing this where the frontier models are not necessarily the right tool.
00:07:16.960 | So, the example, I've seen this in a couple of big health plans, is that they want to do medical document extraction.
00:07:24.640 | So, they have millions of medical documents, you know, prior auths and medical claims.
00:07:30.640 | And they're trying to get, you know, CPT, like, procedure codes and diagnosis codes and prescriptions, and just giving that to, you know, Claude or GPT doesn't do it.
00:07:41.320 | But they have the data.
00:07:43.320 | Over the years, they've collected a lot of labeled data.
00:07:45.600 | And they're like, oh, we can do better.
00:07:47.400 | And so, that's -- and they actually did.
00:07:50.280 | That's one example.
00:07:52.280 | Another example is on the voice side, in particular on the transcription side, like, again, staying in the healthcare space, like, understanding medical jargon or, like, you know, getting transcription models to understand medical jargon.
00:08:05.280 | That has been another reason to not just use an API-based generic model, but in-house it and do better than what they can do with just API-based models.
00:08:17.760 | Another one is around latency.
00:08:19.560 | Look, these models, OpenAI, Anthropic, even, you know, the big players that serve open source models behind shared APIs, inherently, they're optimized for high throughput and high QPS at the expense of latency.
00:08:36.840 | But a lot of times, we're seeing more and more now where latency is becoming very critical, especially when, you know, the AI voices or AI phone calls.
00:08:47.160 | Latency starts really mattering.
00:08:49.360 | Time to first token, time to first sentence really starts mattering.
00:08:53.600 | And you have to just think about things differently.
00:08:58.040 | You can't just use the frontier models as is because, again, they're optimized for something else.
00:09:05.440 | Around the unit economics, there's -- again, like I said before, like pricing, they said this will take care of itself.
00:09:14.160 | Then came this year.
00:09:15.880 | And as you saw in the previous talk from Michael, the Deagentic use cases ballooned.
00:09:20.280 | And when they balloon, it's crazy.
00:09:23.320 | Like, I've seen this.
00:09:24.320 | Like, every single user action can result in literally 50 inference calls.
00:09:28.320 | And so, suddenly, the thing that you thought is going to take care of itself is not taking care of itself.
00:09:33.000 | That costs are really ballooning, and enterprises think that maybe they can do better on the cost and unit economics.
00:09:42.280 | In order to show ROI, in order to show that the solutions that they're pushing are economically viable, they need to show -- they need to reduce their costs somehow.
00:09:52.320 | And they're realizing that they can actually run these models and pay for the compute and have that be a lot cheaper than paying per token and covering all the margins of someone else.
00:10:00.760 | And really going from being a price taker to actually be the maker of the price and be in control of that.
00:10:07.160 | And then, lastly, Destiny, this one is a bit vibey, but I'm hearing it more recently, that some CIOs, CTOs saying, if we, the enterprise, use just the frontier models, and so do our competitors, what is our advantage?
00:10:25.280 | What is our alpha?
00:10:26.640 | And maybe we should bring in some of these things in-house, and to be able to even differentiate, not just at the workflow and application level, but also at the AI level.
00:10:37.720 | So, now what?
00:10:39.320 | If those are the reasons why they want to adopt open source models and iterate on those and build those and fine tune those and distill those, then what changes?
00:10:48.880 | Well, what changes is that they go from super simple world that just call an API and run with it,
00:10:54.400 | to now you need to build inference, you need to build inference infra, you need to make sure that it scales well, you need to make sure that you can move fast,
00:11:02.800 | that your engineers can actually deliver instead of having to hire a bunch of new types of people and then wait for them for a long time to actually build this infrastructure in-house.
00:11:15.280 | So, one thing that I hear quite a bit at this point is, look, I hear this from enterprises, I hear this from startups, too, actually,
00:11:23.600 | which is that, look, you know, we've picked a model, an open source model, we've heard of VLM or SGLANG or TRTLM, we have some GPUs,
00:11:32.000 | in the case of enterprises, it's in the data center, in the case of startups, it's in some cloud, and you put these together and you get production inference.
00:11:39.280 | And I know for a fact that this is not true.
00:11:43.680 | I wish it was true, but I know for a fact that this is not, that there's a lot more that goes into making inference, especially mission critical inference work well inside of your company.
00:11:57.360 | So, what are those?
00:11:58.720 | These are the dragons.
00:12:00.880 | These are the dragons.
00:12:02.080 | So, one, at the performance layer, you know, we talked about, you know, situations where things are very latency sensitive.
00:12:11.600 | The way that you optimize models for latency is actually quite involved, both at the model level and at the infrastructure level.
00:12:22.320 | You have to do it, you have to attack it at both levels.
00:12:24.800 | So, as an example, you know, for, at the model level, it's like, hey, do you use speculative decoding?
00:12:30.960 | And if so, do you, which route do you go?
00:12:33.280 | Do you go with a good draft model?
00:12:34.560 | Do you go with Medusa heads, with Eagle 3?
00:12:36.800 | Do you go with MTP?
00:12:39.360 | There's a lot, and new techniques are coming out all the time.
00:12:42.240 | The Eagle 3 paper came out like six months ago and it's like running in production and actually being very meaningful.
00:12:48.240 | And so, as an enterprise, can you hire the right folks to be able to be on top of the research?
00:12:55.040 | Because you don't, these are not just, you know, switches that you flip in SGLang or VLM and get the result.
00:13:01.760 | Some of these optimizations bleed out of the model level into the infrastructure level.
00:13:07.600 | So, as an example, if you want to be able to do prefix caching really well, if you want to be able to
00:13:15.200 | disaggregate a serving really well, because that starts really mattering, especially in agentic use
00:13:22.640 | cases where, like, the prompts are massive, but the prompts are somewhat similar from one to the other.
00:13:27.520 | It ends up mattering a lot in you hitting your time to first token and you hitting your P99 of that in a
00:13:35.840 | reliable way. Another thing on the infrastructure is, especially if it's the mission critical inference,
00:13:42.320 | which more and more I see that that's the case, how do you guarantee four nines?
00:13:46.800 | And with this, this formula does not guarantee you, you know, more than two nines. And I saw this firsthand.
00:13:56.080 | So, how do you make sure that when the hardware fails underneath that,
00:14:01.280 | you actually recover? How do you make sure that, you know, when VLM crashes, which happens often -- I saw
00:14:06.480 | this firsthand with, like, when Triton crashes often, your tail latencies go through the roof
00:14:10.320 | while you wait for these things to come back. And during that time, like, your users are feeling that.
00:14:20.800 | how do you build against those and make sure that, you know, you can still guarantee four nines and not
00:14:26.240 | be super over-provisioned and mess up all the unit economics that we talked about. When a big burst of
00:14:32.640 | traffic comes in, how do you make sure they scale up fast? How do you make sure that, you know, I was
00:14:37.840 | talking to this massive enterprise, like, soft drink example, where they're like, yeah, it takes us eight
00:14:43.200 | minutes. When you want to bring up a new replica of the same model, it takes eight minutes. And I believe that,
00:14:48.160 | because that -- if you add up all the different things that it goes into doing that, that is how
00:14:51.360 | long it takes. But that's not okay. Again, your tail latencies go through the roof as soon as there's
00:14:55.840 | a big spike of traffic. How do you account for that? And then there are other things around, again, like
00:15:01.600 | making sure that your engineers move fast. The tooling, life cycle management. The observability, massive
00:15:08.480 | iceberg, which is like -- it's like, oh, yeah, I just put some logs and metrics, and you realize there's a lot
00:15:15.440 | more to do underneath it. We just -- previous talk Michael talked about. And then lots of stuff around
00:15:21.600 | controls and audits and things that enterprises actually care about. So these are the dragons.
00:15:29.600 | These are the dragons. And these are the times that then enterprises have a decision to make. Like,
00:15:33.920 | once they get to this level, either, like, I tell them or -- and they have to believe me -- or they don't,
00:15:40.000 | and they go and build it. And then they see some of these things, that they have a decision, which is
00:15:45.760 | build or buy. And that's my job to then try to convince them that I think they should buy this
00:15:51.200 | layer of infrastructure and platform as opposed to build it. And that's sometimes harder than it seems.
00:16:02.880 | So I'm happy to talk more about these things. So I'll be at our booth. Like, two topics that I'd
00:16:07.440 | love to talk about if you're interested. One, self-servingly, if you're an enterprise and those
00:16:12.720 | problems resonate, I'd love to chat with you about. And two, less self-servingly, if you're a startup and
00:16:18.800 | you're trying to sell to enterprises, I'm happy to chat about all the right decisions that we made, the wrong
00:16:24.160 | decisions that we made along the way to build something that then, when it comes to selling to
00:16:29.200 | enterprises, when it comes to deploying it into their own clouds, that it's actually possible and not a
00:16:36.000 | massive other set of dragons over there. And then last thing, we have a happy hour. We'd love to see
00:16:42.000 | you there. Thank you.