back to indexFun stories from building OpenRouter and where all this is going - Alex Atallah, OpenRouter

Chapters
0:0 The Genesis of OpenRouter
1:16 Initial Question.The story begins in early 2023 with the founder, Alex Atallah, pondering if the AI inference market would be dominated by a single player. He noticed the emergence of new models beyond OpenAI and a growing desire from developers to understand the nuances of different models, including their moderation policies [].
2:35 The Rise of Open Source.The video highlights the beginning of the open-source AI race, with early models like Bloom 176B and OPT from Facebook []. A pivotal moment was the release of Meta's Llama 1 in February, which surprisingly outperformed GPT-3 on many benchmarks [], signaling a shift in the landscape.
4:38 The Alpaca Moment.A major breakthrough occurred in March 2023 with the distillation of Alpaca. Stanford researchers demonstrated that by fine-tuning Llama 1 with outputs from GPT-3, they could transfer the style and knowledge of a larger model to a smaller one for less than $600. This proved that creating powerful, specialized models no longer required massive budgets [].
6:43 Window AI.Before OpenRouter, Atallah launched Window AI, an open-source Chrome extension that empowered users to select their preferred LLM for any web application. This project laid the groundwork for what was to come.
7:18 The Launch of OpenRouter.OpenRouter was co-founded with Lewis, the creator of the framework that Window AI was built on. Initially, it was a simple aggregator to collect models in one place.
7:57 Growth and Evolution.OpenRouter quickly evolved into a marketplace, driven by the proliferation of model providers with varying prices, performance, and features. The platform has seen impressive growth, with a 10-100% month-over-month increase for two years. It now offers a single API for over 400 models from more than 60 providers [].
8:57 Marketplace Dynamics.The transition to a marketplace was a response to the complexity of the growing AI ecosystem. By aggregating providers, OpenRouter helps developers achieve better uptime for both open-source and closed-source models and provides valuable data on latency and throughput [].
17:2 Expanding Modalities.The future vision for OpenRouter includes incorporating models that can generate images and "transfusion models" that allow for conversations with images.
17:51 Smarter Routing.The platform plans to implement more sophisticated routing mechanisms, including geographical routing and enterprise-level optimizations for GPU allocation.
18:7 Enhanced Discovery.To help developers find the best models for their needs, OpenRouter aims to improve prompt observability, introduce more granular model categorization, and continue to offer competitive pricing.
00:00:34.040 |
I was looking at this new market that was coming online 00:00:46.240 |
And I decided to look into answering this question, 00:00:52.460 |
Inference might be the largest market ever in software. 00:00:58.460 |
that everybody was assuming the answer to it would be yes. 00:01:02.240 |
Open AI was just far and away the leading model. 00:01:06.540 |
There were a few others that were coming up on its tail. 00:01:10.300 |
And I built a couple prototypes to look into what they could be used for 00:01:22.180 |
I'm going to talk about the founding story of Open Router 00:01:25.980 |
and go through a little bit of the hoops that we jumped through 00:01:32.840 |
as we put together this product that started as an experiment 00:01:37.080 |
and kind of evolved into a marketplace over time. 00:01:39.880 |
In January, we saw the first signs of people wanting other types of models. 00:01:50.480 |
models, and the first evidence was moderation. 00:01:54.960 |
This was like a very clear interest from users in looking for models 00:02:00.240 |
where they could understand whether they'd be deplatformed 00:02:03.640 |
or what the moderation policy of the company was. 00:02:06.960 |
And we saw some people like generating novels 00:02:12.720 |
And in chapter four, the detective would find someone 00:02:17.680 |
who like commits a murder and shoots the victim. 00:02:20.200 |
And Open AI at the time sometimes refused to generate that output 00:02:24.920 |
or it was like questionably against the terms of service. 00:02:27.400 |
And of course, we saw role play and basically a big gray area emerge 00:02:32.920 |
around what models we're willing to generate. 00:02:37.080 |
So in the next month, we saw the open source race begin. 00:02:44.080 |
And that -- I'm going to do a little bit of an OG test here. 00:02:55.640 |
There's like 10 hands raised, or OPT by Facebook. 00:03:02.160 |
It was like one of the earliest open source language models, 00:03:09.720 |
and there were some very interesting projects 00:03:13.040 |
And early days, they weren't really useful for very much. 00:03:18.680 |
So we kept digging, and eventually like the open source community 00:03:27.800 |
people like ran into Meta's first launch, which was Llama 1 in February. 00:03:33.960 |
And Llama 1, in their abstract, advertised that it outperformed GPT-3 00:03:41.480 |
You can see the highlighted part here, which blew everyone away. 00:03:57.840 |
outperforming a large server only, only like tons of money required 00:04:18.320 |
It was like a text completion model, for the most part. 00:04:26.200 |
And people were struggling to figure out what to do with it. 00:04:28.640 |
Which is when we found, when we had the greatest moment of all, 00:04:33.760 |
I think, for the birth of the long tail of language models, 00:04:38.240 |
which was the first successful distillation in March of 2023. 00:04:44.040 |
It was the first time I saw the transference of both style and knowledge 00:04:52.600 |
Because it meant that not only did you not know how to do it, 00:04:55.200 |
you know how to do it, you know how to do it, 00:04:56.440 |
you know how to do it, you know how to do it. 00:04:57.640 |
And then, you know, how to do it, you know, how to do it. 00:05:00.360 |
And then, you know, how to do it, you know, how to do it. 00:06:05.600 |
So it's like only double the number of people who used the, 00:06:09.600 |
like, almost unusable open source models on the previous slide. 00:06:13.600 |
So Open Router initially started as a place to collect all these things. 00:06:20.600 |
But before we got there, I wanted to check out people's willingness 00:06:25.600 |
to bring their own model to generic websites. 00:06:29.400 |
Like, what if the developer didn't even know which model a user wanted to use? 00:06:34.400 |
How would a user bring their choice of model to the software that they want? 00:06:39.400 |
And in April, I launched Window AI, which was an open source Chrome extension 00:06:47.400 |
that let a user choose their model and let a web app just kind of suck it in. 00:06:53.600 |
And so you can see from the Chrome extension here, if you look really closely, 00:06:59.400 |
this user is using Together's open source deployment of GPT-NEXT. 00:07:08.400 |
But like an open source model that swaps out Open AI directly inside the web page. 00:07:15.600 |
I can't read it from here, but I can't read it from here. 00:07:22.400 |
And I co-founded it with the founder of the framework 00:07:31.400 |
And we started Open Router as first a place to collect all the models in one spot 00:07:36.400 |
and help people figure out what to do with them. 00:07:38.600 |
And it eventually grew into a place that gives you, like, better prices, 00:07:43.400 |
better uptime, no subscription, and the most choice for figuring out 00:07:56.400 |
because not everyone here might be familiar with it. 00:07:59.600 |
We have been growing 10% to 100% month-over-month for the last two years. 00:08:08.400 |
It is an API that lets you access all language models, 00:08:12.400 |
and it's also become kind of the go-to place for data 00:08:17.400 |
about who's using which model and how that is changing over time, 00:08:22.400 |
which you can see on our public rankings page here. 00:08:27.600 |
you get near zero switching costs to go from model to model. 00:08:32.400 |
And we have over 400 models, over 60 active providers, 00:08:37.400 |
and you can buy with lots of different payment methods, including crypto. 00:08:42.400 |
And we basically do all the tricky work of normalizing tool calls 00:08:47.400 |
and caching for you so that you get the best prices and the most features, 00:08:51.400 |
and you don't have to worry about what the provider supports. 00:08:58.400 |
Initially, Open Router was not a marketplace, really. 00:09:02.400 |
It was just kind of a collection of all the models 00:09:04.400 |
and a way to explore data about who was using each one. 00:09:08.400 |
Initially, when the first open source models emerged, 00:09:13.400 |
we only had like one or two providers for each one, 00:09:17.400 |
and so we had like a primary provider and a fallback provider. 00:09:20.600 |
Initially, that was it, and we didn't even name the providers. 00:09:25.400 |
But it became clear that there were going to be a bunch of companies 00:09:30.400 |
that wanted to host these models and at very different prices and performances. 00:09:39.400 |
There were companies that supported the Min-P sampler and most didn't. 00:09:45.400 |
some that supported tool calling and structured outputs, 00:09:48.400 |
And suddenly the ecosystem was just ballooning into this kind of out-of-control, 00:09:53.400 |
heterogeneous monster, and we wanted to tame the monster. 00:10:03.400 |
and at different price points, it became a marketplace. 00:10:06.400 |
And you can see like this model, Llama 3.370B Instruct, 00:10:10.400 |
it has one of the models with the most providers on the platform, 00:10:18.400 |
Closed source models also had something interesting happen to them, 00:10:23.400 |
which is that they just couldn't keep up with the demand. 00:10:26.600 |
And so we helped developers basically get uptime boosting, 00:10:32.400 |
and you can see like the delta and how much we can boost uptime 00:10:37.400 |
just by aggregating lots of different providers for a model. 00:10:40.400 |
And this became really helpful for people using open source, or closed source. 00:10:44.400 |
And we became a marketplace for both, showing graphs about latency and throughput 00:10:49.400 |
and helping people figure out, using real-world data, 00:10:52.400 |
what the latency and throughput is on each model. 00:10:55.600 |
And that is how OpenRouter became a marketplace, 00:11:01.400 |
which I thought would be proper for inference, 00:11:06.400 |
but potentially the biggest market in software. 00:11:08.400 |
You can obviously, a couple of other things that we support, 00:11:13.400 |
comparing models using your own prompts with the ease 00:11:18.600 |
fine-grained privacy controls with API-level overrides, 00:11:23.400 |
the ability to see like your usage of all models in one place 00:11:28.400 |
And back to the original question here of whether, 00:11:34.400 |
We've come to the most likely bet that that is not the case. 00:11:46.400 |
How many tokens have been processed by each one? 00:11:49.400 |
And you can see Google Gemini started pretty low, 00:11:58.400 |
and just has grown to 34%, 35% pretty steadily over the last 12 months. 00:12:09.400 |
Anthropic is like one of the most popular models on our platform. 00:12:13.400 |
OpenAI is a little bit underrepresented in this data, 00:12:16.400 |
because a lot of developers use us to get OpenAI-like behavior 00:12:30.400 |
after all of the, you know, back story that I just gave you. 00:12:42.400 |
and realize they can unlock huge gains by doing so. 00:12:49.400 |
we want to make it look exactly the same as Claude from Vertex. 00:12:53.400 |
And we do that because, like, the two hyperscalers 00:12:56.400 |
have fundamentally, you know, the same commodity being delivered 00:13:04.400 |
And for a developer, you just want to be able to, like, select that 00:13:09.400 |
We think inference will be, like, a dominant operating expense, 00:13:17.400 |
You can see the number of active models on OpenRouter 00:13:22.600 |
It's not the case that people just hop from model to model. 00:13:29.400 |
And we're trying to just make this wild ecosystem 00:13:34.400 |
a lot more homogeneous and easier to work with as a developer. 00:13:39.600 |
So, to honor SWIX's title for this presentation, 00:13:51.400 |
And that was our own idea for how to do an MCP within OpenRouter. 00:14:02.600 |
But we did run into the need to expand inference 00:14:12.400 |
For example, searching the web for all models. 00:14:17.400 |
You know, other interesting things coming soon. 00:14:21.400 |
And what we really wanted to do was give these abilities to all models. 00:14:25.400 |
But that involves not just the pre-flight work that MCPs do today, 00:14:32.400 |
where you can kind of call another API, get a bunch of behaviors, 00:14:38.200 |
and then have the inference process access those behaviors as it goes. 00:14:42.200 |
We also needed the ability to transform the outputs on the way to the user. 00:14:47.200 |
And so, what we really, really needed was something more like middleware. 00:14:52.200 |
Middleware is kind of a common concept in web development. 00:14:57.000 |
You set up middleware when you're setting up authentication, for example, 00:15:04.000 |
And so, we came up with a type of middleware that's AI-native and optimized for inference. 00:15:11.000 |
And that looks not totally dissimilar from the way middleware looks in Next.js or web development. 00:15:20.000 |
But this is a little bit about how our plug-in system looks. 00:15:27.800 |
But importantly, it can also augment the results on the way back to the user. 00:15:32.600 |
So, here's an example of our web search plug-in, which augments every language model with the ability to search the web. 00:15:39.600 |
Every language model can just kind of tap into this plug-in and get web annotations as results are being fed back to users in real time. 00:15:52.600 |
So, there's no kind of, like, requirement that you get all of the tokens at once. 00:16:00.800 |
We solved a bunch of other tricky problems while building Open Router. 00:16:07.600 |
We really wanted to get extremely low latency. 00:16:11.600 |
And we got it down to about 30 milliseconds, the best in the industry, I believe, using a lot of custom cache work. 00:16:22.800 |
All these different providers have completely different stream cancellation policies. 00:16:27.600 |
Sometimes if you just drop a stream, the inference provider will bill you for the entire thing. 00:16:36.600 |
Sometimes it will bill you for the next 20 tokens that you never got. 00:16:40.800 |
And we work a lot to try to figure out these edge cases and understand when developers are going to care about them, too. 00:16:49.600 |
And standardizing all these providers and models became, like, a big tricky architecture problem that we spent a while working on. 00:16:59.600 |
We're going to add more modalities to Open Router. 00:17:02.600 |
And I think this is, like, a big change in the industry as well. 00:17:05.600 |
We're going to start seeing LLMs generate images. 00:17:09.800 |
We already have a few examples on the market. 00:17:13.600 |
But, like, some people call it transfusion models, a transformer mixed with stable diffusion. 00:17:19.600 |
These are going to give images way more world knowledge and the ability to have a conversation with the image, 00:17:26.600 |
which we think is just critical for growing that industry and making it really work. 00:17:30.600 |
Imagine, I just ran into somebody today who is using a transfusion model, or who told me about their customer using a transfusion model, 00:17:42.600 |
Like, a whole menu, like, in a delivery app, generated by a transfusion model. 00:17:47.600 |
It's going to be really exciting and a big deal in the coming year. 00:17:53.600 |
We're also going to work on much more powerful routing. 00:17:57.800 |
And so, doing geographical routing right now, it's pretty minimal. 00:18:01.600 |
But routing people to the right GPU in the right place and doing enterprise-level optimizations coming. 00:18:06.600 |
Better prompt observability, better discovery of models. 00:18:14.600 |
Imagine being able to see, like, the best models that take Japanese and create Python code. 00:18:19.800 |
And, of course, even better prices coming soon. 00:18:23.600 |
So, you know, we believe in collaboration and building an ecosystem that's durable