Fun stories from building OpenRouter and where all this is going

- All right, when I started Open Router at the beginning of 2023, I had one major question in mind. I was looking at this new market that was coming online and it was incredible. At the very end of 2022, we all saw ChatGPT and I got bitten by the AI bug.

And I decided to look into answering this question, will this market be winner take all? Inference might be the largest market ever in software. And this seemed like a critical thing that everybody was assuming the answer to it would be yes. Open AI was just far and away the leading model.

There were a few others that were coming up on its tail. And I built a couple prototypes to look into what they could be used for and also wanted to investigate open source. So, in this talk, which Swix named, I'm going to talk about the founding story of Open Router and go through a little bit of the hoops that we jumped through and sort of the investigation that we did as we put together this product that started as an experiment and kind of evolved into a marketplace over time.

In January, we saw the first signs of people wanting other types of models. models, and the first evidence was moderation. This was like a very clear interest from users in looking for models where they could understand whether they'd be deplatformed or what the moderation policy of the company was.

And we saw some people like generating novels where like it would be a detective story. And in chapter four, the detective would find someone who like commits a murder and shoots the victim. And Open AI at the time sometimes refused to generate that output or it was like questionably against the terms of service.

And of course, we saw role play and basically a big gray area emerge around what models we're willing to generate. So in the next month, we saw the open source race begin. And that -- I'm going to do a little bit of an OG test here. Raise your hand if you ever used Bloom 176B.

There's like 10 hands raised, or OPT by Facebook. It was like one of the earliest open source language models, about five hands raised. There were a couple of these emerging, and there were some very interesting projects to help people access them. And early days, they weren't really useful for very much.

So we kept digging, and eventually like the open source community people like ran into Meta's first launch, which was Llama 1 in February. And Llama 1, in their abstract, advertised that it outperformed GPT-3 on most benchmarks. You can see the highlighted part here, which blew everyone away. This was huge.

An open weights model better than GPT-3. And especially a smaller model. This was the 13 billion parameter version, one that you could run on your laptop, outperforming a large server only, only like tons of money required to run inference companies model. And it was beating it on some benchmarks.

Everyone lost their minds. And Llama kicked off a huge storm. It still was not very useful, I have to say. It was like a text completion model, for the most part. And it was very difficult to run locally. The infrastructure just wasn't there. And people were struggling to figure out what to do with it.

Which is when we found, when we had the greatest moment of all, I think, for the birth of the long tail of language models, which was the first successful distillation in March of 2023. It was the first time I saw the transference of both style and knowledge from a large model onto a small one.

And this was a huge unlock. Because it meant that not only did you not know how to do it, you know how to do it, you know how to do it, you know how to do it, you know how to do it. And then, you know, how to do it, you know, how to do it.

And then, you know, how to do it. And then, you know, how to do it, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it.

And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it.

And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. And then, you know, how to do it. Very few people used Alpaca. Raise your hands if you used Alpaca. I see about maybe 12. So it's like only double the number of people who used the, like, almost unusable open source models on the previous slide.

So Open Router initially started as a place to collect all these things. But before we got there, I wanted to check out people's willingness to bring their own model to generic websites. Like, what if the developer didn't even know which model a user wanted to use? How would a user bring their choice of model to the software that they want?

And in April, I launched Window AI, which was an open source Chrome extension that let a user choose their model and let a web app just kind of suck it in. And so you can see from the Chrome extension here, if you look really closely, this user is using Together's open source deployment of GPT-NEXT.

I can't read it from here. But like an open source model that swaps out Open AI directly inside the web page. I can't read it from here, but I can't read it from here. So the next month, Open Router launched. And I co-founded it with the founder of the framework that Window AI was built on, Plasma, Lewis.

And we started Open Router as first a place to collect all the models in one spot and help people figure out what to do with them. And it eventually grew into a place that gives you, like, better prices, better uptime, no subscription, and the most choice for figuring out which intelligence your software should run.

So let's talk a little bit about what it is, because not everyone here might be familiar with it. We have been growing 10% to 100% month-over-month for the last two years. It is an API that lets you access all language models, and it's also become kind of the go-to place for data about who's using which model and how that is changing over time, which you can see on our public rankings page here.

It's a single API that you pay for once, you get near zero switching costs to go from model to model. And we have over 400 models, over 60 active providers, and you can buy with lots of different payment methods, including crypto. And we basically do all the tricky work of normalizing tool calls and caching for you so that you get the best prices and the most features, and you don't have to worry about what the provider supports.

And then we have to look at the results. Another story. Initially, Open Router was not a marketplace, really. It was just kind of a collection of all the models and a way to explore data about who was using each one. So how did we get here? Initially, when the first open source models emerged, we only had like one or two providers for each one, and so we had like a primary provider and a fallback provider.

Initially, that was it, and we didn't even name the providers. But it became clear that there were going to be a bunch of companies that wanted to host these models and at very different prices and performances. the number of features ballooned. There were companies that supported the Min-P sampler and most didn't.

There were some that supported caching, some that supported tool calling and structured outputs, and others that didn't. And suddenly the ecosystem was just ballooning into this kind of out-of-control, heterogeneous monster, and we wanted to tame the monster. So we aggregated all providers in one spot, and at different price points, it became a marketplace.

And you can see like this model, Llama 3.370B Instruct, it has one of the models with the most providers on the platform, and it has like 23. Closed source models also had something interesting happen to them, which is that they just couldn't keep up with the demand. And so we helped developers basically get uptime boosting, and you can see like the delta and how much we can boost uptime just by aggregating lots of different providers for a model.

And this became really helpful for people using open source, or closed source. And we became a marketplace for both, showing graphs about latency and throughput and helping people figure out, using real-world data, what the latency and throughput is on each model. And that is how OpenRouter became a marketplace, and one optimized for language models, which I thought would be proper for inference, but potentially the biggest market in software.

You can obviously, a couple of other things that we support, comparing models using your own prompts with the ease of just texting an iMessage, fine-grained privacy controls with API-level overrides, the ability to see like your usage of all models in one place and have great observability. And back to the original question here of whether, will intelligence be winner-take-all?

We've come to the most likely bet that that is not the case. Here's our data broken down by model author. How many tokens have been processed by each one? And you can see Google Gemini started pretty low, like roughly 2%, 3% in June of last year, and just has grown to 34%, 35% pretty steadily over the last 12 months.

So, we've got a lot of data. Anthropic is like one of the most popular models on our platform. OpenAI is a little bit underrepresented in this data, because a lot of developers use us to get OpenAI-like behavior for all other models. But OpenAI has grown a lot here as well.

So, here's what we believe about the market, after all of the, you know, back story that I just gave you. The future is going to be multi-model. All of our customers, tons of customers, use different models for different purposes and realize they can unlock huge gains by doing so.

Inference is also a commodity. Claude from Bedrock, we want to make it look exactly the same as Claude from Vertex. And we do that because, like, the two hyperscalers have fundamentally, you know, the same commodity being delivered at different rates, different performances. And for a developer, you just want to be able to, like, select that without worrying about who is serving it.

We think inference will be, like, a dominant operating expense, and selecting and routing will be crucial. You can see the number of active models on OpenRouter has just steadily grown. It's not the case that people just hop from model to model. Like, it tends to be sticky. And we're trying to just make this wild ecosystem a lot more homogeneous and easier to work with as a developer.

So, to honor SWIX's title for this presentation, let's give a technical story. It's something that we've worked on in the process of building the company. And that was our own idea for how to do an MCP within OpenRouter. So, we don't have MCPs. We don't have an MCP marketplace.

But we did run into the need to expand inference with new features and new abilities. For example, searching the web for all models. PDF parsing for all models. You know, other interesting things coming soon. And what we really wanted to do was give these abilities to all models. But that involves not just the pre-flight work that MCPs do today, where you can kind of call another API, get a bunch of behaviors, and then have the inference process access those behaviors as it goes.

We also needed the ability to transform the outputs on the way to the user. And so, what we really, really needed was something more like middleware. Middleware is kind of a common concept in web development. You set up middleware when you're setting up authentication, for example, or caching for a web app.

And so, we came up with a type of middleware that's AI-native and optimized for inference. And that looks not totally dissimilar from the way middleware looks in Next.js or web development. So, pardon the code on the screen. But this is a little bit about how our plug-in system looks.

And it can call MCPs from inside a plug-in. But importantly, it can also augment the results on the way back to the user. So, here's an example of our web search plug-in, which augments every language model with the ability to search the web. Every language model can just kind of tap into this plug-in and get web annotations as results are being fed back to users in real time.

And this all happens in a stream. So, there's no kind of, like, requirement that you get all of the tokens at once. It can just happen live in the stream. We solved a bunch of other tricky problems while building Open Router. We really wanted to get extremely low latency.

And we got it down to about 30 milliseconds, the best in the industry, I believe, using a lot of custom cache work. And we also need to make streams cancelable. All these different providers have completely different stream cancellation policies. Sometimes if you just drop a stream, the inference provider will bill you for the entire thing.

Sometimes it won't. Sometimes it will bill you for the next 20 tokens that you never got. And we work a lot to try to figure out these edge cases and understand when developers are going to care about them, too. And standardizing all these providers and models became, like, a big tricky architecture problem that we spent a while working on.

So, here's where all this is going. We're going to add more modalities to Open Router. And I think this is, like, a big change in the industry as well. We're going to start seeing LLMs generate images. We already have a few examples on the market. But, like, some people call it transfusion models, a transformer mixed with stable diffusion.

These are going to give images way more world knowledge and the ability to have a conversation with the image, which we think is just critical for growing that industry and making it really work. Imagine, I just ran into somebody today who is using a transfusion model, or who told me about their customer using a transfusion model, to generate menus.

Imagine doing that. Imagine doing that. Like, a whole menu, like, in a delivery app, generated by a transfusion model. It's going to be really exciting and a big deal in the coming year. We're also going to work on much more powerful routing. Like, routing is our bread and butter.

And so, doing geographical routing right now, it's pretty minimal. But routing people to the right GPU in the right place and doing enterprise-level optimizations coming. Better prompt observability, better discovery of models. Like, really fine-grained categorization. Imagine being able to see, like, the best models that take Japanese and create Python code.

And, of course, even better prices coming soon. So, you know, we believe in collaboration and building an ecosystem that's durable and with low vendor lock-in. So, you know, collaborate with us. Here's our email. And if you're interested, join us, too. Thank you. We'll see you next time.

Fun stories from building OpenRouter and where all this is going - Alex Atallah, OpenRouter

Chapters

Transcript