Fun stories from building OpenRouter and where all this is going

00:00:00.000 | -

00:00:02.000 | All right, when I started Open Router

00:00:28.080 | at the beginning of 2023,

00:00:30.560 | I had one major question in mind.

00:00:34.040 | I was looking at this new market that was coming online

00:00:36.800 | and it was incredible.

00:00:38.700 | At the very end of 2022, we all saw ChatGPT

00:00:43.700 | and I got bitten by the AI bug.

00:00:46.240 | And I decided to look into answering this question,

00:00:50.460 | will this market be winner take all?

00:00:52.460 | Inference might be the largest market ever in software.

00:00:55.820 | And this seemed like a critical thing

00:00:58.460 | that everybody was assuming the answer to it would be yes.

00:01:02.240 | Open AI was just far and away the leading model.

00:01:06.540 | There were a few others that were coming up on its tail.

00:01:10.300 | And I built a couple prototypes to look into what they could be used for

00:01:16.900 | and also wanted to investigate open source.

00:01:19.060 | So, in this talk, which Swix named,

00:01:22.180 | I'm going to talk about the founding story of Open Router

00:01:25.980 | and go through a little bit of the hoops that we jumped through

00:01:30.660 | and sort of the investigation that we did

00:01:32.840 | as we put together this product that started as an experiment

00:01:37.080 | and kind of evolved into a marketplace over time.

00:01:39.880 | In January, we saw the first signs of people wanting other types of models.

00:01:50.480 | models, and the first evidence was moderation.

00:01:54.960 | This was like a very clear interest from users in looking for models

00:02:00.240 | where they could understand whether they'd be deplatformed

00:02:03.640 | or what the moderation policy of the company was.

00:02:06.960 | And we saw some people like generating novels

00:02:10.520 | where like it would be a detective story.

00:02:12.720 | And in chapter four, the detective would find someone

00:02:17.680 | who like commits a murder and shoots the victim.

00:02:20.200 | And Open AI at the time sometimes refused to generate that output

00:02:24.920 | or it was like questionably against the terms of service.

00:02:27.400 | And of course, we saw role play and basically a big gray area emerge

00:02:32.920 | around what models we're willing to generate.

00:02:37.080 | So in the next month, we saw the open source race begin.

00:02:44.080 | And that -- I'm going to do a little bit of an OG test here.

00:02:50.000 | Raise your hand if you ever used Bloom 176B.

00:02:55.640 | There's like 10 hands raised, or OPT by Facebook.

00:03:02.160 | It was like one of the earliest open source language models,

00:03:05.760 | about five hands raised.

00:03:07.000 | There were a couple of these emerging,

00:03:09.720 | and there were some very interesting projects

00:03:11.760 | to help people access them.

00:03:13.040 | And early days, they weren't really useful for very much.

00:03:18.680 | So we kept digging, and eventually like the open source community

00:03:27.800 | people like ran into Meta's first launch, which was Llama 1 in February.

00:03:33.960 | And Llama 1, in their abstract, advertised that it outperformed GPT-3

00:03:40.200 | on most benchmarks.

00:03:41.480 | You can see the highlighted part here, which blew everyone away.

00:03:44.760 | This was huge.

00:03:45.680 | An open weights model better than GPT-3.

00:03:50.800 | And especially a smaller model.

00:03:53.760 | This was the 13 billion parameter version,

00:03:56.120 | one that you could run on your laptop,

00:03:57.840 | outperforming a large server only, only like tons of money required

00:04:05.360 | to run inference companies model.

00:04:07.560 | And it was beating it on some benchmarks.

00:04:10.160 | Everyone lost their minds.

00:04:11.440 | And Llama kicked off a huge storm.

00:04:14.520 | It still was not very useful, I have to say.

00:04:18.320 | It was like a text completion model, for the most part.

00:04:21.640 | And it was very difficult to run locally.

00:04:24.480 | The infrastructure just wasn't there.

00:04:26.200 | And people were struggling to figure out what to do with it.

00:04:28.640 | Which is when we found, when we had the greatest moment of all,

00:04:33.760 | I think, for the birth of the long tail of language models,

00:04:38.240 | which was the first successful distillation in March of 2023.

00:04:44.040 | It was the first time I saw the transference of both style and knowledge

00:04:48.800 | from a large model onto a small one.

00:04:51.360 | And this was a huge unlock.

00:04:52.600 | Because it meant that not only did you not know how to do it,

00:04:55.200 | you know how to do it, you know how to do it,

00:04:56.440 | you know how to do it, you know how to do it.

00:04:57.640 | And then, you know, how to do it, you know, how to do it.

00:04:59.240 | And then, you know, how to do it.

00:05:00.360 | And then, you know, how to do it, you know, how to do it.

00:05:02.360 | And then, you know, how to do it.

00:05:03.520 | And then, you know, how to do it.

00:05:04.440 | And then, you know, how to do it.

00:05:05.840 | And then, you know, how to do it.

00:05:06.840 | And then, you know, how to do it.

00:05:07.840 | And then, you know, how to do it.

00:05:08.840 | And then, you know, how to do it.

00:05:09.840 | And then, you know, how to do it.

00:05:10.840 | And then, you know, how to do it.

00:05:11.840 | And then, you know, how to do it.

00:05:12.840 | And then, you know, how to do it.

00:05:13.840 | And then, you know, how to do it.

00:05:14.840 | And then, you know, how to do it.

00:05:15.840 | And then, you know, how to do it.

00:05:17.040 | And then, you know, how to do it.

00:05:18.240 | And then, you know, how to do it.

00:05:19.440 | And then, you know, how to do it.

00:05:20.440 | And then, you know, how to do it.

00:05:21.800 | And then, you know, how to do it.

00:05:22.800 | And then, you know, how to do it.

00:05:24.200 | And then, you know, how to do it.

00:05:25.400 | And then, you know, how to do it.

00:05:26.400 | And then, you know, how to do it.

00:05:27.400 | And then, you know, how to do it.

00:05:28.600 | And then, you know, how to do it.

00:05:29.600 | And then, you know, how to do it.

00:05:30.600 | And then, you know, how to do it.

00:05:31.600 | And then, you know, how to do it.

00:05:32.600 | And then, you know, how to do it.

00:05:33.600 | And then, you know, how to do it.

00:05:34.600 | And then, you know, how to do it.

00:05:35.600 | And then, you know, how to do it.

00:05:36.600 | And then, you know, how to do it.

00:05:37.600 | And then, you know, how to do it.

00:05:38.800 | And then, you know, how to do it.

00:05:40.600 | And then, you know, how to do it.

00:05:41.600 | And then, you know, how to do it.

00:05:42.600 | And then, you know, how to do it.

00:05:43.600 | And then, you know, how to do it.

00:05:44.600 | And then, you know, how to do it.

00:05:45.600 | And then, you know, how to do it.

00:05:46.600 | And then, you know, how to do it.

00:05:47.600 | And then, you know, how to do it.

00:05:48.600 | And then, you know, how to do it.

00:05:49.600 | And then, you know, how to do it.

00:05:50.600 | And then, you know, how to do it.

00:05:51.600 | And then, you know, how to do it.

00:05:52.600 | And then, you know, how to do it.

00:05:53.600 | And then, you know, how to do it.

00:05:54.600 | And then, you know, how to do it.

00:05:55.800 | Very few people used Alpaca.

00:05:57.600 | Raise your hands if you used Alpaca.

00:06:00.600 | I see about maybe 12.

00:06:05.600 | So it's like only double the number of people who used the,

00:06:09.600 | like, almost unusable open source models on the previous slide.

00:06:13.600 | So Open Router initially started as a place to collect all these things.

00:06:20.600 | But before we got there, I wanted to check out people's willingness

00:06:25.600 | to bring their own model to generic websites.

00:06:29.400 | Like, what if the developer didn't even know which model a user wanted to use?

00:06:34.400 | How would a user bring their choice of model to the software that they want?

00:06:39.400 | And in April, I launched Window AI, which was an open source Chrome extension

00:06:47.400 | that let a user choose their model and let a web app just kind of suck it in.

00:06:53.600 | And so you can see from the Chrome extension here, if you look really closely,

00:06:59.400 | this user is using Together's open source deployment of GPT-NEXT.

00:07:05.400 | I can't read it from here.

00:07:08.400 | But like an open source model that swaps out Open AI directly inside the web page.

00:07:15.600 | I can't read it from here, but I can't read it from here.

00:07:18.400 | So the next month, Open Router launched.

00:07:22.400 | And I co-founded it with the founder of the framework

00:07:26.400 | that Window AI was built on, Plasma, Lewis.

00:07:31.400 | And we started Open Router as first a place to collect all the models in one spot

00:07:36.400 | and help people figure out what to do with them.

00:07:38.600 | And it eventually grew into a place that gives you, like, better prices,

00:07:43.400 | better uptime, no subscription, and the most choice for figuring out

00:07:49.400 | which intelligence your software should run.

00:07:52.400 | So let's talk a little bit about what it is,

00:07:56.400 | because not everyone here might be familiar with it.

00:07:59.600 | We have been growing 10% to 100% month-over-month for the last two years.

00:08:08.400 | It is an API that lets you access all language models,

00:08:12.400 | and it's also become kind of the go-to place for data

00:08:17.400 | about who's using which model and how that is changing over time,

00:08:22.400 | which you can see on our public rankings page here.

00:08:24.400 | It's a single API that you pay for once,

00:08:27.600 | you get near zero switching costs to go from model to model.

00:08:32.400 | And we have over 400 models, over 60 active providers,

00:08:37.400 | and you can buy with lots of different payment methods, including crypto.

00:08:42.400 | And we basically do all the tricky work of normalizing tool calls

00:08:47.400 | and caching for you so that you get the best prices and the most features,

00:08:51.400 | and you don't have to worry about what the provider supports.

00:08:55.600 | And then we have to look at the results.

00:08:57.400 | Another story.

00:08:58.400 | Initially, Open Router was not a marketplace, really.

00:09:02.400 | It was just kind of a collection of all the models

00:09:04.400 | and a way to explore data about who was using each one.

00:09:07.400 | So how did we get here?

00:09:08.400 | Initially, when the first open source models emerged,

00:09:13.400 | we only had like one or two providers for each one,

00:09:17.400 | and so we had like a primary provider and a fallback provider.

00:09:20.600 | Initially, that was it, and we didn't even name the providers.

00:09:25.400 | But it became clear that there were going to be a bunch of companies

00:09:30.400 | that wanted to host these models and at very different prices and performances.

00:09:36.600 | the number of features ballooned.

00:09:39.400 | There were companies that supported the Min-P sampler and most didn't.

00:09:43.400 | There were some that supported caching,

00:09:45.400 | some that supported tool calling and structured outputs,

00:09:47.400 | and others that didn't.

00:09:48.400 | And suddenly the ecosystem was just ballooning into this kind of out-of-control,

00:09:53.400 | heterogeneous monster, and we wanted to tame the monster.

00:09:58.600 | So we aggregated all providers in one spot,

00:10:03.400 | and at different price points, it became a marketplace.

00:10:06.400 | And you can see like this model, Llama 3.370B Instruct,

00:10:10.400 | it has one of the models with the most providers on the platform,

00:10:14.400 | and it has like 23.

00:10:18.400 | Closed source models also had something interesting happen to them,

00:10:23.400 | which is that they just couldn't keep up with the demand.

00:10:26.600 | And so we helped developers basically get uptime boosting,

00:10:32.400 | and you can see like the delta and how much we can boost uptime

00:10:37.400 | just by aggregating lots of different providers for a model.

00:10:40.400 | And this became really helpful for people using open source, or closed source.

00:10:44.400 | And we became a marketplace for both, showing graphs about latency and throughput

00:10:49.400 | and helping people figure out, using real-world data,

00:10:52.400 | what the latency and throughput is on each model.

00:10:55.600 | And that is how OpenRouter became a marketplace,

00:10:59.400 | and one optimized for language models,

00:11:01.400 | which I thought would be proper for inference,

00:11:06.400 | but potentially the biggest market in software.

00:11:08.400 | You can obviously, a couple of other things that we support,

00:11:13.400 | comparing models using your own prompts with the ease

00:11:16.400 | of just texting an iMessage,

00:11:18.600 | fine-grained privacy controls with API-level overrides,

00:11:23.400 | the ability to see like your usage of all models in one place

00:11:27.400 | and have great observability.

00:11:28.400 | And back to the original question here of whether,

00:11:32.400 | will intelligence be winner-take-all?

00:11:34.400 | We've come to the most likely bet that that is not the case.

00:11:40.600 | Here's our data broken down by model author.

00:11:46.400 | How many tokens have been processed by each one?

00:11:49.400 | And you can see Google Gemini started pretty low,

00:11:54.400 | like roughly 2%, 3% in June of last year,

00:11:58.400 | and just has grown to 34%, 35% pretty steadily over the last 12 months.

00:12:07.600 | So, we've got a lot of data.

00:12:09.400 | Anthropic is like one of the most popular models on our platform.

00:12:13.400 | OpenAI is a little bit underrepresented in this data,

00:12:16.400 | because a lot of developers use us to get OpenAI-like behavior

00:12:20.400 | for all other models.

00:12:22.400 | But OpenAI has grown a lot here as well.

00:12:25.400 | So, here's what we believe about the market,

00:12:30.400 | after all of the, you know, back story that I just gave you.

00:12:35.600 | The future is going to be multi-model.

00:12:38.400 | All of our customers, tons of customers,

00:12:40.400 | use different models for different purposes

00:12:42.400 | and realize they can unlock huge gains by doing so.

00:12:45.400 | Inference is also a commodity.

00:12:47.400 | Claude from Bedrock,

00:12:49.400 | we want to make it look exactly the same as Claude from Vertex.

00:12:53.400 | And we do that because, like, the two hyperscalers

00:12:56.400 | have fundamentally, you know, the same commodity being delivered

00:13:01.600 | at different rates, different performances.

00:13:04.400 | And for a developer, you just want to be able to, like, select that

00:13:07.400 | without worrying about who is serving it.

00:13:09.400 | We think inference will be, like, a dominant operating expense,

00:13:14.400 | and selecting and routing will be crucial.

00:13:17.400 | You can see the number of active models on OpenRouter

00:13:21.400 | has just steadily grown.

00:13:22.600 | It's not the case that people just hop from model to model.

00:13:27.400 | Like, it tends to be sticky.

00:13:29.400 | And we're trying to just make this wild ecosystem

00:13:34.400 | a lot more homogeneous and easier to work with as a developer.

00:13:39.600 | So, to honor SWIX's title for this presentation,

00:13:45.400 | let's give a technical story.

00:13:47.400 | It's something that we've worked on

00:13:49.400 | in the process of building the company.

00:13:51.400 | And that was our own idea for how to do an MCP within OpenRouter.

00:13:58.400 | So, we don't have MCPs.

00:14:00.400 | We don't have an MCP marketplace.

00:14:02.600 | But we did run into the need to expand inference

00:14:09.400 | with new features and new abilities.

00:14:12.400 | For example, searching the web for all models.

00:14:15.400 | PDF parsing for all models.

00:14:17.400 | You know, other interesting things coming soon.

00:14:21.400 | And what we really wanted to do was give these abilities to all models.

00:14:25.400 | But that involves not just the pre-flight work that MCPs do today,

00:14:32.400 | where you can kind of call another API, get a bunch of behaviors,

00:14:38.200 | and then have the inference process access those behaviors as it goes.

00:14:42.200 | We also needed the ability to transform the outputs on the way to the user.

00:14:47.200 | And so, what we really, really needed was something more like middleware.

00:14:52.200 | Middleware is kind of a common concept in web development.

00:14:57.000 | You set up middleware when you're setting up authentication, for example,

00:15:01.200 | or caching for a web app.

00:15:04.000 | And so, we came up with a type of middleware that's AI-native and optimized for inference.

00:15:11.000 | And that looks not totally dissimilar from the way middleware looks in Next.js or web development.

00:15:18.000 | So, pardon the code on the screen.

00:15:20.000 | But this is a little bit about how our plug-in system looks.

00:15:24.000 | And it can call MCPs from inside a plug-in.

00:15:27.800 | But importantly, it can also augment the results on the way back to the user.

00:15:32.600 | So, here's an example of our web search plug-in, which augments every language model with the ability to search the web.

00:15:39.600 | Every language model can just kind of tap into this plug-in and get web annotations as results are being fed back to users in real time.

00:15:49.800 | And this all happens in a stream.

00:15:52.600 | So, there's no kind of, like, requirement that you get all of the tokens at once.

00:15:58.600 | It can just happen live in the stream.

00:16:00.800 | We solved a bunch of other tricky problems while building Open Router.

00:16:07.600 | We really wanted to get extremely low latency.

00:16:11.600 | And we got it down to about 30 milliseconds, the best in the industry, I believe, using a lot of custom cache work.

00:16:19.600 | And we also need to make streams cancelable.

00:16:22.800 | All these different providers have completely different stream cancellation policies.

00:16:27.600 | Sometimes if you just drop a stream, the inference provider will bill you for the entire thing.

00:16:34.600 | Sometimes it won't.

00:16:36.600 | Sometimes it will bill you for the next 20 tokens that you never got.

00:16:40.800 | And we work a lot to try to figure out these edge cases and understand when developers are going to care about them, too.

00:16:49.600 | And standardizing all these providers and models became, like, a big tricky architecture problem that we spent a while working on.

00:16:57.600 | So, here's where all this is going.

00:16:59.600 | We're going to add more modalities to Open Router.

00:17:02.600 | And I think this is, like, a big change in the industry as well.

00:17:05.600 | We're going to start seeing LLMs generate images.

00:17:09.800 | We already have a few examples on the market.

00:17:13.600 | But, like, some people call it transfusion models, a transformer mixed with stable diffusion.

00:17:19.600 | These are going to give images way more world knowledge and the ability to have a conversation with the image,

00:17:26.600 | which we think is just critical for growing that industry and making it really work.

00:17:30.600 | Imagine, I just ran into somebody today who is using a transfusion model, or who told me about their customer using a transfusion model,

00:17:38.800 | to generate menus.

00:17:40.600 | Imagine doing that.

00:17:41.600 | Imagine doing that.

00:17:42.600 | Like, a whole menu, like, in a delivery app, generated by a transfusion model.

00:17:47.600 | It's going to be really exciting and a big deal in the coming year.

00:17:53.600 | We're also going to work on much more powerful routing.

00:17:55.600 | Like, routing is our bread and butter.

00:17:57.800 | And so, doing geographical routing right now, it's pretty minimal.

00:18:01.600 | But routing people to the right GPU in the right place and doing enterprise-level optimizations coming.

00:18:06.600 | Better prompt observability, better discovery of models.

00:18:11.600 | Like, really fine-grained categorization.

00:18:14.600 | Imagine being able to see, like, the best models that take Japanese and create Python code.

00:18:19.800 | And, of course, even better prices coming soon.

00:18:23.600 | So, you know, we believe in collaboration and building an ecosystem that's durable

00:18:31.600 | and with low vendor lock-in.

00:18:33.600 | So, you know, collaborate with us.

00:18:35.800 | Here's our email.

00:18:37.600 | And if you're interested, join us, too.

00:18:39.600 | Thank you.

00:18:41.600 | We'll see you next time.

Fun stories from building OpenRouter and where all this is going - Alex Atallah, OpenRouter

Chapters