back to index

Fun stories from building OpenRouter and where all this is going - Alex Atallah, OpenRouter


Chapters

0:0 The Genesis of OpenRouter
1:16 Initial Question.The story begins in early 2023 with the founder, Alex Atallah, pondering if the AI inference market would be dominated by a single player. He noticed the emergence of new models beyond OpenAI and a growing desire from developers to understand the nuances of different models, including their moderation policies [].
2:35 The Rise of Open Source.The video highlights the beginning of the open-source AI race, with early models like Bloom 176B and OPT from Facebook []. A pivotal moment was the release of Meta's Llama 1 in February, which surprisingly outperformed GPT-3 on many benchmarks [], signaling a shift in the landscape.
4:38 The Alpaca Moment.A major breakthrough occurred in March 2023 with the distillation of Alpaca. Stanford researchers demonstrated that by fine-tuning Llama 1 with outputs from GPT-3, they could transfer the style and knowledge of a larger model to a smaller one for less than $600. This proved that creating powerful, specialized models no longer required massive budgets [].
6:43 Window AI.Before OpenRouter, Atallah launched Window AI, an open-source Chrome extension that empowered users to select their preferred LLM for any web application. This project laid the groundwork for what was to come.
7:18 The Launch of OpenRouter.OpenRouter was co-founded with Lewis, the creator of the framework that Window AI was built on. Initially, it was a simple aggregator to collect models in one place.
7:57 Growth and Evolution.OpenRouter quickly evolved into a marketplace, driven by the proliferation of model providers with varying prices, performance, and features. The platform has seen impressive growth, with a 10-100% month-over-month increase for two years. It now offers a single API for over 400 models from more than 60 providers [].
8:57 Marketplace Dynamics.The transition to a marketplace was a response to the complexity of the growing AI ecosystem. By aggregating providers, OpenRouter helps developers achieve better uptime for both open-source and closed-source models and provides valuable data on latency and throughput [].
17:2 Expanding Modalities.The future vision for OpenRouter includes incorporating models that can generate images and "transfusion models" that allow for conversations with images.
17:51 Smarter Routing.The platform plans to implement more sophisticated routing mechanisms, including geographical routing and enterprise-level optimizations for GPU allocation.
18:7 Enhanced Discovery.To help developers find the best models for their needs, OpenRouter aims to improve prompt observability, introduce more granular model categorization, and continue to offer competitive pricing.

Whisper Transcript | Transcript Only Page

00:00:02.000 | All right, when I started Open Router
00:00:28.080 | at the beginning of 2023,
00:00:30.560 | I had one major question in mind.
00:00:34.040 | I was looking at this new market that was coming online
00:00:36.800 | and it was incredible.
00:00:38.700 | At the very end of 2022, we all saw ChatGPT
00:00:43.700 | and I got bitten by the AI bug.
00:00:46.240 | And I decided to look into answering this question,
00:00:50.460 | will this market be winner take all?
00:00:52.460 | Inference might be the largest market ever in software.
00:00:55.820 | And this seemed like a critical thing
00:00:58.460 | that everybody was assuming the answer to it would be yes.
00:01:02.240 | Open AI was just far and away the leading model.
00:01:06.540 | There were a few others that were coming up on its tail.
00:01:10.300 | And I built a couple prototypes to look into what they could be used for
00:01:16.900 | and also wanted to investigate open source.
00:01:19.060 | So, in this talk, which Swix named,
00:01:22.180 | I'm going to talk about the founding story of Open Router
00:01:25.980 | and go through a little bit of the hoops that we jumped through
00:01:30.660 | and sort of the investigation that we did
00:01:32.840 | as we put together this product that started as an experiment
00:01:37.080 | and kind of evolved into a marketplace over time.
00:01:39.880 | In January, we saw the first signs of people wanting other types of models.
00:01:50.480 | models, and the first evidence was moderation.
00:01:54.960 | This was like a very clear interest from users in looking for models
00:02:00.240 | where they could understand whether they'd be deplatformed
00:02:03.640 | or what the moderation policy of the company was.
00:02:06.960 | And we saw some people like generating novels
00:02:10.520 | where like it would be a detective story.
00:02:12.720 | And in chapter four, the detective would find someone
00:02:17.680 | who like commits a murder and shoots the victim.
00:02:20.200 | And Open AI at the time sometimes refused to generate that output
00:02:24.920 | or it was like questionably against the terms of service.
00:02:27.400 | And of course, we saw role play and basically a big gray area emerge
00:02:32.920 | around what models we're willing to generate.
00:02:37.080 | So in the next month, we saw the open source race begin.
00:02:44.080 | And that -- I'm going to do a little bit of an OG test here.
00:02:50.000 | Raise your hand if you ever used Bloom 176B.
00:02:55.640 | There's like 10 hands raised, or OPT by Facebook.
00:03:02.160 | It was like one of the earliest open source language models,
00:03:05.760 | about five hands raised.
00:03:07.000 | There were a couple of these emerging,
00:03:09.720 | and there were some very interesting projects
00:03:11.760 | to help people access them.
00:03:13.040 | And early days, they weren't really useful for very much.
00:03:18.680 | So we kept digging, and eventually like the open source community
00:03:27.800 | people like ran into Meta's first launch, which was Llama 1 in February.
00:03:33.960 | And Llama 1, in their abstract, advertised that it outperformed GPT-3
00:03:40.200 | on most benchmarks.
00:03:41.480 | You can see the highlighted part here, which blew everyone away.
00:03:44.760 | This was huge.
00:03:45.680 | An open weights model better than GPT-3.
00:03:50.800 | And especially a smaller model.
00:03:53.760 | This was the 13 billion parameter version,
00:03:56.120 | one that you could run on your laptop,
00:03:57.840 | outperforming a large server only, only like tons of money required
00:04:05.360 | to run inference companies model.
00:04:07.560 | And it was beating it on some benchmarks.
00:04:10.160 | Everyone lost their minds.
00:04:11.440 | And Llama kicked off a huge storm.
00:04:14.520 | It still was not very useful, I have to say.
00:04:18.320 | It was like a text completion model, for the most part.
00:04:21.640 | And it was very difficult to run locally.
00:04:24.480 | The infrastructure just wasn't there.
00:04:26.200 | And people were struggling to figure out what to do with it.
00:04:28.640 | Which is when we found, when we had the greatest moment of all,
00:04:33.760 | I think, for the birth of the long tail of language models,
00:04:38.240 | which was the first successful distillation in March of 2023.
00:04:44.040 | It was the first time I saw the transference of both style and knowledge
00:04:48.800 | from a large model onto a small one.
00:04:51.360 | And this was a huge unlock.
00:04:52.600 | Because it meant that not only did you not know how to do it,
00:04:55.200 | you know how to do it, you know how to do it,
00:04:56.440 | you know how to do it, you know how to do it.
00:04:57.640 | And then, you know, how to do it, you know, how to do it.
00:04:59.240 | And then, you know, how to do it.
00:05:00.360 | And then, you know, how to do it, you know, how to do it.
00:05:02.360 | And then, you know, how to do it.
00:05:03.520 | And then, you know, how to do it.
00:05:04.440 | And then, you know, how to do it.
00:05:05.840 | And then, you know, how to do it.
00:05:06.840 | And then, you know, how to do it.
00:05:07.840 | And then, you know, how to do it.
00:05:08.840 | And then, you know, how to do it.
00:05:09.840 | And then, you know, how to do it.
00:05:10.840 | And then, you know, how to do it.
00:05:11.840 | And then, you know, how to do it.
00:05:12.840 | And then, you know, how to do it.
00:05:13.840 | And then, you know, how to do it.
00:05:14.840 | And then, you know, how to do it.
00:05:15.840 | And then, you know, how to do it.
00:05:17.040 | And then, you know, how to do it.
00:05:18.240 | And then, you know, how to do it.
00:05:19.440 | And then, you know, how to do it.
00:05:20.440 | And then, you know, how to do it.
00:05:21.800 | And then, you know, how to do it.
00:05:22.800 | And then, you know, how to do it.
00:05:24.200 | And then, you know, how to do it.
00:05:25.400 | And then, you know, how to do it.
00:05:26.400 | And then, you know, how to do it.
00:05:27.400 | And then, you know, how to do it.
00:05:28.600 | And then, you know, how to do it.
00:05:29.600 | And then, you know, how to do it.
00:05:30.600 | And then, you know, how to do it.
00:05:31.600 | And then, you know, how to do it.
00:05:32.600 | And then, you know, how to do it.
00:05:33.600 | And then, you know, how to do it.
00:05:34.600 | And then, you know, how to do it.
00:05:35.600 | And then, you know, how to do it.
00:05:36.600 | And then, you know, how to do it.
00:05:37.600 | And then, you know, how to do it.
00:05:38.800 | And then, you know, how to do it.
00:05:40.600 | And then, you know, how to do it.
00:05:41.600 | And then, you know, how to do it.
00:05:42.600 | And then, you know, how to do it.
00:05:43.600 | And then, you know, how to do it.
00:05:44.600 | And then, you know, how to do it.
00:05:45.600 | And then, you know, how to do it.
00:05:46.600 | And then, you know, how to do it.
00:05:47.600 | And then, you know, how to do it.
00:05:48.600 | And then, you know, how to do it.
00:05:49.600 | And then, you know, how to do it.
00:05:50.600 | And then, you know, how to do it.
00:05:51.600 | And then, you know, how to do it.
00:05:52.600 | And then, you know, how to do it.
00:05:53.600 | And then, you know, how to do it.
00:05:54.600 | And then, you know, how to do it.
00:05:55.800 | Very few people used Alpaca.
00:05:57.600 | Raise your hands if you used Alpaca.
00:06:00.600 | I see about maybe 12.
00:06:05.600 | So it's like only double the number of people who used the,
00:06:09.600 | like, almost unusable open source models on the previous slide.
00:06:13.600 | So Open Router initially started as a place to collect all these things.
00:06:20.600 | But before we got there, I wanted to check out people's willingness
00:06:25.600 | to bring their own model to generic websites.
00:06:29.400 | Like, what if the developer didn't even know which model a user wanted to use?
00:06:34.400 | How would a user bring their choice of model to the software that they want?
00:06:39.400 | And in April, I launched Window AI, which was an open source Chrome extension
00:06:47.400 | that let a user choose their model and let a web app just kind of suck it in.
00:06:53.600 | And so you can see from the Chrome extension here, if you look really closely,
00:06:59.400 | this user is using Together's open source deployment of GPT-NEXT.
00:07:05.400 | I can't read it from here.
00:07:08.400 | But like an open source model that swaps out Open AI directly inside the web page.
00:07:15.600 | I can't read it from here, but I can't read it from here.
00:07:18.400 | So the next month, Open Router launched.
00:07:22.400 | And I co-founded it with the founder of the framework
00:07:26.400 | that Window AI was built on, Plasma, Lewis.
00:07:31.400 | And we started Open Router as first a place to collect all the models in one spot
00:07:36.400 | and help people figure out what to do with them.
00:07:38.600 | And it eventually grew into a place that gives you, like, better prices,
00:07:43.400 | better uptime, no subscription, and the most choice for figuring out
00:07:49.400 | which intelligence your software should run.
00:07:52.400 | So let's talk a little bit about what it is,
00:07:56.400 | because not everyone here might be familiar with it.
00:07:59.600 | We have been growing 10% to 100% month-over-month for the last two years.
00:08:08.400 | It is an API that lets you access all language models,
00:08:12.400 | and it's also become kind of the go-to place for data
00:08:17.400 | about who's using which model and how that is changing over time,
00:08:22.400 | which you can see on our public rankings page here.
00:08:24.400 | It's a single API that you pay for once,
00:08:27.600 | you get near zero switching costs to go from model to model.
00:08:32.400 | And we have over 400 models, over 60 active providers,
00:08:37.400 | and you can buy with lots of different payment methods, including crypto.
00:08:42.400 | And we basically do all the tricky work of normalizing tool calls
00:08:47.400 | and caching for you so that you get the best prices and the most features,
00:08:51.400 | and you don't have to worry about what the provider supports.
00:08:55.600 | And then we have to look at the results.
00:08:57.400 | Another story.
00:08:58.400 | Initially, Open Router was not a marketplace, really.
00:09:02.400 | It was just kind of a collection of all the models
00:09:04.400 | and a way to explore data about who was using each one.
00:09:07.400 | So how did we get here?
00:09:08.400 | Initially, when the first open source models emerged,
00:09:13.400 | we only had like one or two providers for each one,
00:09:17.400 | and so we had like a primary provider and a fallback provider.
00:09:20.600 | Initially, that was it, and we didn't even name the providers.
00:09:25.400 | But it became clear that there were going to be a bunch of companies
00:09:30.400 | that wanted to host these models and at very different prices and performances.
00:09:36.600 | the number of features ballooned.
00:09:39.400 | There were companies that supported the Min-P sampler and most didn't.
00:09:43.400 | There were some that supported caching,
00:09:45.400 | some that supported tool calling and structured outputs,
00:09:47.400 | and others that didn't.
00:09:48.400 | And suddenly the ecosystem was just ballooning into this kind of out-of-control,
00:09:53.400 | heterogeneous monster, and we wanted to tame the monster.
00:09:58.600 | So we aggregated all providers in one spot,
00:10:03.400 | and at different price points, it became a marketplace.
00:10:06.400 | And you can see like this model, Llama 3.370B Instruct,
00:10:10.400 | it has one of the models with the most providers on the platform,
00:10:14.400 | and it has like 23.
00:10:18.400 | Closed source models also had something interesting happen to them,
00:10:23.400 | which is that they just couldn't keep up with the demand.
00:10:26.600 | And so we helped developers basically get uptime boosting,
00:10:32.400 | and you can see like the delta and how much we can boost uptime
00:10:37.400 | just by aggregating lots of different providers for a model.
00:10:40.400 | And this became really helpful for people using open source, or closed source.
00:10:44.400 | And we became a marketplace for both, showing graphs about latency and throughput
00:10:49.400 | and helping people figure out, using real-world data,
00:10:52.400 | what the latency and throughput is on each model.
00:10:55.600 | And that is how OpenRouter became a marketplace,
00:10:59.400 | and one optimized for language models,
00:11:01.400 | which I thought would be proper for inference,
00:11:06.400 | but potentially the biggest market in software.
00:11:08.400 | You can obviously, a couple of other things that we support,
00:11:13.400 | comparing models using your own prompts with the ease
00:11:16.400 | of just texting an iMessage,
00:11:18.600 | fine-grained privacy controls with API-level overrides,
00:11:23.400 | the ability to see like your usage of all models in one place
00:11:27.400 | and have great observability.
00:11:28.400 | And back to the original question here of whether,
00:11:32.400 | will intelligence be winner-take-all?
00:11:34.400 | We've come to the most likely bet that that is not the case.
00:11:40.600 | Here's our data broken down by model author.
00:11:46.400 | How many tokens have been processed by each one?
00:11:49.400 | And you can see Google Gemini started pretty low,
00:11:54.400 | like roughly 2%, 3% in June of last year,
00:11:58.400 | and just has grown to 34%, 35% pretty steadily over the last 12 months.
00:12:07.600 | So, we've got a lot of data.
00:12:09.400 | Anthropic is like one of the most popular models on our platform.
00:12:13.400 | OpenAI is a little bit underrepresented in this data,
00:12:16.400 | because a lot of developers use us to get OpenAI-like behavior
00:12:20.400 | for all other models.
00:12:22.400 | But OpenAI has grown a lot here as well.
00:12:25.400 | So, here's what we believe about the market,
00:12:30.400 | after all of the, you know, back story that I just gave you.
00:12:35.600 | The future is going to be multi-model.
00:12:38.400 | All of our customers, tons of customers,
00:12:40.400 | use different models for different purposes
00:12:42.400 | and realize they can unlock huge gains by doing so.
00:12:45.400 | Inference is also a commodity.
00:12:47.400 | Claude from Bedrock,
00:12:49.400 | we want to make it look exactly the same as Claude from Vertex.
00:12:53.400 | And we do that because, like, the two hyperscalers
00:12:56.400 | have fundamentally, you know, the same commodity being delivered
00:13:01.600 | at different rates, different performances.
00:13:04.400 | And for a developer, you just want to be able to, like, select that
00:13:07.400 | without worrying about who is serving it.
00:13:09.400 | We think inference will be, like, a dominant operating expense,
00:13:14.400 | and selecting and routing will be crucial.
00:13:17.400 | You can see the number of active models on OpenRouter
00:13:21.400 | has just steadily grown.
00:13:22.600 | It's not the case that people just hop from model to model.
00:13:27.400 | Like, it tends to be sticky.
00:13:29.400 | And we're trying to just make this wild ecosystem
00:13:34.400 | a lot more homogeneous and easier to work with as a developer.
00:13:39.600 | So, to honor SWIX's title for this presentation,
00:13:45.400 | let's give a technical story.
00:13:47.400 | It's something that we've worked on
00:13:49.400 | in the process of building the company.
00:13:51.400 | And that was our own idea for how to do an MCP within OpenRouter.
00:13:58.400 | So, we don't have MCPs.
00:14:00.400 | We don't have an MCP marketplace.
00:14:02.600 | But we did run into the need to expand inference
00:14:09.400 | with new features and new abilities.
00:14:12.400 | For example, searching the web for all models.
00:14:15.400 | PDF parsing for all models.
00:14:17.400 | You know, other interesting things coming soon.
00:14:21.400 | And what we really wanted to do was give these abilities to all models.
00:14:25.400 | But that involves not just the pre-flight work that MCPs do today,
00:14:32.400 | where you can kind of call another API, get a bunch of behaviors,
00:14:38.200 | and then have the inference process access those behaviors as it goes.
00:14:42.200 | We also needed the ability to transform the outputs on the way to the user.
00:14:47.200 | And so, what we really, really needed was something more like middleware.
00:14:52.200 | Middleware is kind of a common concept in web development.
00:14:57.000 | You set up middleware when you're setting up authentication, for example,
00:15:01.200 | or caching for a web app.
00:15:04.000 | And so, we came up with a type of middleware that's AI-native and optimized for inference.
00:15:11.000 | And that looks not totally dissimilar from the way middleware looks in Next.js or web development.
00:15:18.000 | So, pardon the code on the screen.
00:15:20.000 | But this is a little bit about how our plug-in system looks.
00:15:24.000 | And it can call MCPs from inside a plug-in.
00:15:27.800 | But importantly, it can also augment the results on the way back to the user.
00:15:32.600 | So, here's an example of our web search plug-in, which augments every language model with the ability to search the web.
00:15:39.600 | Every language model can just kind of tap into this plug-in and get web annotations as results are being fed back to users in real time.
00:15:49.800 | And this all happens in a stream.
00:15:52.600 | So, there's no kind of, like, requirement that you get all of the tokens at once.
00:15:58.600 | It can just happen live in the stream.
00:16:00.800 | We solved a bunch of other tricky problems while building Open Router.
00:16:07.600 | We really wanted to get extremely low latency.
00:16:11.600 | And we got it down to about 30 milliseconds, the best in the industry, I believe, using a lot of custom cache work.
00:16:19.600 | And we also need to make streams cancelable.
00:16:22.800 | All these different providers have completely different stream cancellation policies.
00:16:27.600 | Sometimes if you just drop a stream, the inference provider will bill you for the entire thing.
00:16:34.600 | Sometimes it won't.
00:16:36.600 | Sometimes it will bill you for the next 20 tokens that you never got.
00:16:40.800 | And we work a lot to try to figure out these edge cases and understand when developers are going to care about them, too.
00:16:49.600 | And standardizing all these providers and models became, like, a big tricky architecture problem that we spent a while working on.
00:16:57.600 | So, here's where all this is going.
00:16:59.600 | We're going to add more modalities to Open Router.
00:17:02.600 | And I think this is, like, a big change in the industry as well.
00:17:05.600 | We're going to start seeing LLMs generate images.
00:17:09.800 | We already have a few examples on the market.
00:17:13.600 | But, like, some people call it transfusion models, a transformer mixed with stable diffusion.
00:17:19.600 | These are going to give images way more world knowledge and the ability to have a conversation with the image,
00:17:26.600 | which we think is just critical for growing that industry and making it really work.
00:17:30.600 | Imagine, I just ran into somebody today who is using a transfusion model, or who told me about their customer using a transfusion model,
00:17:38.800 | to generate menus.
00:17:40.600 | Imagine doing that.
00:17:41.600 | Imagine doing that.
00:17:42.600 | Like, a whole menu, like, in a delivery app, generated by a transfusion model.
00:17:47.600 | It's going to be really exciting and a big deal in the coming year.
00:17:53.600 | We're also going to work on much more powerful routing.
00:17:55.600 | Like, routing is our bread and butter.
00:17:57.800 | And so, doing geographical routing right now, it's pretty minimal.
00:18:01.600 | But routing people to the right GPU in the right place and doing enterprise-level optimizations coming.
00:18:06.600 | Better prompt observability, better discovery of models.
00:18:11.600 | Like, really fine-grained categorization.
00:18:14.600 | Imagine being able to see, like, the best models that take Japanese and create Python code.
00:18:19.800 | And, of course, even better prices coming soon.
00:18:23.600 | So, you know, we believe in collaboration and building an ecosystem that's durable
00:18:31.600 | and with low vendor lock-in.
00:18:33.600 | So, you know, collaborate with us.
00:18:35.800 | Here's our email.
00:18:37.600 | And if you're interested, join us, too.
00:18:39.600 | Thank you.
00:18:41.600 | We'll see you next time.