Supabase Vector: The Postgres Vector database: Paul Copplestone

Hey everyone. So, yeah, I'm Koppel, the CEO and co-founder of Superbase. Also, thank you for having me, especially to Swix and Ben. When Swix asks you to come to a conference, you don't say yes, you say definitely, and this is the first time we've ever sponsored a conference at all, so it's good to be here.

So, first of all, very apt that apparently this section of talks is scaled to millions in a weekend. It's very apt because it's actually our tagline. So, what is Superbase? We are a backend as a service. What does that mean? We give you a full Postgres database. Every time you launch a database, a project within Superbase, you get the database.

And we also provide you with authentication. All of the users when you use our auth service are also stored inside that database. We give you edge functions for compute. These are powered by Deno. You can also trigger them from the database. So, hopefully you see where this is going.

We give you large file storage. These do not get stored in your database, but the directory structure does get stored in your database. So, you can write access rooms, things like that. We have a real-time system. This is actually the genesis of Superbase. I won't talk about it in here, but you can use this to listen to changes coming out of your database, your Postgres database.

You can also use it to build live, like, cursor movements, things like this. And then, most importantly for this talk, we have a vector offering. This is for storing embeddings. This is powered by PG vector. And that's the topic of this talk. I want to sort of make the case for PG vector.

So, first of all, I wanted to show -- and, yeah, finally, we're open source. So, we've been operating since 2020. Everything we do is MIT licensed, Apache 2 or Postgres. We try to support existing communities wherever we can, and we try to coexist with them. And that's largely why we support PG vector.

It is an existing tool. We contribute to it. So, I wanted to show a little bit about how the sausage is made in an open source company. And for PG vector, this started with just an email from Greg. He said, I'm sending this email to see what it would take for your team to accept a Postgres extension called PG vector.

It's a simple yet powerful extension to support vector operations. I've already done the work. You can find my pull request on GitHub. So, I jumped on a call with Greg. And afterwards, I sent him an email the next day. Hey, Greg, the extension is merged. So, it should be landing in prod this week.

By the way, our docs search is currently a bit broken. Is this something you'd be interested in helping with? Then, fast forward two weeks, and we released Clippy. Which is, of course, a throwback to Microsoft Clippy, the OG AI assistant. I think we were the first to do this within docs.

We certainly didn't know of anyone else doing this as a docs search interface. So, we built an example, a template around it where you can do this within your own docs. And others followed suit. Notably, Mozilla released this for MDN, one of the most popular dev docs on the internet.

Along with many other AI applications. So, this is a chart of all the new databases being launched on superbase.com, our platform. It doesn't include the open source databases. So, you can see where PG vector was added. It is one of the tailwinds that accelerated the growth of new databases on our platform.

And since then, we've kind of become part of the AI stack for a lot of builders, especially. We work very well with Vercel, Netlify, the Jamstack crowd. And now we're launching around 12,000 databases a week. And so, this around maybe 10 to 15 percent of them are using PG vector in one way or another.

So, thousands of AI applications being launched every week. Also, some of these apps kind of fit that tagline build in a weekend scale to millions. We've literally had apps. We had one that scaled to a million users in 10 days. I know they built it in three days. So, a lot of really bizarre things that we've seen since PG vector was launched.

Also, the app you're using today, if you're using it, is powered by superbase. So, thank you, Simon, for using that inside the application. And then finally, just to wrap up that story arc, Greg, who emailed us at the start of the year, now works at Superbase. If you attended the workshop yesterday, he actually was the one leading that.

Nice. Thanks, Greg. Also responsible for a lot of the growth in Superbase. So, we owe him a lot. But every good story has a few speed bumps. And for PG vector, that started with a tweet. This is one, it says, why you should never use PG vector, Superbase vector store, for production.

PG vector is 20 times slower than a decent vector database quadrant. And it's a full 18% worse in finding relevant docs for you. So, in this chart, higher is better. It's the queries per second. Just making sure you all know. And Postgres, the IVF flat index is not doing well here.

And first of all, we feel this is an unfair mischaracterization of Superbase because PG vector is actually owned by Andrew Kane, a single sole contributor who developed this many years before Superbase came along. But nonetheless, we are contributors. And so, when Andrew saw the tweet, he decided, well, HNSW, let's just add it.

And we got to work with the Aureol team and the AWS team. And it took about one month to build in HNSW. What were the results? This is the same chart, but we just use Postgres HNSW. HNSW is the same chart for you. . First of all, I'm not a big fan of benchmarks because it seems like I'm ragging on quadrant here.

I'm not. Unfortunately, they were used in the tweet, so we had to benchmark against them. Also, they're very isolated. But what you can see most importantly is that the queries per second increased and also the accuracy increase. They're both for quadrant and HNSW 0.99. Also, you might be thinking, well, you can just throw compute at it.

Maybe that's what they're doing. This one actually is a blog post we released today. You can read it. That's the QR code for it. This is an apples for apples comparison between Pinecone and Postgres for the same compute. We basically take the same dollar value. So it's very hard to benchmark Pinecone and to find accuracy.

But we're measuring the queries per second for Pinecone using six replicas which cost $480 versus one of our database systems, which is 410. So we give them a bit of extra compute and the queries per second and accuracy obviously different on the chart. So why am I bullish about Postgres and PG vector for this particular thing?

I was chatting to Joseph, actually the CEO of RoboFlow, a few months ago, and I like to tell this example. It's related actually to the Paint one, but a slightly different application. I like to tell it because it highlights the power of Postgres. So he told me about this app where the users could take photos of trash within San Francisco and then they would upload it to an embedding store and they would kind of measure the trends of trash throughout San Francisco.

You could think of this the same as the Painter WTF, the example that he just used. The problem, of course, with all of these ones is not safe for work images. So why is that a problem? First of all, it fills up your embedding store. You have to store the data.

It's going to cost you more. Your indexes are going to slow down if you're indexing this content and users can see this data inside the app. So I thought about this for an hour and I did a little proof of concept for him just using Postgres. The solution that I thought of was partitions.

Now trash is very boring, so I'm going to use cats in this example. We're going to segment good cats and bad cats. So we'll start with a basic table where we're going to store all of our cats. We're going to store the embeddings inside them. Then when an embedding is uploaded, we're going to call a function called iscats and here I'm going to compare it to a canonical cat.

In this case, my space cat. Then if the similarity is greater than 0.8, I'll store it in a good cats partition and everything else can just go into a bad cats partition. So to do this, I just took my space cat and I generated a vector of that and then I literally just stuffed it inside a Postgres function called iscat.

The way that this works, it takes in an embedding that's the line three and then it's going to return a float, a similarity basically. And all it's going to do is compare the distance to this canonical cat. I'm going to create a table to store all of my embeddings.

That's line five, the embeddings, the URL of the image. And then finally on line six, we're going to determine the similarity. Is it a good cat or a bad cat? Then finally, Postgres has this thing called triggers which are very cool. What we can do is attach a trigger to a table.

So first of all, line two, we're going to create the trigger. Line three, we're going to do it before the insert onto this table. And then the most important one is line six. And this trigger for every time you upload a cat, we're going to run that function that we just saw, compare it, and then store in the table the similarity.

New here is actually kind of a special value for Postgres. Inside the trigger is for the values that you're about to insert. And then finally, what does the data look like? After uploading a bunch of images, you can see here that we're storing all of our embeddings, the URLs for them, and then on the right-hand side, that similarity.

And now we can use that essentially to create a segment. So we just need to split the data. And the nice thing about partitions in Postgres, they've got kind of all the properties of a regular table and each one individually. So we can create an index only on the good cats.

And then to clean up, as our bad cats are getting uploaded, if we ever want to clean them up, we just drop the partition and recreate it. And the way that they work on disk is all the data is stored grouped together. So good cats will be fast, kept fast, bad cats will be dropped.

So what does that look like in code? In Postgres code, it's really just 14, 13, 14 lines of code. Here I'm just adding on line 7, you can see the part, partition that I create. And I'm going to do it by a range. Here is cat is the column that I'm going to partition by.

And then on line 9, I create good cats. And line 11 is where I actually determine the values between 0.8 and 1. And then on line 13, everything else is going to fall into the default partition. So honestly, I don't even know if this is the right way to solve the problem.

But I just think it's cool that I could just do that and it's all built into Postgres. So that's really why I'm bullish on Postgres. I mean, it's so extensible. It's got 30 years of engineering. It's got pretty much everything that you, all the primitives that you might need to get out of your way while you are building an AI application.

It's also extensible. PG vector itself is not built into Postgres. It's just an extension. So for us to add it, we just scouted around the community, or Greg did in this case, and then we merged it in as an extension, and it was running basically within two days. Some other things worth highlighting, if you're doing RAG especially, Postgres has row level security, which I think is very cool.

This allows you to write declarative rules on your tables inside your Postgres database. And so if you're storing user data and you want to split it up by different users, you can actually write those rules. It's also a defense at depth. So if it gets through maybe your API security, you can go directly into your database.

The security is still there. Something that's often not captured in benchmarks, a single round trip versus multiple round trips. So if you store your embeddings next to your operational data, then you do a single fetch to your database. And then finally, we're still early. PG vector is currently an extension.

I can foresee it's probably going to get merged into PG core eventually. I'm not too sure. People often ask me, is there still space for a specialized vector database? Yes, I think there are. For many other things that databases don't do, maybe a lot of putting models closer to the database could be one of those things.

But for this particular use case where you're actually just storing embeddings, indexing them, fetching them out, I think then Postgres is definitely going to be moving down that direction. What's next for Superbase Vector? Pretty simply, we have been really focused on more enterprise use cases, or largely how do you store billions of vectors.

This is another area that needs development. So we've been working on sharding with Citus, another Postgres extension, and it allows you to split your data between different nodes. And we've found that the transactions scale in a linear fashion as you add nodes. So in this case, we're going to develop this.

We've been chatting to the Citus team at Microsoft. If you want to be a design partner on this, then we'd love to work with you on it, and especially if you're already storing billions of embeddings. And if you want to get started, just go to database.new. And we also have, apparently now our swag has finally arrived.

So if you want some free credits and swag, come see us at the booth, and happy building. Thank you.

Supabase Vector: The Postgres Vector database: Paul Copplestone

Chapters

Transcript