Supabase Vector: The Postgres Vector database: Paul Copplestone

00:00:00.000 | Hey everyone. So, yeah, I'm Koppel, the CEO and co-founder of Superbase. Also, thank you for

00:00:22.220 | having me, especially to Swix and Ben. When Swix asks you to come to a conference, you don't say

00:00:27.600 | yes, you say definitely, and this is the first time we've ever sponsored a conference at

00:00:33.880 | all, so it's good to be here. So, first of all, very apt that apparently this section

00:00:41.600 | of talks is scaled to millions in a weekend. It's very apt because it's actually our tagline.

00:00:47.460 | So, what is Superbase? We are a backend as a service. What does that mean? We give you

00:00:55.780 | a full Postgres database. Every time you launch a database, a project within Superbase, you

00:01:02.560 | get the database. And we also provide you with authentication. All of the users when you use

00:01:10.560 | our auth service are also stored inside that database. We give you edge functions for compute.

00:01:17.200 | These are powered by Deno. You can also trigger them from the database. So, hopefully you see

00:01:21.900 | where this is going. We give you large file storage. These do not get stored in your database, but

00:01:27.960 | the directory structure does get stored in your database. So, you can write access rooms, things

00:01:33.080 | like that. We have a real-time system. This is actually the genesis of Superbase. I won't

00:01:40.960 | talk about it in here, but you can use this to listen to changes coming out of your database,

00:01:46.000 | your Postgres database. You can also use it to build live, like, cursor movements, things

00:01:51.420 | like this. And then, most importantly for this talk, we have a vector offering. This is for

00:01:57.540 | storing embeddings. This is powered by PG vector. And that's the topic of this talk. I want to

00:02:04.460 | sort of make the case for PG vector. So, first of all, I wanted to show -- and, yeah, finally,

00:02:11.940 | we're open source. So, we've been operating since 2020. Everything we do is MIT licensed, Apache

00:02:18.080 | 2 or Postgres. We try to support existing communities wherever we can, and we try to coexist with them.

00:02:24.800 | And that's largely why we support PG vector. It is an existing tool. We contribute to it. So, I wanted

00:02:32.640 | to show a little bit about how the sausage is made in an open source company. And for

00:02:38.060 | PG vector, this started with just an email from Greg. He said, I'm sending this email to see what

00:02:45.760 | it would take for your team to accept a Postgres extension called PG vector. It's a simple yet

00:02:51.560 | powerful extension to support vector operations. I've already done the work. You can find my pull

00:02:58.300 | request on GitHub. So, I jumped on a call with Greg. And afterwards, I sent him an email the next day.

00:03:05.720 | Hey, Greg, the extension is merged. So, it should be landing in prod this week. By the way, our docs search is

00:03:13.480 | currently a bit broken. Is this something you'd be interested in helping with? Then, fast forward two weeks, and we released

00:03:21.420 | Clippy. Which is, of course, a throwback to Microsoft Clippy, the OG AI assistant. I think we were the first

00:03:31.840 | to do this within docs. We certainly didn't know of anyone else doing this as a docs search interface.

00:03:36.680 | So, we built an example, a template around it where you can do this within your own docs. And others followed suit.

00:03:43.100 | Notably, Mozilla released this for MDN, one of the most popular dev docs on the internet. Along with many other

00:03:51.700 | AI applications. So, this is a chart of all the new databases being launched on superbase.com, our platform.

00:04:00.100 | It doesn't include the open source databases. So, you can see where PG vector was added. It is one of the

00:04:08.020 | tailwinds that accelerated the growth of new databases on our platform. And since then, we've kind of become

00:04:15.380 | part of the AI stack for a lot of builders, especially. We work very well with Vercel, Netlify, the Jamstack

00:04:22.340 | crowd. And now we're launching around 12,000 databases a week. And so, this around maybe 10 to 15 percent of

00:04:30.980 | them are using PG vector in one way or another. So, thousands of AI applications being launched every

00:04:36.740 | week. Also, some of these apps kind of fit that tagline build in a weekend scale to millions. We've

00:04:43.940 | literally had apps. We had one that scaled to a million users in 10 days. I know they built it in three

00:04:49.860 | days. So, a lot of really bizarre things that we've seen since PG vector was launched. Also, the app you're

00:04:59.460 | using today, if you're using it, is powered by superbase. So, thank you, Simon, for using that inside

00:05:05.220 | the application. And then finally, just to wrap up that story arc, Greg, who emailed us at the start of

00:05:12.740 | the year, now works at Superbase. If you attended the workshop yesterday, he actually was the one

00:05:19.700 | leading that.

00:05:20.340 | Nice. Thanks, Greg.

00:05:23.780 | Also responsible for a lot of the growth in Superbase. So, we owe him a lot.

00:05:29.220 | But every good story has a few speed bumps. And for PG vector, that started with a tweet.

00:05:37.940 | This is one, it says, why you should never use PG vector, Superbase vector store, for production.

00:05:45.060 | PG vector is 20 times slower than a decent vector database quadrant. And it's a full 18% worse in

00:05:52.100 | finding relevant docs for you. So, in this chart, higher is better. It's the queries per second. Just

00:05:59.060 | making sure you all know. And Postgres, the IVF flat index is not doing well here. And first of all,

00:06:07.860 | we feel this is an unfair mischaracterization of Superbase because PG vector is actually owned by

00:06:14.100 | Andrew Kane, a single sole contributor who developed this many years before Superbase came along.

00:06:21.860 | But nonetheless, we are contributors. And so, when Andrew saw the tweet, he decided, well,

00:06:29.220 | HNSW, let's just add it. And we got to work with the Aureol team and the AWS team. And it took about

00:06:36.980 | one month to build in HNSW. What were the results? This is the same chart, but we just use Postgres HNSW.

00:06:48.020 | HNSW is the same chart for you. .

00:06:53.380 | First of all, I'm not a big fan of benchmarks because it seems like I'm ragging on quadrant

00:07:00.100 | here. I'm not. Unfortunately, they were used in the tweet, so we had to benchmark against them. Also,

00:07:08.100 | they're very isolated. But what you can see most importantly is that the queries per second

00:07:13.700 | increased and also the accuracy increase. They're both for quadrant and HNSW 0.99.

00:07:20.900 | Also, you might be thinking, well, you can just throw compute at it. Maybe that's what they're doing.

00:07:27.220 | This one actually is a blog post we released today. You can read it. That's the QR code for it.

00:07:33.380 | This is an apples for apples comparison between Pinecone and Postgres for the same compute.

00:07:39.700 | We basically take the same dollar value. So it's very hard to benchmark Pinecone and to find accuracy.

00:07:47.700 | But we're measuring the queries per second for Pinecone using six replicas which cost $480 versus one of our

00:07:57.940 | database systems, which is 410. So we give them a bit of extra compute and the queries per second and accuracy

00:08:03.860 | obviously different on the chart.

00:08:06.980 | So why am I bullish about Postgres and PG vector for this particular thing?

00:08:15.780 | I was chatting to Joseph, actually the CEO of RoboFlow, a few months ago, and I like to tell this example.

00:08:22.500 | It's related actually to the Paint one, but a slightly different application. I like to tell it because

00:08:27.620 | it highlights the power of Postgres. So he told me about this app where the users could take photos

00:08:34.980 | of trash within San Francisco and then they would upload it to an embedding store and they would kind

00:08:40.980 | of measure the trends of trash throughout San Francisco. You could think of this the same as

00:08:47.700 | the Painter WTF, the example that he just used. The problem, of course, with all of these ones is

00:08:56.580 | not safe for work images. So why is that a problem? First of all, it fills up your embedding store. You

00:09:08.420 | have to store the data. It's going to cost you more. Your indexes are going to slow down if you're indexing

00:09:14.260 | this content and users can see this data inside the app. So I thought about this for an hour and I did a little

00:09:20.100 | proof of concept for him just using Postgres. The solution that I thought of was partitions. Now trash

00:09:27.060 | is very boring, so I'm going to use cats in this example. We're going to segment good cats and bad cats.

00:09:33.060 | So we'll start with a basic table where we're going to store all of our cats. We're going to store the embeddings

00:09:39.620 | inside them. Then when an embedding is uploaded, we're going to call a function called iscats and here

00:09:47.300 | I'm going to compare it to a canonical cat. In this case, my space cat. Then if the similarity is greater

00:09:56.820 | than 0.8, I'll store it in a good cats partition and everything else can just go into a bad cats partition.

00:10:05.300 | So to do this, I just took my space cat and I generated a vector of that and then I literally just

00:10:12.020 | stuffed it inside a Postgres function called iscat. The way that this works, it takes in an embedding

00:10:18.100 | that's the line three and then it's going to return a float, a similarity basically. And all it's going to do is

00:10:27.540 | compare the distance to this canonical cat. I'm going to create a table to store all of my embeddings.

00:10:34.980 | That's line five, the embeddings, the URL of the image. And then finally on line six, we're going to

00:10:41.300 | determine the similarity. Is it a good cat or a bad cat? Then finally, Postgres has this thing called

00:10:49.860 | triggers which are very cool. What we can do is attach a trigger to a table. So first of all, line two,

00:10:56.580 | we're going to create the trigger. Line three, we're going to do it before the insert onto this table.

00:11:02.340 | And then the most important one is line six. And this trigger for every time you upload a cat,

00:11:08.420 | we're going to run that function that we just saw, compare it, and then store in

00:11:14.660 | the table the similarity. New here is actually kind of a special value for Postgres. Inside the trigger

00:11:21.380 | is for the values that you're about to insert. And then finally, what does the data look like?

00:11:26.900 | After uploading a bunch of images, you can see here that we're storing all of our embeddings,

00:11:32.100 | the URLs for them, and then on the right-hand side, that similarity. And now we can use that

00:11:37.620 | essentially to create a segment. So we just need to split the data. And the nice thing about

00:11:45.380 | partitions in Postgres, they've got kind of all the properties of a regular table and each one

00:11:51.060 | individually. So we can create an index only on the good cats. And then to clean up, as our bad cats

00:11:57.620 | are getting uploaded, if we ever want to clean them up, we just drop the partition and recreate it.

00:12:02.340 | And the way that they work on disk is all the data is stored grouped together. So good cats will be

00:12:10.100 | fast, kept fast, bad cats will be dropped. So what does that look like in code? In Postgres code,

00:12:17.780 | it's really just 14, 13, 14 lines of code. Here I'm just adding on line 7, you can see the part,

00:12:25.220 | partition that I create. And I'm going to do it by a range. Here is cat is the column that I'm going to

00:12:32.020 | partition by. And then on line 9, I create good cats. And line 11 is where I actually determine the

00:12:39.460 | values between 0.8 and 1. And then on line 13, everything else is going to fall into the default

00:12:46.820 | partition. So honestly, I don't even know if this is the right way to solve the problem. But I just think

00:12:52.580 | it's cool that I could just do that and it's all built into Postgres. So that's really why I'm bullish

00:12:59.300 | on Postgres. I mean, it's so extensible. It's got 30 years of engineering. It's got pretty much

00:13:05.460 | everything that you, all the primitives that you might need to get out of your way while you are

00:13:10.420 | building an AI application. It's also extensible. PG vector itself is not built into Postgres. It's just

00:13:17.300 | an extension. So for us to add it, we just scouted around the community, or Greg did in this case,

00:13:23.140 | and then we merged it in as an extension, and it was running basically within two days.

00:13:28.580 | Some other things worth highlighting, if you're doing RAG especially, Postgres has row level security,

00:13:35.940 | which I think is very cool. This allows you to write declarative rules on your tables inside your

00:13:42.100 | Postgres database. And so if you're storing user data and you want to split it up by different

00:13:47.060 | users, you can actually write those rules. It's also a defense at depth. So if it gets through maybe

00:13:54.100 | your API security, you can go directly into your database. The security is still there.

00:13:59.380 | Something that's often not captured in benchmarks, a single round trip versus multiple round trips.

00:14:06.900 | So if you store your embeddings next to your operational data, then you do a single fetch to your database.

00:14:15.140 | And then finally, we're still early. PG vector is currently an extension. I can foresee it's

00:14:24.500 | probably going to get merged into PG core eventually. I'm not too sure. People often ask me, is there

00:14:32.180 | still space for a specialized vector database? Yes, I think there are. For many other things that databases

00:14:40.180 | don't do, maybe a lot of putting models closer to the database could be one of those things. But for

00:14:47.460 | this particular use case where you're actually just storing embeddings, indexing them, fetching them out,

00:14:52.180 | I think then Postgres is definitely going to be moving down that direction. What's next for Superbase Vector?

00:15:02.020 | Pretty simply, we have been really focused on more enterprise use cases, or largely how do you store

00:15:09.460 | billions of vectors. This is another area that needs development. So we've been working on sharding

00:15:15.380 | with Citus, another Postgres extension, and it allows you to split your data between different nodes.

00:15:23.140 | And we've found that the transactions scale in a linear fashion as you add nodes. So in this case,

00:15:31.380 | we're going to develop this. We've been chatting to the Citus team at Microsoft. If you want to be a

00:15:36.180 | design partner on this, then we'd love to work with you on it, and especially if you're already storing

00:15:40.740 | billions of embeddings. And if you want to get started, just go to database.new. And we also have,

00:15:49.060 | apparently now our swag has finally arrived. So if you want some free credits and swag, come see us

00:15:53.620 | at the booth, and happy building.

00:15:55.620 | Thank you.

Supabase Vector: The Postgres Vector database: Paul Copplestone

Chapters