back to indexSupabase Vector: The Postgres Vector database: Paul Copplestone

Chapters
0:0 Introduction
0:36 Supabase overview
2:31 Postgres Vector
4:56 Supabase
8:10 Roofflow
13:29 Whats next
15:43 Outro
00:00:00.000 |
Hey everyone. So, yeah, I'm Koppel, the CEO and co-founder of Superbase. Also, thank you for 00:00:22.220 |
having me, especially to Swix and Ben. When Swix asks you to come to a conference, you don't say 00:00:27.600 |
yes, you say definitely, and this is the first time we've ever sponsored a conference at 00:00:33.880 |
all, so it's good to be here. So, first of all, very apt that apparently this section 00:00:41.600 |
of talks is scaled to millions in a weekend. It's very apt because it's actually our tagline. 00:00:47.460 |
So, what is Superbase? We are a backend as a service. What does that mean? We give you 00:00:55.780 |
a full Postgres database. Every time you launch a database, a project within Superbase, you 00:01:02.560 |
get the database. And we also provide you with authentication. All of the users when you use 00:01:10.560 |
our auth service are also stored inside that database. We give you edge functions for compute. 00:01:17.200 |
These are powered by Deno. You can also trigger them from the database. So, hopefully you see 00:01:21.900 |
where this is going. We give you large file storage. These do not get stored in your database, but 00:01:27.960 |
the directory structure does get stored in your database. So, you can write access rooms, things 00:01:33.080 |
like that. We have a real-time system. This is actually the genesis of Superbase. I won't 00:01:40.960 |
talk about it in here, but you can use this to listen to changes coming out of your database, 00:01:46.000 |
your Postgres database. You can also use it to build live, like, cursor movements, things 00:01:51.420 |
like this. And then, most importantly for this talk, we have a vector offering. This is for 00:01:57.540 |
storing embeddings. This is powered by PG vector. And that's the topic of this talk. I want to 00:02:04.460 |
sort of make the case for PG vector. So, first of all, I wanted to show -- and, yeah, finally, 00:02:11.940 |
we're open source. So, we've been operating since 2020. Everything we do is MIT licensed, Apache 00:02:18.080 |
2 or Postgres. We try to support existing communities wherever we can, and we try to coexist with them. 00:02:24.800 |
And that's largely why we support PG vector. It is an existing tool. We contribute to it. So, I wanted 00:02:32.640 |
to show a little bit about how the sausage is made in an open source company. And for 00:02:38.060 |
PG vector, this started with just an email from Greg. He said, I'm sending this email to see what 00:02:45.760 |
it would take for your team to accept a Postgres extension called PG vector. It's a simple yet 00:02:51.560 |
powerful extension to support vector operations. I've already done the work. You can find my pull 00:02:58.300 |
request on GitHub. So, I jumped on a call with Greg. And afterwards, I sent him an email the next day. 00:03:05.720 |
Hey, Greg, the extension is merged. So, it should be landing in prod this week. By the way, our docs search is 00:03:13.480 |
currently a bit broken. Is this something you'd be interested in helping with? Then, fast forward two weeks, and we released 00:03:21.420 |
Clippy. Which is, of course, a throwback to Microsoft Clippy, the OG AI assistant. I think we were the first 00:03:31.840 |
to do this within docs. We certainly didn't know of anyone else doing this as a docs search interface. 00:03:36.680 |
So, we built an example, a template around it where you can do this within your own docs. And others followed suit. 00:03:43.100 |
Notably, Mozilla released this for MDN, one of the most popular dev docs on the internet. Along with many other 00:03:51.700 |
AI applications. So, this is a chart of all the new databases being launched on superbase.com, our platform. 00:04:00.100 |
It doesn't include the open source databases. So, you can see where PG vector was added. It is one of the 00:04:08.020 |
tailwinds that accelerated the growth of new databases on our platform. And since then, we've kind of become 00:04:15.380 |
part of the AI stack for a lot of builders, especially. We work very well with Vercel, Netlify, the Jamstack 00:04:22.340 |
crowd. And now we're launching around 12,000 databases a week. And so, this around maybe 10 to 15 percent of 00:04:30.980 |
them are using PG vector in one way or another. So, thousands of AI applications being launched every 00:04:36.740 |
week. Also, some of these apps kind of fit that tagline build in a weekend scale to millions. We've 00:04:43.940 |
literally had apps. We had one that scaled to a million users in 10 days. I know they built it in three 00:04:49.860 |
days. So, a lot of really bizarre things that we've seen since PG vector was launched. Also, the app you're 00:04:59.460 |
using today, if you're using it, is powered by superbase. So, thank you, Simon, for using that inside 00:05:05.220 |
the application. And then finally, just to wrap up that story arc, Greg, who emailed us at the start of 00:05:12.740 |
the year, now works at Superbase. If you attended the workshop yesterday, he actually was the one 00:05:23.780 |
Also responsible for a lot of the growth in Superbase. So, we owe him a lot. 00:05:29.220 |
But every good story has a few speed bumps. And for PG vector, that started with a tweet. 00:05:37.940 |
This is one, it says, why you should never use PG vector, Superbase vector store, for production. 00:05:45.060 |
PG vector is 20 times slower than a decent vector database quadrant. And it's a full 18% worse in 00:05:52.100 |
finding relevant docs for you. So, in this chart, higher is better. It's the queries per second. Just 00:05:59.060 |
making sure you all know. And Postgres, the IVF flat index is not doing well here. And first of all, 00:06:07.860 |
we feel this is an unfair mischaracterization of Superbase because PG vector is actually owned by 00:06:14.100 |
Andrew Kane, a single sole contributor who developed this many years before Superbase came along. 00:06:21.860 |
But nonetheless, we are contributors. And so, when Andrew saw the tweet, he decided, well, 00:06:29.220 |
HNSW, let's just add it. And we got to work with the Aureol team and the AWS team. And it took about 00:06:36.980 |
one month to build in HNSW. What were the results? This is the same chart, but we just use Postgres HNSW. 00:06:53.380 |
First of all, I'm not a big fan of benchmarks because it seems like I'm ragging on quadrant 00:07:00.100 |
here. I'm not. Unfortunately, they were used in the tweet, so we had to benchmark against them. Also, 00:07:08.100 |
they're very isolated. But what you can see most importantly is that the queries per second 00:07:13.700 |
increased and also the accuracy increase. They're both for quadrant and HNSW 0.99. 00:07:20.900 |
Also, you might be thinking, well, you can just throw compute at it. Maybe that's what they're doing. 00:07:27.220 |
This one actually is a blog post we released today. You can read it. That's the QR code for it. 00:07:33.380 |
This is an apples for apples comparison between Pinecone and Postgres for the same compute. 00:07:39.700 |
We basically take the same dollar value. So it's very hard to benchmark Pinecone and to find accuracy. 00:07:47.700 |
But we're measuring the queries per second for Pinecone using six replicas which cost $480 versus one of our 00:07:57.940 |
database systems, which is 410. So we give them a bit of extra compute and the queries per second and accuracy 00:08:06.980 |
So why am I bullish about Postgres and PG vector for this particular thing? 00:08:15.780 |
I was chatting to Joseph, actually the CEO of RoboFlow, a few months ago, and I like to tell this example. 00:08:22.500 |
It's related actually to the Paint one, but a slightly different application. I like to tell it because 00:08:27.620 |
it highlights the power of Postgres. So he told me about this app where the users could take photos 00:08:34.980 |
of trash within San Francisco and then they would upload it to an embedding store and they would kind 00:08:40.980 |
of measure the trends of trash throughout San Francisco. You could think of this the same as 00:08:47.700 |
the Painter WTF, the example that he just used. The problem, of course, with all of these ones is 00:08:56.580 |
not safe for work images. So why is that a problem? First of all, it fills up your embedding store. You 00:09:08.420 |
have to store the data. It's going to cost you more. Your indexes are going to slow down if you're indexing 00:09:14.260 |
this content and users can see this data inside the app. So I thought about this for an hour and I did a little 00:09:20.100 |
proof of concept for him just using Postgres. The solution that I thought of was partitions. Now trash 00:09:27.060 |
is very boring, so I'm going to use cats in this example. We're going to segment good cats and bad cats. 00:09:33.060 |
So we'll start with a basic table where we're going to store all of our cats. We're going to store the embeddings 00:09:39.620 |
inside them. Then when an embedding is uploaded, we're going to call a function called iscats and here 00:09:47.300 |
I'm going to compare it to a canonical cat. In this case, my space cat. Then if the similarity is greater 00:09:56.820 |
than 0.8, I'll store it in a good cats partition and everything else can just go into a bad cats partition. 00:10:05.300 |
So to do this, I just took my space cat and I generated a vector of that and then I literally just 00:10:12.020 |
stuffed it inside a Postgres function called iscat. The way that this works, it takes in an embedding 00:10:18.100 |
that's the line three and then it's going to return a float, a similarity basically. And all it's going to do is 00:10:27.540 |
compare the distance to this canonical cat. I'm going to create a table to store all of my embeddings. 00:10:34.980 |
That's line five, the embeddings, the URL of the image. And then finally on line six, we're going to 00:10:41.300 |
determine the similarity. Is it a good cat or a bad cat? Then finally, Postgres has this thing called 00:10:49.860 |
triggers which are very cool. What we can do is attach a trigger to a table. So first of all, line two, 00:10:56.580 |
we're going to create the trigger. Line three, we're going to do it before the insert onto this table. 00:11:02.340 |
And then the most important one is line six. And this trigger for every time you upload a cat, 00:11:08.420 |
we're going to run that function that we just saw, compare it, and then store in 00:11:14.660 |
the table the similarity. New here is actually kind of a special value for Postgres. Inside the trigger 00:11:21.380 |
is for the values that you're about to insert. And then finally, what does the data look like? 00:11:26.900 |
After uploading a bunch of images, you can see here that we're storing all of our embeddings, 00:11:32.100 |
the URLs for them, and then on the right-hand side, that similarity. And now we can use that 00:11:37.620 |
essentially to create a segment. So we just need to split the data. And the nice thing about 00:11:45.380 |
partitions in Postgres, they've got kind of all the properties of a regular table and each one 00:11:51.060 |
individually. So we can create an index only on the good cats. And then to clean up, as our bad cats 00:11:57.620 |
are getting uploaded, if we ever want to clean them up, we just drop the partition and recreate it. 00:12:02.340 |
And the way that they work on disk is all the data is stored grouped together. So good cats will be 00:12:10.100 |
fast, kept fast, bad cats will be dropped. So what does that look like in code? In Postgres code, 00:12:17.780 |
it's really just 14, 13, 14 lines of code. Here I'm just adding on line 7, you can see the part, 00:12:25.220 |
partition that I create. And I'm going to do it by a range. Here is cat is the column that I'm going to 00:12:32.020 |
partition by. And then on line 9, I create good cats. And line 11 is where I actually determine the 00:12:39.460 |
values between 0.8 and 1. And then on line 13, everything else is going to fall into the default 00:12:46.820 |
partition. So honestly, I don't even know if this is the right way to solve the problem. But I just think 00:12:52.580 |
it's cool that I could just do that and it's all built into Postgres. So that's really why I'm bullish 00:12:59.300 |
on Postgres. I mean, it's so extensible. It's got 30 years of engineering. It's got pretty much 00:13:05.460 |
everything that you, all the primitives that you might need to get out of your way while you are 00:13:10.420 |
building an AI application. It's also extensible. PG vector itself is not built into Postgres. It's just 00:13:17.300 |
an extension. So for us to add it, we just scouted around the community, or Greg did in this case, 00:13:23.140 |
and then we merged it in as an extension, and it was running basically within two days. 00:13:28.580 |
Some other things worth highlighting, if you're doing RAG especially, Postgres has row level security, 00:13:35.940 |
which I think is very cool. This allows you to write declarative rules on your tables inside your 00:13:42.100 |
Postgres database. And so if you're storing user data and you want to split it up by different 00:13:47.060 |
users, you can actually write those rules. It's also a defense at depth. So if it gets through maybe 00:13:54.100 |
your API security, you can go directly into your database. The security is still there. 00:13:59.380 |
Something that's often not captured in benchmarks, a single round trip versus multiple round trips. 00:14:06.900 |
So if you store your embeddings next to your operational data, then you do a single fetch to your database. 00:14:15.140 |
And then finally, we're still early. PG vector is currently an extension. I can foresee it's 00:14:24.500 |
probably going to get merged into PG core eventually. I'm not too sure. People often ask me, is there 00:14:32.180 |
still space for a specialized vector database? Yes, I think there are. For many other things that databases 00:14:40.180 |
don't do, maybe a lot of putting models closer to the database could be one of those things. But for 00:14:47.460 |
this particular use case where you're actually just storing embeddings, indexing them, fetching them out, 00:14:52.180 |
I think then Postgres is definitely going to be moving down that direction. What's next for Superbase Vector? 00:15:02.020 |
Pretty simply, we have been really focused on more enterprise use cases, or largely how do you store 00:15:09.460 |
billions of vectors. This is another area that needs development. So we've been working on sharding 00:15:15.380 |
with Citus, another Postgres extension, and it allows you to split your data between different nodes. 00:15:23.140 |
And we've found that the transactions scale in a linear fashion as you add nodes. So in this case, 00:15:31.380 |
we're going to develop this. We've been chatting to the Citus team at Microsoft. If you want to be a 00:15:36.180 |
design partner on this, then we'd love to work with you on it, and especially if you're already storing 00:15:40.740 |
billions of embeddings. And if you want to get started, just go to database.new. And we also have, 00:15:49.060 |
apparently now our swag has finally arrived. So if you want some free credits and swag, come see us