back to index

OpenAI's New GPT 3.5 Embedding Model for Semantic Search


Chapters

0:0
0:30 Semantic search with OpenAI GPT architecture
3:43 Getting started with OpenAI embeddings in Python
4:12 Initializing connection to OpenAI API
5:49 Creating OpenAI embeddings with ada
7:24 Initializing the Pinecone vector index
9:4 Getting dataset from Hugging Face to embed and index
10:3 Populating vector index with embeddings
12:1 Semantic search querying
15:9 Deleting the environment
15:23 Final notes

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we're going to have a look at how we can use OpenAI's new text embedding model,
00:00:05.400 | creatively named Text Embedding Arda002,
00:00:09.200 | to essentially search through loads of documents and do it in a super easy way.
00:00:15.600 | So we really don't need to know that much about what is going on behind the scenes here.
00:00:20.200 | We can just kind of get going with it and get really impressive results super quickly.
00:00:25.900 | So to start, let's just have a quick look at how all this is going to look.
00:00:30.100 | It's very similar, if you follow any of these videos, very similar architecture to what we normally use.
00:00:35.400 | We start with our data source,
00:00:42.700 | which is going to be over here. And we're going to take that
00:00:47.900 | and we're going to use the new Arda002 model to embed these.
00:00:56.200 | Okay, so what we have in here are sentences, some text goes through like this.
00:01:04.400 | And what we're doing here is creating meaningful embeddings.
00:01:07.500 | So for example, two sentences that have a very similar meaning within a vector space,
00:01:14.400 | because that's what we're converting them into, vectors within that vector space,
00:01:17.900 | they will be located very closely together.
00:01:20.600 | And of course, we know that OpenAI, when they do something, they do it pretty well.
00:01:25.200 | So the expectation here is that the Arda002 model is going to be pretty good at creating these dense vector representations.
00:01:33.200 | So from that, we're going to get our embeddings.
00:01:37.000 | I'm going to just have them in this little square here.
00:01:40.000 | What we're going to do with those is we're going to take them over into Pinecone,
00:01:44.800 | which is going to be our vector database.
00:01:47.000 | So where we essentially, where this will live, this vector space.
00:01:52.300 | So we have our vector database here and they're going to go into there like that.
00:02:00.500 | Okay, so this process here is what we would refer to as indexing.
00:02:06.300 | Okay, we're taking all of our data and we're indexing it within Pinecone using the Arda002 model.
00:02:11.900 | Now, there's another step to this whole pipeline that we haven't spoken about and that is querying.
00:02:19.900 | So querying is literally when we do a search.
00:02:22.800 | So let's say some random person comes along and they're like, I want to know about this.
00:02:28.200 | We don't know what they're asking about. It's a mystery, but they have this query.
00:02:32.900 | They've passed it to us.
00:02:34.200 | What we do with that query is we take it into Arda002.
00:02:38.400 | We embed it to create a query vector.
00:02:42.300 | So it's going to be a smaller box called xq.
00:02:47.100 | And we're going to take that over to Pinecone here and we're going to say it's Pinecone returned top k.
00:02:55.500 | So top k is going to be a number.
00:02:57.400 | Let's say we say 3 or 5.
00:03:01.000 | Let's say 5, return the top k most relevant vectors that we have already indexed.
00:03:07.300 | So we return those. Now we have five of these vectors.
00:03:12.900 | They're all in here, 1, 2, 3, 4, 5 and we return them to the user.
00:03:19.300 | Okay, but when we return them to the user, we're actually not going to return the vectors because it's just numbers.
00:03:24.700 | It won't make any sense.
00:03:26.600 | We're going to return the text that those vectors were embedded with.
00:03:31.900 | Okay, and that is how we will build our system.
00:03:36.900 | Now it's actually super simple. This chart probably makes it look way more complicated than it actually is.
00:03:42.200 | Let's take a look at the code.
00:03:44.000 | So we're going to be working from this example here.
00:03:46.600 | So we have dops, pinecone.io, dops, OpenAI.
00:03:50.300 | We're going to open this in Colab and just work through.
00:03:56.200 | So we get started by just installing any prerequisites that we have.
00:04:00.600 | So we want to install the Pinecone Client, OpenAI and datasets.
00:04:05.900 | So we'll go ahead and run that.
00:04:10.800 | Okay, that will take a moment. Okay, great.
00:04:13.400 | So come down here and first thing we're going to need to do is create our embeddings.
00:04:17.200 | Now to do that, we need to initialize our connection to OpenAI and for that we need these two keys.
00:04:25.100 | So we need a organization key and we need our secret API key.
00:04:29.200 | So to get that, we'll head over here.
00:04:31.800 | We go to beta@openai.com and you'll need to log in so you can log in at the top right.
00:04:37.800 | I've already logged in so I can go over, click on my profile and I can click view API keys.
00:04:43.900 | Okay, and the first page you come to here is the secret key.
00:04:48.000 | Now here you can't copy this. It's already been created.
00:04:51.800 | So what you need to do is create a new secret key.
00:04:54.900 | So I will do that and then you just copy your key here.
00:04:58.100 | Then with that secret key, you need to paste it into here.
00:05:02.800 | I have mine stored in a variable called API key.
00:05:05.700 | Then we return to the OpenAI page.
00:05:08.300 | We go over to settings and then in here we'll also find our organization ID.
00:05:14.100 | So we need to copy that and that will go in here.
00:05:17.900 | And I have mine stored in another variable called org key.
00:05:22.700 | So I will copy that.
00:05:24.300 | Now I can run this and what we'll do is we'll get a list of all the models that are available
00:05:30.400 | as long as we've authenticated correctly.
00:05:33.000 | So you can see we have this big list which we initialized with this OpenAI engine list.
00:05:37.400 | So we're just seeing everything in there and I don't know if maybe Arda is at the bottom, maybe not.
00:05:45.300 | So I'm not going to search through it, but we'll see which model we're using here.
00:05:48.800 | So this is a new model from OpenAI and it's much cheaper to use
00:05:54.300 | and the performance is supposedly much greater.
00:05:57.700 | So we'll go ahead and we'll try this one out.
00:06:00.700 | So text embedding Arda002 and just as an example, this is how we would create our embedding.
00:06:07.300 | So OpenAI embedding create and then we can pass multiple things to embed here.
00:06:13.100 | So this we have two sensors and that means we will end up outputting two vector embeddings.
00:06:18.900 | And then for the model, we just pass the model that we'd like to use.
00:06:21.300 | So this one. Okay, so we run that and if it worked correctly, you should see that we have these vectors in here.
00:06:28.400 | Okay, and some little bits of information in there.
00:06:31.100 | So it's pretty cool.
00:06:32.800 | Now one thing that I would like to demonstrate here is, okay, are these vectors,
00:06:38.700 | do they have the same dimensionality and what is that dimensionality?
00:06:42.000 | Now they're output by the same model, so we would expect them to have the same dimensionality.
00:06:46.400 | So we're just checking the response. We have data, zero and embedding.
00:06:51.700 | So essentially what we have in here, if I scroll up a bit, you'll be able to see that.
00:06:58.500 | Okay, so we have data, we're going for the first item in the list and we're looking at embedding.
00:07:03.400 | Great. Now print those out and we should see that we get 1536,
00:07:08.200 | which is the embedding dimensionality of the new Arda model.
00:07:11.300 | Now what I want to do is extract those into a list, which is what we're going to be doing later.
00:07:15.700 | So we can extract those and see that we do in fact have two of those.
00:07:19.400 | And again, we can check the dimensionality there as well.
00:07:23.300 | So now what we need to do is initialize a Pinecone instance.
00:07:28.700 | And this is where we're going to store all of our vectors.
00:07:31.800 | So for that, we need to head over to app.pinecone.io.
00:07:36.600 | So let me open that over here. You will need to sign up if this is your first time.
00:07:42.200 | Again, you should come through to a page that looks kind of like this.
00:07:45.100 | So I have James's default project up here. You will have your name followed by default project.
00:07:50.400 | And what we're going to do is we don't want to create our first index.
00:07:53.400 | We're going to be doing that in Python. What we do need is the API keys.
00:07:57.400 | So I'm going to just take one of these. I have my default API key here.
00:08:00.800 | I'm going to copy it here and we're going to paste it into the notebook.
00:08:04.000 | So I've stored mine in a variable called Pinecone key.
00:08:09.000 | So I can run that. And what this will do is initialize our connection to Pinecone.
00:08:13.400 | It will check if there is an index called OpenAI within our project.
00:08:18.200 | So within this space here, we don't have any so it doesn't exist.
00:08:23.300 | If it doesn't exist, it will be created and it will use this dimension here.
00:08:27.800 | So this dimension is a 1536 that we saw earlier. And then we'll connect to that index.
00:08:33.100 | So let's run that. And if we navigate back to the page here, the app.pinecone.io,
00:08:39.700 | we can refresh and we should see that we have an index here.
00:08:44.200 | It was initializing and now it's ready. So we can see all the details there.
00:08:49.000 | We see the dimensionality, the pod types we're using, metrics and so on.
00:08:53.800 | So these are just default variables there. But yes, we do want to be using cosine and pod type.
00:08:58.900 | You can change the pod type depending on what you're wanting to do.
00:09:02.400 | So back in our code, let's go ahead and begin populating that index.
00:09:07.100 | So to populate the index, we obviously need some data.
00:09:10.500 | We're just going to use a very small data set, 1,000 questions from the Trek data set.
00:09:15.800 | So let's load that. This we are getting from Hugging Face data sets.
00:09:21.500 | So if we actually go to Hugging Face CO data sets, Trek,
00:09:25.900 | we'll see the data set that we are downloading, which is this here.
00:09:31.500 | Okay. I think in total there's maybe 5,000-ish examples in there.
00:09:38.400 | We're just going to use the first 1,000 to make things really fast as we're walking through this example.
00:09:43.200 | Okay. And yes, we can see we have text, course label,
00:09:46.200 | fine label. All we really care about here is actually the text.
00:09:49.900 | Okay. And we can have a look at the first one.
00:09:53.000 | How did certain develop in and then leave Russia.
00:09:56.600 | And we can also compare that over to here and we see that it's actually exactly the same.
00:10:01.300 | Okay, cool. So now what we want to do is we're going to create a vector embedding for each one of these samples.
00:10:08.700 | So well, let's walk through the logic of doing that.
00:10:12.000 | So we're going to be doing that in a loop. We're going to be doing it in batches of 32.
00:10:15.700 | And what we're going to do is extract the start position of the batch,
00:10:19.200 | which is I and the end position of that batch.
00:10:21.900 | And we're going to get all of the text within that batch.
00:10:26.500 | So this should actually be high end. So we get all the text within the batch.
00:10:31.500 | We get all the IDs, which is just a count. You can use actual IDs if you want.
00:10:35.800 | For this example, it's not really needed.
00:10:37.600 | And then what we're going to do is we're going to create our embeddings using the OpenAI endpoint that we used before.
00:10:42.500 | So we have our inputs, which is our batch of text.
00:10:45.500 | We have the engine, which is the Arda002 model.
00:10:48.700 | And then here, we're just reformatting those embeddings into a format that we can then take and upstart into Pinecone.
00:10:56.800 | We also, so later on when we're serving or when we're querying,
00:11:01.500 | what we're going to want to do is we don't want to see these vectors because it doesn't make sense to us.
00:11:05.900 | We want to see the original text.
00:11:08.200 | So to make that easy, what we're going to do is pair our metadata.
00:11:12.500 | So the metadata is literally just that text that we want to see.
00:11:16.400 | That will basically just be some metadata attached to each one of our vectors.
00:11:20.600 | And it means that when we're querying, we can just return that and read the actual text rather than looking at the vectors.
00:11:26.900 | I mean, that's it. So we zip all this together.
00:11:29.000 | So each record is going to be a unique ID, the vector embedding, and the attached metadata.
00:11:35.400 | And then we upstart all that into Pinecone.
00:11:38.100 | So we can run that. It should be pretty quick.
00:11:41.200 | Okay. Yep. It's like 13 seconds, really fast.
00:11:44.600 | Okay. 14 seconds total. Really super fast for a thousand items.
00:11:49.600 | That's pretty insane. So now what we can do, that is the indexing portion of our app done.
00:11:55.600 | So all of this in green is now complete.
00:11:59.200 | So we can kind of cross that off. Now what we need to focus on is the querying.
00:12:03.200 | So how do we do querying? It's actually really easy.
00:12:05.800 | So we have a query. I'm going to say what caused the 1929 Great Depression?
00:12:10.700 | We're kind of limited in number of questions we can ask here because we do only have 1,000 examples indexed.
00:12:16.900 | Realistically, probably millions or more.
00:12:19.800 | So we're going to be limited on what we can actually ask here.
00:12:22.800 | But this is so pretty good in order to just demonstrate this workflow.
00:12:28.500 | So let's run this. Basically, we're doing the exact same thing for the query that we did with the lines or the Trek dataset from before.
00:12:36.700 | So we're just embedding it using the Adder002 model.
00:12:40.300 | In this case, we just have one string input there.
00:12:43.900 | And then what we do is in that response, we're going to have data, we're going to retrieve the first item.
00:12:49.700 | There's just one item in there anyway. And we want the embedding from that.
00:12:53.600 | And that will be, if I take a look at this, so that will be a 1536 dimensional vector.
00:13:01.700 | And then what we can do is we pass that to index.query, like so.
00:13:08.900 | OK, so we can remove those square brackets there. Top k equals 5, include metadata.
00:13:16.300 | We do want to include this. This is going to return the text, the original text back to us.
00:13:21.400 | So let's see, are we returning questions that are similar to the question we asked?
00:13:26.700 | OK, so why did the world enter a global depression in 1929 when it was the Great Depression?
00:13:33.200 | I don't know what is with the weird formatting here.
00:13:36.800 | And then it's talking about some other things that are maybe somewhat related.
00:13:40.100 | I'm not really sure, or just things from around that sort of time era.
00:13:44.600 | But we can see that the score here, the similarity, does drop really quickly when we come down to these.
00:13:51.400 | Because they're actually not that relevant. They're just kind of within the same context, I suppose.
00:13:56.200 | So that's pretty cool.
00:13:58.200 | It's clearly returning the correct question that we would expect it to, based on the question we asked.
00:14:04.100 | OK, so we can also format that a little bit nicer.
00:14:06.900 | So here, just run that. We can see a little bit easier to read than in this sort of response format that we had up here.
00:14:14.000 | Now let's make it a little bit harder. We're just going to replace the correct term depression with incorrect term recession.
00:14:19.800 | And see if it still understands our query, because this is where a lexical search,
00:14:24.000 | so where you're searching by keywords, would fail.
00:14:26.100 | In this case, we should see, hopefully, that it does not fail.
00:14:28.500 | So replacing or replicating the same logic again.
00:14:32.400 | And we can see that, yes, the similarity is slightly low, because we're using a different word.
00:14:37.100 | But it's still returning the relevant question as our first example there.
00:14:42.100 | OK, that is pretty cool. Now let's make it even harder.
00:14:46.500 | Why was there a long-term economic downturn in the early 20th century?
00:14:50.600 | Is it going to figure out that this is what we're talking about?
00:14:53.400 | That we're talking about the global depression of 1929?
00:14:57.800 | And yes, it does. And the similarity is actually pretty good there.
00:15:01.300 | So despite not really sharing any of the same words,
00:15:03.900 | it actually manages to identify that this is talking about the same thing, which is pretty good.
00:15:09.700 | Now with that done, we can finish with this example.
00:15:12.700 | So one thing you might need to do here is head over to Pinecone console,
00:15:18.200 | and you can just go ahead and delete the index, or you can do it in code. Completely up to you.
00:15:23.400 | Great. So that's it for this walkthrough and example.
00:15:27.500 | I hope this has been useful. It's really cool to see OpenAI's new embedding model.
00:15:32.600 | And from what I've heard, the performance, although not as clear from this example, is really good.
00:15:38.800 | And as you have seen from this example, it's super easy to use.
00:15:42.100 | So a few lines of code, and we have this really cool, really high-performance,
00:15:46.800 | semantic search example with OpenAI and Pinecone, and we don't really need to worry about anything.
00:15:53.000 | It's just super easy to do. So I hope this has all been interesting and useful.
00:15:57.800 | Thank you very much for watching, and I'll see you again in the next one. Bye.
00:16:01.600 | [MUSIC]
00:16:10.100 | [BLANK_AUDIO]