Back to Index

OpenAI's New GPT 3.5 Embedding Model for Semantic Search


Chapters

0:0
0:30 Semantic search with OpenAI GPT architecture
3:43 Getting started with OpenAI embeddings in Python
4:12 Initializing connection to OpenAI API
5:49 Creating OpenAI embeddings with ada
7:24 Initializing the Pinecone vector index
9:4 Getting dataset from Hugging Face to embed and index
10:3 Populating vector index with embeddings
12:1 Semantic search querying
15:9 Deleting the environment
15:23 Final notes

Transcript

Today, we're going to have a look at how we can use OpenAI's new text embedding model, creatively named Text Embedding Arda002, to essentially search through loads of documents and do it in a super easy way. So we really don't need to know that much about what is going on behind the scenes here.

We can just kind of get going with it and get really impressive results super quickly. So to start, let's just have a quick look at how all this is going to look. It's very similar, if you follow any of these videos, very similar architecture to what we normally use.

We start with our data source, which is going to be over here. And we're going to take that and we're going to use the new Arda002 model to embed these. Okay, so what we have in here are sentences, some text goes through like this. And what we're doing here is creating meaningful embeddings.

So for example, two sentences that have a very similar meaning within a vector space, because that's what we're converting them into, vectors within that vector space, they will be located very closely together. And of course, we know that OpenAI, when they do something, they do it pretty well. So the expectation here is that the Arda002 model is going to be pretty good at creating these dense vector representations.

So from that, we're going to get our embeddings. I'm going to just have them in this little square here. What we're going to do with those is we're going to take them over into Pinecone, which is going to be our vector database. So where we essentially, where this will live, this vector space.

So we have our vector database here and they're going to go into there like that. Okay, so this process here is what we would refer to as indexing. Okay, we're taking all of our data and we're indexing it within Pinecone using the Arda002 model. Now, there's another step to this whole pipeline that we haven't spoken about and that is querying.

So querying is literally when we do a search. So let's say some random person comes along and they're like, I want to know about this. We don't know what they're asking about. It's a mystery, but they have this query. They've passed it to us. What we do with that query is we take it into Arda002.

We embed it to create a query vector. So it's going to be a smaller box called xq. And we're going to take that over to Pinecone here and we're going to say it's Pinecone returned top k. So top k is going to be a number. Let's say we say 3 or 5.

Let's say 5, return the top k most relevant vectors that we have already indexed. So we return those. Now we have five of these vectors. They're all in here, 1, 2, 3, 4, 5 and we return them to the user. Okay, but when we return them to the user, we're actually not going to return the vectors because it's just numbers.

It won't make any sense. We're going to return the text that those vectors were embedded with. Okay, and that is how we will build our system. Now it's actually super simple. This chart probably makes it look way more complicated than it actually is. Let's take a look at the code.

So we're going to be working from this example here. So we have dops, pinecone.io, dops, OpenAI. We're going to open this in Colab and just work through. So we get started by just installing any prerequisites that we have. So we want to install the Pinecone Client, OpenAI and datasets.

So we'll go ahead and run that. Okay, that will take a moment. Okay, great. So come down here and first thing we're going to need to do is create our embeddings. Now to do that, we need to initialize our connection to OpenAI and for that we need these two keys.

So we need a organization key and we need our secret API key. So to get that, we'll head over here. We go to beta@openai.com and you'll need to log in so you can log in at the top right. I've already logged in so I can go over, click on my profile and I can click view API keys.

Okay, and the first page you come to here is the secret key. Now here you can't copy this. It's already been created. So what you need to do is create a new secret key. So I will do that and then you just copy your key here. Then with that secret key, you need to paste it into here.

I have mine stored in a variable called API key. Then we return to the OpenAI page. We go over to settings and then in here we'll also find our organization ID. So we need to copy that and that will go in here. And I have mine stored in another variable called org key.

So I will copy that. Now I can run this and what we'll do is we'll get a list of all the models that are available as long as we've authenticated correctly. So you can see we have this big list which we initialized with this OpenAI engine list. So we're just seeing everything in there and I don't know if maybe Arda is at the bottom, maybe not.

So I'm not going to search through it, but we'll see which model we're using here. So this is a new model from OpenAI and it's much cheaper to use and the performance is supposedly much greater. So we'll go ahead and we'll try this one out. So text embedding Arda002 and just as an example, this is how we would create our embedding.

So OpenAI embedding create and then we can pass multiple things to embed here. So this we have two sensors and that means we will end up outputting two vector embeddings. And then for the model, we just pass the model that we'd like to use. So this one. Okay, so we run that and if it worked correctly, you should see that we have these vectors in here.

Okay, and some little bits of information in there. So it's pretty cool. Now one thing that I would like to demonstrate here is, okay, are these vectors, do they have the same dimensionality and what is that dimensionality? Now they're output by the same model, so we would expect them to have the same dimensionality.

So we're just checking the response. We have data, zero and embedding. So essentially what we have in here, if I scroll up a bit, you'll be able to see that. Okay, so we have data, we're going for the first item in the list and we're looking at embedding. Great.

Now print those out and we should see that we get 1536, which is the embedding dimensionality of the new Arda model. Now what I want to do is extract those into a list, which is what we're going to be doing later. So we can extract those and see that we do in fact have two of those.

And again, we can check the dimensionality there as well. So now what we need to do is initialize a Pinecone instance. And this is where we're going to store all of our vectors. So for that, we need to head over to app.pinecone.io. So let me open that over here.

You will need to sign up if this is your first time. Again, you should come through to a page that looks kind of like this. So I have James's default project up here. You will have your name followed by default project. And what we're going to do is we don't want to create our first index.

We're going to be doing that in Python. What we do need is the API keys. So I'm going to just take one of these. I have my default API key here. I'm going to copy it here and we're going to paste it into the notebook. So I've stored mine in a variable called Pinecone key.

So I can run that. And what this will do is initialize our connection to Pinecone. It will check if there is an index called OpenAI within our project. So within this space here, we don't have any so it doesn't exist. If it doesn't exist, it will be created and it will use this dimension here.

So this dimension is a 1536 that we saw earlier. And then we'll connect to that index. So let's run that. And if we navigate back to the page here, the app.pinecone.io, we can refresh and we should see that we have an index here. It was initializing and now it's ready.

So we can see all the details there. We see the dimensionality, the pod types we're using, metrics and so on. So these are just default variables there. But yes, we do want to be using cosine and pod type. You can change the pod type depending on what you're wanting to do.

So back in our code, let's go ahead and begin populating that index. So to populate the index, we obviously need some data. We're just going to use a very small data set, 1,000 questions from the Trek data set. So let's load that. This we are getting from Hugging Face data sets.

So if we actually go to Hugging Face CO data sets, Trek, we'll see the data set that we are downloading, which is this here. Okay. I think in total there's maybe 5,000-ish examples in there. We're just going to use the first 1,000 to make things really fast as we're walking through this example.

Okay. And yes, we can see we have text, course label, fine label. All we really care about here is actually the text. Okay. And we can have a look at the first one. How did certain develop in and then leave Russia. And we can also compare that over to here and we see that it's actually exactly the same.

Okay, cool. So now what we want to do is we're going to create a vector embedding for each one of these samples. So well, let's walk through the logic of doing that. So we're going to be doing that in a loop. We're going to be doing it in batches of 32.

And what we're going to do is extract the start position of the batch, which is I and the end position of that batch. And we're going to get all of the text within that batch. So this should actually be high end. So we get all the text within the batch.

We get all the IDs, which is just a count. You can use actual IDs if you want. For this example, it's not really needed. And then what we're going to do is we're going to create our embeddings using the OpenAI endpoint that we used before. So we have our inputs, which is our batch of text.

We have the engine, which is the Arda002 model. And then here, we're just reformatting those embeddings into a format that we can then take and upstart into Pinecone. We also, so later on when we're serving or when we're querying, what we're going to want to do is we don't want to see these vectors because it doesn't make sense to us.

We want to see the original text. So to make that easy, what we're going to do is pair our metadata. So the metadata is literally just that text that we want to see. That will basically just be some metadata attached to each one of our vectors. And it means that when we're querying, we can just return that and read the actual text rather than looking at the vectors.

I mean, that's it. So we zip all this together. So each record is going to be a unique ID, the vector embedding, and the attached metadata. And then we upstart all that into Pinecone. So we can run that. It should be pretty quick. Okay. Yep. It's like 13 seconds, really fast.

Okay. 14 seconds total. Really super fast for a thousand items. That's pretty insane. So now what we can do, that is the indexing portion of our app done. So all of this in green is now complete. So we can kind of cross that off. Now what we need to focus on is the querying.

So how do we do querying? It's actually really easy. So we have a query. I'm going to say what caused the 1929 Great Depression? We're kind of limited in number of questions we can ask here because we do only have 1,000 examples indexed. Realistically, probably millions or more. So we're going to be limited on what we can actually ask here.

But this is so pretty good in order to just demonstrate this workflow. So let's run this. Basically, we're doing the exact same thing for the query that we did with the lines or the Trek dataset from before. So we're just embedding it using the Adder002 model. In this case, we just have one string input there.

And then what we do is in that response, we're going to have data, we're going to retrieve the first item. There's just one item in there anyway. And we want the embedding from that. And that will be, if I take a look at this, so that will be a 1536 dimensional vector.

And then what we can do is we pass that to index.query, like so. OK, so we can remove those square brackets there. Top k equals 5, include metadata. We do want to include this. This is going to return the text, the original text back to us. So let's see, are we returning questions that are similar to the question we asked?

OK, so why did the world enter a global depression in 1929 when it was the Great Depression? I don't know what is with the weird formatting here. And then it's talking about some other things that are maybe somewhat related. I'm not really sure, or just things from around that sort of time era.

But we can see that the score here, the similarity, does drop really quickly when we come down to these. Because they're actually not that relevant. They're just kind of within the same context, I suppose. So that's pretty cool. It's clearly returning the correct question that we would expect it to, based on the question we asked.

OK, so we can also format that a little bit nicer. So here, just run that. We can see a little bit easier to read than in this sort of response format that we had up here. Now let's make it a little bit harder. We're just going to replace the correct term depression with incorrect term recession.

And see if it still understands our query, because this is where a lexical search, so where you're searching by keywords, would fail. In this case, we should see, hopefully, that it does not fail. So replacing or replicating the same logic again. And we can see that, yes, the similarity is slightly low, because we're using a different word.

But it's still returning the relevant question as our first example there. OK, that is pretty cool. Now let's make it even harder. Why was there a long-term economic downturn in the early 20th century? Is it going to figure out that this is what we're talking about? That we're talking about the global depression of 1929?

And yes, it does. And the similarity is actually pretty good there. So despite not really sharing any of the same words, it actually manages to identify that this is talking about the same thing, which is pretty good. Now with that done, we can finish with this example. So one thing you might need to do here is head over to Pinecone console, and you can just go ahead and delete the index, or you can do it in code.

Completely up to you. Great. So that's it for this walkthrough and example. I hope this has been useful. It's really cool to see OpenAI's new embedding model. And from what I've heard, the performance, although not as clear from this example, is really good. And as you have seen from this example, it's super easy to use.

So a few lines of code, and we have this really cool, really high-performance, semantic search example with OpenAI and Pinecone, and we don't really need to worry about anything. It's just super easy to do. So I hope this has all been interesting and useful. Thank you very much for watching, and I'll see you again in the next one.

Bye.