OpenAI's NEW Embedding Models

Way back in December 2022, we had the biggest shift in how we approach AI ever. That was thanks to OpenAI releasing ChatGPT at the very end of November. ChatGPT quickly caught a lot of people's attention and it was in the month of December that the interest in ChatGPT and AI really exploded.

But right in the middle of December, OpenAI released another model that also changed the entire landscape of AI. But it didn't go as noticed as ChatGPT and that model was Text Embedding Order 002. Very creative naming, but behind that name is a model that just completely changed the way that we do information retrieval for natural language.

Which covers RAG, facades, and also basically any use case where you're retrieving text information. Now since then, despite a huge explosion in the number of people using RAG and the really cool things that you can do with RAG, OpenAI remained pretty quiet in their embedding models. Embedding models are what you need for RAG.

And there has been no new models since December 2022 until now. OpenAI has just released two new embedding models and a ton of other things as well. Those two embedding models are called Text Embedding 03 Small and Text Embedding 03 Large. And when we look at the results that OpenAI is sharing right now, we can see a fairly decent improvement on English language embeddings with the MTEB benchmark.

But perhaps more impressively, we see a massive improvement in the quality of multilingual embeddings which are measured using the Miracle benchmark. Now Order 002, state of the art when it was released and for a very long time afterwards and still a top performing embedding model, that had an average score of 31.4 on Miracle.

The new Text Embedding 03 Large has an average score of 54.9 on Miracle. That's a massive difference. Now, one of the other things you'll notice looking at these new models is that they have not increased the max context window, so the maximum number of tokens that you can feed into the model.

That makes a lot of sense with embedding models because what you're trying to do with embeddings is trying to compress the meaning of some text into a single point. And if you have a larger chunk of text, there's usually many meanings within that text. So going large and trying to compress into a single point doesn't, you know, those two things don't really go together because that large text can have many meanings.

So it always makes sense to use smaller chunks and clearly OpenAI are aware of that. They're not increasing the maximum number of tokens that you can embed with these models. Now, the other thing which is maybe not as clear to me is that they have not trained on more recent data.

The knowledge date cutoff is still September 2021, which is a fair while ago now. And okay, for embedding models, maybe that isn't quite as important as it is for LLMs, but it's still important. It's good to have some context of recent events when you're trying to embed meaning. So things like COVID, you ask a COVID question, these models, I imagine, are probably not going to perform as well as, say, Cohere's embedding models, which have been trained on more recent data.

Nonetheless, this is still very impressive. And one thing which I think is probably the most impressive thing that I've seen so far is we're now able to decide how many dimensions we'd like in our vectors. Now, there is a tradeoff. You reduce the number of dimensions, you're going to get reduced quality embeddings.

But what is incredibly interesting, and I almost don't quite believe it yet, I still need to test this, is that they're saying that the large model, TextEmbedding3Large, you can cut it down from 3,072 dimensions, which is larger than the previous models. You can cut that down to 256 dimensions and still outperform Arda002, which is a 1,536 dimension embedding model.

Compressing all of that performance into 256 floating point numbers is insane. So I'm going to test that. Not right now, but I'm going to test that and just prove to myself that that is possible. I'm a little bit skeptical, but if so, incredible. OK, so with that out of the way, let's jump into how we might use this new model.

OK, so jumping right into it, we have this notebook. I'm going to share with you a link either in the description. I will try and get a link added to the video as well. And first thing I'm going to do is download Dataset. Well, pip install first. Then I'm going to download Dataset.

OK, so I'm using this AI archive one I've used a million times before. But it is a good Dataset for testing. I'm going to remove all of the columns I don't care about. I'm going to keep just ID text, metadata, typical format. Then I'm going to initialize, or I'm going to take my OpenAI API key.

OK, so that's platform.openai.com if you need one. And I'm going to put in here. And then this is how you create your new embeddings. OK, exactly the same as what you did before. You just change the model ID now. OK, and we'll see those in a moment as well.

So that is our embedding function. Then we jump down. We're going to initialize connection to PyCone serverless. So you get $100 free credit. And you can create multiple indices, which is what we need. Because I want to test multiple models here with different dimensionalities. So that's why I'm using serverless alongside all the other benefits that you get from it as well.

Now taking a look at this, these are the models we're going to take a look at. Using the default dimensions for now, we will try the others pretty soon. So we have the original model. Well, kind of original, the V2 of embeddings from OpenAI. So this is the one they released in December 2022.

The dimensionality there, 1536. Most of us will be very familiar with that number by now. Now the small model uses the same dimensionality. And you can also decrease this down to 512. OK, nice little cool thing you can do there. The other embedding models, the larger one, the one with the insane performance gains, is this one.

So three large, higher dimensionality. That means they can pack more meaning into that single vector. So it makes sense that this is more performance. But what is very cool is that you can compress this down to 256 dimensions. And apparently still help perform this model here. And I mean, that is 100% unheard of within vector embeddings.

Like, 256 dimensions and getting this level of performance is insane. Let's see. I don't know. Maybe. I mean, they say it's true. So then I'm going to kind of go through. I'm going to throw. I'm going to create three different indexes. One for each one of the models. OK.

And then what I'm going to do is just index everything. Now it takes a little bit of time to index everything. But we can see, while I'm waiting for that, we can have a quick look at how long this is taking. Because this is also something to consider when you're choosing embedding models.

And looking at these. So straight away, one, the APIs right now are, I think, pretty slow. Because everything has just been released. So I expect during normal times, this number will probably be smaller. So for Arda002, I'm getting 15 and 1/2 minutes to embed everything. OK, it's to embed and throw everything into Pinecone.

Slightly slower for the small model. Which, OK, probably maybe hasn't been as optimized as Arda002. And also maybe more people are using this right now. But generally, it's, I mean, pretty comparable speed there. As we might expect, embedding through large is definitely slower. OK, so right now, we're on track for about 24 minutes for that whole thing to embed.

So, yeah, definitely slower. That also means your embedding latency is going to be slower. So, I mean, you can look at this. OK, this is two seconds. This is including, like, your network latency and everything. And also, you know, going to Pinecone as well. So you have multiple things there.

It's not 100% fair comparison. But then this one is almost two seconds slower. Maybe make like a 1.5 seconds slower for a single iteration. OK, so this one is definitely slower. It will clearly slow down if you're using RAG or something like that. It's going to slow down that process a little bit.

Probably not that much compared to, you know, the LLM generation component. But still something to consider. So I'm going to wait for this to finish and skip ahead to when it has. OK, so we are done. And we now have, OK, it's like 20, just about 24 minutes for that final model.

So I've created this function. It's just going to go through and basically return documents for us. So let's try it with R.002 and see what we get. So we've been talking about red teaming for LLM02. What do we get? We get, OK, red teaming chat GPT. Not, no, not quite there.

Let's try with the new small model. OK, cool. And let's see, did we mention LLM02 in here? No, no LLM02. So also not quite there. This was a pretty hard one. I haven't seen a model get this one yet. So let's see. We're starting with a hard question. OK, let's see.

Let's see what we have here. OK, so it's talking about red team exercises, this and this. But I don't see LLM02. No, nothing in there. So, OK, maybe that question is too hard for any model, apparently. So let's try. All right, let's just go with, can you tell me why I want to use LLM02?

Why would I want to use LLM02? Now, the models usually can get relevant results here. So, yeah, straight away, this one, you can see LLM02 scales up to this. It's helpfulness and safety is pretty good. Perform better than existing open source models. OK, cool. Good, that is, you know, I would hope they can get this one, as LLM02 can.

OK, same result. I think it's probably the most relevant or one of the most relevant. So let's see. Let me see. So why do I want to use? And then here we get, so this is a large model, excuse me, is it the same? Oh, no, same result. OK, cool.

That's fine. Let's try another question. OK, so let's try where we're comparing LLAMA to GPT-4 and just see how many of these manage to get either GPT-4 in there or LLAMA. So, OK, this is harder. OK, you know, that's like four of five results seem relevant. Are they actually talking about GPT-4 as well?

And yeah, you can see GPT-4 in here. Don't actually see GPT-4 in here, see GPT-J. Oh, OK, no, no, no. So effectiveness of instruction tuning using GPT-4, but not necessarily comparing to GPT-4. OK, this one, I don't see them talking about LLAMA at all. So, OK, these two here are not relevant.

This one, compare our chatbots instruction tuning with LLAMA, which LLAMA GPT-4 outperforms this one, this one, but there's still a gap. OK, so there's a comparison there. Fine. Here, OK, so that's a LLAMA fine-tuned on GPT-4 instructions or outputs, but there is a comparison. And again, OK, there's a comparison, right?

So there's like three results that are compared. Accurate for the small model, let's see. We compare these, OK, relevant, I would say this one. Interesting. Second one, not relevant. Third one. All chatbots against GPT-4 comparisons run by a reward model indicator. All chatbots are compared against. OK, yeah, yeah, that's relevant.

Two out of three. Here, I don't see anything where it's comparing to GPT-4. So I think that's a no. So it's two out of four now. OK, and then here there's, you know, talking kind of like about the comparisons. So three out of five. But then the other model was slightly, oh, it's the same.

OK. Now let's go with the best model. We would expect to see more LLAMA, and I think I do. So this one has LLAMA in four of those answers. We compare. OK, we're comparing. This one, no. So look at this one. OK, they're comparing. So that's accurate. This one.

OK, here, comparing again. And then this final one here, we have, OK, do we have GPT-4? Here, I think. So they have like Bard, Chart, GPT, GPT-4. And then they have some, I mean, this is a table. It's kind of hard to understand. But it seems like, OK, that is actually a comparison as well.

So that one. OK, this one, it got four out of five. That's the best performing one. OK, that's good. That kind of, that correlates with what we would expect. Cool. OK, those are the new embedding models from OpenAI. I think it's kind of hard to see the performance difference there.

I mean, you can see a little bit maybe with the large model. But given the performance differences we sort of saw in that table, at least on multilingual, there's a massive leap up, which is insane. I'm looking forward to trying the very small dimensionality and just comparing that to Arda002.

I think that is very impressive. Definitely try that soon. But for now, looks pretty cool. Definitely want to try the other models as well that OpenAI have released. There are a few. So for now, I'm going to leave it there. I hope all this has been interesting and useful.

So thank you very much for watching and I'll see you again in the next one. Bye.

OpenAI's NEW Embedding Models

Chapters

Transcript