Back to Index

OpenAI Alternatives: Cohere Embed v3 and Open Source


Chapters

0:0
0:45 MTEB Leaderboards
1:46 Starting with OpenAI, Cohere, and e5
4:35 Inference Speeds
6:6 Querying with Different Models
8:5 Results between models
9:35 Ada 002 vs Cohere v3
10:29 Another test for OpenAI, Cohere, and E5
15:23 More Questions and Final Thoughts

Transcript

Today, we're going to be taking a look at a few of the best embedding models that we can use when we're building retrieval pipelines. At the moment, pretty much everyone uses OpenAI's Ardour 002. But there are actually many other models out there and a few that are either sort of competitive or potentially even better than Ardour 002.

And if you go by leaderboards, there are many models that are significantly better. But we'll see that not everything is about leaderboards, and when you're testing it on real life data, Ardour still works well, but it's comparable to many other models. So I'm going to start by taking a look at one of the most popular leaderboards for embedding models, which is the MTEB Benchmark.

MTEB is the Massive Text Embedding Benchmark, and this is hosted on Hungryface Spaces. Now, I think it's literally today there is this new model that is now at the number one spot. We are not going to be looking at that model in this video, although I will do very soon.

But we will be covering this other model from Gohere, which is very close, like very, very little difference, at least from the benchmark results here. And we're going to be taking a look at one of the small embedding models, our open source. So there are many open source models here, but the one that I found to work best that isn't huge is actually down here.

So E5 Base V1, and if model size isn't too much of an issue, you can actually upgrade this model to the E5 Large V2 model. And then we're also going to compare these to what is generally the most popular embedding model, which is Ardour 002. Now, we're going to be taking a look at a few different things here, but I'm going to guide you through basically what you need to know to use each one of these three models.

So to start with, we're going to obviously start with the installs. So the pip install for each one of these is pretty straightforward. We have OpenAI, Gohere, and Transformers over here. The datasets that you see at the top here is the dataset that we're going to use for this walkthrough, and that dataset is this one here.

So you will have probably seen this before if you watch a few of my recent videos. It's this AI Archive chunked dataset. Now, we've installed this, as you can see here, and you'll be able to find this notebook in a link at the top of the video right now.

And then we want to come down to our embedding functions. Now, the embedding functions are what vary the most between each of these models. Obviously, the two API embedding models, Gohere and OpenAI, they're the most straightforward. OpenAI in particular, there's not really anything you need to know. You just input your documents, and you have your model here.

With Gohere, you do need to be aware of using the correct import type here, which is going to be SearchDocument when we're embedding our documents, and when we're embedding a query, it is SearchQuery. And we also have the model name down here as well. Otherwise, it's pretty straightforward. Now, things get a little more complicated when we start looking at how to use our open source model.

Now, this is normal. It's open source. We're not hiding everything behind an API. But in any case, it's still not really that complicated. The only thing that we do need to be aware of is that if you're on fast speeds, you're probably going to want a CUDA-enabled GPU. I think you can also run this on NPS on Mac.

So you just need to be aware of that. When you're running on NPS, rather than using CUDA here, you would switch across to NPS. And in this two device over here, you want to make sure you're moving to your NPS device instead. Now, we initialize the tokenizer and model, and then we do our embeddings.

So to create those embeddings, one thing that we do need to do with this model, a little bit of a formatting thing, is we need to prefix every input document or passage with the text passage. This just tells the model, the embedding model, that this is a passage of text and not a query of text.

Later on, you'll see that we replace this with query rather than passage when we're doing querying. And we tokenize everything. And then we process everything through a model, extract the hidden state of the model, and turn that all into a single embedding for each input document or passage that we have put in there.

So that's our embedding. Then we move on to adding everything into our index. It's where we're storing our vectors. Here, we're just using a local NumPy array. It's a very small data set. And we're just doing this for like a walkthrough. Obviously, if you want to do anything in production, don't do this.

Use a vector database. Unless you're happy you're handling all the data management stuff around it. Now, what I did here is for our APIs, I used a batch size of 128. In reality, I probably could have moved this up to 256. And that would speed things up a little more.

So OpenAI, it took like nine minutes to index all of these documents. With Cohere, it took five and a half minutes. So it seems like Cohere is a bit faster at ingestion and returning embeddings. And then if we look at our open source model, E5, it's a pretty small model.

So we can embed things pretty quickly. For this, I was using a V100 GPU on Google Colab. You can use a T5. But if you're embedding this whole index in memory, which you probably shouldn't anyway, your memory may -- you may run out of memory. Like actual RAM memory where you're storing your NumPy array rather than the actual GPU embedding memory.

So, yeah. One thing is obviously we have a higher batch size here. So if we decrease that, we might get -- we'll probably see slower results. Now, after that is done, our index is ready. We are ready to query. So we move on to our query function. Now, the query function is basically the same as what we did before.

We are creating our embeddings. And here I could have just used the embedding function from the OpenAI notebook. Here I could not use the embedding function because I need to adjust the input type to query rather than document or passage. And then for the E5 model, again, we would need to modify this.

So here we have query instead of passage. Okay? Otherwise, there's not too much difference here. What we do after all of this is we calculate dot product similarity between our query vector and the index. And we do the exact same thing for the cohere model. Both of these are normalized.

So we're just calculating the dot product. I believe with E5, the output was not normalized. So we could either normalize the vectors and then use dot product or we just use cosine similarity, which is just normalized dot product. So, it's up to you. And then one thing that we should be aware of here, which is this is an important thing to take into consideration when you're storing these vectors, is every embedding model, not every embedding model, but a lot of embedding models have different embedding dimensionalities.

So when using R002, the dimensionality that we output is this 1536. All right? So 1536 dimensional vectors. That means we're going to be using more storage than if we're using a cohere embedding model, which is just 1024. And that is still going to be more than if we use the E5 embedding model, which is 768.

So that's important to consider, especially sort of long-term. It's going to cost more to store the higher dimensional vectors. So now looking at the results between each one of these models, which we'll see are pretty similar in terms of performance, at least on the few queries I ran. Now, this is not an easy dataset for a embedding model to understand.

It's very messy, but that's more representative of the real world rather than like clean benchmark data or anything like that. So I think this is a good example of what they can do and what they can't do. So I asked, why should I use LLAMA2? Pretty simple question. I know that LLAMA2 paper is within this dataset.

So I know we should be able to come back with stuff. Now, when you see this text here, this is actually LLAMA2. It's just formatted weirdly. So we see this first one, it's talking about LLAMA2. And I'm asking, why should I use it? It says intended for assistant-like chat and used for a variety of NL generation tests, natural language generation.

But I mean, that's pretty much it in the first document there. Here, again, we're talking about LLAMA2. You see that's optimized for dialogue use cases, outperform open source chat models on most benchmarks, and our human evaluations for helpfulness and safety may be a substitute for closed source models. So we can see, you know, it's a good answer.

And then in the final one here, we get similar answers. So we can see perform better, open source, and on par with some closed source. That's LLAMA2. Let's see Cohere's model. So we can see we get some different results here. And unfortunately, the first one is actually talking about the first LLAMA model.

So it's not quite right. Come down to here, and we do get one of the same results that LLAMA2 got. So optimized for dialogue, outperform open source chat models, maybe a substitute for closed source models. Then we come back to here, and we get the same response that we got in the previous one as well.

So perform better than open source, and on par with closed source. Cool. Then we come to E5. The first one at the top here is kind of not relevant, so we can ignore that. But then the two here that we get, again, they're the same as what we saw with the previous two models.

Okay, cool. So looking at another more specific question about red teaming for LLAMA2. So it's like security testing or stress testing LLAMA2. We can see, okay, this first one here is talking about red teaming, not specific to LLAMA2, although we'll see that none of the models actually managed to find that information within the same chunk, which just makes me think, okay, we don't have LLAMA2 and red teaming within the same chunk within the dataset.

But we can see, okay, this one is talking about jokes, insults based on physical characteristics, racist language, so on and so on. This is them testing the model with red teaming. So, yeah, it's relevant. Obviously, red team approach and results. On the second one, we can see, okay, we have red team members here.

Red team members enjoyed the task and did not experience significant negative emotions. This allows us to expedite the red team's ability to find vulnerabilities in our system, so on and so on. Okay. Kind of relevant, not great. And then we have red teaming via jailbreaking. I think this one's probably a bit more relevant, a bit more useful.

And all of this here is describing red teaming overall. And then they describe, okay, this is a qualitative approach called red teaming at the end there. So, okay, results, nothing special, in my opinion. Okay. Now, with cohere, we can see aiding in disinformation campaigns, generating extremist text. So, this is them talking about what they did for testing with red teaming.

Spreading falsehoods and more. As AI systems improve, the scope of possible harm seems to grow. One potential useful tool for addressing harm is red teaming using manual or automated methods to adversarially probe a language model for harmful outputs. All right. Already this one to me is explaining more about red teaming than any of the other ones from R002.

And we have the other one on red teaming via jailbreaking. So, we already saw this one. So, I'm not going to go through it again. But it was okay. It's not a bad response. Or document to retrieve. And then here we have including limitations and risks that might be exploited by malicious actors.

So, that's another part of red teaming, like testing it, see if people can use these things maliciously. Red teaming approaches are insufficient for addressing these in the AI context. Processes such as red teaming exercises help organizations to discover their own limitations and vulnerabilities as well as those of the AI systems they develop.

And to approach them holistically. A red team exercise is a structured effort to find flaws and vulnerabilities in a plan, organization, or technical system. Often performed by a dedicated red team that seeks to adopt an attacker's mindset and methods. Okay. And it goes on and on. There's a few, I think, good, insightful things in here.

Flaws. Allow organizations to improve security. Yeah. And so on. So, I think that's, in my opinion, better than the open AI responses. Then we come to E5. We get some good ones again. So, here we're talking about publicly available red team data sets. And red team attacks. It's a data set that they're obviously talking about here.

Not too relevant. Right? It's mentioned red teaming. But it's not, it's not talking about, I don't know what red teaming is based on this. Then again, we're talking about red teaming here. A literature review on red teaming AI systems. Informational interviews with experts in the field of trust and safety.

Or incorporate their best practices. In general, we found that red teaming members enjoyed participating in our experiments and felt motivated by a mission to make AI systems less harmful. Okay. So, kind of relevant, but it could be better. And then this one at the bottom. I mean, it says red teaming here.

I have no idea what any of this means. Maybe it's talking about red teaming. Maybe it's a good response. But I don't know. I'm going to assume it isn't. In any case, I think obviously clearly here, E5, the performance is not quite as good as Cohere or open AI.

And generally, I think the Cohere model outperformed both in this scenario. But we should also note that this here is the base model. There's also a large model. And generally, what you'll find with these models is that the large model will perform much better than the base model. So, we might even be able to get comparable results with that.

Now, I'm not going to go through all of these now. Instead, I'll just leave these notebooks that you can go and check out. But we asked a few questions. Mainly, you know, about LLAMA2 and other things that are within these papers. And generally speaking, OpenAI, Cohere, and E5 all got pretty good results.

E5 is probably the weakest of them. And between Cohere and OpenAI, for me, Cohere seemed to perform slightly better. But it's a pretty limited test set. So, I feel like a lot of this will be down to personal preference to some degree. But at some point, of course, I'll test these with more data and try and get a better feel for which one of these I prefer.

But for now, yeah, leaning towards Cohere. Now, that's it for this video. I hope seeing a couple of these alternative embedding models has been useful and interesting. So, thank you very much for watching. And I will see you again in the next one. Bye.