back to indexOpenAI Alternatives: Cohere Embed v3 and Open Source
Chapters
0:0
0:45 MTEB Leaderboards
1:46 Starting with OpenAI, Cohere, and e5
4:35 Inference Speeds
6:6 Querying with Different Models
8:5 Results between models
9:35 Ada 002 vs Cohere v3
10:29 Another test for OpenAI, Cohere, and E5
15:23 More Questions and Final Thoughts
00:00:00.000 |
Today, we're going to be taking a look at a few of the best embedding models that we can use when we're building retrieval pipelines. 00:00:07.360 |
At the moment, pretty much everyone uses OpenAI's Ardour 002. 00:00:11.400 |
But there are actually many other models out there and a few that are either sort of competitive or potentially even better than Ardour 002. 00:00:22.000 |
And if you go by leaderboards, there are many models that are significantly better. 00:00:26.880 |
But we'll see that not everything is about leaderboards, and when you're testing it on real life data, Ardour still works well, but it's comparable to many other models. 00:00:36.680 |
So I'm going to start by taking a look at one of the most popular leaderboards for embedding models, which is the MTEB Benchmark. 00:00:45.600 |
MTEB is the Massive Text Embedding Benchmark, and this is hosted on Hungryface Spaces. 00:00:52.000 |
Now, I think it's literally today there is this new model that is now at the number one spot. 00:00:59.120 |
We are not going to be looking at that model in this video, although I will do very soon. 00:01:03.760 |
But we will be covering this other model from Gohere, which is very close, like very, very little difference, at least from the benchmark results here. 00:01:12.200 |
And we're going to be taking a look at one of the small embedding models, our open source. 00:01:18.000 |
So there are many open source models here, but the one that I found to work best that isn't huge is actually down here. 00:01:27.280 |
So E5 Base V1, and if model size isn't too much of an issue, you can actually upgrade this model to the E5 Large V2 model. 00:01:38.080 |
And then we're also going to compare these to what is generally the most popular embedding model, which is Ardour 002. 00:01:46.160 |
Now, we're going to be taking a look at a few different things here, but I'm going to guide you through basically what you need to know to use each one of these three models. 00:01:56.240 |
So to start with, we're going to obviously start with the installs. 00:01:59.200 |
So the pip install for each one of these is pretty straightforward. 00:02:02.240 |
We have OpenAI, Gohere, and Transformers over here. 00:02:05.920 |
The datasets that you see at the top here is the dataset that we're going to use for this walkthrough, and that dataset is this one here. 00:02:13.040 |
So you will have probably seen this before if you watch a few of my recent videos. 00:02:19.600 |
Now, we've installed this, as you can see here, and you'll be able to find this notebook in a link at the top of the video right now. 00:02:27.840 |
And then we want to come down to our embedding functions. 00:02:30.320 |
Now, the embedding functions are what vary the most between each of these models. 00:02:34.720 |
Obviously, the two API embedding models, Gohere and OpenAI, they're the most straightforward. 00:02:42.240 |
OpenAI in particular, there's not really anything you need to know. 00:02:44.800 |
You just input your documents, and you have your model here. 00:02:48.240 |
With Gohere, you do need to be aware of using the correct import type here, 00:02:53.840 |
which is going to be SearchDocument when we're embedding our documents, 00:02:57.600 |
and when we're embedding a query, it is SearchQuery. 00:03:01.200 |
And we also have the model name down here as well. 00:03:06.160 |
Now, things get a little more complicated when we start looking at how to use our open source model. 00:03:17.200 |
But in any case, it's still not really that complicated. 00:03:21.440 |
The only thing that we do need to be aware of is that if you're on fast speeds, 00:03:25.760 |
you're probably going to want a CUDA-enabled GPU. 00:03:33.840 |
When you're running on NPS, rather than using CUDA here, you would switch across to NPS. 00:03:42.000 |
you want to make sure you're moving to your NPS device instead. 00:03:45.280 |
Now, we initialize the tokenizer and model, and then we do our embeddings. 00:03:51.840 |
one thing that we do need to do with this model, a little bit of a formatting thing, 00:03:55.200 |
is we need to prefix every input document or passage with the text passage. 00:04:02.960 |
This just tells the model, the embedding model, 00:04:06.800 |
that this is a passage of text and not a query of text. 00:04:10.080 |
Later on, you'll see that we replace this with query rather than passage 00:04:18.160 |
And then we process everything through a model, 00:04:29.280 |
for each input document or passage that we have put in there. 00:04:34.720 |
Then we move on to adding everything into our index. 00:04:47.680 |
And we're just doing this for like a walkthrough. 00:04:50.720 |
Obviously, if you want to do anything in production, don't do this. 00:04:56.720 |
Unless you're happy you're handling all the data management stuff around it. 00:05:00.160 |
Now, what I did here is for our APIs, I used a batch size of 128. 00:05:06.720 |
In reality, I probably could have moved this up to 256. 00:05:10.160 |
And that would speed things up a little more. 00:05:12.720 |
So OpenAI, it took like nine minutes to index all of these documents. 00:05:16.880 |
With Cohere, it took five and a half minutes. 00:05:19.520 |
So it seems like Cohere is a bit faster at ingestion and returning embeddings. 00:05:25.600 |
And then if we look at our open source model, E5, it's a pretty small model. 00:05:32.240 |
For this, I was using a V100 GPU on Google Colab. 00:05:39.360 |
But if you're embedding this whole index in memory, which you probably shouldn't anyway, 00:05:46.800 |
your memory may -- you may run out of memory. 00:05:49.360 |
Like actual RAM memory where you're storing your NumPy array 00:05:58.000 |
One thing is obviously we have a higher batch size here. 00:06:02.000 |
So if we decrease that, we might get -- we'll probably see slower results. 00:06:13.760 |
Now, the query function is basically the same as what we did before. 00:06:21.120 |
And here I could have just used the embedding function from the OpenAI notebook. 00:06:24.880 |
Here I could not use the embedding function because I need 00:06:28.240 |
to adjust the input type to query rather than document or passage. 00:06:34.160 |
And then for the E5 model, again, we would need to modify this. 00:06:43.440 |
Otherwise, there's not too much difference here. 00:06:45.920 |
What we do after all of this is we calculate dot product similarity 00:06:53.360 |
And we do the exact same thing for the cohere model. 00:06:59.120 |
I believe with E5, the output was not normalized. 00:07:04.080 |
So we could either normalize the vectors and then use dot product 00:07:07.440 |
or we just use cosine similarity, which is just normalized dot product. 00:07:14.720 |
And then one thing that we should be aware of here, 00:07:17.520 |
which is this is an important thing to take into consideration 00:07:22.880 |
when you're storing these vectors, is every embedding model, 00:07:25.920 |
not every embedding model, but a lot of embedding models 00:07:31.440 |
So when using R002, the dimensionality that we output is this 1536. 00:07:41.760 |
That means we're going to be using more storage 00:07:44.800 |
than if we're using a cohere embedding model, which is just 1024. 00:07:49.760 |
And that is still going to be more than if we use the E5 embedding model, 00:07:54.960 |
So that's important to consider, especially sort of long-term. 00:08:00.720 |
It's going to cost more to store the higher dimensional vectors. 00:08:05.200 |
So now looking at the results between each one of these models, 00:08:09.680 |
which we'll see are pretty similar in terms of performance, 00:08:15.920 |
Now, this is not an easy dataset for a embedding model to understand. 00:08:19.200 |
It's very messy, but that's more representative of the real world 00:08:22.720 |
rather than like clean benchmark data or anything like that. 00:08:26.320 |
So I think this is a good example of what they can do and what they can't do. 00:08:34.160 |
I know that LLAMA2 paper is within this dataset. 00:08:37.280 |
So I know we should be able to come back with stuff. 00:08:39.280 |
Now, when you see this text here, this is actually LLAMA2. 00:08:48.160 |
So we see this first one, it's talking about LLAMA2. 00:08:52.720 |
It says intended for assistant-like chat and used for a variety of NL generation tests, 00:09:01.280 |
But I mean, that's pretty much it in the first document there. 00:09:07.920 |
You see that's optimized for dialogue use cases, outperform open source chat models 00:09:14.000 |
on most benchmarks, and our human evaluations for helpfulness and safety 00:09:19.200 |
may be a substitute for closed source models. 00:09:25.600 |
And then in the final one here, we get similar answers. 00:09:28.160 |
So we can see perform better, open source, and on par with some closed source. 00:09:37.760 |
So we can see we get some different results here. 00:09:40.400 |
And unfortunately, the first one is actually talking about the first LLAMA model. 00:09:46.400 |
Come down to here, and we do get one of the same results that LLAMA2 got. 00:09:51.280 |
So optimized for dialogue, outperform open source chat models, 00:10:00.640 |
Then we come back to here, and we get the same response that we got in the previous one as well. 00:10:06.160 |
So perform better than open source, and on par with closed source. 00:10:15.440 |
The first one at the top here is kind of not relevant, so we can ignore that. 00:10:21.840 |
they're the same as what we saw with the previous two models. 00:10:29.520 |
So looking at another more specific question about red teaming for LLAMA2. 00:10:34.800 |
So it's like security testing or stress testing LLAMA2. 00:10:38.080 |
We can see, okay, this first one here is talking about red teaming, not specific to LLAMA2, 00:10:45.280 |
although we'll see that none of the models actually managed to find that information 00:10:50.080 |
within the same chunk, which just makes me think, okay, 00:10:53.840 |
we don't have LLAMA2 and red teaming within the same chunk within the dataset. 00:10:58.000 |
But we can see, okay, this one is talking about jokes, insults based on physical characteristics, 00:11:05.040 |
This is them testing the model with red teaming. 00:11:11.600 |
On the second one, we can see, okay, we have red team members here. 00:11:15.600 |
Red team members enjoyed the task and did not experience significant negative emotions. 00:11:21.120 |
This allows us to expedite the red team's ability to find vulnerabilities in our system, 00:11:30.640 |
And then we have red teaming via jailbreaking. 00:11:34.320 |
I think this one's probably a bit more relevant, a bit more useful. 00:11:37.920 |
And all of this here is describing red teaming overall. 00:11:42.400 |
And then they describe, okay, this is a qualitative approach called red teaming at the end there. 00:11:49.840 |
So, okay, results, nothing special, in my opinion. 00:11:54.720 |
Now, with cohere, we can see aiding in disinformation campaigns, generating extremist text. 00:11:59.760 |
So, this is them talking about what they did for testing with red teaming. 00:12:07.760 |
As AI systems improve, the scope of possible harm seems to grow. 00:12:11.920 |
One potential useful tool for addressing harm is red teaming using manual or automated methods 00:12:19.040 |
to adversarially probe a language model for harmful outputs. 00:12:23.920 |
Already this one to me is explaining more about red teaming than any of the other ones from R002. 00:12:30.400 |
And we have the other one on red teaming via jailbreaking. 00:12:43.840 |
And then here we have including limitations and risks that might be exploited by malicious actors. 00:12:48.320 |
So, that's another part of red teaming, like testing it, see if people can use these things 00:12:53.200 |
Red teaming approaches are insufficient for addressing these in the AI context. 00:13:00.640 |
Processes such as red teaming exercises help organizations to discover their own 00:13:07.360 |
limitations and vulnerabilities as well as those of the AI systems they develop. 00:13:16.160 |
A red team exercise is a structured effort to find flaws and vulnerabilities in a plan, 00:13:25.600 |
Often performed by a dedicated red team that seeks to adopt an attacker's mindset and methods. 00:13:35.280 |
There's a few, I think, good, insightful things in here. 00:13:46.320 |
So, I think that's, in my opinion, better than the open AI responses. 00:13:54.640 |
So, here we're talking about publicly available red team data sets. 00:14:00.800 |
It's a data set that they're obviously talking about here. 00:14:07.520 |
But it's not, it's not talking about, I don't know what red teaming is based on this. 00:14:13.440 |
Then again, we're talking about red teaming here. 00:14:15.280 |
A literature review on red teaming AI systems. 00:14:19.600 |
Informational interviews with experts in the field of trust and safety. 00:14:25.840 |
In general, we found that red teaming members enjoyed participating in our experiments and felt 00:14:31.760 |
motivated by a mission to make AI systems less harmful. 00:14:36.480 |
So, kind of relevant, but it could be better. 00:14:51.520 |
In any case, I think obviously clearly here, E5, the performance is not quite as good as 00:15:00.160 |
And generally, I think the Cohere model outperformed both in this scenario. 00:15:06.000 |
But we should also note that this here is the base model. 00:15:11.440 |
And generally, what you'll find with these models is that the large model will perform 00:15:17.680 |
So, we might even be able to get comparable results with that. 00:15:22.080 |
Now, I'm not going to go through all of these now. 00:15:25.760 |
Instead, I'll just leave these notebooks that you can go and check out. 00:15:32.800 |
Mainly, you know, about LLAMA2 and other things that are within these papers. 00:15:37.200 |
And generally speaking, OpenAI, Cohere, and E5 all got pretty good results. 00:15:44.640 |
And between Cohere and OpenAI, for me, Cohere seemed to perform slightly better. 00:15:54.000 |
So, I feel like a lot of this will be down to personal preference to some degree. 00:16:00.240 |
But at some point, of course, I'll test these with more data and try and get a better feel 00:16:13.600 |
I hope seeing a couple of these alternative embedding models has been useful and interesting.