OpenAI Alternatives: Cohere Embed v3 and Open Source

00:00:00.000 | Today, we're going to be taking a look at a few of the best embedding models that we can use when we're building retrieval pipelines.

00:00:07.360 | At the moment, pretty much everyone uses OpenAI's Ardour 002.

00:00:11.400 | But there are actually many other models out there and a few that are either sort of competitive or potentially even better than Ardour 002.

00:00:22.000 | And if you go by leaderboards, there are many models that are significantly better.

00:00:26.880 | But we'll see that not everything is about leaderboards, and when you're testing it on real life data, Ardour still works well, but it's comparable to many other models.

00:00:36.680 | So I'm going to start by taking a look at one of the most popular leaderboards for embedding models, which is the MTEB Benchmark.

00:00:45.600 | MTEB is the Massive Text Embedding Benchmark, and this is hosted on Hungryface Spaces.

00:00:52.000 | Now, I think it's literally today there is this new model that is now at the number one spot.

00:00:59.120 | We are not going to be looking at that model in this video, although I will do very soon.

00:01:03.760 | But we will be covering this other model from Gohere, which is very close, like very, very little difference, at least from the benchmark results here.

00:01:12.200 | And we're going to be taking a look at one of the small embedding models, our open source.

00:01:18.000 | So there are many open source models here, but the one that I found to work best that isn't huge is actually down here.

00:01:27.280 | So E5 Base V1, and if model size isn't too much of an issue, you can actually upgrade this model to the E5 Large V2 model.

00:01:38.080 | And then we're also going to compare these to what is generally the most popular embedding model, which is Ardour 002.

00:01:46.160 | Now, we're going to be taking a look at a few different things here, but I'm going to guide you through basically what you need to know to use each one of these three models.

00:01:56.240 | So to start with, we're going to obviously start with the installs.

00:01:59.200 | So the pip install for each one of these is pretty straightforward.

00:02:02.240 | We have OpenAI, Gohere, and Transformers over here.

00:02:05.920 | The datasets that you see at the top here is the dataset that we're going to use for this walkthrough, and that dataset is this one here.

00:02:13.040 | So you will have probably seen this before if you watch a few of my recent videos.

00:02:17.040 | It's this AI Archive chunked dataset.

00:02:19.600 | Now, we've installed this, as you can see here, and you'll be able to find this notebook in a link at the top of the video right now.

00:02:27.840 | And then we want to come down to our embedding functions.

00:02:30.320 | Now, the embedding functions are what vary the most between each of these models.

00:02:34.720 | Obviously, the two API embedding models, Gohere and OpenAI, they're the most straightforward.

00:02:42.240 | OpenAI in particular, there's not really anything you need to know.

00:02:44.800 | You just input your documents, and you have your model here.

00:02:48.240 | With Gohere, you do need to be aware of using the correct import type here,

00:02:53.840 | which is going to be SearchDocument when we're embedding our documents,

00:02:57.600 | and when we're embedding a query, it is SearchQuery.

00:03:01.200 | And we also have the model name down here as well.

00:03:04.000 | Otherwise, it's pretty straightforward.

00:03:06.160 | Now, things get a little more complicated when we start looking at how to use our open source model.

00:03:11.760 | Now, this is normal. It's open source.

00:03:13.840 | We're not hiding everything behind an API.

00:03:17.200 | But in any case, it's still not really that complicated.

00:03:21.440 | The only thing that we do need to be aware of is that if you're on fast speeds,

00:03:25.760 | you're probably going to want a CUDA-enabled GPU.

00:03:28.320 | I think you can also run this on NPS on Mac.

00:03:31.680 | So you just need to be aware of that.

00:03:33.840 | When you're running on NPS, rather than using CUDA here, you would switch across to NPS.

00:03:38.320 | And in this two device over here,

00:03:42.000 | you want to make sure you're moving to your NPS device instead.

00:03:45.280 | Now, we initialize the tokenizer and model, and then we do our embeddings.

00:03:49.600 | So to create those embeddings,

00:03:51.840 | one thing that we do need to do with this model, a little bit of a formatting thing,

00:03:55.200 | is we need to prefix every input document or passage with the text passage.

00:04:02.960 | This just tells the model, the embedding model,

00:04:06.800 | that this is a passage of text and not a query of text.

00:04:10.080 | Later on, you'll see that we replace this with query rather than passage

00:04:14.720 | when we're doing querying.

00:04:16.240 | And we tokenize everything.

00:04:18.160 | And then we process everything through a model,

00:04:21.680 | extract the hidden state of the model,

00:04:24.400 | and turn that all into a single embedding

00:04:29.280 | for each input document or passage that we have put in there.

00:04:33.360 | So that's our embedding.

00:04:34.720 | Then we move on to adding everything into our index.

00:04:40.880 | It's where we're storing our vectors.

00:04:43.040 | Here, we're just using a local NumPy array.

00:04:45.200 | It's a very small data set.

00:04:47.680 | And we're just doing this for like a walkthrough.

00:04:50.720 | Obviously, if you want to do anything in production, don't do this.

00:04:54.720 | Use a vector database.

00:04:56.720 | Unless you're happy you're handling all the data management stuff around it.

00:05:00.160 | Now, what I did here is for our APIs, I used a batch size of 128.

00:05:06.720 | In reality, I probably could have moved this up to 256.

00:05:10.160 | And that would speed things up a little more.

00:05:12.720 | So OpenAI, it took like nine minutes to index all of these documents.

00:05:16.880 | With Cohere, it took five and a half minutes.

00:05:19.520 | So it seems like Cohere is a bit faster at ingestion and returning embeddings.

00:05:25.600 | And then if we look at our open source model, E5, it's a pretty small model.

00:05:30.160 | So we can embed things pretty quickly.

00:05:32.240 | For this, I was using a V100 GPU on Google Colab.

00:05:37.040 | You can use a T5.

00:05:39.360 | But if you're embedding this whole index in memory, which you probably shouldn't anyway,

00:05:46.800 | your memory may -- you may run out of memory.

00:05:49.360 | Like actual RAM memory where you're storing your NumPy array

00:05:53.040 | rather than the actual GPU embedding memory.

00:05:56.000 | So, yeah.

00:05:58.000 | One thing is obviously we have a higher batch size here.

00:06:02.000 | So if we decrease that, we might get -- we'll probably see slower results.

00:06:06.640 | Now, after that is done, our index is ready.

00:06:10.000 | We are ready to query.

00:06:11.200 | So we move on to our query function.

00:06:13.760 | Now, the query function is basically the same as what we did before.

00:06:17.280 | We are creating our embeddings.

00:06:21.120 | And here I could have just used the embedding function from the OpenAI notebook.

00:06:24.880 | Here I could not use the embedding function because I need

00:06:28.240 | to adjust the input type to query rather than document or passage.

00:06:34.160 | And then for the E5 model, again, we would need to modify this.

00:06:39.760 | So here we have query instead of passage.

00:06:42.160 | Okay?

00:06:43.440 | Otherwise, there's not too much difference here.

00:06:45.920 | What we do after all of this is we calculate dot product similarity

00:06:50.240 | between our query vector and the index.

00:06:53.360 | And we do the exact same thing for the cohere model.

00:06:56.000 | Both of these are normalized.

00:06:57.120 | So we're just calculating the dot product.

00:06:59.120 | I believe with E5, the output was not normalized.

00:07:04.080 | So we could either normalize the vectors and then use dot product

00:07:07.440 | or we just use cosine similarity, which is just normalized dot product.

00:07:12.480 | So, it's up to you.

00:07:14.720 | And then one thing that we should be aware of here,

00:07:17.520 | which is this is an important thing to take into consideration

00:07:22.880 | when you're storing these vectors, is every embedding model,

00:07:25.920 | not every embedding model, but a lot of embedding models

00:07:28.000 | have different embedding dimensionalities.

00:07:31.440 | So when using R002, the dimensionality that we output is this 1536.

00:07:37.600 | All right?

00:07:37.840 | So 1536 dimensional vectors.

00:07:41.760 | That means we're going to be using more storage

00:07:44.800 | than if we're using a cohere embedding model, which is just 1024.

00:07:49.760 | And that is still going to be more than if we use the E5 embedding model,

00:07:53.280 | which is 768.

00:07:54.960 | So that's important to consider, especially sort of long-term.

00:08:00.720 | It's going to cost more to store the higher dimensional vectors.

00:08:05.200 | So now looking at the results between each one of these models,

00:08:09.680 | which we'll see are pretty similar in terms of performance,

00:08:13.520 | at least on the few queries I ran.

00:08:15.920 | Now, this is not an easy dataset for a embedding model to understand.

00:08:19.200 | It's very messy, but that's more representative of the real world

00:08:22.720 | rather than like clean benchmark data or anything like that.

00:08:26.320 | So I think this is a good example of what they can do and what they can't do.

00:08:30.960 | So I asked, why should I use LLAMA2?

00:08:33.120 | Pretty simple question.

00:08:34.160 | I know that LLAMA2 paper is within this dataset.

00:08:37.280 | So I know we should be able to come back with stuff.

00:08:39.280 | Now, when you see this text here, this is actually LLAMA2.

00:08:46.400 | It's just formatted weirdly.

00:08:48.160 | So we see this first one, it's talking about LLAMA2.

00:08:50.880 | And I'm asking, why should I use it?

00:08:52.720 | It says intended for assistant-like chat and used for a variety of NL generation tests,

00:08:59.920 | natural language generation.

00:09:01.280 | But I mean, that's pretty much it in the first document there.

00:09:05.600 | Here, again, we're talking about LLAMA2.

00:09:07.920 | You see that's optimized for dialogue use cases, outperform open source chat models

00:09:14.000 | on most benchmarks, and our human evaluations for helpfulness and safety

00:09:19.200 | may be a substitute for closed source models.

00:09:21.760 | So we can see, you know, it's a good answer.

00:09:25.600 | And then in the final one here, we get similar answers.

00:09:28.160 | So we can see perform better, open source, and on par with some closed source.

00:09:34.880 | That's LLAMA2.

00:09:36.320 | Let's see Cohere's model.

00:09:37.760 | So we can see we get some different results here.

00:09:40.400 | And unfortunately, the first one is actually talking about the first LLAMA model.

00:09:44.560 | So it's not quite right.

00:09:46.400 | Come down to here, and we do get one of the same results that LLAMA2 got.

00:09:51.280 | So optimized for dialogue, outperform open source chat models,

00:09:57.600 | maybe a substitute for closed source models.

00:10:00.640 | Then we come back to here, and we get the same response that we got in the previous one as well.

00:10:06.160 | So perform better than open source, and on par with closed source.

00:10:12.000 | Cool.

00:10:13.440 | Then we come to E5.

00:10:15.440 | The first one at the top here is kind of not relevant, so we can ignore that.

00:10:19.760 | But then the two here that we get, again,

00:10:21.840 | they're the same as what we saw with the previous two models.

00:10:25.520 | Okay, cool.

00:10:29.520 | So looking at another more specific question about red teaming for LLAMA2.

00:10:34.800 | So it's like security testing or stress testing LLAMA2.

00:10:38.080 | We can see, okay, this first one here is talking about red teaming, not specific to LLAMA2,

00:10:45.280 | although we'll see that none of the models actually managed to find that information

00:10:50.080 | within the same chunk, which just makes me think, okay,

00:10:53.840 | we don't have LLAMA2 and red teaming within the same chunk within the dataset.

00:10:58.000 | But we can see, okay, this one is talking about jokes, insults based on physical characteristics,

00:11:03.360 | racist language, so on and so on.

00:11:05.040 | This is them testing the model with red teaming.

00:11:07.440 | So, yeah, it's relevant.

00:11:09.440 | Obviously, red team approach and results.

00:11:11.600 | On the second one, we can see, okay, we have red team members here.

00:11:15.600 | Red team members enjoyed the task and did not experience significant negative emotions.

00:11:21.120 | This allows us to expedite the red team's ability to find vulnerabilities in our system,

00:11:26.320 | so on and so on.

00:11:28.080 | Okay.

00:11:28.960 | Kind of relevant, not great.

00:11:30.640 | And then we have red teaming via jailbreaking.

00:11:34.320 | I think this one's probably a bit more relevant, a bit more useful.

00:11:37.920 | And all of this here is describing red teaming overall.

00:11:42.400 | And then they describe, okay, this is a qualitative approach called red teaming at the end there.

00:11:49.840 | So, okay, results, nothing special, in my opinion.

00:11:54.240 | Okay.

00:11:54.720 | Now, with cohere, we can see aiding in disinformation campaigns, generating extremist text.

00:11:59.760 | So, this is them talking about what they did for testing with red teaming.

00:12:03.360 | Spreading falsehoods and more.

00:12:07.760 | As AI systems improve, the scope of possible harm seems to grow.

00:12:11.920 | One potential useful tool for addressing harm is red teaming using manual or automated methods

00:12:19.040 | to adversarially probe a language model for harmful outputs.

00:12:23.680 | All right.

00:12:23.920 | Already this one to me is explaining more about red teaming than any of the other ones from R002.

00:12:30.400 | And we have the other one on red teaming via jailbreaking.

00:12:34.400 | So, we already saw this one.

00:12:35.760 | So, I'm not going to go through it again.

00:12:37.040 | But it was okay.

00:12:38.480 | It's not a bad response.

00:12:41.680 | Or document to retrieve.

00:12:43.840 | And then here we have including limitations and risks that might be exploited by malicious actors.

00:12:48.320 | So, that's another part of red teaming, like testing it, see if people can use these things

00:12:52.560 | maliciously.

00:12:53.200 | Red teaming approaches are insufficient for addressing these in the AI context.

00:13:00.640 | Processes such as red teaming exercises help organizations to discover their own

00:13:07.360 | limitations and vulnerabilities as well as those of the AI systems they develop.

00:13:14.080 | And to approach them holistically.

00:13:16.160 | A red team exercise is a structured effort to find flaws and vulnerabilities in a plan,

00:13:22.960 | organization, or technical system.

00:13:25.600 | Often performed by a dedicated red team that seeks to adopt an attacker's mindset and methods.

00:13:32.160 | Okay.

00:13:32.480 | And it goes on and on.

00:13:35.280 | There's a few, I think, good, insightful things in here.

00:13:38.960 | Flaws.

00:13:40.240 | Allow organizations to improve security.

00:13:44.000 | Yeah.

00:13:45.840 | And so on.

00:13:46.320 | So, I think that's, in my opinion, better than the open AI responses.

00:13:51.840 | Then we come to E5.

00:13:53.200 | We get some good ones again.

00:13:54.640 | So, here we're talking about publicly available red team data sets.

00:13:58.880 | And red team attacks.

00:14:00.800 | It's a data set that they're obviously talking about here.

00:14:04.000 | Not too relevant.

00:14:05.280 | Right?

00:14:05.520 | It's mentioned red teaming.

00:14:07.520 | But it's not, it's not talking about, I don't know what red teaming is based on this.

00:14:13.440 | Then again, we're talking about red teaming here.

00:14:15.280 | A literature review on red teaming AI systems.

00:14:19.600 | Informational interviews with experts in the field of trust and safety.

00:14:22.960 | Or incorporate their best practices.

00:14:25.840 | In general, we found that red teaming members enjoyed participating in our experiments and felt

00:14:31.760 | motivated by a mission to make AI systems less harmful.

00:14:35.920 | Okay.

00:14:36.480 | So, kind of relevant, but it could be better.

00:14:39.760 | And then this one at the bottom.

00:14:42.400 | I mean, it says red teaming here.

00:14:44.160 | I have no idea what any of this means.

00:14:46.480 | Maybe it's talking about red teaming.

00:14:48.080 | Maybe it's a good response.

00:14:49.040 | But I don't know.

00:14:50.240 | I'm going to assume it isn't.

00:14:51.520 | In any case, I think obviously clearly here, E5, the performance is not quite as good as

00:14:57.840 | Cohere or open AI.

00:15:00.160 | And generally, I think the Cohere model outperformed both in this scenario.

00:15:06.000 | But we should also note that this here is the base model.

00:15:09.440 | There's also a large model.

00:15:11.440 | And generally, what you'll find with these models is that the large model will perform

00:15:15.920 | much better than the base model.

00:15:17.680 | So, we might even be able to get comparable results with that.

00:15:22.080 | Now, I'm not going to go through all of these now.

00:15:25.760 | Instead, I'll just leave these notebooks that you can go and check out.

00:15:30.480 | But we asked a few questions.

00:15:32.800 | Mainly, you know, about LLAMA2 and other things that are within these papers.

00:15:37.200 | And generally speaking, OpenAI, Cohere, and E5 all got pretty good results.

00:15:42.800 | E5 is probably the weakest of them.

00:15:44.640 | And between Cohere and OpenAI, for me, Cohere seemed to perform slightly better.

00:15:52.080 | But it's a pretty limited test set.

00:15:54.000 | So, I feel like a lot of this will be down to personal preference to some degree.

00:16:00.240 | But at some point, of course, I'll test these with more data and try and get a better feel

00:16:05.920 | for which one of these I prefer.

00:16:07.600 | But for now, yeah, leaning towards Cohere.

00:16:10.640 | Now, that's it for this video.

00:16:13.600 | I hope seeing a couple of these alternative embedding models has been useful and interesting.

00:16:19.360 | So, thank you very much for watching.

00:16:22.400 | And I will see you again in the next one.

00:16:25.280 | Bye.

00:16:26.320 | [Music]

OpenAI Alternatives: Cohere Embed v3 and Open Source

Chapters