Cohere vs. OpenAI embeddings — multilingual search

Today, we're going to take a look at Cohere's multilingual embedding model. For those of you that are not aware of Cohere, they are kind of similar to OpenAI in that they are essentially a service provider of large language models and all of the services that come with that. Now, right now they are not as well known as OpenAI, which is understandable, OpenAI has been around for a bit longer, but Cohere is actually a really good company that offers a lot of really good tooling that is actually very much comparable to what OpenAI offers.

And that's actually the first thing I want to look at here. I just want to show you a few comparison points between Cohere and OpenAI in terms of embedding models. Okay, so we're going to first take a look at the cost between these two. OpenAI's sort of premier embedding model right now is Arda002, it comes out to this much per 1,000 tokens.

Cohere doesn't have a per 1,000 tokens for the cost, it actually goes with $1 per 1,000 embeddings. What does one embedding mean? Well, basically every call or every chunk of text that you ask Cohere to embed, that is one embedding. So one embedding, the maximum size of that is actually just over 4,000 tokens.

So if you're maxing out every embedding, as in you are sending 4,000 tokens to every embedding call, then that means you would be getting this comparable price here, which is actually half price, which is pretty good. Now, if we kind of translate this into something that's a bit more understandable, we have like 13 paragraphs is roughly about 1,000 tokens.

These are the prices, right? So with Arda, with OpenAI, it's $1 per 32,500 paragraphs. Cohere is actually $1 per 65,000 paragraphs, which is really good, but there is obviously a catch, which is this thing up here, or this. $1 per 1,000 embeddings, right? The chances are you're probably not going to use 4,000 embeddings with every call to Cohere.

So 2,000 tokens, well, that's probably like 26 paragraphs. If you're embedding 26 paragraphs at a time, realistically, you're probably going to do much less, right? So if, let's say, you're going for more like 1,000 tokens, which I think is more realistic, then obviously the price of Cohere is actually double the price of OpenAI in this instance.

So it kind of depends on what you're doing there, as to whether you are throwing a load of text into your embeddings or not. So I think the costs are pretty comparable. Cohere can be cheaper, but it can also be more expensive, according to this logic anyway. Okay, so one thing I missed very quickly is the on-prem solution that Cohere offers.

So we have it here. Essentially, you can run your own AWS instance. And in the time that it would take you, this is assuming you're running at 100%, in the time that it would take you to encode 1 billion paragraphs, if you use Cohere's on-prem solution, you would end up paying $2,500.

It's also a lot quicker, and there are the other benefits as well. But I thought when we're talking about cost, we should definitely include that in there. So, you know, it depends, essentially. Embedding size, actually, you know, this is a good indicator of how much it's going to cost you.

So it's actually under cost. The higher your embedding size, the more storage you need to store all of your embeddings after you've created them, right? So the embedding size, smaller, is cheaper. So Cohere is half the size of OpenAI in this case. So, you know, long-term, you would probably actually be saving money with Cohere with this embedding size if you're storing a lot of vectors.

So, you know, that's definitely something to consider. Like if you consider this with the embedding cost initially, you know, maybe you're actually saving money with Cohere, even if you're just embedding like 1,000 tokens or even 500 tokens at a time. Long-term, you're probably going to end up saving money.

Now, performance. So this is kind of hard to judge because this is a single benchmark that Nozoram has put together. And, okay, I mean, Cohere for sure is coming out on top here. It's kind of hard to say, again, like whether this is representative across the board or not.

But nonetheless, the two models that are comparable here are Cohere's multilingual model and OpenAI's ARDA002 model, which is English. And this is a English search task. So it's pretty interesting that OpenAI's best English language model is comparable to Cohere's multilingual model. Cohere's English model is better. And then there's the Cohere Reranker.

This is an embedding model. It's like imagine you retrieve all of your items or you get two chunks of text and you feed them into like a transform model and compare them directly. It is basically a lot slower, but generally speaking, it will be more accurate. So I think they are pretty interesting results.

It seems like they're kind of on par, like OpenAI and Cohere are very on par, but it seems like Cohere, at least from what I've seen here, is slightly ahead of OpenAI in terms of performance on that single benchmark, which is not the best comparison, in all fairness, but also slightly cheaper in the long run because of the embedding size.

But again, everything here is so close that it's going to depend a lot on your particular use case. So it's not that Cohere is better than OpenAI. It's just that in some cases, they probably are better. And in some cases, they're probably cheaper as well. So that's definitely something to consider.

Now, how do we actually use Cohere for embeddings? So we're going to be focusing on the Cohere multilingual model. And this example we're going to be running through is not really my example. I've taken this example from Nils Reimers based on a webinar that we are doing together. He's basically put all this together and I've just kind of reformatted it in a way so that I can show you how it works and also show you, kind of focus on the multilingual search component of Cohere and show you how it works.

So let's just jump straight into it. Right, so the first thing we need to do is our pip installs. So we have Hugging Fist datasets here. Again, data from that Cohere and Pinecone client. We're using the gRPC client so that we can upsert things faster. We'll see how to use that soon.

Now, I actually have a couple of notes here. So a couple of things to point out with Cohere's multilingual model is that it supports more than 100 languages. I think the benchmarks that they've tested it on cover 16 of those languages or something around there. And of course, you can create embeddings for longer chunks of text.

And this is the dataset we're going to be using. It's some straight data from Wikipedia that Nils put together, I believe. And it's just hosted under Cohere on Hugging Fist datasets. So let's have a look at these. For now, we're just going to look at the English and Italian and we're going to see how we would put those and create a search with them.

And then what I'm going to do is switch across to an example where we have way more data in the database and that covers, I think, nine languages. But it is pretty interesting. So this is what a day looks like. We just have some text in the middle. That's what we're going to be encoding.

So if we're embedding these chunks one at a time, maybe it would be more expensive using Cohere. But I think, in reality, we could put a lot more of these together. So we could put together like five of these chunks or more and it should work pretty well. So, okay, let's go down.

Here, you need a Cohere API key. So to get that, you would go to here. So you type in dashboard.cohere.ai. Okay, and you'll probably have to log in if you haven't already logged in to Cohere. And then you go over to the left here and you will find some API keys.

From there, you take your API key and you just put it in here. Okay, I have my API key stored already in a variable called Cohere API key. Cool. Then this is how you would embed something, right? So we have a list of texts that we would like to embed and we just pass them to this co.embed.

So co is just a client that we've initialized up here. So co.embed text and then you have your model. This is the only multilingual model that Cohere offers at the moment. But, I mean, if you compare that to OpenAI right now, they just offer English models. So I think they've taken the lead with that, which is pretty cool.

Pull embeddings from response. So, okay, we create our embeddings. It gives us a response and it has a lot of information in there. But all we need are the embeddings, right? So we're just starting those out. And then we see dimensionality of those embeddings, which is going to be 768.

So that's the dimensionality. And then we have two of those vector embeddings there, right? So we have two 768 dimensional vectors because we have two sentences. All right, now that's how we would use Cohere's embedding model. But before we move on to actually creating our index, where we're going to sort all of those embeddings, we need to initialize an index.

So we're going to be using a vector database called Pinecone for this. Now, Pinecone, again, we need API key, which we can get from over here. Again, it's free. So app.pinecone.io. I'll just copy and paste that. Okay, cool. So come over here, I can already see I have a couple of indexes in here.

If this is your first time using Pinecone, it will be empty, and that's fine because we're going to create the index in the code. But what you do need is your API key, right? So your API key is here. You copy that, take it over into your notebook, and you would paste it here.

Now, again, I've stored mine in a variable. Then you also have your environment. Now, your environment is next to the API key in the console, right? So here, us-east1-gcp. Your environment is not necessarily going to be the same as mine. So you should check that. Okay, great. So that has initialized, and then we come down here, and what we're going to do here is initialize an index, which is where we're going to sort all of these embeddings.

Now, you give your index a name. It doesn't matter what you call it. Okay, you can call it whatever you want. But there are a few things that are important here that we should not change. So dimension. Dimension is the dimensionality of your embedding. So it's coming from Cohere, right?

This is where I mentioned before, there's the price advantage of using Cohere. When dimensionality is lower, like 768, it's going to be cheaper to store all of your vectors if you are needing to pay for that storage. So we need that, and our index needs to know this value.

So it needs to know the expected dimensionality of the vectors we're putting into it. Then we have our metric, which is dot product. This is needed by Cohere's multilingual model. If you look on the, I think, the about page for the multilingual model, it will say you need to use dot product.

And then these here, you can actually leave them empty. The default values for these are also okay, but I thought I'd put them in there. So S1 is basically the storage-optimized pod for Pinecone, which means you can put in about 5 million vectors in here for free without paying anything.

And then there's also P1, which is like the speed-optimized version, which enables you to put in around 1 million vectors for free. And then pods is the number of those pods you need. So if you needed 10 million vectors, we'd say, "Okay, we need two pods here." Cool, but we just need one.

We're not paying that much in there. So we'd run that. Then we'd connect to the index. We use this gRPC index, which we can also use index. So we could also use this, but gRPC index is just more stable, and it's also faster, so we're doing that. And then we're going to describe the index stats.

So we're going to see what is in there. Now, I already created the index before. So for you, when you're running through this first time, this will actually say zero. For me, I've already added things in there, and that's why it's at 200,100. Now, with the embedding model and vector index itself, we can move on to actually indexing everything.

So basically, we're just going to loop through our dataset, and we're going to do what we just did. So we're going to embed things with coherent, and then what we're going to do is with those embeddings, we're going to add them into Pinecone. Actually, I don't think I showed you how we do that, but it's really simple.

It's actually just this line here. But let me explain what we have here. So batch size is the number of items that we're going to send to coherent and then up into Pinecone at any one time. The line limit, so this is the number of records from each language that we would like to include, that we'd like to embed and add to Pinecone.

We have our data here, so I'm just formatting this so that it's a bit easier later on when we get to this bit here. And errors, and this is just so we can store a few errors, because every now and again, we might hit one, and I'll explain why.

It's not necessary, but there are ways to avoid it, basically. That and not that hard, but for simplicity's sake, I haven't included them in here. So here, I'm just saying, don't go over the line limit, and then we're going through English and Italian one at a time. We get the relevant batch from our data, which we've created here.

So it's actually just the iterable of the data, first English and Italian. We extract the text from that. We create our embeddings using that text. Then we just create some IDs. This is just an ID variable that was in the data up at the top here. ID, and also including text in there, as well.

Then what we do is we create this metadata list of dictionaries. Now, each dictionary is going to contain some text, a title from the record, the URL of the record, and also the language, so English or Italian. Then what we do is we add everything like this. So it's pretty straightforward.

There's nothing too complicated going on there. The one thing that I have added in there is occasionally... So we saw the text earlier on. They were pretty short chunks of text. But for some reason, not all of them are like this. It's kind of like a messy data set.

So some of them are actually quite long, and they actually exceed the metadata limit in Pinecone, which is 10 kilobytes per vector. So basically, we can add up to around 10 kilobytes of text with per vector in Pinecone, but some of them go over that, and they will throw an error.

So I'm actually, for now, I'm just skipping those. But in reality, what you do is you would chunk those larger chunks of text into smaller chunks, and then just add them individually, or just store your text somewhere else. It doesn't have to go into Pinecone. Right. Now, I've already run this.

I'm not going to run it again. And yeah, I can just come down to here. I can run this. We have our describe index. It looks the same as it did before for me. Okay, cool. Now, what we're going to do, so this is the more, I think, more interesting part is searching.

So to search through, what we do is we take a query, we embed it, and then we... So embed is exactly the same as what we did before with cohere, and then we query with that embedding, xq here. And we return the top three most similar items. And then we want to include metadata, which is going to contain our text, title, and a couple of other things.

The URL is pretty important. And then we return it in this kind of format. We include this. This is a pretty good idea from Nils. We include the translate URL. That will just allow us, so when we're getting Italian results or any other language results, we just click on this.

It will take us to Google Translate, and we can see what it actually says. So let's run this. And we can try both of these. I'm not even sure if they work that well because we don't have that much data in here, but we can try. Okay. I don't know any...

Okay, yeah, so number three here. So this is, you know, he's famous in Italy, but I think less famous outside of Italy. So if we go to here, you see translation, and you can see, okay, he's one of the most important and prestigious personalities in the fight against the mafia.

He was killed by Cosa Nostra together with his wife and so on and so on. Right, so he's super famous in Italy, but if you look on Wikipedia for him in English, I think it mentions a little bit about him, but there isn't really that much information there. So that's why we're getting, you know, we're just getting like Italian results here.

And then if we go for this one as well. So this is another one that I think in the English Wikipedia, there's like a paragraph about this. But then if you go to the Italian Wikipedia, there is a ton of these. Now, in this, I don't have... Yeah, I don't have enough data in here.

So let's switch across to the larger data set, and I'll show you what the results look like there, which are much better. Okay, I can ask about this one here. So what is the Mafia Capital case? Okay, and we get Mafia Capitale here. And if you go to translate, you can see, yes, that is, you know, that is the thing that I was talking about.

And then if we go to Wikipedia here, I'm going to point out, okay, so you get all of this text, which is tons. If we go to the English version, okay, so I'm searching in Google here, Mafia Capitale, what do we get? Right, we get this, literally three paragraphs.

So, you know, basically nothing. So you can see why it would be bringing the Italian stuff here rather than the, or why being able to search the Italian stuff is useful, even if you're speaking English. Now, another one we're going to ask, what is Arancino, but I'm going to spell it wrong just to point out the fact that it can actually handle that.

Maybe Arancino, oh, I spot it. No, no, I did get it right, okay. So this is wrong. The one I did before was actually correct. I kind of half expected to get it wrong anyway. All right, so it's, we can go on here, see what it says. So Arancino is a speciality of Sicilian cuisine.

Arancini di riso, it's very nice, if you ever have the chance to try it. You should have this with a pizza. So Arancino, pizza, and furi di zucca, it's amazing. It's like my favorite meal. Okay, so let's try one more. Who is Emma Maroni? Is that right? Yes. Okay, so go to here.

And I don't actually know who this is, so I hope this is correct. It's apparently this person. Okay, so that's it for this introduction to Cohera. I feel like it was a bit longer than I had intended it to be, but that's fine, I'm hoping that it was at least useful and we kind of went through a lot of things there.

So, yeah, I just wanted to share this. It's a alternative to OpenAI. I'm not saying it's necessarily better, I'm not saying it's necessarily cheaper. I think that is very much going to depend on your use case, what you're doing, and many other factors, right? You can train these models, for example, if you're able to train them, then you're probably going to get some pretty good performance as well.

And I suppose one big factor here is actually the multilingual aspect of this model. At the moment, OpenAI doesn't have any multilingual models, or none are actually trained to do that. Some of them, I think, can handle multilingual queries relatively well, but they haven't been trained for that. And this can be relatively problematic, especially when you're dealing with multinational companies or just companies that are not American or English or Australian as well.

I'm not going to forget you. The rest of the world speaks different languages, so having this multilingual model is pretty good. So, yeah, I mean, this is still very early days for Cohere. I'm pretty excited. I know they have a lot planned and that will be really interesting to see.

But for now, I think we'll leave it there. I hope all this has been useful and interesting. So, thank you very much for watching and I will see you again in the next one. Bye. (MUSIC) (MUSIC) (MUSIC)

Cohere vs. OpenAI embeddings — multilingual search

Chapters

Transcript