Today, we are going to be taking a look at the different types of semantic chunkers that we can use in order to chunk our data for applications like RAG in a more intelligent and effective way. For now, we're going to be focusing on the text modality, which is generally going to be used for RAG, but we can apply this to video and also audio.
But for now, let's stick with text. So I'm going to take you through three different types of semantic chunkers. Everything we're working through today is going to be in the semantic chunkers library, and we're going to go to the chunkers intro notebook here, and I'm just going to go ahead and open this in Colab.
Okay, cool. So first thing I'm going to do is just install the prerequisites. So you have semantic chunkers, of course, and Hugging Face datasets. So we're going to be pulling in some data to just test these different methods for chunking and see what difference it makes, especially in terms of latency and also just quality of what we get out of it.
Okay, cool. So we can come down to here, and let's take a look at our dataset. So our dataset contains a set of AI Archive papers. We can see one of them here. So this is the Mamba paper, and you can see there's a few different sections in here already.
We have the title. We have authors, where the authors are studying or performing research from, and then we have the abstract. So what we're going to do is, I mean, you can do this or you can just use a full content if you want for this paper, it's up to you.
But especially one of these chunkers can be pretty slow and quite intensive. So I've limited the amount of text that we're using here. The other two are pretty fast, so it's mainly for that one. So we will need a embedding model to perform our semantic chunking. Semantic chunking, at least the versions of it that we show here, uses or relies on embedding models and finding the semantic similarity between embeddings in some way or another.
So in this example, we're going to be using OpenAI's Embedding3 small model. So you will need an OpenAI API key, or if you prefer not to, you can use a open source model as well. So if you do want to use a open source model instead, so you don't need to get an API key, you can just do this here.
But I'm going to be sticking with OpenAI. Okay, so I've initialized my encoder. And now I'm going to come down to the statistical chunking method. And this is probably the chunker that I would recommend people to use just out of the box. The reason for that is that it will handle a lot of the sort of figuring out of different parameters for you.
It is pretty cost effective. And it is, it's pretty fast as well. So this is generally the one I recommend, but we'll have a look at the others as well. So we'll start with the statistical chunker. And the way that this is going to work is that it's going to identify a good similarity threshold value for you based on the varying similarity throughout a document.
So the similarity that it will use for different documents and different parts of documents may actually change, but it's all calculated for you. So it's, it tends to work very well, actually. So if we have a look here, we have a few of our chunks. So you can see that it ran very quickly.
We have the first one here includes our title, the authors, and the abstract. Okay, so it's kind of like the introduction to the paper. Then after that, we have this, I assume it's probably one of the first paragraphs in the paper. I can see that. Then we go on to this, the second point here, or the second title here and so on and so on.
But generally speaking, these chunks look relatively good. Of course, you will probably need to look through them in a little more detail. Just looking at the start of these, it looks pretty reasonable. So that is a statistical chunking. It's pretty easy. The next one is consecutive chunking and consecutive chunking is probably the second one I would recommend.
It is, again, it's cost effective, relatively quick, but it does require a little more tweaking or inputs from outside. So that primarily is due to the score threshold. So the score threshold, most of the encoders here require different score thresholds. So for example, TextEmbeddingArda002, similar for that model, is anything sort of within the range of 0.73 to 0.8.
That's usually the sort of similarity threshold that you would need to use. With the newer TextEmbedding models and then TextEmbedding3Small, you need to use something much smaller, which is why I've gone with 0.3 down here. So there's a little bit more of, you need to actually input something here.
And to be honest, it depends. In some cases, the performance can be better, I think. But a lot of the time, it's actually harder to get very good performance with this one. So for example, here, I can see that it's probably splitting too frequently from what I can see.
So I may even want to modify my threshold, OK? And so I've decreased it to 0.2, and it seems a little more reasonable. Cool. I may even want to go a little bit lower. But that looks a bit better. So that is the consecutive trunker. Again, using a completely different process here.
This is essentially creating your embeddings one after the other. So it first splits your text into sentences, actually. The same for the, well, for all of the trunkers here. They split your text into sentences, and then they start merging your text into larger chunks. And they're looking, especially this one, it's looking at where there is all of a sudden a drop in similarity between those sentences.
And that is how it defines, OK, this is a logical point to split our chunk. So that's what we're doing there. And then the final one I'll show you is the cumulative trunker. Now the cumulative trunker, what this one will do is it takes our sentences and, OK, we start with sentence one, and then we add sentence two and create an embedding.
And then we add sentence three, create another embedding, and then we compare those two embeddings. So we're comparing the embedding of sentences one and two, and one, two, and three, and then seeing if there is a big change in similarity. If not, then we continue, and then we compare basically the group of three sentences followed by the group of four sentences.
We look at those two, see if there's a sudden similarity change. If there is, then that is where we make our split. So what you're doing here is you're cumulatively adding text, creating your embeddings, and then continuing, cumulatively adding more text, continuing. And the result of that is that this takes a lot longer to run.
It's also a lot more expensive. You're creating far more embeddings. But you'll find compared to at least a consecutive trunker, it is a little more noise resistant. So it requires a bit more of a change over time to actually trigger a split for this trunker. So the results tend to be a bit better, but I would say either on par or maybe even a little worse than the statistical trunker in a lot of cases.
But it's worth trying just to see what gets the best performance for you for your particular use case. Okay. So we can see that, yes, this trunker definitely took a bit longer than the others. And let's have a look at what trunks we got. So we come up, again, I probably should have changed the threshold here, but let's start with these.
So yeah, we see generally I think probably worse performance than we got with the statistical trunker. But especially if I modify the threshold here and tweak that a little bit, generally you can get better performance than the consecutive trunker. So yeah, those are the three trunking methods that we currently have in the semantic trunkers library.
I will also add that there are differences between these trunkers in which modalities they can handle. So the statistical trunker for now is only able to handle a text modality. So great for rag, not so great if you're wanting to pass video, for example. Whereas the consecutive trunker is really good at passing video.
And we have an example on that and I will walk through that in the near future, but it's something to consider. So consecutive trunker can basically handle any modality. Things like the statistical trunker cannot, and then the cumulative trunker is more text focused as well. So for now, that is it on semantic trunkers.
I hope this has been useful and interesting. So thank you very much for watching and I will see you again in the next one. Bye.