back to indexSemantic Chunking - 3 Methods for Better RAG
Chapters
0:0 3 Types of Semantic Chunking
0:42 Python Prerequisites
2:44 Statistical Semantic Chunking
4:38 Consecutive Semantic Chunking
6:45 Cumulative Semantic Chunking
8:58 Multi-modal Chunking
00:00:00.000 |
Today, we are going to be taking a look at the different types of semantic chunkers that 00:00:05.760 |
we can use in order to chunk our data for applications like RAG in a more intelligent 00:00:15.540 |
For now, we're going to be focusing on the text modality, which is generally going to 00:00:21.220 |
be used for RAG, but we can apply this to video and also audio. 00:00:28.440 |
So I'm going to take you through three different types of semantic chunkers. 00:00:33.080 |
Everything we're working through today is going to be in the semantic chunkers library, 00:00:37.440 |
and we're going to go to the chunkers intro notebook here, and I'm just going to go ahead 00:00:43.760 |
So first thing I'm going to do is just install the prerequisites. 00:00:48.300 |
So you have semantic chunkers, of course, and Hugging Face datasets. 00:00:52.080 |
So we're going to be pulling in some data to just test these different methods for chunking 00:01:00.200 |
and see what difference it makes, especially in terms of latency and also just quality 00:01:08.780 |
So we can come down to here, and let's take a look at our dataset. 00:01:17.340 |
So our dataset contains a set of AI Archive papers. 00:01:23.020 |
So this is the Mamba paper, and you can see there's a few different sections in here already. 00:01:29.100 |
We have authors, where the authors are studying or performing research from, and then we have 00:01:37.300 |
So what we're going to do is, I mean, you can do this or you can just use a full content 00:01:46.620 |
But especially one of these chunkers can be pretty slow and quite intensive. 00:01:54.660 |
So I've limited the amount of text that we're using here. 00:02:00.180 |
The other two are pretty fast, so it's mainly for that one. 00:02:03.720 |
So we will need a embedding model to perform our semantic chunking. 00:02:09.700 |
Semantic chunking, at least the versions of it that we show here, uses or relies on embedding 00:02:15.560 |
models and finding the semantic similarity between embeddings in some way or another. 00:02:20.700 |
So in this example, we're going to be using OpenAI's Embedding3 small model. 00:02:27.100 |
So you will need an OpenAI API key, or if you prefer not to, you can use a open source 00:02:35.160 |
So if you do want to use a open source model instead, so you don't need to get an API key, 00:02:47.700 |
And now I'm going to come down to the statistical chunking method. 00:02:51.900 |
And this is probably the chunker that I would recommend people to use just out of the box. 00:02:57.980 |
The reason for that is that it will handle a lot of the sort of figuring out of different 00:03:13.900 |
So this is generally the one I recommend, but we'll have a look at the others as well. 00:03:22.080 |
And the way that this is going to work is that it's going to identify a good similarity 00:03:27.540 |
threshold value for you based on the varying similarity throughout a document. 00:03:34.540 |
So the similarity that it will use for different documents and different parts of documents 00:03:39.420 |
may actually change, but it's all calculated for you. 00:03:42.840 |
So it's, it tends to work very well, actually. 00:03:46.740 |
So if we have a look here, we have a few of our chunks. 00:03:53.740 |
We have the first one here includes our title, the authors, and the abstract. 00:03:59.620 |
Okay, so it's kind of like the introduction to the paper. 00:04:06.020 |
Then after that, we have this, I assume it's probably one of the first paragraphs in the 00:04:14.580 |
Then we go on to this, the second point here, or the second title here and so on and so 00:04:21.460 |
But generally speaking, these chunks look relatively good. 00:04:26.220 |
Of course, you will probably need to look through them in a little more detail. 00:04:31.040 |
Just looking at the start of these, it looks pretty reasonable. 00:04:38.740 |
The next one is consecutive chunking and consecutive chunking is probably the second one I would 00:04:45.340 |
It is, again, it's cost effective, relatively quick, but it does require a little more tweaking 00:04:53.380 |
So that primarily is due to the score threshold. 00:04:56.980 |
So the score threshold, most of the encoders here require different score thresholds. 00:05:02.580 |
So for example, TextEmbeddingArda002, similar for that model, is anything sort of within 00:05:13.600 |
That's usually the sort of similarity threshold that you would need to use. 00:05:17.780 |
With the newer TextEmbedding models and then TextEmbedding3Small, you need to use something 00:05:23.760 |
much smaller, which is why I've gone with 0.3 down here. 00:05:27.000 |
So there's a little bit more of, you need to actually input something here. 00:05:34.860 |
In some cases, the performance can be better, I think. 00:05:39.180 |
But a lot of the time, it's actually harder to get very good performance with this one. 00:05:44.600 |
So for example, here, I can see that it's probably splitting too frequently from what 00:05:51.160 |
So I may even want to modify my threshold, OK? 00:05:56.800 |
And so I've decreased it to 0.2, and it seems a little more reasonable. 00:06:09.200 |
Again, using a completely different process here. 00:06:12.560 |
This is essentially creating your embeddings one after the other. 00:06:17.720 |
So it first splits your text into sentences, actually. 00:06:20.400 |
The same for the, well, for all of the trunkers here. 00:06:23.320 |
They split your text into sentences, and then they start merging your text into larger chunks. 00:06:29.080 |
And they're looking, especially this one, it's looking at where there is all of a sudden 00:06:33.400 |
a drop in similarity between those sentences. 00:06:37.160 |
And that is how it defines, OK, this is a logical point to split our chunk. 00:06:45.400 |
And then the final one I'll show you is the cumulative trunker. 00:06:49.520 |
Now the cumulative trunker, what this one will do is it takes our sentences and, OK, 00:06:54.440 |
we start with sentence one, and then we add sentence two and create an embedding. 00:06:59.040 |
And then we add sentence three, create another embedding, and then we compare those two embeddings. 00:07:04.520 |
So we're comparing the embedding of sentences one and two, and one, two, and three, and 00:07:10.000 |
then seeing if there is a big change in similarity. 00:07:14.800 |
If not, then we continue, and then we compare basically the group of three sentences followed 00:07:22.680 |
We look at those two, see if there's a sudden similarity change. 00:07:25.580 |
If there is, then that is where we make our split. 00:07:28.940 |
So what you're doing here is you're cumulatively adding text, creating your embeddings, and 00:07:35.760 |
then continuing, cumulatively adding more text, continuing. 00:07:39.280 |
And the result of that is that this takes a lot longer to run. 00:07:48.200 |
But you'll find compared to at least a consecutive trunker, it is a little more noise resistant. 00:07:56.820 |
So it requires a bit more of a change over time to actually trigger a split for this 00:08:04.280 |
So the results tend to be a bit better, but I would say either on par or maybe even a 00:08:07.880 |
little worse than the statistical trunker in a lot of cases. 00:08:11.640 |
But it's worth trying just to see what gets the best performance for you for your particular 00:08:19.440 |
So we can see that, yes, this trunker definitely took a bit longer than the others. 00:08:27.700 |
So we come up, again, I probably should have changed the threshold here, but let's start 00:08:35.420 |
So yeah, we see generally I think probably worse performance than we got with the statistical 00:08:44.680 |
But especially if I modify the threshold here and tweak that a little bit, generally you 00:08:49.100 |
can get better performance than the consecutive trunker. 00:08:52.140 |
So yeah, those are the three trunking methods that we currently have in the semantic trunkers 00:08:59.380 |
I will also add that there are differences between these trunkers in which modalities 00:09:06.400 |
So the statistical trunker for now is only able to handle a text modality. 00:09:14.160 |
So great for rag, not so great if you're wanting to pass video, for example. 00:09:20.260 |
Whereas the consecutive trunker is really good at passing video. 00:09:26.300 |
And we have an example on that and I will walk through that in the near future, but 00:09:33.600 |
So consecutive trunker can basically handle any modality. 00:09:37.260 |
Things like the statistical trunker cannot, and then the cumulative trunker is more text 00:09:51.620 |
So thank you very much for watching and I will see you again in the next one.