back to index

Semantic Chunking - 3 Methods for Better RAG


Chapters

0:0 3 Types of Semantic Chunking
0:42 Python Prerequisites
2:44 Statistical Semantic Chunking
4:38 Consecutive Semantic Chunking
6:45 Cumulative Semantic Chunking
8:58 Multi-modal Chunking

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we are going to be taking a look at the different types of semantic chunkers that
00:00:05.760 | we can use in order to chunk our data for applications like RAG in a more intelligent
00:00:13.540 | and effective way.
00:00:15.540 | For now, we're going to be focusing on the text modality, which is generally going to
00:00:21.220 | be used for RAG, but we can apply this to video and also audio.
00:00:26.420 | But for now, let's stick with text.
00:00:28.440 | So I'm going to take you through three different types of semantic chunkers.
00:00:33.080 | Everything we're working through today is going to be in the semantic chunkers library,
00:00:37.440 | and we're going to go to the chunkers intro notebook here, and I'm just going to go ahead
00:00:41.440 | and open this in Colab.
00:00:42.760 | Okay, cool.
00:00:43.760 | So first thing I'm going to do is just install the prerequisites.
00:00:48.300 | So you have semantic chunkers, of course, and Hugging Face datasets.
00:00:52.080 | So we're going to be pulling in some data to just test these different methods for chunking
00:01:00.200 | and see what difference it makes, especially in terms of latency and also just quality
00:01:05.340 | of what we get out of it.
00:01:07.640 | Okay, cool.
00:01:08.780 | So we can come down to here, and let's take a look at our dataset.
00:01:17.340 | So our dataset contains a set of AI Archive papers.
00:01:21.100 | We can see one of them here.
00:01:23.020 | So this is the Mamba paper, and you can see there's a few different sections in here already.
00:01:28.100 | We have the title.
00:01:29.100 | We have authors, where the authors are studying or performing research from, and then we have
00:01:36.300 | the abstract.
00:01:37.300 | So what we're going to do is, I mean, you can do this or you can just use a full content
00:01:43.420 | if you want for this paper, it's up to you.
00:01:46.620 | But especially one of these chunkers can be pretty slow and quite intensive.
00:01:54.660 | So I've limited the amount of text that we're using here.
00:02:00.180 | The other two are pretty fast, so it's mainly for that one.
00:02:03.720 | So we will need a embedding model to perform our semantic chunking.
00:02:09.700 | Semantic chunking, at least the versions of it that we show here, uses or relies on embedding
00:02:15.560 | models and finding the semantic similarity between embeddings in some way or another.
00:02:20.700 | So in this example, we're going to be using OpenAI's Embedding3 small model.
00:02:27.100 | So you will need an OpenAI API key, or if you prefer not to, you can use a open source
00:02:33.660 | model as well.
00:02:35.160 | So if you do want to use a open source model instead, so you don't need to get an API key,
00:02:39.940 | you can just do this here.
00:02:41.300 | But I'm going to be sticking with OpenAI.
00:02:44.420 | Okay, so I've initialized my encoder.
00:02:47.700 | And now I'm going to come down to the statistical chunking method.
00:02:51.900 | And this is probably the chunker that I would recommend people to use just out of the box.
00:02:57.980 | The reason for that is that it will handle a lot of the sort of figuring out of different
00:03:03.860 | parameters for you.
00:03:06.260 | It is pretty cost effective.
00:03:10.100 | And it is, it's pretty fast as well.
00:03:13.900 | So this is generally the one I recommend, but we'll have a look at the others as well.
00:03:18.820 | So we'll start with the statistical chunker.
00:03:22.080 | And the way that this is going to work is that it's going to identify a good similarity
00:03:27.540 | threshold value for you based on the varying similarity throughout a document.
00:03:34.540 | So the similarity that it will use for different documents and different parts of documents
00:03:39.420 | may actually change, but it's all calculated for you.
00:03:42.840 | So it's, it tends to work very well, actually.
00:03:46.740 | So if we have a look here, we have a few of our chunks.
00:03:51.780 | So you can see that it ran very quickly.
00:03:53.740 | We have the first one here includes our title, the authors, and the abstract.
00:03:59.620 | Okay, so it's kind of like the introduction to the paper.
00:04:06.020 | Then after that, we have this, I assume it's probably one of the first paragraphs in the
00:04:09.940 | paper.
00:04:10.940 | I can see that.
00:04:14.580 | Then we go on to this, the second point here, or the second title here and so on and so
00:04:21.460 | But generally speaking, these chunks look relatively good.
00:04:26.220 | Of course, you will probably need to look through them in a little more detail.
00:04:31.040 | Just looking at the start of these, it looks pretty reasonable.
00:04:35.460 | So that is a statistical chunking.
00:04:37.380 | It's pretty easy.
00:04:38.740 | The next one is consecutive chunking and consecutive chunking is probably the second one I would
00:04:44.340 | recommend.
00:04:45.340 | It is, again, it's cost effective, relatively quick, but it does require a little more tweaking
00:04:51.340 | or inputs from outside.
00:04:53.380 | So that primarily is due to the score threshold.
00:04:56.980 | So the score threshold, most of the encoders here require different score thresholds.
00:05:02.580 | So for example, TextEmbeddingArda002, similar for that model, is anything sort of within
00:05:09.620 | the range of 0.73 to 0.8.
00:05:13.600 | That's usually the sort of similarity threshold that you would need to use.
00:05:17.780 | With the newer TextEmbedding models and then TextEmbedding3Small, you need to use something
00:05:23.760 | much smaller, which is why I've gone with 0.3 down here.
00:05:27.000 | So there's a little bit more of, you need to actually input something here.
00:05:32.280 | And to be honest, it depends.
00:05:34.860 | In some cases, the performance can be better, I think.
00:05:39.180 | But a lot of the time, it's actually harder to get very good performance with this one.
00:05:44.600 | So for example, here, I can see that it's probably splitting too frequently from what
00:05:50.120 | I can see.
00:05:51.160 | So I may even want to modify my threshold, OK?
00:05:56.800 | And so I've decreased it to 0.2, and it seems a little more reasonable.
00:06:01.960 | Cool.
00:06:02.960 | I may even want to go a little bit lower.
00:06:04.440 | But that looks a bit better.
00:06:05.640 | So that is the consecutive trunker.
00:06:09.200 | Again, using a completely different process here.
00:06:12.560 | This is essentially creating your embeddings one after the other.
00:06:17.720 | So it first splits your text into sentences, actually.
00:06:20.400 | The same for the, well, for all of the trunkers here.
00:06:23.320 | They split your text into sentences, and then they start merging your text into larger chunks.
00:06:29.080 | And they're looking, especially this one, it's looking at where there is all of a sudden
00:06:33.400 | a drop in similarity between those sentences.
00:06:37.160 | And that is how it defines, OK, this is a logical point to split our chunk.
00:06:44.180 | So that's what we're doing there.
00:06:45.400 | And then the final one I'll show you is the cumulative trunker.
00:06:49.520 | Now the cumulative trunker, what this one will do is it takes our sentences and, OK,
00:06:54.440 | we start with sentence one, and then we add sentence two and create an embedding.
00:06:59.040 | And then we add sentence three, create another embedding, and then we compare those two embeddings.
00:07:04.520 | So we're comparing the embedding of sentences one and two, and one, two, and three, and
00:07:10.000 | then seeing if there is a big change in similarity.
00:07:14.800 | If not, then we continue, and then we compare basically the group of three sentences followed
00:07:21.080 | by the group of four sentences.
00:07:22.680 | We look at those two, see if there's a sudden similarity change.
00:07:25.580 | If there is, then that is where we make our split.
00:07:28.940 | So what you're doing here is you're cumulatively adding text, creating your embeddings, and
00:07:35.760 | then continuing, cumulatively adding more text, continuing.
00:07:39.280 | And the result of that is that this takes a lot longer to run.
00:07:44.560 | It's also a lot more expensive.
00:07:45.560 | You're creating far more embeddings.
00:07:48.200 | But you'll find compared to at least a consecutive trunker, it is a little more noise resistant.
00:07:56.820 | So it requires a bit more of a change over time to actually trigger a split for this
00:08:03.280 | trunker.
00:08:04.280 | So the results tend to be a bit better, but I would say either on par or maybe even a
00:08:07.880 | little worse than the statistical trunker in a lot of cases.
00:08:11.640 | But it's worth trying just to see what gets the best performance for you for your particular
00:08:17.440 | use case.
00:08:18.440 | Okay.
00:08:19.440 | So we can see that, yes, this trunker definitely took a bit longer than the others.
00:08:25.240 | And let's have a look at what trunks we got.
00:08:27.700 | So we come up, again, I probably should have changed the threshold here, but let's start
00:08:34.120 | with these.
00:08:35.420 | So yeah, we see generally I think probably worse performance than we got with the statistical
00:08:43.680 | trunker.
00:08:44.680 | But especially if I modify the threshold here and tweak that a little bit, generally you
00:08:49.100 | can get better performance than the consecutive trunker.
00:08:52.140 | So yeah, those are the three trunking methods that we currently have in the semantic trunkers
00:08:58.380 | library.
00:08:59.380 | I will also add that there are differences between these trunkers in which modalities
00:09:05.140 | they can handle.
00:09:06.400 | So the statistical trunker for now is only able to handle a text modality.
00:09:14.160 | So great for rag, not so great if you're wanting to pass video, for example.
00:09:20.260 | Whereas the consecutive trunker is really good at passing video.
00:09:26.300 | And we have an example on that and I will walk through that in the near future, but
00:09:32.060 | it's something to consider.
00:09:33.600 | So consecutive trunker can basically handle any modality.
00:09:37.260 | Things like the statistical trunker cannot, and then the cumulative trunker is more text
00:09:42.620 | focused as well.
00:09:43.900 | So for now, that is it on semantic trunkers.
00:09:47.980 | I hope this has been useful and interesting.
00:09:51.620 | So thank you very much for watching and I will see you again in the next one.
00:09:56.180 | [MUSIC]
00:10:06.180 | [MUSIC]