back to indexSemantic Chunking for RAG
Chapters
0:0 Semantic Chunking for RAG
0:45 What is Semantic Chunking
3:31 Semantic Chunking in Python
12:17 Adding Context to Chunks
13:41 Providing LLMs with More Context
18:11 Indexing our Chunks
20:27 Creating Chunks for the LLM
27:18 Querying for Chunks
00:00:00.000 |
Today we're going to talk about semantic chunking and a few other methods that we can include 00:00:05.880 |
within our chunking strategy to improve our rag performance. 00:00:10.720 |
But first let's begin with semantic chunking and just understand what I mean when I say 00:00:18.720 |
So when we chunk our documents semantically, what we are actually doing is taking how we 00:00:25.940 |
embed our documents for a rag pipeline, how we store them and how we retrieve them and 00:00:32.020 |
applying that same mechanism to our chunking strategy. 00:00:35.800 |
So rather than saying I'm going to chunk with 400 tokens or I'm going to chunk based on 00:00:41.960 |
whether there's some new lines, so on and so on, instead we are building our chunks 00:00:51.720 |
Now the reason that we do this is if we imagine we have some document over here, there's a 00:00:58.240 |
load of text, we could just naively chunk this in using the token count or using different 00:01:05.440 |
delimiters, or we could consider the fact that we're trying to embed each one of these 00:01:15.760 |
So these are our ideal end states, we have these single vector embeddings. 00:01:21.560 |
These are single vector embeddings, but most chunking strategies out there are very prone 00:01:30.520 |
to containing multiple meanings within our chunks. 00:01:34.000 |
And the optimal way of reducing the likelihood of that is to go for smaller and smaller chunk 00:01:42.720 |
We should still go with smaller chunk sizes, but using semantic chunking, we can find the 00:01:51.080 |
optimum chunk size to allow us to take some meaning here. 00:02:00.360 |
And what we do is find that chunk size that produces essentially a single meaning. 00:02:09.000 |
So we have some meaning here and it is concise. 00:02:14.560 |
And as I mentioned, the reason we do this is if we, for example, take a slightly larger 00:02:20.440 |
chunk and we bring that over here and we have multiple meanings, let's say there's three 00:02:26.240 |
potential meanings of what this text could be about. 00:02:29.720 |
If you consider like a, for example, a newspaper article, you take different chunks of that 00:02:34.640 |
newspaper article and each chunk has a different meaning. 00:02:38.520 |
It's talking about something slightly different, even if the topic is generally the same. 00:02:43.080 |
And what we typically do with our naive chunking strategy is we take all of those meanings 00:02:49.000 |
and we try and compress them into a single vector embedding. 00:02:52.320 |
So it means that, okay, we're kind of capturing the meaning of all of our potential meanings 00:02:58.760 |
here, but we're not capturing the exact meaning of any of them. 00:03:05.060 |
We're just capturing kind of like a diluted version of what all of these things mean together. 00:03:10.640 |
So by semantically chunking, we can actually calculate the conciseness of our chunks, identify 00:03:17.260 |
the optimum length for keeping a chunk concise, and then using that, we create our single 00:03:26.680 |
So let's go ahead and see how we actually apply this in Python. 00:03:31.480 |
So in this notebook, we're going to go through the full pipeline. 00:03:40.980 |
I mentioned there's some other things we're going to do to our chunks as well. 00:03:45.100 |
We're going to embed all of that, and then we'll see how we can begin retrieving some 00:03:54.060 |
There will be a link to this in the description of the video and also in the comments below. 00:04:01.020 |
And the first thing we're going to do is just install the other libraries that we need in 00:04:08.400 |
So we have SemanticRouter, which includes our chunking mechanism. 00:04:13.400 |
We have Pinecone, which is where we're going to be storing and retrieving our embeddings 00:04:18.660 |
And we have HuggingFaceDatasets, which is where we're going to be pulling this starting 00:04:24.540 |
So this dataset, if you watch these videos often, you will have seen it before. 00:04:33.900 |
So things like the LLAMA papers, AHE embeddings, and many other things as well. 00:04:41.460 |
So we're going to come down to here and we're going to initialize our encoder, so the embedding 00:04:47.820 |
And we're going to be using OpenAI for this one, and we're going to use the TextEmbedding3 00:04:52.780 |
We're going to be using this both to create our embeddings for our chunks, which we have 00:04:56.700 |
created them and putting them into Pinecone, and also for that chunking mechanism. 00:05:03.460 |
And ideally, that's what you should be doing as well. 00:05:06.140 |
You should be aligning both of those models because you are optimizing your chunks for 00:05:13.180 |
So it would not make sense to use a different embedding model to the embedding model you're 00:05:22.900 |
And we're going to be using this RollingWindowSplitter. 00:05:26.100 |
So there's a few parameters we can adjust here to see a little bit more of what is going 00:05:32.180 |
So specifically the plot splits and enable statistics, they're going to show us a lot 00:05:36.580 |
more information about what is actually happening when we're producing our chunks. 00:05:41.940 |
Now the other parameters that are probably important to note here is the minimum split 00:05:47.460 |
So I'm basically saying I don't want anything lower than 100 tokens in my splits. 00:05:53.660 |
And then max, I'm saying I don't want anything higher than 500 tokens. 00:05:57.700 |
So we're kind of setting the bounds of where we want this chunking mechanism to function 00:06:05.820 |
And then we'll come down to here and we'll perform our first splitting on the first document 00:06:16.180 |
So if we come up here, we'll see we have these charts that shows what has been happening. 00:06:21.860 |
So as our rolling window of sentences goes through our paper here or document, what it's 00:06:29.860 |
doing is calculating the differing similarity between chunks. 00:06:35.180 |
And then it's identifying the optimum threshold for the specific model we're using. 00:06:41.020 |
We're using the small TextEmbedding3 model from OpenAI. 00:06:45.800 |
So the threshold is pretty small, you can see up here in the top right, 0.2. 00:06:50.660 |
And those similarities between our windows are what you can see with the blue line as 00:06:58.100 |
Then you can see that a split has been made where we see the red dotted line. 00:07:03.020 |
And one thing that's kind of interesting that you can see here is once it gets to the end 00:07:12.100 |
And I haven't checked, but I'm pretty sure that this area here is actually the references 00:07:19.960 |
So you can see that it's basically splitting many times between references because they 00:07:25.540 |
don't really have that much similarity between them. 00:07:28.380 |
So yes, we can see that, we can come down to here as well, we can see the chunk sizes 00:07:35.680 |
So yeah, all between our 100 to 500 range, except with the exception of the last one 00:07:44.100 |
So there was, I assume not enough tokens to fit into that chunk, but we can see the structure 00:07:51.580 |
And then we also have the splitting statistics at the bottom here if you want to check through 00:07:57.700 |
And when looking at this, the thing that you should pay most attention to is your splits 00:08:08.580 |
The more of these you have compared to these, the better your basically semantic chunks 00:08:15.140 |
The way that I could reduce this, like the easiest way is actually just to increase my 00:08:21.240 |
But the other thing you can do is actually set a threshold. 00:08:27.340 |
So we can take a look at the rolling window splitter class here and see that we have this 00:08:31.180 |
dynamic threshold value, and we can set that equal to false. 00:08:35.260 |
So if we go back into our code and set dynamic threshold equals false. 00:08:42.920 |
Then we can also set the default threshold value. 00:08:47.980 |
This is looking at the similarity threshold for what defines a split. 00:08:52.540 |
And you can modify that by going to the encoder, score threshold, and modifying this. 00:09:00.080 |
And we can just increase the threshold number. 00:09:03.580 |
So before we had roughly 0.22, so I'm going to go to 0.25. 00:09:10.180 |
Okay, and you can see that our splits by threshold has increased, and the splits by max chunk 00:09:19.160 |
And we can keep going with that if we want to. 00:09:21.180 |
I think with the new embed3 models, 0.3 is usually a pretty good value. 00:09:30.520 |
So yeah, this is starting to look a bit better. 00:09:34.400 |
I would say probably too many chunks there, actually. 00:09:39.320 |
I'm actually going to take that back to what it was before. 00:09:47.040 |
And the other thing you can do is increase or decrease your window size. 00:09:50.560 |
So if I go with 5, which I think is the default value, and try again, you can also see the 00:10:02.640 |
And you can see here that we get a better ratio here. 00:10:07.040 |
But what you tend to find when you increase the window size here is that the similarity 00:10:15.800 |
And I don't necessarily want to always do that. 00:10:19.340 |
I think it depends on how high level you're trying to be with your embeddings. 00:10:27.560 |
So I'm going to actually just reduce the window size again, go back to what we had at the 00:10:33.480 |
I'm going to re-initialize our encoder, rerun this, this, and this. 00:10:38.920 |
Okay, if we come down here, we can see we have these chunks, they look pretty good. 00:10:46.120 |
These and these as well, which seem maybe quite small even, but we can see that statistics 00:10:57.120 |
But really what we want to be doing is looking at the chunks as well. 00:11:01.920 |
So you can see here the first is like the authors and let's just see, yeah, it's just 00:11:08.120 |
the authors and then it cuts at the abstract. 00:11:11.600 |
So the next chunk here is actually our abstract. 00:11:14.900 |
We go through, okay, if we come to the end here, we see that's basically the full abstract, 00:11:25.520 |
And then we go on to the next one here and we have, it's going into the introduction. 00:11:29.960 |
So it's, it has the, like the paper details and the authors, the title and the authors. 00:11:35.760 |
It has the abstract and then the introduction and it's, it's, you know, broken those apart. 00:11:43.500 |
So now we can go on to just having a look at what those objects are, we can see them 00:11:52.160 |
This just includes all the information that we have there. 00:11:55.660 |
What score triggered it, like the split and token count and so on. 00:12:02.200 |
Then if we come down to here, so for each one of those documents split objects, we access 00:12:08.260 |
the text itself by going into the content attribute and then, well, yeah, that's how, 00:12:18.020 |
Now I mentioned before that we're not just going to focus on semantic chunking majority 00:12:23.780 |
of what we're talking about, but not just that. 00:12:25.740 |
I also want to show you some other little things that we can do in order to improve 00:12:30.700 |
the quality of our, our chunks, both in our embeddings and also for the LLM. 00:12:39.220 |
So what I'm doing here is I'm actually taking the title of the paper and I'm also taking 00:12:44.600 |
the chunk, which is going to go into the content here. 00:12:49.800 |
And then we're actually going to use that to create our embeddings. 00:12:57.060 |
We just feed it into our embedding model because it gives more context to the embedding model 00:13:02.140 |
as to what we're actually talking about when we include things like the title of a paper 00:13:07.020 |
or the chapter of a book, these sort of kind of like hierarchical components. 00:13:13.380 |
So we can create a few of those and we can see what we have. 00:13:17.400 |
So we're looking at the paper on unveiling emotions and generative AI. 00:13:22.140 |
So that can be a really useful bit of context when we're looking at, you know, sometimes 00:13:28.020 |
paragraphs or chunks that don't mention the overall aim of, or the overall context of 00:13:37.580 |
We're going to use this function later when we're creating our embeddings. 00:13:41.980 |
And then the other thing that I want to do, I mentioned it here is for the LLM, I actually 00:13:46.420 |
want to pull in some context from the surrounding chunks. 00:13:49.940 |
So the way that I'm going to do that, and so I'm just going to number all of my chunks, 00:13:53.980 |
add that ID into my metadata, and then just add a reference to the chunks. 00:13:59.560 |
So I'm doing that with this pre-chunk and post-chunk field within the metadata here. 00:14:05.320 |
So if I run this, I'll just show you what I'm doing. 00:14:09.660 |
I just want to show you, let me show you one of them. 00:14:19.260 |
We have the content, which is our semantic chunk. 00:14:23.260 |
We have the pre-chunk and then here you can see our post-chunk. 00:14:26.580 |
There's no pre-chunk for this one because it is literally the first chunk. 00:14:29.940 |
If we go to number one, we will see that this content here will become the pre-chunk value. 00:14:35.780 |
So you can see up there, we have the post-chunk. 00:14:38.860 |
So what's coming after and the current chunk. 00:14:42.700 |
Now we have that, that's quite a lot of text and it's not so efficient when you are storing 00:14:56.660 |
It's just a lot of extra information or you can just add the IDs, right? 00:15:04.420 |
I think, especially when you're looking at storing a lot of data, this is probably what 00:15:09.380 |
When it's less data, you can probably just put in the text itself. 00:15:15.740 |
So yes, let me show you what we're doing here. 00:15:19.440 |
So the ID for each chunk that we're creating here is actually archive ID plus the chunk 00:15:27.540 |
And of course it's pretty easy to sell the pre-chunk ID and post-chunk ID because we 00:15:34.140 |
just need to either go I minus one or I plus one. 00:15:41.460 |
Then we build our metadata and then let's have a look at a few of these. 00:15:48.420 |
So we're just looking at the first three records from that, from the previous document. 00:15:53.020 |
So we have the title of the document and the authors. 00:15:59.100 |
The pre-chunk ID is nothing, of course, post-chunk is this, whereas the current chunk is actually 00:16:06.740 |
Then we look at the next one, current chunk one, post-chunk two, pre-chunk zero. 00:16:21.220 |
So it didn't actually give us a post-chunk ID here. 00:16:25.140 |
So that's the sort of structure that we're setting up for this dataset. 00:16:29.380 |
Let's go on to actually indexing all of this. 00:16:33.340 |
So the first thing we would need to do is set up our vector index. 00:16:42.860 |
Then you want to come down to here and we're going to set up our serverless spec. 00:16:48.940 |
And actually I'm using a paid region here, so you can actually switch this to US East 00:16:54.140 |
one, and this will be the free plan of serverless. 00:16:59.340 |
I have mine in US West one too, so I'm going to leave that, but yes, you probably want 00:17:07.540 |
So before we create our index, we actually need to find the dimensionality of our encoder 00:17:14.140 |
So to do that, we can just encode, like create an embedding, and then we can see that dimensionality 00:17:21.220 |
Otherwise you can just look online and you will be able to find the embedding dimension 00:17:29.580 |
So I'm going to call this one better rag chunking. 00:17:32.340 |
I'm going to be using the dot product metric here, so that's our similarity metric. 00:17:40.220 |
Here we're actually using embed three from OpenAI. 00:17:45.100 |
And then we pass in our serverless spec as well. 00:17:48.140 |
We run that, and I actually already have my index. 00:18:02.000 |
So you can see that I have this vector count here. 00:18:04.540 |
Actually this is pretty high for the full data set. 00:18:08.180 |
You can just limit how many vectors you're putting into your index. 00:18:11.820 |
The splitting or chunking that we do here, it can be expensive because we're creating 00:18:15.940 |
many, many embeddings to produce our semantic chunks, which down the line results in better 00:18:22.500 |
performance, or can result in better performance, but I mean for this example, I don't know 00:18:31.180 |
So I would recommend if you would like to make this quicker, you can actually just download 00:18:50.220 |
You can also see all the chunks and everything if you want to check that out as well. 00:19:00.160 |
If you want to limit the number of records that you're going to be processing and storing 00:19:07.900 |
So if I modify this, you can do, for example, just take the first 10,000 records like that. 00:19:16.420 |
And that will just limit how much you're pulling through there. 00:19:18.980 |
Otherwise, if you're using this data set, what you can also do to pandas, and we can 00:19:24.980 |
go like, I log, I think like this and go with the first 10,000 records like this instead. 00:19:40.780 |
Looking at the rest of what we have here, we set the splitter statistics and splits 00:19:45.680 |
to false because we don't want to be visualizing them with every like embedding run that we're 00:19:56.060 |
When you're actually creating your embeddings, you don't need to be looking at these. 00:20:04.000 |
We then build our metadata using that function that I showed you before. 00:20:08.300 |
And then the other thing that we're doing here is building our chunks for the embedding 00:20:11.900 |
So we're using the title as a kind of like a prefix and then the actual chunk itself 00:20:16.500 |
and embedding all of that together, which we do here. 00:20:19.620 |
And then after that, we just add everything to Pinecone, which I've already done. 00:20:22.580 |
So I'm going to actually stop this now and I can just continue. 00:20:27.500 |
Now that we have our chunks stored in Pinecone, we can go ahead and actually begin querying 00:20:35.400 |
So to do that, what we need to do is we'll set up like a little function for query. 00:20:43.860 |
The input to this will be your text, your input query. 00:20:50.740 |
And the first thing we need to do is create our embedding. 00:20:53.180 |
So we're going to create our, I call it the query vector or query embedding. 00:21:01.800 |
I'm going to put my text into a list here, extract that out. 00:21:07.700 |
And then I'm going to pass this over to our Pinecone index. 00:21:11.820 |
So I'm going to do index, or we should get the context out of that actually. 00:21:26.920 |
So vector top K, make that a little easier to read, top K, which is how many contexts 00:21:39.260 |
I'm going to say five, oh, and include metadata. 00:21:44.700 |
I want to set that to true, and that will allow us to return like the other records 00:21:50.740 |
So I'm going to just for now return the context, and we'll see what we have in there, and then 00:21:59.740 |
So run that, make sure it actually works as well. 00:22:02.980 |
So query, text, I'm going to say, what are EHI embeddings? 00:22:10.820 |
Okay, so it worked, that's good, but it's not in a very good format for what we need. 00:22:16.820 |
So I'm going to take a look, okay, we have, come up to the top here, we have matches. 00:22:25.420 |
We then want to iterate through each one of those matches. 00:22:35.060 |
We want to iterate through each one of those. 00:22:36.860 |
So for match in matches, I want to get, should I call this M, I want to get the metadata. 00:22:51.500 |
And actually we can just go straight into the content. 00:22:58.620 |
The other things that we might want are the post chunk ID and pre-chunk ID, if we'd like 00:23:03.940 |
to pull those through, and also the title as well. 00:23:28.100 |
Then what we want to do is format this data or, you know, do something with it. 00:23:33.180 |
So one thing that I would like to do is using the pre and post IDs that we have here, I'd 00:23:39.780 |
So I'm going to do index fetch, and you see that we get these IDs, so I'm going to get 00:23:45.580 |
So we have IDs equals, and then that's the pre and post, pre, post, and let's see what 00:23:54.900 |
Okay, so we're at the top here, we have our vectors that we return, I'm going to get those, 00:24:00.500 |
so vectors, and then we need to iterate through each one of these, or we can just get the 00:24:08.420 |
So maybe that's easier, so I'm going to say these are our other chunks, and then I'm going 00:24:17.020 |
to get the pre content, pre chunk, let's call it, pre chunk equals the other chunks, vectors, 00:24:28.860 |
the pre ID, okay, which is going to give us this, then we hunt the metadata, then we want 00:24:44.020 |
Okay, so that should give us everything that we need, I think, if I do it right, I did 00:24:54.180 |
Ah, okay, so we have the ID, followed by metadata, followed by content, don't know where I got 00:25:10.540 |
Now what we're going to do is pull these together, we're going to include it in this query here, 00:25:21.240 |
From there, we get these, okay, so we've got a content, title, pre and post IDs, then we're 00:25:29.300 |
going to get other chunks, okay, and then we're going to pull this together in a way 00:25:34.740 |
that's going to make sense for our LLMs, so that is going to depend on, you know, the 00:25:42.900 |
So what I'm going to do is something pretty generic, to be honest. 00:25:45.580 |
So I'm going to go with, we have our chunks here, I'm going to say, I want some of my 00:25:51.860 |
pre chunk and some of my post chunk around my current chunk, and I also want the title 00:25:57.180 |
at the top, maybe, I don't know whether that's super necessary, but it's fine. 00:26:02.240 |
So we'll do chunk equals, and we'll go title first, or we can even just do this here, we'll 00:26:10.820 |
put title, then I'm going to put in a little bit of space, I'm going to put in the pre 00:26:19.260 |
chunk, but I just want the end of the pre chunk, not everything. 00:26:23.300 |
So pre chunk, and I'm just going to go with string values here, you probably don't necessarily 00:26:28.660 |
want to do that, but it's fine, so I'm going to go, I don't know, the last 400 characters, 00:26:36.380 |
then new line, I'm going to go to the chunk, you don't need to do new line necessarily, 00:26:42.180 |
but I'm going to here, just for sake of readability, then you want to do your post chunk, and we 00:26:51.180 |
I'll do the same, so we'll go the first 400 characters, okay. 00:26:58.980 |
And then I'm going to return the chunks, and of course, we're returning a fair bit here, 00:27:05.940 |
so you do want to be careful, depending on the LLM that you're using, that you're not 00:27:08.940 |
going over the context window that it's going to be returning, and even here, I think maybe 00:27:13.820 |
it would make more sense to do a smaller top K value, like three is probably fine. 00:27:19.100 |
Another thing you can do is use reranking, of course, and that will, you can rerank with 00:27:24.460 |
more records, and then rerank to get a more ideal order of those results, and then just 00:27:33.940 |
That's usually the sort of numbers that I would go with. 00:27:37.980 |
So the chunk is, sorry, it should be content here, and also we should be doing chunks dot 00:27:47.060 |
Okay, and all of this should come back with something, so we can try the query I had before, 00:27:54.700 |
just like, what are EHI embeddings, I don't know if this is in here or not. 00:28:07.140 |
So EHI, I may not have this in the dataset, so let me try something that I know is going 00:28:16.700 |
to be in there, so we can go with what are large language models, so should be an easy 00:28:27.900 |
So we have section three, reuses history and work, so it wants to evaluate large language 00:28:33.620 |
models are computational models that have the capability to understand and generate 00:28:39.340 |
LLMs have the, where are we, new text based on a given input, n-gram models, no, no, no, 00:28:51.700 |
Yeah, so that's just like a quick introduction to this idea of semantic chunking rather than, 00:28:58.660 |
you know, the typical chunking methods that we might see elsewhere, and also this idea 00:29:05.140 |
of just adding, you know, some extra context to your chunks, like where we added the title 00:29:11.380 |
to our chunks, and how we actually go through a full pipeline where we're creating our, 00:29:17.980 |
taking our dataset, chunking it, adding extra context where needed, and then actually creating 00:29:23.580 |
our embeddings, and then finally at the end there searching through it. 00:29:27.380 |
So yeah, that is it for this walkthrough, I hope this has been useful and interesting, 00:29:35.540 |
So thank you very much for watching, and I will see you again in the next one, bye.