Semantic Chunking for RAG

Today we're going to talk about semantic chunking and a few other methods that we can include within our chunking strategy to improve our rag performance. But first let's begin with semantic chunking and just understand what I mean when I say semantic chunking. So when we chunk our documents semantically, what we are actually doing is taking how we embed our documents for a rag pipeline, how we store them and how we retrieve them and applying that same mechanism to our chunking strategy.

So rather than saying I'm going to chunk with 400 tokens or I'm going to chunk based on whether there's some new lines, so on and so on, instead we are building our chunks to optimize our embedding conciseness. Now the reason that we do this is if we imagine we have some document over here, there's a load of text, we could just naively chunk this in using the token count or using different delimiters, or we could consider the fact that we're trying to embed each one of these chunks into a single vector embedding.

So these are our ideal end states, we have these single vector embeddings. These are single vector embeddings, but most chunking strategies out there are very prone to containing multiple meanings within our chunks. And the optimal way of reducing the likelihood of that is to go for smaller and smaller chunk sizes, which we should still do.

We should still go with smaller chunk sizes, but using semantic chunking, we can find the optimum chunk size to allow us to take some meaning here. So let's say here to here. And what we do is find that chunk size that produces essentially a single meaning. So we have some meaning here and it is concise.

And as I mentioned, the reason we do this is if we, for example, take a slightly larger chunk and we bring that over here and we have multiple meanings, let's say there's three potential meanings of what this text could be about. If you consider like a, for example, a newspaper article, you take different chunks of that newspaper article and each chunk has a different meaning.

It's talking about something slightly different, even if the topic is generally the same. And what we typically do with our naive chunking strategy is we take all of those meanings and we try and compress them into a single vector embedding. So it means that, okay, we're kind of capturing the meaning of all of our potential meanings here, but we're not capturing the exact meaning of any of them.

We're just capturing kind of like a diluted version of what all of these things mean together. So by semantically chunking, we can actually calculate the conciseness of our chunks, identify the optimum length for keeping a chunk concise, and then using that, we create our single embedding. And that is semantic chunking.

So let's go ahead and see how we actually apply this in Python. So in this notebook, we're going to go through the full pipeline. So we are going to get a dataset. We're going to chunk it. We're going to prepare it. I mentioned there's some other things we're going to do to our chunks as well.

We're going to embed all of that, and then we'll see how we can begin retrieving some of that information. So we're going to go through this notebook. There will be a link to this in the description of the video and also in the comments below. And the first thing we're going to do is just install the other libraries that we need in order to run a notebook.

So we have SemanticRouter, which includes our chunking mechanism. We have Pinecone, which is where we're going to be storing and retrieving our embeddings from. And we have HuggingFaceDatasets, which is where we're going to be pulling this starting dataset from. So this dataset, if you watch these videos often, you will have seen it before.

It's a dump of AI archived papers. So things like the LLAMA papers, AHE embeddings, and many other things as well. So we're going to come down to here and we're going to initialize our encoder, so the embedding model. And we're going to be using OpenAI for this one, and we're going to use the TextEmbedding3 small model.

We're going to be using this both to create our embeddings for our chunks, which we have created them and putting them into Pinecone, and also for that chunking mechanism. And ideally, that's what you should be doing as well. You should be aligning both of those models because you are optimizing your chunks for a specific embedding model.

So it would not make sense to use a different embedding model to the embedding model you're using to create those final chunks. Okay. And we're going to be using this RollingWindowSplitter. So there's a few parameters we can adjust here to see a little bit more of what is going on.

So specifically the plot splits and enable statistics, they're going to show us a lot more information about what is actually happening when we're producing our chunks. Now the other parameters that are probably important to note here is the minimum split tokens. So I'm basically saying I don't want anything lower than 100 tokens in my splits.

And then max, I'm saying I don't want anything higher than 500 tokens. So we're kind of setting the bounds of where we want this chunking mechanism to function within. So I'm going to initialize that. And then we'll come down to here and we'll perform our first splitting on the first document in our archive data set.

So if we come up here, we'll see we have these charts that shows what has been happening. So as our rolling window of sentences goes through our paper here or document, what it's doing is calculating the differing similarity between chunks. And then it's identifying the optimum threshold for the specific model we're using.

We're using the small TextEmbedding3 model from OpenAI. So the threshold is pretty small, you can see up here in the top right, 0.2. And those similarities between our windows are what you can see with the blue line as we're going through. Then you can see that a split has been made where we see the red dotted line.

And one thing that's kind of interesting that you can see here is once it gets to the end here, there are many more chunks. And I haven't checked, but I'm pretty sure that this area here is actually the references for the paper. So you can see that it's basically splitting many times between references because they don't really have that much similarity between them.

So yes, we can see that, we can come down to here as well, we can see the chunk sizes that have been produced. So yeah, all between our 100 to 500 range, except with the exception of the last one here, which is the final one. So there was, I assume not enough tokens to fit into that chunk, but we can see the structure here.

And then we also have the splitting statistics at the bottom here if you want to check through that. And when looking at this, the thing that you should pay most attention to is your splits by max chunk size. So you want more splits by threshold. The more of these you have compared to these, the better your basically semantic chunks are going to be.

The way that I could reduce this, like the easiest way is actually just to increase my max chunk size. But the other thing you can do is actually set a threshold. So if you want to. So we can take a look at the rolling window splitter class here and see that we have this dynamic threshold value, and we can set that equal to false.

So if we go back into our code and set dynamic threshold equals false. Then we can also set the default threshold value. This is looking at the similarity threshold for what defines a split. And you can modify that by going to the encoder, score threshold, and modifying this. And we can just increase the threshold number.

So before we had roughly 0.22, so I'm going to go to 0.25. See what we get. Okay, and you can see that our splits by threshold has increased, and the splits by max chunk size has decreased. And we can keep going with that if we want to. I think with the new embed3 models, 0.3 is usually a pretty good value.

So yeah, this is starting to look a bit better. Maybe too many chunks. I would say probably too many chunks there, actually. But anyway, you can do that. I'm actually going to take that back to what it was before. I'm going to set dynamic threshold to true. And the other thing you can do is increase or decrease your window size.

So if I go with 5, which I think is the default value, and try again, you can also see the effect that this has on those statistics. And you can see here that we get a better ratio here. But what you tend to find when you increase the window size here is that the similarity is averaged over time.

And I don't necessarily want to always do that. I think it depends on how high level you're trying to be with your embeddings. In this case, I want to be pretty concise. So I'm going to actually just reduce the window size again, go back to what we had at the start.

So I'm going to remove that. I'm going to re-initialize our encoder, rerun this, this, and this. Okay, if we come down here, we can see we have these chunks, they look pretty good. These and these as well, which seem maybe quite small even, but we can see that statistics look decent.

But really what we want to be doing is looking at the chunks as well. So you can see here the first is like the authors and let's just see, yeah, it's just the authors and then it cuts at the abstract. So the next chunk here is actually our abstract.

We go through, okay, if we come to the end here, we see that's basically the full abstract, like just the abstract. And then we go on to the next one here and we have, it's going into the introduction. Okay. So it's, it has the, like the paper details and the authors, the title and the authors.

It has the abstract and then the introduction and it's, it's, you know, broken those apart. So that looks, I think it looks pretty good. So now we can go on to just having a look at what those objects are, we can see them here. So we have the document split object here.

This just includes all the information that we have there. What score triggered it, like the split and token count and so on. Then if we come down to here, so for each one of those documents split objects, we access the text itself by going into the content attribute and then, well, yeah, that's how, that's how splits or chunks.

Now I mentioned before that we're not just going to focus on semantic chunking majority of what we're talking about, but not just that. I also want to show you some other little things that we can do in order to improve the quality of our, our chunks, both in our embeddings and also for the LLM.

So the first one of those is this. So what I'm doing here is I'm actually taking the title of the paper and I'm also taking the chunk, which is going to go into the content here. And then I'm just merging those together. And then we're actually going to use that to create our embeddings.

We won't feed this chunk to our LLM. We just feed it into our embedding model because it gives more context to the embedding model as to what we're actually talking about when we include things like the title of a paper or the chapter of a book, these sort of kind of like hierarchical components.

So we can create a few of those and we can see what we have. So we're looking at the paper on unveiling emotions and generative AI. So that can be a really useful bit of context when we're looking at, you know, sometimes paragraphs or chunks that don't mention the overall aim of, or the overall context of where it is coming from.

So we do that. We're going to use this function later when we're creating our embeddings. Okay. And then the other thing that I want to do, I mentioned it here is for the LLM, I actually want to pull in some context from the surrounding chunks. So the way that I'm going to do that, and so I'm just going to number all of my chunks, add that ID into my metadata, and then just add a reference to the chunks.

So I'm doing that with this pre-chunk and post-chunk field within the metadata here. So if I run this, I'll just show you what I'm doing. I don't need the type. I just want to show you, let me show you one of them. So metadata and we go zero. Okay.

So we have the title. We have the content, which is our semantic chunk. We have the pre-chunk and then here you can see our post-chunk. There's no pre-chunk for this one because it is literally the first chunk. If we go to number one, we will see that this content here will become the pre-chunk value.

Okay. So you can see up there, we have the post-chunk. So what's coming after and the current chunk. Now we have that, that's quite a lot of text and it's not so efficient when you are storing everything in like Pinecone. So what you can do is, you can do this.

You can just store the chunks themselves. It's just a lot of extra information or you can just add the IDs, right? So this is, it can be a bit easier. I think, especially when you're looking at storing a lot of data, this is probably what you want to do.

When it's less data, you can probably just put in the text itself. It's not really a big deal. So yes, let me show you what we're doing here. So the ID for each chunk that we're creating here is actually archive ID plus the chunk number. And of course it's pretty easy to sell the pre-chunk ID and post-chunk ID because we just need to either go I minus one or I plus one.

So we do that. Then we build our metadata and then let's have a look at a few of these. So we're just looking at the first three records from that, from the previous document. So we have the title of the document and the authors. The pre-chunk ID is nothing, of course, post-chunk is this, whereas the current chunk is actually zero.

Okay. Then we look at the next one, current chunk one, post-chunk two, pre-chunk zero. Then again, current two, pre-one, post. It's the final one within this set of three. So it didn't actually give us a post-chunk ID here. Okay. So that's the sort of structure that we're setting up for this dataset.

Let's go on to actually indexing all of this. So the first thing we would need to do is set up our vector index. You go to app.pycone.io to get that. Then you want to come down to here and we're going to set up our serverless spec. And actually I'm using a paid region here, so you can actually switch this to US East one, and this will be the free plan of serverless.

I have mine in US West one too, so I'm going to leave that, but yes, you probably want to use this. Okay. So before we create our index, we actually need to find the dimensionality of our encoder model. So to do that, we can just encode, like create an embedding, and then we can see that dimensionality is one, five, three, six.

Otherwise you can just look online and you will be able to find the embedding dimension of your models. Then we're going to create an index. So I'm going to call this one better rag chunking. I'm going to be using the dot product metric here, so that's our similarity metric.

We can use others. We can use cosine, for example. Here we're actually using embed three from OpenAI. And then we pass in our serverless spec as well. We run that, and I actually already have my index. Actually oops. Okay. I just messed that up. So better rag chunking. I want to initialize that.

And I already have my index initialized. So you can see that I have this vector count here. Actually this is pretty high for the full data set. So I'll show you later. You can just limit how many vectors you're putting into your index. The splitting or chunking that we do here, it can be expensive because we're creating many, many embeddings to produce our semantic chunks, which down the line results in better performance, or can result in better performance, but I mean for this example, I don't know if you want to spend that.

So I would recommend if you would like to make this quicker, you can actually just download the pre-chunked data set from here. So this again, it's from Hugging Face. So if I come here, you can see that. You can also see all the chunks and everything if you want to check that out as well.

Actually it looks pretty interesting. So yes, you can do that. If you want to limit the number of records that you're going to be processing and storing in Pinecone, you can do that. So if I modify this, you can do, for example, just take the first 10,000 records like that.

And that will just limit how much you're pulling through there. Otherwise, if you're using this data set, what you can also do to pandas, and we can go like, I log, I think like this and go with the first 10,000 records like this instead. So let me try and run that.

Yes. So that will work. So, okay. Looking at the rest of what we have here, we set the splitter statistics and splits to false because we don't want to be visualizing them with every like embedding run that we're doing here. It's just, we don't need to. Again, like this is more for investigation.

When you're actually creating your embeddings, you don't need to be looking at these. There's no need to. Then you come down here. We are creating our chunks. We then build our metadata using that function that I showed you before. And then the other thing that we're doing here is building our chunks for the embedding model.

So we're using the title as a kind of like a prefix and then the actual chunk itself and embedding all of that together, which we do here. And then after that, we just add everything to Pinecone, which I've already done. So I'm going to actually stop this now and I can just continue.

Now that we have our chunks stored in Pinecone, we can go ahead and actually begin querying against them. So to do that, what we need to do is we'll set up like a little function for query. So I'm going to say define query. The input to this will be your text, your input query.

And the first thing we need to do is create our embedding. So we're going to create our, I call it the query vector or query embedding. So I'm going to do encode. I'm going to put my text into a list here, extract that out. And then I'm going to pass this over to our Pinecone index.

So I'm going to do index, or we should get the context out of that actually. So context equals index.query. We have xq, which is our vector. You can actually define that here as well. So vector top K, make that a little easier to read, top K, which is how many contexts we'd like to return, how many chunks.

I'm going to say five, oh, and include metadata. So include metadata. I want to set that to true, and that will allow us to return like the other records that we had in there. So I'm going to just for now return the context, and we'll see what we have in there, and then we'll just modify it accordingly.

So run that, make sure it actually works as well. So query, text, I'm going to say, what are EHI embeddings? Okay, so it worked, that's good, but it's not in a very good format for what we need. So I'm going to take a look, okay, we have, come up to the top here, we have matches.

So we want to go into our matches. We then want to iterate through each one of those matches. So let me modify this. We want to iterate through each one of those. So for match in matches, I want to get, should I call this M, I want to get the metadata.

Metadata equals M metadata. And actually we can just go straight into the content. So I can get content, metadata, content. The other things that we might want are the post chunk ID and pre-chunk ID, if we'd like to pull those through, and also the title as well. So I can get the title, pre, and also post.

Okay, we should be able to run that, yes. Then what we want to do is format this data or, you know, do something with it. So one thing that I would like to do is using the pre and post IDs that we have here, I'd like to fetch them for our index.

So I'm going to do index fetch, and you see that we get these IDs, so I'm going to get those. So we have IDs equals, and then that's the pre and post, pre, post, and let's see what we get out of that. I just still want to go to the top.

Okay, so we're at the top here, we have our vectors that we return, I'm going to get those, so vectors, and then we need to iterate through each one of these, or we can just get the pre and post content, I think. So maybe that's easier, so I'm going to say these are our other chunks, and then I'm going to get the pre content, pre chunk, let's call it, pre chunk equals the other chunks, vectors, the pre ID, okay, which is going to give us this, then we hunt the metadata, then we want the content, okay.

And then we'll do the same for post. Okay, so that should give us everything that we need, I think, if I do it right, I did something wrong. Ah, okay, so we have the ID, followed by metadata, followed by content, don't know where I got vectors from here. So ID, there we go.

So that should work, cool. Now what we're going to do is pull these together, we're going to include it in this query here, so we have, these are our matches. From there, we get these, okay, so we've got a content, title, pre and post IDs, then we're going to get other chunks, okay, and then we're going to pull this together in a way that's going to make sense for our LLMs, so that is going to depend on, you know, the format that we exactly need.

So what I'm going to do is something pretty generic, to be honest. So I'm going to go with, we have our chunks here, I'm going to say, I want some of my pre chunk and some of my post chunk around my current chunk, and I also want the title at the top, maybe, I don't know whether that's super necessary, but it's fine.

So we'll do chunk equals, and we'll go title first, or we can even just do this here, we'll put title, then I'm going to put in a little bit of space, I'm going to put in the pre chunk, but I just want the end of the pre chunk, not everything.

So pre chunk, and I'm just going to go with string values here, you probably don't necessarily want to do that, but it's fine, so I'm going to go, I don't know, the last 400 characters, then new line, I'm going to go to the chunk, you don't need to do new line necessarily, but I'm going to here, just for sake of readability, then you want to do your post chunk, and we just want to take the start of that.

I'll do the same, so we'll go the first 400 characters, okay. And that's everything I really want there. And then I'm going to return the chunks, and of course, we're returning a fair bit here, so you do want to be careful, depending on the LLM that you're using, that you're not going over the context window that it's going to be returning, and even here, I think maybe it would make more sense to do a smaller top K value, like three is probably fine.

Another thing you can do is use reranking, of course, and that will, you can rerank with more records, and then rerank to get a more ideal order of those results, and then just take like the top three. That's usually the sort of numbers that I would go with. So the chunk is, sorry, it should be content here, and also we should be doing chunks dot append chunk.

Okay, and all of this should come back with something, so we can try the query I had before, just like, what are EHI embeddings, I don't know if this is in here or not. So EHI, I may not have this in the dataset, so let me try something that I know is going to be in there, so we can go with what are large language models, so should be an easy one.

Okay, and let's see what we get here. So we have section three, reuses history and work, so it wants to evaluate large language models are computational models that have the capability to understand and generate human language. LLMs have the, where are we, new text based on a given input, n-gram models, no, no, no, okay, cool.

So it's explaining what LLMs are there. Yeah, so that's just like a quick introduction to this idea of semantic chunking rather than, you know, the typical chunking methods that we might see elsewhere, and also this idea of just adding, you know, some extra context to your chunks, like where we added the title to our chunks, and how we actually go through a full pipeline where we're creating our, taking our dataset, chunking it, adding extra context where needed, and then actually creating our embeddings, and then finally at the end there searching through it.

So yeah, that is it for this walkthrough, I hope this has been useful and interesting, but for now I will leave it there. So thank you very much for watching, and I will see you again in the next one, bye.

Semantic Chunking for RAG

Chapters

Transcript