Semantic Chunking for RAG

00:00:00.000 | Today we're going to talk about semantic chunking and a few other methods that we can include

00:00:05.880 | within our chunking strategy to improve our rag performance.

00:00:10.720 | But first let's begin with semantic chunking and just understand what I mean when I say

00:00:17.640 | semantic chunking.

00:00:18.720 | So when we chunk our documents semantically, what we are actually doing is taking how we

00:00:25.940 | embed our documents for a rag pipeline, how we store them and how we retrieve them and

00:00:32.020 | applying that same mechanism to our chunking strategy.

00:00:35.800 | So rather than saying I'm going to chunk with 400 tokens or I'm going to chunk based on

00:00:41.960 | whether there's some new lines, so on and so on, instead we are building our chunks

00:00:47.480 | to optimize our embedding conciseness.

00:00:51.720 | Now the reason that we do this is if we imagine we have some document over here, there's a

00:00:58.240 | load of text, we could just naively chunk this in using the token count or using different

00:01:05.440 | delimiters, or we could consider the fact that we're trying to embed each one of these

00:01:11.520 | chunks into a single vector embedding.

00:01:15.760 | So these are our ideal end states, we have these single vector embeddings.

00:01:21.560 | These are single vector embeddings, but most chunking strategies out there are very prone

00:01:30.520 | to containing multiple meanings within our chunks.

00:01:34.000 | And the optimal way of reducing the likelihood of that is to go for smaller and smaller chunk

00:01:40.360 | sizes, which we should still do.

00:01:42.720 | We should still go with smaller chunk sizes, but using semantic chunking, we can find the

00:01:51.080 | optimum chunk size to allow us to take some meaning here.

00:01:57.760 | So let's say here to here.

00:02:00.360 | And what we do is find that chunk size that produces essentially a single meaning.

00:02:09.000 | So we have some meaning here and it is concise.

00:02:14.560 | And as I mentioned, the reason we do this is if we, for example, take a slightly larger

00:02:20.440 | chunk and we bring that over here and we have multiple meanings, let's say there's three

00:02:26.240 | potential meanings of what this text could be about.

00:02:29.720 | If you consider like a, for example, a newspaper article, you take different chunks of that

00:02:34.640 | newspaper article and each chunk has a different meaning.

00:02:38.520 | It's talking about something slightly different, even if the topic is generally the same.

00:02:43.080 | And what we typically do with our naive chunking strategy is we take all of those meanings

00:02:49.000 | and we try and compress them into a single vector embedding.

00:02:52.320 | So it means that, okay, we're kind of capturing the meaning of all of our potential meanings

00:02:58.760 | here, but we're not capturing the exact meaning of any of them.

00:03:05.060 | We're just capturing kind of like a diluted version of what all of these things mean together.

00:03:10.640 | So by semantically chunking, we can actually calculate the conciseness of our chunks, identify

00:03:17.260 | the optimum length for keeping a chunk concise, and then using that, we create our single

00:03:23.520 | embedding.

00:03:24.520 | And that is semantic chunking.

00:03:26.680 | So let's go ahead and see how we actually apply this in Python.

00:03:31.480 | So in this notebook, we're going to go through the full pipeline.

00:03:34.580 | So we are going to get a dataset.

00:03:38.580 | We're going to chunk it.

00:03:39.860 | We're going to prepare it.

00:03:40.980 | I mentioned there's some other things we're going to do to our chunks as well.

00:03:45.100 | We're going to embed all of that, and then we'll see how we can begin retrieving some

00:03:50.900 | of that information.

00:03:52.460 | So we're going to go through this notebook.

00:03:54.060 | There will be a link to this in the description of the video and also in the comments below.

00:04:01.020 | And the first thing we're going to do is just install the other libraries that we need in

00:04:06.020 | order to run a notebook.

00:04:08.400 | So we have SemanticRouter, which includes our chunking mechanism.

00:04:13.400 | We have Pinecone, which is where we're going to be storing and retrieving our embeddings

00:04:17.660 | from.

00:04:18.660 | And we have HuggingFaceDatasets, which is where we're going to be pulling this starting

00:04:23.180 | dataset from.

00:04:24.540 | So this dataset, if you watch these videos often, you will have seen it before.

00:04:28.600 | It's a dump of AI archived papers.

00:04:33.900 | So things like the LLAMA papers, AHE embeddings, and many other things as well.

00:04:41.460 | So we're going to come down to here and we're going to initialize our encoder, so the embedding

00:04:46.540 | model.

00:04:47.820 | And we're going to be using OpenAI for this one, and we're going to use the TextEmbedding3

00:04:51.780 | small model.

00:04:52.780 | We're going to be using this both to create our embeddings for our chunks, which we have

00:04:56.700 | created them and putting them into Pinecone, and also for that chunking mechanism.

00:05:03.460 | And ideally, that's what you should be doing as well.

00:05:06.140 | You should be aligning both of those models because you are optimizing your chunks for

00:05:11.420 | a specific embedding model.

00:05:13.180 | So it would not make sense to use a different embedding model to the embedding model you're

00:05:19.260 | using to create those final chunks.

00:05:21.820 | Okay.

00:05:22.900 | And we're going to be using this RollingWindowSplitter.

00:05:26.100 | So there's a few parameters we can adjust here to see a little bit more of what is going

00:05:31.140 | on.

00:05:32.180 | So specifically the plot splits and enable statistics, they're going to show us a lot

00:05:36.580 | more information about what is actually happening when we're producing our chunks.

00:05:41.940 | Now the other parameters that are probably important to note here is the minimum split

00:05:46.460 | tokens.

00:05:47.460 | So I'm basically saying I don't want anything lower than 100 tokens in my splits.

00:05:53.660 | And then max, I'm saying I don't want anything higher than 500 tokens.

00:05:57.700 | So we're kind of setting the bounds of where we want this chunking mechanism to function

00:06:03.060 | within.

00:06:04.060 | So I'm going to initialize that.

00:06:05.820 | And then we'll come down to here and we'll perform our first splitting on the first document

00:06:12.740 | in our archive data set.

00:06:16.180 | So if we come up here, we'll see we have these charts that shows what has been happening.

00:06:21.860 | So as our rolling window of sentences goes through our paper here or document, what it's

00:06:29.860 | doing is calculating the differing similarity between chunks.

00:06:35.180 | And then it's identifying the optimum threshold for the specific model we're using.

00:06:41.020 | We're using the small TextEmbedding3 model from OpenAI.

00:06:45.800 | So the threshold is pretty small, you can see up here in the top right, 0.2.

00:06:50.660 | And those similarities between our windows are what you can see with the blue line as

00:06:57.100 | we're going through.

00:06:58.100 | Then you can see that a split has been made where we see the red dotted line.

00:07:03.020 | And one thing that's kind of interesting that you can see here is once it gets to the end

00:07:07.100 | here, there are many more chunks.

00:07:12.100 | And I haven't checked, but I'm pretty sure that this area here is actually the references

00:07:17.940 | for the paper.

00:07:19.960 | So you can see that it's basically splitting many times between references because they

00:07:25.540 | don't really have that much similarity between them.

00:07:28.380 | So yes, we can see that, we can come down to here as well, we can see the chunk sizes

00:07:34.220 | that have been produced.

00:07:35.680 | So yeah, all between our 100 to 500 range, except with the exception of the last one

00:07:41.660 | here, which is the final one.

00:07:44.100 | So there was, I assume not enough tokens to fit into that chunk, but we can see the structure

00:07:50.380 | here.

00:07:51.580 | And then we also have the splitting statistics at the bottom here if you want to check through

00:07:56.700 | that.

00:07:57.700 | And when looking at this, the thing that you should pay most attention to is your splits

00:08:02.820 | by max chunk size.

00:08:04.900 | So you want more splits by threshold.

00:08:08.580 | The more of these you have compared to these, the better your basically semantic chunks

00:08:13.860 | are going to be.

00:08:15.140 | The way that I could reduce this, like the easiest way is actually just to increase my

00:08:19.980 | max chunk size.

00:08:21.240 | But the other thing you can do is actually set a threshold.

00:08:23.940 | So if you want to.

00:08:27.340 | So we can take a look at the rolling window splitter class here and see that we have this

00:08:31.180 | dynamic threshold value, and we can set that equal to false.

00:08:35.260 | So if we go back into our code and set dynamic threshold equals false.

00:08:42.920 | Then we can also set the default threshold value.

00:08:47.980 | This is looking at the similarity threshold for what defines a split.

00:08:52.540 | And you can modify that by going to the encoder, score threshold, and modifying this.

00:09:00.080 | And we can just increase the threshold number.

00:09:03.580 | So before we had roughly 0.22, so I'm going to go to 0.25.

00:09:09.180 | See what we get.

00:09:10.180 | Okay, and you can see that our splits by threshold has increased, and the splits by max chunk

00:09:16.160 | size has decreased.

00:09:19.160 | And we can keep going with that if we want to.

00:09:21.180 | I think with the new embed3 models, 0.3 is usually a pretty good value.

00:09:30.520 | So yeah, this is starting to look a bit better.

00:09:33.400 | Maybe too many chunks.

00:09:34.400 | I would say probably too many chunks there, actually.

00:09:37.760 | But anyway, you can do that.

00:09:39.320 | I'm actually going to take that back to what it was before.

00:09:43.640 | I'm going to set dynamic threshold to true.

00:09:47.040 | And the other thing you can do is increase or decrease your window size.

00:09:50.560 | So if I go with 5, which I think is the default value, and try again, you can also see the

00:09:58.480 | effect that this has on those statistics.

00:10:02.640 | And you can see here that we get a better ratio here.

00:10:07.040 | But what you tend to find when you increase the window size here is that the similarity

00:10:13.120 | is averaged over time.

00:10:15.800 | And I don't necessarily want to always do that.

00:10:19.340 | I think it depends on how high level you're trying to be with your embeddings.

00:10:25.840 | In this case, I want to be pretty concise.

00:10:27.560 | So I'm going to actually just reduce the window size again, go back to what we had at the

00:10:31.480 | start.

00:10:32.480 | So I'm going to remove that.

00:10:33.480 | I'm going to re-initialize our encoder, rerun this, this, and this.

00:10:38.920 | Okay, if we come down here, we can see we have these chunks, they look pretty good.

00:10:46.120 | These and these as well, which seem maybe quite small even, but we can see that statistics

00:10:56.120 | look decent.

00:10:57.120 | But really what we want to be doing is looking at the chunks as well.

00:11:01.920 | So you can see here the first is like the authors and let's just see, yeah, it's just

00:11:08.120 | the authors and then it cuts at the abstract.

00:11:11.600 | So the next chunk here is actually our abstract.

00:11:14.900 | We go through, okay, if we come to the end here, we see that's basically the full abstract,

00:11:24.520 | like just the abstract.

00:11:25.520 | And then we go on to the next one here and we have, it's going into the introduction.

00:11:28.960 | Okay.

00:11:29.960 | So it's, it has the, like the paper details and the authors, the title and the authors.

00:11:35.760 | It has the abstract and then the introduction and it's, it's, you know, broken those apart.

00:11:40.880 | So that looks, I think it looks pretty good.

00:11:43.500 | So now we can go on to just having a look at what those objects are, we can see them

00:11:47.720 | here.

00:11:48.720 | So we have the document split object here.

00:11:52.160 | This just includes all the information that we have there.

00:11:55.660 | What score triggered it, like the split and token count and so on.

00:12:02.200 | Then if we come down to here, so for each one of those documents split objects, we access

00:12:08.260 | the text itself by going into the content attribute and then, well, yeah, that's how,

00:12:16.400 | that's how splits or chunks.

00:12:18.020 | Now I mentioned before that we're not just going to focus on semantic chunking majority

00:12:23.780 | of what we're talking about, but not just that.

00:12:25.740 | I also want to show you some other little things that we can do in order to improve

00:12:30.700 | the quality of our, our chunks, both in our embeddings and also for the LLM.

00:12:36.780 | So the first one of those is this.

00:12:39.220 | So what I'm doing here is I'm actually taking the title of the paper and I'm also taking

00:12:44.600 | the chunk, which is going to go into the content here.

00:12:47.840 | And then I'm just merging those together.

00:12:49.800 | And then we're actually going to use that to create our embeddings.

00:12:52.560 | We won't feed this chunk to our LLM.

00:12:57.060 | We just feed it into our embedding model because it gives more context to the embedding model

00:13:02.140 | as to what we're actually talking about when we include things like the title of a paper

00:13:07.020 | or the chapter of a book, these sort of kind of like hierarchical components.

00:13:13.380 | So we can create a few of those and we can see what we have.

00:13:17.400 | So we're looking at the paper on unveiling emotions and generative AI.

00:13:22.140 | So that can be a really useful bit of context when we're looking at, you know, sometimes

00:13:28.020 | paragraphs or chunks that don't mention the overall aim of, or the overall context of

00:13:34.620 | where it is coming from.

00:13:36.580 | So we do that.

00:13:37.580 | We're going to use this function later when we're creating our embeddings.

00:13:40.980 | Okay.

00:13:41.980 | And then the other thing that I want to do, I mentioned it here is for the LLM, I actually

00:13:46.420 | want to pull in some context from the surrounding chunks.

00:13:49.940 | So the way that I'm going to do that, and so I'm just going to number all of my chunks,

00:13:53.980 | add that ID into my metadata, and then just add a reference to the chunks.

00:13:59.560 | So I'm doing that with this pre-chunk and post-chunk field within the metadata here.

00:14:05.320 | So if I run this, I'll just show you what I'm doing.

00:14:08.660 | I don't need the type.

00:14:09.660 | I just want to show you, let me show you one of them.

00:14:14.220 | So metadata and we go zero.

00:14:16.380 | Okay.

00:14:17.380 | So we have the title.

00:14:19.260 | We have the content, which is our semantic chunk.

00:14:23.260 | We have the pre-chunk and then here you can see our post-chunk.

00:14:26.580 | There's no pre-chunk for this one because it is literally the first chunk.

00:14:29.940 | If we go to number one, we will see that this content here will become the pre-chunk value.

00:14:34.780 | Okay.

00:14:35.780 | So you can see up there, we have the post-chunk.

00:14:38.860 | So what's coming after and the current chunk.

00:14:42.700 | Now we have that, that's quite a lot of text and it's not so efficient when you are storing

00:14:49.220 | everything in like Pinecone.

00:14:51.340 | So what you can do is, you can do this.

00:14:54.820 | You can just store the chunks themselves.

00:14:56.660 | It's just a lot of extra information or you can just add the IDs, right?

00:15:01.980 | So this is, it can be a bit easier.

00:15:04.420 | I think, especially when you're looking at storing a lot of data, this is probably what

00:15:08.380 | you want to do.

00:15:09.380 | When it's less data, you can probably just put in the text itself.

00:15:13.680 | It's not really a big deal.

00:15:15.740 | So yes, let me show you what we're doing here.

00:15:19.440 | So the ID for each chunk that we're creating here is actually archive ID plus the chunk

00:15:25.620 | number.

00:15:27.540 | And of course it's pretty easy to sell the pre-chunk ID and post-chunk ID because we

00:15:34.140 | just need to either go I minus one or I plus one.

00:15:39.060 | So we do that.

00:15:41.460 | Then we build our metadata and then let's have a look at a few of these.

00:15:48.420 | So we're just looking at the first three records from that, from the previous document.

00:15:53.020 | So we have the title of the document and the authors.

00:15:59.100 | The pre-chunk ID is nothing, of course, post-chunk is this, whereas the current chunk is actually

00:16:04.740 | zero.

00:16:05.740 | Okay.

00:16:06.740 | Then we look at the next one, current chunk one, post-chunk two, pre-chunk zero.

00:16:12.580 | Then again, current two, pre-one, post.

00:16:18.380 | It's the final one within this set of three.

00:16:21.220 | So it didn't actually give us a post-chunk ID here.

00:16:24.140 | Okay.

00:16:25.140 | So that's the sort of structure that we're setting up for this dataset.

00:16:29.380 | Let's go on to actually indexing all of this.

00:16:33.340 | So the first thing we would need to do is set up our vector index.

00:16:38.980 | You go to app.pycone.io to get that.

00:16:42.860 | Then you want to come down to here and we're going to set up our serverless spec.

00:16:48.940 | And actually I'm using a paid region here, so you can actually switch this to US East

00:16:54.140 | one, and this will be the free plan of serverless.

00:16:59.340 | I have mine in US West one too, so I'm going to leave that, but yes, you probably want

00:17:05.100 | to use this.

00:17:06.540 | Okay.

00:17:07.540 | So before we create our index, we actually need to find the dimensionality of our encoder

00:17:13.140 | model.

00:17:14.140 | So to do that, we can just encode, like create an embedding, and then we can see that dimensionality

00:17:19.620 | is one, five, three, six.

00:17:21.220 | Otherwise you can just look online and you will be able to find the embedding dimension

00:17:26.220 | of your models.

00:17:27.860 | Then we're going to create an index.

00:17:29.580 | So I'm going to call this one better rag chunking.

00:17:32.340 | I'm going to be using the dot product metric here, so that's our similarity metric.

00:17:36.980 | We can use others.

00:17:37.980 | We can use cosine, for example.

00:17:40.220 | Here we're actually using embed three from OpenAI.

00:17:45.100 | And then we pass in our serverless spec as well.

00:17:48.140 | We run that, and I actually already have my index.

00:17:52.460 | Actually oops.

00:17:53.460 | Okay.

00:17:54.460 | I just messed that up.

00:17:55.460 | So better rag chunking.

00:17:57.240 | I want to initialize that.

00:17:59.000 | And I already have my index initialized.

00:18:02.000 | So you can see that I have this vector count here.

00:18:04.540 | Actually this is pretty high for the full data set.

00:18:07.180 | So I'll show you later.

00:18:08.180 | You can just limit how many vectors you're putting into your index.

00:18:11.820 | The splitting or chunking that we do here, it can be expensive because we're creating

00:18:15.940 | many, many embeddings to produce our semantic chunks, which down the line results in better

00:18:22.500 | performance, or can result in better performance, but I mean for this example, I don't know

00:18:29.760 | if you want to spend that.

00:18:31.180 | So I would recommend if you would like to make this quicker, you can actually just download

00:18:38.300 | the pre-chunked data set from here.

00:18:41.880 | So this again, it's from Hugging Face.

00:18:45.860 | So if I come here, you can see that.

00:18:50.220 | You can also see all the chunks and everything if you want to check that out as well.

00:18:54.900 | Actually it looks pretty interesting.

00:18:56.460 | So yes, you can do that.

00:19:00.160 | If you want to limit the number of records that you're going to be processing and storing

00:19:06.140 | in Pinecone, you can do that.

00:19:07.900 | So if I modify this, you can do, for example, just take the first 10,000 records like that.

00:19:16.420 | And that will just limit how much you're pulling through there.

00:19:18.980 | Otherwise, if you're using this data set, what you can also do to pandas, and we can

00:19:24.980 | go like, I log, I think like this and go with the first 10,000 records like this instead.

00:19:33.820 | So let me try and run that.

00:19:37.100 | Yes.

00:19:38.420 | So that will work.

00:19:39.420 | So, okay.

00:19:40.780 | Looking at the rest of what we have here, we set the splitter statistics and splits

00:19:45.680 | to false because we don't want to be visualizing them with every like embedding run that we're

00:19:50.660 | doing here.

00:19:51.660 | It's just, we don't need to.

00:19:52.660 | Again, like this is more for investigation.

00:19:56.060 | When you're actually creating your embeddings, you don't need to be looking at these.

00:19:59.140 | There's no need to.

00:20:01.180 | Then you come down here.

00:20:02.180 | We are creating our chunks.

00:20:04.000 | We then build our metadata using that function that I showed you before.

00:20:08.300 | And then the other thing that we're doing here is building our chunks for the embedding

00:20:10.900 | model.

00:20:11.900 | So we're using the title as a kind of like a prefix and then the actual chunk itself

00:20:16.500 | and embedding all of that together, which we do here.

00:20:19.620 | And then after that, we just add everything to Pinecone, which I've already done.

00:20:22.580 | So I'm going to actually stop this now and I can just continue.

00:20:27.500 | Now that we have our chunks stored in Pinecone, we can go ahead and actually begin querying

00:20:34.140 | against them.

00:20:35.400 | So to do that, what we need to do is we'll set up like a little function for query.

00:20:41.180 | So I'm going to say define query.

00:20:43.860 | The input to this will be your text, your input query.

00:20:50.740 | And the first thing we need to do is create our embedding.

00:20:53.180 | So we're going to create our, I call it the query vector or query embedding.

00:20:58.780 | So I'm going to do encode.

00:21:01.800 | I'm going to put my text into a list here, extract that out.

00:21:07.700 | And then I'm going to pass this over to our Pinecone index.

00:21:11.820 | So I'm going to do index, or we should get the context out of that actually.

00:21:16.260 | So context equals index.query.

00:21:21.620 | We have xq, which is our vector.

00:21:24.460 | You can actually define that here as well.

00:21:26.920 | So vector top K, make that a little easier to read, top K, which is how many contexts

00:21:36.780 | we'd like to return, how many chunks.

00:21:39.260 | I'm going to say five, oh, and include metadata.

00:21:43.700 | So include metadata.

00:21:44.700 | I want to set that to true, and that will allow us to return like the other records

00:21:49.720 | that we had in there.

00:21:50.740 | So I'm going to just for now return the context, and we'll see what we have in there, and then

00:21:57.140 | we'll just modify it accordingly.

00:21:59.740 | So run that, make sure it actually works as well.

00:22:02.980 | So query, text, I'm going to say, what are EHI embeddings?

00:22:10.820 | Okay, so it worked, that's good, but it's not in a very good format for what we need.

00:22:16.820 | So I'm going to take a look, okay, we have, come up to the top here, we have matches.

00:22:22.020 | So we want to go into our matches.

00:22:25.420 | We then want to iterate through each one of those matches.

00:22:29.420 | So let me modify this.

00:22:35.060 | We want to iterate through each one of those.

00:22:36.860 | So for match in matches, I want to get, should I call this M, I want to get the metadata.

00:22:46.740 | Metadata equals M metadata.

00:22:51.500 | And actually we can just go straight into the content.

00:22:53.500 | So I can get content, metadata, content.

00:22:58.620 | The other things that we might want are the post chunk ID and pre-chunk ID, if we'd like

00:23:03.940 | to pull those through, and also the title as well.

00:23:06.240 | So I can get the title, pre, and also post.

00:23:22.900 | Okay, we should be able to run that, yes.

00:23:28.100 | Then what we want to do is format this data or, you know, do something with it.

00:23:33.180 | So one thing that I would like to do is using the pre and post IDs that we have here, I'd

00:23:37.900 | like to fetch them for our index.

00:23:39.780 | So I'm going to do index fetch, and you see that we get these IDs, so I'm going to get

00:23:44.580 | those.

00:23:45.580 | So we have IDs equals, and then that's the pre and post, pre, post, and let's see what

00:23:50.900 | we get out of that.

00:23:51.900 | I just still want to go to the top.

00:23:54.900 | Okay, so we're at the top here, we have our vectors that we return, I'm going to get those,

00:24:00.500 | so vectors, and then we need to iterate through each one of these, or we can just get the

00:24:05.860 | pre and post content, I think.

00:24:08.420 | So maybe that's easier, so I'm going to say these are our other chunks, and then I'm going

00:24:17.020 | to get the pre content, pre chunk, let's call it, pre chunk equals the other chunks, vectors,

00:24:28.860 | the pre ID, okay, which is going to give us this, then we hunt the metadata, then we want

00:24:37.140 | the content, okay.

00:24:40.580 | And then we'll do the same for post.

00:24:44.020 | Okay, so that should give us everything that we need, I think, if I do it right, I did

00:24:53.180 | something wrong.

00:24:54.180 | Ah, okay, so we have the ID, followed by metadata, followed by content, don't know where I got

00:25:01.060 | vectors from here.

00:25:02.380 | So ID, there we go.

00:25:07.540 | So that should work, cool.

00:25:10.540 | Now what we're going to do is pull these together, we're going to include it in this query here,

00:25:15.740 | so we have, these are our matches.

00:25:21.240 | From there, we get these, okay, so we've got a content, title, pre and post IDs, then we're

00:25:29.300 | going to get other chunks, okay, and then we're going to pull this together in a way

00:25:34.740 | that's going to make sense for our LLMs, so that is going to depend on, you know, the

00:25:40.180 | format that we exactly need.

00:25:42.900 | So what I'm going to do is something pretty generic, to be honest.

00:25:45.580 | So I'm going to go with, we have our chunks here, I'm going to say, I want some of my

00:25:51.860 | pre chunk and some of my post chunk around my current chunk, and I also want the title

00:25:57.180 | at the top, maybe, I don't know whether that's super necessary, but it's fine.

00:26:02.240 | So we'll do chunk equals, and we'll go title first, or we can even just do this here, we'll

00:26:10.820 | put title, then I'm going to put in a little bit of space, I'm going to put in the pre

00:26:19.260 | chunk, but I just want the end of the pre chunk, not everything.

00:26:23.300 | So pre chunk, and I'm just going to go with string values here, you probably don't necessarily

00:26:28.660 | want to do that, but it's fine, so I'm going to go, I don't know, the last 400 characters,

00:26:36.380 | then new line, I'm going to go to the chunk, you don't need to do new line necessarily,

00:26:42.180 | but I'm going to here, just for sake of readability, then you want to do your post chunk, and we

00:26:48.020 | just want to take the start of that.

00:26:51.180 | I'll do the same, so we'll go the first 400 characters, okay.

00:26:56.540 | And that's everything I really want there.

00:26:58.980 | And then I'm going to return the chunks, and of course, we're returning a fair bit here,

00:27:05.940 | so you do want to be careful, depending on the LLM that you're using, that you're not

00:27:08.940 | going over the context window that it's going to be returning, and even here, I think maybe

00:27:13.820 | it would make more sense to do a smaller top K value, like three is probably fine.

00:27:19.100 | Another thing you can do is use reranking, of course, and that will, you can rerank with

00:27:24.460 | more records, and then rerank to get a more ideal order of those results, and then just

00:27:32.620 | take like the top three.

00:27:33.940 | That's usually the sort of numbers that I would go with.

00:27:37.980 | So the chunk is, sorry, it should be content here, and also we should be doing chunks dot

00:27:46.060 | append chunk.

00:27:47.060 | Okay, and all of this should come back with something, so we can try the query I had before,

00:27:54.700 | just like, what are EHI embeddings, I don't know if this is in here or not.

00:28:07.140 | So EHI, I may not have this in the dataset, so let me try something that I know is going

00:28:16.700 | to be in there, so we can go with what are large language models, so should be an easy

00:28:23.380 | one.

00:28:24.380 | Okay, and let's see what we get here.

00:28:27.900 | So we have section three, reuses history and work, so it wants to evaluate large language

00:28:33.620 | models are computational models that have the capability to understand and generate

00:28:37.860 | human language.

00:28:39.340 | LLMs have the, where are we, new text based on a given input, n-gram models, no, no, no,

00:28:48.300 | okay, cool.

00:28:49.300 | So it's explaining what LLMs are there.

00:28:51.700 | Yeah, so that's just like a quick introduction to this idea of semantic chunking rather than,

00:28:58.660 | you know, the typical chunking methods that we might see elsewhere, and also this idea

00:29:05.140 | of just adding, you know, some extra context to your chunks, like where we added the title

00:29:11.380 | to our chunks, and how we actually go through a full pipeline where we're creating our,

00:29:17.980 | taking our dataset, chunking it, adding extra context where needed, and then actually creating

00:29:23.580 | our embeddings, and then finally at the end there searching through it.

00:29:27.380 | So yeah, that is it for this walkthrough, I hope this has been useful and interesting,

00:29:33.460 | but for now I will leave it there.

00:29:35.540 | So thank you very much for watching, and I will see you again in the next one, bye.

00:29:40.180 | [Music]

00:29:50.180 | [Music]

Semantic Chunking for RAG

Chapters