back to index

Semantic Chunking for RAG


Chapters

0:0 Semantic Chunking for RAG
0:45 What is Semantic Chunking
3:31 Semantic Chunking in Python
12:17 Adding Context to Chunks
13:41 Providing LLMs with More Context
18:11 Indexing our Chunks
20:27 Creating Chunks for the LLM
27:18 Querying for Chunks

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to talk about semantic chunking and a few other methods that we can include
00:00:05.880 | within our chunking strategy to improve our rag performance.
00:00:10.720 | But first let's begin with semantic chunking and just understand what I mean when I say
00:00:17.640 | semantic chunking.
00:00:18.720 | So when we chunk our documents semantically, what we are actually doing is taking how we
00:00:25.940 | embed our documents for a rag pipeline, how we store them and how we retrieve them and
00:00:32.020 | applying that same mechanism to our chunking strategy.
00:00:35.800 | So rather than saying I'm going to chunk with 400 tokens or I'm going to chunk based on
00:00:41.960 | whether there's some new lines, so on and so on, instead we are building our chunks
00:00:47.480 | to optimize our embedding conciseness.
00:00:51.720 | Now the reason that we do this is if we imagine we have some document over here, there's a
00:00:58.240 | load of text, we could just naively chunk this in using the token count or using different
00:01:05.440 | delimiters, or we could consider the fact that we're trying to embed each one of these
00:01:11.520 | chunks into a single vector embedding.
00:01:15.760 | So these are our ideal end states, we have these single vector embeddings.
00:01:21.560 | These are single vector embeddings, but most chunking strategies out there are very prone
00:01:30.520 | to containing multiple meanings within our chunks.
00:01:34.000 | And the optimal way of reducing the likelihood of that is to go for smaller and smaller chunk
00:01:40.360 | sizes, which we should still do.
00:01:42.720 | We should still go with smaller chunk sizes, but using semantic chunking, we can find the
00:01:51.080 | optimum chunk size to allow us to take some meaning here.
00:01:57.760 | So let's say here to here.
00:02:00.360 | And what we do is find that chunk size that produces essentially a single meaning.
00:02:09.000 | So we have some meaning here and it is concise.
00:02:14.560 | And as I mentioned, the reason we do this is if we, for example, take a slightly larger
00:02:20.440 | chunk and we bring that over here and we have multiple meanings, let's say there's three
00:02:26.240 | potential meanings of what this text could be about.
00:02:29.720 | If you consider like a, for example, a newspaper article, you take different chunks of that
00:02:34.640 | newspaper article and each chunk has a different meaning.
00:02:38.520 | It's talking about something slightly different, even if the topic is generally the same.
00:02:43.080 | And what we typically do with our naive chunking strategy is we take all of those meanings
00:02:49.000 | and we try and compress them into a single vector embedding.
00:02:52.320 | So it means that, okay, we're kind of capturing the meaning of all of our potential meanings
00:02:58.760 | here, but we're not capturing the exact meaning of any of them.
00:03:05.060 | We're just capturing kind of like a diluted version of what all of these things mean together.
00:03:10.640 | So by semantically chunking, we can actually calculate the conciseness of our chunks, identify
00:03:17.260 | the optimum length for keeping a chunk concise, and then using that, we create our single
00:03:23.520 | embedding.
00:03:24.520 | And that is semantic chunking.
00:03:26.680 | So let's go ahead and see how we actually apply this in Python.
00:03:31.480 | So in this notebook, we're going to go through the full pipeline.
00:03:34.580 | So we are going to get a dataset.
00:03:38.580 | We're going to chunk it.
00:03:39.860 | We're going to prepare it.
00:03:40.980 | I mentioned there's some other things we're going to do to our chunks as well.
00:03:45.100 | We're going to embed all of that, and then we'll see how we can begin retrieving some
00:03:50.900 | of that information.
00:03:52.460 | So we're going to go through this notebook.
00:03:54.060 | There will be a link to this in the description of the video and also in the comments below.
00:04:01.020 | And the first thing we're going to do is just install the other libraries that we need in
00:04:06.020 | order to run a notebook.
00:04:08.400 | So we have SemanticRouter, which includes our chunking mechanism.
00:04:13.400 | We have Pinecone, which is where we're going to be storing and retrieving our embeddings
00:04:17.660 | from.
00:04:18.660 | And we have HuggingFaceDatasets, which is where we're going to be pulling this starting
00:04:23.180 | dataset from.
00:04:24.540 | So this dataset, if you watch these videos often, you will have seen it before.
00:04:28.600 | It's a dump of AI archived papers.
00:04:33.900 | So things like the LLAMA papers, AHE embeddings, and many other things as well.
00:04:41.460 | So we're going to come down to here and we're going to initialize our encoder, so the embedding
00:04:46.540 | model.
00:04:47.820 | And we're going to be using OpenAI for this one, and we're going to use the TextEmbedding3
00:04:51.780 | small model.
00:04:52.780 | We're going to be using this both to create our embeddings for our chunks, which we have
00:04:56.700 | created them and putting them into Pinecone, and also for that chunking mechanism.
00:05:03.460 | And ideally, that's what you should be doing as well.
00:05:06.140 | You should be aligning both of those models because you are optimizing your chunks for
00:05:11.420 | a specific embedding model.
00:05:13.180 | So it would not make sense to use a different embedding model to the embedding model you're
00:05:19.260 | using to create those final chunks.
00:05:21.820 | Okay.
00:05:22.900 | And we're going to be using this RollingWindowSplitter.
00:05:26.100 | So there's a few parameters we can adjust here to see a little bit more of what is going
00:05:32.180 | So specifically the plot splits and enable statistics, they're going to show us a lot
00:05:36.580 | more information about what is actually happening when we're producing our chunks.
00:05:41.940 | Now the other parameters that are probably important to note here is the minimum split
00:05:46.460 | tokens.
00:05:47.460 | So I'm basically saying I don't want anything lower than 100 tokens in my splits.
00:05:53.660 | And then max, I'm saying I don't want anything higher than 500 tokens.
00:05:57.700 | So we're kind of setting the bounds of where we want this chunking mechanism to function
00:06:03.060 | within.
00:06:04.060 | So I'm going to initialize that.
00:06:05.820 | And then we'll come down to here and we'll perform our first splitting on the first document
00:06:12.740 | in our archive data set.
00:06:16.180 | So if we come up here, we'll see we have these charts that shows what has been happening.
00:06:21.860 | So as our rolling window of sentences goes through our paper here or document, what it's
00:06:29.860 | doing is calculating the differing similarity between chunks.
00:06:35.180 | And then it's identifying the optimum threshold for the specific model we're using.
00:06:41.020 | We're using the small TextEmbedding3 model from OpenAI.
00:06:45.800 | So the threshold is pretty small, you can see up here in the top right, 0.2.
00:06:50.660 | And those similarities between our windows are what you can see with the blue line as
00:06:57.100 | we're going through.
00:06:58.100 | Then you can see that a split has been made where we see the red dotted line.
00:07:03.020 | And one thing that's kind of interesting that you can see here is once it gets to the end
00:07:07.100 | here, there are many more chunks.
00:07:12.100 | And I haven't checked, but I'm pretty sure that this area here is actually the references
00:07:17.940 | for the paper.
00:07:19.960 | So you can see that it's basically splitting many times between references because they
00:07:25.540 | don't really have that much similarity between them.
00:07:28.380 | So yes, we can see that, we can come down to here as well, we can see the chunk sizes
00:07:34.220 | that have been produced.
00:07:35.680 | So yeah, all between our 100 to 500 range, except with the exception of the last one
00:07:41.660 | here, which is the final one.
00:07:44.100 | So there was, I assume not enough tokens to fit into that chunk, but we can see the structure
00:07:50.380 | here.
00:07:51.580 | And then we also have the splitting statistics at the bottom here if you want to check through
00:07:56.700 | that.
00:07:57.700 | And when looking at this, the thing that you should pay most attention to is your splits
00:08:02.820 | by max chunk size.
00:08:04.900 | So you want more splits by threshold.
00:08:08.580 | The more of these you have compared to these, the better your basically semantic chunks
00:08:13.860 | are going to be.
00:08:15.140 | The way that I could reduce this, like the easiest way is actually just to increase my
00:08:19.980 | max chunk size.
00:08:21.240 | But the other thing you can do is actually set a threshold.
00:08:23.940 | So if you want to.
00:08:27.340 | So we can take a look at the rolling window splitter class here and see that we have this
00:08:31.180 | dynamic threshold value, and we can set that equal to false.
00:08:35.260 | So if we go back into our code and set dynamic threshold equals false.
00:08:42.920 | Then we can also set the default threshold value.
00:08:47.980 | This is looking at the similarity threshold for what defines a split.
00:08:52.540 | And you can modify that by going to the encoder, score threshold, and modifying this.
00:09:00.080 | And we can just increase the threshold number.
00:09:03.580 | So before we had roughly 0.22, so I'm going to go to 0.25.
00:09:09.180 | See what we get.
00:09:10.180 | Okay, and you can see that our splits by threshold has increased, and the splits by max chunk
00:09:16.160 | size has decreased.
00:09:19.160 | And we can keep going with that if we want to.
00:09:21.180 | I think with the new embed3 models, 0.3 is usually a pretty good value.
00:09:30.520 | So yeah, this is starting to look a bit better.
00:09:33.400 | Maybe too many chunks.
00:09:34.400 | I would say probably too many chunks there, actually.
00:09:37.760 | But anyway, you can do that.
00:09:39.320 | I'm actually going to take that back to what it was before.
00:09:43.640 | I'm going to set dynamic threshold to true.
00:09:47.040 | And the other thing you can do is increase or decrease your window size.
00:09:50.560 | So if I go with 5, which I think is the default value, and try again, you can also see the
00:09:58.480 | effect that this has on those statistics.
00:10:02.640 | And you can see here that we get a better ratio here.
00:10:07.040 | But what you tend to find when you increase the window size here is that the similarity
00:10:13.120 | is averaged over time.
00:10:15.800 | And I don't necessarily want to always do that.
00:10:19.340 | I think it depends on how high level you're trying to be with your embeddings.
00:10:25.840 | In this case, I want to be pretty concise.
00:10:27.560 | So I'm going to actually just reduce the window size again, go back to what we had at the
00:10:31.480 | start.
00:10:32.480 | So I'm going to remove that.
00:10:33.480 | I'm going to re-initialize our encoder, rerun this, this, and this.
00:10:38.920 | Okay, if we come down here, we can see we have these chunks, they look pretty good.
00:10:46.120 | These and these as well, which seem maybe quite small even, but we can see that statistics
00:10:56.120 | look decent.
00:10:57.120 | But really what we want to be doing is looking at the chunks as well.
00:11:01.920 | So you can see here the first is like the authors and let's just see, yeah, it's just
00:11:08.120 | the authors and then it cuts at the abstract.
00:11:11.600 | So the next chunk here is actually our abstract.
00:11:14.900 | We go through, okay, if we come to the end here, we see that's basically the full abstract,
00:11:24.520 | like just the abstract.
00:11:25.520 | And then we go on to the next one here and we have, it's going into the introduction.
00:11:28.960 | Okay.
00:11:29.960 | So it's, it has the, like the paper details and the authors, the title and the authors.
00:11:35.760 | It has the abstract and then the introduction and it's, it's, you know, broken those apart.
00:11:40.880 | So that looks, I think it looks pretty good.
00:11:43.500 | So now we can go on to just having a look at what those objects are, we can see them
00:11:47.720 | here.
00:11:48.720 | So we have the document split object here.
00:11:52.160 | This just includes all the information that we have there.
00:11:55.660 | What score triggered it, like the split and token count and so on.
00:12:02.200 | Then if we come down to here, so for each one of those documents split objects, we access
00:12:08.260 | the text itself by going into the content attribute and then, well, yeah, that's how,
00:12:16.400 | that's how splits or chunks.
00:12:18.020 | Now I mentioned before that we're not just going to focus on semantic chunking majority
00:12:23.780 | of what we're talking about, but not just that.
00:12:25.740 | I also want to show you some other little things that we can do in order to improve
00:12:30.700 | the quality of our, our chunks, both in our embeddings and also for the LLM.
00:12:36.780 | So the first one of those is this.
00:12:39.220 | So what I'm doing here is I'm actually taking the title of the paper and I'm also taking
00:12:44.600 | the chunk, which is going to go into the content here.
00:12:47.840 | And then I'm just merging those together.
00:12:49.800 | And then we're actually going to use that to create our embeddings.
00:12:52.560 | We won't feed this chunk to our LLM.
00:12:57.060 | We just feed it into our embedding model because it gives more context to the embedding model
00:13:02.140 | as to what we're actually talking about when we include things like the title of a paper
00:13:07.020 | or the chapter of a book, these sort of kind of like hierarchical components.
00:13:13.380 | So we can create a few of those and we can see what we have.
00:13:17.400 | So we're looking at the paper on unveiling emotions and generative AI.
00:13:22.140 | So that can be a really useful bit of context when we're looking at, you know, sometimes
00:13:28.020 | paragraphs or chunks that don't mention the overall aim of, or the overall context of
00:13:34.620 | where it is coming from.
00:13:36.580 | So we do that.
00:13:37.580 | We're going to use this function later when we're creating our embeddings.
00:13:40.980 | Okay.
00:13:41.980 | And then the other thing that I want to do, I mentioned it here is for the LLM, I actually
00:13:46.420 | want to pull in some context from the surrounding chunks.
00:13:49.940 | So the way that I'm going to do that, and so I'm just going to number all of my chunks,
00:13:53.980 | add that ID into my metadata, and then just add a reference to the chunks.
00:13:59.560 | So I'm doing that with this pre-chunk and post-chunk field within the metadata here.
00:14:05.320 | So if I run this, I'll just show you what I'm doing.
00:14:08.660 | I don't need the type.
00:14:09.660 | I just want to show you, let me show you one of them.
00:14:14.220 | So metadata and we go zero.
00:14:16.380 | Okay.
00:14:17.380 | So we have the title.
00:14:19.260 | We have the content, which is our semantic chunk.
00:14:23.260 | We have the pre-chunk and then here you can see our post-chunk.
00:14:26.580 | There's no pre-chunk for this one because it is literally the first chunk.
00:14:29.940 | If we go to number one, we will see that this content here will become the pre-chunk value.
00:14:34.780 | Okay.
00:14:35.780 | So you can see up there, we have the post-chunk.
00:14:38.860 | So what's coming after and the current chunk.
00:14:42.700 | Now we have that, that's quite a lot of text and it's not so efficient when you are storing
00:14:49.220 | everything in like Pinecone.
00:14:51.340 | So what you can do is, you can do this.
00:14:54.820 | You can just store the chunks themselves.
00:14:56.660 | It's just a lot of extra information or you can just add the IDs, right?
00:15:01.980 | So this is, it can be a bit easier.
00:15:04.420 | I think, especially when you're looking at storing a lot of data, this is probably what
00:15:08.380 | you want to do.
00:15:09.380 | When it's less data, you can probably just put in the text itself.
00:15:13.680 | It's not really a big deal.
00:15:15.740 | So yes, let me show you what we're doing here.
00:15:19.440 | So the ID for each chunk that we're creating here is actually archive ID plus the chunk
00:15:25.620 | number.
00:15:27.540 | And of course it's pretty easy to sell the pre-chunk ID and post-chunk ID because we
00:15:34.140 | just need to either go I minus one or I plus one.
00:15:39.060 | So we do that.
00:15:41.460 | Then we build our metadata and then let's have a look at a few of these.
00:15:48.420 | So we're just looking at the first three records from that, from the previous document.
00:15:53.020 | So we have the title of the document and the authors.
00:15:59.100 | The pre-chunk ID is nothing, of course, post-chunk is this, whereas the current chunk is actually
00:16:04.740 | zero.
00:16:05.740 | Okay.
00:16:06.740 | Then we look at the next one, current chunk one, post-chunk two, pre-chunk zero.
00:16:12.580 | Then again, current two, pre-one, post.
00:16:18.380 | It's the final one within this set of three.
00:16:21.220 | So it didn't actually give us a post-chunk ID here.
00:16:24.140 | Okay.
00:16:25.140 | So that's the sort of structure that we're setting up for this dataset.
00:16:29.380 | Let's go on to actually indexing all of this.
00:16:33.340 | So the first thing we would need to do is set up our vector index.
00:16:38.980 | You go to app.pycone.io to get that.
00:16:42.860 | Then you want to come down to here and we're going to set up our serverless spec.
00:16:48.940 | And actually I'm using a paid region here, so you can actually switch this to US East
00:16:54.140 | one, and this will be the free plan of serverless.
00:16:59.340 | I have mine in US West one too, so I'm going to leave that, but yes, you probably want
00:17:05.100 | to use this.
00:17:06.540 | Okay.
00:17:07.540 | So before we create our index, we actually need to find the dimensionality of our encoder
00:17:13.140 | model.
00:17:14.140 | So to do that, we can just encode, like create an embedding, and then we can see that dimensionality
00:17:19.620 | is one, five, three, six.
00:17:21.220 | Otherwise you can just look online and you will be able to find the embedding dimension
00:17:26.220 | of your models.
00:17:27.860 | Then we're going to create an index.
00:17:29.580 | So I'm going to call this one better rag chunking.
00:17:32.340 | I'm going to be using the dot product metric here, so that's our similarity metric.
00:17:36.980 | We can use others.
00:17:37.980 | We can use cosine, for example.
00:17:40.220 | Here we're actually using embed three from OpenAI.
00:17:45.100 | And then we pass in our serverless spec as well.
00:17:48.140 | We run that, and I actually already have my index.
00:17:52.460 | Actually oops.
00:17:53.460 | Okay.
00:17:54.460 | I just messed that up.
00:17:55.460 | So better rag chunking.
00:17:57.240 | I want to initialize that.
00:17:59.000 | And I already have my index initialized.
00:18:02.000 | So you can see that I have this vector count here.
00:18:04.540 | Actually this is pretty high for the full data set.
00:18:07.180 | So I'll show you later.
00:18:08.180 | You can just limit how many vectors you're putting into your index.
00:18:11.820 | The splitting or chunking that we do here, it can be expensive because we're creating
00:18:15.940 | many, many embeddings to produce our semantic chunks, which down the line results in better
00:18:22.500 | performance, or can result in better performance, but I mean for this example, I don't know
00:18:29.760 | if you want to spend that.
00:18:31.180 | So I would recommend if you would like to make this quicker, you can actually just download
00:18:38.300 | the pre-chunked data set from here.
00:18:41.880 | So this again, it's from Hugging Face.
00:18:45.860 | So if I come here, you can see that.
00:18:50.220 | You can also see all the chunks and everything if you want to check that out as well.
00:18:54.900 | Actually it looks pretty interesting.
00:18:56.460 | So yes, you can do that.
00:19:00.160 | If you want to limit the number of records that you're going to be processing and storing
00:19:06.140 | in Pinecone, you can do that.
00:19:07.900 | So if I modify this, you can do, for example, just take the first 10,000 records like that.
00:19:16.420 | And that will just limit how much you're pulling through there.
00:19:18.980 | Otherwise, if you're using this data set, what you can also do to pandas, and we can
00:19:24.980 | go like, I log, I think like this and go with the first 10,000 records like this instead.
00:19:33.820 | So let me try and run that.
00:19:38.420 | So that will work.
00:19:39.420 | So, okay.
00:19:40.780 | Looking at the rest of what we have here, we set the splitter statistics and splits
00:19:45.680 | to false because we don't want to be visualizing them with every like embedding run that we're
00:19:50.660 | doing here.
00:19:51.660 | It's just, we don't need to.
00:19:52.660 | Again, like this is more for investigation.
00:19:56.060 | When you're actually creating your embeddings, you don't need to be looking at these.
00:19:59.140 | There's no need to.
00:20:01.180 | Then you come down here.
00:20:02.180 | We are creating our chunks.
00:20:04.000 | We then build our metadata using that function that I showed you before.
00:20:08.300 | And then the other thing that we're doing here is building our chunks for the embedding
00:20:10.900 | model.
00:20:11.900 | So we're using the title as a kind of like a prefix and then the actual chunk itself
00:20:16.500 | and embedding all of that together, which we do here.
00:20:19.620 | And then after that, we just add everything to Pinecone, which I've already done.
00:20:22.580 | So I'm going to actually stop this now and I can just continue.
00:20:27.500 | Now that we have our chunks stored in Pinecone, we can go ahead and actually begin querying
00:20:34.140 | against them.
00:20:35.400 | So to do that, what we need to do is we'll set up like a little function for query.
00:20:41.180 | So I'm going to say define query.
00:20:43.860 | The input to this will be your text, your input query.
00:20:50.740 | And the first thing we need to do is create our embedding.
00:20:53.180 | So we're going to create our, I call it the query vector or query embedding.
00:20:58.780 | So I'm going to do encode.
00:21:01.800 | I'm going to put my text into a list here, extract that out.
00:21:07.700 | And then I'm going to pass this over to our Pinecone index.
00:21:11.820 | So I'm going to do index, or we should get the context out of that actually.
00:21:16.260 | So context equals index.query.
00:21:21.620 | We have xq, which is our vector.
00:21:24.460 | You can actually define that here as well.
00:21:26.920 | So vector top K, make that a little easier to read, top K, which is how many contexts
00:21:36.780 | we'd like to return, how many chunks.
00:21:39.260 | I'm going to say five, oh, and include metadata.
00:21:43.700 | So include metadata.
00:21:44.700 | I want to set that to true, and that will allow us to return like the other records
00:21:49.720 | that we had in there.
00:21:50.740 | So I'm going to just for now return the context, and we'll see what we have in there, and then
00:21:57.140 | we'll just modify it accordingly.
00:21:59.740 | So run that, make sure it actually works as well.
00:22:02.980 | So query, text, I'm going to say, what are EHI embeddings?
00:22:10.820 | Okay, so it worked, that's good, but it's not in a very good format for what we need.
00:22:16.820 | So I'm going to take a look, okay, we have, come up to the top here, we have matches.
00:22:22.020 | So we want to go into our matches.
00:22:25.420 | We then want to iterate through each one of those matches.
00:22:29.420 | So let me modify this.
00:22:35.060 | We want to iterate through each one of those.
00:22:36.860 | So for match in matches, I want to get, should I call this M, I want to get the metadata.
00:22:46.740 | Metadata equals M metadata.
00:22:51.500 | And actually we can just go straight into the content.
00:22:53.500 | So I can get content, metadata, content.
00:22:58.620 | The other things that we might want are the post chunk ID and pre-chunk ID, if we'd like
00:23:03.940 | to pull those through, and also the title as well.
00:23:06.240 | So I can get the title, pre, and also post.
00:23:22.900 | Okay, we should be able to run that, yes.
00:23:28.100 | Then what we want to do is format this data or, you know, do something with it.
00:23:33.180 | So one thing that I would like to do is using the pre and post IDs that we have here, I'd
00:23:37.900 | like to fetch them for our index.
00:23:39.780 | So I'm going to do index fetch, and you see that we get these IDs, so I'm going to get
00:23:44.580 | those.
00:23:45.580 | So we have IDs equals, and then that's the pre and post, pre, post, and let's see what
00:23:50.900 | we get out of that.
00:23:51.900 | I just still want to go to the top.
00:23:54.900 | Okay, so we're at the top here, we have our vectors that we return, I'm going to get those,
00:24:00.500 | so vectors, and then we need to iterate through each one of these, or we can just get the
00:24:05.860 | pre and post content, I think.
00:24:08.420 | So maybe that's easier, so I'm going to say these are our other chunks, and then I'm going
00:24:17.020 | to get the pre content, pre chunk, let's call it, pre chunk equals the other chunks, vectors,
00:24:28.860 | the pre ID, okay, which is going to give us this, then we hunt the metadata, then we want
00:24:37.140 | the content, okay.
00:24:40.580 | And then we'll do the same for post.
00:24:44.020 | Okay, so that should give us everything that we need, I think, if I do it right, I did
00:24:53.180 | something wrong.
00:24:54.180 | Ah, okay, so we have the ID, followed by metadata, followed by content, don't know where I got
00:25:01.060 | vectors from here.
00:25:02.380 | So ID, there we go.
00:25:07.540 | So that should work, cool.
00:25:10.540 | Now what we're going to do is pull these together, we're going to include it in this query here,
00:25:15.740 | so we have, these are our matches.
00:25:21.240 | From there, we get these, okay, so we've got a content, title, pre and post IDs, then we're
00:25:29.300 | going to get other chunks, okay, and then we're going to pull this together in a way
00:25:34.740 | that's going to make sense for our LLMs, so that is going to depend on, you know, the
00:25:40.180 | format that we exactly need.
00:25:42.900 | So what I'm going to do is something pretty generic, to be honest.
00:25:45.580 | So I'm going to go with, we have our chunks here, I'm going to say, I want some of my
00:25:51.860 | pre chunk and some of my post chunk around my current chunk, and I also want the title
00:25:57.180 | at the top, maybe, I don't know whether that's super necessary, but it's fine.
00:26:02.240 | So we'll do chunk equals, and we'll go title first, or we can even just do this here, we'll
00:26:10.820 | put title, then I'm going to put in a little bit of space, I'm going to put in the pre
00:26:19.260 | chunk, but I just want the end of the pre chunk, not everything.
00:26:23.300 | So pre chunk, and I'm just going to go with string values here, you probably don't necessarily
00:26:28.660 | want to do that, but it's fine, so I'm going to go, I don't know, the last 400 characters,
00:26:36.380 | then new line, I'm going to go to the chunk, you don't need to do new line necessarily,
00:26:42.180 | but I'm going to here, just for sake of readability, then you want to do your post chunk, and we
00:26:48.020 | just want to take the start of that.
00:26:51.180 | I'll do the same, so we'll go the first 400 characters, okay.
00:26:56.540 | And that's everything I really want there.
00:26:58.980 | And then I'm going to return the chunks, and of course, we're returning a fair bit here,
00:27:05.940 | so you do want to be careful, depending on the LLM that you're using, that you're not
00:27:08.940 | going over the context window that it's going to be returning, and even here, I think maybe
00:27:13.820 | it would make more sense to do a smaller top K value, like three is probably fine.
00:27:19.100 | Another thing you can do is use reranking, of course, and that will, you can rerank with
00:27:24.460 | more records, and then rerank to get a more ideal order of those results, and then just
00:27:32.620 | take like the top three.
00:27:33.940 | That's usually the sort of numbers that I would go with.
00:27:37.980 | So the chunk is, sorry, it should be content here, and also we should be doing chunks dot
00:27:46.060 | append chunk.
00:27:47.060 | Okay, and all of this should come back with something, so we can try the query I had before,
00:27:54.700 | just like, what are EHI embeddings, I don't know if this is in here or not.
00:28:07.140 | So EHI, I may not have this in the dataset, so let me try something that I know is going
00:28:16.700 | to be in there, so we can go with what are large language models, so should be an easy
00:28:24.380 | Okay, and let's see what we get here.
00:28:27.900 | So we have section three, reuses history and work, so it wants to evaluate large language
00:28:33.620 | models are computational models that have the capability to understand and generate
00:28:37.860 | human language.
00:28:39.340 | LLMs have the, where are we, new text based on a given input, n-gram models, no, no, no,
00:28:48.300 | okay, cool.
00:28:49.300 | So it's explaining what LLMs are there.
00:28:51.700 | Yeah, so that's just like a quick introduction to this idea of semantic chunking rather than,
00:28:58.660 | you know, the typical chunking methods that we might see elsewhere, and also this idea
00:29:05.140 | of just adding, you know, some extra context to your chunks, like where we added the title
00:29:11.380 | to our chunks, and how we actually go through a full pipeline where we're creating our,
00:29:17.980 | taking our dataset, chunking it, adding extra context where needed, and then actually creating
00:29:23.580 | our embeddings, and then finally at the end there searching through it.
00:29:27.380 | So yeah, that is it for this walkthrough, I hope this has been useful and interesting,
00:29:33.460 | but for now I will leave it there.
00:29:35.540 | So thank you very much for watching, and I will see you again in the next one, bye.
00:29:40.180 | [Music]
00:29:50.180 | [Music]