back to indexLangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101
Chapters
0:0 Data preparation for LLMs
0:45 Downloading the LangChain docs
3:29 Using LangChain document loaders
5:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:2 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important
00:00:00.000 |
In this video we are going to take a look at what we need to do and what we need to consider 00:00:05.920 |
when we are chunking text for large language models. The best way I can think of of 00:00:14.000 |
demonstrating this is to walk through an example. Now we're going to really go with the what I 00:00:19.280 |
believe is kind of like a rule of thumb that I tend to use when I'm chunking text in order to 00:00:24.400 |
put into a large language model and it doesn't necessarily apply to every use case. You know 00:00:29.200 |
every use case is slightly different but I think this is a pretty good approach at least when we're 00:00:34.400 |
using retrieval augmentation and large language models which I think is where the chunking 00:00:40.640 |
question kind of comes up most often. So let's jump straight into it. In this example what we're 00:00:46.560 |
going to be doing is taking the langchain docs here, literally every page on this website, 00:00:53.760 |
and we're going to be downloading those, taking each one of these pages and then we're going to 00:00:59.840 |
be splitting them into more reasonably sized chunks. Now how are we going to do this? We're 00:01:06.720 |
going to take a look at this notebook here. Now if you'd like to follow along with the code you can 00:01:12.480 |
also run this notebook. I will leave a link to it which will appear somewhere near the top of the 00:01:17.760 |
video right now. Now to get started we're going to be using a few Python libraries. Langchain is 00:01:23.760 |
a pretty big one here so not only is it the documentation that we're downloading but it's 00:01:29.440 |
also going to be how we download that documentation and it's also going to be how we split that 00:01:35.600 |
documentation into chunks. Another dependency here is the ticktoken tokenizer. We'll talk about 00:01:42.160 |
that later and we're just going to visualize and make things a little bit easier to follow with 00:01:45.840 |
these libraries here. In this example first thing we're going to do is download all of the docs from 00:01:54.080 |
langchain. Everything is contained within this is the top level page of the langchain docs. We're 00:02:01.360 |
going to save everything into this directory here and we are going to say we want to get all of the 00:02:10.080 |
.html files. We run that and that will take a moment just to download everything. There's a lot 00:02:18.480 |
in there. My internet connection is also pretty slow so it will probably take me a moment but 00:02:25.040 |
let's go ahead and just have a look at where these are being downloaded. If we come over to the left 00:02:31.120 |
here we can see there is the RT docs repository there and inside the RT docs we have this 00:02:39.040 |
langchain-redux-en-latest which is just kind of like the path of our docs. In there you can see 00:02:48.080 |
everything's been downloaded. We have the index page which I think is the top level page. You can 00:02:54.480 |
see it's just HTML. We're not going to process this we're going to use langchain to clean this up 00:03:02.080 |
but if we come down a little bit I think maybe we can see something. This is the first page, 00:03:11.120 |
welcome to langchain, LLMs are emerging as a transformative technology, so on and so on. 00:03:17.440 |
We have some other things, other pages. We're just going to process all of this. 00:03:22.960 |
Back to our code. It's done downloading now. We can come down to here and what we're going to do 00:03:30.560 |
is use the langchain document loaders and we're going to use a Redux loader. Redux is a specific 00:03:37.440 |
template that is used quite often for documentation for code libraries. Langchain includes a document 00:03:46.960 |
loader that is specifically built for reading that type of documentation or those HTML pages 00:03:53.120 |
and processing them into a nicer format. It's really easy to use it. We just point it to our 00:04:00.640 |
directory that we just created. What are we doing here? We're loading those docs and here I'm just 00:04:09.280 |
printing out the length of those docs so that we can see. We have 390 HTML pages that have been 00:04:16.080 |
downloaded there for some reason. When I ran this about an hour ago, they actually had 389, 00:04:25.280 |
now they have 390 pages, so it's already out of date. Cool. Let's have a look at one of those 00:04:32.640 |
pages. We have this document object. Inside that we have page content, which is all of our text. 00:04:40.960 |
If we want to print that in a nicer format, we can see this. Looks pretty good. There is some 00:04:51.040 |
messy parts of this, but it's not really a problem. We could try and process that if we wanted to, 00:04:59.440 |
but honestly, I don't really think it's worth it because a large language model can 00:05:04.480 |
handle this very easily. I personally wouldn't really bother with that. I'd just take it as it 00:05:11.920 |
is. Now, at the end of this object, we come right to the end if it lets me, we see that we have this 00:05:22.000 |
metadata here. Inside the metadata we have the source, which is in this case the file path, 00:05:30.480 |
but fortunately the way that we've set this up is that we can just replace rtdocs with 00:05:36.000 |
HTTPS and that will give us a URL for this particular file. Let's come down here and you 00:05:43.040 |
can see that's what I'm doing here. Replace rtdocs with HTTPS. Cool. Then we can click that 00:05:51.200 |
and we come over to here. Now, this is where we start talking about the chunking of what we're 00:05:58.960 |
doing. When we are thinking about chunking, there are a few things to consider. The first thing to 00:06:08.400 |
consider is how much text or how many tokens can our large language model or whatever process is 00:06:16.400 |
what we're doing, how many tokens can it handle? What is optimal for our particular use case? 00:06:22.480 |
The use case that I'm envisioning here is retrieval augmentation for question answering 00:06:30.240 |
using a larger language model. What does that mean exactly? It's probably best if I draw it out. 00:06:36.560 |
We're going to have our large language model over here and we're going to ask it questions. We have 00:06:42.480 |
our question over here. It's supposed to be a queue. It's fine. We have our question. We're 00:06:49.360 |
going to say, "What is the LLM chain in LangChain?" If we pass that straight into our large language 00:06:57.360 |
model, at the moment using GPT 3.5 Turbo, even GPT 4, they can't answer that question because they 00:07:04.960 |
don't know what the LangChain library is. In this scenario, what we would do is we'd go to Vector 00:07:13.600 |
Database. We don't really need to go into too much detail here. We go to Vector Database, which is 00:07:20.000 |
where we saw all of the documents that we're processing now, all those LangChain dots. They 00:07:25.760 |
would end up within that space and they would be retrieved. We would pass in five or so of these 00:07:33.600 |
chunks of text that are relevant to our particular query alongside our original query. 00:07:42.320 |
What you'd end up with is rather than, let's say this is your prompt, you typically have your 00:07:48.240 |
query. Rather than just a query, you'd have your query and then you'd also have these five bits of 00:07:57.200 |
relevant information below the query. That would all go into the large language model. You would 00:08:03.280 |
essentially say to it, you'd probably have some instructions near the top and those instructions 00:08:08.320 |
would say, I want you to answer this question. You'd maybe give the questionnaire and give it a 00:08:15.760 |
bit later on using the context that we have provided. You would basically, in front of these 00:08:21.920 |
contexts, you would write context. The large language model will answer the question based 00:08:28.560 |
on those contexts. That's the scenario we're envisioning here. In this scenario, if we want to 00:08:37.280 |
input five of these contexts into each one of our retrieval augmented queries, we need to think, 00:08:46.080 |
what is the max token limit of our large language model and how much of that space can be reserved 00:08:53.440 |
for these contexts? In this scenario, let's say that we're using GPT 3.5 Turbo. The token limit 00:09:03.520 |
for GPT 3.5 Turbo is something like 4,096. This includes both. You have your large language model. 00:09:17.200 |
I'm going to put that here. Pretend this is your large language model. This 4,096 includes the 00:09:24.480 |
input to the large language model, so all of your input tokens, and also all of your generated 00:09:33.600 |
output tokens. Basically, we can't just use that full 4,000 tokens on the input. We need to leave 00:09:42.800 |
some space for the output. Also, within the input, we have other components. It's not just the 00:09:49.280 |
context, but we also have the query. That's supposed to say query. As well as that, we might 00:10:00.560 |
also have some instructions. I don't know why I'm writing so bad. As well as the instructions, 00:10:11.040 |
we might also have a bit of track history if this is a track bot. Basically, the amount of contexts 00:10:18.960 |
that we can feed in is pretty limited. In this scenario, let's just assume that we can pass in 00:10:26.000 |
a context of around half of the 4,000 tokens. We'll say 2,000 is going to be our limit. If 2,000 00:10:34.240 |
is our limit, that means we need to divide that by five because those 2,000 tokens need to be 00:10:43.840 |
shared by our five contexts, which leaves us with about 400 of these tokens per context. That's our 00:10:55.520 |
maximum chunk size. Now, one question that we might have here is, could we reduce the number 00:11:01.440 |
of tokens further? For sure, we can. I would say the minimum number of tokens that you need within 00:11:09.200 |
a context is for you to read this context, does it make sense? If you have enough words in there 00:11:18.000 |
for that context to make sense to you as a human being, then that means that it is probably enough 00:11:25.840 |
to feed as a chunk of text into a large language model, into a bedding model, and so on. If that 00:11:34.240 |
chunk of text has enough text in there to have some sort of meaning to itself, then the chunk 00:11:41.040 |
is probably big enough. As long as you satisfy that, that should be the criteria for your minimum 00:11:48.000 |
size of that chunk of text. Naturally, for the maximum size of chunk of text, we have the 400 00:11:54.720 |
tokens that we just calculated now. With all of that in mind, we need to take a look at how 00:12:01.200 |
we would actually calculate the size of these chunks, because we're not basing this on character 00:12:08.720 |
length, we're basing this on token length. In order to do that, we need to look at how to tokenize 00:12:15.360 |
text using the same tokenizer that our large language model uses, and then we can actually 00:12:22.960 |
count the number of tokens within each chunk. Getting started with that, we are going to be 00:12:29.680 |
using the TickToken tokenizer. Now, this is specific to OpenAI models. Obviously, if you're 00:12:35.120 |
using Cohere, HuggingFace, and so on, this is going to be a slightly different approach. 00:12:39.280 |
First, we want to get our encoding. There are multiple TickToken tokenizers that OpenAI uses. 00:12:48.000 |
This is just one of those. Now, let's initialize that, and I will talk a little bit about where 00:12:53.920 |
we're getting these encoders from. You can actually find details for the tokenizer at 00:13:00.000 |
this link here. This link is in the GitHub repo, TickToken, TickTokenModel.py. I'm going to click 00:13:08.880 |
through to that. This is in the OpenAI TickToken repository on GitHub. You can see we have this 00:13:15.360 |
model to encoding dictionary here. Within this, you can see that we have a mapping from each of 00:13:22.080 |
the models to the particular tokenizer that it uses. We are going to use the GPT-3.5 Turbo model, 00:13:29.360 |
which uses a CL-100K base. I would say I think most of the more recent models, like the models 00:13:37.120 |
that you'll be using at the time of recording this video, they all use this encoder. The embeddings 00:13:46.240 |
model that is the most up-to-date uses CL-100K base. The trapped GPTs, GPT-3.5 Turbo uses CL-100K 00:13:55.840 |
base. GPT-4 also uses it. The only one that is still kind of a relevant model is the Textive 00:14:03.600 |
Engine 0.0.3 model. That is the only relevant model that doesn't use that encoder. This one 00:14:10.960 |
uses a P50K base. In reality, you don't even need to go there to find out the encoding that you need 00:14:18.240 |
to use. You can actually just see this. TickToken, encoding for model, and you can run this. You get 00:14:25.440 |
the CL-100K base. That's how we know. Now, anything else? I think that is pretty much it. 00:14:33.760 |
Actually, here I'm creating this TickTokenLength function. That is going to take some text. 00:14:41.680 |
It's going to use the tokenizer to calculate the length of that text in terms of TickToken tokens. 00:14:49.920 |
That's important because we need to use that for our LineChainSplitter function in a moment. 00:14:55.760 |
We create that. Then what we can do is just first, before we jump into the whole chunking component, 00:15:05.840 |
I want to have a look at what the length of documents looks like at the moment. I'm going 00:15:11.760 |
to calculate the token counts, the TickTokenLength function. Come to here, we can see the minimum, 00:15:18.720 |
maximum, and average number of tokens. The smallest document contains just 45 tokens. 00:15:24.400 |
This is probably a page that we don't really need. It probably doesn't contain anything 00:15:32.320 |
useful in there. Maximum is almost 58,000 tokens, which is really big. I'm not sure what that is, 00:15:40.640 |
but the average is a bit more normal, so 1.3 thousand there. We can visualize the distribution 00:15:49.760 |
of those pages and the amount of tokens they have. The vast majority of pages, they're more towards 00:15:58.480 |
the 1,000 token range, as we can see here. All right, cool. Now, let's continue and we'll start 00:16:08.240 |
and look at how we're going to chunk everything. Again, we're using LineChain here. We're using 00:16:13.120 |
a text splitter and we're using the recursive character text splitter. Now, this is, I think, 00:16:18.400 |
probably one of the best chunkers or text splitters that LineChain offers at the moment. 00:16:24.080 |
It's very general purpose. They do also offer some text splitters that are more specific 00:16:30.160 |
to Markdown, for example, but I like this one. You can use it for a ton of things. Let me just 00:16:39.280 |
explain it very quickly. Basically, what it's going to do is it's going to take your length 00:16:45.120 |
function, so the tick token length, and it's going to say, "I need to split your text so that each 00:16:51.680 |
chunk does not go over this chunk size here," so this 400. It's going to split based on the 00:16:59.040 |
separators. The reason we have multiple separators is that it first starts by trying to find double 00:17:05.840 |
new lines. This is a double new line separator. It's going to try and split on that first. If it 00:17:10.720 |
can't find a good split using the double new line characters, it will just try a single new line, 00:17:18.960 |
then it will try a space, and as a very last resort, it will just split on anything. 00:17:24.800 |
Cool. Then one final thing that we have here is this chunk overlap. This chunk overlap is saying 00:17:31.120 |
for every chunk, we are going to overlap it with the next chunk by 20 tokens. Let me draw that out 00:17:42.320 |
so it makes more sense. Imagine we have a ton of texts. There's loads of texts here. 00:17:49.280 |
Now, we are going to get a chunk of 400 characters. Let's say that chunk takes us from 00:18:01.440 |
here all the way to, say, here. We have 400 characters in this chunk. Then the next chunk, 00:18:11.360 |
if we don't have any chunk overlap, would be 400 characters from this. Let's say it's to here. 00:18:18.080 |
This comes with a problem because we don't know what this information here and this information 00:18:27.280 |
here is about. They could be related. We might be missing out on some important information 00:18:34.400 |
by just splitting in the middle here. It's important to try and avoid that if possible. 00:18:41.680 |
The most naive way or naive approach for doing this is to include a chunk overlap. 00:18:47.760 |
What we would do is, let's say we take the 20 tokens behind this. We're going to go back 00:18:58.720 |
20 tokens, which maybe comes to here. That means that this space here is now going to be shared by 00:19:09.280 |
the first chunk and the next chunk, which will also bring back the next chunk to something like 00:19:18.400 |
here. Now, we have chunk one here, which goes from here up to here. Then we have chunk two, which is 00:19:33.440 |
from here to here. Following on from that, we would also add another chunk overlap for number 00:19:43.120 |
three. Number three would go from here to, let's say, here. Finally, for number four, 00:19:49.040 |
we go from here to here. The chunk overlap is just to make sure that we're not missing any 00:19:55.600 |
important connections between our chunks. It does mean that we're going to have a little bit more 00:20:03.360 |
data to store there, because we're including these chunks of 20 in multiple places. 00:20:10.720 |
But I think that's usually worth it in terms of the better performance that you can get by 00:20:17.920 |
not missing out that important information, that important connection between chunks. 00:20:22.640 |
We initialize that. Then, to actually split the text, we use the text splitter, 00:20:30.080 |
split text. We're going to take DOPS5, and we're going to take the page content, which is just the 00:20:36.320 |
plain text. Based on the parameters that we set here, chunk size of 400 and chunk overlap of 20 00:20:46.160 |
using the tick token length token, we get two chunks. Let's have a look at the length of those 00:20:51.760 |
two chunks. The first chunk that we get is 346 tokens. Next one, 247. Both within that max upper 00:21:02.800 |
end limit of 400. You see that it's not going to necessarily split on the 400 tokens specifically, 00:21:11.440 |
because we have the specific separators that we would like to use. It's going to optimize 00:21:18.400 |
preferably for this separator. We're not going right up to that limit with every single chunk, 00:21:25.040 |
which is fine. That's kind of ideal. We don't necessarily need to put in a ton of text there. 00:21:32.320 |
That's it for a single document. What we're going to do now is we're going to repeat that 00:21:40.400 |
over the entire dataset. The final format that I want to create here is going to look like this. 00:21:46.640 |
We're going to have the ID, we're going to have our text, and we're going to have the source 00:21:50.000 |
where this text is actually coming from. One thing that you'll notice here is the ID. We're going to 00:22:00.240 |
create an ID and that ID will be unique to each page. We're going to have multiple chunks for each 00:22:08.160 |
page. That means we're also going to add in this chunk identifier onto the end of the ID to make 00:22:14.320 |
sure that every ID for every chunk is actually unique. Let me show you how we're going to create 00:22:21.760 |
that. Essentially, we have the URL here. We're going to replace the RT docs that we have here 00:22:30.480 |
with the actual HTTPS protocol. I'm just going to print out so you can see what it is. Then we're 00:22:38.160 |
going to take that URL, we're going to add it to this hashlib MD5. This is just a hashing function 00:22:45.680 |
that is going to take our URL and hash it into a unique identifier. This is useful because if we 00:22:55.040 |
are updating this text at some point in the future or this dataset, we can use the same 00:23:02.400 |
hashing function to create our unique IDs. That means that when we update this particular page, 00:23:07.680 |
it will just overwrite the previous versions of that item because we're using the same ID. 00:23:15.760 |
Of course, we can't use the same ID for every single chunk. We also need to add in this here, 00:23:22.640 |
which is like the chunk identifier. It's just a count of the number of chunks. We can see that 00:23:30.400 |
being created here. These are just two examples from the previous page that we just showed. 00:23:37.040 |
So you can see we have the chunk identifier and indeed the chunks are different. This says 00:23:43.360 |
language model cascades, ICE primer books, Socratic models. Okay, whatever. Let's take a look 00:23:50.160 |
at what is at the end of the first item. It should be something similar. There should be the overlap 00:23:57.120 |
that I mentioned. You can see language model cascades, ICE primer books, Socratic models. 00:24:06.000 |
Same thing. Cool. So there is the overlap. Now what we need to do is repeat this same logic 00:24:14.080 |
that we've just created across our entire dataset. To do that, same thing that we just did. We're 00:24:19.520 |
going to take the URL, we're going to create our unique ID, we're going to take the chunks using 00:24:23.760 |
the text splitter, and then we're going to append these all to our documents list here. That's just 00:24:31.280 |
going to be where we store everything. Okay. Now, so the length of the documents an hour ago was 00:24:39.920 |
a little bit less. Now it is 2,212 documents. Cool. We can now save them to 00:24:53.040 |
JSONlines file. To do that, we just do this. JSONlines is basically, it's what you can see here. 00:25:01.200 |
If we take a look at the documents, look at the first five, it's this, but it's just in a JSONlines 00:25:08.640 |
file. You can see it here. Same thing. Then once you've saved it and you create your JSONL file, 00:25:17.840 |
you just load it from file like this. With open train JSONL, wherever you saw it, 00:25:25.200 |
and you just load it iteratively like that. You can take a look. Yeah. Okay, great. That's how 00:25:33.760 |
you would load it. Now, a couple of things here. The reason that we're using JSONL and the reason 00:25:39.840 |
I'm calling this train.JSONL is because this makes it very compatible with HuggingFace datasets. 00:25:47.760 |
Which is essentially a way of sharing your dataset with others, or just making it more 00:25:54.000 |
accessible for yourself if you set to being a private dataset. What I want to do is just show 00:25:59.040 |
you how we can actually go about doing that as well. The first thing that we need to do 00:26:04.160 |
is go to HuggingFace.co. That will bring you to the first page of HuggingFace, which may look 00:26:10.800 |
different to you because you may not already have an account on HuggingFace. If you do need an 00:26:17.680 |
account or you need to sign in, there will be a little button over here that says sign up or log 00:26:22.080 |
in. You would follow that, create your account or log in. Then you will see something like this, 00:26:27.920 |
at which point you go over to your profile. We click new dataset. We give our dataset a name. 00:26:34.000 |
I'm going to call it LangChainDots. You can obviously call this whatever you want. 00:26:38.720 |
You can set it to private if you want to keep this dataset private. For me, I'm going to just 00:26:43.360 |
leave it as public. You create your dataset. On here, this is like the page of your dataset, 00:26:51.040 |
like the homepage of your dataset. You go to files. You go to add file, upload files. 00:26:57.200 |
Then you just need to drag in the train.jsonl file to here. For me, that is here. I'm just going to 00:27:08.640 |
go and drag that in. We go down, commit changes to main. We have now uploaded that. We can go 00:27:17.200 |
click on files here and we'll be able to see that we have the train.jsonl file in there. 00:27:22.000 |
Now, to actually use that in our code, we would need to pip install datasets. This is 00:27:27.760 |
the library for HuggingFace datasets. Then we would write this. We do from datasets, 00:27:36.320 |
import load dataset. Then our data would be a load dataset. 00:27:45.040 |
Here, we need the name of our dataset. Let's go back to the dataset page. We can find that at 00:27:56.000 |
the top here. It said it's James Callum LangChainDots. We can just copy it, add that into here. 00:28:02.400 |
Our split is the training split. That's where the train.jsonl comes in. Then we can view the data 00:28:11.520 |
details there. Once that has loaded, we will be able to see. We can just extract things. Data zero, 00:28:20.960 |
we can see that we have our text in there. It's super easy to work with. That's why I recommend 00:28:29.040 |
storing your data on HuggingFace datasets if you're wanting to share it. Even if you're wanting 00:28:34.720 |
to do the private approach, you can do that as well. You just need, I think it's like an API key 00:28:39.680 |
and that's pretty much it. That's it for this video. I just wanted to cover some of the 00:28:45.200 |
approaches that we take when we are considering how to chunk our text and actually process it for 00:28:53.760 |
large language models and also see how we might store that data later on as well. 00:29:00.400 |
Both of these items, I think we miss a lot in the typical videos. We're really focusing on 00:29:07.760 |
the large language model processing or the retrieval augmentation or whatever else. This, 00:29:15.600 |
in reality, is probably one of the most important parts of the entire process. We miss it pretty 00:29:21.120 |
often. Anyway, that's it for this video. Thank you very much for watching. I hope this is all 00:29:27.520 |
being useful and interesting. I will see you again in the next one. Bye.