Back to Index

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101


Chapters

0:0 Data preparation for LLMs
0:45 Downloading the LangChain docs
3:29 Using LangChain document loaders
5:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:2 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important

Transcript

In this video we are going to take a look at what we need to do and what we need to consider when we are chunking text for large language models. The best way I can think of of demonstrating this is to walk through an example. Now we're going to really go with the what I believe is kind of like a rule of thumb that I tend to use when I'm chunking text in order to put into a large language model and it doesn't necessarily apply to every use case.

You know every use case is slightly different but I think this is a pretty good approach at least when we're using retrieval augmentation and large language models which I think is where the chunking question kind of comes up most often. So let's jump straight into it. In this example what we're going to be doing is taking the langchain docs here, literally every page on this website, and we're going to be downloading those, taking each one of these pages and then we're going to be splitting them into more reasonably sized chunks.

Now how are we going to do this? We're going to take a look at this notebook here. Now if you'd like to follow along with the code you can also run this notebook. I will leave a link to it which will appear somewhere near the top of the video right now.

Now to get started we're going to be using a few Python libraries. Langchain is a pretty big one here so not only is it the documentation that we're downloading but it's also going to be how we download that documentation and it's also going to be how we split that documentation into chunks.

Another dependency here is the ticktoken tokenizer. We'll talk about that later and we're just going to visualize and make things a little bit easier to follow with these libraries here. In this example first thing we're going to do is download all of the docs from langchain. Everything is contained within this is the top level page of the langchain docs.

We're going to save everything into this directory here and we are going to say we want to get all of the .html files. We run that and that will take a moment just to download everything. There's a lot in there. My internet connection is also pretty slow so it will probably take me a moment but let's go ahead and just have a look at where these are being downloaded.

If we come over to the left here we can see there is the RT docs repository there and inside the RT docs we have this langchain-redux-en-latest which is just kind of like the path of our docs. In there you can see everything's been downloaded. We have the index page which I think is the top level page.

You can see it's just HTML. We're not going to process this we're going to use langchain to clean this up but if we come down a little bit I think maybe we can see something. This is the first page, welcome to langchain, LLMs are emerging as a transformative technology, so on and so on.

We have some other things, other pages. We're just going to process all of this. Back to our code. It's done downloading now. We can come down to here and what we're going to do is use the langchain document loaders and we're going to use a Redux loader. Redux is a specific template that is used quite often for documentation for code libraries.

Langchain includes a document loader that is specifically built for reading that type of documentation or those HTML pages and processing them into a nicer format. It's really easy to use it. We just point it to our directory that we just created. What are we doing here? We're loading those docs and here I'm just printing out the length of those docs so that we can see.

We have 390 HTML pages that have been downloaded there for some reason. When I ran this about an hour ago, they actually had 389, now they have 390 pages, so it's already out of date. Cool. Let's have a look at one of those pages. We have this document object.

Inside that we have page content, which is all of our text. If we want to print that in a nicer format, we can see this. Looks pretty good. There is some messy parts of this, but it's not really a problem. We could try and process that if we wanted to, but honestly, I don't really think it's worth it because a large language model can handle this very easily.

I personally wouldn't really bother with that. I'd just take it as it is. Now, at the end of this object, we come right to the end if it lets me, we see that we have this metadata here. Inside the metadata we have the source, which is in this case the file path, but fortunately the way that we've set this up is that we can just replace rtdocs with HTTPS and that will give us a URL for this particular file.

Let's come down here and you can see that's what I'm doing here. Replace rtdocs with HTTPS. Cool. Then we can click that and we come over to here. Now, this is where we start talking about the chunking of what we're doing. When we are thinking about chunking, there are a few things to consider.

The first thing to consider is how much text or how many tokens can our large language model or whatever process is what we're doing, how many tokens can it handle? What is optimal for our particular use case? The use case that I'm envisioning here is retrieval augmentation for question answering using a larger language model.

What does that mean exactly? It's probably best if I draw it out. We're going to have our large language model over here and we're going to ask it questions. We have our question over here. It's supposed to be a queue. It's fine. We have our question. We're going to say, "What is the LLM chain in LangChain?" If we pass that straight into our large language model, at the moment using GPT 3.5 Turbo, even GPT 4, they can't answer that question because they don't know what the LangChain library is.

In this scenario, what we would do is we'd go to Vector Database. We don't really need to go into too much detail here. We go to Vector Database, which is where we saw all of the documents that we're processing now, all those LangChain dots. They would end up within that space and they would be retrieved.

We would pass in five or so of these chunks of text that are relevant to our particular query alongside our original query. What you'd end up with is rather than, let's say this is your prompt, you typically have your query. Rather than just a query, you'd have your query and then you'd also have these five bits of relevant information below the query.

That would all go into the large language model. You would essentially say to it, you'd probably have some instructions near the top and those instructions would say, I want you to answer this question. You'd maybe give the questionnaire and give it a bit later on using the context that we have provided.

You would basically, in front of these contexts, you would write context. The large language model will answer the question based on those contexts. That's the scenario we're envisioning here. In this scenario, if we want to input five of these contexts into each one of our retrieval augmented queries, we need to think, what is the max token limit of our large language model and how much of that space can be reserved for these contexts?

In this scenario, let's say that we're using GPT 3.5 Turbo. The token limit for GPT 3.5 Turbo is something like 4,096. This includes both. You have your large language model. I'm going to put that here. Pretend this is your large language model. This 4,096 includes the input to the large language model, so all of your input tokens, and also all of your generated output tokens.

Basically, we can't just use that full 4,000 tokens on the input. We need to leave some space for the output. Also, within the input, we have other components. It's not just the context, but we also have the query. That's supposed to say query. As well as that, we might also have some instructions.

I don't know why I'm writing so bad. As well as the instructions, we might also have a bit of track history if this is a track bot. Basically, the amount of contexts that we can feed in is pretty limited. In this scenario, let's just assume that we can pass in a context of around half of the 4,000 tokens.

We'll say 2,000 is going to be our limit. If 2,000 is our limit, that means we need to divide that by five because those 2,000 tokens need to be shared by our five contexts, which leaves us with about 400 of these tokens per context. That's our maximum chunk size.

Now, one question that we might have here is, could we reduce the number of tokens further? For sure, we can. I would say the minimum number of tokens that you need within a context is for you to read this context, does it make sense? If you have enough words in there for that context to make sense to you as a human being, then that means that it is probably enough to feed as a chunk of text into a large language model, into a bedding model, and so on.

If that chunk of text has enough text in there to have some sort of meaning to itself, then the chunk is probably big enough. As long as you satisfy that, that should be the criteria for your minimum size of that chunk of text. Naturally, for the maximum size of chunk of text, we have the 400 tokens that we just calculated now.

With all of that in mind, we need to take a look at how we would actually calculate the size of these chunks, because we're not basing this on character length, we're basing this on token length. In order to do that, we need to look at how to tokenize text using the same tokenizer that our large language model uses, and then we can actually count the number of tokens within each chunk.

Getting started with that, we are going to be using the TickToken tokenizer. Now, this is specific to OpenAI models. Obviously, if you're using Cohere, HuggingFace, and so on, this is going to be a slightly different approach. First, we want to get our encoding. There are multiple TickToken tokenizers that OpenAI uses.

This is just one of those. Now, let's initialize that, and I will talk a little bit about where we're getting these encoders from. You can actually find details for the tokenizer at this link here. This link is in the GitHub repo, TickToken, TickTokenModel.py. I'm going to click through to that.

This is in the OpenAI TickToken repository on GitHub. You can see we have this model to encoding dictionary here. Within this, you can see that we have a mapping from each of the models to the particular tokenizer that it uses. We are going to use the GPT-3.5 Turbo model, which uses a CL-100K base.

I would say I think most of the more recent models, like the models that you'll be using at the time of recording this video, they all use this encoder. The embeddings model that is the most up-to-date uses CL-100K base. The trapped GPTs, GPT-3.5 Turbo uses CL-100K base. GPT-4 also uses it.

The only one that is still kind of a relevant model is the Textive Engine 0.0.3 model. That is the only relevant model that doesn't use that encoder. This one uses a P50K base. In reality, you don't even need to go there to find out the encoding that you need to use.

You can actually just see this. TickToken, encoding for model, and you can run this. You get the CL-100K base. That's how we know. Now, anything else? I think that is pretty much it. Actually, here I'm creating this TickTokenLength function. That is going to take some text. It's going to use the tokenizer to calculate the length of that text in terms of TickToken tokens.

That's important because we need to use that for our LineChainSplitter function in a moment. We create that. Then what we can do is just first, before we jump into the whole chunking component, I want to have a look at what the length of documents looks like at the moment.

I'm going to calculate the token counts, the TickTokenLength function. Come to here, we can see the minimum, maximum, and average number of tokens. The smallest document contains just 45 tokens. This is probably a page that we don't really need. It probably doesn't contain anything useful in there. Maximum is almost 58,000 tokens, which is really big.

I'm not sure what that is, but the average is a bit more normal, so 1.3 thousand there. We can visualize the distribution of those pages and the amount of tokens they have. The vast majority of pages, they're more towards the 1,000 token range, as we can see here. All right, cool.

Now, let's continue and we'll start and look at how we're going to chunk everything. Again, we're using LineChain here. We're using a text splitter and we're using the recursive character text splitter. Now, this is, I think, probably one of the best chunkers or text splitters that LineChain offers at the moment.

It's very general purpose. They do also offer some text splitters that are more specific to Markdown, for example, but I like this one. You can use it for a ton of things. Let me just explain it very quickly. Basically, what it's going to do is it's going to take your length function, so the tick token length, and it's going to say, "I need to split your text so that each chunk does not go over this chunk size here," so this 400.

It's going to split based on the separators. The reason we have multiple separators is that it first starts by trying to find double new lines. This is a double new line separator. It's going to try and split on that first. If it can't find a good split using the double new line characters, it will just try a single new line, then it will try a space, and as a very last resort, it will just split on anything.

Cool. Then one final thing that we have here is this chunk overlap. This chunk overlap is saying for every chunk, we are going to overlap it with the next chunk by 20 tokens. Let me draw that out so it makes more sense. Imagine we have a ton of texts.

There's loads of texts here. Now, we are going to get a chunk of 400 characters. Let's say that chunk takes us from here all the way to, say, here. We have 400 characters in this chunk. Then the next chunk, if we don't have any chunk overlap, would be 400 characters from this.

Let's say it's to here. This comes with a problem because we don't know what this information here and this information here is about. They could be related. We might be missing out on some important information by just splitting in the middle here. It's important to try and avoid that if possible.

The most naive way or naive approach for doing this is to include a chunk overlap. What we would do is, let's say we take the 20 tokens behind this. We're going to go back 20 tokens, which maybe comes to here. That means that this space here is now going to be shared by the first chunk and the next chunk, which will also bring back the next chunk to something like here.

Now, we have chunk one here, which goes from here up to here. Then we have chunk two, which is from here to here. Following on from that, we would also add another chunk overlap for number three. Number three would go from here to, let's say, here. Finally, for number four, we go from here to here.

The chunk overlap is just to make sure that we're not missing any important connections between our chunks. It does mean that we're going to have a little bit more data to store there, because we're including these chunks of 20 in multiple places. But I think that's usually worth it in terms of the better performance that you can get by not missing out that important information, that important connection between chunks.

We initialize that. Then, to actually split the text, we use the text splitter, split text. We're going to take DOPS5, and we're going to take the page content, which is just the plain text. Based on the parameters that we set here, chunk size of 400 and chunk overlap of 20 using the tick token length token, we get two chunks.

Let's have a look at the length of those two chunks. The first chunk that we get is 346 tokens. Next one, 247. Both within that max upper end limit of 400. You see that it's not going to necessarily split on the 400 tokens specifically, because we have the specific separators that we would like to use.

It's going to optimize preferably for this separator. We're not going right up to that limit with every single chunk, which is fine. That's kind of ideal. We don't necessarily need to put in a ton of text there. That's it for a single document. What we're going to do now is we're going to repeat that over the entire dataset.

The final format that I want to create here is going to look like this. We're going to have the ID, we're going to have our text, and we're going to have the source where this text is actually coming from. One thing that you'll notice here is the ID. We're going to create an ID and that ID will be unique to each page.

We're going to have multiple chunks for each page. That means we're also going to add in this chunk identifier onto the end of the ID to make sure that every ID for every chunk is actually unique. Let me show you how we're going to create that. Essentially, we have the URL here.

We're going to replace the RT docs that we have here with the actual HTTPS protocol. I'm just going to print out so you can see what it is. Then we're going to take that URL, we're going to add it to this hashlib MD5. This is just a hashing function that is going to take our URL and hash it into a unique identifier.

This is useful because if we are updating this text at some point in the future or this dataset, we can use the same hashing function to create our unique IDs. That means that when we update this particular page, it will just overwrite the previous versions of that item because we're using the same ID.

Of course, we can't use the same ID for every single chunk. We also need to add in this here, which is like the chunk identifier. It's just a count of the number of chunks. We can see that being created here. These are just two examples from the previous page that we just showed.

So you can see we have the chunk identifier and indeed the chunks are different. This says language model cascades, ICE primer books, Socratic models. Okay, whatever. Let's take a look at what is at the end of the first item. It should be something similar. There should be the overlap that I mentioned.

You can see language model cascades, ICE primer books, Socratic models. Same thing. Cool. So there is the overlap. Now what we need to do is repeat this same logic that we've just created across our entire dataset. To do that, same thing that we just did. We're going to take the URL, we're going to create our unique ID, we're going to take the chunks using the text splitter, and then we're going to append these all to our documents list here.

That's just going to be where we store everything. Okay. Now, so the length of the documents an hour ago was a little bit less. Now it is 2,212 documents. Cool. We can now save them to JSONlines file. To do that, we just do this. JSONlines is basically, it's what you can see here.

If we take a look at the documents, look at the first five, it's this, but it's just in a JSONlines file. You can see it here. Same thing. Then once you've saved it and you create your JSONL file, you just load it from file like this. With open train JSONL, wherever you saw it, and you just load it iteratively like that.

You can take a look. Yeah. Okay, great. That's how you would load it. Now, a couple of things here. The reason that we're using JSONL and the reason I'm calling this train.JSONL is because this makes it very compatible with HuggingFace datasets. Which is essentially a way of sharing your dataset with others, or just making it more accessible for yourself if you set to being a private dataset.

What I want to do is just show you how we can actually go about doing that as well. The first thing that we need to do is go to HuggingFace.co. That will bring you to the first page of HuggingFace, which may look different to you because you may not already have an account on HuggingFace.

If you do need an account or you need to sign in, there will be a little button over here that says sign up or log in. You would follow that, create your account or log in. Then you will see something like this, at which point you go over to your profile.

We click new dataset. We give our dataset a name. I'm going to call it LangChainDots. You can obviously call this whatever you want. You can set it to private if you want to keep this dataset private. For me, I'm going to just leave it as public. You create your dataset.

On here, this is like the page of your dataset, like the homepage of your dataset. You go to files. You go to add file, upload files. Then you just need to drag in the train.jsonl file to here. For me, that is here. I'm just going to go and drag that in.

We go down, commit changes to main. We have now uploaded that. We can go click on files here and we'll be able to see that we have the train.jsonl file in there. Now, to actually use that in our code, we would need to pip install datasets. This is the library for HuggingFace datasets.

Then we would write this. We do from datasets, import load dataset. Then our data would be a load dataset. Here, we need the name of our dataset. Let's go back to the dataset page. We can find that at the top here. It said it's James Callum LangChainDots. We can just copy it, add that into here.

Our split is the training split. That's where the train.jsonl comes in. Then we can view the data details there. Once that has loaded, we will be able to see. We can just extract things. Data zero, we can see that we have our text in there. It's super easy to work with.

That's why I recommend storing your data on HuggingFace datasets if you're wanting to share it. Even if you're wanting to do the private approach, you can do that as well. You just need, I think it's like an API key and that's pretty much it. That's it for this video.

I just wanted to cover some of the approaches that we take when we are considering how to chunk our text and actually process it for large language models and also see how we might store that data later on as well. Both of these items, I think we miss a lot in the typical videos.

We're really focusing on the large language model processing or the retrieval augmentation or whatever else. This, in reality, is probably one of the most important parts of the entire process. We miss it pretty often. Anyway, that's it for this video. Thank you very much for watching. I hope this is all being useful and interesting.

I will see you again in the next one. Bye.