back to index

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101


Chapters

0:0 Data preparation for LLMs
0:45 Downloading the LangChain docs
3:29 Using LangChain document loaders
5:54 How much text can we fit in LLMs?
11:57 Using tiktoken tokenizer to find length of text
16:2 Initializing the recursive text splitter in Langchain
17:25 Why we use chunk overlap
20:23 Chunking with RecursiveCharacterTextSplitter
21:37 Creating the dataset
24:50 Saving and loading with JSONL file
28:40 Data prep is important

Whisper Transcript | Transcript Only Page

00:00:00.000 | In this video we are going to take a look at what we need to do and what we need to consider
00:00:05.920 | when we are chunking text for large language models. The best way I can think of of
00:00:14.000 | demonstrating this is to walk through an example. Now we're going to really go with the what I
00:00:19.280 | believe is kind of like a rule of thumb that I tend to use when I'm chunking text in order to
00:00:24.400 | put into a large language model and it doesn't necessarily apply to every use case. You know
00:00:29.200 | every use case is slightly different but I think this is a pretty good approach at least when we're
00:00:34.400 | using retrieval augmentation and large language models which I think is where the chunking
00:00:40.640 | question kind of comes up most often. So let's jump straight into it. In this example what we're
00:00:46.560 | going to be doing is taking the langchain docs here, literally every page on this website,
00:00:53.760 | and we're going to be downloading those, taking each one of these pages and then we're going to
00:00:59.840 | be splitting them into more reasonably sized chunks. Now how are we going to do this? We're
00:01:06.720 | going to take a look at this notebook here. Now if you'd like to follow along with the code you can
00:01:12.480 | also run this notebook. I will leave a link to it which will appear somewhere near the top of the
00:01:17.760 | video right now. Now to get started we're going to be using a few Python libraries. Langchain is
00:01:23.760 | a pretty big one here so not only is it the documentation that we're downloading but it's
00:01:29.440 | also going to be how we download that documentation and it's also going to be how we split that
00:01:35.600 | documentation into chunks. Another dependency here is the ticktoken tokenizer. We'll talk about
00:01:42.160 | that later and we're just going to visualize and make things a little bit easier to follow with
00:01:45.840 | these libraries here. In this example first thing we're going to do is download all of the docs from
00:01:54.080 | langchain. Everything is contained within this is the top level page of the langchain docs. We're
00:02:01.360 | going to save everything into this directory here and we are going to say we want to get all of the
00:02:10.080 | .html files. We run that and that will take a moment just to download everything. There's a lot
00:02:18.480 | in there. My internet connection is also pretty slow so it will probably take me a moment but
00:02:25.040 | let's go ahead and just have a look at where these are being downloaded. If we come over to the left
00:02:31.120 | here we can see there is the RT docs repository there and inside the RT docs we have this
00:02:39.040 | langchain-redux-en-latest which is just kind of like the path of our docs. In there you can see
00:02:48.080 | everything's been downloaded. We have the index page which I think is the top level page. You can
00:02:54.480 | see it's just HTML. We're not going to process this we're going to use langchain to clean this up
00:03:02.080 | but if we come down a little bit I think maybe we can see something. This is the first page,
00:03:11.120 | welcome to langchain, LLMs are emerging as a transformative technology, so on and so on.
00:03:17.440 | We have some other things, other pages. We're just going to process all of this.
00:03:22.960 | Back to our code. It's done downloading now. We can come down to here and what we're going to do
00:03:30.560 | is use the langchain document loaders and we're going to use a Redux loader. Redux is a specific
00:03:37.440 | template that is used quite often for documentation for code libraries. Langchain includes a document
00:03:46.960 | loader that is specifically built for reading that type of documentation or those HTML pages
00:03:53.120 | and processing them into a nicer format. It's really easy to use it. We just point it to our
00:04:00.640 | directory that we just created. What are we doing here? We're loading those docs and here I'm just
00:04:09.280 | printing out the length of those docs so that we can see. We have 390 HTML pages that have been
00:04:16.080 | downloaded there for some reason. When I ran this about an hour ago, they actually had 389,
00:04:25.280 | now they have 390 pages, so it's already out of date. Cool. Let's have a look at one of those
00:04:32.640 | pages. We have this document object. Inside that we have page content, which is all of our text.
00:04:40.960 | If we want to print that in a nicer format, we can see this. Looks pretty good. There is some
00:04:51.040 | messy parts of this, but it's not really a problem. We could try and process that if we wanted to,
00:04:59.440 | but honestly, I don't really think it's worth it because a large language model can
00:05:04.480 | handle this very easily. I personally wouldn't really bother with that. I'd just take it as it
00:05:11.920 | is. Now, at the end of this object, we come right to the end if it lets me, we see that we have this
00:05:22.000 | metadata here. Inside the metadata we have the source, which is in this case the file path,
00:05:30.480 | but fortunately the way that we've set this up is that we can just replace rtdocs with
00:05:36.000 | HTTPS and that will give us a URL for this particular file. Let's come down here and you
00:05:43.040 | can see that's what I'm doing here. Replace rtdocs with HTTPS. Cool. Then we can click that
00:05:51.200 | and we come over to here. Now, this is where we start talking about the chunking of what we're
00:05:58.960 | doing. When we are thinking about chunking, there are a few things to consider. The first thing to
00:06:08.400 | consider is how much text or how many tokens can our large language model or whatever process is
00:06:16.400 | what we're doing, how many tokens can it handle? What is optimal for our particular use case?
00:06:22.480 | The use case that I'm envisioning here is retrieval augmentation for question answering
00:06:30.240 | using a larger language model. What does that mean exactly? It's probably best if I draw it out.
00:06:36.560 | We're going to have our large language model over here and we're going to ask it questions. We have
00:06:42.480 | our question over here. It's supposed to be a queue. It's fine. We have our question. We're
00:06:49.360 | going to say, "What is the LLM chain in LangChain?" If we pass that straight into our large language
00:06:57.360 | model, at the moment using GPT 3.5 Turbo, even GPT 4, they can't answer that question because they
00:07:04.960 | don't know what the LangChain library is. In this scenario, what we would do is we'd go to Vector
00:07:13.600 | Database. We don't really need to go into too much detail here. We go to Vector Database, which is
00:07:20.000 | where we saw all of the documents that we're processing now, all those LangChain dots. They
00:07:25.760 | would end up within that space and they would be retrieved. We would pass in five or so of these
00:07:33.600 | chunks of text that are relevant to our particular query alongside our original query.
00:07:42.320 | What you'd end up with is rather than, let's say this is your prompt, you typically have your
00:07:48.240 | query. Rather than just a query, you'd have your query and then you'd also have these five bits of
00:07:57.200 | relevant information below the query. That would all go into the large language model. You would
00:08:03.280 | essentially say to it, you'd probably have some instructions near the top and those instructions
00:08:08.320 | would say, I want you to answer this question. You'd maybe give the questionnaire and give it a
00:08:15.760 | bit later on using the context that we have provided. You would basically, in front of these
00:08:21.920 | contexts, you would write context. The large language model will answer the question based
00:08:28.560 | on those contexts. That's the scenario we're envisioning here. In this scenario, if we want to
00:08:37.280 | input five of these contexts into each one of our retrieval augmented queries, we need to think,
00:08:46.080 | what is the max token limit of our large language model and how much of that space can be reserved
00:08:53.440 | for these contexts? In this scenario, let's say that we're using GPT 3.5 Turbo. The token limit
00:09:03.520 | for GPT 3.5 Turbo is something like 4,096. This includes both. You have your large language model.
00:09:17.200 | I'm going to put that here. Pretend this is your large language model. This 4,096 includes the
00:09:24.480 | input to the large language model, so all of your input tokens, and also all of your generated
00:09:33.600 | output tokens. Basically, we can't just use that full 4,000 tokens on the input. We need to leave
00:09:42.800 | some space for the output. Also, within the input, we have other components. It's not just the
00:09:49.280 | context, but we also have the query. That's supposed to say query. As well as that, we might
00:10:00.560 | also have some instructions. I don't know why I'm writing so bad. As well as the instructions,
00:10:11.040 | we might also have a bit of track history if this is a track bot. Basically, the amount of contexts
00:10:18.960 | that we can feed in is pretty limited. In this scenario, let's just assume that we can pass in
00:10:26.000 | a context of around half of the 4,000 tokens. We'll say 2,000 is going to be our limit. If 2,000
00:10:34.240 | is our limit, that means we need to divide that by five because those 2,000 tokens need to be
00:10:43.840 | shared by our five contexts, which leaves us with about 400 of these tokens per context. That's our
00:10:55.520 | maximum chunk size. Now, one question that we might have here is, could we reduce the number
00:11:01.440 | of tokens further? For sure, we can. I would say the minimum number of tokens that you need within
00:11:09.200 | a context is for you to read this context, does it make sense? If you have enough words in there
00:11:18.000 | for that context to make sense to you as a human being, then that means that it is probably enough
00:11:25.840 | to feed as a chunk of text into a large language model, into a bedding model, and so on. If that
00:11:34.240 | chunk of text has enough text in there to have some sort of meaning to itself, then the chunk
00:11:41.040 | is probably big enough. As long as you satisfy that, that should be the criteria for your minimum
00:11:48.000 | size of that chunk of text. Naturally, for the maximum size of chunk of text, we have the 400
00:11:54.720 | tokens that we just calculated now. With all of that in mind, we need to take a look at how
00:12:01.200 | we would actually calculate the size of these chunks, because we're not basing this on character
00:12:08.720 | length, we're basing this on token length. In order to do that, we need to look at how to tokenize
00:12:15.360 | text using the same tokenizer that our large language model uses, and then we can actually
00:12:22.960 | count the number of tokens within each chunk. Getting started with that, we are going to be
00:12:29.680 | using the TickToken tokenizer. Now, this is specific to OpenAI models. Obviously, if you're
00:12:35.120 | using Cohere, HuggingFace, and so on, this is going to be a slightly different approach.
00:12:39.280 | First, we want to get our encoding. There are multiple TickToken tokenizers that OpenAI uses.
00:12:48.000 | This is just one of those. Now, let's initialize that, and I will talk a little bit about where
00:12:53.920 | we're getting these encoders from. You can actually find details for the tokenizer at
00:13:00.000 | this link here. This link is in the GitHub repo, TickToken, TickTokenModel.py. I'm going to click
00:13:08.880 | through to that. This is in the OpenAI TickToken repository on GitHub. You can see we have this
00:13:15.360 | model to encoding dictionary here. Within this, you can see that we have a mapping from each of
00:13:22.080 | the models to the particular tokenizer that it uses. We are going to use the GPT-3.5 Turbo model,
00:13:29.360 | which uses a CL-100K base. I would say I think most of the more recent models, like the models
00:13:37.120 | that you'll be using at the time of recording this video, they all use this encoder. The embeddings
00:13:46.240 | model that is the most up-to-date uses CL-100K base. The trapped GPTs, GPT-3.5 Turbo uses CL-100K
00:13:55.840 | base. GPT-4 also uses it. The only one that is still kind of a relevant model is the Textive
00:14:03.600 | Engine 0.0.3 model. That is the only relevant model that doesn't use that encoder. This one
00:14:10.960 | uses a P50K base. In reality, you don't even need to go there to find out the encoding that you need
00:14:18.240 | to use. You can actually just see this. TickToken, encoding for model, and you can run this. You get
00:14:25.440 | the CL-100K base. That's how we know. Now, anything else? I think that is pretty much it.
00:14:33.760 | Actually, here I'm creating this TickTokenLength function. That is going to take some text.
00:14:41.680 | It's going to use the tokenizer to calculate the length of that text in terms of TickToken tokens.
00:14:49.920 | That's important because we need to use that for our LineChainSplitter function in a moment.
00:14:55.760 | We create that. Then what we can do is just first, before we jump into the whole chunking component,
00:15:05.840 | I want to have a look at what the length of documents looks like at the moment. I'm going
00:15:11.760 | to calculate the token counts, the TickTokenLength function. Come to here, we can see the minimum,
00:15:18.720 | maximum, and average number of tokens. The smallest document contains just 45 tokens.
00:15:24.400 | This is probably a page that we don't really need. It probably doesn't contain anything
00:15:32.320 | useful in there. Maximum is almost 58,000 tokens, which is really big. I'm not sure what that is,
00:15:40.640 | but the average is a bit more normal, so 1.3 thousand there. We can visualize the distribution
00:15:49.760 | of those pages and the amount of tokens they have. The vast majority of pages, they're more towards
00:15:58.480 | the 1,000 token range, as we can see here. All right, cool. Now, let's continue and we'll start
00:16:08.240 | and look at how we're going to chunk everything. Again, we're using LineChain here. We're using
00:16:13.120 | a text splitter and we're using the recursive character text splitter. Now, this is, I think,
00:16:18.400 | probably one of the best chunkers or text splitters that LineChain offers at the moment.
00:16:24.080 | It's very general purpose. They do also offer some text splitters that are more specific
00:16:30.160 | to Markdown, for example, but I like this one. You can use it for a ton of things. Let me just
00:16:39.280 | explain it very quickly. Basically, what it's going to do is it's going to take your length
00:16:45.120 | function, so the tick token length, and it's going to say, "I need to split your text so that each
00:16:51.680 | chunk does not go over this chunk size here," so this 400. It's going to split based on the
00:16:59.040 | separators. The reason we have multiple separators is that it first starts by trying to find double
00:17:05.840 | new lines. This is a double new line separator. It's going to try and split on that first. If it
00:17:10.720 | can't find a good split using the double new line characters, it will just try a single new line,
00:17:18.960 | then it will try a space, and as a very last resort, it will just split on anything.
00:17:24.800 | Cool. Then one final thing that we have here is this chunk overlap. This chunk overlap is saying
00:17:31.120 | for every chunk, we are going to overlap it with the next chunk by 20 tokens. Let me draw that out
00:17:42.320 | so it makes more sense. Imagine we have a ton of texts. There's loads of texts here.
00:17:49.280 | Now, we are going to get a chunk of 400 characters. Let's say that chunk takes us from
00:18:01.440 | here all the way to, say, here. We have 400 characters in this chunk. Then the next chunk,
00:18:11.360 | if we don't have any chunk overlap, would be 400 characters from this. Let's say it's to here.
00:18:18.080 | This comes with a problem because we don't know what this information here and this information
00:18:27.280 | here is about. They could be related. We might be missing out on some important information
00:18:34.400 | by just splitting in the middle here. It's important to try and avoid that if possible.
00:18:41.680 | The most naive way or naive approach for doing this is to include a chunk overlap.
00:18:47.760 | What we would do is, let's say we take the 20 tokens behind this. We're going to go back
00:18:58.720 | 20 tokens, which maybe comes to here. That means that this space here is now going to be shared by
00:19:09.280 | the first chunk and the next chunk, which will also bring back the next chunk to something like
00:19:18.400 | here. Now, we have chunk one here, which goes from here up to here. Then we have chunk two, which is
00:19:33.440 | from here to here. Following on from that, we would also add another chunk overlap for number
00:19:43.120 | three. Number three would go from here to, let's say, here. Finally, for number four,
00:19:49.040 | we go from here to here. The chunk overlap is just to make sure that we're not missing any
00:19:55.600 | important connections between our chunks. It does mean that we're going to have a little bit more
00:20:03.360 | data to store there, because we're including these chunks of 20 in multiple places.
00:20:10.720 | But I think that's usually worth it in terms of the better performance that you can get by
00:20:17.920 | not missing out that important information, that important connection between chunks.
00:20:22.640 | We initialize that. Then, to actually split the text, we use the text splitter,
00:20:30.080 | split text. We're going to take DOPS5, and we're going to take the page content, which is just the
00:20:36.320 | plain text. Based on the parameters that we set here, chunk size of 400 and chunk overlap of 20
00:20:46.160 | using the tick token length token, we get two chunks. Let's have a look at the length of those
00:20:51.760 | two chunks. The first chunk that we get is 346 tokens. Next one, 247. Both within that max upper
00:21:02.800 | end limit of 400. You see that it's not going to necessarily split on the 400 tokens specifically,
00:21:11.440 | because we have the specific separators that we would like to use. It's going to optimize
00:21:18.400 | preferably for this separator. We're not going right up to that limit with every single chunk,
00:21:25.040 | which is fine. That's kind of ideal. We don't necessarily need to put in a ton of text there.
00:21:32.320 | That's it for a single document. What we're going to do now is we're going to repeat that
00:21:40.400 | over the entire dataset. The final format that I want to create here is going to look like this.
00:21:46.640 | We're going to have the ID, we're going to have our text, and we're going to have the source
00:21:50.000 | where this text is actually coming from. One thing that you'll notice here is the ID. We're going to
00:22:00.240 | create an ID and that ID will be unique to each page. We're going to have multiple chunks for each
00:22:08.160 | page. That means we're also going to add in this chunk identifier onto the end of the ID to make
00:22:14.320 | sure that every ID for every chunk is actually unique. Let me show you how we're going to create
00:22:21.760 | that. Essentially, we have the URL here. We're going to replace the RT docs that we have here
00:22:30.480 | with the actual HTTPS protocol. I'm just going to print out so you can see what it is. Then we're
00:22:38.160 | going to take that URL, we're going to add it to this hashlib MD5. This is just a hashing function
00:22:45.680 | that is going to take our URL and hash it into a unique identifier. This is useful because if we
00:22:55.040 | are updating this text at some point in the future or this dataset, we can use the same
00:23:02.400 | hashing function to create our unique IDs. That means that when we update this particular page,
00:23:07.680 | it will just overwrite the previous versions of that item because we're using the same ID.
00:23:15.760 | Of course, we can't use the same ID for every single chunk. We also need to add in this here,
00:23:22.640 | which is like the chunk identifier. It's just a count of the number of chunks. We can see that
00:23:30.400 | being created here. These are just two examples from the previous page that we just showed.
00:23:37.040 | So you can see we have the chunk identifier and indeed the chunks are different. This says
00:23:43.360 | language model cascades, ICE primer books, Socratic models. Okay, whatever. Let's take a look
00:23:50.160 | at what is at the end of the first item. It should be something similar. There should be the overlap
00:23:57.120 | that I mentioned. You can see language model cascades, ICE primer books, Socratic models.
00:24:06.000 | Same thing. Cool. So there is the overlap. Now what we need to do is repeat this same logic
00:24:14.080 | that we've just created across our entire dataset. To do that, same thing that we just did. We're
00:24:19.520 | going to take the URL, we're going to create our unique ID, we're going to take the chunks using
00:24:23.760 | the text splitter, and then we're going to append these all to our documents list here. That's just
00:24:31.280 | going to be where we store everything. Okay. Now, so the length of the documents an hour ago was
00:24:39.920 | a little bit less. Now it is 2,212 documents. Cool. We can now save them to
00:24:53.040 | JSONlines file. To do that, we just do this. JSONlines is basically, it's what you can see here.
00:25:01.200 | If we take a look at the documents, look at the first five, it's this, but it's just in a JSONlines
00:25:08.640 | file. You can see it here. Same thing. Then once you've saved it and you create your JSONL file,
00:25:17.840 | you just load it from file like this. With open train JSONL, wherever you saw it,
00:25:25.200 | and you just load it iteratively like that. You can take a look. Yeah. Okay, great. That's how
00:25:33.760 | you would load it. Now, a couple of things here. The reason that we're using JSONL and the reason
00:25:39.840 | I'm calling this train.JSONL is because this makes it very compatible with HuggingFace datasets.
00:25:47.760 | Which is essentially a way of sharing your dataset with others, or just making it more
00:25:54.000 | accessible for yourself if you set to being a private dataset. What I want to do is just show
00:25:59.040 | you how we can actually go about doing that as well. The first thing that we need to do
00:26:04.160 | is go to HuggingFace.co. That will bring you to the first page of HuggingFace, which may look
00:26:10.800 | different to you because you may not already have an account on HuggingFace. If you do need an
00:26:17.680 | account or you need to sign in, there will be a little button over here that says sign up or log
00:26:22.080 | in. You would follow that, create your account or log in. Then you will see something like this,
00:26:27.920 | at which point you go over to your profile. We click new dataset. We give our dataset a name.
00:26:34.000 | I'm going to call it LangChainDots. You can obviously call this whatever you want.
00:26:38.720 | You can set it to private if you want to keep this dataset private. For me, I'm going to just
00:26:43.360 | leave it as public. You create your dataset. On here, this is like the page of your dataset,
00:26:51.040 | like the homepage of your dataset. You go to files. You go to add file, upload files.
00:26:57.200 | Then you just need to drag in the train.jsonl file to here. For me, that is here. I'm just going to
00:27:08.640 | go and drag that in. We go down, commit changes to main. We have now uploaded that. We can go
00:27:17.200 | click on files here and we'll be able to see that we have the train.jsonl file in there.
00:27:22.000 | Now, to actually use that in our code, we would need to pip install datasets. This is
00:27:27.760 | the library for HuggingFace datasets. Then we would write this. We do from datasets,
00:27:36.320 | import load dataset. Then our data would be a load dataset.
00:27:45.040 | Here, we need the name of our dataset. Let's go back to the dataset page. We can find that at
00:27:56.000 | the top here. It said it's James Callum LangChainDots. We can just copy it, add that into here.
00:28:02.400 | Our split is the training split. That's where the train.jsonl comes in. Then we can view the data
00:28:11.520 | details there. Once that has loaded, we will be able to see. We can just extract things. Data zero,
00:28:20.960 | we can see that we have our text in there. It's super easy to work with. That's why I recommend
00:28:29.040 | storing your data on HuggingFace datasets if you're wanting to share it. Even if you're wanting
00:28:34.720 | to do the private approach, you can do that as well. You just need, I think it's like an API key
00:28:39.680 | and that's pretty much it. That's it for this video. I just wanted to cover some of the
00:28:45.200 | approaches that we take when we are considering how to chunk our text and actually process it for
00:28:53.760 | large language models and also see how we might store that data later on as well.
00:29:00.400 | Both of these items, I think we miss a lot in the typical videos. We're really focusing on
00:29:07.760 | the large language model processing or the retrieval augmentation or whatever else. This,
00:29:15.600 | in reality, is probably one of the most important parts of the entire process. We miss it pretty
00:29:21.120 | often. Anyway, that's it for this video. Thank you very much for watching. I hope this is all
00:29:27.520 | being useful and interesting. I will see you again in the next one. Bye.
00:29:35.920 | [Music]