back to index

Let's build the GPT Tokenizer


Chapters

0:0 intro: Tokenization, GPT-2 paper, tokenization-related issues
5:50 tokenization by example in a Web UI (tiktokenizer)
14:56 strings in Python, Unicode code points
18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
22:47 daydreaming: deleting tokenization
23:50 Byte Pair Encoding (BPE) algorithm walkthrough
27:2 starting the implementation
28:35 counting consecutive pairs, finding most common pair
30:36 merging the most common pair
34:58 training the tokenizer: adding the while loop, compression ratio
39:20 tokenizer/LLM diagram: it is a completely separate stage
42:47 decoding tokens to strings
48:21 encoding strings to tokens
57:36 regex patterns to force splits across categories
71:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
74:59 GPT-2 encoder.py released by OpenAI walkthrough
78:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
85:28 minbpe exercise time! write your own GPT-4 tokenizer
88:42 sentencepiece library intro, used to train Llama 2 vocabulary
103:27 how to set vocabulary set? revisiting gpt.py transformer
108:11 training new tokens, example of prompt compression
109:58 multimodal [image, video, audio] tokenization with vector quantization
111:41 revisiting and explaining the quirks of LLM tokenization
130:20 final recommendations

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everyone.
00:00:01.080 | So in this video, I'd like us to cover
00:00:02.560 | the process of tokenization in large language models.
00:00:05.500 | Now, you see here that I have a sad face,
00:00:07.820 | and that's because tokenization is my least favorite part
00:00:11.080 | of working with large language models,
00:00:12.640 | but unfortunately it is necessary to understand
00:00:14.480 | in some detail because it is fairly hairy, gnarly,
00:00:17.580 | and there's a lot of hidden foot guns to be aware of,
00:00:20.400 | and a lot of oddness with large language models
00:00:22.460 | typically traces back to tokenization.
00:00:25.640 | So what is tokenization?
00:00:27.580 | Now, in my previous video,
00:00:29.000 | Let's Build GPT from Scratch,
00:00:31.480 | we actually already did tokenization,
00:00:33.260 | but we did a very naive, simple version of tokenization.
00:00:36.920 | So when you go to the Google Colab for that video,
00:00:40.220 | you see here that we loaded our training set,
00:00:43.160 | and our training set was this Shakespeare data set.
00:00:46.360 | Now, in the beginning, the Shakespeare data set
00:00:49.000 | is just a large string in Python.
00:00:51.000 | It's just text.
00:00:52.280 | And so the question is,
00:00:53.120 | how do we plug text into large language models?
00:00:56.680 | And in this case here,
00:00:59.080 | we created a vocabulary of 65 possible characters
00:01:02.700 | that we saw occur in this string.
00:01:05.120 | These were the possible characters,
00:01:06.680 | and we saw that there are 65 of them.
00:01:08.840 | And then we created a lookup table
00:01:11.320 | for converting from every possible character,
00:01:14.000 | a little string piece, into a token, an integer.
00:01:17.860 | So here, for example, we tokenized the string, hi there,
00:01:22.440 | and we received this sequence of tokens.
00:01:25.760 | And here we took the first 1,000 characters of our data set,
00:01:29.520 | and we encoded it into tokens.
00:01:31.920 | And because this is character level,
00:01:33.920 | we received 1,000 tokens in a sequence.
00:01:37.720 | So token 18, 47, et cetera.
00:01:40.120 | Now, later we saw that the way we plug these tokens
00:01:44.860 | into the language model is by using an embedding table.
00:01:48.540 | And so basically, if we have 65 possible tokens,
00:01:52.680 | then this embedding table is going to have 65 rows.
00:01:56.040 | And roughly speaking, we're taking the integer
00:01:58.960 | associated with every single token,
00:02:00.720 | we're using that as a lookup into this table,
00:02:03.800 | and we're plucking out the corresponding row.
00:02:06.320 | And this row is trainable parameters
00:02:09.500 | that we're going to train using backpropagation.
00:02:11.360 | And this is the vector that then feeds into the transformer.
00:02:15.180 | And that's how the transformer
00:02:16.240 | sort of perceives every single token.
00:02:18.080 | So here we had a very naive tokenization process
00:02:22.660 | that was a character level tokenizer.
00:02:24.960 | But in practice, in state-of-the-art language models,
00:02:27.920 | people use a lot more complicated schemes, unfortunately,
00:02:30.640 | for constructing these token vocabularies.
00:02:34.520 | So we're not dealing on the character level,
00:02:37.640 | we're dealing on chunk level.
00:02:39.320 | And the way these character chunks are constructed
00:02:43.080 | is using algorithms such as, for example,
00:02:45.240 | the byte pair encoding algorithm,
00:02:46.780 | which we're going to go into in detail
00:02:48.760 | and cover in this video.
00:02:52.200 | I'd like to briefly show you the paper
00:02:53.480 | that introduced byte level encoding
00:02:56.060 | as a mechanism for tokenization
00:02:57.720 | in the context of large language models.
00:02:59.880 | And I would say that that's probably a GPT-2 paper.
00:03:02.600 | And if you scroll down here
00:03:04.840 | to the section input representation,
00:03:07.200 | this is where they cover tokenization,
00:03:09.120 | the kinds of properties
00:03:09.960 | that you'd like the tokenization to have.
00:03:12.380 | And they conclude here
00:03:13.220 | that they're going to have a tokenizer
00:03:16.060 | where you have a vocabulary of 50,257 possible tokens.
00:03:21.620 | And the context size is going to be 1,024 tokens.
00:03:26.320 | So in the attention layer
00:03:28.400 | of the transformer neural network,
00:03:30.520 | every single token is attending
00:03:32.120 | to the previous tokens in the sequence.
00:03:33.920 | And it's going to see up to 1,024 tokens.
00:03:37.440 | So tokens are this like fundamental unit,
00:03:40.500 | the atom of large language models, if you will.
00:03:43.740 | And everything is in units of tokens,
00:03:45.280 | everything is about tokens.
00:03:47.000 | And tokenization is the process for translating strings
00:03:49.840 | or text into sequences of tokens and vice versa.
00:03:54.840 | When you go into the LLAMA-2 paper as well,
00:03:57.280 | I can show you that when you search token,
00:03:59.040 | you're going to get 63 hits.
00:04:01.640 | And that's because tokens are, again, pervasive.
00:04:04.340 | So here, they mentioned that they trained
00:04:05.640 | on 2 trillion tokens of data and so on.
00:04:08.040 | So we're going to build our own tokenizer.
00:04:12.000 | Luckily, the ByteBear encoding algorithm
00:04:13.560 | is not that super complicated,
00:04:16.040 | and we can build it from scratch ourselves,
00:04:17.880 | and we'll see exactly how this works.
00:04:19.680 | Before we dive into code,
00:04:20.760 | I'd like to give you a brief taste
00:04:22.560 | of some of the complexities that come from the tokenization,
00:04:25.320 | because I just want to make sure
00:04:26.520 | that we motivate it sufficiently
00:04:27.800 | for why we are doing all of this and why this is so gross.
00:04:31.720 | So tokenization is at the heart of a lot of weirdness
00:04:34.720 | in large language models,
00:04:35.880 | and I would advise that you do not brush it off.
00:04:38.800 | A lot of the issues that may look like just issues
00:04:41.560 | with the neural network architecture
00:04:43.080 | or the large language model itself
00:04:45.380 | are actually issues with the tokenization
00:04:47.680 | and fundamentally trace back to it.
00:04:50.360 | So if you've noticed any issues
00:04:52.180 | with large language models,
00:04:53.560 | can, you know, not able to do spelling tasks very easily,
00:04:57.040 | that's usually due to tokenization.
00:04:59.100 | Simple string processing can be difficult
00:05:01.180 | for the large language model to perform natively.
00:05:03.820 | Non-English languages can work much worse,
00:05:07.220 | and to a large extent, this is due to tokenization.
00:05:10.260 | Sometimes LLMs are bad at simple arithmetic,
00:05:12.860 | also can trace, be traced to tokenization.
00:05:16.740 | GPT-2 specifically would have had quite a bit more issues
00:05:19.460 | with Python than future versions of it due to tokenization.
00:05:23.860 | There's a lot of other issues.
00:05:24.820 | Maybe you've seen weird warnings
00:05:25.980 | about a trailing white space.
00:05:27.260 | This is a tokenization issue.
00:05:28.700 | If you had asked GPT earlier
00:05:32.820 | about Solid Gold Magikarp and what it is,
00:05:35.080 | you would see the LLM go totally crazy,
00:05:37.260 | and it would start going off
00:05:38.880 | about a completely unrelated tangent topic.
00:05:41.660 | Maybe you've been told to use YAML over JSON
00:05:43.700 | in structured data.
00:05:44.780 | All of that has to do with tokenization.
00:05:47.060 | So basically, tokenization is at the heart of many issues.
00:05:50.420 | I will loop back around to these at the end of the video,
00:05:53.200 | but for now, let me just skip over it a little bit,
00:05:56.820 | and let's go to this web app,
00:05:58.720 | the tick-tokenizer.vercel.app.
00:06:02.220 | So I have it loaded here,
00:06:03.580 | and what I like about this web app
00:06:04.660 | is that tokenization is running
00:06:06.360 | sort of live in your browser in JavaScript.
00:06:09.260 | So you can just type here stuff, hello world,
00:06:11.980 | and the whole string re-tokenizes.
00:06:14.120 | So here what we see on the left
00:06:18.780 | is a string that you put in.
00:06:20.100 | On the right, we're currently using a GPT-2 tokenizer.
00:06:23.220 | We see that this string that I pasted here
00:06:25.640 | is currently tokenizing into 300 tokens,
00:06:28.660 | and here they are sort of shown explicitly
00:06:31.700 | in different colors for every single token.
00:06:34.340 | So for example, this word tokenization became two tokens,
00:06:38.900 | the token 30,642 and 1,634.
00:06:43.900 | The token space is, is token 318.
00:06:50.060 | So be careful, on the bottom, you can show whitespace,
00:06:53.300 | and keep in mind that there are spaces
00:06:55.580 | and slash and newline characters in here,
00:06:58.560 | but you can hide them for clarity.
00:07:00.840 | The token space at is token 379.
00:07:06.500 | The token space the is 262, et cetera.
00:07:11.100 | So you notice here that the space
00:07:12.340 | is part of that token chunk.
00:07:15.320 | Now, so this is kind of like how
00:07:18.740 | our English sentence broke up,
00:07:20.540 | and that seems all well and good.
00:07:23.180 | Now here, I put in some arithmetic.
00:07:25.900 | So we see that the token 127 plus,
00:07:29.860 | and then token six, space six, followed by 77.
00:07:34.060 | So what's happening here is that 127 is feeding in
00:07:36.740 | as a single token into the large language model,
00:07:39.420 | but the number 677 will actually feed in
00:07:43.700 | as two separate tokens.
00:07:45.900 | And so the large language model has to sort of
00:07:48.740 | take account of that and process it correctly
00:07:52.300 | in its network.
00:07:53.940 | And see here, 804 will be broken up into two tokens,
00:07:57.180 | and it's all completely arbitrary.
00:07:59.060 | And here I have another example of four-digit numbers,
00:08:01.660 | and they break up in a way that they break up
00:08:03.900 | and it's totally arbitrary.
00:08:04.860 | Sometimes you have multiple digits, a single token.
00:08:08.220 | Sometimes you have individual digits as many tokens,
00:08:11.420 | and it's all kind of pretty arbitrary
00:08:12.860 | and coming out of the tokenizer.
00:08:14.460 | Here's another example.
00:08:17.260 | We have the string egg,
00:08:20.220 | and you see here that this became two tokens.
00:08:23.420 | But for some reason, when I say I have an egg,
00:08:26.320 | you see when it's a space egg, it's two token.
00:08:29.780 | It's, sorry, it's a single token.
00:08:31.980 | So just egg by itself in the beginning of a sentence
00:08:34.460 | is two tokens, but here as a space egg
00:08:37.460 | is suddenly a single token for the exact same string.
00:08:41.260 | Here, lowercase egg turns out to be a single token.
00:08:46.020 | And in particular, notice that the color is different,
00:08:47.860 | so this is a different token.
00:08:49.440 | So this is case sensitive.
00:08:51.180 | And of course, capital egg would also be different tokens.
00:08:55.380 | And again, this would be two tokens arbitrarily.
00:08:59.220 | So for the same concept, egg,
00:09:01.580 | depending on if it's in the beginning of a sentence,
00:09:03.320 | at the end of a sentence, lowercase, uppercase, or mixed,
00:09:06.340 | all of this will be basically very different tokens
00:09:09.020 | and different IDs.
00:09:10.180 | And the language model has to learn from raw data,
00:09:12.500 | from all the internet text
00:09:13.580 | that it's going to be training on,
00:09:14.980 | that these are actually all the exact same concept.
00:09:17.460 | And it has to sort of group them
00:09:18.860 | in the parameters of the neural network
00:09:20.940 | and understand just based on the data patterns
00:09:23.300 | that these are all very similar,
00:09:24.860 | but maybe not almost exactly similar,
00:09:26.980 | but very, very similar.
00:09:30.700 | After the demonstration here,
00:09:32.500 | I have an introduction from OpenAI's Chachabitty in Korean.
00:09:37.500 | So, (speaking in foreign language)
00:09:40.860 | et cetera.
00:09:42.020 | So this is in Korean.
00:09:43.380 | And the reason I put this here
00:09:44.700 | is because you'll notice that
00:09:47.020 | non-English languages work slightly worse in Chachabitty.
00:09:53.360 | Part of this is because of course,
00:09:54.900 | the training data set for Chachabitty
00:09:56.500 | is much larger for English than for everything else.
00:09:59.380 | But the same is true,
00:10:00.340 | not just for the large language model itself,
00:10:02.540 | but also for the tokenizer.
00:10:04.380 | So when we train the tokenizer,
00:10:05.860 | we're going to see that there's a training set as well.
00:10:07.980 | And there's a lot more English than non-English.
00:10:10.540 | And what ends up happening
00:10:11.820 | is that we're going to have a lot more longer tokens
00:10:15.700 | for English.
00:10:16.580 | So how do I put this?
00:10:19.380 | If you have a single sentence in English
00:10:21.260 | and you tokenize it,
00:10:22.420 | you might see that it's 10 tokens or something like that.
00:10:25.180 | But if you translate that sentence into say Korean
00:10:27.420 | or Japanese or something else,
00:10:29.220 | you'll typically see that number of tokens used
00:10:31.220 | is much larger.
00:10:32.380 | And that's because the chunks here
00:10:34.380 | are a lot more broken up.
00:10:35.920 | So we're using a lot more tokens for the exact same thing.
00:10:40.420 | And what this does is it bloats up the sequence length
00:10:43.420 | of all the documents.
00:10:44.960 | So you're using up more tokens
00:10:46.980 | and then in the attention of the transformer,
00:10:49.060 | when these tokens try to attend each other,
00:10:50.860 | you are running out of context
00:10:53.180 | in the maximum context length of that transformer.
00:10:56.500 | And so basically all the non-English text
00:10:59.700 | is stretched out from the perspective of the transformer.
00:11:03.500 | And this just has to do with the trinks
00:11:06.220 | that's used for the tokenizer and the tokenization itself.
00:11:09.380 | So it will create a lot bigger tokens
00:11:11.480 | and a lot larger groups in English,
00:11:13.620 | and it will have a lot of little boundaries
00:11:15.940 | for all the other non-English text.
00:11:17.700 | So if we translated this into English,
00:11:21.620 | it would be significantly fewer tokens.
00:11:24.380 | The final example I have here
00:11:25.580 | is a little snippet of Python for doing FizzBuzz.
00:11:29.180 | And what I'd like you to notice is,
00:11:30.900 | look, all these individual spaces are all separate tokens.
00:11:35.580 | They are token 220.
00:11:37.940 | So 220, 220, 220, 220,
00:11:41.860 | and then space if is a single token.
00:11:44.780 | And so what's going on here
00:11:45.700 | is that when the transformer is going to consume
00:11:47.620 | or try to create this text,
00:11:50.620 | it needs to handle all these spaces individually.
00:11:53.860 | They all feed in one by one
00:11:55.900 | into the entire transformer in the sequence.
00:11:58.580 | And so this is being extremely wasteful
00:12:00.820 | tokenizing it in this way.
00:12:02.780 | And so as a result of that,
00:12:04.740 | GPT-2 is not very good with Python,
00:12:07.060 | and it's not anything to do with coding
00:12:08.860 | or the language model itself.
00:12:10.300 | It's just that if you use a lot of indentation
00:12:12.180 | using space in Python, like we usually do,
00:12:15.540 | you just end up bloating out all the text
00:12:17.800 | and it's separated across way too much of the sequence,
00:12:20.540 | and we are running out of the context length in the sequence.
00:12:23.760 | That's roughly speaking what's happening.
00:12:25.260 | We're being way too wasteful.
00:12:26.580 | We're taking up way too much token space.
00:12:28.740 | Now, we can also scroll up here
00:12:30.100 | and we can change the tokenizer.
00:12:31.720 | So note here that GPT-2 tokenizer
00:12:33.700 | creates a token count of 300 for this string here.
00:12:37.760 | We can change it to CL100K base,
00:12:39.620 | which is the GPT-4 tokenizer.
00:12:41.620 | And we see that the token count drops to 185.
00:12:44.560 | So for the exact same string,
00:12:46.220 | we are now roughly halving the number of tokens.
00:12:49.780 | And roughly speaking,
00:12:50.620 | this is because the number of tokens in the GPT-4 tokenizer
00:12:54.440 | is roughly double that of the number of tokens
00:12:57.160 | in the GPT-2 tokenizer.
00:12:58.620 | So we went from roughly 50K to roughly 100K.
00:13:01.720 | Now, you can imagine that this is a good thing
00:13:03.840 | because the same text is now squished
00:13:06.820 | into half as many tokens.
00:13:08.560 | So this is a lot denser input to the transformer.
00:13:12.780 | And in the transformer,
00:13:15.360 | every single token has a finite number of tokens before it
00:13:17.880 | that it's going to pay attention to.
00:13:19.560 | And so what this is doing is we're roughly able to see
00:13:22.960 | twice as much text as a context
00:13:25.840 | for what token to predict next because of this change.
00:13:29.360 | But of course,
00:13:30.200 | just increasing the number of tokens
00:13:31.520 | is not strictly better infinitely
00:13:34.520 | because as you increase the number of tokens,
00:13:36.240 | now your embedding table is sort of getting a lot larger.
00:13:39.960 | And also at the output,
00:13:41.080 | we are trying to predict the next token,
00:13:42.520 | and there's the softmax there, and that grows as well.
00:13:45.280 | We're gonna go into more detail later on this,
00:13:47.200 | but there's some kind of a sweet spot somewhere
00:13:49.560 | where you have a just right number of tokens
00:13:52.160 | in your vocabulary,
00:13:53.200 | where everything is appropriately dense
00:13:55.360 | and still fairly efficient.
00:13:56.720 | Now, one thing I would like you to note specifically
00:13:59.400 | for the GPT-4 Tokenizer
00:14:01.160 | is that the handling of the whitespace for Python
00:14:04.880 | has improved a lot.
00:14:06.180 | You see that here,
00:14:07.640 | these four spaces are represented as one single token
00:14:10.120 | for the three spaces here,
00:14:12.400 | and then the token spaces.
00:14:14.640 | And here, seven spaces were all grouped into a single token.
00:14:18.640 | So we're being a lot more efficient
00:14:19.960 | in how we represent Python.
00:14:21.400 | And this was a deliberate choice made by OpenAI
00:14:23.680 | when they designed the GPT-4 Tokenizer,
00:14:26.440 | and they group a lot more whitespace into a single character.
00:14:30.600 | What this does is this densifies Python,
00:14:33.880 | and therefore we can attend to more code before it
00:14:37.980 | when we're trying to predict the next token in the sequence.
00:14:40.760 | And so the improvement in the Python coding ability
00:14:43.660 | from GPT-2 to GPT-4
00:14:45.440 | is not just a matter of the language model
00:14:47.480 | and the architecture and the details of the optimization,
00:14:50.440 | but a lot of the improvement here
00:14:51.540 | is also coming from the design of the Tokenizer
00:14:53.800 | and how it groups characters into tokens.
00:14:56.720 | Okay, so let's now start writing some code.
00:14:59.360 | So remember what we want to do.
00:15:01.140 | We want to take strings and feed them into language models.
00:15:05.040 | For that, we need to somehow tokenize strings
00:15:07.640 | into some integers in some fixed vocabulary,
00:15:12.120 | and then we will use those integers
00:15:13.860 | to make a lookup into a lookup table of vectors
00:15:16.820 | and feed those vectors into the transformer as an input.
00:15:20.300 | Now, the reason this gets a little bit tricky, of course,
00:15:22.740 | is that we don't just want to support
00:15:24.060 | the simple English alphabet.
00:15:25.760 | We want to support different kinds of languages.
00:15:27.920 | So this is annyeonghaseyo in Korean, which is hello.
00:15:31.660 | And we also want to support many kinds of special characters
00:15:33.820 | that we might find on the internet, for example, emoji.
00:15:38.260 | So how do we feed this text into transformers?
00:15:43.100 | Well, what is this text anyway in Python?
00:15:46.140 | So if you go to the documentation of a string in Python,
00:15:49.660 | you can see that strings are immutable sequences
00:15:52.340 | of Unicode code points.
00:15:55.020 | Okay, what are Unicode code points?
00:15:56.980 | We can go to Wikipedia.
00:15:59.900 | So Unicode code points are defined
00:16:02.380 | by the Unicode consortium as part of the Unicode standard.
00:16:07.180 | And what this is really is that it's just a definition
00:16:09.640 | of roughly 150,000 characters right now.
00:16:13.000 | And roughly speaking, what they look like
00:16:15.280 | and what integers represent those characters.
00:16:18.680 | So this is 150,000 characters across 161 scripts
00:16:22.000 | as of right now.
00:16:23.620 | So if you scroll down here,
00:16:24.640 | you can see that the standard is very much alive.
00:16:27.280 | The latest standard 15.1 is September, 2023.
00:16:29.980 | And basically, this is just a way to define
00:16:34.960 | lots of types of characters, like for example,
00:16:38.980 | all these characters across different scripts.
00:16:41.300 | So the way we can access the Unicode code point
00:16:44.300 | given a single character
00:16:45.580 | is by using the ORD function in Python.
00:16:47.620 | So for example, I can pass in ORD of H,
00:16:50.620 | and I can see that for the single character H,
00:16:53.980 | the Unicode code point is 104, okay?
00:16:57.860 | But this can be arbitrarily complicated.
00:17:00.940 | So we can take, for example, our emoji here,
00:17:03.620 | and we can see that the code point for this one is 128,000.
00:17:07.460 | Or we can take UN,
00:17:09.100 | and this is 50,000.
00:17:13.180 | Now, keep in mind, you can't plug in strings here
00:17:15.860 | because this doesn't have a single code point.
00:17:18.960 | It only takes a single Unicode code point character
00:17:22.140 | and tells you its integer.
00:17:23.800 | So in this way, we can look up all the characters
00:17:29.640 | of this specific string and their code points.
00:17:32.020 | So ORD of X, for X in this string,
00:17:35.460 | and we get this encoding here.
00:17:38.560 | Now, see here, we've already turned,
00:17:41.220 | the raw code points already have integers.
00:17:43.740 | So why can't we simply just use these integers
00:17:46.200 | and not have any tokenization at all?
00:17:48.260 | Why can't we just use this natively as is
00:17:50.500 | and just use the code point?
00:17:52.200 | Well, one reason for that, of course,
00:17:53.420 | is that the vocabulary in that case would be quite long.
00:17:56.000 | So in this case, for Unicode,
00:17:58.380 | this is a vocabulary of 150,000 different code points.
00:18:02.260 | But more worryingly than that, I think,
00:18:04.300 | the Unicode standard is very much alive
00:18:06.580 | and it keeps changing.
00:18:08.040 | And so it's not kind of a stable representation necessarily
00:18:10.940 | that we may wanna use directly.
00:18:13.220 | So for those reasons, we need something a bit better.
00:18:15.860 | So to find something better, we turn to encodings.
00:18:18.740 | So if we go to the Wikipedia page here,
00:18:20.180 | we see that the Unicode Consortium
00:18:22.240 | defines three types of encodings,
00:18:24.900 | UTF-8, UTF-16, and UTF-32.
00:18:27.860 | These encodings are the way
00:18:30.060 | by which we can take Unicode text
00:18:32.340 | and translate it into binary data or byte streams.
00:18:36.140 | UTF-8 is by far the most common.
00:18:38.940 | So this is the UTF-8 page.
00:18:40.600 | Now this Wikipedia page is actually quite long,
00:18:43.300 | but what's important for our purposes
00:18:44.900 | is that UTF-8 takes every single code point
00:18:47.820 | and it translates it to a byte stream.
00:18:50.540 | And this byte stream is between one to four bytes.
00:18:53.380 | So it's a variable length encoding.
00:18:55.400 | So depending on the Unicode point, according to the schema,
00:18:58.080 | you're gonna end up with between one to four bytes
00:19:00.560 | for each code point.
00:19:01.640 | On top of that, there's UTF-8, UTF-16, and UTF-32.
00:19:07.480 | UTF-32 is nice because it is fixed length
00:19:09.960 | instead of variable length,
00:19:11.120 | but it has many other downsides as well.
00:19:13.680 | So the full kind of spectrum of pros and cons
00:19:17.720 | of all these different three encodings
00:19:19.560 | are beyond the scope of this video.
00:19:21.400 | I just like to point out that I enjoyed this blog post
00:19:24.460 | and this blog post at the end of it
00:19:26.120 | also has a number of references that can be quite useful.
00:19:29.160 | One of them is UTF-8 Everywhere Manifesto.
00:19:32.620 | And this manifesto describes the reason why UTF-8
00:19:36.280 | is significantly preferred and a lot nicer
00:19:39.760 | than the other encodings
00:19:41.280 | and why it is used a lot more prominently on the internet.
00:19:45.540 | One of the major advantages, just to give you a sense,
00:19:49.360 | is that UTF-8 is the only one of these
00:19:51.560 | that is backwards compatible
00:19:53.060 | to the much simpler ASCII encoding of text.
00:19:56.860 | But I'm not gonna go into the full detail in this video.
00:19:59.580 | So suffice to say that we like the UTF-8 encoding,
00:20:02.420 | and let's try to take the string
00:20:04.700 | and see what we get if we encode it into UTF-8.
00:20:07.360 | The string class in Python actually has dot encode,
00:20:11.660 | and you can give it the encoding, which is, say, UTF-8.
00:20:15.180 | Now, what we get out of this is not very nice
00:20:17.180 | because this is the bytes, this is a bytes object,
00:20:20.540 | and it's not very nice in the way that it's printed.
00:20:23.500 | So I personally like to take it through a list
00:20:26.040 | because then we actually get the raw bytes of this encoding.
00:20:30.840 | So this is the raw bytes that represent this string
00:20:35.680 | according to the UTF-8 encoding.
00:20:37.980 | We can also look at UTF-16.
00:20:40.100 | We get a slightly different byte stream.
00:20:42.520 | And here we start to see
00:20:43.680 | one of the disadvantages of UTF-16.
00:20:45.620 | You see how we have zero something,
00:20:47.800 | zero something, zero something.
00:20:49.360 | We're starting to get a sense that this is
00:20:50.560 | a bit of a wasteful encoding.
00:20:52.280 | And indeed, for simple ASCII characters
00:20:55.360 | or English characters here,
00:20:57.320 | we just have the structure of zero something,
00:20:58.880 | zero something, and it's not exactly nice.
00:21:02.280 | Same for UTF-32.
00:21:04.000 | When we expand this, we can start to get a sense
00:21:06.140 | of the wastefulness of this encoding for our purposes.
00:21:08.960 | You see a lot of zeros followed by something,
00:21:11.460 | and so this is not desirable.
00:21:14.580 | So suffice it to say that we would like to stick
00:21:18.820 | with UTF-8 for our purposes.
00:21:21.780 | However, if we just use UTF-8 naively,
00:21:24.660 | these are byte streams.
00:21:26.060 | So that would imply a vocabulary length
00:21:28.680 | of only 256 possible tokens.
00:21:32.340 | But this vocabulary size is very, very small.
00:21:35.260 | What this is going to do if we just were to use it naively
00:21:37.900 | is that all of our text would be stretched out
00:21:41.140 | over very, very long sequences of bytes.
00:21:44.400 | And so what this does is that certainly
00:21:49.400 | the embedding table is going to be tiny,
00:21:51.060 | and the prediction at the top of the final layer
00:21:52.980 | is going to be very tiny, but our sequences are very long.
00:21:55.940 | And remember that we have pretty finite context lengths
00:21:59.860 | in the attention that we can support in a transformer
00:22:02.660 | for computational reasons.
00:22:04.220 | And so we only have as much context length,
00:22:07.140 | but now we have very, very long sequences.
00:22:09.320 | And this is just inefficient,
00:22:10.300 | and it's not going to allow us to attend
00:22:12.220 | to sufficiently long text before us
00:22:15.100 | for the purposes of the next token prediction task.
00:22:18.020 | So we don't want to use the raw bytes of the UTF-8 encoding.
00:22:23.020 | We want to be able to support larger vocabulary size
00:22:26.780 | that we can tune as a hyperparameter,
00:22:28.700 | but we want to stick with the UTF-8 encoding
00:22:31.700 | of these strings.
00:22:32.800 | So what do we do?
00:22:34.340 | Well, the answer, of course,
00:22:35.180 | is we turn to the byte-pair encoding algorithm,
00:22:37.420 | which will allow us to compress these byte sequences
00:22:41.220 | to a variable amount.
00:22:42.720 | So we'll get to that in a bit,
00:22:43.780 | but I just want to briefly speak to the fact
00:22:46.300 | that I would love nothing more
00:22:48.740 | than to be able to feed raw byte sequences
00:22:51.540 | into language models.
00:22:54.100 | In fact, there's a paper
00:22:54.900 | about how this could potentially be done
00:22:57.280 | from somewhere last year.
00:22:59.020 | Now, the problem is you actually have to go in
00:23:01.060 | and you have to modify the transformer architecture
00:23:03.180 | because, as I mentioned,
00:23:04.580 | you're going to have a problem
00:23:06.020 | where the attention will start to become extremely expensive
00:23:08.740 | because the sequences are so long.
00:23:11.380 | And so in this paper,
00:23:12.780 | they propose kind of a hierarchical structuring
00:23:15.740 | of the transformer that could allow you
00:23:17.540 | to just feed in raw bytes.
00:23:19.660 | And so at the end, they say,
00:23:20.980 | "Together, these results establish the viability
00:23:22.900 | of tokenization-free autoregressive sequence modeling
00:23:25.380 | at scale."
00:23:26.300 | So tokenization-free would indeed be amazing.
00:23:28.680 | We would just feed byte streams directly into our models,
00:23:32.180 | but unfortunately, I don't know
00:23:33.260 | that this has really been proven out yet
00:23:35.580 | by sufficiently many groups and at sufficient scale,
00:23:38.580 | but something like this at one point would be amazing
00:23:40.540 | and I hope someone comes up with it.
00:23:42.020 | But for now, we have to come back
00:23:44.020 | and we can't feed this directly into language models
00:23:46.500 | and we have to compress it
00:23:47.740 | using the byte-pair encoding algorithm.
00:23:49.480 | So let's see how that works.
00:23:50.620 | So as I mentioned,
00:23:51.460 | the byte-pair encoding algorithm is not all that complicated
00:23:54.460 | and the Wikipedia page is actually quite instructive
00:23:56.500 | as far as the basic idea goes.
00:23:58.980 | What we're doing is we have some kind of a input sequence.
00:24:01.780 | Like, for example, here we have only four elements
00:24:04.100 | in our vocabulary, A, B, C, and D,
00:24:06.420 | and we have a sequence of them.
00:24:07.920 | So instead of bytes, let's say we just had four,
00:24:10.460 | a vocab size of four.
00:24:11.740 | The sequence is too long and we'd like to compress it.
00:24:15.380 | So what we do is that we iteratively find the pair of tokens
00:24:20.380 | that occur the most frequently.
00:24:24.360 | And then once we've identified that pair,
00:24:27.060 | we replace that pair with just a single new token
00:24:30.900 | that we append to our vocabulary.
00:24:33.380 | So for example, here, the byte-pair AA occurs most often,
00:24:37.320 | so we mint a new token, let's call it capital Z,
00:24:40.740 | and we replace every single occurrence of AA by Z.
00:24:44.800 | So now we have two Zs here.
00:24:47.920 | So here we took a sequence of 11 characters
00:24:51.540 | with vocabulary size four,
00:24:53.740 | and we've converted it to a sequence of only nine tokens,
00:24:58.740 | but now with a vocabulary of five,
00:25:00.600 | because we have a fifth vocabulary element
00:25:02.780 | that we just created, and it's Z,
00:25:05.000 | standing for concatenation of AA.
00:25:07.300 | And we can, again, repeat this process.
00:25:09.660 | So we, again, look at the sequence
00:25:11.980 | and identify the pair of tokens that are most frequent.
00:25:16.680 | Let's say that that is now AB.
00:25:19.140 | Well, we are going to replace AB with a new token
00:25:21.340 | that we mint, called Y.
00:25:23.460 | So Y becomes AB, and then every single occurrence of AB
00:25:26.220 | is now replaced with Y, so we end up with this.
00:25:29.780 | So now we only have one, two, three, four, five, six,
00:25:33.860 | seven characters in our sequence,
00:25:36.100 | but we have not just four vocabulary elements or five,
00:25:41.100 | but now we have six.
00:25:43.620 | And for the final round, we, again,
00:25:46.220 | look through the sequence, find that the phrase ZY,
00:25:49.600 | or the pair ZY is most common,
00:25:51.540 | and replace it one more time with another character,
00:25:55.500 | let's say X.
00:25:56.580 | So X is ZY, and we replace all occurrences of ZY,
00:25:59.980 | and we get this following sequence.
00:26:02.020 | So basically, after we have gone through this process,
00:26:04.840 | instead of having a sequence of 11 tokens
00:26:09.840 | with a vocabulary length of four,
00:26:14.680 | we now have a sequence of one, two, three, four, five tokens,
00:26:19.680 | but our vocabulary length now is seven.
00:26:23.260 | And so in this way, we can iteratively compress our sequence
00:26:27.580 | as we mint new tokens.
00:26:29.560 | So in the exact same way, we start out with byte sequences.
00:26:34.320 | So we have 256 vocabulary size,
00:26:37.520 | but we're now going to go through these
00:26:38.980 | and find the byte pairs that occur the most.
00:26:42.160 | And we're going to iteratively start minting new tokens,
00:26:44.920 | appending them to our vocabulary, and replacing things.
00:26:48.160 | And in this way, we're going to end up
00:26:49.480 | with a compressed training dataset,
00:26:51.580 | and also an algorithm for taking any arbitrary sequence
00:26:54.920 | and encoding it using this vocabulary,
00:26:58.240 | and also decoding it back to strings.
00:27:00.720 | So let's now implement all that.
00:27:02.920 | So here's what I did.
00:27:04.280 | I went to this blog post that I enjoyed,
00:27:06.640 | and I took the first paragraph,
00:27:07.960 | and I copy pasted it here into text.
00:27:10.840 | So this is one very long line here.
00:27:13.040 | Now, to get the tokens, as I mentioned,
00:27:16.360 | we just take our text and we encode it into UTF-8.
00:27:19.360 | The tokens here at this point will be a raw bytes,
00:27:22.520 | single stream of bytes.
00:27:24.700 | And just so that it's easier to work with,
00:27:26.760 | instead of just a bytes object,
00:27:28.800 | I'm going to convert all those bytes to integers,
00:27:32.000 | and then create a list of it,
00:27:33.420 | just so it's easier for us to manipulate
00:27:35.040 | and work with in Python and visualize.
00:27:37.200 | And here I'm printing all of that.
00:27:38.360 | So this is the original paragraph,
00:27:43.360 | and its length is 533 code points.
00:27:48.720 | And then here are the bytes encoded in UTF-8.
00:27:52.800 | And we see that this has a length of 616 bytes
00:27:56.180 | at this point, or 616 tokens.
00:27:58.360 | And the reason this is more,
00:27:59.800 | is because a lot of these simple ASCII characters,
00:28:03.240 | or simple characters, they just become a single byte.
00:28:06.200 | But a lot of these Unicode, more complex characters,
00:28:08.840 | become multiple bytes, up to four.
00:28:10.720 | And so we are expanding that size.
00:28:12.760 | So now what we'd like to do as a first step
00:28:15.220 | of the algorithm, is we'd like to iterate over here,
00:28:17.700 | and find the pair of bytes that occur most frequently,
00:28:22.040 | because we're then going to merge it.
00:28:23.840 | So if you are working along on a notebook, on a side,
00:28:26.460 | then I encourage you to basically click on the link,
00:28:28.740 | find this notebook,
00:28:29.940 | and try to write that function yourself.
00:28:31.980 | Otherwise, I'm going to come here
00:28:32.980 | and implement first the function
00:28:34.620 | that finds the most common pair.
00:28:36.380 | Okay, so here's what I came up with.
00:28:37.740 | There are many different ways to implement this,
00:28:39.860 | but I'm calling the function getStats.
00:28:41.460 | It expects a list of integers.
00:28:43.360 | I'm using a dictionary to keep track
00:28:45.260 | of basically the counts.
00:28:46.860 | And then this is a Pythonic way
00:28:48.080 | to iterate consecutive elements of this list,
00:28:51.300 | which we covered in a previous video.
00:28:53.740 | And then here, I'm just keeping track
00:28:55.500 | of just incrementing by one for all the pairs.
00:28:59.340 | So if I call this on all the tokens here,
00:29:01.900 | then the stats comes out here.
00:29:04.240 | So this is a dictionary.
00:29:05.640 | The keys are these tuples of consecutive elements,
00:29:09.520 | and this is the count.
00:29:11.160 | So just to print it in a slightly better way,
00:29:14.780 | this is one way that I like to do that,
00:29:17.540 | where it's a little bit compound here,
00:29:20.820 | so you can pause if you like,
00:29:22.200 | but we iterate all the items.
00:29:24.460 | The items called on dictionary returns pairs of key value.
00:29:29.060 | And instead, I create a list here of value key,
00:29:34.060 | because if it's a value key list,
00:29:36.260 | then I can call sort on it.
00:29:38.020 | And by default, Python will use the first element,
00:29:42.780 | which in this case will be value,
00:29:44.100 | to sort by if it's given tuples.
00:29:46.420 | And then reverse, so it's descending, and print that.
00:29:49.220 | So basically, it looks like 101,32
00:29:52.640 | was the most commonly occurring consecutive pair,
00:29:55.580 | and it occurred 20 times.
00:29:57.480 | We can double check that that makes reasonable sense.
00:30:00.200 | So if I just search 101,32,
00:30:03.160 | then you see that these are the 20 occurrences of that pair.
00:30:07.360 | And if we'd like to take a look
00:30:10.800 | at what exactly that pair is, we can use char,
00:30:13.920 | which is the opposite of ord in Python.
00:30:16.320 | So we give it a Unicode code point.
00:30:19.200 | So 101, and of 32, and we see that this is E and space.
00:30:24.200 | So basically, there's a lot of E space here,
00:30:28.220 | meaning that a lot of these words seem to end with E.
00:30:30.700 | So here's E space as an example.
00:30:33.240 | So there's a lot of that going on here,
00:30:34.560 | and this is the most common pair.
00:30:36.840 | So now that we've identified the most common pair,
00:30:39.140 | we would like to iterate over the sequence.
00:30:41.980 | We're going to mint a new token with the ID of 256, right?
00:30:46.340 | Because these tokens currently go from zero to 255.
00:30:49.740 | So when we create a new token, it will have an ID of 256,
00:30:53.780 | and we're going to iterate over this entire list.
00:30:58.460 | And every time we see 101,32,
00:31:02.060 | we're gonna swap that out for 256.
00:31:04.900 | So let's implement that now,
00:31:07.020 | and feel free to do that yourself as well.
00:31:09.840 | So first, I commented this
00:31:11.620 | just so we don't pollute the notebook too much.
00:31:14.920 | This is a nice way of, in Python,
00:31:18.000 | obtaining the highest ranking pair.
00:31:20.440 | So we're basically calling the max on this dictionary stats,
00:31:25.000 | and this will return the maximum key.
00:31:28.640 | And then the question is, how does it rank keys?
00:31:31.760 | So you can provide it with a function that ranks keys,
00:31:34.860 | and that function is just stats.get.
00:31:37.740 | Stats.get would basically return the value.
00:31:41.120 | And so we're ranking by the value
00:31:42.760 | and getting the maximum key.
00:31:44.480 | So it's 101,32, as we saw.
00:31:47.520 | Now, to actually merge 101,32,
00:31:49.860 | this is the function that I wrote,
00:31:52.240 | but again, there are many different versions of it.
00:31:54.840 | So we're gonna take a list of IDs
00:31:56.920 | and the pair that we wanna replace,
00:31:58.900 | and that pair will be replaced with the new index IDX.
00:32:02.300 | So iterating through IDs,
00:32:05.020 | if we find the pair, swap it out for IDX.
00:32:08.240 | So we create this new list, and then we start at zero,
00:32:11.680 | and then we go through this entire list sequentially
00:32:13.880 | from left to right.
00:32:14.840 | And here we are checking for equality
00:32:18.040 | at the current position with the pair.
00:32:20.440 | So here we are checking that the pair matches.
00:32:24.660 | Now, here's a bit of a tricky condition
00:32:26.360 | that you have to append if you're trying to be careful,
00:32:28.680 | and that is that you don't want this here
00:32:31.700 | to be out of bounds at the very last position
00:32:34.320 | when you're on the rightmost element of this list.
00:32:36.660 | Otherwise, this would give you an out-of-bounds error.
00:32:39.400 | So we have to make sure
00:32:40.240 | that we're not at the very, very last element.
00:32:42.800 | So this would be false for that.
00:32:45.840 | So if we find a match,
00:32:47.480 | we append to this new list that replacement index,
00:32:52.480 | and we increment the position by two.
00:32:54.080 | So we skip over that entire pair.
00:32:56.440 | But otherwise, if we haven't found a matching pair,
00:32:58.880 | we just sort of copy over the element at that position
00:33:02.720 | and increment by one, and then return this.
00:33:06.360 | So here's a very small toy example.
00:33:07.840 | If we have a list 5, 6, 6, 7, 9, 1,
00:33:10.300 | and we want to replace the occurrences of 67 with 99,
00:33:13.620 | then calling this on that will give us
00:33:17.320 | what we're asking for.
00:33:18.600 | So here, the 67 is replaced with 99.
00:33:21.360 | So now I'm going to uncomment this for our actual use case,
00:33:26.240 | where we want to take our tokens.
00:33:28.520 | We want to take the top pair here
00:33:31.040 | and replace it with 256 to get tokens two.
00:33:34.120 | If we run this, we get the following.
00:33:38.240 | So recall that previously, we had a length 616 in this list,
00:33:43.240 | and now we have a length 596, right?
00:33:47.560 | So this decreased by 20,
00:33:49.620 | which makes sense because there are 20 occurrences.
00:33:52.420 | Moreover, we can try to find 256 here,
00:33:55.500 | and we see plenty of occurrences off it.
00:33:57.600 | And moreover, just double check,
00:33:59.760 | there should be no occurrence of 101, 32.
00:34:02.320 | So this is the original array, plenty of them.
00:34:04.980 | And in the second array, there are no occurrences of 101, 32.
00:34:08.320 | So we've successfully merged this single pair.
00:34:11.480 | And now we just iterate this.
00:34:13.320 | So we are going to go over the sequence again,
00:34:15.420 | find the most common pair, and replace it.
00:34:17.740 | So let me now write a while loop
00:34:19.140 | that uses these functions to do this sort of iteratively.
00:34:22.820 | And how many times do we do it for?
00:34:25.160 | Well, that's totally up to us as a hyperparameter.
00:34:27.460 | The more steps we take, the larger will be our vocabulary,
00:34:32.420 | and the shorter will be our sequence.
00:34:34.500 | And there is some sweet spot
00:34:35.940 | that we usually find works the best in practice.
00:34:38.780 | And so this is kind of a hyperparameter,
00:34:40.900 | and we tune it, and we find good vocabulary sizes.
00:34:44.020 | As an example, GPT-4 currently uses roughly 100,000 tokens.
00:34:47.780 | And ballpark, those are reasonable numbers currently
00:34:51.580 | in state-of-the-art large-language models.
00:34:53.500 | So let me now write, putting it all together,
00:34:56.580 | and iterating these steps.
00:34:58.720 | Okay, now, before we dive into the while loop,
00:35:00.500 | I wanted to add one more cell here,
00:35:03.160 | where I went to the blog post,
00:35:04.540 | and instead of grabbing just the first paragraph or two,
00:35:07.100 | I took the entire blog post,
00:35:08.700 | and I stretched it out in a single line.
00:35:10.780 | And basically, just using longer text
00:35:12.460 | will allow us to have more representative statistics
00:35:14.700 | for the byte pairs,
00:35:15.920 | and we'll just get more sensible results out of it,
00:35:18.940 | because it's longer text.
00:35:20.280 | So here we have the raw text.
00:35:23.180 | We encode it into bytes using the UTF-8 encoding.
00:35:27.660 | And then here, as before,
00:35:29.140 | we are just changing it into a list of integers in Python,
00:35:32.240 | just so it's easier to work with,
00:35:34.060 | instead of the raw bytes objects.
00:35:36.380 | And then this is the code that I came up with
00:35:40.220 | to actually do the merging in loop.
00:35:43.860 | These two functions here are identical
00:35:45.700 | to what we had above.
00:35:46.740 | I only included them here
00:35:48.100 | just so that you have the point of reference here.
00:35:50.760 | So these two are identical,
00:35:54.020 | and then this is the new code that I added.
00:35:56.540 | So the first thing we wanna do
00:35:57.380 | is we want to decide on a final vocabulary size
00:36:00.180 | that we want our tokenizer to have.
00:36:02.480 | And as I mentioned, this is a hyperparameter,
00:36:04.040 | and you set it in some way
00:36:05.280 | depending on your best performance.
00:36:07.520 | So let's say for us, we're going to use 276,
00:36:10.140 | because that way we're going to be doing exactly 20 merges.
00:36:14.000 | And 20 merges because we already have 256 tokens
00:36:18.420 | for the raw bytes.
00:36:20.280 | And to reach 276, we have to do 20 merges
00:36:23.600 | to add 20 new tokens.
00:36:25.160 | Here, this is one way in Python
00:36:28.680 | to just create a copy of a list.
00:36:30.760 | So I'm taking the tokens list,
00:36:33.260 | and by wrapping it in the list,
00:36:34.940 | Python will construct a new list
00:36:36.940 | of all the individual elements,
00:36:38.100 | so this is just a copy operation.
00:36:39.740 | Then here, I'm creating a merges dictionary.
00:36:44.340 | So this merges dictionary is going to maintain
00:36:46.140 | basically the child one, child two,
00:36:49.420 | mapping to a new token.
00:36:52.300 | And so what we're going to be building up here
00:36:53.740 | is a binary tree of merges.
00:36:56.500 | But actually, it's not exactly a tree
00:36:57.860 | because a tree would have a single root node
00:37:00.500 | with a bunch of leaves.
00:37:02.220 | For us, we're starting with the leaves on the bottom,
00:37:04.620 | which are the individual bytes.
00:37:06.060 | Those are the starting 256 tokens.
00:37:08.580 | And then we're starting to merge two of them at a time.
00:37:11.380 | And so it's not a tree, it's more like a forest
00:37:13.740 | as we merge these elements.
00:37:18.360 | So for 20 merges,
00:37:21.540 | we're going to find the most commonly occurring pair.
00:37:24.840 | We're going to mint a new token integer for it.
00:37:27.940 | So I here will start at zero,
00:37:29.460 | so we're going to start at 256.
00:37:31.940 | We're going to print that we're merging it,
00:37:33.620 | and we're going to replace all the occurrences of that pair
00:37:36.660 | with the newly minted token.
00:37:39.520 | And we're going to record that this pair of integers
00:37:42.980 | merged into this new integer.
00:37:45.460 | So running this gives us the following output.
00:37:49.940 | So we did 20 merges.
00:37:54.160 | And for example, the first merge was exactly as before,
00:37:57.500 | the 101, 32 tokens merging into a new token 256.
00:38:02.500 | Now keep in mind that the individual tokens 101 and 32
00:38:06.440 | can still occur in the sequence after merging.
00:38:09.220 | It's only when they occur exactly consecutively
00:38:11.800 | that that becomes 256 now.
00:38:13.680 | And in particular, the other thing to notice here
00:38:17.620 | is that the token 256, which is the newly minted token,
00:38:20.920 | is also eligible for merging.
00:38:22.860 | So here on the bottom, the 20th merge
00:38:24.980 | was a merge of 256 and 259 becoming 275.
00:38:29.780 | So every time we replace these tokens,
00:38:32.180 | they become eligible for merging
00:38:33.700 | in the next round of the iteration.
00:38:35.820 | So that's why we're building up a small sort of binary forest
00:38:38.440 | instead of a single individual tree.
00:38:40.240 | One thing we can take a look at as well
00:38:43.020 | is we can take a look at the compression ratio
00:38:44.740 | that we've achieved.
00:38:46.100 | So in particular, we started off with this tokens list.
00:38:50.220 | So we started off with 24,000 bytes
00:38:53.460 | and after merging 20 times,
00:38:56.500 | we now have only 19,000 tokens.
00:39:01.220 | And so therefore, the compression ratio
00:39:02.660 | of simply just dividing the two is roughly 1.27.
00:39:06.220 | So that's the amount of compression we were able to achieve
00:39:08.380 | of this text with only 20 merges.
00:39:11.100 | And of course, the more vocabulary elements you add,
00:39:15.380 | the greater the compression ratio here would be.
00:39:18.780 | Finally, so that's kind of like
00:39:22.120 | the training of the tokenizer, if you will.
00:39:25.780 | Now, one point that I wanted to make is that,
00:39:28.340 | and maybe this is a diagram that can help kind of illustrate,
00:39:32.300 | is that tokenizer is a completely separate object
00:39:34.780 | from the large language model itself.
00:39:36.780 | So everything in this lecture,
00:39:37.920 | we're not really touching the LLM itself.
00:39:40.060 | We're just training the tokenizer.
00:39:41.660 | This is a completely separate pre-processing stage usually.
00:39:44.820 | So the tokenizer will have its own training set,
00:39:47.440 | just like a large language model
00:39:48.660 | has a potentially different training set.
00:39:51.540 | So the tokenizer has a training set of documents
00:39:53.300 | on which you're going to train the tokenizer.
00:39:55.740 | And then we're performing the BytePair encoding algorithm,
00:39:59.460 | as we saw above, to train the vocabulary of this tokenizer.
00:40:03.660 | So it has its own training set.
00:40:05.060 | It has a pre-processing stage
00:40:06.380 | that you would run a single time in the beginning.
00:40:08.820 | And the tokenizer is trained
00:40:11.740 | using BytePair encoding algorithm.
00:40:13.820 | Once you have the tokenizer, once it's trained,
00:40:16.060 | and you have the vocabulary,
00:40:17.180 | and you have the merges,
00:40:19.200 | we can do both encoding and decoding.
00:40:22.300 | So these two arrows here.
00:40:24.340 | So the tokenizer is a translation layer between raw text,
00:40:28.200 | which is, as we saw, the sequence of Unicode code points.
00:40:31.840 | It can take raw text and turn it into a token sequence,
00:40:35.520 | and vice versa, it can take a token sequence
00:40:38.080 | and translate it back into raw text.
00:40:40.140 | So now that we have trained the tokenizer,
00:40:44.120 | and we have these merges,
00:40:45.920 | we are going to turn to how we can do the encoding
00:40:48.120 | and the decoding step.
00:40:49.400 | If you give me text, here are the tokens,
00:40:51.200 | and vice versa.
00:40:52.040 | If you give me tokens, here's the text.
00:40:54.240 | Once we have that,
00:40:55.080 | we can translate between these two realms.
00:40:57.320 | And then the language model is going to be trained
00:40:59.180 | as a step two afterwards.
00:41:01.120 | And typically, in a sort of a state-of-the-art application,
00:41:05.280 | you might take all of your training data
00:41:06.680 | for the language model,
00:41:07.800 | and you might run it through the tokenizer
00:41:09.480 | and sort of translate everything
00:41:11.300 | into a massive token sequence.
00:41:12.920 | And then you can throw away the raw text
00:41:14.680 | that you're just left with, the tokens themselves.
00:41:17.040 | And those are stored on disk,
00:41:19.120 | and that is what the larger language model
00:41:20.600 | is actually reading when it's training on them.
00:41:22.960 | So that's one approach that you can take
00:41:24.180 | as a single massive pre-processing stage.
00:41:26.840 | So yeah, basically,
00:41:30.240 | I think the most important thing I want to get across
00:41:31.800 | is that this is a completely separate stage.
00:41:33.440 | It usually has its own entire training set.
00:41:36.360 | You may want to have those training sets be different
00:41:38.440 | between the tokenizer and the larger language model.
00:41:40.720 | So for example, when you're training the tokenizer,
00:41:43.160 | as I mentioned, we don't just care
00:41:44.680 | about the performance of English text.
00:41:46.400 | We care about many different languages.
00:41:49.480 | And we also care about code or not code.
00:41:51.640 | So you may want to look into different kinds of mixtures
00:41:54.600 | of different kinds of languages
00:41:55.960 | and different amounts of code and things like that,
00:41:58.760 | because the amount of different language that you have
00:42:01.940 | in your tokenizer training set
00:42:03.720 | will determine how many merges of it there will be.
00:42:07.280 | And therefore, that determines the density
00:42:09.420 | with which this type of data sort of has in the token space.
00:42:14.420 | And so roughly speaking intuitively,
00:42:18.800 | if you add some amount of data,
00:42:20.020 | like say you have a ton of Japanese data
00:42:21.740 | in your tokenizer training set,
00:42:24.060 | then that means that more Japanese tokens will get merged.
00:42:26.840 | And therefore, Japanese will have shorter sequences.
00:42:29.940 | And that's going to be beneficial
00:42:31.160 | for the large language model,
00:42:32.380 | which has a finite context length
00:42:34.300 | on which it can work on in the token space.
00:42:37.820 | So hopefully that makes sense.
00:42:39.340 | So we're now going to turn to encoding and decoding
00:42:42.120 | now that we have trained a tokenizer.
00:42:44.020 | So we have our merges.
00:42:46.160 | And now how do we do encoding and decoding?
00:42:48.340 | Okay, so let's begin with decoding,
00:42:50.100 | which is this arrow over here.
00:42:51.880 | So given a token sequence,
00:42:53.820 | let's go through the tokenizer
00:42:55.020 | to get back a Python string object, so the raw text.
00:42:59.140 | So this is the function that we'd like to implement.
00:43:01.860 | We're given the list of integers
00:43:03.260 | and we want to return a Python string.
00:43:04.980 | If you'd like, try to implement this function yourself.
00:43:07.420 | It's a fun exercise.
00:43:08.600 | Otherwise, I'm going to start pasting in my own solution.
00:43:12.260 | So there are many different ways to do it.
00:43:15.100 | Here's one way.
00:43:16.420 | I will create a kind of pre-processing variable
00:43:19.460 | that I will call vocab.
00:43:20.780 | And vocab is a mapping or a dictionary in Python
00:43:26.020 | from the token ID to the bytes object for that token.
00:43:31.020 | So we begin with the raw bytes for tokens from zero to 255.
00:43:35.900 | And then we go in order of all the merges
00:43:38.760 | and we sort of populate this vocab list
00:43:41.720 | by doing an addition here.
00:43:43.360 | So this is basically the bytes representation
00:43:46.840 | of the first child followed by the second one.
00:43:49.920 | And remember, these are bytes objects.
00:43:51.760 | So this addition here is an addition of two bytes objects,
00:43:55.240 | just concatenation.
00:43:57.080 | So that's what we get here.
00:43:58.440 | One tricky thing to be careful with, by the way,
00:44:01.520 | is that I'm iterating a dictionary in Python
00:44:04.320 | using a .items.
00:44:05.920 | And it really matters that this runs in the order
00:44:09.400 | in which we inserted items into the merges dictionary.
00:44:13.220 | Luckily, starting with Python 3.7,
00:44:15.100 | this is guaranteed to be the case.
00:44:16.640 | But before Python 3.7,
00:44:18.280 | this iteration may have been out of order
00:44:20.120 | with respect to how we inserted elements into merges,
00:44:22.880 | and this may not have worked.
00:44:24.320 | But we are using modern Python, so we're okay.
00:44:27.320 | And then here, given the IDs,
00:44:31.280 | the first thing we're going to do is get the tokens.
00:44:33.980 | So the way I implemented this here is I'm taking,
00:44:38.960 | I'm iterating over all the IDs.
00:44:40.600 | I'm using vocab to look up their bytes.
00:44:42.880 | And then here, this is one way in Python
00:44:45.000 | to concatenate all these bytes together
00:44:47.640 | to create our tokens.
00:44:49.640 | And then these tokens here at this point are raw bytes.
00:44:53.120 | So I have to decode using UTF-8 now
00:44:56.800 | back into Python strings.
00:44:59.080 | So previously, we called dot encode on a string object
00:45:02.020 | to get the bytes, and now we're doing it opposite.
00:45:04.720 | We're taking the bytes and calling a decode
00:45:07.240 | on the bytes object to get a string in Python.
00:45:10.880 | And then we can return text.
00:45:12.600 | So this is how we can do it.
00:45:16.800 | Now, this actually has a issue in the way I implemented it,
00:45:21.720 | and this could actually throw an error.
00:45:23.520 | So try to figure out why this code
00:45:26.120 | could actually result in an error
00:45:27.760 | if we plug in some sequence of IDs that is unlucky.
00:45:32.500 | So let me demonstrate the issue.
00:45:35.360 | When I try to decode just something like 97,
00:45:38.160 | I am going to get a letter A here back.
00:45:41.040 | So nothing too crazy happening.
00:45:43.560 | But when I try to decode 128 as a single element,
00:45:48.320 | the token 128 is what in string or in Python object?
00:45:52.320 | Unicode decoder.
00:45:54.240 | UTF-8 can't decode byte 0x80,
00:45:57.840 | which is this in hex,
00:46:00.040 | in position zero, invalid start byte.
00:46:01.880 | What does that mean?
00:46:03.120 | Well, to understand what this means,
00:46:04.040 | we have to go back to our UTF-8 page
00:46:06.600 | that I briefly showed earlier,
00:46:08.560 | and this is Wikipedia UTF-8.
00:46:10.640 | And basically, there's a specific schema
00:46:13.140 | that UTF-8 bytes take.
00:46:16.080 | So in particular, if you have a multi-byte object
00:46:19.440 | for some of the Unicode characters,
00:46:20.960 | they have to have this special sort of envelope
00:46:23.160 | in how the encoding works.
00:46:25.800 | And so what's happening here is that invalid start byte,
00:46:29.800 | that's because 128, the binary representation of it,
00:46:33.640 | is one followed by all zeros.
00:46:36.280 | So we have one and then all zero.
00:46:38.800 | And we see here that that doesn't conform to the format
00:46:41.040 | because one followed by all zero
00:46:42.480 | just doesn't fit any of these rules, so to speak.
00:46:45.660 | So it's an invalid start byte, which is byte one.
00:46:49.240 | This one must have a one following it,
00:46:52.040 | and then a zero following it,
00:46:53.520 | and then the content of your Unicode in excess here.
00:46:57.080 | So basically, we don't exactly follow the UTF-8 standard,
00:47:00.600 | and this cannot be decoded.
00:47:02.400 | And so the way to fix this is to use this errors equals
00:47:07.400 | in bytes.decode function of Python.
00:47:12.800 | And by default, errors is strict.
00:47:14.720 | So we will throw an error
00:47:16.560 | if it's not valid UTF-8 bytes encoding.
00:47:20.360 | But there are many different things
00:47:21.640 | that you could put here on error handling.
00:47:23.680 | This is the full list of all the errors that you can use.
00:47:26.760 | And in particular, instead of strict,
00:47:28.220 | let's change it to replace.
00:47:30.360 | And that will replace with this special marker,
00:47:34.720 | this replacement character.
00:47:36.720 | So errors equals replace.
00:47:40.360 | And now we just get that character back.
00:47:42.360 | So basically, not every single byte sequence is valid UTF-8.
00:47:49.440 | And if it happens that your large language model,
00:47:52.040 | for example, predicts your tokens in a bad manner,
00:47:56.080 | then they might not fall into valid UTF-8,
00:47:59.720 | and then we won't be able to decode them.
00:48:02.160 | So the standard practice
00:48:03.720 | is to basically use errors equals replace.
00:48:07.080 | And this is what you will also find in the OpenAI code
00:48:10.320 | that they released as well.
00:48:12.240 | But basically, whenever you see this kind of a character
00:48:14.560 | in your output, in that case, something went wrong,
00:48:17.080 | and the LM output was not a valid sequence of tokens.
00:48:22.080 | Okay, and now we're going to go the other way.
00:48:24.320 | So we are going to implement this error right here,
00:48:27.600 | where we are gonna be given a string,
00:48:29.160 | and we want to encode it into tokens.
00:48:31.000 | So this is the signature of the function
00:48:34.080 | that we're interested in.
00:48:35.640 | And this should basically print a list of integers
00:48:39.080 | of the tokens.
00:48:40.040 | So again, try to maybe implement this yourself
00:48:43.040 | if you'd like a fun exercise, and pause here.
00:48:45.840 | Otherwise, I'm going to start putting in my solution.
00:48:48.160 | So again, there are many ways to do this.
00:48:50.840 | So this is one of the ways that sort of I came up with.
00:48:55.840 | So the first thing we're going to do
00:48:58.640 | is we are going to take our text,
00:49:02.360 | encode it into UTF-8 to get the raw bytes.
00:49:05.120 | And then as before,
00:49:05.960 | we're going to call list on the bytes object
00:49:07.760 | to get a list of integers of those bytes.
00:49:11.680 | So those are the starting tokens.
00:49:13.240 | Those are the raw bytes of our sequence.
00:49:15.640 | But now of course, according to the merges dictionary above,
00:49:19.080 | and recall this was the merges,
00:49:20.760 | some of the bytes may be merged according to this lookup.
00:49:25.960 | In addition to that,
00:49:26.800 | remember that the merges was built from top to bottom.
00:49:29.320 | And this is sort of the order
00:49:30.320 | in which we inserted stuff into merges.
00:49:32.560 | And so we prefer to do all these merges in the beginning
00:49:35.720 | before we do these merges later,
00:49:37.760 | because for example,
00:49:39.520 | this merge over here relies on the 256,
00:49:42.040 | which got merged here.
00:49:44.120 | So we have to go in the order from top to bottom sort of,
00:49:47.360 | if we are going to be merging anything.
00:49:50.000 | Now we expect to be doing a few merges.
00:49:52.160 | So we're going to be doing while true.
00:49:54.080 | And now we want to find a pair of bytes
00:49:59.000 | that is consecutive,
00:50:00.280 | that we are allowed to merge according to this.
00:50:02.760 | In order to reuse some of the functionality
00:50:05.120 | that we've already written,
00:50:06.200 | I'm going to reuse the function get stats.
00:50:09.040 | So recall that get stats will give us the,
00:50:13.040 | will basically count up how many times every single pair
00:50:16.040 | occurs in our sequence of tokens
00:50:18.400 | and return that as a dictionary.
00:50:20.480 | And the dictionary was a mapping
00:50:22.040 | from all the different byte pairs
00:50:26.400 | to the number of times that they occur, right?
00:50:28.680 | At this point, we don't actually care
00:50:30.880 | how many times they occur in the sequence.
00:50:33.040 | We only care what the raw pairs are in that sequence.
00:50:36.680 | And so I'm only going to be using basically
00:50:38.120 | the keys of the dictionary.
00:50:39.800 | I only care about the set of possible merge candidates
00:50:42.920 | if that makes sense.
00:50:43.920 | Now we want to identify the pair
00:50:46.280 | that we're going to be merging at this stage of the loop.
00:50:49.400 | So what do we want?
00:50:50.240 | We want to find the pair or like a key inside stats
00:50:54.800 | that has the lowest index in the merges dictionary,
00:50:59.440 | because we want to do all the early merges
00:51:01.080 | before we work our way to the late merges.
00:51:04.120 | So again, there are many different ways to implement this,
00:51:06.240 | but I'm going to do something a little bit fancy here.
00:51:12.320 | So I'm going to be using the min over an iterator.
00:51:15.960 | In Python, when you call min on an iterator,
00:51:18.200 | and stats here is a dictionary,
00:51:19.920 | we're going to be iterating the keys
00:51:21.400 | of this dictionary in Python.
00:51:23.480 | So we're looking at all the pairs inside stats,
00:51:26.920 | which are all the consecutive pairs.
00:51:30.440 | And we're going to be taking the consecutive pair
00:51:33.080 | inside tokens that has the minimum, what?
00:51:36.200 | The min takes a key, which gives us the function
00:51:40.080 | that is going to return a value
00:51:41.920 | over which we're going to do the min.
00:51:44.120 | And the one we care about is we care about taking merges
00:51:47.400 | and basically getting that pair's index.
00:51:52.400 | So basically for any pair inside stats,
00:51:58.240 | we are going to be looking into merges
00:52:00.800 | at what index it has.
00:52:02.720 | And we want to get the pair with the min number.
00:52:05.800 | So as an example, if there's a pair 101 and 32,
00:52:08.320 | we definitely want to get that pair.
00:52:10.640 | We want to identify it here and return it.
00:52:12.560 | And pair would become 101, 32 if it occurs.
00:52:15.920 | And the reason that I'm putting a float inf here
00:52:19.200 | as a fallback is that in the get function,
00:52:22.200 | when we call, when we basically consider a pair
00:52:26.240 | that doesn't occur in the merges,
00:52:28.400 | then that pair is not eligible to be merged, right?
00:52:30.760 | So if in the token sequence,
00:52:32.720 | there's some pair that is not a merging pair,
00:52:35.200 | it cannot be merged,
00:52:36.440 | then it doesn't actually occur here
00:52:38.640 | and it doesn't have an index and it cannot be merged,
00:52:41.840 | which we will denote as float inf.
00:52:44.120 | And the reason infinity is nice here
00:52:45.600 | is because for sure we're guaranteed
00:52:47.520 | that it's not going to participate
00:52:48.760 | in the list of candidates when we do the min.
00:52:51.920 | So this is one way to do it.
00:52:54.160 | So basically a long story short,
00:52:56.440 | this returns the most eligible merging candidate pair
00:53:00.800 | that occurs in the tokens.
00:53:02.400 | Now, one thing to be careful with here
00:53:04.560 | is this function here might fail in the following way.
00:53:09.560 | If there's nothing to merge,
00:53:11.280 | then there's nothing in merges that is satisfied anymore.
00:53:16.280 | There's nothing to merge.
00:53:19.280 | Everything just returns float infs.
00:53:21.760 | And then the pair,
00:53:23.120 | I think will just become the very first element of stats.
00:53:26.640 | But this pair is not actually a mergeable pair.
00:53:29.320 | It just becomes the first pair inside stats arbitrarily
00:53:33.040 | because all of these pairs evaluate to float inf
00:53:35.600 | for the merging criterion.
00:53:38.240 | So basically it could be that this doesn't succeed
00:53:40.840 | because there's no more merging pairs.
00:53:42.080 | So if this pair is not in merges that was returned,
00:53:46.080 | then this is a signal for us
00:53:47.280 | that actually there was nothing to merge.
00:53:49.320 | No single pair can be merged anymore.
00:53:51.480 | In that case, we will break out.
00:53:53.080 | Nothing else can be merged.
00:53:56.560 | You may come up with a different implementation, by the way.
00:54:00.760 | This is kind of like really trying hard in Python.
00:54:03.720 | But really we're just trying to find a pair
00:54:07.120 | that can be merged with a lowest index here.
00:54:09.400 | Now, if we did find a pair that is inside merges
00:54:15.360 | with the lowest index, then we can merge it.
00:54:17.760 | So we're going to look into the mergers dictionary
00:54:22.520 | for that pair to look up the index.
00:54:25.320 | And we're going to now merge into that index.
00:54:28.680 | So we're going to do tokens equals,
00:54:30.320 | and we're going to replace the original tokens.
00:54:34.520 | We're going to be replacing the pair, pair,
00:54:36.680 | and we're going to be replacing it with index IDX.
00:54:39.080 | And this returns a new list of tokens
00:54:41.640 | where every occurrence of pair is replaced with IDX.
00:54:44.520 | So we're doing a merge.
00:54:46.360 | And we're going to be continuing this
00:54:47.640 | until eventually nothing can be merged.
00:54:49.400 | We'll come out here and we'll break out.
00:54:51.360 | And here we just return tokens.
00:54:53.040 | And so that's the implementation, I think.
00:54:56.880 | So hopefully this runs.
00:54:58.240 | Okay, cool.
00:54:59.680 | Um, yeah, and this looks reasonable.
00:55:02.960 | So for example, 32 is a space in ASCII, so that's here.
00:55:06.600 | So this looks like it worked, great.
00:55:10.720 | Okay, so let's wrap up this section of the video at least.
00:55:13.480 | I wanted to point out that this is not quite
00:55:15.160 | the right implementation just yet
00:55:16.320 | because we are leaving out a special case.
00:55:18.760 | So in particular, if we tried to do this,
00:55:21.600 | this would give us an error.
00:55:23.280 | And the issue is that if we only have a single character
00:55:26.360 | or an empty string, then stats is empty
00:55:29.000 | and that causes an issue inside min.
00:55:31.080 | So one way to fight this is if len of tokens is at least two
00:55:36.080 | because if it's less than two,
00:55:37.320 | it's just a single token or no tokens,
00:55:38.880 | then let's just, there's nothing to merge,
00:55:41.040 | so we just return.
00:55:42.440 | So that would fix that case.
00:55:44.280 | Okay.
00:55:46.440 | And then second, I have a few test cases here
00:55:48.880 | for us as well.
00:55:50.040 | So first let's make sure about, or let's note the following.
00:55:54.120 | If we take a string and we try to encode it
00:55:57.400 | and then decode it back,
00:55:58.800 | you'd expect to get the same string back, right?
00:56:01.080 | Is that true for all strings?
00:56:02.600 | So I think, so here it is the case.
00:56:07.560 | And I think in general, this is probably the case,
00:56:10.040 | but notice that going backwards is not,
00:56:14.440 | you're not going to have an identity going backwards
00:56:16.640 | because as I mentioned,
00:56:18.280 | not all token sequences are valid UTF-8 sort of byte streams.
00:56:24.760 | And so therefore, some of them can't even be decodable.
00:56:27.600 | So this only goes in one direction,
00:56:31.480 | but for that one direction, we can check here.
00:56:34.360 | If we take the training text,
00:56:35.680 | which is the text that we trained the tokenizer on,
00:56:37.760 | we can make sure that when we encode and decode,
00:56:39.440 | we get the same thing back, which is true.
00:56:42.040 | And here I took some validation data.
00:56:43.600 | So I went to, I think this webpage and I grabbed some text.
00:56:47.120 | So this is text that the tokenizer has not seen,
00:56:49.440 | and we can make sure that this also works.
00:56:52.640 | Okay, so that gives us some confidence
00:56:53.960 | that this was correctly implemented.
00:56:56.040 | So those are the basics of the byte-pair encoding algorithm.
00:56:59.160 | We saw how we can take some training set,
00:57:02.160 | train a tokenizer.
00:57:03.720 | The parameters of this tokenizer really
00:57:05.480 | are just this dictionary of merges.
00:57:07.960 | And that basically creates the little binary forest
00:57:10.080 | on top of raw bytes.
00:57:11.360 | Once we have this, the merges table,
00:57:14.720 | we can both encode and decode
00:57:16.560 | between raw text and token sequences.
00:57:19.040 | So that's the simplest setting of the tokenizer.
00:57:22.360 | What we're going to do now though,
00:57:23.480 | is we're going to look at some of the state-of-the-art
00:57:25.360 | large language models and the kinds of tokenizers
00:57:27.520 | that they use, and we're going to see
00:57:29.040 | that this picture complexifies very quickly.
00:57:31.200 | So we're going to go through the details
00:57:33.360 | of this complexification one at a time.
00:57:37.280 | So let's kick things off by looking at the GPT series.
00:57:40.040 | So in particular, I have the GPT-2 paper here,
00:57:42.460 | and this paper is from 2019 or so, so five years ago.
00:57:48.000 | And let's scroll down to input representation.
00:57:51.220 | This is where they talk about the tokenizer
00:57:52.720 | that they're using for GPT-2.
00:57:55.160 | Now, this is all fairly readable,
00:57:56.520 | so I encourage you to pause and read this yourself,
00:58:00.000 | but this is where they motivate the use
00:58:01.500 | of the byte-pair encoding algorithm
00:58:03.960 | on the byte-level representation of UTF-8 encoding.
00:58:08.700 | So this is where they motivate it,
00:58:10.040 | and they talk about the vocabulary sizes and everything.
00:58:13.120 | Now, everything here is exactly as we've covered it so far,
00:58:16.040 | but things start to depart around here.
00:58:18.700 | So what they mention is that they don't just apply
00:58:21.040 | the Naive algorithm as we have done it.
00:58:23.520 | And in particular, here's a motivating example.
00:58:26.240 | Suppose that you have common words like dog.
00:58:28.440 | What will happen is that dog, of course,
00:58:30.080 | occurs very frequently in the text,
00:58:32.620 | and it occurs right next to all kinds of punctuation,
00:58:35.300 | as an example, so dog dot, dog exclamation mark,
00:58:38.920 | dog question mark, et cetera.
00:58:40.800 | And naively, you might imagine that the BPA algorithm
00:58:43.520 | could merge these to be single tokens,
00:58:45.680 | and then you end up with lots of tokens
00:58:47.280 | that are just like dog with a slightly different punctuation.
00:58:50.080 | And so it feels like you're clustering things
00:58:51.680 | that shouldn't be clustered.
00:58:52.520 | You're combining kind of semantics with punctuation.
00:58:55.420 | And this feels suboptimal.
00:58:58.840 | And indeed, they also say that this is suboptimal
00:59:01.520 | according to some of the experiments.
00:59:03.440 | So what they want to do is they want to top down
00:59:05.380 | in a manual way, enforce that some types of characters
00:59:09.640 | should never be merged together.
00:59:11.580 | So they want to enforce these emerging rules
00:59:14.840 | on top of the BytePair encoding algorithm.
00:59:17.720 | So let's take a look at their code
00:59:19.920 | and see how they actually enforce this
00:59:21.480 | and what kinds of mergers they actually do perform.
00:59:24.280 | So I have the tab open here for GPT-2
00:59:27.280 | under OpenAI on GitHub.
00:59:29.520 | And when we go to source, there is an encoder.py.
00:59:34.000 | Now, I don't personally love that they called encoder.py
00:59:36.320 | because this is the tokenizer,
00:59:38.120 | and the tokenizer can do both encode and decode.
00:59:41.080 | So it feels kind of awkward to me that it's called encoder,
00:59:43.120 | but that is the tokenizer.
00:59:45.520 | And there's a lot going on here,
00:59:46.680 | and we're gonna step through it in detail at one point.
00:59:49.280 | For now, I just want to focus on this part here.
00:59:52.260 | They create a regex pattern here
00:59:55.120 | that looks very complicated,
00:59:56.200 | and we're gonna go through it in a bit.
00:59:58.680 | But this is the core part that allows them to enforce rules
01:00:02.660 | for what parts of the text will never be merged for sure.
01:00:06.040 | Now, notice that re.compile here is a little bit misleading
01:00:09.640 | because we're not just doing import re,
01:00:12.120 | which is the Python re module,
01:00:13.580 | we're doing import regex as re.
01:00:15.640 | And regex is a Python package that you can install,
01:00:18.720 | pip install regex,
01:00:20.160 | and it's basically an extension of re,
01:00:21.800 | so it's a bit more powerful re.
01:00:23.360 | So let's take a look at this pattern and what it's doing
01:00:29.700 | and why this is actually doing the separation
01:00:32.280 | that they are looking for.
01:00:33.880 | Okay, so I've copy pasted the pattern here
01:00:35.820 | to our Jupyter Notebook where we left off,
01:00:38.280 | and let's take this pattern for a spin.
01:00:40.700 | So in the exact same way that their code does,
01:00:43.400 | we're going to call an re.findAll for this pattern
01:00:47.000 | on any arbitrary string that we are interested in.
01:00:49.440 | So this is the string that we want to encode into tokens
01:00:52.240 | to feed into an LLM, like GPT-2.
01:00:56.800 | So what exactly is this doing?
01:00:59.040 | Well, re.findAll will take this pattern
01:01:01.120 | and try to match it against this string.
01:01:03.120 | The way this works is that you are going from left to right
01:01:07.760 | in the string, and you're trying to match the pattern.
01:01:11.320 | And re.findAll will get all the occurrences
01:01:15.060 | and organize them into a list.
01:01:16.560 | Now, when you look at this pattern,
01:01:20.500 | first of all, notice that this is a raw string,
01:01:22.940 | and then these are three double quotes
01:01:26.220 | just to start the string.
01:01:27.820 | So really the string itself,
01:01:29.400 | this is the pattern itself, right?
01:01:31.060 | And notice that it's made up of a lot of ORs.
01:01:35.260 | So see these vertical bars?
01:01:36.500 | Those are ORs in Regex.
01:01:38.480 | And so you go from left to right in this pattern
01:01:41.460 | and try to match it against the string wherever you are.
01:01:44.540 | So we have hello, and we're gonna try to match it.
01:01:48.040 | Well, it's not apostrophe S,
01:01:49.560 | it's not apostrophe T, or any of these,
01:01:52.520 | but it is an optional space followed by dash P of,
01:01:57.220 | sorry, slash P of L one or more times.
01:02:00.340 | What is slash P of L?
01:02:01.940 | It is coming to some documentation that I found.
01:02:05.260 | There might be other sources as well.
01:02:09.200 | Slash P of L is a letter,
01:02:11.220 | any kind of letter from any language.
01:02:13.660 | And hello is made up of letters, H-E-L-L-O, et cetera.
01:02:18.340 | So optional space followed by a bunch of letters,
01:02:21.520 | one or more letters, is going to match hello,
01:02:24.740 | but then the match ends
01:02:26.260 | because a white space is not a letter.
01:02:28.740 | So from there on begins a new sort of attempt
01:02:33.220 | to match against the string again.
01:02:35.780 | And starting in here,
01:02:37.080 | we're gonna skip over all of these again
01:02:39.140 | until we get to the exact same point again.
01:02:41.580 | And we see that there's an optional space,
01:02:43.540 | this is the optional space,
01:02:44.980 | followed by a bunch of letters, one or more of them.
01:02:47.140 | And so that matches.
01:02:48.620 | So when we run this, we get a list of two elements,
01:02:52.620 | hello, and then space world.
01:02:54.580 | So how are you if we add more letters?
01:02:58.840 | We would just get them like this.
01:03:01.180 | Now, what is this doing and why is this important?
01:03:03.700 | We are taking our string
01:03:05.460 | and instead of directly encoding it for tokenization,
01:03:10.020 | we are first splitting it up.
01:03:12.180 | And when you actually step through the code,
01:03:14.140 | and we'll do that in a bit more detail,
01:03:16.320 | what really it's doing on a high level
01:03:17.740 | is that it first splits your text into a list of texts,
01:03:22.740 | just like this one.
01:03:24.620 | And all these elements of this list
01:03:26.420 | are processed independently by the tokenizer.
01:03:29.220 | And all of the results of that processing
01:03:31.340 | are simply concatenated.
01:03:33.240 | So hello world, oh, I missed how.
01:03:36.660 | Hello world, how are you?
01:03:39.580 | We have five elements of a list.
01:03:41.180 | All of these will independently go from text
01:03:46.180 | to a token sequence.
01:03:48.060 | And then that token sequence is going to be concatenated.
01:03:50.580 | It's all gonna be joined up.
01:03:52.700 | And roughly speaking, what that does
01:03:54.940 | is you're only ever finding merges
01:03:57.420 | between the elements of this list.
01:03:59.240 | So you can only ever consider merges
01:04:00.740 | within every one of these elements individually.
01:04:03.200 | And after you've done all the possible merging
01:04:07.540 | for all of these elements individually,
01:04:09.460 | the results of all that will be joined by concatenation.
01:04:14.460 | And so you are basically,
01:04:16.220 | what you're doing effectively
01:04:17.820 | is you are never going to be merging this E
01:04:20.980 | with this space because they are now parts
01:04:23.500 | of the separate elements of this list.
01:04:25.860 | And so you are saying we are never gonna merge E space
01:04:29.260 | because we're breaking it up in this way.
01:04:33.580 | So basically using this regex pattern to chunk up the text
01:04:37.420 | is just one way of enforcing
01:04:39.580 | that some merges are not to happen.
01:04:43.100 | And we're gonna go into more of this text
01:04:44.660 | and we'll see that what this is trying to do
01:04:46.000 | on a high level is we're trying to not merge
01:04:48.100 | across letters, across numbers,
01:04:50.220 | across punctuation, and so on.
01:04:52.700 | So let's see in more detail how that works.
01:04:54.500 | So let's continue now.
01:04:55.940 | We have slash P of N.
01:04:57.660 | If you go to the documentation,
01:04:59.740 | slash P of N is any kind of numeric character
01:05:03.380 | in any script.
01:05:04.420 | So it's numbers.
01:05:05.940 | So we have an optional space followed by numbers
01:05:07.940 | and those would be separated out.
01:05:09.740 | So letters and numbers are being separated.
01:05:11.540 | So if I do, hello world, one, two, three, how are you?
01:05:14.980 | Then world will stop matching here
01:05:16.980 | because one is not a letter anymore,
01:05:19.580 | but one is a number.
01:05:20.740 | So this group will match for that
01:05:22.520 | and we'll get it as a separate entity.
01:05:24.460 | Let's see how these apostrophes work.
01:05:28.420 | So here, if we have slash V,
01:05:34.180 | I mean apostrophe V as an example,
01:05:36.660 | then apostrophe here is not a letter or a number.
01:05:39.700 | So hello will stop matching
01:05:42.420 | and then we will exactly match this with that.
01:05:45.940 | So that will come out as a separate thing.
01:05:48.340 | So why are they doing the apostrophes here?
01:05:51.620 | Honestly, I think that these are just
01:05:52.740 | like very common apostrophes that are used typically.
01:05:57.740 | I don't love that they've done this
01:05:59.620 | because let me show you what happens
01:06:03.420 | when you have some Unicode apostrophes.
01:06:06.380 | Like for example, you can have,
01:06:08.140 | if you have house, then this will be separated out
01:06:11.180 | because of this matching.
01:06:13.140 | But if you use the Unicode apostrophe like this,
01:06:16.200 | then suddenly this does not work.
01:06:19.940 | And so this apostrophe will actually
01:06:21.660 | become its own thing now.
01:06:23.740 | And so it's basically hard-coded
01:06:25.980 | for this specific kind of apostrophe
01:06:28.260 | and otherwise they become completely separate tokens.
01:06:33.220 | In addition to this, you can go to the GPT-2 docs
01:06:37.680 | and here when they define the pattern,
01:06:39.460 | they say should have added re.ignoreCase.
01:06:42.300 | So BP merges can happen
01:06:43.480 | for capitalized versions of contractions.
01:06:45.560 | So what they're pointing out is that
01:06:47.180 | you see how this is apostrophe and then lowercase letters?
01:06:50.820 | Well, because they didn't do re.ignoreCase,
01:06:53.860 | then these rules will not separate out the apostrophes
01:06:57.940 | if it's uppercase.
01:06:59.820 | So house would be like this,
01:07:03.940 | but if I did house from uppercase,
01:07:09.020 | then notice suddenly the apostrophe comes by itself.
01:07:12.280 | So the tokenization will work differently
01:07:16.220 | in uppercase and lowercase,
01:07:17.560 | inconsistently separating out these apostrophes.
01:07:19.940 | So it feels extremely gnarly and slightly gross,
01:07:22.380 | but that's how that works.
01:07:25.780 | Okay, so let's come back.
01:07:27.320 | After trying to match a bunch of apostrophe expressions,
01:07:29.960 | by the way, the other issue here is that
01:07:31.100 | these are quite language-specific probably.
01:07:33.340 | So I don't know that all the languages, for example,
01:07:35.940 | use or don't use apostrophes,
01:07:37.180 | but that would be inconsistently tokenized as a result.
01:07:41.060 | Then we try to match letters.
01:07:42.500 | Then we try to match numbers.
01:07:44.340 | And then if that doesn't work, we fall back to here.
01:07:47.540 | And what this is saying is, again,
01:07:48.820 | optional space followed by something
01:07:50.340 | that is not a letter, number, or a space,
01:07:53.580 | and one or more of that.
01:07:55.060 | So what this is doing effectively
01:07:56.300 | is this is trying to match punctuation, roughly speaking,
01:07:59.100 | not letters and not numbers.
01:08:01.140 | So this group will try to trigger for that.
01:08:03.380 | So if I do something like this,
01:08:05.600 | then these parts here are not letters or numbers,
01:08:09.540 | but they will actually get caught here.
01:08:13.540 | And so they become its own group.
01:08:15.760 | So we've separated out the punctuation.
01:08:18.420 | And finally, this is also a little bit confusing.
01:08:21.660 | So this is matching whitespace,
01:08:24.140 | but this is using a negative lookahead assertion in regex.
01:08:29.060 | So what this is doing is it's matching whitespace up to,
01:08:32.140 | but not including the last whitespace character.
01:08:34.820 | Why is this important?
01:08:37.900 | This is pretty subtle, I think.
01:08:39.440 | So you see how the whitespace is always included
01:08:41.660 | at the beginning of the word.
01:08:43.460 | So space R, space U, et cetera.
01:08:47.340 | Suppose we have a lot of spaces here.
01:08:49.180 | What's gonna happen here is that these spaces up to
01:08:53.740 | and not including the last character will get caught by this.
01:08:58.020 | And what that will do is it will separate out the spaces
01:09:01.180 | up to but not including the last character
01:09:03.300 | so that the last character can come here
01:09:05.620 | and join with the space U.
01:09:08.620 | And the reason that's nice is because space U
01:09:11.220 | is the common token.
01:09:12.580 | So if I didn't have these extra spaces here,
01:09:15.060 | you would just have space U.
01:09:16.500 | And if I add spaces, we still have a space U,
01:09:20.660 | but now we have all this extra whitespace.
01:09:23.100 | So basically the GPT-2 tokenizer really likes
01:09:25.060 | to have a space letters or numbers,
01:09:27.780 | and it prepends these spaces.
01:09:30.320 | And this is just something that it is consistent about.
01:09:33.020 | So that's what that is for.
01:09:34.460 | And then finally, we have all the,
01:09:36.300 | the last fallback is whitespace characters.
01:09:39.780 | So that would be just,
01:09:42.660 | if that doesn't get caught,
01:09:46.100 | then this thing will catch any trailing spaces and so on.
01:09:50.100 | I wanted to show one more real world example here.
01:09:52.740 | So if we have this string, which is a piece of Python code,
01:09:55.340 | and then we try to split it up,
01:09:57.300 | then this is the kind of output we get.
01:09:59.500 | So you'll notice that the list has many elements here,
01:10:01.580 | and that's because we are splitting up fairly often,
01:10:04.900 | every time sort of a category changes.
01:10:07.440 | So there will never be any mergers within these elements.
01:10:11.940 | And that's what you are seeing here.
01:10:14.780 | Now, you might think that in order to train the tokenizer,
01:10:19.020 | OpenAI has used this to split up text into chunks,
01:10:23.220 | and then run just a BP algorithm within all the chunks.
01:10:26.820 | But that's not exactly what happened.
01:10:28.580 | And the reason is the following.
01:10:30.420 | Notice that we have the spaces here.
01:10:33.340 | Those spaces end up being entire elements.
01:10:36.440 | But these spaces never actually end up being merged
01:10:39.660 | by OpenAI.
01:10:40.500 | And the way you can tell is that
01:10:42.060 | if you copy paste the exact same chunk here
01:10:43.740 | into a tick token, tick tokenizer,
01:10:46.900 | you see that all the spaces are kept independent,
01:10:49.260 | and they're all token 220.
01:10:50.840 | So I think OpenAI at some point enforced some rule
01:10:55.240 | that these spaces would never be merged.
01:10:57.580 | And so there's some additional rules
01:11:00.420 | on top of just chunking and BP
01:11:02.920 | that OpenAI is not clear about.
01:11:05.400 | Now, the training code for the GPT-2 tokenizer
01:11:07.380 | was never released.
01:11:08.540 | So all we have is the code that I've already shown you.
01:11:12.280 | But this code here that they've released
01:11:14.020 | is only the inference code for the tokens.
01:11:17.300 | So this is not the training code.
01:11:18.620 | You can't give it a piece of text and train tokenizer.
01:11:21.620 | This is just the inference code,
01:11:23.020 | which takes the merges that we have up above
01:11:26.340 | and applies them to a new piece of text.
01:11:29.260 | And so we don't know exactly how OpenAI trained
01:11:31.420 | the tokenizer, but it wasn't as simple
01:11:34.460 | as chunk it up and BP it, whatever it was.
01:11:38.340 | Next, I wanted to introduce you
01:11:39.860 | to the tick token library for OpenAI,
01:11:42.220 | which is the official library
01:11:43.980 | for tokenization from OpenAI.
01:11:46.200 | So this is ticktoken, pip install ticktoken,
01:11:50.180 | and then you can do the tokenization inference.
01:11:54.300 | So this is again, not training code.
01:11:55.740 | This is only inference code for tokenization.
01:11:57.980 | I wanted to show you how you would use it, quite simple.
01:12:02.180 | And running this just gives us the GPT-2 tokens
01:12:04.860 | or the GPT-4 tokens.
01:12:06.620 | So this is the tokenizer used for GPT-4.
01:12:09.420 | And so in particular, we see that the whitespace in GPT-2,
01:12:12.080 | it remains unmerged, but in GPT-4,
01:12:14.500 | these whitespaces merge, as we also saw in this one,
01:12:18.740 | where here they're all unmerged,
01:12:20.140 | but if we go down to GPT-4, they become merged.
01:12:23.780 | Now in the GPT-4 tokenizer,
01:12:30.060 | they changed the regular expression
01:12:32.320 | that they use to chunk up text.
01:12:34.700 | So the way to see this is that if you come
01:12:36.200 | to the tick token library,
01:12:39.500 | and then you go to this file, tick_token_ext_openai_public,
01:12:43.800 | this is where sort of like the definition
01:12:45.360 | of all these different tokenizers that OpenAI maintains is.
01:12:48.680 | And so necessarily to do the inference,
01:12:51.120 | they had to publish some of the details about the strings.
01:12:54.020 | So this is the string that we already saw for GPT-2.
01:12:57.080 | It is slightly different, but it is actually equivalent
01:13:00.000 | to what we discussed here.
01:13:02.020 | So this pattern that we discussed
01:13:03.520 | is equivalent to this pattern,
01:13:05.600 | and this one just executes a little bit faster.
01:13:08.320 | So here you see a little bit
01:13:09.360 | of a slightly different definition,
01:13:10.620 | but otherwise it's the same.
01:13:12.560 | We're gonna go into special tokens in a bit.
01:13:15.160 | And then if you scroll down to CL-100K,
01:13:18.440 | this is the GPT-4 tokenizer,
01:13:20.280 | you see that the pattern has changed.
01:13:22.240 | And this is kind of like the major change
01:13:26.000 | in addition to a bunch of other special tokens,
01:13:27.800 | which we'll go into a bit again.
01:13:29.400 | Now, I'm not going to actually go into the full detail
01:13:32.840 | of the pattern change,
01:13:33.680 | because honestly, this is mind-numbing.
01:13:35.960 | I would just advise that you pull out chat-gpt
01:13:38.680 | in the regex documentation and just step through it.
01:13:42.240 | But really the major changes are,
01:13:44.280 | number one, you see this I here?
01:13:47.320 | That means that the case sensitivity,
01:13:50.800 | this is case insensitive match.
01:13:52.680 | And so the comment that we saw earlier on,
01:13:55.600 | oh, we should have used re.uppercase.
01:13:58.400 | Basically, we're now going to be matching
01:14:01.640 | these apostrophe S, apostrophe D, apostrophe M, et cetera.
01:14:06.640 | We're gonna be matching them both in lowercase
01:14:08.200 | and in uppercase.
01:14:09.640 | So that's fixed.
01:14:11.080 | There's a bunch of different handling of the whitespace
01:14:13.400 | that I'm not going to go into the full details of.
01:14:15.800 | And then one more thing here is you will notice
01:14:18.600 | that when they match the numbers,
01:14:20.240 | they only match one to three numbers.
01:14:23.200 | So they will never merge numbers
01:14:27.080 | that are in more than three digits,
01:14:29.840 | only up to three digits of numbers will ever be merged.
01:14:33.920 | And that's one change that they made as well
01:14:36.360 | to prevent tokens that are very, very long number sequences.
01:14:40.120 | But again, we don't really know why they do
01:14:43.000 | any of this stuff because none of this is documented
01:14:45.880 | and it's just, we just get the pattern.
01:14:48.360 | So yeah, it is what it is.
01:14:51.440 | But those are some of the changes that GPT-4 has made.
01:14:54.400 | And of course, the vocabulary size went
01:14:56.400 | from roughly 50K to roughly 100K.
01:14:59.480 | The next thing I would like to do very briefly
01:15:01.120 | is to take you through the GPT-2 encoder.py
01:15:03.880 | that OpenAI has released.
01:15:05.360 | This is the file that I already mentioned to you briefly.
01:15:09.280 | Now, this file is fairly short
01:15:12.560 | and should be relatively understandable
01:15:14.240 | to you at this point.
01:15:15.280 | Starting at the bottom here, they are loading two files,
01:15:21.000 | encoder.json and vocab.bpe.
01:15:23.680 | And they do some light processing on it
01:15:25.120 | and then they call this encoder object,
01:15:26.840 | which is the tokenizer.
01:15:28.640 | Now, if you'd like to inspect these two files,
01:15:31.160 | which together constitute their saved tokenizer,
01:15:34.480 | then you can do that with a piece of code like this.
01:15:37.080 | This is where you can download these two files
01:15:40.040 | and you can inspect them if you'd like.
01:15:41.800 | And what you will find is that this encoder,
01:15:44.440 | as they call it in their code,
01:15:45.960 | is exactly equivalent to our vocab.
01:15:48.680 | So remember here where we have this vocab object,
01:15:52.840 | which allowed us to decode very efficiently.
01:15:55.000 | And basically it took us from the integer
01:15:57.560 | to the bytes for that integer.
01:16:01.600 | So our vocab is exactly their encoder.
01:16:05.000 | And then their vocab.bpe, confusingly,
01:16:08.720 | is actually our merges.
01:16:11.000 | So their bpe.merges,
01:16:12.560 | which is based on the data inside vocab.bpe,
01:16:16.240 | ends up being equivalent to our merges.
01:16:18.920 | So basically they are saving and loading
01:16:22.880 | the two variables that for us are also critical,
01:16:26.320 | the merges variable and the vocab variable.
01:16:29.000 | Using just these two variables,
01:16:30.920 | you can represent a tokenizer
01:16:32.400 | and you can both do encoding and decoding
01:16:34.320 | once you've trained this tokenizer.
01:16:36.160 | Now, the only thing that is actually slightly confusing
01:16:41.480 | inside what OpenAI does here
01:16:43.840 | is that in addition to this encoder and the decoder,
01:16:46.560 | they also have something called a byte encoder
01:16:48.480 | and a byte decoder.
01:16:50.280 | And this is actually, unfortunately,
01:16:51.920 | just kind of a spurious implementation detail.
01:16:55.960 | It isn't actually deep or interesting in any way,
01:16:58.240 | so I'm going to skip the discussion of it.
01:17:00.280 | But what OpenAI does here,
01:17:01.400 | for reasons that I don't fully understand,
01:17:03.000 | is that not only have they this tokenizer,
01:17:05.720 | which can encode and decode,
01:17:07.200 | but they have a whole separate layer here in addition
01:17:09.520 | that is used serially with the tokenizer.
01:17:12.000 | And so you first do byte encode and then encode,
01:17:16.080 | and then you do decode and then byte decode.
01:17:18.560 | So that's the loop,
01:17:19.920 | and they are just stacked serial on top of each other.
01:17:22.960 | And it's not that interesting,
01:17:24.560 | so I won't cover it, and you can step through it
01:17:26.080 | if you'd like.
01:17:27.360 | Otherwise, this file,
01:17:28.520 | if you ignore the byte encoder and the byte decoder,
01:17:30.800 | will be algorithmically very familiar with you.
01:17:33.120 | And the meat of it here is what they call BPE function.
01:17:37.080 | And you should recognize this loop here,
01:17:39.640 | which is very similar to our own while loop,
01:17:42.040 | where they're trying to identify the bigram, a pair,
01:17:46.000 | that they should be merging next.
01:17:48.000 | And then here, just like we had,
01:17:49.680 | they have a for loop trying to merge this pair.
01:17:52.320 | So they will go over all of the sequence,
01:17:54.120 | and they will merge the pair whenever they find it.
01:17:57.080 | And they keep repeating that
01:17:58.440 | until they run out of possible merges in the text.
01:18:02.240 | So that's the meat of this file.
01:18:03.920 | And there's an encode and decode function,
01:18:06.040 | just like we have implemented it.
01:18:07.880 | So long story short,
01:18:08.840 | what I want you to take away at this point is that,
01:18:11.040 | unfortunately, it's a little bit of a messy code
01:18:12.680 | that they have, but algorithmically,
01:18:14.280 | it is identical to what we've built up above.
01:18:17.080 | And what we've built up above, if you understand it,
01:18:19.440 | is algorithmically what is necessary
01:18:21.440 | to actually build a BPE tokenizer, train it,
01:18:24.440 | and then both encode and decode.
01:18:26.920 | The next topic I would like to turn to
01:18:28.280 | is that of special tokens.
01:18:30.360 | So in addition to tokens that are coming from raw bytes
01:18:33.400 | and the BPE merges,
01:18:35.000 | we can insert all kinds of tokens
01:18:36.480 | that we are going to use
01:18:37.480 | to delimit different parts of the data
01:18:39.880 | or introduce to create a special structure
01:18:42.320 | of the token streams.
01:18:44.760 | So if you look at this encoder object
01:18:47.960 | from OpenAI's GPT-2 right here,
01:18:50.920 | we mentioned this is very similar to our vocab.
01:18:53.280 | You'll notice that the length of this is 50,257.
01:18:57.560 | As I mentioned, it's mapping,
01:19:01.920 | and it's inverted from the mapping of our vocab.
01:19:04.240 | Our vocab goes from integer to string,
01:19:07.000 | and they go the other way around for no amazing reason.
01:19:10.360 | But the thing to note here is that
01:19:12.880 | the mapping table here is 50,257.
01:19:16.080 | Where does that number come from?
01:19:18.120 | Where are the tokens?
01:19:20.440 | As I mentioned, there are 256 raw byte tokens.
01:19:23.640 | And then OpenAI actually did 50,000 merges.
01:19:28.720 | So those become the other tokens.
01:19:32.040 | But this would have been 50,256.
01:19:35.040 | So what is the 57th token?
01:19:37.600 | And there is basically one special token.
01:19:39.680 | And that one special token,
01:19:43.040 | you can see it's called end of text.
01:19:46.600 | So this is a special token,
01:19:48.080 | and it's the very last token.
01:19:50.560 | And this token is used to delimit documents
01:19:53.240 | in the training set.
01:19:55.040 | So when we're creating the training data,
01:19:57.320 | we have all these documents,
01:19:58.520 | and we tokenize them and get a stream of tokens.
01:20:01.600 | Those tokens only range from zero to 50,256.
01:20:06.280 | And then in between those documents,
01:20:08.000 | we put special end of text token.
01:20:10.560 | And we insert that token in between documents.
01:20:14.400 | And we are using this as a signal to the language model
01:20:17.800 | that the document has ended,
01:20:19.800 | and what follows is going to be unrelated
01:20:22.080 | to the document previously.
01:20:24.320 | That said, the language model has to learn this from data.
01:20:26.920 | It needs to learn that this token usually means
01:20:29.720 | that it should wipe its sort of memory of what came before.
01:20:33.240 | And what came before this token
01:20:34.520 | is not actually informative to what comes next.
01:20:36.960 | But we are expecting the language model
01:20:38.240 | to just like learn this,
01:20:39.680 | but we're giving it the special sort of delimiter
01:20:42.280 | of these documents.
01:20:44.160 | We can go here to tick tokenizer,
01:20:45.800 | and this is the GPT-2 tokenizer,
01:20:48.800 | our code that we've been playing with before.
01:20:51.080 | So we can add here, right?
01:20:52.080 | Hello world, how are you?
01:20:54.360 | And we're getting different tokens.
01:20:55.880 | But now you can see what happens if I put end of text.
01:20:59.640 | You see how until I finished it,
01:21:03.200 | these are all different tokens.
01:21:04.960 | End of text, still set prefer tokens.
01:21:08.560 | And now when I finish it,
01:21:10.000 | suddenly we get token 50,256.
01:21:14.360 | And the reason this works is because
01:21:16.920 | this didn't actually go through the BPE merges.
01:21:19.960 | Instead, the code that actually outputs the tokens
01:21:24.200 | has special case instructions for handling special tokens.
01:21:28.480 | We did not see these special instructions
01:21:31.520 | for handling special tokens in the encoder.py.
01:21:34.840 | It's absent there.
01:21:36.440 | But if you go to tick token library,
01:21:37.960 | which is implemented in Rust,
01:21:40.520 | you will find all kinds of special case handling
01:21:42.680 | for these special tokens that you can register,
01:21:45.680 | create, add to the vocabulary,
01:21:48.080 | and then it looks for them
01:21:49.000 | and whenever it sees these special tokens like this,
01:21:52.720 | it will actually come in and swap in that special token.
01:21:55.920 | So these things are outside of the typical algorithm
01:21:58.760 | of byte-pair encoding.
01:22:00.280 | So these special tokens are used pervasively,
01:22:04.040 | not just in basically base language modeling
01:22:06.840 | of predicting the next token in the sequence,
01:22:08.880 | but especially when it gets to later,
01:22:10.520 | to the fine tuning stage,
01:22:11.840 | and all of the chat GPT sort of aspects of it.
01:22:15.440 | Because we don't just want to delimit documents,
01:22:16.960 | we want to delimit entire conversations
01:22:18.720 | between an assistant and a user.
01:22:21.120 | So if I refresh this tick tokenizer page,
01:22:24.160 | the default example that they have here
01:22:26.240 | is using not sort of base model encoders,
01:22:30.080 | but fine-tuned model sort of tokenizers.
01:22:33.240 | So for example, using the GPT 3.5 turbo scheme,
01:22:37.120 | these here are all special tokens.
01:22:39.520 | I am start, I am end, et cetera.
01:22:43.080 | This is short for imaginary malloc_start, by the way.
01:22:46.880 | But you can see here that there's a sort of start
01:22:50.320 | and end of every single message,
01:22:51.800 | and there can be many other tokens,
01:22:53.520 | lots of tokens in use to delimit these conversations
01:22:58.280 | and kind of keep track of the flow of the messages here.
01:23:01.880 | Now we can go back to the tick token library.
01:23:04.800 | And here, when you scroll to the bottom,
01:23:07.320 | they talk about how you can extend tick token.
01:23:09.320 | And I can, you can create, basically,
01:23:11.000 | you can fork the CL 100K base tokenizers in GPT 4.
01:23:16.000 | And for example, you can extend it
01:23:18.280 | by adding more special tokens.
01:23:19.720 | And these are totally up to you.
01:23:20.640 | You can come up with any arbitrary tokens
01:23:22.440 | and add them with a new ID afterwards.
01:23:25.480 | And the tick token library will correctly swap them out
01:23:29.360 | when it sees this in the strings.
01:23:31.800 | Now, we can also go back to this file,
01:23:35.240 | which we looked at previously.
01:23:37.160 | And I mentioned that the GPT 2 in tick token,
01:23:40.040 | OpenAI public.py, we have the vocabulary,
01:23:43.600 | we have the pattern for splitting.
01:23:45.760 | And then here we are registering
01:23:46.960 | the single special token in GPT 2,
01:23:48.760 | which was the end of text token.
01:23:50.280 | And we saw that it has this ID.
01:23:52.040 | In GPT 4, when they define this here,
01:23:56.400 | you see that the pattern has changed as we've discussed,
01:23:58.600 | but also the special tokens have changed in this tokenizer.
01:24:01.240 | So we, of course, have the end of text, just like in GPT 2,
01:24:05.040 | but we also see three, sorry, four additional tokens here.
01:24:08.720 | Thim prefix, middle, and suffix.
01:24:11.000 | What is thim?
01:24:11.840 | Thim is short for fill in the middle.
01:24:14.560 | And if you'd like to learn more about this idea,
01:24:17.120 | it comes from this paper.
01:24:18.360 | And I'm not gonna go into detail in this video.
01:24:21.280 | It's beyond this video.
01:24:22.640 | And then there's one additional sort of token here.
01:24:26.760 | So that's that encoding as well.
01:24:28.640 | So it's very common, basically, to train a language model.
01:24:32.880 | And then if you'd like, you can add special tokens.
01:24:36.920 | Now, when you add special tokens,
01:24:38.400 | you, of course, have to do some model surgery
01:24:41.480 | to the transformer
01:24:42.560 | and all of the parameters involved in that transformer,
01:24:44.920 | because you are basically adding an integer.
01:24:47.000 | And you wanna make sure that, for example,
01:24:48.320 | your embedding matrix for the vocabulary tokens
01:24:51.360 | has to be extended by adding a row.
01:24:53.800 | And typically, this row would be initialized
01:24:55.800 | with small random numbers or something like that,
01:24:58.360 | because we need to have a vector
01:24:59.560 | that now stands for that token.
01:25:02.520 | In addition to that,
01:25:03.360 | you have to go to the final layer of the transformer,
01:25:05.000 | and you have to make sure that that projection
01:25:06.600 | at the very end into the classifier
01:25:09.000 | is extended by one as well.
01:25:10.920 | So basically, there's some model surgery involved
01:25:13.000 | that you have to couple with the tokenization changes
01:25:16.400 | if you are going to add special tokens.
01:25:18.840 | But this is a very common operation that people do,
01:25:20.920 | especially if they'd like to fine-tune the model,
01:25:23.120 | for example, taking it from a base model
01:25:24.960 | to a chat model like chat-gpt.
01:25:27.240 | Okay, so at this point, you should have everything you need
01:25:30.680 | in order to build your own GPT-4 tokenizer.
01:25:33.240 | Now, in the process of developing this lecture,
01:25:35.040 | I've done that, and I've published the code
01:25:37.000 | under this repository, minBPE.
01:25:39.880 | So minBPE looks like this right now as I'm recording,
01:25:43.400 | but the minBPE repository will probably change quite a bit
01:25:46.520 | because I intend to continue working on it.
01:25:48.680 | In addition to the minBPE repository,
01:25:51.560 | I've published this exercise progression
01:25:53.920 | that you can follow.
01:25:55.240 | So if you go to exercise.md here,
01:25:58.120 | this is sort of me breaking up the task ahead of you
01:26:01.360 | into four steps that sort of build up
01:26:04.200 | to what can be a GPT-4 tokenizer.
01:26:06.640 | And so feel free to follow these steps exactly
01:26:09.240 | and follow a little bit of the guidance
01:26:10.800 | that I've laid out here.
01:26:12.200 | And anytime you feel stuck,
01:26:14.160 | just reference the minBPE repository here.
01:26:17.040 | So either the tests could be useful
01:26:19.360 | or the minBPE repository itself.
01:26:21.520 | I try to keep the code fairly clean and understandable.
01:26:25.040 | And so feel free to reference it whenever you get stuck.
01:26:29.840 | In addition to that, basically, once you write it,
01:26:34.040 | you should be able to reproduce this behavior
01:26:35.640 | from TicToken.
01:26:36.800 | So getting the GPT-4 tokenizer,
01:26:38.840 | you can take, you can encode the string
01:26:41.320 | and you should get these tokens.
01:26:42.920 | And then you can encode and decode the exact same string
01:26:45.040 | to recover it.
01:26:46.440 | And in addition to all that,
01:26:47.560 | you should be able to implement your own train function,
01:26:50.040 | which TicToken library does not provide.
01:26:52.160 | It's, again, only inference code,
01:26:54.040 | but you could write your own train, minBPE does it as well.
01:26:57.800 | And that will allow you to train
01:26:59.280 | your own token vocabularies.
01:27:00.760 | So here's some of the code inside minBEP, minBPE,
01:27:05.200 | shows the token vocabularies that you might obtain.
01:27:08.720 | So on the left here, we have the GPT-4 merges.
01:27:13.560 | So the first 256 are raw individual bytes.
01:27:17.120 | And then here I am visualizing the merges
01:27:18.880 | that GPT-4 performed during its training.
01:27:21.600 | So the very first merge that GPT-4 did
01:27:24.000 | was merge two spaces into a single token for two spaces.
01:27:28.760 | And that is a token 256.
01:27:30.840 | And so this is the order in which things merged
01:27:32.680 | during GPT-4 training.
01:27:34.200 | And this is the merge order that we obtained in minBPE
01:27:39.120 | by training a tokenizer.
01:27:40.680 | And in this case, I trained it
01:27:41.880 | on a Wikipedia page of Taylor Swift.
01:27:44.520 | Not because I'm a Swifty,
01:27:45.520 | but because that is one of the longest Wikipedia pages,
01:27:49.080 | apparently, that's available.
01:27:50.560 | But she is pretty cool.
01:27:52.320 | And what was I gonna say?
01:27:56.120 | Yeah, so you can compare these two vocabularies.
01:27:58.920 | And so as an example,
01:28:01.240 | here, GPT-4 merged IN to become IN.
01:28:06.120 | And we've done the exact same thing on this token 259.
01:28:09.960 | Here, space T becomes space T.
01:28:12.600 | And that happened for us a little bit later as well.
01:28:15.160 | So the difference here is, again, to my understanding,
01:28:17.480 | only a difference of the training set.
01:28:19.480 | So as an example, because I see a lot of white space,
01:28:21.880 | I expect that GPT-4 probably had a lot of Python code
01:28:24.360 | in its training set, I'm not sure, for the tokenizer.
01:28:27.640 | And here we see much less of that, of course,
01:28:30.760 | in the Wikipedia page.
01:28:32.960 | So roughly speaking, they look the same,
01:28:34.760 | and they look the same
01:28:35.600 | because they're running the same algorithm.
01:28:37.360 | And when you train your own,
01:28:38.720 | you're probably gonna get something similar,
01:28:40.800 | depending on what you train it on.
01:28:42.480 | Okay, so we are now going to move on from TIC token
01:28:44.800 | and the way that OpenAI tokenizes its strings.
01:28:47.680 | We're going to discuss one more very commonly used library
01:28:50.400 | for working with tokenization in LLMs,
01:28:52.720 | and that is SentencePiece.
01:28:54.840 | So SentencePiece is very commonly used in language models
01:28:58.840 | because unlike TIC token,
01:28:59.920 | it can do both training and inference
01:29:02.120 | and is quite efficient at both.
01:29:04.360 | It supports a number of algorithms
01:29:05.800 | for training vocabularies,
01:29:07.760 | but one of them is the byte-bearing coding algorithm
01:29:10.120 | that we've been looking at.
01:29:11.240 | So it supports it.
01:29:13.240 | Now, SentencePiece is used both by Lama and Mistral Series
01:29:16.960 | and many other models as well.
01:29:19.120 | It is on GitHub under Google/SentencePiece.
01:29:21.640 | And the big difference with SentencePiece,
01:29:24.760 | and we're going to look at example
01:29:26.200 | because this is kind of hard and subtle to explain,
01:29:29.280 | is that they think different
01:29:31.000 | about the order of operations here.
01:29:34.480 | So in the case of TIC token,
01:29:36.840 | we first take our code points in the string,
01:29:40.360 | we encode them using UTF-8 to bytes,
01:29:42.640 | and then we're merging bytes.
01:29:43.880 | It's fairly straightforward.
01:29:45.960 | For SentencePiece, it works directly
01:29:49.640 | on the level of the code points themselves.
01:29:51.760 | So it looks at whatever code points are available
01:29:53.920 | in your training set,
01:29:55.200 | and then it starts merging those code points.
01:29:57.720 | And the BPE is running on the level of code points.
01:30:01.600 | And if you happen to run out of code points,
01:30:05.160 | so there are maybe some rare code points
01:30:07.400 | that just don't come up too often,
01:30:08.640 | and the rarity is determined
01:30:09.800 | by this character coverage hyperparameter,
01:30:12.080 | then these code points will either get mapped
01:30:15.200 | to a special unknown token like onc,
01:30:18.120 | or if you have the byte fallback option turned on,
01:30:21.160 | then that will take those rare code points,
01:30:23.600 | it will encode them using UTF-8,
01:30:25.720 | and then the individual bytes of that encoding
01:30:27.640 | will be translated into tokens.
01:30:29.800 | And there are these special byte tokens
01:30:31.560 | that basically get added to the vocabulary.
01:30:34.040 | So it uses BPE on the code points,
01:30:37.800 | and then it falls back to bytes for rare code points.
01:30:41.280 | And so that's kind of like difference.
01:30:44.120 | Personally, I find the tick token way significantly cleaner,
01:30:47.040 | but it's kind of like a subtle,
01:30:48.200 | but pretty major difference
01:30:49.200 | between the way they approach tokenization.
01:30:51.400 | Let's work with a concrete example,
01:30:52.760 | because otherwise this is kind of hard
01:30:54.680 | to get your head around.
01:30:57.800 | So let's work with a concrete example.
01:31:00.160 | This is how we can import sentence piece.
01:31:02.760 | And then here we're going to take,
01:31:04.400 | I think I took like the description of sentence piece,
01:31:06.440 | and I just create like a little toy dataset.
01:31:08.600 | It really likes to have a file,
01:31:09.920 | so I created a toy.txt file with this content.
01:31:14.000 | Now, what's kind of a little bit crazy about sentence piece
01:31:16.560 | is that there's a ton of options and configurations.
01:31:19.720 | And the reason this is so
01:31:20.800 | is because sentence piece has been around,
01:31:22.440 | I think for a while,
01:31:23.520 | and it really tries to handle a large diversity of things.
01:31:26.640 | And because it's been around,
01:31:28.320 | I think it has quite a bit of
01:31:29.920 | accumulated historical baggage as well.
01:31:33.040 | And so in particular,
01:31:34.120 | there's like a ton of configuration arguments.
01:31:36.440 | This is not even all of it.
01:31:38.160 | You can go to here to see all the training options.
01:31:42.560 | And there's also quite useful documentation
01:31:45.400 | when you look at the raw protobuf
01:31:47.480 | that is used to represent the trainer spec and so on.
01:31:50.680 | Many of these options are irrelevant to us.
01:31:54.720 | So maybe to point out one example,
01:31:56.720 | dash dash shrinking factor.
01:31:58.760 | This shrinking factor is not used
01:32:00.800 | in the byte bearing coding algorithm.
01:32:02.200 | So this is just an argument that is irrelevant to us.
01:32:05.840 | It applies to a different training algorithm.
01:32:08.080 | Now, what I tried to do here
01:32:11.760 | is I tried to set up sentence piece
01:32:13.680 | in a way that is very, very similar,
01:32:15.440 | as far as I can tell,
01:32:16.640 | to maybe identical, hopefully,
01:32:18.920 | to the way that Llama2 was trained.
01:32:21.680 | So the way they trained their own tokenizer.
01:32:26.120 | And the way I did this was basically
01:32:27.520 | you can take the tokenizer.model file that Meta released,
01:32:31.160 | and you can open it using the protobuf
01:32:35.360 | sort of file that you can generate.
01:32:38.400 | And then you can inspect all the options.
01:32:39.880 | And I tried to copy over all the options
01:32:41.520 | that looked relevant.
01:32:42.960 | So here we set up the input.
01:32:44.560 | It's a raw text in this file.
01:32:46.680 | Here's gonna be the output.
01:32:47.960 | So it's gonna be protoc400.model and .vocab.
01:32:51.520 | We're saying that we're gonna use the BP algorithm
01:32:54.240 | and we want to vocab size of 400.
01:32:55.960 | Then there's a ton of configurations here
01:32:59.160 | for basically preprocessing and normalization rules,
01:33:06.240 | as they're called.
01:33:07.240 | Normalization used to be very prevalent,
01:33:09.600 | I would say, before LLMs in natural language processing.
01:33:12.360 | So in machine translation and text classification and so on,
01:33:15.800 | you want to normalize and simplify the text
01:33:17.640 | and you wanna turn it all lowercase
01:33:18.960 | and you wanna remove all double whitespace, et cetera.
01:33:22.240 | And in language models, we prefer not to do any of it,
01:33:24.640 | or at least that is my preference as a deep learning person.
01:33:26.960 | You want to not touch your data.
01:33:28.400 | You want to keep the raw data as much as possible
01:33:31.800 | in a raw form.
01:33:33.200 | So you're basically trying to turn off a lot of this,
01:33:36.200 | if you can.
01:33:37.840 | The other thing that sentence piece does
01:33:39.280 | is that it has this concept of sentences.
01:33:41.720 | So sentence piece, it's back,
01:33:45.200 | it kind of like was developed, I think,
01:33:46.440 | early in the days where there was an idea
01:33:50.080 | that you're training a tokenizer
01:33:51.840 | on a bunch of independent sentences.
01:33:53.800 | So it has a lot of like how many sentences
01:33:56.400 | you're gonna train on,
01:33:57.360 | what is the maximum sentence length,
01:33:59.160 | shuffling sentences.
01:34:03.160 | And so for it, sentences are kind of like
01:34:04.840 | the individual training examples.
01:34:06.600 | But again, in the context of LLMs,
01:34:08.160 | I find that this is like a very spurious
01:34:10.000 | and weird distinction.
01:34:11.320 | Like sentences are just like, don't touch the raw data.
01:34:15.080 | Sentences happen to exist.
01:34:16.920 | But in the raw data sets,
01:34:18.760 | there are a lot of like in-betweens,
01:34:20.200 | like what exactly is a sentence?
01:34:21.800 | What isn't a sentence?
01:34:23.160 | And so I think like, it's really hard to define
01:34:25.920 | what an actual sentence is,
01:34:27.560 | if you really like dig into it.
01:34:29.360 | And there could be different concepts of it
01:34:31.320 | in different languages or something like that.
01:34:32.800 | So why even introduce the concept?
01:34:34.720 | It doesn't honestly make sense to me.
01:34:36.600 | I would just prefer to treat a file
01:34:38.440 | as a giant stream of bytes.
01:34:40.400 | It has a lot of treatment around rare word characters.
01:34:44.400 | And when I say word, I mean, code points.
01:34:46.120 | We're gonna come back to this in a second.
01:34:48.320 | And it has a lot of other rules
01:34:49.600 | for basically splitting digits,
01:34:53.000 | splitting white space and numbers
01:34:55.240 | and how you deal with that.
01:34:56.520 | So these are some kind of like merge rules.
01:34:59.120 | So I think this is a little bit equivalent
01:35:00.600 | to tick token using the regular expression
01:35:03.760 | to split up categories.
01:35:05.480 | There's like kind of equivalence of it
01:35:07.920 | if you squint at it in sentence piece,
01:35:09.960 | where you can also, for example,
01:35:11.000 | split up the digits and so on.
01:35:15.520 | There's a few more things here
01:35:18.040 | that I'll come back to in a bit.
01:35:19.280 | And then there are some special tokens
01:35:20.440 | that you can indicate.
01:35:21.640 | And it hardcodes the UNCTOKEN,
01:35:24.040 | the beginning of sentence, end of sentence,
01:35:26.280 | and a PAT token.
01:35:27.440 | And the UNCTOKEN must exist from my understanding.
01:35:31.720 | And there's some systems things.
01:35:33.920 | So we can train.
01:35:35.800 | And when I press train,
01:35:37.240 | it's going to create this file,
01:35:39.320 | talk400.model and talk400.vocab.
01:35:42.400 | I can then load the model file
01:35:44.520 | and I can inspect the vocabulary of it.
01:35:47.440 | And so we trained vocab size 400 on this text here.
01:35:52.280 | And these are the individual pieces,
01:35:55.000 | the individual tokens that sentence piece will create.
01:35:58.000 | So in the beginning,
01:35:58.840 | we see that we have the UNCTOKEN with the ID zero.
01:36:03.080 | Then we have the beginning of sequence,
01:36:04.880 | end of sequence, one and two.
01:36:07.720 | And then we said that the PAT ID is negative one.
01:36:09.600 | So we chose not to use it.
01:36:11.800 | So there's no PAT ID here.
01:36:13.280 | Then these are individual byte tokens.
01:36:17.840 | So here we saw that byte fallback in llama was turned on.
01:36:22.120 | So it's true.
01:36:23.480 | So what follows are going to be the 256 byte tokens.
01:36:27.520 | And these are their IDs.
01:36:32.880 | And then at the bottom,
01:36:34.760 | after the byte tokens come the merges.
01:36:37.720 | And these are the parent nodes in the merges.
01:36:41.560 | So we're not seeing the children,
01:36:42.640 | we're just seeing the parents and their ID.
01:36:44.800 | And then after the merges comes eventually
01:36:49.080 | the individual tokens and their IDs.
01:36:53.120 | And so these are the individual tokens.
01:36:54.560 | So these are the individual code point tokens, if you will,
01:36:58.360 | and they come at the end.
01:36:59.880 | So that is the ordering
01:37:00.840 | with which sentence piece sort of like represents
01:37:02.840 | its vocabularies.
01:37:03.800 | It starts with special tokens, then the byte tokens,
01:37:06.840 | then the merge tokens,
01:37:08.000 | and then the individual code point tokens.
01:37:10.640 | And all these raw code point tokens
01:37:13.640 | are the ones that it encountered in the training set.
01:37:17.080 | So those individual code points are all the,
01:37:20.720 | the entire set of code points that occurred here.
01:37:23.240 | So those all get put in there.
01:37:27.360 | And then those that are extremely rare
01:37:29.160 | as determined by character coverage.
01:37:30.840 | So if a code point occurred only a single time
01:37:32.640 | out of like a million sentences or something like that,
01:37:36.200 | then it would be ignored
01:37:37.840 | and it would not be added to our vocabulary.
01:37:41.120 | Once we have a vocabulary,
01:37:43.280 | we can encode into IDs and we can sort of get a list.
01:37:47.520 | And then here I am also decoding the individual tokens
01:37:53.080 | back into little pieces as they call it.
01:37:55.720 | So let's take a look at what happened here.
01:37:58.920 | Hello, space, (speaking in foreign language)
01:38:02.000 | So these are the token IDs we got back.
01:38:05.560 | And when we look here, a few things sort of jumped to mind.
01:38:10.400 | Number one, take a look at these characters.
01:38:14.160 | The Korean characters, of course,
01:38:15.320 | were not part of the training set.
01:38:17.080 | So sentence piece is encountering code points
01:38:19.560 | that it has not seen during training time.
01:38:22.120 | And those code points do not have
01:38:24.080 | a token associated with them.
01:38:25.840 | So suddenly these are unk tokens, unknown tokens.
01:38:29.760 | But because byte fallback is true,
01:38:32.320 | instead, sentence piece falls back to bytes.
01:38:35.720 | And so it takes this, it encodes it with UTF-8,
01:38:39.640 | and then it uses these tokens to represent those bytes.
01:38:44.240 | And that's what we are getting sort of here.
01:38:47.960 | This is the UTF-8 encoding,
01:38:50.640 | and it is shifted by three
01:38:52.880 | because of these special tokens here
01:38:56.280 | that have IDs earlier on.
01:38:58.400 | So that's what happened here.
01:39:00.400 | Now, one more thing that,
01:39:02.960 | well, first, before I go on,
01:39:05.200 | with respect to the byte fallback,
01:39:06.840 | let me remove byte fallback.
01:39:09.240 | If this is false, what's gonna happen?
01:39:11.680 | Let's retrain.
01:39:12.520 | So the first thing that happened
01:39:14.520 | is all of the byte tokens disappeared, right?
01:39:17.320 | And now we just have the merges,
01:39:18.840 | and we have a lot more merges now
01:39:20.240 | because we have a lot more space
01:39:21.360 | because we're not taking up space
01:39:23.240 | in the vocab size with all the bytes.
01:39:26.960 | And now if we encode this,
01:39:28.880 | we get a zero.
01:39:31.960 | So this entire string here suddenly,
01:39:33.840 | there's no byte fallback.
01:39:34.960 | So this is unknown, and unknown is UNK.
01:39:39.000 | And so this is zero
01:39:40.400 | because the UNK token is token zero.
01:39:43.200 | And you have to keep in mind
01:39:44.960 | that this would feed into your language model.
01:39:47.720 | So what is the language model supposed to do
01:39:49.240 | when all kinds of different things
01:39:51.000 | that are unrecognized because they're rare
01:39:53.000 | just end up mapping into UNK?
01:39:54.880 | It's not exactly the property that you want.
01:39:57.080 | So that's why I think Lama correctly
01:39:59.480 | used byte fallback true
01:40:02.160 | because we definitely want to feed
01:40:03.640 | these unknown or rare code points
01:40:06.120 | into the model in some manner.
01:40:08.600 | The next thing I wanna show you is the following.
01:40:11.720 | Notice here when we are decoding all the individual tokens,
01:40:14.640 | you see how spaces, space here
01:40:17.760 | ends up being this bold underline.
01:40:20.920 | I'm not 100% sure, by the way,
01:40:22.320 | why sentence piece switches whitespace
01:40:24.240 | into these bold underscore characters.
01:40:27.240 | Maybe it's for visualization.
01:40:28.400 | I'm not 100% sure why that happens.
01:40:30.920 | But notice this.
01:40:32.040 | Why do we have an extra space in the front of hello?
01:40:37.040 | What is this coming from?
01:40:40.560 | Well, it's coming from this option here.
01:40:42.800 | Add dummy prefix is true.
01:40:48.120 | And when you go to the documentation,
01:40:50.720 | add dummy whitespace at the beginning of text
01:40:52.640 | in order to treat world in world
01:40:54.920 | and hello world in the exact same way.
01:40:57.240 | So what this is trying to do is the following.
01:41:00.280 | If we go back to our tick tokenizer,
01:41:02.120 | world as a token by itself
01:41:06.400 | has a different ID than space world.
01:41:10.040 | So we have, this is 1917, but this is 14, et cetera.
01:41:14.720 | So these are two different tokens for the language model.
01:41:17.000 | And the language model has to learn from data
01:41:18.720 | that they are actually kind of like a very similar concept.
01:41:21.200 | So to the language model in the tick token world,
01:41:24.080 | basically words in the beginning of sentences
01:41:27.000 | and words in the middle of sentences
01:41:28.560 | actually look completely different.
01:41:30.320 | And it has learned that they are roughly the same.
01:41:34.400 | So this add dummy prefix
01:41:36.480 | is trying to fight that a little bit.
01:41:38.440 | And the way that works is that it basically
01:41:40.800 | adds a dummy prefix.
01:41:43.880 | So as a part of pre-processing,
01:41:48.600 | it will take the string and it will add a space.
01:41:51.160 | It will do this.
01:41:52.920 | And that's done in an effort to make this world
01:41:56.320 | and that world the same.
01:41:57.600 | They will both be space world.
01:41:59.960 | So that's one other kind of pre-processing option
01:42:02.120 | that is turned on and Llama2 also uses this option.
01:42:06.800 | And that's I think everything that I wanna say
01:42:08.360 | from my preview of sentence piece and how it is different.
01:42:11.880 | Maybe here what I've done is
01:42:14.040 | I just put in the raw protocol buffer representation
01:42:18.480 | basically of the tokenizer, the Llama2 trained.
01:42:22.280 | So feel free to sort of step through this.
01:42:24.520 | And if you would like your tokenization
01:42:26.560 | to look identical to that of the meta Llama2,
01:42:30.360 | then you would be copy pasting these settings
01:42:32.080 | as I've tried to do up above.
01:42:34.000 | And yeah, I think that's it for this section.
01:42:37.800 | I think my summary for sentence piece from all this
01:42:40.200 | is number one, I think that there's a lot
01:42:42.400 | of historical baggage in sentence piece.
01:42:44.240 | A lot of concepts that I think are slightly confusing
01:42:46.840 | and I think potentially contain foot guns
01:42:49.200 | like this concept of a sentence
01:42:50.600 | and its maximum length and stuff like that.
01:42:52.760 | Otherwise it is fairly commonly used in the industry
01:42:56.640 | because it is efficient
01:42:58.920 | and can do both training and inference.
01:43:01.160 | It has a few quirks, like for example,
01:43:02.920 | on token must exist and the way the byte fallbacks are done
01:43:05.880 | and so on, I don't find particularly elegant.
01:43:08.320 | And unfortunately I have to say
01:43:09.360 | it's not very well documented.
01:43:10.480 | So it took me a lot of time working with this myself
01:43:13.760 | and just visualizing things
01:43:15.760 | and try to really understand what is happening here
01:43:17.760 | because the documentation, unfortunately,
01:43:20.360 | is in my opinion, not super amazing.
01:43:22.720 | But it is a very nice repo that is available to you
01:43:25.840 | if you'd like to train your own tokenizer right now.
01:43:28.200 | Okay, let me now switch gears again
01:43:29.560 | as we're starting to slowly wrap up here.
01:43:31.640 | I want to revisit this issue in a bit more detail
01:43:33.840 | of how we should set the vocab size
01:43:35.680 | and what are some of the considerations around it.
01:43:38.160 | So for this, I'd like to go back to the model architecture
01:43:41.360 | that we developed in the last video
01:43:43.080 | when we built the GPT from scratch.
01:43:45.680 | So this here was the file
01:43:47.480 | that we built in the previous video
01:43:49.000 | and we defined the transformer model
01:43:50.680 | and let's specifically look at vocab size
01:43:52.800 | and where it appears in this file.
01:43:54.820 | So here we define the vocab size.
01:43:56.560 | At this time it was 65 or something like that,
01:43:59.640 | extremely small number.
01:44:00.720 | So this will grow much larger.
01:44:03.160 | You'll see that vocab size doesn't come up too much
01:44:04.980 | in most of these layers.
01:44:06.160 | The only place that it comes up to
01:44:08.240 | is in exactly these two places here.
01:44:10.920 | So when we define the language model,
01:44:13.280 | there's the token embedding table,
01:44:15.200 | which is this two-dimensional array
01:44:17.320 | where the vocab size is basically the number of rows
01:44:20.620 | and each vocabulary element, each token has a vector
01:44:25.000 | that we're going to train using backpropagation.
01:44:27.240 | That vector is of size and embed,
01:44:28.920 | which is number of channels in the transformer.
01:44:31.440 | And basically as vocab size increases,
01:44:33.560 | this embedding table, as I mentioned earlier,
01:44:35.760 | is going to also grow.
01:44:36.680 | We're going to be adding rows.
01:44:38.680 | In addition to that, at the end of the transformer,
01:44:41.080 | there's this LM head layer, which is a linear layer.
01:44:44.240 | And you'll notice that that layer is used at the very end
01:44:47.000 | to produce the logits, which become the probabilities
01:44:49.840 | for the next token in a sequence.
01:44:51.520 | And so intuitively, we're trying to produce a probability
01:44:54.520 | for every single token that might come next
01:44:57.240 | at every point in time of that transformer.
01:45:00.240 | And if we have more and more tokens,
01:45:02.000 | we need to produce more and more probabilities.
01:45:04.360 | So every single token is going to introduce
01:45:06.160 | an additional dot product that we have to do
01:45:08.600 | here in this linear layer
01:45:09.880 | for this final layer in the transformer.
01:45:12.440 | So why can't vocab size be infinite?
01:45:15.340 | Why can't we grow to infinity?
01:45:16.560 | Well, number one, your token embedding table
01:45:18.520 | is going to grow.
01:45:19.840 | Your linear layer is going to grow.
01:45:23.000 | So we're going to be doing a lot more computation here
01:45:25.140 | because this LM head layer
01:45:26.280 | will become more computational expensive.
01:45:28.740 | Number two, because we have more parameters,
01:45:30.600 | we could be worried that we are going to be
01:45:32.640 | under-training some of these parameters.
01:45:35.440 | So intuitively, if you have a very large vocabulary size,
01:45:38.560 | say we have a million tokens,
01:45:40.800 | then every one of these tokens is going to come up
01:45:42.480 | more and more rarely in the training data
01:45:45.080 | because there's a lot more other tokens all over the place.
01:45:47.600 | And so we're going to be seeing fewer and fewer examples
01:45:51.080 | for each individual token.
01:45:52.960 | And you might be worried that basically
01:45:54.320 | the vectors associated with every token
01:45:56.080 | will be under-trained as a result
01:45:57.760 | because they just don't come up too often
01:45:59.760 | and they don't participate in the forward-backward pass.
01:46:02.460 | In addition to that, as your vocab size grows,
01:46:04.600 | you're going to start shrinking your sequences a lot.
01:46:07.640 | And that's really nice because that means
01:46:09.720 | that we're going to be attending to more and more text.
01:46:11.880 | So that's nice.
01:46:12.980 | But also you might be worrying that too large of chunks
01:46:15.800 | are being squished into single tokens.
01:46:18.040 | And so the model just doesn't have as much time to think
01:46:22.000 | per some number of characters in a text,
01:46:26.320 | or you can think about it that way.
01:46:28.000 | So basically we're squishing too much information
01:46:29.960 | into a single token.
01:46:31.180 | And then the forward pass of the transformer
01:46:32.980 | is not enough to actually process
01:46:34.300 | that information appropriately.
01:46:35.940 | And so these are some of the considerations
01:46:37.420 | you're thinking about
01:46:38.260 | when you're designing the vocab size.
01:46:39.820 | As I mentioned, this is mostly an empirical hyperparameter.
01:46:42.260 | And it seems like in state-of-the-art architectures today,
01:46:44.980 | this is usually in the high 10,000th
01:46:46.820 | or somewhere around 100,000 today.
01:46:49.340 | And the next consideration I want to briefly talk about
01:46:51.500 | is what if we want to take a pre-trained model
01:46:54.260 | and we want to extend the vocab size?
01:46:56.460 | And this is done fairly commonly actually.
01:46:58.180 | So for example, when you're doing fine tuning for chat GPT,
01:47:02.240 | a lot more new special tokens get introduced
01:47:04.280 | on top of the base model to maintain the metadata
01:47:07.280 | and all the structure of conversation objects
01:47:09.980 | between the user and an assistant.
01:47:11.740 | So that takes a lot of special tokens.
01:47:13.860 | You might also try to throw in more special tokens,
01:47:15.960 | for example, for using the browser or any other tool.
01:47:18.920 | And so it's very tempting to add a lot of tokens
01:47:21.560 | for all kinds of special functionality.
01:47:23.960 | So if you want to be adding a token,
01:47:25.320 | that's totally possible, right?
01:47:26.960 | All we have to do is we have to resize this embedding.
01:47:29.620 | So we have to add rows.
01:47:31.100 | We would initialize these parameters from scratch,
01:47:33.860 | which would be small random numbers.
01:47:35.500 | And then we have to extend the weight inside this linear.
01:47:39.340 | So we have to start making dot products
01:47:41.900 | with the associated parameters as well
01:47:43.780 | to basically calculate the probabilities
01:47:45.120 | for these new tokens.
01:47:46.660 | So both of these are just a resizing operation.
01:47:49.500 | It's a very mild model surgery
01:47:51.940 | and can be done fairly easily.
01:47:53.260 | And it's quite common that basically
01:47:54.460 | you would freeze the base model.
01:47:56.000 | You introduce these new parameters
01:47:57.500 | and then you only train these new parameters
01:47:59.300 | to introduce new tokens into the architecture.
01:48:01.600 | And so you can freeze arbitrary parts of it
01:48:04.500 | or you can train arbitrary parts of it
01:48:06.140 | and that's totally up to you.
01:48:07.580 | But basically minor surgery required
01:48:09.460 | if you'd like to introduce new tokens.
01:48:11.420 | And finally, I'd like to mention that
01:48:12.760 | actually there's an entire design space of applications
01:48:15.900 | in terms of introducing new tokens into a vocabulary
01:48:18.460 | that go way beyond just adding special tokens
01:48:20.460 | and special new functionality.
01:48:22.240 | So just to give you a sense of the design space,
01:48:24.100 | but this could be an entire video just by itself.
01:48:26.660 | This is a paper on learning to compress prompts
01:48:29.260 | with what they called GIST tokens.
01:48:32.060 | And the rough idea is suppose that you're using
01:48:34.060 | language models in a setting
01:48:35.060 | that requires very long prompts.
01:48:37.320 | Well, these long prompts just slow everything down
01:48:39.460 | because you have to encode them
01:48:40.540 | and then you have to use them
01:48:41.740 | and then you're tending over them
01:48:43.220 | and it's just heavy to have very large prompts.
01:48:46.820 | So instead what they do here in this paper
01:48:49.220 | is they introduce new tokens
01:48:52.340 | and imagine basically having a few new tokens,
01:48:56.100 | you put them in a sequence
01:48:57.700 | and then you train the model by distillation.
01:49:00.820 | So you are keeping the entire model frozen
01:49:02.740 | and you're only training the representations
01:49:04.740 | of the new tokens, their embeddings,
01:49:06.780 | and you're optimizing over the new tokens
01:49:08.660 | such that the behavior of the language model
01:49:11.120 | is identical to the model that has a very long prompt
01:49:16.120 | that works for you.
01:49:17.580 | And so it's a compression technique
01:49:19.020 | of compressing that very long prompt
01:49:20.620 | into those few new GIST tokens.
01:49:23.140 | And so you can train this and then at test time
01:49:24.980 | you can discard your old prompt
01:49:26.100 | and just swap in those tokens
01:49:27.540 | and they sort of like stand in for that very long prompt
01:49:30.900 | and have an almost identical performance.
01:49:33.580 | And so this is one technique
01:49:36.100 | in a class of parameter efficient fine tuning techniques
01:49:38.780 | where most of the model is basically fixed
01:49:40.940 | and there's no training of the model weights,
01:49:43.000 | there's no training of LoRa or anything like that
01:49:45.260 | of new parameters.
01:49:46.360 | The parameters that you're training
01:49:47.900 | are now just the token embeddings.
01:49:50.660 | So that's just one example,
01:49:51.700 | but this could again be like an entire video,
01:49:53.900 | but just to give you a sense
01:49:54.860 | that there's a whole design space here
01:49:56.020 | that is potentially worth exploring in the future.
01:49:58.620 | The next thing I want to briefly address
01:50:00.020 | is that I think recently there's a lot of momentum
01:50:02.420 | in how you actually could construct transformers
01:50:05.220 | that can simultaneously process
01:50:06.460 | not just text as the input modality,
01:50:08.580 | but a lot of other modalities.
01:50:09.660 | So be it images, videos, audio, et cetera.
01:50:12.980 | And how do you feed in all these modalities
01:50:14.980 | and potentially predict these modalities from a transformer?
01:50:18.980 | Do you have to change the architecture
01:50:20.020 | in some fundamental way?
01:50:21.060 | And I think what a lot of people
01:50:22.220 | are starting to converge towards
01:50:23.540 | is that you're not changing the architecture,
01:50:25.000 | you stick with the transformer,
01:50:26.380 | you just kind of tokenize your input domains
01:50:28.940 | and then call it a day and pretend it's just text tokens
01:50:31.420 | and just do everything else in an identical manner.
01:50:35.420 | So here, for example, there was an early paper
01:50:37.020 | that has a nice graphic for how you can take an image
01:50:39.720 | and you can truncate it into integers.
01:50:43.460 | And these sometimes,
01:50:45.340 | so these would basically become the tokens of images
01:50:47.700 | as an example.
01:50:48.900 | And these tokens can be hard tokens
01:50:51.540 | where you force them to be integers.
01:50:53.620 | They can also be soft tokens
01:50:55.540 | where you sort of don't require these to be discrete,
01:51:00.240 | but you do force these representations
01:51:02.300 | to go through bottlenecks like in autoencoders.
01:51:04.940 | Also in this paper that came out from OpenAI Sora,
01:51:08.420 | which I think really blew the mind of many people
01:51:12.420 | and inspired a lot of people in terms of what's possible,
01:51:14.980 | they have a graphic here and they talk briefly
01:51:16.620 | about how LLMs have text tokens, Sora has visual patches.
01:51:21.340 | So again, they came up with a way to truncate videos
01:51:24.260 | into basically the tokens with their own vocabularies.
01:51:27.540 | And then you can either process discrete tokens,
01:51:29.660 | say with autoregressive models,
01:51:31.040 | or even soft tokens with diffusion models.
01:51:33.660 | And all of that is sort of being actively worked on,
01:51:38.320 | designed on, and it's beyond the scope of this video,
01:51:40.140 | but just something I wanted to mention briefly.
01:51:42.020 | Okay, now that we have come quite deep
01:51:44.340 | into the tokenization algorithm
01:51:45.660 | and we understand a lot more about how it works,
01:51:48.340 | let's loop back around to the beginning of this video
01:51:50.620 | and go through some of these bullet points
01:51:51.940 | and really see why they happen.
01:51:54.740 | So first of all, why can't my LLM spell words very well
01:51:57.820 | or do other spell-related tasks?
01:52:00.160 | So fundamentally, this is because, as we saw,
01:52:03.800 | these characters are chunked up into tokens.
01:52:07.180 | And some of these tokens are actually fairly long.
01:52:09.820 | So as an example, I went to the GPT-4 vocabulary
01:52:12.740 | and I looked at one of the longer tokens.
01:52:15.180 | So .defaultstyle turns out to be a single individual token.
01:52:19.380 | So that's a lot of characters for a single token.
01:52:21.860 | So my suspicion is that there's just too much crammed
01:52:24.340 | into this single token.
01:52:25.820 | And my suspicion was that the model should not be very good
01:52:28.580 | at tasks related to spelling of this single token.
01:52:33.580 | So I asked, how many letters L are there
01:52:36.980 | in the word .defaultstyle?
01:52:39.280 | And of course, my prompt is intentionally done that way.
01:52:44.280 | And you see how .defaultstyle will be a single token.
01:52:46.500 | So this is what the model sees.
01:52:48.260 | So my suspicion is that it wouldn't be very good at this.
01:52:50.500 | And indeed, it is not.
01:52:51.940 | It doesn't actually know how many Ls are in there.
01:52:54.140 | It thinks there are three, and actually there are four,
01:52:56.900 | if I'm not getting this wrong myself.
01:52:59.300 | So that didn't go extremely well.
01:53:01.360 | Let's look at another kind of character-level task.
01:53:05.480 | So for example, here I asked GPT-4
01:53:08.540 | to reverse the string .defaultstyle.
01:53:11.060 | And it tried to use a code interpreter.
01:53:13.260 | I stopped it, and I said, just do it, just try it.
01:53:16.420 | And it gave me jumble.
01:53:18.820 | So it doesn't actually really know how to reverse
01:53:22.060 | this string going from right to left.
01:53:24.740 | So it gave a wrong result.
01:53:26.600 | So again, like working with this working hypothesis
01:53:29.000 | that maybe this is due to the tokenization,
01:53:31.060 | I tried a different approach.
01:53:32.180 | I said, okay, let's reverse the exact same string,
01:53:35.340 | but take the following approach.
01:53:37.260 | Step one, just print out every single character
01:53:39.400 | separated by spaces,
01:53:40.660 | and then as a step two, reverse that list.
01:53:43.260 | And it again tried to use a tool, but when I stopped it,
01:53:45.820 | it first produced all the characters,
01:53:48.380 | and that was actually correct.
01:53:50.040 | And then it reversed them,
01:53:50.880 | and that was correct once it had this.
01:53:52.940 | So somehow it can't reverse it directly,
01:53:54.760 | but when you go just first, you know,
01:53:57.500 | listing it out in order, it can do that somehow.
01:54:00.100 | And then it can, once it's broken up this way,
01:54:03.260 | this becomes all these individual characters.
01:54:05.480 | And so now this is much easier for it
01:54:07.220 | to see these individual tokens
01:54:09.080 | and reverse them and print them out.
01:54:11.720 | So that is kind of interesting.
01:54:14.620 | So let's continue now.
01:54:16.300 | Why are LLMs worse at non-English languages?
01:54:20.460 | And I briefly covered this already,
01:54:21.780 | but basically it's not only that the language model
01:54:25.500 | sees less non-English data
01:54:27.420 | during training of the model parameters,
01:54:29.660 | but also the tokenizer is not sufficiently trained
01:54:34.100 | on non-English data.
01:54:36.060 | And so here, for example, hello, how are you is five tokens,
01:54:40.140 | and its translation is 15 tokens.
01:54:42.380 | So this is a three times blow up.
01:54:45.020 | And so for example, annyeong haseyo is just hello,
01:54:48.260 | basically in Korean.
01:54:49.300 | And that ends up being three tokens.
01:54:50.820 | I'm actually kind of surprised by that
01:54:51.980 | because that is a very common phrase.
01:54:53.900 | That's just a typical greeting of like hello.
01:54:56.180 | And that ends up being three tokens,
01:54:57.620 | whereas our hello is a single token.
01:54:59.820 | And so basically everything is a lot more bloated
01:55:01.580 | and diffuse.
01:55:02.420 | And this is, I think, partly the reason
01:55:03.980 | that the model works worse on other languages.
01:55:07.000 | Coming back, why is LLM bad at simple arithmetic?
01:55:11.200 | That has to do with the tokenization of numbers.
01:55:16.540 | And so you'll notice that, for example,
01:55:19.140 | addition is very sort of like,
01:55:20.860 | there's an algorithm that is like character level
01:55:24.200 | for doing addition.
01:55:25.760 | So for example, here, we would first add the ones
01:55:28.180 | and then the tens and then the hundreds.
01:55:29.840 | You have to refer to specific parts of these digits.
01:55:33.340 | But these numbers are represented
01:55:35.940 | completely arbitrarily based on whatever happened
01:55:37.780 | to merge or not merge during the tokenization process.
01:55:40.940 | There's an entire blog post about this
01:55:42.360 | that I think is quite good.
01:55:43.740 | Integer tokenization is insane.
01:55:45.500 | And this person basically systematically explores
01:55:47.860 | the tokenization of numbers in, I believe this is GPT-2.
01:55:52.020 | And so they noticed that, for example,
01:55:53.320 | for four-digit numbers, you can take a look at
01:55:57.720 | whether it is a single token
01:56:00.280 | or whether it is two tokens that is a one, three,
01:56:02.640 | or a two, two, or a three, one combination.
01:56:04.800 | And so all the different numbers
01:56:06.200 | are all the different combinations.
01:56:07.760 | And you can imagine this is all completely arbitrarily so.
01:56:10.520 | And the model, unfortunately, sometimes sees
01:56:12.880 | a token for all four digits, sometimes for three,
01:56:17.280 | sometimes for two, sometimes for one,
01:56:19.200 | and it's in an arbitrary manner.
01:56:22.060 | And so this is definitely a headwind, if you will,
01:56:25.020 | for the language model.
01:56:26.020 | And it's kind of incredible that it can kind of do it
01:56:27.900 | and deal with it, but it's also kind of not ideal.
01:56:31.020 | And so that's why, for example, we saw that Meta,
01:56:33.060 | when they trained the LLAMA2 algorithm
01:56:35.160 | and they used sentence piece,
01:56:36.580 | they make sure to split up all the digits,
01:56:40.580 | as an example, for LLAMA2.
01:56:43.460 | And this is partly to improve
01:56:44.900 | a simple arithmetic kind of performance.
01:56:48.020 | And finally, why is GPT-2 not as good in Python?
01:56:52.040 | Again, this is partly a modeling issue
01:56:53.880 | on the architecture and the dataset
01:56:55.840 | and the strength of the model,
01:56:57.320 | but it's also partly tokenization.
01:56:59.240 | Because as we saw here with a simple Python example,
01:57:02.020 | the encoding efficiency of the tokenizer
01:57:04.760 | for handling spaces in Python is terrible.
01:57:07.180 | And every single space is an individual token.
01:57:09.360 | And this dramatically reduces the context length
01:57:11.680 | that the model can attend across.
01:57:13.320 | So that's almost like a tokenization bug for GPT-2.
01:57:16.400 | And that was later fixed with GPT-4.
01:57:19.580 | Okay, so here's another fun one.
01:57:21.240 | My LLM abruptly halts when it sees the string end of text.
01:57:25.080 | So here's a very strange behavior.
01:57:28.200 | Print a string end of text, is what I told GPT-4.
01:57:31.440 | And it says, could you please specify the string?
01:57:33.960 | And I'm telling it, give me end of text.
01:57:36.440 | And it seems like there's an issue.
01:57:38.000 | It's not seeing end of text.
01:57:40.280 | And then I give it end of text as the string,
01:57:42.840 | and then here's the string,
01:57:44.200 | and then it just doesn't print it.
01:57:45.840 | So obviously something is breaking here
01:57:47.200 | with respect to the handling of the special token.
01:57:49.520 | And I didn't actually know what OpenAI is doing
01:57:51.600 | under the hood here,
01:57:53.000 | and whether they are potentially parsing this
01:57:55.600 | as an actual token,
01:57:59.040 | instead of this just being end of text
01:58:01.800 | as like individual sort of pieces of it
01:58:05.260 | without the special token handling logic.
01:58:08.280 | And so it might be that someone,
01:58:09.920 | when they're calling dot encode,
01:58:11.960 | they are passing in the allowed special,
01:58:14.260 | and they are allowing end of text
01:58:16.720 | as a special character in the user prompt.
01:58:19.260 | But the user prompt, of course,
01:58:20.280 | is a sort of attacker-controlled text.
01:58:23.660 | So you would hope that they don't really parse
01:58:25.840 | or use special tokens from that kind of input.
01:58:30.240 | But it appears that there's something
01:58:31.240 | definitely going wrong here.
01:58:32.640 | And so your knowledge of these special tokens
01:58:35.840 | ends up being an attack surface, potentially.
01:58:38.120 | And so if you'd like to confuse LLMs,
01:58:41.440 | then just try to give them some special tokens
01:58:44.120 | and see if you're breaking something by chance.
01:58:46.320 | Okay, so this next one is a really fun one,
01:58:48.480 | the trailing whitespace issue.
01:58:51.960 | So if you come to Playground,
01:58:54.440 | and we come here to GPT 3.5 Turbo Instruct.
01:58:58.320 | So this is not a chat model.
01:58:59.520 | This is a completion model.
01:59:00.720 | So think of it more like,
01:59:02.440 | it's a lot more closer to a base model.
01:59:04.600 | It does completion.
01:59:05.960 | It will continue the token sequence.
01:59:08.720 | So here's a tagline for Ice Cream Shop,
01:59:10.520 | and we wanna continue the sequence.
01:59:12.600 | And so we can submit and get a bunch of tokens.
01:59:15.600 | Okay, no problem.
01:59:17.400 | But now, suppose I do this,
01:59:19.720 | but instead of pressing submit here,
01:59:22.420 | I do, here's a tagline for Ice Cream Shop space.
01:59:25.880 | So I have a space here before I click submit.
01:59:28.560 | We get a warning.
01:59:31.240 | Your text ends in a trailing space,
01:59:32.760 | which causes worse performance
01:59:33.920 | due to how API splits text into tokens.
01:59:37.240 | So what's happening here,
01:59:38.200 | it still gave us a sort of completion here,
01:59:40.800 | but let's take a look at what's happening.
01:59:42.760 | So here's a tagline for an Ice Cream Shop.
01:59:46.600 | And then, what does this look like
01:59:49.000 | in the actual training data?
01:59:50.160 | Suppose you found the completion in the training document,
01:59:53.120 | somewhere on the internet,
01:59:54.120 | and the LLM trained on this data.
01:59:56.360 | So maybe it's something like, oh yeah.
01:59:59.280 | Maybe that's the tagline.
02:00:00.120 | That's a terrible tagline.
02:00:01.160 | But notice here that when I create O,
02:00:04.680 | you see that because there's the,
02:00:06.960 | the space character is always a prefix
02:00:09.000 | to these tokens in GPT.
02:00:11.300 | So it's not an O token.
02:00:12.560 | It's a space O token.
02:00:14.040 | The space is part of the O.
02:00:16.520 | And together, they are token 8840.
02:00:19.160 | That's space O.
02:00:20.360 | So what's happening here is that
02:00:23.680 | when I just have it like this,
02:00:25.760 | and I let it complete the next token,
02:00:28.120 | it can sample the space O token.
02:00:31.120 | But instead, if I have this and I add my space,
02:00:34.140 | then what I'm doing here when I encode this string
02:00:36.740 | is I have basically, here's a tagline for an ice cream shop,
02:00:40.940 | and this space at the very end becomes a token 220.
02:00:43.840 | And so we've added token 220,
02:00:47.780 | and this token otherwise would be part of the tagline.
02:00:51.020 | Because if there actually is a tagline here,
02:00:52.700 | so space O is the token.
02:00:55.260 | And so this is suddenly out of distribution for the model,
02:00:58.820 | because this space is part of the next token,
02:01:01.500 | but we're putting it here like this,
02:01:03.900 | and the model has seen very, very little data
02:01:06.700 | of actual space by itself.
02:01:09.740 | And we're asking it to complete the sequence,
02:01:11.300 | like add in more tokens.
02:01:12.820 | But the problem is that we've sort of begun the first token,
02:01:16.200 | and now it's been split up,
02:01:18.340 | and now we're out of distribution,
02:01:20.020 | and now arbitrary bad things happen.
02:01:22.100 | And it's just a very rare example
02:01:23.740 | for it to see something like that.
02:01:25.460 | And that's why we did the warning.
02:01:27.740 | So the fundamental issue here is, of course,
02:01:29.580 | that the LLM is on top of these tokens,
02:01:33.660 | and these tokens are text chunks.
02:01:35.180 | They're not characters in a way
02:01:36.620 | you and I would think of them.
02:01:38.300 | These are the atoms of what the LLM is seeing,
02:01:40.860 | and there's a bunch of weird stuff that comes out of it.
02:01:43.180 | Let's go back to our default cell style.
02:01:46.800 | I bet you that the model has never in its training set
02:01:50.020 | seen default cell style without le in there.
02:01:55.020 | It's always seen this as a single group,
02:01:57.140 | because this is some kind of a function in,
02:02:00.300 | I don't actually know what this is part of.
02:02:03.100 | It's some kind of API.
02:02:04.180 | But I bet you that it's never seen this combination
02:02:06.980 | of tokens in its training data,
02:02:09.740 | because, or I think it would be extremely rare.
02:02:12.260 | So I took this and I copied pasted it here,
02:02:14.760 | and I tried to complete from it,
02:02:17.560 | and it immediately gave me a big error.
02:02:20.000 | And it said, "The model predicted a completion
02:02:21.460 | "that begins with a stop sequence
02:02:22.740 | "resulting in no output.
02:02:23.820 | "Consider adjusting your prompt or stop sequences."
02:02:26.460 | So what happened here when I clicked submit
02:02:28.620 | is that immediately the model emitted
02:02:30.860 | and sort of like end of text token,
02:02:32.740 | I think, or something like that.
02:02:34.320 | It basically predicted the stop sequence immediately.
02:02:36.940 | So it had no completion.
02:02:38.660 | And so this is why I'm getting a warning again,
02:02:40.580 | because we're off the data distribution
02:02:42.800 | and the model is just predicting
02:02:46.180 | just totally arbitrary things.
02:02:47.740 | It's just really confused, basically.
02:02:49.380 | This is giving it brain damage.
02:02:51.020 | It's never seen this before.
02:02:52.100 | It's shocked, and it's predicting end of text or something.
02:02:55.540 | I tried it again here, and in this case, it completed it.
02:02:58.920 | But then for some reason,
02:03:00.280 | this request may violate our usage policies.
02:03:02.840 | This was flagged.
02:03:03.820 | Basically, something just goes wrong,
02:03:07.200 | and there's something like jank.
02:03:08.280 | You can just feel the jank
02:03:09.360 | because the model is extremely unhappy with just this,
02:03:12.320 | and it doesn't know how to complete it
02:03:13.420 | because it's never occurred in a training set.
02:03:15.420 | In a training set, it always appears like this
02:03:17.920 | and becomes a single token.
02:03:20.040 | So these kinds of issues where tokens are,
02:03:22.860 | either you sort of complete the first character
02:03:25.180 | of the next token,
02:03:26.520 | or you have long tokens
02:03:28.480 | that you then have just some of the characters off,
02:03:31.320 | all of these are kind of like issues with partial tokens,
02:03:35.000 | is how I would describe it.
02:03:36.800 | And if you actually dig into the TukToken repository,
02:03:39.760 | go to the Rust code and search for unstable,
02:03:43.400 | and you'll see encode unstable native, unstable tokens,
02:03:49.280 | and a lot of special case handling.
02:03:51.520 | None of this stuff about unstable tokens
02:03:53.400 | is documented anywhere,
02:03:54.960 | but there's a ton of code dealing with unstable tokens,
02:03:57.960 | and unstable tokens is exactly
02:04:00.200 | kind of like what I'm describing here.
02:04:02.320 | What you would like out of a completion API
02:04:05.000 | is something a lot more fancy.
02:04:06.280 | Like if we're putting in default cell star,
02:04:08.740 | if we're asking for the next token sequence,
02:04:10.580 | we're not actually trying to append the next token
02:04:12.560 | exactly after this list.
02:04:14.400 | We're actually trying to append,
02:04:16.240 | we're trying to consider lots of tokens
02:04:18.200 | that if we were, I guess like,
02:04:22.120 | we're trying to search over characters
02:04:24.320 | that if we re-tokenized would be of high probability,
02:04:28.220 | if that makes sense.
02:04:29.220 | So that we can actually add a single individual character
02:04:33.520 | instead of just like adding the next full token
02:04:35.800 | that comes after this partial token list.
02:04:38.760 | So this is very tricky to describe,
02:04:40.560 | and I invite you to maybe like look through this.
02:04:42.840 | It ends up being extremely gnarly and hairy kind of topic,
02:04:45.640 | and it comes from tokenization fundamentally.
02:04:47.600 | So maybe I can even spend an entire video
02:04:50.660 | talking about unstable tokens sometime in the future.
02:04:53.120 | Okay, and I'm really saving the best for last.
02:04:55.320 | My favorite one by far is the solid gold Magikarp.
02:04:58.940 | (laughs)
02:05:00.280 | It was just, okay, so this comes from this blog post,
02:05:02.720 | solid gold Magikarp.
02:05:04.640 | And this is internet famous now for those of us in LLMs.
02:05:09.640 | And basically I would advise you
02:05:11.440 | to read this blog post in full.
02:05:13.600 | But basically what this person was doing
02:05:15.080 | is this person went to the token embedding stable
02:05:21.200 | and clustered the tokens
02:05:23.040 | based on their embedding representation.
02:05:25.680 | And this person noticed that there's a cluster of tokens
02:05:29.160 | that look really strange.
02:05:30.420 | So there's a cluster here, @rot, E-stream fame,
02:05:33.600 | solid gold Magikarp, signup message,
02:05:35.640 | like really weird tokens in basically
02:05:39.720 | in this embedding cluster.
02:05:41.520 | And so where are these tokens
02:05:42.720 | and where did they even come from?
02:05:43.760 | Like what is solid gold Magikarp?
02:05:45.120 | It makes no sense.
02:05:46.160 | And then they found a bunch of these tokens.
02:05:50.900 | And then they noticed that actually the plot thickens here
02:05:53.340 | because if you ask the model about these tokens,
02:05:56.100 | like you ask it some very benign question,
02:05:59.060 | like please can you repeat back to me
02:06:00.580 | the string sold gold Magikarp?
02:06:03.020 | Then you get a variety of basically
02:06:04.860 | totally broken LLM behavior.
02:06:07.500 | So either you get evasion, so I'm sorry I can't hear you,
02:06:10.820 | or you get a bunch of hallucinations as a response.
02:06:13.460 | You can even get back like insults.
02:06:16.340 | So you ask it about streamer bot
02:06:18.660 | and the model actually just calls you names.
02:06:22.300 | Or it kind of comes up with like weird humor.
02:06:25.300 | Like you're actually breaking the model
02:06:27.300 | by asking about these very simple strings
02:06:29.500 | like @rot and solid gold Magikarp.
02:06:32.240 | So like what the hell is happening?
02:06:33.600 | And there's a variety of here documented behaviors.
02:06:36.980 | There's a bunch of tokens, not just sold gold Magikarp
02:06:39.060 | that have that kind of a behavior.
02:06:41.380 | And so basically there's a bunch of like trigger words.
02:06:43.940 | And if you ask the model about these trigger words,
02:06:46.100 | or you just include them in your prompt,
02:06:48.160 | the model goes haywire
02:06:49.300 | and has all kinds of really strange behaviors,
02:06:52.180 | including sort of ones that violate
02:06:54.220 | typical safety guidelines and the alignment of the model,
02:06:57.600 | like it's swearing back at you.
02:06:59.600 | So what is happening here
02:07:01.180 | and how can this possibly be true?
02:07:03.180 | Well, this again comes down to tokenization.
02:07:06.140 | So what's happening here is that sold gold Magikarp,
02:07:08.500 | if you actually dig into it, is a Reddit user.
02:07:11.580 | So there's a u/soldgoldmagikarp.
02:07:15.100 | And probably what happened here,
02:07:16.840 | even though I don't know that this has been like
02:07:18.360 | really definitively explored,
02:07:20.420 | but what is thought to have happened
02:07:22.860 | is that the tokenization dataset was very different
02:07:26.180 | from the training dataset for the actual language model.
02:07:29.660 | So in the tokenization dataset,
02:07:30.940 | there was a ton of Reddit data potentially,
02:07:33.380 | where the u/soldgoldmagikarp was mentioned in the text.
02:07:37.300 | Because sold gold Magikarp was a very common
02:07:40.420 | sort of person who would post a lot,
02:07:43.340 | this would be a string that occurs many times
02:07:45.220 | in a tokenization dataset.
02:07:46.940 | Because it occurs many times in the tokenization dataset,
02:07:50.020 | these tokens would end up getting merged
02:07:51.660 | to a single individual token
02:07:53.100 | for that single Reddit user, soldgoldmagikarp.
02:07:56.300 | So they would have a dedicated token
02:07:58.220 | in a vocabulary of, was it 50,000 tokens in GPT-2,
02:08:01.820 | that is devoted to that Reddit user.
02:08:04.020 | And then what happens is the tokenization dataset
02:08:06.700 | has those strings, but then later when you train the model,
02:08:11.020 | the language model itself,
02:08:13.300 | this data from Reddit was not present.
02:08:16.300 | And so therefore, in the entire training set
02:08:18.760 | for the language model, soldgoldmagikarp never occurs.
02:08:22.300 | That token never appears in the training set
02:08:25.100 | for the actual language model later.
02:08:27.440 | So this token never gets activated.
02:08:29.860 | It's initialized at random in the beginning of optimization.
02:08:32.780 | Then you have forward-backward passes
02:08:34.140 | and updates to the model.
02:08:35.300 | And this token is just never updated in the embedding table.
02:08:37.900 | That row vector never gets sampled.
02:08:39.980 | It never gets used.
02:08:41.060 | So it never gets trained and it's completely untrained.
02:08:43.700 | It's kind of like unallocated memory
02:08:45.540 | in a typical binary program written in C
02:08:47.940 | or something like that.
02:08:49.220 | So it's unallocated memory.
02:08:50.740 | And then at test time, if you evoke this token,
02:08:53.460 | then you're basically plucking out a row
02:08:55.320 | of the embedding table that is completely untrained
02:08:57.260 | and that feeds into a transformer
02:08:58.900 | and creates undefined behavior.
02:09:00.820 | And that's what we're seeing here.
02:09:01.740 | This completely undefined,
02:09:02.980 | never before seen in a training behavior.
02:09:05.700 | And so any of these kind of like weird tokens
02:09:07.640 | would evoke this behavior
02:09:08.620 | because fundamentally the model is out of sample,
02:09:13.540 | out of distribution.
02:09:16.140 | Okay, and the very last thing I wanted
02:09:17.300 | to just briefly mention and point out,
02:09:18.980 | although I think a lot of people are quite aware of this,
02:09:21.340 | is that different kinds of formats
02:09:22.900 | and different representations and different languages
02:09:25.060 | and so on might be more or less efficient
02:09:27.500 | with GPT tokenizers or any tokenizers for any other,
02:09:30.860 | let alone for that matter.
02:09:32.060 | So for example, JSON is actually really dense in tokens
02:09:35.140 | and YAML is a lot more efficient in tokens.
02:09:39.040 | So for example, these are the same in JSON and in YAML.
02:09:43.280 | The JSON is 116 and the YAML is 99.
02:09:47.540 | So quite a bit of an improvement.
02:09:49.400 | And so in the token economy where we are paying per token
02:09:53.760 | in many ways, and you are paying in the context length
02:09:56.400 | and you're paying in dollar amount
02:09:58.560 | for the cost of processing all this kind of structured data
02:10:01.520 | when you have to.
02:10:03.000 | So prefer to use YAMLs over JSONs.
02:10:05.340 | And in general, kind of like the tokenization density,
02:10:07.640 | something that you have to sort of care about
02:10:10.560 | and worry about at all times
02:10:11.840 | and try to find efficient encoding schemes
02:10:14.180 | and spend a lot of time in TickTokenizer
02:10:16.080 | and measure the different token efficiencies
02:10:18.180 | of different formats and settings and so on.
02:10:20.520 | Okay, so that concludes my fairly long video
02:10:23.000 | on tokenization.
02:10:24.440 | I know it's dry, I know it's annoying,
02:10:26.900 | I know it's irritating.
02:10:28.080 | I personally really dislike the stage.
02:10:30.660 | What I do have to say at this point is don't brush it off.
02:10:33.880 | There's a lot of foot guns, sharp edges here,
02:10:36.080 | security issues, AI safety issues,
02:10:38.940 | as we saw plugging in unallocated memory
02:10:41.120 | into language models.
02:10:43.020 | So it's worth understanding this stage.
02:10:46.020 | That said, I will say that eternal glory goes to anyone
02:10:49.860 | who can get rid of it.
02:10:51.160 | I showed you one possible paper that tried to do that.
02:10:54.440 | And I think, I hope a lot more can follow over time.
02:10:58.080 | And my final recommendations for the application right now
02:11:00.460 | are if you can reuse the GPT-4 tokens and vocabulary
02:11:04.240 | in your application,
02:11:05.080 | that's something you should consider
02:11:06.200 | and just use TicToken because it is very efficient
02:11:08.800 | and nice library for inference for BPE.
02:11:12.640 | I also really like the byte level BPE
02:11:14.440 | that TicToken and OpenAI uses.
02:11:16.540 | If you for some reason want to train your own vocabulary
02:11:20.120 | from scratch, then I would use the BPE with sentence piece.
02:11:25.120 | Oops, as I mentioned, I'm not a huge fan of sentence piece.
02:11:29.440 | I don't like it's a byte fallback
02:11:33.040 | and I don't like that it's doing BPE
02:11:34.920 | on Unicode code points.
02:11:36.320 | I think it's, it also has like a million settings
02:11:38.600 | and I think there's a lot of foot guns here
02:11:40.000 | and I think it's really easy to miscalibrate them
02:11:42.160 | and you end up cropping your sentences
02:11:43.680 | or something like that because of some hyperparameter
02:11:46.600 | that you don't fully understand.
02:11:48.000 | So be very careful with the settings,
02:11:50.400 | try to copy paste exactly maybe what Meta did
02:11:53.240 | or basically spend a lot of time looking
02:11:55.400 | at all the hyperparameters
02:11:56.880 | and go through the code of sentence piece
02:11:58.440 | and make sure that you have this correct.
02:12:00.440 | But even if you have all the settings correct,
02:12:03.160 | I still think that the algorithm is kind of inferior
02:12:05.720 | to what's happening here.
02:12:07.440 | And maybe the best,
02:12:08.880 | if you really need to train your vocabulary,
02:12:10.960 | maybe the best thing is to just wait
02:12:12.240 | for MIM-BPE to become as efficient as possible.
02:12:15.920 | And that's something that maybe I hope to work on.
02:12:19.360 | And at some point, maybe we can be training basically,
02:12:22.520 | really what we want is we want TIC token, but training code.
02:12:25.800 | And that is the ideal thing that currently does not exist.
02:12:29.640 | And MIM-BPE is an implementation of it,
02:12:32.920 | but currently it's in Python.
02:12:34.960 | So that's currently what I have to say for tokenization.
02:12:38.040 | There might be an advanced video that has even drier
02:12:41.040 | and even more detailed in the future.
02:12:42.880 | But for now, I think we're gonna leave things off here
02:12:45.040 | and I hope that was helpful.
02:12:48.880 | And they increased this context size from GPT-1 of 512
02:12:58.680 | to 1024 in GPT-4.
02:13:01.840 | (laughs)
02:13:03.760 | The next...
02:13:04.600 | Okay, next, I would like us to briefly walk through
02:13:08.640 | the code for OpenAI on the GPT-2 encoded at Pi.
02:13:12.040 | I'm sorry, I'm gonna sneeze.
02:13:19.120 | And then what's happening here is,
02:13:21.240 | this is a spurious layer that I will explain in a bit.
02:13:25.600 | What's happening here is,
02:13:29.600 | (sneezes)
02:13:31.760 | [BLANK_AUDIO]