back to indexLet's build the GPT Tokenizer
Chapters
0:0 intro: Tokenization, GPT-2 paper, tokenization-related issues
5:50 tokenization by example in a Web UI (tiktokenizer)
14:56 strings in Python, Unicode code points
18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
22:47 daydreaming: deleting tokenization
23:50 Byte Pair Encoding (BPE) algorithm walkthrough
27:2 starting the implementation
28:35 counting consecutive pairs, finding most common pair
30:36 merging the most common pair
34:58 training the tokenizer: adding the while loop, compression ratio
39:20 tokenizer/LLM diagram: it is a completely separate stage
42:47 decoding tokens to strings
48:21 encoding strings to tokens
57:36 regex patterns to force splits across categories
71:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
74:59 GPT-2 encoder.py released by OpenAI walkthrough
78:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
85:28 minbpe exercise time! write your own GPT-4 tokenizer
88:42 sentencepiece library intro, used to train Llama 2 vocabulary
103:27 how to set vocabulary set? revisiting gpt.py transformer
108:11 training new tokens, example of prompt compression
109:58 multimodal [image, video, audio] tokenization with vector quantization
111:41 revisiting and explaining the quirks of LLM tokenization
130:20 final recommendations
00:00:02.560 |
the process of tokenization in large language models. 00:00:07.820 |
and that's because tokenization is my least favorite part 00:00:12.640 |
but unfortunately it is necessary to understand 00:00:14.480 |
in some detail because it is fairly hairy, gnarly, 00:00:17.580 |
and there's a lot of hidden foot guns to be aware of, 00:00:20.400 |
and a lot of oddness with large language models 00:00:33.260 |
but we did a very naive, simple version of tokenization. 00:00:36.920 |
So when you go to the Google Colab for that video, 00:00:40.220 |
you see here that we loaded our training set, 00:00:43.160 |
and our training set was this Shakespeare data set. 00:00:46.360 |
Now, in the beginning, the Shakespeare data set 00:00:53.120 |
how do we plug text into large language models? 00:00:59.080 |
we created a vocabulary of 65 possible characters 00:01:11.320 |
for converting from every possible character, 00:01:14.000 |
a little string piece, into a token, an integer. 00:01:17.860 |
So here, for example, we tokenized the string, hi there, 00:01:25.760 |
And here we took the first 1,000 characters of our data set, 00:01:40.120 |
Now, later we saw that the way we plug these tokens 00:01:44.860 |
into the language model is by using an embedding table. 00:01:48.540 |
And so basically, if we have 65 possible tokens, 00:01:52.680 |
then this embedding table is going to have 65 rows. 00:01:56.040 |
And roughly speaking, we're taking the integer 00:02:00.720 |
we're using that as a lookup into this table, 00:02:03.800 |
and we're plucking out the corresponding row. 00:02:09.500 |
that we're going to train using backpropagation. 00:02:11.360 |
And this is the vector that then feeds into the transformer. 00:02:18.080 |
So here we had a very naive tokenization process 00:02:24.960 |
But in practice, in state-of-the-art language models, 00:02:27.920 |
people use a lot more complicated schemes, unfortunately, 00:02:39.320 |
And the way these character chunks are constructed 00:02:59.880 |
And I would say that that's probably a GPT-2 paper. 00:03:16.060 |
where you have a vocabulary of 50,257 possible tokens. 00:03:21.620 |
And the context size is going to be 1,024 tokens. 00:03:40.500 |
the atom of large language models, if you will. 00:03:47.000 |
And tokenization is the process for translating strings 00:03:49.840 |
or text into sequences of tokens and vice versa. 00:04:01.640 |
And that's because tokens are, again, pervasive. 00:04:22.560 |
of some of the complexities that come from the tokenization, 00:04:27.800 |
for why we are doing all of this and why this is so gross. 00:04:31.720 |
So tokenization is at the heart of a lot of weirdness 00:04:35.880 |
and I would advise that you do not brush it off. 00:04:38.800 |
A lot of the issues that may look like just issues 00:04:53.560 |
can, you know, not able to do spelling tasks very easily, 00:05:01.180 |
for the large language model to perform natively. 00:05:07.220 |
and to a large extent, this is due to tokenization. 00:05:16.740 |
GPT-2 specifically would have had quite a bit more issues 00:05:19.460 |
with Python than future versions of it due to tokenization. 00:05:47.060 |
So basically, tokenization is at the heart of many issues. 00:05:50.420 |
I will loop back around to these at the end of the video, 00:05:53.200 |
but for now, let me just skip over it a little bit, 00:06:09.260 |
So you can just type here stuff, hello world, 00:06:20.100 |
On the right, we're currently using a GPT-2 tokenizer. 00:06:34.340 |
So for example, this word tokenization became two tokens, 00:06:50.060 |
So be careful, on the bottom, you can show whitespace, 00:07:29.860 |
and then token six, space six, followed by 77. 00:07:34.060 |
So what's happening here is that 127 is feeding in 00:07:36.740 |
as a single token into the large language model, 00:07:45.900 |
And so the large language model has to sort of 00:07:48.740 |
take account of that and process it correctly 00:07:53.940 |
And see here, 804 will be broken up into two tokens, 00:07:59.060 |
And here I have another example of four-digit numbers, 00:08:01.660 |
and they break up in a way that they break up 00:08:04.860 |
Sometimes you have multiple digits, a single token. 00:08:08.220 |
Sometimes you have individual digits as many tokens, 00:08:20.220 |
and you see here that this became two tokens. 00:08:23.420 |
But for some reason, when I say I have an egg, 00:08:26.320 |
you see when it's a space egg, it's two token. 00:08:31.980 |
So just egg by itself in the beginning of a sentence 00:08:37.460 |
is suddenly a single token for the exact same string. 00:08:41.260 |
Here, lowercase egg turns out to be a single token. 00:08:46.020 |
And in particular, notice that the color is different, 00:08:51.180 |
And of course, capital egg would also be different tokens. 00:08:55.380 |
And again, this would be two tokens arbitrarily. 00:09:01.580 |
depending on if it's in the beginning of a sentence, 00:09:03.320 |
at the end of a sentence, lowercase, uppercase, or mixed, 00:09:06.340 |
all of this will be basically very different tokens 00:09:10.180 |
And the language model has to learn from raw data, 00:09:14.980 |
that these are actually all the exact same concept. 00:09:20.940 |
and understand just based on the data patterns 00:09:32.500 |
I have an introduction from OpenAI's Chachabitty in Korean. 00:09:47.020 |
non-English languages work slightly worse in Chachabitty. 00:09:56.500 |
is much larger for English than for everything else. 00:10:00.340 |
not just for the large language model itself, 00:10:05.860 |
we're going to see that there's a training set as well. 00:10:07.980 |
And there's a lot more English than non-English. 00:10:11.820 |
is that we're going to have a lot more longer tokens 00:10:22.420 |
you might see that it's 10 tokens or something like that. 00:10:25.180 |
But if you translate that sentence into say Korean 00:10:29.220 |
you'll typically see that number of tokens used 00:10:35.920 |
So we're using a lot more tokens for the exact same thing. 00:10:40.420 |
And what this does is it bloats up the sequence length 00:10:46.980 |
and then in the attention of the transformer, 00:10:53.180 |
in the maximum context length of that transformer. 00:10:59.700 |
is stretched out from the perspective of the transformer. 00:11:06.220 |
that's used for the tokenizer and the tokenization itself. 00:11:25.580 |
is a little snippet of Python for doing FizzBuzz. 00:11:30.900 |
look, all these individual spaces are all separate tokens. 00:11:45.700 |
is that when the transformer is going to consume 00:11:50.620 |
it needs to handle all these spaces individually. 00:12:10.300 |
It's just that if you use a lot of indentation 00:12:17.800 |
and it's separated across way too much of the sequence, 00:12:20.540 |
and we are running out of the context length in the sequence. 00:12:33.700 |
creates a token count of 300 for this string here. 00:12:41.620 |
And we see that the token count drops to 185. 00:12:46.220 |
we are now roughly halving the number of tokens. 00:12:50.620 |
this is because the number of tokens in the GPT-4 tokenizer 00:12:54.440 |
is roughly double that of the number of tokens 00:13:01.720 |
Now, you can imagine that this is a good thing 00:13:08.560 |
So this is a lot denser input to the transformer. 00:13:15.360 |
every single token has a finite number of tokens before it 00:13:19.560 |
And so what this is doing is we're roughly able to see 00:13:25.840 |
for what token to predict next because of this change. 00:13:34.520 |
because as you increase the number of tokens, 00:13:36.240 |
now your embedding table is sort of getting a lot larger. 00:13:42.520 |
and there's the softmax there, and that grows as well. 00:13:45.280 |
We're gonna go into more detail later on this, 00:13:47.200 |
but there's some kind of a sweet spot somewhere 00:13:56.720 |
Now, one thing I would like you to note specifically 00:14:01.160 |
is that the handling of the whitespace for Python 00:14:07.640 |
these four spaces are represented as one single token 00:14:14.640 |
And here, seven spaces were all grouped into a single token. 00:14:21.400 |
And this was a deliberate choice made by OpenAI 00:14:26.440 |
and they group a lot more whitespace into a single character. 00:14:33.880 |
and therefore we can attend to more code before it 00:14:37.980 |
when we're trying to predict the next token in the sequence. 00:14:40.760 |
And so the improvement in the Python coding ability 00:14:47.480 |
and the architecture and the details of the optimization, 00:14:51.540 |
is also coming from the design of the Tokenizer 00:15:01.140 |
We want to take strings and feed them into language models. 00:15:05.040 |
For that, we need to somehow tokenize strings 00:15:13.860 |
to make a lookup into a lookup table of vectors 00:15:16.820 |
and feed those vectors into the transformer as an input. 00:15:20.300 |
Now, the reason this gets a little bit tricky, of course, 00:15:25.760 |
We want to support different kinds of languages. 00:15:27.920 |
So this is annyeonghaseyo in Korean, which is hello. 00:15:31.660 |
And we also want to support many kinds of special characters 00:15:33.820 |
that we might find on the internet, for example, emoji. 00:15:38.260 |
So how do we feed this text into transformers? 00:15:46.140 |
So if you go to the documentation of a string in Python, 00:15:49.660 |
you can see that strings are immutable sequences 00:16:02.380 |
by the Unicode consortium as part of the Unicode standard. 00:16:07.180 |
And what this is really is that it's just a definition 00:16:15.280 |
and what integers represent those characters. 00:16:18.680 |
So this is 150,000 characters across 161 scripts 00:16:24.640 |
you can see that the standard is very much alive. 00:16:34.960 |
lots of types of characters, like for example, 00:16:38.980 |
all these characters across different scripts. 00:16:41.300 |
So the way we can access the Unicode code point 00:16:50.620 |
and I can see that for the single character H, 00:17:03.620 |
and we can see that the code point for this one is 128,000. 00:17:13.180 |
Now, keep in mind, you can't plug in strings here 00:17:15.860 |
because this doesn't have a single code point. 00:17:18.960 |
It only takes a single Unicode code point character 00:17:23.800 |
So in this way, we can look up all the characters 00:17:29.640 |
of this specific string and their code points. 00:17:43.740 |
So why can't we simply just use these integers 00:17:53.420 |
is that the vocabulary in that case would be quite long. 00:17:58.380 |
this is a vocabulary of 150,000 different code points. 00:18:08.040 |
And so it's not kind of a stable representation necessarily 00:18:13.220 |
So for those reasons, we need something a bit better. 00:18:15.860 |
So to find something better, we turn to encodings. 00:18:32.340 |
and translate it into binary data or byte streams. 00:18:40.600 |
Now this Wikipedia page is actually quite long, 00:18:50.540 |
And this byte stream is between one to four bytes. 00:18:55.400 |
So depending on the Unicode point, according to the schema, 00:18:58.080 |
you're gonna end up with between one to four bytes 00:19:01.640 |
On top of that, there's UTF-8, UTF-16, and UTF-32. 00:19:13.680 |
So the full kind of spectrum of pros and cons 00:19:21.400 |
I just like to point out that I enjoyed this blog post 00:19:26.120 |
also has a number of references that can be quite useful. 00:19:32.620 |
And this manifesto describes the reason why UTF-8 00:19:41.280 |
and why it is used a lot more prominently on the internet. 00:19:45.540 |
One of the major advantages, just to give you a sense, 00:19:56.860 |
But I'm not gonna go into the full detail in this video. 00:19:59.580 |
So suffice to say that we like the UTF-8 encoding, 00:20:04.700 |
and see what we get if we encode it into UTF-8. 00:20:07.360 |
The string class in Python actually has dot encode, 00:20:11.660 |
and you can give it the encoding, which is, say, UTF-8. 00:20:15.180 |
Now, what we get out of this is not very nice 00:20:17.180 |
because this is the bytes, this is a bytes object, 00:20:20.540 |
and it's not very nice in the way that it's printed. 00:20:23.500 |
So I personally like to take it through a list 00:20:26.040 |
because then we actually get the raw bytes of this encoding. 00:20:30.840 |
So this is the raw bytes that represent this string 00:20:57.320 |
we just have the structure of zero something, 00:21:04.000 |
When we expand this, we can start to get a sense 00:21:06.140 |
of the wastefulness of this encoding for our purposes. 00:21:08.960 |
You see a lot of zeros followed by something, 00:21:14.580 |
So suffice it to say that we would like to stick 00:21:32.340 |
But this vocabulary size is very, very small. 00:21:35.260 |
What this is going to do if we just were to use it naively 00:21:37.900 |
is that all of our text would be stretched out 00:21:51.060 |
and the prediction at the top of the final layer 00:21:52.980 |
is going to be very tiny, but our sequences are very long. 00:21:55.940 |
And remember that we have pretty finite context lengths 00:21:59.860 |
in the attention that we can support in a transformer 00:22:15.100 |
for the purposes of the next token prediction task. 00:22:18.020 |
So we don't want to use the raw bytes of the UTF-8 encoding. 00:22:23.020 |
We want to be able to support larger vocabulary size 00:22:35.180 |
is we turn to the byte-pair encoding algorithm, 00:22:37.420 |
which will allow us to compress these byte sequences 00:22:59.020 |
Now, the problem is you actually have to go in 00:23:01.060 |
and you have to modify the transformer architecture 00:23:06.020 |
where the attention will start to become extremely expensive 00:23:12.780 |
they propose kind of a hierarchical structuring 00:23:20.980 |
"Together, these results establish the viability 00:23:22.900 |
of tokenization-free autoregressive sequence modeling 00:23:26.300 |
So tokenization-free would indeed be amazing. 00:23:28.680 |
We would just feed byte streams directly into our models, 00:23:35.580 |
by sufficiently many groups and at sufficient scale, 00:23:38.580 |
but something like this at one point would be amazing 00:23:44.020 |
and we can't feed this directly into language models 00:23:51.460 |
the byte-pair encoding algorithm is not all that complicated 00:23:54.460 |
and the Wikipedia page is actually quite instructive 00:23:58.980 |
What we're doing is we have some kind of a input sequence. 00:24:01.780 |
Like, for example, here we have only four elements 00:24:07.920 |
So instead of bytes, let's say we just had four, 00:24:11.740 |
The sequence is too long and we'd like to compress it. 00:24:15.380 |
So what we do is that we iteratively find the pair of tokens 00:24:27.060 |
we replace that pair with just a single new token 00:24:33.380 |
So for example, here, the byte-pair AA occurs most often, 00:24:37.320 |
so we mint a new token, let's call it capital Z, 00:24:40.740 |
and we replace every single occurrence of AA by Z. 00:24:53.740 |
and we've converted it to a sequence of only nine tokens, 00:25:11.980 |
and identify the pair of tokens that are most frequent. 00:25:19.140 |
Well, we are going to replace AB with a new token 00:25:23.460 |
So Y becomes AB, and then every single occurrence of AB 00:25:26.220 |
is now replaced with Y, so we end up with this. 00:25:29.780 |
So now we only have one, two, three, four, five, six, 00:25:36.100 |
but we have not just four vocabulary elements or five, 00:25:46.220 |
look through the sequence, find that the phrase ZY, 00:25:51.540 |
and replace it one more time with another character, 00:25:56.580 |
So X is ZY, and we replace all occurrences of ZY, 00:26:02.020 |
So basically, after we have gone through this process, 00:26:14.680 |
we now have a sequence of one, two, three, four, five tokens, 00:26:23.260 |
And so in this way, we can iteratively compress our sequence 00:26:29.560 |
So in the exact same way, we start out with byte sequences. 00:26:42.160 |
And we're going to iteratively start minting new tokens, 00:26:44.920 |
appending them to our vocabulary, and replacing things. 00:26:51.580 |
and also an algorithm for taking any arbitrary sequence 00:27:16.360 |
we just take our text and we encode it into UTF-8. 00:27:19.360 |
The tokens here at this point will be a raw bytes, 00:27:28.800 |
I'm going to convert all those bytes to integers, 00:27:48.720 |
And then here are the bytes encoded in UTF-8. 00:27:52.800 |
And we see that this has a length of 616 bytes 00:27:59.800 |
is because a lot of these simple ASCII characters, 00:28:03.240 |
or simple characters, they just become a single byte. 00:28:06.200 |
But a lot of these Unicode, more complex characters, 00:28:15.220 |
of the algorithm, is we'd like to iterate over here, 00:28:17.700 |
and find the pair of bytes that occur most frequently, 00:28:23.840 |
So if you are working along on a notebook, on a side, 00:28:26.460 |
then I encourage you to basically click on the link, 00:28:37.740 |
There are many different ways to implement this, 00:28:48.080 |
to iterate consecutive elements of this list, 00:28:55.500 |
of just incrementing by one for all the pairs. 00:29:05.640 |
The keys are these tuples of consecutive elements, 00:29:11.160 |
So just to print it in a slightly better way, 00:29:24.460 |
The items called on dictionary returns pairs of key value. 00:29:29.060 |
And instead, I create a list here of value key, 00:29:38.020 |
And by default, Python will use the first element, 00:29:46.420 |
And then reverse, so it's descending, and print that. 00:29:52.640 |
was the most commonly occurring consecutive pair, 00:29:57.480 |
We can double check that that makes reasonable sense. 00:30:03.160 |
then you see that these are the 20 occurrences of that pair. 00:30:10.800 |
at what exactly that pair is, we can use char, 00:30:19.200 |
So 101, and of 32, and we see that this is E and space. 00:30:28.220 |
meaning that a lot of these words seem to end with E. 00:30:36.840 |
So now that we've identified the most common pair, 00:30:41.980 |
We're going to mint a new token with the ID of 256, right? 00:30:46.340 |
Because these tokens currently go from zero to 255. 00:30:49.740 |
So when we create a new token, it will have an ID of 256, 00:30:53.780 |
and we're going to iterate over this entire list. 00:31:11.620 |
just so we don't pollute the notebook too much. 00:31:20.440 |
So we're basically calling the max on this dictionary stats, 00:31:28.640 |
And then the question is, how does it rank keys? 00:31:31.760 |
So you can provide it with a function that ranks keys, 00:31:52.240 |
but again, there are many different versions of it. 00:31:58.900 |
and that pair will be replaced with the new index IDX. 00:32:08.240 |
So we create this new list, and then we start at zero, 00:32:11.680 |
and then we go through this entire list sequentially 00:32:20.440 |
So here we are checking that the pair matches. 00:32:26.360 |
that you have to append if you're trying to be careful, 00:32:31.700 |
to be out of bounds at the very last position 00:32:34.320 |
when you're on the rightmost element of this list. 00:32:36.660 |
Otherwise, this would give you an out-of-bounds error. 00:32:40.240 |
that we're not at the very, very last element. 00:32:47.480 |
we append to this new list that replacement index, 00:32:56.440 |
But otherwise, if we haven't found a matching pair, 00:32:58.880 |
we just sort of copy over the element at that position 00:33:10.300 |
and we want to replace the occurrences of 67 with 99, 00:33:21.360 |
So now I'm going to uncomment this for our actual use case, 00:33:38.240 |
So recall that previously, we had a length 616 in this list, 00:33:49.620 |
which makes sense because there are 20 occurrences. 00:34:02.320 |
So this is the original array, plenty of them. 00:34:04.980 |
And in the second array, there are no occurrences of 101, 32. 00:34:08.320 |
So we've successfully merged this single pair. 00:34:13.320 |
So we are going to go over the sequence again, 00:34:19.140 |
that uses these functions to do this sort of iteratively. 00:34:25.160 |
Well, that's totally up to us as a hyperparameter. 00:34:27.460 |
The more steps we take, the larger will be our vocabulary, 00:34:35.940 |
that we usually find works the best in practice. 00:34:40.900 |
and we tune it, and we find good vocabulary sizes. 00:34:44.020 |
As an example, GPT-4 currently uses roughly 100,000 tokens. 00:34:47.780 |
And ballpark, those are reasonable numbers currently 00:34:53.500 |
So let me now write, putting it all together, 00:34:58.720 |
Okay, now, before we dive into the while loop, 00:35:04.540 |
and instead of grabbing just the first paragraph or two, 00:35:12.460 |
will allow us to have more representative statistics 00:35:15.920 |
and we'll just get more sensible results out of it, 00:35:23.180 |
We encode it into bytes using the UTF-8 encoding. 00:35:29.140 |
we are just changing it into a list of integers in Python, 00:35:36.380 |
And then this is the code that I came up with 00:35:48.100 |
just so that you have the point of reference here. 00:35:57.380 |
is we want to decide on a final vocabulary size 00:36:02.480 |
And as I mentioned, this is a hyperparameter, 00:36:10.140 |
because that way we're going to be doing exactly 20 merges. 00:36:14.000 |
And 20 merges because we already have 256 tokens 00:36:44.340 |
So this merges dictionary is going to maintain 00:36:52.300 |
And so what we're going to be building up here 00:37:02.220 |
For us, we're starting with the leaves on the bottom, 00:37:08.580 |
And then we're starting to merge two of them at a time. 00:37:11.380 |
And so it's not a tree, it's more like a forest 00:37:21.540 |
we're going to find the most commonly occurring pair. 00:37:24.840 |
We're going to mint a new token integer for it. 00:37:33.620 |
and we're going to replace all the occurrences of that pair 00:37:39.520 |
And we're going to record that this pair of integers 00:37:45.460 |
So running this gives us the following output. 00:37:54.160 |
And for example, the first merge was exactly as before, 00:37:57.500 |
the 101, 32 tokens merging into a new token 256. 00:38:02.500 |
Now keep in mind that the individual tokens 101 and 32 00:38:06.440 |
can still occur in the sequence after merging. 00:38:09.220 |
It's only when they occur exactly consecutively 00:38:13.680 |
And in particular, the other thing to notice here 00:38:17.620 |
is that the token 256, which is the newly minted token, 00:38:35.820 |
So that's why we're building up a small sort of binary forest 00:38:43.020 |
is we can take a look at the compression ratio 00:38:46.100 |
So in particular, we started off with this tokens list. 00:39:02.660 |
of simply just dividing the two is roughly 1.27. 00:39:06.220 |
So that's the amount of compression we were able to achieve 00:39:11.100 |
And of course, the more vocabulary elements you add, 00:39:15.380 |
the greater the compression ratio here would be. 00:39:25.780 |
Now, one point that I wanted to make is that, 00:39:28.340 |
and maybe this is a diagram that can help kind of illustrate, 00:39:32.300 |
is that tokenizer is a completely separate object 00:39:41.660 |
This is a completely separate pre-processing stage usually. 00:39:44.820 |
So the tokenizer will have its own training set, 00:39:51.540 |
So the tokenizer has a training set of documents 00:39:53.300 |
on which you're going to train the tokenizer. 00:39:55.740 |
And then we're performing the BytePair encoding algorithm, 00:39:59.460 |
as we saw above, to train the vocabulary of this tokenizer. 00:40:06.380 |
that you would run a single time in the beginning. 00:40:13.820 |
Once you have the tokenizer, once it's trained, 00:40:24.340 |
So the tokenizer is a translation layer between raw text, 00:40:28.200 |
which is, as we saw, the sequence of Unicode code points. 00:40:31.840 |
It can take raw text and turn it into a token sequence, 00:40:45.920 |
we are going to turn to how we can do the encoding 00:40:57.320 |
And then the language model is going to be trained 00:41:01.120 |
And typically, in a sort of a state-of-the-art application, 00:41:14.680 |
that you're just left with, the tokens themselves. 00:41:20.600 |
is actually reading when it's training on them. 00:41:30.240 |
I think the most important thing I want to get across 00:41:36.360 |
You may want to have those training sets be different 00:41:38.440 |
between the tokenizer and the larger language model. 00:41:40.720 |
So for example, when you're training the tokenizer, 00:41:51.640 |
So you may want to look into different kinds of mixtures 00:41:55.960 |
and different amounts of code and things like that, 00:41:58.760 |
because the amount of different language that you have 00:42:03.720 |
will determine how many merges of it there will be. 00:42:09.420 |
with which this type of data sort of has in the token space. 00:42:24.060 |
then that means that more Japanese tokens will get merged. 00:42:26.840 |
And therefore, Japanese will have shorter sequences. 00:42:39.340 |
So we're now going to turn to encoding and decoding 00:42:55.020 |
to get back a Python string object, so the raw text. 00:42:59.140 |
So this is the function that we'd like to implement. 00:43:04.980 |
If you'd like, try to implement this function yourself. 00:43:08.600 |
Otherwise, I'm going to start pasting in my own solution. 00:43:16.420 |
I will create a kind of pre-processing variable 00:43:20.780 |
And vocab is a mapping or a dictionary in Python 00:43:26.020 |
from the token ID to the bytes object for that token. 00:43:31.020 |
So we begin with the raw bytes for tokens from zero to 255. 00:43:43.360 |
So this is basically the bytes representation 00:43:46.840 |
of the first child followed by the second one. 00:43:51.760 |
So this addition here is an addition of two bytes objects, 00:43:58.440 |
One tricky thing to be careful with, by the way, 00:44:05.920 |
And it really matters that this runs in the order 00:44:09.400 |
in which we inserted items into the merges dictionary. 00:44:20.120 |
with respect to how we inserted elements into merges, 00:44:24.320 |
But we are using modern Python, so we're okay. 00:44:31.280 |
the first thing we're going to do is get the tokens. 00:44:33.980 |
So the way I implemented this here is I'm taking, 00:44:49.640 |
And then these tokens here at this point are raw bytes. 00:44:59.080 |
So previously, we called dot encode on a string object 00:45:02.020 |
to get the bytes, and now we're doing it opposite. 00:45:07.240 |
on the bytes object to get a string in Python. 00:45:16.800 |
Now, this actually has a issue in the way I implemented it, 00:45:27.760 |
if we plug in some sequence of IDs that is unlucky. 00:45:43.560 |
But when I try to decode 128 as a single element, 00:45:48.320 |
the token 128 is what in string or in Python object? 00:46:16.080 |
So in particular, if you have a multi-byte object 00:46:20.960 |
they have to have this special sort of envelope 00:46:25.800 |
And so what's happening here is that invalid start byte, 00:46:29.800 |
that's because 128, the binary representation of it, 00:46:38.800 |
And we see here that that doesn't conform to the format 00:46:42.480 |
just doesn't fit any of these rules, so to speak. 00:46:45.660 |
So it's an invalid start byte, which is byte one. 00:46:53.520 |
and then the content of your Unicode in excess here. 00:46:57.080 |
So basically, we don't exactly follow the UTF-8 standard, 00:47:02.400 |
And so the way to fix this is to use this errors equals 00:47:23.680 |
This is the full list of all the errors that you can use. 00:47:30.360 |
And that will replace with this special marker, 00:47:42.360 |
So basically, not every single byte sequence is valid UTF-8. 00:47:49.440 |
And if it happens that your large language model, 00:47:52.040 |
for example, predicts your tokens in a bad manner, 00:48:07.080 |
And this is what you will also find in the OpenAI code 00:48:12.240 |
But basically, whenever you see this kind of a character 00:48:14.560 |
in your output, in that case, something went wrong, 00:48:17.080 |
and the LM output was not a valid sequence of tokens. 00:48:22.080 |
Okay, and now we're going to go the other way. 00:48:24.320 |
So we are going to implement this error right here, 00:48:35.640 |
And this should basically print a list of integers 00:48:40.040 |
So again, try to maybe implement this yourself 00:48:43.040 |
if you'd like a fun exercise, and pause here. 00:48:45.840 |
Otherwise, I'm going to start putting in my solution. 00:48:50.840 |
So this is one of the ways that sort of I came up with. 00:49:15.640 |
But now of course, according to the merges dictionary above, 00:49:20.760 |
some of the bytes may be merged according to this lookup. 00:49:26.800 |
remember that the merges was built from top to bottom. 00:49:32.560 |
And so we prefer to do all these merges in the beginning 00:49:44.120 |
So we have to go in the order from top to bottom sort of, 00:50:00.280 |
that we are allowed to merge according to this. 00:50:13.040 |
will basically count up how many times every single pair 00:50:26.400 |
to the number of times that they occur, right? 00:50:33.040 |
We only care what the raw pairs are in that sequence. 00:50:39.800 |
I only care about the set of possible merge candidates 00:50:46.280 |
that we're going to be merging at this stage of the loop. 00:50:50.240 |
We want to find the pair or like a key inside stats 00:50:54.800 |
that has the lowest index in the merges dictionary, 00:51:04.120 |
So again, there are many different ways to implement this, 00:51:06.240 |
but I'm going to do something a little bit fancy here. 00:51:12.320 |
So I'm going to be using the min over an iterator. 00:51:23.480 |
So we're looking at all the pairs inside stats, 00:51:30.440 |
And we're going to be taking the consecutive pair 00:51:36.200 |
The min takes a key, which gives us the function 00:51:44.120 |
And the one we care about is we care about taking merges 00:52:02.720 |
And we want to get the pair with the min number. 00:52:05.800 |
So as an example, if there's a pair 101 and 32, 00:52:15.920 |
And the reason that I'm putting a float inf here 00:52:22.200 |
when we call, when we basically consider a pair 00:52:28.400 |
then that pair is not eligible to be merged, right? 00:52:32.720 |
there's some pair that is not a merging pair, 00:52:38.640 |
and it doesn't have an index and it cannot be merged, 00:52:48.760 |
in the list of candidates when we do the min. 00:52:56.440 |
this returns the most eligible merging candidate pair 00:53:04.560 |
is this function here might fail in the following way. 00:53:11.280 |
then there's nothing in merges that is satisfied anymore. 00:53:23.120 |
I think will just become the very first element of stats. 00:53:26.640 |
But this pair is not actually a mergeable pair. 00:53:29.320 |
It just becomes the first pair inside stats arbitrarily 00:53:33.040 |
because all of these pairs evaluate to float inf 00:53:38.240 |
So basically it could be that this doesn't succeed 00:53:42.080 |
So if this pair is not in merges that was returned, 00:53:56.560 |
You may come up with a different implementation, by the way. 00:54:00.760 |
This is kind of like really trying hard in Python. 00:54:09.400 |
Now, if we did find a pair that is inside merges 00:54:17.760 |
So we're going to look into the mergers dictionary 00:54:25.320 |
And we're going to now merge into that index. 00:54:30.320 |
and we're going to replace the original tokens. 00:54:36.680 |
and we're going to be replacing it with index IDX. 00:54:41.640 |
where every occurrence of pair is replaced with IDX. 00:55:02.960 |
So for example, 32 is a space in ASCII, so that's here. 00:55:10.720 |
Okay, so let's wrap up this section of the video at least. 00:55:23.280 |
And the issue is that if we only have a single character 00:55:31.080 |
So one way to fight this is if len of tokens is at least two 00:55:46.440 |
And then second, I have a few test cases here 00:55:50.040 |
So first let's make sure about, or let's note the following. 00:55:58.800 |
you'd expect to get the same string back, right? 00:56:07.560 |
And I think in general, this is probably the case, 00:56:14.440 |
you're not going to have an identity going backwards 00:56:18.280 |
not all token sequences are valid UTF-8 sort of byte streams. 00:56:24.760 |
And so therefore, some of them can't even be decodable. 00:56:31.480 |
but for that one direction, we can check here. 00:56:35.680 |
which is the text that we trained the tokenizer on, 00:56:37.760 |
we can make sure that when we encode and decode, 00:56:43.600 |
So I went to, I think this webpage and I grabbed some text. 00:56:47.120 |
So this is text that the tokenizer has not seen, 00:56:56.040 |
So those are the basics of the byte-pair encoding algorithm. 00:57:07.960 |
And that basically creates the little binary forest 00:57:19.040 |
So that's the simplest setting of the tokenizer. 00:57:23.480 |
is we're going to look at some of the state-of-the-art 00:57:25.360 |
large language models and the kinds of tokenizers 00:57:37.280 |
So let's kick things off by looking at the GPT series. 00:57:40.040 |
So in particular, I have the GPT-2 paper here, 00:57:42.460 |
and this paper is from 2019 or so, so five years ago. 00:57:48.000 |
And let's scroll down to input representation. 00:57:56.520 |
so I encourage you to pause and read this yourself, 00:58:03.960 |
on the byte-level representation of UTF-8 encoding. 00:58:10.040 |
and they talk about the vocabulary sizes and everything. 00:58:13.120 |
Now, everything here is exactly as we've covered it so far, 00:58:18.700 |
So what they mention is that they don't just apply 00:58:23.520 |
And in particular, here's a motivating example. 00:58:32.620 |
and it occurs right next to all kinds of punctuation, 00:58:35.300 |
as an example, so dog dot, dog exclamation mark, 00:58:40.800 |
And naively, you might imagine that the BPA algorithm 00:58:47.280 |
that are just like dog with a slightly different punctuation. 00:58:50.080 |
And so it feels like you're clustering things 00:58:52.520 |
You're combining kind of semantics with punctuation. 00:58:58.840 |
And indeed, they also say that this is suboptimal 00:59:03.440 |
So what they want to do is they want to top down 00:59:05.380 |
in a manual way, enforce that some types of characters 00:59:21.480 |
and what kinds of mergers they actually do perform. 00:59:29.520 |
And when we go to source, there is an encoder.py. 00:59:34.000 |
Now, I don't personally love that they called encoder.py 00:59:38.120 |
and the tokenizer can do both encode and decode. 00:59:41.080 |
So it feels kind of awkward to me that it's called encoder, 00:59:46.680 |
and we're gonna step through it in detail at one point. 00:59:49.280 |
For now, I just want to focus on this part here. 00:59:58.680 |
But this is the core part that allows them to enforce rules 01:00:02.660 |
for what parts of the text will never be merged for sure. 01:00:06.040 |
Now, notice that re.compile here is a little bit misleading 01:00:15.640 |
And regex is a Python package that you can install, 01:00:23.360 |
So let's take a look at this pattern and what it's doing 01:00:29.700 |
and why this is actually doing the separation 01:00:40.700 |
So in the exact same way that their code does, 01:00:43.400 |
we're going to call an re.findAll for this pattern 01:00:47.000 |
on any arbitrary string that we are interested in. 01:00:49.440 |
So this is the string that we want to encode into tokens 01:01:03.120 |
The way this works is that you are going from left to right 01:01:07.760 |
in the string, and you're trying to match the pattern. 01:01:20.500 |
first of all, notice that this is a raw string, 01:01:31.060 |
And notice that it's made up of a lot of ORs. 01:01:38.480 |
And so you go from left to right in this pattern 01:01:41.460 |
and try to match it against the string wherever you are. 01:01:44.540 |
So we have hello, and we're gonna try to match it. 01:01:52.520 |
but it is an optional space followed by dash P of, 01:02:01.940 |
It is coming to some documentation that I found. 01:02:13.660 |
And hello is made up of letters, H-E-L-L-O, et cetera. 01:02:18.340 |
So optional space followed by a bunch of letters, 01:02:21.520 |
one or more letters, is going to match hello, 01:02:28.740 |
So from there on begins a new sort of attempt 01:02:44.980 |
followed by a bunch of letters, one or more of them. 01:02:48.620 |
So when we run this, we get a list of two elements, 01:03:01.180 |
Now, what is this doing and why is this important? 01:03:05.460 |
and instead of directly encoding it for tokenization, 01:03:17.740 |
is that it first splits your text into a list of texts, 01:03:26.420 |
are processed independently by the tokenizer. 01:03:48.060 |
And then that token sequence is going to be concatenated. 01:04:00.740 |
within every one of these elements individually. 01:04:03.200 |
And after you've done all the possible merging 01:04:09.460 |
the results of all that will be joined by concatenation. 01:04:25.860 |
And so you are saying we are never gonna merge E space 01:04:33.580 |
So basically using this regex pattern to chunk up the text 01:04:59.740 |
slash P of N is any kind of numeric character 01:05:05.940 |
So we have an optional space followed by numbers 01:05:11.540 |
So if I do, hello world, one, two, three, how are you? 01:05:36.660 |
then apostrophe here is not a letter or a number. 01:05:42.420 |
and then we will exactly match this with that. 01:05:52.740 |
like very common apostrophes that are used typically. 01:06:08.140 |
if you have house, then this will be separated out 01:06:13.140 |
But if you use the Unicode apostrophe like this, 01:06:28.260 |
and otherwise they become completely separate tokens. 01:06:33.220 |
In addition to this, you can go to the GPT-2 docs 01:06:47.180 |
you see how this is apostrophe and then lowercase letters? 01:06:53.860 |
then these rules will not separate out the apostrophes 01:07:09.020 |
then notice suddenly the apostrophe comes by itself. 01:07:17.560 |
inconsistently separating out these apostrophes. 01:07:19.940 |
So it feels extremely gnarly and slightly gross, 01:07:27.320 |
After trying to match a bunch of apostrophe expressions, 01:07:33.340 |
So I don't know that all the languages, for example, 01:07:37.180 |
but that would be inconsistently tokenized as a result. 01:07:44.340 |
And then if that doesn't work, we fall back to here. 01:07:56.300 |
is this is trying to match punctuation, roughly speaking, 01:08:05.600 |
then these parts here are not letters or numbers, 01:08:18.420 |
And finally, this is also a little bit confusing. 01:08:24.140 |
but this is using a negative lookahead assertion in regex. 01:08:29.060 |
So what this is doing is it's matching whitespace up to, 01:08:32.140 |
but not including the last whitespace character. 01:08:39.440 |
So you see how the whitespace is always included 01:08:49.180 |
What's gonna happen here is that these spaces up to 01:08:53.740 |
and not including the last character will get caught by this. 01:08:58.020 |
And what that will do is it will separate out the spaces 01:09:08.620 |
And the reason that's nice is because space U 01:09:16.500 |
And if I add spaces, we still have a space U, 01:09:23.100 |
So basically the GPT-2 tokenizer really likes 01:09:30.320 |
And this is just something that it is consistent about. 01:09:46.100 |
then this thing will catch any trailing spaces and so on. 01:09:50.100 |
I wanted to show one more real world example here. 01:09:52.740 |
So if we have this string, which is a piece of Python code, 01:09:59.500 |
So you'll notice that the list has many elements here, 01:10:01.580 |
and that's because we are splitting up fairly often, 01:10:07.440 |
So there will never be any mergers within these elements. 01:10:14.780 |
Now, you might think that in order to train the tokenizer, 01:10:19.020 |
OpenAI has used this to split up text into chunks, 01:10:23.220 |
and then run just a BP algorithm within all the chunks. 01:10:36.440 |
But these spaces never actually end up being merged 01:10:46.900 |
you see that all the spaces are kept independent, 01:10:50.840 |
So I think OpenAI at some point enforced some rule 01:11:05.400 |
Now, the training code for the GPT-2 tokenizer 01:11:08.540 |
So all we have is the code that I've already shown you. 01:11:18.620 |
You can't give it a piece of text and train tokenizer. 01:11:29.260 |
And so we don't know exactly how OpenAI trained 01:11:50.180 |
and then you can do the tokenization inference. 01:11:55.740 |
This is only inference code for tokenization. 01:11:57.980 |
I wanted to show you how you would use it, quite simple. 01:12:02.180 |
And running this just gives us the GPT-2 tokens 01:12:09.420 |
And so in particular, we see that the whitespace in GPT-2, 01:12:14.500 |
these whitespaces merge, as we also saw in this one, 01:12:20.140 |
but if we go down to GPT-4, they become merged. 01:12:39.500 |
and then you go to this file, tick_token_ext_openai_public, 01:12:45.360 |
of all these different tokenizers that OpenAI maintains is. 01:12:51.120 |
they had to publish some of the details about the strings. 01:12:54.020 |
So this is the string that we already saw for GPT-2. 01:12:57.080 |
It is slightly different, but it is actually equivalent 01:13:05.600 |
and this one just executes a little bit faster. 01:13:26.000 |
in addition to a bunch of other special tokens, 01:13:29.400 |
Now, I'm not going to actually go into the full detail 01:13:35.960 |
I would just advise that you pull out chat-gpt 01:13:38.680 |
in the regex documentation and just step through it. 01:14:01.640 |
these apostrophe S, apostrophe D, apostrophe M, et cetera. 01:14:06.640 |
We're gonna be matching them both in lowercase 01:14:11.080 |
There's a bunch of different handling of the whitespace 01:14:13.400 |
that I'm not going to go into the full details of. 01:14:15.800 |
And then one more thing here is you will notice 01:14:29.840 |
only up to three digits of numbers will ever be merged. 01:14:36.360 |
to prevent tokens that are very, very long number sequences. 01:14:43.000 |
any of this stuff because none of this is documented 01:14:51.440 |
But those are some of the changes that GPT-4 has made. 01:14:59.480 |
The next thing I would like to do very briefly 01:15:05.360 |
This is the file that I already mentioned to you briefly. 01:15:15.280 |
Starting at the bottom here, they are loading two files, 01:15:28.640 |
Now, if you'd like to inspect these two files, 01:15:31.160 |
which together constitute their saved tokenizer, 01:15:34.480 |
then you can do that with a piece of code like this. 01:15:37.080 |
This is where you can download these two files 01:15:48.680 |
So remember here where we have this vocab object, 01:16:22.880 |
the two variables that for us are also critical, 01:16:36.160 |
Now, the only thing that is actually slightly confusing 01:16:43.840 |
is that in addition to this encoder and the decoder, 01:16:46.560 |
they also have something called a byte encoder 01:16:51.920 |
just kind of a spurious implementation detail. 01:16:55.960 |
It isn't actually deep or interesting in any way, 01:17:07.200 |
but they have a whole separate layer here in addition 01:17:12.000 |
And so you first do byte encode and then encode, 01:17:19.920 |
and they are just stacked serial on top of each other. 01:17:24.560 |
so I won't cover it, and you can step through it 01:17:28.520 |
if you ignore the byte encoder and the byte decoder, 01:17:30.800 |
will be algorithmically very familiar with you. 01:17:33.120 |
And the meat of it here is what they call BPE function. 01:17:42.040 |
where they're trying to identify the bigram, a pair, 01:17:49.680 |
they have a for loop trying to merge this pair. 01:17:54.120 |
and they will merge the pair whenever they find it. 01:17:58.440 |
until they run out of possible merges in the text. 01:18:08.840 |
what I want you to take away at this point is that, 01:18:11.040 |
unfortunately, it's a little bit of a messy code 01:18:14.280 |
it is identical to what we've built up above. 01:18:17.080 |
And what we've built up above, if you understand it, 01:18:30.360 |
So in addition to tokens that are coming from raw bytes 01:18:50.920 |
we mentioned this is very similar to our vocab. 01:18:53.280 |
You'll notice that the length of this is 50,257. 01:19:01.920 |
and it's inverted from the mapping of our vocab. 01:19:07.000 |
and they go the other way around for no amazing reason. 01:19:20.440 |
As I mentioned, there are 256 raw byte tokens. 01:19:58.520 |
and we tokenize them and get a stream of tokens. 01:20:10.560 |
And we insert that token in between documents. 01:20:14.400 |
And we are using this as a signal to the language model 01:20:24.320 |
That said, the language model has to learn this from data. 01:20:26.920 |
It needs to learn that this token usually means 01:20:29.720 |
that it should wipe its sort of memory of what came before. 01:20:34.520 |
is not actually informative to what comes next. 01:20:39.680 |
but we're giving it the special sort of delimiter 01:20:48.800 |
our code that we've been playing with before. 01:20:55.880 |
But now you can see what happens if I put end of text. 01:21:16.920 |
this didn't actually go through the BPE merges. 01:21:19.960 |
Instead, the code that actually outputs the tokens 01:21:24.200 |
has special case instructions for handling special tokens. 01:21:31.520 |
for handling special tokens in the encoder.py. 01:21:40.520 |
you will find all kinds of special case handling 01:21:42.680 |
for these special tokens that you can register, 01:21:49.000 |
and whenever it sees these special tokens like this, 01:21:52.720 |
it will actually come in and swap in that special token. 01:21:55.920 |
So these things are outside of the typical algorithm 01:22:00.280 |
So these special tokens are used pervasively, 01:22:06.840 |
of predicting the next token in the sequence, 01:22:11.840 |
and all of the chat GPT sort of aspects of it. 01:22:15.440 |
Because we don't just want to delimit documents, 01:22:33.240 |
So for example, using the GPT 3.5 turbo scheme, 01:22:43.080 |
This is short for imaginary malloc_start, by the way. 01:22:46.880 |
But you can see here that there's a sort of start 01:22:53.520 |
lots of tokens in use to delimit these conversations 01:22:58.280 |
and kind of keep track of the flow of the messages here. 01:23:01.880 |
Now we can go back to the tick token library. 01:23:07.320 |
they talk about how you can extend tick token. 01:23:11.000 |
you can fork the CL 100K base tokenizers in GPT 4. 01:23:25.480 |
And the tick token library will correctly swap them out 01:23:37.160 |
And I mentioned that the GPT 2 in tick token, 01:23:56.400 |
you see that the pattern has changed as we've discussed, 01:23:58.600 |
but also the special tokens have changed in this tokenizer. 01:24:01.240 |
So we, of course, have the end of text, just like in GPT 2, 01:24:05.040 |
but we also see three, sorry, four additional tokens here. 01:24:14.560 |
And if you'd like to learn more about this idea, 01:24:18.360 |
And I'm not gonna go into detail in this video. 01:24:22.640 |
And then there's one additional sort of token here. 01:24:28.640 |
So it's very common, basically, to train a language model. 01:24:32.880 |
And then if you'd like, you can add special tokens. 01:24:38.400 |
you, of course, have to do some model surgery 01:24:42.560 |
and all of the parameters involved in that transformer, 01:24:48.320 |
your embedding matrix for the vocabulary tokens 01:24:55.800 |
with small random numbers or something like that, 01:25:03.360 |
you have to go to the final layer of the transformer, 01:25:05.000 |
and you have to make sure that that projection 01:25:10.920 |
So basically, there's some model surgery involved 01:25:13.000 |
that you have to couple with the tokenization changes 01:25:18.840 |
But this is a very common operation that people do, 01:25:20.920 |
especially if they'd like to fine-tune the model, 01:25:27.240 |
Okay, so at this point, you should have everything you need 01:25:33.240 |
Now, in the process of developing this lecture, 01:25:39.880 |
So minBPE looks like this right now as I'm recording, 01:25:43.400 |
but the minBPE repository will probably change quite a bit 01:25:58.120 |
this is sort of me breaking up the task ahead of you 01:26:06.640 |
And so feel free to follow these steps exactly 01:26:21.520 |
I try to keep the code fairly clean and understandable. 01:26:25.040 |
And so feel free to reference it whenever you get stuck. 01:26:29.840 |
In addition to that, basically, once you write it, 01:26:34.040 |
you should be able to reproduce this behavior 01:26:42.920 |
And then you can encode and decode the exact same string 01:26:47.560 |
you should be able to implement your own train function, 01:26:54.040 |
but you could write your own train, minBPE does it as well. 01:27:00.760 |
So here's some of the code inside minBEP, minBPE, 01:27:05.200 |
shows the token vocabularies that you might obtain. 01:27:08.720 |
So on the left here, we have the GPT-4 merges. 01:27:24.000 |
was merge two spaces into a single token for two spaces. 01:27:30.840 |
And so this is the order in which things merged 01:27:34.200 |
And this is the merge order that we obtained in minBPE 01:27:45.520 |
but because that is one of the longest Wikipedia pages, 01:27:56.120 |
Yeah, so you can compare these two vocabularies. 01:28:06.120 |
And we've done the exact same thing on this token 259. 01:28:12.600 |
And that happened for us a little bit later as well. 01:28:15.160 |
So the difference here is, again, to my understanding, 01:28:19.480 |
So as an example, because I see a lot of white space, 01:28:21.880 |
I expect that GPT-4 probably had a lot of Python code 01:28:24.360 |
in its training set, I'm not sure, for the tokenizer. 01:28:27.640 |
And here we see much less of that, of course, 01:28:42.480 |
Okay, so we are now going to move on from TIC token 01:28:44.800 |
and the way that OpenAI tokenizes its strings. 01:28:47.680 |
We're going to discuss one more very commonly used library 01:28:54.840 |
So SentencePiece is very commonly used in language models 01:29:07.760 |
but one of them is the byte-bearing coding algorithm 01:29:13.240 |
Now, SentencePiece is used both by Lama and Mistral Series 01:29:26.200 |
because this is kind of hard and subtle to explain, 01:29:51.760 |
So it looks at whatever code points are available 01:29:55.200 |
and then it starts merging those code points. 01:29:57.720 |
And the BPE is running on the level of code points. 01:30:12.080 |
then these code points will either get mapped 01:30:18.120 |
or if you have the byte fallback option turned on, 01:30:25.720 |
and then the individual bytes of that encoding 01:30:37.800 |
and then it falls back to bytes for rare code points. 01:30:44.120 |
Personally, I find the tick token way significantly cleaner, 01:31:04.400 |
I think I took like the description of sentence piece, 01:31:09.920 |
so I created a toy.txt file with this content. 01:31:14.000 |
Now, what's kind of a little bit crazy about sentence piece 01:31:16.560 |
is that there's a ton of options and configurations. 01:31:23.520 |
and it really tries to handle a large diversity of things. 01:31:34.120 |
there's like a ton of configuration arguments. 01:31:38.160 |
You can go to here to see all the training options. 01:31:47.480 |
that is used to represent the trainer spec and so on. 01:32:02.200 |
So this is just an argument that is irrelevant to us. 01:32:05.840 |
It applies to a different training algorithm. 01:32:27.520 |
you can take the tokenizer.model file that Meta released, 01:32:51.520 |
We're saying that we're gonna use the BP algorithm 01:32:59.160 |
for basically preprocessing and normalization rules, 01:33:09.600 |
I would say, before LLMs in natural language processing. 01:33:12.360 |
So in machine translation and text classification and so on, 01:33:18.960 |
and you wanna remove all double whitespace, et cetera. 01:33:22.240 |
And in language models, we prefer not to do any of it, 01:33:24.640 |
or at least that is my preference as a deep learning person. 01:33:28.400 |
You want to keep the raw data as much as possible 01:33:33.200 |
So you're basically trying to turn off a lot of this, 01:34:11.320 |
Like sentences are just like, don't touch the raw data. 01:34:23.160 |
And so I think like, it's really hard to define 01:34:31.320 |
in different languages or something like that. 01:34:40.400 |
It has a lot of treatment around rare word characters. 01:35:27.440 |
And the UNCTOKEN must exist from my understanding. 01:35:47.440 |
And so we trained vocab size 400 on this text here. 01:35:55.000 |
the individual tokens that sentence piece will create. 01:35:58.840 |
we see that we have the UNCTOKEN with the ID zero. 01:36:07.720 |
And then we said that the PAT ID is negative one. 01:36:17.840 |
So here we saw that byte fallback in llama was turned on. 01:36:23.480 |
So what follows are going to be the 256 byte tokens. 01:36:37.720 |
And these are the parent nodes in the merges. 01:36:54.560 |
So these are the individual code point tokens, if you will, 01:37:00.840 |
with which sentence piece sort of like represents 01:37:03.800 |
It starts with special tokens, then the byte tokens, 01:37:13.640 |
are the ones that it encountered in the training set. 01:37:20.720 |
the entire set of code points that occurred here. 01:37:30.840 |
So if a code point occurred only a single time 01:37:32.640 |
out of like a million sentences or something like that, 01:37:43.280 |
we can encode into IDs and we can sort of get a list. 01:37:47.520 |
And then here I am also decoding the individual tokens 01:38:05.560 |
And when we look here, a few things sort of jumped to mind. 01:38:17.080 |
So sentence piece is encountering code points 01:38:25.840 |
So suddenly these are unk tokens, unknown tokens. 01:38:35.720 |
And so it takes this, it encodes it with UTF-8, 01:38:39.640 |
and then it uses these tokens to represent those bytes. 01:39:14.520 |
is all of the byte tokens disappeared, right? 01:39:44.960 |
that this would feed into your language model. 01:40:08.600 |
The next thing I wanna show you is the following. 01:40:11.720 |
Notice here when we are decoding all the individual tokens, 01:40:32.040 |
Why do we have an extra space in the front of hello? 01:40:50.720 |
add dummy whitespace at the beginning of text 01:40:57.240 |
So what this is trying to do is the following. 01:41:10.040 |
So we have, this is 1917, but this is 14, et cetera. 01:41:14.720 |
So these are two different tokens for the language model. 01:41:17.000 |
And the language model has to learn from data 01:41:18.720 |
that they are actually kind of like a very similar concept. 01:41:21.200 |
So to the language model in the tick token world, 01:41:24.080 |
basically words in the beginning of sentences 01:41:30.320 |
And it has learned that they are roughly the same. 01:41:48.600 |
it will take the string and it will add a space. 01:41:52.920 |
And that's done in an effort to make this world 01:41:59.960 |
So that's one other kind of pre-processing option 01:42:02.120 |
that is turned on and Llama2 also uses this option. 01:42:06.800 |
And that's I think everything that I wanna say 01:42:08.360 |
from my preview of sentence piece and how it is different. 01:42:14.040 |
I just put in the raw protocol buffer representation 01:42:18.480 |
basically of the tokenizer, the Llama2 trained. 01:42:26.560 |
to look identical to that of the meta Llama2, 01:42:30.360 |
then you would be copy pasting these settings 01:42:34.000 |
And yeah, I think that's it for this section. 01:42:37.800 |
I think my summary for sentence piece from all this 01:42:44.240 |
A lot of concepts that I think are slightly confusing 01:42:52.760 |
Otherwise it is fairly commonly used in the industry 01:43:02.920 |
on token must exist and the way the byte fallbacks are done 01:43:05.880 |
and so on, I don't find particularly elegant. 01:43:10.480 |
So it took me a lot of time working with this myself 01:43:15.760 |
and try to really understand what is happening here 01:43:22.720 |
But it is a very nice repo that is available to you 01:43:25.840 |
if you'd like to train your own tokenizer right now. 01:43:31.640 |
I want to revisit this issue in a bit more detail 01:43:35.680 |
and what are some of the considerations around it. 01:43:38.160 |
So for this, I'd like to go back to the model architecture 01:43:56.560 |
At this time it was 65 or something like that, 01:44:03.160 |
You'll see that vocab size doesn't come up too much 01:44:17.320 |
where the vocab size is basically the number of rows 01:44:20.620 |
and each vocabulary element, each token has a vector 01:44:25.000 |
that we're going to train using backpropagation. 01:44:28.920 |
which is number of channels in the transformer. 01:44:33.560 |
this embedding table, as I mentioned earlier, 01:44:38.680 |
In addition to that, at the end of the transformer, 01:44:41.080 |
there's this LM head layer, which is a linear layer. 01:44:44.240 |
And you'll notice that that layer is used at the very end 01:44:47.000 |
to produce the logits, which become the probabilities 01:44:51.520 |
And so intuitively, we're trying to produce a probability 01:45:02.000 |
we need to produce more and more probabilities. 01:45:23.000 |
So we're going to be doing a lot more computation here 01:45:35.440 |
So intuitively, if you have a very large vocabulary size, 01:45:40.800 |
then every one of these tokens is going to come up 01:45:45.080 |
because there's a lot more other tokens all over the place. 01:45:47.600 |
And so we're going to be seeing fewer and fewer examples 01:45:59.760 |
and they don't participate in the forward-backward pass. 01:46:02.460 |
In addition to that, as your vocab size grows, 01:46:04.600 |
you're going to start shrinking your sequences a lot. 01:46:09.720 |
that we're going to be attending to more and more text. 01:46:12.980 |
But also you might be worrying that too large of chunks 01:46:18.040 |
And so the model just doesn't have as much time to think 01:46:28.000 |
So basically we're squishing too much information 01:46:39.820 |
As I mentioned, this is mostly an empirical hyperparameter. 01:46:42.260 |
And it seems like in state-of-the-art architectures today, 01:46:49.340 |
And the next consideration I want to briefly talk about 01:46:51.500 |
is what if we want to take a pre-trained model 01:46:58.180 |
So for example, when you're doing fine tuning for chat GPT, 01:47:04.280 |
on top of the base model to maintain the metadata 01:47:07.280 |
and all the structure of conversation objects 01:47:13.860 |
You might also try to throw in more special tokens, 01:47:15.960 |
for example, for using the browser or any other tool. 01:47:18.920 |
And so it's very tempting to add a lot of tokens 01:47:26.960 |
All we have to do is we have to resize this embedding. 01:47:31.100 |
We would initialize these parameters from scratch, 01:47:35.500 |
And then we have to extend the weight inside this linear. 01:47:46.660 |
So both of these are just a resizing operation. 01:47:59.300 |
to introduce new tokens into the architecture. 01:48:12.760 |
actually there's an entire design space of applications 01:48:15.900 |
in terms of introducing new tokens into a vocabulary 01:48:18.460 |
that go way beyond just adding special tokens 01:48:22.240 |
So just to give you a sense of the design space, 01:48:24.100 |
but this could be an entire video just by itself. 01:48:26.660 |
This is a paper on learning to compress prompts 01:48:32.060 |
And the rough idea is suppose that you're using 01:48:37.320 |
Well, these long prompts just slow everything down 01:48:43.220 |
and it's just heavy to have very large prompts. 01:48:52.340 |
and imagine basically having a few new tokens, 01:48:57.700 |
and then you train the model by distillation. 01:49:11.120 |
is identical to the model that has a very long prompt 01:49:23.140 |
And so you can train this and then at test time 01:49:27.540 |
and they sort of like stand in for that very long prompt 01:49:36.100 |
in a class of parameter efficient fine tuning techniques 01:49:40.940 |
and there's no training of the model weights, 01:49:43.000 |
there's no training of LoRa or anything like that 01:49:51.700 |
but this could again be like an entire video, 01:49:56.020 |
that is potentially worth exploring in the future. 01:50:00.020 |
is that I think recently there's a lot of momentum 01:50:02.420 |
in how you actually could construct transformers 01:50:14.980 |
and potentially predict these modalities from a transformer? 01:50:23.540 |
is that you're not changing the architecture, 01:50:28.940 |
and then call it a day and pretend it's just text tokens 01:50:31.420 |
and just do everything else in an identical manner. 01:50:35.420 |
So here, for example, there was an early paper 01:50:37.020 |
that has a nice graphic for how you can take an image 01:50:45.340 |
so these would basically become the tokens of images 01:50:55.540 |
where you sort of don't require these to be discrete, 01:51:02.300 |
to go through bottlenecks like in autoencoders. 01:51:04.940 |
Also in this paper that came out from OpenAI Sora, 01:51:08.420 |
which I think really blew the mind of many people 01:51:12.420 |
and inspired a lot of people in terms of what's possible, 01:51:14.980 |
they have a graphic here and they talk briefly 01:51:16.620 |
about how LLMs have text tokens, Sora has visual patches. 01:51:21.340 |
So again, they came up with a way to truncate videos 01:51:24.260 |
into basically the tokens with their own vocabularies. 01:51:27.540 |
And then you can either process discrete tokens, 01:51:33.660 |
And all of that is sort of being actively worked on, 01:51:38.320 |
designed on, and it's beyond the scope of this video, 01:51:40.140 |
but just something I wanted to mention briefly. 01:51:45.660 |
and we understand a lot more about how it works, 01:51:48.340 |
let's loop back around to the beginning of this video 01:51:54.740 |
So first of all, why can't my LLM spell words very well 01:52:00.160 |
So fundamentally, this is because, as we saw, 01:52:07.180 |
And some of these tokens are actually fairly long. 01:52:09.820 |
So as an example, I went to the GPT-4 vocabulary 01:52:15.180 |
So .defaultstyle turns out to be a single individual token. 01:52:19.380 |
So that's a lot of characters for a single token. 01:52:21.860 |
So my suspicion is that there's just too much crammed 01:52:25.820 |
And my suspicion was that the model should not be very good 01:52:28.580 |
at tasks related to spelling of this single token. 01:52:39.280 |
And of course, my prompt is intentionally done that way. 01:52:44.280 |
And you see how .defaultstyle will be a single token. 01:52:48.260 |
So my suspicion is that it wouldn't be very good at this. 01:52:51.940 |
It doesn't actually know how many Ls are in there. 01:52:54.140 |
It thinks there are three, and actually there are four, 01:53:01.360 |
Let's look at another kind of character-level task. 01:53:13.260 |
I stopped it, and I said, just do it, just try it. 01:53:18.820 |
So it doesn't actually really know how to reverse 01:53:26.600 |
So again, like working with this working hypothesis 01:53:32.180 |
I said, okay, let's reverse the exact same string, 01:53:37.260 |
Step one, just print out every single character 01:53:43.260 |
And it again tried to use a tool, but when I stopped it, 01:53:57.500 |
listing it out in order, it can do that somehow. 01:54:00.100 |
And then it can, once it's broken up this way, 01:54:03.260 |
this becomes all these individual characters. 01:54:21.780 |
but basically it's not only that the language model 01:54:29.660 |
but also the tokenizer is not sufficiently trained 01:54:36.060 |
And so here, for example, hello, how are you is five tokens, 01:54:45.020 |
And so for example, annyeong haseyo is just hello, 01:54:53.900 |
That's just a typical greeting of like hello. 01:54:59.820 |
And so basically everything is a lot more bloated 01:55:03.980 |
that the model works worse on other languages. 01:55:07.000 |
Coming back, why is LLM bad at simple arithmetic? 01:55:11.200 |
That has to do with the tokenization of numbers. 01:55:20.860 |
there's an algorithm that is like character level 01:55:25.760 |
So for example, here, we would first add the ones 01:55:29.840 |
You have to refer to specific parts of these digits. 01:55:35.940 |
completely arbitrarily based on whatever happened 01:55:37.780 |
to merge or not merge during the tokenization process. 01:55:45.500 |
And this person basically systematically explores 01:55:47.860 |
the tokenization of numbers in, I believe this is GPT-2. 01:55:53.320 |
for four-digit numbers, you can take a look at 01:56:00.280 |
or whether it is two tokens that is a one, three, 01:56:07.760 |
And you can imagine this is all completely arbitrarily so. 01:56:12.880 |
a token for all four digits, sometimes for three, 01:56:22.060 |
And so this is definitely a headwind, if you will, 01:56:26.020 |
And it's kind of incredible that it can kind of do it 01:56:27.900 |
and deal with it, but it's also kind of not ideal. 01:56:31.020 |
And so that's why, for example, we saw that Meta, 01:56:48.020 |
And finally, why is GPT-2 not as good in Python? 01:56:59.240 |
Because as we saw here with a simple Python example, 01:57:07.180 |
And every single space is an individual token. 01:57:09.360 |
And this dramatically reduces the context length 01:57:13.320 |
So that's almost like a tokenization bug for GPT-2. 01:57:21.240 |
My LLM abruptly halts when it sees the string end of text. 01:57:28.200 |
Print a string end of text, is what I told GPT-4. 01:57:31.440 |
And it says, could you please specify the string? 01:57:40.280 |
And then I give it end of text as the string, 01:57:47.200 |
with respect to the handling of the special token. 01:57:49.520 |
And I didn't actually know what OpenAI is doing 01:57:53.000 |
and whether they are potentially parsing this 01:58:23.660 |
So you would hope that they don't really parse 01:58:25.840 |
or use special tokens from that kind of input. 01:58:32.640 |
And so your knowledge of these special tokens 01:58:35.840 |
ends up being an attack surface, potentially. 01:58:41.440 |
then just try to give them some special tokens 01:58:44.120 |
and see if you're breaking something by chance. 01:59:12.600 |
And so we can submit and get a bunch of tokens. 01:59:22.420 |
I do, here's a tagline for Ice Cream Shop space. 01:59:25.880 |
So I have a space here before I click submit. 01:59:50.160 |
Suppose you found the completion in the training document, 02:00:31.120 |
But instead, if I have this and I add my space, 02:00:34.140 |
then what I'm doing here when I encode this string 02:00:36.740 |
is I have basically, here's a tagline for an ice cream shop, 02:00:40.940 |
and this space at the very end becomes a token 220. 02:00:47.780 |
and this token otherwise would be part of the tagline. 02:00:55.260 |
And so this is suddenly out of distribution for the model, 02:00:58.820 |
because this space is part of the next token, 02:01:03.900 |
and the model has seen very, very little data 02:01:09.740 |
And we're asking it to complete the sequence, 02:01:12.820 |
But the problem is that we've sort of begun the first token, 02:01:38.300 |
These are the atoms of what the LLM is seeing, 02:01:40.860 |
and there's a bunch of weird stuff that comes out of it. 02:01:46.800 |
I bet you that the model has never in its training set 02:02:04.180 |
But I bet you that it's never seen this combination 02:02:09.740 |
because, or I think it would be extremely rare. 02:02:20.000 |
And it said, "The model predicted a completion 02:02:23.820 |
"Consider adjusting your prompt or stop sequences." 02:02:34.320 |
It basically predicted the stop sequence immediately. 02:02:38.660 |
And so this is why I'm getting a warning again, 02:02:52.100 |
It's shocked, and it's predicting end of text or something. 02:02:55.540 |
I tried it again here, and in this case, it completed it. 02:03:09.360 |
because the model is extremely unhappy with just this, 02:03:13.420 |
because it's never occurred in a training set. 02:03:15.420 |
In a training set, it always appears like this 02:03:22.860 |
either you sort of complete the first character 02:03:28.480 |
that you then have just some of the characters off, 02:03:31.320 |
all of these are kind of like issues with partial tokens, 02:03:36.800 |
And if you actually dig into the TukToken repository, 02:03:43.400 |
and you'll see encode unstable native, unstable tokens, 02:03:54.960 |
but there's a ton of code dealing with unstable tokens, 02:04:10.580 |
we're not actually trying to append the next token 02:04:24.320 |
that if we re-tokenized would be of high probability, 02:04:29.220 |
So that we can actually add a single individual character 02:04:33.520 |
instead of just like adding the next full token 02:04:40.560 |
and I invite you to maybe like look through this. 02:04:42.840 |
It ends up being extremely gnarly and hairy kind of topic, 02:04:45.640 |
and it comes from tokenization fundamentally. 02:04:50.660 |
talking about unstable tokens sometime in the future. 02:04:53.120 |
Okay, and I'm really saving the best for last. 02:04:55.320 |
My favorite one by far is the solid gold Magikarp. 02:05:00.280 |
It was just, okay, so this comes from this blog post, 02:05:04.640 |
And this is internet famous now for those of us in LLMs. 02:05:15.080 |
is this person went to the token embedding stable 02:05:25.680 |
And this person noticed that there's a cluster of tokens 02:05:30.420 |
So there's a cluster here, @rot, E-stream fame, 02:05:50.900 |
And then they noticed that actually the plot thickens here 02:05:53.340 |
because if you ask the model about these tokens, 02:06:07.500 |
So either you get evasion, so I'm sorry I can't hear you, 02:06:10.820 |
or you get a bunch of hallucinations as a response. 02:06:22.300 |
Or it kind of comes up with like weird humor. 02:06:33.600 |
And there's a variety of here documented behaviors. 02:06:36.980 |
There's a bunch of tokens, not just sold gold Magikarp 02:06:41.380 |
And so basically there's a bunch of like trigger words. 02:06:43.940 |
And if you ask the model about these trigger words, 02:06:49.300 |
and has all kinds of really strange behaviors, 02:06:54.220 |
typical safety guidelines and the alignment of the model, 02:07:06.140 |
So what's happening here is that sold gold Magikarp, 02:07:08.500 |
if you actually dig into it, is a Reddit user. 02:07:16.840 |
even though I don't know that this has been like 02:07:22.860 |
is that the tokenization dataset was very different 02:07:26.180 |
from the training dataset for the actual language model. 02:07:33.380 |
where the u/soldgoldmagikarp was mentioned in the text. 02:07:43.340 |
this would be a string that occurs many times 02:07:46.940 |
Because it occurs many times in the tokenization dataset, 02:07:53.100 |
for that single Reddit user, soldgoldmagikarp. 02:07:58.220 |
in a vocabulary of, was it 50,000 tokens in GPT-2, 02:08:04.020 |
And then what happens is the tokenization dataset 02:08:06.700 |
has those strings, but then later when you train the model, 02:08:18.760 |
for the language model, soldgoldmagikarp never occurs. 02:08:29.860 |
It's initialized at random in the beginning of optimization. 02:08:35.300 |
And this token is just never updated in the embedding table. 02:08:41.060 |
So it never gets trained and it's completely untrained. 02:08:50.740 |
And then at test time, if you evoke this token, 02:08:55.320 |
of the embedding table that is completely untrained 02:09:05.700 |
And so any of these kind of like weird tokens 02:09:08.620 |
because fundamentally the model is out of sample, 02:09:18.980 |
although I think a lot of people are quite aware of this, 02:09:22.900 |
and different representations and different languages 02:09:27.500 |
with GPT tokenizers or any tokenizers for any other, 02:09:32.060 |
So for example, JSON is actually really dense in tokens 02:09:39.040 |
So for example, these are the same in JSON and in YAML. 02:09:49.400 |
And so in the token economy where we are paying per token 02:09:53.760 |
in many ways, and you are paying in the context length 02:09:58.560 |
for the cost of processing all this kind of structured data 02:10:05.340 |
And in general, kind of like the tokenization density, 02:10:07.640 |
something that you have to sort of care about 02:10:30.660 |
What I do have to say at this point is don't brush it off. 02:10:33.880 |
There's a lot of foot guns, sharp edges here, 02:10:46.020 |
That said, I will say that eternal glory goes to anyone 02:10:51.160 |
I showed you one possible paper that tried to do that. 02:10:54.440 |
And I think, I hope a lot more can follow over time. 02:10:58.080 |
And my final recommendations for the application right now 02:11:00.460 |
are if you can reuse the GPT-4 tokens and vocabulary 02:11:06.200 |
and just use TicToken because it is very efficient 02:11:16.540 |
If you for some reason want to train your own vocabulary 02:11:20.120 |
from scratch, then I would use the BPE with sentence piece. 02:11:25.120 |
Oops, as I mentioned, I'm not a huge fan of sentence piece. 02:11:36.320 |
I think it's, it also has like a million settings 02:11:40.000 |
and I think it's really easy to miscalibrate them 02:11:43.680 |
or something like that because of some hyperparameter 02:11:50.400 |
try to copy paste exactly maybe what Meta did 02:12:00.440 |
But even if you have all the settings correct, 02:12:03.160 |
I still think that the algorithm is kind of inferior 02:12:12.240 |
for MIM-BPE to become as efficient as possible. 02:12:15.920 |
And that's something that maybe I hope to work on. 02:12:19.360 |
And at some point, maybe we can be training basically, 02:12:22.520 |
really what we want is we want TIC token, but training code. 02:12:25.800 |
And that is the ideal thing that currently does not exist. 02:12:34.960 |
So that's currently what I have to say for tokenization. 02:12:38.040 |
There might be an advanced video that has even drier 02:12:42.880 |
But for now, I think we're gonna leave things off here 02:12:48.880 |
And they increased this context size from GPT-1 of 512 02:13:04.600 |
Okay, next, I would like us to briefly walk through 02:13:08.640 |
the code for OpenAI on the GPT-2 encoded at Pi. 02:13:21.240 |
this is a spurious layer that I will explain in a bit.