Let's build the GPT Tokenizer

00:00:00.000 | Hi everyone.

00:00:01.080 | So in this video, I'd like us to cover

00:00:02.560 | the process of tokenization in large language models.

00:00:05.500 | Now, you see here that I have a sad face,

00:00:07.820 | and that's because tokenization is my least favorite part

00:00:11.080 | of working with large language models,

00:00:12.640 | but unfortunately it is necessary to understand

00:00:14.480 | in some detail because it is fairly hairy, gnarly,

00:00:17.580 | and there's a lot of hidden foot guns to be aware of,

00:00:20.400 | and a lot of oddness with large language models

00:00:22.460 | typically traces back to tokenization.

00:00:25.640 | So what is tokenization?

00:00:27.580 | Now, in my previous video,

00:00:29.000 | Let's Build GPT from Scratch,

00:00:31.480 | we actually already did tokenization,

00:00:33.260 | but we did a very naive, simple version of tokenization.

00:00:36.920 | So when you go to the Google Colab for that video,

00:00:40.220 | you see here that we loaded our training set,

00:00:43.160 | and our training set was this Shakespeare data set.

00:00:46.360 | Now, in the beginning, the Shakespeare data set

00:00:49.000 | is just a large string in Python.

00:00:51.000 | It's just text.

00:00:52.280 | And so the question is,

00:00:53.120 | how do we plug text into large language models?

00:00:56.680 | And in this case here,

00:00:59.080 | we created a vocabulary of 65 possible characters

00:01:02.700 | that we saw occur in this string.

00:01:05.120 | These were the possible characters,

00:01:06.680 | and we saw that there are 65 of them.

00:01:08.840 | And then we created a lookup table

00:01:11.320 | for converting from every possible character,

00:01:14.000 | a little string piece, into a token, an integer.

00:01:17.860 | So here, for example, we tokenized the string, hi there,

00:01:22.440 | and we received this sequence of tokens.

00:01:25.760 | And here we took the first 1,000 characters of our data set,

00:01:29.520 | and we encoded it into tokens.

00:01:31.920 | And because this is character level,

00:01:33.920 | we received 1,000 tokens in a sequence.

00:01:37.720 | So token 18, 47, et cetera.

00:01:40.120 | Now, later we saw that the way we plug these tokens

00:01:44.860 | into the language model is by using an embedding table.

00:01:48.540 | And so basically, if we have 65 possible tokens,

00:01:52.680 | then this embedding table is going to have 65 rows.

00:01:56.040 | And roughly speaking, we're taking the integer

00:01:58.960 | associated with every single token,

00:02:00.720 | we're using that as a lookup into this table,

00:02:03.800 | and we're plucking out the corresponding row.

00:02:06.320 | And this row is trainable parameters

00:02:09.500 | that we're going to train using backpropagation.

00:02:11.360 | And this is the vector that then feeds into the transformer.

00:02:15.180 | And that's how the transformer

00:02:16.240 | sort of perceives every single token.

00:02:18.080 | So here we had a very naive tokenization process

00:02:22.660 | that was a character level tokenizer.

00:02:24.960 | But in practice, in state-of-the-art language models,

00:02:27.920 | people use a lot more complicated schemes, unfortunately,

00:02:30.640 | for constructing these token vocabularies.

00:02:34.520 | So we're not dealing on the character level,

00:02:37.640 | we're dealing on chunk level.

00:02:39.320 | And the way these character chunks are constructed

00:02:43.080 | is using algorithms such as, for example,

00:02:45.240 | the byte pair encoding algorithm,

00:02:46.780 | which we're going to go into in detail

00:02:48.760 | and cover in this video.

00:02:52.200 | I'd like to briefly show you the paper

00:02:53.480 | that introduced byte level encoding

00:02:56.060 | as a mechanism for tokenization

00:02:57.720 | in the context of large language models.

00:02:59.880 | And I would say that that's probably a GPT-2 paper.

00:03:02.600 | And if you scroll down here

00:03:04.840 | to the section input representation,

00:03:07.200 | this is where they cover tokenization,

00:03:09.120 | the kinds of properties

00:03:09.960 | that you'd like the tokenization to have.

00:03:12.380 | And they conclude here

00:03:13.220 | that they're going to have a tokenizer

00:03:16.060 | where you have a vocabulary of 50,257 possible tokens.

00:03:21.620 | And the context size is going to be 1,024 tokens.

00:03:26.320 | So in the attention layer

00:03:28.400 | of the transformer neural network,

00:03:30.520 | every single token is attending

00:03:32.120 | to the previous tokens in the sequence.

00:03:33.920 | And it's going to see up to 1,024 tokens.

00:03:37.440 | So tokens are this like fundamental unit,

00:03:40.500 | the atom of large language models, if you will.

00:03:43.740 | And everything is in units of tokens,

00:03:45.280 | everything is about tokens.

00:03:47.000 | And tokenization is the process for translating strings

00:03:49.840 | or text into sequences of tokens and vice versa.

00:03:54.840 | When you go into the LLAMA-2 paper as well,

00:03:57.280 | I can show you that when you search token,

00:03:59.040 | you're going to get 63 hits.

00:04:01.640 | And that's because tokens are, again, pervasive.

00:04:04.340 | So here, they mentioned that they trained

00:04:05.640 | on 2 trillion tokens of data and so on.

00:04:08.040 | So we're going to build our own tokenizer.

00:04:12.000 | Luckily, the ByteBear encoding algorithm

00:04:13.560 | is not that super complicated,

00:04:16.040 | and we can build it from scratch ourselves,

00:04:17.880 | and we'll see exactly how this works.

00:04:19.680 | Before we dive into code,

00:04:20.760 | I'd like to give you a brief taste

00:04:22.560 | of some of the complexities that come from the tokenization,

00:04:25.320 | because I just want to make sure

00:04:26.520 | that we motivate it sufficiently

00:04:27.800 | for why we are doing all of this and why this is so gross.

00:04:31.720 | So tokenization is at the heart of a lot of weirdness

00:04:34.720 | in large language models,

00:04:35.880 | and I would advise that you do not brush it off.

00:04:38.800 | A lot of the issues that may look like just issues

00:04:41.560 | with the neural network architecture

00:04:43.080 | or the large language model itself

00:04:45.380 | are actually issues with the tokenization

00:04:47.680 | and fundamentally trace back to it.

00:04:50.360 | So if you've noticed any issues

00:04:52.180 | with large language models,

00:04:53.560 | can, you know, not able to do spelling tasks very easily,

00:04:57.040 | that's usually due to tokenization.

00:04:59.100 | Simple string processing can be difficult

00:05:01.180 | for the large language model to perform natively.

00:05:03.820 | Non-English languages can work much worse,

00:05:07.220 | and to a large extent, this is due to tokenization.

00:05:10.260 | Sometimes LLMs are bad at simple arithmetic,

00:05:12.860 | also can trace, be traced to tokenization.

00:05:16.740 | GPT-2 specifically would have had quite a bit more issues

00:05:19.460 | with Python than future versions of it due to tokenization.

00:05:23.860 | There's a lot of other issues.

00:05:24.820 | Maybe you've seen weird warnings

00:05:25.980 | about a trailing white space.

00:05:27.260 | This is a tokenization issue.

00:05:28.700 | If you had asked GPT earlier

00:05:32.820 | about Solid Gold Magikarp and what it is,

00:05:35.080 | you would see the LLM go totally crazy,

00:05:37.260 | and it would start going off

00:05:38.880 | about a completely unrelated tangent topic.

00:05:41.660 | Maybe you've been told to use YAML over JSON

00:05:43.700 | in structured data.

00:05:44.780 | All of that has to do with tokenization.

00:05:47.060 | So basically, tokenization is at the heart of many issues.

00:05:50.420 | I will loop back around to these at the end of the video,

00:05:53.200 | but for now, let me just skip over it a little bit,

00:05:56.820 | and let's go to this web app,

00:05:58.720 | the tick-tokenizer.vercel.app.

00:06:02.220 | So I have it loaded here,

00:06:03.580 | and what I like about this web app

00:06:04.660 | is that tokenization is running

00:06:06.360 | sort of live in your browser in JavaScript.

00:06:09.260 | So you can just type here stuff, hello world,

00:06:11.980 | and the whole string re-tokenizes.

00:06:14.120 | So here what we see on the left

00:06:18.780 | is a string that you put in.

00:06:20.100 | On the right, we're currently using a GPT-2 tokenizer.

00:06:23.220 | We see that this string that I pasted here

00:06:25.640 | is currently tokenizing into 300 tokens,

00:06:28.660 | and here they are sort of shown explicitly

00:06:31.700 | in different colors for every single token.

00:06:34.340 | So for example, this word tokenization became two tokens,

00:06:38.900 | the token 30,642 and 1,634.

00:06:43.900 | The token space is, is token 318.

00:06:50.060 | So be careful, on the bottom, you can show whitespace,

00:06:53.300 | and keep in mind that there are spaces

00:06:55.580 | and slash and newline characters in here,

00:06:58.560 | but you can hide them for clarity.

00:07:00.840 | The token space at is token 379.

00:07:06.500 | The token space the is 262, et cetera.

00:07:11.100 | So you notice here that the space

00:07:12.340 | is part of that token chunk.

00:07:15.320 | Now, so this is kind of like how

00:07:18.740 | our English sentence broke up,

00:07:20.540 | and that seems all well and good.

00:07:23.180 | Now here, I put in some arithmetic.

00:07:25.900 | So we see that the token 127 plus,

00:07:29.860 | and then token six, space six, followed by 77.

00:07:34.060 | So what's happening here is that 127 is feeding in

00:07:36.740 | as a single token into the large language model,

00:07:39.420 | but the number 677 will actually feed in

00:07:43.700 | as two separate tokens.

00:07:45.900 | And so the large language model has to sort of

00:07:48.740 | take account of that and process it correctly

00:07:52.300 | in its network.

00:07:53.940 | And see here, 804 will be broken up into two tokens,

00:07:57.180 | and it's all completely arbitrary.

00:07:59.060 | And here I have another example of four-digit numbers,

00:08:01.660 | and they break up in a way that they break up

00:08:03.900 | and it's totally arbitrary.

00:08:04.860 | Sometimes you have multiple digits, a single token.

00:08:08.220 | Sometimes you have individual digits as many tokens,

00:08:11.420 | and it's all kind of pretty arbitrary

00:08:12.860 | and coming out of the tokenizer.

00:08:14.460 | Here's another example.

00:08:17.260 | We have the string egg,

00:08:20.220 | and you see here that this became two tokens.

00:08:23.420 | But for some reason, when I say I have an egg,

00:08:26.320 | you see when it's a space egg, it's two token.

00:08:29.780 | It's, sorry, it's a single token.

00:08:31.980 | So just egg by itself in the beginning of a sentence

00:08:34.460 | is two tokens, but here as a space egg

00:08:37.460 | is suddenly a single token for the exact same string.

00:08:41.260 | Here, lowercase egg turns out to be a single token.

00:08:46.020 | And in particular, notice that the color is different,

00:08:47.860 | so this is a different token.

00:08:49.440 | So this is case sensitive.

00:08:51.180 | And of course, capital egg would also be different tokens.

00:08:55.380 | And again, this would be two tokens arbitrarily.

00:08:59.220 | So for the same concept, egg,

00:09:01.580 | depending on if it's in the beginning of a sentence,

00:09:03.320 | at the end of a sentence, lowercase, uppercase, or mixed,

00:09:06.340 | all of this will be basically very different tokens

00:09:09.020 | and different IDs.

00:09:10.180 | And the language model has to learn from raw data,

00:09:12.500 | from all the internet text

00:09:13.580 | that it's going to be training on,

00:09:14.980 | that these are actually all the exact same concept.

00:09:17.460 | And it has to sort of group them

00:09:18.860 | in the parameters of the neural network

00:09:20.940 | and understand just based on the data patterns

00:09:23.300 | that these are all very similar,

00:09:24.860 | but maybe not almost exactly similar,

00:09:26.980 | but very, very similar.

00:09:30.700 | After the demonstration here,

00:09:32.500 | I have an introduction from OpenAI's Chachabitty in Korean.

00:09:37.500 | So, (speaking in foreign language)

00:09:40.860 | et cetera.

00:09:42.020 | So this is in Korean.

00:09:43.380 | And the reason I put this here

00:09:44.700 | is because you'll notice that

00:09:47.020 | non-English languages work slightly worse in Chachabitty.

00:09:53.360 | Part of this is because of course,

00:09:54.900 | the training data set for Chachabitty

00:09:56.500 | is much larger for English than for everything else.

00:09:59.380 | But the same is true,

00:10:00.340 | not just for the large language model itself,

00:10:02.540 | but also for the tokenizer.

00:10:04.380 | So when we train the tokenizer,

00:10:05.860 | we're going to see that there's a training set as well.

00:10:07.980 | And there's a lot more English than non-English.

00:10:10.540 | And what ends up happening

00:10:11.820 | is that we're going to have a lot more longer tokens

00:10:15.700 | for English.

00:10:16.580 | So how do I put this?

00:10:19.380 | If you have a single sentence in English

00:10:21.260 | and you tokenize it,

00:10:22.420 | you might see that it's 10 tokens or something like that.

00:10:25.180 | But if you translate that sentence into say Korean

00:10:27.420 | or Japanese or something else,

00:10:29.220 | you'll typically see that number of tokens used

00:10:31.220 | is much larger.

00:10:32.380 | And that's because the chunks here

00:10:34.380 | are a lot more broken up.

00:10:35.920 | So we're using a lot more tokens for the exact same thing.

00:10:40.420 | And what this does is it bloats up the sequence length

00:10:43.420 | of all the documents.

00:10:44.960 | So you're using up more tokens

00:10:46.980 | and then in the attention of the transformer,

00:10:49.060 | when these tokens try to attend each other,

00:10:50.860 | you are running out of context

00:10:53.180 | in the maximum context length of that transformer.

00:10:56.500 | And so basically all the non-English text

00:10:59.700 | is stretched out from the perspective of the transformer.

00:11:03.500 | And this just has to do with the trinks

00:11:06.220 | that's used for the tokenizer and the tokenization itself.

00:11:09.380 | So it will create a lot bigger tokens

00:11:11.480 | and a lot larger groups in English,

00:11:13.620 | and it will have a lot of little boundaries

00:11:15.940 | for all the other non-English text.

00:11:17.700 | So if we translated this into English,

00:11:21.620 | it would be significantly fewer tokens.

00:11:24.380 | The final example I have here

00:11:25.580 | is a little snippet of Python for doing FizzBuzz.

00:11:29.180 | And what I'd like you to notice is,

00:11:30.900 | look, all these individual spaces are all separate tokens.

00:11:35.580 | They are token 220.

00:11:37.940 | So 220, 220, 220, 220,

00:11:41.860 | and then space if is a single token.

00:11:44.780 | And so what's going on here

00:11:45.700 | is that when the transformer is going to consume

00:11:47.620 | or try to create this text,

00:11:50.620 | it needs to handle all these spaces individually.

00:11:53.860 | They all feed in one by one

00:11:55.900 | into the entire transformer in the sequence.

00:11:58.580 | And so this is being extremely wasteful

00:12:00.820 | tokenizing it in this way.

00:12:02.780 | And so as a result of that,

00:12:04.740 | GPT-2 is not very good with Python,

00:12:07.060 | and it's not anything to do with coding

00:12:08.860 | or the language model itself.

00:12:10.300 | It's just that if you use a lot of indentation

00:12:12.180 | using space in Python, like we usually do,

00:12:15.540 | you just end up bloating out all the text

00:12:17.800 | and it's separated across way too much of the sequence,

00:12:20.540 | and we are running out of the context length in the sequence.

00:12:23.760 | That's roughly speaking what's happening.

00:12:25.260 | We're being way too wasteful.

00:12:26.580 | We're taking up way too much token space.

00:12:28.740 | Now, we can also scroll up here

00:12:30.100 | and we can change the tokenizer.

00:12:31.720 | So note here that GPT-2 tokenizer

00:12:33.700 | creates a token count of 300 for this string here.

00:12:37.760 | We can change it to CL100K base,

00:12:39.620 | which is the GPT-4 tokenizer.

00:12:41.620 | And we see that the token count drops to 185.

00:12:44.560 | So for the exact same string,

00:12:46.220 | we are now roughly halving the number of tokens.

00:12:49.780 | And roughly speaking,

00:12:50.620 | this is because the number of tokens in the GPT-4 tokenizer

00:12:54.440 | is roughly double that of the number of tokens

00:12:57.160 | in the GPT-2 tokenizer.

00:12:58.620 | So we went from roughly 50K to roughly 100K.

00:13:01.720 | Now, you can imagine that this is a good thing

00:13:03.840 | because the same text is now squished

00:13:06.820 | into half as many tokens.

00:13:08.560 | So this is a lot denser input to the transformer.

00:13:12.780 | And in the transformer,

00:13:15.360 | every single token has a finite number of tokens before it

00:13:17.880 | that it's going to pay attention to.

00:13:19.560 | And so what this is doing is we're roughly able to see

00:13:22.960 | twice as much text as a context

00:13:25.840 | for what token to predict next because of this change.

00:13:29.360 | But of course,

00:13:30.200 | just increasing the number of tokens

00:13:31.520 | is not strictly better infinitely

00:13:34.520 | because as you increase the number of tokens,

00:13:36.240 | now your embedding table is sort of getting a lot larger.

00:13:39.960 | And also at the output,

00:13:41.080 | we are trying to predict the next token,

00:13:42.520 | and there's the softmax there, and that grows as well.

00:13:45.280 | We're gonna go into more detail later on this,

00:13:47.200 | but there's some kind of a sweet spot somewhere

00:13:49.560 | where you have a just right number of tokens

00:13:52.160 | in your vocabulary,

00:13:53.200 | where everything is appropriately dense

00:13:55.360 | and still fairly efficient.

00:13:56.720 | Now, one thing I would like you to note specifically

00:13:59.400 | for the GPT-4 Tokenizer

00:14:01.160 | is that the handling of the whitespace for Python

00:14:04.880 | has improved a lot.

00:14:06.180 | You see that here,

00:14:07.640 | these four spaces are represented as one single token

00:14:10.120 | for the three spaces here,

00:14:12.400 | and then the token spaces.

00:14:14.640 | And here, seven spaces were all grouped into a single token.

00:14:18.640 | So we're being a lot more efficient

00:14:19.960 | in how we represent Python.

00:14:21.400 | And this was a deliberate choice made by OpenAI

00:14:23.680 | when they designed the GPT-4 Tokenizer,

00:14:26.440 | and they group a lot more whitespace into a single character.

00:14:30.600 | What this does is this densifies Python,

00:14:33.880 | and therefore we can attend to more code before it

00:14:37.980 | when we're trying to predict the next token in the sequence.

00:14:40.760 | And so the improvement in the Python coding ability

00:14:43.660 | from GPT-2 to GPT-4

00:14:45.440 | is not just a matter of the language model

00:14:47.480 | and the architecture and the details of the optimization,

00:14:50.440 | but a lot of the improvement here

00:14:51.540 | is also coming from the design of the Tokenizer

00:14:53.800 | and how it groups characters into tokens.

00:14:56.720 | Okay, so let's now start writing some code.

00:14:59.360 | So remember what we want to do.

00:15:01.140 | We want to take strings and feed them into language models.

00:15:05.040 | For that, we need to somehow tokenize strings

00:15:07.640 | into some integers in some fixed vocabulary,

00:15:12.120 | and then we will use those integers

00:15:13.860 | to make a lookup into a lookup table of vectors

00:15:16.820 | and feed those vectors into the transformer as an input.

00:15:20.300 | Now, the reason this gets a little bit tricky, of course,

00:15:22.740 | is that we don't just want to support

00:15:24.060 | the simple English alphabet.

00:15:25.760 | We want to support different kinds of languages.

00:15:27.920 | So this is annyeonghaseyo in Korean, which is hello.

00:15:31.660 | And we also want to support many kinds of special characters

00:15:33.820 | that we might find on the internet, for example, emoji.

00:15:38.260 | So how do we feed this text into transformers?

00:15:43.100 | Well, what is this text anyway in Python?

00:15:46.140 | So if you go to the documentation of a string in Python,

00:15:49.660 | you can see that strings are immutable sequences

00:15:52.340 | of Unicode code points.

00:15:55.020 | Okay, what are Unicode code points?

00:15:56.980 | We can go to Wikipedia.

00:15:59.900 | So Unicode code points are defined

00:16:02.380 | by the Unicode consortium as part of the Unicode standard.

00:16:07.180 | And what this is really is that it's just a definition

00:16:09.640 | of roughly 150,000 characters right now.

00:16:13.000 | And roughly speaking, what they look like

00:16:15.280 | and what integers represent those characters.

00:16:18.680 | So this is 150,000 characters across 161 scripts

00:16:22.000 | as of right now.

00:16:23.620 | So if you scroll down here,

00:16:24.640 | you can see that the standard is very much alive.

00:16:27.280 | The latest standard 15.1 is September, 2023.

00:16:29.980 | And basically, this is just a way to define

00:16:34.960 | lots of types of characters, like for example,

00:16:38.980 | all these characters across different scripts.

00:16:41.300 | So the way we can access the Unicode code point

00:16:44.300 | given a single character

00:16:45.580 | is by using the ORD function in Python.

00:16:47.620 | So for example, I can pass in ORD of H,

00:16:50.620 | and I can see that for the single character H,

00:16:53.980 | the Unicode code point is 104, okay?

00:16:57.860 | But this can be arbitrarily complicated.

00:17:00.940 | So we can take, for example, our emoji here,

00:17:03.620 | and we can see that the code point for this one is 128,000.

00:17:07.460 | Or we can take UN,

00:17:09.100 | and this is 50,000.

00:17:13.180 | Now, keep in mind, you can't plug in strings here

00:17:15.860 | because this doesn't have a single code point.

00:17:18.960 | It only takes a single Unicode code point character

00:17:22.140 | and tells you its integer.

00:17:23.800 | So in this way, we can look up all the characters

00:17:29.640 | of this specific string and their code points.

00:17:32.020 | So ORD of X, for X in this string,

00:17:35.460 | and we get this encoding here.

00:17:38.560 | Now, see here, we've already turned,

00:17:41.220 | the raw code points already have integers.

00:17:43.740 | So why can't we simply just use these integers

00:17:46.200 | and not have any tokenization at all?

00:17:48.260 | Why can't we just use this natively as is

00:17:50.500 | and just use the code point?

00:17:52.200 | Well, one reason for that, of course,

00:17:53.420 | is that the vocabulary in that case would be quite long.

00:17:56.000 | So in this case, for Unicode,

00:17:58.380 | this is a vocabulary of 150,000 different code points.

00:18:02.260 | But more worryingly than that, I think,

00:18:04.300 | the Unicode standard is very much alive

00:18:06.580 | and it keeps changing.

00:18:08.040 | And so it's not kind of a stable representation necessarily

00:18:10.940 | that we may wanna use directly.

00:18:13.220 | So for those reasons, we need something a bit better.

00:18:15.860 | So to find something better, we turn to encodings.

00:18:18.740 | So if we go to the Wikipedia page here,

00:18:20.180 | we see that the Unicode Consortium

00:18:22.240 | defines three types of encodings,

00:18:24.900 | UTF-8, UTF-16, and UTF-32.

00:18:27.860 | These encodings are the way

00:18:30.060 | by which we can take Unicode text

00:18:32.340 | and translate it into binary data or byte streams.

00:18:36.140 | UTF-8 is by far the most common.

00:18:38.940 | So this is the UTF-8 page.

00:18:40.600 | Now this Wikipedia page is actually quite long,

00:18:43.300 | but what's important for our purposes

00:18:44.900 | is that UTF-8 takes every single code point

00:18:47.820 | and it translates it to a byte stream.

00:18:50.540 | And this byte stream is between one to four bytes.

00:18:53.380 | So it's a variable length encoding.

00:18:55.400 | So depending on the Unicode point, according to the schema,

00:18:58.080 | you're gonna end up with between one to four bytes

00:19:00.560 | for each code point.

00:19:01.640 | On top of that, there's UTF-8, UTF-16, and UTF-32.

00:19:07.480 | UTF-32 is nice because it is fixed length

00:19:09.960 | instead of variable length,

00:19:11.120 | but it has many other downsides as well.

00:19:13.680 | So the full kind of spectrum of pros and cons

00:19:17.720 | of all these different three encodings

00:19:19.560 | are beyond the scope of this video.

00:19:21.400 | I just like to point out that I enjoyed this blog post

00:19:24.460 | and this blog post at the end of it

00:19:26.120 | also has a number of references that can be quite useful.

00:19:29.160 | One of them is UTF-8 Everywhere Manifesto.

00:19:32.620 | And this manifesto describes the reason why UTF-8

00:19:36.280 | is significantly preferred and a lot nicer

00:19:39.760 | than the other encodings

00:19:41.280 | and why it is used a lot more prominently on the internet.

00:19:45.540 | One of the major advantages, just to give you a sense,

00:19:49.360 | is that UTF-8 is the only one of these

00:19:51.560 | that is backwards compatible

00:19:53.060 | to the much simpler ASCII encoding of text.

00:19:56.860 | But I'm not gonna go into the full detail in this video.

00:19:59.580 | So suffice to say that we like the UTF-8 encoding,

00:20:02.420 | and let's try to take the string

00:20:04.700 | and see what we get if we encode it into UTF-8.

00:20:07.360 | The string class in Python actually has dot encode,

00:20:11.660 | and you can give it the encoding, which is, say, UTF-8.

00:20:15.180 | Now, what we get out of this is not very nice

00:20:17.180 | because this is the bytes, this is a bytes object,

00:20:20.540 | and it's not very nice in the way that it's printed.

00:20:23.500 | So I personally like to take it through a list

00:20:26.040 | because then we actually get the raw bytes of this encoding.

00:20:30.840 | So this is the raw bytes that represent this string

00:20:35.680 | according to the UTF-8 encoding.

00:20:37.980 | We can also look at UTF-16.

00:20:40.100 | We get a slightly different byte stream.

00:20:42.520 | And here we start to see

00:20:43.680 | one of the disadvantages of UTF-16.

00:20:45.620 | You see how we have zero something,

00:20:47.800 | zero something, zero something.

00:20:49.360 | We're starting to get a sense that this is

00:20:50.560 | a bit of a wasteful encoding.

00:20:52.280 | And indeed, for simple ASCII characters

00:20:55.360 | or English characters here,

00:20:57.320 | we just have the structure of zero something,

00:20:58.880 | zero something, and it's not exactly nice.

00:21:02.280 | Same for UTF-32.

00:21:04.000 | When we expand this, we can start to get a sense

00:21:06.140 | of the wastefulness of this encoding for our purposes.

00:21:08.960 | You see a lot of zeros followed by something,

00:21:11.460 | and so this is not desirable.

00:21:14.580 | So suffice it to say that we would like to stick

00:21:18.820 | with UTF-8 for our purposes.

00:21:21.780 | However, if we just use UTF-8 naively,

00:21:24.660 | these are byte streams.

00:21:26.060 | So that would imply a vocabulary length

00:21:28.680 | of only 256 possible tokens.

00:21:32.340 | But this vocabulary size is very, very small.

00:21:35.260 | What this is going to do if we just were to use it naively

00:21:37.900 | is that all of our text would be stretched out

00:21:41.140 | over very, very long sequences of bytes.

00:21:44.400 | And so what this does is that certainly

00:21:49.400 | the embedding table is going to be tiny,

00:21:51.060 | and the prediction at the top of the final layer

00:21:52.980 | is going to be very tiny, but our sequences are very long.

00:21:55.940 | And remember that we have pretty finite context lengths

00:21:59.860 | in the attention that we can support in a transformer

00:22:02.660 | for computational reasons.

00:22:04.220 | And so we only have as much context length,

00:22:07.140 | but now we have very, very long sequences.

00:22:09.320 | And this is just inefficient,

00:22:10.300 | and it's not going to allow us to attend

00:22:12.220 | to sufficiently long text before us

00:22:15.100 | for the purposes of the next token prediction task.

00:22:18.020 | So we don't want to use the raw bytes of the UTF-8 encoding.

00:22:23.020 | We want to be able to support larger vocabulary size

00:22:26.780 | that we can tune as a hyperparameter,

00:22:28.700 | but we want to stick with the UTF-8 encoding

00:22:31.700 | of these strings.

00:22:32.800 | So what do we do?

00:22:34.340 | Well, the answer, of course,

00:22:35.180 | is we turn to the byte-pair encoding algorithm,

00:22:37.420 | which will allow us to compress these byte sequences

00:22:41.220 | to a variable amount.

00:22:42.720 | So we'll get to that in a bit,

00:22:43.780 | but I just want to briefly speak to the fact

00:22:46.300 | that I would love nothing more

00:22:48.740 | than to be able to feed raw byte sequences

00:22:51.540 | into language models.

00:22:54.100 | In fact, there's a paper

00:22:54.900 | about how this could potentially be done

00:22:57.280 | from somewhere last year.

00:22:59.020 | Now, the problem is you actually have to go in

00:23:01.060 | and you have to modify the transformer architecture

00:23:03.180 | because, as I mentioned,

00:23:04.580 | you're going to have a problem

00:23:06.020 | where the attention will start to become extremely expensive

00:23:08.740 | because the sequences are so long.

00:23:11.380 | And so in this paper,

00:23:12.780 | they propose kind of a hierarchical structuring

00:23:15.740 | of the transformer that could allow you

00:23:17.540 | to just feed in raw bytes.

00:23:19.660 | And so at the end, they say,

00:23:20.980 | "Together, these results establish the viability

00:23:22.900 | of tokenization-free autoregressive sequence modeling

00:23:25.380 | at scale."

00:23:26.300 | So tokenization-free would indeed be amazing.

00:23:28.680 | We would just feed byte streams directly into our models,

00:23:32.180 | but unfortunately, I don't know

00:23:33.260 | that this has really been proven out yet

00:23:35.580 | by sufficiently many groups and at sufficient scale,

00:23:38.580 | but something like this at one point would be amazing

00:23:40.540 | and I hope someone comes up with it.

00:23:42.020 | But for now, we have to come back

00:23:44.020 | and we can't feed this directly into language models

00:23:46.500 | and we have to compress it

00:23:47.740 | using the byte-pair encoding algorithm.

00:23:49.480 | So let's see how that works.

00:23:50.620 | So as I mentioned,

00:23:51.460 | the byte-pair encoding algorithm is not all that complicated

00:23:54.460 | and the Wikipedia page is actually quite instructive

00:23:56.500 | as far as the basic idea goes.

00:23:58.980 | What we're doing is we have some kind of a input sequence.

00:24:01.780 | Like, for example, here we have only four elements

00:24:04.100 | in our vocabulary, A, B, C, and D,

00:24:06.420 | and we have a sequence of them.

00:24:07.920 | So instead of bytes, let's say we just had four,

00:24:10.460 | a vocab size of four.

00:24:11.740 | The sequence is too long and we'd like to compress it.

00:24:15.380 | So what we do is that we iteratively find the pair of tokens

00:24:20.380 | that occur the most frequently.

00:24:24.360 | And then once we've identified that pair,

00:24:27.060 | we replace that pair with just a single new token

00:24:30.900 | that we append to our vocabulary.

00:24:33.380 | So for example, here, the byte-pair AA occurs most often,

00:24:37.320 | so we mint a new token, let's call it capital Z,

00:24:40.740 | and we replace every single occurrence of AA by Z.

00:24:44.800 | So now we have two Zs here.

00:24:47.920 | So here we took a sequence of 11 characters

00:24:51.540 | with vocabulary size four,

00:24:53.740 | and we've converted it to a sequence of only nine tokens,

00:24:58.740 | but now with a vocabulary of five,

00:25:00.600 | because we have a fifth vocabulary element

00:25:02.780 | that we just created, and it's Z,

00:25:05.000 | standing for concatenation of AA.

00:25:07.300 | And we can, again, repeat this process.

00:25:09.660 | So we, again, look at the sequence

00:25:11.980 | and identify the pair of tokens that are most frequent.

00:25:16.680 | Let's say that that is now AB.

00:25:19.140 | Well, we are going to replace AB with a new token

00:25:21.340 | that we mint, called Y.

00:25:23.460 | So Y becomes AB, and then every single occurrence of AB

00:25:26.220 | is now replaced with Y, so we end up with this.

00:25:29.780 | So now we only have one, two, three, four, five, six,

00:25:33.860 | seven characters in our sequence,

00:25:36.100 | but we have not just four vocabulary elements or five,

00:25:41.100 | but now we have six.

00:25:43.620 | And for the final round, we, again,

00:25:46.220 | look through the sequence, find that the phrase ZY,

00:25:49.600 | or the pair ZY is most common,

00:25:51.540 | and replace it one more time with another character,

00:25:55.500 | let's say X.

00:25:56.580 | So X is ZY, and we replace all occurrences of ZY,

00:25:59.980 | and we get this following sequence.

00:26:02.020 | So basically, after we have gone through this process,

00:26:04.840 | instead of having a sequence of 11 tokens

00:26:09.840 | with a vocabulary length of four,

00:26:14.680 | we now have a sequence of one, two, three, four, five tokens,

00:26:19.680 | but our vocabulary length now is seven.

00:26:23.260 | And so in this way, we can iteratively compress our sequence

00:26:27.580 | as we mint new tokens.

00:26:29.560 | So in the exact same way, we start out with byte sequences.

00:26:34.320 | So we have 256 vocabulary size,

00:26:37.520 | but we're now going to go through these

00:26:38.980 | and find the byte pairs that occur the most.

00:26:42.160 | And we're going to iteratively start minting new tokens,

00:26:44.920 | appending them to our vocabulary, and replacing things.

00:26:48.160 | And in this way, we're going to end up

00:26:49.480 | with a compressed training dataset,

00:26:51.580 | and also an algorithm for taking any arbitrary sequence

00:26:54.920 | and encoding it using this vocabulary,

00:26:58.240 | and also decoding it back to strings.

00:27:00.720 | So let's now implement all that.

00:27:02.920 | So here's what I did.

00:27:04.280 | I went to this blog post that I enjoyed,

00:27:06.640 | and I took the first paragraph,

00:27:07.960 | and I copy pasted it here into text.

00:27:10.840 | So this is one very long line here.

00:27:13.040 | Now, to get the tokens, as I mentioned,

00:27:16.360 | we just take our text and we encode it into UTF-8.

00:27:19.360 | The tokens here at this point will be a raw bytes,

00:27:22.520 | single stream of bytes.

00:27:24.700 | And just so that it's easier to work with,

00:27:26.760 | instead of just a bytes object,

00:27:28.800 | I'm going to convert all those bytes to integers,

00:27:32.000 | and then create a list of it,

00:27:33.420 | just so it's easier for us to manipulate

00:27:35.040 | and work with in Python and visualize.

00:27:37.200 | And here I'm printing all of that.

00:27:38.360 | So this is the original paragraph,

00:27:43.360 | and its length is 533 code points.

00:27:48.720 | And then here are the bytes encoded in UTF-8.

00:27:52.800 | And we see that this has a length of 616 bytes

00:27:56.180 | at this point, or 616 tokens.

00:27:58.360 | And the reason this is more,

00:27:59.800 | is because a lot of these simple ASCII characters,

00:28:03.240 | or simple characters, they just become a single byte.

00:28:06.200 | But a lot of these Unicode, more complex characters,

00:28:08.840 | become multiple bytes, up to four.

00:28:10.720 | And so we are expanding that size.

00:28:12.760 | So now what we'd like to do as a first step

00:28:15.220 | of the algorithm, is we'd like to iterate over here,

00:28:17.700 | and find the pair of bytes that occur most frequently,

00:28:22.040 | because we're then going to merge it.

00:28:23.840 | So if you are working along on a notebook, on a side,

00:28:26.460 | then I encourage you to basically click on the link,

00:28:28.740 | find this notebook,

00:28:29.940 | and try to write that function yourself.

00:28:31.980 | Otherwise, I'm going to come here

00:28:32.980 | and implement first the function

00:28:34.620 | that finds the most common pair.

00:28:36.380 | Okay, so here's what I came up with.

00:28:37.740 | There are many different ways to implement this,

00:28:39.860 | but I'm calling the function getStats.

00:28:41.460 | It expects a list of integers.

00:28:43.360 | I'm using a dictionary to keep track

00:28:45.260 | of basically the counts.

00:28:46.860 | And then this is a Pythonic way

00:28:48.080 | to iterate consecutive elements of this list,

00:28:51.300 | which we covered in a previous video.

00:28:53.740 | And then here, I'm just keeping track

00:28:55.500 | of just incrementing by one for all the pairs.

00:28:59.340 | So if I call this on all the tokens here,

00:29:01.900 | then the stats comes out here.

00:29:04.240 | So this is a dictionary.

00:29:05.640 | The keys are these tuples of consecutive elements,

00:29:09.520 | and this is the count.

00:29:11.160 | So just to print it in a slightly better way,

00:29:14.780 | this is one way that I like to do that,

00:29:17.540 | where it's a little bit compound here,

00:29:20.820 | so you can pause if you like,

00:29:22.200 | but we iterate all the items.

00:29:24.460 | The items called on dictionary returns pairs of key value.

00:29:29.060 | And instead, I create a list here of value key,

00:29:34.060 | because if it's a value key list,

00:29:36.260 | then I can call sort on it.

00:29:38.020 | And by default, Python will use the first element,

00:29:42.780 | which in this case will be value,

00:29:44.100 | to sort by if it's given tuples.

00:29:46.420 | And then reverse, so it's descending, and print that.

00:29:49.220 | So basically, it looks like 101,32

00:29:52.640 | was the most commonly occurring consecutive pair,

00:29:55.580 | and it occurred 20 times.

00:29:57.480 | We can double check that that makes reasonable sense.

00:30:00.200 | So if I just search 101,32,

00:30:03.160 | then you see that these are the 20 occurrences of that pair.

00:30:07.360 | And if we'd like to take a look

00:30:10.800 | at what exactly that pair is, we can use char,

00:30:13.920 | which is the opposite of ord in Python.

00:30:16.320 | So we give it a Unicode code point.

00:30:19.200 | So 101, and of 32, and we see that this is E and space.

00:30:24.200 | So basically, there's a lot of E space here,

00:30:28.220 | meaning that a lot of these words seem to end with E.

00:30:30.700 | So here's E space as an example.

00:30:33.240 | So there's a lot of that going on here,

00:30:34.560 | and this is the most common pair.

00:30:36.840 | So now that we've identified the most common pair,

00:30:39.140 | we would like to iterate over the sequence.

00:30:41.980 | We're going to mint a new token with the ID of 256, right?

00:30:46.340 | Because these tokens currently go from zero to 255.

00:30:49.740 | So when we create a new token, it will have an ID of 256,

00:30:53.780 | and we're going to iterate over this entire list.

00:30:58.460 | And every time we see 101,32,

00:31:02.060 | we're gonna swap that out for 256.

00:31:04.900 | So let's implement that now,

00:31:07.020 | and feel free to do that yourself as well.

00:31:09.840 | So first, I commented this

00:31:11.620 | just so we don't pollute the notebook too much.

00:31:14.920 | This is a nice way of, in Python,

00:31:18.000 | obtaining the highest ranking pair.

00:31:20.440 | So we're basically calling the max on this dictionary stats,

00:31:25.000 | and this will return the maximum key.

00:31:28.640 | And then the question is, how does it rank keys?

00:31:31.760 | So you can provide it with a function that ranks keys,

00:31:34.860 | and that function is just stats.get.

00:31:37.740 | Stats.get would basically return the value.

00:31:41.120 | And so we're ranking by the value

00:31:42.760 | and getting the maximum key.

00:31:44.480 | So it's 101,32, as we saw.

00:31:47.520 | Now, to actually merge 101,32,

00:31:49.860 | this is the function that I wrote,

00:31:52.240 | but again, there are many different versions of it.

00:31:54.840 | So we're gonna take a list of IDs

00:31:56.920 | and the pair that we wanna replace,

00:31:58.900 | and that pair will be replaced with the new index IDX.

00:32:02.300 | So iterating through IDs,

00:32:05.020 | if we find the pair, swap it out for IDX.

00:32:08.240 | So we create this new list, and then we start at zero,

00:32:11.680 | and then we go through this entire list sequentially

00:32:13.880 | from left to right.

00:32:14.840 | And here we are checking for equality

00:32:18.040 | at the current position with the pair.

00:32:20.440 | So here we are checking that the pair matches.

00:32:24.660 | Now, here's a bit of a tricky condition

00:32:26.360 | that you have to append if you're trying to be careful,

00:32:28.680 | and that is that you don't want this here

00:32:31.700 | to be out of bounds at the very last position

00:32:34.320 | when you're on the rightmost element of this list.

00:32:36.660 | Otherwise, this would give you an out-of-bounds error.

00:32:39.400 | So we have to make sure

00:32:40.240 | that we're not at the very, very last element.

00:32:42.800 | So this would be false for that.

00:32:45.840 | So if we find a match,

00:32:47.480 | we append to this new list that replacement index,

00:32:52.480 | and we increment the position by two.

00:32:54.080 | So we skip over that entire pair.

00:32:56.440 | But otherwise, if we haven't found a matching pair,

00:32:58.880 | we just sort of copy over the element at that position

00:33:02.720 | and increment by one, and then return this.

00:33:06.360 | So here's a very small toy example.

00:33:07.840 | If we have a list 5, 6, 6, 7, 9, 1,

00:33:10.300 | and we want to replace the occurrences of 67 with 99,

00:33:13.620 | then calling this on that will give us

00:33:17.320 | what we're asking for.

00:33:18.600 | So here, the 67 is replaced with 99.

00:33:21.360 | So now I'm going to uncomment this for our actual use case,

00:33:26.240 | where we want to take our tokens.

00:33:28.520 | We want to take the top pair here

00:33:31.040 | and replace it with 256 to get tokens two.

00:33:34.120 | If we run this, we get the following.

00:33:38.240 | So recall that previously, we had a length 616 in this list,

00:33:43.240 | and now we have a length 596, right?

00:33:47.560 | So this decreased by 20,

00:33:49.620 | which makes sense because there are 20 occurrences.

00:33:52.420 | Moreover, we can try to find 256 here,

00:33:55.500 | and we see plenty of occurrences off it.

00:33:57.600 | And moreover, just double check,

00:33:59.760 | there should be no occurrence of 101, 32.

00:34:02.320 | So this is the original array, plenty of them.

00:34:04.980 | And in the second array, there are no occurrences of 101, 32.

00:34:08.320 | So we've successfully merged this single pair.

00:34:11.480 | And now we just iterate this.

00:34:13.320 | So we are going to go over the sequence again,

00:34:15.420 | find the most common pair, and replace it.

00:34:17.740 | So let me now write a while loop

00:34:19.140 | that uses these functions to do this sort of iteratively.

00:34:22.820 | And how many times do we do it for?

00:34:25.160 | Well, that's totally up to us as a hyperparameter.

00:34:27.460 | The more steps we take, the larger will be our vocabulary,

00:34:32.420 | and the shorter will be our sequence.

00:34:34.500 | And there is some sweet spot

00:34:35.940 | that we usually find works the best in practice.

00:34:38.780 | And so this is kind of a hyperparameter,

00:34:40.900 | and we tune it, and we find good vocabulary sizes.

00:34:44.020 | As an example, GPT-4 currently uses roughly 100,000 tokens.

00:34:47.780 | And ballpark, those are reasonable numbers currently

00:34:51.580 | in state-of-the-art large-language models.

00:34:53.500 | So let me now write, putting it all together,

00:34:56.580 | and iterating these steps.

00:34:58.720 | Okay, now, before we dive into the while loop,

00:35:00.500 | I wanted to add one more cell here,

00:35:03.160 | where I went to the blog post,

00:35:04.540 | and instead of grabbing just the first paragraph or two,

00:35:07.100 | I took the entire blog post,

00:35:08.700 | and I stretched it out in a single line.

00:35:10.780 | And basically, just using longer text

00:35:12.460 | will allow us to have more representative statistics

00:35:14.700 | for the byte pairs,

00:35:15.920 | and we'll just get more sensible results out of it,

00:35:18.940 | because it's longer text.

00:35:20.280 | So here we have the raw text.

00:35:23.180 | We encode it into bytes using the UTF-8 encoding.

00:35:27.660 | And then here, as before,

00:35:29.140 | we are just changing it into a list of integers in Python,

00:35:32.240 | just so it's easier to work with,

00:35:34.060 | instead of the raw bytes objects.

00:35:36.380 | And then this is the code that I came up with

00:35:40.220 | to actually do the merging in loop.

00:35:43.860 | These two functions here are identical

00:35:45.700 | to what we had above.

00:35:46.740 | I only included them here

00:35:48.100 | just so that you have the point of reference here.

00:35:50.760 | So these two are identical,

00:35:54.020 | and then this is the new code that I added.

00:35:56.540 | So the first thing we wanna do

00:35:57.380 | is we want to decide on a final vocabulary size

00:36:00.180 | that we want our tokenizer to have.

00:36:02.480 | And as I mentioned, this is a hyperparameter,

00:36:04.040 | and you set it in some way

00:36:05.280 | depending on your best performance.

00:36:07.520 | So let's say for us, we're going to use 276,

00:36:10.140 | because that way we're going to be doing exactly 20 merges.

00:36:14.000 | And 20 merges because we already have 256 tokens

00:36:18.420 | for the raw bytes.

00:36:20.280 | And to reach 276, we have to do 20 merges

00:36:23.600 | to add 20 new tokens.

00:36:25.160 | Here, this is one way in Python

00:36:28.680 | to just create a copy of a list.

00:36:30.760 | So I'm taking the tokens list,

00:36:33.260 | and by wrapping it in the list,

00:36:34.940 | Python will construct a new list

00:36:36.940 | of all the individual elements,

00:36:38.100 | so this is just a copy operation.

00:36:39.740 | Then here, I'm creating a merges dictionary.

00:36:44.340 | So this merges dictionary is going to maintain

00:36:46.140 | basically the child one, child two,

00:36:49.420 | mapping to a new token.

00:36:52.300 | And so what we're going to be building up here

00:36:53.740 | is a binary tree of merges.

00:36:56.500 | But actually, it's not exactly a tree

00:36:57.860 | because a tree would have a single root node

00:37:00.500 | with a bunch of leaves.

00:37:02.220 | For us, we're starting with the leaves on the bottom,

00:37:04.620 | which are the individual bytes.

00:37:06.060 | Those are the starting 256 tokens.

00:37:08.580 | And then we're starting to merge two of them at a time.

00:37:11.380 | And so it's not a tree, it's more like a forest

00:37:13.740 | as we merge these elements.

00:37:18.360 | So for 20 merges,

00:37:21.540 | we're going to find the most commonly occurring pair.

00:37:24.840 | We're going to mint a new token integer for it.

00:37:27.940 | So I here will start at zero,

00:37:29.460 | so we're going to start at 256.

00:37:31.940 | We're going to print that we're merging it,

00:37:33.620 | and we're going to replace all the occurrences of that pair

00:37:36.660 | with the newly minted token.

00:37:39.520 | And we're going to record that this pair of integers

00:37:42.980 | merged into this new integer.

00:37:45.460 | So running this gives us the following output.

00:37:49.940 | So we did 20 merges.

00:37:54.160 | And for example, the first merge was exactly as before,

00:37:57.500 | the 101, 32 tokens merging into a new token 256.

00:38:02.500 | Now keep in mind that the individual tokens 101 and 32

00:38:06.440 | can still occur in the sequence after merging.

00:38:09.220 | It's only when they occur exactly consecutively

00:38:11.800 | that that becomes 256 now.

00:38:13.680 | And in particular, the other thing to notice here

00:38:17.620 | is that the token 256, which is the newly minted token,

00:38:20.920 | is also eligible for merging.

00:38:22.860 | So here on the bottom, the 20th merge

00:38:24.980 | was a merge of 256 and 259 becoming 275.

00:38:29.780 | So every time we replace these tokens,

00:38:32.180 | they become eligible for merging

00:38:33.700 | in the next round of the iteration.

00:38:35.820 | So that's why we're building up a small sort of binary forest

00:38:38.440 | instead of a single individual tree.

00:38:40.240 | One thing we can take a look at as well

00:38:43.020 | is we can take a look at the compression ratio

00:38:44.740 | that we've achieved.

00:38:46.100 | So in particular, we started off with this tokens list.

00:38:50.220 | So we started off with 24,000 bytes

00:38:53.460 | and after merging 20 times,

00:38:56.500 | we now have only 19,000 tokens.

00:39:01.220 | And so therefore, the compression ratio

00:39:02.660 | of simply just dividing the two is roughly 1.27.

00:39:06.220 | So that's the amount of compression we were able to achieve

00:39:08.380 | of this text with only 20 merges.

00:39:11.100 | And of course, the more vocabulary elements you add,

00:39:15.380 | the greater the compression ratio here would be.

00:39:18.780 | Finally, so that's kind of like

00:39:22.120 | the training of the tokenizer, if you will.

00:39:25.780 | Now, one point that I wanted to make is that,

00:39:28.340 | and maybe this is a diagram that can help kind of illustrate,

00:39:32.300 | is that tokenizer is a completely separate object

00:39:34.780 | from the large language model itself.

00:39:36.780 | So everything in this lecture,

00:39:37.920 | we're not really touching the LLM itself.

00:39:40.060 | We're just training the tokenizer.

00:39:41.660 | This is a completely separate pre-processing stage usually.

00:39:44.820 | So the tokenizer will have its own training set,

00:39:47.440 | just like a large language model

00:39:48.660 | has a potentially different training set.

00:39:51.540 | So the tokenizer has a training set of documents

00:39:53.300 | on which you're going to train the tokenizer.

00:39:55.740 | And then we're performing the BytePair encoding algorithm,

00:39:59.460 | as we saw above, to train the vocabulary of this tokenizer.

00:40:03.660 | So it has its own training set.

00:40:05.060 | It has a pre-processing stage

00:40:06.380 | that you would run a single time in the beginning.

00:40:08.820 | And the tokenizer is trained

00:40:11.740 | using BytePair encoding algorithm.

00:40:13.820 | Once you have the tokenizer, once it's trained,

00:40:16.060 | and you have the vocabulary,

00:40:17.180 | and you have the merges,

00:40:19.200 | we can do both encoding and decoding.

00:40:22.300 | So these two arrows here.

00:40:24.340 | So the tokenizer is a translation layer between raw text,

00:40:28.200 | which is, as we saw, the sequence of Unicode code points.

00:40:31.840 | It can take raw text and turn it into a token sequence,

00:40:35.520 | and vice versa, it can take a token sequence

00:40:38.080 | and translate it back into raw text.

00:40:40.140 | So now that we have trained the tokenizer,

00:40:44.120 | and we have these merges,

00:40:45.920 | we are going to turn to how we can do the encoding

00:40:48.120 | and the decoding step.

00:40:49.400 | If you give me text, here are the tokens,

00:40:51.200 | and vice versa.

00:40:52.040 | If you give me tokens, here's the text.

00:40:54.240 | Once we have that,

00:40:55.080 | we can translate between these two realms.

00:40:57.320 | And then the language model is going to be trained

00:40:59.180 | as a step two afterwards.

00:41:01.120 | And typically, in a sort of a state-of-the-art application,

00:41:05.280 | you might take all of your training data

00:41:06.680 | for the language model,

00:41:07.800 | and you might run it through the tokenizer

00:41:09.480 | and sort of translate everything

00:41:11.300 | into a massive token sequence.

00:41:12.920 | And then you can throw away the raw text

00:41:14.680 | that you're just left with, the tokens themselves.

00:41:17.040 | And those are stored on disk,

00:41:19.120 | and that is what the larger language model

00:41:20.600 | is actually reading when it's training on them.

00:41:22.960 | So that's one approach that you can take

00:41:24.180 | as a single massive pre-processing stage.

00:41:26.840 | So yeah, basically,

00:41:30.240 | I think the most important thing I want to get across

00:41:31.800 | is that this is a completely separate stage.

00:41:33.440 | It usually has its own entire training set.

00:41:36.360 | You may want to have those training sets be different

00:41:38.440 | between the tokenizer and the larger language model.

00:41:40.720 | So for example, when you're training the tokenizer,

00:41:43.160 | as I mentioned, we don't just care

00:41:44.680 | about the performance of English text.

00:41:46.400 | We care about many different languages.

00:41:49.480 | And we also care about code or not code.

00:41:51.640 | So you may want to look into different kinds of mixtures

00:41:54.600 | of different kinds of languages

00:41:55.960 | and different amounts of code and things like that,

00:41:58.760 | because the amount of different language that you have

00:42:01.940 | in your tokenizer training set

00:42:03.720 | will determine how many merges of it there will be.

00:42:07.280 | And therefore, that determines the density

00:42:09.420 | with which this type of data sort of has in the token space.

00:42:14.420 | And so roughly speaking intuitively,

00:42:18.800 | if you add some amount of data,

00:42:20.020 | like say you have a ton of Japanese data

00:42:21.740 | in your tokenizer training set,

00:42:24.060 | then that means that more Japanese tokens will get merged.

00:42:26.840 | And therefore, Japanese will have shorter sequences.

00:42:29.940 | And that's going to be beneficial

00:42:31.160 | for the large language model,

00:42:32.380 | which has a finite context length

00:42:34.300 | on which it can work on in the token space.

00:42:37.820 | So hopefully that makes sense.

00:42:39.340 | So we're now going to turn to encoding and decoding

00:42:42.120 | now that we have trained a tokenizer.

00:42:44.020 | So we have our merges.

00:42:46.160 | And now how do we do encoding and decoding?

00:42:48.340 | Okay, so let's begin with decoding,

00:42:50.100 | which is this arrow over here.

00:42:51.880 | So given a token sequence,

00:42:53.820 | let's go through the tokenizer

00:42:55.020 | to get back a Python string object, so the raw text.

00:42:59.140 | So this is the function that we'd like to implement.

00:43:01.860 | We're given the list of integers

00:43:03.260 | and we want to return a Python string.

00:43:04.980 | If you'd like, try to implement this function yourself.

00:43:07.420 | It's a fun exercise.

00:43:08.600 | Otherwise, I'm going to start pasting in my own solution.

00:43:12.260 | So there are many different ways to do it.

00:43:15.100 | Here's one way.

00:43:16.420 | I will create a kind of pre-processing variable

00:43:19.460 | that I will call vocab.

00:43:20.780 | And vocab is a mapping or a dictionary in Python

00:43:26.020 | from the token ID to the bytes object for that token.

00:43:31.020 | So we begin with the raw bytes for tokens from zero to 255.

00:43:35.900 | And then we go in order of all the merges

00:43:38.760 | and we sort of populate this vocab list

00:43:41.720 | by doing an addition here.

00:43:43.360 | So this is basically the bytes representation

00:43:46.840 | of the first child followed by the second one.

00:43:49.920 | And remember, these are bytes objects.

00:43:51.760 | So this addition here is an addition of two bytes objects,

00:43:55.240 | just concatenation.

00:43:57.080 | So that's what we get here.

00:43:58.440 | One tricky thing to be careful with, by the way,

00:44:01.520 | is that I'm iterating a dictionary in Python

00:44:04.320 | using a .items.

00:44:05.920 | And it really matters that this runs in the order

00:44:09.400 | in which we inserted items into the merges dictionary.

00:44:13.220 | Luckily, starting with Python 3.7,

00:44:15.100 | this is guaranteed to be the case.

00:44:16.640 | But before Python 3.7,

00:44:18.280 | this iteration may have been out of order

00:44:20.120 | with respect to how we inserted elements into merges,

00:44:22.880 | and this may not have worked.

00:44:24.320 | But we are using modern Python, so we're okay.

00:44:27.320 | And then here, given the IDs,

00:44:31.280 | the first thing we're going to do is get the tokens.

00:44:33.980 | So the way I implemented this here is I'm taking,

00:44:38.960 | I'm iterating over all the IDs.

00:44:40.600 | I'm using vocab to look up their bytes.

00:44:42.880 | And then here, this is one way in Python

00:44:45.000 | to concatenate all these bytes together

00:44:47.640 | to create our tokens.

00:44:49.640 | And then these tokens here at this point are raw bytes.

00:44:53.120 | So I have to decode using UTF-8 now

00:44:56.800 | back into Python strings.

00:44:59.080 | So previously, we called dot encode on a string object

00:45:02.020 | to get the bytes, and now we're doing it opposite.

00:45:04.720 | We're taking the bytes and calling a decode

00:45:07.240 | on the bytes object to get a string in Python.

00:45:10.880 | And then we can return text.

00:45:12.600 | So this is how we can do it.

00:45:16.800 | Now, this actually has a issue in the way I implemented it,

00:45:21.720 | and this could actually throw an error.

00:45:23.520 | So try to figure out why this code

00:45:26.120 | could actually result in an error

00:45:27.760 | if we plug in some sequence of IDs that is unlucky.

00:45:32.500 | So let me demonstrate the issue.

00:45:35.360 | When I try to decode just something like 97,

00:45:38.160 | I am going to get a letter A here back.

00:45:41.040 | So nothing too crazy happening.

00:45:43.560 | But when I try to decode 128 as a single element,

00:45:48.320 | the token 128 is what in string or in Python object?

00:45:52.320 | Unicode decoder.

00:45:54.240 | UTF-8 can't decode byte 0x80,

00:45:57.840 | which is this in hex,

00:46:00.040 | in position zero, invalid start byte.

00:46:01.880 | What does that mean?

00:46:03.120 | Well, to understand what this means,

00:46:04.040 | we have to go back to our UTF-8 page

00:46:06.600 | that I briefly showed earlier,

00:46:08.560 | and this is Wikipedia UTF-8.

00:46:10.640 | And basically, there's a specific schema

00:46:13.140 | that UTF-8 bytes take.

00:46:16.080 | So in particular, if you have a multi-byte object

00:46:19.440 | for some of the Unicode characters,

00:46:20.960 | they have to have this special sort of envelope

00:46:23.160 | in how the encoding works.

00:46:25.800 | And so what's happening here is that invalid start byte,

00:46:29.800 | that's because 128, the binary representation of it,

00:46:33.640 | is one followed by all zeros.

00:46:36.280 | So we have one and then all zero.

00:46:38.800 | And we see here that that doesn't conform to the format

00:46:41.040 | because one followed by all zero

00:46:42.480 | just doesn't fit any of these rules, so to speak.

00:46:45.660 | So it's an invalid start byte, which is byte one.

00:46:49.240 | This one must have a one following it,

00:46:52.040 | and then a zero following it,

00:46:53.520 | and then the content of your Unicode in excess here.

00:46:57.080 | So basically, we don't exactly follow the UTF-8 standard,

00:47:00.600 | and this cannot be decoded.

00:47:02.400 | And so the way to fix this is to use this errors equals

00:47:07.400 | in bytes.decode function of Python.

00:47:12.800 | And by default, errors is strict.

00:47:14.720 | So we will throw an error

00:47:16.560 | if it's not valid UTF-8 bytes encoding.

00:47:20.360 | But there are many different things

00:47:21.640 | that you could put here on error handling.

00:47:23.680 | This is the full list of all the errors that you can use.

00:47:26.760 | And in particular, instead of strict,

00:47:28.220 | let's change it to replace.

00:47:30.360 | And that will replace with this special marker,

00:47:34.720 | this replacement character.

00:47:36.720 | So errors equals replace.

00:47:40.360 | And now we just get that character back.

00:47:42.360 | So basically, not every single byte sequence is valid UTF-8.

00:47:49.440 | And if it happens that your large language model,

00:47:52.040 | for example, predicts your tokens in a bad manner,

00:47:56.080 | then they might not fall into valid UTF-8,

00:47:59.720 | and then we won't be able to decode them.

00:48:02.160 | So the standard practice

00:48:03.720 | is to basically use errors equals replace.

00:48:07.080 | And this is what you will also find in the OpenAI code

00:48:10.320 | that they released as well.

00:48:12.240 | But basically, whenever you see this kind of a character

00:48:14.560 | in your output, in that case, something went wrong,

00:48:17.080 | and the LM output was not a valid sequence of tokens.

00:48:22.080 | Okay, and now we're going to go the other way.

00:48:24.320 | So we are going to implement this error right here,

00:48:27.600 | where we are gonna be given a string,

00:48:29.160 | and we want to encode it into tokens.

00:48:31.000 | So this is the signature of the function

00:48:34.080 | that we're interested in.

00:48:35.640 | And this should basically print a list of integers

00:48:39.080 | of the tokens.

00:48:40.040 | So again, try to maybe implement this yourself

00:48:43.040 | if you'd like a fun exercise, and pause here.

00:48:45.840 | Otherwise, I'm going to start putting in my solution.

00:48:48.160 | So again, there are many ways to do this.

00:48:50.840 | So this is one of the ways that sort of I came up with.

00:48:55.840 | So the first thing we're going to do

00:48:58.640 | is we are going to take our text,

00:49:02.360 | encode it into UTF-8 to get the raw bytes.

00:49:05.120 | And then as before,

00:49:05.960 | we're going to call list on the bytes object

00:49:07.760 | to get a list of integers of those bytes.

00:49:11.680 | So those are the starting tokens.

00:49:13.240 | Those are the raw bytes of our sequence.

00:49:15.640 | But now of course, according to the merges dictionary above,

00:49:19.080 | and recall this was the merges,

00:49:20.760 | some of the bytes may be merged according to this lookup.

00:49:25.960 | In addition to that,

00:49:26.800 | remember that the merges was built from top to bottom.

00:49:29.320 | And this is sort of the order

00:49:30.320 | in which we inserted stuff into merges.

00:49:32.560 | And so we prefer to do all these merges in the beginning

00:49:35.720 | before we do these merges later,

00:49:37.760 | because for example,

00:49:39.520 | this merge over here relies on the 256,

00:49:42.040 | which got merged here.

00:49:44.120 | So we have to go in the order from top to bottom sort of,

00:49:47.360 | if we are going to be merging anything.

00:49:50.000 | Now we expect to be doing a few merges.

00:49:52.160 | So we're going to be doing while true.

00:49:54.080 | And now we want to find a pair of bytes

00:49:59.000 | that is consecutive,

00:50:00.280 | that we are allowed to merge according to this.

00:50:02.760 | In order to reuse some of the functionality

00:50:05.120 | that we've already written,

00:50:06.200 | I'm going to reuse the function get stats.

00:50:09.040 | So recall that get stats will give us the,

00:50:13.040 | will basically count up how many times every single pair

00:50:16.040 | occurs in our sequence of tokens

00:50:18.400 | and return that as a dictionary.

00:50:20.480 | And the dictionary was a mapping

00:50:22.040 | from all the different byte pairs

00:50:26.400 | to the number of times that they occur, right?

00:50:28.680 | At this point, we don't actually care

00:50:30.880 | how many times they occur in the sequence.

00:50:33.040 | We only care what the raw pairs are in that sequence.

00:50:36.680 | And so I'm only going to be using basically

00:50:38.120 | the keys of the dictionary.

00:50:39.800 | I only care about the set of possible merge candidates

00:50:42.920 | if that makes sense.

00:50:43.920 | Now we want to identify the pair

00:50:46.280 | that we're going to be merging at this stage of the loop.

00:50:49.400 | So what do we want?

00:50:50.240 | We want to find the pair or like a key inside stats

00:50:54.800 | that has the lowest index in the merges dictionary,

00:50:59.440 | because we want to do all the early merges

00:51:01.080 | before we work our way to the late merges.

00:51:04.120 | So again, there are many different ways to implement this,

00:51:06.240 | but I'm going to do something a little bit fancy here.

00:51:12.320 | So I'm going to be using the min over an iterator.

00:51:15.960 | In Python, when you call min on an iterator,

00:51:18.200 | and stats here is a dictionary,

00:51:19.920 | we're going to be iterating the keys

00:51:21.400 | of this dictionary in Python.

00:51:23.480 | So we're looking at all the pairs inside stats,

00:51:26.920 | which are all the consecutive pairs.

00:51:30.440 | And we're going to be taking the consecutive pair

00:51:33.080 | inside tokens that has the minimum, what?

00:51:36.200 | The min takes a key, which gives us the function

00:51:40.080 | that is going to return a value

00:51:41.920 | over which we're going to do the min.

00:51:44.120 | And the one we care about is we care about taking merges

00:51:47.400 | and basically getting that pair's index.

00:51:52.400 | So basically for any pair inside stats,

00:51:58.240 | we are going to be looking into merges

00:52:00.800 | at what index it has.

00:52:02.720 | And we want to get the pair with the min number.

00:52:05.800 | So as an example, if there's a pair 101 and 32,

00:52:08.320 | we definitely want to get that pair.

00:52:10.640 | We want to identify it here and return it.

00:52:12.560 | And pair would become 101, 32 if it occurs.

00:52:15.920 | And the reason that I'm putting a float inf here

00:52:19.200 | as a fallback is that in the get function,

00:52:22.200 | when we call, when we basically consider a pair

00:52:26.240 | that doesn't occur in the merges,

00:52:28.400 | then that pair is not eligible to be merged, right?

00:52:30.760 | So if in the token sequence,

00:52:32.720 | there's some pair that is not a merging pair,

00:52:35.200 | it cannot be merged,

00:52:36.440 | then it doesn't actually occur here

00:52:38.640 | and it doesn't have an index and it cannot be merged,

00:52:41.840 | which we will denote as float inf.

00:52:44.120 | And the reason infinity is nice here

00:52:45.600 | is because for sure we're guaranteed

00:52:47.520 | that it's not going to participate

00:52:48.760 | in the list of candidates when we do the min.

00:52:51.920 | So this is one way to do it.

00:52:54.160 | So basically a long story short,

00:52:56.440 | this returns the most eligible merging candidate pair

00:53:00.800 | that occurs in the tokens.

00:53:02.400 | Now, one thing to be careful with here

00:53:04.560 | is this function here might fail in the following way.

00:53:09.560 | If there's nothing to merge,

00:53:11.280 | then there's nothing in merges that is satisfied anymore.

00:53:16.280 | There's nothing to merge.

00:53:19.280 | Everything just returns float infs.

00:53:21.760 | And then the pair,

00:53:23.120 | I think will just become the very first element of stats.

00:53:26.640 | But this pair is not actually a mergeable pair.

00:53:29.320 | It just becomes the first pair inside stats arbitrarily

00:53:33.040 | because all of these pairs evaluate to float inf

00:53:35.600 | for the merging criterion.

00:53:38.240 | So basically it could be that this doesn't succeed

00:53:40.840 | because there's no more merging pairs.

00:53:42.080 | So if this pair is not in merges that was returned,

00:53:46.080 | then this is a signal for us

00:53:47.280 | that actually there was nothing to merge.

00:53:49.320 | No single pair can be merged anymore.

00:53:51.480 | In that case, we will break out.

00:53:53.080 | Nothing else can be merged.

00:53:56.560 | You may come up with a different implementation, by the way.

00:54:00.760 | This is kind of like really trying hard in Python.

00:54:03.720 | But really we're just trying to find a pair

00:54:07.120 | that can be merged with a lowest index here.

00:54:09.400 | Now, if we did find a pair that is inside merges

00:54:15.360 | with the lowest index, then we can merge it.

00:54:17.760 | So we're going to look into the mergers dictionary

00:54:22.520 | for that pair to look up the index.

00:54:25.320 | And we're going to now merge into that index.

00:54:28.680 | So we're going to do tokens equals,

00:54:30.320 | and we're going to replace the original tokens.

00:54:34.520 | We're going to be replacing the pair, pair,

00:54:36.680 | and we're going to be replacing it with index IDX.

00:54:39.080 | And this returns a new list of tokens

00:54:41.640 | where every occurrence of pair is replaced with IDX.

00:54:44.520 | So we're doing a merge.

00:54:46.360 | And we're going to be continuing this

00:54:47.640 | until eventually nothing can be merged.

00:54:49.400 | We'll come out here and we'll break out.

00:54:51.360 | And here we just return tokens.

00:54:53.040 | And so that's the implementation, I think.

00:54:56.880 | So hopefully this runs.

00:54:58.240 | Okay, cool.

00:54:59.680 | Um, yeah, and this looks reasonable.

00:55:02.960 | So for example, 32 is a space in ASCII, so that's here.

00:55:06.600 | So this looks like it worked, great.

00:55:10.720 | Okay, so let's wrap up this section of the video at least.

00:55:13.480 | I wanted to point out that this is not quite

00:55:15.160 | the right implementation just yet

00:55:16.320 | because we are leaving out a special case.

00:55:18.760 | So in particular, if we tried to do this,

00:55:21.600 | this would give us an error.

00:55:23.280 | And the issue is that if we only have a single character

00:55:26.360 | or an empty string, then stats is empty

00:55:29.000 | and that causes an issue inside min.

00:55:31.080 | So one way to fight this is if len of tokens is at least two

00:55:36.080 | because if it's less than two,

00:55:37.320 | it's just a single token or no tokens,

00:55:38.880 | then let's just, there's nothing to merge,

00:55:41.040 | so we just return.

00:55:42.440 | So that would fix that case.

00:55:44.280 | Okay.

00:55:46.440 | And then second, I have a few test cases here

00:55:48.880 | for us as well.

00:55:50.040 | So first let's make sure about, or let's note the following.

00:55:54.120 | If we take a string and we try to encode it

00:55:57.400 | and then decode it back,

00:55:58.800 | you'd expect to get the same string back, right?

00:56:01.080 | Is that true for all strings?

00:56:02.600 | So I think, so here it is the case.

00:56:07.560 | And I think in general, this is probably the case,

00:56:10.040 | but notice that going backwards is not,

00:56:14.440 | you're not going to have an identity going backwards

00:56:16.640 | because as I mentioned,

00:56:18.280 | not all token sequences are valid UTF-8 sort of byte streams.

00:56:24.760 | And so therefore, some of them can't even be decodable.

00:56:27.600 | So this only goes in one direction,

00:56:31.480 | but for that one direction, we can check here.

00:56:34.360 | If we take the training text,

00:56:35.680 | which is the text that we trained the tokenizer on,

00:56:37.760 | we can make sure that when we encode and decode,

00:56:39.440 | we get the same thing back, which is true.

00:56:42.040 | And here I took some validation data.

00:56:43.600 | So I went to, I think this webpage and I grabbed some text.

00:56:47.120 | So this is text that the tokenizer has not seen,

00:56:49.440 | and we can make sure that this also works.

00:56:52.640 | Okay, so that gives us some confidence

00:56:53.960 | that this was correctly implemented.

00:56:56.040 | So those are the basics of the byte-pair encoding algorithm.

00:56:59.160 | We saw how we can take some training set,

00:57:02.160 | train a tokenizer.

00:57:03.720 | The parameters of this tokenizer really

00:57:05.480 | are just this dictionary of merges.

00:57:07.960 | And that basically creates the little binary forest

00:57:10.080 | on top of raw bytes.

00:57:11.360 | Once we have this, the merges table,

00:57:14.720 | we can both encode and decode

00:57:16.560 | between raw text and token sequences.

00:57:19.040 | So that's the simplest setting of the tokenizer.

00:57:22.360 | What we're going to do now though,

00:57:23.480 | is we're going to look at some of the state-of-the-art

00:57:25.360 | large language models and the kinds of tokenizers

00:57:27.520 | that they use, and we're going to see

00:57:29.040 | that this picture complexifies very quickly.

00:57:31.200 | So we're going to go through the details

00:57:33.360 | of this complexification one at a time.

00:57:37.280 | So let's kick things off by looking at the GPT series.

00:57:40.040 | So in particular, I have the GPT-2 paper here,

00:57:42.460 | and this paper is from 2019 or so, so five years ago.

00:57:48.000 | And let's scroll down to input representation.

00:57:51.220 | This is where they talk about the tokenizer

00:57:52.720 | that they're using for GPT-2.

00:57:55.160 | Now, this is all fairly readable,

00:57:56.520 | so I encourage you to pause and read this yourself,

00:58:00.000 | but this is where they motivate the use

00:58:01.500 | of the byte-pair encoding algorithm

00:58:03.960 | on the byte-level representation of UTF-8 encoding.

00:58:08.700 | So this is where they motivate it,

00:58:10.040 | and they talk about the vocabulary sizes and everything.

00:58:13.120 | Now, everything here is exactly as we've covered it so far,

00:58:16.040 | but things start to depart around here.

00:58:18.700 | So what they mention is that they don't just apply

00:58:21.040 | the Naive algorithm as we have done it.

00:58:23.520 | And in particular, here's a motivating example.

00:58:26.240 | Suppose that you have common words like dog.

00:58:28.440 | What will happen is that dog, of course,

00:58:30.080 | occurs very frequently in the text,

00:58:32.620 | and it occurs right next to all kinds of punctuation,

00:58:35.300 | as an example, so dog dot, dog exclamation mark,

00:58:38.920 | dog question mark, et cetera.

00:58:40.800 | And naively, you might imagine that the BPA algorithm

00:58:43.520 | could merge these to be single tokens,

00:58:45.680 | and then you end up with lots of tokens

00:58:47.280 | that are just like dog with a slightly different punctuation.

00:58:50.080 | And so it feels like you're clustering things

00:58:51.680 | that shouldn't be clustered.

00:58:52.520 | You're combining kind of semantics with punctuation.

00:58:55.420 | And this feels suboptimal.

00:58:58.840 | And indeed, they also say that this is suboptimal

00:59:01.520 | according to some of the experiments.

00:59:03.440 | So what they want to do is they want to top down

00:59:05.380 | in a manual way, enforce that some types of characters

00:59:09.640 | should never be merged together.

00:59:11.580 | So they want to enforce these emerging rules

00:59:14.840 | on top of the BytePair encoding algorithm.

00:59:17.720 | So let's take a look at their code

00:59:19.920 | and see how they actually enforce this

00:59:21.480 | and what kinds of mergers they actually do perform.

00:59:24.280 | So I have the tab open here for GPT-2

00:59:27.280 | under OpenAI on GitHub.

00:59:29.520 | And when we go to source, there is an encoder.py.

00:59:34.000 | Now, I don't personally love that they called encoder.py

00:59:36.320 | because this is the tokenizer,

00:59:38.120 | and the tokenizer can do both encode and decode.

00:59:41.080 | So it feels kind of awkward to me that it's called encoder,

00:59:43.120 | but that is the tokenizer.

00:59:45.520 | And there's a lot going on here,

00:59:46.680 | and we're gonna step through it in detail at one point.

00:59:49.280 | For now, I just want to focus on this part here.

00:59:52.260 | They create a regex pattern here

00:59:55.120 | that looks very complicated,

00:59:56.200 | and we're gonna go through it in a bit.

00:59:58.680 | But this is the core part that allows them to enforce rules

01:00:02.660 | for what parts of the text will never be merged for sure.

01:00:06.040 | Now, notice that re.compile here is a little bit misleading

01:00:09.640 | because we're not just doing import re,

01:00:12.120 | which is the Python re module,

01:00:13.580 | we're doing import regex as re.

01:00:15.640 | And regex is a Python package that you can install,

01:00:18.720 | pip install regex,

01:00:20.160 | and it's basically an extension of re,

01:00:21.800 | so it's a bit more powerful re.

01:00:23.360 | So let's take a look at this pattern and what it's doing

01:00:29.700 | and why this is actually doing the separation

01:00:32.280 | that they are looking for.

01:00:33.880 | Okay, so I've copy pasted the pattern here

01:00:35.820 | to our Jupyter Notebook where we left off,

01:00:38.280 | and let's take this pattern for a spin.

01:00:40.700 | So in the exact same way that their code does,

01:00:43.400 | we're going to call an re.findAll for this pattern

01:00:47.000 | on any arbitrary string that we are interested in.

01:00:49.440 | So this is the string that we want to encode into tokens

01:00:52.240 | to feed into an LLM, like GPT-2.

01:00:56.800 | So what exactly is this doing?

01:00:59.040 | Well, re.findAll will take this pattern

01:01:01.120 | and try to match it against this string.

01:01:03.120 | The way this works is that you are going from left to right

01:01:07.760 | in the string, and you're trying to match the pattern.

01:01:11.320 | And re.findAll will get all the occurrences

01:01:15.060 | and organize them into a list.

01:01:16.560 | Now, when you look at this pattern,

01:01:20.500 | first of all, notice that this is a raw string,

01:01:22.940 | and then these are three double quotes

01:01:26.220 | just to start the string.

01:01:27.820 | So really the string itself,

01:01:29.400 | this is the pattern itself, right?

01:01:31.060 | And notice that it's made up of a lot of ORs.

01:01:35.260 | So see these vertical bars?

01:01:36.500 | Those are ORs in Regex.

01:01:38.480 | And so you go from left to right in this pattern

01:01:41.460 | and try to match it against the string wherever you are.

01:01:44.540 | So we have hello, and we're gonna try to match it.

01:01:48.040 | Well, it's not apostrophe S,

01:01:49.560 | it's not apostrophe T, or any of these,

01:01:52.520 | but it is an optional space followed by dash P of,

01:01:57.220 | sorry, slash P of L one or more times.

01:02:00.340 | What is slash P of L?

01:02:01.940 | It is coming to some documentation that I found.

01:02:05.260 | There might be other sources as well.

01:02:09.200 | Slash P of L is a letter,

01:02:11.220 | any kind of letter from any language.

01:02:13.660 | And hello is made up of letters, H-E-L-L-O, et cetera.

01:02:18.340 | So optional space followed by a bunch of letters,

01:02:21.520 | one or more letters, is going to match hello,

01:02:24.740 | but then the match ends

01:02:26.260 | because a white space is not a letter.

01:02:28.740 | So from there on begins a new sort of attempt

01:02:33.220 | to match against the string again.

01:02:35.780 | And starting in here,

01:02:37.080 | we're gonna skip over all of these again

01:02:39.140 | until we get to the exact same point again.

01:02:41.580 | And we see that there's an optional space,

01:02:43.540 | this is the optional space,

01:02:44.980 | followed by a bunch of letters, one or more of them.

01:02:47.140 | And so that matches.

01:02:48.620 | So when we run this, we get a list of two elements,

01:02:52.620 | hello, and then space world.

01:02:54.580 | So how are you if we add more letters?

01:02:58.840 | We would just get them like this.

01:03:01.180 | Now, what is this doing and why is this important?

01:03:03.700 | We are taking our string

01:03:05.460 | and instead of directly encoding it for tokenization,

01:03:10.020 | we are first splitting it up.

01:03:12.180 | And when you actually step through the code,

01:03:14.140 | and we'll do that in a bit more detail,

01:03:16.320 | what really it's doing on a high level

01:03:17.740 | is that it first splits your text into a list of texts,

01:03:22.740 | just like this one.

01:03:24.620 | And all these elements of this list

01:03:26.420 | are processed independently by the tokenizer.

01:03:29.220 | And all of the results of that processing

01:03:31.340 | are simply concatenated.

01:03:33.240 | So hello world, oh, I missed how.

01:03:36.660 | Hello world, how are you?

01:03:39.580 | We have five elements of a list.

01:03:41.180 | All of these will independently go from text

01:03:46.180 | to a token sequence.

01:03:48.060 | And then that token sequence is going to be concatenated.

01:03:50.580 | It's all gonna be joined up.

01:03:52.700 | And roughly speaking, what that does

01:03:54.940 | is you're only ever finding merges

01:03:57.420 | between the elements of this list.

01:03:59.240 | So you can only ever consider merges

01:04:00.740 | within every one of these elements individually.

01:04:03.200 | And after you've done all the possible merging

01:04:07.540 | for all of these elements individually,

01:04:09.460 | the results of all that will be joined by concatenation.

01:04:14.460 | And so you are basically,

01:04:16.220 | what you're doing effectively

01:04:17.820 | is you are never going to be merging this E

01:04:20.980 | with this space because they are now parts

01:04:23.500 | of the separate elements of this list.

01:04:25.860 | And so you are saying we are never gonna merge E space

01:04:29.260 | because we're breaking it up in this way.

01:04:33.580 | So basically using this regex pattern to chunk up the text

01:04:37.420 | is just one way of enforcing

01:04:39.580 | that some merges are not to happen.

01:04:43.100 | And we're gonna go into more of this text

01:04:44.660 | and we'll see that what this is trying to do

01:04:46.000 | on a high level is we're trying to not merge

01:04:48.100 | across letters, across numbers,

01:04:50.220 | across punctuation, and so on.

01:04:52.700 | So let's see in more detail how that works.

01:04:54.500 | So let's continue now.

01:04:55.940 | We have slash P of N.

01:04:57.660 | If you go to the documentation,

01:04:59.740 | slash P of N is any kind of numeric character

01:05:03.380 | in any script.

01:05:04.420 | So it's numbers.

01:05:05.940 | So we have an optional space followed by numbers

01:05:07.940 | and those would be separated out.

01:05:09.740 | So letters and numbers are being separated.

01:05:11.540 | So if I do, hello world, one, two, three, how are you?

01:05:14.980 | Then world will stop matching here

01:05:16.980 | because one is not a letter anymore,

01:05:19.580 | but one is a number.

01:05:20.740 | So this group will match for that

01:05:22.520 | and we'll get it as a separate entity.

01:05:24.460 | Let's see how these apostrophes work.

01:05:28.420 | So here, if we have slash V,

01:05:34.180 | I mean apostrophe V as an example,

01:05:36.660 | then apostrophe here is not a letter or a number.

01:05:39.700 | So hello will stop matching

01:05:42.420 | and then we will exactly match this with that.

01:05:45.940 | So that will come out as a separate thing.

01:05:48.340 | So why are they doing the apostrophes here?

01:05:51.620 | Honestly, I think that these are just

01:05:52.740 | like very common apostrophes that are used typically.

01:05:57.740 | I don't love that they've done this

01:05:59.620 | because let me show you what happens

01:06:03.420 | when you have some Unicode apostrophes.

01:06:06.380 | Like for example, you can have,

01:06:08.140 | if you have house, then this will be separated out

01:06:11.180 | because of this matching.

01:06:13.140 | But if you use the Unicode apostrophe like this,

01:06:16.200 | then suddenly this does not work.

01:06:19.940 | And so this apostrophe will actually

01:06:21.660 | become its own thing now.

01:06:23.740 | And so it's basically hard-coded

01:06:25.980 | for this specific kind of apostrophe

01:06:28.260 | and otherwise they become completely separate tokens.

01:06:33.220 | In addition to this, you can go to the GPT-2 docs

01:06:37.680 | and here when they define the pattern,

01:06:39.460 | they say should have added re.ignoreCase.

01:06:42.300 | So BP merges can happen

01:06:43.480 | for capitalized versions of contractions.

01:06:45.560 | So what they're pointing out is that

01:06:47.180 | you see how this is apostrophe and then lowercase letters?

01:06:50.820 | Well, because they didn't do re.ignoreCase,

01:06:53.860 | then these rules will not separate out the apostrophes

01:06:57.940 | if it's uppercase.

01:06:59.820 | So house would be like this,

01:07:03.940 | but if I did house from uppercase,

01:07:09.020 | then notice suddenly the apostrophe comes by itself.

01:07:12.280 | So the tokenization will work differently

01:07:16.220 | in uppercase and lowercase,

01:07:17.560 | inconsistently separating out these apostrophes.

01:07:19.940 | So it feels extremely gnarly and slightly gross,

01:07:22.380 | but that's how that works.

01:07:25.780 | Okay, so let's come back.

01:07:27.320 | After trying to match a bunch of apostrophe expressions,

01:07:29.960 | by the way, the other issue here is that

01:07:31.100 | these are quite language-specific probably.

01:07:33.340 | So I don't know that all the languages, for example,

01:07:35.940 | use or don't use apostrophes,

01:07:37.180 | but that would be inconsistently tokenized as a result.

01:07:41.060 | Then we try to match letters.

01:07:42.500 | Then we try to match numbers.

01:07:44.340 | And then if that doesn't work, we fall back to here.

01:07:47.540 | And what this is saying is, again,

01:07:48.820 | optional space followed by something

01:07:50.340 | that is not a letter, number, or a space,

01:07:53.580 | and one or more of that.

01:07:55.060 | So what this is doing effectively

01:07:56.300 | is this is trying to match punctuation, roughly speaking,

01:07:59.100 | not letters and not numbers.

01:08:01.140 | So this group will try to trigger for that.

01:08:03.380 | So if I do something like this,

01:08:05.600 | then these parts here are not letters or numbers,

01:08:09.540 | but they will actually get caught here.

01:08:13.540 | And so they become its own group.

01:08:15.760 | So we've separated out the punctuation.

01:08:18.420 | And finally, this is also a little bit confusing.

01:08:21.660 | So this is matching whitespace,

01:08:24.140 | but this is using a negative lookahead assertion in regex.

01:08:29.060 | So what this is doing is it's matching whitespace up to,

01:08:32.140 | but not including the last whitespace character.

01:08:34.820 | Why is this important?

01:08:37.900 | This is pretty subtle, I think.

01:08:39.440 | So you see how the whitespace is always included

01:08:41.660 | at the beginning of the word.

01:08:43.460 | So space R, space U, et cetera.

01:08:47.340 | Suppose we have a lot of spaces here.

01:08:49.180 | What's gonna happen here is that these spaces up to

01:08:53.740 | and not including the last character will get caught by this.

01:08:58.020 | And what that will do is it will separate out the spaces

01:09:01.180 | up to but not including the last character

01:09:03.300 | so that the last character can come here

01:09:05.620 | and join with the space U.

01:09:08.620 | And the reason that's nice is because space U

01:09:11.220 | is the common token.

01:09:12.580 | So if I didn't have these extra spaces here,

01:09:15.060 | you would just have space U.

01:09:16.500 | And if I add spaces, we still have a space U,

01:09:20.660 | but now we have all this extra whitespace.

01:09:23.100 | So basically the GPT-2 tokenizer really likes

01:09:25.060 | to have a space letters or numbers,

01:09:27.780 | and it prepends these spaces.

01:09:30.320 | And this is just something that it is consistent about.

01:09:33.020 | So that's what that is for.

01:09:34.460 | And then finally, we have all the,

01:09:36.300 | the last fallback is whitespace characters.

01:09:39.780 | So that would be just,

01:09:42.660 | if that doesn't get caught,

01:09:46.100 | then this thing will catch any trailing spaces and so on.

01:09:50.100 | I wanted to show one more real world example here.

01:09:52.740 | So if we have this string, which is a piece of Python code,

01:09:55.340 | and then we try to split it up,

01:09:57.300 | then this is the kind of output we get.

01:09:59.500 | So you'll notice that the list has many elements here,

01:10:01.580 | and that's because we are splitting up fairly often,

01:10:04.900 | every time sort of a category changes.

01:10:07.440 | So there will never be any mergers within these elements.

01:10:11.940 | And that's what you are seeing here.

01:10:14.780 | Now, you might think that in order to train the tokenizer,

01:10:19.020 | OpenAI has used this to split up text into chunks,

01:10:23.220 | and then run just a BP algorithm within all the chunks.

01:10:26.820 | But that's not exactly what happened.

01:10:28.580 | And the reason is the following.

01:10:30.420 | Notice that we have the spaces here.

01:10:33.340 | Those spaces end up being entire elements.

01:10:36.440 | But these spaces never actually end up being merged

01:10:39.660 | by OpenAI.

01:10:40.500 | And the way you can tell is that

01:10:42.060 | if you copy paste the exact same chunk here

01:10:43.740 | into a tick token, tick tokenizer,

01:10:46.900 | you see that all the spaces are kept independent,

01:10:49.260 | and they're all token 220.

01:10:50.840 | So I think OpenAI at some point enforced some rule

01:10:55.240 | that these spaces would never be merged.

01:10:57.580 | And so there's some additional rules

01:11:00.420 | on top of just chunking and BP

01:11:02.920 | that OpenAI is not clear about.

01:11:05.400 | Now, the training code for the GPT-2 tokenizer

01:11:07.380 | was never released.

01:11:08.540 | So all we have is the code that I've already shown you.

01:11:12.280 | But this code here that they've released

01:11:14.020 | is only the inference code for the tokens.

01:11:17.300 | So this is not the training code.

01:11:18.620 | You can't give it a piece of text and train tokenizer.

01:11:21.620 | This is just the inference code,

01:11:23.020 | which takes the merges that we have up above

01:11:26.340 | and applies them to a new piece of text.

01:11:29.260 | And so we don't know exactly how OpenAI trained

01:11:31.420 | the tokenizer, but it wasn't as simple

01:11:34.460 | as chunk it up and BP it, whatever it was.

01:11:38.340 | Next, I wanted to introduce you

01:11:39.860 | to the tick token library for OpenAI,

01:11:42.220 | which is the official library

01:11:43.980 | for tokenization from OpenAI.

01:11:46.200 | So this is ticktoken, pip install ticktoken,

01:11:50.180 | and then you can do the tokenization inference.

01:11:54.300 | So this is again, not training code.

01:11:55.740 | This is only inference code for tokenization.

01:11:57.980 | I wanted to show you how you would use it, quite simple.

01:12:02.180 | And running this just gives us the GPT-2 tokens

01:12:04.860 | or the GPT-4 tokens.

01:12:06.620 | So this is the tokenizer used for GPT-4.

01:12:09.420 | And so in particular, we see that the whitespace in GPT-2,

01:12:12.080 | it remains unmerged, but in GPT-4,

01:12:14.500 | these whitespaces merge, as we also saw in this one,

01:12:18.740 | where here they're all unmerged,

01:12:20.140 | but if we go down to GPT-4, they become merged.

01:12:23.780 | Now in the GPT-4 tokenizer,

01:12:30.060 | they changed the regular expression

01:12:32.320 | that they use to chunk up text.

01:12:34.700 | So the way to see this is that if you come

01:12:36.200 | to the tick token library,

01:12:39.500 | and then you go to this file, tick_token_ext_openai_public,

01:12:43.800 | this is where sort of like the definition

01:12:45.360 | of all these different tokenizers that OpenAI maintains is.

01:12:48.680 | And so necessarily to do the inference,

01:12:51.120 | they had to publish some of the details about the strings.

01:12:54.020 | So this is the string that we already saw for GPT-2.

01:12:57.080 | It is slightly different, but it is actually equivalent

01:13:00.000 | to what we discussed here.

01:13:02.020 | So this pattern that we discussed

01:13:03.520 | is equivalent to this pattern,

01:13:05.600 | and this one just executes a little bit faster.

01:13:08.320 | So here you see a little bit

01:13:09.360 | of a slightly different definition,

01:13:10.620 | but otherwise it's the same.

01:13:12.560 | We're gonna go into special tokens in a bit.

01:13:15.160 | And then if you scroll down to CL-100K,

01:13:18.440 | this is the GPT-4 tokenizer,

01:13:20.280 | you see that the pattern has changed.

01:13:22.240 | And this is kind of like the major change

01:13:26.000 | in addition to a bunch of other special tokens,

01:13:27.800 | which we'll go into a bit again.

01:13:29.400 | Now, I'm not going to actually go into the full detail

01:13:32.840 | of the pattern change,

01:13:33.680 | because honestly, this is mind-numbing.

01:13:35.960 | I would just advise that you pull out chat-gpt

01:13:38.680 | in the regex documentation and just step through it.

01:13:42.240 | But really the major changes are,

01:13:44.280 | number one, you see this I here?

01:13:47.320 | That means that the case sensitivity,

01:13:50.800 | this is case insensitive match.

01:13:52.680 | And so the comment that we saw earlier on,

01:13:55.600 | oh, we should have used re.uppercase.

01:13:58.400 | Basically, we're now going to be matching

01:14:01.640 | these apostrophe S, apostrophe D, apostrophe M, et cetera.

01:14:06.640 | We're gonna be matching them both in lowercase

01:14:08.200 | and in uppercase.

01:14:09.640 | So that's fixed.

01:14:11.080 | There's a bunch of different handling of the whitespace

01:14:13.400 | that I'm not going to go into the full details of.

01:14:15.800 | And then one more thing here is you will notice

01:14:18.600 | that when they match the numbers,

01:14:20.240 | they only match one to three numbers.

01:14:23.200 | So they will never merge numbers

01:14:27.080 | that are in more than three digits,

01:14:29.840 | only up to three digits of numbers will ever be merged.

01:14:33.920 | And that's one change that they made as well

01:14:36.360 | to prevent tokens that are very, very long number sequences.

01:14:40.120 | But again, we don't really know why they do

01:14:43.000 | any of this stuff because none of this is documented

01:14:45.880 | and it's just, we just get the pattern.

01:14:48.360 | So yeah, it is what it is.

01:14:51.440 | But those are some of the changes that GPT-4 has made.

01:14:54.400 | And of course, the vocabulary size went

01:14:56.400 | from roughly 50K to roughly 100K.

01:14:59.480 | The next thing I would like to do very briefly

01:15:01.120 | is to take you through the GPT-2 encoder.py

01:15:03.880 | that OpenAI has released.

01:15:05.360 | This is the file that I already mentioned to you briefly.

01:15:09.280 | Now, this file is fairly short

01:15:12.560 | and should be relatively understandable

01:15:14.240 | to you at this point.

01:15:15.280 | Starting at the bottom here, they are loading two files,

01:15:21.000 | encoder.json and vocab.bpe.

01:15:23.680 | And they do some light processing on it

01:15:25.120 | and then they call this encoder object,

01:15:26.840 | which is the tokenizer.

01:15:28.640 | Now, if you'd like to inspect these two files,

01:15:31.160 | which together constitute their saved tokenizer,

01:15:34.480 | then you can do that with a piece of code like this.

01:15:37.080 | This is where you can download these two files

01:15:40.040 | and you can inspect them if you'd like.

01:15:41.800 | And what you will find is that this encoder,

01:15:44.440 | as they call it in their code,

01:15:45.960 | is exactly equivalent to our vocab.

01:15:48.680 | So remember here where we have this vocab object,

01:15:52.840 | which allowed us to decode very efficiently.

01:15:55.000 | And basically it took us from the integer

01:15:57.560 | to the bytes for that integer.

01:16:01.600 | So our vocab is exactly their encoder.

01:16:05.000 | And then their vocab.bpe, confusingly,

01:16:08.720 | is actually our merges.

01:16:11.000 | So their bpe.merges,

01:16:12.560 | which is based on the data inside vocab.bpe,

01:16:16.240 | ends up being equivalent to our merges.

01:16:18.920 | So basically they are saving and loading

01:16:22.880 | the two variables that for us are also critical,

01:16:26.320 | the merges variable and the vocab variable.

01:16:29.000 | Using just these two variables,

01:16:30.920 | you can represent a tokenizer

01:16:32.400 | and you can both do encoding and decoding

01:16:34.320 | once you've trained this tokenizer.

01:16:36.160 | Now, the only thing that is actually slightly confusing

01:16:41.480 | inside what OpenAI does here

01:16:43.840 | is that in addition to this encoder and the decoder,

01:16:46.560 | they also have something called a byte encoder

01:16:48.480 | and a byte decoder.

01:16:50.280 | And this is actually, unfortunately,

01:16:51.920 | just kind of a spurious implementation detail.

01:16:55.960 | It isn't actually deep or interesting in any way,

01:16:58.240 | so I'm going to skip the discussion of it.

01:17:00.280 | But what OpenAI does here,

01:17:01.400 | for reasons that I don't fully understand,

01:17:03.000 | is that not only have they this tokenizer,

01:17:05.720 | which can encode and decode,

01:17:07.200 | but they have a whole separate layer here in addition

01:17:09.520 | that is used serially with the tokenizer.

01:17:12.000 | And so you first do byte encode and then encode,

01:17:16.080 | and then you do decode and then byte decode.

01:17:18.560 | So that's the loop,

01:17:19.920 | and they are just stacked serial on top of each other.

01:17:22.960 | And it's not that interesting,

01:17:24.560 | so I won't cover it, and you can step through it

01:17:26.080 | if you'd like.

01:17:27.360 | Otherwise, this file,

01:17:28.520 | if you ignore the byte encoder and the byte decoder,

01:17:30.800 | will be algorithmically very familiar with you.

01:17:33.120 | And the meat of it here is what they call BPE function.

01:17:37.080 | And you should recognize this loop here,

01:17:39.640 | which is very similar to our own while loop,

01:17:42.040 | where they're trying to identify the bigram, a pair,

01:17:46.000 | that they should be merging next.

01:17:48.000 | And then here, just like we had,

01:17:49.680 | they have a for loop trying to merge this pair.

01:17:52.320 | So they will go over all of the sequence,

01:17:54.120 | and they will merge the pair whenever they find it.

01:17:57.080 | And they keep repeating that

01:17:58.440 | until they run out of possible merges in the text.

01:18:02.240 | So that's the meat of this file.

01:18:03.920 | And there's an encode and decode function,

01:18:06.040 | just like we have implemented it.

01:18:07.880 | So long story short,

01:18:08.840 | what I want you to take away at this point is that,

01:18:11.040 | unfortunately, it's a little bit of a messy code

01:18:12.680 | that they have, but algorithmically,

01:18:14.280 | it is identical to what we've built up above.

01:18:17.080 | And what we've built up above, if you understand it,

01:18:19.440 | is algorithmically what is necessary

01:18:21.440 | to actually build a BPE tokenizer, train it,

01:18:24.440 | and then both encode and decode.

01:18:26.920 | The next topic I would like to turn to

01:18:28.280 | is that of special tokens.

01:18:30.360 | So in addition to tokens that are coming from raw bytes

01:18:33.400 | and the BPE merges,

01:18:35.000 | we can insert all kinds of tokens

01:18:36.480 | that we are going to use

01:18:37.480 | to delimit different parts of the data

01:18:39.880 | or introduce to create a special structure

01:18:42.320 | of the token streams.

01:18:44.760 | So if you look at this encoder object

01:18:47.960 | from OpenAI's GPT-2 right here,

01:18:50.920 | we mentioned this is very similar to our vocab.

01:18:53.280 | You'll notice that the length of this is 50,257.

01:18:57.560 | As I mentioned, it's mapping,

01:19:01.920 | and it's inverted from the mapping of our vocab.

01:19:04.240 | Our vocab goes from integer to string,

01:19:07.000 | and they go the other way around for no amazing reason.

01:19:10.360 | But the thing to note here is that

01:19:12.880 | the mapping table here is 50,257.

01:19:16.080 | Where does that number come from?

01:19:18.120 | Where are the tokens?

01:19:20.440 | As I mentioned, there are 256 raw byte tokens.

01:19:23.640 | And then OpenAI actually did 50,000 merges.

01:19:28.720 | So those become the other tokens.

01:19:32.040 | But this would have been 50,256.

01:19:35.040 | So what is the 57th token?

01:19:37.600 | And there is basically one special token.

01:19:39.680 | And that one special token,

01:19:43.040 | you can see it's called end of text.

01:19:46.600 | So this is a special token,

01:19:48.080 | and it's the very last token.

01:19:50.560 | And this token is used to delimit documents

01:19:53.240 | in the training set.

01:19:55.040 | So when we're creating the training data,

01:19:57.320 | we have all these documents,

01:19:58.520 | and we tokenize them and get a stream of tokens.

01:20:01.600 | Those tokens only range from zero to 50,256.

01:20:06.280 | And then in between those documents,

01:20:08.000 | we put special end of text token.

01:20:10.560 | And we insert that token in between documents.

01:20:14.400 | And we are using this as a signal to the language model

01:20:17.800 | that the document has ended,

01:20:19.800 | and what follows is going to be unrelated

01:20:22.080 | to the document previously.

01:20:24.320 | That said, the language model has to learn this from data.

01:20:26.920 | It needs to learn that this token usually means

01:20:29.720 | that it should wipe its sort of memory of what came before.

01:20:33.240 | And what came before this token

01:20:34.520 | is not actually informative to what comes next.

01:20:36.960 | But we are expecting the language model

01:20:38.240 | to just like learn this,

01:20:39.680 | but we're giving it the special sort of delimiter

01:20:42.280 | of these documents.

01:20:44.160 | We can go here to tick tokenizer,

01:20:45.800 | and this is the GPT-2 tokenizer,

01:20:48.800 | our code that we've been playing with before.

01:20:51.080 | So we can add here, right?

01:20:52.080 | Hello world, how are you?

01:20:54.360 | And we're getting different tokens.

01:20:55.880 | But now you can see what happens if I put end of text.

01:20:59.640 | You see how until I finished it,

01:21:03.200 | these are all different tokens.

01:21:04.960 | End of text, still set prefer tokens.

01:21:08.560 | And now when I finish it,

01:21:10.000 | suddenly we get token 50,256.

01:21:14.360 | And the reason this works is because

01:21:16.920 | this didn't actually go through the BPE merges.

01:21:19.960 | Instead, the code that actually outputs the tokens

01:21:24.200 | has special case instructions for handling special tokens.

01:21:28.480 | We did not see these special instructions

01:21:31.520 | for handling special tokens in the encoder.py.

01:21:34.840 | It's absent there.

01:21:36.440 | But if you go to tick token library,

01:21:37.960 | which is implemented in Rust,

01:21:40.520 | you will find all kinds of special case handling

01:21:42.680 | for these special tokens that you can register,

01:21:45.680 | create, add to the vocabulary,

01:21:48.080 | and then it looks for them

01:21:49.000 | and whenever it sees these special tokens like this,

01:21:52.720 | it will actually come in and swap in that special token.

01:21:55.920 | So these things are outside of the typical algorithm

01:21:58.760 | of byte-pair encoding.

01:22:00.280 | So these special tokens are used pervasively,

01:22:04.040 | not just in basically base language modeling

01:22:06.840 | of predicting the next token in the sequence,

01:22:08.880 | but especially when it gets to later,

01:22:10.520 | to the fine tuning stage,

01:22:11.840 | and all of the chat GPT sort of aspects of it.

01:22:15.440 | Because we don't just want to delimit documents,

01:22:16.960 | we want to delimit entire conversations

01:22:18.720 | between an assistant and a user.

01:22:21.120 | So if I refresh this tick tokenizer page,

01:22:24.160 | the default example that they have here

01:22:26.240 | is using not sort of base model encoders,

01:22:30.080 | but fine-tuned model sort of tokenizers.

01:22:33.240 | So for example, using the GPT 3.5 turbo scheme,

01:22:37.120 | these here are all special tokens.

01:22:39.520 | I am start, I am end, et cetera.

01:22:43.080 | This is short for imaginary malloc_start, by the way.

01:22:46.880 | But you can see here that there's a sort of start

01:22:50.320 | and end of every single message,

01:22:51.800 | and there can be many other tokens,

01:22:53.520 | lots of tokens in use to delimit these conversations

01:22:58.280 | and kind of keep track of the flow of the messages here.

01:23:01.880 | Now we can go back to the tick token library.

01:23:04.800 | And here, when you scroll to the bottom,

01:23:07.320 | they talk about how you can extend tick token.

01:23:09.320 | And I can, you can create, basically,

01:23:11.000 | you can fork the CL 100K base tokenizers in GPT 4.

01:23:16.000 | And for example, you can extend it

01:23:18.280 | by adding more special tokens.

01:23:19.720 | And these are totally up to you.

01:23:20.640 | You can come up with any arbitrary tokens

01:23:22.440 | and add them with a new ID afterwards.

01:23:25.480 | And the tick token library will correctly swap them out

01:23:29.360 | when it sees this in the strings.

01:23:31.800 | Now, we can also go back to this file,

01:23:35.240 | which we looked at previously.

01:23:37.160 | And I mentioned that the GPT 2 in tick token,

01:23:40.040 | OpenAI public.py, we have the vocabulary,

01:23:43.600 | we have the pattern for splitting.

01:23:45.760 | And then here we are registering

01:23:46.960 | the single special token in GPT 2,

01:23:48.760 | which was the end of text token.

01:23:50.280 | And we saw that it has this ID.

01:23:52.040 | In GPT 4, when they define this here,

01:23:56.400 | you see that the pattern has changed as we've discussed,

01:23:58.600 | but also the special tokens have changed in this tokenizer.

01:24:01.240 | So we, of course, have the end of text, just like in GPT 2,

01:24:05.040 | but we also see three, sorry, four additional tokens here.

01:24:08.720 | Thim prefix, middle, and suffix.

01:24:11.000 | What is thim?

01:24:11.840 | Thim is short for fill in the middle.

01:24:14.560 | And if you'd like to learn more about this idea,

01:24:17.120 | it comes from this paper.

01:24:18.360 | And I'm not gonna go into detail in this video.

01:24:21.280 | It's beyond this video.

01:24:22.640 | And then there's one additional sort of token here.

01:24:26.760 | So that's that encoding as well.

01:24:28.640 | So it's very common, basically, to train a language model.

01:24:32.880 | And then if you'd like, you can add special tokens.

01:24:36.920 | Now, when you add special tokens,

01:24:38.400 | you, of course, have to do some model surgery

01:24:41.480 | to the transformer

01:24:42.560 | and all of the parameters involved in that transformer,

01:24:44.920 | because you are basically adding an integer.

01:24:47.000 | And you wanna make sure that, for example,

01:24:48.320 | your embedding matrix for the vocabulary tokens

01:24:51.360 | has to be extended by adding a row.

01:24:53.800 | And typically, this row would be initialized

01:24:55.800 | with small random numbers or something like that,

01:24:58.360 | because we need to have a vector

01:24:59.560 | that now stands for that token.

01:25:02.520 | In addition to that,

01:25:03.360 | you have to go to the final layer of the transformer,

01:25:05.000 | and you have to make sure that that projection

01:25:06.600 | at the very end into the classifier

01:25:09.000 | is extended by one as well.

01:25:10.920 | So basically, there's some model surgery involved

01:25:13.000 | that you have to couple with the tokenization changes

01:25:16.400 | if you are going to add special tokens.

01:25:18.840 | But this is a very common operation that people do,

01:25:20.920 | especially if they'd like to fine-tune the model,

01:25:23.120 | for example, taking it from a base model

01:25:24.960 | to a chat model like chat-gpt.

01:25:27.240 | Okay, so at this point, you should have everything you need

01:25:30.680 | in order to build your own GPT-4 tokenizer.

01:25:33.240 | Now, in the process of developing this lecture,

01:25:35.040 | I've done that, and I've published the code

01:25:37.000 | under this repository, minBPE.

01:25:39.880 | So minBPE looks like this right now as I'm recording,

01:25:43.400 | but the minBPE repository will probably change quite a bit

01:25:46.520 | because I intend to continue working on it.

01:25:48.680 | In addition to the minBPE repository,

01:25:51.560 | I've published this exercise progression

01:25:53.920 | that you can follow.

01:25:55.240 | So if you go to exercise.md here,

01:25:58.120 | this is sort of me breaking up the task ahead of you

01:26:01.360 | into four steps that sort of build up

01:26:04.200 | to what can be a GPT-4 tokenizer.

01:26:06.640 | And so feel free to follow these steps exactly

01:26:09.240 | and follow a little bit of the guidance

01:26:10.800 | that I've laid out here.

01:26:12.200 | And anytime you feel stuck,

01:26:14.160 | just reference the minBPE repository here.

01:26:17.040 | So either the tests could be useful

01:26:19.360 | or the minBPE repository itself.

01:26:21.520 | I try to keep the code fairly clean and understandable.

01:26:25.040 | And so feel free to reference it whenever you get stuck.

01:26:29.840 | In addition to that, basically, once you write it,

01:26:34.040 | you should be able to reproduce this behavior

01:26:35.640 | from TicToken.

01:26:36.800 | So getting the GPT-4 tokenizer,

01:26:38.840 | you can take, you can encode the string

01:26:41.320 | and you should get these tokens.

01:26:42.920 | And then you can encode and decode the exact same string

01:26:45.040 | to recover it.

01:26:46.440 | And in addition to all that,

01:26:47.560 | you should be able to implement your own train function,

01:26:50.040 | which TicToken library does not provide.

01:26:52.160 | It's, again, only inference code,

01:26:54.040 | but you could write your own train, minBPE does it as well.

01:26:57.800 | And that will allow you to train

01:26:59.280 | your own token vocabularies.

01:27:00.760 | So here's some of the code inside minBEP, minBPE,

01:27:05.200 | shows the token vocabularies that you might obtain.

01:27:08.720 | So on the left here, we have the GPT-4 merges.

01:27:13.560 | So the first 256 are raw individual bytes.

01:27:17.120 | And then here I am visualizing the merges

01:27:18.880 | that GPT-4 performed during its training.

01:27:21.600 | So the very first merge that GPT-4 did

01:27:24.000 | was merge two spaces into a single token for two spaces.

01:27:28.760 | And that is a token 256.

01:27:30.840 | And so this is the order in which things merged

01:27:32.680 | during GPT-4 training.

01:27:34.200 | And this is the merge order that we obtained in minBPE

01:27:39.120 | by training a tokenizer.

01:27:40.680 | And in this case, I trained it

01:27:41.880 | on a Wikipedia page of Taylor Swift.

01:27:44.520 | Not because I'm a Swifty,

01:27:45.520 | but because that is one of the longest Wikipedia pages,

01:27:49.080 | apparently, that's available.

01:27:50.560 | But she is pretty cool.

01:27:52.320 | And what was I gonna say?

01:27:56.120 | Yeah, so you can compare these two vocabularies.

01:27:58.920 | And so as an example,

01:28:01.240 | here, GPT-4 merged IN to become IN.

01:28:06.120 | And we've done the exact same thing on this token 259.

01:28:09.960 | Here, space T becomes space T.

01:28:12.600 | And that happened for us a little bit later as well.

01:28:15.160 | So the difference here is, again, to my understanding,

01:28:17.480 | only a difference of the training set.

01:28:19.480 | So as an example, because I see a lot of white space,

01:28:21.880 | I expect that GPT-4 probably had a lot of Python code

01:28:24.360 | in its training set, I'm not sure, for the tokenizer.

01:28:27.640 | And here we see much less of that, of course,

01:28:30.760 | in the Wikipedia page.

01:28:32.960 | So roughly speaking, they look the same,

01:28:34.760 | and they look the same

01:28:35.600 | because they're running the same algorithm.

01:28:37.360 | And when you train your own,

01:28:38.720 | you're probably gonna get something similar,

01:28:40.800 | depending on what you train it on.

01:28:42.480 | Okay, so we are now going to move on from TIC token

01:28:44.800 | and the way that OpenAI tokenizes its strings.

01:28:47.680 | We're going to discuss one more very commonly used library

01:28:50.400 | for working with tokenization in LLMs,

01:28:52.720 | and that is SentencePiece.

01:28:54.840 | So SentencePiece is very commonly used in language models

01:28:58.840 | because unlike TIC token,

01:28:59.920 | it can do both training and inference

01:29:02.120 | and is quite efficient at both.

01:29:04.360 | It supports a number of algorithms

01:29:05.800 | for training vocabularies,

01:29:07.760 | but one of them is the byte-bearing coding algorithm

01:29:10.120 | that we've been looking at.

01:29:11.240 | So it supports it.

01:29:13.240 | Now, SentencePiece is used both by Lama and Mistral Series

01:29:16.960 | and many other models as well.

01:29:19.120 | It is on GitHub under Google/SentencePiece.

01:29:21.640 | And the big difference with SentencePiece,

01:29:24.760 | and we're going to look at example

01:29:26.200 | because this is kind of hard and subtle to explain,

01:29:29.280 | is that they think different

01:29:31.000 | about the order of operations here.

01:29:34.480 | So in the case of TIC token,

01:29:36.840 | we first take our code points in the string,

01:29:40.360 | we encode them using UTF-8 to bytes,

01:29:42.640 | and then we're merging bytes.

01:29:43.880 | It's fairly straightforward.

01:29:45.960 | For SentencePiece, it works directly

01:29:49.640 | on the level of the code points themselves.

01:29:51.760 | So it looks at whatever code points are available

01:29:53.920 | in your training set,

01:29:55.200 | and then it starts merging those code points.

01:29:57.720 | And the BPE is running on the level of code points.

01:30:01.600 | And if you happen to run out of code points,

01:30:05.160 | so there are maybe some rare code points

01:30:07.400 | that just don't come up too often,

01:30:08.640 | and the rarity is determined

01:30:09.800 | by this character coverage hyperparameter,

01:30:12.080 | then these code points will either get mapped

01:30:15.200 | to a special unknown token like onc,

01:30:18.120 | or if you have the byte fallback option turned on,

01:30:21.160 | then that will take those rare code points,

01:30:23.600 | it will encode them using UTF-8,

01:30:25.720 | and then the individual bytes of that encoding

01:30:27.640 | will be translated into tokens.

01:30:29.800 | And there are these special byte tokens

01:30:31.560 | that basically get added to the vocabulary.

01:30:34.040 | So it uses BPE on the code points,

01:30:37.800 | and then it falls back to bytes for rare code points.

01:30:41.280 | And so that's kind of like difference.

01:30:44.120 | Personally, I find the tick token way significantly cleaner,

01:30:47.040 | but it's kind of like a subtle,

01:30:48.200 | but pretty major difference

01:30:49.200 | between the way they approach tokenization.

01:30:51.400 | Let's work with a concrete example,

01:30:52.760 | because otherwise this is kind of hard

01:30:54.680 | to get your head around.

01:30:57.800 | So let's work with a concrete example.

01:31:00.160 | This is how we can import sentence piece.

01:31:02.760 | And then here we're going to take,

01:31:04.400 | I think I took like the description of sentence piece,

01:31:06.440 | and I just create like a little toy dataset.

01:31:08.600 | It really likes to have a file,

01:31:09.920 | so I created a toy.txt file with this content.

01:31:14.000 | Now, what's kind of a little bit crazy about sentence piece

01:31:16.560 | is that there's a ton of options and configurations.

01:31:19.720 | And the reason this is so

01:31:20.800 | is because sentence piece has been around,

01:31:22.440 | I think for a while,

01:31:23.520 | and it really tries to handle a large diversity of things.

01:31:26.640 | And because it's been around,

01:31:28.320 | I think it has quite a bit of

01:31:29.920 | accumulated historical baggage as well.

01:31:33.040 | And so in particular,

01:31:34.120 | there's like a ton of configuration arguments.

01:31:36.440 | This is not even all of it.

01:31:38.160 | You can go to here to see all the training options.

01:31:42.560 | And there's also quite useful documentation

01:31:45.400 | when you look at the raw protobuf

01:31:47.480 | that is used to represent the trainer spec and so on.

01:31:50.680 | Many of these options are irrelevant to us.

01:31:54.720 | So maybe to point out one example,

01:31:56.720 | dash dash shrinking factor.

01:31:58.760 | This shrinking factor is not used

01:32:00.800 | in the byte bearing coding algorithm.

01:32:02.200 | So this is just an argument that is irrelevant to us.

01:32:05.840 | It applies to a different training algorithm.

01:32:08.080 | Now, what I tried to do here

01:32:11.760 | is I tried to set up sentence piece

01:32:13.680 | in a way that is very, very similar,

01:32:15.440 | as far as I can tell,

01:32:16.640 | to maybe identical, hopefully,

01:32:18.920 | to the way that Llama2 was trained.

01:32:21.680 | So the way they trained their own tokenizer.

01:32:26.120 | And the way I did this was basically

01:32:27.520 | you can take the tokenizer.model file that Meta released,

01:32:31.160 | and you can open it using the protobuf

01:32:35.360 | sort of file that you can generate.

01:32:38.400 | And then you can inspect all the options.

01:32:39.880 | And I tried to copy over all the options

01:32:41.520 | that looked relevant.

01:32:42.960 | So here we set up the input.

01:32:44.560 | It's a raw text in this file.

01:32:46.680 | Here's gonna be the output.

01:32:47.960 | So it's gonna be protoc400.model and .vocab.

01:32:51.520 | We're saying that we're gonna use the BP algorithm

01:32:54.240 | and we want to vocab size of 400.

01:32:55.960 | Then there's a ton of configurations here

01:32:59.160 | for basically preprocessing and normalization rules,

01:33:06.240 | as they're called.

01:33:07.240 | Normalization used to be very prevalent,

01:33:09.600 | I would say, before LLMs in natural language processing.

01:33:12.360 | So in machine translation and text classification and so on,

01:33:15.800 | you want to normalize and simplify the text

01:33:17.640 | and you wanna turn it all lowercase

01:33:18.960 | and you wanna remove all double whitespace, et cetera.

01:33:22.240 | And in language models, we prefer not to do any of it,

01:33:24.640 | or at least that is my preference as a deep learning person.

01:33:26.960 | You want to not touch your data.

01:33:28.400 | You want to keep the raw data as much as possible

01:33:31.800 | in a raw form.

01:33:33.200 | So you're basically trying to turn off a lot of this,

01:33:36.200 | if you can.

01:33:37.840 | The other thing that sentence piece does

01:33:39.280 | is that it has this concept of sentences.

01:33:41.720 | So sentence piece, it's back,

01:33:45.200 | it kind of like was developed, I think,

01:33:46.440 | early in the days where there was an idea

01:33:50.080 | that you're training a tokenizer

01:33:51.840 | on a bunch of independent sentences.

01:33:53.800 | So it has a lot of like how many sentences

01:33:56.400 | you're gonna train on,

01:33:57.360 | what is the maximum sentence length,

01:33:59.160 | shuffling sentences.

01:34:03.160 | And so for it, sentences are kind of like

01:34:04.840 | the individual training examples.

01:34:06.600 | But again, in the context of LLMs,

01:34:08.160 | I find that this is like a very spurious

01:34:10.000 | and weird distinction.

01:34:11.320 | Like sentences are just like, don't touch the raw data.

01:34:15.080 | Sentences happen to exist.

01:34:16.920 | But in the raw data sets,

01:34:18.760 | there are a lot of like in-betweens,

01:34:20.200 | like what exactly is a sentence?

01:34:21.800 | What isn't a sentence?

01:34:23.160 | And so I think like, it's really hard to define

01:34:25.920 | what an actual sentence is,

01:34:27.560 | if you really like dig into it.

01:34:29.360 | And there could be different concepts of it

01:34:31.320 | in different languages or something like that.

01:34:32.800 | So why even introduce the concept?

01:34:34.720 | It doesn't honestly make sense to me.

01:34:36.600 | I would just prefer to treat a file

01:34:38.440 | as a giant stream of bytes.

01:34:40.400 | It has a lot of treatment around rare word characters.

01:34:44.400 | And when I say word, I mean, code points.

01:34:46.120 | We're gonna come back to this in a second.

01:34:48.320 | And it has a lot of other rules

01:34:49.600 | for basically splitting digits,

01:34:53.000 | splitting white space and numbers

01:34:55.240 | and how you deal with that.

01:34:56.520 | So these are some kind of like merge rules.

01:34:59.120 | So I think this is a little bit equivalent

01:35:00.600 | to tick token using the regular expression

01:35:03.760 | to split up categories.

01:35:05.480 | There's like kind of equivalence of it

01:35:07.920 | if you squint at it in sentence piece,

01:35:09.960 | where you can also, for example,

01:35:11.000 | split up the digits and so on.

01:35:15.520 | There's a few more things here

01:35:18.040 | that I'll come back to in a bit.

01:35:19.280 | And then there are some special tokens

01:35:20.440 | that you can indicate.

01:35:21.640 | And it hardcodes the UNCTOKEN,

01:35:24.040 | the beginning of sentence, end of sentence,

01:35:26.280 | and a PAT token.

01:35:27.440 | And the UNCTOKEN must exist from my understanding.

01:35:31.720 | And there's some systems things.

01:35:33.920 | So we can train.

01:35:35.800 | And when I press train,

01:35:37.240 | it's going to create this file,

01:35:39.320 | talk400.model and talk400.vocab.

01:35:42.400 | I can then load the model file

01:35:44.520 | and I can inspect the vocabulary of it.

01:35:47.440 | And so we trained vocab size 400 on this text here.

01:35:52.280 | And these are the individual pieces,

01:35:55.000 | the individual tokens that sentence piece will create.

01:35:58.000 | So in the beginning,

01:35:58.840 | we see that we have the UNCTOKEN with the ID zero.

01:36:03.080 | Then we have the beginning of sequence,

01:36:04.880 | end of sequence, one and two.

01:36:07.720 | And then we said that the PAT ID is negative one.

01:36:09.600 | So we chose not to use it.

01:36:11.800 | So there's no PAT ID here.

01:36:13.280 | Then these are individual byte tokens.

01:36:17.840 | So here we saw that byte fallback in llama was turned on.

01:36:22.120 | So it's true.

01:36:23.480 | So what follows are going to be the 256 byte tokens.

01:36:27.520 | And these are their IDs.

01:36:32.880 | And then at the bottom,

01:36:34.760 | after the byte tokens come the merges.

01:36:37.720 | And these are the parent nodes in the merges.

01:36:41.560 | So we're not seeing the children,

01:36:42.640 | we're just seeing the parents and their ID.

01:36:44.800 | And then after the merges comes eventually

01:36:49.080 | the individual tokens and their IDs.

01:36:53.120 | And so these are the individual tokens.

01:36:54.560 | So these are the individual code point tokens, if you will,

01:36:58.360 | and they come at the end.

01:36:59.880 | So that is the ordering

01:37:00.840 | with which sentence piece sort of like represents

01:37:02.840 | its vocabularies.

01:37:03.800 | It starts with special tokens, then the byte tokens,

01:37:06.840 | then the merge tokens,

01:37:08.000 | and then the individual code point tokens.

01:37:10.640 | And all these raw code point tokens

01:37:13.640 | are the ones that it encountered in the training set.

01:37:17.080 | So those individual code points are all the,

01:37:20.720 | the entire set of code points that occurred here.

01:37:23.240 | So those all get put in there.

01:37:27.360 | And then those that are extremely rare

01:37:29.160 | as determined by character coverage.

01:37:30.840 | So if a code point occurred only a single time

01:37:32.640 | out of like a million sentences or something like that,

01:37:36.200 | then it would be ignored

01:37:37.840 | and it would not be added to our vocabulary.

01:37:41.120 | Once we have a vocabulary,

01:37:43.280 | we can encode into IDs and we can sort of get a list.

01:37:47.520 | And then here I am also decoding the individual tokens

01:37:53.080 | back into little pieces as they call it.

01:37:55.720 | So let's take a look at what happened here.

01:37:58.920 | Hello, space, (speaking in foreign language)

01:38:02.000 | So these are the token IDs we got back.

01:38:05.560 | And when we look here, a few things sort of jumped to mind.

01:38:10.400 | Number one, take a look at these characters.

01:38:14.160 | The Korean characters, of course,

01:38:15.320 | were not part of the training set.

01:38:17.080 | So sentence piece is encountering code points

01:38:19.560 | that it has not seen during training time.

01:38:22.120 | And those code points do not have

01:38:24.080 | a token associated with them.

01:38:25.840 | So suddenly these are unk tokens, unknown tokens.

01:38:29.760 | But because byte fallback is true,

01:38:32.320 | instead, sentence piece falls back to bytes.

01:38:35.720 | And so it takes this, it encodes it with UTF-8,

01:38:39.640 | and then it uses these tokens to represent those bytes.

01:38:44.240 | And that's what we are getting sort of here.

01:38:47.960 | This is the UTF-8 encoding,

01:38:50.640 | and it is shifted by three

01:38:52.880 | because of these special tokens here

01:38:56.280 | that have IDs earlier on.

01:38:58.400 | So that's what happened here.

01:39:00.400 | Now, one more thing that,

01:39:02.960 | well, first, before I go on,

01:39:05.200 | with respect to the byte fallback,

01:39:06.840 | let me remove byte fallback.

01:39:09.240 | If this is false, what's gonna happen?

01:39:11.680 | Let's retrain.

01:39:12.520 | So the first thing that happened

01:39:14.520 | is all of the byte tokens disappeared, right?

01:39:17.320 | And now we just have the merges,

01:39:18.840 | and we have a lot more merges now

01:39:20.240 | because we have a lot more space

01:39:21.360 | because we're not taking up space

01:39:23.240 | in the vocab size with all the bytes.

01:39:26.960 | And now if we encode this,

01:39:28.880 | we get a zero.

01:39:31.960 | So this entire string here suddenly,

01:39:33.840 | there's no byte fallback.

01:39:34.960 | So this is unknown, and unknown is UNK.

01:39:39.000 | And so this is zero

01:39:40.400 | because the UNK token is token zero.

01:39:43.200 | And you have to keep in mind

01:39:44.960 | that this would feed into your language model.

01:39:47.720 | So what is the language model supposed to do

01:39:49.240 | when all kinds of different things

01:39:51.000 | that are unrecognized because they're rare

01:39:53.000 | just end up mapping into UNK?

01:39:54.880 | It's not exactly the property that you want.

01:39:57.080 | So that's why I think Lama correctly

01:39:59.480 | used byte fallback true

01:40:02.160 | because we definitely want to feed

01:40:03.640 | these unknown or rare code points

01:40:06.120 | into the model in some manner.

01:40:08.600 | The next thing I wanna show you is the following.

01:40:11.720 | Notice here when we are decoding all the individual tokens,

01:40:14.640 | you see how spaces, space here

01:40:17.760 | ends up being this bold underline.

01:40:20.920 | I'm not 100% sure, by the way,

01:40:22.320 | why sentence piece switches whitespace

01:40:24.240 | into these bold underscore characters.

01:40:27.240 | Maybe it's for visualization.

01:40:28.400 | I'm not 100% sure why that happens.

01:40:30.920 | But notice this.

01:40:32.040 | Why do we have an extra space in the front of hello?

01:40:37.040 | What is this coming from?

01:40:40.560 | Well, it's coming from this option here.

01:40:42.800 | Add dummy prefix is true.

01:40:48.120 | And when you go to the documentation,

01:40:50.720 | add dummy whitespace at the beginning of text

01:40:52.640 | in order to treat world in world

01:40:54.920 | and hello world in the exact same way.

01:40:57.240 | So what this is trying to do is the following.

01:41:00.280 | If we go back to our tick tokenizer,

01:41:02.120 | world as a token by itself

01:41:06.400 | has a different ID than space world.

01:41:10.040 | So we have, this is 1917, but this is 14, et cetera.

01:41:14.720 | So these are two different tokens for the language model.

01:41:17.000 | And the language model has to learn from data

01:41:18.720 | that they are actually kind of like a very similar concept.

01:41:21.200 | So to the language model in the tick token world,

01:41:24.080 | basically words in the beginning of sentences

01:41:27.000 | and words in the middle of sentences

01:41:28.560 | actually look completely different.

01:41:30.320 | And it has learned that they are roughly the same.

01:41:34.400 | So this add dummy prefix

01:41:36.480 | is trying to fight that a little bit.

01:41:38.440 | And the way that works is that it basically

01:41:40.800 | adds a dummy prefix.

01:41:43.880 | So as a part of pre-processing,

01:41:48.600 | it will take the string and it will add a space.

01:41:51.160 | It will do this.

01:41:52.920 | And that's done in an effort to make this world

01:41:56.320 | and that world the same.

01:41:57.600 | They will both be space world.

01:41:59.960 | So that's one other kind of pre-processing option

01:42:02.120 | that is turned on and Llama2 also uses this option.

01:42:06.800 | And that's I think everything that I wanna say

01:42:08.360 | from my preview of sentence piece and how it is different.

01:42:11.880 | Maybe here what I've done is

01:42:14.040 | I just put in the raw protocol buffer representation

01:42:18.480 | basically of the tokenizer, the Llama2 trained.

01:42:22.280 | So feel free to sort of step through this.

01:42:24.520 | And if you would like your tokenization

01:42:26.560 | to look identical to that of the meta Llama2,

01:42:30.360 | then you would be copy pasting these settings

01:42:32.080 | as I've tried to do up above.

01:42:34.000 | And yeah, I think that's it for this section.

01:42:37.800 | I think my summary for sentence piece from all this

01:42:40.200 | is number one, I think that there's a lot

01:42:42.400 | of historical baggage in sentence piece.

01:42:44.240 | A lot of concepts that I think are slightly confusing

01:42:46.840 | and I think potentially contain foot guns

01:42:49.200 | like this concept of a sentence

01:42:50.600 | and its maximum length and stuff like that.

01:42:52.760 | Otherwise it is fairly commonly used in the industry

01:42:56.640 | because it is efficient

01:42:58.920 | and can do both training and inference.

01:43:01.160 | It has a few quirks, like for example,

01:43:02.920 | on token must exist and the way the byte fallbacks are done

01:43:05.880 | and so on, I don't find particularly elegant.

01:43:08.320 | And unfortunately I have to say

01:43:09.360 | it's not very well documented.

01:43:10.480 | So it took me a lot of time working with this myself

01:43:13.760 | and just visualizing things

01:43:15.760 | and try to really understand what is happening here

01:43:17.760 | because the documentation, unfortunately,

01:43:20.360 | is in my opinion, not super amazing.

01:43:22.720 | But it is a very nice repo that is available to you

01:43:25.840 | if you'd like to train your own tokenizer right now.

01:43:28.200 | Okay, let me now switch gears again

01:43:29.560 | as we're starting to slowly wrap up here.

01:43:31.640 | I want to revisit this issue in a bit more detail

01:43:33.840 | of how we should set the vocab size

01:43:35.680 | and what are some of the considerations around it.

01:43:38.160 | So for this, I'd like to go back to the model architecture

01:43:41.360 | that we developed in the last video

01:43:43.080 | when we built the GPT from scratch.

01:43:45.680 | So this here was the file

01:43:47.480 | that we built in the previous video

01:43:49.000 | and we defined the transformer model

01:43:50.680 | and let's specifically look at vocab size

01:43:52.800 | and where it appears in this file.

01:43:54.820 | So here we define the vocab size.

01:43:56.560 | At this time it was 65 or something like that,

01:43:59.640 | extremely small number.

01:44:00.720 | So this will grow much larger.

01:44:03.160 | You'll see that vocab size doesn't come up too much

01:44:04.980 | in most of these layers.

01:44:06.160 | The only place that it comes up to

01:44:08.240 | is in exactly these two places here.

01:44:10.920 | So when we define the language model,

01:44:13.280 | there's the token embedding table,

01:44:15.200 | which is this two-dimensional array

01:44:17.320 | where the vocab size is basically the number of rows

01:44:20.620 | and each vocabulary element, each token has a vector

01:44:25.000 | that we're going to train using backpropagation.

01:44:27.240 | That vector is of size and embed,

01:44:28.920 | which is number of channels in the transformer.

01:44:31.440 | And basically as vocab size increases,

01:44:33.560 | this embedding table, as I mentioned earlier,

01:44:35.760 | is going to also grow.

01:44:36.680 | We're going to be adding rows.

01:44:38.680 | In addition to that, at the end of the transformer,

01:44:41.080 | there's this LM head layer, which is a linear layer.

01:44:44.240 | And you'll notice that that layer is used at the very end

01:44:47.000 | to produce the logits, which become the probabilities

01:44:49.840 | for the next token in a sequence.

01:44:51.520 | And so intuitively, we're trying to produce a probability

01:44:54.520 | for every single token that might come next

01:44:57.240 | at every point in time of that transformer.

01:45:00.240 | And if we have more and more tokens,

01:45:02.000 | we need to produce more and more probabilities.

01:45:04.360 | So every single token is going to introduce

01:45:06.160 | an additional dot product that we have to do

01:45:08.600 | here in this linear layer

01:45:09.880 | for this final layer in the transformer.

01:45:12.440 | So why can't vocab size be infinite?

01:45:15.340 | Why can't we grow to infinity?

01:45:16.560 | Well, number one, your token embedding table

01:45:18.520 | is going to grow.

01:45:19.840 | Your linear layer is going to grow.

01:45:23.000 | So we're going to be doing a lot more computation here

01:45:25.140 | because this LM head layer

01:45:26.280 | will become more computational expensive.

01:45:28.740 | Number two, because we have more parameters,

01:45:30.600 | we could be worried that we are going to be

01:45:32.640 | under-training some of these parameters.

01:45:35.440 | So intuitively, if you have a very large vocabulary size,

01:45:38.560 | say we have a million tokens,

01:45:40.800 | then every one of these tokens is going to come up

01:45:42.480 | more and more rarely in the training data

01:45:45.080 | because there's a lot more other tokens all over the place.

01:45:47.600 | And so we're going to be seeing fewer and fewer examples

01:45:51.080 | for each individual token.

01:45:52.960 | And you might be worried that basically

01:45:54.320 | the vectors associated with every token

01:45:56.080 | will be under-trained as a result

01:45:57.760 | because they just don't come up too often

01:45:59.760 | and they don't participate in the forward-backward pass.

01:46:02.460 | In addition to that, as your vocab size grows,

01:46:04.600 | you're going to start shrinking your sequences a lot.

01:46:07.640 | And that's really nice because that means

01:46:09.720 | that we're going to be attending to more and more text.

01:46:11.880 | So that's nice.

01:46:12.980 | But also you might be worrying that too large of chunks

01:46:15.800 | are being squished into single tokens.

01:46:18.040 | And so the model just doesn't have as much time to think

01:46:22.000 | per some number of characters in a text,

01:46:26.320 | or you can think about it that way.

01:46:28.000 | So basically we're squishing too much information

01:46:29.960 | into a single token.

01:46:31.180 | And then the forward pass of the transformer

01:46:32.980 | is not enough to actually process

01:46:34.300 | that information appropriately.

01:46:35.940 | And so these are some of the considerations

01:46:37.420 | you're thinking about

01:46:38.260 | when you're designing the vocab size.

01:46:39.820 | As I mentioned, this is mostly an empirical hyperparameter.

01:46:42.260 | And it seems like in state-of-the-art architectures today,

01:46:44.980 | this is usually in the high 10,000th

01:46:46.820 | or somewhere around 100,000 today.

01:46:49.340 | And the next consideration I want to briefly talk about

01:46:51.500 | is what if we want to take a pre-trained model

01:46:54.260 | and we want to extend the vocab size?

01:46:56.460 | And this is done fairly commonly actually.

01:46:58.180 | So for example, when you're doing fine tuning for chat GPT,

01:47:02.240 | a lot more new special tokens get introduced

01:47:04.280 | on top of the base model to maintain the metadata

01:47:07.280 | and all the structure of conversation objects

01:47:09.980 | between the user and an assistant.

01:47:11.740 | So that takes a lot of special tokens.

01:47:13.860 | You might also try to throw in more special tokens,

01:47:15.960 | for example, for using the browser or any other tool.

01:47:18.920 | And so it's very tempting to add a lot of tokens

01:47:21.560 | for all kinds of special functionality.

01:47:23.960 | So if you want to be adding a token,

01:47:25.320 | that's totally possible, right?

01:47:26.960 | All we have to do is we have to resize this embedding.

01:47:29.620 | So we have to add rows.

01:47:31.100 | We would initialize these parameters from scratch,

01:47:33.860 | which would be small random numbers.

01:47:35.500 | And then we have to extend the weight inside this linear.

01:47:39.340 | So we have to start making dot products

01:47:41.900 | with the associated parameters as well

01:47:43.780 | to basically calculate the probabilities

01:47:45.120 | for these new tokens.

01:47:46.660 | So both of these are just a resizing operation.

01:47:49.500 | It's a very mild model surgery

01:47:51.940 | and can be done fairly easily.

01:47:53.260 | And it's quite common that basically

01:47:54.460 | you would freeze the base model.

01:47:56.000 | You introduce these new parameters

01:47:57.500 | and then you only train these new parameters

01:47:59.300 | to introduce new tokens into the architecture.

01:48:01.600 | And so you can freeze arbitrary parts of it

01:48:04.500 | or you can train arbitrary parts of it

01:48:06.140 | and that's totally up to you.

01:48:07.580 | But basically minor surgery required

01:48:09.460 | if you'd like to introduce new tokens.

01:48:11.420 | And finally, I'd like to mention that

01:48:12.760 | actually there's an entire design space of applications

01:48:15.900 | in terms of introducing new tokens into a vocabulary

01:48:18.460 | that go way beyond just adding special tokens

01:48:20.460 | and special new functionality.

01:48:22.240 | So just to give you a sense of the design space,

01:48:24.100 | but this could be an entire video just by itself.

01:48:26.660 | This is a paper on learning to compress prompts

01:48:29.260 | with what they called GIST tokens.

01:48:32.060 | And the rough idea is suppose that you're using

01:48:34.060 | language models in a setting

01:48:35.060 | that requires very long prompts.

01:48:37.320 | Well, these long prompts just slow everything down

01:48:39.460 | because you have to encode them

01:48:40.540 | and then you have to use them

01:48:41.740 | and then you're tending over them

01:48:43.220 | and it's just heavy to have very large prompts.

01:48:46.820 | So instead what they do here in this paper

01:48:49.220 | is they introduce new tokens

01:48:52.340 | and imagine basically having a few new tokens,

01:48:56.100 | you put them in a sequence

01:48:57.700 | and then you train the model by distillation.

01:49:00.820 | So you are keeping the entire model frozen

01:49:02.740 | and you're only training the representations

01:49:04.740 | of the new tokens, their embeddings,

01:49:06.780 | and you're optimizing over the new tokens

01:49:08.660 | such that the behavior of the language model

01:49:11.120 | is identical to the model that has a very long prompt

01:49:16.120 | that works for you.

01:49:17.580 | And so it's a compression technique

01:49:19.020 | of compressing that very long prompt

01:49:20.620 | into those few new GIST tokens.

01:49:23.140 | And so you can train this and then at test time

01:49:24.980 | you can discard your old prompt

01:49:26.100 | and just swap in those tokens

01:49:27.540 | and they sort of like stand in for that very long prompt

01:49:30.900 | and have an almost identical performance.

01:49:33.580 | And so this is one technique

01:49:36.100 | in a class of parameter efficient fine tuning techniques

01:49:38.780 | where most of the model is basically fixed

01:49:40.940 | and there's no training of the model weights,

01:49:43.000 | there's no training of LoRa or anything like that

01:49:45.260 | of new parameters.

01:49:46.360 | The parameters that you're training

01:49:47.900 | are now just the token embeddings.

01:49:50.660 | So that's just one example,

01:49:51.700 | but this could again be like an entire video,

01:49:53.900 | but just to give you a sense

01:49:54.860 | that there's a whole design space here

01:49:56.020 | that is potentially worth exploring in the future.

01:49:58.620 | The next thing I want to briefly address

01:50:00.020 | is that I think recently there's a lot of momentum

01:50:02.420 | in how you actually could construct transformers

01:50:05.220 | that can simultaneously process

01:50:06.460 | not just text as the input modality,

01:50:08.580 | but a lot of other modalities.

01:50:09.660 | So be it images, videos, audio, et cetera.

01:50:12.980 | And how do you feed in all these modalities

01:50:14.980 | and potentially predict these modalities from a transformer?

01:50:18.980 | Do you have to change the architecture

01:50:20.020 | in some fundamental way?

01:50:21.060 | And I think what a lot of people

01:50:22.220 | are starting to converge towards

01:50:23.540 | is that you're not changing the architecture,

01:50:25.000 | you stick with the transformer,

01:50:26.380 | you just kind of tokenize your input domains

01:50:28.940 | and then call it a day and pretend it's just text tokens

01:50:31.420 | and just do everything else in an identical manner.

01:50:35.420 | So here, for example, there was an early paper

01:50:37.020 | that has a nice graphic for how you can take an image

01:50:39.720 | and you can truncate it into integers.

01:50:43.460 | And these sometimes,

01:50:45.340 | so these would basically become the tokens of images

01:50:47.700 | as an example.

01:50:48.900 | And these tokens can be hard tokens

01:50:51.540 | where you force them to be integers.

01:50:53.620 | They can also be soft tokens

01:50:55.540 | where you sort of don't require these to be discrete,

01:51:00.240 | but you do force these representations

01:51:02.300 | to go through bottlenecks like in autoencoders.

01:51:04.940 | Also in this paper that came out from OpenAI Sora,

01:51:08.420 | which I think really blew the mind of many people

01:51:12.420 | and inspired a lot of people in terms of what's possible,

01:51:14.980 | they have a graphic here and they talk briefly

01:51:16.620 | about how LLMs have text tokens, Sora has visual patches.

01:51:21.340 | So again, they came up with a way to truncate videos

01:51:24.260 | into basically the tokens with their own vocabularies.

01:51:27.540 | And then you can either process discrete tokens,

01:51:29.660 | say with autoregressive models,

01:51:31.040 | or even soft tokens with diffusion models.

01:51:33.660 | And all of that is sort of being actively worked on,

01:51:38.320 | designed on, and it's beyond the scope of this video,

01:51:40.140 | but just something I wanted to mention briefly.

01:51:42.020 | Okay, now that we have come quite deep

01:51:44.340 | into the tokenization algorithm

01:51:45.660 | and we understand a lot more about how it works,

01:51:48.340 | let's loop back around to the beginning of this video

01:51:50.620 | and go through some of these bullet points

01:51:51.940 | and really see why they happen.

01:51:54.740 | So first of all, why can't my LLM spell words very well

01:51:57.820 | or do other spell-related tasks?

01:52:00.160 | So fundamentally, this is because, as we saw,

01:52:03.800 | these characters are chunked up into tokens.

01:52:07.180 | And some of these tokens are actually fairly long.

01:52:09.820 | So as an example, I went to the GPT-4 vocabulary

01:52:12.740 | and I looked at one of the longer tokens.

01:52:15.180 | So .defaultstyle turns out to be a single individual token.

01:52:19.380 | So that's a lot of characters for a single token.

01:52:21.860 | So my suspicion is that there's just too much crammed

01:52:24.340 | into this single token.

01:52:25.820 | And my suspicion was that the model should not be very good

01:52:28.580 | at tasks related to spelling of this single token.

01:52:33.580 | So I asked, how many letters L are there

01:52:36.980 | in the word .defaultstyle?

01:52:39.280 | And of course, my prompt is intentionally done that way.

01:52:44.280 | And you see how .defaultstyle will be a single token.

01:52:46.500 | So this is what the model sees.

01:52:48.260 | So my suspicion is that it wouldn't be very good at this.

01:52:50.500 | And indeed, it is not.

01:52:51.940 | It doesn't actually know how many Ls are in there.

01:52:54.140 | It thinks there are three, and actually there are four,

01:52:56.900 | if I'm not getting this wrong myself.

01:52:59.300 | So that didn't go extremely well.

01:53:01.360 | Let's look at another kind of character-level task.

01:53:05.480 | So for example, here I asked GPT-4

01:53:08.540 | to reverse the string .defaultstyle.

01:53:11.060 | And it tried to use a code interpreter.

01:53:13.260 | I stopped it, and I said, just do it, just try it.

01:53:16.420 | And it gave me jumble.

01:53:18.820 | So it doesn't actually really know how to reverse

01:53:22.060 | this string going from right to left.

01:53:24.740 | So it gave a wrong result.

01:53:26.600 | So again, like working with this working hypothesis

01:53:29.000 | that maybe this is due to the tokenization,

01:53:31.060 | I tried a different approach.

01:53:32.180 | I said, okay, let's reverse the exact same string,

01:53:35.340 | but take the following approach.

01:53:37.260 | Step one, just print out every single character

01:53:39.400 | separated by spaces,

01:53:40.660 | and then as a step two, reverse that list.

01:53:43.260 | And it again tried to use a tool, but when I stopped it,

01:53:45.820 | it first produced all the characters,

01:53:48.380 | and that was actually correct.

01:53:50.040 | And then it reversed them,

01:53:50.880 | and that was correct once it had this.

01:53:52.940 | So somehow it can't reverse it directly,

01:53:54.760 | but when you go just first, you know,

01:53:57.500 | listing it out in order, it can do that somehow.

01:54:00.100 | And then it can, once it's broken up this way,

01:54:03.260 | this becomes all these individual characters.

01:54:05.480 | And so now this is much easier for it

01:54:07.220 | to see these individual tokens

01:54:09.080 | and reverse them and print them out.

01:54:11.720 | So that is kind of interesting.

01:54:14.620 | So let's continue now.

01:54:16.300 | Why are LLMs worse at non-English languages?

01:54:20.460 | And I briefly covered this already,

01:54:21.780 | but basically it's not only that the language model

01:54:25.500 | sees less non-English data

01:54:27.420 | during training of the model parameters,

01:54:29.660 | but also the tokenizer is not sufficiently trained

01:54:34.100 | on non-English data.

01:54:36.060 | And so here, for example, hello, how are you is five tokens,

01:54:40.140 | and its translation is 15 tokens.

01:54:42.380 | So this is a three times blow up.

01:54:45.020 | And so for example, annyeong haseyo is just hello,

01:54:48.260 | basically in Korean.

01:54:49.300 | And that ends up being three tokens.

01:54:50.820 | I'm actually kind of surprised by that

01:54:51.980 | because that is a very common phrase.

01:54:53.900 | That's just a typical greeting of like hello.

01:54:56.180 | And that ends up being three tokens,

01:54:57.620 | whereas our hello is a single token.

01:54:59.820 | And so basically everything is a lot more bloated

01:55:01.580 | and diffuse.

01:55:02.420 | And this is, I think, partly the reason

01:55:03.980 | that the model works worse on other languages.

01:55:07.000 | Coming back, why is LLM bad at simple arithmetic?

01:55:11.200 | That has to do with the tokenization of numbers.

01:55:16.540 | And so you'll notice that, for example,

01:55:19.140 | addition is very sort of like,

01:55:20.860 | there's an algorithm that is like character level

01:55:24.200 | for doing addition.

01:55:25.760 | So for example, here, we would first add the ones

01:55:28.180 | and then the tens and then the hundreds.

01:55:29.840 | You have to refer to specific parts of these digits.

01:55:33.340 | But these numbers are represented

01:55:35.940 | completely arbitrarily based on whatever happened

01:55:37.780 | to merge or not merge during the tokenization process.

01:55:40.940 | There's an entire blog post about this

01:55:42.360 | that I think is quite good.

01:55:43.740 | Integer tokenization is insane.

01:55:45.500 | And this person basically systematically explores

01:55:47.860 | the tokenization of numbers in, I believe this is GPT-2.

01:55:52.020 | And so they noticed that, for example,

01:55:53.320 | for four-digit numbers, you can take a look at

01:55:57.720 | whether it is a single token

01:56:00.280 | or whether it is two tokens that is a one, three,

01:56:02.640 | or a two, two, or a three, one combination.

01:56:04.800 | And so all the different numbers

01:56:06.200 | are all the different combinations.

01:56:07.760 | And you can imagine this is all completely arbitrarily so.

01:56:10.520 | And the model, unfortunately, sometimes sees

01:56:12.880 | a token for all four digits, sometimes for three,

01:56:17.280 | sometimes for two, sometimes for one,

01:56:19.200 | and it's in an arbitrary manner.

01:56:22.060 | And so this is definitely a headwind, if you will,

01:56:25.020 | for the language model.

01:56:26.020 | And it's kind of incredible that it can kind of do it

01:56:27.900 | and deal with it, but it's also kind of not ideal.

01:56:31.020 | And so that's why, for example, we saw that Meta,

01:56:33.060 | when they trained the LLAMA2 algorithm

01:56:35.160 | and they used sentence piece,

01:56:36.580 | they make sure to split up all the digits,

01:56:40.580 | as an example, for LLAMA2.

01:56:43.460 | And this is partly to improve

01:56:44.900 | a simple arithmetic kind of performance.

01:56:48.020 | And finally, why is GPT-2 not as good in Python?

01:56:52.040 | Again, this is partly a modeling issue

01:56:53.880 | on the architecture and the dataset

01:56:55.840 | and the strength of the model,

01:56:57.320 | but it's also partly tokenization.

01:56:59.240 | Because as we saw here with a simple Python example,

01:57:02.020 | the encoding efficiency of the tokenizer

01:57:04.760 | for handling spaces in Python is terrible.

01:57:07.180 | And every single space is an individual token.

01:57:09.360 | And this dramatically reduces the context length

01:57:11.680 | that the model can attend across.

01:57:13.320 | So that's almost like a tokenization bug for GPT-2.

01:57:16.400 | And that was later fixed with GPT-4.

01:57:19.580 | Okay, so here's another fun one.

01:57:21.240 | My LLM abruptly halts when it sees the string end of text.

01:57:25.080 | So here's a very strange behavior.

01:57:28.200 | Print a string end of text, is what I told GPT-4.

01:57:31.440 | And it says, could you please specify the string?

01:57:33.960 | And I'm telling it, give me end of text.

01:57:36.440 | And it seems like there's an issue.

01:57:38.000 | It's not seeing end of text.

01:57:40.280 | And then I give it end of text as the string,

01:57:42.840 | and then here's the string,

01:57:44.200 | and then it just doesn't print it.

01:57:45.840 | So obviously something is breaking here

01:57:47.200 | with respect to the handling of the special token.

01:57:49.520 | And I didn't actually know what OpenAI is doing

01:57:51.600 | under the hood here,

01:57:53.000 | and whether they are potentially parsing this

01:57:55.600 | as an actual token,

01:57:59.040 | instead of this just being end of text

01:58:01.800 | as like individual sort of pieces of it

01:58:05.260 | without the special token handling logic.

01:58:08.280 | And so it might be that someone,

01:58:09.920 | when they're calling dot encode,

01:58:11.960 | they are passing in the allowed special,

01:58:14.260 | and they are allowing end of text

01:58:16.720 | as a special character in the user prompt.

01:58:19.260 | But the user prompt, of course,

01:58:20.280 | is a sort of attacker-controlled text.

01:58:23.660 | So you would hope that they don't really parse

01:58:25.840 | or use special tokens from that kind of input.

01:58:30.240 | But it appears that there's something

01:58:31.240 | definitely going wrong here.

01:58:32.640 | And so your knowledge of these special tokens

01:58:35.840 | ends up being an attack surface, potentially.

01:58:38.120 | And so if you'd like to confuse LLMs,

01:58:41.440 | then just try to give them some special tokens

01:58:44.120 | and see if you're breaking something by chance.

01:58:46.320 | Okay, so this next one is a really fun one,

01:58:48.480 | the trailing whitespace issue.

01:58:51.960 | So if you come to Playground,

01:58:54.440 | and we come here to GPT 3.5 Turbo Instruct.

01:58:58.320 | So this is not a chat model.

01:58:59.520 | This is a completion model.

01:59:00.720 | So think of it more like,

01:59:02.440 | it's a lot more closer to a base model.

01:59:04.600 | It does completion.

01:59:05.960 | It will continue the token sequence.

01:59:08.720 | So here's a tagline for Ice Cream Shop,

01:59:10.520 | and we wanna continue the sequence.

01:59:12.600 | And so we can submit and get a bunch of tokens.

01:59:15.600 | Okay, no problem.

01:59:17.400 | But now, suppose I do this,

01:59:19.720 | but instead of pressing submit here,

01:59:22.420 | I do, here's a tagline for Ice Cream Shop space.

01:59:25.880 | So I have a space here before I click submit.

01:59:28.560 | We get a warning.

01:59:31.240 | Your text ends in a trailing space,

01:59:32.760 | which causes worse performance

01:59:33.920 | due to how API splits text into tokens.

01:59:37.240 | So what's happening here,

01:59:38.200 | it still gave us a sort of completion here,

01:59:40.800 | but let's take a look at what's happening.

01:59:42.760 | So here's a tagline for an Ice Cream Shop.

01:59:46.600 | And then, what does this look like

01:59:49.000 | in the actual training data?

01:59:50.160 | Suppose you found the completion in the training document,

01:59:53.120 | somewhere on the internet,

01:59:54.120 | and the LLM trained on this data.

01:59:56.360 | So maybe it's something like, oh yeah.

01:59:59.280 | Maybe that's the tagline.

02:00:00.120 | That's a terrible tagline.

02:00:01.160 | But notice here that when I create O,

02:00:04.680 | you see that because there's the,

02:00:06.960 | the space character is always a prefix

02:00:09.000 | to these tokens in GPT.

02:00:11.300 | So it's not an O token.

02:00:12.560 | It's a space O token.

02:00:14.040 | The space is part of the O.

02:00:16.520 | And together, they are token 8840.

02:00:19.160 | That's space O.

02:00:20.360 | So what's happening here is that

02:00:23.680 | when I just have it like this,

02:00:25.760 | and I let it complete the next token,

02:00:28.120 | it can sample the space O token.

02:00:31.120 | But instead, if I have this and I add my space,

02:00:34.140 | then what I'm doing here when I encode this string

02:00:36.740 | is I have basically, here's a tagline for an ice cream shop,

02:00:40.940 | and this space at the very end becomes a token 220.

02:00:43.840 | And so we've added token 220,

02:00:47.780 | and this token otherwise would be part of the tagline.

02:00:51.020 | Because if there actually is a tagline here,

02:00:52.700 | so space O is the token.

02:00:55.260 | And so this is suddenly out of distribution for the model,

02:00:58.820 | because this space is part of the next token,

02:01:01.500 | but we're putting it here like this,

02:01:03.900 | and the model has seen very, very little data

02:01:06.700 | of actual space by itself.

02:01:09.740 | And we're asking it to complete the sequence,

02:01:11.300 | like add in more tokens.

02:01:12.820 | But the problem is that we've sort of begun the first token,

02:01:16.200 | and now it's been split up,

02:01:18.340 | and now we're out of distribution,

02:01:20.020 | and now arbitrary bad things happen.

02:01:22.100 | And it's just a very rare example

02:01:23.740 | for it to see something like that.

02:01:25.460 | And that's why we did the warning.

02:01:27.740 | So the fundamental issue here is, of course,

02:01:29.580 | that the LLM is on top of these tokens,

02:01:33.660 | and these tokens are text chunks.

02:01:35.180 | They're not characters in a way

02:01:36.620 | you and I would think of them.

02:01:38.300 | These are the atoms of what the LLM is seeing,

02:01:40.860 | and there's a bunch of weird stuff that comes out of it.

02:01:43.180 | Let's go back to our default cell style.

02:01:46.800 | I bet you that the model has never in its training set

02:01:50.020 | seen default cell style without le in there.

02:01:55.020 | It's always seen this as a single group,

02:01:57.140 | because this is some kind of a function in,

02:02:00.300 | I don't actually know what this is part of.

02:02:03.100 | It's some kind of API.

02:02:04.180 | But I bet you that it's never seen this combination

02:02:06.980 | of tokens in its training data,

02:02:09.740 | because, or I think it would be extremely rare.

02:02:12.260 | So I took this and I copied pasted it here,

02:02:14.760 | and I tried to complete from it,

02:02:17.560 | and it immediately gave me a big error.

02:02:20.000 | And it said, "The model predicted a completion

02:02:21.460 | "that begins with a stop sequence

02:02:22.740 | "resulting in no output.

02:02:23.820 | "Consider adjusting your prompt or stop sequences."

02:02:26.460 | So what happened here when I clicked submit

02:02:28.620 | is that immediately the model emitted

02:02:30.860 | and sort of like end of text token,

02:02:32.740 | I think, or something like that.

02:02:34.320 | It basically predicted the stop sequence immediately.

02:02:36.940 | So it had no completion.

02:02:38.660 | And so this is why I'm getting a warning again,

02:02:40.580 | because we're off the data distribution

02:02:42.800 | and the model is just predicting

02:02:46.180 | just totally arbitrary things.

02:02:47.740 | It's just really confused, basically.

02:02:49.380 | This is giving it brain damage.

02:02:51.020 | It's never seen this before.

02:02:52.100 | It's shocked, and it's predicting end of text or something.

02:02:55.540 | I tried it again here, and in this case, it completed it.

02:02:58.920 | But then for some reason,

02:03:00.280 | this request may violate our usage policies.

02:03:02.840 | This was flagged.

02:03:03.820 | Basically, something just goes wrong,

02:03:07.200 | and there's something like jank.

02:03:08.280 | You can just feel the jank

02:03:09.360 | because the model is extremely unhappy with just this,

02:03:12.320 | and it doesn't know how to complete it

02:03:13.420 | because it's never occurred in a training set.

02:03:15.420 | In a training set, it always appears like this

02:03:17.920 | and becomes a single token.

02:03:20.040 | So these kinds of issues where tokens are,

02:03:22.860 | either you sort of complete the first character

02:03:25.180 | of the next token,

02:03:26.520 | or you have long tokens

02:03:28.480 | that you then have just some of the characters off,

02:03:31.320 | all of these are kind of like issues with partial tokens,

02:03:35.000 | is how I would describe it.

02:03:36.800 | And if you actually dig into the TukToken repository,

02:03:39.760 | go to the Rust code and search for unstable,

02:03:43.400 | and you'll see encode unstable native, unstable tokens,

02:03:49.280 | and a lot of special case handling.

02:03:51.520 | None of this stuff about unstable tokens

02:03:53.400 | is documented anywhere,

02:03:54.960 | but there's a ton of code dealing with unstable tokens,

02:03:57.960 | and unstable tokens is exactly

02:04:00.200 | kind of like what I'm describing here.

02:04:02.320 | What you would like out of a completion API

02:04:05.000 | is something a lot more fancy.

02:04:06.280 | Like if we're putting in default cell star,

02:04:08.740 | if we're asking for the next token sequence,

02:04:10.580 | we're not actually trying to append the next token

02:04:12.560 | exactly after this list.

02:04:14.400 | We're actually trying to append,

02:04:16.240 | we're trying to consider lots of tokens

02:04:18.200 | that if we were, I guess like,

02:04:22.120 | we're trying to search over characters

02:04:24.320 | that if we re-tokenized would be of high probability,

02:04:28.220 | if that makes sense.

02:04:29.220 | So that we can actually add a single individual character

02:04:33.520 | instead of just like adding the next full token

02:04:35.800 | that comes after this partial token list.

02:04:38.760 | So this is very tricky to describe,

02:04:40.560 | and I invite you to maybe like look through this.

02:04:42.840 | It ends up being extremely gnarly and hairy kind of topic,

02:04:45.640 | and it comes from tokenization fundamentally.

02:04:47.600 | So maybe I can even spend an entire video

02:04:50.660 | talking about unstable tokens sometime in the future.

02:04:53.120 | Okay, and I'm really saving the best for last.

02:04:55.320 | My favorite one by far is the solid gold Magikarp.

02:04:58.940 | (laughs)

02:05:00.280 | It was just, okay, so this comes from this blog post,

02:05:02.720 | solid gold Magikarp.

02:05:04.640 | And this is internet famous now for those of us in LLMs.

02:05:09.640 | And basically I would advise you

02:05:11.440 | to read this blog post in full.

02:05:13.600 | But basically what this person was doing

02:05:15.080 | is this person went to the token embedding stable

02:05:21.200 | and clustered the tokens

02:05:23.040 | based on their embedding representation.

02:05:25.680 | And this person noticed that there's a cluster of tokens

02:05:29.160 | that look really strange.

02:05:30.420 | So there's a cluster here, @rot, E-stream fame,

02:05:33.600 | solid gold Magikarp, signup message,

02:05:35.640 | like really weird tokens in basically

02:05:39.720 | in this embedding cluster.

02:05:41.520 | And so where are these tokens

02:05:42.720 | and where did they even come from?

02:05:43.760 | Like what is solid gold Magikarp?

02:05:45.120 | It makes no sense.

02:05:46.160 | And then they found a bunch of these tokens.

02:05:50.900 | And then they noticed that actually the plot thickens here

02:05:53.340 | because if you ask the model about these tokens,

02:05:56.100 | like you ask it some very benign question,

02:05:59.060 | like please can you repeat back to me

02:06:00.580 | the string sold gold Magikarp?

02:06:03.020 | Then you get a variety of basically

02:06:04.860 | totally broken LLM behavior.

02:06:07.500 | So either you get evasion, so I'm sorry I can't hear you,

02:06:10.820 | or you get a bunch of hallucinations as a response.

02:06:13.460 | You can even get back like insults.

02:06:16.340 | So you ask it about streamer bot

02:06:18.660 | and the model actually just calls you names.

02:06:22.300 | Or it kind of comes up with like weird humor.

02:06:25.300 | Like you're actually breaking the model

02:06:27.300 | by asking about these very simple strings

02:06:29.500 | like @rot and solid gold Magikarp.

02:06:32.240 | So like what the hell is happening?

02:06:33.600 | And there's a variety of here documented behaviors.

02:06:36.980 | There's a bunch of tokens, not just sold gold Magikarp

02:06:39.060 | that have that kind of a behavior.

02:06:41.380 | And so basically there's a bunch of like trigger words.

02:06:43.940 | And if you ask the model about these trigger words,

02:06:46.100 | or you just include them in your prompt,

02:06:48.160 | the model goes haywire

02:06:49.300 | and has all kinds of really strange behaviors,

02:06:52.180 | including sort of ones that violate

02:06:54.220 | typical safety guidelines and the alignment of the model,

02:06:57.600 | like it's swearing back at you.

02:06:59.600 | So what is happening here

02:07:01.180 | and how can this possibly be true?

02:07:03.180 | Well, this again comes down to tokenization.

02:07:06.140 | So what's happening here is that sold gold Magikarp,

02:07:08.500 | if you actually dig into it, is a Reddit user.

02:07:11.580 | So there's a u/soldgoldmagikarp.

02:07:15.100 | And probably what happened here,

02:07:16.840 | even though I don't know that this has been like

02:07:18.360 | really definitively explored,

02:07:20.420 | but what is thought to have happened

02:07:22.860 | is that the tokenization dataset was very different

02:07:26.180 | from the training dataset for the actual language model.

02:07:29.660 | So in the tokenization dataset,

02:07:30.940 | there was a ton of Reddit data potentially,

02:07:33.380 | where the u/soldgoldmagikarp was mentioned in the text.

02:07:37.300 | Because sold gold Magikarp was a very common

02:07:40.420 | sort of person who would post a lot,

02:07:43.340 | this would be a string that occurs many times

02:07:45.220 | in a tokenization dataset.

02:07:46.940 | Because it occurs many times in the tokenization dataset,

02:07:50.020 | these tokens would end up getting merged

02:07:51.660 | to a single individual token

02:07:53.100 | for that single Reddit user, soldgoldmagikarp.

02:07:56.300 | So they would have a dedicated token

02:07:58.220 | in a vocabulary of, was it 50,000 tokens in GPT-2,

02:08:01.820 | that is devoted to that Reddit user.

02:08:04.020 | And then what happens is the tokenization dataset

02:08:06.700 | has those strings, but then later when you train the model,

02:08:11.020 | the language model itself,

02:08:13.300 | this data from Reddit was not present.

02:08:16.300 | And so therefore, in the entire training set

02:08:18.760 | for the language model, soldgoldmagikarp never occurs.

02:08:22.300 | That token never appears in the training set

02:08:25.100 | for the actual language model later.

02:08:27.440 | So this token never gets activated.

02:08:29.860 | It's initialized at random in the beginning of optimization.

02:08:32.780 | Then you have forward-backward passes

02:08:34.140 | and updates to the model.

02:08:35.300 | And this token is just never updated in the embedding table.

02:08:37.900 | That row vector never gets sampled.

02:08:39.980 | It never gets used.

02:08:41.060 | So it never gets trained and it's completely untrained.

02:08:43.700 | It's kind of like unallocated memory

02:08:45.540 | in a typical binary program written in C

02:08:47.940 | or something like that.

02:08:49.220 | So it's unallocated memory.

02:08:50.740 | And then at test time, if you evoke this token,

02:08:53.460 | then you're basically plucking out a row

02:08:55.320 | of the embedding table that is completely untrained

02:08:57.260 | and that feeds into a transformer

02:08:58.900 | and creates undefined behavior.

02:09:00.820 | And that's what we're seeing here.

02:09:01.740 | This completely undefined,

02:09:02.980 | never before seen in a training behavior.

02:09:05.700 | And so any of these kind of like weird tokens

02:09:07.640 | would evoke this behavior

02:09:08.620 | because fundamentally the model is out of sample,

02:09:13.540 | out of distribution.

02:09:16.140 | Okay, and the very last thing I wanted

02:09:17.300 | to just briefly mention and point out,

02:09:18.980 | although I think a lot of people are quite aware of this,

02:09:21.340 | is that different kinds of formats

02:09:22.900 | and different representations and different languages

02:09:25.060 | and so on might be more or less efficient

02:09:27.500 | with GPT tokenizers or any tokenizers for any other,

02:09:30.860 | let alone for that matter.

02:09:32.060 | So for example, JSON is actually really dense in tokens

02:09:35.140 | and YAML is a lot more efficient in tokens.

02:09:39.040 | So for example, these are the same in JSON and in YAML.

02:09:43.280 | The JSON is 116 and the YAML is 99.

02:09:47.540 | So quite a bit of an improvement.

02:09:49.400 | And so in the token economy where we are paying per token

02:09:53.760 | in many ways, and you are paying in the context length

02:09:56.400 | and you're paying in dollar amount

02:09:58.560 | for the cost of processing all this kind of structured data

02:10:01.520 | when you have to.

02:10:03.000 | So prefer to use YAMLs over JSONs.

02:10:05.340 | And in general, kind of like the tokenization density,

02:10:07.640 | something that you have to sort of care about

02:10:10.560 | and worry about at all times

02:10:11.840 | and try to find efficient encoding schemes

02:10:14.180 | and spend a lot of time in TickTokenizer

02:10:16.080 | and measure the different token efficiencies

02:10:18.180 | of different formats and settings and so on.

02:10:20.520 | Okay, so that concludes my fairly long video

02:10:23.000 | on tokenization.

02:10:24.440 | I know it's dry, I know it's annoying,

02:10:26.900 | I know it's irritating.

02:10:28.080 | I personally really dislike the stage.

02:10:30.660 | What I do have to say at this point is don't brush it off.

02:10:33.880 | There's a lot of foot guns, sharp edges here,

02:10:36.080 | security issues, AI safety issues,

02:10:38.940 | as we saw plugging in unallocated memory

02:10:41.120 | into language models.

02:10:43.020 | So it's worth understanding this stage.

02:10:46.020 | That said, I will say that eternal glory goes to anyone

02:10:49.860 | who can get rid of it.

02:10:51.160 | I showed you one possible paper that tried to do that.

02:10:54.440 | And I think, I hope a lot more can follow over time.

02:10:58.080 | And my final recommendations for the application right now

02:11:00.460 | are if you can reuse the GPT-4 tokens and vocabulary

02:11:04.240 | in your application,

02:11:05.080 | that's something you should consider

02:11:06.200 | and just use TicToken because it is very efficient

02:11:08.800 | and nice library for inference for BPE.

02:11:12.640 | I also really like the byte level BPE

02:11:14.440 | that TicToken and OpenAI uses.

02:11:16.540 | If you for some reason want to train your own vocabulary

02:11:20.120 | from scratch, then I would use the BPE with sentence piece.

02:11:25.120 | Oops, as I mentioned, I'm not a huge fan of sentence piece.

02:11:29.440 | I don't like it's a byte fallback

02:11:33.040 | and I don't like that it's doing BPE

02:11:34.920 | on Unicode code points.

02:11:36.320 | I think it's, it also has like a million settings

02:11:38.600 | and I think there's a lot of foot guns here

02:11:40.000 | and I think it's really easy to miscalibrate them

02:11:42.160 | and you end up cropping your sentences

02:11:43.680 | or something like that because of some hyperparameter

02:11:46.600 | that you don't fully understand.

02:11:48.000 | So be very careful with the settings,

02:11:50.400 | try to copy paste exactly maybe what Meta did

02:11:53.240 | or basically spend a lot of time looking

02:11:55.400 | at all the hyperparameters

02:11:56.880 | and go through the code of sentence piece

02:11:58.440 | and make sure that you have this correct.

02:12:00.440 | But even if you have all the settings correct,

02:12:03.160 | I still think that the algorithm is kind of inferior

02:12:05.720 | to what's happening here.

02:12:07.440 | And maybe the best,

02:12:08.880 | if you really need to train your vocabulary,

02:12:10.960 | maybe the best thing is to just wait

02:12:12.240 | for MIM-BPE to become as efficient as possible.

02:12:15.920 | And that's something that maybe I hope to work on.

02:12:19.360 | And at some point, maybe we can be training basically,

02:12:22.520 | really what we want is we want TIC token, but training code.

02:12:25.800 | And that is the ideal thing that currently does not exist.

02:12:29.640 | And MIM-BPE is an implementation of it,

02:12:32.920 | but currently it's in Python.

02:12:34.960 | So that's currently what I have to say for tokenization.

02:12:38.040 | There might be an advanced video that has even drier

02:12:41.040 | and even more detailed in the future.

02:12:42.880 | But for now, I think we're gonna leave things off here

02:12:45.040 | and I hope that was helpful.

02:12:48.040 | Bye.

02:12:48.880 | And they increased this context size from GPT-1 of 512

02:12:58.680 | to 1024 in GPT-4.

02:13:01.840 | (laughs)

02:13:03.760 | The next...

02:13:04.600 | Okay, next, I would like us to briefly walk through

02:13:08.640 | the code for OpenAI on the GPT-2 encoded at Pi.

02:13:12.040 | I'm sorry, I'm gonna sneeze.

02:13:19.120 | And then what's happening here is,

02:13:21.240 | this is a spurious layer that I will explain in a bit.

02:13:25.600 | What's happening here is,

02:13:29.600 | (sneezes)

02:13:31.760 | [BLANK_AUDIO]

Let's build the GPT Tokenizer

Chapters