back to indexHow LLMs work for Web Devs: GPT in 600 lines of Vanilla JS - Ishan Anand

00:00:00.000 |
Okay, thank you for coming bright and early, 9:00 a.m. at the start of the conference, 00:00:24.420 |
for LLMs for Web Devs, GPT in 600 lines of vanilla JavaScript. I think you guys are going 00:00:32.740 |
to have a great conference. I was here last year. I thoroughly enjoyed it. And I think 00:00:37.300 |
this is a great way to kick things off. If you're just coming to this field or this conference 00:00:42.580 |
without any background in machine learning, this is your missing AI degree that will help 00:00:47.900 |
make the rest of the conference, I think, hopefully a lot more valuable. 00:00:52.260 |
So if you're just joining us and came in, you can go to spreadsheets are all you need.ai. 00:00:59.260 |
And there's a Discord link in the upper menu. Click on that, and then go to the AI Engineer 00:01:05.040 |
World's Fair 2025 room. And then there's a link to download the GPT-2 model weights in 00:01:12.000 |
the pinned message at the top. It takes a little while to download that, so I would download 00:01:16.040 |
that and get started. Because the WiFi might be a little bit slow. We're going to be using 00:01:23.040 |
Okay. We're going to do something special today. It is a talk. And I will be doing a lot of talking 00:01:29.880 |
at you. And we'll be running code, so it is a workshop. But our mission today is to break 00:01:37.040 |
Clark's third law. You might be familiar with the science fiction author, Arthur C. Clark, 00:01:44.040 |
and his famous maxim that any sufficiently advanced technology is indistinguishable from magic. 00:01:49.820 |
And nowhere is that more true or relevant today than when it comes to large language models. 00:01:56.040 |
These are seemingly magical machines that can produce lifelike text, automate tasks as agents, 00:02:04.040 |
and maybe even replace humans in certain contexts. And if you ask somebody how these work or you 00:02:10.900 |
go online, you're liable to get the impression that you need to have semesters of linear algebra 00:02:17.680 |
and calculus before you can begin taking your first machine learning class and understand how 00:02:22.760 |
these work. And yes, that's true if you want to be a machine learning engineer. But if you just 00:02:28.760 |
want to understand how these work and you're a builder on top of them, I'm here to tell you that is not 00:02:34.200 |
true. Do not believe them. You don't need all that sophistication if you just want to have a really 00:02:39.800 |
accurate model of how a transformer works. I know because I was here last year. I gave a talk called 00:02:47.720 |
Spreadsheets are all you need where I showed an Excel worksheet that implemented all of GPT2 small 00:02:53.720 |
entirely in pure Excel functions. And then I took that spreadsheet and I turned it into a class that I 00:02:59.560 |
taught online where I took people tab by tab through how the entire model works. And not everyone was even 00:03:07.720 |
an engineer. So one of my favorites is this guy. This Joe here is a CFO. He's just naturally good at Excel. 00:03:15.720 |
And that gave him everything he needed combined with the sheet to understand how a large language model 00:03:21.000 |
works on the inside. And he says, this is great. I had no experience in machine learning or AI concepts 00:03:27.160 |
before. Yet he was able to walk away with a very good understanding of how they work. So what I'm going to do 00:03:31.720 |
today is compress that class, which is about eight hours down to these two hours today, and explain how 00:03:37.720 |
the transformer works. And I show up today or this year at the conference with this. Instead of Excel, 00:03:45.720 |
because not every one of us have a job that requires us to be really good at Excel, we're using a vanilla 00:03:51.720 |
JavaScript implementation. Because a lot of folks who come to AI engineering as a field have a web development or full-stack 00:03:59.720 |
JavaScript background. And so if that's your background, you are perfect the way you are. 00:04:05.720 |
You don't need to learn Python if you just want to understand how a model works. And you're still going 00:04:09.720 |
to use TypeScript and Next.js around the model, but you still want to have a good understanding of how it works. 00:04:15.720 |
So today's approach is I'm going to give you the background to understand the code. And we're going to take 00:04:21.720 |
a brief walkthrough of it. I'm going to focus more on the why than on the what. So we'll take a look at the code, 00:04:29.720 |
but I'm going to spend a lot of time building intuition and background to understand the code. 00:04:35.720 |
the code. And then instead of complex equations, I'm going to use analogies and examples to make 00:04:41.720 |
it more tangible. Okay. And the background you need to have is, first of all, just motivation. And a 00:04:47.720 |
curiosity to understand how these work on the inside. Any science, technology, engineering background is 00:04:53.720 |
sufficient. Prior programming experience, especially in JavaScript, but you don't need to know React or Vue. 00:04:59.720 |
We're going to just use vanilla JavaScript. And then some awareness, I would call it, of linear algebra, 00:05:05.720 |
meaning you just need to know what a matrix multiplication is. You don't need to be a hot shot 00:05:11.720 |
JavaScript ninja. You don't need prior AI or ML background. And you don't need, you know, deep 00:05:17.720 |
calculus or linear algebra fluency. Okay. And the key resources for today are going to be our 00:05:23.720 |
JavaScript implementation of GPT-2. Use Chrome on the desktop. And if you go to spreadsheets are all you need, 00:05:31.720 |
.ai -- and I apologize, that is a long domain name -- slash GPT-2, it will load up this implementation 00:05:38.720 |
and it will run locally in your browser. There's a Discord server where I've dropped some links. 00:05:43.720 |
And if you've got questions, feel free to drop them in there as well. Okay. So this is a simplified diagram of 00:05:52.720 |
GPT-2. It does not look like your classic transformer diagram intentionally. And it will serve as our 00:05:59.720 |
roadmap for what we're going to do throughout today's workshop. I'm going to start by just giving you 00:06:05.720 |
some background on LLMs and our JavaScript implementation of GPT-2, how to get it running, 00:06:11.720 |
how to get it started. And then we're going to focus on these three areas for the most of it. And that's tokenization, 00:06:19.720 |
embeddings, and then the language head. And I've focused on those because that's the input and the output of the model. 00:06:26.720 |
And those are the most important to have the background and understanding if you're going to be building systems around and on top of LLMs. 00:06:34.720 |
We will cover the inside number crunching part. That's attention and the multilayer perceptron at a pretty high level. 00:06:41.720 |
I'll go a little bit more into the multilayer perceptron because then I can explain about backpropagation and it serves as kind of a foundation to understand how the model learns, which is an important concept. 00:06:53.720 |
And then finally, I'll talk about the difference between GPT-2 and ChatGPT. So they were separated by three or four years. What were those innovations that made the model so much seemingly more smarter than GPT-2? 00:07:08.720 |
Hint, it wasn't necessarily anything algorithmic. Okay, so let's start with a quick tour of our JavaScript implementation of GPT-2 and then some background on LLMs. 00:07:20.720 |
So this is what you get when you load up GPT-2, slash GPT-2 on the spreadsheets-are-all-you-need website. 00:07:33.720 |
If you scroll down, the first thing you're going to want to do, there's a link here that says download the GPT-2 small CSV. 00:07:41.720 |
So the first thing you want to do is go to this page on GitHub. This is all the model parameters of GPT-2 small in a bunch of CSV. 00:07:49.720 |
It's a giant zip file. You're going to download it and unzip it. And when you hear about, you know, this model is like a billion parameters or 70 billion parameters, 00:08:01.720 |
it's all just a bunch of giant numbers. And you can think of it as a giant spreadsheet. And that's literally what the zip file is. 00:08:07.720 |
Once you've downloaded that zip file and opened it up, what you want to do is select all of those files and drag it into this section here. 00:08:18.720 |
When it's done loading all those files, it'll look like this. It'll say, ready, all model parameters loaded. When it's not loaded, it'll be in red. 00:08:27.720 |
And it'll say, you know, please add the files. And what it's doing is it's actually loading all those GPT-2 parameters locally into your index.db database. 00:08:37.720 |
And you can see here, it's basically about 1.5 gigabytes. Now, Chrome will let you do that if you've got sufficient disk space on your hard drive. 00:08:46.720 |
But the benefit of doing this is now the entire model is running locally in vanilla JavaScript on your browser. In fact, to run and debug this, you don't need anything else. 00:08:56.720 |
You could pull the internet connection and it should still work. And the way this is set up is similar to a Python notebook, if you've encountered one of those. 00:09:05.720 |
We have these cells. And every cell has a play button, which will run what's inside it. And there are two types of cells. One type of cell is just JavaScript code. 00:09:14.720 |
And you can open these and expand these if you want. It's just vanilla JavaScript code. Our matrices, for example, are just simple 2D JavaScript arrays. 00:09:22.720 |
And if I hit play, it will execute this code. And you can see there's a message right there. The other type of cell is one like this. It's kind of like a table or spreadsheet interface. 00:09:36.720 |
It runs a formula and then shows you the result. So it's a way to actually run code and see the results very immediately. And these formulas are just raw JavaScript with a little syntactic sugar to figure out, you know, if you reference something, it knows what previous table it was. 00:09:51.720 |
Let me give you an example of that. So right here is an example. So here, this get final tokens is just JavaScript code we defined earlier. And this prompt to tokens is literally the DOM ID. And this, let's zoom this one out. This brackets is basically syntactic sugar saying, go grab the DOM table that has this ID. So prompt to tokens is up here. 00:10:20.720 |
And it's just grabbing this thing and passing that into that function. So it's all straight vanilla JavaScript. But the real benefit is being able to debug a model right here without leaving your browser. So what I'm going to do, I can do -- who here has done like console debugging before, right? 00:10:39.720 |
So you can do console debugging right here. So if I say console.log matches, and I open up my DevTools inspector, let's put that side by side. There we go. And I go to the console. And then I rerun this. There we go. Wait for the layout shifts. There we go. 00:11:08.720 |
Separate into words is right there. So if I rerun this, you can see right here, it might be a little bit hard. But if you look here, you can see I've basically got my console.log statement here. If I wanted to know what that variable is doing, but I can go even further. I can just type the word debugger. 00:11:27.720 |
And then I rerun it. And boom. I'm actually stepping through a large language model that was once considered too dangerous to release, right with ever leaving my browser. So every part of this model, if you were like, I really want to understand how this works, you can step right through it in a familiar language and the browser, a very familiar IDE. So I'm going to remove that debugger statement. 00:11:54.720 |
So it doesn't get in our way later. And then there we go. OK. So that's a quick tour of how to run our JavaScript implementation of GPT-2. OK. Next up, large language models. So a large language model has a 00:12:22.720 |
So a large language model has a really simple job to do. We give it a passage of text and it simply predicts the next word. Technically the next token, as we'll talk about. So I might give it the text, Mike is quick, he moves, and it'll just output a single word quickly. It does not by nature naturally give you paragraphs of text. So if we want to get more text, what we do is we take that output, and then we append the word we just got out, which was quickly, and then we append the word we just 00:12:50.720 |
got out, which was quickly, and put it at the end of the thing we originally put in. So then we take that additional text, and now we ask it to run through again with this much longer piece, and say, what is the next word now? And it says, oh, it's and. Then we take that, append it to the original input, and keep going, and ask it what the next thing is. And this is how we generate paragraphs of text or code from a model. It is what they call an autoregressive model. You simply take the output, put it back to the input, and rerun it. Now, this is why, if you use our JavaScript implementation, 00:13:18.720 |
all it does is predict the next word. Because that is the core function. If you understand that, you understand how the rest of it works. 00:13:25.720 |
So we've said that large language models have this core action of completing passages of text, and they're trained to complete sentences like this one. Mike is quick, he moves, and as a human, you probably understand, you know, a possible completion is the word quickly, or maybe the word fast, or around. But how do we get a computer to do that? 00:13:46.720 |
Well, here's a fill-in-the-blank problem that computers are really good at. Two plus two equals four. It's a math problem. Computers are really good at math. And you can make these equations really complex, and computers can still do them really fast. So in effect, what researchers have figured out how to do is take what is a word problem and turn it into a math problem. 00:14:08.720 |
In order to do that, they have to go through a series of steps. First, what they have to do is they have to map the words in our text to numbers. Here I've shown it as just a one-to-one mapping. So Mike goes to 89. Is goes to 9. But in practice, as we'll see, it's a long list of numbers called an embedding. 00:14:26.720 |
And then we do our number crunching on them. Here I've drawn this as just simple arithmetic. It's much more complex than that. But it's actually almost as simple as that. It's just a lot of multiplication, addition. There's an exponentiation in there. 00:14:40.720 |
But it's not math you probably haven't seen before. It's just a lot of it tediously put together. And then after all that arithmetic, we get a result. Again, it'll be a long list of numbers. Here I've simplified for now, just saying a single number. 00:14:56.720 |
And we look at the resulting number that comes back. And that number is going to be what it says the next predicted word is going to be. But we need to translate that back to a word, because it's a number. So then we do the reverse of what we did at the beginning, instead of going from words to numbers. 00:15:08.720 |
We go from numbers to words. And of course, the number we get back, numbers are continuous, words are discrete, doesn't necessarily always cleanly map. We'll get some number like here. For example, hypothetically, we get 231. 00:15:24.720 |
There's nothing in our dictionary that maps to it. But the closest word in our dictionary is quickly, which is at 232. But fast is kind of close to 240. So what we're going to do is we're going to wait the probability distribution of these tokens 00:15:38.720 |
according to how close they are to the predicted number that came out of our model. And that turns into our probability distribution. 00:15:45.720 |
So then we run a random number generator, and then we pick according to that distribution. One thing I want to emphasize is we add the random number generator in. 00:15:54.720 |
We could always just simply take the closest word, and that's called greedy or temperature zero. 00:16:01.720 |
OK. So that gives us this view. You get some text. We turn that text into tokens. We turn those tokens into numbers. And then we do some number crunching on them. And then we turn those numbers into text. 00:16:14.720 |
And that gives us our next predicted token. OK. The model we are going to be studying today is GPT-2. 00:16:23.720 |
GPT-2 small, specifically. There's actually multiple versions of GPT-2 that were released. And that came out in 2019. 00:16:30.720 |
So about four years before GPT-4 and three years before ChatGPT. But don't let that fool you. 00:16:37.720 |
GPT-2 small, you know. This was a model that was considered too dangerous to release when it first came out. And it was state-of-the-art. More importantly, it is actually the foundation of most of the state-of-the-art models you have probably used today. 00:16:51.720 |
And you don't have to take my word for it. This is a research lab, Luther AI, saying basically the recipe for building a large language model has not fundamentally changed since the transform was introduced. 00:17:02.720 |
And only slightly tweaked from the language models by OpenAI GPT-1 and GPT-2. 00:17:07.720 |
And then in this article, they actually go on to list what the changes are between GPT-2 and a state-of-the-art model at the time, which was LAMA-2 when this was written, which was last year. 00:17:20.720 |
So the way to think about this -- and this is a helpful family tree chart of different large language models -- is that most of the ones you're familiar with at the top of this tree -- might be hard to see -- is ChatGPT, LAMA, Bard/Gemini, GPT-4, Claude. 00:17:35.720 |
They all inherit from GPT-2. GPT-2 is its granddaddy. If you understand GPT-2, you are 80% of the way to understanding how a state-of-the-art model works under the hood. 00:17:47.720 |
Okay, so now let's dive into the first stage of our model. So that's tokenization. 00:17:54.720 |
Okay, this is where we take the input text and we split it into subword units called tokens. In the example that I like to use -- Mike is quick, he moves -- unfortunately, every single word is a single token. 00:18:09.720 |
is a single token. But it is not unusual for a word to be two, three, or more tokens. 00:18:16.720 |
And then these tokens all have IDs that are just lists or positions in the dictionary, as you see here underneath. 00:18:32.720 |
Okay. Well, we get more people in the room, we get more Wi-Fi issues. 00:18:52.720 |
So what I want to illustrate for you is that you can take a word like -- oh, you know what I can do? I do have a backup. Let's do this. 00:19:07.720 |
Okay, this is a version that runs locally. So what you can do -- so you can try this once we get Wi-Fi back. I'm going to take the word "reindeer" and the word "re-injury" and then I'm going to run them up until right here, which is the final tokens. 00:19:26.720 |
So these are the tokens for the input prompt we just put in here. And you can see the word "re-injury" was turned into multiple tokens. 00:19:40.720 |
There's a space, which is part of the token itself, R-E-I-N, and then jury, J-U-R-Y, and they have separate token IDs. 00:19:48.720 |
The thing I want you to pay attention to is reindeer also starts as R-E-I-N, right? But it got split into three tokens, space, R-E-I-N-D, E-E-R. 00:19:59.720 |
So this is not like basic string parsing. Something more complex is going on. And so the natural question is, well, why the heck are we doing this? 00:20:07.720 |
Why don't we do something simpler? Why don't we do, say, word-based tokenization? We just take every word in the dictionary and give it a number, like dog is one, cat is two, and so forth. 00:20:18.720 |
So that has a couple of problems. First is it can't handle unknown or misspelled words. And there are some models, early models, that had an unk for unknown token. 00:20:26.720 |
But when you're grabbing all the text on the Internet, you might encounter things you didn't expect. Examples could be languages that you weren't planning for. 00:20:34.720 |
One of the early models, GPT-2, in fact, they tried to take foreign languages out of it. And then they magically discovered some snuck in and it was actually good at translation. 00:20:43.720 |
They wouldn't have had that if a lot of words were just simply that were not English, were just thrown out. 00:20:48.720 |
Another example is when they did summarization with it, they realized they can put the too-long-didn't-read acronym, TL:DR. And if they didn't have a token for that, it would have been thrown away and it would have lost that ability. 00:21:01.720 |
The other problem is that you're going to increase the vocabulary size, which is going to increase the size of the model. 00:21:07.720 |
It will need more parameters if you're going to have more vocabulary. 00:21:10.720 |
English alone is 170,000 words. For perspective, GPT-2's vocabulary is only about 50,000, so it's a third of that. 00:21:17.720 |
And then if you add additional languages on that and you're doing word-based tokenization, it would get even larger. 00:21:22.720 |
in essence, if you do this, you get more memory, more compute, or maybe less performance. 00:21:28.720 |
So then you're like, well, I'm a developer. I'm used to say something like ASCII. Why don't I just do character-based tokenization? 00:21:34.720 |
I say A is 1, B is 2, and do it that way. Well, the first problem is it's going to increase the sequence length. 00:21:41.720 |
So you can see this inside the model. As you go through the model, after you get your prompt, right here, you can see here is where the embeddings -- we'll talk about this in a second. 00:21:51.720 |
But you can see Mike is quick, period, he moves. Each of these rows, this matrix, has a height that is the size of the number of tokens. 00:22:00.720 |
And that persists through the entire model. So as I keep going, we see, again, this six-height matrix. It's going to keep going. 00:22:09.720 |
If we made every single character its own token, this is going to get a lot larger. Right now, it's just, what, six tokens high. 00:22:17.720 |
But if I made M, I, K, E, each of these characters their own token, this is going to get a much larger matrix. 00:22:22.720 |
So it will be more memory, more compute to process. The other issue is there's low semantic correlation in characters. 00:22:29.720 |
They don't carry a lot of meaning. And a good example is this chain letter that went around a few decades ago on the Internet. 00:22:34.720 |
And it says, according to research at Cambridge University, it doesn't matter in what order the letters in a word are. 00:22:40.720 |
The only important thing is that the first and last letter be at the right place. And all the letters are jumbled, but you can still read it. 00:22:46.720 |
And the point is that you don't read characters. You actually read subword units yourself. 00:22:51.720 |
And so if there's less semantic correlation, it's going to be more work for the model to do during training to erase that character boundary 00:22:58.720 |
boundary and get the pieces that really matter. So if character tokenization is too small and word tokenization is too big, 00:23:06.720 |
Goldilocks says let's do something in between, which is subword tokenization. And that's this algorithm called by parent coding. 00:23:12.720 |
So it's got two phases. The first is the learning phase where you take a large -- they call it a corpus of text that's gathered from the Internet. 00:23:19.720 |
And then we put it through this learning algorithm we'll describe in a second. And then we get out of it a vocabulary, a dictionary of tokens. 00:23:27.720 |
And then later when we're processing the model and asking it to generate text, if we give it some input words, we have to re-translate it into those tokens that were used during training. 00:23:37.720 |
So we take the input words, we take the vocabulary, and we get out tokens. This is the research paper that introduced the algorithm to machine learning. 00:23:47.720 |
to machine learning. But it turns out this algorithm is from the 90s. It's actually a compression algorithm, as we'll talk about in a second. 00:23:54.720 |
And it even has some Python code you can copy and paste and run. The goal of the algorithm really is to take the text that's going to be trained on and figure out the most efficient way to represent it. 00:24:05.720 |
That's really what tokenization is trying to do. And so here's the example from the paper. And what I'm going to do first is I'm going to use this dot to separate out the characters. 00:24:16.720 |
That's going to tell us, essentially, each token will be separated by those dots so we can see them individually. 00:24:21.720 |
At the end of this process, you'll see that there are less dots, meaning that we've got more tokens -- sorry, more number of tokens in our vocabulary, but less number of tokens being used to represent the corpus. 00:24:33.720 |
And in this corpus, we're going to pretend that when we scraped the internet, this is all we came up with. There are only four words: low, lower, newest, and widest. 00:24:42.720 |
And you'll notice some of them appear more than once. In fact, they all do. But the reason I'm doing it this way is I want the frequency of the word to be represented. 00:24:49.720 |
Right? I'm trying to compress all the words on the internet. And if a word is more frequent, I want to know that, because if it's more frequent, I want to give it more representation or a more efficient representation. 00:25:02.720 |
So that dot is going to separate out our tokens. And then I'm going to put the underscore to indicate the space character. 00:25:08.720 |
One caveat -- in this example, the space character is at the end of the token. In GPT-2, it's actually at the beginning. 00:25:14.720 |
And then we're going to start with our vocabulary of just our characters, A through Z, all lowercase. 00:25:20.720 |
We're assuming that we're in a lowercase-only world here for now. Those are our initial tokens. That's our vocabulary on the right. 00:25:26.720 |
And then the first thing we're going to do is we're going to count all the adjacent tokens, all the adjacent characters. 00:25:33.720 |
Right now, there are no tokens other than characters. And we can see E and S occurs nine times. Right? Six times in newest and three times in widest. 00:25:42.720 |
So I'm going to put a table together, and I put E next to S has a frequency of nine. And then I'm going to do this for all the possible pairs. 00:25:49.720 |
And then I'm going to take the most frequent pair, and I'm going to say that's a new token. Why am I doing that? Because if I do that, I can take E and S, I can put them together, and I can pretend they're their own character. 00:26:00.720 |
I'll call it a token. And then I can go back to my corpus. And every place there was an E and S, I'm going to replace it with my new ES token. 00:26:09.720 |
So now I've shrunk the number of tokens that need to represent the stuff here on the left. Now I've taken what were separate tokens and combined them together. 00:26:17.720 |
So I'm using less tokens to represent all this text. And then I can repeat the process. I can just say ES next to T occurs nine times, and I can do a whole other table. 00:26:27.720 |
Now ES is itself, now its own character, its own token. So it can be paired with other things that I'm counting. And I can see that ES with T occurs nine times. 00:26:37.720 |
So I add that to my vocabulary of tokens, and I go back and I compress my corpus again. Then I keep going, and I just simply repeat this process, looking for whatever is the most frequent at each pass, and making it a token, and then recompressing the corpus. 00:26:52.720 |
And after 10 passes, you get something like this. Here on the right is our vocabulary of tokens. And then here on the left, you can see we have shrunk the number of tokens used to represent the corpus. 00:27:04.720 |
There are less of these dots, right, than we saw before and originally. Now the most frequent words became their own tokens, right? Low and newest. And even the words that did not get their own full token representation, we've now represented them a lot more efficiently. 00:27:24.720 |
And you notice that there are some common subword units like low became tokens, and EST became tokens. So it managed to map to some of the morphemes, the parts of speech we use as humans as tokens, but that is just a coincidence. 00:27:39.720 |
Right now the model has no understanding of semantic meaning. Okay, so this is the learning algorithm. Now the tokenization algorithm is really similar. I'm not going to go through it in full detail. 00:27:51.720 |
I have a video on my YouTube channel where I do go through it in full detail. But for time, I'm just going to talk about it at a high level. It's essentially similar to the learning algorithm, except we're doing it on individual tokens as they come in, asking ourselves what would it look like if this word were part of the tokenization process. 00:28:09.720 |
So let me show you what that is like. This is a helpful jump to here. Let's go to tokenization. So the first thing we do is we take our prompt and we separate it into words. Now this is just a regular expression that came from the OpenAI source code when they open sourced it. 00:28:23.720 |
And we're just parsing it according to this. And it's going to basically take out punctuation and spaces. So we get these as our words. So Mike is quick, period gets its own token, he and moves. Right now we're not tokens, we're just separating words. 00:28:37.720 |
But do note that the spaces are assigned as part of this separation. So he and moves have a space in front of them. 00:28:46.720 |
The next thing we're going to do is we're going to fetch vocab, BPE. This is a file that came from OpenAI when they trained the model and its tokenization. This is their dictionary of tokens. 00:28:55.720 |
So when they did their training in BPE, the most common left token with a right token was space with a T. The second most was space followed by an A and so forth. 00:29:06.720 |
I've added two extra columns, rank and score, just to figure out where a pair of tokens is in terms relative to the others, how important they were. 00:29:15.720 |
Then this is just a helpful map of tokens to their IDs. And then finally, there's this, which is prompt to tokens. 00:29:24.720 |
Now, this is not how you'd really want to write a tokenizer. I've set it up to look similar to the Excel version of the matrix so you can watch the video and still be able to watch it and understand it, no matter which version of the Excel sheet or the JavaScript version you're looking at. 00:29:40.720 |
But this kind of illustrates the process. So here is quick. We're breaking it apart right here. Let me get presentify. There we go. We're breaking it apart into characters here. 00:29:53.720 |
All right, quick. And then what we're doing is we're looking at each pair of characters. So Q with U, U with I, I with C, C with K. And we're saying, for each one, where does it fall in that rank of tokens? 00:30:07.720 |
And we're going to take the most popular one. In this case, it's I and C. And so we've rewritten this as a string of tokens, except the I and C were put together as one token. And then we're simply going to repeat the process. 00:30:22.720 |
We're going to repeat the process. And then we're going to repeat the process. And then we're going to repeat the process and keep going. And we're combined. Then Q and U get combined. Then I, C, K get combined. And finally, it gets to the point where we realize, oh, this is in my token vocabulary. And it turns it into a token. 00:30:34.720 |
I have a video if you want to see this in depth and in detail. But for now, just think of it as the same part of the learning process, really just run small. Now, tokenization is actually kind of considered by a lot of experts to be a necessary evil. 00:30:51.720 |
Some of the problems that you sometimes encounter with models, although the root causes in tokenization, it can sometimes make them worse. So the common one is how many Rs are there in strawberry? 00:31:02.720 |
Models don't actually see the Rs. And I love this post from Riley Goodside, where if you remember the matrix, the character says, oh, I don't see the numbers. I just see what's inside there. 00:31:15.720 |
And that's kind of like what it is to the model. You know, it doesn't see any of the letters. So what he's done here in this image is strawberry is tokenized multiple different ways, depending on whether it's uppercase versus lowercase, whether there's a space in front of it or not, 00:31:30.720 |
and whether there's a quote in front of it or not. So the model, you know, these six words are not all strawberry. They're all six different token patterns. And so it makes it a lot harder and a lot more work for the model to understand it. 00:31:44.720 |
If you think about it, you don't actually see the letters either. If somebody asked you how many letters are in a word, you'd have to stop and think and parse it out. 00:31:51.720 |
I said at the beginning, you know, you don't do word tokenization. You don't do character tokenization. That's not a hard, fast rule. That's an empirical rule. And there are research models that have done both. 00:32:03.720 |
So just know that, you know, that's, you know, maybe a few years from now that we character based tokenization will be more popular. This is an example of one. The last thing I want to leave you with is that tokenization doesn't have to be just about text. 00:32:18.720 |
It can also be about other things. So here's an example of the vision transformer. They use patches of images as tokens. Waymo uses trajectories in space to prevent collisions as tokens. 00:32:31.720 |
Okay. Let's briefly check how we're doing. If this will load. There we go. So this is actually pinned at the top. 9:30. Oh, we're right. We're just a minute behind. Okay. So this is -- you guys can keep me honest. In the Discord room at the top is a spreadsheet. I'm, of course, using a spreadsheet. 00:33:00.720 |
to make sure we stay on track and on time. Okay. So -- oh, we need to do this now. There we go. Next up is embeddings. So now we're in the second phase of the input. These are token and position embeddings. Okay. So at the beginning of this workshop, I talked about how we map words into numbers. And I simplified this process by saying what we're doing is 00:33:30.720 |
is we're taking, say, the word "Mike" and we're turning it into a single number like 89. But, of course, that's not what we're really doing. We're actually going to turn it into a long list of numbers called an embedding. 00:33:42.720 |
So even the period, for example, gets 768 of these numbers. And in the case of GPT-2, the dimensionality, that list size, is 768. Every single word gets 768 numbers. And you can see that if we go to the token embedding section. 00:34:01.720 |
And then these -- we'll get to that table in a second. These are the embeddings of our input prompt. And you can type -- by the way, I didn't mention this before -- you can type anything here in this input prompt and it should parse it. 00:34:11.720 |
Although it doesn't handle foreign language as well because there's a character mismatch on import. But here's Mike is quick period. And each of these get 768 numbers. So what you can do is if I take row one, column 768, there's where our list of 768 numbers ends. 00:34:30.720 |
So every single one of these gets the same number of dimensions for how it gets represented. And it might be a little confusing because we mapped words into tokens with numbers. We've had token IDs. And then we have these embeddings as well. 00:34:47.720 |
And the analogy I like to use is imagine you are going to go look for a house to rent or a house to buy. And the street addresses -- and you build this table of street addresses and square feet, bedrooms and bathrooms and price. 00:34:59.720 |
The street addresses are identifiers. They're kind of like the token IDs. They tell you where to find something, where to find a house. But they don't tell you anything about what's inside. They don't tell you about what you care about, what the meaning is. It doesn't tell you the square feet, the bedrooms, the bathrooms, or the price. 00:35:14.720 |
And the embedding values, that's their job. They're to take the token and tell you something about what it means. And the identifier or the token ID is just to give it a position in the dictionary or effectively a numerical name. 00:35:26.720 |
And what we're really doing with the embedding values is we're trying to build a map for words where we put similar words grouped together. So here I've shown a two-dimensional map. But of course, we're in a 768 hyper-dimensional area. 00:35:41.720 |
But the idea is basically the same. Take the words happy and glad in this map, this word island. You know, those are happy words. So I'm going to put them up here. And then I'm going to take words like dog and cat, and I'll put them in another part of this word island because they're animals. They don't relate as much. 00:35:58.720 |
And then, you know, the word sad, well, it doesn't have the same meaning as happy and glad. But it's still an emotion. So I want it closer to the emotions, happy emotions, part of the island, into this, maybe the sad province, right next to the happy province. 00:36:14.720 |
So they're kind of on the same half of the island, but they're not directly in the same spot because happy and sad have different meanings. 00:36:20.720 |
So here's a simple two-dimensional example of the benefit of doing this, which is that you can start doing word arithmetic and word math. 00:36:30.720 |
So we're going to imagine that we've built a two-dimensional embedding where the first column is authority and the second column is gender. And we take the token man and we'll just arbitrarily say man has authority of one and a gender of one, a woman, authority of one and a gender of two. 00:36:45.720 |
A king has more authority than a man, so two for authority, gender of one because still a man, and then queen, authority of two, a gender of two. And we can plot this out in a plane like as follows. 00:36:57.720 |
So queen, for example, is at position two, two. And then we can actually build relationships just from doing vector math. 00:37:05.720 |
So, for example, if we take king, we subtract man, we add woman, and then we just do regular arithmetic column by column. 00:37:13.720 |
So two minus one plus one, one minus one plus two gives us two, two. So king minus man plus woman is two, two. But of course, that is the same thing as queen. 00:37:24.720 |
So we're saying king minus man plus woman equals queen. And we can think of this as an analogy. King is to man as queen is to woman. And if we take out the queen and just leave it as a blank, we have our first kind of word problem we can convert to a math problem. 00:37:42.720 |
If you give me any three words and I had an embedding for them, I can figure out what the fourth word in this relationship is simply from vector math. 00:37:50.720 |
And that is what Word2vec, which was the most famous word embedding, was able to do. It learned a bunch of relationships in a series of papers, not just one paper. 00:38:02.720 |
So, for example, France is to Paris, as Italy is to Rome, and Japan is to Tokyo. Einstein is to scientist, as Messi is to midfielder, or Mozart is to violinist. 00:38:14.720 |
Japan is to sushi, as Germany is to bratwurst. It didn't get all of them right, but clearly it's learning something about relationships. 00:38:21.720 |
And I want to be clear, this is actually just the same thing as being good at clustering. So imagine I've got on my word island, I've got all my countries over here. Right? France, Japan, Italy. 00:38:32.720 |
And then I've got all my capitals, Paris, Tokyo, Rome, over here in the word island. Well, every single one of them has the same vector relationship in space between each other if the clustering was really good and tight. 00:38:45.720 |
And if somebody comes in and says, hey, what's the capital of Canada? Well, I just use that same direction, I add that same vector to it, I can say, oh, well, it's Ottawa. 00:38:58.720 |
So in practice, real world embeddings are different than what I've shown here with this contrived example. First of all, we have many more columns or dimensions. So instead of simply two, we have hundreds. 00:39:09.720 |
GPT-2 is 768. Your state-of-the-art model these days has a lot more. The other key difference is I made up this thing saying that the first dimension or column is authority, the second one is gender. 00:39:22.720 |
We don't know what they mean. The columns are completely uninterpretable. The model just seems to pick them. And the values themselves, correspondingly, therefore, are not interpretable either. 00:39:32.720 |
And that might sound useless, but it's actually useful for at least getting similarity, for example. So let's go back to our housing analogy. Imagine I took off the top column, the labels, essentially, top row of what each column meant, and I just gave you the IDs. 00:39:47.720 |
If you went to, say, 47 Ivy Lane, and you said, I like this house, I want to see more like it, you could still find those. Because what you could do is you could notice that with 58 Sun, Av, and 15 Luna Lane, they all have roughly the same values in the first column, 2400, 2400, 2400, and the same values for the third, fourth column as well. 00:40:09.720 |
So you're like, if I like this house, I can find the others like it, even though I don't know what these columns mean. So the question you're probably asking yourself is, well, where the heck do these values come from? How do we know what they mean? Or how does the model pick it? 00:40:24.720 |
Well, the slightly unsatisfying answer is that the embeddings are just simply learned by the neural network model during training. So now let's just talk about what training looks like. 00:40:33.720 |
So in training, what we do is we grab a bunch of text from the internet, we take out passages like this one, Mike is quick, he moves quickly. So quickly was in the original passage. 00:40:42.720 |
And then we chop off the last token or the last word, and we have a randomly initialized model. All the values of the weights and parameters are completely random, including the embeddings. 00:40:52.720 |
And then we run it through the whole model, and we say, what do you predict is going to be the next word? And it's random, so it comes up with something nonsensical, like Mike is quick, he moves haircut. 00:41:01.720 |
And then we use this algorithm called backpropagation. And we say, hey, backpropagation, the correct answer was quickly. Can you adjust the parameters to get closer to that value? 00:41:11.720 |
And it will go and tell us for every single parameter how to slightly move it to get it closer to producing the right answer of quickly. 00:41:18.720 |
And then we rerun the model, and eventually it says, Mike is quick, he moves quickly. And then we do this not just for one single passage of text, we're doing this for many passages at a time. 00:41:28.720 |
And one great benefit of this is we don't have to teach the model anything explicitly. It's kind of learning from unsupervised text. We're just gathering that was naturally there on the Internet, and it's learning things like grammar, names, capitals of countries. 00:41:46.720 |
But that seems a bit mysterious. And we can kind of get an intuitive sense for what it's doing if you think about it as learning from word statistics. 00:41:55.720 |
So imagine you've got passages like this one about the words ice and steam. It was so cold, the puddle had turned to ice. Steam rose from the still hot cup of coffee. 00:42:05.720 |
And what you notice is that the words ice and cold tend to co-occur with each other. And the words steam and hot tend to co-occur with each other. 00:42:14.720 |
So if you were an alien coming down from another planet, you had no idea what our language was. And you looked at this, you would say, I don't know what ice, cold, steam, and hot are. 00:42:23.720 |
But I know that ice is probably cold more than steam is. Because ice and cold co-occur more often than steam and cold do. And I know that steam is hot because ice and hot don't co-occur as much as steam and hot do. 00:42:37.720 |
So there must be some relationship. They must have that meaning. And this is called the distributional hypothesis. And the phrase that you'll often hear people quote is, you shall know a word by the company it keeps. 00:42:49.720 |
Which is basically that a word is partially defined by its context. Or said another way, words that have similar meanings can be replaced with each other in similar contexts. 00:42:58.720 |
So if you know how they're distributed in their statistical representation, you have a sense of what similar words might be. 00:43:04.720 |
And in fact, in the full version of my class, we actually build our own primitive embeddings from Wikipedia pages, using a very simplified version of not Word2Vec, but another algorithm called GloVe, which is based on just using a co-occurrence matrix. 00:43:19.720 |
So roughly what that process looks like is you would count how often the words co-occur within some window size inside your corpus of text. 00:43:27.720 |
You'd analyze every single word, and you'd look in this case, three words to the left, three words to the right, and you'd say, these are the words that co-occur with it. 00:43:33.720 |
And then you'd build a big giant table, a matrix, and you'd say, let me compare every word to every other word, and I'm going to count how often they co-occur in all my text with each other. 00:43:43.720 |
So here in this example, Word2 and Word3 co-occur, let's say, five times in our corpus of text. 00:43:50.720 |
And then what you can think of the embedding as, instead of taking this table, which is all possible words by all possible words, and if we're in English, that's about 170,000 words, or in the case of BPE, it's 50,000 tokens. 00:44:03.720 |
You can imagine it compressing the columns of the matrix. So instead of having 170,000 columns, it's got whatever your embedding dimension is, 768. 00:44:13.720 |
It's still all possible words high, but it's now a lot smaller in the number of columns. So it's basically a compressed co-occurrence matrix. 00:44:22.720 |
So the actual table OpenAI gives us from training, the embedding table, is really just this thing on the right. It is basically every single word or token and then the representation of its dimensions. 00:44:34.720 |
And you can think of it as they just took a co-occurrence matrix and they shrunk it. The reason I like this mental model is it helps motivate certain things about embeddings that might seem a bit weird. 00:44:45.720 |
So a classic one is, how do you measure how similar two embeddings are? You might naively think we just look at Euclidean space, you know, as the crow flies, how far apart they are. 00:44:54.720 |
But that's not what we do. We use something called cosine similarity. How many people have heard of cosine similarity? 00:45:00.720 |
Yeah, so, you know, cosine similarity is not an intuitive measurement of similarity. And what it is, is we take the angle of the two points, and then we take the cosine of that angle, and that's how we say how similar two embeddings are. 00:45:13.720 |
So if you remember your trigonometry, if they're opposed, you know, cosine goes from negative one to one. So if they're opposed, they get negative one value. So that's when two vectors are like this. 00:45:23.720 |
If they're unrelated, it's like this, they're orthogonal. And if they're similar, they're pointing the same place, the cosine will be one. 00:45:30.720 |
Oh, and by the way, this opposite is not opposite in probably your intuitive conventional sense. So happy and sad, you might think would be opposites. But actually, they're more similar to each other than any other set of words. 00:45:41.720 |
If you think about how often like happy and sad probably occur in songs and poems and other types of contexts. They're both emotions. So they're actually more similar than they would be opposite. 00:45:49.720 |
In fact, if you take all the GPT-2 embeddings and you compare them against each other, very few are negative, and most of those are close to zero. 00:45:56.720 |
But going back to kind of learning from word statistics, the key thing we care about is the relative co-occurrence of different words against each other. We don't care about the raw co-occurrence. 00:46:10.720 |
So let's, for example, imagine there's this section of our co-occurrence matrix, which is the basis for the embeddings. We're comparing three words -- one, two, three -- against word 10 and 11. 00:46:20.720 |
And word 1 occurs with word 10 and word 10 times each. Word 2 occurs with word 10 and 11 50 times each. Word 3 on the other end is 5 and 20 times. 00:46:31.720 |
And I'm going to plot these as vectors, where the horizontal axis is how many times it occurs with word 10, and the vertical is how many times with word 11. 00:46:39.720 |
And what we notice is something interesting. Word 1 and 2 essentially have the same meaning. As far as word 10 and word 11 can tell, the relative probability between them is the same -- one to one. 00:46:52.720 |
It's just that word 2 happens to be more common, right? It's five times more common. But if we plotted this in vector space, what you see is that word 2 and word 1 are really far apart in Euclidean distance, but they have the same angle. 00:47:07.720 |
Now, word 3 is actually closer in Euclidean space. But relative to word 10 and word 11 has a different meaning. And so that's a motivation for why we're looking at the angle, right, which seems to more accurately capture the meaning and not necessarily popularity, which is what Euclidean space would capture. 00:47:24.720 |
Okay, so how do we actually use this? Well, these embeddings were learned during training. So OpenAI gives us this model_wte matrix. And the way it's set up is that it is our vocabulary -- 00:47:35.720 |
size tall. So 50,257 tokens is how many tokens GPT-2 uses. So there's a row for every single token. And then each row is just simply the embedding for that token. So there's a row for dog. And that row is the 768 numbers that represent the semantic meaning of dog. 00:47:55.720 |
So let's go to our example here. So let's take is. What is the token ID for is? It is 318. So this is our model_wte. This is one of those CSV files we dragged in. So it fetches it and displays it. So let's go to row 318. There's actually an off by one. But I'll do that anyways. 00:48:17.720 |
So you can see -- we'll come back to that in a second. And let's go to our final token embeddings right here. So as you can see, it's .0097, .010. And you'll see that matches what you have here. .0097 in the 319th row. But if we were 0 index, it would be 318. .001. 00:48:39.720 |
So all this code is doing right here is it's grabbing the token ID and just pulling out the corresponding row and plopping it into this table here. 00:48:49.720 |
So it's a very simple operation. That's why this is what? Basically 15 lines of JavaScript to just grab one thing out of another and then put it here. 00:49:00.720 |
So all this is doing is just taking those token IDs and looking them up and putting them in this table. 00:49:06.720 |
Last thing I want to say before we leave token embeddings is, as before with tokens, I said it doesn't just have to be a text. The same thing is true for embeddings. 00:49:15.720 |
So the famous example is clip, which was the basis for a lot of the image generators that you probably have tried or used. 00:49:21.720 |
And instead of just comparing words against words, you can think of it as it's comparing words against images. 00:49:27.720 |
So if you look at all the images on the internet and you look at that alt text and it sees dog and a bunch of images with dogs in it, you can get it to learn that relationship. 00:49:36.720 |
So later on, you can pass an image and it can say this thing is a dog. 00:49:44.720 |
Okay, so now we're still at the top with input and we're talking about embeddings, but now we're talking about different kind of embedding. 00:49:52.720 |
And the key thing to remember is that in English word order matters, right? 00:49:57.720 |
The dog chases the cat is something different than the cat chases the dog. 00:50:02.720 |
Now in math, two plus three equals five, but three plus two equals five. 00:50:08.720 |
Anything after that equal sign has no idea what the order was. 00:50:11.720 |
Our problem is that when we mapped from a word domain, an English domain, a language domain, we were in a domain where order typically matters. 00:50:22.720 |
We went to a number domain where order typically does not matter. 00:50:25.720 |
And even though I've drawn this with simplified arithmetic, there are parts of the model that are also commutative. 00:50:31.720 |
So what could happen in essence is you can change the order of the words, right? 00:50:36.720 |
And now it could mean something different or it could be completely gibberish. 00:50:40.720 |
But anything after that equal sign can't tell the difference. 00:50:43.720 |
So it has no sense of position of these words. 00:50:46.720 |
And that's going to be really hard to understand the meaning of the sentence or the phrase. 00:50:50.720 |
So what we're going to do is we're going to add some sense of position to the embedding. 00:50:54.720 |
What we're going to do is we're going to just take, for example, one token like woman and say woman at position zero in the prompt is going to basically mean the same thing as woman at any other position in the prompt. 00:51:05.720 |
So let's give woman at one position other than zero a slight offset, a slightly different position to represent that it's roughly the same meaning. 00:51:15.720 |
It's just woman when it occurs at position one and a different spot for woman at position two and so forth. 00:51:21.720 |
And in general, we'll use the position in the prompt to define a small offset that we're going to slightly move the position of the token in the embedding space. 00:51:31.720 |
In the original famous attention is all you need paper, they use the sine and cosine formulas. 00:51:37.720 |
You don't have to look through the whole thing. 00:51:39.720 |
The key thing I want you to pay attention to is it's just sine and a cosine. 00:51:42.720 |
And if you remember your trigonometry, sine and cosine goes from negative one to one. 00:51:45.720 |
So what we're effectively doing is we are building this circle right around this with a limited diameter. 00:51:55.720 |
And that's the other thing to remember for trigonometry is sine and cosine oscillate. 00:51:58.720 |
I'm just oscillating the position of this thing in space based on its position in the prompt. 00:52:03.720 |
I'm keeping it roughly in that same area, but it's a little cloud that's all the different versions of woman just in different parts of the prompt. 00:52:09.720 |
In GPT-2, interestingly enough, they didn't use that same technique. 00:52:14.720 |
They let the model learn the embeddings on its own, which still blows my mind. 00:52:20.720 |
So how we use this is we have another matrix that OpenAI gives us when they open source GPT-2, which is the position matrix. 00:52:29.720 |
And this time it is 1024 high, which is our maximum context length. 00:52:34.720 |
It's an early model, so it's not very large context. 00:52:37.720 |
And then each of the rows is, again, the embedding dimension. 00:52:41.720 |
And what we're going to do is these are position offsets. 00:52:44.720 |
So we're going to add these offsets in each row to each value in our embedding dimension to offset it to represent its position. 00:52:53.720 |
We start with our embedding values from the token embeddings. 00:53:10.720 |
And that gets our position embeddings for every single one of our input tokens. 00:53:22.720 |
So the first thing we do is we fetch this model_wpe. 00:53:37.720 |
So beyond that, the model has no idea how to understand that. 00:53:42.720 |
And this code for positional embed is really just a matrix add. 00:53:46.720 |
The only thing I have to do is make sure, well, the input prompt is going to be less than 1024 tokens. 00:53:52.720 |
So it just needs to stop when it gets to the end of the input. 00:53:55.720 |
But that gives us our positional embeddings here. 00:53:57.720 |
So we're just passing to this our token embeddings we had from the previous step. 00:54:01.720 |
And then our model_wpe table, which came fetched from the CSV file. 00:54:06.720 |
It just simply adds those together and we get these positional embeddings. 00:54:12.720 |
Here's kind of an illustration of the action of GPT-2 and its positional embeddings. 00:54:17.720 |
So these numbers you see right after, like happy3, happy4, aren't the actual tokens. 00:54:22.720 |
What I've done is I've plotted the word happy, the token happy rather, and I've put it at different positions. 00:54:28.720 |
I've put it at position3, position4, position5, and so forth. 00:54:31.720 |
And then I took two other words, happy and glad, and I just put them in space as reference points. 00:54:35.720 |
And you can see it's doing what we described earlier. 00:54:37.720 |
It's basically just keeping it roughly in the same position. 00:54:40.720 |
It's slightly offsetting it depending on where it is in the prompt. 00:54:48.720 |
Of all the changes to modern LLMs from GPT-2, probably one of the most common and biggest ones is that they do not use these types of positional embeddings. 00:54:59.720 |
So if you look at a modern LM, this is probably the first thing you'll see that's different. 00:55:11.720 |
We are five minutes behind, but we've got 10 minutes. 00:55:18.720 |
I will take a break for a question or two, if anybody wants to ask one. 00:55:36.720 |
You were mentioning that with GPT-2, they weren't using the sine and cosine kind of cloud. 00:55:41.720 |
What were they actually doing to come up with the word position embeddings? 00:55:50.720 |
So the big change they did is, think of it this way. 00:55:57.720 |
If we go back to our diagram of how embeddings are learned. 00:56:06.720 |
So in the original transformer, the position embeddings were not represented as learnable 00:56:25.720 |
All they did is they said during back propagation, let's learn those as well. 00:56:32.720 |
So they did not, they just simply said, hey, let's not hard code those values. 00:56:36.720 |
And let's make that other thing now a parameter of the model learns. 00:56:45.720 |
And as I said, it's still kind of mind blowing that that worked. 00:56:50.720 |
But as we now know, these things can learn a ton. 00:57:04.720 |
So now we're getting into the heart of the number crunching. 00:57:07.720 |
And this one's going to be a little more cursory understanding and explanation. 00:57:11.720 |
But I still think it's important to understand what it is. 00:57:17.720 |
And now we're inside what are often called the layers. 00:57:22.720 |
That's a less common, but other people use the term blocks. 00:57:24.720 |
The reason I use the word blocks is when I teach this, it's usually people coming to it the 00:57:30.720 |
And the word layer gets used in other contexts. 00:57:33.720 |
Later we're going to talk about the multi-layer perceptron. 00:57:35.720 |
And it can be confusing when you're first coming to something and the same word has different 00:57:40.720 |
But know that when you talk to most people, when they talk about how many layers in a model, 00:57:44.720 |
they're talking about what I call how many blocks. 00:57:47.720 |
And if you start getting to this part of the code, you'll notice that inside the blocks, 00:57:52.720 |
actually if you go right here, you can see these are all labeled with steps. 00:58:07.720 |
This is what I happened to pick when I was implementing it in Excel. 00:58:10.720 |
And I kept the same mapping so all my material would translate. 00:58:18.720 |
But the key operations inside the blocks are multi-head attention and the multi-layer perceptron. 00:58:28.720 |
The way to think about attention is we're going to let the tokens or words talk to each other 00:58:33.720 |
so they can convey their meaning to all the other words. 00:58:42.720 |
Maybe it needs to find that guy and realize, oh, Mike is my antecedent. 00:58:46.720 |
As opposed to, say, if there was the name Sally, that's unlikely to be the match of it. 00:58:50.720 |
But there's other kinds of ways words can communicate to disambiguate. 00:58:54.720 |
So, for example, the word quick in English has four different meanings. 00:59:00.720 |
But it can also mean bright, as in quick of wit. 00:59:03.720 |
It can be a body part, as in the quick of your fingernail. 00:59:06.720 |
And in Shakespeare in English, it can be alive, as in the phrase the quick and the dead. 00:59:12.720 |
And knowing that this word moves here helps the model understand that, oh, 00:59:17.720 |
we're probably talking about quick when it's physical space. 00:59:20.720 |
It helps to predict what the next word could be. 00:59:26.720 |
But it's not, you know, a body part or your fingernail. 00:59:29.720 |
And the way I like to think about attention is we've got these tokens, these words. 00:59:38.720 |
And I like to imagine there's kind of a weird gravity, like celestial mechanics, where each 00:59:43.720 |
of these tokens in attention now suddenly look at what position they're at. 00:59:46.720 |
And they're able to push and pull each other relative to a kind of gravity. 00:59:51.720 |
And if you remember, gravity is mass times distance. 00:59:54.720 |
So, you've probably heard of query, key, and value. 00:59:58.720 |
But I feel like it doesn't capture kind of all the level of interaction between the tokens 01:00:04.720 |
And what's happening is, you know, if you remember, gravity is mass times distance. 01:00:08.720 |
The distance I like to think of is a measure of relevance. 01:00:11.720 |
So, quick and moves, whenever they see each other, they're like, oh, yeah, you and you, 01:00:18.720 |
But quick and the period, they probably don't need to talk to each other a lot. 01:00:21.720 |
And that's kind of like distance in terms of gravity. 01:00:24.720 |
And then I like to think of the value as being kind of like mass, the kind of action they're 01:00:30.720 |
And what's happening is, let's go back to what we talked about with embeddings. 01:00:33.720 |
We've got moves, which is sitting somewhere in the embedding space. 01:00:36.720 |
And if you remember that co-occurrence matrix, moves has been used in a lot of sentences. 01:00:41.720 |
In some sentences, moves was used to describe rabbits, right, or cheetahs, or animals or things 01:00:49.720 |
So there's some other point in this embedding space that we don't have yet that represents 01:00:56.720 |
And then moves was used in some sentences to describe slugs or penguins. 01:01:04.720 |
But the embedding for moves, unfortunately, has to capture all of those meanings together. 01:01:09.720 |
But now that we know quick is here, we can change that. 01:01:12.720 |
We can say, oh, I'm going to shift the position of moves from the regular generic version of moves 01:01:24.720 |
I've shifted its position in space to capture its meaning. 01:01:27.720 |
I'm not going to go through all the steps of attention. 01:01:30.720 |
But I think the most salient part of attention to see is step seven. 01:01:37.720 |
It's the most famous thing that people usually show. 01:01:45.720 |
So you can see it says Mike is quick on the horizontal. 01:01:53.720 |
So what this is is you can see how much relevance or attention each word is paying to every other 01:02:01.720 |
And by the way, what you're seeing here is just the first head. 01:02:04.720 |
So if you scroll this to the right, 64 spaces, you'll see another matrix that looks like this. 01:02:11.720 |
The biggest thing to notice here is that the upper triangle is all zeros. 01:02:17.720 |
And that's because in transformers like GPT-2 and decoder-based transformers, we have this rule 01:02:30.720 |
And then the other key property is each of these values, each row, sums up to one. 01:02:35.720 |
So you can think about the percentage of attention each word is paying to the other tokens. 01:02:39.720 |
So here, for example, moves, 16% of its attention. 01:02:43.720 |
Here in the first head of, in this case, the last block is paying 16% of its attention to the 01:02:50.720 |
word Mike, 23% of its attention to is, and so forth. 01:02:59.720 |
Okay, so now this is the second major operation inside each block or layer. 01:03:04.720 |
And the reason I want to cover this in a little more detail is I want to explain what neural network is. 01:03:09.720 |
And it helps give a little more understanding to how the model actually learns. 01:03:14.720 |
So if you haven't seen a neural network before, it is a computational model inspired by the human brain. 01:03:20.720 |
It is not a direct mimic or simulation of how the brain works. 01:03:23.720 |
Inside the brain, we have these things called neurons. 01:03:26.720 |
These neurons are all connected to each other. 01:03:29.720 |
And you've got a bunch of connections incoming from other neurons. 01:03:31.720 |
And you've got a bunch of connections outgoing to other neurons. 01:03:34.720 |
And then in between, you have this axon right here. 01:03:39.720 |
And the axon has an all or nothing activation behavior. 01:03:44.720 |
If there's a sufficient amount of pattern of input that shows up, the axon will activate and it will send a signal out to its output. 01:03:52.720 |
But if the activation doesn't have enough meet some threshold, you'll have these failed initiations and no signal will be sent to the output. 01:04:01.720 |
As far as the other output neurons connected, this neuron is not firing. 01:04:08.720 |
And so we model this mathematically with this diagram where we've got a bunch of inputs, x1 through xn. 01:04:16.720 |
These will be our embedding dimension numbers. 01:04:18.720 |
And then we've got another series of numbers called weights, w1, w2 through wn. 01:04:23.720 |
And we're simply multiplying the x's times the w's, adding them together, adding an additional number called a bias term. 01:04:30.720 |
And then we put it through an activation function. 01:04:33.720 |
And this activation function is designed to roughly mimic what happens in the brain. 01:04:37.720 |
The easiest one to understand is this one, ReLU, which is basically saying when I multiply and add all the inputs coming in against their weights, if the result is negative, then I do nothing. 01:04:49.720 |
If it's positive, then I just pass it through as is. 01:04:52.720 |
And there's a whole zoo of these activation functions. 01:04:56.720 |
So then what we do is we take these neurons and we stitch them together into a network of neurons, hence an artificial neural network. 01:05:04.720 |
Now there's a lot of ways you can stitch these together. 01:05:07.720 |
In the case of transformers in GPT-2, we do it in a pattern called the multilayer perceptron. 01:05:14.720 |
You will also see it referred to as a fully connected network and a feed forward neural network or just simply the neural network. 01:05:24.720 |
These are not directly identical terms, but they all overlap. 01:05:27.720 |
And the way the MLP pattern looks is you have your neurons arranged in these columns. 01:05:36.720 |
And each layer in the multilayer perceptron has a node. 01:05:40.720 |
And nodes in each layer are fully connected to every other node in its preceding input, but no other. 01:05:46.720 |
So this node right here can see all of its input nodes. 01:05:53.720 |
Everything that it gets is mediated through this intermediate layer between it. 01:05:57.720 |
And these layers between the input and the output are simply just called hidden layers. 01:06:03.720 |
The last thing maybe you should know as background on this is that neural networks can be more efficient to write as a matrix multiplication. 01:06:10.720 |
So this process, we've got two neurons with a set of weights. 01:06:19.720 |
This can be written as a matrix multiplication where you just separate all the weights into one matrix, all the inputs into one matrix, and all the biases into one matrix. 01:06:27.720 |
And then you can write it as this large w times x plus b equals the representation of the same thing of running a bunch of these neurons together. 01:06:36.720 |
If you don't know your matrix multiplication, I like this website, which has a nice interactive visual demonstration of what matrix multiplication looks like. 01:06:45.720 |
So you hit this, and then you keep going through step, and you can kind of convince yourself that what I showed here matches what's on that web page. 01:06:53.720 |
So the key property, though, and why MLPs are so important is that they are universal, trainable approximators to any function purely from its input and output. 01:07:04.720 |
With enough neurons, an MLP can approximate almost any function. 01:07:10.720 |
So let's just take a simple example like a parabola. 01:07:13.720 |
And we're going to imagine we're going to use a simple neural network with a ReLU activation to try and approximate it. 01:07:18.720 |
We'll have one input node, one output, because we have an x going into a y. 01:07:22.720 |
And then let's just use two nodes in our hidden layer. 01:07:25.720 |
And those two nodes will use a ReLU activation. 01:07:27.720 |
Well, without doing the math, you can kind of imagine, just by matching shapes, how you might do this. 01:07:32.720 |
I'll take the ReLU, and I can take this part of the ReLU, and I can match it to the right half of my parabola. 01:07:39.720 |
And then I can take another ReLU, I can flip it, and then I can match it to the left half. 01:07:45.720 |
And then I can add them together, and I've got some kind of approximation to my parabola, 01:07:49.720 |
at least on this domain of x that we're looking at. 01:07:52.720 |
And that's what I've done in this example here. 01:07:56.720 |
which, let's see if the Wi-Fi behaves for us. 01:08:07.720 |
So here you can actually do this simple neural network, and you can try to match it to a parabola, 01:08:14.720 |
And you can see, and this is a measure of error up here, called mean square error. 01:08:18.720 |
And you can try and see how good you can get your level of error. 01:08:22.720 |
We're basically changing the different line pieces that we're using that are made out of ReLUs 01:08:30.720 |
This can kind of give you a feel for what the model is actually trying to do when it's trying to match a function. 01:08:36.720 |
And you can think of this like more neurons means more lines, which means a better approximation. 01:08:43.720 |
So with eight neurons, the parabola looks like this. 01:08:48.720 |
And with 200 neurons, you can barely tell the difference, at least at this scale. 01:08:52.720 |
But the other key thing is that we don't have to use trial and error to find this out, 01:08:59.720 |
Imagine doing what I was doing with that fiddling. 01:09:04.720 |
With enough neurons that you can approximate almost any function. 01:09:07.720 |
It's called the universal approximation theorem. 01:09:09.720 |
But the key thing is you combine that with a special algorithm called backpropagation, 01:09:13.720 |
which lets us learn any function purely from its inputs and outputs without having to twiddle those knobs. 01:09:22.720 |
And that's important because what we're going to ask this multilayer perceptron to do is the core mechanism job of a transformer, 01:09:31.720 |
I'm going to give it the embedding of a token. 01:09:33.720 |
And I'm going to ask it, predict what the embedding of the next token is. 01:09:45.720 |
And I can say, learn from this input and output what that mapping function is. 01:09:50.720 |
And what will happen is backpropagation will look at the input. 01:09:53.720 |
It'll look at the output we got from the model when it was initially randomized. 01:09:57.720 |
It'll look at the ground truth from what came pulled from the internet. 01:10:01.720 |
And it will look at how we need to adjust the parameters and weights to change the perceptron to get more accurate at making that prediction. 01:10:07.720 |
And after enough iterations, it'll get better and better and actually begin to start matching the function. 01:10:13.720 |
The canonical analogy to understand what's happening in backpropagation, also known as, for our purposes, gradient descent, is a lost hiker trying to get down a foggy mountain. 01:10:24.720 |
And you're at the top of this foggy mountain as a hiker. 01:10:31.720 |
So you don't know which direction to get off the mountain. 01:10:34.720 |
Well, the one thing you can do is you can look down at the ground. 01:10:38.720 |
And you can say, oh, whichever way is going down, that's going to be the area towards getting off the mountain. 01:10:46.720 |
By the way, actually, in real life, I have tried this. 01:10:51.720 |
So this is a hiker representing -- the hiker in this analogy represents the model parameters. 01:10:58.720 |
It's in some space, but we don't know where to move the model parameters to get the least amount of error. 01:11:06.720 |
It represents how wrong we are at the current position of where the hiker is or where the parameters are. 01:11:13.720 |
And the mountain is foggy because we can tell the amount of error when I give it an input and it makes a prediction for the next token and we compare it. 01:11:21.720 |
But we don't know how to shift the model parameters to get lower error. 01:11:28.720 |
It will tell us where the elevation is going down. 01:11:31.720 |
It won't tell us what the whole mountain looks like, but it will just say where you are standing right now. 01:11:35.720 |
Go in this direction and you'll decrease the amount of error you've got. 01:11:39.720 |
And you use that to find your way down the mountain, so to speak, of the parameters and find a minima. 01:12:02.720 |
What I'm going to do is I'm going to show you them in slide form. 01:12:06.720 |
And I'm going to graphically show what's happening in the GPT-2 MLP stage. 01:12:15.720 |
The input layer right here has 768 of these X inputs. 01:12:23.720 |
We're going to say, here's the embedding predicted. 01:12:25.720 |
So I'm going to give it 768 numbers of the preceding token that I want it to predict afterwards. 01:12:31.720 |
And then its output layer is 768 numbers for the predicted embedding token. 01:12:39.720 |
And then it's got one hidden layer, which is bigger. 01:12:42.720 |
This ratio of four times the embedding dimension turns out to be empirically useful and lots of models do it. 01:12:47.720 |
But I don't know if you could figure that out just from first principles. 01:12:53.720 |
And we have three steps here for applying our weights and bias. 01:12:58.720 |
Applying our activation function, which in this case is the gelu activation function. 01:13:02.720 |
And then we project that back down to our embedding dimension. 01:13:10.720 |
And then what we're doing is we're taking those embedding values. 01:13:13.720 |
And we're going to send each embedding value into its position inside the MLP. 01:13:21.720 |
And then these are now the embedding values that come out are the embedding values of the predicted token. 01:13:28.720 |
And then we take the next token in our prompt and then run it through the MLP again. 01:13:33.720 |
In practice, you do this in parallel, but conceptually you can think of it this way as happening one after the other. 01:13:38.720 |
Okay, so what's happening in these three steps is really just a combination of matrix, add, and multiply. 01:13:45.720 |
So we take the result of our previous step, which is step 12, which I have not gone into. 01:13:50.720 |
And then we have some learned weight matrix, which you see is MLP FC weights. 01:14:04.720 |
So written as matrix multiplication, step 13 is just step 12 times some weight matrix plus some learned bias matrix. 01:14:12.720 |
Then we apply a GELU activation function, which I showed the diagram earlier. 01:14:19.720 |
And then we then do our projection, which is the remaining step to get down to the 768. 01:14:25.720 |
So we take the result of the previous step, which was the activation function. 01:14:29.720 |
We apply a different learned weight matrix, a new set of weights that gets learned. 01:14:34.720 |
And that's due with a matrix multiply and then another matrix add. 01:14:37.720 |
So step 15 is just step 14 times some weight matrix plus some projection matrix. 01:14:48.720 |
Before we leave backpropagation, you might remember that I talked about how embeddings are learned by the model. 01:14:55.720 |
Both the token and in the case of GPT-2, they started learning the position embeddings. 01:14:59.720 |
So the key thing I want you to remember is that backpropagation is a generic optimization algorithm. 01:15:04.720 |
It is not just for the weights and biases of the MLP. 01:15:07.720 |
It can be used for other parts of the transformer too. 01:15:14.720 |
It's used for all the parameters and attention, the queries, keys, and values. 01:15:17.720 |
If you've heard that term, all use backpropagation. 01:15:20.720 |
Even other parts, layer normalization and the like, use backpropagation to get optimized. 01:15:25.720 |
And the analogy I want you to think about is backpropagation and optimizing a model this 01:15:33.720 |
If you remember those cooking shows, they tell the chef, here are your ingredients. 01:15:40.720 |
And maybe we give you some tools that you've got to use. 01:15:43.720 |
And then they have them compete against somebody else. 01:15:50.720 |
We tell it, here's the next token afterwards. 01:15:53.720 |
And then we define the steps in the architecture. 01:16:00.720 |
Decide how much to mix of each ingredient at each step to get the desired output. 01:16:10.720 |
So, I've got this in our simplified diagram, which is 12x. 01:16:14.720 |
Which represents that what happens is we run attention and we run the perceptron. 01:16:19.720 |
But we ask the model to continue to refine iteratively its prediction for what the next model is. 01:16:29.720 |
And then it does it again and again and again. 01:16:37.720 |
In the case of your modern state-of-the-art model, it's going to be many more times than 01:16:42.720 |
The key thing I want you to remember though is that each block is performing identical operations, 01:16:51.720 |
And you can see this if you look at the code here. 01:16:54.720 |
You see, for example, here we're grabbing the weights of, in this case, the MLP. 01:17:02.720 |
But the key thing I want you to pay attention to is this H11. 01:17:05.720 |
That basically is saying hidden block or hidden layer. 01:17:17.720 |
If we were doing it for the first block, it would say H0. 01:17:21.720 |
So, inside the implementation of this code, we go to the iteration step. 01:17:26.720 |
All this kind of messy code is doing is it's grabbing the DOM objects for the blocks. 01:17:32.720 |
And then it's going into each of the formulas. 01:17:36.720 |
And it's just changing that H value to each one for every iteration. 01:17:40.720 |
And then reruns the entire set again, just to simulate what would actually be happening in a model. 01:17:49.720 |
So, this is when we finally get to take our predicted token embedding and turn it back into a token. 01:17:56.720 |
So, what we do is we take the last MLP's of the last block. 01:18:01.720 |
So, this one here, we got our most refined prediction for what the token embedding is. 01:18:10.720 |
And then we're going to turn that into a token. 01:18:13.720 |
So, it's going to go through an operation called layer norm, which we haven't gone through in detail. 01:18:17.720 |
But what we get out of this step is the embedding of the predicted next token. 01:18:21.720 |
This is what it's saying the next token is going to be. 01:18:28.720 |
So, the way we're going to do this is we're going to go back to that matrix we had. 01:18:32.720 |
Our dictionary of tokens to embeddings, model_wte. 01:18:35.720 |
And we're going to take that and we're going to multiply it times our predicted next token. 01:18:43.720 |
So, one thing to remember, it's very helpful to somebody to think about the dimensions of these things. 01:18:52.720 |
It represents a row for every one of our vocabulary. 01:18:55.720 |
And then it's 768 dimensions wide for the embeddings for each dimension, embeddings for each token. 01:19:01.720 |
And when we multiply it times this column, which is 768 dimensions representing embedding, 01:19:07.720 |
what we're basically doing is we're getting this, which is a column of 50,257, but only one wide. 01:19:14.720 |
Each one of those, right, is a dot product of the predicted embedding against one of the known token embeddings. 01:19:20.720 |
What we have is 50,000 scores for how similar the embedding we got is to each of the embeddings in our dictionary of tokens. 01:19:29.720 |
So, you can think of these as they're called logits or logits. 01:19:34.720 |
How close is the embedding we got to our dictionary of tokens that we have? 01:19:39.720 |
And so, the more similar a prediction in a token, the higher that entry. 01:19:44.720 |
To turn this into a probability distribution, we have a problem because these are just numbers. 01:19:49.720 |
A probability distribution has to sum up to one. 01:19:52.720 |
So, then we put it through a special normalization operation called a softmax, and that will make sure they all sum to one. 01:19:59.720 |
And then we can interpret this as entirely a probability distribution. 01:20:02.720 |
Each one of those normalized token scores will then basically represent the probability of that token in the representation. 01:20:19.720 |
And so, the logit, you know, for this negative 135.9 represents a score of some kind of the similarity of the very first token in our dictionary 01:20:31.720 |
against whatever this predicted embedding was. 01:20:34.720 |
And then, the code for predicted token is not actually doing what's doing a probability distribution. 01:20:45.720 |
It always picks the highest probability token. 01:20:48.720 |
And that is done so that when you're using this, you can compare it against the same GPT-2 you'd get from OpenAI's code, 01:20:57.720 |
You can compare like for like, and you'll get the same result. 01:21:14.720 |
We might have an issue if we've got network being slow. 01:21:25.720 |
Now, there are other ways to sample than just simply taking a-- running a random number generator 01:21:32.720 |
One of those is called top K, which is you say, I don't want the really unlikely words. 01:21:39.720 |
I don't want to accidentally end up with haircut. 01:21:43.720 |
So, what you do is you define a cutoff and you say, okay, the top 10 tokens, give me those. 01:21:50.720 |
Another way is called nucleus or top P sampling, which is instead of saying, give me just 10 01:21:55.720 |
tokens, just give me as many tokens as it takes to get 80% or 90% total probability. 01:22:01.720 |
So, I get most of the likely tokens and then I renormalize to those 90%. 01:22:16.720 |
You can see it says Mike is quick, he moves, there is quickly. 01:22:20.720 |
So, if you run GPT2 small using, you know, hugging face transformers, you should get the same 01:22:25.720 |
answer, that next token, as you get here, Mike is quick, he moves, and you get this final result 01:22:32.720 |
And then it will tell us what the token ID was and what the maximum logit turned out to be 01:22:52.720 |
2953, because this is one index, because it's like a spreadsheet. 01:22:58.720 |
That was the highest logit in the entire column. 01:23:02.720 |
And all the code that it's running here is doing, next is just going to pick what the most 01:23:08.720 |
And when it finds it, 2952, it converts it back to our token dictionary. 01:23:21.720 |
So, GPT2 was definitely groundbreaking when it first came out. 01:23:25.720 |
It was famously considered too dangerous to release. 01:23:28.720 |
But chat GPT was, you know, earth shattering. 01:23:32.720 |
What were the intervening, you know, what were the additional innovations in those intervening 01:23:39.720 |
Well, for the most part, it was a lot of the same architecture, just more scale. 01:23:46.720 |
So, if you looked at a modern transformer, you would probably see a lot of the same parts, 01:23:50.720 |
And some of the parts might be upgraded or switched out. 01:23:54.720 |
So, attention mechanisms changed and other things like that. 01:23:57.720 |
But the biggest difference you should know about is that the job and the training were actually 01:24:04.720 |
The key thing to understand is that predicting the next word is not the same as being a chatbot. 01:24:09.720 |
Our GPT2 -- actually, I'll show you this here. 01:24:12.720 |
Our GPT2 is basically trained to predict next words from looking at text on the internet. 01:24:19.720 |
So, it is a next word predictor only for internet text, but not for being a helpful assistant. 01:24:24.720 |
So, if I give you this example, like first name, let's see if we'll come back quickly enough. 01:24:29.720 |
It says, first name, colon, password, email, colon. 01:24:39.720 |
So, it's like, oh, I'm in the middle of a form. 01:24:48.720 |
Anyone want to guess what this is going to output? 01:24:59.720 |
Hello, class, foo, brace, public static voice. 01:25:05.720 |
But you can ask it, you know, helpful questions like, what is the capital of France? 01:25:18.720 |
So, embedded in that is some of the information we want. 01:25:21.720 |
But also a mix of other things we may not want or can't control. 01:25:25.720 |
So, what we have to do is figure out how to change it and shift it. 01:25:30.720 |
And so, this is the four-step pipeline for doing that. 01:25:33.720 |
And this is, I want to emphasize, mostly a training difference. 01:25:42.720 |
It knows how to imitate text on the internet. 01:25:46.720 |
You've got instruct GPT or chat GPT all the way here on the right. 01:25:51.720 |
And what we're doing is a series of steps to kind of elicit or pull out the behaviors we 01:25:58.720 |
So, the first thing we do is we take the model and we train it to imitate text on the internet. 01:26:06.720 |
The next thing we want to do is we want to train the model on examples of what a helpful 01:26:13.720 |
Ideal assistant responses, 10,000 to 100,000 examples of prompt and response that were written 01:26:23.720 |
You can go on GitHub to the Stanford Alpaca data set is a good example. 01:26:31.720 |
It's like, give three tips for staying healthy. 01:26:36.720 |
Make sure to include plenty of fruits and vegetables. 01:26:42.720 |
So, what we're doing is we're training it to augment. 01:26:45.720 |
We're augmenting all that internet data with a subset of how we want it to behave to force 01:26:51.720 |
It's kind of learning to imitate specifically these models. 01:26:55.720 |
And it will learn more about being a helpful assistant that way. 01:27:00.720 |
It's like it's this last stage called RLHF, or reinforcement learning from human feedback. 01:27:05.720 |
And to understand this, you have to understand what RL is. 01:27:24.720 |
So the canonical example, let's zoom this to fit window. 01:27:34.720 |
So imagine you're a computer trying to play a game like this. 01:27:37.720 |
And you've got maybe a robot player that's navigating a maze. 01:27:41.720 |
And you've got some monsters and obstacles that you'll die. 01:27:47.720 |
And what reinforcement learning will do is it will explore these paths. 01:28:02.720 |
And eventually, you'll learn the optimal strategy, which is this. 01:28:06.720 |
And this is a very different kind of learning. 01:28:08.720 |
Everything we've talked about so far is an imitation learning. 01:28:12.720 |
I give it the next completed word I know from the internet. 01:28:17.720 |
This is saying, even if a human couldn't find the maze, I'm asking you to find it. 01:28:25.720 |
It looks at its score and says, oh, that didn't work. 01:28:28.720 |
So it comes up with a plan or a policy to navigate the maze and maximize its score. 01:28:44.720 |
You're probably wondering, what does navigating a maze-- whoops, we've been through this slide-- 01:28:51.720 |
Well, you can think of generating token after token of text as walking a path through language. 01:29:00.720 |
And there are some paths that are probably ones that you want more than others. 01:29:09.720 |
And you might want to avoid paths like, I am a angry robot. 01:29:25.720 |
And so we need some way to score these various possible paths of text. 01:29:30.720 |
We are trying to teach it something more nuanced than simply imitation. 01:29:34.720 |
But there isn't a necessary, obvious way to score passages of text. 01:29:39.720 |
So what we first have to do is derive that scoring function for this game we're going to ask the 01:29:48.720 |
And we ask it to come up with two different types of passages. 01:29:51.720 |
So here, for example, is an example from Anthropic's helpful dataset. 01:29:54.720 |
And we ask it, hey, come up with a recipe for a pumpkin pie. 01:29:58.720 |
And then we have a chosen, a preferred one, and a rejected one. 01:30:01.720 |
So the chosen one is like, it tells you, grab a cup of sugar, half teaspoon of salt, and 01:30:07.720 |
The rejected one literally says, I love this, go buy some pumpkin and look at the package. 01:30:14.720 |
For the harmless dataset, this is one about alcohol. 01:30:18.720 |
And the chosen one says, hey, it sounds like alcohol is something you're using when you 01:30:24.720 |
Maybe you should think of a more productive way of channeling that. 01:30:26.720 |
While the rejected says, go ahead and drink whatever you want. 01:30:28.720 |
So we have these pairs of chosen and rejected types of responses. 01:30:33.720 |
And we use that to derive a scoring model from this data. 01:30:38.720 |
So we haven't, right now in this third step, we haven't changed our original model yet. 01:30:44.720 |
Then we pass that scoring model to the model itself and put it in that maze-like reinforcement 01:30:49.720 |
learning scenario to train the model to reinforce our preferences from the scoring model. 01:30:54.720 |
So in summary, we first build a general purpose knowledge base from text on the internet. 01:30:59.720 |
Then we train it on a specific task by giving it ideal outputs to mimic and imitate. 01:31:05.720 |
Then we learn human preferences or nuanced preferences. 01:31:09.720 |
And then we teach those nuanced preferences using reinforcement learning. 01:31:13.720 |
Right now there is a huge revolution in reinforcement learning, which is why I think this is so important 01:31:21.720 |
Partially kicked off by R10 and GPRO, GRPO, I should say. 01:31:28.720 |
And I have a video on YouTube that you can go watch where I dive into that a little bit more. 01:31:36.720 |
So I kind of want to just put it all together and summarize where we've been on this journey. 01:31:40.720 |
So we've got tokenization, which was really just saying, hey, what is an efficient representation 01:31:52.720 |
And I didn't talk about this earlier, but one way of looking at embeddings is that they 01:31:57.720 |
have a rich history in natural language, but they also have a rich history in recommendation 01:32:01.720 |
You can kind of think of this job as putting similar words with similar meanings in similar 01:32:06.720 |
spaces, as putting similar books or similar movies or similar music in similar spaces so 01:32:12.720 |
you can make the proper recommendation when somebody comes in. 01:32:15.720 |
Here is an example of a recommendation system. 01:32:17.720 |
I think this is Amazon Music, where they're trying to categorize the genre of music. 01:32:24.720 |
And so if you go back to our co-occurrence matrix, you know, this is in some sense a recommendation 01:32:31.720 |
If somebody asks me to predict what comes after the word ultimately, well, if I've got that 01:32:36.720 |
co-occurrence matrix, this is really helpful information. 01:32:39.720 |
It's at least better than random to guess what comes next. 01:32:41.720 |
So you can kind of think of this as a recommendation system for what the next word is going to be. 01:32:47.720 |
Latent within the embedding itself is not just a sense of what words are similar to it, but also what words are likely to come after it. 01:32:58.720 |
We can simply give it examples of our embeddings and what we know the next word to be and to pull that latent prediction out of the embedding itself and learn to predict what the next word is based on its embeddings. 01:33:09.720 |
But, of course, there's another set of hints that are really useful, and that's all the words that came before. 01:33:15.720 |
So now we're going to let all the words talk to each other to share their context to say, oh, your moves, but your moves in a fast context. 01:33:21.720 |
That's going to change and shift your recommendations. 01:33:23.720 |
You can kind of think of this as kind of like a superposition of recommendations for what the next word is going to be. 01:33:29.720 |
And then you're probably not going to get it right the first time. 01:33:32.720 |
So we're going to let you refine that prediction about 12 times. 01:33:35.720 |
And then finally, you'll come out with a predicted embedding. 01:33:38.720 |
And we just got to turn that into whatever the next word is based on how close it is to our known dictionary of embeddings. 01:33:44.720 |
And that's essentially one way of looking at the model in whole, despite all the complexity that we went through. 01:33:53.720 |
So we've been through a lot of different parts of the model at a very high level. 01:33:57.720 |
It is totally natural to feel like your brain is full. 01:34:00.720 |
What I often tell folks coming through this is don't expect full mastery. 01:34:05.720 |
But my metrics for success is that you get the sense that mastery is within your grasp. 01:34:11.720 |
There's nothing in here that was so complex you can't understand it. 01:34:18.720 |
And we can turn what appears to be magic into machinery that you can understand. 01:34:26.720 |
Before you go, last thing I'll just say is just like your favorite, you know, AI model, I get better from human feedback. 01:34:33.720 |
So to incentivize you to fill out the survey and join the mailing list, there's a link in the Discord channel if you fill it out. 01:34:40.720 |
And join the mailing list, I will send you the PDFs from today's workshop. 01:34:45.720 |
And then if you visit spreadsheets are all you need, you can join the mailing list. 01:34:50.720 |
There's a YouTube channel as well where I've got a bunch of other videos. 01:34:53.720 |
And then I also have a Patreon I just launched, and I'm available for consulting, training, and implementation. 01:35:01.720 |
I hope you enjoyed the presentation and feel like now it's a little less like magic. 01:35:27.720 |
And so I just wanted your expert opinion on like the way that I'm using AI is very much like voice, speech-to-text, and I just ramble and I just try to give it as much information and sometimes I'll reiterate what I think is really important. 01:35:56.720 |
So the number one thing, I have a video I'm working on for this, like, the number one thing I'd say is you have to treat it scientifically. 01:36:06.720 |
You can have theories about how the model works, but you don't really know until you test it. 01:36:11.720 |
This whole, like, the whole space is very empirical. 01:36:16.720 |
So one of the common things in prompting they used to tell you is like say please and thank you. 01:36:21.720 |
Or say my grandma used to do this, I'm going to lose my job, right? 01:36:28.720 |
But there was a great paper, the prompting report by Sander, and he went and studied and they tested a bunch of models. 01:36:35.720 |
And they found that, you know, it turns out with later models it didn't work. 01:36:38.720 |
And then Ethan Mollick's team also did a recreation of a similar test. 01:36:43.720 |
And they just tested a bunch of models with a bunch of prompts. 01:36:46.720 |
They tried it with polite words and they just said, okay, which one's better? 01:36:50.720 |
And they found that it wasn't really helpful. 01:36:53.720 |
So that being said, generally, you, for like one shot use cases, like I'm just using it like, I use a whisper tool myself all the time. 01:37:07.720 |
But then what I do is I go through and I look out, I look through and I fix up things. 01:37:11.720 |
Like if there's grammar or I repeated something or I said something wrong. 01:37:16.720 |
And the way to think about why you want to do that is, it's a somewhat subtle point. 01:37:21.720 |
But when we go back to this diagram, this whole process is fixed. 01:37:32.720 |
If I put a token in and I know how many tokens are in the prompt, I can predict how many flops. 01:37:37.720 |
This is why, you know, when like a model like DeepSeq was trained, we know likely how many flops were used. 01:37:42.720 |
Because this thing isn't just like a program. 01:37:48.720 |
So if you make the prompt do more work, if the prompt is going to make the model do more work, 01:37:54.720 |
you kind of think of it like it loses some ability to do some other thinking. 01:37:59.720 |
And so if you have spelling mistakes, if you say something slightly wrong that's different from normal, 01:38:07.720 |
then it has to spend some sense of compute fixing that up. 01:38:11.720 |
This is less true with the reasoning models today. 01:38:13.720 |
Because the one thing the reasoning models can do is they can repeat this process more times on their own. 01:38:20.720 |
But I would say what you're doing is probably fine. 01:38:22.720 |
It's probably better that you put as much as you can that is relevant in. 01:38:26.720 |
If the choice is I put more stuff in, but I had the grammar wrong, but I had the relevant stuff in, that's better than if you didn't have it in there. 01:38:34.720 |
But if you're trying to engineer a prompt going into your model, I would spend time trying to optimize what its behavior is with some evals to make sure. 01:38:42.720 |
Or at least benchmark what it is and then when a new model comes out, you can see whether the new model changes things. 01:38:46.720 |
That's why evals are so important, because this whole space is very empirical. 01:39:08.720 |
I wonder what your take is on the new mixture of experts models with the company very, very fine-grained experts. 01:39:19.720 |
I wouldn't call it my take, but I'll give you the conventional take. 01:39:24.720 |
So the question is, what's your take on the mixture of experts models? 01:39:29.720 |
There are three or four things that when you come out of this workshop that we don't cover that you should know about. 01:39:41.720 |
Another is RLHF, which I talked a little bit about at the end. 01:39:45.720 |
And then the other is reasoning models, which we just talked about, where the model can kind of run itself through. 01:39:52.720 |
It's probably one of the biggest top four changes. 01:39:56.720 |
What we are trying to do with the mixture of expert is that, first of all, it's only here in the perceptron, 01:40:04.720 |
which tends to dominate a lot of the calculation inside of a model. 01:40:08.720 |
And what you're trying to do is get more knowledge, use more parameters, without increasing the amount of compute. 01:40:15.720 |
So what you do is you conceptually take this perceptron and you break it into pieces. 01:40:20.720 |
And then you say, depending on what token comes in, I'm only going to use a subset of my perceptron's thinking. 01:40:25.720 |
And that way you can be more efficient with your compute and actually potentially your memory, too. 01:40:29.720 |
You can charge your memory nicely per device if you want to do stuff like that. 01:40:38.720 |
And we've had some really great models based on it. 01:40:41.720 |
The challenge is training an MOE model is difficult. 01:40:45.720 |
And so it's taken a while for some of the open source community to catch up in that implementation. 01:40:57.720 |
There are some much older models before ChatGPT that did MOE in other contexts. 01:41:05.720 |
It's trying to cram more knowledge and more parameters while keeping the amount of compute used lower. 01:41:13.720 |
You can think of it as giving it more knowledge.