GPT Internals Masterclass - with Ishan Anand

>> All right. Let's go. >> Okay. Did you want to do an intro or should I just go ahead and start? >> Intro. I was just excited to have Ishan back. I guess you already previewed last week, but also you spoke at the World's Fair, and you're world-famous for Especially, It's All You Need.

It's your thing. But also, it means that you understand models on a very fundamental level because you have manually re-implemented them, and today you decided to tackle GPT-2. So welcome. >> Yeah. Thank you. So I'm excited to be presenting Language Models or Unsupervised Multitask Learners. For context, the other thing Swix had alluded to in previous paper clubs, is there's going to be paper clubs for more, not test of time, but something similar, and this is therefore more evergreen, more general, more novice audience, than might be a traditional paper club reading, where I think it's a lot more sophisticated.

But hopefully, if you're just coming to the field of AI engineering, this video and some others like it, and some resources I'll point you to will help you get started in understanding how LLMs actually work under the hood. The name of the paper is Language Models or Unsupervised Multitask Learners.

It is not actually officially called the GPT-2 paper, but that's how everyone refers to it. You'll notice a couple of big names here. You might recognize Dario, you might recognize Ilya. If you're in the community, you'll recognize Alec. A lot of these people went on to continue to do great and amazing things in the community.

I am, as Swix noted, Ishan Nand. You can find me here at my homepage. I'm probably best known in the AI community for Spreadsheets Are All You Need, that is a implementation of GPT-2 entirely in Excel. I teach a class on Maven that's basically seven to eight hours long, where we go through every single part of that spreadsheet.

For people who have actually minimal AI background, it's a great first class in AI. I'm an AI consultant educator and really excited to give you the abbreviated version of that and the GPT-2 paper today. Let's get started. Here's what we're going to talk about. We're going to talk about why should you even pay attention to GPT-2.

Strangely enough, I get this question from my class. People are like, "Oh, I saw that it was GPT-2." I thought, "Oh, that's got to be out of date." We should talk about why that's important and why you should pay attention. Then we'll talk about the dataset. Then we'll talk about actually the results.

This slide is backwards. Then we'll talk about the model architecture. No, sorry. We'll talk about the model architecture, then the results. Then we'll talk about the future directions because we know what the future is going to hold, but they didn't. We'll talk about it as if we didn't really know.

I don't know if Angad is here. If he is, he can jump on or let me know. I don't know if I'll see it in the chat. But he led a really great paper club about, I think it was eight or nine months ago on the original GPT-1 paper.

I highly recommend checking that out. Also partially because, spoiler alert, the GPT-2 paper doesn't talk a lot about the model architecture. There's a limit that you'll learn from the model and model building from the GPT-2 paper. It's actually GPT-1 just scaled up. >> Angad is actually here. >> Oh, he is.

Oh, Angad, did you want to just jump in and say anything about this? This is your slide. >> Yeah. Can you guys hear me? >> Yeah. >> Yes. >> Cool. Thank you for having my overview about GPT-1. As you said, GPT-1 is the precursor to GPT-2. The architecture is roughly almost exactly the same.

There are going to be some little differences that I think you're going to present in today's discussion. But I think everyone should at least give GPT-1 a read or try to see how they actually achieved or settled on the transformer architecture, and also their training objective of language modeling at the next token prediction.

We've got some cool resources for you guys. We have the official paper from the OpenAI team. We also have a blog post, basically a write-up about GPT-1 in 2024, so having a futuristic or looking back on the GPT-1. Also, we have, as you mentioned, the paper club episode about GPT-1.

It is recorded and it is on YouTube. Please make sure you at least give it a read or something like this. It's going to be an immense resource for you guys. Back to you. >> Okay. Thank you. I see a request for the slides, which is a great question.

Let me do that right now because my pet peeve is people who hold up the slides from you. Let me do this. Anyone at the link, you can go ahead and comment, copy link, and then I will drop it in the chat. Let's see. There we go. You should have it in the Zoom chat and I will drop it, or somebody can drop it please in the Discord as well for the paper club.

Thank you. Okay. Thank you. Great. Whoops. We're not going to auto-play the video. There we go. So let's get started. So first off, why GPT-2 matters? Well, the first thing is it was one of the first cases where we saw one model is really all you need. We had a single model that solved multiple tasks, and here's the key, without any supervised training on any of those tasks.

It had one simple objective, which was to predict the next word, and that let it learn how to do many multiple tasks. This seems obvious today because we're basically six years later. But at the time, it was not obvious. GPT-1, for example, was pre-trained to predict the next word.

Then what they did is they stuck different structured output configurations on top of it, and fine-tuned it on each of these tasks like classification, similarity, multiple choice. So actually I have, this is right from the GPT-1 paper right here. So this is from GPT-1, that was the setup. So we had one model that was pre-trained, and then fine-tuned, and then had structured output set for every single task.

So it still was not like ChatGPT where you could just talk to it, and GPT-2 still quite wasn't there. But you could start getting prompt engineering to get the right result. By contrast, GPT-2 was pre-trained again on predicting the next word, but then you just gave it task-specific prompts, few or zero-shot prompts, as we'll see in the results section, to get the desired output that you wanted.

A useful and interesting comparison contrast is the Google multi-model. You'll recognize these names, Noam Shazir, Aiden, Kaiser, Vaswani, a lot of the same people from Attention is All You Need. These guys know how to name a paper, I'll say. One model to rule them all, obviously, Lord of the Rings reference.

Here it is. It's a multi-task model. It's even multi-modal, and believe it or not, it's also a mixture of experts. It can take a image, it can caption it, it can categorize it, it can do translation. This is way back in 2017. But the key thing is that it was supervised fine-tuned or supervised tuned for each task, although it was done jointly all in the same model, and they had task-specific architectural components for each task.

It was not the same just predict the next word. It was, I'm going to give you these datasets for each of these different tasks, but I'm just doing in the same model across them. The key hypothesis of the paper then is that, as you see here, our speculation is that a language model with large enough capacity, so something that's large enough, will infer and learn to perform tasks demonstrated in that dataset regardless of how you procured them.

It'll be able to basically learn multiple tasks entirely unsupervised. This is right from GPT-2 paper, which I'll pull right up. I'll go back and forth between the paper. Where did they have this? There it is. It's right, our speculation, right there. It's right here, right in the beginning. The other key element here is that, it's the emergence of prompting is all you need, and the emergence of prompt engineering.

Here, we're just using prompts in order to condition the model. A follow-on to what we talked about earlier, it's we started to see prompt engineering be taking a role for the first time as a way to control a model where previously you would have stuck a different head on top of it and fine-tuned it.

It's also the emergence of scale is all you need. This multitask capability emerges and improves as the models, training time, size, and dataset increases. As you can see here, when we look at GPT-1, then GPT-2, which basically was a 10x scale up on the size of parameters, also in the dataset size, and then it's at the stage later for GPT-3, which was going to scale it up by 100 times on the number of parameters with the idea after they saw these results that, hey, we'll just simply scale up the model larger and we'll get better results.

The other interesting thing about GPT-2 is it's also the continuation of this idea that the decoder is all you need. If you're new to transformers, transformer has traditionally in the original Vaswani implementation, had the left half here, which was called the encoder, and then the decoder on the right half, which was in charge of generating the output.

They basically dissected in half and said, you only need the decoder because all we're going to do is basically text generation. They were not obviously the first to do this, GPT-1 preceded it. Around the same time as GPT-1 was this other paper, also by Google, and Noam Shazier and Kaiser again, which is generating Wikipedia by summarizing long sequences.

It was another decoder only, and I'm not sure who was first. They were both published in 2018. This one is published in January of 2018. GPT-1, I think, was middle of the year, but you never know when these start. So it's not quite clear. Moving on, this eventually ended up being, as we now know today, the most popular way you would implement a large language model.

And GPT-2 is basically the ancestor of all the major models you have probably familiar with. So GPT-4, CLOD, ChatGPT, Cohera is in here, BARD, now Gemini, LLAMA, they're all decoder transformer models. So the key idea is if you understand the GPT-2 architecture, you're basically 80% of the way to understanding what a modern large language model looks like.

A lot of the components are still the same. Maybe they've replaced layer norm with RMS norm and so forth. There's probably only a few other changes, but most of the way there it's 80%. So it was highly influential. And part of that may be because of a lot of the hype around GPT-2, but part of that is also the last open source model as of this recording from OpenAI.

So a lot of people dug into it and took inspiration from it. It was also probably one of the first AI models to break out of the AI bubble. So this is the famous passage where they prompted GPT-2 to write a fake news article about unicorns living in the Andes Mountains.

This got a lot and a lot of press in 2019. In fact, it was one of the reasons, the risk of misinformation, that initially the open source release of GPT-2 was only the smallest model. They didn't release the source or weights for the larger model until later that year.

I believe it was in November. And it got a ton of attention. It was called the AI model too dangerous to release, which, for better or for worse, made it break through the AI bubble and into the public consciousness. So that's why GPT-2 is all you need, in a sense, to get started and why it's so important to the field.

Okay, now let's talk about the data set because the data is a huge part of any AI model, especially the large language model. The problem they faced is if we're going to train a unsupervised large language model, we need a data set that is sufficiently large, that is high quality, and that demonstrates a wide range of tasks that the model can learn from because it should be embedded in it.

It should be sufficient enough that it has that wide variety, even though we're not explicitly going to fine-tune it on any of these particular tasks. And the solution they came up with was to create a new data set using the internet. Let's just grab text on the internet. But we also need it to be high quality.

So we're going to use social media for a quality signal. And then because the web, I say internet here, I'm really talking about the web, has a wide range of tasks, it should be sufficiently large if we have a large enough data set. It should demonstrate a variety of different tasks that we can use to test the model.

So they created this data set called WebText. First, they started by gathering all the outbound links from Reddit before December 2017. Then they removed links with less than three karma. So that was their quality signal. If it didn't get enough karma, then it was not a high-quality link. I want to be really clear.

They didn't actually scrape Reddit and Reddit conversations. They just used Reddit to rank sites, kind of like how Google ranks sites through PageRank. At this point, they realized Reddit might be a better way of human quality. Then they actually removed Wikipedia entries. And the reason they did this is some of the tests we're going to talk about later are actually tests that involve Wikipedia as part of the data set, as part of the evaluation.

So they wanted to avoid data contamination and not putting, you know, training the model on text that it would later be tested on. They also, although not shown in this diagram, is they also removed any non-English text, or they tried to. Turns out some leaked in and turned into a capability to do translation.

And then they extracted the raw text from the HTML files using the DragNet or newspaper Python frameworks or libraries and that got them WebText data set, which was 8 million documents or 40 gigabytes of data. And to put this in perspective, the GPT-1 model was trained on the Books Corpus, which is a series of unpublished books that was about 4.8, roughly five gigabytes in size.

So this is about an order of magnitude more data. It was pretty large. It was not the largest data set at the time. I believe there was a BERT one that was larger, but it was one of the largest. And then put it relative to GPT-2, I did an estimate at GPT-2 is roughly around another order of magnitudes bigger than this.

Sorry, GPT-3 compared to the GPT-2 data set for WebText. Okay, now let's talk. Oh, let's see, we got questions. Let's see, should I pause for questions or just keep going? - I mean, if anyone has questions, now is a good time. - Okay. Sure, I just opened the door for it.

Are there any questions? I see a keep going. Looks like people are handling some of the questions in chat. Okay, let's talk about the architecture of these models. So the GPT-2 models, I put this in quotes, were a series of four models. A couple notes, so I put GPT-1 as a comparison point.

GPT-1 was 117 million parameters. It had 12 layers, 768 for the embedding dimension and a 512 context length. The four models that they create for GPT-2, when you read the paper, you should note that they do not refer to them as small, medium, large, XL. That came after. Instead, when you read the paper, and this can be a little confusing, GPT-2, they save as the name for the largest model.

So GPT-2 XL for them means GPT-2. They're the same, which if you're using the Hugging Face Transformers, if you use GPT-2 as the model name bear, you just get GPT-2 small. So avoid the confusion, just be aware. And then in the text, they refer to all the other small variants as web text language models.

So these three are called the web text language models, and then this is what they refer to as GPT-2 in the paper. So for them, GPT-2 is simply the largest model, which makes sense in retrospect, because small really is just a replication of GPT-1. And one other thing is they even tried to replicate it so much that they originally reported the size as 117 million parameters.

Turns out it was 124. So when you download the weights from, I guess, Azure now, these two are the same. They're just simply renamed because there was a typo in the size of the model. And you can see, basically, the largest model compared to the original GPT-1 is 1,600, so roughly twice as big in the embedding dimensions.

They increase the context link for all of them, and it has a lot more layers. So hence, a lot more sized model. Unfortunately-- well, there are few changes architecturally from GPT-1 compared to GPT-2. First is a larger size, which we saw in the previous slide. They also increased the token count slightly from 40,000 to 50,000 tokens.

And this one is actually interesting. They moved layer norm from post-activation to pre-activation. They were inspired by a paper that did this in an image model and proposed that pre-activation was better. It turns out a year or two after this, somebody did work on language models and showed that pre-activation was actually better as well for layer norm and actually improved training stability in certain cases.

Unfortunately, the paper has few details on the actual training. So we know that the batch size was 512, but the learning rate was tuned for each size model, but the exact numbers are not specified. And this is really interesting because the GPT-3 paper, for example, went into a lot more detail on this.

In fact, it's, I think, right here near the beginning. Let's see. There it is. You've got a table here on the batch size learning rate. And I think the appendix actually has the Adam W parameters as well. So there isn't a lot of detail on how the model works and how it was specifically trained.

Thankfully, however, the community, partially because it was eventually open sourced and had so much attention on it, came up with a large number of implementations. So there's the official OpenAI implementation, which is right here. And one thing I like pointing out to people, if you go into the source and you click on this, if you're new to AI and machine learning, you don't realize how small the actual code is because all the knowledge is in the parameters.

If you add up all the code here and you take out the TensorFlow, it's basically just 500 lines of code. It's one of my favorite statistics to help people understand, yes, you can understand this. It's only 500 lines of code. You can grok it if you just spend a week or two on it.

So don't feel like this is magic that you'll never understand. The most popular way to use it, probably today, is through Hugging Face Transformers, which is another implementation of it that uses the same OpenAI weights that they released publicly. And then-- whoops, there we go. This is technically not an implementation, but a really popular guide to how the inside of the model works.

I found this an extremely helpful resource as well, where Jay Alomar, who's now a co-hearer, goes through in detail how every single step of the transformer works. He has really great diagrams and illustrations for how the model works. Another popular implementation is MiniGPT from Andrej Karpathy, which is a PyTorch reimplementation.

The original version of GPT-2 was in TensorFlow. So this one's in PyTorch. You can see it here at GitHub. And it's also OpenAI weight compatible for GPT-2. And then he has LLM.c, which implements GPT-2 entirely in C without any PyTorch for performance. A lesser known, but I think equally interesting implementation is Transformer Lens from Neil Nanda.

And I think this helps go to why GPT-2 is so interesting and important. A lot of folks in mechanistic interpretability like to use small models to do experiments. And Transformer Lens is a tool for running understanding and interpretability experiments on large language models that are GPT-2 style. And in fact, if you see the video that I did at the AI Engineer World's Fair last year, I do a version of GoldenGate Claude.

That was thanks to Neil Nanda and his team who had done sparse autoencoders partially using a version of this thing called SAE Lens for GPT-2. And I just basically used one of their vectors and stuck it in my spreadsheet. And that was a huge benefit and boon. But it's a great way to learn how these models actually work by doing experiments on them.

Another great visualization is this one, which is Transformer Explainer. It has some really nice graphics. You can watch essentially how information propagates through the network. Another great visualization is this one, which is very popular. It's got nanoGPT, GPT-2 small, and GPT-3. You can kind of see-- I like this view because you can see how much smaller nanoGPT is compared to GPT-2 small.

And then here's GPT-2 XL. It really makes it very visceral in terms of how it feels. And the one challenge I have with visualizations is they're fun to look at, but you can't actually go in and make modifications to them. You can't build your mental model by interactively changing things within them.

So there's mine, which is Spreadsheets Are All You Need, which is an Excel file that implements all of GPT-2 small entirely in Excel. Let me see if I can pull that one up. Oh, wonderful. It restarted on me because I'm running in parallel as well. We'll do this right now.

I'll show you the other one. So that one, you can see there's a video right here from AI Engineer World's Fair. And I walk through the spreadsheet version of this, of GPT-2. It's a really abbreviated version of how the model works. And then the most recent version is this one, which is GPT-2 entirely in your browser.

So this one's entirely in JavaScript. And let me walk through it for just 5 or 10 minutes as kind of an intro to how transformer models work. Before I do that, I am going to give you a five-minute introduction on how to think about a transformer model. So basically, we have this.

I like this simplified diagram rather than the canonical diagram. Basically, you're taking text. You turn that text into tokens. You turn those tokens into numbers. We do some math or number crunching on them. We turn those numbers into text. And that becomes our next token. We translate those numbers back out.

And the way I like to think about this is tokenization is just representation. But you have your token and position embeddings. And this is really a map for words. We're basically grouping similar words together. So I like to imagine, say, a two-dimensional map. But in this case, in the case of GPT-2 small, it's 768, 1,600 in GPT-2 XL.

So you can imagine happy and glad are sitting here. And sad's maybe a little close to it, but not quite as close. And then things that are very different, like dog, cat, are over here. And rather than thinking about this long list of numbers as just some arbitrary list of numbers, these are points in a space.

Instead of two dimensions, though, they're now points in a 768-dimensional space or 1,600-dimensional space. But what we've done is we've grouped similar words together. And once we've done that, when we think about it, similar words should also share the same next word predictions. So the next word after happy is probably also the next word after a similar word like glad.

And then that gives us kind of a boost or heads up that we can go to a neural network. Neural networks are really good if you give them a quick question and an answer, and it'll pick out what the answer is. So you give it photos, and you say which ones are dogs and which ones are cats.

It'll learn to figure that out. In this case, we'll give it sentences, and we'll ask it to complete the next word, or give it, say, a single word and say, what is the next word after it? And it will start learning that. The only other wrinkle is we have additional hints we can give it, which is all the hints from all the other words that came before it.

And really, what the transformer is doing is it's letting every word look at every other word to inform what its actual meaning is. Get a better hint. Instead of just taking a one word or two or three word history prediction, it's going to look at all the past words.

And then it refines that prediction over 12 iterations. In the case of GPD, small, more in the larger one. And then we get a predicted number back that we just convert back out to a word. So putting it all together, we get basically this diagram. So we take a prompt, we split into tokens, we convert those tokens into numbers, and then we refine that prediction iteratively, and we pick the next most likely token.

I've simplified what happens in here. There's actually-- and you'll see this in the spreadsheet-- there's 16 steps. And they all are mapped out here. I can-- let's see if this loaded up. Oh, wonderful. Let's go back to the spreadsheet, if it will load up. Let's see why Excel isn't working.

But that's fine, because I'm going to demonstrate the web version of this. So this is the same thing as GPT-2, small, except running entirely in JavaScript. And what's exciting about this is you don't need to have Excel anymore. You can just come with your browser, and it will run entirely locally.

And the way it's structured-- let's pull this up here. Here we go. It's actually a series of vanilla JavaScript components. Everything is like a Python notebook. You've got a cell here that wraps everything. There's only two types of cells right now. One is simply like a spreadsheet. It basically runs a function, and then it shows the result in a table, as you can see here.

And then the last one is just defining code in raw JavaScript. And what's great about this is you can debug the LLM entirely in your browser. No PyTorch, nothing else getting in the way. So to run this, the first thing you want to do is click this link and download the zip file, which is going to have all the model parameters.

You'll drag and drop those into here. It'll basically stick 1.5 gigabytes into your index.db. And then you're sitting right here. So the first thing we do is we define matrix operations, and then in raw JavaScript. So this is our matrix multiply, also defined in raw JavaScript. Really simple two-dimensional arrays is how we use the structure for this.

There's a transpose. There's a last row. And then you enter your prompt here. You hit Run, and it'll actually run the model. So let me show you, though, the debugging capabilities. So I'm going to do this. I'm going to take this thing, which is separating into words. And I'm going to run it up to here.

And you can see our prompt is "Mike is quick, he moves." It separates into these words. But what I can do is I can just write from the-- ever leaving my browser. I can go here, and I can say, well, you know what? I want to see what is-- what does matches look like?

What is that array? Console.log matches. Hit this. Now rerun this function. Oh, there it is. Right there in my debugger. I can just see the result. And heck, you know what? Maybe I really want to just step through this thing. So I can hit this, put the debugger statement, hit Play, and boom.

I'm right here inside my DevTools, and I can debug a large language model at any layer of abstraction that I want. So great way to kind of get a handle on what's happening under the hood. And I'll take you through a brief view of this, so let me get rid of these statements.

Redefine that function, and then I'll reset it. And then if we hit Run, what it's going to do is it's going to run through each one of these in order. So the first section is really just defining basic matrix operations. The next section is our tokenization. So here we separate things into words.

Then we actually take those, and we do the BP algorithm to turn them into tokens, which we'll get out here. So if I run to here, I'll see if we-- well, here's our list of words-- our tokens, rather, and then their token IDs. And then if we keep going, this will turn the tokens into embeddings.

So this is a series of steps to do that. Then finally, we turn them into the positional embedding. We're basically just walking through this same diagram in order that I showed you here. So we tokenize, then we turn it into embeddings, and then we go inside each of the blocks.

And this is the 16 steps. These match the same steps in the Excel sheet, where we'll basically do layer norm, for example. We'll do multi-headed tension. And we'll do the multi-layer perceptron. I'm going to do the following. If I hit Run and turn away, it'll actually stop running because it's in the browser, and the browser optimization stops it.

But you can hit Run, go to this page right now, and then click it, and you can watch it run. It'll take about a minute to predict the next token. So that's a quick overview of GPT-2 and some resources to understand the model in more detail. Happy at the end if we've got questions to go into this in more detail, but I don't want to spend our entire time on that.

OK. That was the demo. Any questions so far? Let's see. I'm going to look in chat. Oh, boy. Is there a Python Jupyter version of this? It's going to be fun when he checks the chat now. OK, great. This looks like Jupyter. Yes, it does. My drawer looks like Ishan's desktop.

Awesome. OK, this is definitely fun. There is a Jupyter version. Well, the Jupyter version of this would be mini-GPT, probably running inside, or Transformer Lens running inside Jupyter. All of these other ones are basically Python implementations. This one is no Python. You can just run it right in your browser.

So nothing to install. So no Jupyter-- you don't even need Python. You can just use JavaScript. Helps web developers kind of get up to speed. So that's the answer to that one. Is there an intuitive way to understand positional embeddings? Yes. I'll pause for this. Let's see. I'll answer this question and then move on.

Positional embeddings. Doo-doo-doo-doo-doo-doo-doo. OK, so you know that probably embeddings we've talked about are positions in a space. I showed another diagram where basically we had elements in some two-dimensional space. I showed it as just two dimensions. Here's your canonical man, woman, king, queen, where king minus man plus woman equals queen.

This is a contrived example, and we've put them in different parts of space. When the problem we have is that-- let's go back to this. In English, the dog chases the cat, and the cat chases the dog have very different meanings. Position matters. So here's another example I use in my class, which is if I take the word "only" and I put it into four different positions, these are four different sentences.

"Only I thanked her for the gift" means nobody else thanked her. "I thanked her only for the gift" means I didn't thank her for anything else. The problem we have is that in English, word order matters. But in math, very often, position does not matter. So 3 plus 2 is the same as 2 plus 3.

So they both equal 5. And so this is one of the hardest things to realize. What the large language model is doing is it's taking a word problem, and it's converting it to numbers, turning it into a number problem. This is a realm where order matters, and this is a realm where order does not matter.

And so the math, everything after the equal sign, cannot see the order of the stuff between them, even though-- I don't know if the spreadsheet came up. Let's see. There it is. Even though you can look in this spreadsheet, and you're like, well, why can't it see it? I can see it in order, just like you can see in order-- let's pull PowerPoint back here-- just like you can see the order between 2 plus 3, and you can see the addition, it can't.

The math can't. So what we need to do is give it a way to understand what the position is. So the way we do that is we basically say, in GPT-2-- note that there's something called rope, which does it slightly differently. We say that-- let's go back to this diagram I led with.

The woman at position 0 probably means the same thing as woman in the other position. So we're just going to move it slightly so that woman at position x is almost in the same location in the embedding space as woman at position 0. It's just slightly offset. And in general, we're going to just move it slightly in some region so it doesn't move around too much, but stays close to it.

So it can at least tell woman at different positions from other positions. Inside attention is all you need, which is the original transformer paper. They basically use the sine and cosine. If you remember sine and cosine, they limit to 1. So they're basically keeping it in the circle. It's oscillating around this.

Inside GPT-2, they actually just let it learn the positional embeddings itself. And they are simply just added. So the way this works inside the spreadsheet is here are your token and text embeddings. Sorry, right here. So this is the embedding for the word Mike. This is 768 columns after column 3.

Same here for all these other ones. Each one of these, if you look at this formula, you'll see there's a plus model WPE. That is a set of parameters right here. So the first row-- so this is what a million parameters looks like. It's a bunch of numbers. So for anything that's in the first row, this number gets added to it.

So you can actually go back to the one I showed you right here. And you add negative 0.18 to that value. And the thing that's in the first row gets that added to this position right here. And then the thing that's in the second row basically gets, every element-wise, added to it the token for the next-- so the second row gets added to whatever's in the second position.

The token in the third position gets the third row added to it. And this goes for all 1,024. So there are 1,024 rows here for every single position in the context. OK, I was off by a column. So that math only worked on the second column. Let me keep going so we don't run out of time.

But I'll take more questions later. OK. OK, let's keep going. OK, let's talk about the results. So they got SOTA, which is state-of-the-art, for seven out of eight language modeling tasks at zero shot. I'd call this, actually-- they claim it's zero shot. Some of these I'd call as few shot.

But at the time, it was probably good enough to call it zero shot. So here are the different tasks. And here are the results. I'm going to go through a couple of the notable ones. So one is lambada, which is a task of predicting the next word of a long passage.

And this data set is set up so that you can't predict the next word just by looking at the target sentence or even the last previous sentence. You need to go through the entire passage and have some kind of sense of understanding to complete it. So here's one where it's like they've underscored or underlined the word "dancing" because that's how far you have to be to find the word that's the answer.

I like this example from the paper because camera is the word to complete the sentence. And it's never even here. You have to infer they're dealing with the camera. He's like, you just have to click the shutter. So it's really a test of long passage understanding. In fact, I can't remember the-- I believe something like long is the acronym for what this stands for.

One thing to know is that GPT-2, when they tested it, was actually not fully doing that great. But it was coming up with completions to keep going for the sentence. And so what they did is they added a stop word filter. And they only let it use words that could end a sentence.

Because it would come up with other likely completions for the sentence, but they would have kept going. And so they would have been the wrong answer. So they basically had to modify slightly the end result in order to get the correct values. When they did that, they got-- then they achieved state-of-the-art results.

This is the children's book test, similar, where you basically have a long passage. And you need to answer a question here. So in this case, she thought that Mr. Blank had exaggerated matters a little. So again, it's a fill in the blank. And then you're given a series of choices.

And then the data set has the right answer. Now, a large language model just completes the end of the sentence. It's not like a BERT model where it could complete somewhere in the middle that's masked out. So the way they set this up, because it's a decoder, is they computed the probability of each one of these choices.

And then those choices, along with the probabilities for the rest of the other words to complete the sentence, they added that up to one probability for each one of these, and then compared that joint probability of Baxter had exaggerated matters a little, Cropper had exaggerated matters a little. And they picked whichever one of those combinations had the highest probability accord to the language model.

This is the 1 billion word benchmark. It is the only one of the eight on that table that GPT-2 did not hit state-of-the-art. By the way, one thing I should add. It says we hit seven out of eight. There are other tasks, which we'll talk about in a second, where the model did not hit state-of-the-art.

But in those language modeling tasks in that table, it was seven out of the eight. Their conclusion is that the reason this happened is that the 1 billion word benchmark does a ton of destructive pre-processing on the data. So this is a screenshot from the 1 billion word benchmark.

This is not from the GPT-2 paper. But it describes a bunch of the steps they do to pre-process it. And then the last thing they do is do sentence-level shuffling, which removes the long-range structure. So in some sense, you could argue it's not even a valid test. And so it's not surprising that it didn't do as well on that last benchmark.

Let me go back to those so you can see that. So here's Lambada. Here is the children's book test. Here's the 1 billion word benchmark. Some of these are perplexity, so lower is better. So if it's in bold, it's better than the state-of-the-art. That's what you see here. Some of these are accuracy, so higher is better.

So here, you can see in bold when they've achieved higher than state-of-the-art. This is the only one where the state-of-the-art was still out of reach for them. That was the 1 billion word benchmark. Another one they tried was question answering. And they did not achieve state-of-the-art on it. This is the conversation question answering data set.

This is an example from the paper for that data set. And you can see it's, again, a passage and then a series of questions with answers and actually reasoning. So they didn't hit state-of-the-art. But there are two interesting things. One is they matched or exceeded three out of the four baselines without using any of the training data.

The other baselines had actually used the training data. This is, again, the power of pre-training on a large enough data set, which is very surprising, at least was surprising at the time. The other thing that they note, which jumped out to me, is GPT-2 would often use simple heuristics to answer who questions, where it would look for names that were in the preceding passage.

And it would just use that as its heuristic. And to me, that reminds me of, if you're familiar with induction heads, which are heads inside multi-head attention, whose whole job is to, if it sees a passage that says "Harry Potter" like five times, the next time it sees "Harry," it's like, oh, the next likely thing is "Potter." So it's very interesting to see even that kind of sense of something like an induction head inside GPT-2 in these early experiments.

Another thing they tested was summarization. It was tested on news stories from CNN and the Daily Mail. And again, we have kind of early prompt engineering. They induced summarization by appending TL;DR-- too long, didn't read, which is something humans had been doing for a long time now-- to a passage.

And it turns out to start summarizing. Unfortunately, it was not state-of-the-art. You can see the results here. Here's GPT-2 TL;DR, this row right here. And you can see the state-of-the-art is doing a lot better. But it was still promising. The end result that came out resembled a summary. But it turned out it confused certain details, like the number of cars in a crash or where a logo was placed or things like that.

And unfortunately, it just barely outperformed picking three random sentences from the article, as you can see. That's this line here. So then they wanted to test, well, is TL;DR doing anything at all? So they dropped TL;DR. And you can see GPT-2, without any TL;DR hint, is doing worse. So TL;DR definitely is actually steering the model.

It's actually prompt engineering the model. It's just the model isn't powerful enough. OK, another one they tried was translation. And in this case, they induced translation by few-shot prompting of English and French pairs. Again, we see this early prompt engineering. What was really surprising, even though they didn't achieve state-of-the-art, is that they still beat other baselines.

But the entire 40-gigabyte data set only had 10 megabytes of French data. So they went back, and they found a few naturally reoccurring stuff here. But they were really surprised. So the performance was surprising to us, since we deliberately removed non-English web pages from WebText as a filtering step.

In order to confirm this, we ran a byte-level language detector on WebText, which detected only 10 megabytes of data in the French language, which is approximately 500 times smaller than the monolingual French corpus common in prior unsupervised machine translation research. That's really surprising. You might remember there was an example-- I think this was Gemini or BARD-- learn to translate a very esoteric language that has something like only 100 or 1,000 speakers by having a very small data set of it.

I feel like this is kind of parallels of that. And then they also tried question answering. So this is an example from that paper that rolled out that data set, where you have a question coming from Wikipedia, and then a long answer, and then a short answer for each of those prompts.

And so they did the short answer. They seeded it with question and answer pairs. Again, this is why Wikipedia was removed from the training data. And they got poor results. The baseline was like 30% to 50%. GPT-2 XL got 4.1%, and GPT-2 Small got less than 1%. They don't actually give us the number.

But it does indicate that size helps. So maybe a sufficiently large model could exceed state of the art. If you want to get better than the baseline of 30% to 50%, just build a large enough model. And that's all you need to do. You don't have to do any other algorithmic improvements.

There's this hilarious little footnote inside the paper, which says that Alec Radford overestimated his skill at random trivia. So if you, I don't know, run into a wild Alec Radford in his natural habitat of San Francisco, do not play dead. Do not back away slowly. Challenge him to random trivia, and you've a better-than-random chance at beating him.

OK, future directions. OK, really, this was the beginning of scale is all you need, kind of things we talked about earlier at the beginning. Given that the web models appear to underfit the web text data set, as they note here in this figure from the paper, it seems that size is improving model performance on many tasks.

We talked about how GPT-2 small didn't do as well as GPT-2 large, and it seems like just increasing the size of the model improves things. So then the question is, does size help even more? And of course, that leads us to, hey, let's put a ton of money on making-- instead of going up by 10, let's go up by a factor of 100, and let's see if we'll get a much smaller model.

And I'm sure you all know the answer. The answer is they did. But that leads us into setting the stage for GPT-3. OK, and that is it. I will take questions in, I think, the 10 minutes we have remaining. And I'll look at the chat, see what we got.

- The chat is a mess. - Oh, boy. Is that a good thing, or is that a sign of a bad or a good-- - Yeah, it means we're engaged and having productive discussions. I feel like-- what's his name? - Leanne had some questions about the sine function, which I mean, I answered from my point of view, but I'm curious if you have-- - What's the question on the sine function?

- Positional encoding is a spherical jitter in the sitting space. - Yes. - Jitter, to me, means randomness, and there's no randomness here. - Well, that's a-- it's an adjustment. I called it an oscillation because it is-- you're right. It is predictable. It's formulaic based on what position you're in.

And so that's-- that might just be a translation issue. I agree. Jitter typically means random, but it is-- I called it an oscillation. So we'll just slightly move it around inside this space. That seems-- Oh, tell me-- - That is jitter. That is a form of jitter. I don't like this explanation.

- Why not? Tell me. - It's a position embedding is a whole different embedding. You're saying we're moving the position of woman, but no, we're pairing the position of woman with the position to having significance y or whatever. Your example with the word order, you're not really changing the position of woman to anti-woman, right?

Are you? - No, we're keeping it roughly in the same embedding space. We're moving it slightly. - There's position space, and then there's the word token embedding. I don't know how to phrase it. - Well, no, well, we're not inside like rope where we're inside the attention mechanism in GPT-2.

So you are literally using the very same embeddings. - Oh, OK. - So you are inside-- - the rope. Yeah, I'm-- - You are in-- yeah, you're thinking-- so that's the key difference. So you're not in a separate positional space, right? So here is-- here I've taken like happy-- this is happy at position 8, happy at position 1, happy with capital.

So it's a different word. Here's glad at position 2. Here's happy 3, happy 4, happy 6. These are all the same happy. Just put a number when I plotted it with PCA. So this is a dimensionality reduction of 768. And these are, you can see, happy at 3, 4, 5, 6, 7, 8.

They're all roughly close to each other. Glad is a whole other word. I just put it at position 2 so we can see what it is. And happy 1. They're in the same embedding space. So this is the same embedding space as, you know-- do-do-do-do-do-do-do. I have a diagram here of it somewhere.

Do-do-do-- well, I did. I have a PCA plot of the same type of thing. There it is. This is glad, happy, happy, capital, joyful, dog, cat, rabbit, right? This is not positional embedding. This is just this stuff put in PCA, two-dimensional from 768. And then I did the same thing, except I did it in positional for just one word to see-- to go through what is the difference.

Where is that? That's this one. So this is-- here, you can see it right here. So happy is the same happy 1, glad. These are just the same things with the positional embeddings added onto them for whatever row. So we're in-- in GPT-2, you're in the same embedding space.

You're just moving them around. That is a crucial difference. That's like one of the crucial differences, probably between modern transformers and GPT-2. The other one that I call out is RMS norm, for example, is used in LLAMA. I think I actually have a slide on the major differences. That might be a good way to close out, is GPT-2 versus something like LLAMA.

So-- - Beautiful. - Yeah. So you've got-- so you can see, this is LLAMA-405B. What does a modern model look like compared to GPT-2? So they're both decoder transformers, just a lot larger size. Same architecture, similar-- more size. So more layers, embedding dimensions are larger, context is larger. The training data and the training cost went up.

We're talking $125 million is the estimate. And then the same pieces, but they're moved. So instead of learned absolute positional embeddings, we go to Rope. Layer norm gets replaced with RMS norm. Multi-head attention gets group query. Galoo gets replaced with Swigloo. And then the other key difference is, compared to chat GPT, we don't have any supervised fine tuning, any RLHF.

That process isn't there at all. It's just simply a pre-trained model, whereas models today typically go through some form of post-training. So that's a good way to kind of bridge us to what the future or the present looks like from today, at least the day of the recording. That's a great-- yeah, we need-- I need this, but updated for DeepSeq and-- Yeah, the next time I do my class, I'll probably redo this with LLAMA and DeepSeq.

So I might add that column. So yeah. Any other questions? That's good discussion. Is this LLAMA 1? What? LLAMA 1 or LLAMA 2? LLAMA 3, 405B, the one in August. OK. Thanks. Yeah, good question. Yeah, there's some-- Let's see what else is in the chat. --in the chat, yeah.

Why would you want pre-layer norm instead of post-layer norm? I see that one. I'm going to go to that one because I do have a slide for that. My-- there's an article that I found that explains this. Oh, really? I'd love to see that. Yeah, it's like some question, some article about what the original transformers got wrong.

I think you have it. Oh, you have the same diagram I pulled up. OK. Oh, yeah. So this is-- but this was-- the key thing is I'm pretty sure this came out after the GPT-2 paper itself. So GPT-2 doesn't cite this paper. They cite a different paper, which does the same thing inside a, I think, a visual or a CNN that had skip connections.

And they propose, hey-- and so I think the obvious thing was like, OK, it works for vision. Maybe it'll also work for language to go to a pre-layer norm transformer. And it worked. And then I think maybe around the same time, or-- I don't know how close these are.

Well, June-- it's about a year later, right? They actually show some benefits and improvements for pre-layer norm. There is a trade-off, which I've forgotten what it is. But this is the paper. You can find it. There's the archive number. And you can Google the title. But that's a quick answer there for a pointer on why you'd want to do that.

Send me-- drop that link for that article. I'd be curious to see it. Yeah. Which-- as walks on a hypersphere. Is it that this paper? Or this is another paper of the benefits of layer norm that I like as well. Let's see. Somebody click the link. Oh, I remember this one now.

OK, yes. But what's really fascinating to me about this, for all these model architecture, is we keep finding afterwards, like, oh, yeah, this is what it's doing. This paper came out after GPT-- after we did that change. And so-- and we keep finding ones. This is one of my favorite ones, where I would have thought Dropout would have explained self-repair inside these kinds of models.

But this research claimed it was surprisingly layer norm. So it goes to show you how much of this is interestingly empirical, but guided by an intuition of playing with these things. Let me see if there's another chat I can address. Let's see. There is no maximum length angle among positional embeddings is entirely learned by the model.

There's a question there of what is the maximum length angle. If you save these, I can reply to these in Discord, Swix. - Yeah, good. - Let's see. And there's the article. I know we're just at two minutes. Oh, that's the article you were pulling up. OK, great. - It's basically a recycling of the one that you already had.

- Got it. Let's see. I'm trying to go through. Is there anything else you saw that was interesting I should cover in the last 60 seconds? Do I think we've hit the scaling wall? - Different topic. - That's a different topic. But I'll just say people who have said that have learned to rue the day.

Now that we've got test time compute, at least. Let's see. - So my response is that that is kind of moving the goalposts. Like if you want to say that you haven't hit a wall, then OK, scale up GPC 4 to GPC 5. And where is it? So in some sense, we haven't hit it.

We have hit one. And we're just redirecting the attention. - I will just say I know a excellent podcast that held a debate in Vancouver last year in December between two very qualified experts on this very topic. And I would refer you to that. - Yeah, wallguy1. - What?

- The pro wall person. - The pro-- yeah, yeah. Yeah, he did. He did. - OK. Cool. I mean, I can call it. I think we can continue in Discord. Thank you so much, Ishan. That was amazing as always. Actually, not even as always. I think that's too dismissive of a word for what you did.

Your slides are amazing. So thank you. - Oh, thank you. - Yeah, we don't have a paper picked for next week. Again, we can pick one in the Discord if anyone wants to volunteer. You have a high bar set. - Well, thank you for having me. And I look forward to being back at some point in the future.

And I hope-- I want more people to participate. I don't want people to feel like they have to match this. - Yeah, exactly. Yeah. He teaches a course. That's why he has those. - Yes, that's why I have all these slides. This is slides from my class. - Oh, yeah, plug your course.

Where do people sign up? - Oh, they can go to Maven. I didn't want to be too commercial. But if you go to Maven, there's the class. And then you can see what people have said about it. And right now-- so Maven usually is live. I leave it open right now.

People can attend on demand. And then you get the recordings. You get access to the Discord. You get all the quizzes. And then if you want to attend when I do my next live one, you can attend a future live cohort for free. If you have questions about the class, feel free to shoot me a question over Twitter, LinkedIn, Discord, whatever.

But that's where you can find it on Maven. So you have a class on Maven as well, right? - Yeah, but it's more AI engineering, quote unquote. So you treat the language model as a black box, and you go from there. - Yeah, this is-- I should be clear.

I had one guy who signed up thinking this was about using AI with Excel. No, no, this is about how the actual model works. And I use the Excel spreadsheet that implements it. That's this thing. So you can understand every single step. And then I also use this web version as well.

And I walk through so you get a sense of how the entire model works. But you can try this web version. Anyone can go to this page and try it out. I actually see at least one of my former students here. - OK, cool. All right. Thank you so much.

Have a nice day, everyone. - Thanks. - Thank you. - Thank you. Bye.

GPT Internals Masterclass - with Ishan Anand

Transcript