back to indexGPT Internals Masterclass - with Ishan Anand

00:00:04.080 |
>> Okay. Did you want to do an intro or should I just go ahead and start? 00:00:07.820 |
>> Intro. I was just excited to have Ishan back. 00:00:18.760 |
and you're world-famous for Especially, It's All You Need. 00:00:23.240 |
But also, it means that you understand models on a very fundamental level 00:00:28.080 |
because you have manually re-implemented them, 00:00:30.860 |
and today you decided to tackle GPT-2. So welcome. 00:00:38.440 |
to be presenting Language Models or Unsupervised Multitask Learners. 00:00:42.740 |
For context, the other thing Swix had alluded to in previous paper clubs, 00:00:57.760 |
than might be a traditional paper club reading, 00:01:02.760 |
But hopefully, if you're just coming to the field of AI engineering, 00:01:10.040 |
and some resources I'll point you to will help you get started 00:01:12.440 |
in understanding how LLMs actually work under the hood. 00:01:15.920 |
The name of the paper is Language Models or Unsupervised Multitask Learners. 00:01:20.400 |
It is not actually officially called the GPT-2 paper, 00:01:31.700 |
If you're in the community, you'll recognize Alec. 00:01:34.680 |
A lot of these people went on to continue to do 00:01:45.840 |
I'm probably best known in the AI community for Spreadsheets Are All You Need, 00:01:50.040 |
that is a implementation of GPT-2 entirely in Excel. 00:01:53.360 |
I teach a class on Maven that's basically seven to eight hours long, 00:01:57.440 |
where we go through every single part of that spreadsheet. 00:02:00.600 |
For people who have actually minimal AI background, 00:02:07.040 |
I'm an AI consultant educator and really excited to give you 00:02:09.880 |
the abbreviated version of that and the GPT-2 paper today. 00:02:13.280 |
Let's get started. Here's what we're going to talk about. 00:02:17.840 |
We're going to talk about why should you even pay attention to GPT-2. 00:02:21.360 |
Strangely enough, I get this question from my class. 00:02:24.480 |
People are like, "Oh, I saw that it was GPT-2." 00:02:26.640 |
I thought, "Oh, that's got to be out of date." 00:02:28.440 |
We should talk about why that's important and why you should pay attention. 00:02:36.800 |
Then we'll talk about the model architecture. 00:02:38.600 |
No, sorry. We'll talk about the model architecture, then the results. 00:02:43.360 |
because we know what the future is going to hold, but they didn't. 00:02:46.440 |
We'll talk about it as if we didn't really know. 00:02:59.440 |
I think it was eight or nine months ago on the original GPT-1 paper. 00:03:08.840 |
the GPT-2 paper doesn't talk a lot about the model architecture. 00:03:12.680 |
There's a limit that you'll learn from the model and model building from the GPT-2 paper. 00:03:25.120 |
did you want to just jump in and say anything about this? 00:03:34.120 |
>> Cool. Thank you for having my overview about GPT-1. 00:03:40.120 |
As you said, GPT-1 is the precursor to GPT-2. 00:03:43.680 |
The architecture is roughly almost exactly the same. 00:03:46.920 |
There are going to be some little differences 00:03:49.080 |
that I think you're going to present in today's discussion. 00:03:51.800 |
But I think everyone should at least give GPT-1 a read or 00:03:56.480 |
try to see how they actually achieved or settled on the transformer architecture, 00:04:02.680 |
and also their training objective of language modeling at the next token prediction. 00:04:11.960 |
We have the official paper from the OpenAI team. 00:04:22.200 |
so having a futuristic or looking back on the GPT-1. 00:04:34.000 |
Please make sure you at least give it a read or something like this. 00:04:38.400 |
It's going to be an immense resource for you guys. Back to you. 00:04:44.360 |
>> Okay. Thank you. I see a request for the slides, 00:04:51.600 |
my pet peeve is people who hold up the slides from you. 00:04:58.920 |
copy link, and then I will drop it in the chat. 00:05:07.800 |
You should have it in the Zoom chat and I will drop it, 00:05:09.840 |
or somebody can drop it please in the Discord as well for the paper club. 00:05:17.960 |
Great. Whoops. We're not going to auto-play the video. There we go. 00:05:31.560 |
the first cases where we saw one model is really all you need. 00:05:35.640 |
We had a single model that solved multiple tasks, 00:05:40.640 |
any supervised training on any of those tasks. 00:05:46.760 |
and that let it learn how to do many multiple tasks. 00:05:49.680 |
This seems obvious today because we're basically six years later. 00:06:03.680 |
different structured output configurations on top of it, 00:06:07.000 |
and fine-tuned it on each of these tasks like classification, 00:06:15.440 |
this is right from the GPT-1 paper right here. 00:06:29.520 |
So it still was not like ChatGPT where you could just talk to it, 00:06:34.440 |
But you could start getting prompt engineering to get the right result. 00:06:37.840 |
By contrast, GPT-2 was pre-trained again on predicting the next word, 00:06:42.560 |
but then you just gave it task-specific prompts, 00:06:50.920 |
A useful and interesting comparison contrast is the Google multi-model. 00:07:03.240 |
a lot of the same people from Attention is All You Need. 00:07:05.720 |
These guys know how to name a paper, I'll say. 00:07:14.920 |
and believe it or not, it's also a mixture of experts. 00:07:25.280 |
But the key thing is that it was supervised fine-tuned or 00:07:31.360 |
although it was done jointly all in the same model, 00:07:33.360 |
and they had task-specific architectural components for each task. 00:07:37.160 |
It was not the same just predict the next word. 00:07:41.960 |
these datasets for each of these different tasks, 00:07:44.120 |
but I'm just doing in the same model across them. 00:07:47.080 |
The key hypothesis of the paper then is that, 00:07:56.480 |
will infer and learn to perform tasks demonstrated in 00:08:00.400 |
that dataset regardless of how you procured them. 00:08:04.480 |
It'll be able to basically learn multiple tasks entirely unsupervised. 00:08:32.920 |
it's the emergence of prompting is all you need, 00:08:37.320 |
Here, we're just using prompts in order to condition the model. 00:08:42.320 |
it's we started to see prompt engineering be taking a role for 00:08:45.400 |
the first time as a way to control a model where 00:08:47.440 |
previously you would have stuck a different head on 00:08:51.920 |
It's also the emergence of scale is all you need. 00:08:55.400 |
This multitask capability emerges and improves as the models, 00:09:17.240 |
which was going to scale it up by 100 times on the number of 00:09:20.760 |
parameters with the idea after they saw these results that, hey, 00:09:27.040 |
The other interesting thing about GPT-2 is it's also 00:09:30.880 |
the continuation of this idea that the decoder is all you need. 00:09:37.040 |
transformer has traditionally in the original Vaswani implementation, 00:09:46.040 |
which was in charge of generating the output. 00:09:50.760 |
you only need the decoder because all we're going to do 00:09:56.480 |
They were not obviously the first to do this, 00:10:00.040 |
Around the same time as GPT-1 was this other paper, 00:10:04.000 |
also by Google, and Noam Shazier and Kaiser again, 00:10:07.120 |
which is generating Wikipedia by summarizing long sequences. 00:10:33.400 |
the most popular way you would implement a large language model. 00:10:36.360 |
And GPT-2 is basically the ancestor of all the major models 00:10:55.760 |
So the key idea is if you understand the GPT-2 architecture, 00:10:59.720 |
you're basically 80% of the way to understanding 00:11:02.400 |
what a modern large language model looks like. 00:11:08.080 |
Maybe they've replaced layer norm with RMS norm and so forth. 00:11:17.760 |
And part of that may be because of a lot of the hype around GPT-2, 00:11:21.080 |
but part of that is also the last open source model 00:11:26.320 |
So a lot of people dug into it and took inspiration from it. 00:11:30.800 |
It was also probably one of the first AI models 00:11:35.160 |
So this is the famous passage where they prompted GPT-2 00:11:52.960 |
that initially the open source release of GPT-2 00:11:57.880 |
They didn't release the source or weights for the larger model 00:12:06.480 |
It was called the AI model too dangerous to release, 00:12:17.600 |
So that's why GPT-2 is all you need, in a sense, 00:12:21.320 |
to get started and why it's so important to the field. 00:12:27.640 |
because the data is a huge part of any AI model, 00:12:33.840 |
The problem they faced is if we're going to train 00:12:40.120 |
we need a data set that is sufficiently large, 00:12:50.520 |
It should be sufficient enough that it has that wide variety, 00:12:53.400 |
even though we're not explicitly going to fine-tune it 00:12:59.520 |
was to create a new data set using the internet. 00:13:07.960 |
So we're going to use social media for a quality signal. 00:13:11.040 |
And then because the web, I say internet here, 00:13:18.520 |
It should demonstrate a variety of different tasks 00:13:23.960 |
So they created this data set called WebText. 00:13:27.280 |
First, they started by gathering all the outbound links 00:13:33.040 |
Then they removed links with less than three karma. 00:13:41.520 |
They didn't actually scrape Reddit and Reddit conversations. 00:13:47.160 |
kind of like how Google ranks sites through PageRank. 00:13:54.080 |
Then they actually removed Wikipedia entries. 00:13:57.880 |
is some of the tests we're going to talk about later 00:14:02.400 |
as part of the data set, as part of the evaluation. 00:14:08.480 |
and not putting, you know, training the model on text 00:14:13.800 |
They also, although not shown in this diagram, 00:14:15.760 |
is they also removed any non-English text, or they tried to. 00:14:21.000 |
and turned into a capability to do translation. 00:14:24.960 |
And then they extracted the raw text from the HTML files 00:14:28.280 |
using the DragNet or newspaper Python frameworks or libraries 00:14:35.400 |
which was 8 million documents or 40 gigabytes of data. 00:14:41.360 |
the GPT-1 model was trained on the Books Corpus, 00:14:45.960 |
that was about 4.8, roughly five gigabytes in size. 00:14:50.160 |
So this is about an order of magnitude more data. 00:14:56.040 |
I believe there was a BERT one that was larger, 00:15:05.600 |
another order of magnitudes bigger than this. 00:15:07.800 |
Sorry, GPT-3 compared to the GPT-2 data set for WebText. 00:15:20.960 |
Let's see, should I pause for questions or just keep going? 00:15:25.680 |
- I mean, if anyone has questions, now is a good time. 00:15:35.400 |
Looks like people are handling some of the questions in chat. 00:15:38.280 |
Okay, let's talk about the architecture of these models. 00:15:50.320 |
A couple notes, so I put GPT-1 as a comparison point. 00:15:57.600 |
It had 12 layers, 768 for the embedding dimension 00:16:09.440 |
that they do not refer to them as small, medium, large, XL. 00:16:17.440 |
GPT-2, they save as the name for the largest model. 00:16:25.960 |
the Hugging Face Transformers, if you use GPT-2 00:16:28.040 |
as the model name bear, you just get GPT-2 small. 00:16:35.400 |
And then in the text, they refer to all the other small variants 00:16:40.760 |
So these three are called the web text language models, 00:16:43.760 |
and then this is what they refer to as GPT-2 in the paper. 00:16:47.080 |
So for them, GPT-2 is simply the largest model, 00:16:50.840 |
because small really is just a replication of GPT-1. 00:16:55.000 |
And one other thing is they even tried to replicate it so much 00:17:03.480 |
So when you download the weights from, I guess, Azure now, 00:17:08.040 |
They're just simply renamed because there was a typo 00:17:12.480 |
And you can see, basically, the largest model 00:17:18.840 |
so roughly twice as big in the embedding dimensions. 00:17:21.200 |
They increase the context link for all of them, 00:17:28.600 |
Unfortunately-- well, there are few changes architecturally 00:17:35.240 |
First is a larger size, which we saw in the previous slide. 00:18:11.960 |
improved training stability in certain cases. 00:18:15.960 |
Unfortunately, the paper has few details on the actual training. 00:18:21.560 |
but the learning rate was tuned for each size model, 00:18:27.200 |
And this is really interesting because the GPT-3 paper, 00:18:31.280 |
for example, went into a lot more detail on this. 00:18:33.960 |
In fact, it's, I think, right here near the beginning. 00:18:37.880 |
You've got a table here on the batch size learning rate. 00:18:40.320 |
And I think the appendix actually has the Adam W 00:18:45.120 |
So there isn't a lot of detail on how the model works 00:18:53.520 |
partially because it was eventually open sourced 00:19:00.320 |
So there's the official OpenAI implementation, 00:19:07.560 |
if you go into the source and you click on this, 00:19:12.900 |
you don't realize how small the actual code is 00:19:15.360 |
because all the knowledge is in the parameters. 00:19:18.460 |
If you add up all the code here and you take out the TensorFlow, 00:19:26.000 |
to help people understand, yes, you can understand this. 00:19:29.980 |
You can grok it if you just spend a week or two on it. 00:19:33.000 |
So don't feel like this is magic that you'll never understand. 00:19:38.000 |
The most popular way to use it, probably today, 00:19:43.040 |
is another implementation of it that uses the same OpenAI 00:19:59.640 |
as well, where Jay Alomar, who's now a co-hearer, 00:20:04.300 |
goes through in detail how every single step of the transformer 00:20:09.080 |
He has really great diagrams and illustrations 00:20:16.920 |
MiniGPT from Andrej Karpathy, which is a PyTorch 00:20:21.240 |
The original version of GPT-2 was in TensorFlow. 00:20:28.280 |
And it's also OpenAI weight compatible for GPT-2. 00:20:32.840 |
And then he has LLM.c, which implements GPT-2 entirely in C 00:20:40.640 |
A lesser known, but I think equally interesting 00:20:44.280 |
implementation is Transformer Lens from Neil Nanda. 00:20:52.600 |
A lot of folks in mechanistic interpretability 00:20:58.720 |
And Transformer Lens is a tool for running understanding 00:21:03.920 |
and interpretability experiments on large language models 00:21:14.200 |
that I did at the AI Engineer World's Fair last year, 00:21:27.080 |
using a version of this thing called SAE Lens for GPT-2. 00:21:30.600 |
And I just basically used one of their vectors 00:21:35.840 |
But it's a great way to learn how these models actually work 00:21:47.440 |
You can watch essentially how information propagates 00:22:01.560 |
I like this view because you can see how much smaller 00:22:09.560 |
It really makes it very visceral in terms of how it feels. 00:22:14.700 |
And the one challenge I have with visualizations 00:22:17.560 |
is they're fun to look at, but you can't actually go in 00:22:23.080 |
by interactively changing things within them. 00:22:28.200 |
Are All You Need, which is an Excel file that 00:22:31.520 |
implements all of GPT-2 small entirely in Excel. 00:22:40.640 |
It restarted on me because I'm running in parallel as well. 00:22:46.720 |
So that one, you can see there's a video right here 00:23:04.600 |
And then the most recent version is this one, 00:23:13.980 |
And let me walk through it for just 5 or 10 minutes 00:23:18.440 |
as kind of an intro to how transformer models work. 00:23:53.960 |
But you have your token and position embeddings. 00:23:58.720 |
We're basically grouping similar words together. 00:24:01.200 |
So I like to imagine, say, a two-dimensional map. 00:24:03.800 |
But in this case, in the case of GPT-2 small, 00:24:09.840 |
So you can imagine happy and glad are sitting here. 00:24:12.640 |
And sad's maybe a little close to it, but not quite as close. 00:24:15.520 |
And then things that are very different, like dog, cat, 00:24:18.800 |
And rather than thinking about this long list of numbers 00:24:28.200 |
they're now points in a 768-dimensional space 00:24:33.880 |
But what we've done is we've grouped similar words together. 00:24:37.260 |
And once we've done that, when we think about it, 00:24:39.300 |
similar words should also share the same next word predictions. 00:24:43.480 |
So the next word after happy is probably also the next word 00:24:49.040 |
And then that gives us kind of a boost or heads up 00:24:54.320 |
Neural networks are really good if you give them 00:24:59.560 |
So you give it photos, and you say which ones are dogs 00:25:12.800 |
The only other wrinkle is we have additional hints 00:25:14.960 |
we can give it, which is all the hints from all the other words 00:25:20.320 |
is it's letting every word look at every other word 00:25:26.880 |
Instead of just taking a one word or two or three word 00:25:29.900 |
history prediction, it's going to look at all the past words. 00:25:32.640 |
And then it refines that prediction over 12 iterations. 00:25:35.240 |
In the case of GPD, small, more in the larger one. 00:25:39.440 |
back that we just convert back out to a word. 00:25:42.480 |
So putting it all together, we get basically this diagram. 00:25:50.640 |
and then we refine that prediction iteratively, 00:25:56.880 |
There's actually-- and you'll see this in the spreadsheet-- 00:26:06.560 |
Let's go back to the spreadsheet, if it will load up. 00:26:13.200 |
But that's fine, because I'm going to demonstrate 00:26:40.400 |
It's actually a series of vanilla JavaScript components. 00:26:45.400 |
You've got a cell here that wraps everything. 00:26:54.440 |
shows the result in a table, as you can see here. 00:27:00.720 |
And what's great about this is you can debug the LLM entirely 00:27:08.040 |
So to run this, the first thing you want to do 00:27:10.000 |
is click this link and download the zip file, which 00:27:15.360 |
It'll basically stick 1.5 gigabytes into your index.db. 00:27:20.800 |
So the first thing we do is we define matrix operations, 00:27:26.920 |
So this is our matrix multiply, also defined in raw JavaScript. 00:27:38.280 |
You hit Run, and it'll actually run the model. 00:27:40.280 |
So let me show you, though, the debugging capabilities. 00:27:49.620 |
And you can see our prompt is "Mike is quick, he moves." 00:27:53.480 |
But what I can do is I can just write from the-- 00:27:57.560 |
I can go here, and I can say, well, you know what? 00:28:14.680 |
I can just see the result. And heck, you know what? 00:28:16.920 |
Maybe I really want to just step through this thing. 00:28:19.440 |
So I can hit this, put the debugger statement, hit Play, 00:28:33.240 |
So great way to kind of get a handle on what's 00:28:37.200 |
And I'll take you through a brief view of this, 00:28:42.560 |
Redefine that function, and then I'll reset it. 00:28:50.160 |
And then if we hit Run, what it's going to do 00:28:53.000 |
is it's going to run through each one of these in order. 00:29:04.120 |
Then we actually take those, and we do the BP algorithm 00:29:07.560 |
to turn them into tokens, which we'll get out here. 00:29:14.320 |
our tokens, rather, and then their token IDs. 00:29:16.840 |
And then if we keep going, this will turn the tokens 00:29:22.780 |
Then finally, we turn them into the positional embedding. 00:29:25.080 |
We're basically just walking through this same diagram 00:29:32.000 |
So we tokenize, then we turn it into embeddings, 00:29:39.240 |
These match the same steps in the Excel sheet, 00:29:42.440 |
where we'll basically do layer norm, for example. 00:29:53.120 |
actually stop running because it's in the browser, 00:29:57.360 |
But you can hit Run, go to this page right now, 00:30:02.240 |
It'll take about a minute to predict the next token. 00:30:04.640 |
So that's a quick overview of GPT-2 and some resources 00:30:30.000 |
It's going to be fun when he checks the chat now. 00:31:07.740 |
Helps web developers kind of get up to speed. 00:31:33.720 |
OK, so you know that probably embeddings we've talked about 00:31:52.800 |
Here's your canonical man, woman, king, queen, 00:31:55.720 |
where king minus man plus woman equals queen. 00:31:58.040 |
This is a contrived example, and we've put them 00:32:14.160 |
and the cat chases the dog have very different meanings. 00:32:19.880 |
So here's another example I use in my class, which 00:32:35.240 |
The problem we have is that in English, word order matters. 00:32:40.840 |
But in math, very often, position does not matter. 00:32:50.520 |
And so this is one of the hardest things to realize. 00:33:02.760 |
and this is a realm where order does not matter. 00:33:05.200 |
And so the math, everything after the equal sign, 00:33:07.760 |
cannot see the order of the stuff between them, 00:33:15.040 |
Even though you can look in this spreadsheet, 00:33:19.120 |
I can see it in order, just like you can see in order-- 00:33:27.240 |
just like you can see the order between 2 plus 3, 00:33:36.920 |
So the way we do that is we basically say, in GPT-2-- 00:33:41.500 |
note that there's something called rope, which 00:33:44.960 |
We say that-- let's go back to this diagram I led with. 00:33:47.880 |
The woman at position 0 probably means the same thing 00:34:03.320 |
And in general, we're going to just move it slightly 00:34:05.880 |
in some region so it doesn't move around too much, 00:34:08.960 |
So it can at least tell woman at different positions 00:34:18.480 |
If you remember sine and cosine, they limit to 1. 00:34:21.200 |
So they're basically keeping it in the circle. 00:34:47.480 |
Each one of these, if you look at this formula, 00:35:00.760 |
So the first row-- so this is what a million parameters 00:35:09.080 |
So you can actually go back to the one I showed you right here. 00:35:36.960 |
The token in the third position gets the third row added to it. 00:35:42.080 |
So there are 1,024 rows here for every single position 00:35:49.720 |
So that math only worked on the second column. 00:35:52.000 |
Let me keep going so we don't run out of time. 00:36:07.440 |
for seven out of eight language modeling tasks at zero shot. 00:36:12.400 |
I'd call this, actually-- they claim it's zero shot. 00:36:17.040 |
But at the time, it was probably good enough to call it zero shot. 00:36:23.000 |
I'm going to go through a couple of the notable ones. 00:36:28.240 |
of predicting the next word of a long passage. 00:36:36.280 |
at the target sentence or even the last previous sentence. 00:36:42.520 |
and have some kind of sense of understanding to complete it. 00:36:49.920 |
because that's how far you have to be to find the word that's 00:36:55.200 |
because camera is the word to complete the sentence. 00:37:00.520 |
You have to infer they're dealing with the camera. 00:37:02.600 |
He's like, you just have to click the shutter. 00:37:04.520 |
So it's really a test of long passage understanding. 00:37:14.360 |
One thing to know is that GPT-2, when they tested it, 00:37:26.080 |
And so what they did is they added a stop word filter. 00:37:30.680 |
And they only let it use words that could end a sentence. 00:37:34.720 |
Because it would come up with other likely completions 00:37:37.640 |
for the sentence, but they would have kept going. 00:37:39.680 |
And so they would have been the wrong answer. 00:37:41.600 |
So they basically had to modify slightly the end result 00:37:54.720 |
similar, where you basically have a long passage. 00:38:16.200 |
It's not like a BERT model where it could complete somewhere 00:38:21.520 |
So the way they set this up, because it's a decoder, 00:38:24.520 |
is they computed the probability of each one of these choices. 00:38:28.680 |
And then those choices, along with the probabilities 00:38:32.760 |
for the rest of the other words to complete the sentence, 00:38:39.720 |
that joint probability of Baxter had exaggerated matters 00:38:43.440 |
a little, Cropper had exaggerated matters a little. 00:38:46.000 |
And they picked whichever one of those combinations 00:38:48.480 |
had the highest probability accord to the language model. 00:38:56.560 |
It is the only one of the eight on that table 00:39:05.640 |
There are other tasks, which we'll talk about in a second, 00:39:08.280 |
where the model did not hit state-of-the-art. 00:39:11.120 |
But in those language modeling tasks in that table, 00:39:14.920 |
Their conclusion is that the reason this happened 00:39:21.240 |
does a ton of destructive pre-processing on the data. 00:39:25.640 |
So this is a screenshot from the 1 billion word benchmark. 00:39:31.720 |
But it describes a bunch of the steps they do to pre-process 00:39:43.560 |
And so it's not surprising that it didn't do as well 00:39:57.280 |
Some of these are perplexity, so lower is better. 00:39:59.360 |
So if it's in bold, it's better than the state-of-the-art. 00:40:02.720 |
Some of these are accuracy, so higher is better. 00:40:04.760 |
So here, you can see in bold when they've achieved higher 00:40:07.660 |
This is the only one where the state-of-the-art was still 00:40:15.800 |
Another one they tried was question answering. 00:40:18.300 |
And they did not achieve state-of-the-art on it. 00:40:20.300 |
This is the conversation question answering data set. 00:40:23.980 |
This is an example from the paper for that data set. 00:40:26.820 |
And you can see it's, again, a passage and then 00:40:29.020 |
a series of questions with answers and actually reasoning. 00:40:37.000 |
One is they matched or exceeded three out of the four 00:40:39.940 |
baselines without using any of the training data. 00:40:42.780 |
The other baselines had actually used the training data. 00:40:48.240 |
on a large enough data set, which is very surprising, 00:40:52.700 |
The other thing that they note, which jumped out to me, 00:40:58.580 |
to answer who questions, where it would look for names 00:41:11.900 |
are heads inside multi-head attention, whose whole job is 00:41:14.260 |
to, if it sees a passage that says "Harry Potter" like five 00:41:17.060 |
times, the next time it sees "Harry," it's like, oh, 00:41:21.380 |
So it's very interesting to see even that kind of sense 00:41:23.740 |
of something like an induction head inside GPT-2 00:41:31.740 |
It was tested on news stories from CNN and the Daily Mail. 00:41:35.060 |
And again, we have kind of early prompt engineering. 00:41:37.420 |
They induced summarization by appending TL;DR-- 00:41:41.020 |
too long, didn't read, which is something humans 00:41:55.980 |
And you can see the state-of-the-art is doing 00:42:01.540 |
The end result that came out resembled a summary. 00:42:03.980 |
But it turned out it confused certain details, 00:42:09.220 |
or where a logo was placed or things like that. 00:42:12.700 |
And unfortunately, it just barely outperformed 00:42:14.540 |
picking three random sentences from the article, 00:42:23.620 |
And you can see GPT-2, without any TL;DR hint, is doing worse. 00:42:27.260 |
So TL;DR definitely is actually steering the model. 00:42:43.140 |
by few-shot prompting of English and French pairs. 00:43:01.980 |
So they went back, and they found a few naturally 00:43:09.260 |
since we deliberately removed non-English web pages 00:43:13.260 |
In order to confirm this, we ran a byte-level language detector 00:43:15.980 |
on WebText, which detected only 10 megabytes of data 00:43:18.540 |
in the French language, which is approximately 500 times 00:43:21.420 |
smaller than the monolingual French corpus common 00:43:24.220 |
in prior unsupervised machine translation research. 00:43:33.620 |
learn to translate a very esoteric language that 00:43:37.060 |
has something like only 100 or 1,000 speakers 00:43:41.220 |
I feel like this is kind of parallels of that. 00:43:48.900 |
that rolled out that data set, where you have a question 00:43:51.300 |
coming from Wikipedia, and then a long answer, 00:43:54.980 |
and then a short answer for each of those prompts. 00:43:59.900 |
They seeded it with question and answer pairs. 00:44:01.820 |
Again, this is why Wikipedia was removed from the training data. 00:44:11.820 |
GPT-2 XL got 4.1%, and GPT-2 Small got less than 1%. 00:44:26.540 |
If you want to get better than the baseline of 30% to 50%, 00:44:32.660 |
You don't have to do any other algorithmic improvements. 00:44:40.340 |
inside the paper, which says that Alec Radford 00:44:47.300 |
So if you, I don't know, run into a wild Alec Radford 00:44:51.500 |
in his natural habitat of San Francisco, do not play dead. 00:45:07.980 |
is all you need, kind of things we talked about earlier 00:45:11.140 |
Given that the web models appear to underfit the web text 00:45:13.940 |
data set, as they note here in this figure from the paper, 00:45:18.020 |
it seems that size is improving model performance 00:45:21.620 |
We talked about how GPT-2 small didn't do as well as GPT-2 00:45:29.380 |
So then the question is, does size help even more? 00:45:35.700 |
instead of going up by 10, let's go up by a factor of 100, 00:45:38.460 |
and let's see if we'll get a much smaller model. 00:45:43.220 |
But that leads us into setting the stage for GPT-3. 00:46:02.540 |
Is that a good thing, or is that a sign of a bad or a good-- 00:46:05.020 |
- Yeah, it means we're engaged and having productive 00:46:12.700 |
- Leanne had some questions about the sine function, which 00:46:34.940 |
I called it an oscillation because it is-- you're right. 00:46:38.820 |
It's formulaic based on what position you're in. 00:46:41.180 |
And so that's-- that might just be a translation issue. 00:46:49.180 |
So we'll just slightly move it around inside this space. 00:47:04.740 |
- It's a position embedding is a whole different embedding. 00:47:07.540 |
You're saying we're moving the position of woman, 00:47:13.540 |
with the position to having significance y or whatever. 00:47:23.860 |
you're not really changing the position of woman 00:47:29.980 |
- No, we're keeping it roughly in the same embedding space. 00:47:46.460 |
where we're inside the attention mechanism in GPT-2. 00:47:50.100 |
So you are literally using the very same embeddings. 00:48:01.700 |
So you're not in a separate positional space, right? 00:48:08.460 |
this is happy at position 8, happy at position 1, 00:48:17.580 |
Just put a number when I plotted it with PCA. 00:48:19.580 |
So this is a dimensionality reduction of 768. 00:48:22.740 |
And these are, you can see, happy at 3, 4, 5, 6, 7, 8. 00:48:29.540 |
I just put it at position 2 so we can see what it is. 00:48:34.900 |
So this is the same embedding space as, you know-- 00:48:54.540 |
This is glad, happy, happy, capital, joyful, dog, cat, 00:49:02.140 |
This is just this stuff put in PCA, two-dimensional from 768. 00:49:08.660 |
And then I did the same thing, except I did it 00:49:17.900 |
So this is-- here, you can see it right here. 00:49:25.560 |
These are just the same things with the positional embeddings 00:49:29.660 |
So we're in-- in GPT-2, you're in the same embedding space. 00:49:38.660 |
probably between modern transformers and GPT-2. 00:49:42.300 |
The other one that I call out is RMS norm, for example, 00:49:48.780 |
I think I actually have a slide on the major differences. 00:49:51.300 |
That might be a good way to close out, is GPT-2 00:50:03.380 |
So you've got-- so you can see, this is LLAMA-405B. 00:50:07.500 |
What does a modern model look like compared to GPT-2? 00:50:18.580 |
So more layers, embedding dimensions are larger, 00:50:22.220 |
The training data and the training cost went up. 00:50:30.580 |
So instead of learned absolute positional embeddings, 00:51:04.620 |
looks like from today, at least the day of the recording. 00:51:19.500 |
I'll probably redo this with LLAMA and DeepSeq. 00:52:07.700 |
My-- there's an article that I found that explains this. 00:52:15.860 |
Yeah, it's like some question, some article about what 00:52:28.660 |
the key thing is I'm pretty sure this came out 00:52:44.300 |
does the same thing inside a, I think, a visual or a CNN 00:52:53.660 |
and so I think the obvious thing was like, OK, 00:53:05.820 |
And then I think maybe around the same time, or-- 00:53:11.820 |
They actually show some benefits and improvements 00:53:16.740 |
There is a trade-off, which I've forgotten what it is. 00:53:24.780 |
But that's a quick answer there for a pointer 00:53:44.820 |
Or this is another paper of the benefits of layer norm 00:54:00.380 |
But what's really fascinating to me about this, 00:54:05.060 |
is we keep finding afterwards, like, oh, yeah, 00:54:20.300 |
have explained self-repair inside these kinds of models. 00:54:24.140 |
But this research claimed it was surprisingly layer norm. 00:54:33.900 |
by an intuition of playing with these things. 00:54:37.700 |
Let me see if there's another chat I can address. 00:54:46.780 |
There is no maximum length angle among positional embeddings 00:54:55.220 |
There's a question there of what is the maximum length angle. 00:54:59.660 |
If you save these, I can reply to these in Discord, Swix. 00:55:17.580 |
- It's basically a recycling of the one that you already had. 00:55:26.820 |
Is there anything else you saw that was interesting I should 00:55:44.300 |
Now that we've got test time compute, at least. 00:55:48.780 |
- So my response is that that is kind of moving the goalposts. 00:55:56.020 |
Like if you want to say that you haven't hit a wall, 00:56:11.180 |
- I will just say I know a excellent podcast that 00:56:15.380 |
held a debate in Vancouver last year in December 00:56:19.620 |
between two very qualified experts on this very topic. 00:56:46.460 |
I think that's too dismissive of a word for what you did. 00:56:55.020 |
- Yeah, we don't have a paper picked for next week. 00:57:09.740 |
And I look forward to being back at some point in the future. 00:57:15.740 |
I don't want people to feel like they have to match this. 00:57:36.460 |
And then you can see what people have said about it. 00:57:50.860 |
And then if you want to attend when I do my next live one, 00:57:53.700 |
you can attend a future live cohort for free. 00:57:57.780 |
feel free to shoot me a question over Twitter, LinkedIn, 00:58:08.540 |
- Yeah, but it's more AI engineering, quote unquote. 00:58:11.300 |
So you treat the language model as a black box, 00:58:20.940 |
No, no, this is about how the actual model works. 00:58:23.180 |
And I use the Excel spreadsheet that implements it. 00:58:29.420 |
And then I also use this web version as well. 00:58:40.860 |
I actually see at least one of my former students here.