back to index

GPT Internals Masterclass - with Ishan Anand


Whisper Transcript | Transcript Only Page

00:00:00.000 | >> All right. Let's go.
00:00:04.080 | >> Okay. Did you want to do an intro or should I just go ahead and start?
00:00:07.820 | >> Intro. I was just excited to have Ishan back.
00:00:13.160 | I guess you already previewed last week,
00:00:15.760 | but also you spoke at the World's Fair,
00:00:18.760 | and you're world-famous for Especially, It's All You Need.
00:00:21.420 | It's your thing.
00:00:23.240 | But also, it means that you understand models on a very fundamental level
00:00:28.080 | because you have manually re-implemented them,
00:00:30.860 | and today you decided to tackle GPT-2. So welcome.
00:00:35.120 | >> Yeah. Thank you. So I'm excited
00:00:38.440 | to be presenting Language Models or Unsupervised Multitask Learners.
00:00:42.740 | For context, the other thing Swix had alluded to in previous paper clubs,
00:00:48.000 | is there's going to be paper clubs for more,
00:00:51.260 | not test of time, but something similar,
00:00:53.740 | and this is therefore more evergreen,
00:00:56.060 | more general, more novice audience,
00:00:57.760 | than might be a traditional paper club reading,
00:01:00.280 | where I think it's a lot more sophisticated.
00:01:02.760 | But hopefully, if you're just coming to the field of AI engineering,
00:01:07.600 | this video and some others like it,
00:01:10.040 | and some resources I'll point you to will help you get started
00:01:12.440 | in understanding how LLMs actually work under the hood.
00:01:15.920 | The name of the paper is Language Models or Unsupervised Multitask Learners.
00:01:20.400 | It is not actually officially called the GPT-2 paper,
00:01:23.160 | but that's how everyone refers to it.
00:01:25.160 | You'll notice a couple of big names here.
00:01:28.440 | You might recognize Dario,
00:01:29.880 | you might recognize Ilya.
00:01:31.700 | If you're in the community, you'll recognize Alec.
00:01:34.680 | A lot of these people went on to continue to do
00:01:37.640 | great and amazing things in the community.
00:01:40.880 | I am, as Swix noted, Ishan Nand.
00:01:43.880 | You can find me here at my homepage.
00:01:45.840 | I'm probably best known in the AI community for Spreadsheets Are All You Need,
00:01:50.040 | that is a implementation of GPT-2 entirely in Excel.
00:01:53.360 | I teach a class on Maven that's basically seven to eight hours long,
00:01:57.440 | where we go through every single part of that spreadsheet.
00:02:00.600 | For people who have actually minimal AI background,
00:02:04.760 | it's a great first class in AI.
00:02:07.040 | I'm an AI consultant educator and really excited to give you
00:02:09.880 | the abbreviated version of that and the GPT-2 paper today.
00:02:13.280 | Let's get started. Here's what we're going to talk about.
00:02:17.840 | We're going to talk about why should you even pay attention to GPT-2.
00:02:21.360 | Strangely enough, I get this question from my class.
00:02:24.480 | People are like, "Oh, I saw that it was GPT-2."
00:02:26.640 | I thought, "Oh, that's got to be out of date."
00:02:28.440 | We should talk about why that's important and why you should pay attention.
00:02:31.400 | Then we'll talk about the dataset.
00:02:33.260 | Then we'll talk about actually the results.
00:02:35.280 | This slide is backwards.
00:02:36.800 | Then we'll talk about the model architecture.
00:02:38.600 | No, sorry. We'll talk about the model architecture, then the results.
00:02:41.000 | Then we'll talk about the future directions
00:02:43.360 | because we know what the future is going to hold, but they didn't.
00:02:46.440 | We'll talk about it as if we didn't really know.
00:02:49.000 | I don't know if Angad is here.
00:02:51.560 | If he is, he can jump on or let me know.
00:02:54.400 | I don't know if I'll see it in the chat.
00:02:55.560 | But he led a really great paper club about,
00:02:59.440 | I think it was eight or nine months ago on the original GPT-1 paper.
00:03:03.480 | I highly recommend checking that out.
00:03:06.020 | Also partially because, spoiler alert,
00:03:08.840 | the GPT-2 paper doesn't talk a lot about the model architecture.
00:03:12.680 | There's a limit that you'll learn from the model and model building from the GPT-2 paper.
00:03:18.720 | It's actually GPT-1 just scaled up.
00:03:21.880 | >> Angad is actually here.
00:03:23.840 | >> Oh, he is. Oh, Angad,
00:03:25.120 | did you want to just jump in and say anything about this?
00:03:27.280 | This is your slide.
00:03:28.960 | >> Yeah. Can you guys hear me?
00:03:32.760 | >> Yeah.
00:03:33.040 | >> Yes.
00:03:34.120 | >> Cool. Thank you for having my overview about GPT-1.
00:03:40.120 | As you said, GPT-1 is the precursor to GPT-2.
00:03:43.680 | The architecture is roughly almost exactly the same.
00:03:46.920 | There are going to be some little differences
00:03:49.080 | that I think you're going to present in today's discussion.
00:03:51.800 | But I think everyone should at least give GPT-1 a read or
00:03:56.480 | try to see how they actually achieved or settled on the transformer architecture,
00:04:02.680 | and also their training objective of language modeling at the next token prediction.
00:04:08.280 | We've got some cool resources for you guys.
00:04:11.960 | We have the official paper from the OpenAI team.
00:04:15.960 | We also have a blog post,
00:04:17.920 | basically a write-up about GPT-1 in 2024,
00:04:22.200 | so having a futuristic or looking back on the GPT-1.
00:04:26.280 | Also, we have, as you mentioned,
00:04:28.000 | the paper club episode about GPT-1.
00:04:31.480 | It is recorded and it is on YouTube.
00:04:34.000 | Please make sure you at least give it a read or something like this.
00:04:38.400 | It's going to be an immense resource for you guys. Back to you.
00:04:44.360 | >> Okay. Thank you. I see a request for the slides,
00:04:47.960 | which is a great question.
00:04:49.360 | Let me do that right now because
00:04:51.600 | my pet peeve is people who hold up the slides from you.
00:04:54.720 | Let me do this. Anyone at the link,
00:04:57.320 | you can go ahead and comment,
00:04:58.920 | copy link, and then I will drop it in the chat.
00:05:03.640 | Let's see. There we go.
00:05:07.800 | You should have it in the Zoom chat and I will drop it,
00:05:09.840 | or somebody can drop it please in the Discord as well for the paper club.
00:05:15.040 | Thank you. Okay. Thank you.
00:05:17.960 | Great. Whoops. We're not going to auto-play the video. There we go.
00:05:22.280 | So let's get started. So first off,
00:05:24.920 | why GPT-2 matters?
00:05:27.800 | Well, the first thing is it was one of
00:05:31.560 | the first cases where we saw one model is really all you need.
00:05:35.640 | We had a single model that solved multiple tasks,
00:05:39.280 | and here's the key, without
00:05:40.640 | any supervised training on any of those tasks.
00:05:44.040 | It had one simple objective,
00:05:45.240 | which was to predict the next word,
00:05:46.760 | and that let it learn how to do many multiple tasks.
00:05:49.680 | This seems obvious today because we're basically six years later.
00:05:54.660 | But at the time, it was not obvious.
00:05:56.720 | GPT-1, for example,
00:05:58.640 | was pre-trained to predict the next word.
00:06:01.400 | Then what they did is they stuck
00:06:03.680 | different structured output configurations on top of it,
00:06:07.000 | and fine-tuned it on each of these tasks like classification,
00:06:11.320 | similarity, multiple choice.
00:06:13.720 | So actually I have,
00:06:15.440 | this is right from the GPT-1 paper right here.
00:06:19.400 | So this is from GPT-1, that was the setup.
00:06:21.880 | So we had one model that was pre-trained,
00:06:24.800 | and then fine-tuned, and then had
00:06:27.000 | structured output set for every single task.
00:06:29.520 | So it still was not like ChatGPT where you could just talk to it,
00:06:32.400 | and GPT-2 still quite wasn't there.
00:06:34.440 | But you could start getting prompt engineering to get the right result.
00:06:37.840 | By contrast, GPT-2 was pre-trained again on predicting the next word,
00:06:42.560 | but then you just gave it task-specific prompts,
00:06:45.560 | few or zero-shot prompts,
00:06:47.240 | as we'll see in the results section,
00:06:48.800 | to get the desired output that you wanted.
00:06:50.920 | A useful and interesting comparison contrast is the Google multi-model.
00:06:56.960 | You'll recognize these names, Noam Shazir,
00:07:00.480 | Aiden, Kaiser, Vaswani,
00:07:03.240 | a lot of the same people from Attention is All You Need.
00:07:05.720 | These guys know how to name a paper, I'll say.
00:07:08.440 | One model to rule them all,
00:07:09.560 | obviously, Lord of the Rings reference.
00:07:11.360 | Here it is. It's a multi-task model.
00:07:13.320 | It's even multi-modal,
00:07:14.920 | and believe it or not, it's also a mixture of experts.
00:07:17.760 | It can take a image,
00:07:20.240 | it can caption it, it can categorize it,
00:07:22.040 | it can do translation.
00:07:23.400 | This is way back in 2017.
00:07:25.280 | But the key thing is that it was supervised fine-tuned or
00:07:28.880 | supervised tuned for each task,
00:07:31.360 | although it was done jointly all in the same model,
00:07:33.360 | and they had task-specific architectural components for each task.
00:07:37.160 | It was not the same just predict the next word.
00:07:40.640 | It was, I'm going to give you
00:07:41.960 | these datasets for each of these different tasks,
00:07:44.120 | but I'm just doing in the same model across them.
00:07:47.080 | The key hypothesis of the paper then is that,
00:07:50.480 | as you see here, our speculation is that
00:07:52.600 | a language model with large enough capacity,
00:07:55.000 | so something that's large enough,
00:07:56.480 | will infer and learn to perform tasks demonstrated in
00:08:00.400 | that dataset regardless of how you procured them.
00:08:04.480 | It'll be able to basically learn multiple tasks entirely unsupervised.
00:08:09.320 | This is right from GPT-2 paper,
00:08:11.920 | which I'll pull right up.
00:08:13.560 | I'll go back and forth between the paper.
00:08:16.400 | Where did they have this? There it is.
00:08:19.120 | It's right, our speculation, right there.
00:08:26.320 | It's right here, right in the beginning.
00:08:29.160 | The other key element here is that,
00:08:32.920 | it's the emergence of prompting is all you need,
00:08:35.480 | and the emergence of prompt engineering.
00:08:37.320 | Here, we're just using prompts in order to condition the model.
00:08:40.280 | A follow-on to what we talked about earlier,
00:08:42.320 | it's we started to see prompt engineering be taking a role for
00:08:45.400 | the first time as a way to control a model where
00:08:47.440 | previously you would have stuck a different head on
00:08:49.360 | top of it and fine-tuned it.
00:08:51.920 | It's also the emergence of scale is all you need.
00:08:55.400 | This multitask capability emerges and improves as the models,
00:09:00.680 | training time, size, and dataset increases.
00:09:03.960 | As you can see here, when we look at GPT-1,
00:09:08.000 | then GPT-2, which basically was
00:09:09.600 | a 10x scale up on the size of parameters,
00:09:12.040 | also in the dataset size,
00:09:13.800 | and then it's at the stage later for GPT-3,
00:09:17.240 | which was going to scale it up by 100 times on the number of
00:09:20.760 | parameters with the idea after they saw these results that, hey,
00:09:23.520 | we'll just simply scale up the model
00:09:24.880 | larger and we'll get better results.
00:09:27.040 | The other interesting thing about GPT-2 is it's also
00:09:30.880 | the continuation of this idea that the decoder is all you need.
00:09:34.960 | If you're new to transformers,
00:09:37.040 | transformer has traditionally in the original Vaswani implementation,
00:09:41.240 | had the left half here,
00:09:42.400 | which was called the encoder,
00:09:43.960 | and then the decoder on the right half,
00:09:46.040 | which was in charge of generating the output.
00:09:48.680 | They basically dissected in half and said,
00:09:50.760 | you only need the decoder because all we're going to do
00:09:53.080 | is basically text generation.
00:09:56.480 | They were not obviously the first to do this,
00:09:58.840 | GPT-1 preceded it.
00:10:00.040 | Around the same time as GPT-1 was this other paper,
00:10:04.000 | also by Google, and Noam Shazier and Kaiser again,
00:10:07.120 | which is generating Wikipedia by summarizing long sequences.
00:10:10.880 | It was another decoder only,
00:10:12.840 | and I'm not sure who was first.
00:10:14.880 | They were both published in 2018.
00:10:17.000 | This one is published in January of 2018.
00:10:20.200 | GPT-1, I think, was middle of the year,
00:10:22.440 | but you never know when these start.
00:10:24.120 | So it's not quite clear.
00:10:27.760 | Moving on, this eventually ended up being,
00:10:32.400 | as we now know today,
00:10:33.400 | the most popular way you would implement a large language model.
00:10:36.360 | And GPT-2 is basically the ancestor of all the major models
00:10:42.600 | you have probably familiar with.
00:10:44.160 | So GPT-4, CLOD, ChatGPT, Cohera is in here,
00:10:51.040 | BARD, now Gemini, LLAMA,
00:10:53.040 | they're all decoder transformer models.
00:10:55.760 | So the key idea is if you understand the GPT-2 architecture,
00:10:59.720 | you're basically 80% of the way to understanding
00:11:02.400 | what a modern large language model looks like.
00:11:06.080 | A lot of the components are still the same.
00:11:08.080 | Maybe they've replaced layer norm with RMS norm and so forth.
00:11:11.640 | There's probably only a few other changes,
00:11:13.880 | but most of the way there it's 80%.
00:11:16.080 | So it was highly influential.
00:11:17.760 | And part of that may be because of a lot of the hype around GPT-2,
00:11:21.080 | but part of that is also the last open source model
00:11:24.560 | as of this recording from OpenAI.
00:11:26.320 | So a lot of people dug into it and took inspiration from it.
00:11:30.800 | It was also probably one of the first AI models
00:11:33.120 | to break out of the AI bubble.
00:11:35.160 | So this is the famous passage where they prompted GPT-2
00:11:40.680 | to write a fake news article about unicorns
00:11:43.200 | living in the Andes Mountains.
00:11:44.840 | This got a lot and a lot of press in 2019.
00:11:49.680 | In fact, it was one of the reasons,
00:11:51.360 | the risk of misinformation,
00:11:52.960 | that initially the open source release of GPT-2
00:11:56.560 | was only the smallest model.
00:11:57.880 | They didn't release the source or weights for the larger model
00:12:01.480 | until later that year.
00:12:02.760 | I believe it was in November.
00:12:04.840 | And it got a ton of attention.
00:12:06.480 | It was called the AI model too dangerous to release,
00:12:10.080 | which, for better or for worse,
00:12:11.800 | made it break through the AI bubble
00:12:14.080 | and into the public consciousness.
00:12:17.600 | So that's why GPT-2 is all you need, in a sense,
00:12:21.320 | to get started and why it's so important to the field.
00:12:25.080 | Okay, now let's talk about the data set
00:12:27.640 | because the data is a huge part of any AI model,
00:12:31.880 | especially the large language model.
00:12:33.840 | The problem they faced is if we're going to train
00:12:36.600 | a unsupervised large language model,
00:12:40.120 | we need a data set that is sufficiently large,
00:12:43.400 | that is high quality,
00:12:45.160 | and that demonstrates a wide range of tasks
00:12:47.760 | that the model can learn from
00:12:48.760 | because it should be embedded in it.
00:12:50.520 | It should be sufficient enough that it has that wide variety,
00:12:53.400 | even though we're not explicitly going to fine-tune it
00:12:55.360 | on any of these particular tasks.
00:12:57.640 | And the solution they came up with
00:12:59.520 | was to create a new data set using the internet.
00:13:03.800 | Let's just grab text on the internet.
00:13:06.200 | But we also need it to be high quality.
00:13:07.960 | So we're going to use social media for a quality signal.
00:13:11.040 | And then because the web, I say internet here,
00:13:13.000 | I'm really talking about the web,
00:13:14.120 | has a wide range of tasks,
00:13:15.800 | it should be sufficiently large
00:13:17.240 | if we have a large enough data set.
00:13:18.520 | It should demonstrate a variety of different tasks
00:13:20.480 | that we can use to test the model.
00:13:23.960 | So they created this data set called WebText.
00:13:27.280 | First, they started by gathering all the outbound links
00:13:29.720 | from Reddit before December 2017.
00:13:33.040 | Then they removed links with less than three karma.
00:13:36.120 | So that was their quality signal.
00:13:37.640 | If it didn't get enough karma,
00:13:39.240 | then it was not a high-quality link.
00:13:40.600 | I want to be really clear.
00:13:41.520 | They didn't actually scrape Reddit and Reddit conversations.
00:13:45.040 | They just used Reddit to rank sites,
00:13:47.160 | kind of like how Google ranks sites through PageRank.
00:13:49.960 | At this point, they realized Reddit
00:13:51.400 | might be a better way of human quality.
00:13:54.080 | Then they actually removed Wikipedia entries.
00:13:56.680 | And the reason they did this
00:13:57.880 | is some of the tests we're going to talk about later
00:14:00.160 | are actually tests that involve Wikipedia
00:14:02.400 | as part of the data set, as part of the evaluation.
00:14:05.440 | So they wanted to avoid data contamination
00:14:08.480 | and not putting, you know, training the model on text
00:14:11.280 | that it would later be tested on.
00:14:13.800 | They also, although not shown in this diagram,
00:14:15.760 | is they also removed any non-English text, or they tried to.
00:14:19.480 | Turns out some leaked in
00:14:21.000 | and turned into a capability to do translation.
00:14:24.960 | And then they extracted the raw text from the HTML files
00:14:28.280 | using the DragNet or newspaper Python frameworks or libraries
00:14:33.520 | and that got them WebText data set,
00:14:35.400 | which was 8 million documents or 40 gigabytes of data.
00:14:39.440 | And to put this in perspective,
00:14:41.360 | the GPT-1 model was trained on the Books Corpus,
00:14:44.200 | which is a series of unpublished books
00:14:45.960 | that was about 4.8, roughly five gigabytes in size.
00:14:50.160 | So this is about an order of magnitude more data.
00:14:52.520 | It was pretty large.
00:14:53.560 | It was not the largest data set at the time.
00:14:56.040 | I believe there was a BERT one that was larger,
00:14:58.080 | but it was one of the largest.
00:15:00.000 | And then put it relative to GPT-2,
00:15:02.120 | I did an estimate at GPT-2 is roughly around
00:15:05.600 | another order of magnitudes bigger than this.
00:15:07.800 | Sorry, GPT-3 compared to the GPT-2 data set for WebText.
00:15:14.000 | Okay, now let's talk.
00:15:16.080 | Oh, let's see, we got questions.
00:15:20.960 | Let's see, should I pause for questions or just keep going?
00:15:25.680 | - I mean, if anyone has questions, now is a good time.
00:15:27.920 | - Okay.
00:15:29.760 | Sure, I just opened the door for it.
00:15:32.560 | Are there any questions?
00:15:33.800 | I see a keep going.
00:15:35.400 | Looks like people are handling some of the questions in chat.
00:15:38.280 | Okay, let's talk about the architecture of these models.
00:15:43.680 | So the GPT-2 models, I put this in quotes,
00:15:48.000 | were a series of four models.
00:15:50.320 | A couple notes, so I put GPT-1 as a comparison point.
00:15:55.360 | GPT-1 was 117 million parameters.
00:15:57.600 | It had 12 layers, 768 for the embedding dimension
00:16:00.920 | and a 512 context length.
00:16:02.960 | The four models that they create for GPT-2,
00:16:07.440 | when you read the paper, you should note
00:16:09.440 | that they do not refer to them as small, medium, large, XL.
00:16:12.640 | That came after.
00:16:14.040 | Instead, when you read the paper,
00:16:15.440 | and this can be a little confusing,
00:16:17.440 | GPT-2, they save as the name for the largest model.
00:16:20.600 | So GPT-2 XL for them means GPT-2.
00:16:24.360 | They're the same, which if you're using
00:16:25.960 | the Hugging Face Transformers, if you use GPT-2
00:16:28.040 | as the model name bear, you just get GPT-2 small.
00:16:31.840 | So avoid the confusion, just be aware.
00:16:35.400 | And then in the text, they refer to all the other small variants
00:16:38.600 | as web text language models.
00:16:40.760 | So these three are called the web text language models,
00:16:43.760 | and then this is what they refer to as GPT-2 in the paper.
00:16:47.080 | So for them, GPT-2 is simply the largest model,
00:16:49.080 | which makes sense in retrospect,
00:16:50.840 | because small really is just a replication of GPT-1.
00:16:55.000 | And one other thing is they even tried to replicate it so much
00:16:58.320 | that they originally reported the size
00:17:00.040 | as 117 million parameters.
00:17:01.800 | Turns out it was 124.
00:17:03.480 | So when you download the weights from, I guess, Azure now,
00:17:06.760 | these two are the same.
00:17:08.040 | They're just simply renamed because there was a typo
00:17:10.600 | in the size of the model.
00:17:12.480 | And you can see, basically, the largest model
00:17:14.280 | compared to the original GPT-1 is 1,600,
00:17:18.840 | so roughly twice as big in the embedding dimensions.
00:17:21.200 | They increase the context link for all of them,
00:17:23.240 | and it has a lot more layers.
00:17:24.640 | So hence, a lot more sized model.
00:17:28.600 | Unfortunately-- well, there are few changes architecturally
00:17:32.120 | from GPT-1 compared to GPT-2.
00:17:35.240 | First is a larger size, which we saw in the previous slide.
00:17:39.040 | They also increased the token count slightly
00:17:41.640 | from 40,000 to 50,000 tokens.
00:17:45.640 | And this one is actually interesting.
00:17:48.240 | They moved layer norm from post-activation
00:17:50.360 | to pre-activation.
00:17:52.120 | They were inspired by a paper that
00:17:57.960 | did this in an image model and proposed
00:18:00.080 | that pre-activation was better.
00:18:01.840 | It turns out a year or two after this,
00:18:04.240 | somebody did work on language models
00:18:07.100 | and showed that pre-activation was actually
00:18:09.480 | better as well for layer norm and actually
00:18:11.960 | improved training stability in certain cases.
00:18:15.960 | Unfortunately, the paper has few details on the actual training.
00:18:19.520 | So we know that the batch size was 512,
00:18:21.560 | but the learning rate was tuned for each size model,
00:18:24.780 | but the exact numbers are not specified.
00:18:27.200 | And this is really interesting because the GPT-3 paper,
00:18:31.280 | for example, went into a lot more detail on this.
00:18:33.960 | In fact, it's, I think, right here near the beginning.
00:18:36.880 | Let's see.
00:18:37.360 | There it is.
00:18:37.880 | You've got a table here on the batch size learning rate.
00:18:40.320 | And I think the appendix actually has the Adam W
00:18:42.840 | parameters as well.
00:18:45.120 | So there isn't a lot of detail on how the model works
00:18:49.600 | and how it was specifically trained.
00:18:51.600 | Thankfully, however, the community,
00:18:53.520 | partially because it was eventually open sourced
00:18:55.560 | and had so much attention on it, came up
00:18:57.480 | with a large number of implementations.
00:19:00.320 | So there's the official OpenAI implementation,
00:19:03.840 | which is right here.
00:19:04.840 | And one thing I like pointing out to people,
00:19:07.560 | if you go into the source and you click on this,
00:19:09.680 | if you're new to AI and machine learning,
00:19:12.900 | you don't realize how small the actual code is
00:19:15.360 | because all the knowledge is in the parameters.
00:19:18.460 | If you add up all the code here and you take out the TensorFlow,
00:19:21.920 | it's basically just 500 lines of code.
00:19:24.440 | It's one of my favorite statistics
00:19:26.000 | to help people understand, yes, you can understand this.
00:19:28.320 | It's only 500 lines of code.
00:19:29.980 | You can grok it if you just spend a week or two on it.
00:19:33.000 | So don't feel like this is magic that you'll never understand.
00:19:38.000 | The most popular way to use it, probably today,
00:19:40.880 | is through Hugging Face Transformers, which
00:19:43.040 | is another implementation of it that uses the same OpenAI
00:19:45.840 | weights that they released publicly.
00:19:49.140 | And then-- whoops, there we go.
00:19:52.520 | This is technically not an implementation,
00:19:54.760 | but a really popular guide to how
00:19:56.480 | the inside of the model works.
00:19:57.880 | I found this an extremely helpful resource
00:19:59.640 | as well, where Jay Alomar, who's now a co-hearer,
00:20:04.300 | goes through in detail how every single step of the transformer
00:20:08.600 | works.
00:20:09.080 | He has really great diagrams and illustrations
00:20:12.800 | for how the model works.
00:20:14.720 | Another popular implementation is
00:20:16.920 | MiniGPT from Andrej Karpathy, which is a PyTorch
00:20:20.480 | reimplementation.
00:20:21.240 | The original version of GPT-2 was in TensorFlow.
00:20:24.880 | So this one's in PyTorch.
00:20:26.600 | You can see it here at GitHub.
00:20:28.280 | And it's also OpenAI weight compatible for GPT-2.
00:20:32.840 | And then he has LLM.c, which implements GPT-2 entirely in C
00:20:37.120 | without any PyTorch for performance.
00:20:40.640 | A lesser known, but I think equally interesting
00:20:44.280 | implementation is Transformer Lens from Neil Nanda.
00:20:48.600 | And I think this helps go to why GPT-2 is
00:20:51.080 | so interesting and important.
00:20:52.600 | A lot of folks in mechanistic interpretability
00:20:55.000 | like to use small models to do experiments.
00:20:58.720 | And Transformer Lens is a tool for running understanding
00:21:03.920 | and interpretability experiments on large language models
00:21:07.480 | that are GPT-2 style.
00:21:09.680 | And in fact, if you see the video
00:21:14.200 | that I did at the AI Engineer World's Fair last year,
00:21:18.800 | I do a version of GoldenGate Claude.
00:21:20.920 | That was thanks to Neil Nanda and his team
00:21:23.840 | who had done sparse autoencoders partially
00:21:27.080 | using a version of this thing called SAE Lens for GPT-2.
00:21:30.600 | And I just basically used one of their vectors
00:21:32.620 | and stuck it in my spreadsheet.
00:21:33.920 | And that was a huge benefit and boon.
00:21:35.840 | But it's a great way to learn how these models actually work
00:21:39.960 | by doing experiments on them.
00:21:42.000 | Another great visualization is this one,
00:21:44.080 | which is Transformer Explainer.
00:21:45.560 | It has some really nice graphics.
00:21:47.440 | You can watch essentially how information propagates
00:21:49.960 | through the network.
00:21:51.920 | Another great visualization is this one,
00:21:55.480 | which is very popular.
00:21:56.560 | It's got nanoGPT, GPT-2 small, and GPT-3.
00:22:00.620 | You can kind of see--
00:22:01.560 | I like this view because you can see how much smaller
00:22:04.760 | nanoGPT is compared to GPT-2 small.
00:22:07.920 | And then here's GPT-2 XL.
00:22:09.560 | It really makes it very visceral in terms of how it feels.
00:22:14.700 | And the one challenge I have with visualizations
00:22:17.560 | is they're fun to look at, but you can't actually go in
00:22:20.320 | and make modifications to them.
00:22:21.640 | You can't build your mental model
00:22:23.080 | by interactively changing things within them.
00:22:25.720 | So there's mine, which is Spreadsheets
00:22:28.200 | Are All You Need, which is an Excel file that
00:22:31.520 | implements all of GPT-2 small entirely in Excel.
00:22:36.200 | Let me see if I can pull that one up.
00:22:37.880 | Oh, wonderful.
00:22:40.640 | It restarted on me because I'm running in parallel as well.
00:22:43.880 | We'll do this right now.
00:22:45.500 | I'll show you the other one.
00:22:46.720 | So that one, you can see there's a video right here
00:22:57.000 | from AI Engineer World's Fair.
00:22:58.280 | And I walk through the spreadsheet version
00:22:59.960 | of this, of GPT-2.
00:23:01.800 | It's a really abbreviated version
00:23:03.160 | of how the model works.
00:23:04.600 | And then the most recent version is this one,
00:23:07.720 | which is GPT-2 entirely in your browser.
00:23:11.160 | So this one's entirely in JavaScript.
00:23:13.980 | And let me walk through it for just 5 or 10 minutes
00:23:18.440 | as kind of an intro to how transformer models work.
00:23:21.200 | Before I do that, I am going to give you
00:23:24.920 | a five-minute introduction on how
00:23:26.960 | to think about a transformer model.
00:23:29.960 | So basically, we have this.
00:23:34.000 | I like this simplified diagram rather
00:23:35.520 | than the canonical diagram.
00:23:37.000 | Basically, you're taking text.
00:23:38.360 | You turn that text into tokens.
00:23:40.320 | You turn those tokens into numbers.
00:23:41.840 | We do some math or number crunching on them.
00:23:44.040 | We turn those numbers into text.
00:23:45.640 | And that becomes our next token.
00:23:47.880 | We translate those numbers back out.
00:23:50.180 | And the way I like to think about this
00:23:51.880 | is tokenization is just representation.
00:23:53.960 | But you have your token and position embeddings.
00:23:56.320 | And this is really a map for words.
00:23:58.720 | We're basically grouping similar words together.
00:24:01.200 | So I like to imagine, say, a two-dimensional map.
00:24:03.800 | But in this case, in the case of GPT-2 small,
00:24:06.160 | it's 768, 1,600 in GPT-2 XL.
00:24:09.840 | So you can imagine happy and glad are sitting here.
00:24:12.640 | And sad's maybe a little close to it, but not quite as close.
00:24:15.520 | And then things that are very different, like dog, cat,
00:24:17.800 | are over here.
00:24:18.800 | And rather than thinking about this long list of numbers
00:24:22.960 | as just some arbitrary list of numbers,
00:24:24.640 | these are points in a space.
00:24:26.440 | Instead of two dimensions, though,
00:24:28.200 | they're now points in a 768-dimensional space
00:24:31.640 | or 1,600-dimensional space.
00:24:33.880 | But what we've done is we've grouped similar words together.
00:24:37.260 | And once we've done that, when we think about it,
00:24:39.300 | similar words should also share the same next word predictions.
00:24:43.480 | So the next word after happy is probably also the next word
00:24:47.000 | after a similar word like glad.
00:24:49.040 | And then that gives us kind of a boost or heads up
00:24:52.480 | that we can go to a neural network.
00:24:54.320 | Neural networks are really good if you give them
00:24:56.920 | a quick question and an answer, and it'll
00:24:58.600 | pick out what the answer is.
00:24:59.560 | So you give it photos, and you say which ones are dogs
00:25:01.860 | and which ones are cats.
00:25:03.120 | It'll learn to figure that out.
00:25:04.640 | In this case, we'll give it sentences,
00:25:06.360 | and we'll ask it to complete the next word,
00:25:08.280 | or give it, say, a single word and say,
00:25:09.860 | what is the next word after it?
00:25:11.160 | And it will start learning that.
00:25:12.800 | The only other wrinkle is we have additional hints
00:25:14.960 | we can give it, which is all the hints from all the other words
00:25:17.620 | that came before it.
00:25:18.600 | And really, what the transformer is doing
00:25:20.320 | is it's letting every word look at every other word
00:25:23.220 | to inform what its actual meaning is.
00:25:26.040 | Get a better hint.
00:25:26.880 | Instead of just taking a one word or two or three word
00:25:29.900 | history prediction, it's going to look at all the past words.
00:25:32.640 | And then it refines that prediction over 12 iterations.
00:25:35.240 | In the case of GPD, small, more in the larger one.
00:25:38.060 | And then we get a predicted number
00:25:39.440 | back that we just convert back out to a word.
00:25:42.480 | So putting it all together, we get basically this diagram.
00:25:47.080 | So we take a prompt, we split into tokens,
00:25:48.960 | we convert those tokens into numbers,
00:25:50.640 | and then we refine that prediction iteratively,
00:25:52.760 | and we pick the next most likely token.
00:25:55.080 | I've simplified what happens in here.
00:25:56.880 | There's actually-- and you'll see this in the spreadsheet--
00:25:59.480 | there's 16 steps.
00:26:01.360 | And they all are mapped out here.
00:26:03.440 | I can-- let's see if this loaded up.
00:26:05.240 | Oh, wonderful.
00:26:06.560 | Let's go back to the spreadsheet, if it will load up.
00:26:08.800 | Let's see why Excel isn't working.
00:26:13.200 | But that's fine, because I'm going to demonstrate
00:26:15.200 | the web version of this.
00:26:16.280 | So this is the same thing as GPT-2, small,
00:26:19.600 | except running entirely in JavaScript.
00:26:22.200 | And what's exciting about this is you don't
00:26:25.400 | need to have Excel anymore.
00:26:27.160 | You can just come with your browser,
00:26:29.160 | and it will run entirely locally.
00:26:31.440 | And the way it's structured--
00:26:33.640 | let's pull this up here.
00:26:34.720 | Here we go.
00:26:40.400 | It's actually a series of vanilla JavaScript components.
00:26:43.700 | Everything is like a Python notebook.
00:26:45.400 | You've got a cell here that wraps everything.
00:26:48.360 | There's only two types of cells right now.
00:26:50.480 | One is simply like a spreadsheet.
00:26:52.400 | It basically runs a function, and then it
00:26:54.440 | shows the result in a table, as you can see here.
00:26:57.240 | And then the last one is just defining code
00:26:59.160 | in raw JavaScript.
00:27:00.720 | And what's great about this is you can debug the LLM entirely
00:27:04.280 | in your browser.
00:27:05.640 | No PyTorch, nothing else getting in the way.
00:27:08.040 | So to run this, the first thing you want to do
00:27:10.000 | is click this link and download the zip file, which
00:27:12.040 | is going to have all the model parameters.
00:27:13.440 | You'll drag and drop those into here.
00:27:15.360 | It'll basically stick 1.5 gigabytes into your index.db.
00:27:19.300 | And then you're sitting right here.
00:27:20.800 | So the first thing we do is we define matrix operations,
00:27:24.720 | and then in raw JavaScript.
00:27:26.920 | So this is our matrix multiply, also defined in raw JavaScript.
00:27:29.760 | Really simple two-dimensional arrays
00:27:31.640 | is how we use the structure for this.
00:27:34.160 | There's a transpose.
00:27:35.160 | There's a last row.
00:27:36.400 | And then you enter your prompt here.
00:27:38.280 | You hit Run, and it'll actually run the model.
00:27:40.280 | So let me show you, though, the debugging capabilities.
00:27:43.280 | So I'm going to do this.
00:27:44.600 | I'm going to take this thing, which
00:27:46.200 | is separating into words.
00:27:48.040 | And I'm going to run it up to here.
00:27:49.620 | And you can see our prompt is "Mike is quick, he moves."
00:27:51.960 | It separates into these words.
00:27:53.480 | But what I can do is I can just write from the--
00:27:55.880 | ever leaving my browser.
00:27:57.560 | I can go here, and I can say, well, you know what?
00:27:59.880 | I want to see what is--
00:28:03.080 | what does matches look like?
00:28:04.280 | What is that array?
00:28:05.080 | Console.log matches.
00:28:09.760 | Hit this.
00:28:10.920 | Now rerun this function.
00:28:12.360 | Oh, there it is.
00:28:13.280 | Right there in my debugger.
00:28:14.680 | I can just see the result. And heck, you know what?
00:28:16.920 | Maybe I really want to just step through this thing.
00:28:19.440 | So I can hit this, put the debugger statement, hit Play,
00:28:25.000 | and boom.
00:28:25.720 | I'm right here inside my DevTools,
00:28:28.320 | and I can debug a large language model
00:28:30.160 | at any layer of abstraction that I want.
00:28:33.240 | So great way to kind of get a handle on what's
00:28:35.440 | happening under the hood.
00:28:37.200 | And I'll take you through a brief view of this,
00:28:40.760 | so let me get rid of these statements.
00:28:42.560 | Redefine that function, and then I'll reset it.
00:28:50.160 | And then if we hit Run, what it's going to do
00:28:53.000 | is it's going to run through each one of these in order.
00:28:55.720 | So the first section is really just defining
00:28:57.720 | basic matrix operations.
00:28:59.640 | The next section is our tokenization.
00:29:01.720 | So here we separate things into words.
00:29:04.120 | Then we actually take those, and we do the BP algorithm
00:29:07.560 | to turn them into tokens, which we'll get out here.
00:29:10.080 | So if I run to here, I'll see if we-- well,
00:29:12.200 | here's our list of words--
00:29:14.320 | our tokens, rather, and then their token IDs.
00:29:16.840 | And then if we keep going, this will turn the tokens
00:29:19.000 | into embeddings.
00:29:21.040 | So this is a series of steps to do that.
00:29:22.780 | Then finally, we turn them into the positional embedding.
00:29:25.080 | We're basically just walking through this same diagram
00:29:29.520 | in order that I showed you here.
00:29:32.000 | So we tokenize, then we turn it into embeddings,
00:29:34.560 | and then we go inside each of the blocks.
00:29:37.360 | And this is the 16 steps.
00:29:39.240 | These match the same steps in the Excel sheet,
00:29:42.440 | where we'll basically do layer norm, for example.
00:29:46.000 | We'll do multi-headed tension.
00:29:48.200 | And we'll do the multi-layer perceptron.
00:29:50.280 | I'm going to do the following.
00:29:51.520 | If I hit Run and turn away, it'll
00:29:53.120 | actually stop running because it's in the browser,
00:29:55.320 | and the browser optimization stops it.
00:29:57.360 | But you can hit Run, go to this page right now,
00:30:00.400 | and then click it, and you can watch it run.
00:30:02.240 | It'll take about a minute to predict the next token.
00:30:04.640 | So that's a quick overview of GPT-2 and some resources
00:30:08.640 | to understand the model in more detail.
00:30:10.280 | Happy at the end if we've got questions
00:30:12.000 | to go into this in more detail, but I
00:30:13.540 | don't want to spend our entire time on that.
00:30:18.840 | That was the demo.
00:30:19.800 | Any questions so far?
00:30:21.000 | Let's see.
00:30:24.620 | I'm going to look in chat.
00:30:27.160 | Oh, boy.
00:30:28.040 | Is there a Python Jupyter version of this?
00:30:30.000 | It's going to be fun when he checks the chat now.
00:30:32.280 | OK, great.
00:30:34.640 | This looks like Jupyter.
00:30:35.680 | Yes, it does.
00:30:37.280 | My drawer looks like Ishan's desktop.
00:30:39.440 | Awesome.
00:30:41.160 | OK, this is definitely fun.
00:30:43.080 | There is a Jupyter version.
00:30:45.120 | Well, the Jupyter version of this
00:30:46.720 | would be mini-GPT, probably running inside,
00:30:51.120 | or Transformer Lens running inside Jupyter.
00:30:55.200 | All of these other ones are basically Python
00:30:57.080 | implementations.
00:30:58.200 | This one is no Python.
00:30:59.520 | You can just run it right in your browser.
00:31:02.200 | So nothing to install.
00:31:04.360 | So no Jupyter-- you don't even need Python.
00:31:06.200 | You can just use JavaScript.
00:31:07.740 | Helps web developers kind of get up to speed.
00:31:09.620 | So that's the answer to that one.
00:31:10.960 | Is there an intuitive way to understand
00:31:12.580 | positional embeddings?
00:31:15.940 | I'll pause for this.
00:31:21.240 | Let's see.
00:31:22.200 | I'll answer this question and then move on.
00:31:30.120 | Positional embeddings.
00:31:32.520 | Doo-doo-doo-doo-doo-doo-doo.
00:31:33.720 | OK, so you know that probably embeddings we've talked about
00:31:42.760 | are positions in a space.
00:31:44.840 | I showed another diagram where basically we
00:31:46.760 | had elements in some two-dimensional space.
00:31:51.120 | I showed it as just two dimensions.
00:31:52.800 | Here's your canonical man, woman, king, queen,
00:31:55.720 | where king minus man plus woman equals queen.
00:31:58.040 | This is a contrived example, and we've put them
00:31:59.920 | in different parts of space.
00:32:01.880 | When the problem we have is that--
00:32:07.080 | let's go back to this.
00:32:09.880 | In English, the dog chases the cat,
00:32:14.160 | and the cat chases the dog have very different meanings.
00:32:17.640 | Position matters.
00:32:19.880 | So here's another example I use in my class, which
00:32:22.440 | is if I take the word "only" and I put it
00:32:25.240 | into four different positions, these
00:32:26.960 | are four different sentences.
00:32:29.400 | "Only I thanked her for the gift"
00:32:30.640 | means nobody else thanked her.
00:32:31.760 | "I thanked her only for the gift"
00:32:33.140 | means I didn't thank her for anything else.
00:32:35.240 | The problem we have is that in English, word order matters.
00:32:40.840 | But in math, very often, position does not matter.
00:32:45.760 | So 3 plus 2 is the same as 2 plus 3.
00:32:48.560 | So they both equal 5.
00:32:50.520 | And so this is one of the hardest things to realize.
00:32:53.120 | What the large language model is doing
00:32:55.680 | is it's taking a word problem, and it's
00:32:57.840 | converting it to numbers, turning it
00:32:59.460 | into a number problem.
00:33:01.040 | This is a realm where order matters,
00:33:02.760 | and this is a realm where order does not matter.
00:33:05.200 | And so the math, everything after the equal sign,
00:33:07.760 | cannot see the order of the stuff between them,
00:33:09.960 | even though--
00:33:11.160 | I don't know if the spreadsheet came up.
00:33:13.280 | Let's see.
00:33:14.520 | There it is.
00:33:15.040 | Even though you can look in this spreadsheet,
00:33:17.320 | and you're like, well, why can't it see it?
00:33:19.120 | I can see it in order, just like you can see in order--
00:33:24.120 | let's pull PowerPoint back here--
00:33:27.240 | just like you can see the order between 2 plus 3,
00:33:29.640 | and you can see the addition, it can't.
00:33:31.560 | The math can't.
00:33:32.800 | So what we need to do is give it a way
00:33:34.480 | to understand what the position is.
00:33:36.920 | So the way we do that is we basically say, in GPT-2--
00:33:41.500 | note that there's something called rope, which
00:33:43.420 | does it slightly differently.
00:33:44.960 | We say that-- let's go back to this diagram I led with.
00:33:47.880 | The woman at position 0 probably means the same thing
00:33:50.840 | as woman in the other position.
00:33:53.080 | So we're just going to move it slightly
00:33:55.600 | so that woman at position x is almost
00:33:58.240 | in the same location in the embedding space
00:34:00.280 | as woman at position 0.
00:34:01.520 | It's just slightly offset.
00:34:03.320 | And in general, we're going to just move it slightly
00:34:05.880 | in some region so it doesn't move around too much,
00:34:08.000 | but stays close to it.
00:34:08.960 | So it can at least tell woman at different positions
00:34:11.400 | from other positions.
00:34:14.120 | Inside attention is all you need,
00:34:15.620 | which is the original transformer paper.
00:34:17.000 | They basically use the sine and cosine.
00:34:18.480 | If you remember sine and cosine, they limit to 1.
00:34:21.200 | So they're basically keeping it in the circle.
00:34:23.120 | It's oscillating around this.
00:34:25.000 | Inside GPT-2, they actually just let
00:34:28.360 | it learn the positional embeddings itself.
00:34:30.240 | And they are simply just added.
00:34:32.320 | So the way this works inside the spreadsheet
00:34:36.480 | is here are your token and text embeddings.
00:34:39.400 | Sorry, right here.
00:34:40.680 | So this is the embedding for the word Mike.
00:34:42.440 | This is 768 columns after column 3.
00:34:45.560 | Same here for all these other ones.
00:34:47.480 | Each one of these, if you look at this formula,
00:34:50.720 | you'll see there's a plus model WPE.
00:34:53.840 | That is a set of parameters right here.
00:35:00.760 | So the first row-- so this is what a million parameters
00:35:03.240 | looks like.
00:35:03.720 | It's a bunch of numbers.
00:35:05.680 | So for anything that's in the first row,
00:35:07.880 | this number gets added to it.
00:35:09.080 | So you can actually go back to the one I showed you right here.
00:35:12.480 | And you add negative 0.18 to that value.
00:35:18.440 | And the thing that's in the first row
00:35:20.240 | gets that added to this position right here.
00:35:25.080 | And then the thing that's in the second row
00:35:27.200 | basically gets, every element-wise,
00:35:29.920 | added to it the token for the next--
00:35:33.080 | so the second row gets added to whatever's
00:35:34.760 | in the second position.
00:35:36.960 | The token in the third position gets the third row added to it.
00:35:39.920 | And this goes for all 1,024.
00:35:42.080 | So there are 1,024 rows here for every single position
00:35:46.800 | in the context.
00:35:47.800 | OK, I was off by a column.
00:35:49.720 | So that math only worked on the second column.
00:35:52.000 | Let me keep going so we don't run out of time.
00:35:53.880 | But I'll take more questions later.
00:35:59.760 | OK, let's keep going.
00:36:01.800 | OK, let's talk about the results.
00:36:03.840 | So they got SOTA, which is state-of-the-art,
00:36:07.440 | for seven out of eight language modeling tasks at zero shot.
00:36:12.400 | I'd call this, actually-- they claim it's zero shot.
00:36:14.800 | Some of these I'd call as few shot.
00:36:17.040 | But at the time, it was probably good enough to call it zero shot.
00:36:20.240 | So here are the different tasks.
00:36:21.740 | And here are the results.
00:36:23.000 | I'm going to go through a couple of the notable ones.
00:36:25.680 | So one is lambada, which is a task
00:36:28.240 | of predicting the next word of a long passage.
00:36:30.680 | And this data set is set up so that you
00:36:33.080 | can't predict the next word just by looking
00:36:36.280 | at the target sentence or even the last previous sentence.
00:36:39.640 | You need to go through the entire passage
00:36:42.520 | and have some kind of sense of understanding to complete it.
00:36:45.360 | So here's one where it's like they've
00:36:47.720 | underscored or underlined the word "dancing"
00:36:49.920 | because that's how far you have to be to find the word that's
00:36:52.520 | the answer.
00:36:53.520 | I like this example from the paper
00:36:55.200 | because camera is the word to complete the sentence.
00:36:59.020 | And it's never even here.
00:37:00.520 | You have to infer they're dealing with the camera.
00:37:02.600 | He's like, you just have to click the shutter.
00:37:04.520 | So it's really a test of long passage understanding.
00:37:07.840 | In fact, I can't remember the--
00:37:10.000 | I believe something like long is the acronym
00:37:12.320 | for what this stands for.
00:37:14.360 | One thing to know is that GPT-2, when they tested it,
00:37:19.360 | was actually not fully doing that great.
00:37:22.120 | But it was coming up with completions
00:37:23.680 | to keep going for the sentence.
00:37:26.080 | And so what they did is they added a stop word filter.
00:37:30.680 | And they only let it use words that could end a sentence.
00:37:34.720 | Because it would come up with other likely completions
00:37:37.640 | for the sentence, but they would have kept going.
00:37:39.680 | And so they would have been the wrong answer.
00:37:41.600 | So they basically had to modify slightly the end result
00:37:45.440 | in order to get the correct values.
00:37:47.880 | When they did that, they got--
00:37:49.840 | then they achieved state-of-the-art results.
00:37:52.760 | This is the children's book test,
00:37:54.720 | similar, where you basically have a long passage.
00:37:58.600 | And you need to answer a question here.
00:38:02.920 | So in this case, she thought that Mr. Blank
00:38:05.120 | had exaggerated matters a little.
00:38:06.520 | So again, it's a fill in the blank.
00:38:08.560 | And then you're given a series of choices.
00:38:10.520 | And then the data set has the right answer.
00:38:13.200 | Now, a large language model just completes
00:38:15.240 | the end of the sentence.
00:38:16.200 | It's not like a BERT model where it could complete somewhere
00:38:19.880 | in the middle that's masked out.
00:38:21.520 | So the way they set this up, because it's a decoder,
00:38:24.520 | is they computed the probability of each one of these choices.
00:38:28.680 | And then those choices, along with the probabilities
00:38:32.760 | for the rest of the other words to complete the sentence,
00:38:35.280 | they added that up to one probability
00:38:36.900 | for each one of these, and then compared
00:38:39.720 | that joint probability of Baxter had exaggerated matters
00:38:43.440 | a little, Cropper had exaggerated matters a little.
00:38:46.000 | And they picked whichever one of those combinations
00:38:48.480 | had the highest probability accord to the language model.
00:38:53.160 | This is the 1 billion word benchmark.
00:38:56.560 | It is the only one of the eight on that table
00:38:59.840 | that GPT-2 did not hit state-of-the-art.
00:39:01.720 | By the way, one thing I should add.
00:39:03.820 | It says we hit seven out of eight.
00:39:05.640 | There are other tasks, which we'll talk about in a second,
00:39:08.280 | where the model did not hit state-of-the-art.
00:39:11.120 | But in those language modeling tasks in that table,
00:39:13.400 | it was seven out of the eight.
00:39:14.920 | Their conclusion is that the reason this happened
00:39:18.320 | is that the 1 billion word benchmark
00:39:21.240 | does a ton of destructive pre-processing on the data.
00:39:25.640 | So this is a screenshot from the 1 billion word benchmark.
00:39:29.880 | This is not from the GPT-2 paper.
00:39:31.720 | But it describes a bunch of the steps they do to pre-process
00:39:35.160 | And then the last thing they do is
00:39:36.620 | do sentence-level shuffling, which removes
00:39:39.200 | the long-range structure.
00:39:40.680 | So in some sense, you could argue
00:39:42.040 | it's not even a valid test.
00:39:43.560 | And so it's not surprising that it didn't do as well
00:39:46.320 | on that last benchmark.
00:39:49.080 | Let me go back to those so you can see that.
00:39:51.000 | So here's Lambada.
00:39:52.400 | Here is the children's book test.
00:39:54.400 | Here's the 1 billion word benchmark.
00:39:57.280 | Some of these are perplexity, so lower is better.
00:39:59.360 | So if it's in bold, it's better than the state-of-the-art.
00:40:01.320 | That's what you see here.
00:40:02.720 | Some of these are accuracy, so higher is better.
00:40:04.760 | So here, you can see in bold when they've achieved higher
00:40:06.780 | than state-of-the-art.
00:40:07.660 | This is the only one where the state-of-the-art was still
00:40:10.140 | out of reach for them.
00:40:11.460 | That was the 1 billion word benchmark.
00:40:15.800 | Another one they tried was question answering.
00:40:18.300 | And they did not achieve state-of-the-art on it.
00:40:20.300 | This is the conversation question answering data set.
00:40:23.980 | This is an example from the paper for that data set.
00:40:26.820 | And you can see it's, again, a passage and then
00:40:29.020 | a series of questions with answers and actually reasoning.
00:40:33.660 | So they didn't hit state-of-the-art.
00:40:35.460 | But there are two interesting things.
00:40:37.000 | One is they matched or exceeded three out of the four
00:40:39.940 | baselines without using any of the training data.
00:40:42.780 | The other baselines had actually used the training data.
00:40:46.500 | This is, again, the power of pre-training
00:40:48.240 | on a large enough data set, which is very surprising,
00:40:50.960 | at least was surprising at the time.
00:40:52.700 | The other thing that they note, which jumped out to me,
00:40:54.980 | is GPT-2 would often use simple heuristics
00:40:58.580 | to answer who questions, where it would look for names
00:41:03.380 | that were in the preceding passage.
00:41:05.260 | And it would just use that as its heuristic.
00:41:07.400 | And to me, that reminds me of, if you're
00:41:10.380 | familiar with induction heads, which
00:41:11.900 | are heads inside multi-head attention, whose whole job is
00:41:14.260 | to, if it sees a passage that says "Harry Potter" like five
00:41:17.060 | times, the next time it sees "Harry," it's like, oh,
00:41:19.340 | the next likely thing is "Potter."
00:41:21.380 | So it's very interesting to see even that kind of sense
00:41:23.740 | of something like an induction head inside GPT-2
00:41:26.740 | in these early experiments.
00:41:28.700 | Another thing they tested was summarization.
00:41:31.740 | It was tested on news stories from CNN and the Daily Mail.
00:41:35.060 | And again, we have kind of early prompt engineering.
00:41:37.420 | They induced summarization by appending TL;DR--
00:41:41.020 | too long, didn't read, which is something humans
00:41:43.060 | had been doing for a long time now--
00:41:45.300 | to a passage.
00:41:46.460 | And it turns out to start summarizing.
00:41:49.320 | Unfortunately, it was not state-of-the-art.
00:41:51.060 | You can see the results here.
00:41:52.540 | Here's GPT-2 TL;DR, this row right here.
00:41:55.980 | And you can see the state-of-the-art is doing
00:41:58.380 | a lot better.
00:41:59.380 | But it was still promising.
00:42:01.540 | The end result that came out resembled a summary.
00:42:03.980 | But it turned out it confused certain details,
00:42:06.420 | like the number of cars in a crash
00:42:09.220 | or where a logo was placed or things like that.
00:42:12.700 | And unfortunately, it just barely outperformed
00:42:14.540 | picking three random sentences from the article,
00:42:16.740 | as you can see.
00:42:17.660 | That's this line here.
00:42:19.380 | So then they wanted to test, well, is TL;DR
00:42:21.180 | doing anything at all?
00:42:22.140 | So they dropped TL;DR.
00:42:23.620 | And you can see GPT-2, without any TL;DR hint, is doing worse.
00:42:27.260 | So TL;DR definitely is actually steering the model.
00:42:30.660 | It's actually prompt engineering the model.
00:42:33.340 | It's just the model isn't powerful enough.
00:42:35.100 | OK, another one they tried was translation.
00:42:40.860 | And in this case, they induced translation
00:42:43.140 | by few-shot prompting of English and French pairs.
00:42:46.580 | Again, we see this early prompt engineering.
00:42:48.940 | What was really surprising, even though they
00:42:52.060 | didn't achieve state-of-the-art, is
00:42:54.060 | that they still beat other baselines.
00:42:56.100 | But the entire 40-gigabyte data set
00:43:00.340 | only had 10 megabytes of French data.
00:43:01.980 | So they went back, and they found a few naturally
00:43:04.180 | reoccurring stuff here.
00:43:05.620 | But they were really surprised.
00:43:07.620 | So the performance was surprising to us,
00:43:09.260 | since we deliberately removed non-English web pages
00:43:11.500 | from WebText as a filtering step.
00:43:13.260 | In order to confirm this, we ran a byte-level language detector
00:43:15.980 | on WebText, which detected only 10 megabytes of data
00:43:18.540 | in the French language, which is approximately 500 times
00:43:21.420 | smaller than the monolingual French corpus common
00:43:24.220 | in prior unsupervised machine translation research.
00:43:27.540 | That's really surprising.
00:43:29.380 | You might remember there was an example--
00:43:31.740 | I think this was Gemini or BARD--
00:43:33.620 | learn to translate a very esoteric language that
00:43:37.060 | has something like only 100 or 1,000 speakers
00:43:39.540 | by having a very small data set of it.
00:43:41.220 | I feel like this is kind of parallels of that.
00:43:44.140 | And then they also tried question answering.
00:43:46.540 | So this is an example from that paper
00:43:48.900 | that rolled out that data set, where you have a question
00:43:51.300 | coming from Wikipedia, and then a long answer,
00:43:54.980 | and then a short answer for each of those prompts.
00:43:57.940 | And so they did the short answer.
00:43:59.900 | They seeded it with question and answer pairs.
00:44:01.820 | Again, this is why Wikipedia was removed from the training data.
00:44:06.100 | And they got poor results.
00:44:08.320 | The baseline was like 30% to 50%.
00:44:11.820 | GPT-2 XL got 4.1%, and GPT-2 Small got less than 1%.
00:44:17.420 | They don't actually give us the number.
00:44:19.020 | But it does indicate that size helps.
00:44:22.460 | So maybe a sufficiently large model
00:44:25.060 | could exceed state of the art.
00:44:26.540 | If you want to get better than the baseline of 30% to 50%,
00:44:29.780 | just build a large enough model.
00:44:31.420 | And that's all you need to do.
00:44:32.660 | You don't have to do any other algorithmic improvements.
00:44:36.820 | There's this hilarious little footnote
00:44:40.340 | inside the paper, which says that Alec Radford
00:44:44.940 | overestimated his skill at random trivia.
00:44:47.300 | So if you, I don't know, run into a wild Alec Radford
00:44:51.500 | in his natural habitat of San Francisco, do not play dead.
00:44:54.940 | Do not back away slowly.
00:44:56.100 | Challenge him to random trivia, and you've
00:44:57.900 | a better-than-random chance at beating him.
00:45:01.900 | OK, future directions.
00:45:05.140 | OK, really, this was the beginning of scale
00:45:07.980 | is all you need, kind of things we talked about earlier
00:45:10.260 | at the beginning.
00:45:11.140 | Given that the web models appear to underfit the web text
00:45:13.940 | data set, as they note here in this figure from the paper,
00:45:18.020 | it seems that size is improving model performance
00:45:20.900 | on many tasks.
00:45:21.620 | We talked about how GPT-2 small didn't do as well as GPT-2
00:45:25.400 | large, and it seems like just increasing
00:45:27.580 | the size of the model improves things.
00:45:29.380 | So then the question is, does size help even more?
00:45:31.500 | And of course, that leads us to, hey, let's
00:45:33.460 | put a ton of money on making--
00:45:35.700 | instead of going up by 10, let's go up by a factor of 100,
00:45:38.460 | and let's see if we'll get a much smaller model.
00:45:40.140 | And I'm sure you all know the answer.
00:45:41.740 | The answer is they did.
00:45:43.220 | But that leads us into setting the stage for GPT-3.
00:45:47.660 | OK, and that is it.
00:45:51.100 | I will take questions in, I think,
00:45:53.180 | the 10 minutes we have remaining.
00:45:54.500 | And I'll look at the chat, see what we got.
00:45:56.300 | - The chat is a mess.
00:46:00.780 | - Oh, boy.
00:46:02.540 | Is that a good thing, or is that a sign of a bad or a good--
00:46:05.020 | - Yeah, it means we're engaged and having productive
00:46:07.500 | discussions.
00:46:08.180 | I feel like-- what's his name?
00:46:12.700 | - Leanne had some questions about the sine function, which
00:46:15.940 | I mean, I answered from my point of view,
00:46:17.700 | but I'm curious if you have--
00:46:18.940 | - What's the question on the sine function?
00:46:21.860 | - Positional encoding is a spherical jitter
00:46:24.660 | in the sitting space.
00:46:26.300 | - Yes.
00:46:26.860 | - Jitter, to me, means randomness,
00:46:28.580 | and there's no randomness here.
00:46:31.540 | - Well, that's a--
00:46:33.420 | it's an adjustment.
00:46:34.940 | I called it an oscillation because it is-- you're right.
00:46:37.820 | It is predictable.
00:46:38.820 | It's formulaic based on what position you're in.
00:46:41.180 | And so that's-- that might just be a translation issue.
00:46:45.020 | I agree.
00:46:45.660 | Jitter typically means random, but it is--
00:46:47.860 | I called it an oscillation.
00:46:49.180 | So we'll just slightly move it around inside this space.
00:46:53.740 | That seems--
00:46:55.620 | Oh, tell me--
00:46:57.300 | - That is jitter.
00:46:58.420 | That is a form of jitter.
00:47:01.220 | I don't like this explanation.
00:47:03.180 | - Why not?
00:47:03.700 | Tell me.
00:47:04.740 | - It's a position embedding is a whole different embedding.
00:47:07.540 | You're saying we're moving the position of woman,
00:47:10.700 | but no, we're pairing the position of woman
00:47:13.540 | with the position to having significance y or whatever.
00:47:18.660 | Your example with the word order,
00:47:23.860 | you're not really changing the position of woman
00:47:26.340 | to anti-woman, right?
00:47:27.660 | Are you?
00:47:29.980 | - No, we're keeping it roughly in the same embedding space.
00:47:33.700 | We're moving it slightly.
00:47:35.460 | - There's position space, and then there's
00:47:37.820 | the word token embedding.
00:47:40.020 | I don't know how to phrase it.
00:47:41.620 | - Well, no, well, we're not inside like rope
00:47:46.460 | where we're inside the attention mechanism in GPT-2.
00:47:50.100 | So you are literally using the very same embeddings.
00:47:54.860 | - Oh, OK.
00:47:55.860 | - So you are inside--
00:47:56.900 | - [INAUDIBLE] the rope.
00:47:58.020 | Yeah, I'm--
00:47:58.540 | - You are in-- yeah, you're thinking--
00:48:00.180 | so that's the key difference.
00:48:01.700 | So you're not in a separate positional space, right?
00:48:04.580 | So here is-- here I've taken like happy--
00:48:08.460 | this is happy at position 8, happy at position 1,
00:48:11.100 | happy with capital.
00:48:11.940 | So it's a different word.
00:48:13.060 | Here's glad at position 2.
00:48:14.220 | Here's happy 3, happy 4, happy 6.
00:48:16.380 | These are all the same happy.
00:48:17.580 | Just put a number when I plotted it with PCA.
00:48:19.580 | So this is a dimensionality reduction of 768.
00:48:22.740 | And these are, you can see, happy at 3, 4, 5, 6, 7, 8.
00:48:26.620 | They're all roughly close to each other.
00:48:28.220 | Glad is a whole other word.
00:48:29.540 | I just put it at position 2 so we can see what it is.
00:48:31.740 | And happy 1.
00:48:32.540 | They're in the same embedding space.
00:48:34.900 | So this is the same embedding space as, you know--
00:48:39.740 | do-do-do-do-do-do-do.
00:48:42.140 | I have a diagram here of it somewhere.
00:48:44.740 | Do-do-do-- well, I did.
00:48:49.900 | I have a PCA plot of the same type of thing.
00:48:53.300 | There it is.
00:48:54.540 | This is glad, happy, happy, capital, joyful, dog, cat,
00:48:59.300 | rabbit, right?
00:48:59.900 | This is not positional embedding.
00:49:02.140 | This is just this stuff put in PCA, two-dimensional from 768.
00:49:08.660 | And then I did the same thing, except I did it
00:49:10.820 | in positional for just one word to see--
00:49:14.860 | to go through what is the difference.
00:49:16.420 | Where is that?
00:49:17.140 | That's this one.
00:49:17.900 | So this is-- here, you can see it right here.
00:49:23.580 | So happy is the same happy 1, glad.
00:49:25.560 | These are just the same things with the positional embeddings
00:49:28.140 | added onto them for whatever row.
00:49:29.660 | So we're in-- in GPT-2, you're in the same embedding space.
00:49:33.260 | You're just moving them around.
00:49:34.860 | That is a crucial difference.
00:49:36.300 | That's like one of the crucial differences,
00:49:38.660 | probably between modern transformers and GPT-2.
00:49:42.300 | The other one that I call out is RMS norm, for example,
00:49:46.740 | is used in LLAMA.
00:49:48.780 | I think I actually have a slide on the major differences.
00:49:51.300 | That might be a good way to close out, is GPT-2
00:49:55.940 | versus something like LLAMA.
00:50:01.620 | - Beautiful.
00:50:02.900 | - Yeah.
00:50:03.380 | So you've got-- so you can see, this is LLAMA-405B.
00:50:07.500 | What does a modern model look like compared to GPT-2?
00:50:12.580 | So they're both decoder transformers,
00:50:14.900 | just a lot larger size.
00:50:16.620 | Same architecture, similar-- more size.
00:50:18.580 | So more layers, embedding dimensions are larger,
00:50:20.860 | context is larger.
00:50:22.220 | The training data and the training cost went up.
00:50:25.820 | We're talking $125 million is the estimate.
00:50:28.420 | And then the same pieces, but they're moved.
00:50:30.580 | So instead of learned absolute positional embeddings,
00:50:33.020 | we go to Rope.
00:50:34.260 | Layer norm gets replaced with RMS norm.
00:50:36.580 | Multi-head attention gets group query.
00:50:39.100 | Galoo gets replaced with Swigloo.
00:50:41.140 | And then the other key difference
00:50:43.540 | is, compared to chat GPT, we don't
00:50:47.180 | have any supervised fine tuning, any RLHF.
00:50:50.900 | That process isn't there at all.
00:50:52.460 | It's just simply a pre-trained model,
00:50:54.020 | whereas models today typically go
00:50:55.740 | through some form of post-training.
00:50:57.940 | So that's a good way to kind of bridge us
00:51:02.780 | to what the future or the present
00:51:04.620 | looks like from today, at least the day of the recording.
00:51:08.940 | That's a great-- yeah, we need--
00:51:11.340 | I need this, but updated for DeepSeq and--
00:51:14.500 | [LAUGHTER]
00:51:17.540 | Yeah, the next time I do my class,
00:51:19.500 | I'll probably redo this with LLAMA and DeepSeq.
00:51:22.260 | So I might add that column.
00:51:24.860 | So yeah.
00:51:28.100 | Any other questions?
00:51:29.540 | That's good discussion.
00:51:30.940 | Is this LLAMA 1?
00:51:35.900 | What?
00:51:37.060 | LLAMA 1 or LLAMA 2?
00:51:40.460 | LLAMA 3, 405B, the one in August.
00:51:45.980 | Thanks.
00:51:46.820 | Yeah, good question.
00:51:48.820 | [AUDIO OUT]
00:51:54.220 | Yeah, there's some--
00:51:55.060 | Let's see what else is in the chat.
00:51:56.540 | --in the chat, yeah.
00:51:58.460 | Why would you want pre-layer norm instead
00:52:00.460 | of post-layer norm?
00:52:01.980 | I see that one.
00:52:03.780 | I'm going to go to that one because I
00:52:05.300 | do have a slide for that.
00:52:07.700 | My-- there's an article that I found that explains this.
00:52:13.460 | Oh, really?
00:52:14.060 | I'd love to see that.
00:52:15.860 | Yeah, it's like some question, some article about what
00:52:21.020 | the original transformers got wrong.
00:52:22.940 | I think you have it.
00:52:23.780 | Oh, you have the same diagram I pulled up.
00:52:25.860 | Oh, yeah.
00:52:26.660 | So this is-- but this was--
00:52:28.660 | the key thing is I'm pretty sure this came out
00:52:33.700 | after the GPT-2 paper itself.
00:52:39.180 | So GPT-2 doesn't cite this paper.
00:52:42.460 | They cite a different paper, which
00:52:44.300 | does the same thing inside a, I think, a visual or a CNN
00:52:51.020 | that had skip connections.
00:52:52.580 | And they propose, hey--
00:52:53.660 | and so I think the obvious thing was like, OK,
00:52:55.580 | it works for vision.
00:52:57.620 | Maybe it'll also work for language
00:52:59.100 | to go to a pre-layer norm transformer.
00:53:04.100 | And it worked.
00:53:05.820 | And then I think maybe around the same time, or--
00:53:08.140 | I don't know how close these are.
00:53:09.780 | Well, June-- it's about a year later, right?
00:53:11.820 | They actually show some benefits and improvements
00:53:15.140 | for pre-layer norm.
00:53:16.740 | There is a trade-off, which I've forgotten what it is.
00:53:18.940 | But this is the paper.
00:53:19.860 | You can find it.
00:53:21.580 | There's the archive number.
00:53:23.500 | And you can Google the title.
00:53:24.780 | But that's a quick answer there for a pointer
00:53:26.700 | on why you'd want to do that.
00:53:29.500 | Send me-- drop that link for that article.
00:53:31.260 | I'd be curious to see it.
00:53:34.040 | Yeah.
00:53:38.980 | Which-- as walks on a hypersphere.
00:53:43.380 | Is it that this paper?
00:53:44.820 | Or this is another paper of the benefits of layer norm
00:53:52.740 | that I like as well.
00:53:54.220 | Let's see.
00:53:54.720 | Somebody click the link.
00:53:56.660 | Oh, I remember this one now.
00:53:58.780 | OK, yes.
00:54:00.380 | But what's really fascinating to me about this,
00:54:02.940 | for all these model architecture,
00:54:05.060 | is we keep finding afterwards, like, oh, yeah,
00:54:09.460 | this is what it's doing.
00:54:10.940 | This paper came out after GPT--
00:54:13.280 | after we did that change.
00:54:14.980 | And so-- and we keep finding ones.
00:54:16.700 | This is one of my favorite ones, where
00:54:18.820 | I would have thought Dropout would
00:54:20.300 | have explained self-repair inside these kinds of models.
00:54:24.140 | But this research claimed it was surprisingly layer norm.
00:54:28.540 | So it goes to show you how much of this
00:54:31.060 | is interestingly empirical, but guided
00:54:33.900 | by an intuition of playing with these things.
00:54:37.700 | Let me see if there's another chat I can address.
00:54:39.780 | Let's see.
00:54:46.780 | There is no maximum length angle among positional embeddings
00:54:52.260 | is entirely learned by the model.
00:54:55.220 | There's a question there of what is the maximum length angle.
00:54:59.660 | If you save these, I can reply to these in Discord, Swix.
00:55:04.180 | - Yeah, good.
00:55:05.780 | - Let's see.
00:55:06.340 | And there's the article.
00:55:11.340 | I know we're just at two minutes.
00:55:14.500 | Oh, that's the article you were pulling up.
00:55:16.340 | OK, great.
00:55:17.580 | - It's basically a recycling of the one that you already had.
00:55:20.340 | - Got it.
00:55:22.340 | Let's see.
00:55:22.900 | I'm trying to go through.
00:55:26.820 | Is there anything else you saw that was interesting I should
00:55:28.500 | cover in the last 60 seconds?
00:55:31.940 | Do I think we've hit the scaling wall?
00:55:36.060 | - Different topic.
00:55:37.180 | - That's a different topic.
00:55:38.460 | But I'll just say people who have said that
00:55:40.780 | have learned to rue the day.
00:55:44.300 | Now that we've got test time compute, at least.
00:55:48.260 | Let's see.
00:55:48.780 | - So my response is that that is kind of moving the goalposts.
00:55:56.020 | Like if you want to say that you haven't hit a wall,
00:55:59.620 | then OK, scale up GPC 4 to GPC 5.
00:56:02.180 | And where is it?
00:56:03.900 | So in some sense, we haven't hit it.
00:56:06.180 | We have hit one.
00:56:07.420 | And we're just redirecting the attention.
00:56:11.180 | - I will just say I know a excellent podcast that
00:56:15.380 | held a debate in Vancouver last year in December
00:56:19.620 | between two very qualified experts on this very topic.
00:56:23.780 | And I would refer you to that.
00:56:25.340 | - Yeah, wallguy1.
00:56:27.740 | - What?
00:56:29.340 | - The pro wall person.
00:56:30.660 | - The pro-- yeah, yeah.
00:56:32.140 | Yeah, he did.
00:56:34.460 | He did.
00:56:36.220 | - OK.
00:56:38.180 | Cool.
00:56:38.860 | I mean, I can call it.
00:56:40.100 | I think we can continue in Discord.
00:56:41.620 | Thank you so much, Ishan.
00:56:42.700 | That was amazing as always.
00:56:44.980 | Actually, not even as always.
00:56:46.460 | I think that's too dismissive of a word for what you did.
00:56:51.460 | Your slides are amazing.
00:56:52.500 | So thank you.
00:56:53.140 | - Oh, thank you.
00:56:55.020 | - Yeah, we don't have a paper picked for next week.
00:56:57.500 | Again, we can pick one in the Discord
00:57:00.820 | if anyone wants to volunteer.
00:57:02.580 | You have a high bar set.
00:57:03.580 | - Well, thank you for having me.
00:57:09.740 | And I look forward to being back at some point in the future.
00:57:13.580 | And I hope--
00:57:14.340 | I want more people to participate.
00:57:15.740 | I don't want people to feel like they have to match this.
00:57:17.620 | - Yeah, exactly.
00:57:18.660 | Yeah.
00:57:19.620 | He teaches a course.
00:57:20.500 | That's why he has those.
00:57:21.180 | - Yes, that's why I have all these slides.
00:57:22.940 | This is slides from my class.
00:57:25.660 | - Oh, yeah, plug your course.
00:57:27.060 | Where do people sign up?
00:57:28.100 | - Oh, they can go to Maven.
00:57:31.420 | I didn't want to be too commercial.
00:57:32.900 | But if you go to Maven, there's the class.
00:57:36.460 | And then you can see what people have said about it.
00:57:39.220 | And right now-- so Maven usually is live.
00:57:43.020 | I leave it open right now.
00:57:44.660 | People can attend on demand.
00:57:46.180 | And then you get the recordings.
00:57:48.500 | You get access to the Discord.
00:57:49.820 | You get all the quizzes.
00:57:50.860 | And then if you want to attend when I do my next live one,
00:57:53.700 | you can attend a future live cohort for free.
00:57:56.220 | If you have questions about the class,
00:57:57.780 | feel free to shoot me a question over Twitter, LinkedIn,
00:58:00.340 | Discord, whatever.
00:58:02.020 | But that's where you can find it on Maven.
00:58:03.740 | So you have a class on Maven as well, right?
00:58:08.540 | - Yeah, but it's more AI engineering, quote unquote.
00:58:11.300 | So you treat the language model as a black box,
00:58:13.460 | and you go from there.
00:58:14.660 | - Yeah, this is--
00:58:15.660 | I should be clear.
00:58:16.860 | I had one guy who signed up thinking
00:58:18.420 | this was about using AI with Excel.
00:58:20.940 | No, no, this is about how the actual model works.
00:58:23.180 | And I use the Excel spreadsheet that implements it.
00:58:26.060 | That's this thing.
00:58:27.100 | So you can understand every single step.
00:58:29.420 | And then I also use this web version as well.
00:58:34.740 | And I walk through so you get a sense
00:58:36.500 | of how the entire model works.
00:58:37.740 | But you can try this web version.
00:58:39.100 | Anyone can go to this page and try it out.
00:58:40.860 | I actually see at least one of my former students here.
00:58:43.460 | - OK, cool.
00:58:47.780 | All right.
00:58:48.380 | Thank you so much.
00:58:49.140 | Have a nice day, everyone.
00:58:50.500 | - Thanks.
00:58:51.460 | - Thank you.
00:58:51.980 | - Thank you.