GPT Internals Masterclass - with Ishan Anand

00:00:00.000 | >> All right. Let's go.

00:00:04.080 | >> Okay. Did you want to do an intro or should I just go ahead and start?

00:00:07.820 | >> Intro. I was just excited to have Ishan back.

00:00:13.160 | I guess you already previewed last week,

00:00:15.760 | but also you spoke at the World's Fair,

00:00:18.760 | and you're world-famous for Especially, It's All You Need.

00:00:21.420 | It's your thing.

00:00:23.240 | But also, it means that you understand models on a very fundamental level

00:00:28.080 | because you have manually re-implemented them,

00:00:30.860 | and today you decided to tackle GPT-2. So welcome.

00:00:35.120 | >> Yeah. Thank you. So I'm excited

00:00:38.440 | to be presenting Language Models or Unsupervised Multitask Learners.

00:00:42.740 | For context, the other thing Swix had alluded to in previous paper clubs,

00:00:48.000 | is there's going to be paper clubs for more,

00:00:51.260 | not test of time, but something similar,

00:00:53.740 | and this is therefore more evergreen,

00:00:56.060 | more general, more novice audience,

00:00:57.760 | than might be a traditional paper club reading,

00:01:00.280 | where I think it's a lot more sophisticated.

00:01:02.760 | But hopefully, if you're just coming to the field of AI engineering,

00:01:07.600 | this video and some others like it,

00:01:10.040 | and some resources I'll point you to will help you get started

00:01:12.440 | in understanding how LLMs actually work under the hood.

00:01:15.920 | The name of the paper is Language Models or Unsupervised Multitask Learners.

00:01:20.400 | It is not actually officially called the GPT-2 paper,

00:01:23.160 | but that's how everyone refers to it.

00:01:25.160 | You'll notice a couple of big names here.

00:01:28.440 | You might recognize Dario,

00:01:29.880 | you might recognize Ilya.

00:01:31.700 | If you're in the community, you'll recognize Alec.

00:01:34.680 | A lot of these people went on to continue to do

00:01:37.640 | great and amazing things in the community.

00:01:40.880 | I am, as Swix noted, Ishan Nand.

00:01:43.880 | You can find me here at my homepage.

00:01:45.840 | I'm probably best known in the AI community for Spreadsheets Are All You Need,

00:01:50.040 | that is a implementation of GPT-2 entirely in Excel.

00:01:53.360 | I teach a class on Maven that's basically seven to eight hours long,

00:01:57.440 | where we go through every single part of that spreadsheet.

00:02:00.600 | For people who have actually minimal AI background,

00:02:04.760 | it's a great first class in AI.

00:02:07.040 | I'm an AI consultant educator and really excited to give you

00:02:09.880 | the abbreviated version of that and the GPT-2 paper today.

00:02:13.280 | Let's get started. Here's what we're going to talk about.

00:02:17.840 | We're going to talk about why should you even pay attention to GPT-2.

00:02:21.360 | Strangely enough, I get this question from my class.

00:02:24.480 | People are like, "Oh, I saw that it was GPT-2."

00:02:26.640 | I thought, "Oh, that's got to be out of date."

00:02:28.440 | We should talk about why that's important and why you should pay attention.

00:02:31.400 | Then we'll talk about the dataset.

00:02:33.260 | Then we'll talk about actually the results.

00:02:35.280 | This slide is backwards.

00:02:36.800 | Then we'll talk about the model architecture.

00:02:38.600 | No, sorry. We'll talk about the model architecture, then the results.

00:02:41.000 | Then we'll talk about the future directions

00:02:43.360 | because we know what the future is going to hold, but they didn't.

00:02:46.440 | We'll talk about it as if we didn't really know.

00:02:49.000 | I don't know if Angad is here.

00:02:51.560 | If he is, he can jump on or let me know.

00:02:54.400 | I don't know if I'll see it in the chat.

00:02:55.560 | But he led a really great paper club about,

00:02:59.440 | I think it was eight or nine months ago on the original GPT-1 paper.

00:03:03.480 | I highly recommend checking that out.

00:03:06.020 | Also partially because, spoiler alert,

00:03:08.840 | the GPT-2 paper doesn't talk a lot about the model architecture.

00:03:12.680 | There's a limit that you'll learn from the model and model building from the GPT-2 paper.

00:03:18.720 | It's actually GPT-1 just scaled up.

00:03:21.880 | >> Angad is actually here.

00:03:23.840 | >> Oh, he is. Oh, Angad,

00:03:25.120 | did you want to just jump in and say anything about this?

00:03:27.280 | This is your slide.

00:03:28.960 | >> Yeah. Can you guys hear me?

00:03:32.760 | >> Yeah.

00:03:33.040 | >> Yes.

00:03:34.120 | >> Cool. Thank you for having my overview about GPT-1.

00:03:40.120 | As you said, GPT-1 is the precursor to GPT-2.

00:03:43.680 | The architecture is roughly almost exactly the same.

00:03:46.920 | There are going to be some little differences

00:03:49.080 | that I think you're going to present in today's discussion.

00:03:51.800 | But I think everyone should at least give GPT-1 a read or

00:03:56.480 | try to see how they actually achieved or settled on the transformer architecture,

00:04:02.680 | and also their training objective of language modeling at the next token prediction.

00:04:08.280 | We've got some cool resources for you guys.

00:04:11.960 | We have the official paper from the OpenAI team.

00:04:15.960 | We also have a blog post,

00:04:17.920 | basically a write-up about GPT-1 in 2024,

00:04:22.200 | so having a futuristic or looking back on the GPT-1.

00:04:26.280 | Also, we have, as you mentioned,

00:04:28.000 | the paper club episode about GPT-1.

00:04:31.480 | It is recorded and it is on YouTube.

00:04:34.000 | Please make sure you at least give it a read or something like this.

00:04:38.400 | It's going to be an immense resource for you guys. Back to you.

00:04:44.360 | >> Okay. Thank you. I see a request for the slides,

00:04:47.960 | which is a great question.

00:04:49.360 | Let me do that right now because

00:04:51.600 | my pet peeve is people who hold up the slides from you.

00:04:54.720 | Let me do this. Anyone at the link,

00:04:57.320 | you can go ahead and comment,

00:04:58.920 | copy link, and then I will drop it in the chat.

00:05:03.640 | Let's see. There we go.

00:05:07.800 | You should have it in the Zoom chat and I will drop it,

00:05:09.840 | or somebody can drop it please in the Discord as well for the paper club.

00:05:15.040 | Thank you. Okay. Thank you.

00:05:17.960 | Great. Whoops. We're not going to auto-play the video. There we go.

00:05:22.280 | So let's get started. So first off,

00:05:24.920 | why GPT-2 matters?

00:05:27.800 | Well, the first thing is it was one of

00:05:31.560 | the first cases where we saw one model is really all you need.

00:05:35.640 | We had a single model that solved multiple tasks,

00:05:39.280 | and here's the key, without

00:05:40.640 | any supervised training on any of those tasks.

00:05:44.040 | It had one simple objective,

00:05:45.240 | which was to predict the next word,

00:05:46.760 | and that let it learn how to do many multiple tasks.

00:05:49.680 | This seems obvious today because we're basically six years later.

00:05:54.660 | But at the time, it was not obvious.

00:05:56.720 | GPT-1, for example,

00:05:58.640 | was pre-trained to predict the next word.

00:06:01.400 | Then what they did is they stuck

00:06:03.680 | different structured output configurations on top of it,

00:06:07.000 | and fine-tuned it on each of these tasks like classification,

00:06:11.320 | similarity, multiple choice.

00:06:13.720 | So actually I have,

00:06:15.440 | this is right from the GPT-1 paper right here.

00:06:19.400 | So this is from GPT-1, that was the setup.

00:06:21.880 | So we had one model that was pre-trained,

00:06:24.800 | and then fine-tuned, and then had

00:06:27.000 | structured output set for every single task.

00:06:29.520 | So it still was not like ChatGPT where you could just talk to it,

00:06:32.400 | and GPT-2 still quite wasn't there.

00:06:34.440 | But you could start getting prompt engineering to get the right result.

00:06:37.840 | By contrast, GPT-2 was pre-trained again on predicting the next word,

00:06:42.560 | but then you just gave it task-specific prompts,

00:06:45.560 | few or zero-shot prompts,

00:06:47.240 | as we'll see in the results section,

00:06:48.800 | to get the desired output that you wanted.

00:06:50.920 | A useful and interesting comparison contrast is the Google multi-model.

00:06:56.960 | You'll recognize these names, Noam Shazir,

00:07:00.480 | Aiden, Kaiser, Vaswani,

00:07:03.240 | a lot of the same people from Attention is All You Need.

00:07:05.720 | These guys know how to name a paper, I'll say.

00:07:08.440 | One model to rule them all,

00:07:09.560 | obviously, Lord of the Rings reference.

00:07:11.360 | Here it is. It's a multi-task model.

00:07:13.320 | It's even multi-modal,

00:07:14.920 | and believe it or not, it's also a mixture of experts.

00:07:17.760 | It can take a image,

00:07:20.240 | it can caption it, it can categorize it,

00:07:22.040 | it can do translation.

00:07:23.400 | This is way back in 2017.

00:07:25.280 | But the key thing is that it was supervised fine-tuned or

00:07:28.880 | supervised tuned for each task,

00:07:31.360 | although it was done jointly all in the same model,

00:07:33.360 | and they had task-specific architectural components for each task.

00:07:37.160 | It was not the same just predict the next word.

00:07:40.640 | It was, I'm going to give you

00:07:41.960 | these datasets for each of these different tasks,

00:07:44.120 | but I'm just doing in the same model across them.

00:07:47.080 | The key hypothesis of the paper then is that,

00:07:50.480 | as you see here, our speculation is that

00:07:52.600 | a language model with large enough capacity,

00:07:55.000 | so something that's large enough,

00:07:56.480 | will infer and learn to perform tasks demonstrated in

00:08:00.400 | that dataset regardless of how you procured them.

00:08:04.480 | It'll be able to basically learn multiple tasks entirely unsupervised.

00:08:09.320 | This is right from GPT-2 paper,

00:08:11.920 | which I'll pull right up.

00:08:13.560 | I'll go back and forth between the paper.

00:08:16.400 | Where did they have this? There it is.

00:08:19.120 | It's right, our speculation, right there.

00:08:26.320 | It's right here, right in the beginning.

00:08:29.160 | The other key element here is that,

00:08:32.920 | it's the emergence of prompting is all you need,

00:08:35.480 | and the emergence of prompt engineering.

00:08:37.320 | Here, we're just using prompts in order to condition the model.

00:08:40.280 | A follow-on to what we talked about earlier,

00:08:42.320 | it's we started to see prompt engineering be taking a role for

00:08:45.400 | the first time as a way to control a model where

00:08:47.440 | previously you would have stuck a different head on

00:08:49.360 | top of it and fine-tuned it.

00:08:51.920 | It's also the emergence of scale is all you need.

00:08:55.400 | This multitask capability emerges and improves as the models,

00:09:00.680 | training time, size, and dataset increases.

00:09:03.960 | As you can see here, when we look at GPT-1,

00:09:08.000 | then GPT-2, which basically was

00:09:09.600 | a 10x scale up on the size of parameters,

00:09:12.040 | also in the dataset size,

00:09:13.800 | and then it's at the stage later for GPT-3,

00:09:17.240 | which was going to scale it up by 100 times on the number of

00:09:20.760 | parameters with the idea after they saw these results that, hey,

00:09:23.520 | we'll just simply scale up the model

00:09:24.880 | larger and we'll get better results.

00:09:27.040 | The other interesting thing about GPT-2 is it's also

00:09:30.880 | the continuation of this idea that the decoder is all you need.

00:09:34.960 | If you're new to transformers,

00:09:37.040 | transformer has traditionally in the original Vaswani implementation,

00:09:41.240 | had the left half here,

00:09:42.400 | which was called the encoder,

00:09:43.960 | and then the decoder on the right half,

00:09:46.040 | which was in charge of generating the output.

00:09:48.680 | They basically dissected in half and said,

00:09:50.760 | you only need the decoder because all we're going to do

00:09:53.080 | is basically text generation.

00:09:56.480 | They were not obviously the first to do this,

00:09:58.840 | GPT-1 preceded it.

00:10:00.040 | Around the same time as GPT-1 was this other paper,

00:10:04.000 | also by Google, and Noam Shazier and Kaiser again,

00:10:07.120 | which is generating Wikipedia by summarizing long sequences.

00:10:10.880 | It was another decoder only,

00:10:12.840 | and I'm not sure who was first.

00:10:14.880 | They were both published in 2018.

00:10:17.000 | This one is published in January of 2018.

00:10:20.200 | GPT-1, I think, was middle of the year,

00:10:22.440 | but you never know when these start.

00:10:24.120 | So it's not quite clear.

00:10:27.760 | Moving on, this eventually ended up being,

00:10:32.400 | as we now know today,

00:10:33.400 | the most popular way you would implement a large language model.

00:10:36.360 | And GPT-2 is basically the ancestor of all the major models

00:10:42.600 | you have probably familiar with.

00:10:44.160 | So GPT-4, CLOD, ChatGPT, Cohera is in here,

00:10:51.040 | BARD, now Gemini, LLAMA,

00:10:53.040 | they're all decoder transformer models.

00:10:55.760 | So the key idea is if you understand the GPT-2 architecture,

00:10:59.720 | you're basically 80% of the way to understanding

00:11:02.400 | what a modern large language model looks like.

00:11:06.080 | A lot of the components are still the same.

00:11:08.080 | Maybe they've replaced layer norm with RMS norm and so forth.

00:11:11.640 | There's probably only a few other changes,

00:11:13.880 | but most of the way there it's 80%.

00:11:16.080 | So it was highly influential.

00:11:17.760 | And part of that may be because of a lot of the hype around GPT-2,

00:11:21.080 | but part of that is also the last open source model

00:11:24.560 | as of this recording from OpenAI.

00:11:26.320 | So a lot of people dug into it and took inspiration from it.

00:11:30.800 | It was also probably one of the first AI models

00:11:33.120 | to break out of the AI bubble.

00:11:35.160 | So this is the famous passage where they prompted GPT-2

00:11:40.680 | to write a fake news article about unicorns

00:11:43.200 | living in the Andes Mountains.

00:11:44.840 | This got a lot and a lot of press in 2019.

00:11:49.680 | In fact, it was one of the reasons,

00:11:51.360 | the risk of misinformation,

00:11:52.960 | that initially the open source release of GPT-2

00:11:56.560 | was only the smallest model.

00:11:57.880 | They didn't release the source or weights for the larger model

00:12:01.480 | until later that year.

00:12:02.760 | I believe it was in November.

00:12:04.840 | And it got a ton of attention.

00:12:06.480 | It was called the AI model too dangerous to release,

00:12:10.080 | which, for better or for worse,

00:12:11.800 | made it break through the AI bubble

00:12:14.080 | and into the public consciousness.

00:12:17.600 | So that's why GPT-2 is all you need, in a sense,

00:12:21.320 | to get started and why it's so important to the field.

00:12:25.080 | Okay, now let's talk about the data set

00:12:27.640 | because the data is a huge part of any AI model,

00:12:31.880 | especially the large language model.

00:12:33.840 | The problem they faced is if we're going to train

00:12:36.600 | a unsupervised large language model,

00:12:40.120 | we need a data set that is sufficiently large,

00:12:43.400 | that is high quality,

00:12:45.160 | and that demonstrates a wide range of tasks

00:12:47.760 | that the model can learn from

00:12:48.760 | because it should be embedded in it.

00:12:50.520 | It should be sufficient enough that it has that wide variety,

00:12:53.400 | even though we're not explicitly going to fine-tune it

00:12:55.360 | on any of these particular tasks.

00:12:57.640 | And the solution they came up with

00:12:59.520 | was to create a new data set using the internet.

00:13:03.800 | Let's just grab text on the internet.

00:13:06.200 | But we also need it to be high quality.

00:13:07.960 | So we're going to use social media for a quality signal.

00:13:11.040 | And then because the web, I say internet here,

00:13:13.000 | I'm really talking about the web,

00:13:14.120 | has a wide range of tasks,

00:13:15.800 | it should be sufficiently large

00:13:17.240 | if we have a large enough data set.

00:13:18.520 | It should demonstrate a variety of different tasks

00:13:20.480 | that we can use to test the model.

00:13:23.960 | So they created this data set called WebText.

00:13:27.280 | First, they started by gathering all the outbound links

00:13:29.720 | from Reddit before December 2017.

00:13:33.040 | Then they removed links with less than three karma.

00:13:36.120 | So that was their quality signal.

00:13:37.640 | If it didn't get enough karma,

00:13:39.240 | then it was not a high-quality link.

00:13:40.600 | I want to be really clear.

00:13:41.520 | They didn't actually scrape Reddit and Reddit conversations.

00:13:45.040 | They just used Reddit to rank sites,

00:13:47.160 | kind of like how Google ranks sites through PageRank.

00:13:49.960 | At this point, they realized Reddit

00:13:51.400 | might be a better way of human quality.

00:13:54.080 | Then they actually removed Wikipedia entries.

00:13:56.680 | And the reason they did this

00:13:57.880 | is some of the tests we're going to talk about later

00:14:00.160 | are actually tests that involve Wikipedia

00:14:02.400 | as part of the data set, as part of the evaluation.

00:14:05.440 | So they wanted to avoid data contamination

00:14:08.480 | and not putting, you know, training the model on text

00:14:11.280 | that it would later be tested on.

00:14:13.800 | They also, although not shown in this diagram,

00:14:15.760 | is they also removed any non-English text, or they tried to.

00:14:19.480 | Turns out some leaked in

00:14:21.000 | and turned into a capability to do translation.

00:14:24.960 | And then they extracted the raw text from the HTML files

00:14:28.280 | using the DragNet or newspaper Python frameworks or libraries

00:14:33.520 | and that got them WebText data set,

00:14:35.400 | which was 8 million documents or 40 gigabytes of data.

00:14:39.440 | And to put this in perspective,

00:14:41.360 | the GPT-1 model was trained on the Books Corpus,

00:14:44.200 | which is a series of unpublished books

00:14:45.960 | that was about 4.8, roughly five gigabytes in size.

00:14:50.160 | So this is about an order of magnitude more data.

00:14:52.520 | It was pretty large.

00:14:53.560 | It was not the largest data set at the time.

00:14:56.040 | I believe there was a BERT one that was larger,

00:14:58.080 | but it was one of the largest.

00:15:00.000 | And then put it relative to GPT-2,

00:15:02.120 | I did an estimate at GPT-2 is roughly around

00:15:05.600 | another order of magnitudes bigger than this.

00:15:07.800 | Sorry, GPT-3 compared to the GPT-2 data set for WebText.

00:15:14.000 | Okay, now let's talk.

00:15:16.080 | Oh, let's see, we got questions.

00:15:20.960 | Let's see, should I pause for questions or just keep going?

00:15:25.680 | - I mean, if anyone has questions, now is a good time.

00:15:27.920 | - Okay.

00:15:29.760 | Sure, I just opened the door for it.

00:15:32.560 | Are there any questions?

00:15:33.800 | I see a keep going.

00:15:35.400 | Looks like people are handling some of the questions in chat.

00:15:38.280 | Okay, let's talk about the architecture of these models.

00:15:43.680 | So the GPT-2 models, I put this in quotes,

00:15:48.000 | were a series of four models.

00:15:50.320 | A couple notes, so I put GPT-1 as a comparison point.

00:15:55.360 | GPT-1 was 117 million parameters.

00:15:57.600 | It had 12 layers, 768 for the embedding dimension

00:16:00.920 | and a 512 context length.

00:16:02.960 | The four models that they create for GPT-2,

00:16:07.440 | when you read the paper, you should note

00:16:09.440 | that they do not refer to them as small, medium, large, XL.

00:16:12.640 | That came after.

00:16:14.040 | Instead, when you read the paper,

00:16:15.440 | and this can be a little confusing,

00:16:17.440 | GPT-2, they save as the name for the largest model.

00:16:20.600 | So GPT-2 XL for them means GPT-2.

00:16:24.360 | They're the same, which if you're using

00:16:25.960 | the Hugging Face Transformers, if you use GPT-2

00:16:28.040 | as the model name bear, you just get GPT-2 small.

00:16:31.840 | So avoid the confusion, just be aware.

00:16:35.400 | And then in the text, they refer to all the other small variants

00:16:38.600 | as web text language models.

00:16:40.760 | So these three are called the web text language models,

00:16:43.760 | and then this is what they refer to as GPT-2 in the paper.

00:16:47.080 | So for them, GPT-2 is simply the largest model,

00:16:49.080 | which makes sense in retrospect,

00:16:50.840 | because small really is just a replication of GPT-1.

00:16:55.000 | And one other thing is they even tried to replicate it so much

00:16:58.320 | that they originally reported the size

00:17:00.040 | as 117 million parameters.

00:17:01.800 | Turns out it was 124.

00:17:03.480 | So when you download the weights from, I guess, Azure now,

00:17:06.760 | these two are the same.

00:17:08.040 | They're just simply renamed because there was a typo

00:17:10.600 | in the size of the model.

00:17:12.480 | And you can see, basically, the largest model

00:17:14.280 | compared to the original GPT-1 is 1,600,

00:17:18.840 | so roughly twice as big in the embedding dimensions.

00:17:21.200 | They increase the context link for all of them,

00:17:23.240 | and it has a lot more layers.

00:17:24.640 | So hence, a lot more sized model.

00:17:28.600 | Unfortunately-- well, there are few changes architecturally

00:17:32.120 | from GPT-1 compared to GPT-2.

00:17:35.240 | First is a larger size, which we saw in the previous slide.

00:17:39.040 | They also increased the token count slightly

00:17:41.640 | from 40,000 to 50,000 tokens.

00:17:45.640 | And this one is actually interesting.

00:17:48.240 | They moved layer norm from post-activation

00:17:50.360 | to pre-activation.

00:17:52.120 | They were inspired by a paper that

00:17:57.960 | did this in an image model and proposed

00:18:00.080 | that pre-activation was better.

00:18:01.840 | It turns out a year or two after this,

00:18:04.240 | somebody did work on language models

00:18:07.100 | and showed that pre-activation was actually

00:18:09.480 | better as well for layer norm and actually

00:18:11.960 | improved training stability in certain cases.

00:18:15.960 | Unfortunately, the paper has few details on the actual training.

00:18:19.520 | So we know that the batch size was 512,

00:18:21.560 | but the learning rate was tuned for each size model,

00:18:24.780 | but the exact numbers are not specified.

00:18:27.200 | And this is really interesting because the GPT-3 paper,

00:18:31.280 | for example, went into a lot more detail on this.

00:18:33.960 | In fact, it's, I think, right here near the beginning.

00:18:36.880 | Let's see.

00:18:37.360 | There it is.

00:18:37.880 | You've got a table here on the batch size learning rate.

00:18:40.320 | And I think the appendix actually has the Adam W

00:18:42.840 | parameters as well.

00:18:45.120 | So there isn't a lot of detail on how the model works

00:18:49.600 | and how it was specifically trained.

00:18:51.600 | Thankfully, however, the community,

00:18:53.520 | partially because it was eventually open sourced

00:18:55.560 | and had so much attention on it, came up

00:18:57.480 | with a large number of implementations.

00:19:00.320 | So there's the official OpenAI implementation,

00:19:03.840 | which is right here.

00:19:04.840 | And one thing I like pointing out to people,

00:19:07.560 | if you go into the source and you click on this,

00:19:09.680 | if you're new to AI and machine learning,

00:19:12.900 | you don't realize how small the actual code is

00:19:15.360 | because all the knowledge is in the parameters.

00:19:18.460 | If you add up all the code here and you take out the TensorFlow,

00:19:21.920 | it's basically just 500 lines of code.

00:19:24.440 | It's one of my favorite statistics

00:19:26.000 | to help people understand, yes, you can understand this.

00:19:28.320 | It's only 500 lines of code.

00:19:29.980 | You can grok it if you just spend a week or two on it.

00:19:33.000 | So don't feel like this is magic that you'll never understand.

00:19:38.000 | The most popular way to use it, probably today,

00:19:40.880 | is through Hugging Face Transformers, which

00:19:43.040 | is another implementation of it that uses the same OpenAI

00:19:45.840 | weights that they released publicly.

00:19:49.140 | And then-- whoops, there we go.

00:19:52.520 | This is technically not an implementation,

00:19:54.760 | but a really popular guide to how

00:19:56.480 | the inside of the model works.

00:19:57.880 | I found this an extremely helpful resource

00:19:59.640 | as well, where Jay Alomar, who's now a co-hearer,

00:20:04.300 | goes through in detail how every single step of the transformer

00:20:08.600 | works.

00:20:09.080 | He has really great diagrams and illustrations

00:20:12.800 | for how the model works.

00:20:14.720 | Another popular implementation is

00:20:16.920 | MiniGPT from Andrej Karpathy, which is a PyTorch

00:20:20.480 | reimplementation.

00:20:21.240 | The original version of GPT-2 was in TensorFlow.

00:20:24.880 | So this one's in PyTorch.

00:20:26.600 | You can see it here at GitHub.

00:20:28.280 | And it's also OpenAI weight compatible for GPT-2.

00:20:32.840 | And then he has LLM.c, which implements GPT-2 entirely in C

00:20:37.120 | without any PyTorch for performance.

00:20:40.640 | A lesser known, but I think equally interesting

00:20:44.280 | implementation is Transformer Lens from Neil Nanda.

00:20:48.600 | And I think this helps go to why GPT-2 is

00:20:51.080 | so interesting and important.

00:20:52.600 | A lot of folks in mechanistic interpretability

00:20:55.000 | like to use small models to do experiments.

00:20:58.720 | And Transformer Lens is a tool for running understanding

00:21:03.920 | and interpretability experiments on large language models

00:21:07.480 | that are GPT-2 style.

00:21:09.680 | And in fact, if you see the video

00:21:14.200 | that I did at the AI Engineer World's Fair last year,

00:21:18.800 | I do a version of GoldenGate Claude.

00:21:20.920 | That was thanks to Neil Nanda and his team

00:21:23.840 | who had done sparse autoencoders partially

00:21:27.080 | using a version of this thing called SAE Lens for GPT-2.

00:21:30.600 | And I just basically used one of their vectors

00:21:32.620 | and stuck it in my spreadsheet.

00:21:33.920 | And that was a huge benefit and boon.

00:21:35.840 | But it's a great way to learn how these models actually work

00:21:39.960 | by doing experiments on them.

00:21:42.000 | Another great visualization is this one,

00:21:44.080 | which is Transformer Explainer.

00:21:45.560 | It has some really nice graphics.

00:21:47.440 | You can watch essentially how information propagates

00:21:49.960 | through the network.

00:21:51.920 | Another great visualization is this one,

00:21:55.480 | which is very popular.

00:21:56.560 | It's got nanoGPT, GPT-2 small, and GPT-3.

00:22:00.620 | You can kind of see--

00:22:01.560 | I like this view because you can see how much smaller

00:22:04.760 | nanoGPT is compared to GPT-2 small.

00:22:07.920 | And then here's GPT-2 XL.

00:22:09.560 | It really makes it very visceral in terms of how it feels.

00:22:14.700 | And the one challenge I have with visualizations

00:22:17.560 | is they're fun to look at, but you can't actually go in

00:22:20.320 | and make modifications to them.

00:22:21.640 | You can't build your mental model

00:22:23.080 | by interactively changing things within them.

00:22:25.720 | So there's mine, which is Spreadsheets

00:22:28.200 | Are All You Need, which is an Excel file that

00:22:31.520 | implements all of GPT-2 small entirely in Excel.

00:22:36.200 | Let me see if I can pull that one up.

00:22:37.880 | Oh, wonderful.

00:22:40.640 | It restarted on me because I'm running in parallel as well.

00:22:43.880 | We'll do this right now.

00:22:45.500 | I'll show you the other one.

00:22:46.720 | So that one, you can see there's a video right here

00:22:57.000 | from AI Engineer World's Fair.

00:22:58.280 | And I walk through the spreadsheet version

00:22:59.960 | of this, of GPT-2.

00:23:01.800 | It's a really abbreviated version

00:23:03.160 | of how the model works.

00:23:04.600 | And then the most recent version is this one,

00:23:07.720 | which is GPT-2 entirely in your browser.

00:23:11.160 | So this one's entirely in JavaScript.

00:23:13.980 | And let me walk through it for just 5 or 10 minutes

00:23:18.440 | as kind of an intro to how transformer models work.

00:23:21.200 | Before I do that, I am going to give you

00:23:24.920 | a five-minute introduction on how

00:23:26.960 | to think about a transformer model.

00:23:29.960 | So basically, we have this.

00:23:34.000 | I like this simplified diagram rather

00:23:35.520 | than the canonical diagram.

00:23:37.000 | Basically, you're taking text.

00:23:38.360 | You turn that text into tokens.

00:23:40.320 | You turn those tokens into numbers.

00:23:41.840 | We do some math or number crunching on them.

00:23:44.040 | We turn those numbers into text.

00:23:45.640 | And that becomes our next token.

00:23:47.880 | We translate those numbers back out.

00:23:50.180 | And the way I like to think about this

00:23:51.880 | is tokenization is just representation.

00:23:53.960 | But you have your token and position embeddings.

00:23:56.320 | And this is really a map for words.

00:23:58.720 | We're basically grouping similar words together.

00:24:01.200 | So I like to imagine, say, a two-dimensional map.

00:24:03.800 | But in this case, in the case of GPT-2 small,

00:24:06.160 | it's 768, 1,600 in GPT-2 XL.

00:24:09.840 | So you can imagine happy and glad are sitting here.

00:24:12.640 | And sad's maybe a little close to it, but not quite as close.

00:24:15.520 | And then things that are very different, like dog, cat,

00:24:17.800 | are over here.

00:24:18.800 | And rather than thinking about this long list of numbers

00:24:22.960 | as just some arbitrary list of numbers,

00:24:24.640 | these are points in a space.

00:24:26.440 | Instead of two dimensions, though,

00:24:28.200 | they're now points in a 768-dimensional space

00:24:31.640 | or 1,600-dimensional space.

00:24:33.880 | But what we've done is we've grouped similar words together.

00:24:37.260 | And once we've done that, when we think about it,

00:24:39.300 | similar words should also share the same next word predictions.

00:24:43.480 | So the next word after happy is probably also the next word

00:24:47.000 | after a similar word like glad.

00:24:49.040 | And then that gives us kind of a boost or heads up

00:24:52.480 | that we can go to a neural network.

00:24:54.320 | Neural networks are really good if you give them

00:24:56.920 | a quick question and an answer, and it'll

00:24:58.600 | pick out what the answer is.

00:24:59.560 | So you give it photos, and you say which ones are dogs

00:25:01.860 | and which ones are cats.

00:25:03.120 | It'll learn to figure that out.

00:25:04.640 | In this case, we'll give it sentences,

00:25:06.360 | and we'll ask it to complete the next word,

00:25:08.280 | or give it, say, a single word and say,

00:25:09.860 | what is the next word after it?

00:25:11.160 | And it will start learning that.

00:25:12.800 | The only other wrinkle is we have additional hints

00:25:14.960 | we can give it, which is all the hints from all the other words

00:25:17.620 | that came before it.

00:25:18.600 | And really, what the transformer is doing

00:25:20.320 | is it's letting every word look at every other word

00:25:23.220 | to inform what its actual meaning is.

00:25:26.040 | Get a better hint.

00:25:26.880 | Instead of just taking a one word or two or three word

00:25:29.900 | history prediction, it's going to look at all the past words.

00:25:32.640 | And then it refines that prediction over 12 iterations.

00:25:35.240 | In the case of GPD, small, more in the larger one.

00:25:38.060 | And then we get a predicted number

00:25:39.440 | back that we just convert back out to a word.

00:25:42.480 | So putting it all together, we get basically this diagram.

00:25:47.080 | So we take a prompt, we split into tokens,

00:25:48.960 | we convert those tokens into numbers,

00:25:50.640 | and then we refine that prediction iteratively,

00:25:52.760 | and we pick the next most likely token.

00:25:55.080 | I've simplified what happens in here.

00:25:56.880 | There's actually-- and you'll see this in the spreadsheet--

00:25:59.480 | there's 16 steps.

00:26:01.360 | And they all are mapped out here.

00:26:03.440 | I can-- let's see if this loaded up.

00:26:05.240 | Oh, wonderful.

00:26:06.560 | Let's go back to the spreadsheet, if it will load up.

00:26:08.800 | Let's see why Excel isn't working.

00:26:13.200 | But that's fine, because I'm going to demonstrate

00:26:15.200 | the web version of this.

00:26:16.280 | So this is the same thing as GPT-2, small,

00:26:19.600 | except running entirely in JavaScript.

00:26:22.200 | And what's exciting about this is you don't

00:26:25.400 | need to have Excel anymore.

00:26:27.160 | You can just come with your browser,

00:26:29.160 | and it will run entirely locally.

00:26:31.440 | And the way it's structured--

00:26:33.640 | let's pull this up here.

00:26:34.720 | Here we go.

00:26:40.400 | It's actually a series of vanilla JavaScript components.

00:26:43.700 | Everything is like a Python notebook.

00:26:45.400 | You've got a cell here that wraps everything.

00:26:48.360 | There's only two types of cells right now.

00:26:50.480 | One is simply like a spreadsheet.

00:26:52.400 | It basically runs a function, and then it

00:26:54.440 | shows the result in a table, as you can see here.

00:26:57.240 | And then the last one is just defining code

00:26:59.160 | in raw JavaScript.

00:27:00.720 | And what's great about this is you can debug the LLM entirely

00:27:04.280 | in your browser.

00:27:05.640 | No PyTorch, nothing else getting in the way.

00:27:08.040 | So to run this, the first thing you want to do

00:27:10.000 | is click this link and download the zip file, which

00:27:12.040 | is going to have all the model parameters.

00:27:13.440 | You'll drag and drop those into here.

00:27:15.360 | It'll basically stick 1.5 gigabytes into your index.db.

00:27:19.300 | And then you're sitting right here.

00:27:20.800 | So the first thing we do is we define matrix operations,

00:27:24.720 | and then in raw JavaScript.

00:27:26.920 | So this is our matrix multiply, also defined in raw JavaScript.

00:27:29.760 | Really simple two-dimensional arrays

00:27:31.640 | is how we use the structure for this.

00:27:34.160 | There's a transpose.

00:27:35.160 | There's a last row.

00:27:36.400 | And then you enter your prompt here.

00:27:38.280 | You hit Run, and it'll actually run the model.

00:27:40.280 | So let me show you, though, the debugging capabilities.

00:27:43.280 | So I'm going to do this.

00:27:44.600 | I'm going to take this thing, which

00:27:46.200 | is separating into words.

00:27:48.040 | And I'm going to run it up to here.

00:27:49.620 | And you can see our prompt is "Mike is quick, he moves."

00:27:51.960 | It separates into these words.

00:27:53.480 | But what I can do is I can just write from the--

00:27:55.880 | ever leaving my browser.

00:27:57.560 | I can go here, and I can say, well, you know what?

00:27:59.880 | I want to see what is--

00:28:03.080 | what does matches look like?

00:28:04.280 | What is that array?

00:28:05.080 | Console.log matches.

00:28:09.760 | Hit this.

00:28:10.920 | Now rerun this function.

00:28:12.360 | Oh, there it is.

00:28:13.280 | Right there in my debugger.

00:28:14.680 | I can just see the result. And heck, you know what?

00:28:16.920 | Maybe I really want to just step through this thing.

00:28:19.440 | So I can hit this, put the debugger statement, hit Play,

00:28:25.000 | and boom.

00:28:25.720 | I'm right here inside my DevTools,

00:28:28.320 | and I can debug a large language model

00:28:30.160 | at any layer of abstraction that I want.

00:28:33.240 | So great way to kind of get a handle on what's

00:28:35.440 | happening under the hood.

00:28:37.200 | And I'll take you through a brief view of this,

00:28:40.760 | so let me get rid of these statements.

00:28:42.560 | Redefine that function, and then I'll reset it.

00:28:50.160 | And then if we hit Run, what it's going to do

00:28:53.000 | is it's going to run through each one of these in order.

00:28:55.720 | So the first section is really just defining

00:28:57.720 | basic matrix operations.

00:28:59.640 | The next section is our tokenization.

00:29:01.720 | So here we separate things into words.

00:29:04.120 | Then we actually take those, and we do the BP algorithm

00:29:07.560 | to turn them into tokens, which we'll get out here.

00:29:10.080 | So if I run to here, I'll see if we-- well,

00:29:12.200 | here's our list of words--

00:29:14.320 | our tokens, rather, and then their token IDs.

00:29:16.840 | And then if we keep going, this will turn the tokens

00:29:19.000 | into embeddings.

00:29:21.040 | So this is a series of steps to do that.

00:29:22.780 | Then finally, we turn them into the positional embedding.

00:29:25.080 | We're basically just walking through this same diagram

00:29:29.520 | in order that I showed you here.

00:29:32.000 | So we tokenize, then we turn it into embeddings,

00:29:34.560 | and then we go inside each of the blocks.

00:29:37.360 | And this is the 16 steps.

00:29:39.240 | These match the same steps in the Excel sheet,

00:29:42.440 | where we'll basically do layer norm, for example.

00:29:46.000 | We'll do multi-headed tension.

00:29:48.200 | And we'll do the multi-layer perceptron.

00:29:50.280 | I'm going to do the following.

00:29:51.520 | If I hit Run and turn away, it'll

00:29:53.120 | actually stop running because it's in the browser,

00:29:55.320 | and the browser optimization stops it.

00:29:57.360 | But you can hit Run, go to this page right now,

00:30:00.400 | and then click it, and you can watch it run.

00:30:02.240 | It'll take about a minute to predict the next token.

00:30:04.640 | So that's a quick overview of GPT-2 and some resources

00:30:08.640 | to understand the model in more detail.

00:30:10.280 | Happy at the end if we've got questions

00:30:12.000 | to go into this in more detail, but I

00:30:13.540 | don't want to spend our entire time on that.

00:30:16.840 | OK.

00:30:18.840 | That was the demo.

00:30:19.800 | Any questions so far?

00:30:21.000 | Let's see.

00:30:24.620 | I'm going to look in chat.

00:30:27.160 | Oh, boy.

00:30:28.040 | Is there a Python Jupyter version of this?

00:30:30.000 | It's going to be fun when he checks the chat now.

00:30:32.280 | OK, great.

00:30:34.640 | This looks like Jupyter.

00:30:35.680 | Yes, it does.

00:30:37.280 | My drawer looks like Ishan's desktop.

00:30:39.440 | Awesome.

00:30:41.160 | OK, this is definitely fun.

00:30:43.080 | There is a Jupyter version.

00:30:45.120 | Well, the Jupyter version of this

00:30:46.720 | would be mini-GPT, probably running inside,

00:30:51.120 | or Transformer Lens running inside Jupyter.

00:30:55.200 | All of these other ones are basically Python

00:30:57.080 | implementations.

00:30:58.200 | This one is no Python.

00:30:59.520 | You can just run it right in your browser.

00:31:02.200 | So nothing to install.

00:31:04.360 | So no Jupyter-- you don't even need Python.

00:31:06.200 | You can just use JavaScript.

00:31:07.740 | Helps web developers kind of get up to speed.

00:31:09.620 | So that's the answer to that one.

00:31:10.960 | Is there an intuitive way to understand

00:31:12.580 | positional embeddings?

00:31:15.440 | Yes.

00:31:15.940 | I'll pause for this.

00:31:21.240 | Let's see.

00:31:22.200 | I'll answer this question and then move on.

00:31:30.120 | Positional embeddings.

00:31:32.520 | Doo-doo-doo-doo-doo-doo-doo.

00:31:33.720 | OK, so you know that probably embeddings we've talked about

00:31:42.760 | are positions in a space.

00:31:44.840 | I showed another diagram where basically we

00:31:46.760 | had elements in some two-dimensional space.

00:31:51.120 | I showed it as just two dimensions.

00:31:52.800 | Here's your canonical man, woman, king, queen,

00:31:55.720 | where king minus man plus woman equals queen.

00:31:58.040 | This is a contrived example, and we've put them

00:31:59.920 | in different parts of space.

00:32:01.880 | When the problem we have is that--

00:32:07.080 | let's go back to this.

00:32:09.880 | In English, the dog chases the cat,

00:32:14.160 | and the cat chases the dog have very different meanings.

00:32:17.640 | Position matters.

00:32:19.880 | So here's another example I use in my class, which

00:32:22.440 | is if I take the word "only" and I put it

00:32:25.240 | into four different positions, these

00:32:26.960 | are four different sentences.

00:32:29.400 | "Only I thanked her for the gift"

00:32:30.640 | means nobody else thanked her.

00:32:31.760 | "I thanked her only for the gift"

00:32:33.140 | means I didn't thank her for anything else.

00:32:35.240 | The problem we have is that in English, word order matters.

00:32:40.840 | But in math, very often, position does not matter.

00:32:45.760 | So 3 plus 2 is the same as 2 plus 3.

00:32:48.560 | So they both equal 5.

00:32:50.520 | And so this is one of the hardest things to realize.

00:32:53.120 | What the large language model is doing

00:32:55.680 | is it's taking a word problem, and it's

00:32:57.840 | converting it to numbers, turning it

00:32:59.460 | into a number problem.

00:33:01.040 | This is a realm where order matters,

00:33:02.760 | and this is a realm where order does not matter.

00:33:05.200 | And so the math, everything after the equal sign,

00:33:07.760 | cannot see the order of the stuff between them,

00:33:09.960 | even though--

00:33:11.160 | I don't know if the spreadsheet came up.

00:33:13.280 | Let's see.

00:33:14.520 | There it is.

00:33:15.040 | Even though you can look in this spreadsheet,

00:33:17.320 | and you're like, well, why can't it see it?

00:33:19.120 | I can see it in order, just like you can see in order--

00:33:24.120 | let's pull PowerPoint back here--

00:33:27.240 | just like you can see the order between 2 plus 3,

00:33:29.640 | and you can see the addition, it can't.

00:33:31.560 | The math can't.

00:33:32.800 | So what we need to do is give it a way

00:33:34.480 | to understand what the position is.

00:33:36.920 | So the way we do that is we basically say, in GPT-2--

00:33:41.500 | note that there's something called rope, which

00:33:43.420 | does it slightly differently.

00:33:44.960 | We say that-- let's go back to this diagram I led with.

00:33:47.880 | The woman at position 0 probably means the same thing

00:33:50.840 | as woman in the other position.

00:33:53.080 | So we're just going to move it slightly

00:33:55.600 | so that woman at position x is almost

00:33:58.240 | in the same location in the embedding space

00:34:00.280 | as woman at position 0.

00:34:01.520 | It's just slightly offset.

00:34:03.320 | And in general, we're going to just move it slightly

00:34:05.880 | in some region so it doesn't move around too much,

00:34:08.000 | but stays close to it.

00:34:08.960 | So it can at least tell woman at different positions

00:34:11.400 | from other positions.

00:34:14.120 | Inside attention is all you need,

00:34:15.620 | which is the original transformer paper.

00:34:17.000 | They basically use the sine and cosine.

00:34:18.480 | If you remember sine and cosine, they limit to 1.

00:34:21.200 | So they're basically keeping it in the circle.

00:34:23.120 | It's oscillating around this.

00:34:25.000 | Inside GPT-2, they actually just let

00:34:28.360 | it learn the positional embeddings itself.

00:34:30.240 | And they are simply just added.

00:34:32.320 | So the way this works inside the spreadsheet

00:34:36.480 | is here are your token and text embeddings.

00:34:39.400 | Sorry, right here.

00:34:40.680 | So this is the embedding for the word Mike.

00:34:42.440 | This is 768 columns after column 3.

00:34:45.560 | Same here for all these other ones.

00:34:47.480 | Each one of these, if you look at this formula,

00:34:50.720 | you'll see there's a plus model WPE.

00:34:53.840 | That is a set of parameters right here.

00:35:00.760 | So the first row-- so this is what a million parameters

00:35:03.240 | looks like.

00:35:03.720 | It's a bunch of numbers.

00:35:05.680 | So for anything that's in the first row,

00:35:07.880 | this number gets added to it.

00:35:09.080 | So you can actually go back to the one I showed you right here.

00:35:12.480 | And you add negative 0.18 to that value.

00:35:18.440 | And the thing that's in the first row

00:35:20.240 | gets that added to this position right here.

00:35:25.080 | And then the thing that's in the second row

00:35:27.200 | basically gets, every element-wise,

00:35:29.920 | added to it the token for the next--

00:35:33.080 | so the second row gets added to whatever's

00:35:34.760 | in the second position.

00:35:36.960 | The token in the third position gets the third row added to it.

00:35:39.920 | And this goes for all 1,024.

00:35:42.080 | So there are 1,024 rows here for every single position

00:35:46.800 | in the context.

00:35:47.800 | OK, I was off by a column.

00:35:49.720 | So that math only worked on the second column.

00:35:52.000 | Let me keep going so we don't run out of time.

00:35:53.880 | But I'll take more questions later.

00:35:57.280 | OK.

00:35:59.760 | OK, let's keep going.

00:36:01.800 | OK, let's talk about the results.

00:36:03.840 | So they got SOTA, which is state-of-the-art,

00:36:07.440 | for seven out of eight language modeling tasks at zero shot.

00:36:12.400 | I'd call this, actually-- they claim it's zero shot.

00:36:14.800 | Some of these I'd call as few shot.

00:36:17.040 | But at the time, it was probably good enough to call it zero shot.

00:36:20.240 | So here are the different tasks.

00:36:21.740 | And here are the results.

00:36:23.000 | I'm going to go through a couple of the notable ones.

00:36:25.680 | So one is lambada, which is a task

00:36:28.240 | of predicting the next word of a long passage.

00:36:30.680 | And this data set is set up so that you

00:36:33.080 | can't predict the next word just by looking

00:36:36.280 | at the target sentence or even the last previous sentence.

00:36:39.640 | You need to go through the entire passage

00:36:42.520 | and have some kind of sense of understanding to complete it.

00:36:45.360 | So here's one where it's like they've

00:36:47.720 | underscored or underlined the word "dancing"

00:36:49.920 | because that's how far you have to be to find the word that's

00:36:52.520 | the answer.

00:36:53.520 | I like this example from the paper

00:36:55.200 | because camera is the word to complete the sentence.

00:36:59.020 | And it's never even here.

00:37:00.520 | You have to infer they're dealing with the camera.

00:37:02.600 | He's like, you just have to click the shutter.

00:37:04.520 | So it's really a test of long passage understanding.

00:37:07.840 | In fact, I can't remember the--

00:37:10.000 | I believe something like long is the acronym

00:37:12.320 | for what this stands for.

00:37:14.360 | One thing to know is that GPT-2, when they tested it,

00:37:19.360 | was actually not fully doing that great.

00:37:22.120 | But it was coming up with completions

00:37:23.680 | to keep going for the sentence.

00:37:26.080 | And so what they did is they added a stop word filter.

00:37:30.680 | And they only let it use words that could end a sentence.

00:37:34.720 | Because it would come up with other likely completions

00:37:37.640 | for the sentence, but they would have kept going.

00:37:39.680 | And so they would have been the wrong answer.

00:37:41.600 | So they basically had to modify slightly the end result

00:37:45.440 | in order to get the correct values.

00:37:47.880 | When they did that, they got--

00:37:49.840 | then they achieved state-of-the-art results.

00:37:52.760 | This is the children's book test,

00:37:54.720 | similar, where you basically have a long passage.

00:37:58.600 | And you need to answer a question here.

00:38:02.920 | So in this case, she thought that Mr. Blank

00:38:05.120 | had exaggerated matters a little.

00:38:06.520 | So again, it's a fill in the blank.

00:38:08.560 | And then you're given a series of choices.

00:38:10.520 | And then the data set has the right answer.

00:38:13.200 | Now, a large language model just completes

00:38:15.240 | the end of the sentence.

00:38:16.200 | It's not like a BERT model where it could complete somewhere

00:38:19.880 | in the middle that's masked out.

00:38:21.520 | So the way they set this up, because it's a decoder,

00:38:24.520 | is they computed the probability of each one of these choices.

00:38:28.680 | And then those choices, along with the probabilities

00:38:32.760 | for the rest of the other words to complete the sentence,

00:38:35.280 | they added that up to one probability

00:38:36.900 | for each one of these, and then compared

00:38:39.720 | that joint probability of Baxter had exaggerated matters

00:38:43.440 | a little, Cropper had exaggerated matters a little.

00:38:46.000 | And they picked whichever one of those combinations

00:38:48.480 | had the highest probability accord to the language model.

00:38:53.160 | This is the 1 billion word benchmark.

00:38:56.560 | It is the only one of the eight on that table

00:38:59.840 | that GPT-2 did not hit state-of-the-art.

00:39:01.720 | By the way, one thing I should add.

00:39:03.820 | It says we hit seven out of eight.

00:39:05.640 | There are other tasks, which we'll talk about in a second,

00:39:08.280 | where the model did not hit state-of-the-art.

00:39:11.120 | But in those language modeling tasks in that table,

00:39:13.400 | it was seven out of the eight.

00:39:14.920 | Their conclusion is that the reason this happened

00:39:18.320 | is that the 1 billion word benchmark

00:39:21.240 | does a ton of destructive pre-processing on the data.

00:39:25.640 | So this is a screenshot from the 1 billion word benchmark.

00:39:29.880 | This is not from the GPT-2 paper.

00:39:31.720 | But it describes a bunch of the steps they do to pre-process

00:39:34.680 | it.

00:39:35.160 | And then the last thing they do is

00:39:36.620 | do sentence-level shuffling, which removes

00:39:39.200 | the long-range structure.

00:39:40.680 | So in some sense, you could argue

00:39:42.040 | it's not even a valid test.

00:39:43.560 | And so it's not surprising that it didn't do as well

00:39:46.320 | on that last benchmark.

00:39:49.080 | Let me go back to those so you can see that.

00:39:51.000 | So here's Lambada.

00:39:52.400 | Here is the children's book test.

00:39:54.400 | Here's the 1 billion word benchmark.

00:39:57.280 | Some of these are perplexity, so lower is better.

00:39:59.360 | So if it's in bold, it's better than the state-of-the-art.

00:40:01.320 | That's what you see here.

00:40:02.720 | Some of these are accuracy, so higher is better.

00:40:04.760 | So here, you can see in bold when they've achieved higher

00:40:06.780 | than state-of-the-art.

00:40:07.660 | This is the only one where the state-of-the-art was still

00:40:10.140 | out of reach for them.

00:40:11.460 | That was the 1 billion word benchmark.

00:40:15.800 | Another one they tried was question answering.

00:40:18.300 | And they did not achieve state-of-the-art on it.

00:40:20.300 | This is the conversation question answering data set.

00:40:23.980 | This is an example from the paper for that data set.

00:40:26.820 | And you can see it's, again, a passage and then

00:40:29.020 | a series of questions with answers and actually reasoning.

00:40:33.660 | So they didn't hit state-of-the-art.

00:40:35.460 | But there are two interesting things.

00:40:37.000 | One is they matched or exceeded three out of the four

00:40:39.940 | baselines without using any of the training data.

00:40:42.780 | The other baselines had actually used the training data.

00:40:46.500 | This is, again, the power of pre-training

00:40:48.240 | on a large enough data set, which is very surprising,

00:40:50.960 | at least was surprising at the time.

00:40:52.700 | The other thing that they note, which jumped out to me,

00:40:54.980 | is GPT-2 would often use simple heuristics

00:40:58.580 | to answer who questions, where it would look for names

00:41:03.380 | that were in the preceding passage.

00:41:05.260 | And it would just use that as its heuristic.

00:41:07.400 | And to me, that reminds me of, if you're

00:41:10.380 | familiar with induction heads, which

00:41:11.900 | are heads inside multi-head attention, whose whole job is

00:41:14.260 | to, if it sees a passage that says "Harry Potter" like five

00:41:17.060 | times, the next time it sees "Harry," it's like, oh,

00:41:19.340 | the next likely thing is "Potter."

00:41:21.380 | So it's very interesting to see even that kind of sense

00:41:23.740 | of something like an induction head inside GPT-2

00:41:26.740 | in these early experiments.

00:41:28.700 | Another thing they tested was summarization.

00:41:31.740 | It was tested on news stories from CNN and the Daily Mail.

00:41:35.060 | And again, we have kind of early prompt engineering.

00:41:37.420 | They induced summarization by appending TL;DR--

00:41:41.020 | too long, didn't read, which is something humans

00:41:43.060 | had been doing for a long time now--

00:41:45.300 | to a passage.

00:41:46.460 | And it turns out to start summarizing.

00:41:49.320 | Unfortunately, it was not state-of-the-art.

00:41:51.060 | You can see the results here.

00:41:52.540 | Here's GPT-2 TL;DR, this row right here.

00:41:55.980 | And you can see the state-of-the-art is doing

00:41:58.380 | a lot better.

00:41:59.380 | But it was still promising.

00:42:01.540 | The end result that came out resembled a summary.

00:42:03.980 | But it turned out it confused certain details,

00:42:06.420 | like the number of cars in a crash

00:42:09.220 | or where a logo was placed or things like that.

00:42:12.700 | And unfortunately, it just barely outperformed

00:42:14.540 | picking three random sentences from the article,

00:42:16.740 | as you can see.

00:42:17.660 | That's this line here.

00:42:19.380 | So then they wanted to test, well, is TL;DR

00:42:21.180 | doing anything at all?

00:42:22.140 | So they dropped TL;DR.

00:42:23.620 | And you can see GPT-2, without any TL;DR hint, is doing worse.

00:42:27.260 | So TL;DR definitely is actually steering the model.

00:42:30.660 | It's actually prompt engineering the model.

00:42:33.340 | It's just the model isn't powerful enough.

00:42:35.100 | OK, another one they tried was translation.

00:42:40.860 | And in this case, they induced translation

00:42:43.140 | by few-shot prompting of English and French pairs.

00:42:46.580 | Again, we see this early prompt engineering.

00:42:48.940 | What was really surprising, even though they

00:42:52.060 | didn't achieve state-of-the-art, is

00:42:54.060 | that they still beat other baselines.

00:42:56.100 | But the entire 40-gigabyte data set

00:43:00.340 | only had 10 megabytes of French data.

00:43:01.980 | So they went back, and they found a few naturally

00:43:04.180 | reoccurring stuff here.

00:43:05.620 | But they were really surprised.

00:43:07.620 | So the performance was surprising to us,

00:43:09.260 | since we deliberately removed non-English web pages

00:43:11.500 | from WebText as a filtering step.

00:43:13.260 | In order to confirm this, we ran a byte-level language detector

00:43:15.980 | on WebText, which detected only 10 megabytes of data

00:43:18.540 | in the French language, which is approximately 500 times

00:43:21.420 | smaller than the monolingual French corpus common

00:43:24.220 | in prior unsupervised machine translation research.

00:43:27.540 | That's really surprising.

00:43:29.380 | You might remember there was an example--

00:43:31.740 | I think this was Gemini or BARD--

00:43:33.620 | learn to translate a very esoteric language that

00:43:37.060 | has something like only 100 or 1,000 speakers

00:43:39.540 | by having a very small data set of it.

00:43:41.220 | I feel like this is kind of parallels of that.

00:43:44.140 | And then they also tried question answering.

00:43:46.540 | So this is an example from that paper

00:43:48.900 | that rolled out that data set, where you have a question

00:43:51.300 | coming from Wikipedia, and then a long answer,

00:43:54.980 | and then a short answer for each of those prompts.

00:43:57.940 | And so they did the short answer.

00:43:59.900 | They seeded it with question and answer pairs.

00:44:01.820 | Again, this is why Wikipedia was removed from the training data.

00:44:06.100 | And they got poor results.

00:44:08.320 | The baseline was like 30% to 50%.

00:44:11.820 | GPT-2 XL got 4.1%, and GPT-2 Small got less than 1%.

00:44:17.420 | They don't actually give us the number.

00:44:19.020 | But it does indicate that size helps.

00:44:22.460 | So maybe a sufficiently large model

00:44:25.060 | could exceed state of the art.

00:44:26.540 | If you want to get better than the baseline of 30% to 50%,

00:44:29.780 | just build a large enough model.

00:44:31.420 | And that's all you need to do.

00:44:32.660 | You don't have to do any other algorithmic improvements.

00:44:36.820 | There's this hilarious little footnote

00:44:40.340 | inside the paper, which says that Alec Radford

00:44:44.940 | overestimated his skill at random trivia.

00:44:47.300 | So if you, I don't know, run into a wild Alec Radford

00:44:51.500 | in his natural habitat of San Francisco, do not play dead.

00:44:54.940 | Do not back away slowly.

00:44:56.100 | Challenge him to random trivia, and you've

00:44:57.900 | a better-than-random chance at beating him.

00:45:01.900 | OK, future directions.

00:45:05.140 | OK, really, this was the beginning of scale

00:45:07.980 | is all you need, kind of things we talked about earlier

00:45:10.260 | at the beginning.

00:45:11.140 | Given that the web models appear to underfit the web text

00:45:13.940 | data set, as they note here in this figure from the paper,

00:45:18.020 | it seems that size is improving model performance

00:45:20.900 | on many tasks.

00:45:21.620 | We talked about how GPT-2 small didn't do as well as GPT-2

00:45:25.400 | large, and it seems like just increasing

00:45:27.580 | the size of the model improves things.

00:45:29.380 | So then the question is, does size help even more?

00:45:31.500 | And of course, that leads us to, hey, let's

00:45:33.460 | put a ton of money on making--

00:45:35.700 | instead of going up by 10, let's go up by a factor of 100,

00:45:38.460 | and let's see if we'll get a much smaller model.

00:45:40.140 | And I'm sure you all know the answer.

00:45:41.740 | The answer is they did.

00:45:43.220 | But that leads us into setting the stage for GPT-3.

00:45:47.660 | OK, and that is it.

00:45:51.100 | I will take questions in, I think,

00:45:53.180 | the 10 minutes we have remaining.

00:45:54.500 | And I'll look at the chat, see what we got.

00:45:56.300 | - The chat is a mess.

00:46:00.780 | - Oh, boy.

00:46:02.540 | Is that a good thing, or is that a sign of a bad or a good--

00:46:05.020 | - Yeah, it means we're engaged and having productive

00:46:07.500 | discussions.

00:46:08.180 | I feel like-- what's his name?

00:46:12.700 | - Leanne had some questions about the sine function, which

00:46:15.940 | I mean, I answered from my point of view,

00:46:17.700 | but I'm curious if you have--

00:46:18.940 | - What's the question on the sine function?

00:46:21.860 | - Positional encoding is a spherical jitter

00:46:24.660 | in the sitting space.

00:46:26.300 | - Yes.

00:46:26.860 | - Jitter, to me, means randomness,

00:46:28.580 | and there's no randomness here.

00:46:31.540 | - Well, that's a--

00:46:33.420 | it's an adjustment.

00:46:34.940 | I called it an oscillation because it is-- you're right.

00:46:37.820 | It is predictable.

00:46:38.820 | It's formulaic based on what position you're in.

00:46:41.180 | And so that's-- that might just be a translation issue.

00:46:45.020 | I agree.

00:46:45.660 | Jitter typically means random, but it is--

00:46:47.860 | I called it an oscillation.

00:46:49.180 | So we'll just slightly move it around inside this space.

00:46:53.740 | That seems--

00:46:55.620 | Oh, tell me--

00:46:57.300 | - That is jitter.

00:46:58.420 | That is a form of jitter.

00:47:01.220 | I don't like this explanation.

00:47:03.180 | - Why not?

00:47:03.700 | Tell me.

00:47:04.740 | - It's a position embedding is a whole different embedding.

00:47:07.540 | You're saying we're moving the position of woman,

00:47:10.700 | but no, we're pairing the position of woman

00:47:13.540 | with the position to having significance y or whatever.

00:47:18.660 | Your example with the word order,

00:47:23.860 | you're not really changing the position of woman

00:47:26.340 | to anti-woman, right?

00:47:27.660 | Are you?

00:47:29.980 | - No, we're keeping it roughly in the same embedding space.

00:47:33.700 | We're moving it slightly.

00:47:35.460 | - There's position space, and then there's

00:47:37.820 | the word token embedding.

00:47:40.020 | I don't know how to phrase it.

00:47:41.620 | - Well, no, well, we're not inside like rope

00:47:46.460 | where we're inside the attention mechanism in GPT-2.

00:47:50.100 | So you are literally using the very same embeddings.

00:47:54.860 | - Oh, OK.

00:47:55.860 | - So you are inside--

00:47:56.900 | - [INAUDIBLE] the rope.

00:47:58.020 | Yeah, I'm--

00:47:58.540 | - You are in-- yeah, you're thinking--

00:48:00.180 | so that's the key difference.

00:48:01.700 | So you're not in a separate positional space, right?

00:48:04.580 | So here is-- here I've taken like happy--

00:48:08.460 | this is happy at position 8, happy at position 1,

00:48:11.100 | happy with capital.

00:48:11.940 | So it's a different word.

00:48:13.060 | Here's glad at position 2.

00:48:14.220 | Here's happy 3, happy 4, happy 6.

00:48:16.380 | These are all the same happy.

00:48:17.580 | Just put a number when I plotted it with PCA.

00:48:19.580 | So this is a dimensionality reduction of 768.

00:48:22.740 | And these are, you can see, happy at 3, 4, 5, 6, 7, 8.

00:48:26.620 | They're all roughly close to each other.

00:48:28.220 | Glad is a whole other word.

00:48:29.540 | I just put it at position 2 so we can see what it is.

00:48:31.740 | And happy 1.

00:48:32.540 | They're in the same embedding space.

00:48:34.900 | So this is the same embedding space as, you know--

00:48:39.740 | do-do-do-do-do-do-do.

00:48:42.140 | I have a diagram here of it somewhere.

00:48:44.740 | Do-do-do-- well, I did.

00:48:49.900 | I have a PCA plot of the same type of thing.

00:48:53.300 | There it is.

00:48:54.540 | This is glad, happy, happy, capital, joyful, dog, cat,

00:48:59.300 | rabbit, right?

00:48:59.900 | This is not positional embedding.

00:49:02.140 | This is just this stuff put in PCA, two-dimensional from 768.

00:49:08.660 | And then I did the same thing, except I did it

00:49:10.820 | in positional for just one word to see--

00:49:14.860 | to go through what is the difference.

00:49:16.420 | Where is that?

00:49:17.140 | That's this one.

00:49:17.900 | So this is-- here, you can see it right here.

00:49:23.580 | So happy is the same happy 1, glad.

00:49:25.560 | These are just the same things with the positional embeddings

00:49:28.140 | added onto them for whatever row.

00:49:29.660 | So we're in-- in GPT-2, you're in the same embedding space.

00:49:33.260 | You're just moving them around.

00:49:34.860 | That is a crucial difference.

00:49:36.300 | That's like one of the crucial differences,

00:49:38.660 | probably between modern transformers and GPT-2.

00:49:42.300 | The other one that I call out is RMS norm, for example,

00:49:46.740 | is used in LLAMA.

00:49:48.780 | I think I actually have a slide on the major differences.

00:49:51.300 | That might be a good way to close out, is GPT-2

00:49:55.940 | versus something like LLAMA.

00:49:57.900 | So--

00:50:01.620 | - Beautiful.

00:50:02.900 | - Yeah.

00:50:03.380 | So you've got-- so you can see, this is LLAMA-405B.

00:50:07.500 | What does a modern model look like compared to GPT-2?

00:50:12.580 | So they're both decoder transformers,

00:50:14.900 | just a lot larger size.

00:50:16.620 | Same architecture, similar-- more size.

00:50:18.580 | So more layers, embedding dimensions are larger,

00:50:20.860 | context is larger.

00:50:22.220 | The training data and the training cost went up.

00:50:25.820 | We're talking $125 million is the estimate.

00:50:28.420 | And then the same pieces, but they're moved.

00:50:30.580 | So instead of learned absolute positional embeddings,

00:50:33.020 | we go to Rope.

00:50:34.260 | Layer norm gets replaced with RMS norm.

00:50:36.580 | Multi-head attention gets group query.

00:50:39.100 | Galoo gets replaced with Swigloo.

00:50:41.140 | And then the other key difference

00:50:43.540 | is, compared to chat GPT, we don't

00:50:47.180 | have any supervised fine tuning, any RLHF.

00:50:50.900 | That process isn't there at all.

00:50:52.460 | It's just simply a pre-trained model,

00:50:54.020 | whereas models today typically go

00:50:55.740 | through some form of post-training.

00:50:57.940 | So that's a good way to kind of bridge us

00:51:02.780 | to what the future or the present

00:51:04.620 | looks like from today, at least the day of the recording.

00:51:08.940 | That's a great-- yeah, we need--

00:51:11.340 | I need this, but updated for DeepSeq and--

00:51:14.500 | [LAUGHTER]

00:51:17.540 | Yeah, the next time I do my class,

00:51:19.500 | I'll probably redo this with LLAMA and DeepSeq.

00:51:22.260 | So I might add that column.

00:51:24.860 | So yeah.

00:51:28.100 | Any other questions?

00:51:29.540 | That's good discussion.

00:51:30.940 | Is this LLAMA 1?

00:51:35.900 | What?

00:51:37.060 | LLAMA 1 or LLAMA 2?

00:51:40.460 | LLAMA 3, 405B, the one in August.

00:51:44.860 | OK.

00:51:45.980 | Thanks.

00:51:46.820 | Yeah, good question.

00:51:48.820 | [AUDIO OUT]

00:51:54.220 | Yeah, there's some--

00:51:55.060 | Let's see what else is in the chat.

00:51:56.540 | --in the chat, yeah.

00:51:58.460 | Why would you want pre-layer norm instead

00:52:00.460 | of post-layer norm?

00:52:01.980 | I see that one.

00:52:03.780 | I'm going to go to that one because I

00:52:05.300 | do have a slide for that.

00:52:07.700 | My-- there's an article that I found that explains this.

00:52:13.460 | Oh, really?

00:52:14.060 | I'd love to see that.

00:52:15.860 | Yeah, it's like some question, some article about what

00:52:21.020 | the original transformers got wrong.

00:52:22.940 | I think you have it.

00:52:23.780 | Oh, you have the same diagram I pulled up.

00:52:25.380 | OK.

00:52:25.860 | Oh, yeah.

00:52:26.660 | So this is-- but this was--

00:52:28.660 | the key thing is I'm pretty sure this came out

00:52:33.700 | after the GPT-2 paper itself.

00:52:39.180 | So GPT-2 doesn't cite this paper.

00:52:42.460 | They cite a different paper, which

00:52:44.300 | does the same thing inside a, I think, a visual or a CNN

00:52:51.020 | that had skip connections.

00:52:52.580 | And they propose, hey--

00:52:53.660 | and so I think the obvious thing was like, OK,

00:52:55.580 | it works for vision.

00:52:57.620 | Maybe it'll also work for language

00:52:59.100 | to go to a pre-layer norm transformer.

00:53:04.100 | And it worked.

00:53:05.820 | And then I think maybe around the same time, or--

00:53:08.140 | I don't know how close these are.

00:53:09.780 | Well, June-- it's about a year later, right?

00:53:11.820 | They actually show some benefits and improvements

00:53:15.140 | for pre-layer norm.

00:53:16.740 | There is a trade-off, which I've forgotten what it is.

00:53:18.940 | But this is the paper.

00:53:19.860 | You can find it.

00:53:21.580 | There's the archive number.

00:53:23.500 | And you can Google the title.

00:53:24.780 | But that's a quick answer there for a pointer

00:53:26.700 | on why you'd want to do that.

00:53:29.500 | Send me-- drop that link for that article.

00:53:31.260 | I'd be curious to see it.

00:53:34.040 | Yeah.

00:53:38.980 | Which-- as walks on a hypersphere.

00:53:43.380 | Is it that this paper?

00:53:44.820 | Or this is another paper of the benefits of layer norm

00:53:52.740 | that I like as well.

00:53:54.220 | Let's see.

00:53:54.720 | Somebody click the link.

00:53:56.660 | Oh, I remember this one now.

00:53:58.780 | OK, yes.

00:54:00.380 | But what's really fascinating to me about this,

00:54:02.940 | for all these model architecture,

00:54:05.060 | is we keep finding afterwards, like, oh, yeah,

00:54:09.460 | this is what it's doing.

00:54:10.940 | This paper came out after GPT--

00:54:13.280 | after we did that change.

00:54:14.980 | And so-- and we keep finding ones.

00:54:16.700 | This is one of my favorite ones, where

00:54:18.820 | I would have thought Dropout would

00:54:20.300 | have explained self-repair inside these kinds of models.

00:54:24.140 | But this research claimed it was surprisingly layer norm.

00:54:28.540 | So it goes to show you how much of this

00:54:31.060 | is interestingly empirical, but guided

00:54:33.900 | by an intuition of playing with these things.

00:54:37.700 | Let me see if there's another chat I can address.

00:54:39.780 | Let's see.

00:54:46.780 | There is no maximum length angle among positional embeddings

00:54:52.260 | is entirely learned by the model.

00:54:55.220 | There's a question there of what is the maximum length angle.

00:54:59.660 | If you save these, I can reply to these in Discord, Swix.

00:55:04.180 | - Yeah, good.

00:55:05.780 | - Let's see.

00:55:06.340 | And there's the article.

00:55:11.340 | I know we're just at two minutes.

00:55:14.500 | Oh, that's the article you were pulling up.

00:55:16.340 | OK, great.

00:55:17.580 | - It's basically a recycling of the one that you already had.

00:55:20.340 | - Got it.

00:55:22.340 | Let's see.

00:55:22.900 | I'm trying to go through.

00:55:26.820 | Is there anything else you saw that was interesting I should

00:55:28.500 | cover in the last 60 seconds?

00:55:31.940 | Do I think we've hit the scaling wall?

00:55:36.060 | - Different topic.

00:55:37.180 | - That's a different topic.

00:55:38.460 | But I'll just say people who have said that

00:55:40.780 | have learned to rue the day.

00:55:44.300 | Now that we've got test time compute, at least.

00:55:48.260 | Let's see.

00:55:48.780 | - So my response is that that is kind of moving the goalposts.

00:55:56.020 | Like if you want to say that you haven't hit a wall,

00:55:59.620 | then OK, scale up GPC 4 to GPC 5.

00:56:02.180 | And where is it?

00:56:03.900 | So in some sense, we haven't hit it.

00:56:06.180 | We have hit one.

00:56:07.420 | And we're just redirecting the attention.

00:56:11.180 | - I will just say I know a excellent podcast that

00:56:15.380 | held a debate in Vancouver last year in December

00:56:19.620 | between two very qualified experts on this very topic.

00:56:23.780 | And I would refer you to that.

00:56:25.340 | - Yeah, wallguy1.

00:56:27.740 | - What?

00:56:29.340 | - The pro wall person.

00:56:30.660 | - The pro-- yeah, yeah.

00:56:32.140 | Yeah, he did.

00:56:34.460 | He did.

00:56:36.220 | - OK.

00:56:38.180 | Cool.

00:56:38.860 | I mean, I can call it.

00:56:40.100 | I think we can continue in Discord.

00:56:41.620 | Thank you so much, Ishan.

00:56:42.700 | That was amazing as always.

00:56:44.980 | Actually, not even as always.

00:56:46.460 | I think that's too dismissive of a word for what you did.

00:56:51.460 | Your slides are amazing.

00:56:52.500 | So thank you.

00:56:53.140 | - Oh, thank you.

00:56:55.020 | - Yeah, we don't have a paper picked for next week.

00:56:57.500 | Again, we can pick one in the Discord

00:57:00.820 | if anyone wants to volunteer.

00:57:02.580 | You have a high bar set.

00:57:03.580 | - Well, thank you for having me.

00:57:09.740 | And I look forward to being back at some point in the future.

00:57:13.580 | And I hope--

00:57:14.340 | I want more people to participate.

00:57:15.740 | I don't want people to feel like they have to match this.

00:57:17.620 | - Yeah, exactly.

00:57:18.660 | Yeah.

00:57:19.620 | He teaches a course.

00:57:20.500 | That's why he has those.

00:57:21.180 | - Yes, that's why I have all these slides.

00:57:22.940 | This is slides from my class.

00:57:25.660 | - Oh, yeah, plug your course.

00:57:27.060 | Where do people sign up?

00:57:28.100 | - Oh, they can go to Maven.

00:57:31.420 | I didn't want to be too commercial.

00:57:32.900 | But if you go to Maven, there's the class.

00:57:36.460 | And then you can see what people have said about it.

00:57:39.220 | And right now-- so Maven usually is live.

00:57:43.020 | I leave it open right now.

00:57:44.660 | People can attend on demand.

00:57:46.180 | And then you get the recordings.

00:57:48.500 | You get access to the Discord.

00:57:49.820 | You get all the quizzes.

00:57:50.860 | And then if you want to attend when I do my next live one,

00:57:53.700 | you can attend a future live cohort for free.

00:57:56.220 | If you have questions about the class,

00:57:57.780 | feel free to shoot me a question over Twitter, LinkedIn,

00:58:00.340 | Discord, whatever.

00:58:02.020 | But that's where you can find it on Maven.

00:58:03.740 | So you have a class on Maven as well, right?

00:58:08.540 | - Yeah, but it's more AI engineering, quote unquote.

00:58:11.300 | So you treat the language model as a black box,

00:58:13.460 | and you go from there.

00:58:14.660 | - Yeah, this is--

00:58:15.660 | I should be clear.

00:58:16.860 | I had one guy who signed up thinking

00:58:18.420 | this was about using AI with Excel.

00:58:20.940 | No, no, this is about how the actual model works.

00:58:23.180 | And I use the Excel spreadsheet that implements it.

00:58:26.060 | That's this thing.

00:58:27.100 | So you can understand every single step.

00:58:29.420 | And then I also use this web version as well.

00:58:34.740 | And I walk through so you get a sense

00:58:36.500 | of how the entire model works.

00:58:37.740 | But you can try this web version.

00:58:39.100 | Anyone can go to this page and try it out.

00:58:40.860 | I actually see at least one of my former students here.

00:58:43.460 | - OK, cool.

00:58:47.780 | All right.

00:58:48.380 | Thank you so much.

00:58:49.140 | Have a nice day, everyone.

00:58:50.500 | - Thanks.

00:58:51.460 | - Thank you.

00:58:51.980 | - Thank you.

00:58:53.020 | Bye.