Breaking down the OG GPT Paper by Alec Radford

00:00:00.000 | Okay. Sure. So, hey everyone. My name is Amged. I'm a machine learning engineer.

00:00:06.000 | I generally do ML consulting services to startups. I help them like ship AI-powered products,

00:00:13.000 | especially in the field of NLP and speech-to-text applications.

00:00:17.000 | And I run a blog where I like publish posts about ML stuff.

00:00:23.000 | So feel free to check it out. I've done some posts about Whisper.

00:00:27.000 | Yeah. So with that out of the way, let's get directly to what we want to discuss today,

00:00:33.000 | which is like the GPT-1 paper by the folks at OpenAI.

00:00:40.000 | So the paper is titled "Improving Language Understanding by Generative Pre-training".

00:00:45.000 | It was published in June 2018. And these are like the authors.

00:00:50.000 | Let me switch to presentation mode. Yeah. And these are like the authors of the paper.

00:00:54.000 | So mainly Alec Radford and Elia Suskiver, very well-known publishers in the ML field.

00:01:02.000 | So let's get started. Back in 2018, deep learning was becoming very popular.

00:01:10.000 | But the main thing with deep learning is like it is very, very data hungry.

00:01:15.000 | There are like good news and bad news.

00:01:17.000 | The good news is like there is a ton of data available everywhere on the Internet and just online.

00:01:24.000 | We have like tons and tons of data.

00:01:26.000 | The bad news is this data is not annotated and it's not curated and it's basically very, very messy.

00:01:33.000 | And if you want to train machine learning models back then,

00:01:36.000 | the only way to do it is just to annotate data yourself or hire like data annotators.

00:01:41.000 | And these tend to be very, very expensive and difficult to scale and hire.

00:01:45.000 | Like if you think GPUs are expensive, you have not worked with data annotators.

00:01:49.000 | That's what I like to say. So this makes like deep learning very restricted.

00:01:54.000 | You cannot use it everywhere.

00:01:56.000 | Like you're only restricted to fields that have good, high quality annotated data sets.

00:02:02.000 | And this is a very big bottleneck.

00:02:05.000 | So people back then in 2018 are trying to solve this problem.

00:02:09.000 | Like how do we get over the fact that we need labeled data?

00:02:15.000 | And one potential solution to this problem is like unsupervised learning.

00:02:20.000 | So the question is, what if we can leverage like just the linguistic information from the unlabeled data?

00:02:26.000 | So we have like a bunch of text, like novels, books, articles.

00:02:32.000 | How can we leverage like the linguistic information from this?

00:02:35.000 | And the answer to this could be using unsupervised learning.

00:02:40.000 | And if you can do this successfully, this alleviates the need for large amounts of labeled data.

00:02:45.000 | Because basically you can utilize Wikipedia, which is very big, or even all the papers published on archive and so on.

00:02:55.000 | And even if you have like lots of labeled data, using unsupervised learning as a step one or step zero

00:03:01.000 | is going to make your model actually perform better than just training on this labeled data.

00:03:06.000 | Because of transfer learning, where you can transfer the things that your model has learned during pre-training

00:03:12.000 | into your actual objective.

00:03:15.000 | And a good evidence of this is back in 2017 and 20, or even a bit before 2017,

00:03:20.000 | is like people have been using pre-trained word embeddings.

00:03:24.000 | Things like Wave2Vec, GloVe, FastText, to achieve very good performance on many tasks like classification and machine learning.

00:03:32.000 | Sorry, machine translation.

00:03:34.000 | So this is a good evidence that actually unsupervised learning works and it's a very good approach.

00:03:40.000 | So this is the premise of the paper actually, like unsupervised learning is very promising.

00:03:48.000 | So let's talk a bit about word embedding.

00:03:51.000 | So the idea behind word embedding is like you want to project words, so just text, into an n-dimensional space.

00:03:58.000 | And n is usually 300 or 1,000 back then.

00:04:02.000 | And this space has a very special property that words that are similar in meaning have very similar products.

00:04:08.000 | Sorry, very similar vectors.

00:04:11.000 | And by similar vectors I mean like you can measure similarity by things like cosine similarity dot product, like L2 distance.

00:04:20.000 | So we can capture similarity between words that don't each other but have similar meanings.

00:04:27.000 | For example, the word "booking" and "reservation".

00:04:30.000 | Just from the syntax, they are very different words, but they have very similar meanings.

00:04:34.000 | We're just booking something.

00:04:36.000 | And similarly, the word "adam" and "sgd".

00:04:39.000 | These are like completely different words, but they are both actually like optimizers and use in machine learning.

00:04:44.000 | So they are very similar and their vectors are going to be probably similar.

00:04:48.000 | The most common implementation of word embedding is like "wave2vec" by Google.

00:04:53.000 | This is what popularized the usage of word embedding.

00:04:57.000 | And then "glove" by Stanford and "fasttext" by Facebook.

00:05:01.000 | And the way these word embedding models are trained is like leveraging co-occurrence between words that have similar contexts.

00:05:09.000 | For example, similar words tend to occur in similar contexts.

00:05:13.000 | Like you would generally find the word "adam" associated with learning rate or machine learning.

00:05:19.000 | Similarly, "sgd" is associated with learning rate as well.

00:05:22.000 | So you can conclude that these two words are kind of similar.

00:05:25.000 | And the way these word embedding are used is that they are like utilized by training a head.

00:05:31.000 | Like for example, if you want to classify a word as being positive or negative, having a positive sentiment or a negative sentiment,

00:05:38.000 | you can train a classification head on top of the frozen embeddings.

00:05:42.000 | So you use the word embeddings like as a fixed feature input, input features.

00:05:47.000 | Just frozen things without training the word embeddings themselves.

00:05:54.000 | And this has like a significant drawback or a bunch of significant drawbacks.

00:05:59.000 | The first thing is it doesn't utilize the context of the text, so you're just using the word itself.

00:06:05.000 | And some words, even the same word, can have very different meanings depending on the context.

00:06:11.000 | For example, the river bank, like the Amazon bank or the oil bank, is different from the HSBC bank or like the JPMorgan bank,

00:06:19.000 | even though these are the exact same words.

00:06:22.000 | So if you're going to use just wave2vec, these two words will have the same vector, even though they have very different contexts.

00:06:29.000 | And even beyond this, natural language has nuances that cannot just be captured by using words.

00:06:35.000 | Like the way you write things like "glamour3on" and "skillissue" on OpenAI, the way you write it this way,

00:06:41.000 | you have a very specific intonation compared just to writing OpenAI in the normal way.

00:06:47.000 | So are there any questions about word embeddings so far?

00:07:00.000 | Someone says in the chat, nope. I cannot read the chat, so if just someone can say this using the microphone, that'd be great.

00:07:08.000 | I don't think there are any questions. I think Sean said no.

00:07:11.000 | But if anyone has any questions, you can maybe drop it in the chat, and then I can just surface them as and when.

00:07:17.000 | But it seems like everyone's okay with it for now. I'll keep an eye on the chat for you.

00:07:21.000 | Yeah, great. Yeah, thank you. That's appreciated.

00:07:24.000 | So this was word embedding. The main limitation or drawback is you cannot use context.

00:07:30.000 | So can we go beyond word embedding?

00:07:33.000 | Word embeddings are too local. We need something that's more global that can capture the higher-level semantics.

00:07:39.000 | But this also has its complications.

00:07:43.000 | How do you leverage more than word-level information from unlabeled text?

00:07:47.000 | You're going to have some questions that you need to answer.

00:07:51.000 | First is, which objective should you use while training?

00:07:54.000 | Do you want to use language modeling or machine translation or discourse coherence or something else?

00:08:00.000 | And in 2024 right now, we definitely know the answer.

00:08:05.000 | Language modeling works very well because this is what we've been using for the past three years.

00:08:09.000 | But back then in 2017 and 2018, this was not very clear.

00:08:13.000 | For example, this paper came out before BERT.

00:08:16.000 | And even the original machine learning, the original transformer paper, attention is all you need to use machine translation.

00:08:25.000 | So back then, this definitely was not very obvious to people.

00:08:29.000 | The second question is, how should we transfer these embeddings to the target task?

00:08:35.000 | There are a bunch of ways to do this.

00:08:37.000 | Do you want to modify the model architecture?

00:08:43.000 | Each task is going to require each specific modification to the model itself.

00:08:49.000 | This is one approach, and this requires very deep knowledge of model architecture and being just a wizard to modify the architecture.

00:08:59.000 | The second approach is to use a specific recipe or schema to do transfer learning.

00:09:05.000 | A very popular example of this is ULM Fit by Jeremy Howard and Sebastian Ruder.

00:09:11.000 | We're going to cover this in the next two slides, I think.

00:09:14.000 | The third option is also you can add an auxiliary learning objective during pre-training.

00:09:19.000 | While you're pre-training on language modeling, you can have an auxiliary learning objective like machine translation or discourse coherence.

00:09:27.000 | These are some of the approaches you might want to take when you are deciding on the target task.

00:09:34.000 | All these questions made going beyond word embedding not so straightforward.

00:09:42.000 | They made it difficult to utilize semi-supervised learning or unsupervised pre-training.

00:09:49.000 | Let's take a look at ULM Fit and how they did this.

00:09:53.000 | This is a paper titled "Universal Language Model Fine-Tuning for Text Classification."

00:10:00.000 | Their objective is text classification, but they want to also build a universal language model.

00:10:06.000 | This is a very similar work in NLP. It's a very well-known paper and it has a big impact.

00:10:12.000 | The question they are raising is instead of just utilizing the word embeddings,

00:10:21.000 | like you're going to have a classifier, you're going to have an embedding layer and a classification layer.

00:10:26.000 | The old way of doing this, you're going to use the embedding layer from Web2Vec and you are going to randomly initialize the classification head.

00:10:34.000 | They are asking why not just have a good initialization point for all the layers, not just the embedding layer.

00:10:42.000 | Their answer to this question is an approach called ULM Fit.

00:10:47.000 | It is a three-step recipe for state-of-the-art text classification back then.

00:10:52.000 | We have three steps. The first step is to train the language model on general domain data.

00:10:58.000 | We call this pre-training on large corpus these days.

00:11:02.000 | You just train your language model on a very big corpus like Wikitext.

00:11:07.000 | This was big back then, but it's probably small now.

00:11:10.000 | You can do pre-training on 15 trillion tokens of data if you are a big organization like Meta, for example, or even more.

00:11:19.000 | This is the first step. The second step is you do fine-tune the language model on your target data.

00:11:26.000 | You keep doing language modeling in this step.

00:11:30.000 | This is similar to what we call continued pre-training on target domain.

00:11:35.000 | You just take your LLAMA270B base and you just do language modeling on, let's say, financial books to try to make your model like Bloomberg GPT, for example.

00:11:48.000 | But you're still doing language modeling. You're not doing any task-specific training.

00:11:54.000 | The third and final step is to train a classifier on your label data.

00:11:58.000 | This is the fine-tuning step that we are all familiar with.

00:12:02.000 | Let's say you want to classify Amazon reviews as being positive, neutral, or negative.

00:12:07.000 | You're just going to get maybe 1,000 reviews on the label for each review and just train the model in a supervised fashion.

00:12:16.000 | This was a very good paper and it was very influential and made a big buzz in the ML field.

00:12:26.000 | This paper was released in 2018, but it did not mention the word "transformer" at all.

00:12:32.000 | They did not even reference the paper.

00:12:35.000 | The architecture of the model used was an RNN-based model. I think it was an LSTM.

00:12:40.000 | So, no transformers at all.

00:12:43.000 | And this is kind of a big gap in this work that GPT folks are going to fill.

00:12:50.000 | Otherwise, this paper would have been a very, very good paper.

00:12:55.000 | It still is a good paper.

00:12:57.000 | So, let's talk a bit about GPT, the thing that we want to talk about today.

00:13:04.000 | GPT stands for Generative Pre-trained Transformer, the keyword is "transformer".

00:13:10.000 | It was developed by OpenAI, by these awesome folks.

00:13:15.000 | It was actually one of the first things that made OpenAI a popular organization in the ML field.

00:13:22.000 | The whole premise is to use a semi-supervised approach for NLU tasks, natural language understanding tasks.

00:13:32.000 | So, the goal is to learn a universal representation that transfers very well to other downstream tasks.

00:13:41.000 | So, basically have a good starting point, instead of starting from scratch every single time.

00:13:47.000 | And their approach is basically two-step.

00:13:50.000 | The first step is to do unsupervised pre-training, and then supervised fine-tuning.

00:13:55.000 | And this is kind of where the word "semi-supervised" comes from.

00:13:59.000 | So, a mixture of unsupervised training and supervised training.

00:14:03.000 | And their architecture is transformers, of course, because we have GPUs, and GPUs love transformers.

00:14:09.000 | So, that's a good thing.

00:14:14.000 | So, to do this approach, you're going to need two things.

00:14:17.000 | The first thing is a very big corpus of unlabeled text.

00:14:20.000 | And then the second thing is a dataset that is annotated, that is ready to use for supervised fine-tuning.

00:14:28.000 | So, you could have multiple datasets if you want to train your model for several tasks.

00:14:34.000 | But the good news is your target tasks do not need to be in the same domain as the unlabeled text.

00:14:40.000 | So, for example, let's say you want to train on financial tasks.

00:14:45.000 | You're going to give the model some information about a stock and ask it about how it performs.

00:14:50.000 | Or whether it should buy or sell.

00:14:52.000 | So, like a financial classifier.

00:14:55.000 | Actually, you can pre-train on just like normal general data.

00:14:59.000 | Like you can pre-train on a corpus from the Internet and then just fine-tune on your desired tasks.

00:15:05.000 | Like your unlabeled corpus does not need to be in the same domain as your objective.

00:15:11.000 | And this is good news because we have a lot of general-purpose text that you can use for pre-training.

00:15:17.000 | While obtaining very domain-specific corpus is more involved.

00:15:22.000 | And a very minor note here is the name can be misleading in this work.

00:15:27.000 | The word "generative" here mainly refers to the pre-training step.

00:15:31.000 | The actual tasks that they had in mind are more discriminative.

00:15:35.000 | So, like classification, question answering, and semantic similarity.

00:15:38.000 | That is, natural language understanding tasks.

00:15:41.000 | So, they did not discuss machine translation or just being a chatbot in this work.

00:15:49.000 | And they actually released their blog post under a different, and I think more fitting title,

00:15:54.000 | called "Improving Language Understanding with Unsupervised Learning".

00:15:58.000 | And this is like the key idea here, like unsupervised learning.

00:16:03.000 | This is a very nice blog post on their blog.

00:16:07.000 | So, now we have discussed EOLMFET and GPT.

00:16:14.000 | Let's also discuss some of the other related work in this domain.

00:16:18.000 | So, let's first start with semi-supervised learning.

00:16:21.000 | The work GPT falls under this domain.

00:16:25.000 | And back then this was becoming very popular, like sequence labeling, text classification tasks.

00:16:31.000 | People were doing semi-supervised learning for these.

00:16:34.000 | And you have different levels for this approach.

00:16:38.000 | So, the first and basic level is just using language statistics to get the features.

00:16:46.000 | Like use bag of words, tf, idf, and all these classical machine learning stuff.

00:16:53.000 | You can use them as input features and just train a classifier on top of this.

00:16:58.000 | But that's not very helpful, because two words that are different in syntax,

00:17:03.000 | but are similar in semantics, are going to have very different features.

00:17:08.000 | So, it makes the model a bit limited.

00:17:13.000 | The second step is to use the word embedding, like we discussed previously.

00:17:16.000 | And this approach allows you to capture the semantics,

00:17:20.000 | but it's also very limited in that it's based on words.

00:17:24.000 | We're not capturing the higher level semantics.

00:17:27.000 | The third level is like sequence embeddings.

00:17:30.000 | So, instead of just using the word to get the embedding,

00:17:35.000 | you are going to utilize the entire sequence.

00:17:38.000 | So, like the sentence or the paragraph to get the embedding.

00:17:41.000 | And this actually allows us to utilize the context

00:17:44.000 | to understand the high level semantics of the text.

00:17:47.000 | And this is like level 3 is where GPT falls.

00:17:50.000 | We are using the entire sequence to generate embeddings

00:17:54.000 | that we can use for classification or any other task.

00:17:59.000 | So, this is regarding semi-supervised learning.

00:18:02.000 | The second field, and the more specific field, is unsupervised learning.

00:18:07.000 | So, it's a special case of semi-supervised learning.

00:18:10.000 | These terminologies can be confusing, I know.

00:18:14.000 | But the goal is to find a good initial starting point

00:18:18.000 | instead of just going directly to do unsupervised learning objective.

00:18:27.000 | This is a typo, probably.

00:18:30.000 | So, the early works in unsupervised learning

00:18:36.000 | was actually used in vision, in image classification.

00:18:39.000 | So, for example, you can use ResNet

00:18:41.000 | that was trained to classify ImageNet.

00:18:43.000 | You can take the backbone, the ResNet as a backbone,

00:18:46.000 | and then do like segmentation

00:18:49.000 | or just try to detect pneumonia in chest x-rays.

00:18:52.000 | And this is a very good starting point.

00:18:54.000 | Even though ResNet was...

00:18:56.000 | Even though ImageNet is just classifying images

00:19:00.000 | as being cats or dogs or humans or horses.

00:19:04.000 | So, it's more of a general purpose dataset,

00:19:06.000 | but you learn a very good representation

00:19:09.000 | that can be used in your downstream task.

00:19:13.000 | And in the field of vision,

00:19:15.000 | this actually proved to be quite well

00:19:18.000 | because people have found out that pre-training

00:19:20.000 | acts as a regularization scheme.

00:19:22.000 | Your model tends to have better generalization

00:19:25.000 | if you pre-train it on a very large corpus.

00:19:29.000 | And the authors mentioned in the paper

00:19:31.000 | that the closest line of work to their work

00:19:33.000 | is actually what we discovered...

00:19:35.000 | What we covered so far,

00:19:37.000 | the ULMFET by Howard and Ruder,

00:19:39.000 | and also another work by Dai et al.

00:19:43.000 | But I did not go into...

00:19:45.000 | I did not go in detail into this work.

00:19:49.000 | However, the main drawback of these two works

00:19:52.000 | and everything else, almost everything else back then,

00:19:55.000 | is like they are using RNNs.

00:19:57.000 | And we know that by 2018,

00:19:59.000 | like, transformer reigns supreme

00:20:01.000 | because we have GPUs.

00:20:04.000 | So we can train transformers more efficiently.

00:20:06.000 | And also RNNs have, like...

00:20:08.000 | They are limited in their ability

00:20:10.000 | to process large contexts

00:20:13.000 | because of, like, the gradient vanishing problems

00:20:16.000 | and so on.

00:20:19.000 | So this is, like, the most common approach back then.

00:20:21.000 | Another approach is also to use the head-end representation

00:20:24.000 | from a pre-trained language model

00:20:26.000 | or machine translation model

00:20:27.000 | as auxiliary features while training your model.

00:20:30.000 | So you can have...

00:20:32.000 | Let's say you have a machine learning model.

00:20:36.000 | You're going to get the representations

00:20:38.000 | or, like, the hidden state of this model

00:20:41.000 | and then just give it to your classifier

00:20:44.000 | as, like, additional features.

00:20:47.000 | And as you can imagine, this is, like...

00:20:49.000 | This involves, like, a substantial amount of work

00:20:51.000 | and new parameters for each task.

00:20:54.000 | So this is not very universal.

00:20:57.000 | So any questions so far

00:20:59.000 | before we actually go in detail into the approach?

00:21:04.000 | - I think there was a question from Sook.

00:21:07.000 | Did AlexNet use GPUs?

00:21:10.000 | And were the transformers... - AlexNet?

00:21:11.000 | - ...the first ones to use the GPUs?

00:21:13.000 | I think he just put it in the chat.

00:21:15.000 | Yeah, but...

00:21:18.000 | Were transformers, like, some of the...

00:21:19.000 | - Yeah, AlexNet, yeah.

00:21:22.000 | Yeah, AlexNet did definitely use GPUs,

00:21:24.000 | but back then, I think they were writing Qt code.

00:21:27.000 | I think Alex Kryzewski was the person

00:21:30.000 | doing the GPU programming stuff himself.

00:21:36.000 | So, yeah, I think AlexNet did definitely use GPUs,

00:21:39.000 | if I remember correctly.

00:21:41.000 | - Yes, okay.

00:21:43.000 | Cool, cool.

00:21:45.000 | Well, seems like they did.

00:21:47.000 | Okay, I would just like to see

00:21:49.000 | if anyone else has any other questions.

00:21:50.000 | I think it should be good for now.

00:21:53.000 | - Yeah.

00:21:55.000 | Someone said, "According to papers with Qt, they did."

00:21:58.000 | Yeah, I think they did use GPUs.

00:22:01.000 | So let's go into details about the GPT approach.

00:22:06.000 | So we have two steps.

00:22:08.000 | The first step is unsupervised pre-training.

00:22:10.000 | So basically, the goal is to train

00:22:12.000 | a high-capacity language model

00:22:14.000 | on a large corpus of text.

00:22:16.000 | The training objective here is language modeling.

00:22:18.000 | That is, given a sequence of tokens,

00:22:21.000 | try to correctly predict the next token.

00:22:24.000 | So let's say the cat sat on D.

00:22:27.000 | You're trying to predict the next token,

00:22:29.000 | which I think should be the mat.

00:22:31.000 | So this is basically language modeling.

00:22:34.000 | The loss function D used is negative log-likelihood.

00:22:38.000 | It's also called cross-entropy.

00:22:40.000 | Basically, this equation,

00:22:43.000 | the negative log-likelihood of the correct label.

00:22:47.000 | And you do this over--

00:22:49.000 | a very, very important note here is

00:22:51.000 | you do this over the entire sequence.

00:22:53.000 | So if you have a sentence that said,

00:22:55.000 | "The cat sat on the mat,"

00:22:57.000 | you do this over every single token in this sentence.

00:23:00.000 | And this is very important.

00:23:02.000 | So you are training on your entire input--

00:23:05.000 | your entire corpus,

00:23:08.000 | not just the last token.

00:23:11.000 | So the architecture is like a transformer,

00:23:14.000 | a multi-layer transformer,

00:23:16.000 | but they are using the decoder only.

00:23:18.000 | They are not using the encoder.

00:23:20.000 | So--and the difference between the encoder and the decoder,

00:23:24.000 | I think, can be summarized to, like, how you do attention.

00:23:27.000 | So in encoders,

00:23:29.000 | every token has access to every other token.

00:23:32.000 | But in decoders,

00:23:34.000 | every token has access only to the tokens

00:23:36.000 | that came before it.

00:23:38.000 | So you only have one directional attention.

00:23:40.000 | I think it's also called left-to-right attention

00:23:44.000 | or right-to-left, or, like,

00:23:46.000 | you only have attention in one direction.

00:23:49.000 | And basically, the transformer architecture

00:23:52.000 | applies, like, multi-headed mask itself attention operation

00:23:55.000 | over the input tokens,

00:23:57.000 | and then this is followed by a position-wise feed-forward layer.

00:24:00.000 | And you keep doing this for, like, n,

00:24:02.000 | where n is the number of transformer layers,

00:24:05.000 | and then you just generate an auto-distribution

00:24:08.000 | over the vocabulary.

00:24:10.000 | So let's take this into more detail.

00:24:12.000 | You have your input text

00:24:14.000 | that has been tokenized

00:24:16.000 | into tokens,

00:24:18.000 | and these tokens have been encoded.

00:24:21.000 | So you have, like, the token integer or token ID.

00:24:24.000 | So it's a number.

00:24:26.000 | So you take this number, and you pass it through the embedding layer,

00:24:29.000 | the token embedding layer,

00:24:31.000 | also called the semantic embedding layer,

00:24:33.000 | and you get a vector for this token.

00:24:36.000 | And you also get the positional embedding for this token.

00:24:39.000 | You sum these up, like, just vector addition,

00:24:43.000 | and you get your input to the first transformer block,

00:24:47.000 | which is H0.

00:24:49.000 | So just the positional embedding

00:24:53.000 | and the token embedding added together,

00:24:56.000 | and you get your input, H0.

00:24:58.000 | And then you take H0 and pass it to each block,

00:25:02.000 | each transformer block,

00:25:04.000 | and the output of the first block is going to be the input

00:25:06.000 | to the second block, and so on.

00:25:08.000 | And this is what this equation says.

00:25:11.000 | And at the end,

00:25:13.000 | once you've done going through all these blocks,

00:25:15.000 | you're going to go to the output layer,

00:25:18.000 | which is actually a reverse embedding.

00:25:21.000 | So you have a vector, but you want to go back to a token,

00:25:24.000 | or, like, I should say,

00:25:26.000 | a probability distribution over all the tokens.

00:25:30.000 | And once you have this distribution,

00:25:32.000 | you pass it through a softmax

00:25:34.000 | to get actual probabilities that sum up to one,

00:25:37.000 | instead of just logits or scores.

00:25:40.000 | And I think we've covered the transformer paper,

00:25:43.000 | so I think people are familiar with this,

00:25:46.000 | but if you have other questions, please go ahead.

00:25:49.000 | A small remark about this is that we use

00:25:51.000 | tied input and output token embedding,

00:25:53.000 | so the embedding layer in this step

00:25:55.000 | is the same as the one used at the final step.

00:25:58.000 | So...

00:26:00.000 | Yeah, someone says causal masking.

00:26:08.000 | Yes, exactly, causal language modeling.

00:26:11.000 | So the second step is supervised fine-tuning.

00:26:14.000 | So the goal of this step

00:26:16.000 | is to adapt the parameters of the pre-trained model

00:26:19.000 | to your supervised target task.

00:26:22.000 | And for this, we need a label data set

00:26:24.000 | where each instance is a pair of, like,

00:26:26.000 | you have a sequence of input tokens and the label.

00:26:28.000 | So, for example, an input sequence

00:26:31.000 | could be this product is very bad,

00:26:33.000 | and your label could be, like, a negative sentiment.

00:26:37.000 | So the inputs are passed through the pre-trained model

00:26:40.000 | to obtain the final transformer blocks activation.

00:26:43.000 | So if you go back a bit,

00:26:46.000 | you take your input sequence and pass it

00:26:48.000 | through this same transformer

00:26:51.000 | and get the output of the final token,

00:26:53.000 | and then you compare this to your label.

00:26:57.000 | So you're gonna get the hidden representation

00:27:03.000 | of the last encoder layer at the M token,

00:27:06.000 | where M is, like, the final token, the input sequence.

00:27:08.000 | And you just pass this through your classification head

00:27:11.000 | or whatever head you're using.

00:27:13.000 | So, for example, in classification,

00:27:15.000 | we're gonna use Softmax on top of a linear layer

00:27:18.000 | to get your output and compare it with the label.

00:27:23.000 | And you are using roughly the same loss function,

00:27:26.000 | which is negative log-light-load estimation

00:27:28.000 | or cross-entropy.

00:27:32.000 | And a key distinction here

00:27:35.000 | between this and the previous step

00:27:37.000 | is you are only calculating the loss

00:27:39.000 | over only the output token, not the entire sequence.

00:27:42.000 | So the loss is only on Y, not on X1 or XM and so on.

00:27:48.000 | So the only extra parameters you need for this

00:27:51.000 | is your classification head.

00:27:54.000 | So, for example, W-Y,

00:27:56.000 | if you're trying to do classification,

00:27:58.000 | the parameter metrics of the--

00:28:02.000 | the metrics of the parameters of the output layer.

00:28:06.000 | And also embeddings if you're adding new tokens.

00:28:09.000 | And we're gonna see this in a bit.

00:28:12.000 | Something they also experimented with

00:28:14.000 | is auxiliary training objective.

00:28:16.000 | So they also used language modeling

00:28:18.000 | as an auxiliary objective in fine-tuning.

00:28:20.000 | So not just this--

00:28:23.000 | not just this classification, but also language modeling.

00:28:26.000 | And they say this helped them

00:28:29.000 | by improving the generalization of the supervised model

00:28:32.000 | and accelerating convergence.

00:28:34.000 | They also say this is in line with prior work.

00:28:37.000 | They also observed better performance

00:28:39.000 | when using it as an auxiliary objective.

00:28:42.000 | And the way you do this is your loss function

00:28:45.000 | is now a sum of multiple loss functions

00:28:48.000 | where one loss function is this one,

00:28:51.000 | the classification loss function,

00:28:53.000 | and also you have the language modeling loss function

00:28:57.000 | with a certain weight, like lambda here is like a weight.

00:29:00.000 | And lambda could be, for example, 0.5 or 0.3.

00:29:04.000 | So you have like a summation of multiple losses.

00:29:09.000 | And a small note for myself here.

00:29:12.000 | I'm not sure if auxiliary language--

00:29:14.000 | like auxiliary objectives are popular today.

00:29:17.000 | I think people just do supervised fine-tuning

00:29:20.000 | without an auxiliary objective.

00:29:23.000 | That's just my take.

00:29:26.000 | So any questions so far?

00:29:30.000 | I think the chat seems to not have any questions.

00:29:37.000 | So maybe you want to just-- you can just continue from now.

00:29:41.000 | Sure.

00:29:42.000 | So now we have discovered--

00:29:47.000 | we have covered the approach, the basic two steps.

00:29:51.000 | You will now get a very good, let's say, classifier,

00:29:54.000 | because you have done unsupervised pre-training

00:29:57.000 | and supervised fine-tuning on classifiers.

00:29:59.000 | But GPT-1 is actually trying to be more--

00:30:01.000 | it's trying to be more of a universal model

00:30:03.000 | than just a classifier.

00:30:05.000 | So they are trying to handle multiple tasks,

00:30:07.000 | like classification, entailment, semantic similarity,

00:30:10.000 | and answering multiple choice questions.

00:30:13.000 | And these are all, as you can see,

00:30:15.000 | discriminative tasks other than generative tasks,

00:30:18.000 | as we discussed before.

00:30:21.000 | So for tasks like classification, this is very easy.

00:30:23.000 | You can do what we have covered so far.

00:30:25.000 | Just add the head on top and do the classification.

00:30:28.000 | But other tasks have different structured inputs and outputs.

00:30:33.000 | So for example, text entailment has ordered sentence pairs.

00:30:36.000 | MCQs have a question with multiple answers, and so on.

00:30:41.000 | So each task has its own specific structure.

00:30:45.000 | And the way people have dealt with this previously

00:30:47.000 | is just learn a specific architecture for each task

00:30:53.000 | on top of your model.

00:30:55.000 | And this defeats the whole purpose of the GPT work.

00:31:00.000 | We're trying to do something that's

00:31:02.000 | global, general, and a general-purpose model,

00:31:05.000 | rather than having multiple task-specific architectures.

00:31:11.000 | So instead of using this approach,

00:31:14.000 | they opted to use a different approach,

00:31:17.000 | where they convert the structured inputs

00:31:21.000 | into just tokens.

00:31:23.000 | So they are trying to create a multitask format.

00:31:27.000 | And this is similar to what people have used in future work

00:31:30.000 | like T5 and Whisper, where basically you're

00:31:33.000 | trying to model different tasks just using

00:31:37.000 | tokens and special tokens.

00:31:40.000 | These input transformations allows

00:31:42.000 | us to use the same architecture for different tasks.

00:31:45.000 | So you don't need to do a lot of modification.

00:31:49.000 | And we're going to go into this in two details

00:31:51.000 | in the next slide.

00:31:53.000 | So for example, let's take textual entailment.

00:31:56.000 | This task involves reading a pair of sentences

00:31:59.000 | and judging the relationship between them.

00:32:01.000 | So the relationship could be one of entailment, contradiction,

00:32:03.000 | or neutral.

00:32:04.000 | And a small note is that this task is still challenging

00:32:10.000 | because your model needs to have good understanding

00:32:13.000 | and reasoning of the language, because it

00:32:16.000 | can be confusing sometimes.

00:32:19.000 | So you have your premise, and you have your hypothesis,

00:32:22.000 | and you're trying to classify or try

00:32:24.000 | to predict the relationship as being entailment,

00:32:26.000 | contradiction, or neutral.

00:32:29.000 | So the way to do this is just to concatenate

00:32:32.000 | the premise and the hypothesis.

00:32:34.000 | So this could be a sentence, and this could be a sentence.

00:32:37.000 | You just concatenate them and add a special delimiter token

00:32:40.000 | in between them.

00:32:41.000 | And obviously, you add your start token

00:32:43.000 | at the beginning and your end token at the end.

00:32:46.000 | And just try to train a classifier on top of this.

00:32:49.000 | And your classifier should classify

00:32:52.000 | this sequence of input tokens as one of entailment,

00:32:55.000 | contradiction, or neutral.

00:32:57.000 | So this has become just a classification task

00:33:01.000 | by just doing transformation.

00:33:06.000 | The second task that we cover is semantic similarity.

00:33:09.000 | And I think this is very popular nowadays,

00:33:12.000 | because of retrieval, augmented generation,

00:33:15.000 | and embeddings in general.

00:33:18.000 | So all this cool rag stuff.

00:33:21.000 | So this task is about predicting how semantically

00:33:25.000 | similar two sentences are.

00:33:27.000 | And semantic similarity just means how close in meaning

00:33:30.000 | they are.

00:33:31.000 | Do they talk about the same thing?

00:33:33.000 | Do they mean the same thing?

00:33:35.000 | Do they have similar meaning or not?

00:33:38.000 | And again, this can be challenging,

00:33:40.000 | because you can have two paragraphs that

00:33:43.000 | have very different usage of words,

00:33:45.000 | but they convey the same meaning.

00:33:47.000 | So this can be challenging if your model is not smart enough.

00:33:50.000 | And one note about this task is there

00:33:52.000 | is no inherent ordering of the sentences.

00:33:54.000 | So you can just--

00:33:55.000 | there is no sentence A and sentence B.

00:33:57.000 | You just have two sentences.

00:33:59.000 | Unlike, for example, in the entailment,

00:34:01.000 | you have a very specific order.

00:34:02.000 | Like, you have a distinction between the premise

00:34:05.000 | and the hypothesis.

00:34:06.000 | But here you have just two random sentences.

00:34:09.000 | And the way they approach this is using a Siamese architecture

00:34:14.000 | where we have--

00:34:15.000 | where we generate two input sequences

00:34:18.000 | and pass them through the transformer

00:34:20.000 | and compare between them.

00:34:22.000 | So basically, Siamese architecture

00:34:24.000 | is a fancy way of saying that you

00:34:26.000 | are using the same model twice, or two identical versions

00:34:30.000 | of the model.

00:34:31.000 | So you get your sentence-- first sentence and second sentence,

00:34:34.000 | and then concatenate them and add your special tokens.

00:34:37.000 | And you pass them through the transformer

00:34:39.000 | and you get some vector at the end.

00:34:41.000 | And also, you do the same thing, but you reverse

00:34:44.000 | the order of the sentences.

00:34:47.000 | So you get the second input sequence,

00:34:49.000 | pass it to the transformer, and you get your vector.

00:34:52.000 | So basically, at the end, you're going to have two vectors.

00:34:55.000 | You just add them--

00:34:57.000 | do vector addition on top of them.

00:34:59.000 | And then you just add a head on top of the output vector.

00:35:05.000 | So you just multiply the vector by some layer,

00:35:09.000 | a linear layer, for example, that has parameters w,

00:35:12.000 | and you get your output.

00:35:14.000 | And for example, if you're doing--

00:35:16.000 | like, if you just have--

00:35:18.000 | if you're just interested in knowing

00:35:20.000 | whether these sentences are similar or not similar,

00:35:23.000 | so you have only two labels, you can train a classifier.

00:35:26.000 | If you're interested in having more of a scale of similarity,

00:35:30.000 | like from 0 to 10, you can train a regressor on top of this.

00:35:33.000 | So that's very cool.

00:35:37.000 | We're still using the same architecture, by the way,

00:35:39.000 | just the transformer here.

00:35:41.000 | We're not modifying it.

00:35:42.000 | We're just being smart about how to approach this task.

00:35:48.000 | You can extend this to do question answering.

00:35:51.000 | So let's say you have a question and you have four choices.

00:35:55.000 | Or let's say you have a document.

00:35:56.000 | You have a question about the document,

00:35:58.000 | and you have four potential answers to this document.

00:36:01.000 | And you want your model to pick one answer of these.

00:36:04.000 | So the way to do this is you take your document or context

00:36:08.000 | and add the question and then add the first answer.

00:36:12.000 | And then you get your first input sequence.

00:36:15.000 | And you pass this to the transformer.

00:36:17.000 | And similarly, you get your context or document

00:36:20.000 | and add the question and add the second answer

00:36:22.000 | and make this into a sequence and give it to the transformer.

00:36:25.000 | And you do this for all the answers.

00:36:27.000 | And then you just compare between the scores

00:36:31.000 | the model gives to each of these potential answers.

00:36:36.000 | So for example, we have a Wikipedia article,

00:36:39.000 | and we have a question, and we have answer A.

00:36:42.000 | We concatenate all of these with some special tokens.

00:36:46.000 | And we pass them to the transformer

00:36:49.000 | and to the final linear layer, and we get the score.

00:36:53.000 | So score for answer A and score for answer B

00:36:55.000 | and score for answer C. And just do a softmax

00:36:57.000 | to get a proper probability distribution.

00:37:01.000 | And that's it.

00:37:02.000 | You've got your model to do question answering.

00:37:05.000 | And you can do this also for other tasks

00:37:07.000 | like common sense reasoning.

00:37:10.000 | So any questions about these transformations for each task?

00:37:18.000 | I think there was a question about modifying the input

00:37:22.000 | sequence to both possible orderings,

00:37:24.000 | at least when it comes to semantics in learning.

00:37:27.000 | I think someone had a question about that.

00:37:31.000 | So he just said that he doesn't really

00:37:34.000 | understand what it means to modify the input sequence

00:37:37.000 | for both possible orderings.

00:37:41.000 | Yeah, sure.

00:37:42.000 | So if you're doing textual entailment,

00:37:45.000 | you have distinction between the premise and the hypothesis.

00:37:47.000 | Like you have a premise and a hypothesis.

00:37:52.000 | So you can just mix them up.

00:37:54.000 | You have a very specific premise and a very specific hypothesis.

00:37:59.000 | And the ordering here matters.

00:38:00.000 | Like you should put the premise first, and then

00:38:02.000 | the special token, and then the hypothesis.

00:38:04.000 | You can put the hypothesis first, and then the delimiter,

00:38:07.000 | and then the premise.

00:38:08.000 | So this is what they mean by the ordering here matters.

00:38:12.000 | But for semantic similarity, you just have two sentences.

00:38:16.000 | There is no inherent ordering for the semantic similarity

00:38:20.000 | task.

00:38:23.000 | I hope this answers the question.

00:38:27.000 | I think it did.

00:38:29.000 | He said, ah, I see.

00:38:30.000 | Thanks.

00:38:30.500 | I think probably that's about it.

00:38:33.000 | I don't see any other questions.

00:38:36.000 | He said, does it also combat things like positional biases

00:38:39.000 | and transformers by switching the order of the sentences

00:38:42.000 | itself?

00:38:43.000 | And you've added two quantities.

00:38:45.000 | Yeah, exactly.

00:38:46.000 | I think, yeah.

00:38:47.000 | I think this is one of the motivations they did this.

00:38:50.000 | They want to say, there is no inherent order for this task.

00:38:54.000 | So maybe the transformer will just have a bias.

00:38:57.000 | So let's try both ways.

00:38:58.000 | Like let's give it the first sentence, and then the second.

00:39:01.000 | And let's give it the other way around

00:39:05.000 | to get over the positional bias.

00:39:08.000 | Because I think for some models, they

00:39:12.000 | pay more attention to the last few tokens in the input.

00:39:17.000 | And they disregard the earlier tokens.

00:39:19.000 | This tends to happen sometimes, yes.

00:39:23.000 | I think, yeah.

00:39:24.000 | I think this is a good way to think about this.

00:39:29.000 | Awesome.

00:39:30.000 | I think-- I don't see any other questions in the chat.

00:39:33.000 | So I can just service them as and when they come.

00:39:35.000 | So I think we should be good for now by the looks of it.

00:39:38.000 | Sure.

00:39:39.000 | Sure.

00:39:39.500 | So that was the whole approach.

00:39:41.000 | We can now discuss some details about the training.

00:39:44.000 | And back then, OpenAI actually did release info

00:39:47.000 | about their training and models.

00:39:49.000 | They don't do this now, unfortunately.

00:39:52.000 | But anyways, so the data set they use for training

00:39:55.000 | is BookCorpus for training the language model.

00:39:58.000 | So this is in step one, which is unsupervised training.

00:40:01.000 | BookCorpus.

00:40:03.000 | And back then, this was huge.

00:40:04.000 | It has 7,000 unique unpublished books from different genres.

00:40:08.000 | So the variety here helps as well.

00:40:11.000 | They have adventure, fantasy, and romance.

00:40:13.000 | So kind of like a very good, big, diverse data set.

00:40:18.000 | And the main advantage of this data set,

00:40:21.000 | and the reason they chose it, is because it

00:40:23.000 | has long stretches of text.

00:40:25.000 | So if you have a book or a novel,

00:40:27.000 | you have a paragraph that's maybe 10 lines or more.

00:40:31.000 | And this helps the model to learn

00:40:33.000 | how to handle long-range information

00:40:35.000 | and how to handle long context.

00:40:37.000 | For example, there is also another data set

00:40:39.000 | that's called WordBenchmark, which is also big and diverse.

00:40:43.000 | But it doesn't have this long context.

00:40:46.000 | It's just a bunch of small sentences, I would say.

00:40:51.000 | And this way, your model will not

00:40:54.000 | learn how to handle long context.

00:40:56.000 | And a side note here is like ELMo is also

00:41:00.000 | a very seminal work in NLP.

00:41:03.000 | And it falls under the same domain

00:41:05.000 | of unsupervised pre-training and having good embeddings.

00:41:10.000 | So it's one of the fundamental papers and works in NLP.

00:41:15.000 | So they say their model achieves a very low token

00:41:18.000 | level perplexity of 18.4 on this corpus.

00:41:22.000 | But I don't think this is actually low today.

00:41:25.000 | I think perplexity of 18 is a bit high.

00:41:28.000 | But I'm not sure.

00:41:30.000 | And their model is about--

00:41:32.000 | their data set is about 5 gigabyte in size.

00:41:34.000 | So not quite big by today's standard, of course.

00:41:39.000 | The architecture, as we mentioned,

00:41:41.000 | they have a transformer model.

00:41:42.000 | They use a tokenizer.

00:41:43.000 | They use byte-pair encoding back then, which is also, I think,

00:41:49.000 | what's used right now.

00:41:50.000 | So this is still not changed from today.

00:41:53.000 | They have a surprisingly big vocab size

00:41:57.000 | by their time's standard.

00:41:59.000 | Like, they have 40,000 tokens.

00:42:01.000 | I think Llamatu had also something

00:42:04.000 | that's in the same range, like 40,000, 50,000 tokens.

00:42:07.000 | So back then, this was actually quite big, I would say.

00:42:11.000 | And they used the FTFY library to clean the raw text

00:42:14.000 | and then do some standardization using spaCy tokenizer.

00:42:19.000 | So this is good work in the tokenization area, I would say.

00:42:25.000 | And their model is just a typical transformer model,

00:42:29.000 | like typical by today's standard, just a transformer,

00:42:32.000 | decoder-only transformer with, like, 12 layers and mask

00:42:35.000 | self-attention.

00:42:36.000 | And it has a very big size of 117 millions.

00:42:40.000 | And this was actually big back then, although this

00:42:43.000 | is trivial now, of course.

00:42:45.000 | Their embedding, they used learned positional embeddings

00:42:48.000 | compared to the sinusoidal embedding

00:42:51.000 | in the original transformer paper.

00:42:53.000 | So this was actually much simpler to implement.

00:42:57.000 | And the model just has to learn the embeddings in training.

00:43:01.000 | They have a context size of 512.

00:43:04.000 | Again, back then, that was very big.

00:43:07.000 | And they also used tied input and output token embeddings,

00:43:10.000 | as we mentioned.

00:43:11.000 | And for the attention block, they have 12 attention heads.

00:43:14.000 | Each head has 64--

00:43:17.000 | each head has a dimension of 64 for a total of 768 dimensions

00:43:24.000 | for the entire attention block.

00:43:27.000 | And after the attention, you have the MLP layer,

00:43:29.000 | also known as position-wise feedforward network.

00:43:32.000 | And the size of the inner state of this network is 3,072--

00:43:38.000 | the size is 3,072.

00:43:41.000 | And this means that, actually, this is an expansion

00:43:43.000 | and then contraction network.

00:43:46.000 | So you go from 768 to 3K, and then you

00:43:49.000 | go back from 3K to 768.

00:43:51.000 | And all this is happening inside this MLP.

00:43:53.000 | And the activation layer they used

00:43:55.000 | is the GLO activation layer.

00:43:59.000 | The optimizer, they used Adam, which was becoming

00:44:03.000 | very popular back then.

00:44:04.000 | I think Adam was developed in 2014, 2015.

00:44:09.000 | So it was getting a lot of traction back then.

00:44:11.000 | And the maximum learning rate is this learning rate,

00:44:14.000 | which is also very popular nowadays.

00:44:16.000 | I think this was used in one of the Lama-Tung models.

00:44:20.000 | So this is very familiar.

00:44:23.000 | They do warm-up as well from 0 to the maximum learning

00:44:26.000 | rate over 2K steps.

00:44:29.000 | And then they just cosine annealing

00:44:32.000 | to-- they do cosine annealing from the maximum learning rate

00:44:35.000 | to 0, so just learning rate decay.

00:44:39.000 | Their compute, they used one machine

00:44:43.000 | that has 8x P600 GPU.

00:44:47.000 | This is the same family of GPUs as P100, I think.

00:44:50.000 | But I don't have much information about this GPU.

00:44:53.000 | They train for 30 days, which actually is not bad.

00:44:59.000 | That's not very long.

00:45:00.000 | But back then, this was very long, I think.

00:45:03.000 | Their utilization dimension is 0.33.

00:45:06.000 | And the total flops is 0.6 petaflop,

00:45:10.000 | so almost one petaflop.

00:45:12.000 | And for reference, I think the tiny grad machine

00:45:15.000 | is trying to give you one petaflop.

00:45:19.000 | Or I could be wrong.

00:45:21.000 | Yeah, I could be wrong about this.

00:45:23.000 | So anyway, the compute they used is almost one petaflop days.

00:45:28.000 | And this is the way they calculated this.

00:45:31.000 | And this is actually how the model looks

00:45:33.000 | like if you try to do it in the transformers framework.

00:45:36.000 | So you have token embedding, position embedding,

00:45:38.000 | and then some dropouts.

00:45:40.000 | And then you have your actual transformer blocks.

00:45:44.000 | And this is the architecture of the language model.

00:45:48.000 | So there is no head in here.

00:45:50.000 | So the block is just attention, and then layer norm,

00:45:53.000 | and then MLP, and then layer norm.

00:45:55.000 | So if there are no questions, we can move on

00:46:02.000 | to the second step, which is supervised fine tuning.

00:46:07.000 | There was one question from Sean,

00:46:09.000 | which was, are perplexity numbers comparable

00:46:11.000 | across different models?

00:46:13.000 | And I think we discussed just now

00:46:15.000 | that the model itself has a perplexity of 18.4.

00:46:18.000 | And so in this case, would it be OK to compare--

00:46:23.000 | is it a metric that's invariant across different models

00:46:28.000 | in this sense?

00:46:31.000 | Yeah, so that's a good question.

00:46:32.000 | I think perplexity is just a metric.

00:46:34.000 | Like, you're going to measure perplexity

00:46:36.000 | on a certain data set.

00:46:37.000 | Like, let's say you have this data set that

00:46:39.000 | is 1 trillion tokens, or 1 billion tokens.

00:46:42.000 | And you measure perplexity of GPT-1 on this data set.

00:46:46.000 | And you can measure the perplexity of Lama2

00:46:48.000 | on this data set.

00:46:49.000 | And I think you can compare this metric.

00:46:52.000 | It's like saying Lama2 has a human eval of 70,

00:46:57.000 | and GPT-4 has a human eval of maybe 90.

00:47:00.000 | It's not totally fair, but we can do it, I guess.

00:47:03.000 | We can get away with doing it.

00:47:05.000 | Yeah, but it might not be 100% fair.

00:47:07.000 | Yeah, so that's a good point.

00:47:13.000 | OK, cool.

00:47:14.000 | I think that's probably the only question.

00:47:16.000 | So I think we should move on to supervised fine-tuning.

00:47:20.000 | So the second step is supervised fine-tuning.

00:47:22.000 | And they use these data sets for this task.

00:47:24.000 | So for sentence similarity, they use these data sets.

00:47:28.000 | And for classifications, they used COLA,

00:47:30.000 | which I think is popular now as well.

00:47:34.000 | I'm not going to go into details about this,

00:47:36.000 | because I don't have much information about these.

00:47:40.000 | And the architecture is just use the same backbone

00:47:43.000 | as pre-training.

00:47:44.000 | And you add your head, mostly classifier head,

00:47:48.000 | with a dropout of 0.1.

00:47:50.000 | They train for three epochs with a batch size of 32.

00:47:53.000 | This is also still standard nowadays as well.

00:47:57.000 | People usually train, do fine-tuning for three epochs

00:48:00.000 | with a batch size of maybe 32 or 64.

00:48:03.000 | So yeah, that's common nowadays as well.

00:48:06.000 | The learning rate is 6.25, 8.25.

00:48:10.000 | And also, people use very similar learning rates nowadays

00:48:15.000 | as well.

00:48:17.000 | So this is very familiar.

00:48:18.000 | This architecture for fine-tuning

00:48:21.000 | is very familiar, even today.

00:48:24.000 | And they also use learning rate decay with a warm-up.

00:48:28.000 | And when they use the auxiliary language,

00:48:32.000 | the auxiliary objective, they used a lambda of 0.5.

00:48:36.000 | So the weight of the auxiliary language modeling objective

00:48:39.000 | was 0.5.

00:48:41.000 | But this is not popular nowadays.

00:48:43.000 | Most people don't do this.

00:48:44.000 | Any question about this?

00:48:53.000 | I think we should--

00:48:54.000 | Before we go to the benchmarks?

00:48:55.000 | Yeah, the chat.

00:48:56.000 | There's no questions in the chat.

00:48:58.000 | So I think we can just move on to the benchmarks.

00:49:01.000 | Sure.

00:49:02.000 | So the benchmark is like--

00:49:04.000 | they do a lot of SOTA performance on many tasks.

00:49:07.000 | Like you can see here, almost every single task,

00:49:10.000 | except this one.

00:49:11.000 | And there are ones where they have significant improvement,

00:49:14.000 | like this in QNLI.

00:49:17.000 | They have like 6% absolute improvement.

00:49:21.000 | I think-- let's go back a bit.

00:49:23.000 | QNLI-- where is it?

00:49:27.000 | In here.

00:49:28.000 | I think the QNLI is more of a natural language understanding

00:49:38.000 | where it requires reasoning.

00:49:39.000 | And this is where they make the most improvement, I think.

00:49:43.000 | But overall, they are doing very good performance on many tasks.

00:49:49.000 | And they are comparing to LSTMs and other models.

00:49:52.000 | So good news.

00:49:53.000 | GPT-1 tends to be a good model.

00:49:56.000 | Yeah, this is for natural language inference.

00:49:59.000 | For question answering and common sense,

00:50:01.000 | they also make very big gains, as you can see here.

00:50:05.000 | So 76 compared to 86.

00:50:10.000 | They compare also to multiple models.

00:50:13.000 | And one of them is actually an ensemble of nine models,

00:50:16.000 | as they denote here by 9x.

00:50:20.000 | And also, the same happens for semantic similarity

00:50:23.000 | and classification.

00:50:24.000 | Although, I would say the boost in performance

00:50:28.000 | is not as big as the previous ones.

00:50:30.000 | They are good on some metrics and bad on other metrics.

00:50:39.000 | So this is my favorite part of the paper.

00:50:45.000 | They did-- after the benchmarks, they now have a good model.

00:50:48.000 | So they are trying to understand why their model is good

00:50:51.000 | and why their GPT-1 is suddenly SOTA.

00:50:56.000 | So they do analysis and try to understand

00:51:00.000 | why this is happening.

00:51:02.000 | So the first step is they're trying

00:51:04.000 | to analyze the impact of the transfer learning,

00:51:07.000 | the number of layers transferred from the pre-trained model

00:51:10.000 | to the fine-tuned model.

00:51:13.000 | So they just take the pre-trained transformer,

00:51:16.000 | and they take all the layers and do fine-tuning.

00:51:20.000 | And then they take 11 layers, and then they do fine-tuning.

00:51:23.000 | And then they take 10 layers and do fine-tuning, and so on.

00:51:26.000 | They compare the performance of these models.

00:51:28.000 | And as you can see, the more layers you add,

00:51:30.000 | the more performance you get.

00:51:32.000 | And if you do zero layers transferred,

00:51:35.000 | you're not taking any of the pre-trained weights.

00:51:39.000 | You're just starting from scratch.

00:51:41.000 | You get this performance.

00:51:42.000 | But if you just add one layer, you

00:51:44.000 | get a very significant boost.

00:51:46.000 | The solid lines here are the dev data sets.

00:51:50.000 | The dashed line is the training data set.

00:51:52.000 | So you can see that a very big improvement in performance

00:51:54.000 | just by adding one layer.

00:51:56.000 | And this is almost like a continuous trend.

00:52:00.000 | So they observe the obvious fact that transferring more layers

00:52:06.000 | improves performance.

00:52:07.000 | And just only transferring the embedding layers

00:52:10.000 | also improves performance significantly.

00:52:13.000 | And they mentioned that each layer adds a boost of 9%

00:52:17.000 | for each layer you add on this task, multi-NLI.

00:52:22.000 | And this analysis actually indicates

00:52:24.000 | that each layer in the pre-trained model

00:52:26.000 | actually has a purpose.

00:52:27.000 | And it learns something that's useful.

00:52:30.000 | And it's very helpful in the fine-tuning model.

00:52:34.000 | So different layers learn different things.

00:52:37.000 | They learn different scales.

00:52:38.000 | And they are all helpful.

00:52:40.000 | This is a very good finding of this work.

00:52:44.000 | So each layer in the pre-trained model

00:52:46.000 | contains useful information on functionality

00:52:48.000 | for solving for target tasks like classification.

00:52:51.000 | The second piece of analysis they did is zero-shot evaluation.

00:52:59.000 | And this was, I would say, radical back then.

00:53:03.000 | They are trying to evaluate how good the pre-trained model is

00:53:08.000 | on these tasks without even fine-tuning.

00:53:11.000 | And they want to answer the questions like,

00:53:13.000 | why is the language model pre-training effective?

00:53:16.000 | Why does pre-training a transformer

00:53:20.000 | on language modeling help us when we are doing classification?

00:53:24.000 | They have a hypothesis that the underlying generative model

00:53:28.000 | learns actually multiple tasks in the pre-training.

00:53:31.000 | So it's not just learning language modeling.

00:53:33.000 | Because if you think about it, if you're

00:53:36.000 | learning how to model language properly,

00:53:40.000 | you're more likely to have a deeper understanding

00:53:43.000 | of the language and just natural language.

00:53:47.000 | Like, for example, if you speak only English and French,

00:53:53.000 | and you cannot speak, let's say, Spanish,

00:53:55.000 | you won't be able to do classifications in Spanish.

00:53:59.000 | But if you speak English and you speak French,

00:54:01.000 | you probably have the understanding

00:54:03.000 | to do these tasks.

00:54:05.000 | So this is something cool and something to think about a bit.

00:54:10.000 | Another thing is attention is very helpful.

00:54:13.000 | Transformers show very big improvement compared to LSTMs.

00:54:18.000 | Or that's what they are hypothesizing about

00:54:21.000 | and what they are trying to test.

00:54:23.000 | So to test these two hypotheses, they

00:54:25.000 | designed a series of heuristics.

00:54:27.000 | They basically tried to evaluate the pre-trained model

00:54:32.000 | on these tasks before doing the supervised fine-tuning.

00:54:36.000 | And they have different modifications

00:54:38.000 | for each data set and each task.

00:54:42.000 | For example, for linguistic acceptability with the COLA,

00:54:46.000 | they used the examples.

00:54:48.000 | And they take the average token log probability

00:54:51.000 | for each token and use this as a score.

00:54:54.000 | And they have a threshold.

00:54:56.000 | And they just determine if the model got this right or wrong.

00:54:59.000 | Something that's very cool is how

00:55:03.000 | they handle sentiment analysis.

00:55:05.000 | So you've got your, let's say, Amazon review.

00:55:08.000 | And you append the token vary to the review

00:55:11.000 | and restrict the language to generate only two tokens, one

00:55:15.000 | of two tokens, positive or negative.

00:55:17.000 | And you see which token has the higher score or higher

00:55:20.000 | probability.

00:55:21.000 | And this is the prediction of the model on this sentiment

00:55:24.000 | or this paragraph.

00:55:26.000 | So just restrict the language model prediction

00:55:29.000 | to these two tokens and see which one is higher.

00:55:34.000 | This is actually pretty cool.

00:55:35.000 | It's kind of similar to constrained grammar

00:55:38.000 | and constrained output.

00:55:41.000 | Another example is question answering.

00:55:43.000 | So for each answer, as you said, concatenate the document

00:55:46.000 | question and answer in the input.

00:55:48.000 | And you average the token log probability of just the answer

00:55:52.000 | and use this as the score for this answer.

00:55:54.000 | And do this for multiple answers and just

00:55:56.000 | pick the one with the highest score.

00:55:58.000 | And for winning grad schemas, for this task,

00:56:06.000 | let's say you have a sentence that

00:56:08.000 | says the city councilman refused the demonstrators a permit

00:56:11.000 | because they advocated violence.

00:56:13.000 | So the goal here is to find this word "they."

00:56:17.000 | What does it refer to?

00:56:18.000 | Does it refer to the demonstrators

00:56:20.000 | or the city councilman?

00:56:22.000 | This is what they mean by winning grad schema.

00:56:24.000 | And I think it's a very popular task.

00:56:27.000 | So the way they do this is they take this word "they"

00:56:30.000 | and replace it with the two possible answers.

00:56:33.000 | In this case, councilman and demonstrators.

00:56:35.000 | So you have example A, you use the word "councilman."

00:56:40.000 | And example B, use the word "demonstrators."

00:56:42.000 | And just average the token probability

00:56:47.000 | of what comes after this word.

00:56:49.000 | And then you just take the one with the higher probability

00:56:55.000 | score.

00:56:58.000 | So this is how they wanted to evaluate the model.

00:57:01.000 | Let's see, actually, the evaluation results.

00:57:04.000 | And surprise, surprise, the longer

00:57:08.000 | you train the model and the more data you give the model,

00:57:13.000 | the more it tends to perform on this task.

00:57:15.000 | So the better zero-shot performance it has, basically.

00:57:19.000 | This is a very cool graph.

00:57:21.000 | So we have different tasks here.

00:57:23.000 | And the solid line is the GPT transformer.

00:57:26.000 | And you can see with more steps, more training steps,

00:57:29.000 | the better performance you have.

00:57:31.000 | And they also did a very cool analysis

00:57:33.000 | where they trained an LSTM, which is the dashed line.

00:57:36.000 | And you can see that for almost every single task,

00:57:39.000 | the transformer is better than the LSTM.

00:57:43.000 | So they observed the obvious that more training gives

00:57:47.000 | better performance.

00:57:49.000 | More pre-training gives better performance,

00:57:51.000 | even in zero-shot settings.

00:57:54.000 | And this suggests that generative pre-training

00:57:57.000 | actually-- in generative pre-training,

00:58:00.000 | the model learns a lot of tasks, not just language modeling.

00:58:03.000 | And they mentioned something here.

00:58:05.000 | They said they observed that the LSTM exhibits higher variance

00:58:08.000 | in its zero-shot performance, suggesting

00:58:10.000 | that the inclusive bits of the transformer assist in transfer.

00:58:13.000 | But I'm not sure what they mean by high variance

00:58:16.000 | in this graph, actually.

00:58:18.000 | So if anyone has any clue about this, please share it with us.

00:58:21.000 | Yeah, it's better less than D1, and GPUs are all we need.

00:58:29.000 | Sadly.

00:58:36.000 | So the final section in the paper

00:58:40.000 | is the appellation studies they did.

00:58:43.000 | They have three appellation studies.

00:58:45.000 | They want to first see the effect

00:58:47.000 | of the auxiliary language modeling objective

00:58:50.000 | during fine-tuning.

00:58:52.000 | So in this case, we have the normal transformer training

00:58:58.000 | with auxiliary language modeling, which is row 1.

00:59:01.000 | And we have the one without auxiliary language modeling,

00:59:04.000 | which is row 3.

00:59:06.000 | They say that the trend suggests that larger datasets benefit

00:59:12.000 | from auxiliary training compared to smaller datasets.

00:59:15.000 | And this is kind of like actually probably

00:59:18.000 | why people stop doing auxiliary language modeling.

00:59:20.000 | It doesn't seem to be quite important, probably.

00:59:26.000 | The second appellation study is the effect of the transformer.

00:59:29.000 | So they just trained the usual transformer,

00:59:32.000 | and they compared it with an LSTM, so row 1 and row 4.

00:59:38.000 | And you can see, yeah, the transformer

00:59:40.000 | has generally better performance than the LSTM on most tasks.

00:59:47.000 | So they mentioned that they observed a 5.6 average drop

00:59:50.000 | when they used the LSTM.

00:59:54.000 | And it only outperforms the transformer

00:59:56.000 | on one dataset, which is MRPC.

01:00:01.000 | I'm not sure what this is.

01:00:03.000 | The second appellation study, and I

01:00:05.000 | would say the most important one,

01:00:06.000 | is the effect of pre-training.

01:00:08.000 | So they compared the transformer that

01:00:10.000 | has been trained in the framework they proposed,

01:00:13.000 | the two-step framework.

01:00:14.000 | And they also compared it to a transformer that

01:00:17.000 | was directly trained on the supervised task, that

01:00:20.000 | is, without any pre-training.

01:00:21.000 | So we have row 1 and row 2.

01:00:23.000 | You can see that this is where the huge difference shows up,

01:00:26.000 | like 74 compared to 59, 45 compared to 18, 88 compared

01:00:32.000 | to 71.

01:00:33.000 | So this is a very big difference.

01:00:36.000 | Yeah, and we observed that no pre-training actually

01:00:39.000 | hurts the performance quite a lot on almost all tasks.

01:00:44.000 | And this results in 15% decrease.

01:00:48.000 | And this is actually the worst-performing model

01:00:50.000 | out of all these models.

01:00:52.000 | Yeah.

01:00:53.000 | So I would say that the conclusion

01:00:55.000 | from this appellation study is that pre-training

01:00:58.000 | is very, very important.

01:00:59.000 | I think this is the final section before the future

01:01:07.000 | studies do you want to discuss.

01:01:10.000 | Any questions so far?

01:01:11.000 | I think chat seems good.

01:01:17.000 | Do you want to-- you have one more slide, I think.

01:01:19.000 | You said future studies.

01:01:20.000 | Yeah.

01:01:21.000 | Do you want to just finish it up?

01:01:23.000 | Then we can open it up to questions if people--

01:01:26.000 | Sure.

01:01:27.000 | Yeah.

01:01:28.000 | Sure.

01:01:29.000 | So this section, I don't think it was in the paper.

01:01:31.000 | I think it was only on the blog post.

01:01:33.000 | They discuss what future work could be.

01:01:37.000 | And the first approach is surprise, surprise,

01:01:40.000 | just scale up.

01:01:42.000 | So they mentioned that they noticed improvement

01:01:44.000 | in language modeling.

01:01:45.000 | And it correlates with downstream tasks.

01:01:48.000 | And they're only using very limited hardware,

01:01:50.000 | many GPUs on a machine and training

01:01:52.000 | on a very small data set.

01:01:54.000 | So maybe there is room for improvement

01:01:56.000 | if you scale the model, the training, and the data set.

01:01:58.000 | And I think we know the answer to this question,

01:02:01.000 | or the answer to this hypothesis.

01:02:04.000 | The second approach is--

01:02:07.000 | the second futuristic approach they want to try

01:02:09.000 | is try to see if there is-- you can tweak the fine-tuning

01:02:14.000 | approach instead of just doing vanilla fine-tuning.

01:02:17.000 | You can use adaptation or one of the other fancy ways

01:02:21.000 | of doing fine-tuning.

01:02:23.000 | I don't think this is as important as they might

01:02:28.000 | have thought about this.

01:02:29.000 | Because people right now are doing just simple fine-tuning,

01:02:32.000 | and it works.

01:02:34.000 | So yeah, not quite as promising as the first approach, which

01:02:37.000 | is just scaling up.

01:02:39.000 | And the third one is understanding

01:02:41.000 | of why generative pretraining helps.

01:02:44.000 | And they've done some ablations and analysis about this.

01:02:48.000 | They want to do even more further targeted experiments

01:02:51.000 | and research to understand this.

01:02:53.000 | So basically, observability and explainability

01:02:56.000 | in machine learning.

01:02:59.000 | And one very good question they ask

01:03:01.000 | is, how much does longer context help compared

01:03:06.000 | to just improved world knowledge when

01:03:09.000 | you are doing pretraining?

01:03:11.000 | So for example, this GPT model is

01:03:13.000 | able to process a longer context because it's a transformer.

01:03:18.000 | So is this the thing that makes it softer performance?

01:03:21.000 | Or is it the fact that they train on a bigger data set

01:03:26.000 | with a longer training time, and it obtains world knowledge?

01:03:30.000 | And I think both are important, I would say.

01:03:33.000 | But that's a very good question.

01:03:37.000 | So yeah, that was it in the GPT.

01:03:39.000 | They introduce a framework for achieving SOTA performance

01:03:42.000 | by doing two-step approach, do pretraining,

01:03:44.000 | and then fine-tuning.

01:03:45.000 | The goal of pretraining is to obtain a very good world

01:03:53.000 | knowledge and a very good starting point.

01:03:55.000 | And you do the pretraining on a diverse corpus

01:03:58.000 | with long stretches of text.

01:04:00.000 | So the model acquires, actually, world knowledge.

01:04:02.000 | And then you actually do transfer learning

01:04:04.000 | by fine-tuning on tasks like question answering and so on.

01:04:08.000 | And the results is that they improve

01:04:11.000 | the state-of-the-art performance online data sets out of 12.

01:04:15.000 | And they successfully have utilized

01:04:18.000 | the unsupervised approach to do transfer learning

01:04:21.000 | to discriminative tasks.

01:04:23.000 | So now we have a clear way of how

01:04:24.000 | to do unsupervised training and semi-supervised training.

01:04:31.000 | And the work also highlights that transformers

01:04:35.000 | is a very good architecture, and larger data sets are good.

01:04:39.000 | And this is a very important push

01:04:41.000 | in the direction of scaling up and pretraining.

01:04:44.000 | And this is actually what people are

01:04:46.000 | going to do for the next six years, from 2018 to 2024.

01:04:52.000 | Yeah, so very good paper, very good work.

01:04:56.000 | And with that, we've come to the end of this paper.

01:04:58.000 | So if you have any questions, I think we can take questions now.