Back to Index

Breaking down the OG GPT Paper by Alec Radford


Transcript

Okay. Sure. So, hey everyone. My name is Amged. I'm a machine learning engineer. I generally do ML consulting services to startups. I help them like ship AI-powered products, especially in the field of NLP and speech-to-text applications. And I run a blog where I like publish posts about ML stuff.

So feel free to check it out. I've done some posts about Whisper. Yeah. So with that out of the way, let's get directly to what we want to discuss today, which is like the GPT-1 paper by the folks at OpenAI. So the paper is titled "Improving Language Understanding by Generative Pre-training".

It was published in June 2018. And these are like the authors. Let me switch to presentation mode. Yeah. And these are like the authors of the paper. So mainly Alec Radford and Elia Suskiver, very well-known publishers in the ML field. So let's get started. Back in 2018, deep learning was becoming very popular.

But the main thing with deep learning is like it is very, very data hungry. There are like good news and bad news. The good news is like there is a ton of data available everywhere on the Internet and just online. We have like tons and tons of data. The bad news is this data is not annotated and it's not curated and it's basically very, very messy.

And if you want to train machine learning models back then, the only way to do it is just to annotate data yourself or hire like data annotators. And these tend to be very, very expensive and difficult to scale and hire. Like if you think GPUs are expensive, you have not worked with data annotators.

That's what I like to say. So this makes like deep learning very restricted. You cannot use it everywhere. Like you're only restricted to fields that have good, high quality annotated data sets. And this is a very big bottleneck. So people back then in 2018 are trying to solve this problem.

Like how do we get over the fact that we need labeled data? And one potential solution to this problem is like unsupervised learning. So the question is, what if we can leverage like just the linguistic information from the unlabeled data? So we have like a bunch of text, like novels, books, articles.

How can we leverage like the linguistic information from this? And the answer to this could be using unsupervised learning. And if you can do this successfully, this alleviates the need for large amounts of labeled data. Because basically you can utilize Wikipedia, which is very big, or even all the papers published on archive and so on.

And even if you have like lots of labeled data, using unsupervised learning as a step one or step zero is going to make your model actually perform better than just training on this labeled data. Because of transfer learning, where you can transfer the things that your model has learned during pre-training into your actual objective.

And a good evidence of this is back in 2017 and 20, or even a bit before 2017, is like people have been using pre-trained word embeddings. Things like Wave2Vec, GloVe, FastText, to achieve very good performance on many tasks like classification and machine learning. Sorry, machine translation. So this is a good evidence that actually unsupervised learning works and it's a very good approach.

So this is the premise of the paper actually, like unsupervised learning is very promising. So let's talk a bit about word embedding. So the idea behind word embedding is like you want to project words, so just text, into an n-dimensional space. And n is usually 300 or 1,000 back then.

And this space has a very special property that words that are similar in meaning have very similar products. Sorry, very similar vectors. And by similar vectors I mean like you can measure similarity by things like cosine similarity dot product, like L2 distance. So we can capture similarity between words that don't each other but have similar meanings.

For example, the word "booking" and "reservation". Just from the syntax, they are very different words, but they have very similar meanings. We're just booking something. And similarly, the word "adam" and "sgd". These are like completely different words, but they are both actually like optimizers and use in machine learning.

So they are very similar and their vectors are going to be probably similar. The most common implementation of word embedding is like "wave2vec" by Google. This is what popularized the usage of word embedding. And then "glove" by Stanford and "fasttext" by Facebook. And the way these word embedding models are trained is like leveraging co-occurrence between words that have similar contexts.

For example, similar words tend to occur in similar contexts. Like you would generally find the word "adam" associated with learning rate or machine learning. Similarly, "sgd" is associated with learning rate as well. So you can conclude that these two words are kind of similar. And the way these word embedding are used is that they are like utilized by training a head.

Like for example, if you want to classify a word as being positive or negative, having a positive sentiment or a negative sentiment, you can train a classification head on top of the frozen embeddings. So you use the word embeddings like as a fixed feature input, input features. Just frozen things without training the word embeddings themselves.

And this has like a significant drawback or a bunch of significant drawbacks. The first thing is it doesn't utilize the context of the text, so you're just using the word itself. And some words, even the same word, can have very different meanings depending on the context. For example, the river bank, like the Amazon bank or the oil bank, is different from the HSBC bank or like the JPMorgan bank, even though these are the exact same words.

So if you're going to use just wave2vec, these two words will have the same vector, even though they have very different contexts. And even beyond this, natural language has nuances that cannot just be captured by using words. Like the way you write things like "glamour3on" and "skillissue" on OpenAI, the way you write it this way, you have a very specific intonation compared just to writing OpenAI in the normal way.

So are there any questions about word embeddings so far? Someone says in the chat, nope. I cannot read the chat, so if just someone can say this using the microphone, that'd be great. I don't think there are any questions. I think Sean said no. But if anyone has any questions, you can maybe drop it in the chat, and then I can just surface them as and when.

But it seems like everyone's okay with it for now. I'll keep an eye on the chat for you. Yeah, great. Yeah, thank you. That's appreciated. So this was word embedding. The main limitation or drawback is you cannot use context. So can we go beyond word embedding? Word embeddings are too local.

We need something that's more global that can capture the higher-level semantics. But this also has its complications. How do you leverage more than word-level information from unlabeled text? You're going to have some questions that you need to answer. First is, which objective should you use while training? Do you want to use language modeling or machine translation or discourse coherence or something else?

And in 2024 right now, we definitely know the answer. Language modeling works very well because this is what we've been using for the past three years. But back then in 2017 and 2018, this was not very clear. For example, this paper came out before BERT. And even the original machine learning, the original transformer paper, attention is all you need to use machine translation.

So back then, this definitely was not very obvious to people. The second question is, how should we transfer these embeddings to the target task? There are a bunch of ways to do this. Do you want to modify the model architecture? Each task is going to require each specific modification to the model itself.

This is one approach, and this requires very deep knowledge of model architecture and being just a wizard to modify the architecture. The second approach is to use a specific recipe or schema to do transfer learning. A very popular example of this is ULM Fit by Jeremy Howard and Sebastian Ruder.

We're going to cover this in the next two slides, I think. The third option is also you can add an auxiliary learning objective during pre-training. While you're pre-training on language modeling, you can have an auxiliary learning objective like machine translation or discourse coherence. These are some of the approaches you might want to take when you are deciding on the target task.

All these questions made going beyond word embedding not so straightforward. They made it difficult to utilize semi-supervised learning or unsupervised pre-training. Let's take a look at ULM Fit and how they did this. This is a paper titled "Universal Language Model Fine-Tuning for Text Classification." Their objective is text classification, but they want to also build a universal language model.

This is a very similar work in NLP. It's a very well-known paper and it has a big impact. The question they are raising is instead of just utilizing the word embeddings, like you're going to have a classifier, you're going to have an embedding layer and a classification layer. The old way of doing this, you're going to use the embedding layer from Web2Vec and you are going to randomly initialize the classification head.

They are asking why not just have a good initialization point for all the layers, not just the embedding layer. Their answer to this question is an approach called ULM Fit. It is a three-step recipe for state-of-the-art text classification back then. We have three steps. The first step is to train the language model on general domain data.

We call this pre-training on large corpus these days. You just train your language model on a very big corpus like Wikitext. This was big back then, but it's probably small now. You can do pre-training on 15 trillion tokens of data if you are a big organization like Meta, for example, or even more.

This is the first step. The second step is you do fine-tune the language model on your target data. You keep doing language modeling in this step. This is similar to what we call continued pre-training on target domain. You just take your LLAMA270B base and you just do language modeling on, let's say, financial books to try to make your model like Bloomberg GPT, for example.

But you're still doing language modeling. You're not doing any task-specific training. The third and final step is to train a classifier on your label data. This is the fine-tuning step that we are all familiar with. Let's say you want to classify Amazon reviews as being positive, neutral, or negative.

You're just going to get maybe 1,000 reviews on the label for each review and just train the model in a supervised fashion. This was a very good paper and it was very influential and made a big buzz in the ML field. This paper was released in 2018, but it did not mention the word "transformer" at all.

They did not even reference the paper. The architecture of the model used was an RNN-based model. I think it was an LSTM. So, no transformers at all. And this is kind of a big gap in this work that GPT folks are going to fill. Otherwise, this paper would have been a very, very good paper.

It still is a good paper. So, let's talk a bit about GPT, the thing that we want to talk about today. GPT stands for Generative Pre-trained Transformer, the keyword is "transformer". It was developed by OpenAI, by these awesome folks. It was actually one of the first things that made OpenAI a popular organization in the ML field.

The whole premise is to use a semi-supervised approach for NLU tasks, natural language understanding tasks. So, the goal is to learn a universal representation that transfers very well to other downstream tasks. So, basically have a good starting point, instead of starting from scratch every single time. And their approach is basically two-step.

The first step is to do unsupervised pre-training, and then supervised fine-tuning. And this is kind of where the word "semi-supervised" comes from. So, a mixture of unsupervised training and supervised training. And their architecture is transformers, of course, because we have GPUs, and GPUs love transformers. So, that's a good thing.

So, to do this approach, you're going to need two things. The first thing is a very big corpus of unlabeled text. And then the second thing is a dataset that is annotated, that is ready to use for supervised fine-tuning. So, you could have multiple datasets if you want to train your model for several tasks.

But the good news is your target tasks do not need to be in the same domain as the unlabeled text. So, for example, let's say you want to train on financial tasks. You're going to give the model some information about a stock and ask it about how it performs.

Or whether it should buy or sell. So, like a financial classifier. Actually, you can pre-train on just like normal general data. Like you can pre-train on a corpus from the Internet and then just fine-tune on your desired tasks. Like your unlabeled corpus does not need to be in the same domain as your objective.

And this is good news because we have a lot of general-purpose text that you can use for pre-training. While obtaining very domain-specific corpus is more involved. And a very minor note here is the name can be misleading in this work. The word "generative" here mainly refers to the pre-training step.

The actual tasks that they had in mind are more discriminative. So, like classification, question answering, and semantic similarity. That is, natural language understanding tasks. So, they did not discuss machine translation or just being a chatbot in this work. And they actually released their blog post under a different, and I think more fitting title, called "Improving Language Understanding with Unsupervised Learning".

And this is like the key idea here, like unsupervised learning. This is a very nice blog post on their blog. So, now we have discussed EOLMFET and GPT. Let's also discuss some of the other related work in this domain. So, let's first start with semi-supervised learning. The work GPT falls under this domain.

And back then this was becoming very popular, like sequence labeling, text classification tasks. People were doing semi-supervised learning for these. And you have different levels for this approach. So, the first and basic level is just using language statistics to get the features. Like use bag of words, tf, idf, and all these classical machine learning stuff.

You can use them as input features and just train a classifier on top of this. But that's not very helpful, because two words that are different in syntax, but are similar in semantics, are going to have very different features. So, it makes the model a bit limited. The second step is to use the word embedding, like we discussed previously.

And this approach allows you to capture the semantics, but it's also very limited in that it's based on words. We're not capturing the higher level semantics. The third level is like sequence embeddings. So, instead of just using the word to get the embedding, you are going to utilize the entire sequence.

So, like the sentence or the paragraph to get the embedding. And this actually allows us to utilize the context to understand the high level semantics of the text. And this is like level 3 is where GPT falls. We are using the entire sequence to generate embeddings that we can use for classification or any other task.

So, this is regarding semi-supervised learning. The second field, and the more specific field, is unsupervised learning. So, it's a special case of semi-supervised learning. These terminologies can be confusing, I know. But the goal is to find a good initial starting point instead of just going directly to do unsupervised learning objective.

This is a typo, probably. So, the early works in unsupervised learning was actually used in vision, in image classification. So, for example, you can use ResNet that was trained to classify ImageNet. You can take the backbone, the ResNet as a backbone, and then do like segmentation or just try to detect pneumonia in chest x-rays.

And this is a very good starting point. Even though ResNet was... Even though ImageNet is just classifying images as being cats or dogs or humans or horses. So, it's more of a general purpose dataset, but you learn a very good representation that can be used in your downstream task.

And in the field of vision, this actually proved to be quite well because people have found out that pre-training acts as a regularization scheme. Your model tends to have better generalization if you pre-train it on a very large corpus. And the authors mentioned in the paper that the closest line of work to their work is actually what we discovered...

What we covered so far, the ULMFET by Howard and Ruder, and also another work by Dai et al. But I did not go into... I did not go in detail into this work. However, the main drawback of these two works and everything else, almost everything else back then, is like they are using RNNs.

And we know that by 2018, like, transformer reigns supreme because we have GPUs. So we can train transformers more efficiently. And also RNNs have, like... They are limited in their ability to process large contexts because of, like, the gradient vanishing problems and so on. So this is, like, the most common approach back then.

Another approach is also to use the head-end representation from a pre-trained language model or machine translation model as auxiliary features while training your model. So you can have... Let's say you have a machine learning model. You're going to get the representations or, like, the hidden state of this model and then just give it to your classifier as, like, additional features.

And as you can imagine, this is, like... This involves, like, a substantial amount of work and new parameters for each task. So this is not very universal. So any questions so far before we actually go in detail into the approach? - I think there was a question from Sook.

Did AlexNet use GPUs? And were the transformers... - AlexNet? - ...the first ones to use the GPUs? I think he just put it in the chat. Yeah, but... Were transformers, like, some of the... - Yeah, AlexNet, yeah. Yeah, AlexNet did definitely use GPUs, but back then, I think they were writing Qt code.

I think Alex Kryzewski was the person doing the GPU programming stuff himself. So, yeah, I think AlexNet did definitely use GPUs, if I remember correctly. - Yes, okay. Cool, cool. Well, seems like they did. Okay, I would just like to see if anyone else has any other questions. I think it should be good for now.

- Yeah. Someone said, "According to papers with Qt, they did." Yeah, I think they did use GPUs. So let's go into details about the GPT approach. So we have two steps. The first step is unsupervised pre-training. So basically, the goal is to train a high-capacity language model on a large corpus of text.

The training objective here is language modeling. That is, given a sequence of tokens, try to correctly predict the next token. So let's say the cat sat on D. You're trying to predict the next token, which I think should be the mat. So this is basically language modeling. The loss function D used is negative log-likelihood.

It's also called cross-entropy. Basically, this equation, the negative log-likelihood of the correct label. And you do this over-- a very, very important note here is you do this over the entire sequence. So if you have a sentence that said, "The cat sat on the mat," you do this over every single token in this sentence.

And this is very important. So you are training on your entire input-- your entire corpus, not just the last token. So the architecture is like a transformer, a multi-layer transformer, but they are using the decoder only. They are not using the encoder. So--and the difference between the encoder and the decoder, I think, can be summarized to, like, how you do attention.

So in encoders, every token has access to every other token. But in decoders, every token has access only to the tokens that came before it. So you only have one directional attention. I think it's also called left-to-right attention or right-to-left, or, like, you only have attention in one direction.

And basically, the transformer architecture applies, like, multi-headed mask itself attention operation over the input tokens, and then this is followed by a position-wise feed-forward layer. And you keep doing this for, like, n, where n is the number of transformer layers, and then you just generate an auto-distribution over the vocabulary.

So let's take this into more detail. You have your input text that has been tokenized into tokens, and these tokens have been encoded. So you have, like, the token integer or token ID. So it's a number. So you take this number, and you pass it through the embedding layer, the token embedding layer, also called the semantic embedding layer, and you get a vector for this token.

And you also get the positional embedding for this token. You sum these up, like, just vector addition, and you get your input to the first transformer block, which is H0. So just the positional embedding and the token embedding added together, and you get your input, H0. And then you take H0 and pass it to each block, each transformer block, and the output of the first block is going to be the input to the second block, and so on.

And this is what this equation says. And at the end, once you've done going through all these blocks, you're going to go to the output layer, which is actually a reverse embedding. So you have a vector, but you want to go back to a token, or, like, I should say, a probability distribution over all the tokens.

And once you have this distribution, you pass it through a softmax to get actual probabilities that sum up to one, instead of just logits or scores. And I think we've covered the transformer paper, so I think people are familiar with this, but if you have other questions, please go ahead.

A small remark about this is that we use tied input and output token embedding, so the embedding layer in this step is the same as the one used at the final step. So... Yeah, someone says causal masking. Yes, exactly, causal language modeling. So the second step is supervised fine-tuning.

So the goal of this step is to adapt the parameters of the pre-trained model to your supervised target task. And for this, we need a label data set where each instance is a pair of, like, you have a sequence of input tokens and the label. So, for example, an input sequence could be this product is very bad, and your label could be, like, a negative sentiment.

So the inputs are passed through the pre-trained model to obtain the final transformer blocks activation. So if you go back a bit, you take your input sequence and pass it through this same transformer and get the output of the final token, and then you compare this to your label.

So you're gonna get the hidden representation of the last encoder layer at the M token, where M is, like, the final token, the input sequence. And you just pass this through your classification head or whatever head you're using. So, for example, in classification, we're gonna use Softmax on top of a linear layer to get your output and compare it with the label.

And you are using roughly the same loss function, which is negative log-light-load estimation or cross-entropy. And a key distinction here between this and the previous step is you are only calculating the loss over only the output token, not the entire sequence. So the loss is only on Y, not on X1 or XM and so on.

So the only extra parameters you need for this is your classification head. So, for example, W-Y, if you're trying to do classification, the parameter metrics of the-- the metrics of the parameters of the output layer. And also embeddings if you're adding new tokens. And we're gonna see this in a bit.

Something they also experimented with is auxiliary training objective. So they also used language modeling as an auxiliary objective in fine-tuning. So not just this-- not just this classification, but also language modeling. And they say this helped them by improving the generalization of the supervised model and accelerating convergence. They also say this is in line with prior work.

They also observed better performance when using it as an auxiliary objective. And the way you do this is your loss function is now a sum of multiple loss functions where one loss function is this one, the classification loss function, and also you have the language modeling loss function with a certain weight, like lambda here is like a weight.

And lambda could be, for example, 0.5 or 0.3. So you have like a summation of multiple losses. And a small note for myself here. I'm not sure if auxiliary language-- like auxiliary objectives are popular today. I think people just do supervised fine-tuning without an auxiliary objective. That's just my take.

So any questions so far? I think the chat seems to not have any questions. So maybe you want to just-- you can just continue from now. Sure. So now we have discovered-- we have covered the approach, the basic two steps. You will now get a very good, let's say, classifier, because you have done unsupervised pre-training and supervised fine-tuning on classifiers.

But GPT-1 is actually trying to be more-- it's trying to be more of a universal model than just a classifier. So they are trying to handle multiple tasks, like classification, entailment, semantic similarity, and answering multiple choice questions. And these are all, as you can see, discriminative tasks other than generative tasks, as we discussed before.

So for tasks like classification, this is very easy. You can do what we have covered so far. Just add the head on top and do the classification. But other tasks have different structured inputs and outputs. So for example, text entailment has ordered sentence pairs. MCQs have a question with multiple answers, and so on.

So each task has its own specific structure. And the way people have dealt with this previously is just learn a specific architecture for each task on top of your model. And this defeats the whole purpose of the GPT work. We're trying to do something that's global, general, and a general-purpose model, rather than having multiple task-specific architectures.

So instead of using this approach, they opted to use a different approach, where they convert the structured inputs into just tokens. So they are trying to create a multitask format. And this is similar to what people have used in future work like T5 and Whisper, where basically you're trying to model different tasks just using tokens and special tokens.

These input transformations allows us to use the same architecture for different tasks. So you don't need to do a lot of modification. And we're going to go into this in two details in the next slide. So for example, let's take textual entailment. This task involves reading a pair of sentences and judging the relationship between them.

So the relationship could be one of entailment, contradiction, or neutral. And a small note is that this task is still challenging because your model needs to have good understanding and reasoning of the language, because it can be confusing sometimes. So you have your premise, and you have your hypothesis, and you're trying to classify or try to predict the relationship as being entailment, contradiction, or neutral.

So the way to do this is just to concatenate the premise and the hypothesis. So this could be a sentence, and this could be a sentence. You just concatenate them and add a special delimiter token in between them. And obviously, you add your start token at the beginning and your end token at the end.

And just try to train a classifier on top of this. And your classifier should classify this sequence of input tokens as one of entailment, contradiction, or neutral. So this has become just a classification task by just doing transformation. The second task that we cover is semantic similarity. And I think this is very popular nowadays, because of retrieval, augmented generation, and embeddings in general.

So all this cool rag stuff. So this task is about predicting how semantically similar two sentences are. And semantic similarity just means how close in meaning they are. Do they talk about the same thing? Do they mean the same thing? Do they have similar meaning or not? And again, this can be challenging, because you can have two paragraphs that have very different usage of words, but they convey the same meaning.

So this can be challenging if your model is not smart enough. And one note about this task is there is no inherent ordering of the sentences. So you can just-- there is no sentence A and sentence B. You just have two sentences. Unlike, for example, in the entailment, you have a very specific order.

Like, you have a distinction between the premise and the hypothesis. But here you have just two random sentences. And the way they approach this is using a Siamese architecture where we have-- where we generate two input sequences and pass them through the transformer and compare between them. So basically, Siamese architecture is a fancy way of saying that you are using the same model twice, or two identical versions of the model.

So you get your sentence-- first sentence and second sentence, and then concatenate them and add your special tokens. And you pass them through the transformer and you get some vector at the end. And also, you do the same thing, but you reverse the order of the sentences. So you get the second input sequence, pass it to the transformer, and you get your vector.

So basically, at the end, you're going to have two vectors. You just add them-- do vector addition on top of them. And then you just add a head on top of the output vector. So you just multiply the vector by some layer, a linear layer, for example, that has parameters w, and you get your output.

And for example, if you're doing-- like, if you just have-- if you're just interested in knowing whether these sentences are similar or not similar, so you have only two labels, you can train a classifier. If you're interested in having more of a scale of similarity, like from 0 to 10, you can train a regressor on top of this.

So that's very cool. We're still using the same architecture, by the way, just the transformer here. We're not modifying it. We're just being smart about how to approach this task. You can extend this to do question answering. So let's say you have a question and you have four choices.

Or let's say you have a document. You have a question about the document, and you have four potential answers to this document. And you want your model to pick one answer of these. So the way to do this is you take your document or context and add the question and then add the first answer.

And then you get your first input sequence. And you pass this to the transformer. And similarly, you get your context or document and add the question and add the second answer and make this into a sequence and give it to the transformer. And you do this for all the answers.

And then you just compare between the scores the model gives to each of these potential answers. So for example, we have a Wikipedia article, and we have a question, and we have answer A. We concatenate all of these with some special tokens. And we pass them to the transformer and to the final linear layer, and we get the score.

So score for answer A and score for answer B and score for answer C. And just do a softmax to get a proper probability distribution. And that's it. You've got your model to do question answering. And you can do this also for other tasks like common sense reasoning. So any questions about these transformations for each task?

I think there was a question about modifying the input sequence to both possible orderings, at least when it comes to semantics in learning. I think someone had a question about that. So he just said that he doesn't really understand what it means to modify the input sequence for both possible orderings.

Yeah, sure. So if you're doing textual entailment, you have distinction between the premise and the hypothesis. Like you have a premise and a hypothesis. So you can just mix them up. You have a very specific premise and a very specific hypothesis. And the ordering here matters. Like you should put the premise first, and then the special token, and then the hypothesis.

You can put the hypothesis first, and then the delimiter, and then the premise. So this is what they mean by the ordering here matters. But for semantic similarity, you just have two sentences. There is no inherent ordering for the semantic similarity task. I hope this answers the question. I think it did.

He said, ah, I see. Thanks. I think probably that's about it. I don't see any other questions. He said, does it also combat things like positional biases and transformers by switching the order of the sentences itself? And you've added two quantities. Yeah, exactly. I think, yeah. I think this is one of the motivations they did this.

They want to say, there is no inherent order for this task. So maybe the transformer will just have a bias. So let's try both ways. Like let's give it the first sentence, and then the second. And let's give it the other way around to get over the positional bias.

Because I think for some models, they pay more attention to the last few tokens in the input. And they disregard the earlier tokens. This tends to happen sometimes, yes. I think, yeah. I think this is a good way to think about this. Awesome. I think-- I don't see any other questions in the chat.

So I can just service them as and when they come. So I think we should be good for now by the looks of it. Sure. Sure. So that was the whole approach. We can now discuss some details about the training. And back then, OpenAI actually did release info about their training and models.

They don't do this now, unfortunately. But anyways, so the data set they use for training is BookCorpus for training the language model. So this is in step one, which is unsupervised training. BookCorpus. And back then, this was huge. It has 7,000 unique unpublished books from different genres. So the variety here helps as well.

They have adventure, fantasy, and romance. So kind of like a very good, big, diverse data set. And the main advantage of this data set, and the reason they chose it, is because it has long stretches of text. So if you have a book or a novel, you have a paragraph that's maybe 10 lines or more.

And this helps the model to learn how to handle long-range information and how to handle long context. For example, there is also another data set that's called WordBenchmark, which is also big and diverse. But it doesn't have this long context. It's just a bunch of small sentences, I would say.

And this way, your model will not learn how to handle long context. And a side note here is like ELMo is also a very seminal work in NLP. And it falls under the same domain of unsupervised pre-training and having good embeddings. So it's one of the fundamental papers and works in NLP.

So they say their model achieves a very low token level perplexity of 18.4 on this corpus. But I don't think this is actually low today. I think perplexity of 18 is a bit high. But I'm not sure. And their model is about-- their data set is about 5 gigabyte in size.

So not quite big by today's standard, of course. The architecture, as we mentioned, they have a transformer model. They use a tokenizer. They use byte-pair encoding back then, which is also, I think, what's used right now. So this is still not changed from today. They have a surprisingly big vocab size by their time's standard.

Like, they have 40,000 tokens. I think Llamatu had also something that's in the same range, like 40,000, 50,000 tokens. So back then, this was actually quite big, I would say. And they used the FTFY library to clean the raw text and then do some standardization using spaCy tokenizer. So this is good work in the tokenization area, I would say.

And their model is just a typical transformer model, like typical by today's standard, just a transformer, decoder-only transformer with, like, 12 layers and mask self-attention. And it has a very big size of 117 millions. And this was actually big back then, although this is trivial now, of course. Their embedding, they used learned positional embeddings compared to the sinusoidal embedding in the original transformer paper.

So this was actually much simpler to implement. And the model just has to learn the embeddings in training. They have a context size of 512. Again, back then, that was very big. And they also used tied input and output token embeddings, as we mentioned. And for the attention block, they have 12 attention heads.

Each head has 64-- each head has a dimension of 64 for a total of 768 dimensions for the entire attention block. And after the attention, you have the MLP layer, also known as position-wise feedforward network. And the size of the inner state of this network is 3,072-- the size is 3,072.

And this means that, actually, this is an expansion and then contraction network. So you go from 768 to 3K, and then you go back from 3K to 768. And all this is happening inside this MLP. And the activation layer they used is the GLO activation layer. The optimizer, they used Adam, which was becoming very popular back then.

I think Adam was developed in 2014, 2015. So it was getting a lot of traction back then. And the maximum learning rate is this learning rate, which is also very popular nowadays. I think this was used in one of the Lama-Tung models. So this is very familiar. They do warm-up as well from 0 to the maximum learning rate over 2K steps.

And then they just cosine annealing to-- they do cosine annealing from the maximum learning rate to 0, so just learning rate decay. Their compute, they used one machine that has 8x P600 GPU. This is the same family of GPUs as P100, I think. But I don't have much information about this GPU.

They train for 30 days, which actually is not bad. That's not very long. But back then, this was very long, I think. Their utilization dimension is 0.33. And the total flops is 0.6 petaflop, so almost one petaflop. And for reference, I think the tiny grad machine is trying to give you one petaflop.

Or I could be wrong. Yeah, I could be wrong about this. So anyway, the compute they used is almost one petaflop days. And this is the way they calculated this. And this is actually how the model looks like if you try to do it in the transformers framework. So you have token embedding, position embedding, and then some dropouts.

And then you have your actual transformer blocks. And this is the architecture of the language model. So there is no head in here. So the block is just attention, and then layer norm, and then MLP, and then layer norm. So if there are no questions, we can move on to the second step, which is supervised fine tuning.

There was one question from Sean, which was, are perplexity numbers comparable across different models? And I think we discussed just now that the model itself has a perplexity of 18.4. And so in this case, would it be OK to compare-- is it a metric that's invariant across different models in this sense?

Yeah, so that's a good question. I think perplexity is just a metric. Like, you're going to measure perplexity on a certain data set. Like, let's say you have this data set that is 1 trillion tokens, or 1 billion tokens. And you measure perplexity of GPT-1 on this data set.

And you can measure the perplexity of Lama2 on this data set. And I think you can compare this metric. It's like saying Lama2 has a human eval of 70, and GPT-4 has a human eval of maybe 90. It's not totally fair, but we can do it, I guess. We can get away with doing it.

Yeah, but it might not be 100% fair. Yeah, so that's a good point. OK, cool. I think that's probably the only question. So I think we should move on to supervised fine-tuning. So the second step is supervised fine-tuning. And they use these data sets for this task. So for sentence similarity, they use these data sets.

And for classifications, they used COLA, which I think is popular now as well. I'm not going to go into details about this, because I don't have much information about these. And the architecture is just use the same backbone as pre-training. And you add your head, mostly classifier head, with a dropout of 0.1.

They train for three epochs with a batch size of 32. This is also still standard nowadays as well. People usually train, do fine-tuning for three epochs with a batch size of maybe 32 or 64. So yeah, that's common nowadays as well. The learning rate is 6.25, 8.25. And also, people use very similar learning rates nowadays as well.

So this is very familiar. This architecture for fine-tuning is very familiar, even today. And they also use learning rate decay with a warm-up. And when they use the auxiliary language, the auxiliary objective, they used a lambda of 0.5. So the weight of the auxiliary language modeling objective was 0.5.

But this is not popular nowadays. Most people don't do this. Any question about this? I think we should-- Before we go to the benchmarks? Yeah, the chat. There's no questions in the chat. So I think we can just move on to the benchmarks. Sure. So the benchmark is like-- they do a lot of SOTA performance on many tasks.

Like you can see here, almost every single task, except this one. And there are ones where they have significant improvement, like this in QNLI. They have like 6% absolute improvement. I think-- let's go back a bit. QNLI-- where is it? In here. I think the QNLI is more of a natural language understanding where it requires reasoning.

And this is where they make the most improvement, I think. But overall, they are doing very good performance on many tasks. And they are comparing to LSTMs and other models. So good news. GPT-1 tends to be a good model. Yeah, this is for natural language inference. For question answering and common sense, they also make very big gains, as you can see here.

So 76 compared to 86. They compare also to multiple models. And one of them is actually an ensemble of nine models, as they denote here by 9x. And also, the same happens for semantic similarity and classification. Although, I would say the boost in performance is not as big as the previous ones.

They are good on some metrics and bad on other metrics. So this is my favorite part of the paper. They did-- after the benchmarks, they now have a good model. So they are trying to understand why their model is good and why their GPT-1 is suddenly SOTA. So they do analysis and try to understand why this is happening.

So the first step is they're trying to analyze the impact of the transfer learning, the number of layers transferred from the pre-trained model to the fine-tuned model. So they just take the pre-trained transformer, and they take all the layers and do fine-tuning. And then they take 11 layers, and then they do fine-tuning.

And then they take 10 layers and do fine-tuning, and so on. They compare the performance of these models. And as you can see, the more layers you add, the more performance you get. And if you do zero layers transferred, you're not taking any of the pre-trained weights. You're just starting from scratch.

You get this performance. But if you just add one layer, you get a very significant boost. The solid lines here are the dev data sets. The dashed line is the training data set. So you can see that a very big improvement in performance just by adding one layer. And this is almost like a continuous trend.

So they observe the obvious fact that transferring more layers improves performance. And just only transferring the embedding layers also improves performance significantly. And they mentioned that each layer adds a boost of 9% for each layer you add on this task, multi-NLI. And this analysis actually indicates that each layer in the pre-trained model actually has a purpose.

And it learns something that's useful. And it's very helpful in the fine-tuning model. So different layers learn different things. They learn different scales. And they are all helpful. This is a very good finding of this work. So each layer in the pre-trained model contains useful information on functionality for solving for target tasks like classification.

The second piece of analysis they did is zero-shot evaluation. And this was, I would say, radical back then. They are trying to evaluate how good the pre-trained model is on these tasks without even fine-tuning. And they want to answer the questions like, why is the language model pre-training effective?

Why does pre-training a transformer on language modeling help us when we are doing classification? They have a hypothesis that the underlying generative model learns actually multiple tasks in the pre-training. So it's not just learning language modeling. Because if you think about it, if you're learning how to model language properly, you're more likely to have a deeper understanding of the language and just natural language.

Like, for example, if you speak only English and French, and you cannot speak, let's say, Spanish, you won't be able to do classifications in Spanish. But if you speak English and you speak French, you probably have the understanding to do these tasks. So this is something cool and something to think about a bit.

Another thing is attention is very helpful. Transformers show very big improvement compared to LSTMs. Or that's what they are hypothesizing about and what they are trying to test. So to test these two hypotheses, they designed a series of heuristics. They basically tried to evaluate the pre-trained model on these tasks before doing the supervised fine-tuning.

And they have different modifications for each data set and each task. For example, for linguistic acceptability with the COLA, they used the examples. And they take the average token log probability for each token and use this as a score. And they have a threshold. And they just determine if the model got this right or wrong.

Something that's very cool is how they handle sentiment analysis. So you've got your, let's say, Amazon review. And you append the token vary to the review and restrict the language to generate only two tokens, one of two tokens, positive or negative. And you see which token has the higher score or higher probability.

And this is the prediction of the model on this sentiment or this paragraph. So just restrict the language model prediction to these two tokens and see which one is higher. This is actually pretty cool. It's kind of similar to constrained grammar and constrained output. Another example is question answering.

So for each answer, as you said, concatenate the document question and answer in the input. And you average the token log probability of just the answer and use this as the score for this answer. And do this for multiple answers and just pick the one with the highest score.

And for winning grad schemas, for this task, let's say you have a sentence that says the city councilman refused the demonstrators a permit because they advocated violence. So the goal here is to find this word "they." What does it refer to? Does it refer to the demonstrators or the city councilman?

This is what they mean by winning grad schema. And I think it's a very popular task. So the way they do this is they take this word "they" and replace it with the two possible answers. In this case, councilman and demonstrators. So you have example A, you use the word "councilman." And example B, use the word "demonstrators." And just average the token probability of what comes after this word.

And then you just take the one with the higher probability score. So this is how they wanted to evaluate the model. Let's see, actually, the evaluation results. And surprise, surprise, the longer you train the model and the more data you give the model, the more it tends to perform on this task.

So the better zero-shot performance it has, basically. This is a very cool graph. So we have different tasks here. And the solid line is the GPT transformer. And you can see with more steps, more training steps, the better performance you have. And they also did a very cool analysis where they trained an LSTM, which is the dashed line.

And you can see that for almost every single task, the transformer is better than the LSTM. So they observed the obvious that more training gives better performance. More pre-training gives better performance, even in zero-shot settings. And this suggests that generative pre-training actually-- in generative pre-training, the model learns a lot of tasks, not just language modeling.

And they mentioned something here. They said they observed that the LSTM exhibits higher variance in its zero-shot performance, suggesting that the inclusive bits of the transformer assist in transfer. But I'm not sure what they mean by high variance in this graph, actually. So if anyone has any clue about this, please share it with us.

Yeah, it's better less than D1, and GPUs are all we need. Sadly. So the final section in the paper is the appellation studies they did. They have three appellation studies. They want to first see the effect of the auxiliary language modeling objective during fine-tuning. So in this case, we have the normal transformer training with auxiliary language modeling, which is row 1.

And we have the one without auxiliary language modeling, which is row 3. They say that the trend suggests that larger datasets benefit from auxiliary training compared to smaller datasets. And this is kind of like actually probably why people stop doing auxiliary language modeling. It doesn't seem to be quite important, probably.

The second appellation study is the effect of the transformer. So they just trained the usual transformer, and they compared it with an LSTM, so row 1 and row 4. And you can see, yeah, the transformer has generally better performance than the LSTM on most tasks. So they mentioned that they observed a 5.6 average drop when they used the LSTM.

And it only outperforms the transformer on one dataset, which is MRPC. I'm not sure what this is. The second appellation study, and I would say the most important one, is the effect of pre-training. So they compared the transformer that has been trained in the framework they proposed, the two-step framework.

And they also compared it to a transformer that was directly trained on the supervised task, that is, without any pre-training. So we have row 1 and row 2. You can see that this is where the huge difference shows up, like 74 compared to 59, 45 compared to 18, 88 compared to 71.

So this is a very big difference. Yeah, and we observed that no pre-training actually hurts the performance quite a lot on almost all tasks. And this results in 15% decrease. And this is actually the worst-performing model out of all these models. Yeah. So I would say that the conclusion from this appellation study is that pre-training is very, very important.

I think this is the final section before the future studies do you want to discuss. Any questions so far? I think chat seems good. Do you want to-- you have one more slide, I think. You said future studies. Yeah. Do you want to just finish it up? Then we can open it up to questions if people-- Sure.

Yeah. Sure. So this section, I don't think it was in the paper. I think it was only on the blog post. They discuss what future work could be. And the first approach is surprise, surprise, just scale up. So they mentioned that they noticed improvement in language modeling. And it correlates with downstream tasks.

And they're only using very limited hardware, many GPUs on a machine and training on a very small data set. So maybe there is room for improvement if you scale the model, the training, and the data set. And I think we know the answer to this question, or the answer to this hypothesis.

The second approach is-- the second futuristic approach they want to try is try to see if there is-- you can tweak the fine-tuning approach instead of just doing vanilla fine-tuning. You can use adaptation or one of the other fancy ways of doing fine-tuning. I don't think this is as important as they might have thought about this.

Because people right now are doing just simple fine-tuning, and it works. So yeah, not quite as promising as the first approach, which is just scaling up. And the third one is understanding of why generative pretraining helps. And they've done some ablations and analysis about this. They want to do even more further targeted experiments and research to understand this.

So basically, observability and explainability in machine learning. And one very good question they ask is, how much does longer context help compared to just improved world knowledge when you are doing pretraining? So for example, this GPT model is able to process a longer context because it's a transformer. So is this the thing that makes it softer performance?

Or is it the fact that they train on a bigger data set with a longer training time, and it obtains world knowledge? And I think both are important, I would say. But that's a very good question. So yeah, that was it in the GPT. They introduce a framework for achieving SOTA performance by doing two-step approach, do pretraining, and then fine-tuning.

The goal of pretraining is to obtain a very good world knowledge and a very good starting point. And you do the pretraining on a diverse corpus with long stretches of text. So the model acquires, actually, world knowledge. And then you actually do transfer learning by fine-tuning on tasks like question answering and so on.

And the results is that they improve the state-of-the-art performance online data sets out of 12. And they successfully have utilized the unsupervised approach to do transfer learning to discriminative tasks. So now we have a clear way of how to do unsupervised training and semi-supervised training. And the work also highlights that transformers is a very good architecture, and larger data sets are good.

And this is a very important push in the direction of scaling up and pretraining. And this is actually what people are going to do for the next six years, from 2018 to 2024. Yeah, so very good paper, very good work. And with that, we've come to the end of this paper.

So if you have any questions, I think we can take questions now.