back to indexBreaking down the OG GPT Paper by Alec Radford
00:00:00.000 |
Okay. Sure. So, hey everyone. My name is Amged. I'm a machine learning engineer. 00:00:06.000 |
I generally do ML consulting services to startups. I help them like ship AI-powered products, 00:00:13.000 |
especially in the field of NLP and speech-to-text applications. 00:00:17.000 |
And I run a blog where I like publish posts about ML stuff. 00:00:23.000 |
So feel free to check it out. I've done some posts about Whisper. 00:00:27.000 |
Yeah. So with that out of the way, let's get directly to what we want to discuss today, 00:00:33.000 |
which is like the GPT-1 paper by the folks at OpenAI. 00:00:40.000 |
So the paper is titled "Improving Language Understanding by Generative Pre-training". 00:00:45.000 |
It was published in June 2018. And these are like the authors. 00:00:50.000 |
Let me switch to presentation mode. Yeah. And these are like the authors of the paper. 00:00:54.000 |
So mainly Alec Radford and Elia Suskiver, very well-known publishers in the ML field. 00:01:02.000 |
So let's get started. Back in 2018, deep learning was becoming very popular. 00:01:10.000 |
But the main thing with deep learning is like it is very, very data hungry. 00:01:17.000 |
The good news is like there is a ton of data available everywhere on the Internet and just online. 00:01:26.000 |
The bad news is this data is not annotated and it's not curated and it's basically very, very messy. 00:01:33.000 |
And if you want to train machine learning models back then, 00:01:36.000 |
the only way to do it is just to annotate data yourself or hire like data annotators. 00:01:41.000 |
And these tend to be very, very expensive and difficult to scale and hire. 00:01:45.000 |
Like if you think GPUs are expensive, you have not worked with data annotators. 00:01:49.000 |
That's what I like to say. So this makes like deep learning very restricted. 00:01:56.000 |
Like you're only restricted to fields that have good, high quality annotated data sets. 00:02:05.000 |
So people back then in 2018 are trying to solve this problem. 00:02:09.000 |
Like how do we get over the fact that we need labeled data? 00:02:15.000 |
And one potential solution to this problem is like unsupervised learning. 00:02:20.000 |
So the question is, what if we can leverage like just the linguistic information from the unlabeled data? 00:02:26.000 |
So we have like a bunch of text, like novels, books, articles. 00:02:32.000 |
How can we leverage like the linguistic information from this? 00:02:35.000 |
And the answer to this could be using unsupervised learning. 00:02:40.000 |
And if you can do this successfully, this alleviates the need for large amounts of labeled data. 00:02:45.000 |
Because basically you can utilize Wikipedia, which is very big, or even all the papers published on archive and so on. 00:02:55.000 |
And even if you have like lots of labeled data, using unsupervised learning as a step one or step zero 00:03:01.000 |
is going to make your model actually perform better than just training on this labeled data. 00:03:06.000 |
Because of transfer learning, where you can transfer the things that your model has learned during pre-training 00:03:15.000 |
And a good evidence of this is back in 2017 and 20, or even a bit before 2017, 00:03:20.000 |
is like people have been using pre-trained word embeddings. 00:03:24.000 |
Things like Wave2Vec, GloVe, FastText, to achieve very good performance on many tasks like classification and machine learning. 00:03:34.000 |
So this is a good evidence that actually unsupervised learning works and it's a very good approach. 00:03:40.000 |
So this is the premise of the paper actually, like unsupervised learning is very promising. 00:03:51.000 |
So the idea behind word embedding is like you want to project words, so just text, into an n-dimensional space. 00:04:02.000 |
And this space has a very special property that words that are similar in meaning have very similar products. 00:04:11.000 |
And by similar vectors I mean like you can measure similarity by things like cosine similarity dot product, like L2 distance. 00:04:20.000 |
So we can capture similarity between words that don't each other but have similar meanings. 00:04:27.000 |
For example, the word "booking" and "reservation". 00:04:30.000 |
Just from the syntax, they are very different words, but they have very similar meanings. 00:04:39.000 |
These are like completely different words, but they are both actually like optimizers and use in machine learning. 00:04:44.000 |
So they are very similar and their vectors are going to be probably similar. 00:04:48.000 |
The most common implementation of word embedding is like "wave2vec" by Google. 00:04:53.000 |
This is what popularized the usage of word embedding. 00:04:57.000 |
And then "glove" by Stanford and "fasttext" by Facebook. 00:05:01.000 |
And the way these word embedding models are trained is like leveraging co-occurrence between words that have similar contexts. 00:05:09.000 |
For example, similar words tend to occur in similar contexts. 00:05:13.000 |
Like you would generally find the word "adam" associated with learning rate or machine learning. 00:05:19.000 |
Similarly, "sgd" is associated with learning rate as well. 00:05:22.000 |
So you can conclude that these two words are kind of similar. 00:05:25.000 |
And the way these word embedding are used is that they are like utilized by training a head. 00:05:31.000 |
Like for example, if you want to classify a word as being positive or negative, having a positive sentiment or a negative sentiment, 00:05:38.000 |
you can train a classification head on top of the frozen embeddings. 00:05:42.000 |
So you use the word embeddings like as a fixed feature input, input features. 00:05:47.000 |
Just frozen things without training the word embeddings themselves. 00:05:54.000 |
And this has like a significant drawback or a bunch of significant drawbacks. 00:05:59.000 |
The first thing is it doesn't utilize the context of the text, so you're just using the word itself. 00:06:05.000 |
And some words, even the same word, can have very different meanings depending on the context. 00:06:11.000 |
For example, the river bank, like the Amazon bank or the oil bank, is different from the HSBC bank or like the JPMorgan bank, 00:06:22.000 |
So if you're going to use just wave2vec, these two words will have the same vector, even though they have very different contexts. 00:06:29.000 |
And even beyond this, natural language has nuances that cannot just be captured by using words. 00:06:35.000 |
Like the way you write things like "glamour3on" and "skillissue" on OpenAI, the way you write it this way, 00:06:41.000 |
you have a very specific intonation compared just to writing OpenAI in the normal way. 00:06:47.000 |
So are there any questions about word embeddings so far? 00:07:00.000 |
Someone says in the chat, nope. I cannot read the chat, so if just someone can say this using the microphone, that'd be great. 00:07:08.000 |
I don't think there are any questions. I think Sean said no. 00:07:11.000 |
But if anyone has any questions, you can maybe drop it in the chat, and then I can just surface them as and when. 00:07:17.000 |
But it seems like everyone's okay with it for now. I'll keep an eye on the chat for you. 00:07:21.000 |
Yeah, great. Yeah, thank you. That's appreciated. 00:07:24.000 |
So this was word embedding. The main limitation or drawback is you cannot use context. 00:07:33.000 |
Word embeddings are too local. We need something that's more global that can capture the higher-level semantics. 00:07:43.000 |
How do you leverage more than word-level information from unlabeled text? 00:07:47.000 |
You're going to have some questions that you need to answer. 00:07:51.000 |
First is, which objective should you use while training? 00:07:54.000 |
Do you want to use language modeling or machine translation or discourse coherence or something else? 00:08:00.000 |
And in 2024 right now, we definitely know the answer. 00:08:05.000 |
Language modeling works very well because this is what we've been using for the past three years. 00:08:09.000 |
But back then in 2017 and 2018, this was not very clear. 00:08:13.000 |
For example, this paper came out before BERT. 00:08:16.000 |
And even the original machine learning, the original transformer paper, attention is all you need to use machine translation. 00:08:25.000 |
So back then, this definitely was not very obvious to people. 00:08:29.000 |
The second question is, how should we transfer these embeddings to the target task? 00:08:37.000 |
Do you want to modify the model architecture? 00:08:43.000 |
Each task is going to require each specific modification to the model itself. 00:08:49.000 |
This is one approach, and this requires very deep knowledge of model architecture and being just a wizard to modify the architecture. 00:08:59.000 |
The second approach is to use a specific recipe or schema to do transfer learning. 00:09:05.000 |
A very popular example of this is ULM Fit by Jeremy Howard and Sebastian Ruder. 00:09:11.000 |
We're going to cover this in the next two slides, I think. 00:09:14.000 |
The third option is also you can add an auxiliary learning objective during pre-training. 00:09:19.000 |
While you're pre-training on language modeling, you can have an auxiliary learning objective like machine translation or discourse coherence. 00:09:27.000 |
These are some of the approaches you might want to take when you are deciding on the target task. 00:09:34.000 |
All these questions made going beyond word embedding not so straightforward. 00:09:42.000 |
They made it difficult to utilize semi-supervised learning or unsupervised pre-training. 00:09:49.000 |
Let's take a look at ULM Fit and how they did this. 00:09:53.000 |
This is a paper titled "Universal Language Model Fine-Tuning for Text Classification." 00:10:00.000 |
Their objective is text classification, but they want to also build a universal language model. 00:10:06.000 |
This is a very similar work in NLP. It's a very well-known paper and it has a big impact. 00:10:12.000 |
The question they are raising is instead of just utilizing the word embeddings, 00:10:21.000 |
like you're going to have a classifier, you're going to have an embedding layer and a classification layer. 00:10:26.000 |
The old way of doing this, you're going to use the embedding layer from Web2Vec and you are going to randomly initialize the classification head. 00:10:34.000 |
They are asking why not just have a good initialization point for all the layers, not just the embedding layer. 00:10:42.000 |
Their answer to this question is an approach called ULM Fit. 00:10:47.000 |
It is a three-step recipe for state-of-the-art text classification back then. 00:10:52.000 |
We have three steps. The first step is to train the language model on general domain data. 00:10:58.000 |
We call this pre-training on large corpus these days. 00:11:02.000 |
You just train your language model on a very big corpus like Wikitext. 00:11:07.000 |
This was big back then, but it's probably small now. 00:11:10.000 |
You can do pre-training on 15 trillion tokens of data if you are a big organization like Meta, for example, or even more. 00:11:19.000 |
This is the first step. The second step is you do fine-tune the language model on your target data. 00:11:26.000 |
You keep doing language modeling in this step. 00:11:30.000 |
This is similar to what we call continued pre-training on target domain. 00:11:35.000 |
You just take your LLAMA270B base and you just do language modeling on, let's say, financial books to try to make your model like Bloomberg GPT, for example. 00:11:48.000 |
But you're still doing language modeling. You're not doing any task-specific training. 00:11:54.000 |
The third and final step is to train a classifier on your label data. 00:11:58.000 |
This is the fine-tuning step that we are all familiar with. 00:12:02.000 |
Let's say you want to classify Amazon reviews as being positive, neutral, or negative. 00:12:07.000 |
You're just going to get maybe 1,000 reviews on the label for each review and just train the model in a supervised fashion. 00:12:16.000 |
This was a very good paper and it was very influential and made a big buzz in the ML field. 00:12:26.000 |
This paper was released in 2018, but it did not mention the word "transformer" at all. 00:12:35.000 |
The architecture of the model used was an RNN-based model. I think it was an LSTM. 00:12:43.000 |
And this is kind of a big gap in this work that GPT folks are going to fill. 00:12:50.000 |
Otherwise, this paper would have been a very, very good paper. 00:12:57.000 |
So, let's talk a bit about GPT, the thing that we want to talk about today. 00:13:04.000 |
GPT stands for Generative Pre-trained Transformer, the keyword is "transformer". 00:13:10.000 |
It was developed by OpenAI, by these awesome folks. 00:13:15.000 |
It was actually one of the first things that made OpenAI a popular organization in the ML field. 00:13:22.000 |
The whole premise is to use a semi-supervised approach for NLU tasks, natural language understanding tasks. 00:13:32.000 |
So, the goal is to learn a universal representation that transfers very well to other downstream tasks. 00:13:41.000 |
So, basically have a good starting point, instead of starting from scratch every single time. 00:13:50.000 |
The first step is to do unsupervised pre-training, and then supervised fine-tuning. 00:13:55.000 |
And this is kind of where the word "semi-supervised" comes from. 00:13:59.000 |
So, a mixture of unsupervised training and supervised training. 00:14:03.000 |
And their architecture is transformers, of course, because we have GPUs, and GPUs love transformers. 00:14:14.000 |
So, to do this approach, you're going to need two things. 00:14:17.000 |
The first thing is a very big corpus of unlabeled text. 00:14:20.000 |
And then the second thing is a dataset that is annotated, that is ready to use for supervised fine-tuning. 00:14:28.000 |
So, you could have multiple datasets if you want to train your model for several tasks. 00:14:34.000 |
But the good news is your target tasks do not need to be in the same domain as the unlabeled text. 00:14:40.000 |
So, for example, let's say you want to train on financial tasks. 00:14:45.000 |
You're going to give the model some information about a stock and ask it about how it performs. 00:14:55.000 |
Actually, you can pre-train on just like normal general data. 00:14:59.000 |
Like you can pre-train on a corpus from the Internet and then just fine-tune on your desired tasks. 00:15:05.000 |
Like your unlabeled corpus does not need to be in the same domain as your objective. 00:15:11.000 |
And this is good news because we have a lot of general-purpose text that you can use for pre-training. 00:15:17.000 |
While obtaining very domain-specific corpus is more involved. 00:15:22.000 |
And a very minor note here is the name can be misleading in this work. 00:15:27.000 |
The word "generative" here mainly refers to the pre-training step. 00:15:31.000 |
The actual tasks that they had in mind are more discriminative. 00:15:35.000 |
So, like classification, question answering, and semantic similarity. 00:15:38.000 |
That is, natural language understanding tasks. 00:15:41.000 |
So, they did not discuss machine translation or just being a chatbot in this work. 00:15:49.000 |
And they actually released their blog post under a different, and I think more fitting title, 00:15:54.000 |
called "Improving Language Understanding with Unsupervised Learning". 00:15:58.000 |
And this is like the key idea here, like unsupervised learning. 00:16:14.000 |
Let's also discuss some of the other related work in this domain. 00:16:18.000 |
So, let's first start with semi-supervised learning. 00:16:25.000 |
And back then this was becoming very popular, like sequence labeling, text classification tasks. 00:16:31.000 |
People were doing semi-supervised learning for these. 00:16:34.000 |
And you have different levels for this approach. 00:16:38.000 |
So, the first and basic level is just using language statistics to get the features. 00:16:46.000 |
Like use bag of words, tf, idf, and all these classical machine learning stuff. 00:16:53.000 |
You can use them as input features and just train a classifier on top of this. 00:16:58.000 |
But that's not very helpful, because two words that are different in syntax, 00:17:03.000 |
but are similar in semantics, are going to have very different features. 00:17:13.000 |
The second step is to use the word embedding, like we discussed previously. 00:17:16.000 |
And this approach allows you to capture the semantics, 00:17:20.000 |
but it's also very limited in that it's based on words. 00:17:24.000 |
We're not capturing the higher level semantics. 00:17:30.000 |
So, instead of just using the word to get the embedding, 00:17:35.000 |
you are going to utilize the entire sequence. 00:17:38.000 |
So, like the sentence or the paragraph to get the embedding. 00:17:41.000 |
And this actually allows us to utilize the context 00:17:44.000 |
to understand the high level semantics of the text. 00:17:50.000 |
We are using the entire sequence to generate embeddings 00:17:54.000 |
that we can use for classification or any other task. 00:17:59.000 |
So, this is regarding semi-supervised learning. 00:18:02.000 |
The second field, and the more specific field, is unsupervised learning. 00:18:07.000 |
So, it's a special case of semi-supervised learning. 00:18:10.000 |
These terminologies can be confusing, I know. 00:18:14.000 |
But the goal is to find a good initial starting point 00:18:18.000 |
instead of just going directly to do unsupervised learning objective. 00:18:36.000 |
was actually used in vision, in image classification. 00:18:43.000 |
You can take the backbone, the ResNet as a backbone, 00:18:49.000 |
or just try to detect pneumonia in chest x-rays. 00:18:56.000 |
Even though ImageNet is just classifying images 00:19:18.000 |
because people have found out that pre-training 00:19:22.000 |
Your model tends to have better generalization 00:19:49.000 |
However, the main drawback of these two works 00:19:52.000 |
and everything else, almost everything else back then, 00:20:04.000 |
So we can train transformers more efficiently. 00:20:13.000 |
because of, like, the gradient vanishing problems 00:20:19.000 |
So this is, like, the most common approach back then. 00:20:21.000 |
Another approach is also to use the head-end representation 00:20:27.000 |
as auxiliary features while training your model. 00:20:49.000 |
This involves, like, a substantial amount of work 00:20:59.000 |
before we actually go in detail into the approach? 00:21:24.000 |
but back then, I think they were writing Qt code. 00:21:36.000 |
So, yeah, I think AlexNet did definitely use GPUs, 00:21:55.000 |
Someone said, "According to papers with Qt, they did." 00:22:01.000 |
So let's go into details about the GPT approach. 00:22:16.000 |
The training objective here is language modeling. 00:22:34.000 |
The loss function D used is negative log-likelihood. 00:22:43.000 |
the negative log-likelihood of the correct label. 00:22:57.000 |
you do this over every single token in this sentence. 00:23:20.000 |
So--and the difference between the encoder and the decoder, 00:23:24.000 |
I think, can be summarized to, like, how you do attention. 00:23:40.000 |
I think it's also called left-to-right attention 00:23:52.000 |
applies, like, multi-headed mask itself attention operation 00:23:57.000 |
and then this is followed by a position-wise feed-forward layer. 00:24:05.000 |
and then you just generate an auto-distribution 00:24:21.000 |
So you have, like, the token integer or token ID. 00:24:26.000 |
So you take this number, and you pass it through the embedding layer, 00:24:36.000 |
And you also get the positional embedding for this token. 00:24:39.000 |
You sum these up, like, just vector addition, 00:24:43.000 |
and you get your input to the first transformer block, 00:24:58.000 |
And then you take H0 and pass it to each block, 00:25:04.000 |
and the output of the first block is going to be the input 00:25:13.000 |
once you've done going through all these blocks, 00:25:21.000 |
So you have a vector, but you want to go back to a token, 00:25:26.000 |
a probability distribution over all the tokens. 00:25:34.000 |
to get actual probabilities that sum up to one, 00:25:40.000 |
And I think we've covered the transformer paper, 00:25:46.000 |
but if you have other questions, please go ahead. 00:25:55.000 |
is the same as the one used at the final step. 00:26:11.000 |
So the second step is supervised fine-tuning. 00:26:16.000 |
is to adapt the parameters of the pre-trained model 00:26:26.000 |
you have a sequence of input tokens and the label. 00:26:33.000 |
and your label could be, like, a negative sentiment. 00:26:37.000 |
So the inputs are passed through the pre-trained model 00:26:40.000 |
to obtain the final transformer blocks activation. 00:26:57.000 |
So you're gonna get the hidden representation 00:27:06.000 |
where M is, like, the final token, the input sequence. 00:27:08.000 |
And you just pass this through your classification head 00:27:15.000 |
we're gonna use Softmax on top of a linear layer 00:27:18.000 |
to get your output and compare it with the label. 00:27:23.000 |
And you are using roughly the same loss function, 00:27:39.000 |
over only the output token, not the entire sequence. 00:27:42.000 |
So the loss is only on Y, not on X1 or XM and so on. 00:27:48.000 |
So the only extra parameters you need for this 00:28:02.000 |
the metrics of the parameters of the output layer. 00:28:06.000 |
And also embeddings if you're adding new tokens. 00:28:23.000 |
not just this classification, but also language modeling. 00:28:29.000 |
by improving the generalization of the supervised model 00:28:34.000 |
They also say this is in line with prior work. 00:28:42.000 |
And the way you do this is your loss function 00:28:53.000 |
and also you have the language modeling loss function 00:28:57.000 |
with a certain weight, like lambda here is like a weight. 00:29:00.000 |
And lambda could be, for example, 0.5 or 0.3. 00:29:04.000 |
So you have like a summation of multiple losses. 00:29:17.000 |
I think people just do supervised fine-tuning 00:29:30.000 |
I think the chat seems to not have any questions. 00:29:37.000 |
So maybe you want to just-- you can just continue from now. 00:29:47.000 |
we have covered the approach, the basic two steps. 00:29:51.000 |
You will now get a very good, let's say, classifier, 00:29:54.000 |
because you have done unsupervised pre-training 00:30:07.000 |
like classification, entailment, semantic similarity, 00:30:15.000 |
discriminative tasks other than generative tasks, 00:30:21.000 |
So for tasks like classification, this is very easy. 00:30:25.000 |
Just add the head on top and do the classification. 00:30:28.000 |
But other tasks have different structured inputs and outputs. 00:30:33.000 |
So for example, text entailment has ordered sentence pairs. 00:30:36.000 |
MCQs have a question with multiple answers, and so on. 00:30:45.000 |
And the way people have dealt with this previously 00:30:47.000 |
is just learn a specific architecture for each task 00:30:55.000 |
And this defeats the whole purpose of the GPT work. 00:31:02.000 |
global, general, and a general-purpose model, 00:31:05.000 |
rather than having multiple task-specific architectures. 00:31:23.000 |
So they are trying to create a multitask format. 00:31:27.000 |
And this is similar to what people have used in future work 00:31:42.000 |
us to use the same architecture for different tasks. 00:31:45.000 |
So you don't need to do a lot of modification. 00:31:49.000 |
And we're going to go into this in two details 00:31:53.000 |
So for example, let's take textual entailment. 00:31:56.000 |
This task involves reading a pair of sentences 00:32:01.000 |
So the relationship could be one of entailment, contradiction, 00:32:04.000 |
And a small note is that this task is still challenging 00:32:10.000 |
because your model needs to have good understanding 00:32:19.000 |
So you have your premise, and you have your hypothesis, 00:32:24.000 |
to predict the relationship as being entailment, 00:32:34.000 |
So this could be a sentence, and this could be a sentence. 00:32:37.000 |
You just concatenate them and add a special delimiter token 00:32:43.000 |
at the beginning and your end token at the end. 00:32:46.000 |
And just try to train a classifier on top of this. 00:32:52.000 |
this sequence of input tokens as one of entailment, 00:32:57.000 |
So this has become just a classification task 00:33:06.000 |
The second task that we cover is semantic similarity. 00:33:21.000 |
So this task is about predicting how semantically 00:33:27.000 |
And semantic similarity just means how close in meaning 00:33:47.000 |
So this can be challenging if your model is not smart enough. 00:34:02.000 |
Like, you have a distinction between the premise 00:34:09.000 |
And the way they approach this is using a Siamese architecture 00:34:26.000 |
are using the same model twice, or two identical versions 00:34:31.000 |
So you get your sentence-- first sentence and second sentence, 00:34:34.000 |
and then concatenate them and add your special tokens. 00:34:41.000 |
And also, you do the same thing, but you reverse 00:34:49.000 |
pass it to the transformer, and you get your vector. 00:34:52.000 |
So basically, at the end, you're going to have two vectors. 00:34:59.000 |
And then you just add a head on top of the output vector. 00:35:05.000 |
So you just multiply the vector by some layer, 00:35:09.000 |
a linear layer, for example, that has parameters w, 00:35:20.000 |
whether these sentences are similar or not similar, 00:35:23.000 |
so you have only two labels, you can train a classifier. 00:35:26.000 |
If you're interested in having more of a scale of similarity, 00:35:30.000 |
like from 0 to 10, you can train a regressor on top of this. 00:35:37.000 |
We're still using the same architecture, by the way, 00:35:42.000 |
We're just being smart about how to approach this task. 00:35:48.000 |
You can extend this to do question answering. 00:35:51.000 |
So let's say you have a question and you have four choices. 00:35:58.000 |
and you have four potential answers to this document. 00:36:01.000 |
And you want your model to pick one answer of these. 00:36:04.000 |
So the way to do this is you take your document or context 00:36:08.000 |
and add the question and then add the first answer. 00:36:17.000 |
And similarly, you get your context or document 00:36:20.000 |
and add the question and add the second answer 00:36:22.000 |
and make this into a sequence and give it to the transformer. 00:36:31.000 |
the model gives to each of these potential answers. 00:36:39.000 |
and we have a question, and we have answer A. 00:36:42.000 |
We concatenate all of these with some special tokens. 00:36:49.000 |
and to the final linear layer, and we get the score. 00:36:55.000 |
and score for answer C. And just do a softmax 00:37:02.000 |
You've got your model to do question answering. 00:37:10.000 |
So any questions about these transformations for each task? 00:37:18.000 |
I think there was a question about modifying the input 00:37:24.000 |
at least when it comes to semantics in learning. 00:37:34.000 |
understand what it means to modify the input sequence 00:37:45.000 |
you have distinction between the premise and the hypothesis. 00:37:54.000 |
You have a very specific premise and a very specific hypothesis. 00:38:00.000 |
Like you should put the premise first, and then 00:38:04.000 |
You can put the hypothesis first, and then the delimiter, 00:38:08.000 |
So this is what they mean by the ordering here matters. 00:38:12.000 |
But for semantic similarity, you just have two sentences. 00:38:16.000 |
There is no inherent ordering for the semantic similarity 00:38:36.000 |
He said, does it also combat things like positional biases 00:38:39.000 |
and transformers by switching the order of the sentences 00:38:47.000 |
I think this is one of the motivations they did this. 00:38:50.000 |
They want to say, there is no inherent order for this task. 00:38:54.000 |
So maybe the transformer will just have a bias. 00:38:58.000 |
Like let's give it the first sentence, and then the second. 00:39:12.000 |
pay more attention to the last few tokens in the input. 00:39:24.000 |
I think this is a good way to think about this. 00:39:30.000 |
I think-- I don't see any other questions in the chat. 00:39:33.000 |
So I can just service them as and when they come. 00:39:35.000 |
So I think we should be good for now by the looks of it. 00:39:41.000 |
We can now discuss some details about the training. 00:39:44.000 |
And back then, OpenAI actually did release info 00:39:52.000 |
But anyways, so the data set they use for training 00:39:55.000 |
is BookCorpus for training the language model. 00:39:58.000 |
So this is in step one, which is unsupervised training. 00:40:04.000 |
It has 7,000 unique unpublished books from different genres. 00:40:13.000 |
So kind of like a very good, big, diverse data set. 00:40:27.000 |
you have a paragraph that's maybe 10 lines or more. 00:40:39.000 |
that's called WordBenchmark, which is also big and diverse. 00:40:46.000 |
It's just a bunch of small sentences, I would say. 00:41:05.000 |
of unsupervised pre-training and having good embeddings. 00:41:10.000 |
So it's one of the fundamental papers and works in NLP. 00:41:15.000 |
So they say their model achieves a very low token 00:41:22.000 |
But I don't think this is actually low today. 00:41:34.000 |
So not quite big by today's standard, of course. 00:41:43.000 |
They use byte-pair encoding back then, which is also, I think, 00:42:04.000 |
that's in the same range, like 40,000, 50,000 tokens. 00:42:07.000 |
So back then, this was actually quite big, I would say. 00:42:11.000 |
And they used the FTFY library to clean the raw text 00:42:14.000 |
and then do some standardization using spaCy tokenizer. 00:42:19.000 |
So this is good work in the tokenization area, I would say. 00:42:25.000 |
And their model is just a typical transformer model, 00:42:29.000 |
like typical by today's standard, just a transformer, 00:42:32.000 |
decoder-only transformer with, like, 12 layers and mask 00:42:40.000 |
And this was actually big back then, although this 00:42:45.000 |
Their embedding, they used learned positional embeddings 00:42:53.000 |
So this was actually much simpler to implement. 00:42:57.000 |
And the model just has to learn the embeddings in training. 00:43:07.000 |
And they also used tied input and output token embeddings, 00:43:11.000 |
And for the attention block, they have 12 attention heads. 00:43:17.000 |
each head has a dimension of 64 for a total of 768 dimensions 00:43:27.000 |
And after the attention, you have the MLP layer, 00:43:29.000 |
also known as position-wise feedforward network. 00:43:32.000 |
And the size of the inner state of this network is 3,072-- 00:43:41.000 |
And this means that, actually, this is an expansion 00:43:59.000 |
The optimizer, they used Adam, which was becoming 00:44:09.000 |
So it was getting a lot of traction back then. 00:44:11.000 |
And the maximum learning rate is this learning rate, 00:44:16.000 |
I think this was used in one of the Lama-Tung models. 00:44:23.000 |
They do warm-up as well from 0 to the maximum learning 00:44:32.000 |
to-- they do cosine annealing from the maximum learning rate 00:44:47.000 |
This is the same family of GPUs as P100, I think. 00:44:50.000 |
But I don't have much information about this GPU. 00:44:53.000 |
They train for 30 days, which actually is not bad. 00:45:12.000 |
And for reference, I think the tiny grad machine 00:45:23.000 |
So anyway, the compute they used is almost one petaflop days. 00:45:33.000 |
like if you try to do it in the transformers framework. 00:45:36.000 |
So you have token embedding, position embedding, 00:45:40.000 |
And then you have your actual transformer blocks. 00:45:44.000 |
And this is the architecture of the language model. 00:45:50.000 |
So the block is just attention, and then layer norm, 00:46:02.000 |
to the second step, which is supervised fine tuning. 00:46:15.000 |
that the model itself has a perplexity of 18.4. 00:46:18.000 |
And so in this case, would it be OK to compare-- 00:46:23.000 |
is it a metric that's invariant across different models 00:46:42.000 |
And you measure perplexity of GPT-1 on this data set. 00:46:52.000 |
It's like saying Lama2 has a human eval of 70, 00:47:00.000 |
It's not totally fair, but we can do it, I guess. 00:47:16.000 |
So I think we should move on to supervised fine-tuning. 00:47:20.000 |
So the second step is supervised fine-tuning. 00:47:24.000 |
So for sentence similarity, they use these data sets. 00:47:36.000 |
because I don't have much information about these. 00:47:40.000 |
And the architecture is just use the same backbone 00:47:44.000 |
And you add your head, mostly classifier head, 00:47:50.000 |
They train for three epochs with a batch size of 32. 00:47:53.000 |
This is also still standard nowadays as well. 00:47:57.000 |
People usually train, do fine-tuning for three epochs 00:48:10.000 |
And also, people use very similar learning rates nowadays 00:48:24.000 |
And they also use learning rate decay with a warm-up. 00:48:32.000 |
the auxiliary objective, they used a lambda of 0.5. 00:48:36.000 |
So the weight of the auxiliary language modeling objective 00:48:58.000 |
So I think we can just move on to the benchmarks. 00:49:04.000 |
they do a lot of SOTA performance on many tasks. 00:49:07.000 |
Like you can see here, almost every single task, 00:49:11.000 |
And there are ones where they have significant improvement, 00:49:28.000 |
I think the QNLI is more of a natural language understanding 00:49:39.000 |
And this is where they make the most improvement, I think. 00:49:43.000 |
But overall, they are doing very good performance on many tasks. 00:49:49.000 |
And they are comparing to LSTMs and other models. 00:49:56.000 |
Yeah, this is for natural language inference. 00:50:01.000 |
they also make very big gains, as you can see here. 00:50:13.000 |
And one of them is actually an ensemble of nine models, 00:50:20.000 |
And also, the same happens for semantic similarity 00:50:24.000 |
Although, I would say the boost in performance 00:50:30.000 |
They are good on some metrics and bad on other metrics. 00:50:45.000 |
They did-- after the benchmarks, they now have a good model. 00:50:48.000 |
So they are trying to understand why their model is good 00:51:04.000 |
to analyze the impact of the transfer learning, 00:51:07.000 |
the number of layers transferred from the pre-trained model 00:51:13.000 |
So they just take the pre-trained transformer, 00:51:16.000 |
and they take all the layers and do fine-tuning. 00:51:20.000 |
And then they take 11 layers, and then they do fine-tuning. 00:51:23.000 |
And then they take 10 layers and do fine-tuning, and so on. 00:51:26.000 |
They compare the performance of these models. 00:51:35.000 |
you're not taking any of the pre-trained weights. 00:51:52.000 |
So you can see that a very big improvement in performance 00:52:00.000 |
So they observe the obvious fact that transferring more layers 00:52:07.000 |
And just only transferring the embedding layers 00:52:13.000 |
And they mentioned that each layer adds a boost of 9% 00:52:17.000 |
for each layer you add on this task, multi-NLI. 00:52:30.000 |
And it's very helpful in the fine-tuning model. 00:52:48.000 |
for solving for target tasks like classification. 00:52:51.000 |
The second piece of analysis they did is zero-shot evaluation. 00:52:59.000 |
And this was, I would say, radical back then. 00:53:03.000 |
They are trying to evaluate how good the pre-trained model is 00:53:13.000 |
why is the language model pre-training effective? 00:53:20.000 |
on language modeling help us when we are doing classification? 00:53:24.000 |
They have a hypothesis that the underlying generative model 00:53:28.000 |
learns actually multiple tasks in the pre-training. 00:53:40.000 |
you're more likely to have a deeper understanding 00:53:47.000 |
Like, for example, if you speak only English and French, 00:53:55.000 |
you won't be able to do classifications in Spanish. 00:53:59.000 |
But if you speak English and you speak French, 00:54:05.000 |
So this is something cool and something to think about a bit. 00:54:13.000 |
Transformers show very big improvement compared to LSTMs. 00:54:27.000 |
They basically tried to evaluate the pre-trained model 00:54:32.000 |
on these tasks before doing the supervised fine-tuning. 00:54:42.000 |
For example, for linguistic acceptability with the COLA, 00:54:48.000 |
And they take the average token log probability 00:54:56.000 |
And they just determine if the model got this right or wrong. 00:55:05.000 |
So you've got your, let's say, Amazon review. 00:55:11.000 |
and restrict the language to generate only two tokens, one 00:55:17.000 |
And you see which token has the higher score or higher 00:55:21.000 |
And this is the prediction of the model on this sentiment 00:55:26.000 |
So just restrict the language model prediction 00:55:29.000 |
to these two tokens and see which one is higher. 00:55:43.000 |
So for each answer, as you said, concatenate the document 00:55:48.000 |
And you average the token log probability of just the answer 00:56:08.000 |
says the city councilman refused the demonstrators a permit 00:56:13.000 |
So the goal here is to find this word "they." 00:56:22.000 |
This is what they mean by winning grad schema. 00:56:27.000 |
So the way they do this is they take this word "they" 00:56:30.000 |
and replace it with the two possible answers. 00:56:35.000 |
So you have example A, you use the word "councilman." 00:56:49.000 |
And then you just take the one with the higher probability 00:56:58.000 |
So this is how they wanted to evaluate the model. 00:57:08.000 |
you train the model and the more data you give the model, 00:57:15.000 |
So the better zero-shot performance it has, basically. 00:57:26.000 |
And you can see with more steps, more training steps, 00:57:33.000 |
where they trained an LSTM, which is the dashed line. 00:57:36.000 |
And you can see that for almost every single task, 00:57:43.000 |
So they observed the obvious that more training gives 00:57:54.000 |
And this suggests that generative pre-training 00:58:00.000 |
the model learns a lot of tasks, not just language modeling. 00:58:05.000 |
They said they observed that the LSTM exhibits higher variance 00:58:10.000 |
that the inclusive bits of the transformer assist in transfer. 00:58:13.000 |
But I'm not sure what they mean by high variance 00:58:18.000 |
So if anyone has any clue about this, please share it with us. 00:58:21.000 |
Yeah, it's better less than D1, and GPUs are all we need. 00:58:52.000 |
So in this case, we have the normal transformer training 00:58:58.000 |
with auxiliary language modeling, which is row 1. 00:59:01.000 |
And we have the one without auxiliary language modeling, 00:59:06.000 |
They say that the trend suggests that larger datasets benefit 00:59:12.000 |
from auxiliary training compared to smaller datasets. 00:59:18.000 |
why people stop doing auxiliary language modeling. 00:59:20.000 |
It doesn't seem to be quite important, probably. 00:59:26.000 |
The second appellation study is the effect of the transformer. 00:59:32.000 |
and they compared it with an LSTM, so row 1 and row 4. 00:59:40.000 |
has generally better performance than the LSTM on most tasks. 00:59:47.000 |
So they mentioned that they observed a 5.6 average drop 01:00:10.000 |
has been trained in the framework they proposed, 01:00:14.000 |
And they also compared it to a transformer that 01:00:17.000 |
was directly trained on the supervised task, that 01:00:23.000 |
You can see that this is where the huge difference shows up, 01:00:26.000 |
like 74 compared to 59, 45 compared to 18, 88 compared 01:00:36.000 |
Yeah, and we observed that no pre-training actually 01:00:39.000 |
hurts the performance quite a lot on almost all tasks. 01:00:48.000 |
And this is actually the worst-performing model 01:00:55.000 |
from this appellation study is that pre-training 01:00:59.000 |
I think this is the final section before the future 01:01:17.000 |
Do you want to-- you have one more slide, I think. 01:01:23.000 |
Then we can open it up to questions if people-- 01:01:29.000 |
So this section, I don't think it was in the paper. 01:01:37.000 |
And the first approach is surprise, surprise, 01:01:42.000 |
So they mentioned that they noticed improvement 01:01:48.000 |
And they're only using very limited hardware, 01:01:56.000 |
if you scale the model, the training, and the data set. 01:01:58.000 |
And I think we know the answer to this question, 01:02:07.000 |
the second futuristic approach they want to try 01:02:09.000 |
is try to see if there is-- you can tweak the fine-tuning 01:02:14.000 |
approach instead of just doing vanilla fine-tuning. 01:02:17.000 |
You can use adaptation or one of the other fancy ways 01:02:23.000 |
I don't think this is as important as they might 01:02:29.000 |
Because people right now are doing just simple fine-tuning, 01:02:34.000 |
So yeah, not quite as promising as the first approach, which 01:02:44.000 |
And they've done some ablations and analysis about this. 01:02:48.000 |
They want to do even more further targeted experiments 01:02:53.000 |
So basically, observability and explainability 01:03:01.000 |
is, how much does longer context help compared 01:03:13.000 |
able to process a longer context because it's a transformer. 01:03:18.000 |
So is this the thing that makes it softer performance? 01:03:21.000 |
Or is it the fact that they train on a bigger data set 01:03:26.000 |
with a longer training time, and it obtains world knowledge? 01:03:39.000 |
They introduce a framework for achieving SOTA performance 01:03:45.000 |
The goal of pretraining is to obtain a very good world 01:03:55.000 |
And you do the pretraining on a diverse corpus 01:04:00.000 |
So the model acquires, actually, world knowledge. 01:04:04.000 |
by fine-tuning on tasks like question answering and so on. 01:04:11.000 |
the state-of-the-art performance online data sets out of 12. 01:04:18.000 |
the unsupervised approach to do transfer learning 01:04:24.000 |
to do unsupervised training and semi-supervised training. 01:04:31.000 |
And the work also highlights that transformers 01:04:35.000 |
is a very good architecture, and larger data sets are good. 01:04:41.000 |
in the direction of scaling up and pretraining. 01:04:46.000 |
going to do for the next six years, from 2018 to 2024. 01:04:56.000 |
And with that, we've come to the end of this paper. 01:04:58.000 |
So if you have any questions, I think we can take questions now.