back to index

Breaking down the OG GPT Paper by Alec Radford


Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay. Sure. So, hey everyone. My name is Amged. I'm a machine learning engineer.
00:00:06.000 | I generally do ML consulting services to startups. I help them like ship AI-powered products,
00:00:13.000 | especially in the field of NLP and speech-to-text applications.
00:00:17.000 | And I run a blog where I like publish posts about ML stuff.
00:00:23.000 | So feel free to check it out. I've done some posts about Whisper.
00:00:27.000 | Yeah. So with that out of the way, let's get directly to what we want to discuss today,
00:00:33.000 | which is like the GPT-1 paper by the folks at OpenAI.
00:00:40.000 | So the paper is titled "Improving Language Understanding by Generative Pre-training".
00:00:45.000 | It was published in June 2018. And these are like the authors.
00:00:50.000 | Let me switch to presentation mode. Yeah. And these are like the authors of the paper.
00:00:54.000 | So mainly Alec Radford and Elia Suskiver, very well-known publishers in the ML field.
00:01:02.000 | So let's get started. Back in 2018, deep learning was becoming very popular.
00:01:10.000 | But the main thing with deep learning is like it is very, very data hungry.
00:01:15.000 | There are like good news and bad news.
00:01:17.000 | The good news is like there is a ton of data available everywhere on the Internet and just online.
00:01:24.000 | We have like tons and tons of data.
00:01:26.000 | The bad news is this data is not annotated and it's not curated and it's basically very, very messy.
00:01:33.000 | And if you want to train machine learning models back then,
00:01:36.000 | the only way to do it is just to annotate data yourself or hire like data annotators.
00:01:41.000 | And these tend to be very, very expensive and difficult to scale and hire.
00:01:45.000 | Like if you think GPUs are expensive, you have not worked with data annotators.
00:01:49.000 | That's what I like to say. So this makes like deep learning very restricted.
00:01:54.000 | You cannot use it everywhere.
00:01:56.000 | Like you're only restricted to fields that have good, high quality annotated data sets.
00:02:02.000 | And this is a very big bottleneck.
00:02:05.000 | So people back then in 2018 are trying to solve this problem.
00:02:09.000 | Like how do we get over the fact that we need labeled data?
00:02:15.000 | And one potential solution to this problem is like unsupervised learning.
00:02:20.000 | So the question is, what if we can leverage like just the linguistic information from the unlabeled data?
00:02:26.000 | So we have like a bunch of text, like novels, books, articles.
00:02:32.000 | How can we leverage like the linguistic information from this?
00:02:35.000 | And the answer to this could be using unsupervised learning.
00:02:40.000 | And if you can do this successfully, this alleviates the need for large amounts of labeled data.
00:02:45.000 | Because basically you can utilize Wikipedia, which is very big, or even all the papers published on archive and so on.
00:02:55.000 | And even if you have like lots of labeled data, using unsupervised learning as a step one or step zero
00:03:01.000 | is going to make your model actually perform better than just training on this labeled data.
00:03:06.000 | Because of transfer learning, where you can transfer the things that your model has learned during pre-training
00:03:12.000 | into your actual objective.
00:03:15.000 | And a good evidence of this is back in 2017 and 20, or even a bit before 2017,
00:03:20.000 | is like people have been using pre-trained word embeddings.
00:03:24.000 | Things like Wave2Vec, GloVe, FastText, to achieve very good performance on many tasks like classification and machine learning.
00:03:32.000 | Sorry, machine translation.
00:03:34.000 | So this is a good evidence that actually unsupervised learning works and it's a very good approach.
00:03:40.000 | So this is the premise of the paper actually, like unsupervised learning is very promising.
00:03:48.000 | So let's talk a bit about word embedding.
00:03:51.000 | So the idea behind word embedding is like you want to project words, so just text, into an n-dimensional space.
00:03:58.000 | And n is usually 300 or 1,000 back then.
00:04:02.000 | And this space has a very special property that words that are similar in meaning have very similar products.
00:04:08.000 | Sorry, very similar vectors.
00:04:11.000 | And by similar vectors I mean like you can measure similarity by things like cosine similarity dot product, like L2 distance.
00:04:20.000 | So we can capture similarity between words that don't each other but have similar meanings.
00:04:27.000 | For example, the word "booking" and "reservation".
00:04:30.000 | Just from the syntax, they are very different words, but they have very similar meanings.
00:04:34.000 | We're just booking something.
00:04:36.000 | And similarly, the word "adam" and "sgd".
00:04:39.000 | These are like completely different words, but they are both actually like optimizers and use in machine learning.
00:04:44.000 | So they are very similar and their vectors are going to be probably similar.
00:04:48.000 | The most common implementation of word embedding is like "wave2vec" by Google.
00:04:53.000 | This is what popularized the usage of word embedding.
00:04:57.000 | And then "glove" by Stanford and "fasttext" by Facebook.
00:05:01.000 | And the way these word embedding models are trained is like leveraging co-occurrence between words that have similar contexts.
00:05:09.000 | For example, similar words tend to occur in similar contexts.
00:05:13.000 | Like you would generally find the word "adam" associated with learning rate or machine learning.
00:05:19.000 | Similarly, "sgd" is associated with learning rate as well.
00:05:22.000 | So you can conclude that these two words are kind of similar.
00:05:25.000 | And the way these word embedding are used is that they are like utilized by training a head.
00:05:31.000 | Like for example, if you want to classify a word as being positive or negative, having a positive sentiment or a negative sentiment,
00:05:38.000 | you can train a classification head on top of the frozen embeddings.
00:05:42.000 | So you use the word embeddings like as a fixed feature input, input features.
00:05:47.000 | Just frozen things without training the word embeddings themselves.
00:05:54.000 | And this has like a significant drawback or a bunch of significant drawbacks.
00:05:59.000 | The first thing is it doesn't utilize the context of the text, so you're just using the word itself.
00:06:05.000 | And some words, even the same word, can have very different meanings depending on the context.
00:06:11.000 | For example, the river bank, like the Amazon bank or the oil bank, is different from the HSBC bank or like the JPMorgan bank,
00:06:19.000 | even though these are the exact same words.
00:06:22.000 | So if you're going to use just wave2vec, these two words will have the same vector, even though they have very different contexts.
00:06:29.000 | And even beyond this, natural language has nuances that cannot just be captured by using words.
00:06:35.000 | Like the way you write things like "glamour3on" and "skillissue" on OpenAI, the way you write it this way,
00:06:41.000 | you have a very specific intonation compared just to writing OpenAI in the normal way.
00:06:47.000 | So are there any questions about word embeddings so far?
00:07:00.000 | Someone says in the chat, nope. I cannot read the chat, so if just someone can say this using the microphone, that'd be great.
00:07:08.000 | I don't think there are any questions. I think Sean said no.
00:07:11.000 | But if anyone has any questions, you can maybe drop it in the chat, and then I can just surface them as and when.
00:07:17.000 | But it seems like everyone's okay with it for now. I'll keep an eye on the chat for you.
00:07:21.000 | Yeah, great. Yeah, thank you. That's appreciated.
00:07:24.000 | So this was word embedding. The main limitation or drawback is you cannot use context.
00:07:30.000 | So can we go beyond word embedding?
00:07:33.000 | Word embeddings are too local. We need something that's more global that can capture the higher-level semantics.
00:07:39.000 | But this also has its complications.
00:07:43.000 | How do you leverage more than word-level information from unlabeled text?
00:07:47.000 | You're going to have some questions that you need to answer.
00:07:51.000 | First is, which objective should you use while training?
00:07:54.000 | Do you want to use language modeling or machine translation or discourse coherence or something else?
00:08:00.000 | And in 2024 right now, we definitely know the answer.
00:08:05.000 | Language modeling works very well because this is what we've been using for the past three years.
00:08:09.000 | But back then in 2017 and 2018, this was not very clear.
00:08:13.000 | For example, this paper came out before BERT.
00:08:16.000 | And even the original machine learning, the original transformer paper, attention is all you need to use machine translation.
00:08:25.000 | So back then, this definitely was not very obvious to people.
00:08:29.000 | The second question is, how should we transfer these embeddings to the target task?
00:08:35.000 | There are a bunch of ways to do this.
00:08:37.000 | Do you want to modify the model architecture?
00:08:43.000 | Each task is going to require each specific modification to the model itself.
00:08:49.000 | This is one approach, and this requires very deep knowledge of model architecture and being just a wizard to modify the architecture.
00:08:59.000 | The second approach is to use a specific recipe or schema to do transfer learning.
00:09:05.000 | A very popular example of this is ULM Fit by Jeremy Howard and Sebastian Ruder.
00:09:11.000 | We're going to cover this in the next two slides, I think.
00:09:14.000 | The third option is also you can add an auxiliary learning objective during pre-training.
00:09:19.000 | While you're pre-training on language modeling, you can have an auxiliary learning objective like machine translation or discourse coherence.
00:09:27.000 | These are some of the approaches you might want to take when you are deciding on the target task.
00:09:34.000 | All these questions made going beyond word embedding not so straightforward.
00:09:42.000 | They made it difficult to utilize semi-supervised learning or unsupervised pre-training.
00:09:49.000 | Let's take a look at ULM Fit and how they did this.
00:09:53.000 | This is a paper titled "Universal Language Model Fine-Tuning for Text Classification."
00:10:00.000 | Their objective is text classification, but they want to also build a universal language model.
00:10:06.000 | This is a very similar work in NLP. It's a very well-known paper and it has a big impact.
00:10:12.000 | The question they are raising is instead of just utilizing the word embeddings,
00:10:21.000 | like you're going to have a classifier, you're going to have an embedding layer and a classification layer.
00:10:26.000 | The old way of doing this, you're going to use the embedding layer from Web2Vec and you are going to randomly initialize the classification head.
00:10:34.000 | They are asking why not just have a good initialization point for all the layers, not just the embedding layer.
00:10:42.000 | Their answer to this question is an approach called ULM Fit.
00:10:47.000 | It is a three-step recipe for state-of-the-art text classification back then.
00:10:52.000 | We have three steps. The first step is to train the language model on general domain data.
00:10:58.000 | We call this pre-training on large corpus these days.
00:11:02.000 | You just train your language model on a very big corpus like Wikitext.
00:11:07.000 | This was big back then, but it's probably small now.
00:11:10.000 | You can do pre-training on 15 trillion tokens of data if you are a big organization like Meta, for example, or even more.
00:11:19.000 | This is the first step. The second step is you do fine-tune the language model on your target data.
00:11:26.000 | You keep doing language modeling in this step.
00:11:30.000 | This is similar to what we call continued pre-training on target domain.
00:11:35.000 | You just take your LLAMA270B base and you just do language modeling on, let's say, financial books to try to make your model like Bloomberg GPT, for example.
00:11:48.000 | But you're still doing language modeling. You're not doing any task-specific training.
00:11:54.000 | The third and final step is to train a classifier on your label data.
00:11:58.000 | This is the fine-tuning step that we are all familiar with.
00:12:02.000 | Let's say you want to classify Amazon reviews as being positive, neutral, or negative.
00:12:07.000 | You're just going to get maybe 1,000 reviews on the label for each review and just train the model in a supervised fashion.
00:12:16.000 | This was a very good paper and it was very influential and made a big buzz in the ML field.
00:12:26.000 | This paper was released in 2018, but it did not mention the word "transformer" at all.
00:12:32.000 | They did not even reference the paper.
00:12:35.000 | The architecture of the model used was an RNN-based model. I think it was an LSTM.
00:12:40.000 | So, no transformers at all.
00:12:43.000 | And this is kind of a big gap in this work that GPT folks are going to fill.
00:12:50.000 | Otherwise, this paper would have been a very, very good paper.
00:12:55.000 | It still is a good paper.
00:12:57.000 | So, let's talk a bit about GPT, the thing that we want to talk about today.
00:13:04.000 | GPT stands for Generative Pre-trained Transformer, the keyword is "transformer".
00:13:10.000 | It was developed by OpenAI, by these awesome folks.
00:13:15.000 | It was actually one of the first things that made OpenAI a popular organization in the ML field.
00:13:22.000 | The whole premise is to use a semi-supervised approach for NLU tasks, natural language understanding tasks.
00:13:32.000 | So, the goal is to learn a universal representation that transfers very well to other downstream tasks.
00:13:41.000 | So, basically have a good starting point, instead of starting from scratch every single time.
00:13:47.000 | And their approach is basically two-step.
00:13:50.000 | The first step is to do unsupervised pre-training, and then supervised fine-tuning.
00:13:55.000 | And this is kind of where the word "semi-supervised" comes from.
00:13:59.000 | So, a mixture of unsupervised training and supervised training.
00:14:03.000 | And their architecture is transformers, of course, because we have GPUs, and GPUs love transformers.
00:14:09.000 | So, that's a good thing.
00:14:14.000 | So, to do this approach, you're going to need two things.
00:14:17.000 | The first thing is a very big corpus of unlabeled text.
00:14:20.000 | And then the second thing is a dataset that is annotated, that is ready to use for supervised fine-tuning.
00:14:28.000 | So, you could have multiple datasets if you want to train your model for several tasks.
00:14:34.000 | But the good news is your target tasks do not need to be in the same domain as the unlabeled text.
00:14:40.000 | So, for example, let's say you want to train on financial tasks.
00:14:45.000 | You're going to give the model some information about a stock and ask it about how it performs.
00:14:50.000 | Or whether it should buy or sell.
00:14:52.000 | So, like a financial classifier.
00:14:55.000 | Actually, you can pre-train on just like normal general data.
00:14:59.000 | Like you can pre-train on a corpus from the Internet and then just fine-tune on your desired tasks.
00:15:05.000 | Like your unlabeled corpus does not need to be in the same domain as your objective.
00:15:11.000 | And this is good news because we have a lot of general-purpose text that you can use for pre-training.
00:15:17.000 | While obtaining very domain-specific corpus is more involved.
00:15:22.000 | And a very minor note here is the name can be misleading in this work.
00:15:27.000 | The word "generative" here mainly refers to the pre-training step.
00:15:31.000 | The actual tasks that they had in mind are more discriminative.
00:15:35.000 | So, like classification, question answering, and semantic similarity.
00:15:38.000 | That is, natural language understanding tasks.
00:15:41.000 | So, they did not discuss machine translation or just being a chatbot in this work.
00:15:49.000 | And they actually released their blog post under a different, and I think more fitting title,
00:15:54.000 | called "Improving Language Understanding with Unsupervised Learning".
00:15:58.000 | And this is like the key idea here, like unsupervised learning.
00:16:03.000 | This is a very nice blog post on their blog.
00:16:07.000 | So, now we have discussed EOLMFET and GPT.
00:16:14.000 | Let's also discuss some of the other related work in this domain.
00:16:18.000 | So, let's first start with semi-supervised learning.
00:16:21.000 | The work GPT falls under this domain.
00:16:25.000 | And back then this was becoming very popular, like sequence labeling, text classification tasks.
00:16:31.000 | People were doing semi-supervised learning for these.
00:16:34.000 | And you have different levels for this approach.
00:16:38.000 | So, the first and basic level is just using language statistics to get the features.
00:16:46.000 | Like use bag of words, tf, idf, and all these classical machine learning stuff.
00:16:53.000 | You can use them as input features and just train a classifier on top of this.
00:16:58.000 | But that's not very helpful, because two words that are different in syntax,
00:17:03.000 | but are similar in semantics, are going to have very different features.
00:17:08.000 | So, it makes the model a bit limited.
00:17:13.000 | The second step is to use the word embedding, like we discussed previously.
00:17:16.000 | And this approach allows you to capture the semantics,
00:17:20.000 | but it's also very limited in that it's based on words.
00:17:24.000 | We're not capturing the higher level semantics.
00:17:27.000 | The third level is like sequence embeddings.
00:17:30.000 | So, instead of just using the word to get the embedding,
00:17:35.000 | you are going to utilize the entire sequence.
00:17:38.000 | So, like the sentence or the paragraph to get the embedding.
00:17:41.000 | And this actually allows us to utilize the context
00:17:44.000 | to understand the high level semantics of the text.
00:17:47.000 | And this is like level 3 is where GPT falls.
00:17:50.000 | We are using the entire sequence to generate embeddings
00:17:54.000 | that we can use for classification or any other task.
00:17:59.000 | So, this is regarding semi-supervised learning.
00:18:02.000 | The second field, and the more specific field, is unsupervised learning.
00:18:07.000 | So, it's a special case of semi-supervised learning.
00:18:10.000 | These terminologies can be confusing, I know.
00:18:14.000 | But the goal is to find a good initial starting point
00:18:18.000 | instead of just going directly to do unsupervised learning objective.
00:18:27.000 | This is a typo, probably.
00:18:30.000 | So, the early works in unsupervised learning
00:18:36.000 | was actually used in vision, in image classification.
00:18:39.000 | So, for example, you can use ResNet
00:18:41.000 | that was trained to classify ImageNet.
00:18:43.000 | You can take the backbone, the ResNet as a backbone,
00:18:46.000 | and then do like segmentation
00:18:49.000 | or just try to detect pneumonia in chest x-rays.
00:18:52.000 | And this is a very good starting point.
00:18:54.000 | Even though ResNet was...
00:18:56.000 | Even though ImageNet is just classifying images
00:19:00.000 | as being cats or dogs or humans or horses.
00:19:04.000 | So, it's more of a general purpose dataset,
00:19:06.000 | but you learn a very good representation
00:19:09.000 | that can be used in your downstream task.
00:19:13.000 | And in the field of vision,
00:19:15.000 | this actually proved to be quite well
00:19:18.000 | because people have found out that pre-training
00:19:20.000 | acts as a regularization scheme.
00:19:22.000 | Your model tends to have better generalization
00:19:25.000 | if you pre-train it on a very large corpus.
00:19:29.000 | And the authors mentioned in the paper
00:19:31.000 | that the closest line of work to their work
00:19:33.000 | is actually what we discovered...
00:19:35.000 | What we covered so far,
00:19:37.000 | the ULMFET by Howard and Ruder,
00:19:39.000 | and also another work by Dai et al.
00:19:43.000 | But I did not go into...
00:19:45.000 | I did not go in detail into this work.
00:19:49.000 | However, the main drawback of these two works
00:19:52.000 | and everything else, almost everything else back then,
00:19:55.000 | is like they are using RNNs.
00:19:57.000 | And we know that by 2018,
00:19:59.000 | like, transformer reigns supreme
00:20:01.000 | because we have GPUs.
00:20:04.000 | So we can train transformers more efficiently.
00:20:06.000 | And also RNNs have, like...
00:20:08.000 | They are limited in their ability
00:20:10.000 | to process large contexts
00:20:13.000 | because of, like, the gradient vanishing problems
00:20:16.000 | and so on.
00:20:19.000 | So this is, like, the most common approach back then.
00:20:21.000 | Another approach is also to use the head-end representation
00:20:24.000 | from a pre-trained language model
00:20:26.000 | or machine translation model
00:20:27.000 | as auxiliary features while training your model.
00:20:30.000 | So you can have...
00:20:32.000 | Let's say you have a machine learning model.
00:20:36.000 | You're going to get the representations
00:20:38.000 | or, like, the hidden state of this model
00:20:41.000 | and then just give it to your classifier
00:20:44.000 | as, like, additional features.
00:20:47.000 | And as you can imagine, this is, like...
00:20:49.000 | This involves, like, a substantial amount of work
00:20:51.000 | and new parameters for each task.
00:20:54.000 | So this is not very universal.
00:20:57.000 | So any questions so far
00:20:59.000 | before we actually go in detail into the approach?
00:21:04.000 | - I think there was a question from Sook.
00:21:07.000 | Did AlexNet use GPUs?
00:21:10.000 | And were the transformers... - AlexNet?
00:21:11.000 | - ...the first ones to use the GPUs?
00:21:13.000 | I think he just put it in the chat.
00:21:15.000 | Yeah, but...
00:21:18.000 | Were transformers, like, some of the...
00:21:19.000 | - Yeah, AlexNet, yeah.
00:21:22.000 | Yeah, AlexNet did definitely use GPUs,
00:21:24.000 | but back then, I think they were writing Qt code.
00:21:27.000 | I think Alex Kryzewski was the person
00:21:30.000 | doing the GPU programming stuff himself.
00:21:36.000 | So, yeah, I think AlexNet did definitely use GPUs,
00:21:39.000 | if I remember correctly.
00:21:41.000 | - Yes, okay.
00:21:43.000 | Cool, cool.
00:21:45.000 | Well, seems like they did.
00:21:47.000 | Okay, I would just like to see
00:21:49.000 | if anyone else has any other questions.
00:21:50.000 | I think it should be good for now.
00:21:53.000 | - Yeah.
00:21:55.000 | Someone said, "According to papers with Qt, they did."
00:21:58.000 | Yeah, I think they did use GPUs.
00:22:01.000 | So let's go into details about the GPT approach.
00:22:06.000 | So we have two steps.
00:22:08.000 | The first step is unsupervised pre-training.
00:22:10.000 | So basically, the goal is to train
00:22:12.000 | a high-capacity language model
00:22:14.000 | on a large corpus of text.
00:22:16.000 | The training objective here is language modeling.
00:22:18.000 | That is, given a sequence of tokens,
00:22:21.000 | try to correctly predict the next token.
00:22:24.000 | So let's say the cat sat on D.
00:22:27.000 | You're trying to predict the next token,
00:22:29.000 | which I think should be the mat.
00:22:31.000 | So this is basically language modeling.
00:22:34.000 | The loss function D used is negative log-likelihood.
00:22:38.000 | It's also called cross-entropy.
00:22:40.000 | Basically, this equation,
00:22:43.000 | the negative log-likelihood of the correct label.
00:22:47.000 | And you do this over--
00:22:49.000 | a very, very important note here is
00:22:51.000 | you do this over the entire sequence.
00:22:53.000 | So if you have a sentence that said,
00:22:55.000 | "The cat sat on the mat,"
00:22:57.000 | you do this over every single token in this sentence.
00:23:00.000 | And this is very important.
00:23:02.000 | So you are training on your entire input--
00:23:05.000 | your entire corpus,
00:23:08.000 | not just the last token.
00:23:11.000 | So the architecture is like a transformer,
00:23:14.000 | a multi-layer transformer,
00:23:16.000 | but they are using the decoder only.
00:23:18.000 | They are not using the encoder.
00:23:20.000 | So--and the difference between the encoder and the decoder,
00:23:24.000 | I think, can be summarized to, like, how you do attention.
00:23:27.000 | So in encoders,
00:23:29.000 | every token has access to every other token.
00:23:32.000 | But in decoders,
00:23:34.000 | every token has access only to the tokens
00:23:36.000 | that came before it.
00:23:38.000 | So you only have one directional attention.
00:23:40.000 | I think it's also called left-to-right attention
00:23:44.000 | or right-to-left, or, like,
00:23:46.000 | you only have attention in one direction.
00:23:49.000 | And basically, the transformer architecture
00:23:52.000 | applies, like, multi-headed mask itself attention operation
00:23:55.000 | over the input tokens,
00:23:57.000 | and then this is followed by a position-wise feed-forward layer.
00:24:00.000 | And you keep doing this for, like, n,
00:24:02.000 | where n is the number of transformer layers,
00:24:05.000 | and then you just generate an auto-distribution
00:24:08.000 | over the vocabulary.
00:24:10.000 | So let's take this into more detail.
00:24:12.000 | You have your input text
00:24:14.000 | that has been tokenized
00:24:16.000 | into tokens,
00:24:18.000 | and these tokens have been encoded.
00:24:21.000 | So you have, like, the token integer or token ID.
00:24:24.000 | So it's a number.
00:24:26.000 | So you take this number, and you pass it through the embedding layer,
00:24:29.000 | the token embedding layer,
00:24:31.000 | also called the semantic embedding layer,
00:24:33.000 | and you get a vector for this token.
00:24:36.000 | And you also get the positional embedding for this token.
00:24:39.000 | You sum these up, like, just vector addition,
00:24:43.000 | and you get your input to the first transformer block,
00:24:47.000 | which is H0.
00:24:49.000 | So just the positional embedding
00:24:53.000 | and the token embedding added together,
00:24:56.000 | and you get your input, H0.
00:24:58.000 | And then you take H0 and pass it to each block,
00:25:02.000 | each transformer block,
00:25:04.000 | and the output of the first block is going to be the input
00:25:06.000 | to the second block, and so on.
00:25:08.000 | And this is what this equation says.
00:25:11.000 | And at the end,
00:25:13.000 | once you've done going through all these blocks,
00:25:15.000 | you're going to go to the output layer,
00:25:18.000 | which is actually a reverse embedding.
00:25:21.000 | So you have a vector, but you want to go back to a token,
00:25:24.000 | or, like, I should say,
00:25:26.000 | a probability distribution over all the tokens.
00:25:30.000 | And once you have this distribution,
00:25:32.000 | you pass it through a softmax
00:25:34.000 | to get actual probabilities that sum up to one,
00:25:37.000 | instead of just logits or scores.
00:25:40.000 | And I think we've covered the transformer paper,
00:25:43.000 | so I think people are familiar with this,
00:25:46.000 | but if you have other questions, please go ahead.
00:25:49.000 | A small remark about this is that we use
00:25:51.000 | tied input and output token embedding,
00:25:53.000 | so the embedding layer in this step
00:25:55.000 | is the same as the one used at the final step.
00:25:58.000 | So...
00:26:00.000 | Yeah, someone says causal masking.
00:26:08.000 | Yes, exactly, causal language modeling.
00:26:11.000 | So the second step is supervised fine-tuning.
00:26:14.000 | So the goal of this step
00:26:16.000 | is to adapt the parameters of the pre-trained model
00:26:19.000 | to your supervised target task.
00:26:22.000 | And for this, we need a label data set
00:26:24.000 | where each instance is a pair of, like,
00:26:26.000 | you have a sequence of input tokens and the label.
00:26:28.000 | So, for example, an input sequence
00:26:31.000 | could be this product is very bad,
00:26:33.000 | and your label could be, like, a negative sentiment.
00:26:37.000 | So the inputs are passed through the pre-trained model
00:26:40.000 | to obtain the final transformer blocks activation.
00:26:43.000 | So if you go back a bit,
00:26:46.000 | you take your input sequence and pass it
00:26:48.000 | through this same transformer
00:26:51.000 | and get the output of the final token,
00:26:53.000 | and then you compare this to your label.
00:26:57.000 | So you're gonna get the hidden representation
00:27:03.000 | of the last encoder layer at the M token,
00:27:06.000 | where M is, like, the final token, the input sequence.
00:27:08.000 | And you just pass this through your classification head
00:27:11.000 | or whatever head you're using.
00:27:13.000 | So, for example, in classification,
00:27:15.000 | we're gonna use Softmax on top of a linear layer
00:27:18.000 | to get your output and compare it with the label.
00:27:23.000 | And you are using roughly the same loss function,
00:27:26.000 | which is negative log-light-load estimation
00:27:28.000 | or cross-entropy.
00:27:32.000 | And a key distinction here
00:27:35.000 | between this and the previous step
00:27:37.000 | is you are only calculating the loss
00:27:39.000 | over only the output token, not the entire sequence.
00:27:42.000 | So the loss is only on Y, not on X1 or XM and so on.
00:27:48.000 | So the only extra parameters you need for this
00:27:51.000 | is your classification head.
00:27:54.000 | So, for example, W-Y,
00:27:56.000 | if you're trying to do classification,
00:27:58.000 | the parameter metrics of the--
00:28:02.000 | the metrics of the parameters of the output layer.
00:28:06.000 | And also embeddings if you're adding new tokens.
00:28:09.000 | And we're gonna see this in a bit.
00:28:12.000 | Something they also experimented with
00:28:14.000 | is auxiliary training objective.
00:28:16.000 | So they also used language modeling
00:28:18.000 | as an auxiliary objective in fine-tuning.
00:28:20.000 | So not just this--
00:28:23.000 | not just this classification, but also language modeling.
00:28:26.000 | And they say this helped them
00:28:29.000 | by improving the generalization of the supervised model
00:28:32.000 | and accelerating convergence.
00:28:34.000 | They also say this is in line with prior work.
00:28:37.000 | They also observed better performance
00:28:39.000 | when using it as an auxiliary objective.
00:28:42.000 | And the way you do this is your loss function
00:28:45.000 | is now a sum of multiple loss functions
00:28:48.000 | where one loss function is this one,
00:28:51.000 | the classification loss function,
00:28:53.000 | and also you have the language modeling loss function
00:28:57.000 | with a certain weight, like lambda here is like a weight.
00:29:00.000 | And lambda could be, for example, 0.5 or 0.3.
00:29:04.000 | So you have like a summation of multiple losses.
00:29:09.000 | And a small note for myself here.
00:29:12.000 | I'm not sure if auxiliary language--
00:29:14.000 | like auxiliary objectives are popular today.
00:29:17.000 | I think people just do supervised fine-tuning
00:29:20.000 | without an auxiliary objective.
00:29:23.000 | That's just my take.
00:29:26.000 | So any questions so far?
00:29:30.000 | I think the chat seems to not have any questions.
00:29:37.000 | So maybe you want to just-- you can just continue from now.
00:29:41.000 | Sure.
00:29:42.000 | So now we have discovered--
00:29:47.000 | we have covered the approach, the basic two steps.
00:29:51.000 | You will now get a very good, let's say, classifier,
00:29:54.000 | because you have done unsupervised pre-training
00:29:57.000 | and supervised fine-tuning on classifiers.
00:29:59.000 | But GPT-1 is actually trying to be more--
00:30:01.000 | it's trying to be more of a universal model
00:30:03.000 | than just a classifier.
00:30:05.000 | So they are trying to handle multiple tasks,
00:30:07.000 | like classification, entailment, semantic similarity,
00:30:10.000 | and answering multiple choice questions.
00:30:13.000 | And these are all, as you can see,
00:30:15.000 | discriminative tasks other than generative tasks,
00:30:18.000 | as we discussed before.
00:30:21.000 | So for tasks like classification, this is very easy.
00:30:23.000 | You can do what we have covered so far.
00:30:25.000 | Just add the head on top and do the classification.
00:30:28.000 | But other tasks have different structured inputs and outputs.
00:30:33.000 | So for example, text entailment has ordered sentence pairs.
00:30:36.000 | MCQs have a question with multiple answers, and so on.
00:30:41.000 | So each task has its own specific structure.
00:30:45.000 | And the way people have dealt with this previously
00:30:47.000 | is just learn a specific architecture for each task
00:30:53.000 | on top of your model.
00:30:55.000 | And this defeats the whole purpose of the GPT work.
00:31:00.000 | We're trying to do something that's
00:31:02.000 | global, general, and a general-purpose model,
00:31:05.000 | rather than having multiple task-specific architectures.
00:31:11.000 | So instead of using this approach,
00:31:14.000 | they opted to use a different approach,
00:31:17.000 | where they convert the structured inputs
00:31:21.000 | into just tokens.
00:31:23.000 | So they are trying to create a multitask format.
00:31:27.000 | And this is similar to what people have used in future work
00:31:30.000 | like T5 and Whisper, where basically you're
00:31:33.000 | trying to model different tasks just using
00:31:37.000 | tokens and special tokens.
00:31:40.000 | These input transformations allows
00:31:42.000 | us to use the same architecture for different tasks.
00:31:45.000 | So you don't need to do a lot of modification.
00:31:49.000 | And we're going to go into this in two details
00:31:51.000 | in the next slide.
00:31:53.000 | So for example, let's take textual entailment.
00:31:56.000 | This task involves reading a pair of sentences
00:31:59.000 | and judging the relationship between them.
00:32:01.000 | So the relationship could be one of entailment, contradiction,
00:32:03.000 | or neutral.
00:32:04.000 | And a small note is that this task is still challenging
00:32:10.000 | because your model needs to have good understanding
00:32:13.000 | and reasoning of the language, because it
00:32:16.000 | can be confusing sometimes.
00:32:19.000 | So you have your premise, and you have your hypothesis,
00:32:22.000 | and you're trying to classify or try
00:32:24.000 | to predict the relationship as being entailment,
00:32:26.000 | contradiction, or neutral.
00:32:29.000 | So the way to do this is just to concatenate
00:32:32.000 | the premise and the hypothesis.
00:32:34.000 | So this could be a sentence, and this could be a sentence.
00:32:37.000 | You just concatenate them and add a special delimiter token
00:32:40.000 | in between them.
00:32:41.000 | And obviously, you add your start token
00:32:43.000 | at the beginning and your end token at the end.
00:32:46.000 | And just try to train a classifier on top of this.
00:32:49.000 | And your classifier should classify
00:32:52.000 | this sequence of input tokens as one of entailment,
00:32:55.000 | contradiction, or neutral.
00:32:57.000 | So this has become just a classification task
00:33:01.000 | by just doing transformation.
00:33:06.000 | The second task that we cover is semantic similarity.
00:33:09.000 | And I think this is very popular nowadays,
00:33:12.000 | because of retrieval, augmented generation,
00:33:15.000 | and embeddings in general.
00:33:18.000 | So all this cool rag stuff.
00:33:21.000 | So this task is about predicting how semantically
00:33:25.000 | similar two sentences are.
00:33:27.000 | And semantic similarity just means how close in meaning
00:33:30.000 | they are.
00:33:31.000 | Do they talk about the same thing?
00:33:33.000 | Do they mean the same thing?
00:33:35.000 | Do they have similar meaning or not?
00:33:38.000 | And again, this can be challenging,
00:33:40.000 | because you can have two paragraphs that
00:33:43.000 | have very different usage of words,
00:33:45.000 | but they convey the same meaning.
00:33:47.000 | So this can be challenging if your model is not smart enough.
00:33:50.000 | And one note about this task is there
00:33:52.000 | is no inherent ordering of the sentences.
00:33:54.000 | So you can just--
00:33:55.000 | there is no sentence A and sentence B.
00:33:57.000 | You just have two sentences.
00:33:59.000 | Unlike, for example, in the entailment,
00:34:01.000 | you have a very specific order.
00:34:02.000 | Like, you have a distinction between the premise
00:34:05.000 | and the hypothesis.
00:34:06.000 | But here you have just two random sentences.
00:34:09.000 | And the way they approach this is using a Siamese architecture
00:34:14.000 | where we have--
00:34:15.000 | where we generate two input sequences
00:34:18.000 | and pass them through the transformer
00:34:20.000 | and compare between them.
00:34:22.000 | So basically, Siamese architecture
00:34:24.000 | is a fancy way of saying that you
00:34:26.000 | are using the same model twice, or two identical versions
00:34:30.000 | of the model.
00:34:31.000 | So you get your sentence-- first sentence and second sentence,
00:34:34.000 | and then concatenate them and add your special tokens.
00:34:37.000 | And you pass them through the transformer
00:34:39.000 | and you get some vector at the end.
00:34:41.000 | And also, you do the same thing, but you reverse
00:34:44.000 | the order of the sentences.
00:34:47.000 | So you get the second input sequence,
00:34:49.000 | pass it to the transformer, and you get your vector.
00:34:52.000 | So basically, at the end, you're going to have two vectors.
00:34:55.000 | You just add them--
00:34:57.000 | do vector addition on top of them.
00:34:59.000 | And then you just add a head on top of the output vector.
00:35:05.000 | So you just multiply the vector by some layer,
00:35:09.000 | a linear layer, for example, that has parameters w,
00:35:12.000 | and you get your output.
00:35:14.000 | And for example, if you're doing--
00:35:16.000 | like, if you just have--
00:35:18.000 | if you're just interested in knowing
00:35:20.000 | whether these sentences are similar or not similar,
00:35:23.000 | so you have only two labels, you can train a classifier.
00:35:26.000 | If you're interested in having more of a scale of similarity,
00:35:30.000 | like from 0 to 10, you can train a regressor on top of this.
00:35:33.000 | So that's very cool.
00:35:37.000 | We're still using the same architecture, by the way,
00:35:39.000 | just the transformer here.
00:35:41.000 | We're not modifying it.
00:35:42.000 | We're just being smart about how to approach this task.
00:35:48.000 | You can extend this to do question answering.
00:35:51.000 | So let's say you have a question and you have four choices.
00:35:55.000 | Or let's say you have a document.
00:35:56.000 | You have a question about the document,
00:35:58.000 | and you have four potential answers to this document.
00:36:01.000 | And you want your model to pick one answer of these.
00:36:04.000 | So the way to do this is you take your document or context
00:36:08.000 | and add the question and then add the first answer.
00:36:12.000 | And then you get your first input sequence.
00:36:15.000 | And you pass this to the transformer.
00:36:17.000 | And similarly, you get your context or document
00:36:20.000 | and add the question and add the second answer
00:36:22.000 | and make this into a sequence and give it to the transformer.
00:36:25.000 | And you do this for all the answers.
00:36:27.000 | And then you just compare between the scores
00:36:31.000 | the model gives to each of these potential answers.
00:36:36.000 | So for example, we have a Wikipedia article,
00:36:39.000 | and we have a question, and we have answer A.
00:36:42.000 | We concatenate all of these with some special tokens.
00:36:46.000 | And we pass them to the transformer
00:36:49.000 | and to the final linear layer, and we get the score.
00:36:53.000 | So score for answer A and score for answer B
00:36:55.000 | and score for answer C. And just do a softmax
00:36:57.000 | to get a proper probability distribution.
00:37:01.000 | And that's it.
00:37:02.000 | You've got your model to do question answering.
00:37:05.000 | And you can do this also for other tasks
00:37:07.000 | like common sense reasoning.
00:37:10.000 | So any questions about these transformations for each task?
00:37:18.000 | I think there was a question about modifying the input
00:37:22.000 | sequence to both possible orderings,
00:37:24.000 | at least when it comes to semantics in learning.
00:37:27.000 | I think someone had a question about that.
00:37:31.000 | So he just said that he doesn't really
00:37:34.000 | understand what it means to modify the input sequence
00:37:37.000 | for both possible orderings.
00:37:41.000 | Yeah, sure.
00:37:42.000 | So if you're doing textual entailment,
00:37:45.000 | you have distinction between the premise and the hypothesis.
00:37:47.000 | Like you have a premise and a hypothesis.
00:37:52.000 | So you can just mix them up.
00:37:54.000 | You have a very specific premise and a very specific hypothesis.
00:37:59.000 | And the ordering here matters.
00:38:00.000 | Like you should put the premise first, and then
00:38:02.000 | the special token, and then the hypothesis.
00:38:04.000 | You can put the hypothesis first, and then the delimiter,
00:38:07.000 | and then the premise.
00:38:08.000 | So this is what they mean by the ordering here matters.
00:38:12.000 | But for semantic similarity, you just have two sentences.
00:38:16.000 | There is no inherent ordering for the semantic similarity
00:38:20.000 | task.
00:38:23.000 | I hope this answers the question.
00:38:27.000 | I think it did.
00:38:29.000 | He said, ah, I see.
00:38:30.000 | Thanks.
00:38:30.500 | I think probably that's about it.
00:38:33.000 | I don't see any other questions.
00:38:36.000 | He said, does it also combat things like positional biases
00:38:39.000 | and transformers by switching the order of the sentences
00:38:42.000 | itself?
00:38:43.000 | And you've added two quantities.
00:38:45.000 | Yeah, exactly.
00:38:46.000 | I think, yeah.
00:38:47.000 | I think this is one of the motivations they did this.
00:38:50.000 | They want to say, there is no inherent order for this task.
00:38:54.000 | So maybe the transformer will just have a bias.
00:38:57.000 | So let's try both ways.
00:38:58.000 | Like let's give it the first sentence, and then the second.
00:39:01.000 | And let's give it the other way around
00:39:05.000 | to get over the positional bias.
00:39:08.000 | Because I think for some models, they
00:39:12.000 | pay more attention to the last few tokens in the input.
00:39:17.000 | And they disregard the earlier tokens.
00:39:19.000 | This tends to happen sometimes, yes.
00:39:23.000 | I think, yeah.
00:39:24.000 | I think this is a good way to think about this.
00:39:29.000 | Awesome.
00:39:30.000 | I think-- I don't see any other questions in the chat.
00:39:33.000 | So I can just service them as and when they come.
00:39:35.000 | So I think we should be good for now by the looks of it.
00:39:38.000 | Sure.
00:39:39.000 | Sure.
00:39:39.500 | So that was the whole approach.
00:39:41.000 | We can now discuss some details about the training.
00:39:44.000 | And back then, OpenAI actually did release info
00:39:47.000 | about their training and models.
00:39:49.000 | They don't do this now, unfortunately.
00:39:52.000 | But anyways, so the data set they use for training
00:39:55.000 | is BookCorpus for training the language model.
00:39:58.000 | So this is in step one, which is unsupervised training.
00:40:01.000 | BookCorpus.
00:40:03.000 | And back then, this was huge.
00:40:04.000 | It has 7,000 unique unpublished books from different genres.
00:40:08.000 | So the variety here helps as well.
00:40:11.000 | They have adventure, fantasy, and romance.
00:40:13.000 | So kind of like a very good, big, diverse data set.
00:40:18.000 | And the main advantage of this data set,
00:40:21.000 | and the reason they chose it, is because it
00:40:23.000 | has long stretches of text.
00:40:25.000 | So if you have a book or a novel,
00:40:27.000 | you have a paragraph that's maybe 10 lines or more.
00:40:31.000 | And this helps the model to learn
00:40:33.000 | how to handle long-range information
00:40:35.000 | and how to handle long context.
00:40:37.000 | For example, there is also another data set
00:40:39.000 | that's called WordBenchmark, which is also big and diverse.
00:40:43.000 | But it doesn't have this long context.
00:40:46.000 | It's just a bunch of small sentences, I would say.
00:40:51.000 | And this way, your model will not
00:40:54.000 | learn how to handle long context.
00:40:56.000 | And a side note here is like ELMo is also
00:41:00.000 | a very seminal work in NLP.
00:41:03.000 | And it falls under the same domain
00:41:05.000 | of unsupervised pre-training and having good embeddings.
00:41:10.000 | So it's one of the fundamental papers and works in NLP.
00:41:15.000 | So they say their model achieves a very low token
00:41:18.000 | level perplexity of 18.4 on this corpus.
00:41:22.000 | But I don't think this is actually low today.
00:41:25.000 | I think perplexity of 18 is a bit high.
00:41:28.000 | But I'm not sure.
00:41:30.000 | And their model is about--
00:41:32.000 | their data set is about 5 gigabyte in size.
00:41:34.000 | So not quite big by today's standard, of course.
00:41:39.000 | The architecture, as we mentioned,
00:41:41.000 | they have a transformer model.
00:41:42.000 | They use a tokenizer.
00:41:43.000 | They use byte-pair encoding back then, which is also, I think,
00:41:49.000 | what's used right now.
00:41:50.000 | So this is still not changed from today.
00:41:53.000 | They have a surprisingly big vocab size
00:41:57.000 | by their time's standard.
00:41:59.000 | Like, they have 40,000 tokens.
00:42:01.000 | I think Llamatu had also something
00:42:04.000 | that's in the same range, like 40,000, 50,000 tokens.
00:42:07.000 | So back then, this was actually quite big, I would say.
00:42:11.000 | And they used the FTFY library to clean the raw text
00:42:14.000 | and then do some standardization using spaCy tokenizer.
00:42:19.000 | So this is good work in the tokenization area, I would say.
00:42:25.000 | And their model is just a typical transformer model,
00:42:29.000 | like typical by today's standard, just a transformer,
00:42:32.000 | decoder-only transformer with, like, 12 layers and mask
00:42:35.000 | self-attention.
00:42:36.000 | And it has a very big size of 117 millions.
00:42:40.000 | And this was actually big back then, although this
00:42:43.000 | is trivial now, of course.
00:42:45.000 | Their embedding, they used learned positional embeddings
00:42:48.000 | compared to the sinusoidal embedding
00:42:51.000 | in the original transformer paper.
00:42:53.000 | So this was actually much simpler to implement.
00:42:57.000 | And the model just has to learn the embeddings in training.
00:43:01.000 | They have a context size of 512.
00:43:04.000 | Again, back then, that was very big.
00:43:07.000 | And they also used tied input and output token embeddings,
00:43:10.000 | as we mentioned.
00:43:11.000 | And for the attention block, they have 12 attention heads.
00:43:14.000 | Each head has 64--
00:43:17.000 | each head has a dimension of 64 for a total of 768 dimensions
00:43:24.000 | for the entire attention block.
00:43:27.000 | And after the attention, you have the MLP layer,
00:43:29.000 | also known as position-wise feedforward network.
00:43:32.000 | And the size of the inner state of this network is 3,072--
00:43:38.000 | the size is 3,072.
00:43:41.000 | And this means that, actually, this is an expansion
00:43:43.000 | and then contraction network.
00:43:46.000 | So you go from 768 to 3K, and then you
00:43:49.000 | go back from 3K to 768.
00:43:51.000 | And all this is happening inside this MLP.
00:43:53.000 | And the activation layer they used
00:43:55.000 | is the GLO activation layer.
00:43:59.000 | The optimizer, they used Adam, which was becoming
00:44:03.000 | very popular back then.
00:44:04.000 | I think Adam was developed in 2014, 2015.
00:44:09.000 | So it was getting a lot of traction back then.
00:44:11.000 | And the maximum learning rate is this learning rate,
00:44:14.000 | which is also very popular nowadays.
00:44:16.000 | I think this was used in one of the Lama-Tung models.
00:44:20.000 | So this is very familiar.
00:44:23.000 | They do warm-up as well from 0 to the maximum learning
00:44:26.000 | rate over 2K steps.
00:44:29.000 | And then they just cosine annealing
00:44:32.000 | to-- they do cosine annealing from the maximum learning rate
00:44:35.000 | to 0, so just learning rate decay.
00:44:39.000 | Their compute, they used one machine
00:44:43.000 | that has 8x P600 GPU.
00:44:47.000 | This is the same family of GPUs as P100, I think.
00:44:50.000 | But I don't have much information about this GPU.
00:44:53.000 | They train for 30 days, which actually is not bad.
00:44:59.000 | That's not very long.
00:45:00.000 | But back then, this was very long, I think.
00:45:03.000 | Their utilization dimension is 0.33.
00:45:06.000 | And the total flops is 0.6 petaflop,
00:45:10.000 | so almost one petaflop.
00:45:12.000 | And for reference, I think the tiny grad machine
00:45:15.000 | is trying to give you one petaflop.
00:45:19.000 | Or I could be wrong.
00:45:21.000 | Yeah, I could be wrong about this.
00:45:23.000 | So anyway, the compute they used is almost one petaflop days.
00:45:28.000 | And this is the way they calculated this.
00:45:31.000 | And this is actually how the model looks
00:45:33.000 | like if you try to do it in the transformers framework.
00:45:36.000 | So you have token embedding, position embedding,
00:45:38.000 | and then some dropouts.
00:45:40.000 | And then you have your actual transformer blocks.
00:45:44.000 | And this is the architecture of the language model.
00:45:48.000 | So there is no head in here.
00:45:50.000 | So the block is just attention, and then layer norm,
00:45:53.000 | and then MLP, and then layer norm.
00:45:55.000 | So if there are no questions, we can move on
00:46:02.000 | to the second step, which is supervised fine tuning.
00:46:07.000 | There was one question from Sean,
00:46:09.000 | which was, are perplexity numbers comparable
00:46:11.000 | across different models?
00:46:13.000 | And I think we discussed just now
00:46:15.000 | that the model itself has a perplexity of 18.4.
00:46:18.000 | And so in this case, would it be OK to compare--
00:46:23.000 | is it a metric that's invariant across different models
00:46:28.000 | in this sense?
00:46:31.000 | Yeah, so that's a good question.
00:46:32.000 | I think perplexity is just a metric.
00:46:34.000 | Like, you're going to measure perplexity
00:46:36.000 | on a certain data set.
00:46:37.000 | Like, let's say you have this data set that
00:46:39.000 | is 1 trillion tokens, or 1 billion tokens.
00:46:42.000 | And you measure perplexity of GPT-1 on this data set.
00:46:46.000 | And you can measure the perplexity of Lama2
00:46:48.000 | on this data set.
00:46:49.000 | And I think you can compare this metric.
00:46:52.000 | It's like saying Lama2 has a human eval of 70,
00:46:57.000 | and GPT-4 has a human eval of maybe 90.
00:47:00.000 | It's not totally fair, but we can do it, I guess.
00:47:03.000 | We can get away with doing it.
00:47:05.000 | Yeah, but it might not be 100% fair.
00:47:07.000 | Yeah, so that's a good point.
00:47:13.000 | OK, cool.
00:47:14.000 | I think that's probably the only question.
00:47:16.000 | So I think we should move on to supervised fine-tuning.
00:47:20.000 | So the second step is supervised fine-tuning.
00:47:22.000 | And they use these data sets for this task.
00:47:24.000 | So for sentence similarity, they use these data sets.
00:47:28.000 | And for classifications, they used COLA,
00:47:30.000 | which I think is popular now as well.
00:47:34.000 | I'm not going to go into details about this,
00:47:36.000 | because I don't have much information about these.
00:47:40.000 | And the architecture is just use the same backbone
00:47:43.000 | as pre-training.
00:47:44.000 | And you add your head, mostly classifier head,
00:47:48.000 | with a dropout of 0.1.
00:47:50.000 | They train for three epochs with a batch size of 32.
00:47:53.000 | This is also still standard nowadays as well.
00:47:57.000 | People usually train, do fine-tuning for three epochs
00:48:00.000 | with a batch size of maybe 32 or 64.
00:48:03.000 | So yeah, that's common nowadays as well.
00:48:06.000 | The learning rate is 6.25, 8.25.
00:48:10.000 | And also, people use very similar learning rates nowadays
00:48:15.000 | as well.
00:48:17.000 | So this is very familiar.
00:48:18.000 | This architecture for fine-tuning
00:48:21.000 | is very familiar, even today.
00:48:24.000 | And they also use learning rate decay with a warm-up.
00:48:28.000 | And when they use the auxiliary language,
00:48:32.000 | the auxiliary objective, they used a lambda of 0.5.
00:48:36.000 | So the weight of the auxiliary language modeling objective
00:48:39.000 | was 0.5.
00:48:41.000 | But this is not popular nowadays.
00:48:43.000 | Most people don't do this.
00:48:44.000 | Any question about this?
00:48:53.000 | I think we should--
00:48:54.000 | Before we go to the benchmarks?
00:48:55.000 | Yeah, the chat.
00:48:56.000 | There's no questions in the chat.
00:48:58.000 | So I think we can just move on to the benchmarks.
00:49:01.000 | Sure.
00:49:02.000 | So the benchmark is like--
00:49:04.000 | they do a lot of SOTA performance on many tasks.
00:49:07.000 | Like you can see here, almost every single task,
00:49:10.000 | except this one.
00:49:11.000 | And there are ones where they have significant improvement,
00:49:14.000 | like this in QNLI.
00:49:17.000 | They have like 6% absolute improvement.
00:49:21.000 | I think-- let's go back a bit.
00:49:23.000 | QNLI-- where is it?
00:49:27.000 | In here.
00:49:28.000 | I think the QNLI is more of a natural language understanding
00:49:38.000 | where it requires reasoning.
00:49:39.000 | And this is where they make the most improvement, I think.
00:49:43.000 | But overall, they are doing very good performance on many tasks.
00:49:49.000 | And they are comparing to LSTMs and other models.
00:49:52.000 | So good news.
00:49:53.000 | GPT-1 tends to be a good model.
00:49:56.000 | Yeah, this is for natural language inference.
00:49:59.000 | For question answering and common sense,
00:50:01.000 | they also make very big gains, as you can see here.
00:50:05.000 | So 76 compared to 86.
00:50:10.000 | They compare also to multiple models.
00:50:13.000 | And one of them is actually an ensemble of nine models,
00:50:16.000 | as they denote here by 9x.
00:50:20.000 | And also, the same happens for semantic similarity
00:50:23.000 | and classification.
00:50:24.000 | Although, I would say the boost in performance
00:50:28.000 | is not as big as the previous ones.
00:50:30.000 | They are good on some metrics and bad on other metrics.
00:50:39.000 | So this is my favorite part of the paper.
00:50:45.000 | They did-- after the benchmarks, they now have a good model.
00:50:48.000 | So they are trying to understand why their model is good
00:50:51.000 | and why their GPT-1 is suddenly SOTA.
00:50:56.000 | So they do analysis and try to understand
00:51:00.000 | why this is happening.
00:51:02.000 | So the first step is they're trying
00:51:04.000 | to analyze the impact of the transfer learning,
00:51:07.000 | the number of layers transferred from the pre-trained model
00:51:10.000 | to the fine-tuned model.
00:51:13.000 | So they just take the pre-trained transformer,
00:51:16.000 | and they take all the layers and do fine-tuning.
00:51:20.000 | And then they take 11 layers, and then they do fine-tuning.
00:51:23.000 | And then they take 10 layers and do fine-tuning, and so on.
00:51:26.000 | They compare the performance of these models.
00:51:28.000 | And as you can see, the more layers you add,
00:51:30.000 | the more performance you get.
00:51:32.000 | And if you do zero layers transferred,
00:51:35.000 | you're not taking any of the pre-trained weights.
00:51:39.000 | You're just starting from scratch.
00:51:41.000 | You get this performance.
00:51:42.000 | But if you just add one layer, you
00:51:44.000 | get a very significant boost.
00:51:46.000 | The solid lines here are the dev data sets.
00:51:50.000 | The dashed line is the training data set.
00:51:52.000 | So you can see that a very big improvement in performance
00:51:54.000 | just by adding one layer.
00:51:56.000 | And this is almost like a continuous trend.
00:52:00.000 | So they observe the obvious fact that transferring more layers
00:52:06.000 | improves performance.
00:52:07.000 | And just only transferring the embedding layers
00:52:10.000 | also improves performance significantly.
00:52:13.000 | And they mentioned that each layer adds a boost of 9%
00:52:17.000 | for each layer you add on this task, multi-NLI.
00:52:22.000 | And this analysis actually indicates
00:52:24.000 | that each layer in the pre-trained model
00:52:26.000 | actually has a purpose.
00:52:27.000 | And it learns something that's useful.
00:52:30.000 | And it's very helpful in the fine-tuning model.
00:52:34.000 | So different layers learn different things.
00:52:37.000 | They learn different scales.
00:52:38.000 | And they are all helpful.
00:52:40.000 | This is a very good finding of this work.
00:52:44.000 | So each layer in the pre-trained model
00:52:46.000 | contains useful information on functionality
00:52:48.000 | for solving for target tasks like classification.
00:52:51.000 | The second piece of analysis they did is zero-shot evaluation.
00:52:59.000 | And this was, I would say, radical back then.
00:53:03.000 | They are trying to evaluate how good the pre-trained model is
00:53:08.000 | on these tasks without even fine-tuning.
00:53:11.000 | And they want to answer the questions like,
00:53:13.000 | why is the language model pre-training effective?
00:53:16.000 | Why does pre-training a transformer
00:53:20.000 | on language modeling help us when we are doing classification?
00:53:24.000 | They have a hypothesis that the underlying generative model
00:53:28.000 | learns actually multiple tasks in the pre-training.
00:53:31.000 | So it's not just learning language modeling.
00:53:33.000 | Because if you think about it, if you're
00:53:36.000 | learning how to model language properly,
00:53:40.000 | you're more likely to have a deeper understanding
00:53:43.000 | of the language and just natural language.
00:53:47.000 | Like, for example, if you speak only English and French,
00:53:53.000 | and you cannot speak, let's say, Spanish,
00:53:55.000 | you won't be able to do classifications in Spanish.
00:53:59.000 | But if you speak English and you speak French,
00:54:01.000 | you probably have the understanding
00:54:03.000 | to do these tasks.
00:54:05.000 | So this is something cool and something to think about a bit.
00:54:10.000 | Another thing is attention is very helpful.
00:54:13.000 | Transformers show very big improvement compared to LSTMs.
00:54:18.000 | Or that's what they are hypothesizing about
00:54:21.000 | and what they are trying to test.
00:54:23.000 | So to test these two hypotheses, they
00:54:25.000 | designed a series of heuristics.
00:54:27.000 | They basically tried to evaluate the pre-trained model
00:54:32.000 | on these tasks before doing the supervised fine-tuning.
00:54:36.000 | And they have different modifications
00:54:38.000 | for each data set and each task.
00:54:42.000 | For example, for linguistic acceptability with the COLA,
00:54:46.000 | they used the examples.
00:54:48.000 | And they take the average token log probability
00:54:51.000 | for each token and use this as a score.
00:54:54.000 | And they have a threshold.
00:54:56.000 | And they just determine if the model got this right or wrong.
00:54:59.000 | Something that's very cool is how
00:55:03.000 | they handle sentiment analysis.
00:55:05.000 | So you've got your, let's say, Amazon review.
00:55:08.000 | And you append the token vary to the review
00:55:11.000 | and restrict the language to generate only two tokens, one
00:55:15.000 | of two tokens, positive or negative.
00:55:17.000 | And you see which token has the higher score or higher
00:55:20.000 | probability.
00:55:21.000 | And this is the prediction of the model on this sentiment
00:55:24.000 | or this paragraph.
00:55:26.000 | So just restrict the language model prediction
00:55:29.000 | to these two tokens and see which one is higher.
00:55:34.000 | This is actually pretty cool.
00:55:35.000 | It's kind of similar to constrained grammar
00:55:38.000 | and constrained output.
00:55:41.000 | Another example is question answering.
00:55:43.000 | So for each answer, as you said, concatenate the document
00:55:46.000 | question and answer in the input.
00:55:48.000 | And you average the token log probability of just the answer
00:55:52.000 | and use this as the score for this answer.
00:55:54.000 | And do this for multiple answers and just
00:55:56.000 | pick the one with the highest score.
00:55:58.000 | And for winning grad schemas, for this task,
00:56:06.000 | let's say you have a sentence that
00:56:08.000 | says the city councilman refused the demonstrators a permit
00:56:11.000 | because they advocated violence.
00:56:13.000 | So the goal here is to find this word "they."
00:56:17.000 | What does it refer to?
00:56:18.000 | Does it refer to the demonstrators
00:56:20.000 | or the city councilman?
00:56:22.000 | This is what they mean by winning grad schema.
00:56:24.000 | And I think it's a very popular task.
00:56:27.000 | So the way they do this is they take this word "they"
00:56:30.000 | and replace it with the two possible answers.
00:56:33.000 | In this case, councilman and demonstrators.
00:56:35.000 | So you have example A, you use the word "councilman."
00:56:40.000 | And example B, use the word "demonstrators."
00:56:42.000 | And just average the token probability
00:56:47.000 | of what comes after this word.
00:56:49.000 | And then you just take the one with the higher probability
00:56:55.000 | score.
00:56:58.000 | So this is how they wanted to evaluate the model.
00:57:01.000 | Let's see, actually, the evaluation results.
00:57:04.000 | And surprise, surprise, the longer
00:57:08.000 | you train the model and the more data you give the model,
00:57:13.000 | the more it tends to perform on this task.
00:57:15.000 | So the better zero-shot performance it has, basically.
00:57:19.000 | This is a very cool graph.
00:57:21.000 | So we have different tasks here.
00:57:23.000 | And the solid line is the GPT transformer.
00:57:26.000 | And you can see with more steps, more training steps,
00:57:29.000 | the better performance you have.
00:57:31.000 | And they also did a very cool analysis
00:57:33.000 | where they trained an LSTM, which is the dashed line.
00:57:36.000 | And you can see that for almost every single task,
00:57:39.000 | the transformer is better than the LSTM.
00:57:43.000 | So they observed the obvious that more training gives
00:57:47.000 | better performance.
00:57:49.000 | More pre-training gives better performance,
00:57:51.000 | even in zero-shot settings.
00:57:54.000 | And this suggests that generative pre-training
00:57:57.000 | actually-- in generative pre-training,
00:58:00.000 | the model learns a lot of tasks, not just language modeling.
00:58:03.000 | And they mentioned something here.
00:58:05.000 | They said they observed that the LSTM exhibits higher variance
00:58:08.000 | in its zero-shot performance, suggesting
00:58:10.000 | that the inclusive bits of the transformer assist in transfer.
00:58:13.000 | But I'm not sure what they mean by high variance
00:58:16.000 | in this graph, actually.
00:58:18.000 | So if anyone has any clue about this, please share it with us.
00:58:21.000 | Yeah, it's better less than D1, and GPUs are all we need.
00:58:29.000 | Sadly.
00:58:36.000 | So the final section in the paper
00:58:40.000 | is the appellation studies they did.
00:58:43.000 | They have three appellation studies.
00:58:45.000 | They want to first see the effect
00:58:47.000 | of the auxiliary language modeling objective
00:58:50.000 | during fine-tuning.
00:58:52.000 | So in this case, we have the normal transformer training
00:58:58.000 | with auxiliary language modeling, which is row 1.
00:59:01.000 | And we have the one without auxiliary language modeling,
00:59:04.000 | which is row 3.
00:59:06.000 | They say that the trend suggests that larger datasets benefit
00:59:12.000 | from auxiliary training compared to smaller datasets.
00:59:15.000 | And this is kind of like actually probably
00:59:18.000 | why people stop doing auxiliary language modeling.
00:59:20.000 | It doesn't seem to be quite important, probably.
00:59:26.000 | The second appellation study is the effect of the transformer.
00:59:29.000 | So they just trained the usual transformer,
00:59:32.000 | and they compared it with an LSTM, so row 1 and row 4.
00:59:38.000 | And you can see, yeah, the transformer
00:59:40.000 | has generally better performance than the LSTM on most tasks.
00:59:47.000 | So they mentioned that they observed a 5.6 average drop
00:59:50.000 | when they used the LSTM.
00:59:54.000 | And it only outperforms the transformer
00:59:56.000 | on one dataset, which is MRPC.
01:00:01.000 | I'm not sure what this is.
01:00:03.000 | The second appellation study, and I
01:00:05.000 | would say the most important one,
01:00:06.000 | is the effect of pre-training.
01:00:08.000 | So they compared the transformer that
01:00:10.000 | has been trained in the framework they proposed,
01:00:13.000 | the two-step framework.
01:00:14.000 | And they also compared it to a transformer that
01:00:17.000 | was directly trained on the supervised task, that
01:00:20.000 | is, without any pre-training.
01:00:21.000 | So we have row 1 and row 2.
01:00:23.000 | You can see that this is where the huge difference shows up,
01:00:26.000 | like 74 compared to 59, 45 compared to 18, 88 compared
01:00:32.000 | to 71.
01:00:33.000 | So this is a very big difference.
01:00:36.000 | Yeah, and we observed that no pre-training actually
01:00:39.000 | hurts the performance quite a lot on almost all tasks.
01:00:44.000 | And this results in 15% decrease.
01:00:48.000 | And this is actually the worst-performing model
01:00:50.000 | out of all these models.
01:00:52.000 | Yeah.
01:00:53.000 | So I would say that the conclusion
01:00:55.000 | from this appellation study is that pre-training
01:00:58.000 | is very, very important.
01:00:59.000 | I think this is the final section before the future
01:01:07.000 | studies do you want to discuss.
01:01:10.000 | Any questions so far?
01:01:11.000 | I think chat seems good.
01:01:17.000 | Do you want to-- you have one more slide, I think.
01:01:19.000 | You said future studies.
01:01:20.000 | Yeah.
01:01:21.000 | Do you want to just finish it up?
01:01:23.000 | Then we can open it up to questions if people--
01:01:26.000 | Sure.
01:01:27.000 | Yeah.
01:01:28.000 | Sure.
01:01:29.000 | So this section, I don't think it was in the paper.
01:01:31.000 | I think it was only on the blog post.
01:01:33.000 | They discuss what future work could be.
01:01:37.000 | And the first approach is surprise, surprise,
01:01:40.000 | just scale up.
01:01:42.000 | So they mentioned that they noticed improvement
01:01:44.000 | in language modeling.
01:01:45.000 | And it correlates with downstream tasks.
01:01:48.000 | And they're only using very limited hardware,
01:01:50.000 | many GPUs on a machine and training
01:01:52.000 | on a very small data set.
01:01:54.000 | So maybe there is room for improvement
01:01:56.000 | if you scale the model, the training, and the data set.
01:01:58.000 | And I think we know the answer to this question,
01:02:01.000 | or the answer to this hypothesis.
01:02:04.000 | The second approach is--
01:02:07.000 | the second futuristic approach they want to try
01:02:09.000 | is try to see if there is-- you can tweak the fine-tuning
01:02:14.000 | approach instead of just doing vanilla fine-tuning.
01:02:17.000 | You can use adaptation or one of the other fancy ways
01:02:21.000 | of doing fine-tuning.
01:02:23.000 | I don't think this is as important as they might
01:02:28.000 | have thought about this.
01:02:29.000 | Because people right now are doing just simple fine-tuning,
01:02:32.000 | and it works.
01:02:34.000 | So yeah, not quite as promising as the first approach, which
01:02:37.000 | is just scaling up.
01:02:39.000 | And the third one is understanding
01:02:41.000 | of why generative pretraining helps.
01:02:44.000 | And they've done some ablations and analysis about this.
01:02:48.000 | They want to do even more further targeted experiments
01:02:51.000 | and research to understand this.
01:02:53.000 | So basically, observability and explainability
01:02:56.000 | in machine learning.
01:02:59.000 | And one very good question they ask
01:03:01.000 | is, how much does longer context help compared
01:03:06.000 | to just improved world knowledge when
01:03:09.000 | you are doing pretraining?
01:03:11.000 | So for example, this GPT model is
01:03:13.000 | able to process a longer context because it's a transformer.
01:03:18.000 | So is this the thing that makes it softer performance?
01:03:21.000 | Or is it the fact that they train on a bigger data set
01:03:26.000 | with a longer training time, and it obtains world knowledge?
01:03:30.000 | And I think both are important, I would say.
01:03:33.000 | But that's a very good question.
01:03:37.000 | So yeah, that was it in the GPT.
01:03:39.000 | They introduce a framework for achieving SOTA performance
01:03:42.000 | by doing two-step approach, do pretraining,
01:03:44.000 | and then fine-tuning.
01:03:45.000 | The goal of pretraining is to obtain a very good world
01:03:53.000 | knowledge and a very good starting point.
01:03:55.000 | And you do the pretraining on a diverse corpus
01:03:58.000 | with long stretches of text.
01:04:00.000 | So the model acquires, actually, world knowledge.
01:04:02.000 | And then you actually do transfer learning
01:04:04.000 | by fine-tuning on tasks like question answering and so on.
01:04:08.000 | And the results is that they improve
01:04:11.000 | the state-of-the-art performance online data sets out of 12.
01:04:15.000 | And they successfully have utilized
01:04:18.000 | the unsupervised approach to do transfer learning
01:04:21.000 | to discriminative tasks.
01:04:23.000 | So now we have a clear way of how
01:04:24.000 | to do unsupervised training and semi-supervised training.
01:04:31.000 | And the work also highlights that transformers
01:04:35.000 | is a very good architecture, and larger data sets are good.
01:04:39.000 | And this is a very important push
01:04:41.000 | in the direction of scaling up and pretraining.
01:04:44.000 | And this is actually what people are
01:04:46.000 | going to do for the next six years, from 2018 to 2024.
01:04:52.000 | Yeah, so very good paper, very good work.
01:04:56.000 | And with that, we've come to the end of this paper.
01:04:58.000 | So if you have any questions, I think we can take questions now.