back to index

Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL


Chapters

0:0 Introduction
2:43 Overview of Transformers
6:3 Attention mechanisms
7:53 Self retention
11:38 Other necessary ingredients
13:32 Encoder Decoder Architecture
16:2 Advantages & Disadvantages
18:4 Applications of Transformers

Whisper Transcript | Transcript Only Page

00:00:00.000 | [BLANK_AUDIO]
00:00:05.880 | >> Hey everyone, welcome to the first and
00:00:08.640 | introductory lecture for CS25, Transformers United.
00:00:13.000 | So CS25 was a class that the three of us created and
00:00:16.480 | taught at Stanford in the fall of 2021.
00:00:19.760 | And the subject of the class is not as the picture might suggest,
00:00:23.800 | it's not about robots that can transform into cars.
00:00:27.480 | It's about deep learning models and specifically a particular kind of
00:00:31.620 | deep learning models that have revolutionized multiple fields.
00:00:35.340 | Starting from natural language processing to things like computer vision and
00:00:39.840 | reinforcement learning, to name a few.
00:00:41.480 | We have an exciting set of videos lined up for you.
00:00:45.880 | We have some truly fantastic speakers come and
00:00:48.400 | give talks about how they were applying transformers in their own research.
00:00:53.280 | And we hope you will enjoy and learn from these talks.
00:00:56.440 | So this video is purely an introductory lecture to talk a little bit about
00:01:00.560 | transformers.
00:01:01.800 | And before we get started, I'd like to introduce the instructors.
00:01:05.560 | So my name is Abwehr.
00:01:07.000 | I am a software engineer at a company called Applied Intuition.
00:01:10.320 | Before this, I was a master's student in CS at Stanford.
00:01:14.320 | And I am one of the co-instructors for CS25.
00:01:19.680 | Chaitanya, Dev, if the two of you could introduce yourselves.
00:01:22.480 | >> So hi, everyone.
00:01:23.880 | I am a PhD student at Stanford.
00:01:26.280 | Before this, I was pursuing a master's here.
00:01:29.520 | I'm researching a lot in generative modeling, reinforcement learning, and
00:01:32.760 | robotics.
00:01:33.880 | So nice to meet you all.
00:01:35.280 | >> Yeah, that was Dev, since he didn't say his name.
00:01:38.400 | Chaitanya, if you want to introduce yourself.
00:01:40.560 | >> Yeah, hi, everyone.
00:01:41.920 | My name is Chaitanya, and
00:01:43.520 | I'm currently working as an ML engineer at a startup called Moveworks.
00:01:48.360 | Before that, I was a master's student at Stanford specializing in NLP and
00:01:52.360 | was a member of the prize-winning Stanford's team for the Alexa Prize Challenge.
00:01:56.280 | >> All right, awesome.
00:01:59.680 | So moving on to the rest of this talk.
00:02:04.520 | Essentially, what we hope you will learn watching these videos, and
00:02:08.840 | what we hope the people who took our class in the fall of 2021 learned, is three things.
00:02:15.320 | One is we hope you will have an understanding of how transformers work.
00:02:19.720 | Secondly, we hope you will learn, and by the end of these talks,
00:02:23.560 | understand how transformers are being applied beyond just natural language
00:02:28.080 | processing.
00:02:29.460 | And thirdly, we hope that some of these talks will spark some new ideas within you,
00:02:34.360 | and hopefully lead to new directions of research,
00:02:37.320 | new kinds of innovation, and things of that sort.
00:02:40.280 | And to begin, we're going to talk a little bit about transformers and
00:02:48.280 | introduce some of the context behind transformers as well.
00:02:52.280 | And for that, I'd like to hand it off to Dev.
00:02:53.960 | >> So hi, everyone.
00:03:02.040 | So welcome to our transformer seminar.
00:03:04.960 | So I will start first with an overview of the attention timeline and
00:03:08.860 | how it came to be.
00:03:09.960 | The key idea about transformers was the self-attention mechanism that was
00:03:14.260 | developed in 2017, and it all started with this one paper called
00:03:17.220 | Attention is All You Need by Vava Swanyatal.
00:03:19.900 | Before 2017, we used to have this prehistoric era where we had older models
00:03:24.420 | like RNNs, LSTMs, and simpler attention mechanisms.
00:03:28.620 | And eventually, the growth in transformers has exploded into other fields and
00:03:32.580 | has become prominent in all of machine learning.
00:03:35.540 | And I'll go and see and show how this has been used.
00:03:39.900 | So in the prehistoric era, there used to be RNNs.
00:03:43.660 | There were different models, like the sequence-to-sequence, LSTMs, GRUs.
00:03:47.700 | They were good at encoding some sort of memory, but they did not work for
00:03:51.840 | encoding long sequences, and they were very bad at encoding context.
00:03:55.220 | So here is an example where if you have a sentence like, I grew up in France,
00:03:59.180 | dot, dot, dot, so I speak fluent dash.
00:04:02.380 | Then you want to fill this with French based on the context, but
00:04:05.340 | a LSTM model might not know what it is and might just make a very big mistake here.
00:04:09.780 | Similarly, we can show some sort of correlation map here where if you have
00:04:13.940 | a pronoun like it, we want it to correlate to one of the past nouns that we have seen
00:04:18.420 | so far, like animal, but again, older models were really not good at this context encoding.
00:04:26.100 | So where we are currently now is on the verge of takeoff.
00:04:29.580 | We're beginning to realize the potential of transformers in different fields.
00:04:32.900 | We have started to use them to solve long sequence problems and
00:04:36.740 | protein folding, such as the alpha fold model from DeepMind,
00:04:42.780 | which gets 95% accuracy on different challenges in offline RL.
00:04:47.580 | We can use it for few-shot and zero-shot generalization for text and image generation.
00:04:52.100 | And we can also use this for content generation.
00:04:53.860 | So here's an example from OpenAI, where you can give a different text prompt and
00:04:58.580 | have an AI-generated fictional image for you.
00:05:01.260 | And so there's a talk on this that you can also watch on YouTube,
00:05:06.500 | which basically says that LSTMs are dead and long-lived transformers.
00:05:09.940 | So what's the future?
00:05:13.340 | So we can enable a lot more applications for transformers.
00:05:17.660 | They can be applied to any form of sequence modeling.
00:05:20.740 | So we could use them for real understanding.
00:05:23.340 | We can use them for finance and a lot more.
00:05:25.740 | So basically imagine all sorts of genetic modeling problems.
00:05:29.180 | Nevertheless, there are a lot of missing ingredients.
00:05:31.660 | So like the human brain, we need some sort of external memory unit,
00:05:35.860 | which is the hippocampus for us.
00:05:37.860 | And there are some early works here.
00:05:40.460 | So one nice work you might want to check out is called Neural Turing Machines.
00:05:44.340 | Similarly, the current attention mechanisms are very competitionally complex
00:05:49.100 | in terms of time, and they scale quadratically, which we'll discuss later.
00:05:52.500 | And we want to make them more linear.
00:05:54.700 | And the third problem is that we want to align our current sort of language models
00:05:58.500 | with how the human brain works and human values.
00:06:01.340 | And this is also a big issue.
00:06:03.260 | OK, so now I will deep dive deeper into the attention mechanisms
00:06:10.260 | and show how they came out to be.
00:06:12.260 | So initially, they used to be very simple mechanisms.
00:06:17.780 | Their attention was inspired by the process of importance fitting,
00:06:21.220 | or putting attention on different parts of an image,
00:06:24.420 | where like similar to a human, where you might focus more on like a foreground,
00:06:28.380 | if you have an image of a dog compared to like the rest of the background.
00:06:31.060 | So in the case of soft attention, what you do is you learn the simple
00:06:34.340 | soft attention weighting for each pixel, which can be a weight between 0 to 1.
00:06:39.100 | The problem over here is that this is a very expensive computation.
00:06:42.380 | And then you can, as shown in the figure on the left,
00:06:46.540 | you can see we are calculating this attention map for the whole image.
00:06:48.740 | What you can do instead is you can just calculate a 0 to 1 attention map,
00:06:55.500 | where we directly put a 1 on wherever the dog is and a 0 wherever it's a background.
00:07:00.780 | This is like less computationally expensive,
00:07:03.260 | but the problem is it's not differentiable and makes things harder to train.
00:07:06.140 | Going forward, we also have different varieties of basic attention mechanisms
00:07:10.980 | that came, that were proposed before self-attention.
00:07:14.140 | So the first variety here is global attention models.
00:07:17.500 | So in global attention models for each hidden layer input, hidden layer output,
00:07:23.420 | you learn an attention weight, a of p.
00:07:26.300 | And this is element-wise multiplied with your current output to calculate your
00:07:29.660 | final output, yt.
00:07:31.260 | Similarly, you have local attention models,
00:07:35.340 | where instead of calculating the global attention over the whole sequence length,
00:07:39.980 | you only calculate the attention over a small window.
00:07:43.500 | And then you weight by the attention of the window into the current output
00:07:49.100 | to get the final output you need.
00:07:50.340 | So moving on, I'll pass on to Chaitanya to discuss self-attention mechanisms and
00:07:56.340 | transforms.
00:07:58.580 | >> Yeah, thank you, Div, for covering a brief overview of how the primitive
00:08:03.660 | versions of attention work.
00:08:05.340 | Now, just before we talk about self-attention, just a bit of a trivia that
00:08:10.060 | this term was first introduced by a paper from Lin et al,
00:08:14.060 | which provided a framework for a self-attentive mechanism for
00:08:18.620 | our sentence embeddings.
00:08:22.540 | And now moving on to the main crux of the transformers paper,
00:08:26.420 | which was the self-attention block.
00:08:28.380 | So self-attention is the basis, is the main comp building block for
00:08:33.300 | what makes the transformers model work so well and to enable them and
00:08:38.220 | make them so powerful.
00:08:40.260 | So to think of it more easily,
00:08:42.140 | we can break down the self-attention as a search retrieval problem.
00:08:46.980 | So the problem is that given a query Q, and we need to find a set of keys K,
00:08:53.260 | which are most similar to Q and return the corresponding key values called V.
00:08:58.220 | Now, these three vectors can be drawn from the same source.
00:09:01.060 | For example, we can have that Q, K, and V are all equal to a single vector X,
00:09:05.620 | where X can be output of a previous layer.
00:09:08.580 | In transformers, these vectors are obtained by applying different linear
00:09:12.820 | transformations to X.
00:09:14.460 | So as to enable the model to capture more complex interactions between
00:09:18.780 | the different tokens at different places of the sentence.
00:09:22.860 | Now, how attention is computed is just a weighted summation of the similarities
00:09:27.540 | in the query and key vectors, which is weighted by the respective value for
00:09:31.860 | those keys.
00:09:33.260 | And in the transformers paper, they use the scale dot product as a similarity
00:09:37.820 | function for the queries and keys.
00:09:40.260 | And another important aspect of the transformers was the introduction of
00:09:44.580 | multi-head self-attention.
00:09:46.420 | So what multi-head self-attention means is that the self-attention is for
00:09:50.700 | at every layer, the self-attention is performed multiple times,
00:09:54.260 | which enables the model to learn multiple representation subspaces.
00:09:58.660 | So in a way, you can think of it that each head has a power to look at
00:10:05.820 | different things and to learn different semantics.
00:10:08.580 | For example, one head can be learning to try to predict what is the part of
00:10:13.580 | speech for those tokens.
00:10:15.220 | One head might be learning what is the syntactic structure of the sentence,
00:10:19.660 | and all those things that are there to understand what the upcoming sentence
00:10:26.860 | means.
00:10:28.740 | Now, to better understand what the self-attention works and what are the
00:10:32.020 | different computations, there is a short video.
00:10:35.140 | So as you can see, there are three incoming tokens.
00:10:41.180 | So input 1, input 2, input 3.
00:10:43.220 | We apply linear transformations to get the key value vectors for each input,
00:10:49.100 | and then once a query queue comes, we calculate its similarity with the
00:10:53.220 | respective key vectors, and then multiply those scores with the value
00:10:58.900 | vector, and then add them all up to get the output.
00:11:02.740 | The same computation is then performed on all the tokens, and we get the output
00:11:08.420 | of the self-attention layer.
00:11:10.380 | So as you can see here, the final output of the self-attention layer is in dark
00:11:14.580 | green that's at the top of the screen.
00:11:17.660 | So now again, for the final token, we perform everything same, queries
00:11:21.500 | multiplied by keys.
00:11:22.820 | We get the similarity scores, and then those similarity scores weigh the value
00:11:26.820 | vectors, and then we finally perform the addition to get the self-attention
00:11:31.300 | output of the transformers.
00:11:39.220 | Apart from self-attention, there are some other necessary ingredients that makes
00:11:44.620 | the transformer so powerful.
00:11:46.540 | One important aspect is the presence of positional representations or the
00:11:50.700 | embedding layer.
00:11:51.740 | So the way RNNs work very well was that since they process each of the
00:11:58.020 | information in a sequential ordering, so there was this notion of ordering,
00:12:03.220 | right, and which is also very important in understanding language because we all
00:12:06.980 | know that we read any piece of text from left to right in most of the languages,
00:12:14.940 | and also right to left in some languages.
00:12:17.220 | So there is a notion of ordering, which is lost in kind of self-attention
00:12:20.740 | because every word is attending to every other word.
00:12:24.060 | That's why this paper introduced a separate embedding layer for introducing
00:12:28.900 | positional representations.
00:12:30.980 | The second important aspect is having nonlinearities.
00:12:34.300 | So if you think of all the computation that is happening in the self-attention
00:12:38.100 | layer, it's all linear because it's all matrix multiplication.
00:12:41.220 | But as we all know, that deep learning models work well when they are able to
00:12:47.700 | learn more complex mappings between input and output, which can be attained
00:12:52.140 | by a simple MLP.
00:12:54.220 | And the third important component of the transformers is the masking.
00:12:59.020 | So masking is what allows to parallelize the operations.
00:13:03.020 | Since every word can attend to every other word, in the decoder part of the
00:13:07.220 | transformers, which Advai is going to be talking about later, is the problem
00:13:11.380 | comes that you don't want the decoder to look into the future because that can
00:13:16.540 | result in data leakage.
00:13:18.340 | So that's why masking helps the decoder to avoid that future information and learn
00:13:24.860 | only what the model has processed so far.
00:13:29.580 | So now on to the encoder-decoder architecture of the transformers.
00:13:34.780 | Advai?
00:13:36.580 | - Yeah, thanks, Saithanya, for talking about self-attention.
00:13:39.620 | So self-attention is sort of the key ingredient or one of the key ingredients
00:13:44.940 | that allows transformers to work so well.
00:13:47.540 | But at a very high level, the model that was proposed in the Vaswani et al.
00:13:51.940 | paper of 2017 was like previous language models in the sense that it had an
00:13:57.540 | encoder-decoder architecture.
00:13:59.540 | What that means is, let's say you're working on a translation problem.
00:14:02.900 | You want to translate English to French.
00:14:04.940 | The way that would work is you would read in the entire input of your English
00:14:09.340 | sentence, you would encode that input, so that's the encoder part of the network.
00:14:13.780 | And then you would generate token by token the corresponding French translation.
00:14:18.420 | And the decoder is the part of the network that is responsible for generating
00:14:22.980 | those tokens.
00:14:24.420 | So you can think of these encoder blocks and decoder blocks as essentially
00:14:29.980 | something like Lego.
00:14:30.980 | They have these sub-components that make them up.
00:14:34.580 | And in particular, the encoder block has three main sub-components.
00:14:38.540 | The first is a self-attention layer that Saithanya talked about earlier.
00:14:43.180 | And as talked about earlier as well, you need a feed-forward layer after that
00:14:48.500 | because the self-attention layer only performs linear operations.
00:14:52.140 | And so you need something that can capture the non-linearities.
00:14:55.540 | You also have a layer norm after this.
00:14:58.060 | And lastly, there are residual connections between different encoder blocks.
00:15:02.740 | The decoder is very similar to the encoder, but there's one difference,
00:15:06.340 | which is that it has this extra layer because the decoder doesn't just do
00:15:10.100 | multi-head attention on the output of the previous layers.
00:15:13.900 | So for context, the encoder does multi-head attention for each self-attention
00:15:19.340 | layer in the encoder block.
00:15:21.300 | In each of the encoder blocks, it does multi-head attention looking at the
00:15:26.020 | previous layers of the encoder blocks.
00:15:29.420 | The decoder, however, does that in the sense that it also looks at the previous
00:15:34.380 | layers of the decoder, but it also looks at the output of the encoder.
00:15:38.420 | And so it needs a multi-head attention layer over the encoder blocks.
00:15:43.740 | And lastly, there's masking as well.
00:15:46.700 | So if you are-- because every token can look at every other token,
00:15:51.300 | you want to make sure in the decoder that you're not looking into the future.
00:15:55.100 | So if you're in position 3, for instance,
00:15:57.340 | you shouldn't be able to look at position 4 and position 5.
00:16:00.020 | So those are sort of all the components that led to the creation of the model
00:16:08.060 | in the Vaswani et al paper.
00:16:10.700 | And let's talk a little bit about the advantages and drawbacks of this model.
00:16:16.540 | So the two main advantages, which are huge advantages and which are why
00:16:20.660 | transformers have done such a good job of revolutionizing many,
00:16:25.740 | many fields within deep learning, are as follows.
00:16:29.660 | So the first is there is this constant path length between any two positions
00:16:34.260 | in a sequence because every token in the sequence is looking at every other token.
00:16:39.700 | And this basically solves the problem that Dev talked about earlier
00:16:43.060 | with long sequences.
00:16:44.580 | You don't have this problem with long sequences where if you're trying to
00:16:47.940 | predict a token that depends on a word that was far, far behind in a sentence,
00:16:54.220 | you don't have the problem of losing that context.
00:16:55.940 | Now, the distance between them is only one in terms of the path length.
00:17:00.700 | Also, because of the nature of the computation that's happening,
00:17:03.900 | transformer models lend themselves really well to parallelization.
00:17:07.380 | And because of the advances that we've had with GPUs, basically,
00:17:10.900 | if you take a transformer model with n parameters and you take a model that
00:17:14.660 | isn't a transformer, say like an MSTM, also with n parameters,
00:17:18.340 | training the transformer model is going to be much faster because of the
00:17:22.020 | parallelization that it leverages.
00:17:24.500 | So those are the advantages.
00:17:26.180 | The disadvantages are basically self-attention takes quadratic time
00:17:31.220 | because every token looks at every other token.
00:17:33.620 | Order n squared, as you might know, does not scale.
00:17:36.500 | And there's actually been a lot of work in trying to tackle this.
00:17:40.220 | So we've linked to some here.
00:17:41.500 | Big Bird, Linformer, and Reformer are all approaches to try and
00:17:44.580 | make this linear or quasi-linear, essentially.
00:17:47.580 | And yeah, we highly recommend going through Jay Allamer's blog,
00:17:55.820 | the Illustrated Transformer, which provides great visualizations and
00:17:59.940 | explains everything that we just talked about in great detail.
00:18:02.260 | Yeah, and I'd like to pass it on to Chaitanya for applications of transformers.
00:18:10.460 | So now moving on to like some of the recent work, some of the work that
00:18:14.780 | very shortly followed the Transformers paper.
00:18:18.060 | So one of the models that came out was GPT, the GPT architecture,
00:18:23.740 | which was released by OpenAI.
00:18:24.940 | So OpenAI had the latest model that OpenAI has in the GPT series is the GPT-3.
00:18:31.260 | So it consists of only the decoder blocks from Transformers and is trained
00:18:35.500 | on a traditional language modeling task, which is predicting the current token,
00:18:40.300 | which is predicting the next token given the last T tokens that the model has seen.
00:18:45.900 | And for any downstream tasks, now the model can just,
00:18:49.900 | you can just train a classification layer on the last hidden state,
00:18:53.260 | which can have any number of labels.
00:18:57.660 | And since the model is generative in nature, you can also use the pre-trained
00:19:03.420 | network as for generative kind of tasks, such as summarization and natural language,
00:19:09.100 | and natural language generation for that instance.
00:19:12.380 | Another important aspect that GPT-3 gained popularity was its ability to
00:19:18.540 | be able to perform in-context learning, what the authors called in-context learning.
00:19:23.580 | So this is the ability wherein the model can perform, can learn under few short settings,
00:19:28.780 | what the task is to complete the task without performing any gradient updates.
00:19:33.820 | For example, let's say the model is shown a bunch of addition examples.
00:19:38.620 | And then if you pass in a new input and leave the,
00:19:42.220 | and just leave it at equal to sign, the model tries to predict the next token,
00:19:49.020 | which very well comes out to be the sum of the numbers that is shown.
00:19:55.660 | Another example can be also the spell correction task or the translation task.
00:20:00.540 | So this was the ability that made GPT-3 so much talked about in the NLP world.
00:20:08.060 | And right now also, many applications have been made using GPT-3, which includes
00:20:13.820 | one of them being the VS Code Copilot, which tries to generate a piece of code
00:20:21.820 | given a docstring kind of natural language text.
00:20:26.140 | Another major model that came out that was based on the Transformers architecture was BERT.
00:20:32.220 | So BERT lends its name from, it's an acronym for Bidirectional Encoding,
00:20:37.020 | encoder representations of Transformers.
00:20:38.940 | It consists of only the encoder blocks of the Transformers, which is unlike GPT-3,
00:20:44.620 | which had only the decoder blocks.
00:20:46.620 | Now, because of this change, there comes a problem because BERT has only the encoder blocks.
00:20:55.100 | So it sees the entire piece of text.
00:20:57.260 | It cannot be pre-trained on a naive language modeling task
00:21:00.220 | because of the problem of data leakage from the future.
00:21:02.940 | So what the authors came up with was a clever idea.
00:21:06.700 | And they came up with a novel task called mass language modeling,
00:21:10.860 | which included to replace certain words with a placeholder.
00:21:15.180 | And then the model tries to predict those words given the entire context.
00:21:19.980 | Now, apart from this token-level task, the authors also added a second objective
00:21:25.900 | called the next sentence prediction, which was a sentence-level task.
00:21:28.780 | Wherein, given two chunks of text, the model tried to predict
00:21:34.300 | whether the second sentence followed the other sentence or not,
00:21:37.740 | followed the first sentence or not.
00:21:39.180 | And now, after pre-training this model for any downstream task,
00:21:43.580 | the model can be further fine-tuned with an additional classification layer,
00:21:46.700 | just like it was in GPT-3.
00:21:49.580 | So these are the two models that have been very popular
00:21:54.300 | and have made a lot of applications, made their way in a lot of applications.
00:21:58.940 | But the landscape has changed quite a lot since we have taken this class.
00:22:02.540 | There are models with different pre-training techniques, like Electra, D-BERTA.
00:22:07.020 | And there are also models that do well in other modalities,
00:22:12.140 | and which we are going to be talking about in other lecture series as well.
00:22:15.260 | So yeah, that's all from this lecture.
00:22:18.700 | And thank you for tuning in.
00:22:20.140 | - Yeah, just want to end by saying thank you all for watching this.
00:22:25.580 | And we have a really exciting set of videos with truly amazing speakers.
00:22:30.940 | And we hope you are able to derive value from that.
00:22:33.180 | - Okay, thanks a lot.
00:22:35.100 | - Thank you.
00:22:35.820 | - Thank you, everyone.
00:22:37.500 | - Bye.
00:22:38.800 | - Bye.
00:22:39.300 | [BLANK_AUDIO]