Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL

00:00:00.000 | [BLANK_AUDIO]

00:00:05.880 | >> Hey everyone, welcome to the first and

00:00:08.640 | introductory lecture for CS25, Transformers United.

00:00:13.000 | So CS25 was a class that the three of us created and

00:00:16.480 | taught at Stanford in the fall of 2021.

00:00:19.760 | And the subject of the class is not as the picture might suggest,

00:00:23.800 | it's not about robots that can transform into cars.

00:00:27.480 | It's about deep learning models and specifically a particular kind of

00:00:31.620 | deep learning models that have revolutionized multiple fields.

00:00:35.340 | Starting from natural language processing to things like computer vision and

00:00:39.840 | reinforcement learning, to name a few.

00:00:41.480 | We have an exciting set of videos lined up for you.

00:00:45.880 | We have some truly fantastic speakers come and

00:00:48.400 | give talks about how they were applying transformers in their own research.

00:00:53.280 | And we hope you will enjoy and learn from these talks.

00:00:56.440 | So this video is purely an introductory lecture to talk a little bit about

00:01:00.560 | transformers.

00:01:01.800 | And before we get started, I'd like to introduce the instructors.

00:01:05.560 | So my name is Abwehr.

00:01:07.000 | I am a software engineer at a company called Applied Intuition.

00:01:10.320 | Before this, I was a master's student in CS at Stanford.

00:01:14.320 | And I am one of the co-instructors for CS25.

00:01:19.680 | Chaitanya, Dev, if the two of you could introduce yourselves.

00:01:22.480 | >> So hi, everyone.

00:01:23.880 | I am a PhD student at Stanford.

00:01:26.280 | Before this, I was pursuing a master's here.

00:01:29.520 | I'm researching a lot in generative modeling, reinforcement learning, and

00:01:32.760 | robotics.

00:01:33.880 | So nice to meet you all.

00:01:35.280 | >> Yeah, that was Dev, since he didn't say his name.

00:01:38.400 | Chaitanya, if you want to introduce yourself.

00:01:40.560 | >> Yeah, hi, everyone.

00:01:41.920 | My name is Chaitanya, and

00:01:43.520 | I'm currently working as an ML engineer at a startup called Moveworks.

00:01:48.360 | Before that, I was a master's student at Stanford specializing in NLP and

00:01:52.360 | was a member of the prize-winning Stanford's team for the Alexa Prize Challenge.

00:01:56.280 | >> All right, awesome.

00:01:59.680 | So moving on to the rest of this talk.

00:02:04.520 | Essentially, what we hope you will learn watching these videos, and

00:02:08.840 | what we hope the people who took our class in the fall of 2021 learned, is three things.

00:02:15.320 | One is we hope you will have an understanding of how transformers work.

00:02:19.720 | Secondly, we hope you will learn, and by the end of these talks,

00:02:23.560 | understand how transformers are being applied beyond just natural language

00:02:28.080 | processing.

00:02:29.460 | And thirdly, we hope that some of these talks will spark some new ideas within you,

00:02:34.360 | and hopefully lead to new directions of research,

00:02:37.320 | new kinds of innovation, and things of that sort.

00:02:40.280 | And to begin, we're going to talk a little bit about transformers and

00:02:48.280 | introduce some of the context behind transformers as well.

00:02:52.280 | And for that, I'd like to hand it off to Dev.

00:02:53.960 | >> So hi, everyone.

00:03:02.040 | So welcome to our transformer seminar.

00:03:04.960 | So I will start first with an overview of the attention timeline and

00:03:08.860 | how it came to be.

00:03:09.960 | The key idea about transformers was the self-attention mechanism that was

00:03:14.260 | developed in 2017, and it all started with this one paper called

00:03:17.220 | Attention is All You Need by Vava Swanyatal.

00:03:19.900 | Before 2017, we used to have this prehistoric era where we had older models

00:03:24.420 | like RNNs, LSTMs, and simpler attention mechanisms.

00:03:28.620 | And eventually, the growth in transformers has exploded into other fields and

00:03:32.580 | has become prominent in all of machine learning.

00:03:35.540 | And I'll go and see and show how this has been used.

00:03:39.900 | So in the prehistoric era, there used to be RNNs.

00:03:43.660 | There were different models, like the sequence-to-sequence, LSTMs, GRUs.

00:03:47.700 | They were good at encoding some sort of memory, but they did not work for

00:03:51.840 | encoding long sequences, and they were very bad at encoding context.

00:03:55.220 | So here is an example where if you have a sentence like, I grew up in France,

00:03:59.180 | dot, dot, dot, so I speak fluent dash.

00:04:02.380 | Then you want to fill this with French based on the context, but

00:04:05.340 | a LSTM model might not know what it is and might just make a very big mistake here.

00:04:09.780 | Similarly, we can show some sort of correlation map here where if you have

00:04:13.940 | a pronoun like it, we want it to correlate to one of the past nouns that we have seen

00:04:18.420 | so far, like animal, but again, older models were really not good at this context encoding.

00:04:26.100 | So where we are currently now is on the verge of takeoff.

00:04:29.580 | We're beginning to realize the potential of transformers in different fields.

00:04:32.900 | We have started to use them to solve long sequence problems and

00:04:36.740 | protein folding, such as the alpha fold model from DeepMind,

00:04:42.780 | which gets 95% accuracy on different challenges in offline RL.

00:04:47.580 | We can use it for few-shot and zero-shot generalization for text and image generation.

00:04:52.100 | And we can also use this for content generation.

00:04:53.860 | So here's an example from OpenAI, where you can give a different text prompt and

00:04:58.580 | have an AI-generated fictional image for you.

00:05:01.260 | And so there's a talk on this that you can also watch on YouTube,

00:05:06.500 | which basically says that LSTMs are dead and long-lived transformers.

00:05:09.940 | So what's the future?

00:05:13.340 | So we can enable a lot more applications for transformers.

00:05:17.660 | They can be applied to any form of sequence modeling.

00:05:20.740 | So we could use them for real understanding.

00:05:23.340 | We can use them for finance and a lot more.

00:05:25.740 | So basically imagine all sorts of genetic modeling problems.

00:05:29.180 | Nevertheless, there are a lot of missing ingredients.

00:05:31.660 | So like the human brain, we need some sort of external memory unit,

00:05:35.860 | which is the hippocampus for us.

00:05:37.860 | And there are some early works here.

00:05:40.460 | So one nice work you might want to check out is called Neural Turing Machines.

00:05:44.340 | Similarly, the current attention mechanisms are very competitionally complex

00:05:49.100 | in terms of time, and they scale quadratically, which we'll discuss later.

00:05:52.500 | And we want to make them more linear.

00:05:54.700 | And the third problem is that we want to align our current sort of language models

00:05:58.500 | with how the human brain works and human values.

00:06:01.340 | And this is also a big issue.

00:06:03.260 | OK, so now I will deep dive deeper into the attention mechanisms

00:06:10.260 | and show how they came out to be.

00:06:12.260 | So initially, they used to be very simple mechanisms.

00:06:17.780 | Their attention was inspired by the process of importance fitting,

00:06:21.220 | or putting attention on different parts of an image,

00:06:24.420 | where like similar to a human, where you might focus more on like a foreground,

00:06:28.380 | if you have an image of a dog compared to like the rest of the background.

00:06:31.060 | So in the case of soft attention, what you do is you learn the simple

00:06:34.340 | soft attention weighting for each pixel, which can be a weight between 0 to 1.

00:06:39.100 | The problem over here is that this is a very expensive computation.

00:06:42.380 | And then you can, as shown in the figure on the left,

00:06:46.540 | you can see we are calculating this attention map for the whole image.

00:06:48.740 | What you can do instead is you can just calculate a 0 to 1 attention map,

00:06:55.500 | where we directly put a 1 on wherever the dog is and a 0 wherever it's a background.

00:07:00.780 | This is like less computationally expensive,

00:07:03.260 | but the problem is it's not differentiable and makes things harder to train.

00:07:06.140 | Going forward, we also have different varieties of basic attention mechanisms

00:07:10.980 | that came, that were proposed before self-attention.

00:07:14.140 | So the first variety here is global attention models.

00:07:17.500 | So in global attention models for each hidden layer input, hidden layer output,

00:07:23.420 | you learn an attention weight, a of p.

00:07:26.300 | And this is element-wise multiplied with your current output to calculate your

00:07:29.660 | final output, yt.

00:07:31.260 | Similarly, you have local attention models,

00:07:35.340 | where instead of calculating the global attention over the whole sequence length,

00:07:39.980 | you only calculate the attention over a small window.

00:07:43.500 | And then you weight by the attention of the window into the current output

00:07:49.100 | to get the final output you need.

00:07:50.340 | So moving on, I'll pass on to Chaitanya to discuss self-attention mechanisms and

00:07:56.340 | transforms.

00:07:58.580 | >> Yeah, thank you, Div, for covering a brief overview of how the primitive

00:08:03.660 | versions of attention work.

00:08:05.340 | Now, just before we talk about self-attention, just a bit of a trivia that

00:08:10.060 | this term was first introduced by a paper from Lin et al,

00:08:14.060 | which provided a framework for a self-attentive mechanism for

00:08:18.620 | our sentence embeddings.

00:08:22.540 | And now moving on to the main crux of the transformers paper,

00:08:26.420 | which was the self-attention block.

00:08:28.380 | So self-attention is the basis, is the main comp building block for

00:08:33.300 | what makes the transformers model work so well and to enable them and

00:08:38.220 | make them so powerful.

00:08:40.260 | So to think of it more easily,

00:08:42.140 | we can break down the self-attention as a search retrieval problem.

00:08:46.980 | So the problem is that given a query Q, and we need to find a set of keys K,

00:08:53.260 | which are most similar to Q and return the corresponding key values called V.

00:08:58.220 | Now, these three vectors can be drawn from the same source.

00:09:01.060 | For example, we can have that Q, K, and V are all equal to a single vector X,

00:09:05.620 | where X can be output of a previous layer.

00:09:08.580 | In transformers, these vectors are obtained by applying different linear

00:09:12.820 | transformations to X.

00:09:14.460 | So as to enable the model to capture more complex interactions between

00:09:18.780 | the different tokens at different places of the sentence.

00:09:22.860 | Now, how attention is computed is just a weighted summation of the similarities

00:09:27.540 | in the query and key vectors, which is weighted by the respective value for

00:09:31.860 | those keys.

00:09:33.260 | And in the transformers paper, they use the scale dot product as a similarity

00:09:37.820 | function for the queries and keys.

00:09:40.260 | And another important aspect of the transformers was the introduction of

00:09:44.580 | multi-head self-attention.

00:09:46.420 | So what multi-head self-attention means is that the self-attention is for

00:09:50.700 | at every layer, the self-attention is performed multiple times,

00:09:54.260 | which enables the model to learn multiple representation subspaces.

00:09:58.660 | So in a way, you can think of it that each head has a power to look at

00:10:05.820 | different things and to learn different semantics.

00:10:08.580 | For example, one head can be learning to try to predict what is the part of

00:10:13.580 | speech for those tokens.

00:10:15.220 | One head might be learning what is the syntactic structure of the sentence,

00:10:19.660 | and all those things that are there to understand what the upcoming sentence

00:10:26.860 | means.

00:10:28.740 | Now, to better understand what the self-attention works and what are the

00:10:32.020 | different computations, there is a short video.

00:10:35.140 | So as you can see, there are three incoming tokens.

00:10:41.180 | So input 1, input 2, input 3.

00:10:43.220 | We apply linear transformations to get the key value vectors for each input,

00:10:49.100 | and then once a query queue comes, we calculate its similarity with the

00:10:53.220 | respective key vectors, and then multiply those scores with the value

00:10:58.900 | vector, and then add them all up to get the output.

00:11:02.740 | The same computation is then performed on all the tokens, and we get the output

00:11:08.420 | of the self-attention layer.

00:11:10.380 | So as you can see here, the final output of the self-attention layer is in dark

00:11:14.580 | green that's at the top of the screen.

00:11:17.660 | So now again, for the final token, we perform everything same, queries

00:11:21.500 | multiplied by keys.

00:11:22.820 | We get the similarity scores, and then those similarity scores weigh the value

00:11:26.820 | vectors, and then we finally perform the addition to get the self-attention

00:11:31.300 | output of the transformers.

00:11:39.220 | Apart from self-attention, there are some other necessary ingredients that makes

00:11:44.620 | the transformer so powerful.

00:11:46.540 | One important aspect is the presence of positional representations or the

00:11:50.700 | embedding layer.

00:11:51.740 | So the way RNNs work very well was that since they process each of the

00:11:58.020 | information in a sequential ordering, so there was this notion of ordering,

00:12:03.220 | right, and which is also very important in understanding language because we all

00:12:06.980 | know that we read any piece of text from left to right in most of the languages,

00:12:14.940 | and also right to left in some languages.

00:12:17.220 | So there is a notion of ordering, which is lost in kind of self-attention

00:12:20.740 | because every word is attending to every other word.

00:12:24.060 | That's why this paper introduced a separate embedding layer for introducing

00:12:28.900 | positional representations.

00:12:30.980 | The second important aspect is having nonlinearities.

00:12:34.300 | So if you think of all the computation that is happening in the self-attention

00:12:38.100 | layer, it's all linear because it's all matrix multiplication.

00:12:41.220 | But as we all know, that deep learning models work well when they are able to

00:12:47.700 | learn more complex mappings between input and output, which can be attained

00:12:52.140 | by a simple MLP.

00:12:54.220 | And the third important component of the transformers is the masking.

00:12:59.020 | So masking is what allows to parallelize the operations.

00:13:03.020 | Since every word can attend to every other word, in the decoder part of the

00:13:07.220 | transformers, which Advai is going to be talking about later, is the problem

00:13:11.380 | comes that you don't want the decoder to look into the future because that can

00:13:16.540 | result in data leakage.

00:13:18.340 | So that's why masking helps the decoder to avoid that future information and learn

00:13:24.860 | only what the model has processed so far.

00:13:29.580 | So now on to the encoder-decoder architecture of the transformers.

00:13:34.780 | Advai?

00:13:36.580 | - Yeah, thanks, Saithanya, for talking about self-attention.

00:13:39.620 | So self-attention is sort of the key ingredient or one of the key ingredients

00:13:44.940 | that allows transformers to work so well.

00:13:47.540 | But at a very high level, the model that was proposed in the Vaswani et al.

00:13:51.940 | paper of 2017 was like previous language models in the sense that it had an

00:13:57.540 | encoder-decoder architecture.

00:13:59.540 | What that means is, let's say you're working on a translation problem.

00:14:02.900 | You want to translate English to French.

00:14:04.940 | The way that would work is you would read in the entire input of your English

00:14:09.340 | sentence, you would encode that input, so that's the encoder part of the network.

00:14:13.780 | And then you would generate token by token the corresponding French translation.

00:14:18.420 | And the decoder is the part of the network that is responsible for generating

00:14:22.980 | those tokens.

00:14:24.420 | So you can think of these encoder blocks and decoder blocks as essentially

00:14:29.980 | something like Lego.

00:14:30.980 | They have these sub-components that make them up.

00:14:34.580 | And in particular, the encoder block has three main sub-components.

00:14:38.540 | The first is a self-attention layer that Saithanya talked about earlier.

00:14:43.180 | And as talked about earlier as well, you need a feed-forward layer after that

00:14:48.500 | because the self-attention layer only performs linear operations.

00:14:52.140 | And so you need something that can capture the non-linearities.

00:14:55.540 | You also have a layer norm after this.

00:14:58.060 | And lastly, there are residual connections between different encoder blocks.

00:15:02.740 | The decoder is very similar to the encoder, but there's one difference,

00:15:06.340 | which is that it has this extra layer because the decoder doesn't just do

00:15:10.100 | multi-head attention on the output of the previous layers.

00:15:13.900 | So for context, the encoder does multi-head attention for each self-attention

00:15:19.340 | layer in the encoder block.

00:15:21.300 | In each of the encoder blocks, it does multi-head attention looking at the

00:15:26.020 | previous layers of the encoder blocks.

00:15:29.420 | The decoder, however, does that in the sense that it also looks at the previous

00:15:34.380 | layers of the decoder, but it also looks at the output of the encoder.

00:15:38.420 | And so it needs a multi-head attention layer over the encoder blocks.

00:15:43.740 | And lastly, there's masking as well.

00:15:46.700 | So if you are-- because every token can look at every other token,

00:15:51.300 | you want to make sure in the decoder that you're not looking into the future.

00:15:55.100 | So if you're in position 3, for instance,

00:15:57.340 | you shouldn't be able to look at position 4 and position 5.

00:16:00.020 | So those are sort of all the components that led to the creation of the model

00:16:08.060 | in the Vaswani et al paper.

00:16:10.700 | And let's talk a little bit about the advantages and drawbacks of this model.

00:16:16.540 | So the two main advantages, which are huge advantages and which are why

00:16:20.660 | transformers have done such a good job of revolutionizing many,

00:16:25.740 | many fields within deep learning, are as follows.

00:16:29.660 | So the first is there is this constant path length between any two positions

00:16:34.260 | in a sequence because every token in the sequence is looking at every other token.

00:16:39.700 | And this basically solves the problem that Dev talked about earlier

00:16:43.060 | with long sequences.

00:16:44.580 | You don't have this problem with long sequences where if you're trying to

00:16:47.940 | predict a token that depends on a word that was far, far behind in a sentence,

00:16:54.220 | you don't have the problem of losing that context.

00:16:55.940 | Now, the distance between them is only one in terms of the path length.

00:17:00.700 | Also, because of the nature of the computation that's happening,

00:17:03.900 | transformer models lend themselves really well to parallelization.

00:17:07.380 | And because of the advances that we've had with GPUs, basically,

00:17:10.900 | if you take a transformer model with n parameters and you take a model that

00:17:14.660 | isn't a transformer, say like an MSTM, also with n parameters,

00:17:18.340 | training the transformer model is going to be much faster because of the

00:17:22.020 | parallelization that it leverages.

00:17:24.500 | So those are the advantages.

00:17:26.180 | The disadvantages are basically self-attention takes quadratic time

00:17:31.220 | because every token looks at every other token.

00:17:33.620 | Order n squared, as you might know, does not scale.

00:17:36.500 | And there's actually been a lot of work in trying to tackle this.

00:17:40.220 | So we've linked to some here.

00:17:41.500 | Big Bird, Linformer, and Reformer are all approaches to try and

00:17:44.580 | make this linear or quasi-linear, essentially.

00:17:47.580 | And yeah, we highly recommend going through Jay Allamer's blog,

00:17:55.820 | the Illustrated Transformer, which provides great visualizations and

00:17:59.940 | explains everything that we just talked about in great detail.

00:18:02.260 | Yeah, and I'd like to pass it on to Chaitanya for applications of transformers.

00:18:10.460 | So now moving on to like some of the recent work, some of the work that

00:18:14.780 | very shortly followed the Transformers paper.

00:18:18.060 | So one of the models that came out was GPT, the GPT architecture,

00:18:23.740 | which was released by OpenAI.

00:18:24.940 | So OpenAI had the latest model that OpenAI has in the GPT series is the GPT-3.

00:18:31.260 | So it consists of only the decoder blocks from Transformers and is trained

00:18:35.500 | on a traditional language modeling task, which is predicting the current token,

00:18:40.300 | which is predicting the next token given the last T tokens that the model has seen.

00:18:45.900 | And for any downstream tasks, now the model can just,

00:18:49.900 | you can just train a classification layer on the last hidden state,

00:18:53.260 | which can have any number of labels.

00:18:57.660 | And since the model is generative in nature, you can also use the pre-trained

00:19:03.420 | network as for generative kind of tasks, such as summarization and natural language,

00:19:09.100 | and natural language generation for that instance.

00:19:12.380 | Another important aspect that GPT-3 gained popularity was its ability to

00:19:18.540 | be able to perform in-context learning, what the authors called in-context learning.

00:19:23.580 | So this is the ability wherein the model can perform, can learn under few short settings,

00:19:28.780 | what the task is to complete the task without performing any gradient updates.

00:19:33.820 | For example, let's say the model is shown a bunch of addition examples.

00:19:38.620 | And then if you pass in a new input and leave the,

00:19:42.220 | and just leave it at equal to sign, the model tries to predict the next token,

00:19:49.020 | which very well comes out to be the sum of the numbers that is shown.

00:19:55.660 | Another example can be also the spell correction task or the translation task.

00:20:00.540 | So this was the ability that made GPT-3 so much talked about in the NLP world.

00:20:08.060 | And right now also, many applications have been made using GPT-3, which includes

00:20:13.820 | one of them being the VS Code Copilot, which tries to generate a piece of code

00:20:21.820 | given a docstring kind of natural language text.

00:20:26.140 | Another major model that came out that was based on the Transformers architecture was BERT.

00:20:32.220 | So BERT lends its name from, it's an acronym for Bidirectional Encoding,

00:20:37.020 | encoder representations of Transformers.

00:20:38.940 | It consists of only the encoder blocks of the Transformers, which is unlike GPT-3,

00:20:44.620 | which had only the decoder blocks.

00:20:46.620 | Now, because of this change, there comes a problem because BERT has only the encoder blocks.

00:20:55.100 | So it sees the entire piece of text.

00:20:57.260 | It cannot be pre-trained on a naive language modeling task

00:21:00.220 | because of the problem of data leakage from the future.

00:21:02.940 | So what the authors came up with was a clever idea.

00:21:06.700 | And they came up with a novel task called mass language modeling,

00:21:10.860 | which included to replace certain words with a placeholder.

00:21:15.180 | And then the model tries to predict those words given the entire context.

00:21:19.980 | Now, apart from this token-level task, the authors also added a second objective

00:21:25.900 | called the next sentence prediction, which was a sentence-level task.

00:21:28.780 | Wherein, given two chunks of text, the model tried to predict

00:21:34.300 | whether the second sentence followed the other sentence or not,

00:21:37.740 | followed the first sentence or not.

00:21:39.180 | And now, after pre-training this model for any downstream task,

00:21:43.580 | the model can be further fine-tuned with an additional classification layer,

00:21:46.700 | just like it was in GPT-3.

00:21:49.580 | So these are the two models that have been very popular

00:21:54.300 | and have made a lot of applications, made their way in a lot of applications.

00:21:58.940 | But the landscape has changed quite a lot since we have taken this class.

00:22:02.540 | There are models with different pre-training techniques, like Electra, D-BERTA.

00:22:07.020 | And there are also models that do well in other modalities,

00:22:12.140 | and which we are going to be talking about in other lecture series as well.

00:22:15.260 | So yeah, that's all from this lecture.

00:22:18.700 | And thank you for tuning in.

00:22:20.140 | - Yeah, just want to end by saying thank you all for watching this.

00:22:25.580 | And we have a really exciting set of videos with truly amazing speakers.

00:22:30.940 | And we hope you are able to derive value from that.

00:22:33.180 | - Okay, thanks a lot.

00:22:35.100 | - Thank you.

00:22:35.820 | - Thank you, everyone.

00:22:37.500 | - Bye.

00:22:38.800 | - Bye.

00:22:39.300 | [BLANK_AUDIO]

Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL

Chapters