back to indexA Comprehensive Overview of Large Language Models - Latent Space Paper Club
00:00:02.640 |
So, hey guys, thanks so much for coming by the paper club. 00:00:06.000 |
As usual, this is a paper club we run in Asia, 00:00:11.800 |
So today we're just recording it for the first time, 00:00:22.880 |
You can drop in the chat, which you can access 00:00:37.080 |
So today, we'll be going through the comprehensive overview 00:00:43.920 |
But on top of that, I think what we wanna do also 00:00:46.720 |
is just to share the reason why attention actually came about 00:00:54.680 |
So we'll have a little bit of a history lesson on that, 00:01:03.440 |
talking about what has happened post the Transformers era. 00:01:17.800 |
So I'll use the first part to talk about pre, 00:01:20.580 |
I would say GPT, and then I'll use the second link 00:01:28.880 |
So essentially, what models have been trying to do recently 00:01:42.560 |
you want to find out the next word in the prompt. 00:01:45.320 |
In this case, it can be question and answers. 00:02:01.220 |
given the sequence over here up to time equals to T, 00:02:24.000 |
beyond just thinking about looking at what the sequence is, 00:02:29.660 |
it's good to think about what kind of use case 00:02:34.920 |
when it comes to thinking about the evaluation metrics 00:02:42.000 |
- Your screen just kind of like cut out for you. 00:03:11.580 |
when we are using the different models, right? 00:03:21.660 |
it's also useful to think from a linguistic perspective 00:03:44.980 |
because if you output something that's false, 00:03:48.060 |
then your language model is probably not truthful. 00:03:51.480 |
Things like sentiment, which we have seen before. 00:03:58.460 |
So in this case, if you look at the sentence, 00:04:03.860 |
Standing next to Ero, Zuko pondered his destiny. 00:04:17.900 |
In this case, Zuko is currently in the kitchen, 00:04:30.260 |
we observe models are learning in terms of patterns. 00:05:08.060 |
but also we want to condition it on the source sentence. 00:05:13.620 |
So that is essentially what translation does. 00:05:33.100 |
And one of the key things that we will notice 00:05:48.340 |
corresponds to the first word in the target sentence. 00:05:59.260 |
crisscross relationship where you might need to, 00:06:06.180 |
and the third word over here corresponds to the second. 00:06:11.980 |
we want to find a way to be able to model this relationship. 00:06:17.940 |
And this relationship has actually been studied before 00:06:35.380 |
and the target sentence on the bottom, on the left, 00:06:40.100 |
then if we've got this very linear one-to-one relationship, 00:06:46.860 |
then we will see that there will be a white box over here 00:06:53.220 |
indicating that the first word corresponds to the first word, 00:07:00.300 |
But as you can see, just from English to French, 00:07:03.980 |
there is this idea where words that is later in the sequence 00:07:08.980 |
corresponds to words that's earlier and vice versa. 00:07:29.340 |
So naturally, when we look at the encoder-decoder blocks, 00:07:44.100 |
contains all the information of the entire sentence, 00:07:48.300 |
but there's this information bottleneck problem, 00:07:51.220 |
which means that if let's say this is a longer sentence, 00:07:55.260 |
the last hidden state might not contain information 00:08:00.620 |
And therefore, there's this idea of attention 00:08:11.740 |
the decoder when during the language generation component 00:08:49.780 |
that has been implemented in the encoder-decoder 00:08:54.380 |
kind of paradigm or the kind of architecture. 00:09:05.100 |
or we calculate these individual hidden states, 00:09:08.060 |
we realize that it has to be calculated sequentially. 00:09:19.060 |
after the first hidden state is being output. 00:09:22.140 |
And the third hidden state can only be calculated 00:09:24.980 |
after the second hidden state has been output. 00:09:33.940 |
where there is a dependency of the previous state? 00:10:01.700 |
one of the building blocks of the transformer architecture. 00:10:39.860 |
just adding a feed-forward layer on top of it. 00:10:43.900 |
if you're just calculating key query value pass, 00:10:52.220 |
because you're just getting a weighted sum of the values 00:10:56.900 |
So we want to add a layer of non-linearity to it, 00:11:00.220 |
which is taken care of by the feed-forward network. 00:11:10.180 |
you want to not let the model see the future tokens, 00:11:14.260 |
and essentially that's when masking comes into play, 00:11:20.620 |
in the decoder architecture later down the road. 00:11:24.660 |
So a couple of things on top of what we are talking about 00:11:30.940 |
in terms of the language modeling component for transformers. 00:11:38.300 |
So this is when you have things like tokenization, 00:11:42.620 |
So essentially, what are we trying to solve over here? 00:11:54.100 |
that can be things like a variation of an existing word, 00:11:57.740 |
in this case, you add many A's in between the word, 00:12:40.500 |
And essentially what goes on with byte-pair encoding 00:12:51.740 |
represent either prefixes or suffixes of a word, 00:13:00.300 |
So if you see over here, you've got this T-A-A, 00:13:05.780 |
and A-A-A, and anything after that, and S-T-Y. 00:13:17.820 |
and therefore we are able to represent it over here. 00:13:28.300 |
So essentially that's the idea of sub-word models, 00:13:35.660 |
byte-pair encoding, sentence piece, word piece, 00:13:38.460 |
That's the problem that they're trying to solve. 00:13:53.620 |
is essentially replacing the recurrent neural network blocks 00:13:58.740 |
So when we talk about recurrent neural networks, 00:14:10.740 |
are the three types of dominant architectures. 00:14:15.660 |
and examples of this would be things like BERT, 00:14:23.540 |
Encoder-decoder models, where we've seen earlier, 00:14:26.020 |
we have an encoder that maps your sequence into 00:14:36.020 |
sampling, or your autoregressive sampling of tokens 00:14:46.180 |
things like GPT-2, GPT-3, they are all there. 00:14:48.380 |
So you essentially learn the language of patterns, 00:14:52.660 |
and then you directly just do your autoregressive 00:15:36.700 |
or for me, what I did was I tried to understand 00:15:39.140 |
what was the framework that the authors were using 00:15:46.820 |
then dividing it, and then giving us a reader 00:16:03.980 |
pick-your-own-adventure, pick-your-own-journey, 00:16:13.540 |
so that along the way, you'll be able to build 00:16:18.180 |
that foundational knowledge and then add layers on it, 00:16:22.300 |
At the end of the day, we all know that new models 00:16:24.540 |
are always developed and new models are always announced. 00:16:31.700 |
So, let's just go through the paper very quickly. 00:16:42.700 |
where we are seeing that large language models, 00:16:58.180 |
we saw that the performance of T5 on downstream tasks, 00:17:04.780 |
it can be your glue task, it can be your squirt task, 00:17:16.660 |
And you've seen, there are multiple experiments 00:17:21.180 |
that that's the better way, that's the better alternative. 00:17:27.900 |
they are able to perform zero short transfer learning 00:17:35.900 |
from the downstream task, GPT-3 is able to give the answer. 00:17:42.940 |
where we actually might not need to fine-tune 00:17:56.180 |
On top of it, they were able to show things like reasoning, 00:18:03.460 |
they were able to show things like in-context learning. 00:18:09.060 |
when you do things like chain of thought prompting. 00:18:15.420 |
given certain patterns, when they ask for a question 00:18:21.540 |
or ask for a task that follows a similar pattern 00:18:43.700 |
Can we look at things like better architectures? 00:18:45.340 |
Can we look at things like more efficient ways 00:18:52.540 |
Are there ways that we can represent these factors 00:19:02.740 |
So, that's essentially what things like architectures 00:19:07.140 |
come into play, quantization comes into play. 00:19:23.940 |
The datasets that have been used to train them, 00:19:29.420 |
What kind of evaluation tasks are they looking at? 00:19:58.860 |
we have covered some of these topics from the paper. 00:20:35.020 |
Naturally, things like mass language modeling 00:20:36.780 |
are things that we see in your encoder-only models. 00:21:01.500 |
so essentially, it's like a fill-in-the-blank 00:21:16.380 |
into this thing called prefix language modeling, 00:21:18.660 |
where you feed the model one part of the sequence, 00:21:25.740 |
to generate the remaining parts of the sequence. 00:21:33.940 |
is that when they do prefix language modeling, 00:21:37.940 |
they use this thing called a causal mask with prefix, 00:21:57.740 |
you still have that element of mask attention. 00:22:22.460 |
and you divide it by the standard deviation of the weights. 00:22:26.780 |
is that we're trying to achieve numerical stability 00:22:30.300 |
of the weights so that when you do a forward pass 00:22:36.180 |
you don't have numbers that go all over the place. 00:22:41.180 |
Positional encoding was something we talked about earlier. 00:22:47.100 |
they had this idea of sinusoidal position representations. 00:23:06.620 |
in the, as the index of the sequence increases, 00:23:21.300 |
It's augmented by a positional representation. 00:23:32.700 |
of encoding positional representations is not learnable 00:23:45.620 |
So therefore, it has been changed to something as simple 00:24:06.940 |
or ways that can help with training or implementation. 00:24:11.900 |
So things like the libraries that we're using, 00:24:30.380 |
So amongst others, there's this idea of data parallelism 00:24:43.780 |
and then I run separate batches on top of them. 00:24:46.180 |
So let's say I've got a batch of, I don't know, 100,000. 00:24:57.980 |
Then the other 50,000 in the same model in the second GPU, 00:25:10.020 |
that you calculate the matrix multiplication steps 00:25:26.700 |
the multiplication with a column can be done concurrently 00:25:30.620 |
and therefore it splits it up such that the first, 00:25:39.300 |
Matrix on the right multiplies on the second column 00:25:42.660 |
Or in this case, you concatenate the results together. 00:25:45.340 |
So that again also helps us with getting the results 00:26:01.340 |
where it's a very smart way of utilizing memory. 00:26:06.140 |
So what happens is that instead of calculating, 00:26:10.300 |
instead of a series of steps that is very memory intensive 00:26:19.260 |
perform the softmax and then get your results, 00:26:21.980 |
they are doing some way of, they are iterating it 00:26:27.580 |
to calculate things like the softmax of the fly. 00:26:31.580 |
So essentially that's what they're doing over here. 00:26:33.300 |
So it's an optimization of using your high bandwidth RAM 00:26:41.380 |
Because in your GPUs, you've got very fast computation 00:26:56.300 |
as they go into things like your Mamba models. 00:27:04.980 |
So the second part in terms of the background 00:27:08.340 |
will be how do we adapt these models for specific tasks? 00:27:23.740 |
There's also things like instruction fine tuning 00:27:31.740 |
and then the model will fine tune its outputs based on that. 00:27:38.780 |
if let's say I ask GPT to explain the moon landing 00:27:48.340 |
there is this way where GPT outputs the steps in this way. 00:28:02.980 |
So that's how GPT-3 will output its sentences 00:28:05.780 |
but if we're able to do some sort of instruction fine tuning 00:28:13.660 |
then this is the kind of outputs that you can get. 00:28:17.740 |
And so that's the kind of variations of different models 00:28:22.380 |
that we can see when we download them from open source, 00:28:25.940 |
I say repositories, things like Hugging Face. 00:28:35.420 |
where you want to ensure that your model fulfills 00:28:40.420 |
what people call the three H's of model behavior. 00:28:47.940 |
your models will be honest and your models are helpful. 00:28:50.580 |
So things like harmlessness will be things like, 00:28:56.380 |
how can I let's say bake a cake with cyanide? 00:29:00.980 |
If let's say your model is not alignment tuned, 00:29:07.420 |
but let's say if you do alignment fine tuning 00:29:15.660 |
then the model will learn accordingly from that. 00:29:35.300 |
where essentially for each of the different outputs, 00:29:42.740 |
In this case, the reward is just a scalar value 00:29:56.300 |
based on this policy, you get to maximize the reward. 00:30:09.540 |
from the model outputs and you get some reward, 00:30:26.340 |
So that's essentially what reinforcement learning is. 00:30:28.940 |
So typically for, I think when reinforcement learning 00:30:37.780 |
was a hot thing back then, it's one course by itself. 00:30:50.380 |
you are more familiar with is things like prompting. 00:30:56.980 |
you just give a task and the model answers directly, 00:30:59.780 |
but also you have things like chain of thought prompting 00:31:03.380 |
where you give the model some examples before 00:31:14.140 |
So that's essentially what you have over here. 00:31:51.820 |
such that you'll be able to get the results that you want? 00:31:57.220 |
of what people like to call prompt engineering. 00:32:09.540 |
is a very brief list of some of the models that we have. 00:32:14.540 |
Now, keep in mind that a lot of these models, 00:32:18.980 |
the list always is updated every two or three weeks. 00:32:34.900 |
purposes that we see these models are trying to achieve 00:32:38.860 |
can be things like your general purpose ones. 00:32:41.700 |
So that's when you get a model to do all sorts of things. 00:32:45.180 |
There's also, of course, your multi-modal ones, 00:32:53.460 |
and then you maybe ask the model to decipher some fact 00:33:00.220 |
There's also, of course, your video-related ones. 00:33:03.300 |
There are some that are very specific to code generation. 00:33:08.260 |
Some that are very specific in the finance domain. 00:33:11.580 |
Some that are very specific in the science domain. 00:33:21.220 |
There's a much more detailed list in the paper itself. 00:33:29.940 |
there are also additional papers that come out. 00:33:41.300 |
So these are some of them that were not mentioned. 00:33:43.380 |
So good to understand that this is always an evolving list. 00:33:57.540 |
You've got things like your instruction tuning, 00:34:07.980 |
Now the context windows are in the six figures, 00:34:12.740 |
There are also other ways in which LLMs can be used. 00:34:25.740 |
you can always fine-tune them for very specific purposes 00:34:31.060 |
to maybe your own corpus or your own knowledge base. 00:34:35.700 |
So that's essentially what we're doing over here. 00:34:55.580 |
where let's say instead of representing a number in 32-bit, 00:35:02.220 |
and see if I can still maintain the model accuracy. 00:35:09.140 |
if you're able to get lighter models, smaller models, 00:35:14.060 |
Multi-modal LLMs that we talked about earlier 00:35:16.300 |
that take in things like images and video as inputs. 00:35:22.380 |
is when you just add another layer on top of the output 00:35:32.820 |
where your adapter is used in two or more models. 00:36:14.900 |
so you will be able to leverage on different, 00:36:18.340 |
I would say, different vertical workflows of the model 00:36:38.700 |
is if you're able to reduce the number of parameters 00:36:55.780 |
instead of calculating gradients for 64 parameters 00:37:02.500 |
what you can do is that you can decompose this matrix 00:37:05.140 |
into a eight-by-two and a two-by-eight matrix. 00:37:12.140 |
you get back the 64, you get back 64 weights, 00:37:15.060 |
or the resultant is an eight-by-eight matrix, 00:37:32.820 |
So essentially that's what we're doing over here. 00:37:35.100 |
Yeah, so that's pretty much it for this segment. 00:37:51.380 |
We've got, these are things that we've seen before, 00:37:53.940 |
Wikipedia datasets, C4 dataset, Common Crawl, 00:38:00.500 |
And then, of course, you've got some datasets 00:38:04.420 |
that can be used for very task-specific models, 00:38:14.700 |
and you've also got datasets that's used for alignment. 00:38:24.220 |
or HuggingFace, you'll be able to download them, 00:38:40.300 |
I would say, templates or schemas that you can use 00:38:43.380 |
to prepare your datasets so that you can do fine-tuning. 00:38:49.620 |
and this is for getting the model to be more, 00:38:53.540 |
to have, to display behavior that's more aligned to our use. 00:38:58.540 |
So naturally, this one, I'm okay to share some examples, 00:39:02.420 |
but this one, you can go ahead and click on the link. 00:39:07.140 |
So let's say we've done our training on fine-tuning. 00:39:29.220 |
You've got things like your single-task evaluations, 00:39:31.380 |
so very popular ones would be things like SQuAD, 00:39:46.700 |
answering math questions, so mathematical reasoning, 00:39:51.700 |
and this is, I believe, natural language inference. 00:39:57.820 |
So essentially, whether the two sentences are, 00:40:34.260 |
and then you've got your multi-task evaluation, 00:40:49.380 |
this is divided into multiple individual evaluations, 00:40:56.980 |
so you've got things like natural language inference, 00:41:08.420 |
So essentially, that's what's going on over here. 00:41:17.820 |
so there's a big number of knowledge intensities 00:41:31.060 |
I would say questions that mimic human behavior more, 00:41:48.500 |
So beyond just things like what's in the list, 00:41:56.300 |
and naturally, what happens is that for each of them, 00:41:59.620 |
there are also certain guardrails that need to be placed. 00:42:08.180 |
it is important to ensure that when we submit lyrics 00:42:13.380 |
these lyrics shouldn't be under any kind of copyright. 00:42:16.740 |
If not, then there might be legal consequences. 00:42:28.900 |
So finally, last part, before we go into Q&A, 00:42:33.900 |
what are some of the things that we see models exhibit? 00:42:48.340 |
If the training data exhibits a certain behavior, 00:42:50.780 |
naturally, we see the model exhibiting this behavior. 00:42:57.860 |
And also things like models memorizing private content. 00:43:15.420 |
and then it outputs some sort of phone number 00:43:19.700 |
And let's say a user takes this and does a search. 00:43:26.300 |
And you can see there's actually some information over here 00:43:31.860 |
that's not supposed to be exposed to the public. 00:43:35.580 |
And then maybe someone searches for the phone number 00:43:37.740 |
and there you might have an additional contact 00:43:43.020 |
So these are some of the things that we want to, 00:43:49.220 |
when it comes to the component about human alignment. 00:43:56.220 |
making helpful, being harmless, and being honest, 00:44:07.260 |
And generally, what happens is that there is teams, 00:44:13.540 |
all these ways of conducting adversarial attacks. 00:44:17.380 |
or what people like to call red teaming these models. 00:44:20.420 |
So essentially trying to generate adversarial prompts 00:44:25.140 |
or find ways such that the model will leak out something, 00:44:28.820 |
and then if they're able to do so, they will fix it. 00:44:53.980 |
and go into the topics that you're looking at. 00:45:04.620 |
I've also linked some of the external sources 00:45:18.380 |
and I'm leaving about 10 more minutes if there's any Q&As. 00:45:23.940 |
Thanks so much for giving such a detailed walkthrough. 00:45:28.340 |
I think there was a question by Bonan in the chat 00:45:33.420 |
like what exactly is the benefit of using a transformer 00:46:14.420 |
the sequence, the hidden state of the 10th token. 00:46:29.740 |
And essentially, that's what's going on over here, 00:46:31.420 |
where if, let's say, I want to calculate the second state, 00:46:35.620 |
the second hidden state of the second token in the sequence, 00:46:40.140 |
I need to calculate the first hidden state as an input. 00:46:42.700 |
So that goes back to either your RNNs or LSTMs, 00:46:52.300 |
the inputs to the hidden state is the hidden state 00:46:54.980 |
of the previous token, and also the input token. 00:46:59.580 |
So the thing is that because there is this dependency, 00:47:05.340 |
the future hidden states rely on the previous hidden states, 00:47:09.500 |
and because of that, there is no ability to parallelize 00:47:13.060 |
from a sequence perspective, on the wall clock perspective. 00:47:15.780 |
And therefore, you see the first line back forward, 00:47:17.620 |
and back of pass first at O of sequence length. 00:47:19.220 |
That means for how long the sequence length you have, 00:47:28.140 |
At least the way I like to think about it is that, 00:47:33.260 |
In order for me to get the final hidden state, 00:47:35.100 |
before I can start evaluating its predictions, 00:47:44.660 |
I can just pad everything to the same length, 00:47:48.580 |
So I can get everything out in one output step, 00:47:53.900 |
At least that's my understanding of the parallelizability. 00:48:09.260 |
there is still this need of passing the hidden state 00:48:13.100 |
of the current token back into the transformer, 00:48:34.060 |
I had about the classification in this paper was 00:48:37.300 |
that of prefix versus full language modeling. 00:48:40.060 |
Because if you look at the example that you give in the text, 00:48:51.180 |
and then you output the word, "the force be with you." 00:48:55.860 |
and then the model is asked to predict "be with you." 00:48:59.500 |
But that just both seems like the same thing. 00:49:03.020 |
Because my understanding of prefix language modeling 00:49:05.020 |
was that, oh, we're gonna specify a specific token, 00:49:32.300 |
so it's a little bit hard to comment on that. 00:49:37.340 |
that this and this really doesn't show a lot of difference. 00:49:57.060 |
So generally what happens is that for full language, 00:50:05.260 |
you might just start with a beginning of sentence token 00:50:10.780 |
And then you autoregressively sample from there, 00:50:12.900 |
which is different from the prefix language modeling 00:50:15.300 |
where you are given the beginning of sentence token, 00:50:22.140 |
And then, of course, when you do your learning, 00:50:23.740 |
you are learning based on that particular sequence of text 00:50:30.100 |
I think this one, we've got to take a look at the paper 00:50:34.220 |
It was also the guy who was the author of the T5 paper, 00:50:46.500 |
- Yeah, and I think we can talk about this some other time. 00:50:48.980 |
It was just something that confused me quite a good amount. 00:50:57.740 |
'Cause when we covered the original transformer paper, 00:51:07.500 |
But it seems like you mentioned that newer papers 00:51:10.980 |
are starting to use learn positional encodings instead 00:51:27.860 |
I'm not very sure what were the changes that inspired it. 00:52:00.500 |
I've got maybe say 500 tokens or 1,000 tokens, 00:52:15.660 |
But I think once they have figured out how to do so, 00:52:32.420 |
I didn't really go into the details of this part of research. 00:52:39.500 |
'Cause that was just something that I was intrigued by. 00:52:51.620 |
Okay, it seems like there's no more questions. 00:52:56.780 |
So anyway, I think moving on to next week's paper, 00:53:00.380 |
I was thinking of doing a deep-seek MOE paper. 00:53:03.660 |
That was one thing I'd like to present, to propose, sorry. 00:53:13.740 |
like always on experts, randomly routed experts. 00:53:19.220 |
So as usual, if anyone wants to present on the paper itself 00:53:36.620 |
if I actually had to sit down and present the paper. 00:53:39.980 |
So I think, as usual, I'll probably just drop a thread 00:53:54.020 |
Anyone have any other papers that you guys wanna read? 00:53:57.660 |
- Hmm, I'll take a look, I'll take a look at them. 00:54:20.940 |
And yeah, looking forward to next week, guys.