back to index

A Comprehensive Overview of Large Language Models - Latent Space Paper Club


Whisper Transcript | Transcript Only Page

00:00:00.000 | All right, let's go.
00:00:01.680 | - All right, cool.
00:00:02.640 | So, hey guys, thanks so much for coming by the paper club.
00:00:06.000 | As usual, this is a paper club we run in Asia,
00:00:09.860 | where we go through one paper every week.
00:00:11.800 | So today we're just recording it for the first time,
00:00:15.040 | and we hope that you'll benefit from it.
00:00:16.920 | So as usual, if you guys got any questions,
00:00:18.600 | you can either let me know,
00:00:20.480 | and I can invite you guys on stage.
00:00:22.880 | You can drop in the chat, which you can access
00:00:25.180 | by just clicking the button on the top,
00:00:27.480 | just the little message icon.
00:00:30.640 | And yeah, do you wanna take it away, Brian?
00:00:33.500 | - Sure, thanks Ivan.
00:00:37.080 | So today, we'll be going through the comprehensive overview
00:00:42.000 | of large-network edge models.
00:00:43.920 | But on top of that, I think what we wanna do also
00:00:46.720 | is just to share the reason why attention actually came about
00:00:51.720 | before the Transformers paper.
00:00:54.680 | So we'll have a little bit of a history lesson on that,
00:00:58.600 | on why it was developed.
00:01:00.160 | And then we will go through the paper,
00:01:03.440 | talking about what has happened post the Transformers era.
00:01:08.240 | In fact, it's when the GPT era started.
00:01:11.080 | So I'm gonna begin.
00:01:12.700 | As you can see, the link has two parts.
00:01:17.800 | So I'll use the first part to talk about pre,
00:01:20.580 | I would say GPT, and then I'll use the second link
00:01:23.400 | to talk about the paper prompt.
00:01:24.960 | So let's begin.
00:01:28.880 | So essentially, what models have been trying to do recently
00:01:33.840 | is this idea of language modeling,
00:01:36.320 | where given a previous sequence of words,
00:01:40.720 | which is your input or your prompt,
00:01:42.560 | you want to find out the next word in the prompt.
00:01:45.320 | In this case, it can be question and answers.
00:01:47.920 | So it can be modeled essentially
00:01:51.400 | by this probability of the next token,
00:01:54.860 | given the sequence of tokens.
00:01:56.520 | So that's when you can see the next token,
00:01:59.000 | which is T plus one over,
00:02:01.220 | given the sequence over here up to time equals to T,
00:02:05.920 | position equals to T.
00:02:07.680 | And of course, your T plus one is a sample,
00:02:11.040 | a sample from the vocabulary that you have,
00:02:13.120 | which is basically your sub-words
00:02:14.560 | or the tokens that you have.
00:02:15.960 | So why is this the case?
00:02:19.880 | I think for us who are doing NLP,
00:02:24.000 | beyond just thinking about looking at what the sequence is,
00:02:27.360 | what's being generated in the sequence,
00:02:29.660 | it's good to think about what kind of use case
00:02:31.620 | or what kind of tasks we are doing.
00:02:33.200 | And I'll say this is very useful
00:02:34.920 | when it comes to thinking about the evaluation metrics
00:02:38.400 | for each of these evaluation tasks.
00:02:40.560 | So you can get things like this.
00:02:42.000 | - Your screen just kind of like cut out for you.
00:02:44.400 | - Is it?
00:02:47.040 | Okay, let me see. - Oh, wait, sorry.
00:02:47.860 | No, no, sorry.
00:02:48.700 | It works again, sorry, like that.
00:02:50.020 | It just suddenly disappeared for me.
00:02:52.160 | It works again, like that, yeah.
00:02:54.560 | - No problem.
00:02:55.900 | So things like machine translation
00:02:58.740 | that we'll be talking about,
00:03:00.620 | we've got question and answer,
00:03:01.740 | summarization, so on and so forth.
00:03:04.040 | So essentially, good to think about
00:03:07.700 | what tasks we are trying to attack
00:03:11.580 | when we are using the different models, right?
00:03:14.820 | So while we think about language models
00:03:18.500 | as predicting the next token,
00:03:21.660 | it's also useful to think from a linguistic perspective
00:03:25.260 | what is being learned by these models.
00:03:28.500 | So there's a list over here.
00:03:31.900 | I'll just go through a few that is useful.
00:03:34.820 | Things like facts, which is trivia.
00:03:37.100 | So these are the ones where you can say
00:03:40.260 | the penalty for getting the prediction wrong
00:03:44.100 | is relatively higher,
00:03:44.980 | because if you output something that's false,
00:03:48.060 | then your language model is probably not truthful.
00:03:51.480 | Things like sentiment, which we have seen before.
00:03:56.820 | Things like reasoning.
00:03:58.460 | So in this case, if you look at the sentence,
00:04:00.760 | Ero went to the kitchen to make some tea.
00:04:03.860 | Standing next to Ero, Zuko pondered his destiny.
00:04:07.140 | Zuko left the.
00:04:08.580 | So in this case, the idea is that
00:04:11.100 | there is some sort of spatial understanding.
00:04:15.180 | The model needs to understand
00:04:16.180 | some spatial understanding of the sentence.
00:04:17.900 | In this case, Zuko is currently in the kitchen,
00:04:21.660 | so he left the kitchen.
00:04:23.460 | So these are some of the things that,
00:04:26.780 | from a synthetic perspective,
00:04:28.460 | or from a linguistic perspective,
00:04:30.260 | we observe models are learning in terms of patterns.
00:04:34.160 | So from language models,
00:04:38.580 | we talk about conditional language models.
00:04:41.580 | So essentially, the idea is that
00:04:43.940 | we are trying to generate a target sequence
00:04:48.740 | in a target sentence,
00:04:50.700 | given some sequence in the source sentence.
00:04:54.180 | So that is why you see over here
00:04:57.260 | that we are not just generating our yt,
00:05:00.420 | given some y1 to yt minus one,
00:05:03.060 | which is basically the sequence
00:05:05.420 | that has been generated by the model before,
00:05:08.060 | but also we want to condition it on the source sentence.
00:05:13.620 | So that is essentially what translation does.
00:05:15.420 | You give, if you think about it,
00:05:16.980 | you give the model a source sentence,
00:05:20.340 | you pick the target language,
00:05:22.740 | and then you observe the model generate
00:05:26.060 | the sequence in the target language.
00:05:29.020 | So it's more than just language modeling,
00:05:31.800 | but it's also conditional.
00:05:33.100 | And one of the key things that we will notice
00:05:37.940 | in conditional language modeling
00:05:40.900 | is that we don't necessarily see
00:05:44.460 | that the first word in the source sentence
00:05:48.340 | corresponds to the first word in the target sentence.
00:05:51.060 | So as you can see, this might be it,
00:05:53.420 | first word to first word,
00:05:54.780 | but just the second word onwards,
00:05:56.620 | you start to see that there is this sort of
00:05:59.260 | crisscross relationship where you might need to,
00:06:02.980 | where maybe the second word over here
00:06:04.620 | corresponds to the third word,
00:06:06.180 | and the third word over here corresponds to the second.
00:06:09.060 | So essentially the idea is that
00:06:11.980 | we want to find a way to be able to model this relationship.
00:06:17.940 | And this relationship has actually been studied before
00:06:23.260 | in this idea of alignment,
00:06:27.100 | where if you think about it,
00:06:28.820 | if let's say we've got the clause sentence,
00:06:31.980 | let's say on the top,
00:06:35.380 | and the target sentence on the bottom, on the left,
00:06:40.100 | then if we've got this very linear one-to-one relationship,
00:06:44.740 | or this monotonic relationship,
00:06:46.860 | then we will see that there will be a white box over here
00:06:51.020 | from the top left to the bottom right,
00:06:53.220 | indicating that the first word corresponds to the first word,
00:06:56.860 | second word corresponds to the second word,
00:06:58.780 | so on and so forth.
00:07:00.300 | But as you can see, just from English to French,
00:07:03.980 | there is this idea where words that is later in the sequence
00:07:08.980 | corresponds to words that's earlier and vice versa.
00:07:12.380 | So that is how we can visualize attention.
00:07:17.140 | So then the question is, okay,
00:07:20.780 | what, how are we in a sense modeling it,
00:07:25.460 | or what does it look like
00:07:26.540 | from the encoder-decoder perspective?
00:07:29.340 | So naturally, when we look at the encoder-decoder blocks,
00:07:34.060 | this can be, let's look at this as an RNN.
00:07:38.620 | We say that the hidden state,
00:07:41.220 | the last hidden state in the encoder block
00:07:44.100 | contains all the information of the entire sentence,
00:07:48.300 | but there's this information bottleneck problem,
00:07:51.220 | which means that if let's say this is a longer sentence,
00:07:55.260 | the last hidden state might not contain information
00:07:58.300 | of the earlier tokens.
00:08:00.620 | And therefore, there's this idea of attention
00:08:05.380 | where you have,
00:08:07.500 | given that you've got all the hidden states
00:08:09.780 | of all the input tokens,
00:08:11.740 | the decoder when during the language generation component
00:08:17.780 | will pay attention or attend to
00:08:20.860 | weighted sum of all the hidden states.
00:08:26.420 | So if let's say I've got something
00:08:28.260 | that is later in my sequence
00:08:31.260 | that corresponds to a token
00:08:34.300 | that is earlier in my source sentence,
00:08:36.940 | then I will see the attention weights
00:08:40.700 | giving more weight to the hidden states
00:08:44.500 | in the source sentence.
00:08:45.940 | So essentially, that's the idea of attention
00:08:49.780 | that has been implemented in the encoder-decoder
00:08:54.380 | kind of paradigm or the kind of architecture.
00:08:58.060 | So the problem with that is that
00:09:01.140 | when we create these
00:09:05.100 | or we calculate these individual hidden states,
00:09:08.060 | we realize that it has to be calculated sequentially.
00:09:11.180 | That means in this diagram,
00:09:13.100 | you can see that the second hidden state
00:09:16.140 | has to only, can only be calculated
00:09:19.060 | after the first hidden state is being output.
00:09:22.140 | And the third hidden state can only be calculated
00:09:24.980 | after the second hidden state has been output.
00:09:27.460 | So the question is,
00:09:29.020 | can we remove or break free from this idea
00:09:33.940 | where there is a dependency of the previous state?
00:09:36.900 | Because if we're able to do so,
00:09:38.740 | then we are able to run our forward pass
00:09:41.900 | and collect our gradients
00:09:43.380 | and run that prop on the architecture
00:09:46.300 | concurrently across the whole sequence.
00:09:52.420 | So essentially, that's the idea
00:09:55.580 | of your key query value attention,
00:09:58.540 | and that essentially forms
00:10:01.700 | one of the building blocks of the transformer architecture.
00:10:06.700 | So I think from here,
00:10:09.060 | what we're just going to talk about
00:10:11.340 | is there are other components
00:10:16.300 | to the transformer architecture
00:10:18.900 | beyond just our key query value attention.
00:10:23.500 | There is also this idea of understanding
00:10:26.180 | the position of the text,
00:10:28.940 | and that's basically an idea
00:10:30.380 | of adding position representations
00:10:32.260 | that you will see in the paper later,
00:10:34.820 | adding some sort of non-linearity
00:10:37.460 | when you're doing the calculation,
00:10:38.980 | and that's essentially
00:10:39.860 | just adding a feed-forward layer on top of it.
00:10:42.660 | So the idea is that
00:10:43.900 | if you're just calculating key query value pass,
00:10:46.260 | you're always looking at linear combinations
00:10:49.740 | of your, you can say your values,
00:10:52.220 | because you're just getting a weighted sum of the values
00:10:55.500 | calculated by attention.
00:10:56.900 | So we want to add a layer of non-linearity to it,
00:11:00.220 | which is taken care of by the feed-forward network.
00:11:04.060 | And of course, the last part
00:11:05.340 | is when you're doing the decoding step,
00:11:07.700 | when you're generating tokens,
00:11:10.180 | you want to not let the model see the future tokens,
00:11:14.260 | and essentially that's when masking comes into play,
00:11:17.420 | attention masking comes into play.
00:11:18.700 | So you will start to see that
00:11:20.620 | in the decoder architecture later down the road.
00:11:24.660 | So a couple of things on top of what we are talking about
00:11:30.940 | in terms of the language modeling component for transformers.
00:11:35.860 | One topic is sub-word models.
00:11:38.300 | So this is when you have things like tokenization,
00:11:41.460 | your byte-pair encoding.
00:11:42.620 | So essentially, what are we trying to solve over here?
00:11:45.380 | If you look at this table at the bottom,
00:11:49.140 | we start to see that for words
00:11:51.380 | that exist outside the vocabulary,
00:11:54.100 | that can be things like a variation of an existing word,
00:11:57.740 | in this case, you add many A's in between the word,
00:12:02.140 | between T and A for tasty
00:12:03.860 | to probably indicate that it's very tasty,
00:12:07.540 | or misspellings of words,
00:12:09.340 | which is also very common in input,
00:12:11.540 | or novel words over here where we understand
00:12:15.580 | the word transformerify might mean
00:12:18.100 | adding maybe a transformer block
00:12:21.780 | into an existing architecture,
00:12:23.780 | but it's a word that we might not see
00:12:25.980 | in the existing dictionary.
00:12:27.940 | So for them, for these words over here,
00:12:31.220 | if you just use a traditional vocabulary
00:12:33.260 | or a dictionary vocabulary,
00:12:35.300 | the index will be some sort of an UNK token.
00:12:40.500 | And essentially what goes on with byte-pair encoding
00:12:44.780 | is that it starts to learn these shorter
00:12:47.980 | combinations of letters that can sometime
00:12:51.740 | represent either prefixes or suffixes of a word,
00:12:57.060 | and then essentially you are able
00:12:58.780 | to generate the embeddings for them.
00:13:00.300 | So if you see over here, you've got this T-A-A,
00:13:04.460 | and then anything after that,
00:13:05.780 | and A-A-A, and anything after that, and S-T-Y.
00:13:08.820 | So this guy, probably you've seen it
00:13:10.700 | in other existing words,
00:13:13.860 | and therefore there is an existing embedding
00:13:16.380 | that's associated with it,
00:13:17.820 | and therefore we are able to represent it over here.
00:13:20.420 | You can think of it maybe as a,
00:13:22.260 | you're essentially creating,
00:13:23.660 | you're essentially generating three tokens
00:13:25.620 | from this source sequence over here.
00:13:28.300 | So essentially that's the idea of sub-word models,
00:13:33.300 | or in this case, you've got things like
00:13:35.660 | byte-pair encoding, sentence piece, word piece,
00:13:37.620 | and things like that.
00:13:38.460 | That's the problem that they're trying to solve.
00:13:41.100 | Okay?
00:13:43.660 | three types of architectures.
00:13:47.700 | The key thing over here to note is that
00:13:51.780 | what we have in the transformer block
00:13:53.620 | is essentially replacing the recurrent neural network blocks
00:13:57.660 | that we had previously.
00:13:58.740 | So when we talk about recurrent neural networks,
00:14:00.580 | of course we add things like LSTMs, GRUs,
00:14:04.260 | bidirectional models, multi-layer models.
00:14:06.820 | So it encompasses all that.
00:14:08.740 | And essentially what we have over here
00:14:10.740 | are the three types of dominant architectures.
00:14:13.940 | We've got the encoder models,
00:14:15.660 | and examples of this would be things like BERT,
00:14:17.780 | where you learn via mass language modeling,
00:14:21.260 | which has been covered before.
00:14:23.540 | Encoder-decoder models, where we've seen earlier,
00:14:26.020 | we have an encoder that maps your sequence into
00:14:29.580 | a space, or a position in latent space,
00:14:32.980 | and then from there you perform your
00:14:36.020 | sampling, or your autoregressive sampling of tokens
00:14:38.540 | to form your target sequence,
00:14:40.380 | which is what we've seen in T5.
00:14:42.380 | And the decoder models,
00:14:44.140 | which I think all of us are familiar with,
00:14:46.180 | things like GPT-2, GPT-3, they are all there.
00:14:48.380 | So you essentially learn the language of patterns,
00:14:52.660 | and then you directly just do your autoregressive
00:14:56.700 | sampling or decoding from there.
00:14:58.660 | Okay?
00:15:00.700 | from there, right,
00:15:03.580 | we will lead to
00:15:06.380 | this paper that we have over here,
00:15:08.180 | which is the Comprehensive Overview
00:15:09.620 | of Large Language Models.
00:15:10.860 | If you take a look at this paper,
00:15:13.860 | it seemed to me that
00:15:17.580 | there were multiple updates to the paper.
00:15:20.300 | And that signals to me that there's probably
00:15:23.220 | going to be updates along the way.
00:15:25.980 | So, I think what's useful
00:15:29.620 | is beyond looking at just the paper itself,
00:15:33.820 | understand,
00:15:36.700 | or for me, what I did was I tried to understand
00:15:39.140 | what was the framework that the authors were using
00:15:41.820 | to attack understanding of the knowledge,
00:15:46.820 | then dividing it, and then giving us a reader
00:15:50.300 | to understand it.
00:15:51.780 | Right?
00:15:53.300 | It's a very dense paper.
00:15:55.420 | It's got, I think, over 450 citations.
00:15:58.980 | So, I think it's more of a
00:16:03.980 | pick-your-own-adventure, pick-your-own-journey,
00:16:06.620 | pick-your-own-learning-process
00:16:10.620 | kind of, I would say, direction,
00:16:13.540 | so that along the way, you'll be able to build
00:16:18.180 | that foundational knowledge and then add layers on it,
00:16:21.020 | add layers on it.
00:16:22.300 | At the end of the day, we all know that new models
00:16:24.540 | are always developed and new models are always announced.
00:16:27.940 | So, going back to the first principles
00:16:29.340 | and fundamentals are useful.
00:16:31.700 | So, let's just go through the paper very quickly.
00:16:35.780 | Let's just start from the top over here.
00:16:38.380 | So, essentially, we'll just talk about
00:16:40.980 | the last point over here,
00:16:42.700 | where we are seeing that large language models,
00:16:46.740 | in particular things like GPT-3,
00:16:50.380 | are able to perform your downstream tasks
00:16:53.060 | without specific fine-tuning.
00:16:54.660 | So, that's the first key point,
00:16:56.020 | because if we looked at T5,
00:16:58.180 | we saw that the performance of T5 on downstream tasks,
00:17:03.100 | in this case, it can be translation,
00:17:04.780 | it can be your glue task, it can be your squirt task,
00:17:08.780 | their performance only will get better
00:17:13.740 | once you fine-tune on that particular task.
00:17:16.660 | And you've seen, there are multiple experiments
00:17:19.060 | that they've done, which demonstrates
00:17:21.180 | that that's the better way, that's the better alternative.
00:17:24.260 | So, what GPT-3 demonstrated was that
00:17:27.900 | they are able to perform zero short transfer learning
00:17:31.060 | on these tasks.
00:17:32.660 | So, what does that mean?
00:17:33.500 | That means that if you just give the prompt
00:17:35.900 | from the downstream task, GPT-3 is able to give the answer.
00:17:39.940 | So, that kind of changed things
00:17:42.940 | where we actually might not need to fine-tune
00:17:45.260 | for a particular task.
00:17:46.100 | Of course, when we look later down the road,
00:17:48.300 | we see that there's very, very specific ways
00:17:50.980 | of doing things like instruction tuning.
00:17:53.500 | But that was one of the big discoveries
00:17:54.980 | that they had back then.
00:17:56.180 | On top of it, they were able to show things like reasoning,
00:18:01.900 | they were able to show things like planning,
00:18:03.460 | they were able to show things like in-context learning.
00:18:05.180 | So, we get to see examples of this later
00:18:09.060 | when you do things like chain of thought prompting.
00:18:13.060 | So, they're able to understand,
00:18:15.420 | given certain patterns, when they ask for a question
00:18:21.540 | or ask for a task that follows a similar pattern
00:18:23.900 | from the prompt, they are able to answer.
00:18:25.980 | The problem that we see today
00:18:30.900 | is that the cost of training them
00:18:34.220 | or pre-training them is relatively high,
00:18:35.700 | usually in the tens of millions.
00:18:38.340 | So, the question is that can we get better
00:18:41.140 | at pre-training these models?
00:18:43.700 | Can we look at things like better architectures?
00:18:45.340 | Can we look at things like more efficient ways
00:18:49.180 | of fine-tuning our parameters?
00:18:52.540 | Are there ways that we can represent these factors
00:18:54.940 | in a lower vector state
00:18:57.740 | or a state that uses less granularity?
00:19:02.740 | So, that's essentially what things like architectures
00:19:07.140 | come into play, quantization comes into play.
00:19:10.740 | So, the way I saw this paper
00:19:14.660 | was that we had the background,
00:19:17.460 | which talks about some of the key concepts
00:19:19.660 | and then the different types of LLMs
00:19:22.180 | and their particular use cases.
00:19:23.940 | The datasets that have been used to train them,
00:19:27.540 | at least the public ones.
00:19:29.420 | What kind of evaluation tasks are they looking at?
00:19:31.860 | So, probably that's what we call evals.
00:19:34.340 | And the different types of applications
00:19:36.940 | for these LLMs in the commercial.
00:19:40.260 | And of course, from there we talk about
00:19:42.660 | what probably researchers are looking at
00:19:46.260 | going into maybe say the next three months
00:19:49.300 | or the next year.
00:19:50.220 | So, let's look at some of the fundamentals.
00:19:54.900 | So, I'm gonna start from the left side.
00:19:56.860 | The paper has covered,
00:19:58.860 | we have covered some of these topics from the paper.
00:20:01.100 | Tokenization, attention mechanisms,
00:20:04.860 | the different types of activation functions.
00:20:06.620 | So, those are stuff that we've learned.
00:20:09.340 | You can get a recap when you do
00:20:10.580 | your traditional deep learning topics.
00:20:15.020 | Then, of course, we talked about
00:20:16.580 | the different types of architectures
00:20:17.780 | which was covered earlier.
00:20:19.060 | Your encoder-only, your encoder-decoder,
00:20:21.860 | your decoder-only.
00:20:23.340 | And naturally, each of them
00:20:25.620 | will have their own associated way
00:20:27.500 | of doing attention masking.
00:20:29.660 | So, that's this part over here.
00:20:31.220 | We talked about the different types
00:20:33.980 | of pre-training objectives.
00:20:35.020 | Naturally, things like mass language modeling
00:20:36.780 | are things that we see in your encoder-only models.
00:20:41.620 | Language modeling are things that we see
00:20:43.340 | in your encoder-decoder models.
00:20:45.300 | So, mass language modeling,
00:20:46.940 | basically, in this diagram,
00:20:48.740 | is you give the model this token
00:20:53.140 | and these tokens over here,
00:20:54.860 | and the model is expected to predict
00:20:56.940 | these targets over here,
00:20:57.900 | the ones that have been highlighted.
00:21:00.020 | Whereas in full language modeling,
00:21:01.500 | so essentially, it's like a fill-in-the-blank
00:21:03.020 | kind of problem.
00:21:04.260 | Whereas for full language modeling,
00:21:05.660 | you give the first token,
00:21:07.220 | and then the model is expected to predict
00:21:09.020 | the second, third, fourth, fifth token,
00:21:10.700 | so on and so forth.
00:21:11.580 | So, that's that.
00:21:13.300 | There have been also research
00:21:16.380 | into this thing called prefix language modeling,
00:21:18.660 | where you feed the model one part of the sequence,
00:21:23.660 | and then you're asking the model
00:21:25.740 | to generate the remaining parts of the sequence.
00:21:30.740 | And what's useful over here
00:21:33.940 | is that when they do prefix language modeling,
00:21:37.940 | they use this thing called a causal mask with prefix,
00:21:42.500 | which means that for the input tokens,
00:21:46.940 | the model is able to see or attend
00:21:49.180 | to all the previous tokens in the input
00:21:53.180 | before it starts to generate output.
00:21:54.620 | And that's why when you see
00:21:56.300 | as the model generates the output,
00:21:57.740 | you still have that element of mask attention.
00:22:01.820 | So, essentially, that's this part over here.
00:22:04.940 | Things that are, I would say,
00:22:08.780 | if you look at the transformer paper,
00:22:10.940 | which is covered over there,
00:22:12.860 | will be things like layer normalization,
00:22:15.460 | where you divide the weights by the mean,
00:22:20.460 | sorry, you minus the mean from the weights
00:22:22.460 | and you divide it by the standard deviation of the weights.
00:22:25.860 | Essentially, what we're doing
00:22:26.780 | is that we're trying to achieve numerical stability
00:22:30.300 | of the weights so that when you do a forward pass
00:22:34.420 | and you do your back propagation,
00:22:36.180 | you don't have numbers that go all over the place.
00:22:39.140 | So that's layer normalization.
00:22:41.180 | Positional encoding was something we talked about earlier.
00:22:46.060 | In the original paper,
00:22:47.100 | they had this idea of sinusoidal position representations.
00:22:50.260 | So how to read this graph.
00:22:53.900 | Okay, so essentially how to read this graph
00:23:03.860 | is that as you go from left to right
00:23:06.620 | in the, as the index of the sequence increases,
00:23:10.700 | essentially you're applying some sort
00:23:12.380 | of sinusoidal function on top of it,
00:23:14.660 | such that every token in the sequence
00:23:19.140 | has a positional representation.
00:23:21.300 | It's augmented by a positional representation.
00:23:23.260 | So essentially from left to right,
00:23:24.580 | all these vectors actually look different.
00:23:27.140 | But what happens is that this way
00:23:32.700 | of encoding positional representations is not learnable
00:23:37.580 | because there is no such way to do,
00:23:40.340 | it's no such way to have a gradient
00:23:42.060 | and then to update the positions.
00:23:45.620 | So therefore, it has been changed to something as simple
00:23:48.460 | as just adding a positional representation
00:23:51.100 | on top of the embeddings.
00:23:53.100 | And of course, if you look at the paper,
00:23:54.740 | there are also newer ways to do it,
00:23:56.260 | things like alibi, things like rope.
00:23:59.020 | So that's the left-hand side.
00:24:02.180 | Now on the right-hand side over here,
00:24:04.460 | we are looking at newer ways
00:24:06.940 | or ways that can help with training or implementation.
00:24:11.900 | So things like the libraries that we're using,
00:24:16.180 | JAX, PyTorch, TensorFlow, amongst others,
00:24:18.340 | there's this idea of distributed training,
00:24:22.340 | which means that can we use multiple GPUs
00:24:26.060 | to train our models so that we are able
00:24:28.660 | to learn our weights faster?
00:24:30.380 | So amongst others, there's this idea of data parallelism
00:24:34.540 | where you duplicate your model in two GPUs.
00:24:39.540 | Let's say I've got two GPUs,
00:24:40.900 | I duplicate my model in both GPUs
00:24:43.780 | and then I run separate batches on top of them.
00:24:46.180 | So let's say I've got a batch of, I don't know, 100,000.
00:24:50.300 | I split it into 50,000, 50,000.
00:24:52.580 | I run the first batch of 50,000
00:24:54.420 | in the first model in the first GPU.
00:24:57.980 | Then the other 50,000 in the same model in the second GPU,
00:25:01.700 | calculate the gradients, average them,
00:25:03.540 | and then perform my backdoor.
00:25:05.260 | So that's what data parallelism is.
00:25:07.860 | Tensor parallelism, essentially the idea is
00:25:10.020 | that you calculate the matrix multiplication steps
00:25:15.020 | in multiple GPUs and then you add them up.
00:25:20.700 | So what happens is that, as you can see,
00:25:22.940 | we know that for each row,
00:25:26.700 | the multiplication with a column can be done concurrently
00:25:30.620 | and therefore it splits it up such that the first,
00:25:33.220 | that this, the matrix on the left
00:25:37.300 | multiplies with only one column.
00:25:39.300 | Matrix on the right multiplies on the second column
00:25:41.260 | and then you combine them together.
00:25:42.660 | Or in this case, you concatenate the results together.
00:25:45.340 | So that again also helps us with getting the results
00:25:49.980 | from the forward pass a lot better, okay?
00:25:52.900 | So that's that.
00:25:55.620 | Other kinds of tricks that we are using,
00:25:59.700 | things like flash attention,
00:26:01.340 | where it's a very smart way of utilizing memory.
00:26:06.140 | So what happens is that instead of calculating,
00:26:10.300 | instead of a series of steps that is very memory intensive
00:26:14.020 | when they load your query,
00:26:17.100 | your key query matrices to the calculation,
00:26:19.260 | perform the softmax and then get your results,
00:26:21.980 | they are doing some way of, they are iterating it
00:26:25.060 | and they are using very smart functions
00:26:27.580 | to calculate things like the softmax of the fly.
00:26:31.580 | So essentially that's what they're doing over here.
00:26:33.300 | So it's an optimization of using your high bandwidth RAM
00:26:38.220 | and also the RAM in your GPU, right?
00:26:41.380 | Because in your GPUs, you've got very fast computation
00:26:45.540 | but relatively lower memory.
00:26:47.140 | Just a little bit extra,
00:26:51.100 | this is one of those very common topics
00:26:52.780 | that they would like to start off
00:26:56.300 | as they go into things like your Mamba models.
00:26:59.820 | So that's just the first part.
00:27:04.980 | So the second part in terms of the background
00:27:08.340 | will be how do we adapt these models for specific tasks?
00:27:13.340 | So there are things like transfer learning,
00:27:17.140 | which we've seen before,
00:27:18.380 | where we pre-train our T5 base model
00:27:21.700 | and then we fine tune on individual tasks.
00:27:23.740 | There's also things like instruction fine tuning
00:27:26.820 | where the model is given a series
00:27:29.860 | of instructions and outputs
00:27:31.740 | and then the model will fine tune its outputs based on that.
00:27:35.900 | So examples of this can be things like,
00:27:38.780 | if let's say I ask GPT to explain the moon landing
00:27:42.500 | to a 6U in a few sentences.
00:27:44.540 | Generally, if the model is pre-trained,
00:27:48.340 | there is this way where GPT outputs the steps in this way.
00:27:53.180 | So explain the theory of gravity,
00:27:55.380 | explain the theory of relativity to a 6UO
00:27:57.900 | and then explain the Big Bang to a 6UO
00:27:59.980 | and then explain evolution to a 6UO.
00:28:02.980 | So that's how GPT-3 will output its sentences
00:28:05.780 | but if we're able to do some sort of instruction fine tuning
00:28:08.540 | where there is some sort of emphasis
00:28:11.180 | on things like 6UO in a few sentences,
00:28:13.660 | then this is the kind of outputs that you can get.
00:28:17.740 | And so that's the kind of variations of different models
00:28:22.380 | that we can see when we download them from open source,
00:28:25.940 | I say repositories, things like Hugging Face.
00:28:29.700 | So that's instruction fine tuning over here
00:28:31.900 | and something called alignment tuning
00:28:35.420 | where you want to ensure that your model fulfills
00:28:40.420 | what people call the three H's of model behavior.
00:28:46.060 | So your models will be harmless,
00:28:47.940 | your models will be honest and your models are helpful.
00:28:50.580 | So things like harmlessness will be things like,
00:28:52.700 | if let's say you ask the model,
00:28:56.380 | how can I let's say bake a cake with cyanide?
00:29:00.980 | If let's say your model is not alignment tuned,
00:29:04.420 | the model might give the instructions
00:29:07.420 | but let's say if you do alignment fine tuning
00:29:09.620 | to tell the model, hey, this is something
00:29:10.980 | that you should not output
00:29:12.340 | or you shouldn't give instructions for,
00:29:15.660 | then the model will learn accordingly from that.
00:29:18.300 | So these are some of the methods
00:29:21.180 | that we want to fine tune our models with
00:29:24.020 | such that our models are able to demonstrate
00:29:26.460 | a certain behavior.
00:29:27.420 | Then how are we doing it?
00:29:31.180 | We can use things, we can use skills
00:29:33.020 | like reinforcement learning to do it,
00:29:35.300 | where essentially for each of the different outputs,
00:29:40.860 | you have a certain kind of reward.
00:29:42.740 | In this case, the reward is just a scalar value
00:29:45.860 | and then you learn some sort,
00:29:48.580 | you learn some sort of policy
00:29:53.180 | such that when the model outputs text
00:29:56.300 | based on this policy, you get to maximize the reward.
00:30:00.500 | So the key thing over here
00:30:02.340 | is that the policy has to be differentiable
00:30:07.340 | so that when you get some results
00:30:09.540 | from the model outputs and you get some reward,
00:30:14.540 | sometimes your reward might not be good
00:30:15.820 | or you're comparing rewards,
00:30:17.420 | you're able to get the loss of the reward
00:30:19.380 | and back propagate it through the gradients
00:30:22.620 | to update the weights in the policy.
00:30:26.340 | So that's essentially what reinforcement learning is.
00:30:28.940 | So typically for, I think when reinforcement learning
00:30:37.780 | was a hot thing back then, it's one course by itself.
00:30:41.540 | So this is just a very high level,
00:30:43.620 | five, six, five minute overview of it.
00:30:47.820 | On top of it, I think one of the things
00:30:50.380 | you are more familiar with is things like prompting.
00:30:52.980 | So we've got zero-shot prompting
00:30:54.740 | where you just ask for tasks,
00:30:56.980 | you just give a task and the model answers directly,
00:30:59.780 | but also you have things like chain of thought prompting
00:31:03.380 | where you give the model some examples before
00:31:07.900 | and then from there, ask the model
00:31:10.580 | to mimic the behavior of the examples above.
00:31:14.140 | So that's essentially what you have over here.
00:31:15.700 | You've got in-context learning of,
00:31:18.100 | I would say translation on the right,
00:31:21.420 | and you've got in-context learning
00:31:23.740 | of correcting spelling mistakes on the left.
00:31:26.980 | So that is essentially this part over here
00:31:32.900 | and you've got to see that a few shots
00:31:35.500 | or things like five shots or three shots
00:31:37.940 | usually have better performance
00:31:39.860 | against your zero-shot or one-shots.
00:31:42.860 | So that's this part over here.
00:31:45.180 | And then the question of course is,
00:31:47.580 | how do you craft these prompts
00:31:51.820 | such that you'll be able to get the results that you want?
00:31:54.180 | So that's essentially the idea
00:31:57.220 | of what people like to call prompt engineering.
00:32:01.540 | So that's essentially the part
00:32:04.420 | on the backgrounds that we want to cover.
00:32:07.060 | The next part over here, I would say,
00:32:09.540 | is a very brief list of some of the models that we have.
00:32:14.540 | Now, keep in mind that a lot of these models,
00:32:18.980 | the list always is updated every two or three weeks.
00:32:22.380 | So good to understand.
00:32:25.140 | So naturally, I think when the paper
00:32:27.900 | is going to be updated in the future,
00:32:29.820 | you will see additional models.
00:32:31.620 | Some of the high-level, I would say,
00:32:34.900 | purposes that we see these models are trying to achieve
00:32:38.860 | can be things like your general purpose ones.
00:32:41.700 | So that's when you get a model to do all sorts of things.
00:32:45.180 | There's also, of course, your multi-modal ones,
00:32:49.780 | where you give the model some image
00:32:53.460 | and then you maybe ask the model to decipher some fact
00:32:57.780 | or draw some conclusion from the image.
00:33:00.220 | There's also, of course, your video-related ones.
00:33:03.300 | There are some that are very specific to code generation.
00:33:06.180 | So here are some of them.
00:33:08.260 | Some that are very specific in the finance domain.
00:33:11.580 | Some that are very specific in the science domain.
00:33:14.700 | And of course, there are some
00:33:17.300 | that are very useful for chatbots.
00:33:19.540 | This is the list over here.
00:33:21.220 | There's a much more detailed list in the paper itself.
00:33:26.820 | Having said that, of course, as mentioned,
00:33:29.940 | there are also additional papers that come out.
00:33:33.820 | And so there are also some, I would say,
00:33:36.940 | missing models that were not mentioned.
00:33:41.300 | So these are some of them that were not mentioned.
00:33:43.380 | So good to understand that this is always an evolving list.
00:33:48.220 | So what are some of the features
00:33:54.660 | that we see in these models?
00:33:57.540 | You've got things like your instruction tuning,
00:33:59.980 | which was talked about earlier.
00:34:02.340 | We notice that models are able to have
00:34:05.700 | increasingly high context windows.
00:34:07.980 | Now the context windows are in the six figures,
00:34:10.940 | sometimes even in the seven figures.
00:34:12.740 | There are also other ways in which LLMs can be used.
00:34:19.020 | I think a very popular one is REC.
00:34:22.420 | So there are, I would say,
00:34:24.180 | beyond just your general-purpose use,
00:34:25.740 | you can always fine-tune them for very specific purposes
00:34:28.540 | or purposes that are very specific
00:34:31.060 | to maybe your own corpus or your own knowledge base.
00:34:35.700 | So that's essentially what we're doing over here.
00:34:37.740 | Other topics for the reader to explore.
00:34:42.180 | So essentially, what we're doing over here
00:34:44.260 | is we're talking about ways...
00:34:46.780 | Actually, most of these topics over here,
00:34:49.020 | if you look at them,
00:34:50.060 | are about parameter-efficient fine-tuning.
00:34:52.980 | So things like quantization,
00:34:55.580 | where let's say instead of representing a number in 32-bit,
00:34:59.780 | I represent my number in 8-bit or 4-bit
00:35:02.220 | and see if I can still maintain the model accuracy.
00:35:06.180 | Generally, the model accuracy will go down.
00:35:08.300 | But the thing is,
00:35:09.140 | if you're able to get lighter models, smaller models,
00:35:12.100 | that's actually very useful.
00:35:14.060 | Multi-modal LLMs that we talked about earlier
00:35:16.300 | that take in things like images and video as inputs.
00:35:20.820 | Adapter tuning, essentially,
00:35:22.380 | is when you just add another layer on top of the output
00:35:26.180 | and then you perform fine-tuning on it.
00:35:28.140 | There are more sophisticated ways to use it
00:35:32.820 | where your adapter is used in two or more models.
00:35:37.820 | That means the same adapter is being used
00:35:41.140 | in, let's say, a general model
00:35:42.940 | and also, let's say, a GPT model.
00:35:46.260 | I've seen that in the talk
00:35:50.820 | about embedding representation learning.
00:35:53.460 | Mixture of experts,
00:35:56.500 | something that we've seen before,
00:35:57.500 | so where instead of just having
00:36:01.140 | one feed-forward layer over here,
00:36:04.220 | you're actually able to route them
00:36:05.580 | to different feed-forward layers.
00:36:08.260 | And then from there,
00:36:09.780 | you'll be able to, in a sense,
00:36:13.420 | then once you multiply them together,
00:36:14.900 | so you will be able to leverage on different,
00:36:18.340 | I would say, different vertical workflows of the model
00:36:22.420 | where each of the vertical workflows
00:36:24.460 | will learn different aspects.
00:36:27.460 | So that's essentially your MOE.
00:36:30.180 | Low-rank adaptation, or LORA,
00:36:34.020 | this looks very popular recently.
00:36:36.500 | Essentially, what we're trying to do
00:36:38.700 | is if you're able to reduce the number of parameters
00:36:43.700 | during your gradient updates,
00:36:46.300 | then you actually use less compute
00:36:48.140 | to get your fine-tuned models.
00:36:51.420 | And the idea behind it is that instead of,
00:36:54.660 | so let's say over here,
00:36:55.780 | instead of calculating gradients for 64 parameters
00:37:00.500 | for an eight-by-eight matrix,
00:37:02.500 | what you can do is that you can decompose this matrix
00:37:05.140 | into a eight-by-two and a two-by-eight matrix.
00:37:09.300 | The key thing over here
00:37:10.260 | is that when you multiply this by this,
00:37:12.140 | you get back the 64, you get back 64 weights,
00:37:15.060 | or the resultant is an eight-by-eight matrix,
00:37:17.340 | which is 64 weights.
00:37:18.780 | If you're able to decompose it
00:37:20.980 | with weights in a smaller dimension,
00:37:22.860 | essentially this idea is that,
00:37:24.260 | and of course, how small it is
00:37:25.820 | is a hyperparameter for you to tune,
00:37:28.300 | then the cost of fine-tuning will go down.
00:37:32.820 | So essentially that's what we're doing over here.
00:37:35.100 | Yeah, so that's pretty much it for this segment.
00:37:41.300 | The last, the next few segments,
00:37:44.980 | this next segment essentially is about
00:37:47.020 | your datasets that can be used for training,
00:37:48.820 | at least the public ones that we see.
00:37:51.380 | We've got, these are things that we've seen before,
00:37:53.940 | Wikipedia datasets, C4 dataset, Common Crawl,
00:37:57.140 | which is used for your,
00:37:58.180 | I would say more general-purpose models.
00:38:00.500 | And then, of course, you've got some datasets
00:38:04.420 | that can be used for very task-specific models,
00:38:07.620 | for example, code generation.
00:38:09.820 | You've got datasets that is used
00:38:12.460 | for instruction fine-tuning,
00:38:14.700 | and you've also got datasets that's used for alignment.
00:38:17.180 | So essentially what happens is that
00:38:19.940 | if you go to maybe, say, TensorFlow datasets
00:38:24.220 | or HuggingFace, you'll be able to download them,
00:38:26.460 | and then you'll be able to observe
00:38:29.260 | these datasets by itself.
00:38:31.540 | And if, let's say, you want to maybe, say,
00:38:34.580 | fine-tune a model for specific use,
00:38:36.540 | these are actually useful,
00:38:40.300 | I would say, templates or schemas that you can use
00:38:43.380 | to prepare your datasets so that you can do fine-tuning.
00:38:46.180 | So this is instruction-tuned,
00:38:49.620 | and this is for getting the model to be more,
00:38:53.540 | to have, to display behavior that's more aligned to our use.
00:38:58.540 | So naturally, this one, I'm okay to share some examples,
00:39:02.420 | but this one, you can go ahead and click on the link.
00:39:04.220 | You'll be able to see the kind of examples
00:39:06.300 | that's over there.
00:39:07.140 | So let's say we've done our training on fine-tuning.
00:39:11.580 | We found a way to update our parameters
00:39:15.980 | in a more efficient way.
00:39:16.820 | The final part is, of course, evaluation.
00:39:19.620 | So I think I'll cover, at a high level,
00:39:24.300 | two classes of model evaluations.
00:39:29.220 | You've got things like your single-task evaluations,
00:39:31.380 | so very popular ones would be things like SQuAD,
00:39:33.860 | StoryClose, Math, MNLI,
00:39:37.060 | which is for question answering,
00:39:41.460 | understanding context of words
00:39:44.140 | where you're filling in the blanks,
00:39:46.700 | answering math questions, so mathematical reasoning,
00:39:51.700 | and this is, I believe, natural language inference.
00:39:57.820 | So essentially, whether the two sentences are,
00:40:01.580 | they follow each other or not, right?
00:40:04.620 | Essentially, whether the next sentence
00:40:09.260 | logically follows the first sentence.
00:40:11.100 | And also things like truthful QA,
00:40:14.380 | which validates whether a sentence,
00:40:17.220 | whether the model outputs facts
00:40:19.700 | instead of maybe just say other kinds of,
00:40:22.500 | maybe trivia that's not true, not truthful.
00:40:27.260 | So these are some of, I would say,
00:40:29.260 | your single-task evals,
00:40:34.260 | and then you've got your multi-task evaluation,
00:40:37.180 | things like GLUE, things like MMLU,
00:40:39.780 | things like SuperGLUE,
00:40:41.500 | and of course, there are a couple more
00:40:42.860 | that's inside the list.
00:40:44.900 | So what happens over here is that
00:40:46.900 | if we just take a look at GLUE,
00:40:49.380 | this is divided into multiple individual evaluations,
00:40:56.980 | so you've got things like natural language inference,
00:41:00.660 | you've got things like whether a sentence
00:41:03.860 | makes sense or not, so that's your COLA,
00:41:06.020 | you've got things like semantic similarity.
00:41:08.420 | So essentially, that's what's going on over here.
00:41:11.660 | MMLU, which is one of the more popular ways
00:41:14.980 | of doing benchmarks right now,
00:41:17.820 | so there's a big number of knowledge intensities
00:41:21.140 | that you can see over here.
00:41:22.500 | And of course, SuperGLUE,
00:41:25.620 | which is the second generation from GLUE,
00:41:27.820 | which has more, I would say,
00:41:31.060 | I would say questions that mimic human behavior more,
00:41:35.380 | or things that are a bit trickier
00:41:36.740 | for models to understand.
00:41:39.300 | So that is the part on evaluations.
00:41:42.500 | So different kinds of applications,
00:41:47.220 | I think we've seen many kinds.
00:41:48.500 | So beyond just things like what's in the list,
00:41:51.860 | we also see things like music generation,
00:41:53.780 | we see things like video generation,
00:41:56.300 | and naturally, what happens is that for each of them,
00:41:59.620 | there are also certain guardrails that need to be placed.
00:42:02.540 | So what are some examples?
00:42:03.740 | If let's say for a music generation model,
00:42:08.180 | it is important to ensure that when we submit lyrics
00:42:10.580 | for the model to output,
00:42:13.380 | these lyrics shouldn't be under any kind of copyright.
00:42:16.740 | If not, then there might be legal consequences.
00:42:19.140 | So this is something that I would say,
00:42:23.260 | depending on the domain that you're in,
00:42:25.580 | you will be looking at models
00:42:26.860 | that are very specific to your domain.
00:42:28.900 | So finally, last part, before we go into Q&A,
00:42:33.900 | what are some of the things that we see models exhibit?
00:42:38.060 | So things like biases are very common,
00:42:41.340 | stereotypes are very common.
00:42:43.020 | And I guess the reason why is based
00:42:46.860 | on some of the training data that we see.
00:42:48.340 | If the training data exhibits a certain behavior,
00:42:50.780 | naturally, we see the model exhibiting this behavior.
00:42:54.700 | So that's, I think, one of the things
00:42:56.260 | that we want to be aware of.
00:42:57.860 | And also things like models memorizing private content.
00:43:07.700 | So if let's say I've got a GPT model
00:43:10.060 | and I type in a particular prompt,
00:43:12.900 | and this GPT model sees some email,
00:43:15.420 | and then it outputs some sort of phone number
00:43:17.980 | that is supposed to be private.
00:43:19.700 | And let's say a user takes this and does a search.
00:43:23.060 | So essentially, the idea is that
00:43:24.500 | this is the output from the model.
00:43:26.300 | And you can see there's actually some information over here
00:43:28.540 | that might be private.
00:43:30.940 | You might have a phone number
00:43:31.860 | that's not supposed to be exposed to the public.
00:43:35.580 | And then maybe someone searches for the phone number
00:43:37.740 | and there you might have an additional contact
00:43:40.900 | that maybe you can use, right?
00:43:43.020 | So these are some of the things that we want to,
00:43:46.940 | I would say, be aware of
00:43:49.220 | when it comes to the component about human alignment.
00:43:54.140 | So on top of the three H's,
00:43:56.220 | making helpful, being harmless, and being honest,
00:43:59.580 | you also want to ensure that your models
00:44:01.980 | do not leak out or do not learn
00:44:05.020 | certain private information.
00:44:07.260 | And generally, what happens is that there is teams,
00:44:10.820 | like there are teams that are behind
00:44:13.540 | all these ways of conducting adversarial attacks.
00:44:15.820 | You can call them white hat attacks,
00:44:17.380 | or what people like to call red teaming these models.
00:44:20.420 | So essentially trying to generate adversarial prompts
00:44:25.140 | or find ways such that the model will leak out something,
00:44:28.820 | and then if they're able to do so, they will fix it.
00:44:31.500 | I think there's a few interesting articles
00:44:34.300 | about that recently.
00:44:35.340 | So essentially, that is the paper.
00:44:40.580 | It sounds like a firehose of information.
00:44:44.380 | So if there's anything,
00:44:48.180 | any topic you want to deep dive into,
00:44:49.900 | feel free to take a look at the paper
00:44:52.060 | or take a look at this
00:44:53.980 | and go into the topics that you're looking at.
00:44:56.020 | So if let's say I want to just do something
00:44:57.420 | on parameter efficient tuning,
00:44:58.660 | feel free to just go into that segment.
00:45:00.540 | So I've linked all the papers over here.
00:45:04.620 | I've also linked some of the external sources
00:45:07.100 | that have been useful for me over here.
00:45:09.660 | So yeah, feel free to take this
00:45:12.740 | as a reference guide for yourself.
00:45:15.260 | And I think with that, I've come to the end,
00:45:18.380 | and I'm leaving about 10 more minutes if there's any Q&As.
00:45:21.380 | So Ivan?
00:45:23.100 | - Yeah, dude.
00:45:23.940 | Thanks so much for giving such a detailed walkthrough.
00:45:28.340 | I think there was a question by Bonan in the chat
00:45:31.740 | about a paralyzation,
00:45:33.420 | like what exactly is the benefit of using a transformer
00:45:36.340 | versus a, I guess in this case, a RNN, RSTM.
00:45:41.340 | Do you want to maybe start with that?
00:45:42.900 | Like how the paralyzation works?
00:45:44.660 | - Let me just take a look.
00:45:47.940 | Okay, so if you think about it,
00:45:50.220 | let's look at this example over here.
00:45:56.380 | One second, let me just, okay.
00:45:58.260 | So the idea over here is if you think about
00:46:03.260 | the traditional RNNs, what happens is that,
00:46:07.540 | let's say I've got a sequence of 10 tokens,
00:46:09.980 | and I want to calculate the hidden state
00:46:12.100 | of the entire sequence, in this case,
00:46:14.420 | the sequence, the hidden state of the 10th token.
00:46:17.020 | There is a dependency of the 9th token,
00:46:20.460 | and the dependency of the 9th token is the,
00:46:23.340 | sorry, the 9th hidden state.
00:46:24.660 | Sorry, the 9th hidden state,
00:46:25.780 | and the dependency of the 9th hidden state,
00:46:27.460 | the hidden state, so on and so forth.
00:46:29.740 | And essentially, that's what's going on over here,
00:46:31.420 | where if, let's say, I want to calculate the second state,
00:46:35.620 | the second hidden state of the second token in the sequence,
00:46:38.780 | I need to calculate the first,
00:46:40.140 | I need to calculate the first hidden state as an input.
00:46:42.700 | So that goes back to either your RNNs or LSTMs,
00:46:47.220 | where the hidden state is calculated,
00:46:52.300 | the inputs to the hidden state is the hidden state
00:46:54.980 | of the previous token, and also the input token.
00:46:59.580 | So the thing is that because there is this dependency,
00:47:03.540 | there is this reliance on,
00:47:05.340 | the future hidden states rely on the previous hidden states,
00:47:09.500 | and because of that, there is no ability to parallelize
00:47:13.060 | from a sequence perspective, on the wall clock perspective.
00:47:15.780 | And therefore, you see the first line back forward,
00:47:17.620 | and back of pass first at O of sequence length.
00:47:19.220 | That means for how long the sequence length you have,
00:47:22.580 | you have to do that number of calculations.
00:47:25.100 | Does it make sense?
00:47:26.060 | - I think it makes sense too.
00:47:28.140 | At least the way I like to think about it is that,
00:47:29.860 | let's say I had five sentences,
00:47:31.740 | and they're not the same length.
00:47:33.260 | In order for me to get the final hidden state,
00:47:35.100 | before I can start evaluating its predictions,
00:47:37.100 | I need to run five passes,
00:47:40.220 | and for each character in each sequence,
00:47:42.180 | or each token in this case,
00:47:43.660 | for a transformer itself,
00:47:44.660 | I can just pad everything to the same length,
00:47:46.780 | and pass it through in one time step.
00:47:48.580 | So I can get everything out in one output step,
00:47:52.140 | one forward pass.
00:47:53.900 | At least that's my understanding of the parallelizability.
00:47:56.860 | - Yeah, that makes sense.
00:47:59.660 | I agree.
00:48:01.180 | I would say for this diagram,
00:48:03.980 | we think of it during the training state.
00:48:05.620 | Naturally, during the inference stage,
00:48:08.220 | we still have to,
00:48:09.260 | there is still this need of passing the hidden state
00:48:13.100 | of the current token back into the transformer,
00:48:16.300 | the model to get the next token.
00:48:18.460 | - For sure, for sure.
00:48:20.420 | I was talking more about the training stage,
00:48:21.980 | but I think in terms of inference,
00:48:23.820 | you incur the additional costs
00:48:26.180 | at the transformer of each additional token
00:48:29.580 | that RNN does, I think, ultimately.
00:48:32.100 | Actually, for me, one of the questions
00:48:34.060 | I had about the classification in this paper was
00:48:37.300 | that of prefix versus full language modeling.
00:48:40.060 | Because if you look at the example that you give in the text,
00:48:42.900 | I think they give the example of,
00:48:45.260 | you have just a cute little example,
00:48:47.700 | which is, if it's full language modeling,
00:48:50.100 | they give the word "may"
00:48:51.180 | and then you output the word, "the force be with you."
00:48:53.500 | If it's prefix language modeling,
00:48:54.980 | it's "may the force"
00:48:55.860 | and then the model is asked to predict "be with you."
00:48:59.500 | But that just both seems like the same thing.
00:49:03.020 | Because my understanding of prefix language modeling
00:49:05.020 | was that, oh, we're gonna specify a specific token,
00:49:07.740 | for example, like a bracket classify,
00:49:10.100 | bracket sentiment,
00:49:11.620 | sort of like in P5.
00:49:12.820 | And the model learns that if it sees
00:49:14.380 | this specific prefix,
00:49:16.220 | then it should perform differently.
00:49:19.460 | And so that was what I was a bit confused by
00:49:21.340 | in this specific paper.
00:49:22.780 | - That makes sense.
00:49:26.940 | I didn't look at the paper in particular,
00:49:32.300 | so it's a little bit hard to comment on that.
00:49:35.620 | I understand when you are saying
00:49:37.340 | that this and this really doesn't show a lot of difference.
00:49:42.940 | - I think what I can comment is that
00:49:46.060 | generally in full language modeling,
00:49:47.820 | what happens is that you...
00:49:49.700 | Okay, this is, of course,
00:49:53.340 | the encoder-decoder phase of things
00:49:56.060 | beyond the GPT stuff.
00:49:57.060 | So generally what happens is that for full language,
00:50:00.620 | you generate everything.
00:50:02.660 | So in fact, maybe in this case,
00:50:05.260 | you might just start with a beginning of sentence token
00:50:08.060 | and then you take maybe some hidden state
00:50:09.540 | and then you generate from there.
00:50:10.780 | And then you autoregressively sample from there,
00:50:12.900 | which is different from the prefix language modeling
00:50:15.300 | where you are given the beginning of sentence token,
00:50:18.940 | actually, and then a series of tokens
00:50:20.420 | before you do your generation.
00:50:22.140 | And then, of course, when you do your learning,
00:50:23.740 | you are learning based on that particular sequence of text
00:50:26.620 | more than just the beginning of sentence.
00:50:29.260 | I'm not very sure.
00:50:30.100 | I think this one, we've got to take a look at the paper
00:50:32.820 | to fully understand.
00:50:34.220 | It was also the guy who was the author of the T5 paper,
00:50:37.460 | I believe.
00:50:38.380 | - Oh, really?
00:50:39.220 | The guy who did this paper?
00:50:41.100 | - Yes, I think his name is Colin.
00:50:44.060 | Yeah, but let's go and check, yeah.
00:50:46.500 | - Yeah, and I think we can talk about this some other time.
00:50:48.980 | It was just something that confused me quite a good amount.
00:50:53.100 | I guess the other thing that surprised me
00:50:54.460 | was just like learn positional encodings.
00:50:57.740 | 'Cause when we covered the original transformer paper,
00:51:00.740 | I think there was a section where they said,
00:51:02.820 | oh, we experimented with learn
00:51:04.900 | and frozen positional encodings.
00:51:07.500 | But it seems like you mentioned that newer papers
00:51:10.980 | are starting to use learn positional encodings instead
00:51:13.260 | and they show an increase in performance.
00:51:15.980 | And I was wondering if maybe,
00:51:18.100 | what sort of change, in your opinion,
00:51:20.700 | to make this happen, if that makes sense?
00:51:22.780 | - To be very honest,
00:51:27.860 | I'm not very sure what were the changes that inspired it.
00:51:36.620 | Maybe the way I would comment is that,
00:51:40.980 | once they are able to do so,
00:51:42.780 | they are able to represent,
00:51:47.500 | they are able to efficiently represent
00:51:49.740 | an input with a much longer context window.
00:51:55.020 | So I think that probably what happened
00:51:57.020 | was that there was innovation in that space.
00:51:59.340 | Because the thing is that if, let's say,
00:52:00.500 | I've got maybe say 500 tokens or 1,000 tokens,
00:52:04.100 | there might be a limitation on how you
00:52:06.500 | model the positions,
00:52:12.820 | because maybe the positions might all be
00:52:14.220 | just clustered in one area.
00:52:15.660 | But I think once they have figured out how to do so,
00:52:17.980 | that's when they open up the window
00:52:19.460 | to longer context window.
00:52:20.740 | So maybe how they learn positional encodings
00:52:25.740 | might be one of the tricks that they use
00:52:28.820 | to have longer context windows.
00:52:30.860 | But again, I might be wrong.
00:52:32.420 | I didn't really go into the details of this part of research.
00:52:36.260 | - Yeah, for sure, for sure.
00:52:37.940 | Yeah, I was just wondering about that.
00:52:39.500 | 'Cause that was just something that I was intrigued by.
00:52:42.820 | I think we're almost at time.
00:52:44.740 | If anyone has any other questions,
00:52:46.380 | you can drop them in the chat.
00:52:48.340 | Maybe we can just end it here.
00:52:51.620 | Okay, it seems like there's no more questions.
00:52:56.780 | So anyway, I think moving on to next week's paper,
00:53:00.380 | I was thinking of doing a deep-seek MOE paper.
00:53:03.660 | That was one thing I'd like to present, to propose, sorry.
00:53:07.140 | 'Cause I thought it's super interesting,
00:53:09.940 | and there are a whole bunch of these ideas
00:53:12.020 | that they're experimenting with,
00:53:13.740 | like always on experts, randomly routed experts.
00:53:17.660 | So I thought it's a good paper.
00:53:19.220 | So as usual, if anyone wants to present on the paper itself
00:53:26.460 | for the upcoming week,
00:53:27.980 | then happy to help you with it.
00:53:30.540 | I think you generally learn a lot more
00:53:32.540 | when you actually do the paper.
00:53:34.500 | I learned at least 10 times more
00:53:36.620 | if I actually had to sit down and present the paper.
00:53:39.980 | So I think, as usual, I'll probably just drop a thread
00:53:44.780 | inside the paper club.
00:53:46.660 | And then if you guys have any other papers
00:53:48.100 | that you'd like to suggest,
00:53:49.460 | you can add it onto the thread,
00:53:50.660 | and then we can all vote for that.
00:53:52.340 | Yeah, do you have any papers in mind, Brian?
00:53:54.020 | Anyone have any other papers that you guys wanna read?
00:53:57.660 | - Hmm, I'll take a look, I'll take a look at them.
00:54:01.660 | There are some, I would say,
00:54:04.340 | are very open source models.
00:54:06.740 | So we'll see how, maybe one day next month,
00:54:12.260 | I can take a look at them, yeah.
00:54:14.060 | - Okay, cool, sounds good to me, yeah.
00:54:16.340 | Then otherwise, thank you so much, guys,
00:54:18.180 | for tuning in today's session.
00:54:19.540 | Really appreciate it.
00:54:20.940 | And yeah, looking forward to next week, guys.
00:54:23.220 | Ciao.
00:54:24.540 | - Thanks, everybody, see you guys, bye-bye.
00:54:26.460 | Have a good evening.
00:54:27.460 | See you guys. Bye bye. Have a good evening.