A Comprehensive Overview of Large Language Models

00:00:00.000 | All right, let's go.

00:00:01.680 | - All right, cool.

00:00:02.640 | So, hey guys, thanks so much for coming by the paper club.

00:00:06.000 | As usual, this is a paper club we run in Asia,

00:00:09.860 | where we go through one paper every week.

00:00:11.800 | So today we're just recording it for the first time,

00:00:15.040 | and we hope that you'll benefit from it.

00:00:16.920 | So as usual, if you guys got any questions,

00:00:18.600 | you can either let me know,

00:00:20.480 | and I can invite you guys on stage.

00:00:22.880 | You can drop in the chat, which you can access

00:00:25.180 | by just clicking the button on the top,

00:00:27.480 | just the little message icon.

00:00:30.640 | And yeah, do you wanna take it away, Brian?

00:00:33.500 | - Sure, thanks Ivan.

00:00:37.080 | So today, we'll be going through the comprehensive overview

00:00:42.000 | of large-network edge models.

00:00:43.920 | But on top of that, I think what we wanna do also

00:00:46.720 | is just to share the reason why attention actually came about

00:00:51.720 | before the Transformers paper.

00:00:54.680 | So we'll have a little bit of a history lesson on that,

00:00:58.600 | on why it was developed.

00:01:00.160 | And then we will go through the paper,

00:01:03.440 | talking about what has happened post the Transformers era.

00:01:08.240 | In fact, it's when the GPT era started.

00:01:11.080 | So I'm gonna begin.

00:01:12.700 | As you can see, the link has two parts.

00:01:17.800 | So I'll use the first part to talk about pre,

00:01:20.580 | I would say GPT, and then I'll use the second link

00:01:23.400 | to talk about the paper prompt.

00:01:24.960 | So let's begin.

00:01:28.880 | So essentially, what models have been trying to do recently

00:01:33.840 | is this idea of language modeling,

00:01:36.320 | where given a previous sequence of words,

00:01:40.720 | which is your input or your prompt,

00:01:42.560 | you want to find out the next word in the prompt.

00:01:45.320 | In this case, it can be question and answers.

00:01:47.920 | So it can be modeled essentially

00:01:51.400 | by this probability of the next token,

00:01:54.860 | given the sequence of tokens.

00:01:56.520 | So that's when you can see the next token,

00:01:59.000 | which is T plus one over,

00:02:01.220 | given the sequence over here up to time equals to T,

00:02:05.920 | position equals to T.

00:02:07.680 | And of course, your T plus one is a sample,

00:02:11.040 | a sample from the vocabulary that you have,

00:02:13.120 | which is basically your sub-words

00:02:14.560 | or the tokens that you have.

00:02:15.960 | So why is this the case?

00:02:19.880 | I think for us who are doing NLP,

00:02:24.000 | beyond just thinking about looking at what the sequence is,

00:02:27.360 | what's being generated in the sequence,

00:02:29.660 | it's good to think about what kind of use case

00:02:31.620 | or what kind of tasks we are doing.

00:02:33.200 | And I'll say this is very useful

00:02:34.920 | when it comes to thinking about the evaluation metrics

00:02:38.400 | for each of these evaluation tasks.

00:02:40.560 | So you can get things like this.

00:02:42.000 | - Your screen just kind of like cut out for you.

00:02:44.400 | - Is it?

00:02:47.040 | Okay, let me see. - Oh, wait, sorry.

00:02:47.860 | No, no, sorry.

00:02:48.700 | It works again, sorry, like that.

00:02:50.020 | It just suddenly disappeared for me.

00:02:52.160 | It works again, like that, yeah.

00:02:54.560 | - No problem.

00:02:55.900 | So things like machine translation

00:02:58.740 | that we'll be talking about,

00:03:00.620 | we've got question and answer,

00:03:01.740 | summarization, so on and so forth.

00:03:04.040 | So essentially, good to think about

00:03:07.700 | what tasks we are trying to attack

00:03:11.580 | when we are using the different models, right?

00:03:14.820 | So while we think about language models

00:03:18.500 | as predicting the next token,

00:03:21.660 | it's also useful to think from a linguistic perspective

00:03:25.260 | what is being learned by these models.

00:03:28.500 | So there's a list over here.

00:03:31.900 | I'll just go through a few that is useful.

00:03:34.820 | Things like facts, which is trivia.

00:03:37.100 | So these are the ones where you can say

00:03:40.260 | the penalty for getting the prediction wrong

00:03:44.100 | is relatively higher,

00:03:44.980 | because if you output something that's false,

00:03:48.060 | then your language model is probably not truthful.

00:03:51.480 | Things like sentiment, which we have seen before.

00:03:56.820 | Things like reasoning.

00:03:58.460 | So in this case, if you look at the sentence,

00:04:00.760 | Ero went to the kitchen to make some tea.

00:04:03.860 | Standing next to Ero, Zuko pondered his destiny.

00:04:07.140 | Zuko left the.

00:04:08.580 | So in this case, the idea is that

00:04:11.100 | there is some sort of spatial understanding.

00:04:15.180 | The model needs to understand

00:04:16.180 | some spatial understanding of the sentence.

00:04:17.900 | In this case, Zuko is currently in the kitchen,

00:04:21.660 | so he left the kitchen.

00:04:23.460 | So these are some of the things that,

00:04:26.780 | from a synthetic perspective,

00:04:28.460 | or from a linguistic perspective,

00:04:30.260 | we observe models are learning in terms of patterns.

00:04:34.160 | So from language models,

00:04:38.580 | we talk about conditional language models.

00:04:41.580 | So essentially, the idea is that

00:04:43.940 | we are trying to generate a target sequence

00:04:48.740 | in a target sentence,

00:04:50.700 | given some sequence in the source sentence.

00:04:54.180 | So that is why you see over here

00:04:57.260 | that we are not just generating our yt,

00:05:00.420 | given some y1 to yt minus one,

00:05:03.060 | which is basically the sequence

00:05:05.420 | that has been generated by the model before,

00:05:08.060 | but also we want to condition it on the source sentence.

00:05:13.620 | So that is essentially what translation does.

00:05:15.420 | You give, if you think about it,

00:05:16.980 | you give the model a source sentence,

00:05:20.340 | you pick the target language,

00:05:22.740 | and then you observe the model generate

00:05:26.060 | the sequence in the target language.

00:05:29.020 | So it's more than just language modeling,

00:05:31.800 | but it's also conditional.

00:05:33.100 | And one of the key things that we will notice

00:05:37.940 | in conditional language modeling

00:05:40.900 | is that we don't necessarily see

00:05:44.460 | that the first word in the source sentence

00:05:48.340 | corresponds to the first word in the target sentence.

00:05:51.060 | So as you can see, this might be it,

00:05:53.420 | first word to first word,

00:05:54.780 | but just the second word onwards,

00:05:56.620 | you start to see that there is this sort of

00:05:59.260 | crisscross relationship where you might need to,

00:06:02.980 | where maybe the second word over here

00:06:04.620 | corresponds to the third word,

00:06:06.180 | and the third word over here corresponds to the second.

00:06:09.060 | So essentially the idea is that

00:06:11.980 | we want to find a way to be able to model this relationship.

00:06:17.940 | And this relationship has actually been studied before

00:06:23.260 | in this idea of alignment,

00:06:27.100 | where if you think about it,

00:06:28.820 | if let's say we've got the clause sentence,

00:06:31.980 | let's say on the top,

00:06:35.380 | and the target sentence on the bottom, on the left,

00:06:40.100 | then if we've got this very linear one-to-one relationship,

00:06:44.740 | or this monotonic relationship,

00:06:46.860 | then we will see that there will be a white box over here

00:06:51.020 | from the top left to the bottom right,

00:06:53.220 | indicating that the first word corresponds to the first word,

00:06:56.860 | second word corresponds to the second word,

00:06:58.780 | so on and so forth.

00:07:00.300 | But as you can see, just from English to French,

00:07:03.980 | there is this idea where words that is later in the sequence

00:07:08.980 | corresponds to words that's earlier and vice versa.

00:07:12.380 | So that is how we can visualize attention.

00:07:17.140 | So then the question is, okay,

00:07:20.780 | what, how are we in a sense modeling it,

00:07:25.460 | or what does it look like

00:07:26.540 | from the encoder-decoder perspective?

00:07:29.340 | So naturally, when we look at the encoder-decoder blocks,

00:07:34.060 | this can be, let's look at this as an RNN.

00:07:38.620 | We say that the hidden state,

00:07:41.220 | the last hidden state in the encoder block

00:07:44.100 | contains all the information of the entire sentence,

00:07:48.300 | but there's this information bottleneck problem,

00:07:51.220 | which means that if let's say this is a longer sentence,

00:07:55.260 | the last hidden state might not contain information

00:07:58.300 | of the earlier tokens.

00:08:00.620 | And therefore, there's this idea of attention

00:08:05.380 | where you have,

00:08:07.500 | given that you've got all the hidden states

00:08:09.780 | of all the input tokens,

00:08:11.740 | the decoder when during the language generation component

00:08:17.780 | will pay attention or attend to

00:08:20.860 | weighted sum of all the hidden states.

00:08:26.420 | So if let's say I've got something

00:08:28.260 | that is later in my sequence

00:08:31.260 | that corresponds to a token

00:08:34.300 | that is earlier in my source sentence,

00:08:36.940 | then I will see the attention weights

00:08:40.700 | giving more weight to the hidden states

00:08:44.500 | in the source sentence.

00:08:45.940 | So essentially, that's the idea of attention

00:08:49.780 | that has been implemented in the encoder-decoder

00:08:54.380 | kind of paradigm or the kind of architecture.

00:08:58.060 | So the problem with that is that

00:09:01.140 | when we create these

00:09:05.100 | or we calculate these individual hidden states,

00:09:08.060 | we realize that it has to be calculated sequentially.

00:09:11.180 | That means in this diagram,

00:09:13.100 | you can see that the second hidden state

00:09:16.140 | has to only, can only be calculated

00:09:19.060 | after the first hidden state is being output.

00:09:22.140 | And the third hidden state can only be calculated

00:09:24.980 | after the second hidden state has been output.

00:09:27.460 | So the question is,

00:09:29.020 | can we remove or break free from this idea

00:09:33.940 | where there is a dependency of the previous state?

00:09:36.900 | Because if we're able to do so,

00:09:38.740 | then we are able to run our forward pass

00:09:41.900 | and collect our gradients

00:09:43.380 | and run that prop on the architecture

00:09:46.300 | concurrently across the whole sequence.

00:09:52.420 | So essentially, that's the idea

00:09:55.580 | of your key query value attention,

00:09:58.540 | and that essentially forms

00:10:01.700 | one of the building blocks of the transformer architecture.

00:10:06.700 | So I think from here,

00:10:09.060 | what we're just going to talk about

00:10:11.340 | is there are other components

00:10:16.300 | to the transformer architecture

00:10:18.900 | beyond just our key query value attention.

00:10:23.500 | There is also this idea of understanding

00:10:26.180 | the position of the text,

00:10:28.940 | and that's basically an idea

00:10:30.380 | of adding position representations

00:10:32.260 | that you will see in the paper later,

00:10:34.820 | adding some sort of non-linearity

00:10:37.460 | when you're doing the calculation,

00:10:38.980 | and that's essentially

00:10:39.860 | just adding a feed-forward layer on top of it.

00:10:42.660 | So the idea is that

00:10:43.900 | if you're just calculating key query value pass,

00:10:46.260 | you're always looking at linear combinations

00:10:49.740 | of your, you can say your values,

00:10:52.220 | because you're just getting a weighted sum of the values

00:10:55.500 | calculated by attention.

00:10:56.900 | So we want to add a layer of non-linearity to it,

00:11:00.220 | which is taken care of by the feed-forward network.

00:11:04.060 | And of course, the last part

00:11:05.340 | is when you're doing the decoding step,

00:11:07.700 | when you're generating tokens,

00:11:10.180 | you want to not let the model see the future tokens,

00:11:14.260 | and essentially that's when masking comes into play,

00:11:17.420 | attention masking comes into play.

00:11:18.700 | So you will start to see that

00:11:20.620 | in the decoder architecture later down the road.

00:11:24.660 | So a couple of things on top of what we are talking about

00:11:30.940 | in terms of the language modeling component for transformers.

00:11:35.860 | One topic is sub-word models.

00:11:38.300 | So this is when you have things like tokenization,

00:11:41.460 | your byte-pair encoding.

00:11:42.620 | So essentially, what are we trying to solve over here?

00:11:45.380 | If you look at this table at the bottom,

00:11:49.140 | we start to see that for words

00:11:51.380 | that exist outside the vocabulary,

00:11:54.100 | that can be things like a variation of an existing word,

00:11:57.740 | in this case, you add many A's in between the word,

00:12:02.140 | between T and A for tasty

00:12:03.860 | to probably indicate that it's very tasty,

00:12:07.540 | or misspellings of words,

00:12:09.340 | which is also very common in input,

00:12:11.540 | or novel words over here where we understand

00:12:15.580 | the word transformerify might mean

00:12:18.100 | adding maybe a transformer block

00:12:21.780 | into an existing architecture,

00:12:23.780 | but it's a word that we might not see

00:12:25.980 | in the existing dictionary.

00:12:27.940 | So for them, for these words over here,

00:12:31.220 | if you just use a traditional vocabulary

00:12:33.260 | or a dictionary vocabulary,

00:12:35.300 | the index will be some sort of an UNK token.

00:12:40.500 | And essentially what goes on with byte-pair encoding

00:12:44.780 | is that it starts to learn these shorter

00:12:47.980 | combinations of letters that can sometime

00:12:51.740 | represent either prefixes or suffixes of a word,

00:12:57.060 | and then essentially you are able

00:12:58.780 | to generate the embeddings for them.

00:13:00.300 | So if you see over here, you've got this T-A-A,

00:13:04.460 | and then anything after that,

00:13:05.780 | and A-A-A, and anything after that, and S-T-Y.

00:13:08.820 | So this guy, probably you've seen it

00:13:10.700 | in other existing words,

00:13:13.860 | and therefore there is an existing embedding

00:13:16.380 | that's associated with it,

00:13:17.820 | and therefore we are able to represent it over here.

00:13:20.420 | You can think of it maybe as a,

00:13:22.260 | you're essentially creating,

00:13:23.660 | you're essentially generating three tokens

00:13:25.620 | from this source sequence over here.

00:13:28.300 | So essentially that's the idea of sub-word models,

00:13:33.300 | or in this case, you've got things like

00:13:35.660 | byte-pair encoding, sentence piece, word piece,

00:13:37.620 | and things like that.

00:13:38.460 | That's the problem that they're trying to solve.

00:13:41.100 | Okay?

00:13:41.940 | So,

00:13:43.660 | three types of architectures.

00:13:47.700 | The key thing over here to note is that

00:13:51.780 | what we have in the transformer block

00:13:53.620 | is essentially replacing the recurrent neural network blocks

00:13:57.660 | that we had previously.

00:13:58.740 | So when we talk about recurrent neural networks,

00:14:00.580 | of course we add things like LSTMs, GRUs,

00:14:04.260 | bidirectional models, multi-layer models.

00:14:06.820 | So it encompasses all that.

00:14:08.740 | And essentially what we have over here

00:14:10.740 | are the three types of dominant architectures.

00:14:13.940 | We've got the encoder models,

00:14:15.660 | and examples of this would be things like BERT,

00:14:17.780 | where you learn via mass language modeling,

00:14:21.260 | which has been covered before.

00:14:23.540 | Encoder-decoder models, where we've seen earlier,

00:14:26.020 | we have an encoder that maps your sequence into

00:14:29.580 | a space, or a position in latent space,

00:14:32.980 | and then from there you perform your

00:14:36.020 | sampling, or your autoregressive sampling of tokens

00:14:38.540 | to form your target sequence,

00:14:40.380 | which is what we've seen in T5.

00:14:42.380 | And the decoder models,

00:14:44.140 | which I think all of us are familiar with,

00:14:46.180 | things like GPT-2, GPT-3, they are all there.

00:14:48.380 | So you essentially learn the language of patterns,

00:14:52.660 | and then you directly just do your autoregressive

00:14:56.700 | sampling or decoding from there.

00:14:58.660 | Okay?

00:14:59.780 | So,

00:15:00.700 | from there, right,

00:15:03.580 | we will lead to

00:15:06.380 | this paper that we have over here,

00:15:08.180 | which is the Comprehensive Overview

00:15:09.620 | of Large Language Models.

00:15:10.860 | If you take a look at this paper,

00:15:13.860 | it seemed to me that

00:15:17.580 | there were multiple updates to the paper.

00:15:20.300 | And that signals to me that there's probably

00:15:23.220 | going to be updates along the way.

00:15:25.980 | So, I think what's useful

00:15:29.620 | is beyond looking at just the paper itself,

00:15:33.820 | understand,

00:15:36.700 | or for me, what I did was I tried to understand

00:15:39.140 | what was the framework that the authors were using

00:15:41.820 | to attack understanding of the knowledge,

00:15:46.820 | then dividing it, and then giving us a reader

00:15:50.300 | to understand it.

00:15:51.780 | Right?

00:15:53.300 | It's a very dense paper.

00:15:55.420 | It's got, I think, over 450 citations.

00:15:58.980 | So, I think it's more of a

00:16:03.980 | pick-your-own-adventure, pick-your-own-journey,

00:16:06.620 | pick-your-own-learning-process

00:16:10.620 | kind of, I would say, direction,

00:16:13.540 | so that along the way, you'll be able to build

00:16:18.180 | that foundational knowledge and then add layers on it,

00:16:21.020 | add layers on it.

00:16:22.300 | At the end of the day, we all know that new models

00:16:24.540 | are always developed and new models are always announced.

00:16:27.940 | So, going back to the first principles

00:16:29.340 | and fundamentals are useful.

00:16:31.700 | So, let's just go through the paper very quickly.

00:16:35.780 | Let's just start from the top over here.

00:16:38.380 | So, essentially, we'll just talk about

00:16:40.980 | the last point over here,

00:16:42.700 | where we are seeing that large language models,

00:16:46.740 | in particular things like GPT-3,

00:16:50.380 | are able to perform your downstream tasks

00:16:53.060 | without specific fine-tuning.

00:16:54.660 | So, that's the first key point,

00:16:56.020 | because if we looked at T5,

00:16:58.180 | we saw that the performance of T5 on downstream tasks,

00:17:03.100 | in this case, it can be translation,

00:17:04.780 | it can be your glue task, it can be your squirt task,

00:17:08.780 | their performance only will get better

00:17:13.740 | once you fine-tune on that particular task.

00:17:16.660 | And you've seen, there are multiple experiments

00:17:19.060 | that they've done, which demonstrates

00:17:21.180 | that that's the better way, that's the better alternative.

00:17:24.260 | So, what GPT-3 demonstrated was that

00:17:27.900 | they are able to perform zero short transfer learning

00:17:31.060 | on these tasks.

00:17:32.660 | So, what does that mean?

00:17:33.500 | That means that if you just give the prompt

00:17:35.900 | from the downstream task, GPT-3 is able to give the answer.

00:17:39.940 | So, that kind of changed things

00:17:42.940 | where we actually might not need to fine-tune

00:17:45.260 | for a particular task.

00:17:46.100 | Of course, when we look later down the road,

00:17:48.300 | we see that there's very, very specific ways

00:17:50.980 | of doing things like instruction tuning.

00:17:53.500 | But that was one of the big discoveries

00:17:54.980 | that they had back then.

00:17:56.180 | On top of it, they were able to show things like reasoning,

00:18:01.900 | they were able to show things like planning,

00:18:03.460 | they were able to show things like in-context learning.

00:18:05.180 | So, we get to see examples of this later

00:18:09.060 | when you do things like chain of thought prompting.

00:18:13.060 | So, they're able to understand,

00:18:15.420 | given certain patterns, when they ask for a question

00:18:21.540 | or ask for a task that follows a similar pattern

00:18:23.900 | from the prompt, they are able to answer.

00:18:25.980 | The problem that we see today

00:18:30.900 | is that the cost of training them

00:18:34.220 | or pre-training them is relatively high,

00:18:35.700 | usually in the tens of millions.

00:18:38.340 | So, the question is that can we get better

00:18:41.140 | at pre-training these models?

00:18:43.700 | Can we look at things like better architectures?

00:18:45.340 | Can we look at things like more efficient ways

00:18:49.180 | of fine-tuning our parameters?

00:18:52.540 | Are there ways that we can represent these factors

00:18:54.940 | in a lower vector state

00:18:57.740 | or a state that uses less granularity?

00:19:02.740 | So, that's essentially what things like architectures

00:19:07.140 | come into play, quantization comes into play.

00:19:10.740 | So, the way I saw this paper

00:19:14.660 | was that we had the background,

00:19:17.460 | which talks about some of the key concepts

00:19:19.660 | and then the different types of LLMs

00:19:22.180 | and their particular use cases.

00:19:23.940 | The datasets that have been used to train them,

00:19:27.540 | at least the public ones.

00:19:29.420 | What kind of evaluation tasks are they looking at?

00:19:31.860 | So, probably that's what we call evals.

00:19:34.340 | And the different types of applications

00:19:36.940 | for these LLMs in the commercial.

00:19:40.260 | And of course, from there we talk about

00:19:42.660 | what probably researchers are looking at

00:19:46.260 | going into maybe say the next three months

00:19:49.300 | or the next year.

00:19:50.220 | So, let's look at some of the fundamentals.

00:19:54.900 | So, I'm gonna start from the left side.

00:19:56.860 | The paper has covered,

00:19:58.860 | we have covered some of these topics from the paper.

00:20:01.100 | Tokenization, attention mechanisms,

00:20:04.860 | the different types of activation functions.

00:20:06.620 | So, those are stuff that we've learned.

00:20:09.340 | You can get a recap when you do

00:20:10.580 | your traditional deep learning topics.

00:20:15.020 | Then, of course, we talked about

00:20:16.580 | the different types of architectures

00:20:17.780 | which was covered earlier.

00:20:19.060 | Your encoder-only, your encoder-decoder,

00:20:21.860 | your decoder-only.

00:20:23.340 | And naturally, each of them

00:20:25.620 | will have their own associated way

00:20:27.500 | of doing attention masking.

00:20:29.660 | So, that's this part over here.

00:20:31.220 | We talked about the different types

00:20:33.980 | of pre-training objectives.

00:20:35.020 | Naturally, things like mass language modeling

00:20:36.780 | are things that we see in your encoder-only models.

00:20:41.620 | Language modeling are things that we see

00:20:43.340 | in your encoder-decoder models.

00:20:45.300 | So, mass language modeling,

00:20:46.940 | basically, in this diagram,

00:20:48.740 | is you give the model this token

00:20:53.140 | and these tokens over here,

00:20:54.860 | and the model is expected to predict

00:20:56.940 | these targets over here,

00:20:57.900 | the ones that have been highlighted.

00:21:00.020 | Whereas in full language modeling,

00:21:01.500 | so essentially, it's like a fill-in-the-blank

00:21:03.020 | kind of problem.

00:21:04.260 | Whereas for full language modeling,

00:21:05.660 | you give the first token,

00:21:07.220 | and then the model is expected to predict

00:21:09.020 | the second, third, fourth, fifth token,

00:21:10.700 | so on and so forth.

00:21:11.580 | So, that's that.

00:21:13.300 | There have been also research

00:21:16.380 | into this thing called prefix language modeling,

00:21:18.660 | where you feed the model one part of the sequence,

00:21:23.660 | and then you're asking the model

00:21:25.740 | to generate the remaining parts of the sequence.

00:21:30.740 | And what's useful over here

00:21:33.940 | is that when they do prefix language modeling,

00:21:37.940 | they use this thing called a causal mask with prefix,

00:21:42.500 | which means that for the input tokens,

00:21:46.940 | the model is able to see or attend

00:21:49.180 | to all the previous tokens in the input

00:21:53.180 | before it starts to generate output.

00:21:54.620 | And that's why when you see

00:21:56.300 | as the model generates the output,

00:21:57.740 | you still have that element of mask attention.

00:22:01.820 | So, essentially, that's this part over here.

00:22:04.940 | Things that are, I would say,

00:22:08.780 | if you look at the transformer paper,

00:22:10.940 | which is covered over there,

00:22:12.860 | will be things like layer normalization,

00:22:15.460 | where you divide the weights by the mean,

00:22:20.460 | sorry, you minus the mean from the weights

00:22:22.460 | and you divide it by the standard deviation of the weights.

00:22:25.860 | Essentially, what we're doing

00:22:26.780 | is that we're trying to achieve numerical stability

00:22:30.300 | of the weights so that when you do a forward pass

00:22:34.420 | and you do your back propagation,

00:22:36.180 | you don't have numbers that go all over the place.

00:22:39.140 | So that's layer normalization.

00:22:41.180 | Positional encoding was something we talked about earlier.

00:22:46.060 | In the original paper,

00:22:47.100 | they had this idea of sinusoidal position representations.

00:22:50.260 | So how to read this graph.

00:22:53.900 | Okay, so essentially how to read this graph

00:23:03.860 | is that as you go from left to right

00:23:06.620 | in the, as the index of the sequence increases,

00:23:10.700 | essentially you're applying some sort

00:23:12.380 | of sinusoidal function on top of it,

00:23:14.660 | such that every token in the sequence

00:23:19.140 | has a positional representation.

00:23:21.300 | It's augmented by a positional representation.

00:23:23.260 | So essentially from left to right,

00:23:24.580 | all these vectors actually look different.

00:23:27.140 | But what happens is that this way

00:23:32.700 | of encoding positional representations is not learnable

00:23:37.580 | because there is no such way to do,

00:23:40.340 | it's no such way to have a gradient

00:23:42.060 | and then to update the positions.

00:23:45.620 | So therefore, it has been changed to something as simple

00:23:48.460 | as just adding a positional representation

00:23:51.100 | on top of the embeddings.

00:23:53.100 | And of course, if you look at the paper,

00:23:54.740 | there are also newer ways to do it,

00:23:56.260 | things like alibi, things like rope.

00:23:59.020 | So that's the left-hand side.

00:24:02.180 | Now on the right-hand side over here,

00:24:04.460 | we are looking at newer ways

00:24:06.940 | or ways that can help with training or implementation.

00:24:11.900 | So things like the libraries that we're using,

00:24:16.180 | JAX, PyTorch, TensorFlow, amongst others,

00:24:18.340 | there's this idea of distributed training,

00:24:22.340 | which means that can we use multiple GPUs

00:24:26.060 | to train our models so that we are able

00:24:28.660 | to learn our weights faster?

00:24:30.380 | So amongst others, there's this idea of data parallelism

00:24:34.540 | where you duplicate your model in two GPUs.

00:24:39.540 | Let's say I've got two GPUs,

00:24:40.900 | I duplicate my model in both GPUs

00:24:43.780 | and then I run separate batches on top of them.

00:24:46.180 | So let's say I've got a batch of, I don't know, 100,000.

00:24:50.300 | I split it into 50,000, 50,000.

00:24:52.580 | I run the first batch of 50,000

00:24:54.420 | in the first model in the first GPU.

00:24:57.980 | Then the other 50,000 in the same model in the second GPU,

00:25:01.700 | calculate the gradients, average them,

00:25:03.540 | and then perform my backdoor.

00:25:05.260 | So that's what data parallelism is.

00:25:07.860 | Tensor parallelism, essentially the idea is

00:25:10.020 | that you calculate the matrix multiplication steps

00:25:15.020 | in multiple GPUs and then you add them up.

00:25:20.700 | So what happens is that, as you can see,

00:25:22.940 | we know that for each row,

00:25:26.700 | the multiplication with a column can be done concurrently

00:25:30.620 | and therefore it splits it up such that the first,

00:25:33.220 | that this, the matrix on the left

00:25:37.300 | multiplies with only one column.

00:25:39.300 | Matrix on the right multiplies on the second column

00:25:41.260 | and then you combine them together.

00:25:42.660 | Or in this case, you concatenate the results together.

00:25:45.340 | So that again also helps us with getting the results

00:25:49.980 | from the forward pass a lot better, okay?

00:25:52.900 | So that's that.

00:25:55.620 | Other kinds of tricks that we are using,

00:25:59.700 | things like flash attention,

00:26:01.340 | where it's a very smart way of utilizing memory.

00:26:06.140 | So what happens is that instead of calculating,

00:26:10.300 | instead of a series of steps that is very memory intensive

00:26:14.020 | when they load your query,

00:26:17.100 | your key query matrices to the calculation,

00:26:19.260 | perform the softmax and then get your results,

00:26:21.980 | they are doing some way of, they are iterating it

00:26:25.060 | and they are using very smart functions

00:26:27.580 | to calculate things like the softmax of the fly.

00:26:31.580 | So essentially that's what they're doing over here.

00:26:33.300 | So it's an optimization of using your high bandwidth RAM

00:26:38.220 | and also the RAM in your GPU, right?

00:26:41.380 | Because in your GPUs, you've got very fast computation

00:26:45.540 | but relatively lower memory.

00:26:47.140 | Just a little bit extra,

00:26:51.100 | this is one of those very common topics

00:26:52.780 | that they would like to start off

00:26:56.300 | as they go into things like your Mamba models.

00:26:59.820 | So that's just the first part.

00:27:04.980 | So the second part in terms of the background

00:27:08.340 | will be how do we adapt these models for specific tasks?

00:27:13.340 | So there are things like transfer learning,

00:27:17.140 | which we've seen before,

00:27:18.380 | where we pre-train our T5 base model

00:27:21.700 | and then we fine tune on individual tasks.

00:27:23.740 | There's also things like instruction fine tuning

00:27:26.820 | where the model is given a series

00:27:29.860 | of instructions and outputs

00:27:31.740 | and then the model will fine tune its outputs based on that.

00:27:35.900 | So examples of this can be things like,

00:27:38.780 | if let's say I ask GPT to explain the moon landing

00:27:42.500 | to a 6U in a few sentences.

00:27:44.540 | Generally, if the model is pre-trained,

00:27:48.340 | there is this way where GPT outputs the steps in this way.

00:27:53.180 | So explain the theory of gravity,

00:27:55.380 | explain the theory of relativity to a 6UO

00:27:57.900 | and then explain the Big Bang to a 6UO

00:27:59.980 | and then explain evolution to a 6UO.

00:28:02.980 | So that's how GPT-3 will output its sentences

00:28:05.780 | but if we're able to do some sort of instruction fine tuning

00:28:08.540 | where there is some sort of emphasis

00:28:11.180 | on things like 6UO in a few sentences,

00:28:13.660 | then this is the kind of outputs that you can get.

00:28:17.740 | And so that's the kind of variations of different models

00:28:22.380 | that we can see when we download them from open source,

00:28:25.940 | I say repositories, things like Hugging Face.

00:28:29.700 | So that's instruction fine tuning over here

00:28:31.900 | and something called alignment tuning

00:28:35.420 | where you want to ensure that your model fulfills

00:28:40.420 | what people call the three H's of model behavior.

00:28:46.060 | So your models will be harmless,

00:28:47.940 | your models will be honest and your models are helpful.

00:28:50.580 | So things like harmlessness will be things like,

00:28:52.700 | if let's say you ask the model,

00:28:56.380 | how can I let's say bake a cake with cyanide?

00:29:00.980 | If let's say your model is not alignment tuned,

00:29:04.420 | the model might give the instructions

00:29:07.420 | but let's say if you do alignment fine tuning

00:29:09.620 | to tell the model, hey, this is something

00:29:10.980 | that you should not output

00:29:12.340 | or you shouldn't give instructions for,

00:29:15.660 | then the model will learn accordingly from that.

00:29:18.300 | So these are some of the methods

00:29:21.180 | that we want to fine tune our models with

00:29:24.020 | such that our models are able to demonstrate

00:29:26.460 | a certain behavior.

00:29:27.420 | Then how are we doing it?

00:29:31.180 | We can use things, we can use skills

00:29:33.020 | like reinforcement learning to do it,

00:29:35.300 | where essentially for each of the different outputs,

00:29:40.860 | you have a certain kind of reward.

00:29:42.740 | In this case, the reward is just a scalar value

00:29:45.860 | and then you learn some sort,

00:29:48.580 | you learn some sort of policy

00:29:53.180 | such that when the model outputs text

00:29:56.300 | based on this policy, you get to maximize the reward.

00:30:00.500 | So the key thing over here

00:30:02.340 | is that the policy has to be differentiable

00:30:07.340 | so that when you get some results

00:30:09.540 | from the model outputs and you get some reward,

00:30:14.540 | sometimes your reward might not be good

00:30:15.820 | or you're comparing rewards,

00:30:17.420 | you're able to get the loss of the reward

00:30:19.380 | and back propagate it through the gradients

00:30:22.620 | to update the weights in the policy.

00:30:26.340 | So that's essentially what reinforcement learning is.

00:30:28.940 | So typically for, I think when reinforcement learning

00:30:37.780 | was a hot thing back then, it's one course by itself.

00:30:41.540 | So this is just a very high level,

00:30:43.620 | five, six, five minute overview of it.

00:30:47.820 | On top of it, I think one of the things

00:30:50.380 | you are more familiar with is things like prompting.

00:30:52.980 | So we've got zero-shot prompting

00:30:54.740 | where you just ask for tasks,

00:30:56.980 | you just give a task and the model answers directly,

00:30:59.780 | but also you have things like chain of thought prompting

00:31:03.380 | where you give the model some examples before

00:31:07.900 | and then from there, ask the model

00:31:10.580 | to mimic the behavior of the examples above.

00:31:14.140 | So that's essentially what you have over here.

00:31:15.700 | You've got in-context learning of,

00:31:18.100 | I would say translation on the right,

00:31:21.420 | and you've got in-context learning

00:31:23.740 | of correcting spelling mistakes on the left.

00:31:26.980 | So that is essentially this part over here

00:31:32.900 | and you've got to see that a few shots

00:31:35.500 | or things like five shots or three shots

00:31:37.940 | usually have better performance

00:31:39.860 | against your zero-shot or one-shots.

00:31:42.860 | So that's this part over here.

00:31:45.180 | And then the question of course is,

00:31:47.580 | how do you craft these prompts

00:31:51.820 | such that you'll be able to get the results that you want?

00:31:54.180 | So that's essentially the idea

00:31:57.220 | of what people like to call prompt engineering.

00:32:01.540 | So that's essentially the part

00:32:04.420 | on the backgrounds that we want to cover.

00:32:07.060 | The next part over here, I would say,

00:32:09.540 | is a very brief list of some of the models that we have.

00:32:14.540 | Now, keep in mind that a lot of these models,

00:32:18.980 | the list always is updated every two or three weeks.

00:32:22.380 | So good to understand.

00:32:25.140 | So naturally, I think when the paper

00:32:27.900 | is going to be updated in the future,

00:32:29.820 | you will see additional models.

00:32:31.620 | Some of the high-level, I would say,

00:32:34.900 | purposes that we see these models are trying to achieve

00:32:38.860 | can be things like your general purpose ones.

00:32:41.700 | So that's when you get a model to do all sorts of things.

00:32:45.180 | There's also, of course, your multi-modal ones,

00:32:49.780 | where you give the model some image

00:32:53.460 | and then you maybe ask the model to decipher some fact

00:32:57.780 | or draw some conclusion from the image.

00:33:00.220 | There's also, of course, your video-related ones.

00:33:03.300 | There are some that are very specific to code generation.

00:33:06.180 | So here are some of them.

00:33:08.260 | Some that are very specific in the finance domain.

00:33:11.580 | Some that are very specific in the science domain.

00:33:14.700 | And of course, there are some

00:33:17.300 | that are very useful for chatbots.

00:33:19.540 | This is the list over here.

00:33:21.220 | There's a much more detailed list in the paper itself.

00:33:26.820 | Having said that, of course, as mentioned,

00:33:29.940 | there are also additional papers that come out.

00:33:33.820 | And so there are also some, I would say,

00:33:36.940 | missing models that were not mentioned.

00:33:41.300 | So these are some of them that were not mentioned.

00:33:43.380 | So good to understand that this is always an evolving list.

00:33:48.220 | So what are some of the features

00:33:54.660 | that we see in these models?

00:33:57.540 | You've got things like your instruction tuning,

00:33:59.980 | which was talked about earlier.

00:34:02.340 | We notice that models are able to have

00:34:05.700 | increasingly high context windows.

00:34:07.980 | Now the context windows are in the six figures,

00:34:10.940 | sometimes even in the seven figures.

00:34:12.740 | There are also other ways in which LLMs can be used.

00:34:19.020 | I think a very popular one is REC.

00:34:22.420 | So there are, I would say,

00:34:24.180 | beyond just your general-purpose use,

00:34:25.740 | you can always fine-tune them for very specific purposes

00:34:28.540 | or purposes that are very specific

00:34:31.060 | to maybe your own corpus or your own knowledge base.

00:34:35.700 | So that's essentially what we're doing over here.

00:34:37.740 | Other topics for the reader to explore.

00:34:42.180 | So essentially, what we're doing over here

00:34:44.260 | is we're talking about ways...

00:34:46.780 | Actually, most of these topics over here,

00:34:49.020 | if you look at them,

00:34:50.060 | are about parameter-efficient fine-tuning.

00:34:52.980 | So things like quantization,

00:34:55.580 | where let's say instead of representing a number in 32-bit,

00:34:59.780 | I represent my number in 8-bit or 4-bit

00:35:02.220 | and see if I can still maintain the model accuracy.

00:35:06.180 | Generally, the model accuracy will go down.

00:35:08.300 | But the thing is,

00:35:09.140 | if you're able to get lighter models, smaller models,

00:35:12.100 | that's actually very useful.

00:35:14.060 | Multi-modal LLMs that we talked about earlier

00:35:16.300 | that take in things like images and video as inputs.

00:35:20.820 | Adapter tuning, essentially,

00:35:22.380 | is when you just add another layer on top of the output

00:35:26.180 | and then you perform fine-tuning on it.

00:35:28.140 | There are more sophisticated ways to use it

00:35:32.820 | where your adapter is used in two or more models.

00:35:37.820 | That means the same adapter is being used

00:35:41.140 | in, let's say, a general model

00:35:42.940 | and also, let's say, a GPT model.

00:35:46.260 | I've seen that in the talk

00:35:50.820 | about embedding representation learning.

00:35:53.460 | Mixture of experts,

00:35:56.500 | something that we've seen before,

00:35:57.500 | so where instead of just having

00:36:01.140 | one feed-forward layer over here,

00:36:04.220 | you're actually able to route them

00:36:05.580 | to different feed-forward layers.

00:36:08.260 | And then from there,

00:36:09.780 | you'll be able to, in a sense,

00:36:13.420 | then once you multiply them together,

00:36:14.900 | so you will be able to leverage on different,

00:36:18.340 | I would say, different vertical workflows of the model

00:36:22.420 | where each of the vertical workflows

00:36:24.460 | will learn different aspects.

00:36:27.460 | So that's essentially your MOE.

00:36:30.180 | Low-rank adaptation, or LORA,

00:36:34.020 | this looks very popular recently.

00:36:36.500 | Essentially, what we're trying to do

00:36:38.700 | is if you're able to reduce the number of parameters

00:36:43.700 | during your gradient updates,

00:36:46.300 | then you actually use less compute

00:36:48.140 | to get your fine-tuned models.

00:36:51.420 | And the idea behind it is that instead of,

00:36:54.660 | so let's say over here,

00:36:55.780 | instead of calculating gradients for 64 parameters

00:37:00.500 | for an eight-by-eight matrix,

00:37:02.500 | what you can do is that you can decompose this matrix

00:37:05.140 | into a eight-by-two and a two-by-eight matrix.

00:37:09.300 | The key thing over here

00:37:10.260 | is that when you multiply this by this,

00:37:12.140 | you get back the 64, you get back 64 weights,

00:37:15.060 | or the resultant is an eight-by-eight matrix,

00:37:17.340 | which is 64 weights.

00:37:18.780 | If you're able to decompose it

00:37:20.980 | with weights in a smaller dimension,

00:37:22.860 | essentially this idea is that,

00:37:24.260 | and of course, how small it is

00:37:25.820 | is a hyperparameter for you to tune,

00:37:28.300 | then the cost of fine-tuning will go down.

00:37:32.820 | So essentially that's what we're doing over here.

00:37:35.100 | Yeah, so that's pretty much it for this segment.

00:37:41.300 | The last, the next few segments,

00:37:44.980 | this next segment essentially is about

00:37:47.020 | your datasets that can be used for training,

00:37:48.820 | at least the public ones that we see.

00:37:51.380 | We've got, these are things that we've seen before,

00:37:53.940 | Wikipedia datasets, C4 dataset, Common Crawl,

00:37:57.140 | which is used for your,

00:37:58.180 | I would say more general-purpose models.

00:38:00.500 | And then, of course, you've got some datasets

00:38:04.420 | that can be used for very task-specific models,

00:38:07.620 | for example, code generation.

00:38:09.820 | You've got datasets that is used

00:38:12.460 | for instruction fine-tuning,

00:38:14.700 | and you've also got datasets that's used for alignment.

00:38:17.180 | So essentially what happens is that

00:38:19.940 | if you go to maybe, say, TensorFlow datasets

00:38:24.220 | or HuggingFace, you'll be able to download them,

00:38:26.460 | and then you'll be able to observe

00:38:29.260 | these datasets by itself.

00:38:31.540 | And if, let's say, you want to maybe, say,

00:38:34.580 | fine-tune a model for specific use,

00:38:36.540 | these are actually useful,

00:38:40.300 | I would say, templates or schemas that you can use

00:38:43.380 | to prepare your datasets so that you can do fine-tuning.

00:38:46.180 | So this is instruction-tuned,

00:38:49.620 | and this is for getting the model to be more,

00:38:53.540 | to have, to display behavior that's more aligned to our use.

00:38:58.540 | So naturally, this one, I'm okay to share some examples,

00:39:02.420 | but this one, you can go ahead and click on the link.

00:39:04.220 | You'll be able to see the kind of examples

00:39:06.300 | that's over there.

00:39:07.140 | So let's say we've done our training on fine-tuning.

00:39:11.580 | We found a way to update our parameters

00:39:15.980 | in a more efficient way.

00:39:16.820 | The final part is, of course, evaluation.

00:39:19.620 | So I think I'll cover, at a high level,

00:39:24.300 | two classes of model evaluations.

00:39:29.220 | You've got things like your single-task evaluations,

00:39:31.380 | so very popular ones would be things like SQuAD,

00:39:33.860 | StoryClose, Math, MNLI,

00:39:37.060 | which is for question answering,

00:39:41.460 | understanding context of words

00:39:44.140 | where you're filling in the blanks,

00:39:46.700 | answering math questions, so mathematical reasoning,

00:39:51.700 | and this is, I believe, natural language inference.

00:39:57.820 | So essentially, whether the two sentences are,

00:40:01.580 | they follow each other or not, right?

00:40:04.620 | Essentially, whether the next sentence

00:40:09.260 | logically follows the first sentence.

00:40:11.100 | And also things like truthful QA,

00:40:14.380 | which validates whether a sentence,

00:40:17.220 | whether the model outputs facts

00:40:19.700 | instead of maybe just say other kinds of,

00:40:22.500 | maybe trivia that's not true, not truthful.

00:40:27.260 | So these are some of, I would say,

00:40:29.260 | your single-task evals,

00:40:34.260 | and then you've got your multi-task evaluation,

00:40:37.180 | things like GLUE, things like MMLU,

00:40:39.780 | things like SuperGLUE,

00:40:41.500 | and of course, there are a couple more

00:40:42.860 | that's inside the list.

00:40:44.900 | So what happens over here is that

00:40:46.900 | if we just take a look at GLUE,

00:40:49.380 | this is divided into multiple individual evaluations,

00:40:56.980 | so you've got things like natural language inference,

00:41:00.660 | you've got things like whether a sentence

00:41:03.860 | makes sense or not, so that's your COLA,

00:41:06.020 | you've got things like semantic similarity.

00:41:08.420 | So essentially, that's what's going on over here.

00:41:11.660 | MMLU, which is one of the more popular ways

00:41:14.980 | of doing benchmarks right now,

00:41:17.820 | so there's a big number of knowledge intensities

00:41:21.140 | that you can see over here.

00:41:22.500 | And of course, SuperGLUE,

00:41:25.620 | which is the second generation from GLUE,

00:41:27.820 | which has more, I would say,

00:41:31.060 | I would say questions that mimic human behavior more,

00:41:35.380 | or things that are a bit trickier

00:41:36.740 | for models to understand.

00:41:39.300 | So that is the part on evaluations.

00:41:42.500 | So different kinds of applications,

00:41:47.220 | I think we've seen many kinds.

00:41:48.500 | So beyond just things like what's in the list,

00:41:51.860 | we also see things like music generation,

00:41:53.780 | we see things like video generation,

00:41:56.300 | and naturally, what happens is that for each of them,

00:41:59.620 | there are also certain guardrails that need to be placed.

00:42:02.540 | So what are some examples?

00:42:03.740 | If let's say for a music generation model,

00:42:08.180 | it is important to ensure that when we submit lyrics

00:42:10.580 | for the model to output,

00:42:13.380 | these lyrics shouldn't be under any kind of copyright.

00:42:16.740 | If not, then there might be legal consequences.

00:42:19.140 | So this is something that I would say,

00:42:23.260 | depending on the domain that you're in,

00:42:25.580 | you will be looking at models

00:42:26.860 | that are very specific to your domain.

00:42:28.900 | So finally, last part, before we go into Q&A,

00:42:33.900 | what are some of the things that we see models exhibit?

00:42:38.060 | So things like biases are very common,

00:42:41.340 | stereotypes are very common.

00:42:43.020 | And I guess the reason why is based

00:42:46.860 | on some of the training data that we see.

00:42:48.340 | If the training data exhibits a certain behavior,

00:42:50.780 | naturally, we see the model exhibiting this behavior.

00:42:54.700 | So that's, I think, one of the things

00:42:56.260 | that we want to be aware of.

00:42:57.860 | And also things like models memorizing private content.

00:43:07.700 | So if let's say I've got a GPT model

00:43:10.060 | and I type in a particular prompt,

00:43:12.900 | and this GPT model sees some email,

00:43:15.420 | and then it outputs some sort of phone number

00:43:17.980 | that is supposed to be private.

00:43:19.700 | And let's say a user takes this and does a search.

00:43:23.060 | So essentially, the idea is that

00:43:24.500 | this is the output from the model.

00:43:26.300 | And you can see there's actually some information over here

00:43:28.540 | that might be private.

00:43:30.940 | You might have a phone number

00:43:31.860 | that's not supposed to be exposed to the public.

00:43:35.580 | And then maybe someone searches for the phone number

00:43:37.740 | and there you might have an additional contact

00:43:40.900 | that maybe you can use, right?

00:43:43.020 | So these are some of the things that we want to,

00:43:46.940 | I would say, be aware of

00:43:49.220 | when it comes to the component about human alignment.

00:43:54.140 | So on top of the three H's,

00:43:56.220 | making helpful, being harmless, and being honest,

00:43:59.580 | you also want to ensure that your models

00:44:01.980 | do not leak out or do not learn

00:44:05.020 | certain private information.

00:44:07.260 | And generally, what happens is that there is teams,

00:44:10.820 | like there are teams that are behind

00:44:13.540 | all these ways of conducting adversarial attacks.

00:44:15.820 | You can call them white hat attacks,

00:44:17.380 | or what people like to call red teaming these models.

00:44:20.420 | So essentially trying to generate adversarial prompts

00:44:25.140 | or find ways such that the model will leak out something,

00:44:28.820 | and then if they're able to do so, they will fix it.

00:44:31.500 | I think there's a few interesting articles

00:44:34.300 | about that recently.

00:44:35.340 | So essentially, that is the paper.

00:44:40.580 | It sounds like a firehose of information.

00:44:44.380 | So if there's anything,

00:44:48.180 | any topic you want to deep dive into,

00:44:49.900 | feel free to take a look at the paper

00:44:52.060 | or take a look at this

00:44:53.980 | and go into the topics that you're looking at.

00:44:56.020 | So if let's say I want to just do something

00:44:57.420 | on parameter efficient tuning,

00:44:58.660 | feel free to just go into that segment.

00:45:00.540 | So I've linked all the papers over here.

00:45:04.620 | I've also linked some of the external sources

00:45:07.100 | that have been useful for me over here.

00:45:09.660 | So yeah, feel free to take this

00:45:12.740 | as a reference guide for yourself.

00:45:15.260 | And I think with that, I've come to the end,

00:45:18.380 | and I'm leaving about 10 more minutes if there's any Q&As.

00:45:21.380 | So Ivan?

00:45:23.100 | - Yeah, dude.

00:45:23.940 | Thanks so much for giving such a detailed walkthrough.

00:45:28.340 | I think there was a question by Bonan in the chat

00:45:31.740 | about a paralyzation,

00:45:33.420 | like what exactly is the benefit of using a transformer

00:45:36.340 | versus a, I guess in this case, a RNN, RSTM.

00:45:41.340 | Do you want to maybe start with that?

00:45:42.900 | Like how the paralyzation works?

00:45:44.660 | - Let me just take a look.

00:45:47.940 | Okay, so if you think about it,

00:45:50.220 | let's look at this example over here.

00:45:56.380 | One second, let me just, okay.

00:45:58.260 | So the idea over here is if you think about

00:46:03.260 | the traditional RNNs, what happens is that,

00:46:07.540 | let's say I've got a sequence of 10 tokens,

00:46:09.980 | and I want to calculate the hidden state

00:46:12.100 | of the entire sequence, in this case,

00:46:14.420 | the sequence, the hidden state of the 10th token.

00:46:17.020 | There is a dependency of the 9th token,

00:46:20.460 | and the dependency of the 9th token is the,

00:46:23.340 | sorry, the 9th hidden state.

00:46:24.660 | Sorry, the 9th hidden state,

00:46:25.780 | and the dependency of the 9th hidden state,

00:46:27.460 | the hidden state, so on and so forth.

00:46:29.740 | And essentially, that's what's going on over here,

00:46:31.420 | where if, let's say, I want to calculate the second state,

00:46:35.620 | the second hidden state of the second token in the sequence,

00:46:38.780 | I need to calculate the first,

00:46:40.140 | I need to calculate the first hidden state as an input.

00:46:42.700 | So that goes back to either your RNNs or LSTMs,

00:46:47.220 | where the hidden state is calculated,

00:46:52.300 | the inputs to the hidden state is the hidden state

00:46:54.980 | of the previous token, and also the input token.

00:46:59.580 | So the thing is that because there is this dependency,

00:47:03.540 | there is this reliance on,

00:47:05.340 | the future hidden states rely on the previous hidden states,

00:47:09.500 | and because of that, there is no ability to parallelize

00:47:13.060 | from a sequence perspective, on the wall clock perspective.

00:47:15.780 | And therefore, you see the first line back forward,

00:47:17.620 | and back of pass first at O of sequence length.

00:47:19.220 | That means for how long the sequence length you have,

00:47:22.580 | you have to do that number of calculations.

00:47:25.100 | Does it make sense?

00:47:26.060 | - I think it makes sense too.

00:47:28.140 | At least the way I like to think about it is that,

00:47:29.860 | let's say I had five sentences,

00:47:31.740 | and they're not the same length.

00:47:33.260 | In order for me to get the final hidden state,

00:47:35.100 | before I can start evaluating its predictions,

00:47:37.100 | I need to run five passes,

00:47:40.220 | and for each character in each sequence,

00:47:42.180 | or each token in this case,

00:47:43.660 | for a transformer itself,

00:47:44.660 | I can just pad everything to the same length,

00:47:46.780 | and pass it through in one time step.

00:47:48.580 | So I can get everything out in one output step,

00:47:52.140 | one forward pass.

00:47:53.900 | At least that's my understanding of the parallelizability.

00:47:56.860 | - Yeah, that makes sense.

00:47:59.660 | I agree.

00:48:01.180 | I would say for this diagram,

00:48:03.980 | we think of it during the training state.

00:48:05.620 | Naturally, during the inference stage,

00:48:08.220 | we still have to,

00:48:09.260 | there is still this need of passing the hidden state

00:48:13.100 | of the current token back into the transformer,

00:48:16.300 | the model to get the next token.

00:48:18.460 | - For sure, for sure.

00:48:20.420 | I was talking more about the training stage,

00:48:21.980 | but I think in terms of inference,

00:48:23.820 | you incur the additional costs

00:48:26.180 | at the transformer of each additional token

00:48:29.580 | that RNN does, I think, ultimately.

00:48:32.100 | Actually, for me, one of the questions

00:48:34.060 | I had about the classification in this paper was

00:48:37.300 | that of prefix versus full language modeling.

00:48:40.060 | Because if you look at the example that you give in the text,

00:48:42.900 | I think they give the example of,

00:48:45.260 | you have just a cute little example,

00:48:47.700 | which is, if it's full language modeling,

00:48:50.100 | they give the word "may"

00:48:51.180 | and then you output the word, "the force be with you."

00:48:53.500 | If it's prefix language modeling,

00:48:54.980 | it's "may the force"

00:48:55.860 | and then the model is asked to predict "be with you."

00:48:59.500 | But that just both seems like the same thing.

00:49:03.020 | Because my understanding of prefix language modeling

00:49:05.020 | was that, oh, we're gonna specify a specific token,

00:49:07.740 | for example, like a bracket classify,

00:49:10.100 | bracket sentiment,

00:49:11.620 | sort of like in P5.

00:49:12.820 | And the model learns that if it sees

00:49:14.380 | this specific prefix,

00:49:16.220 | then it should perform differently.

00:49:19.460 | And so that was what I was a bit confused by

00:49:21.340 | in this specific paper.

00:49:22.780 | - That makes sense.

00:49:26.940 | I didn't look at the paper in particular,

00:49:32.300 | so it's a little bit hard to comment on that.

00:49:35.620 | I understand when you are saying

00:49:37.340 | that this and this really doesn't show a lot of difference.

00:49:42.940 | - I think what I can comment is that

00:49:46.060 | generally in full language modeling,

00:49:47.820 | what happens is that you...

00:49:49.700 | Okay, this is, of course,

00:49:53.340 | the encoder-decoder phase of things

00:49:56.060 | beyond the GPT stuff.

00:49:57.060 | So generally what happens is that for full language,

00:50:00.620 | you generate everything.

00:50:02.660 | So in fact, maybe in this case,

00:50:05.260 | you might just start with a beginning of sentence token

00:50:08.060 | and then you take maybe some hidden state

00:50:09.540 | and then you generate from there.

00:50:10.780 | And then you autoregressively sample from there,

00:50:12.900 | which is different from the prefix language modeling

00:50:15.300 | where you are given the beginning of sentence token,

00:50:18.940 | actually, and then a series of tokens

00:50:20.420 | before you do your generation.

00:50:22.140 | And then, of course, when you do your learning,

00:50:23.740 | you are learning based on that particular sequence of text

00:50:26.620 | more than just the beginning of sentence.

00:50:29.260 | I'm not very sure.

00:50:30.100 | I think this one, we've got to take a look at the paper

00:50:32.820 | to fully understand.

00:50:34.220 | It was also the guy who was the author of the T5 paper,

00:50:37.460 | I believe.

00:50:38.380 | - Oh, really?

00:50:39.220 | The guy who did this paper?

00:50:41.100 | - Yes, I think his name is Colin.

00:50:44.060 | Yeah, but let's go and check, yeah.

00:50:46.500 | - Yeah, and I think we can talk about this some other time.

00:50:48.980 | It was just something that confused me quite a good amount.

00:50:53.100 | I guess the other thing that surprised me

00:50:54.460 | was just like learn positional encodings.

00:50:57.740 | 'Cause when we covered the original transformer paper,

00:51:00.740 | I think there was a section where they said,

00:51:02.820 | oh, we experimented with learn

00:51:04.900 | and frozen positional encodings.

00:51:07.500 | But it seems like you mentioned that newer papers

00:51:10.980 | are starting to use learn positional encodings instead

00:51:13.260 | and they show an increase in performance.

00:51:15.980 | And I was wondering if maybe,

00:51:18.100 | what sort of change, in your opinion,

00:51:20.700 | to make this happen, if that makes sense?

00:51:22.780 | - To be very honest,

00:51:27.860 | I'm not very sure what were the changes that inspired it.

00:51:36.620 | Maybe the way I would comment is that,

00:51:40.980 | once they are able to do so,

00:51:42.780 | they are able to represent,

00:51:47.500 | they are able to efficiently represent

00:51:49.740 | an input with a much longer context window.

00:51:55.020 | So I think that probably what happened

00:51:57.020 | was that there was innovation in that space.

00:51:59.340 | Because the thing is that if, let's say,

00:52:00.500 | I've got maybe say 500 tokens or 1,000 tokens,

00:52:04.100 | there might be a limitation on how you

00:52:06.500 | model the positions,

00:52:12.820 | because maybe the positions might all be

00:52:14.220 | just clustered in one area.

00:52:15.660 | But I think once they have figured out how to do so,

00:52:17.980 | that's when they open up the window

00:52:19.460 | to longer context window.

00:52:20.740 | So maybe how they learn positional encodings

00:52:25.740 | might be one of the tricks that they use

00:52:28.820 | to have longer context windows.

00:52:30.860 | But again, I might be wrong.

00:52:32.420 | I didn't really go into the details of this part of research.

00:52:36.260 | - Yeah, for sure, for sure.

00:52:37.940 | Yeah, I was just wondering about that.

00:52:39.500 | 'Cause that was just something that I was intrigued by.

00:52:42.820 | I think we're almost at time.

00:52:44.740 | If anyone has any other questions,

00:52:46.380 | you can drop them in the chat.

00:52:48.340 | Maybe we can just end it here.

00:52:51.620 | Okay, it seems like there's no more questions.

00:52:56.780 | So anyway, I think moving on to next week's paper,

00:53:00.380 | I was thinking of doing a deep-seek MOE paper.

00:53:03.660 | That was one thing I'd like to present, to propose, sorry.

00:53:07.140 | 'Cause I thought it's super interesting,

00:53:09.940 | and there are a whole bunch of these ideas

00:53:12.020 | that they're experimenting with,

00:53:13.740 | like always on experts, randomly routed experts.

00:53:17.660 | So I thought it's a good paper.

00:53:19.220 | So as usual, if anyone wants to present on the paper itself

00:53:26.460 | for the upcoming week,

00:53:27.980 | then happy to help you with it.

00:53:30.540 | I think you generally learn a lot more

00:53:32.540 | when you actually do the paper.

00:53:34.500 | I learned at least 10 times more

00:53:36.620 | if I actually had to sit down and present the paper.

00:53:39.980 | So I think, as usual, I'll probably just drop a thread

00:53:44.780 | inside the paper club.

00:53:46.660 | And then if you guys have any other papers

00:53:48.100 | that you'd like to suggest,

00:53:49.460 | you can add it onto the thread,

00:53:50.660 | and then we can all vote for that.

00:53:52.340 | Yeah, do you have any papers in mind, Brian?

00:53:54.020 | Anyone have any other papers that you guys wanna read?

00:53:57.660 | - Hmm, I'll take a look, I'll take a look at them.

00:54:01.660 | There are some, I would say,

00:54:04.340 | are very open source models.

00:54:06.740 | So we'll see how, maybe one day next month,

00:54:12.260 | I can take a look at them, yeah.

00:54:14.060 | - Okay, cool, sounds good to me, yeah.

00:54:16.340 | Then otherwise, thank you so much, guys,

00:54:18.180 | for tuning in today's session.

00:54:19.540 | Really appreciate it.

00:54:20.940 | And yeah, looking forward to next week, guys.

00:54:23.220 | Ciao.

00:54:24.540 | - Thanks, everybody, see you guys, bye-bye.

00:54:26.460 | Have a good evening.

00:54:27.460 | See you guys. Bye bye. Have a good evening.

A Comprehensive Overview of Large Language Models - Latent Space Paper Club