Back to Index

A Comprehensive Overview of Large Language Models - Latent Space Paper Club


Transcript

All right, let's go. - All right, cool. So, hey guys, thanks so much for coming by the paper club. As usual, this is a paper club we run in Asia, where we go through one paper every week. So today we're just recording it for the first time, and we hope that you'll benefit from it.

So as usual, if you guys got any questions, you can either let me know, and I can invite you guys on stage. You can drop in the chat, which you can access by just clicking the button on the top, just the little message icon. And yeah, do you wanna take it away, Brian?

- Sure, thanks Ivan. So today, we'll be going through the comprehensive overview of large-network edge models. But on top of that, I think what we wanna do also is just to share the reason why attention actually came about before the Transformers paper. So we'll have a little bit of a history lesson on that, on why it was developed.

And then we will go through the paper, talking about what has happened post the Transformers era. In fact, it's when the GPT era started. So I'm gonna begin. As you can see, the link has two parts. So I'll use the first part to talk about pre, I would say GPT, and then I'll use the second link to talk about the paper prompt.

So let's begin. So essentially, what models have been trying to do recently is this idea of language modeling, where given a previous sequence of words, which is your input or your prompt, you want to find out the next word in the prompt. In this case, it can be question and answers.

So it can be modeled essentially by this probability of the next token, given the sequence of tokens. So that's when you can see the next token, which is T plus one over, given the sequence over here up to time equals to T, position equals to T. And of course, your T plus one is a sample, a sample from the vocabulary that you have, which is basically your sub-words or the tokens that you have.

So why is this the case? I think for us who are doing NLP, beyond just thinking about looking at what the sequence is, what's being generated in the sequence, it's good to think about what kind of use case or what kind of tasks we are doing. And I'll say this is very useful when it comes to thinking about the evaluation metrics for each of these evaluation tasks.

So you can get things like this. - Your screen just kind of like cut out for you. - Is it? Okay, let me see. - Oh, wait, sorry. No, no, sorry. It works again, sorry, like that. It just suddenly disappeared for me. It works again, like that, yeah. - No problem.

So things like machine translation that we'll be talking about, we've got question and answer, summarization, so on and so forth. So essentially, good to think about what tasks we are trying to attack when we are using the different models, right? So while we think about language models as predicting the next token, it's also useful to think from a linguistic perspective what is being learned by these models.

So there's a list over here. I'll just go through a few that is useful. Things like facts, which is trivia. So these are the ones where you can say the penalty for getting the prediction wrong is relatively higher, because if you output something that's false, then your language model is probably not truthful.

Things like sentiment, which we have seen before. Things like reasoning. So in this case, if you look at the sentence, Ero went to the kitchen to make some tea. Standing next to Ero, Zuko pondered his destiny. Zuko left the. So in this case, the idea is that there is some sort of spatial understanding.

The model needs to understand some spatial understanding of the sentence. In this case, Zuko is currently in the kitchen, so he left the kitchen. So these are some of the things that, from a synthetic perspective, or from a linguistic perspective, we observe models are learning in terms of patterns.

So from language models, we talk about conditional language models. So essentially, the idea is that we are trying to generate a target sequence in a target sentence, given some sequence in the source sentence. So that is why you see over here that we are not just generating our yt, given some y1 to yt minus one, which is basically the sequence that has been generated by the model before, but also we want to condition it on the source sentence.

So that is essentially what translation does. You give, if you think about it, you give the model a source sentence, you pick the target language, and then you observe the model generate the sequence in the target language. So it's more than just language modeling, but it's also conditional. And one of the key things that we will notice in conditional language modeling is that we don't necessarily see that the first word in the source sentence corresponds to the first word in the target sentence.

So as you can see, this might be it, first word to first word, but just the second word onwards, you start to see that there is this sort of crisscross relationship where you might need to, where maybe the second word over here corresponds to the third word, and the third word over here corresponds to the second.

So essentially the idea is that we want to find a way to be able to model this relationship. And this relationship has actually been studied before in this idea of alignment, where if you think about it, if let's say we've got the clause sentence, let's say on the top, and the target sentence on the bottom, on the left, then if we've got this very linear one-to-one relationship, or this monotonic relationship, then we will see that there will be a white box over here from the top left to the bottom right, indicating that the first word corresponds to the first word, second word corresponds to the second word, so on and so forth.

But as you can see, just from English to French, there is this idea where words that is later in the sequence corresponds to words that's earlier and vice versa. So that is how we can visualize attention. So then the question is, okay, what, how are we in a sense modeling it, or what does it look like from the encoder-decoder perspective?

So naturally, when we look at the encoder-decoder blocks, this can be, let's look at this as an RNN. We say that the hidden state, the last hidden state in the encoder block contains all the information of the entire sentence, but there's this information bottleneck problem, which means that if let's say this is a longer sentence, the last hidden state might not contain information of the earlier tokens.

And therefore, there's this idea of attention where you have, given that you've got all the hidden states of all the input tokens, the decoder when during the language generation component will pay attention or attend to weighted sum of all the hidden states. So if let's say I've got something that is later in my sequence that corresponds to a token that is earlier in my source sentence, then I will see the attention weights giving more weight to the hidden states in the source sentence.

So essentially, that's the idea of attention that has been implemented in the encoder-decoder kind of paradigm or the kind of architecture. So the problem with that is that when we create these or we calculate these individual hidden states, we realize that it has to be calculated sequentially. That means in this diagram, you can see that the second hidden state has to only, can only be calculated after the first hidden state is being output.

And the third hidden state can only be calculated after the second hidden state has been output. So the question is, can we remove or break free from this idea where there is a dependency of the previous state? Because if we're able to do so, then we are able to run our forward pass and collect our gradients and run that prop on the architecture concurrently across the whole sequence.

So essentially, that's the idea of your key query value attention, and that essentially forms one of the building blocks of the transformer architecture. So I think from here, what we're just going to talk about is there are other components to the transformer architecture beyond just our key query value attention.

There is also this idea of understanding the position of the text, and that's basically an idea of adding position representations that you will see in the paper later, adding some sort of non-linearity when you're doing the calculation, and that's essentially just adding a feed-forward layer on top of it.

So the idea is that if you're just calculating key query value pass, you're always looking at linear combinations of your, you can say your values, because you're just getting a weighted sum of the values calculated by attention. So we want to add a layer of non-linearity to it, which is taken care of by the feed-forward network.

And of course, the last part is when you're doing the decoding step, when you're generating tokens, you want to not let the model see the future tokens, and essentially that's when masking comes into play, attention masking comes into play. So you will start to see that in the decoder architecture later down the road.

So a couple of things on top of what we are talking about in terms of the language modeling component for transformers. One topic is sub-word models. So this is when you have things like tokenization, your byte-pair encoding. So essentially, what are we trying to solve over here? If you look at this table at the bottom, we start to see that for words that exist outside the vocabulary, that can be things like a variation of an existing word, in this case, you add many A's in between the word, between T and A for tasty to probably indicate that it's very tasty, or misspellings of words, which is also very common in input, or novel words over here where we understand the word transformerify might mean adding maybe a transformer block into an existing architecture, but it's a word that we might not see in the existing dictionary.

So for them, for these words over here, if you just use a traditional vocabulary or a dictionary vocabulary, the index will be some sort of an UNK token. And essentially what goes on with byte-pair encoding is that it starts to learn these shorter combinations of letters that can sometime represent either prefixes or suffixes of a word, and then essentially you are able to generate the embeddings for them.

So if you see over here, you've got this T-A-A, and then anything after that, and A-A-A, and anything after that, and S-T-Y. So this guy, probably you've seen it in other existing words, and therefore there is an existing embedding that's associated with it, and therefore we are able to represent it over here.

You can think of it maybe as a, you're essentially creating, you're essentially generating three tokens from this source sequence over here. So essentially that's the idea of sub-word models, or in this case, you've got things like byte-pair encoding, sentence piece, word piece, and things like that. That's the problem that they're trying to solve.

Okay? So, three types of architectures. The key thing over here to note is that what we have in the transformer block is essentially replacing the recurrent neural network blocks that we had previously. So when we talk about recurrent neural networks, of course we add things like LSTMs, GRUs, bidirectional models, multi-layer models.

So it encompasses all that. And essentially what we have over here are the three types of dominant architectures. We've got the encoder models, and examples of this would be things like BERT, where you learn via mass language modeling, which has been covered before. Encoder-decoder models, where we've seen earlier, we have an encoder that maps your sequence into a space, or a position in latent space, and then from there you perform your sampling, or your autoregressive sampling of tokens to form your target sequence, which is what we've seen in T5.

And the decoder models, which I think all of us are familiar with, things like GPT-2, GPT-3, they are all there. So you essentially learn the language of patterns, and then you directly just do your autoregressive sampling or decoding from there. Okay? So, from there, right, we will lead to this paper that we have over here, which is the Comprehensive Overview of Large Language Models.

If you take a look at this paper, it seemed to me that there were multiple updates to the paper. And that signals to me that there's probably going to be updates along the way. So, I think what's useful is beyond looking at just the paper itself, understand, or for me, what I did was I tried to understand what was the framework that the authors were using to attack understanding of the knowledge, then dividing it, and then giving us a reader to understand it.

Right? It's a very dense paper. It's got, I think, over 450 citations. So, I think it's more of a pick-your-own-adventure, pick-your-own-journey, pick-your-own-learning-process kind of, I would say, direction, so that along the way, you'll be able to build that foundational knowledge and then add layers on it, add layers on it.

At the end of the day, we all know that new models are always developed and new models are always announced. So, going back to the first principles and fundamentals are useful. So, let's just go through the paper very quickly. Let's just start from the top over here. So, essentially, we'll just talk about the last point over here, where we are seeing that large language models, in particular things like GPT-3, are able to perform your downstream tasks without specific fine-tuning.

So, that's the first key point, because if we looked at T5, we saw that the performance of T5 on downstream tasks, in this case, it can be translation, it can be your glue task, it can be your squirt task, their performance only will get better once you fine-tune on that particular task.

And you've seen, there are multiple experiments that they've done, which demonstrates that that's the better way, that's the better alternative. So, what GPT-3 demonstrated was that they are able to perform zero short transfer learning on these tasks. So, what does that mean? That means that if you just give the prompt from the downstream task, GPT-3 is able to give the answer.

So, that kind of changed things where we actually might not need to fine-tune for a particular task. Of course, when we look later down the road, we see that there's very, very specific ways of doing things like instruction tuning. But that was one of the big discoveries that they had back then.

On top of it, they were able to show things like reasoning, they were able to show things like planning, they were able to show things like in-context learning. So, we get to see examples of this later when you do things like chain of thought prompting. So, they're able to understand, given certain patterns, when they ask for a question or ask for a task that follows a similar pattern from the prompt, they are able to answer.

The problem that we see today is that the cost of training them or pre-training them is relatively high, usually in the tens of millions. So, the question is that can we get better at pre-training these models? Can we look at things like better architectures? Can we look at things like more efficient ways of fine-tuning our parameters?

Are there ways that we can represent these factors in a lower vector state or a state that uses less granularity? So, that's essentially what things like architectures come into play, quantization comes into play. So, the way I saw this paper was that we had the background, which talks about some of the key concepts and then the different types of LLMs and their particular use cases.

The datasets that have been used to train them, at least the public ones. What kind of evaluation tasks are they looking at? So, probably that's what we call evals. And the different types of applications for these LLMs in the commercial. And of course, from there we talk about what probably researchers are looking at going into maybe say the next three months or the next year.

So, let's look at some of the fundamentals. So, I'm gonna start from the left side. The paper has covered, we have covered some of these topics from the paper. Tokenization, attention mechanisms, the different types of activation functions. So, those are stuff that we've learned. You can get a recap when you do your traditional deep learning topics.

Then, of course, we talked about the different types of architectures which was covered earlier. Your encoder-only, your encoder-decoder, your decoder-only. And naturally, each of them will have their own associated way of doing attention masking. So, that's this part over here. We talked about the different types of pre-training objectives.

Naturally, things like mass language modeling are things that we see in your encoder-only models. Language modeling are things that we see in your encoder-decoder models. So, mass language modeling, basically, in this diagram, is you give the model this token and these tokens over here, and the model is expected to predict these targets over here, the ones that have been highlighted.

Whereas in full language modeling, so essentially, it's like a fill-in-the-blank kind of problem. Whereas for full language modeling, you give the first token, and then the model is expected to predict the second, third, fourth, fifth token, so on and so forth. So, that's that. There have been also research into this thing called prefix language modeling, where you feed the model one part of the sequence, and then you're asking the model to generate the remaining parts of the sequence.

And what's useful over here is that when they do prefix language modeling, they use this thing called a causal mask with prefix, which means that for the input tokens, the model is able to see or attend to all the previous tokens in the input before it starts to generate output.

And that's why when you see as the model generates the output, you still have that element of mask attention. So, essentially, that's this part over here. Things that are, I would say, if you look at the transformer paper, which is covered over there, will be things like layer normalization, where you divide the weights by the mean, sorry, you minus the mean from the weights and you divide it by the standard deviation of the weights.

Essentially, what we're doing is that we're trying to achieve numerical stability of the weights so that when you do a forward pass and you do your back propagation, you don't have numbers that go all over the place. So that's layer normalization. Positional encoding was something we talked about earlier.

In the original paper, they had this idea of sinusoidal position representations. So how to read this graph. Okay, so essentially how to read this graph is that as you go from left to right in the, as the index of the sequence increases, essentially you're applying some sort of sinusoidal function on top of it, such that every token in the sequence has a positional representation.

It's augmented by a positional representation. So essentially from left to right, all these vectors actually look different. But what happens is that this way of encoding positional representations is not learnable because there is no such way to do, it's no such way to have a gradient and then to update the positions.

So therefore, it has been changed to something as simple as just adding a positional representation on top of the embeddings. And of course, if you look at the paper, there are also newer ways to do it, things like alibi, things like rope. So that's the left-hand side. Now on the right-hand side over here, we are looking at newer ways or ways that can help with training or implementation.

So things like the libraries that we're using, JAX, PyTorch, TensorFlow, amongst others, there's this idea of distributed training, which means that can we use multiple GPUs to train our models so that we are able to learn our weights faster? So amongst others, there's this idea of data parallelism where you duplicate your model in two GPUs.

Let's say I've got two GPUs, I duplicate my model in both GPUs and then I run separate batches on top of them. So let's say I've got a batch of, I don't know, 100,000. I split it into 50,000, 50,000. I run the first batch of 50,000 in the first model in the first GPU.

Then the other 50,000 in the same model in the second GPU, calculate the gradients, average them, and then perform my backdoor. So that's what data parallelism is. Tensor parallelism, essentially the idea is that you calculate the matrix multiplication steps in multiple GPUs and then you add them up. So what happens is that, as you can see, we know that for each row, the multiplication with a column can be done concurrently and therefore it splits it up such that the first, that this, the matrix on the left multiplies with only one column.

Matrix on the right multiplies on the second column and then you combine them together. Or in this case, you concatenate the results together. So that again also helps us with getting the results from the forward pass a lot better, okay? So that's that. Other kinds of tricks that we are using, things like flash attention, where it's a very smart way of utilizing memory.

So what happens is that instead of calculating, instead of a series of steps that is very memory intensive when they load your query, your key query matrices to the calculation, perform the softmax and then get your results, they are doing some way of, they are iterating it and they are using very smart functions to calculate things like the softmax of the fly.

So essentially that's what they're doing over here. So it's an optimization of using your high bandwidth RAM and also the RAM in your GPU, right? Because in your GPUs, you've got very fast computation but relatively lower memory. Just a little bit extra, this is one of those very common topics that they would like to start off as they go into things like your Mamba models.

So that's just the first part. So the second part in terms of the background will be how do we adapt these models for specific tasks? So there are things like transfer learning, which we've seen before, where we pre-train our T5 base model and then we fine tune on individual tasks.

There's also things like instruction fine tuning where the model is given a series of instructions and outputs and then the model will fine tune its outputs based on that. So examples of this can be things like, if let's say I ask GPT to explain the moon landing to a 6U in a few sentences.

Generally, if the model is pre-trained, there is this way where GPT outputs the steps in this way. So explain the theory of gravity, explain the theory of relativity to a 6UO and then explain the Big Bang to a 6UO and then explain evolution to a 6UO. So that's how GPT-3 will output its sentences but if we're able to do some sort of instruction fine tuning where there is some sort of emphasis on things like 6UO in a few sentences, then this is the kind of outputs that you can get.

And so that's the kind of variations of different models that we can see when we download them from open source, I say repositories, things like Hugging Face. So that's instruction fine tuning over here and something called alignment tuning where you want to ensure that your model fulfills what people call the three H's of model behavior.

So your models will be harmless, your models will be honest and your models are helpful. So things like harmlessness will be things like, if let's say you ask the model, how can I let's say bake a cake with cyanide? If let's say your model is not alignment tuned, the model might give the instructions but let's say if you do alignment fine tuning to tell the model, hey, this is something that you should not output or you shouldn't give instructions for, then the model will learn accordingly from that.

So these are some of the methods that we want to fine tune our models with such that our models are able to demonstrate a certain behavior. Then how are we doing it? We can use things, we can use skills like reinforcement learning to do it, where essentially for each of the different outputs, you have a certain kind of reward.

In this case, the reward is just a scalar value and then you learn some sort, you learn some sort of policy such that when the model outputs text based on this policy, you get to maximize the reward. So the key thing over here is that the policy has to be differentiable so that when you get some results from the model outputs and you get some reward, sometimes your reward might not be good or you're comparing rewards, you're able to get the loss of the reward and back propagate it through the gradients to update the weights in the policy.

So that's essentially what reinforcement learning is. So typically for, I think when reinforcement learning was a hot thing back then, it's one course by itself. So this is just a very high level, five, six, five minute overview of it. On top of it, I think one of the things you are more familiar with is things like prompting.

So we've got zero-shot prompting where you just ask for tasks, you just give a task and the model answers directly, but also you have things like chain of thought prompting where you give the model some examples before and then from there, ask the model to mimic the behavior of the examples above.

So that's essentially what you have over here. You've got in-context learning of, I would say translation on the right, and you've got in-context learning of correcting spelling mistakes on the left. So that is essentially this part over here and you've got to see that a few shots or things like five shots or three shots usually have better performance against your zero-shot or one-shots.

So that's this part over here. And then the question of course is, how do you craft these prompts such that you'll be able to get the results that you want? So that's essentially the idea of what people like to call prompt engineering. So that's essentially the part on the backgrounds that we want to cover.

The next part over here, I would say, is a very brief list of some of the models that we have. Now, keep in mind that a lot of these models, the list always is updated every two or three weeks. So good to understand. So naturally, I think when the paper is going to be updated in the future, you will see additional models.

Some of the high-level, I would say, purposes that we see these models are trying to achieve can be things like your general purpose ones. So that's when you get a model to do all sorts of things. There's also, of course, your multi-modal ones, where you give the model some image and then you maybe ask the model to decipher some fact or draw some conclusion from the image.

There's also, of course, your video-related ones. There are some that are very specific to code generation. So here are some of them. Some that are very specific in the finance domain. Some that are very specific in the science domain. And of course, there are some that are very useful for chatbots.

This is the list over here. There's a much more detailed list in the paper itself. Having said that, of course, as mentioned, there are also additional papers that come out. And so there are also some, I would say, missing models that were not mentioned. So these are some of them that were not mentioned.

So good to understand that this is always an evolving list. So what are some of the features that we see in these models? You've got things like your instruction tuning, which was talked about earlier. We notice that models are able to have increasingly high context windows. Now the context windows are in the six figures, sometimes even in the seven figures.

There are also other ways in which LLMs can be used. I think a very popular one is REC. So there are, I would say, beyond just your general-purpose use, you can always fine-tune them for very specific purposes or purposes that are very specific to maybe your own corpus or your own knowledge base.

So that's essentially what we're doing over here. Other topics for the reader to explore. So essentially, what we're doing over here is we're talking about ways... Actually, most of these topics over here, if you look at them, are about parameter-efficient fine-tuning. So things like quantization, where let's say instead of representing a number in 32-bit, I represent my number in 8-bit or 4-bit and see if I can still maintain the model accuracy.

Generally, the model accuracy will go down. But the thing is, if you're able to get lighter models, smaller models, that's actually very useful. Multi-modal LLMs that we talked about earlier that take in things like images and video as inputs. Adapter tuning, essentially, is when you just add another layer on top of the output and then you perform fine-tuning on it.

There are more sophisticated ways to use it where your adapter is used in two or more models. That means the same adapter is being used in, let's say, a general model and also, let's say, a GPT model. I've seen that in the talk about embedding representation learning. Mixture of experts, something that we've seen before, so where instead of just having one feed-forward layer over here, you're actually able to route them to different feed-forward layers.

And then from there, you'll be able to, in a sense, then once you multiply them together, so you will be able to leverage on different, I would say, different vertical workflows of the model where each of the vertical workflows will learn different aspects. So that's essentially your MOE. Low-rank adaptation, or LORA, this looks very popular recently.

Essentially, what we're trying to do is if you're able to reduce the number of parameters during your gradient updates, then you actually use less compute to get your fine-tuned models. And the idea behind it is that instead of, so let's say over here, instead of calculating gradients for 64 parameters for an eight-by-eight matrix, what you can do is that you can decompose this matrix into a eight-by-two and a two-by-eight matrix.

The key thing over here is that when you multiply this by this, you get back the 64, you get back 64 weights, or the resultant is an eight-by-eight matrix, which is 64 weights. If you're able to decompose it with weights in a smaller dimension, essentially this idea is that, and of course, how small it is is a hyperparameter for you to tune, then the cost of fine-tuning will go down.

So essentially that's what we're doing over here. Yeah, so that's pretty much it for this segment. The last, the next few segments, this next segment essentially is about your datasets that can be used for training, at least the public ones that we see. We've got, these are things that we've seen before, Wikipedia datasets, C4 dataset, Common Crawl, which is used for your, I would say more general-purpose models.

And then, of course, you've got some datasets that can be used for very task-specific models, for example, code generation. You've got datasets that is used for instruction fine-tuning, and you've also got datasets that's used for alignment. So essentially what happens is that if you go to maybe, say, TensorFlow datasets or HuggingFace, you'll be able to download them, and then you'll be able to observe these datasets by itself.

And if, let's say, you want to maybe, say, fine-tune a model for specific use, these are actually useful, I would say, templates or schemas that you can use to prepare your datasets so that you can do fine-tuning. So this is instruction-tuned, and this is for getting the model to be more, to have, to display behavior that's more aligned to our use.

So naturally, this one, I'm okay to share some examples, but this one, you can go ahead and click on the link. You'll be able to see the kind of examples that's over there. So let's say we've done our training on fine-tuning. We found a way to update our parameters in a more efficient way.

The final part is, of course, evaluation. So I think I'll cover, at a high level, two classes of model evaluations. You've got things like your single-task evaluations, so very popular ones would be things like SQuAD, StoryClose, Math, MNLI, which is for question answering, understanding context of words where you're filling in the blanks, answering math questions, so mathematical reasoning, and this is, I believe, natural language inference.

So essentially, whether the two sentences are, they follow each other or not, right? Essentially, whether the next sentence logically follows the first sentence. And also things like truthful QA, which validates whether a sentence, whether the model outputs facts instead of maybe just say other kinds of, maybe trivia that's not true, not truthful.

So these are some of, I would say, your single-task evals, and then you've got your multi-task evaluation, things like GLUE, things like MMLU, things like SuperGLUE, and of course, there are a couple more that's inside the list. So what happens over here is that if we just take a look at GLUE, this is divided into multiple individual evaluations, so you've got things like natural language inference, you've got things like whether a sentence makes sense or not, so that's your COLA, you've got things like semantic similarity.

So essentially, that's what's going on over here. MMLU, which is one of the more popular ways of doing benchmarks right now, so there's a big number of knowledge intensities that you can see over here. And of course, SuperGLUE, which is the second generation from GLUE, which has more, I would say, I would say questions that mimic human behavior more, or things that are a bit trickier for models to understand.

So that is the part on evaluations. So different kinds of applications, I think we've seen many kinds. So beyond just things like what's in the list, we also see things like music generation, we see things like video generation, and naturally, what happens is that for each of them, there are also certain guardrails that need to be placed.

So what are some examples? If let's say for a music generation model, it is important to ensure that when we submit lyrics for the model to output, these lyrics shouldn't be under any kind of copyright. If not, then there might be legal consequences. So this is something that I would say, depending on the domain that you're in, you will be looking at models that are very specific to your domain.

So finally, last part, before we go into Q&A, what are some of the things that we see models exhibit? So things like biases are very common, stereotypes are very common. And I guess the reason why is based on some of the training data that we see. If the training data exhibits a certain behavior, naturally, we see the model exhibiting this behavior.

So that's, I think, one of the things that we want to be aware of. And also things like models memorizing private content. So if let's say I've got a GPT model and I type in a particular prompt, and this GPT model sees some email, and then it outputs some sort of phone number that is supposed to be private.

And let's say a user takes this and does a search. So essentially, the idea is that this is the output from the model. And you can see there's actually some information over here that might be private. You might have a phone number that's not supposed to be exposed to the public.

And then maybe someone searches for the phone number and there you might have an additional contact that maybe you can use, right? So these are some of the things that we want to, I would say, be aware of when it comes to the component about human alignment. So on top of the three H's, making helpful, being harmless, and being honest, you also want to ensure that your models do not leak out or do not learn certain private information.

And generally, what happens is that there is teams, like there are teams that are behind all these ways of conducting adversarial attacks. You can call them white hat attacks, or what people like to call red teaming these models. So essentially trying to generate adversarial prompts or find ways such that the model will leak out something, and then if they're able to do so, they will fix it.

I think there's a few interesting articles about that recently. So essentially, that is the paper. It sounds like a firehose of information. So if there's anything, any topic you want to deep dive into, feel free to take a look at the paper or take a look at this and go into the topics that you're looking at.

So if let's say I want to just do something on parameter efficient tuning, feel free to just go into that segment. So I've linked all the papers over here. I've also linked some of the external sources that have been useful for me over here. So yeah, feel free to take this as a reference guide for yourself.

And I think with that, I've come to the end, and I'm leaving about 10 more minutes if there's any Q&As. So Ivan? - Yeah, dude. Thanks so much for giving such a detailed walkthrough. I think there was a question by Bonan in the chat about a paralyzation, like what exactly is the benefit of using a transformer versus a, I guess in this case, a RNN, RSTM.

Do you want to maybe start with that? Like how the paralyzation works? - Let me just take a look. Okay, so if you think about it, let's look at this example over here. One second, let me just, okay. So the idea over here is if you think about the traditional RNNs, what happens is that, let's say I've got a sequence of 10 tokens, and I want to calculate the hidden state of the entire sequence, in this case, the sequence, the hidden state of the 10th token.

There is a dependency of the 9th token, and the dependency of the 9th token is the, sorry, the 9th hidden state. Sorry, the 9th hidden state, and the dependency of the 9th hidden state, the hidden state, so on and so forth. And essentially, that's what's going on over here, where if, let's say, I want to calculate the second state, the second hidden state of the second token in the sequence, I need to calculate the first, I need to calculate the first hidden state as an input.

So that goes back to either your RNNs or LSTMs, where the hidden state is calculated, the inputs to the hidden state is the hidden state of the previous token, and also the input token. So the thing is that because there is this dependency, there is this reliance on, the future hidden states rely on the previous hidden states, and because of that, there is no ability to parallelize from a sequence perspective, on the wall clock perspective.

And therefore, you see the first line back forward, and back of pass first at O of sequence length. That means for how long the sequence length you have, you have to do that number of calculations. Does it make sense? - I think it makes sense too. At least the way I like to think about it is that, let's say I had five sentences, and they're not the same length.

In order for me to get the final hidden state, before I can start evaluating its predictions, I need to run five passes, and for each character in each sequence, or each token in this case, for a transformer itself, I can just pad everything to the same length, and pass it through in one time step.

So I can get everything out in one output step, one forward pass. At least that's my understanding of the parallelizability. - Yeah, that makes sense. I agree. I would say for this diagram, we think of it during the training state. Naturally, during the inference stage, we still have to, there is still this need of passing the hidden state of the current token back into the transformer, the model to get the next token.

- For sure, for sure. I was talking more about the training stage, but I think in terms of inference, you incur the additional costs at the transformer of each additional token that RNN does, I think, ultimately. Actually, for me, one of the questions I had about the classification in this paper was that of prefix versus full language modeling.

Because if you look at the example that you give in the text, I think they give the example of, you have just a cute little example, which is, if it's full language modeling, they give the word "may" and then you output the word, "the force be with you." If it's prefix language modeling, it's "may the force" and then the model is asked to predict "be with you." But that just both seems like the same thing.

Because my understanding of prefix language modeling was that, oh, we're gonna specify a specific token, for example, like a bracket classify, bracket sentiment, sort of like in P5. And the model learns that if it sees this specific prefix, then it should perform differently. And so that was what I was a bit confused by in this specific paper.

- That makes sense. I didn't look at the paper in particular, so it's a little bit hard to comment on that. I understand when you are saying that this and this really doesn't show a lot of difference. - I think what I can comment is that generally in full language modeling, what happens is that you...

Okay, this is, of course, the encoder-decoder phase of things beyond the GPT stuff. So generally what happens is that for full language, you generate everything. So in fact, maybe in this case, you might just start with a beginning of sentence token and then you take maybe some hidden state and then you generate from there.

And then you autoregressively sample from there, which is different from the prefix language modeling where you are given the beginning of sentence token, actually, and then a series of tokens before you do your generation. And then, of course, when you do your learning, you are learning based on that particular sequence of text more than just the beginning of sentence.

I'm not very sure. I think this one, we've got to take a look at the paper to fully understand. It was also the guy who was the author of the T5 paper, I believe. - Oh, really? The guy who did this paper? - Yes, I think his name is Colin.

Yeah, but let's go and check, yeah. - Yeah, and I think we can talk about this some other time. It was just something that confused me quite a good amount. I guess the other thing that surprised me was just like learn positional encodings. 'Cause when we covered the original transformer paper, I think there was a section where they said, oh, we experimented with learn and frozen positional encodings.

But it seems like you mentioned that newer papers are starting to use learn positional encodings instead and they show an increase in performance. And I was wondering if maybe, what sort of change, in your opinion, to make this happen, if that makes sense? - To be very honest, I'm not very sure what were the changes that inspired it.

Maybe the way I would comment is that, once they are able to do so, they are able to represent, they are able to efficiently represent an input with a much longer context window. So I think that probably what happened was that there was innovation in that space. Because the thing is that if, let's say, I've got maybe say 500 tokens or 1,000 tokens, there might be a limitation on how you model the positions, because maybe the positions might all be just clustered in one area.

But I think once they have figured out how to do so, that's when they open up the window to longer context window. So maybe how they learn positional encodings might be one of the tricks that they use to have longer context windows. But again, I might be wrong. I didn't really go into the details of this part of research.

- Yeah, for sure, for sure. Yeah, I was just wondering about that. 'Cause that was just something that I was intrigued by. I think we're almost at time. If anyone has any other questions, you can drop them in the chat. Maybe we can just end it here. Okay, it seems like there's no more questions.

So anyway, I think moving on to next week's paper, I was thinking of doing a deep-seek MOE paper. That was one thing I'd like to present, to propose, sorry. 'Cause I thought it's super interesting, and there are a whole bunch of these ideas that they're experimenting with, like always on experts, randomly routed experts.

So I thought it's a good paper. So as usual, if anyone wants to present on the paper itself for the upcoming week, then happy to help you with it. I think you generally learn a lot more when you actually do the paper. I learned at least 10 times more if I actually had to sit down and present the paper.

So I think, as usual, I'll probably just drop a thread inside the paper club. And then if you guys have any other papers that you'd like to suggest, you can add it onto the thread, and then we can all vote for that. Yeah, do you have any papers in mind, Brian?

Anyone have any other papers that you guys wanna read? - Hmm, I'll take a look, I'll take a look at them. There are some, I would say, are very open source models. So we'll see how, maybe one day next month, I can take a look at them, yeah. - Okay, cool, sounds good to me, yeah.

Then otherwise, thank you so much, guys, for tuning in today's session. Really appreciate it. And yeah, looking forward to next week, guys. Ciao. - Thanks, everybody, see you guys, bye-bye. Have a good evening. See you guys. Bye bye. Have a good evening.