LLM Asia Paper Club Survey Round

Okay, great, great. Okay, I will now start on presenting this paper. Basically, I found out about this paper recently, which is a paper called "Let's think dot by dot". And the motivation behind choosing this paper is that basically I always have the burning question as to like, how do LLMs actually think?

Now, we know that chain of thought reasoning process or chain of thought prompting actually shows that by allowing LLMs to think out loud before answering, their performance actually improves considerably when compared to direct answering techniques. This actually provides us some intuition as to how LLMs reason through their tasks.

However, recent work as found in this paper suggests that for chain of thought reasoning, LLM answers could also be unfaithful to the intermediate reasoning steps. Simply put, the answers do not really tally with their workings. Now, here are some guiding questions before moving on to the related words section.

The first question is that, do LLMs actually need to think out loud, basically like write down their thoughts, or are they able to think internally like humans? The second question is that, are there any specific tasks to show that LLMs are not simply relying on the semantic information and possibly relying on other kinds of information found in the tokens in their inputs?

So I'll just skim through the related words section just to provide some additional context on this work. The first section in the related work is talking about computational complexity. Basically, they are discussing that there are different levels of computational complexity and task complexity. Basically, the lowest level is something called TC0, and the higher levels of task complexity include graph connectivity, solving Sudoku, etc.

And this requires some form of recursion and checking multiple constraints at the same time. Now, the second part of the related words section is talking about how transformers use tokens for reasoning. In the chain of thought prompting example, we can see that transformers can indeed solve higher complexity problems when additional reasoning tokens are provided.

These tokens actually help to decompose the problem into smaller problems and improve the model reasoning's capabilities. So the question is that, are the LLMs making full use of these extra tokens as part of their reasoning process? And the third part in the related words section is talking about potential model performance improvements on more complex tasks.

Some work also suggested that using filler tokens could improve performance on nested quantifier tasks. Basically, tasks that have some form of constraints, nested constraints that depend on each other. Now, what the authors want is that they actually wanted to create new data set that specifically target these kind of problems to test their filler tokens hypothesis.

Now, let me move on to the methodology section and talk about the tasks created by the authors. I have a short section here talking about the nested quantifier, and then I will send the link over later, but this will just be a short summary about what exactly are nested quantifiers.

So due to time constraints, I'll just move on to the first task created by the authors. So the first task is called 3SUM. Basically, it's that you have a following list of numbers and you want to check if there's a group of three numbers inside that list that adds up to 0 modulo 10.

So the example question is that you have a list of 2, 3, 5, 7, 1, 8, and the answer could be like 2, 3, 5 or 2, 7, 1 because the result will be 10. And 10 divided by 10, the remainder is 0. And for the second task, it's called 2SUM transform, which is also regarding a sequence of numbers.

And basically, they have one list and then they have a hidden transform list. Basically, using the original list, and then they are adding some form of permutation to each element in that list, and that permutation is only revealed at the end. So for example, you have an original list and then you are +3 to each number in the transform list.

And then what this task actually requires is that it requires model to remember transformation information and how to integrate this in its computational thinking process. So let me move on to the experiment section of the paper. So basically, the authors are using some form of scaled-down, randomly-initialized LAMA model in their training process using three different model configuration for the setup.

And the first configuration is basically using filler tokens, basically a series of dots as intermediate tokens between the input tokens and the output tokens. Now, the second configuration is that instead of using dots, they are using the chain of thought reasoning tokens as the intermediate tokens. Now, the third configuration is the sort of like the control setup where there are no filler tokens in between the input and output.

And then the results show that for the threesome task, the first task created by the authors, as the length of the sequence of numbers increases, the difference in performance between the configuration with the filler tokens and the one with no filler tokens actually widens the gap quite drastically. And then this basically shows that as the task complexity increases, the filler tokens actually do contain some form of task-related information.

And the next figure is talking about a different experiment which is basically the authors trying to freeze all the previous models of the transformer and only preserving the final attention layer. And then basically, they are trying to prove that whatever changes they make in the final layer, this will cause the results to change.

Basically, the performance improvements or performance changes are only resulting from the changes in the final layer. So in the final layer, what they tried is that they tried to increase the percentage of filler tokens in the final layer. And they've seen that in their tasks, actually it increases the accuracy up to like 100%.

And then it increases sharply from 60% of the filler tokens kept until it tapers off and then it has some diminishing returns past the 60% mark. Now, for the two-sum transform task, which is the second task, the author lays down a table with comparing the chain of thought, the filler tokens, and the no-tokens configuration.

And then this shows that, yeah, as expected, the chain of thought is still triumphing at the very top. And the filler token is sort of getting quite close to the chain of thought reasoning tokens while the no-tokens is showing like a 78% accuracy. Now, both of the aforementioned configurations are noticeably exceeding the performance of no intermediate tokens.

And some additional insight that the author actually found out in their experiment is that they, before using the filler tokens to replace the intermediate tokens, they actually needed the chain of thought configuration to be set up to see how the chain of thought tokens are placed as the intermediate tokens before directly replacing one-to-one with the filler tokens.

So this basically means that this is actually quite challenging. This is a quite challenging learning problem whereby prior knowledge of how to put the intermediate tokens are needed before you can use the filler tokens. This also implies that the filler tokens must be placed in some particular arrangement or configuration in order to showcase good performance.

So, at the end of the paper, some food for thought is basically saying that the experiment results imply that although filler tokens do not contain any semantic information or instructions for the language models to follow, they still contain some form of structural information or synthetic information. So, a possible question is that why use dots as the filler tokens?

We can also use other kinds of tokens, right? As long as the tokens do not really contain any kind of semantic meaning, it could all be valid choices for the model. So, I would love to see that maybe we can try different types of tokens to see if there are any different kinds of results.

Although dot tokens are technically filler tokens, but they also do serve as some form of ellipses in the English language, acting as some form of continuation. So, in summary, this paper basically shows that how large language models can use filler tokens to achieve higher performance in certain tasks when compared to not using any filler tokens at all as the intermediate tokens.

The results also show that filler tokens have some form of computational benefit, although not semantically, but maybe some form of structural information. So, that's the end of my survey. I think it's an interesting paper, and then I'm open to any questions. Yep. I think Casper had a question in the chat itself, whether the filler tokens were special tokens introduced after the model was trained, or was it just the same token used for pure filler that the tokenizer had learnt perhaps during the pre-training phase?

They actually include the filler tokens, as I said, in the pre-training phase or so. Yeah. So, they were not like a special token added in afterwards. There's another question from Warren, which is why not just use unknown token or something instead of dots in this case? Yeah, that's an interesting question also, yeah.

I think it will probably, yeah, you'll probably have different results, yeah. Seems some form of, any rare form of tokens, yeah, should be fine, yeah. It seems to, I don't know if you've seen like the memes on Twitter, where they, like, if you prompt the model and you put a space in between, or like one or two spaces, and then you just let it, like, and then you let it do the normal inference call, the performance actually increases sometimes because the space itself is another token, right?

Oh, okay. And then that modify, that sort of like modifies the computation itself. Oh, yeah, that's quite interesting, yeah. It's all these hidden complexities that, that we don't really know. Yeah, it's quite interesting. I just comment, could it also be because of positional biases? Do you want to maybe elaborate on that?

I'm not too sure about what positional biases are. I mean, you can drop it in the chat or you can unmute yourself. Yeah, sure. So what I meant was because, like, there's a positional encoding in the transformers, right? So because the models will train on a lot of sequences, then you, it's like, I think maybe if it trains too much, it gets saturated, then it kind of like learns the positional encoding as well.

So like, if you put a space or two, right, then just nice, most of the training sequences, that particular one also is aligned with, like, the rest of the positions with, like, how it was observed in the training data, then it might improve in performance or something like that.

I just know, like, this kind of bias exists, but I don't really know how to actually explain it. Let me see where I can find an article on it. For sure, for sure. I don't think the positional biases of the, the prompt and the original input would change, but I think perhaps what you're thinking about is the additional space itself plus positional bias might better reflect the training data or where the answer might be.

Is that what you're trying to say? Something like this. Okay. Let me see if I can pull this link up. Large language models are not very valid in this. I'm looking to it. Yeah. I guess if there's no more questions, then maybe we can move on to the next paper.

Do you want to go next? Yeah. Okay, I'll stop sharing my screen. Oh, you want me to go next? Okay. Yeah. Give me a moment. Can you see the screen? Yep. Just give me a while. Okay. Yeah. Hi, everyone. Yeah. This paper is a bit different. So it's about how we use a second model actually to estimate the uncertainty of LLM's response.

So it's quite an obscure paper, actually. I spent a few days trying to think of a paper to sort of talk about. But I did enjoy reading this paper. For one, it approaches the problem quite differently. It's from a bunch of people in operations research, which is actually where I did grad school.

And the sort of approach of doing it is actually more of just getting data, fitting a model in, and then trying to make use of that model for a real-world outcome. So let's just dive right in. So the TLDR is actually just three points. Number one, here they train a regression model.

And, in fact, actually, it's a classical one, random forest, to estimate the uncertainty of an LLM's response. And the input for that is just the LLM's hidden layers of the last token, activations of the last token, or in the case of a model that, let's say, open AIs, APIs, or some other LLM provider where you don't actually get the actual hidden layers, you can actually -- or a gray box model, you can actually use some of the probability-related output as an input to this regression model.

And the output, it's actually not the logits of the language model, but instead it's actually a task-specific score, and typically between 0 and 1, about the certainty of the answer. So I'll just elaborate a bit more about that later. The paper covers a bit about the existing methods, and I think most of the methods right now for quantifying uncertainty, they tend to be based directly on the output of the language model.

What do I mean by that? Say, given a fixed prompt, you might get different outputs, and you try to do sampling there to see what the variation in the output is, or you might add some perturbations to your prompt and see how that results in variation in your output, and you measure it thereafter.

The paper keeps specifying, keeps mentioning that these are unsupervised methods, and in contrast to what they're doing with a supervised method where they actually have ground truth to some degree, and I'll share again a bit more about that when I explain the problem mathematically. And this work has been somewhat applied where they take a second model to quantify the base language model, and it's been done for transformers, but not for recent large language models like LamaTree or Gemma.

So before we go into the paper itself, actually, why does this matter? I think the paper sort of points out point two and point three in my list here of four points. They show that by doing this, you actually can get improved performance on certain tasks, and it's a potential use case in detecting hallucinations because by getting a sort of confidence score for an answer, you can then perhaps hedge, do a different sampling approach.

Point one and point four are actually my own ideas upon reading the paper where actually if, say, you have a language model interface app, and actually with providing a sort of certainty score about the response, imagine you're a typical chatbot, people talk about rag chatbots, if you're able to provide like the certainty score, the UI/UX of it could incorporate that into explaining the certainty of the response.

And number four is interesting because, as you'll see later on in this quick sharing, what we're doing is basically predicting the performance of an LLM answer, and that actually opens up possible use cases of auto evals where you already built in a sort of, you've already done your evals for like maybe a good training set, and then now you can scale this up to auto evals, and actually when you deploy an LLM system live, this auto eval could actually help sort of highlight cases where your system might potentially be giving low confidence answers.

So let's just try to formalize this a little bit, so that's where we just go to this part where we express the problem mathematically. So the first thing is an LLM is just abstracted into, I give it an input and it generates a response, and the prompt here denoted by X is a series of tokens, and it's X1 to XK, and they all belong to this set chi, which is the vocabulary size of the language model.

Thereafter, it would generate a response Y, which is over in the second line here, Y, it's a vector Y, where it consists of multiple tokens, like say M tokens, and again, they belong to the set Y, which also could be the vocabulary size, but it's actually also the probability distribution, which is the third line here where each token, YJ, is a probability of the conditional probability of the input prompts, which is vector X, and all the previous earlier outputs, Y1 all the way to YJ-1.

So that's how we set up the problem, where it's just X and Ys. Typically then, if you use your language model, we're not just using it to do completions, we actually want to then use it for downstream tasks like Q&A, MCQ, translations, and here we actually have a scoring function, so this could be blue, and so here, it's actually this function S, where it sort of takes in the true Y, like let's say the true answer, and then models generate the answer Y, and then it maps it to a 0 to 1, so it's just any generic scoring function that you can think of.

So then the task of uncertainty estimation is effectively learning a function G, so in this third step over here, it actually then sort of predicts the score given the input prompt and the output response. So given the prompt that I feed into the language model, and the response that the language model gives, can I then predict how good that answer is?

So then the paper then explains how they can apply this approach to all sorts of language models, white box language models where you have the weights, grey box language models where you don't have the weights, and maybe you have more details about output, like say the probabilities of the various tokens, the log props, or completely black box models, like maybe an API provider that doesn't give any sort of indication of what goes on behind the scenes, and it just only gives a final check completion.

So let's just share a bit more about the white box language model, because these sort of methods thereafter extend from here. So the sort of, for the white box language models, we sort of want to first build a data set that is called d_raw, where it consists of four things.

Number one, the input prompt, which is x_i, the output, the answer that was generated by the language model y_i, we want the true answer, which is y_i_true, and lastly, the fourth item, which is the score, the evaluation of that answer, vis-a-vis the answer that the model generated. And notice how if I give the same x_i, I might get different y_i's because of the probabilistic nature of a language model, so that actually gives me more training data as well.

So then from each sort of data, each row entry in this raw data set, I want to extract out features to construct what they call the uncertainty data set. So the uncertainty data set over here is denoted by d_un, is a tuple of your v_i and your scoring, and v_i here is a vector of selected features.

So there are billions of parameters in a language model, and what they suggest is to use the hidden layers of the activation, and then for this experiment, they actually used the activations from the middle layer and the last layer. They even suggest other ways of getting features, such as directly asking the model to, like in the input prompt, appending the term, appending the phrase, "How certain are you about the response?" and then trying to get the activations with respect to that prompt.

They suggest that as a method that other people have done it, but I don't think they implemented it in their approach when I read through the paper and the code. So once you have the uncertainty data set, which to recap is just a supervised learning example, where you're just a series of feature vectors, and then a score, you then train any other good old-fashioned supervised learning model to predict that score.

And then once you have that trained model, which is now in 0.4 of what we see here, I can use that inference time where I have an input text, I feed it through the LLM, I extract the features, and I use my trained machine learning model to then predict the uncertainty score.

So they do spend a bit of time in the paper, actually only in the appendix, so this is the kind of paper where you read the appendix to actually get the algorithm, how they actually implemented it. So they use 320 features. So these 320 features consist of 20 features that they use from the gray box LLM, which I'll explain later, and another 300 features of which they get 100 by running a LASSO regression, 100 by calculating the mutual information, and 100 by calculating the correlation coefficient, and then they train a random forest regressor with these 320 features.

You know, actually, I don't think, and they probably didn't fine tune, I mean, optimize the hyperparameters for this regression model, and it wasn't the point of that paper, it was just to demonstrate that it's possible, so it's probably an area of work if people are interested to tinker. So that's what they do.

To recap, prompt, feed language model, get the activations, those activations, select a few, train another regression model. The second thing that they explain is how you can then incorporate this for gray box language models. So here, they actually then just come up with 20 features related to the output probabilities from both the response, but also the question or the prompt, and it's sort of covered there in the paper.

It's not too interesting, but what I thought was somewhat interesting was how they could get it from 20 features. Then the last category is for the black box language models. So it's stuff like, let's say, OpenAI or Entropic. How do you actually get some uncertainty estimates from there? So that's what I thought was a bit clever, where they take an input prompt and they feed it to the proprietary, let's say, the black box model, and they get the output response.

At the same time, they take that original input prompt and they feed it through a white box model, say, LamaTree, and get the activations of LamaTree, and they're trying to map the activations in LamaTree to the outputs of a proprietary model, and somehow that seems to work. In the paper, they do the experiments with Lama7B and Gemma...

I think Lama27B and Gemma7B as black boxes, and by using the other open source model as the white box for the uncertainty estimation. So just going into the results, they cover three tasks, Q&A, MCQ, and translation, and they give the results here, and they compare it against other methods for getting uncertainty estimation because what you want is not just an answer.

You want something like a probability between 0 and 1, and here what the figures you see are actually the AUC scores. So then they demonstrate that their method is better in getting a higher AUC for these various tasks for Q&A and translation. So the AUC of the score vis-à-vis the answer being correct, binary, yes or no correct.

So just to give an example of what this looks like, so let's say over here in this screenshot, the question is, "What musical featured the songs 'A Secretary is Not a Toy' in the company way?" I have no idea what that is, but apparently the answer is, "How to succeed in business without really trying?" And if you look at the table over here, if you were to take some sort of greedy approach and take the max probability, I believe, yeah, you would sort of get a-- maybe it's the wrong answer, but yeah, you would get a probability of-- you get a confidence score of 0.9, but if you look at, say, like the white box method, you would see that the correct answer over in this column, the WB dash S, the greedy answer gets a confidence score of 0.14, but answer one, which is the correct answer, gets a confidence score of 0.22.

So in this case, they found that this confidence score approach could actually yield more accurate answers. So this is just an example. Just wrapping up quite soon. So remember how I said that what they're doing is training a regression model on the hidden activations? And I said that they were using the middle layer and the last layer.

They sort of found that actually you get better performance when you're using the middle layer, and they apparently cite this thing from the literature, suggesting that actually the middle layer of these language models are better at summarization. So just to quote, this may come from the fact that the last layer focuses more on generation of the next token instead of summarizing information of the whole sentence as discussed by other authors.

So that was new information for me. I thought it was quite interesting. So last thing is that actually there's right now a Kaggle competition where they're trying to predict how people rank the chatbot responses on the arena. So I thought that was an interesting sort of synergies here, and that paper, and that was pretty cool.

And if sort of we can sort of automate some of the eval work and more from my perspective, where I build a large language model powered application and I deploy it, I also wanted to know is my large language model actually doing well in real time and maybe something like this could come in useful.

So yeah, thanks for listening. Awesome, dude. That was a great presentation. Honestly, a super interesting paper. I think there were two questions inside the chat itself. First one was from Oman, which was, how is measuring of the uncertainty related to the concept of a well-calibrated model? Al does, yeah.

Yeah. So the paper does cover the concept of calibration. So maybe to just share, to get everyone up to speed about what calibration is. So a model is considered well-calibrated. Let me give an example, a well-calibrated model. So let's say the model says it's going to rain 40% of the time.

The probability of it raining is 0.4, then 40% of the time it would rain. So it's calibration, it's sort of means that the numerical score that the model produces gives a probability. So in this case, they do sort of mention calibration in the paper and they effectively are saying that number one, a model that has a good confidence score is more likely to be a good well-calibrated score.

And number two, that usual methods of calibration, like I think isotonic regression or I believe some other binning method can be applied. And they do have some numerical results on calibration as well in the paper. Well, I guess, would it be right to say that then a well-calibrated model is a model that performs well on the task that you've assigned it or is it slightly more complex than that?

Yes and no. Yes, it effectively means that. Well, you can just think of it as a model that's calibrated, it's a model that gives a probability just because a model, so just because a model spits you a number between zero and one doesn't mean it's a probability. So for the folks that are not too familiar with this, you can go to the Scikit-learn documentation and you'll give an example of calibration, of how you can calibrate sort of scores to become more like probabilities.

I see. I think there are a few more questions. If it's okay with you. No, I'm sorry, I was reading it. Yeah, sorry. How does falsetto and sort of white box model to black box create? I am a bit doubtful as well. Ah yes, so Warren, yeah. So it doesn't mean, yeah, initially it told me I want to understand that because, yeah, how can you use one model for another model given that the architectures are different?

So I think what they're, all they're doing is, okay, for let's say OpenAI, I run my prompt through OpenAI and I get a result and I keep that result. At the same time, I take my original prompt and I run it through LLAMA370B or whatever, and then I get the activations of LLAMA370B and then I just then use those activations as the input for my secondary model to predict the uncertainty.

So they're trying to like sort of shortcut their way into predicting, to getting a sort of, they're effectively using, if I understand, if I sort of see it intuitively, is that they're using the open source or the white box model as a way to create some sort of representation of the prompt and then in any case, they are using a secondary downstream model to learn that mapping of how LLAMA370B would map the prompt into the outputs of OpenAI's models, yeah.

So for Caspar's question about how this would compare to methods just using log props, that's another thing that took me a while to eventually realize is that if let's say I do log props, it's sort of predicting the probability of the next token, but the probability of the next token isn't necessarily the probability that my answer is correct because it's task specific.

So let's say I want to do a question and answer and then I get a probability of like this is a MacBook. You can get some probability for that statement, but I don't get a probability for whether the statement this is a MacBook is the correct statement for my given task, yeah.

And next one, for the mathematical models that measure uncertainty, is it dependent on the loss function? For example, when comparing a simple model train on L2 loss versus one that goes through additional complexity like RLHF model-based scoring, does the mathematical theory still hold? Okay, I'm not too sure, but there was a part of this paper that did have some theory about how given certain conditions, I believe like this problem, the optimal solution is your optimal base classifier, but there's certain conditional independence properties that aren't satisfied and also because if I recall, actually I have the paper open here.

Yeah, so if you, because you don't, when we train a language model, the loss function is actually a different loss function. Over here. Sorry. Yeah, over here. So they basically say that, yeah, so because if the large language models aren't trained on the cross-entropy loss, the sort of theorem doesn't hold, but they do see like, oh, because large language models are trained on the larger data and so on and so forth, they do, there's apparently some, what do you call that, approximations that can be done.

Yeah, so I didn't fully understand section 3.3, but maybe you could take a read and see if it helps. So yeah, it's trying to justify why they use hidden layers as features. Yeah. All right. All right. That's about it. Thanks, everyone. Thanks for presenting, dude. Yeah, I think Nick says Casper with mono-semanticity, I think.

All right. Do you mind? Okay. So I promise it's not just a but let me just get through some basic stuff first before we go through some interesting dashboards and visualizations. So I'm covering the paper towards mono-semanticity, which Anthropic released, I believe it was late last year. And basically the primary contribution of this paper was demonstrating that it's possible to use sparse autoencoders to identify features in a language model.

So that's really fuzzy, but maybe just to level set and make sure everyone has an idea of what I'm talking about here. You know, this is probably one of the most important works in the field of mechanistic interpretability or mechinterp over the past sort of few years. And so what is mechanistic interpretability?

It's basically making the inner workings of a neural network human interpretable. And this is in contrast to other interpretability or AI explainability approaches, which take more of a behavioral approach. If you think of like behaviorists in psychology, that's often the approach that's taken by some researchers. Mechinterp, on the other hand, is very much concerned with the activations within a model and figuring out at a very sort of granular level how it is a model is working.

So really what we're trying to do here or what mechinterp is trying to do is trying to find features, right? And, you know, it's a bit of a fuzzy definition, but think of a feature as sort of a property of a token or a group of tokens. You know, for example, there might be a feature that fires on pronouns where and this might be an attention feature and it, you know, that feature firing, that feature activation might signify that, you know, the pronoun is attending to another proper noun somewhere in the sequence.

Or, you know, it could be a sort of more fuzzy sort of feature where if this feature is firing, then the sort of sequence sounds angry, right? And there's all sorts of features. Basically, you know, if you think about what a model is capable of, what it can represent, you should have a feature underlying all of those capabilities, right?

Now, the problem here is that it's actually really hard to identify features because models are so big. And in addition to that, you don't have nice clean features where one feature corresponds to a single neuron. And the sort of hypothesis is that it's because a model represents far more features than there are neurons available in the model.

If you think about, you know, every single feature that you'd want to represent to represent the world, then it's sort of like, it's sort of obvious that a model can't be large enough such that you have one neuron that corresponds to each feature, right? But empirically, you know, it's been shown that it is possible to identify features.

And this was initially done through really hard manual work and just, you know, people eyeballing feature activations. And broadly, this approach of figuring out what features a model represents is sometimes also called dictionary learning. So with that sort of background, you know, maybe I'll jump into the paper. So towards monosemitism was one of the first papers to really introduce and demonstrate that sparse autoencoders appear to be a pretty effective dictionary learning method.

And what was done in this paper is that the authors at Entropiq trained sparse autoencoders to reconstruct the outputs of an MLP layer. But in this reconstruction, there's two things added. One is a sparsity penalty to remove noise and find, you know, more interpretable feature activations. And two, an expansion factor, which makes it so that you can -- it's easier to sort of represent more features or identify more features using the SAE.

And the way they did that in the paper was using a toy model with a single layer. So a single layer MLP. And they took the sort of hidden representations from the MLP and trained SAEs with a range of expansion factors from 1x to 256x. So, you know, I think -- now that I've gone through that, let me actually just skip to -- let me actually share my other screen where I've actually pulled up the dashboards of the SAEs, right?

And then you can actually get a sense of what's actually being identified by these SAEs, right? So, you know, one example feature which they talk about in the paper is a feature that happens to fire when the model thinks that something is a DNA sequence in lowercase, right? And that looks pretty accurate.

You know, what's interesting about these features is that if you look at what they do in terms of affecting downstream output is that if this feature fires on any single token, it has a very strong impact in terms of up-weighting subsequent tokens which are also, you know, DNA-like, right?

You know, you have other sort of -- there's all sorts of features. Some of them aren't so interpretable. Some of them are. You know, you'll have some that fire on some languages. You'll have some that fire on Caesar-shifted encoded words. And I'd encourage you to actually look at the visualization.

It's linked in the Towards Monosemanticity paper and super interesting. Maybe just to give you more -- you know, another sort of -- give you a sense of what other work is going on. So that work was done on a toy model, a single layer. It's not a real language model.

But on the open-source side, there's people trading SAEs now using the same techniques on all sorts of models. GPT-2-small is a favorite just because it's very well understood by interpretability researchers. And you can see all sorts of features identified in GPT-2-small. And it's actually super easy to train an SAE now.

You can just use an off-the-shelf library and play around with it. But maybe just going back to, like, you know, why -- going back to, you know, why this matters and what the limitations are and where we're headed in the future. Let me just switch back to my other screen.

And here we go. You know, so -- maybe let me just talk about limitations first, right? This is a pretty new -- it's a pretty new technique. Less than a year old now, really. At least in the context of Mekinterp. You know, it's pretty clear that the feature set identified by SAEs isn't complete.

There's rare features. There's features that might get ignored for some reason or another due to training methods not being that sort of refined. You know, you also have this limitation where as you increase the expansion factor in the SAE, you get this phenomenon of feature splitting where, you know, what was previously a single feature now becomes a family of, like, 20 features, for example, all of which are, like, sub-features.

And it gets a bit complicated. So you don't get, like, these very nice, clean representations necessarily. Everything's sort of fuzzy. But there is work going towards, like, improving them, right? And there's quite a lot of work on improving SAEs. This is now seen as the most promising line of Mekinterp work within both Anthropic and DeepMind.

And there's also folks at OpenAI working on SAEs. And OpenAI actually open-sourced their SAEs for GPT-2 small, which is pretty interesting. But, yeah, you know, I don't think this is -- this is, like, very much like an exploratory paper, right? It's sort of like -- I think the sort of point is that, hey, there's this new technique.

It looks pretty interesting. But the jury is still out on whether this ends up being, you know, the technique that solves Mekinterp. But, yeah, that's it. >> Awesome. Thanks for the presentation. I think Warren had two questions. What is the difference between SAE versus a probing classifier? I think the other one he was saying is, what makes SAEs such a good fit for this specific task?

Why not use other models to extract monosemanticity? >> Yeah, sure. So it is similar in some regards to linear probes. But I think the nice thing about SAEs is that -- and this sort of kind of covers the second question to some extent -- is that you get a few nice features.

One is that it's unsupervised, so you can learn a lot of features at once, which is nice. It doesn't need any sort of humans eyeballing things. And two, so why SAEs in particular? You know, I think, you know, the features that are helpful -- or rather than features, the elements on the SAE, which are helpful in this regard, one is like the sparsity, which is important.

And also the ability to just reconstruct the inputs, right? Because you want that. And that helps you actually -- if you're reconstructing hidden activations, you get a very nice natural way to assess the goodness of an SAE, because you can measure the reconstruction error. So, you know, how you typically evaluate these models.

One metric that you measure them on is actually reconstruction error, right? So you ablate the actual sort of activations in the model, and you replace them with actually the reconstructed activations from the SAE. All righty. Any other questions? >> Seems like I think that's about it. I guess if that's the case, then let me just share briefly on Medusa.

Okay. Let me just see if I can find the screen. Awesome. Thanks for the reminder, Doug Abel. Appreciate it. All right. Sorry about that. So today I'll just be presenting quickly on Medusa. So the concept behind Medusa is just basically it's a better way to do speculative decoding. And so I think before we move into what speculative decoding is, I think it's important to see what are the main problems with model inference.

Just so we're all on the same page, I'm just going to paste the link that I'm looking at in the chat itself. So if you're familiar with any large language model, a large chunk of the time that we spend, it's basically running, transferring the weights from the, well, I guess the high bandwidth environment, like what you see over here, over to the cache, which is where there's a very limited amount of space and where most of the calculations happen.

So what that means is that if we run inference one time, we load in all parameters, we load in all inputs, we get out the final output of one, we only get one token and we have to repeat the whole step. There are a lot of optimizations that have been done around it, but basically the main problem is still not fixed that you, you do all this work and you only get one token out.

So what people have done is this thing called speculative decoding. So what does that mean? So I think we have a huge model, like a LLAMA70B, and we have a smaller model, which we call a draft model, called a LLAMA7B. A LLAMA7B can take an initial prompt and quickly generate a whole bunch of tokens itself.

So you can think, let's say you've seen some prompt, it generates like N tokens that are supposed to be, that it thinks are going to follow this specific prompt itself. So now we have a bunch of candidates, right? For a proposed sequence inside this draft model. So how do we know what percentage of these tokens are correct?

I guess along the same lines, how many of these tokens we should reject? The way to do this is to basically just batch all these tokens and feed it into the model itself. So what this could look like is let's say our prompt is the capital of France is, and our smaller model says, hey, the capital of France is Paris and it's a beautiful city.

For inference, we would pass in the capital of France is, the capital of France is Paris. And each step, what we're basically trying to see is, does the completion from the smaller model match the big model? And so what we're doing here is we're able to batch all of these inside a single like forward pass.

And we don't incur a huge back and forth transfer. And so we see this huge speed up in terms of the decoding speed itself when we're generating tokens. So there's some problems with this though. The first one is, of course, optimization, because now we need a smaller model. We need to feed it through this whole chunk of data.

And we need to somehow do the reconsideration between the original completions and the new completions. The second, which I think is the bigger problem, is that the draft model might not accurately reflect the capabilities or the world knowledge of the larger model itself. If you're going to play, let's say, Gemma 2B, it might not really be the same as a Llama 7B, even if it's able to decode like six times its path.

So in comes Medusa. So traditionally we have some input passes through an embedding. It goes through your transformer layers and we get our hidden state. This is going to be a single vector of some dimensions. That's going to be the same as your embedding dimension itself. And what we would always do is we'd say, okay, you might then have a linear layer.

And you get up basically a vector with a whole bunch of probabilities that correspond to the probability that each individual token for that position is the next token that should be selected. So that's the original transformer flow itself. What Medusa does is that it slaps on a whole bunch of new MLPs that operate on the same hidden state and try to make the same prediction itself.

So you can see this LM hit predicts it IS. Medusa predicts IS the second hit, the first hit, which predicts the second token in the completion. It goes for IS, comma, the IS across the speaker and so on. These hits aren't anything special. They're really just MLP networks that generate a distribution over the vocabulary.

So you can see over here that all it does is it's just, well, the final hidden state. This is probably going to be a one times D vector, right? It's multiplied by a single weight matrix, a P1K, which is D by D. So that gives you a one times V, right?

And then we add the residue itself. And so then once we add these two residues together, we do a softmax. I feel like I might be messing up the dimensions, but basically it's a, you're going to get out a probability distribution at the end that's equal to the number of tokens that you have.

Each hit is essentially going to produce a probability distribution over all these different choices. And so you're going to get, well, basically SK different options for each token. I think the best way to see it is sort of over here, we can see the completions, right? So this is what the original language hit sort of predicts.

These are going to be the next tokens that are predicted by these first hit, first Medusa hit, the second Medusa hit, the third Medusa hit, and so on. And so what they do is that they always choose the first token that's generated by the original language modeling hit to guarantee that your enemies get some completions.

But then when it comes to the other tokens being chosen and that by itself, there's some new way that you have. So I think they also mentioned that they do some sort of greedy algorithm doing training on a training data set, whereby they try to determine the optimal size for the street.

So each individual node level. So there are two ways that they do this training. One is that you freeze the base out and you only do the hits. And what this does is that you basically are just doing the same cross entropy loss, but you apply this sort of biased constant term here that is a constant taken to power of K.

So what this means is that for the overall loss that you're calculating, they call it L_Medusa_1. This is the first way that you train the Medusa level, Medusa hit. You're essentially weighting the token, the hits that are predicting tokens that are further and further out into input sequence, less and less.

And this is actually super fast because you can get around five hours. You just need about five hours with 60K samples and your 7P model is good to go. The harder way that I use a lot of better results is basically for you to train the LM and the individual hits.

That results in new loss equation of this is your original language modeling equation. And this is your, well, what we had over here. So this time they have the smaller term called L_0, which is basically a small, small, small, small term. So that the head prediction doesn't mess up the overall loss because the Medusa hit is going to be super wrong at start.

Since it's, it's not trained on the dataset, hasn't seen anything. It's just an MLP. So they do some sort of linear warmups whereby the learning rate is slowly increased over time and then maybe decrease and there's some scheduling that's going on there. The last part is just this dataset, which I think was, it's pretty interesting.

If you train your model on the new, on, on this dataset itself, on the Medusa hits, it's not really a problem. You can just take a public seat, public dataset and you can train it. And in this case, we just, if you're worried that the dataset that you're training a model on doesn't actually reflect what the model has learned, you can just basically take a public dataset with a whole bunch of prompts and just get your model to generate a completion itself.

And for certain models that have basically like basically the ability to train on the system, the user, the system, you can have multi-turn conversations, which is great. So that generally works pretty well from, from what they say about the Medusa hits. If you just freeze the base LM and you just train the Medusa hits, but they do say that if you are training the whole model itself, plus the Medusa hits, which is this step over here, you probably want to also include like a little KL divergence term so that when you run the loss itself, the model parameters don't change so much and you want to sort of minimize the difference of your model from your original model.

So that essentially you're still outputting like high quality completions. So yeah, that's basically the Medusa paper summarized pretty fast. There's a whole bunch of stuff that I've skipped over, but this is basically the main high level idea behind the Medusa paper itself. So yeah, happy to take any questions.

Let me just try to pull up the chat if there's any questions. But yeah. Okay. What does it cost more than increasing the beam search parameter? So if I use five Medusa hits, is it like five times six by 30? So I don't think they actually use beam search inside this itself.

So the way that I've seen, so I looked at the code before this to try and see, and they provide a few different ways. The first one is some greedy new click and sampling. Basically the idea is that all you're doing with Medusa is that you're changing the use of it.

You're changing the way that you generate these speculated tokens itself. So you originally use the draft model, but with Medusa you use these hits. You're still running it through a separate search. So depending on how you use your beam, how your beam search is implemented with the Medusa hit itself, I guess it will really determine the completion.

But I think it's not super clear in the paper how exactly they do the final computation. They just say that they try to find the longest prefix length that's common across all the different potential completions that are generated. I hope that answers the question, Bennett. I'm starting to want to pull up some stuff.

Let's see. Multi-token prediction. I've seen this paper. I haven't read it yet. So I think perhaps I will look at it after this and figure it out. Yeah. But I guess if not, then I'm just going to end the recording over here. If anyone has any questions, happy to answer it.

I'm just going to end the recording over here. All right. Yongxin, I think I need you to end the recording. Okay. Okay, let's share.

LLM Asia Paper Club Survey Round

Chapters

Transcript