back to indexLLM Asia Paper Club Survey Round
Chapters
0:0 Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
12:30 Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
32:23 Monosemanticity
44:54 Medusa - Simple Speculative Decoding using multiple heads
00:00:01.000 |
Okay, I will now start on presenting this paper. 00:00:04.000 |
Basically, I found out about this paper recently, 00:00:08.000 |
which is a paper called "Let's think dot by dot". 00:00:11.000 |
And the motivation behind choosing this paper is that 00:00:14.000 |
basically I always have the burning question as to like, 00:00:19.000 |
Now, we know that chain of thought reasoning process 00:00:22.000 |
or chain of thought prompting actually shows that 00:00:24.000 |
by allowing LLMs to think out loud before answering, 00:00:28.000 |
their performance actually improves considerably 00:00:30.000 |
when compared to direct answering techniques. 00:00:33.000 |
This actually provides us some intuition as to 00:00:42.000 |
suggests that for chain of thought reasoning, 00:00:56.000 |
before moving on to the related words section. 00:01:06.000 |
or are they able to think internally like humans? 00:01:10.000 |
are there any specific tasks to show that LLMs 00:01:13.000 |
are not simply relying on the semantic information 00:01:17.000 |
and possibly relying on other kinds of information 00:01:23.000 |
So I'll just skim through the related words section 00:01:26.000 |
just to provide some additional context on this work. 00:01:33.000 |
Basically, they are discussing that there are 00:01:39.000 |
Basically, the lowest level is something called TC0, 00:01:42.000 |
and the higher levels of task complexity include 00:01:51.000 |
and checking multiple constraints at the same time. 00:01:54.000 |
Now, the second part of the related words section 00:01:56.000 |
is talking about how transformers use tokens for reasoning. 00:02:06.000 |
when additional reasoning tokens are provided. 00:02:09.000 |
These tokens actually help to decompose the problem 00:02:13.000 |
and improve the model reasoning's capabilities. 00:02:18.000 |
are the LLMs making full use of these extra tokens 00:02:24.000 |
And the third part in the related words section 00:02:27.000 |
is talking about potential model performance improvements 00:02:32.000 |
Some work also suggested that using filler tokens 00:02:35.000 |
could improve performance on nested quantifier tasks. 00:02:38.000 |
Basically, tasks that have some form of constraints, 00:02:41.000 |
nested constraints that depend on each other. 00:02:49.000 |
that specifically target these kind of problems 00:02:55.000 |
Now, let me move on to the methodology section 00:02:57.000 |
and talk about the tasks created by the authors. 00:03:13.000 |
I'll just move on to the first task created by the authors. 00:03:19.000 |
Basically, it's that you have a following list of numbers 00:03:22.000 |
and you want to check if there's a group of three numbers 00:03:25.000 |
inside that list that adds up to 0 modulo 10. 00:03:34.000 |
and the answer could be like 2, 3, 5 or 2, 7, 1 00:03:44.000 |
And for the second task, it's called 2SUM transform, 00:03:48.000 |
which is also regarding a sequence of numbers. 00:03:59.000 |
and then they are adding some form of permutation 00:04:03.000 |
and that permutation is only revealed at the end. 00:04:07.000 |
and then you are +3 to each number in the transform list. 00:04:16.000 |
is that it requires model to remember transformation information 00:04:20.000 |
and how to integrate this in its computational thinking process. 00:04:25.000 |
So let me move on to the experiment section of the paper. 00:04:30.000 |
So basically, the authors are using some form of scaled-down, 00:04:33.000 |
randomly-initialized LAMA model in their training process 00:04:37.000 |
using three different model configuration for the setup. 00:04:41.000 |
And the first configuration is basically using filler tokens, 00:04:45.000 |
basically a series of dots as intermediate tokens 00:04:48.000 |
between the input tokens and the output tokens. 00:04:51.000 |
Now, the second configuration is that instead of using dots, 00:04:54.000 |
they are using the chain of thought reasoning tokens 00:04:59.000 |
Now, the third configuration is the sort of like the control setup 00:05:02.000 |
where there are no filler tokens in between the input and output. 00:05:08.000 |
And then the results show that for the threesome task, 00:05:14.000 |
as the length of the sequence of numbers increases, 00:05:20.000 |
between the configuration with the filler tokens 00:05:29.000 |
And then this basically shows that as the task complexity increases, 00:05:40.000 |
And the next figure is talking about a different experiment 00:05:44.000 |
which is basically the authors trying to freeze 00:05:49.000 |
and only preserving the final attention layer. 00:05:52.000 |
And then basically, they are trying to prove that 00:05:54.000 |
whatever changes they make in the final layer, 00:05:59.000 |
Basically, the performance improvements or performance changes 00:06:02.000 |
are only resulting from the changes in the final layer. 00:06:06.000 |
So in the final layer, what they tried is that 00:06:08.000 |
they tried to increase the percentage of filler tokens 00:06:14.000 |
actually it increases the accuracy up to like 100%. 00:06:25.000 |
until it tapers off and then it has some diminishing returns 00:06:34.000 |
the author lays down a table with comparing the chain of thought, 00:06:37.000 |
the filler tokens, and the no-tokens configuration. 00:06:44.000 |
the chain of thought is still triumphing at the very top. 00:06:48.000 |
And the filler token is sort of getting quite close 00:06:53.000 |
while the no-tokens is showing like a 78% accuracy. 00:06:58.000 |
Now, both of the aforementioned configurations 00:07:12.000 |
actually found out in their experiment is that 00:07:19.000 |
they actually needed the chain of thought configuration 00:07:22.000 |
to be set up to see how the chain of thought tokens 00:07:32.000 |
So this basically means that this is actually quite challenging. 00:07:44.000 |
This also implies that the filler tokens must be placed 00:07:46.000 |
in some particular arrangement or configuration 00:07:55.000 |
some food for thought is basically saying that 00:08:07.000 |
they still contain some form of structural information 00:08:17.000 |
We can also use other kinds of tokens, right? 00:08:30.000 |
to see if there are any different kinds of results. 00:08:33.000 |
Although dot tokens are technically filler tokens, 00:08:43.000 |
So, in summary, this paper basically shows that 00:08:47.000 |
how large language models can use filler tokens 00:08:50.000 |
to achieve higher performance in certain tasks 00:08:53.000 |
when compared to not using any filler tokens at all 00:09:05.000 |
but maybe some form of structural information. 00:09:18.000 |
I think Casper had a question in the chat itself, 00:09:21.000 |
whether the filler tokens were special tokens 00:09:27.000 |
or was it just the same token used for pure filler 00:09:41.000 |
So, they were not like a special token added in afterwards. 00:09:50.000 |
which is why not just use unknown token or something 00:09:54.000 |
Yeah, that's an interesting question also, yeah. 00:10:00.000 |
you'll probably have different results, yeah. 00:10:10.000 |
if you prompt the model and you put a space in between, 00:10:17.000 |
and then you let it do the normal inference call, 00:10:21.000 |
because the space itself is another token, right? 00:10:25.000 |
that sort of like modifies the computation itself. 00:10:43.000 |
could it also be because of positional biases? 00:10:48.000 |
I'm not too sure about what positional biases are. 00:10:50.000 |
I mean, you can drop it in the chat or you can unmute yourself. 00:10:55.000 |
there's a positional encoding in the transformers, right? 00:10:57.000 |
So because the models will train on a lot of sequences, 00:11:02.000 |
I think maybe if it trains too much, it gets saturated, 00:11:04.000 |
then it kind of like learns the positional encoding as well. 00:11:09.000 |
then just nice, most of the training sequences, 00:11:11.000 |
that particular one also is aligned with, like, 00:11:24.000 |
but I don't really know how to actually explain it. 00:11:26.000 |
Let me see where I can find an article on it. 00:11:32.000 |
the prompt and the original input would change, 00:11:34.000 |
but I think perhaps what you're thinking about 00:11:36.000 |
is the additional space itself plus positional bias 00:11:55.000 |
Large language models are not very valid in this. 00:12:43.000 |
So it's about how we use a second model actually 00:12:48.000 |
to estimate the uncertainty of LLM's response. 00:12:53.000 |
I spent a few days trying to think of a paper to sort of talk about. 00:12:59.000 |
For one, it approaches the problem quite differently. 00:13:02.000 |
It's from a bunch of people in operations research, 00:13:07.000 |
And the sort of approach of doing it is actually more of just getting data, 00:13:11.000 |
fitting a model in, and then trying to make use of that model 00:13:19.000 |
Number one, here they train a regression model. 00:13:22.000 |
And, in fact, actually, it's a classical one, 00:13:24.000 |
random forest, to estimate the uncertainty of an LLM's response. 00:13:28.000 |
And the input for that is just the LLM's hidden layers 00:13:32.000 |
of the last token, activations of the last token, 00:13:41.000 |
where you don't actually get the actual hidden layers, 00:13:46.000 |
you can actually use some of the probability-related output 00:13:53.000 |
And the output, it's actually not the logits of the language model, 00:13:59.000 |
but instead it's actually a task-specific score, 00:14:02.000 |
and typically between 0 and 1, about the certainty of the answer. 00:14:05.000 |
So I'll just elaborate a bit more about that later. 00:14:09.000 |
The paper covers a bit about the existing methods, 00:14:12.000 |
and I think most of the methods right now for quantifying uncertainty, 00:14:17.000 |
they tend to be based directly on the output of the language model. 00:14:22.000 |
Say, given a fixed prompt, you might get different outputs, 00:14:25.000 |
and you try to do sampling there to see what the variation in the output is, 00:14:29.000 |
or you might add some perturbations to your prompt 00:14:31.000 |
and see how that results in variation in your output, 00:14:41.000 |
and in contrast to what they're doing with a supervised method 00:14:44.000 |
where they actually have ground truth to some degree, 00:14:53.000 |
where they take a second model to quantify the base language model, 00:14:59.000 |
but not for recent large language models like LamaTree or Gemma. 00:15:07.000 |
I think the paper sort of points out point two and point three 00:15:15.000 |
you actually can get improved performance on certain tasks, 00:15:18.000 |
and it's a potential use case in detecting hallucinations 00:15:21.000 |
because by getting a sort of confidence score for an answer, 00:15:24.000 |
you can then perhaps hedge, do a different sampling approach. 00:15:29.000 |
Point one and point four are actually my own ideas upon reading the paper 00:15:33.000 |
where actually if, say, you have a language model 00:15:40.000 |
a sort of certainty score about the response, 00:15:43.000 |
imagine you're a typical chatbot, people talk about rag chatbots, 00:15:46.000 |
if you're able to provide like the certainty score, 00:15:51.000 |
into explaining the certainty of the response. 00:15:56.000 |
as you'll see later on in this quick sharing, 00:16:04.000 |
and that actually opens up possible use cases of auto evals 00:16:09.000 |
you've already done your evals for like maybe a good training set, 00:16:13.000 |
and then now you can scale this up to auto evals, 00:16:15.000 |
and actually when you deploy an LLM system live, 00:16:18.000 |
this auto eval could actually help sort of highlight cases 00:16:22.000 |
where your system might potentially be giving low confidence answers. 00:16:27.000 |
So let's just try to formalize this a little bit, 00:16:34.000 |
So the first thing is an LLM is just abstracted into, 00:16:37.000 |
I give it an input and it generates a response, 00:16:40.000 |
and the prompt here denoted by X is a series of tokens, 00:16:44.000 |
and it's X1 to XK, and they all belong to this set chi, 00:16:48.000 |
which is the vocabulary size of the language model. 00:16:54.000 |
which is over in the second line here, Y, it's a vector Y, 00:16:58.000 |
where it consists of multiple tokens, like say M tokens, 00:17:06.000 |
but it's actually also the probability distribution, 00:17:08.000 |
which is the third line here where each token, YJ, 00:17:12.000 |
is a probability of the conditional probability 00:17:20.000 |
and all the previous earlier outputs, Y1 all the way to YJ-1. 00:17:29.000 |
Typically then, if you use your language model, 00:17:35.000 |
we actually want to then use it for downstream tasks 00:17:40.000 |
and here we actually have a scoring function, 00:18:16.000 |
given the input prompt and the output response. 00:18:20.000 |
So given the prompt that I feed into the language model, 00:18:24.000 |
and the response that the language model gives, 00:18:36.000 |
white box language models where you have the weights, 00:18:38.000 |
grey box language models where you don't have the weights, 00:18:41.000 |
and maybe you have more details about output, 00:18:43.000 |
like say the probabilities of the various tokens, 00:18:47.000 |
the log props, or completely black box models, 00:18:54.000 |
any sort of indication of what goes on behind the scenes, 00:18:57.000 |
and it just only gives a final check completion. 00:19:03.000 |
because these sort of methods thereafter extend from here. 00:19:08.000 |
So the sort of, for the white box language models, 00:19:30.000 |
and lastly, the fourth item, which is the score, 00:19:35.000 |
vis-a-vis the answer that the model generated. 00:19:43.000 |
because of the probabilistic nature of a language model, 00:19:45.000 |
so that actually gives me more training data as well. 00:19:57.000 |
to construct what they call the uncertainty data set. 00:20:09.000 |
and v_i here is a vector of selected features. 00:20:13.000 |
So there are billions of parameters in a language model, 00:20:27.000 |
They even suggest other ways of getting features, 00:20:51.000 |
in their approach when I read through the paper and the code. 00:21:01.000 |
where you're just a series of feature vectors, 00:21:06.000 |
supervised learning model to predict that score. 00:21:28.000 |
so this is the kind of paper where you read the appendix 00:21:51.000 |
and 100 by calculating the correlation coefficient, 00:21:54.000 |
and then they train a random forest regressor 00:22:10.000 |
it was just to demonstrate that it's possible, 00:34:10.000 |
And, you know, it's a bit of a fuzzy definition, 00:35:14.000 |
it's actually really hard to identify features 00:35:42.000 |
every single feature that you'd want to represent 00:37:25.000 |
it's easier to sort of represent more features 00:37:43.000 |
And they took the sort of hidden representations 00:38:41.000 |
You know, what's interesting about these features 00:38:53.000 |
is that if this feature fires on any single token, 00:39:31.000 |
It's linked in the Towards Monosemanticity paper 00:39:44.000 |
give you a sense of what other work is going on. 00:40:14.000 |
And it's actually super easy to train an SAE now. 00:40:36.000 |
why this matters and what the limitations are 00:40:52.000 |
maybe let me just talk about limitations first, right? 00:41:05.000 |
that the feature set identified by SAEs isn't complete. 00:41:13.000 |
due to training methods not being that sort of refined. 00:41:23.000 |
where as you increase the expansion factor in the SAE, 00:41:30.000 |
where, you know, what was previously a single feature 00:41:34.000 |
now becomes a family of, like, 20 features, for example, 00:41:44.000 |
these very nice, clean representations necessarily. 00:41:50.000 |
But there is work going towards, like, improving them, right? 00:41:53.000 |
And there's quite a lot of work on improving SAEs. 00:42:00.000 |
of Mekinterp work within both Anthropic and DeepMind. 00:42:06.000 |
And there's also folks at OpenAI working on SAEs. 00:42:09.000 |
And OpenAI actually open-sourced their SAEs for GPT-2 small, 00:42:16.000 |
But, yeah, you know, I don't think this is -- 00:42:18.000 |
this is, like, very much like an exploratory paper, right? 00:42:29.000 |
But the jury is still out on whether this ends up being, 00:42:32.000 |
you know, the technique that solves Mekinterp. 00:42:46.000 |
What is the difference between SAE versus a probing classifier? 00:42:52.000 |
what makes SAEs such a good fit for this specific task? 00:42:56.000 |
Why not use other models to extract monosemanticity? 00:43:02.000 |
So it is similar in some regards to linear probes. 00:43:08.000 |
But I think the nice thing about SAEs is that -- 00:43:11.000 |
and this sort of kind of covers the second question to some extent -- 00:43:20.000 |
so you can learn a lot of features at once, which is nice. 00:43:23.000 |
It doesn't need any sort of humans eyeballing things. 00:43:36.000 |
You know, I think, you know, the features that are helpful -- 00:43:42.000 |
or rather than features, the elements on the SAE, 00:43:45.000 |
which are helpful in this regard, one is like the sparsity, 00:43:49.000 |
And also the ability to just reconstruct the inputs, right? 00:44:00.000 |
you get a very nice natural way to assess the goodness of an SAE, 00:44:04.000 |
because you can measure the reconstruction error. 00:44:07.000 |
So, you know, how you typically evaluate these models. 00:44:10.000 |
One metric that you measure them on is actually reconstruction error, right? 00:44:14.000 |
So you ablate the actual sort of activations in the model, 00:44:19.000 |
and you replace them with actually the reconstructed activations from the 00:44:46.000 |
I guess if that's the case, then let me just share briefly on Medusa. 00:45:00.000 |
So today I'll just be presenting quickly on Medusa. 00:45:04.000 |
So the concept behind Medusa is just basically it's a better way to do 00:45:10.000 |
And so I think before we move into what speculative decoding is, 00:45:13.000 |
I think it's important to see what are the main problems with model 00:45:18.000 |
I'm just going to paste the link that I'm looking at in the chat itself. 00:45:23.000 |
So if you're familiar with any large language model, 00:45:26.000 |
a large chunk of the time that we spend, it's basically running, 00:45:31.000 |
I guess the high bandwidth environment, like what you see over here, 00:45:35.000 |
which is where there's a very limited amount of space and where most of the 00:45:40.000 |
So what that means is that if we run inference one time, 00:45:43.000 |
we load in all parameters, we load in all inputs, 00:45:49.000 |
we only get one token and we have to repeat the whole step. 00:45:52.000 |
There are a lot of optimizations that have been done around it, 00:45:55.000 |
but basically the main problem is still not fixed that you, 00:46:00.000 |
you do all this work and you only get one token out. 00:46:03.000 |
So what people have done is this thing called speculative decoding. 00:46:09.000 |
So I think we have a huge model, like a LLAMA70B, 00:46:12.000 |
and we have a smaller model, which we call a draft model, 00:46:18.000 |
A LLAMA7B can take an initial prompt and quickly generate a whole bunch of 00:46:23.000 |
So you can think, let's say you've seen some prompt, 00:46:26.000 |
it generates like N tokens that are supposed to be, 00:46:29.000 |
that it thinks are going to follow this specific prompt itself. 00:46:36.000 |
For a proposed sequence inside this draft model. 00:46:39.000 |
So how do we know what percentage of these tokens are correct? 00:46:43.000 |
I guess along the same lines, how many of these tokens we should reject? 00:46:47.000 |
The way to do this is to basically just batch all these tokens and feed 00:46:52.000 |
So what this could look like is let's say our prompt is the capital of 00:46:58.000 |
the capital of France is Paris and it's a beautiful city. 00:47:01.000 |
For inference, we would pass in the capital of France is, 00:47:07.000 |
And each step, what we're basically trying to see is, 00:47:09.000 |
does the completion from the smaller model match the big model? 00:47:14.000 |
And so what we're doing here is we're able to batch all of these inside a 00:47:18.000 |
And we don't incur a huge back and forth transfer. 00:47:21.000 |
And so we see this huge speed up in terms of the decoding speed itself 00:47:35.000 |
We need to feed it through this whole chunk of data. 00:47:38.000 |
And we need to somehow do the reconsideration between the original 00:47:43.000 |
The second, which I think is the bigger problem, 00:47:45.000 |
is that the draft model might not accurately reflect the capabilities or 00:47:50.000 |
the world knowledge of the larger model itself. 00:47:54.000 |
If you're going to play, let's say, Gemma 2B, 00:47:56.000 |
it might not really be the same as a Llama 7B, 00:47:58.000 |
even if it's able to decode like six times its path. 00:48:04.000 |
So traditionally we have some input passes through an embedding. 00:48:09.000 |
It goes through your transformer layers and we get our hidden state. 00:48:12.000 |
This is going to be a single vector of some dimensions. 00:48:15.000 |
That's going to be the same as your embedding dimension itself. 00:48:18.000 |
And what we would always do is we'd say, okay, 00:48:26.000 |
And you get up basically a vector with a whole bunch of probabilities that 00:48:30.000 |
correspond to the probability that each individual token for that position is 00:48:37.000 |
So that's the original transformer flow itself. 00:48:41.000 |
What Medusa does is that it slaps on a whole bunch of new MLPs that operate 00:48:45.000 |
on the same hidden state and try to make the same prediction itself. 00:48:52.000 |
Medusa predicts IS the second hit, the first hit, 00:48:55.000 |
which predicts the second token in the completion. 00:48:58.000 |
It goes for IS, comma, the IS across the speaker and so on. 00:49:05.000 |
They're really just MLP networks that generate a distribution over the vocabulary. 00:49:08.000 |
So you can see over here that all it does is it's just, well, 00:49:13.000 |
This is probably going to be a one times D vector, right? 00:49:16.000 |
It's multiplied by a single weight matrix, a P1K, which is D by D. 00:49:26.000 |
And so then once we add these two residues together, we do a softmax. 00:49:29.000 |
I feel like I might be messing up the dimensions, but basically it's a, 00:49:33.000 |
you're going to get out a probability distribution at the end that's equal to 00:49:38.000 |
Each hit is essentially going to produce a probability distribution over all 00:49:45.000 |
basically SK different options for each token. 00:49:47.000 |
I think the best way to see it is sort of over here, 00:49:52.000 |
So this is what the original language hit sort of predicts. 00:49:56.000 |
These are going to be the next tokens that are predicted by these first hit, 00:50:01.000 |
first Medusa hit, the second Medusa hit, the third Medusa hit, and so on. 00:50:06.000 |
And so what they do is that they always choose the first token that's 00:50:10.000 |
generated by the original language modeling hit to guarantee that your 00:50:15.000 |
But then when it comes to the other tokens being chosen and that by itself, 00:50:22.000 |
So I think they also mentioned that they do some sort of greedy algorithm 00:50:27.000 |
whereby they try to determine the optimal size for the street. 00:50:33.000 |
So there are two ways that they do this training. 00:50:34.000 |
One is that you freeze the base out and you only do the hits. 00:50:37.000 |
And what this does is that you basically are just doing the same cross 00:50:42.000 |
but you apply this sort of biased constant term here that is a constant 00:50:49.000 |
So what this means is that for the overall loss that you're calculating, 00:50:55.000 |
This is the first way that you train the Medusa level, Medusa hit. 00:51:00.000 |
the hits that are predicting tokens that are further and further out into 00:51:05.000 |
And this is actually super fast because you can get around five hours. 00:51:08.000 |
You just need about five hours with 60K samples and your 7P model is good 00:51:14.000 |
The harder way that I use a lot of better results is basically for you to 00:51:20.000 |
That results in new loss equation of this is your original language 00:51:23.000 |
modeling equation. And this is your, well, what we had over here. 00:51:27.000 |
So this time they have the smaller term called L_0, 00:51:29.000 |
which is basically a small, small, small, small term. 00:51:31.000 |
So that the head prediction doesn't mess up the overall loss because the 00:51:35.000 |
Medusa hit is going to be super wrong at start. 00:51:38.000 |
Since it's, it's not trained on the dataset, hasn't seen anything. 00:51:43.000 |
So they do some sort of linear warmups whereby the learning rate is slowly 00:51:46.000 |
increased over time and then maybe decrease and there's some scheduling 00:51:51.000 |
The last part is just this dataset, which I think was, 00:51:57.000 |
If you train your model on the new, on, on this dataset itself, 00:52:00.000 |
on the Medusa hits, it's not really a problem. 00:52:03.000 |
You can just take a public seat, public dataset and you can train it. 00:52:09.000 |
if you're worried that the dataset that you're training a model on doesn't 00:52:13.000 |
you can just basically take a public dataset with a whole bunch of prompts 00:52:16.000 |
and just get your model to generate a completion itself. 00:52:20.000 |
And for certain models that have basically like basically the ability to 00:52:28.000 |
you can have multi-turn conversations, which is great. 00:52:35.000 |
If you just freeze the base LM and you just train the Medusa hits, 00:52:39.000 |
but they do say that if you are training the whole model itself, 00:52:43.000 |
plus the Medusa hits, which is this step over here, 00:52:46.000 |
you probably want to also include like a little KL divergence term so that 00:52:51.000 |
the model parameters don't change so much and you want to sort of minimize 00:52:55.000 |
the difference of your model from your original model. 00:52:58.000 |
So that essentially you're still outputting like high quality completions. 00:53:04.000 |
So yeah, that's basically the Medusa paper summarized pretty fast. 00:53:07.000 |
There's a whole bunch of stuff that I've skipped over, 00:53:10.000 |
but this is basically the main high level idea behind the Medusa paper itself. 00:53:18.000 |
Let me just try to pull up the chat if there's any questions. 00:53:32.000 |
What does it cost more than increasing the beam search parameter? 00:53:45.000 |
So I don't think they actually use beam search inside this itself. 00:53:52.000 |
so I looked at the code before this to try and see, 00:53:55.000 |
The first one is some greedy new click and sampling. 00:53:58.000 |
Basically the idea is that all you're doing with Medusa is that you're 00:54:05.000 |
You're changing the way that you generate these speculated tokens itself. 00:54:13.000 |
You're still running it through a separate search. 00:54:18.000 |
how your beam search is implemented with the Medusa hit itself, 00:54:21.000 |
I guess it will really determine the completion. 00:54:26.000 |
But I think it's not super clear in the paper how exactly they do the 00:54:34.000 |
They just say that they try to find the longest prefix length that's 00:54:37.000 |
common across all the different potential completions that are generated. 00:54:55.000 |
So I think perhaps I will look at it after this and figure it out. 00:55:03.000 |
But I guess if not, then I'm just going to end the recording over here. 00:55:06.000 |
If anyone has any questions, happy to answer it. 00:55:08.000 |
I'm just going to end the recording over here. 00:55:19.000 |
Yongxin, I think I need you to end the recording.