back to index[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
00:00:00.000 |
it's recorded but tl;dr probably just shared internal so sam we'll we'll get you recording 00:00:06.240 |
after cool i'm i'm gonna pass off to you guys whenever you want to start i'm sure people 00:00:16.080 |
trickle in so from writer side i would be presenting i don't know if you guys can hear me 00:00:23.840 |
yep all right uh there should be also a sim i don't know if he's also present but i will 00:00:30.480 |
kick off the presentation let me know whenever you everyone is ready and i will start 00:00:56.240 |
okay i will start by sharing my screen meanwhile let's see if it works 00:01:07.120 |
desktop 2 and let me know if you can see my screen 00:01:13.840 |
yep we can see your slides all right perfect so tonight uh can i start right everyone 00:01:27.600 |
yeah you should be good it's recording people might trickle in every now and then but 00:01:32.800 |
all right so tonight i will be presenting a paper that came out from writer writing in the margins 00:01:40.240 |
to uh writing in the margins better inference pattern for long context retrieval 00:01:44.720 |
it's a paper that involves a long context and how we leverage the kv cache to make a 00:01:53.040 |
long context modeling more effective so i will uh i will do a very little technical but not so 00:02:02.800 |
technical presentation and later we can deep dive into the details of the paper so let me open the 00:02:08.800 |
chat so i can see what everyone is writing meanwhile i'm while i'm talking oh yeah i think 00:02:16.000 |
i think you can just ignore the chat people are gonna be it's gonna be really buzzy and 00:02:20.560 |
there's gonna be a lot of people like oh wow that's so cool everything yeah vibhu and i will 00:02:24.800 |
take care of the chat for you perfect something super crazy pops up we'll let you know otherwise 00:02:29.360 |
we'll we'll take care of it all right perfect all right so i will skip the part of what is 00:02:34.320 |
a language model but okay the language model is a probabilistic model and that leverages the prompt 00:02:39.040 |
to generate what is the next token and how we generate text to the language model we do it 00:02:42.560 |
iteratively so one token at a time most of the language models nowadays are based on the 00:02:49.360 |
transformer model and in the transformer model basically whenever we have a prompt the first 00:02:55.120 |
thing that we do is we put this prompt into the memory of the language model that is known as 00:02:59.760 |
the kv cache which is basically the keys and values in these transformer layers and the 00:03:07.360 |
operation of creating this initial kv cache is known as prefilling so the first thing that the 00:03:13.680 |
inference engine which could be vllm which could be tesserati or any other inference 00:03:18.160 |
framework that you're using the first thing that it does with your prompt is doing this prefilling 00:03:25.200 |
and if you're interested actually the prefilling is one of the most expensive 00:03:29.040 |
part of the language of processing a prompt because it has a quadratic cost with respect 00:03:33.840 |
to the computations as well in the with respect to the memory so imagine that we have a prompt 00:03:39.360 |
called austin is a city in a prompt that says austin is a city in then the first thing that 00:03:45.600 |
we do to generate text with this prompt is we prefill it into the language model in the kv cache 00:03:51.280 |
and then we leverage it to generate tokens one token at a time and the kv cache is a this kind 00:03:58.160 |
of memory in the language model uh in any actually transformer model that is autoregressive that is 00:04:04.160 |
that if it contains some tokens then the language model will leverage them if it doesn't contain 00:04:10.160 |
those tokens then the language model cannot leverage them so the language model only sees 00:04:14.480 |
what is inside of the kv cache so what happens is over prompt austin is a city in the first thing 00:04:19.280 |
that we do is we do this prefilling which puts these tokens into the kv cache which is one for 00:04:25.600 |
each layer of the transformer and then we generate what is the next token by asking the language 00:04:31.120 |
model what is the next token and suppose the next token is the word texas we take this token texas 00:04:37.200 |
we keep it in the kv cache so that the language model can leverage to generate the next token 00:04:42.080 |
and suppose the next token is a comma etc etc until we generate the the entire response of 00:04:47.680 |
the language model now we have seen that in the past few years prompts are becoming longer and 00:04:53.840 |
longer so we started with 2000 context window to 4000 8000 32 60 or 100 and now we have reached 00:05:01.840 |
millions of tokens um this means that you can send to an entire book to the language model and 00:05:08.880 |
you can ask the language model questions about this book however with the great prompts comes 00:05:14.400 |
great responsibility and the the reason is uh the following so imagine you have a very long 00:05:21.520 |
prompt suppose that you have a book and you want the language model to answer questions about this 00:05:26.720 |
book suppose that this book is like 1 million tokens longer if you do the prefilling of this 00:05:34.400 |
1 million tokens prompt in in the kv cache the language model will actually not be able to do 00:05:42.320 |
that because as i as i said before the prefilling is one of the most expensive operations that we do 00:05:48.080 |
uh when inferencing language models uh why because it's quadratic with respect to the 00:05:53.360 |
sequence length in terms of memory and also in terms of compute so uh language models actually 00:05:58.560 |
cannot prefill the entire prompt in one single pass in the in the kv cache so what they do is 00:06:05.200 |
they do it by chunks this is called the chunked profile and it's an experimental feature that has 00:06:10.240 |
been recently introduced in vllm but it's probably used in more sophisticated inference engines at 00:06:16.800 |
major companies so what we do by with the chunks prefilling basically we split the prompt of the 00:06:27.200 |
user into multiple chunks and we prefill each chunk step by step so suppose that the user sent 00:06:33.440 |
an entire book which is made up of 10 chapters now the chunk usually they are of a fixed size 00:06:39.920 |
this doesn't mean that the the chunk has some contextual meaning it may not be the first 00:06:44.720 |
chapter of the book or the second chapter of the book it could be just be the first 00:06:49.040 |
4000 tokens of the prompt and then the next 4000 tokens of the prompt so we prefilled the first 00:06:55.280 |
chunk into the kv cache then we prefilled the second chunk in the kv cache so now the kv cache 00:07:00.480 |
contains the first chunk and the second chunk then the third chunk etc etc until all the prompt is 00:07:05.680 |
inside of the kv cache which can then be leveraged to generate tokens there this is intuitively very 00:07:12.720 |
similar to how a student would read a book for example imagine a student is given a book to read 00:07:18.640 |
and then to answer questions a question about this book what the student would do with the 00:07:25.760 |
students would read the first chapter and now the brain of the student contains information about 00:07:29.920 |
only the first chapter then the student would read the second chapter and now the brain of 00:07:33.840 |
the student contains information about the first chapter and the second chapter of the book and 00:07:37.680 |
then etc etc until the student reads the last chapter now the brain of the student contains 00:07:42.080 |
information about all the chapters and then we the student will be given the question and the 00:07:49.520 |
question and then the student has to leverage the information that he has read from the book to 00:07:55.520 |
answer this question however the student would struggle to do it why because intuitively when 00:08:00.960 |
you read a book the the moment you start reading the second chapter you can have already forgetting 00:08:06.240 |
what is the first chapter about um so what is a better strategy that this could this student could 00:08:14.640 |
do well the student could read the chapters and while reading it could take some annotations 00:08:21.040 |
and this is what we do with writing in the margins so because for a very long prompt we are 00:08:27.760 |
already forced to split the prompt into multiple chunks to do this chunked prefilled why not 00:08:35.520 |
leverage the partially prefilled kv cache to generate some annotations that can be then be 00:08:42.080 |
leveraged to improve the model's capability of extracting information from this prompt 00:08:47.120 |
so basically writing the margins and from a technical point of view works as follows 00:08:54.160 |
we have this very large prompt we split it into chunks because we are forced to split 00:08:58.800 |
it into chunks we cannot prefer the entire context into the kv cache we prefer the first chunk and 00:09:07.360 |
then we add after the first chunk we add a prompt that tells the model okay use the text above to 00:09:13.280 |
instruct information about the query and the query is the question that we want to uh to get an 00:09:18.800 |
answer to so now the kv cache contains the first chunk and this prompt here and we leverage it to 00:09:25.120 |
generate a few tokens which is the this margin annotation and then this is another trick you 00:09:31.200 |
can delete stuff from the kv cache only from the end why you can do that because the language model 00:09:38.720 |
is an autoregressive language is an autoregressive model which means that every token depends on all 00:09:45.040 |
past tokens which means you can delete from stuff from the end and regenerate it eventually but you 00:09:51.920 |
cannot of course delete stuff from the beginning or from the middle because it would invalidate 00:09:56.960 |
all the future tokens but you can always remove stuff from the end so what we do is we prefer 00:10:01.280 |
the first part of prompt we prefer this instructive instruction we generate a few tokens which is the 00:10:08.640 |
margin annotation then we can delete this margin and this instruction that we have added and then 00:10:14.720 |
we prefer the second chunk we then append another prompt extractive prompt we generate a few more 00:10:23.120 |
tokens which is the second margin so the second margin will depend on the first and the second 00:10:27.680 |
chunk then we delete this second margins tokens we delete the extractive prompt etc etc until 00:10:34.720 |
we have processed all the prompt and this is a visualization that we have also put in the paper 00:10:42.640 |
so basically imagine you are given a very large prompt so what we do is we pre-fill the first 00:10:48.080 |
part in the in the KVCache and then we extract the information on what is present in the KVCache 00:10:54.160 |
and we call it the first margin we then also can classify these margins and in the paper we also 00:11:02.080 |
show that actually the the computation and the pre-filling of the margins sorry the computation 00:11:08.000 |
of the margin and the classification of the margin can be overlapped inside of the same request 00:11:12.560 |
in the batch so you don't need to have another request in the batch but this is okay a little 00:11:17.200 |
KVCache tricks for optimizing the inference so what do we do with these margins basically we 00:11:23.360 |
append them at the end before asking the question to the language model because our goal is okay 00:11:29.360 |
we have this very big context and then we have some question so what we do instead of just asking 00:11:34.400 |
the question the model may not be able to find it so we generate these margins and then we append 00:11:38.960 |
them right before asking the question and then we ask the model the question so now the model 00:11:43.200 |
can also leverage these margins which are present right before the question to better answer the 00:11:51.040 |
question why these margins are why this margin will be leveraged by the language model first 00:11:58.560 |
of all because they are they are the instruction that we use to extract them is an extractive 00:12:06.560 |
summary so the we ask the language model to extract information or using the prefilled KVCache 00:12:14.560 |
on knowing what is the query so the margins are relevant to answer that particular query 00:12:22.480 |
why do we add these margins at the end before asking the question because a few months ago 00:12:28.960 |
there was a paper called lost in the middle basically it says that and it's actually true 00:12:34.400 |
basically it says that with the relevant information that we're trying to extract from 00:12:39.200 |
this prompt which could be very large is present either at the beginning or at the end of the 00:12:45.520 |
prompt then the language model will very likely be able to will be very likely will be able to 00:12:50.240 |
find it however this if this information is present kind of in the middle then the 00:12:54.720 |
the language model will less likely be able to find it so that's why we add them to the end 00:13:01.760 |
so it improves the language model's ability to to leverage this information but does it even work 00:13:08.880 |
so yes we have proof and we have we show in the paper a comparison of pre-trained language models 00:13:15.520 |
so first of all we are not fine-tuning any language model we are not changing anything 00:13:20.480 |
this is just a different way of utilizing something that is already being done which 00:13:25.440 |
is the chunked prefill of the KVCache to improve a language model's ability to leverage long context 00:13:31.680 |
so it can be used with any transformer model without fine-tuning it 00:13:35.920 |
just by doing it just by doing this inference differently that is don't just blindly refill 00:13:42.240 |
the KVCache but refill chunk by chunk because you are forced to and then leverage it to extract 00:13:47.040 |
these margins and then leverage all these extracted margins at the end 00:13:55.440 |
I think someone had mic on I just made it accidental yeah go ahead 00:14:04.800 |
so yeah in the paper we have a proof that it helps the language model pre-trained language 00:14:11.360 |
models without any fine-tuning to better utilize long context and we provide a few benchmarks 00:14:19.920 |
as you can see for example smaller models have a better more improvement so here we have a for 00:14:28.240 |
example long context so the generic pattern that we use for long context so just the 00:14:33.440 |
context and the question whatever the benchmark is then we have a reg which means basically that we 00:14:40.640 |
extract instead of giving the entire context plus the margins and then the question we are 00:14:49.840 |
basically extracting each of the chunks separately asking the language model which one of them is 00:14:56.720 |
relevant and then only providing the relevant ones which is what we would do in reg 00:15:03.040 |
and then asking the language model to leverage only the relevant ones at the end or the writing 00:15:11.120 |
in the margins approach which is all the context plus the margins that were extracted during this 00:15:17.920 |
chunked prefill and then the question at the end now how is this different from a prompting strategy 00:15:26.480 |
because you may be thinking okay but we can already take a very large prompt split it into 00:15:32.080 |
chunks and then use the language model to kind of summarize each of this chunk independently 00:15:38.800 |
and then send it to the language model again to answer the question well it's a matter of 00:15:45.120 |
cost so let's talk about cost imagine that we want we have this very big book made up of 10 chapters 00:15:53.920 |
and then we have a question at the end that we want to get an answer to which is what is the 00:15:58.480 |
answer to like the universe and everything now what uh writing in the margins would do is would 00:16:04.160 |
prefill the if you don't use writing in the margins what you would do is you would feed the 00:16:10.640 |
first chapter of the book and then extract some margin with that actually let me show you this 00:16:15.120 |
other slide then you would take the second chapter of the book and then extract the margin with that 00:16:20.240 |
then the third chapter of the book and extract the margin with that etc etc suppose that each 00:16:24.560 |
of this chapter is 100 000 tokens then the cost to generate the margins and suppose that the margins 00:16:30.800 |
are really small it's more or less 1 million tokens because 100 000 tokens multiplied by 00:16:35.680 |
10 chapters is around 1 million tokens but then then you need to also send the entire book past 00:16:41.280 |
these extracted margins again to the language model to um to to generate the answer and it 00:16:48.400 |
would cost you another million because the model has to reprocess this prefilling again 00:16:52.560 |
of 1 million tokens so it would cost you 2 million tokens but with writing in the margins it would 00:16:58.400 |
cost you a for 1 million prompt more or less it would cost you 1 million tokens because you don't 00:17:03.440 |
have to re-prefill the uh the entire context because you are you have already prefilled it so 00:17:10.640 |
the the kvcache is growing and you're extracting some information and then you don't have to 00:17:15.600 |
repopulate it which is what you would do if you treat each chunk independently and then 00:17:24.560 |
do kind of like a let's say the chunking that we commonly do 00:17:29.840 |
so what are the advantages well it's compatible with any transformer model 00:17:35.600 |
there are other advantages that i want to show you in a video that i made so let me show this 00:17:42.320 |
is on my linkedin post i don't know if you can see now again my my screen right it's a yeah 00:17:48.880 |
looks good we see the linkedin video perfect okay so we have a large document suppose 1 00:17:54.560 |
million tokens and then we have a question like how many employees were hired in 2006 00:17:58.560 |
now when we have a very large prompt it must be prefilled in the kvcache by chunks and this 00:18:06.720 |
operation is called the chunks prefill so what happens with chunk prefill is that you 00:18:12.720 |
take the first chunk and you prefill it in the kvcache then what we do is we add to the kvcache 00:18:20.160 |
this extractive summary prompt which is for example use the text above to extract information 00:18:24.800 |
about the following query how many employees were hired in 2006 and then we generate tokens 00:18:30.960 |
using whatever is inside of the kvcache which is the first chunk plus this instructive 00:18:36.160 |
extractive sum prompt and suppose that we generate a few tokens which are visible now 00:18:42.800 |
we take these tokens we save them so we decode whatever is generated we're using the tokenizer 00:18:50.320 |
and we save it and then we remove it from the kvcache now if you are looking from a 00:18:56.880 |
implementation wise point of view you don't actually remove from the kvcache you just 00:19:01.840 |
usually the kvcache allocation is a static but even if you use vllm it's used it's done using 00:19:07.680 |
the so-called like pages attention so you actually allocate pages of kvcache actually 00:19:13.120 |
so basically we it's not like you are removing stuff from the kvcache you are just resizing 00:19:22.240 |
the tensor which is actually an o1 operation so it doesn't have additional cost you just 00:19:27.920 |
keep track of how many tokens there are so anyway we we take this margin we save 00:19:33.600 |
it somewhere and then we prefill the second chunk to the kvcache 00:19:36.320 |
and then we again add another extractive summary prompt and then we leverage it to generate the 00:19:45.040 |
second margin which will depend on the first and the second chunk of the of the prompt 00:19:51.920 |
etc etc for all the chunks so we will have a list of margins 00:19:56.320 |
and then we can also classify these margins so we can also say which because of some of 00:20:06.240 |
the margins may like be the model hallucinating or the model just saying i cannot find this 00:20:12.160 |
information or etc so we can also classify them we can either use an auxiliary classifier or we 00:20:17.440 |
can use the model itself to classify them and we show in the paper that you don't have to actually 00:20:22.640 |
create a new request to the model to classify these margins you can actually overlap the 00:20:27.040 |
generation of the margins with the classification of the previous margin using the same request 00:20:31.760 |
in the batch does the margin pertain to all previous chunks or only the current one 00:20:38.240 |
all the chunks up to that margin so the first margin only the first chunk the second margin 00:20:44.080 |
the chunk one and through this third margin chunk one two and three etc because we want to leverage 00:20:49.280 |
the prefilled kvcache okay so we classify these margins and then we append them at the end 00:21:10.160 |
and then we generate the answer the final answer 00:21:12.080 |
so the advantage is that as i said before we exploit chunk profile of a large prompt to 00:21:21.440 |
generate intermediate summary so we are we are exploiting something that we are already forced 00:21:26.640 |
to do but we are not leveraging right now so it's kind of comes for free just with a minor let's say 00:21:34.160 |
compute because you have the the cost of generating these margins but you avoid the 00:21:41.520 |
bigger cost of prefilling so if you just use chunks chunking techniques like you can do with 00:21:47.360 |
the long chain for example you pay twice the cost of prefilling if you do this system you don't pay 00:21:54.480 |
twice the cost of prefilling which is very expensive and to give you an insight on how 00:22:00.800 |
expensive is prefilling basically in most cases so whenever you work with the openai or with cohere or 00:22:10.320 |
any other provider whenever you send a request your request is always overlapped with the token 00:22:17.120 |
generation of other requests so the first time your prompt comes to their server they are overlapping 00:22:21.840 |
the prefill of your request with the token generation of others because the prefilling 00:22:26.960 |
is compute bound because it's very expensive computationally while the token generation is 00:22:32.800 |
memory bound so to to always utilize the gpu fully they over they kind of schedule together 00:22:40.400 |
one prefill with multiple token generations so so it's compatible with any of the shelf 00:22:46.880 |
language model without any fine tuning and we saw we show some benchmarks in the paper 00:22:54.000 |
it improves the ability of any language model to extract relevant information so solving the 00:22:58.080 |
lost in the middle problem and another cool thing that you can do is basically 00:23:04.640 |
because now you generate these margins while prefilling the the prompt you can also feedback 00:23:13.120 |
this margin to the user and the user can classify them for you like a thumbs up or thumbs down 00:23:17.840 |
so it adds a human in the loop and also the user can visualize the progress of how the 00:23:23.600 |
prefilling is going because when you have a very very large prompt i believe that because the cost 00:23:29.200 |
of prefilling is quadratic it will become really it will become really expensive to prefill it and 00:23:36.320 |
the user may have to wait many seconds so you can actually give a feedback to the user of how 00:23:42.080 |
content how much context has been processed and you can actually leverage the waiting time of the 00:23:46.000 |
user to give you thumbs up or thumbs down on these margins which can actually improve the ability of 00:23:50.560 |
the language model to use them and the user can also early exit so the user found the relevant 00:23:56.640 |
information in one of these margins you can say okay stop inference and the user would not have 00:24:00.720 |
to pay for you know all the context uh being processed and we also provide an implementation 00:24:07.920 |
so if you go to this url so github.com/writer/writing in the margins you can find 00:24:14.640 |
our implementation on how we actually do this stuff with the KVCache i don't know how to delete 00:24:20.080 |
this line it's so annoying let me check i believe it's annotate and then delete clear clear all 00:24:28.480 |
drawings okay so if you go here you can see uh what we do basically um it's a simple it works 00:24:39.040 |
with any language model we here we provide a demo or with the llama fi and quen uh here we show for 00:24:45.760 |
example um this this code that is present here in the github repository matches exactly the code 00:24:51.680 |
that we present in the paper which is the pseudo code that you can see here so how we split into 00:24:58.000 |
segments and how we prefill into the KVCache and how we delete stuff from the KVCache so it's 00:25:03.520 |
present here so here you have a very simple like we also show the state of the KVCache at each 00:25:09.920 |
line of code so that the user can understand what is happening here and this is the code for 00:25:15.760 |
the method that we use to delete stuff from the KVCache all right uh let me see if there is 00:25:22.480 |
something that is missing here yeah here we provide a comparison of how it differs from 00:25:28.480 |
reg and how it differs from just long context processing questions 00:25:34.960 |
all right let me check you in the chat but Eugene is like 00:25:43.840 |
crushing it answering answering the question in the chat thanks Eugene 00:25:47.840 |
my pleasure thank you for taking time to share with us about uh the paper and even preparing 00:25:54.080 |
slides uh yeah this okay the slides were from another talk I gave in the company so it's uh 00:26:03.120 |
reusing stuff but yeah thank you thank you everyone um so let me go through the question 00:26:11.280 |
if there is something in the chat that I can answer um yes this is not any we are not doing 00:26:22.160 |
any change to the model architecture so you don't have to fine-tune anything you don't have to change 00:26:26.320 |
anything like the um can you use this stuff with like a length pane no because it requires a 00:26:33.840 |
modification on how the inference engine is using the model so it's a because when you work at the 00:26:40.560 |
KVCache level you cannot just work with the apis and tell them to you know remove stuff from the 00:26:46.000 |
KVCache or overlap stuff in the KVCache but it doesn't require changes to the model to the 00:26:51.760 |
weights of the model that's why we talk about no fine-tuning here is the extractive summary prompt 00:26:59.520 |
just the instruction to produce the margin yes so the instructive summary prompt is basically 00:27:04.880 |
a prompt that we add after each chunk to extract relevant relevant information about that query so 00:27:12.000 |
it's not just to find the relevant information but about the specific query because we this 00:27:16.480 |
inference pattern that we introduced is a specific uh for those prompts that are composed 00:27:22.320 |
of a context plus an instruction so we always know what is the instruction that's why 00:27:26.720 |
this is I mean the best use of this uh this inference pattern now in the in the paper we 00:27:34.720 |
also show um how chunked prefield works at the KVCache level so if you are familiar with how 00:27:40.480 |
the KVCache with the query the keys etc but we also show how to overlap the computation of the 00:27:47.360 |
margin with the classification of a margin and this is exactly actually the the representation 00:27:54.720 |
of the KVCache during the prefilling of one chunk and how it can be overlapped with the 00:28:01.840 |
classification using the same language model and the same request in the same language model 00:28:09.120 |
sorry could you go deeper into overlap I don't know where like overlap happens is it between 00:28:14.480 |
the different chunks or you're overlapping the different chunks um let's talk about overlap 00:28:19.440 |
so I am talking about uh let's say first visualize it uh let's say here we do have a nice 00:28:26.960 |
representation of that so it is here so you extract the margin and you need to find a way to classify 00:28:33.360 |
it you can either use an auxiliary classifier so use another model to classify it as relevant 00:28:39.040 |
or irrelevant or you use the same language model to classify but if you want to use the same 00:28:43.520 |
language model to classify you would need to create another request in the batch because 00:28:47.200 |
you don't want the classification request to visualize anything in the KVCache you just want 00:28:53.520 |
to ask the language model okay the I ask a language model to extract information about 00:28:58.400 |
this query here so is Ethan Washington in a marble floored room and the language model 00:29:05.840 |
extracted this stuff here is it relevant to the query or not if you want to do it you would need 00:29:11.680 |
to create another request in the batch but we show here that you can actually um in the 00:29:17.600 |
chunked prefilling so let's go here chunked prefilling in the same request in the batch 00:29:23.520 |
so when you do chunked prefilling basically what you are doing is you are adding the first chunk 00:29:29.200 |
to the KVCache so the the keys and the queries are the first chunk so this is c1 that you see here 00:29:36.080 |
and then what do we do is we actually add after this we also want to add an extractive summary 00:29:40.960 |
prompt right and then we use this one to generate tokens so now the token generation has um is a 00:29:48.080 |
I would say this is the part of the prefilling of the first chunk so the first chunk plus the 00:29:53.440 |
extractive summary and then we use it to generate tokens so this is the first 00:29:58.000 |
margin token generated usually we pre-allocate the KVCache so the KVCache is not like growing 00:30:05.040 |
tensor we pre-allocate it with a fixed number of let's say padding tokens but they are not 00:30:09.280 |
really padding tokens they are just unused spaces in the KVCache and then we replace them with the 00:30:14.960 |
tokens that are actually generated from the language model so suppose that you have these 00:30:19.280 |
unused tokens which I call padding here basically what do we do after we have generated the first 00:30:27.040 |
margin we delete this margin right and also the instructive token so what we are actually doing 00:30:32.480 |
is we don't delete anything we just change the pointer position of the KVCache on how many tokens 00:30:37.840 |
are used so now the pointer suppose it's pointing here then we can pre-fill the second chunk so the 00:30:44.400 |
second chunk needs to attend to all its tokens in a causal way so each token in the second chunk 00:30:52.240 |
needs to attend to only itself and all the previous tokens in the same chunk but also needs 00:30:57.040 |
to attend to all the past tokens of the first chunk that was already pre-filled and we also 00:31:01.840 |
need to pre-fill the instructive summary prompt which can visualize all the past tokens that it 00:31:07.920 |
has seen but while then then we can also skipping some tokens that we reserve for the generation of 00:31:19.360 |
the margin we can also pre-fill the classification instruction for the first margin which was 00:31:30.560 |
generated before in the previous step and then during the token generation step so this is the 00:31:36.240 |
pre-filling along with the instructives the pre-filling of the second segment after we have 00:31:43.600 |
pre-filled the second segment along with the first generated margin we can generate the tokens of the 00:31:49.600 |
second margin but classify the first one which we already obtained in the step before so we are 00:31:56.320 |
generating two token sets here and the token generation step in the same request now one 00:32:01.840 |
using only the part relevant to the first chunk the second chunk and the extractive summary of 00:32:08.000 |
the of the after the second chunk and one using only as you can see this is the attention mask 00:32:14.240 |
right and one is only using the in the part that is relevant to classifying the first margin that 00:32:20.400 |
was instructing the previous step so you can do it also like this thank you yeah sorry there's 00:32:30.160 |
a question in the chat um and i think from explanation i think naz is clear so when you 00:32:36.560 |
are creating the margin for the second chunk you're actually paying attention to the first and 00:32:43.600 |
second yeah okay so you can change the attention mask to only look at the latest chunk however 00:32:50.160 |
that's exactly the question yeah yes but it's not possible actually i mean let me clarify why it's 00:32:56.640 |
not possible because the kvcash is made up of contextualized tokens is actually these tokens 00:33:01.360 |
are not single they are contextualized so the token number one in the kvcash is a contextualized 00:33:06.720 |
version of the token number zero and one the token number two in the kvcash is a contextualized 00:33:10.880 |
version of the token zero one and two so if you tell the model to only look at the last tokens 00:33:15.440 |
you are creating an autoregressive model that is generating the logits of p of let's say x10 00:33:24.800 |
but only looking at p of x9 x8 which are contextualized token that contain information 00:33:33.600 |
about seven six five but you are not using them so you are actually going out of distribution 00:33:42.000 |
how much of the kvcash do you prefer with the chunk versus leave it um you can you can actually 00:33:53.600 |
okay if you use for example vllm they use this call thing called the pages attention so actually 00:33:58.000 |
they prefer they allocate one entire page which is actually a lot of tokens so it's like another 00:34:03.840 |
chunk which is more than enough to generate the margin so what are the next steps for this well 00:34:11.520 |
the next step is for sure we are sending it to conferences get it published and presenting it 00:34:18.880 |
around but we are recently focused on long context modeling and actually we are looking at you know 00:34:27.200 |
how long context modeling long context can be better leverage so we will be working in this 00:34:32.000 |
field actually we will be how to say we will be researching a lot in this field 00:34:51.600 |
i think there's a question from amad how are the queries chosen for the query based summarization 00:35:03.680 |
uh yes okay so the query is basically uh we work with a prompt that is made up of context plus 00:35:11.920 |
query so we all always know what is the query that's the structure of the prompt that we work 00:35:17.200 |
with uh what's the use case that writers led you to this research well we are i am personally very 00:35:25.280 |
interested in long context modeling and i am giving the freedom to research what i like and 00:35:31.600 |
writer is also interested in long context modeling so things intersect and this here we are and then 00:35:37.680 |
you know we have our many smart people working together we did a few brainstorming and yeah 00:35:44.160 |
what's the latency you see for typical request 00:35:52.400 |
well you are delayed there is no kind of latency increase because of this you are just paying 00:36:00.000 |
more price to generate more tokens in intermediate cases of course what would happen is that uh you 00:36:07.520 |
have before for example you need to um process the entire prompt at once so chunk pre-filling 00:36:14.800 |
chunk pre-filling chunk pre-filling now you have chunk pre-filling with some token generation which 00:36:19.600 |
will slow down the entire request but you are actually getting something back which is feedback 00:36:25.120 |
and you are getting the possibility to see what the model is actually seeing at each step 00:36:29.360 |
so you get human in the loop so the human is waiting but 00:36:34.000 |
is waiting let's say with some feedback which is nice to have progress bars right 00:36:38.480 |
you and maybe this is sensitive do you happen to have a demo of showing how this actually looks 00:36:45.360 |
like in the user interface or is it something that we have to sign up to write that we actually see 00:36:49.120 |
we don't have that but we are working on demos yeah 00:36:54.160 |
here we have you know we have a concept on how it would look like 00:37:04.320 |
so okay in some cases the writing margin does not work well as other methods there are 00:37:15.200 |
two factors first of all because we are each margin is kind of a summarization of what is 00:37:22.000 |
present in the context it depends highly on how good that model is at summarizing so the better 00:37:28.480 |
the margin the better the information it will extract and the better it can be leveraged so 00:37:33.280 |
if you think about the student if you're not taking skills are not so good then probably 00:37:38.240 |
your notes will not be useful the second thing is actually the comparison here that you see with reg 00:37:45.600 |
this reg actually we put ourself in the worst condition possible which is let's help reg beat 00:37:51.440 |
us but then actually reg doesn't beat us how usually in reg what you do is you have these 00:37:56.720 |
chunks and you extract some vectors of these chunks and then you match them with the dot 00:38:03.120 |
product or whatever with the query what we did with reg actually is we asked the language model 00:38:09.760 |
to see if the reg yes it was charitable because we asked the language model actually to to to see 00:38:16.480 |
if that particular chunk is relevant so actually you have a 70 billion model telling you if that 00:38:23.040 |
chunk is relevant compared to extract some vector and map it with dot product i mean we help reg a 00:38:29.680 |
lot so actually if we did actually a reg approach like a naive reg approach we would do much better 00:38:43.280 |
do we have anyone else have questions i want to come on screen to just ask umar and sam 00:39:03.200 |
if there's nobody else that is interested i actually am having a little bit of trouble 00:39:15.680 |
wrapping my mind around why chunk pre-filling is so much more efficient i i i looked at the 00:39:24.480 |
sort of the reference and i i kind of get the idea but maybe you can help me understand 00:39:30.960 |
the intuition okay it is not uh first of all chunk pre-fill doesn't exist because it's 00:39:36.880 |
more efficient it's because we need it we must do it so when you pre-fill uh a chunk into the 00:39:43.440 |
language model let me show you actually here we have the kvk representation right so when you 00:39:48.080 |
pre-fill a chunk in the language model let's say this one chunk number one c1 you are generating 00:39:54.080 |
a quadratic uh matrix as you can see if you have four tokens you are generating a four by four 00:40:00.720 |
uh matrix which is prohibitive to generate for very long prompts like you imagine you have 1 00:40:07.840 |
million tokens that's 1 b 1 million by 1 million metrics where each of these values is actually 00:40:12.720 |
a dot product of a vector and then the computation cost of that also it would really be very slow 00:40:20.880 |
and the gpus are really good at parallelizing in this case when you have a lot of operations 00:40:25.120 |
they will actually be parallelized but anyway the problem actually is the memories of this 00:40:30.000 |
pre-filling because when you generate it it's really huge and it doesn't fit yeah so you are 00:40:35.040 |
we are forced to do this chunk then we are doing okay we do this chunk pre-filling so check one 00:40:41.040 |
but we are not leveraging these chunks that we are because we are forced to do it right so it's 00:40:47.520 |
slower than just doing it in one pass but since we are already forced to do it why not use them 00:40:54.160 |
yeah yeah no i i definitely i think i got most of the paper just the chunk pre-filling part 00:41:00.880 |
the background if you want more information i can give you some references one is the vllm 00:41:08.800 |
page they are actually in the vllm now they are it's an experimental implementation 00:41:13.760 |
there was a nvidia explanation on chunks pre-filling so i will send a link later 00:41:21.520 |
nvidia published recently an article about chunk pre-filling but basically pre-filling is the most 00:41:27.520 |
expensive part of working with long prompts for language models that's why yeah they need to so 00:41:34.080 |
but what i guess i was having trouble understanding why why that is is it just because it's quadratic 00:41:39.440 |
and so you have to break it up into chunks is that maybe maybe i can take a step at it 00:41:44.800 |
so you can imagine so let's look at the attention mask here here i'm generating the first margin 00:41:50.320 |
am i doing here i am doing uh i have already seven tokens in the kvcache and i am generating 00:41:58.720 |
the eighth token so i am doing seven dot products so token generation which means generating one 00:42:04.960 |
token using whatever is in the kvcache is linear with respect to whatever is in the side of the 00:42:08.800 |
kvcache pre-filling the kvcache is quadratic and mostly because it's quadratic it's very expensive 00:42:14.000 |
so we are talking about something that is linear with quadratic so and if you consider about prompts 00:42:19.680 |
long context if you are working with a two million context window one million and nine hundred ninety 00:42:26.720 |
nine thousand will be prompt nobody will ever generate more than let's say five thousand tokens 00:42:32.240 |
so yeah because the most expensive is actually pre-filling so so so just to get an intuition here 00:42:40.240 |
the the other extreme is that you your chunk size is one token right so what's the trade-off 00:42:46.240 |
so what they do is basically they try to put as bigger as as big as possible until it fits in the 00:42:53.200 |
gpu so they usually suppose i think good numbers are like four thousand tokens or eight thousand 00:43:01.760 |
tokens or something in this range and as i said before usually the token generation is memory 00:43:09.760 |
bound means that the limitation is only given by how much your kvcache can hold so the memory can 00:43:14.160 |
hold in terms of kvcache while pre-filling is compute bound so to maximize the gpu utilization 00:43:20.000 |
whenever you work with open ai or coherent they just overlap your new request with other people's 00:43:26.400 |
old requests so while they are generating tokens they are also pre-filled so the gpu is utilized 00:43:31.280 |
100 okay yeah no that's helpful and i if you do send those uh links i'll definitely read them 00:43:39.040 |
thank you is there is there a break-even point where at like a certain context length it becomes 00:43:46.880 |
more valuable to do writing with margins than um than just using the llm by itself like it 00:43:54.160 |
where the where the compute equals out or is it all together margins is better i believe okay 00:44:02.480 |
writing in the margins it's like you read a book and you have margins versus reading the book i 00:44:08.640 |
think it's always convenient to read the book with the margins because you are actually paying the 00:44:12.720 |
price for that right we are you it's not something that comes for free you are actually paying the 00:44:17.120 |
cost of generating this margin so you're actually putting there some effort and then you you leverage 00:44:22.640 |
then we what we could do is okay does it always help so far yes so it's not like something that 00:44:31.520 |
you get for free right so you pay and it's is it worth it so far yes and if it's always worth it i 00:44:41.120 |
so far from our data it's always worth it like it doesn't like even if even if you're literally 00:44:48.000 |
just talking about a chunk if you have like a sentence yeah instead of a book oh really 00:44:53.760 |
yeah then it's not convenient because uh in that case you are not even doing chunk refilling right 00:45:00.560 |
because if you have only small context then they just prefilled at once but when you have yeah yeah 00:45:06.560 |
make sense once you start growing i think the topic still stands right like let's say you've 00:45:13.280 |
got a paragraph and your chunks are sentences or you've got a page right so you've got a thousand 00:45:17.600 |
tokens and your chunks per se is every sentence relevant so some form of highlighting is you know 00:45:22.720 |
you have a one page but every sentence you highlight what's relevant or not at that level 00:45:28.080 |
it's it's kind of negligible to throw the whole thing in a prompt versus do i want to highlight 00:45:33.840 |
seven of the 40 sentences and do this approach i think that's so that's the other non-extreme right 00:45:41.520 |
yeah so basically whenever the the context can just be prefilled in the kb cache without any 00:45:48.480 |
chunk refilling i believe it's not worthy to use it but if it's long then it helps 00:45:55.920 |
and it helps much more than uh chunking separately like we do with the apis right 00:46:01.680 |
with the long chain because you are paying double twice the cost of prefilling in this case we are 00:46:07.120 |
only paying once and we also prove in the ablation studies that actually it's always convenient to 00:46:13.440 |
send the context plus the margins never just the margins so you can see here this ablation contest 00:46:20.400 |
compression so if you only send the margins or only the context is always worse than 00:46:28.480 |
context being the whole right the entire book 00:46:42.480 |
and so build building off that ablation let's say you've got a model that doesn't have the context 00:46:50.560 |
you have to do some sort of chunking splitting uh let's say you know you have a total context 00:46:55.280 |
of 8 000 tokens and you have a million token document there's approaches if you know how 00:47:00.560 |
you can process chunk by chunk and then combine so with the you know what you would expect is 00:47:06.640 |
you could for each chunk do this writing with margins approach at every level and then scale 00:47:11.920 |
that down with how many ever steps you need is that intuition still pretty accurate 00:47:18.320 |
i believe with 8 000 um with 8 000 uh how to say context window 00:47:25.280 |
uh i believe that the latency would be higher right because you at each step you are adding 00:47:31.840 |
more yes at that kind of latency it's even more convenient to just do independent chunking and 00:47:39.040 |
generate in that kind of lane because if you you can always split the the context into chunks and 00:47:45.760 |
then you just send multiple requests and you can pay the price of competing in that kind of range 00:47:50.960 |
but when you are talking about 64 000 tokens that start making more sense to use this approach 00:47:57.200 |
so for that level i think it's all the traditional approaches uh they work fine 00:48:08.880 |
any other questions uh well i think another question that came out was why now why right 00:48:16.480 |
i mean why nobody thought about this before because actually uh we didn't have long context 00:48:22.880 |
very long context um models before and we were not forced to even do chunked pre-feeding so as you as 00:48:30.480 |
you can see from blm they have this feature is an experimental feature right now in blm so it's 00:48:38.400 |
because right now we need this chunked refilling and everyone is doing it so that's why we have 00:48:43.840 |
this so it's always you know innovation is always starts from some problem that you face and some 00:48:49.120 |
need that you have so right now we have this need and we have the capability so right that's how we 00:48:55.360 |
came up with this awesome we have a few more minutes if anyone else has last minute questions 00:49:11.360 |
um please feel free to ask and big shout out thanks to the writer team for presenting 00:49:20.080 |
thank you guys for for listening um you are welcome to uh send us uh your questions uh we 00:49:28.960 |
have a you know github repository i i i suggest watching the code it's really you know very 00:49:35.200 |
commented and it follows the same uh kind of pattern that we shared in the paper so it's 00:49:40.560 |
easily understandable for everyone and we have we did a lot of nice tricks you know like one of the 00:49:45.280 |
tricks is like you can always delete stuff from the kv from the end of the kv cache and you also 00:49:49.440 |
know why now it's an interesting project if anyone's interested on fridays we have a similar 00:49:57.840 |
ai in action section where we try to take practical outside of paper there's the code up there's the 00:50:03.360 |
paper up if anyone wants to run it and present it share their learnings it'll be a really good 00:50:07.680 |
learning exercise but that's always there um i guess we got a question from jimmy any future 00:50:14.960 |
work in this direction any future work in this direction well for sure we will keep working on 00:50:21.920 |
long context and how we can better leverage long context so there is another kind of problem with 00:50:27.600 |
long context which is uh the the whole language models use long context actually depends highly 00:50:34.160 |
on this attention mechanism and how the softmax works so we have seen with the paper called 00:50:39.360 |
sync attentions that actually the language model allocates a lot of a lot of because when you do a 00:50:46.720 |
direction mechanism you are doing a weighted sum over the tokens and each token is given a weight 00:50:52.480 |
and we see that most of the weight is given to the first few tokens so there is a lot of research in 00:50:58.560 |
this area recently i saw a few days ago another paper came out called the sigmoid attention which 00:51:03.840 |
is also studying you know the distribution of this logits and the so so i think the attention 00:51:10.400 |
mechanism will play a big part in how we can extend the long context so if we can also fix 00:51:16.880 |
this part here so i am very interested you know in the kvcache and optimizing long context modeling so 00:51:23.040 |
we are we are working in this direction because it's it's needed by the market and also it's i 00:51:31.200 |
like it and i i think it's cool to be able to analyze an entire book or an entire codebase 00:51:36.720 |
instead of hoping that the rag finds the right one 00:51:40.720 |
awesome well uh big shout out to writer team always great to have you guys present 00:51:49.840 |
um sam is in discord as well i'm sure he'll relay questions and stuff we've got the recording we'll 00:51:56.080 |
share it with your team i don't know what you choose to do with it but um next week we've got 00:52:01.920 |
swicks he'll be presenting some of the strawberry q star quiet star all those papers so he'll be 00:52:08.240 |
doing that next week and then the following week if anyone's interested in anything volunteers are 00:52:12.640 |
always open i posted a few papers in paper club i think there's also the mistral stuff so if anyone 00:52:18.800 |
wants to lead pop in there otherwise um next week swicks is doing strawberry and star stuff so 00:52:27.760 |
cool thank you guys thanks everyone take care 00:52:45.840 |
question is there a you can um is there any way you can copy over the comments 00:52:52.000 |
um i'm trying to do it but it seems to like lazy load like as you scroll up and down it's 00:52:58.800 |
really painful let me see if i can extract these comments 00:53:03.520 |
do you know if they normally get saved as a zoom recording i'm using you can click save chat where 00:53:14.880 |
you save chat anyone want to help me out chat there's i was able to copy the comments without 00:53:21.600 |
any problem uh the file is usually saved on the host computer in a folder like documents zoom 00:53:29.840 |
see there's a chat log file that as long as okay i got it i just i just saved the chat i'll throw 00:53:38.480 |
it in discord i have a text file of it yes slides would be great too if we can grab them uh yeah 00:53:44.560 |
get it all i'll pick them i'll pick them right now perfect thanks guys sweet all right thanks