[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

it's recorded but tl;dr probably just shared internal so sam we'll we'll get you recording after cool i'm i'm gonna pass off to you guys whenever you want to start i'm sure people trickle in so from writer side i would be presenting i don't know if you guys can hear me yep all right uh there should be also a sim i don't know if he's also present but i will kick off the presentation let me know whenever you everyone is ready and i will start i think we're good take it away so so okay i will start by sharing my screen meanwhile let's see if it works desktop 2 and let me know if you can see my screen yep we can see your slides all right perfect so tonight uh can i start right everyone yeah you should be good it's recording people might trickle in every now and then but all right so tonight i will be presenting a paper that came out from writer writing in the margins to uh writing in the margins better inference pattern for long context retrieval it's a paper that involves a long context and how we leverage the kv cache to make a long context modeling more effective so i will uh i will do a very little technical but not so technical presentation and later we can deep dive into the details of the paper so let me open the chat so i can see what everyone is writing meanwhile i'm while i'm talking oh yeah i think i think you can just ignore the chat people are gonna be it's gonna be really buzzy and there's gonna be a lot of people like oh wow that's so cool everything yeah vibhu and i will take care of the chat for you perfect something super crazy pops up we'll let you know otherwise we'll we'll take care of it all right perfect all right so i will skip the part of what is a language model but okay the language model is a probabilistic model and that leverages the prompt to generate what is the next token and how we generate text to the language model we do it iteratively so one token at a time most of the language models nowadays are based on the transformer model and in the transformer model basically whenever we have a prompt the first thing that we do is we put this prompt into the memory of the language model that is known as the kv cache which is basically the keys and values in these transformer layers and the operation of creating this initial kv cache is known as prefilling so the first thing that the inference engine which could be vllm which could be tesserati or any other inference framework that you're using the first thing that it does with your prompt is doing this prefilling and if you're interested actually the prefilling is one of the most expensive part of the language of processing a prompt because it has a quadratic cost with respect to the computations as well in the with respect to the memory so imagine that we have a prompt called austin is a city in a prompt that says austin is a city in then the first thing that we do to generate text with this prompt is we prefill it into the language model in the kv cache and then we leverage it to generate tokens one token at a time and the kv cache is a this kind of memory in the language model uh in any actually transformer model that is autoregressive that is that if it contains some tokens then the language model will leverage them if it doesn't contain those tokens then the language model cannot leverage them so the language model only sees what is inside of the kv cache so what happens is over prompt austin is a city in the first thing that we do is we do this prefilling which puts these tokens into the kv cache which is one for each layer of the transformer and then we generate what is the next token by asking the language model what is the next token and suppose the next token is the word texas we take this token texas we keep it in the kv cache so that the language model can leverage to generate the next token and suppose the next token is a comma etc etc until we generate the the entire response of the language model now we have seen that in the past few years prompts are becoming longer and longer so we started with 2000 context window to 4000 8000 32 60 or 100 and now we have reached millions of tokens um this means that you can send to an entire book to the language model and you can ask the language model questions about this book however with the great prompts comes great responsibility and the the reason is uh the following so imagine you have a very long prompt suppose that you have a book and you want the language model to answer questions about this book suppose that this book is like 1 million tokens longer if you do the prefilling of this 1 million tokens prompt in in the kv cache the language model will actually not be able to do that because as i as i said before the prefilling is one of the most expensive operations that we do uh when inferencing language models uh why because it's quadratic with respect to the sequence length in terms of memory and also in terms of compute so uh language models actually cannot prefill the entire prompt in one single pass in the in the kv cache so what they do is they do it by chunks this is called the chunked profile and it's an experimental feature that has been recently introduced in vllm but it's probably used in more sophisticated inference engines at major companies so what we do by with the chunks prefilling basically we split the prompt of the user into multiple chunks and we prefill each chunk step by step so suppose that the user sent an entire book which is made up of 10 chapters now the chunk usually they are of a fixed size this doesn't mean that the the chunk has some contextual meaning it may not be the first chapter of the book or the second chapter of the book it could be just be the first 4000 tokens of the prompt and then the next 4000 tokens of the prompt so we prefilled the first chunk into the kv cache then we prefilled the second chunk in the kv cache so now the kv cache contains the first chunk and the second chunk then the third chunk etc etc until all the prompt is inside of the kv cache which can then be leveraged to generate tokens there this is intuitively very similar to how a student would read a book for example imagine a student is given a book to read and then to answer questions a question about this book what the student would do with the students would read the first chapter and now the brain of the student contains information about only the first chapter then the student would read the second chapter and now the brain of the student contains information about the first chapter and the second chapter of the book and then etc etc until the student reads the last chapter now the brain of the student contains information about all the chapters and then we the student will be given the question and the question and then the student has to leverage the information that he has read from the book to answer this question however the student would struggle to do it why because intuitively when you read a book the the moment you start reading the second chapter you can have already forgetting what is the first chapter about um so what is a better strategy that this could this student could do well the student could read the chapters and while reading it could take some annotations and this is what we do with writing in the margins so because for a very long prompt we are already forced to split the prompt into multiple chunks to do this chunked prefilled why not leverage the partially prefilled kv cache to generate some annotations that can be then be leveraged to improve the model's capability of extracting information from this prompt so basically writing the margins and from a technical point of view works as follows we have this very large prompt we split it into chunks because we are forced to split it into chunks we cannot prefer the entire context into the kv cache we prefer the first chunk and then we add after the first chunk we add a prompt that tells the model okay use the text above to instruct information about the query and the query is the question that we want to uh to get an answer to so now the kv cache contains the first chunk and this prompt here and we leverage it to generate a few tokens which is the this margin annotation and then this is another trick you can delete stuff from the kv cache only from the end why you can do that because the language model is an autoregressive language is an autoregressive model which means that every token depends on all past tokens which means you can delete from stuff from the end and regenerate it eventually but you cannot of course delete stuff from the beginning or from the middle because it would invalidate all the future tokens but you can always remove stuff from the end so what we do is we prefer the first part of prompt we prefer this instructive instruction we generate a few tokens which is the margin annotation then we can delete this margin and this instruction that we have added and then we prefer the second chunk we then append another prompt extractive prompt we generate a few more tokens which is the second margin so the second margin will depend on the first and the second chunk then we delete this second margins tokens we delete the extractive prompt etc etc until we have processed all the prompt and this is a visualization that we have also put in the paper so basically imagine you are given a very large prompt so what we do is we pre-fill the first part in the in the KVCache and then we extract the information on what is present in the KVCache and we call it the first margin we then also can classify these margins and in the paper we also show that actually the the computation and the pre-filling of the margins sorry the computation of the margin and the classification of the margin can be overlapped inside of the same request in the batch so you don't need to have another request in the batch but this is okay a little KVCache tricks for optimizing the inference so what do we do with these margins basically we append them at the end before asking the question to the language model because our goal is okay we have this very big context and then we have some question so what we do instead of just asking the question the model may not be able to find it so we generate these margins and then we append them right before asking the question and then we ask the model the question so now the model can also leverage these margins which are present right before the question to better answer the question why these margins are why this margin will be leveraged by the language model first of all because they are they are the instruction that we use to extract them is an extractive summary so the we ask the language model to extract information or using the prefilled KVCache on knowing what is the query so the margins are relevant to answer that particular query why do we add these margins at the end before asking the question because a few months ago there was a paper called lost in the middle basically it says that and it's actually true basically it says that with the relevant information that we're trying to extract from this prompt which could be very large is present either at the beginning or at the end of the prompt then the language model will very likely be able to will be very likely will be able to find it however this if this information is present kind of in the middle then the the language model will less likely be able to find it so that's why we add them to the end so it improves the language model's ability to to leverage this information but does it even work so yes we have proof and we have we show in the paper a comparison of pre-trained language models so first of all we are not fine-tuning any language model we are not changing anything this is just a different way of utilizing something that is already being done which is the chunked prefill of the KVCache to improve a language model's ability to leverage long context so it can be used with any transformer model without fine-tuning it just by doing it just by doing this inference differently that is don't just blindly refill the KVCache but refill chunk by chunk because you are forced to and then leverage it to extract these margins and then leverage all these extracted margins at the end what you are asking me to do I think someone had mic on I just made it accidental yeah go ahead so yeah in the paper we have a proof that it helps the language model pre-trained language models without any fine-tuning to better utilize long context and we provide a few benchmarks as you can see for example smaller models have a better more improvement so here we have a for example long context so the generic pattern that we use for long context so just the context and the question whatever the benchmark is then we have a reg which means basically that we extract instead of giving the entire context plus the margins and then the question we are basically extracting each of the chunks separately asking the language model which one of them is relevant and then only providing the relevant ones which is what we would do in reg and then asking the language model to leverage only the relevant ones at the end or the writing in the margins approach which is all the context plus the margins that were extracted during this chunked prefill and then the question at the end now how is this different from a prompting strategy because you may be thinking okay but we can already take a very large prompt split it into chunks and then use the language model to kind of summarize each of this chunk independently and then send it to the language model again to answer the question well it's a matter of cost so let's talk about cost imagine that we want we have this very big book made up of 10 chapters and then we have a question at the end that we want to get an answer to which is what is the answer to like the universe and everything now what uh writing in the margins would do is would prefill the if you don't use writing in the margins what you would do is you would feed the first chapter of the book and then extract some margin with that actually let me show you this other slide then you would take the second chapter of the book and then extract the margin with that then the third chapter of the book and extract the margin with that etc etc suppose that each of this chapter is 100 000 tokens then the cost to generate the margins and suppose that the margins are really small it's more or less 1 million tokens because 100 000 tokens multiplied by 10 chapters is around 1 million tokens but then then you need to also send the entire book past these extracted margins again to the language model to um to to generate the answer and it would cost you another million because the model has to reprocess this prefilling again of 1 million tokens so it would cost you 2 million tokens but with writing in the margins it would cost you a for 1 million prompt more or less it would cost you 1 million tokens because you don't have to re-prefill the uh the entire context because you are you have already prefilled it so the the kvcache is growing and you're extracting some information and then you don't have to repopulate it which is what you would do if you treat each chunk independently and then do kind of like a let's say the chunking that we commonly do so what are the advantages well it's compatible with any transformer model there are other advantages that i want to show you in a video that i made so let me show this is on my linkedin post i don't know if you can see now again my my screen right it's a yeah looks good we see the linkedin video perfect okay so we have a large document suppose 1 million tokens and then we have a question like how many employees were hired in 2006 now when we have a very large prompt it must be prefilled in the kvcache by chunks and this operation is called the chunks prefill so what happens with chunk prefill is that you take the first chunk and you prefill it in the kvcache then what we do is we add to the kvcache this extractive summary prompt which is for example use the text above to extract information about the following query how many employees were hired in 2006 and then we generate tokens using whatever is inside of the kvcache which is the first chunk plus this instructive extractive sum prompt and suppose that we generate a few tokens which are visible now we take these tokens we save them so we decode whatever is generated we're using the tokenizer and we save it and then we remove it from the kvcache now if you are looking from a implementation wise point of view you don't actually remove from the kvcache you just usually the kvcache allocation is a static but even if you use vllm it's used it's done using the so-called like pages attention so you actually allocate pages of kvcache actually so basically we it's not like you are removing stuff from the kvcache you are just resizing the tensor which is actually an o1 operation so it doesn't have additional cost you just keep track of how many tokens there are so anyway we we take this margin we save it somewhere and then we prefill the second chunk to the kvcache and then we again add another extractive summary prompt and then we leverage it to generate the second margin which will depend on the first and the second chunk of the of the prompt etc etc for all the chunks so we will have a list of margins and then we can also classify these margins so we can also say which because of some of the margins may like be the model hallucinating or the model just saying i cannot find this information or etc so we can also classify them we can either use an auxiliary classifier or we can use the model itself to classify them and we show in the paper that you don't have to actually create a new request to the model to classify these margins you can actually overlap the generation of the margins with the classification of the previous margin using the same request in the batch does the margin pertain to all previous chunks or only the current one all the chunks up to that margin so the first margin only the first chunk the second margin the chunk one and through this third margin chunk one two and three etc because we want to leverage the prefilled kvcache okay so we classify these margins and then we append them at the end and then we append the question and then we generate the answer the final answer so the advantage is that as i said before we exploit chunk profile of a large prompt to generate intermediate summary so we are we are exploiting something that we are already forced to do but we are not leveraging right now so it's kind of comes for free just with a minor let's say compute because you have the the cost of generating these margins but you avoid the bigger cost of prefilling so if you just use chunks chunking techniques like you can do with the long chain for example you pay twice the cost of prefilling if you do this system you don't pay twice the cost of prefilling which is very expensive and to give you an insight on how expensive is prefilling basically in most cases so whenever you work with the openai or with cohere or any other provider whenever you send a request your request is always overlapped with the token generation of other requests so the first time your prompt comes to their server they are overlapping the prefill of your request with the token generation of others because the prefilling is compute bound because it's very expensive computationally while the token generation is memory bound so to to always utilize the gpu fully they over they kind of schedule together one prefill with multiple token generations so so it's compatible with any of the shelf language model without any fine tuning and we saw we show some benchmarks in the paper it improves the ability of any language model to extract relevant information so solving the lost in the middle problem and another cool thing that you can do is basically because now you generate these margins while prefilling the the prompt you can also feedback this margin to the user and the user can classify them for you like a thumbs up or thumbs down so it adds a human in the loop and also the user can visualize the progress of how the prefilling is going because when you have a very very large prompt i believe that because the cost of prefilling is quadratic it will become really it will become really expensive to prefill it and the user may have to wait many seconds so you can actually give a feedback to the user of how content how much context has been processed and you can actually leverage the waiting time of the user to give you thumbs up or thumbs down on these margins which can actually improve the ability of the language model to use them and the user can also early exit so the user found the relevant information in one of these margins you can say okay stop inference and the user would not have to pay for you know all the context uh being processed and we also provide an implementation so if you go to this url so github.com/writer/writing in the margins you can find our implementation on how we actually do this stuff with the KVCache i don't know how to delete this line it's so annoying let me check i believe it's annotate and then delete clear clear all drawings okay so if you go here you can see uh what we do basically um it's a simple it works with any language model we here we provide a demo or with the llama fi and quen uh here we show for example um this this code that is present here in the github repository matches exactly the code that we present in the paper which is the pseudo code that you can see here so how we split into segments and how we prefill into the KVCache and how we delete stuff from the KVCache so it's present here so here you have a very simple like we also show the state of the KVCache at each line of code so that the user can understand what is happening here and this is the code for the method that we use to delete stuff from the KVCache all right uh let me see if there is something that is missing here yeah here we provide a comparison of how it differs from reg and how it differs from just long context processing questions all right let me check you in the chat but Eugene is like crushing it answering answering the question in the chat thanks Eugene my pleasure thank you for taking time to share with us about uh the paper and even preparing slides uh yeah this okay the slides were from another talk I gave in the company so it's uh reusing stuff but yeah thank you thank you everyone um so let me go through the question if there is something in the chat that I can answer um yes this is not any we are not doing any change to the model architecture so you don't have to fine-tune anything you don't have to change anything like the um can you use this stuff with like a length pane no because it requires a modification on how the inference engine is using the model so it's a because when you work at the KVCache level you cannot just work with the apis and tell them to you know remove stuff from the KVCache or overlap stuff in the KVCache but it doesn't require changes to the model to the weights of the model that's why we talk about no fine-tuning here is the extractive summary prompt just the instruction to produce the margin yes so the instructive summary prompt is basically a prompt that we add after each chunk to extract relevant relevant information about that query so it's not just to find the relevant information but about the specific query because we this inference pattern that we introduced is a specific uh for those prompts that are composed of a context plus an instruction so we always know what is the instruction that's why this is I mean the best use of this uh this inference pattern now in the in the paper we also show um how chunked prefield works at the KVCache level so if you are familiar with how the KVCache with the query the keys etc but we also show how to overlap the computation of the margin with the classification of a margin and this is exactly actually the the representation of the KVCache during the prefilling of one chunk and how it can be overlapped with the classification using the same language model and the same request in the same language model sorry could you go deeper into overlap I don't know where like overlap happens is it between the different chunks or you're overlapping the different chunks um let's talk about overlap so I am talking about uh let's say first visualize it uh let's say here we do have a nice representation of that so it is here so you extract the margin and you need to find a way to classify it you can either use an auxiliary classifier so use another model to classify it as relevant or irrelevant or you use the same language model to classify but if you want to use the same language model to classify you would need to create another request in the batch because you don't want the classification request to visualize anything in the KVCache you just want to ask the language model okay the I ask a language model to extract information about this query here so is Ethan Washington in a marble floored room and the language model extracted this stuff here is it relevant to the query or not if you want to do it you would need to create another request in the batch but we show here that you can actually um in the chunked prefilling so let's go here chunked prefilling in the same request in the batch so when you do chunked prefilling basically what you are doing is you are adding the first chunk to the KVCache so the the keys and the queries are the first chunk so this is c1 that you see here and then what do we do is we actually add after this we also want to add an extractive summary prompt right and then we use this one to generate tokens so now the token generation has um is a I would say this is the part of the prefilling of the first chunk so the first chunk plus the extractive summary and then we use it to generate tokens so this is the first margin token generated usually we pre-allocate the KVCache so the KVCache is not like growing tensor we pre-allocate it with a fixed number of let's say padding tokens but they are not really padding tokens they are just unused spaces in the KVCache and then we replace them with the tokens that are actually generated from the language model so suppose that you have these unused tokens which I call padding here basically what do we do after we have generated the first margin we delete this margin right and also the instructive token so what we are actually doing is we don't delete anything we just change the pointer position of the KVCache on how many tokens are used so now the pointer suppose it's pointing here then we can pre-fill the second chunk so the second chunk needs to attend to all its tokens in a causal way so each token in the second chunk needs to attend to only itself and all the previous tokens in the same chunk but also needs to attend to all the past tokens of the first chunk that was already pre-filled and we also need to pre-fill the instructive summary prompt which can visualize all the past tokens that it has seen but while then then we can also skipping some tokens that we reserve for the generation of the margin we can also pre-fill the classification instruction for the first margin which was generated before in the previous step and then during the token generation step so this is the pre-filling along with the instructives the pre-filling of the second segment after we have pre-filled the second segment along with the first generated margin we can generate the tokens of the second margin but classify the first one which we already obtained in the step before so we are generating two token sets here and the token generation step in the same request now one using only the part relevant to the first chunk the second chunk and the extractive summary of the of the after the second chunk and one using only as you can see this is the attention mask right and one is only using the in the part that is relevant to classifying the first margin that was instructing the previous step so you can do it also like this thank you yeah sorry there's a question in the chat um and i think from explanation i think naz is clear so when you are creating the margin for the second chunk you're actually paying attention to the first and second yeah okay so you can change the attention mask to only look at the latest chunk however that's exactly the question yeah yes but it's not possible actually i mean let me clarify why it's not possible because the kvcash is made up of contextualized tokens is actually these tokens are not single they are contextualized so the token number one in the kvcash is a contextualized version of the token number zero and one the token number two in the kvcash is a contextualized version of the token zero one and two so if you tell the model to only look at the last tokens you are creating an autoregressive model that is generating the logits of p of let's say x10 but only looking at p of x9 x8 which are contextualized token that contain information about seven six five but you are not using them so you are actually going out of distribution so this is why thank you how much of the kvcash do you prefer with the chunk versus leave it um you can you can actually okay if you use for example vllm they use this call thing called the pages attention so actually they prefer they allocate one entire page which is actually a lot of tokens so it's like another chunk which is more than enough to generate the margin so what are the next steps for this well the next step is for sure we are sending it to conferences get it published and presenting it around but we are recently focused on long context modeling and actually we are looking at you know how long context modeling long context can be better leverage so we will be working in this field actually we will be how to say we will be researching a lot in this field i think there's a question from amad how are the queries chosen for the query based summarization uh i think it's a classification right uh yes okay so the query is basically uh we work with a prompt that is made up of context plus query so we all always know what is the query that's the structure of the prompt that we work with uh what's the use case that writers led you to this research well we are i am personally very interested in long context modeling and i am giving the freedom to research what i like and writer is also interested in long context modeling so things intersect and this here we are and then you know we have our many smart people working together we did a few brainstorming and yeah what's the latency you see for typical request well you are delayed there is no kind of latency increase because of this you are just paying more price to generate more tokens in intermediate cases of course what would happen is that uh you have before for example you need to um process the entire prompt at once so chunk pre-filling chunk pre-filling chunk pre-filling now you have chunk pre-filling with some token generation which will slow down the entire request but you are actually getting something back which is feedback and you are getting the possibility to see what the model is actually seeing at each step so you get human in the loop so the human is waiting but is waiting let's say with some feedback which is nice to have progress bars right you and maybe this is sensitive do you happen to have a demo of showing how this actually looks like in the user interface or is it something that we have to sign up to write that we actually see we don't have that but we are working on demos yeah here we have you know we have a concept on how it would look like yeah thank you so okay in some cases the writing margin does not work well as other methods there are two factors first of all because we are each margin is kind of a summarization of what is present in the context it depends highly on how good that model is at summarizing so the better the margin the better the information it will extract and the better it can be leveraged so if you think about the student if you're not taking skills are not so good then probably your notes will not be useful the second thing is actually the comparison here that you see with reg this reg actually we put ourself in the worst condition possible which is let's help reg beat us but then actually reg doesn't beat us how usually in reg what you do is you have these chunks and you extract some vectors of these chunks and then you match them with the dot product or whatever with the query what we did with reg actually is we asked the language model to see if the reg yes it was charitable because we asked the language model actually to to to see if that particular chunk is relevant so actually you have a 70 billion model telling you if that chunk is relevant compared to extract some vector and map it with dot product i mean we help reg a lot so actually if we did actually a reg approach like a naive reg approach we would do much better thanks umar do we have anyone else have questions i want to come on screen to just ask umar and sam more questions if there's nobody else that is interested i actually am having a little bit of trouble wrapping my mind around why chunk pre-filling is so much more efficient i i i looked at the sort of the reference and i i kind of get the idea but maybe you can help me understand the intuition okay it is not uh first of all chunk pre-fill doesn't exist because it's more efficient it's because we need it we must do it so when you pre-fill uh a chunk into the language model let me show you actually here we have the kvk representation right so when you pre-fill a chunk in the language model let's say this one chunk number one c1 you are generating a quadratic uh matrix as you can see if you have four tokens you are generating a four by four uh matrix which is prohibitive to generate for very long prompts like you imagine you have 1 million tokens that's 1 b 1 million by 1 million metrics where each of these values is actually a dot product of a vector and then the computation cost of that also it would really be very slow and the gpus are really good at parallelizing in this case when you have a lot of operations they will actually be parallelized but anyway the problem actually is the memories of this pre-filling because when you generate it it's really huge and it doesn't fit yeah so you are we are forced to do this chunk then we are doing okay we do this chunk pre-filling so check one but we are not leveraging these chunks that we are because we are forced to do it right so it's slower than just doing it in one pass but since we are already forced to do it why not use them yeah yeah no i i definitely i think i got most of the paper just the chunk pre-filling part the background if you want more information i can give you some references one is the vllm page they are actually in the vllm now they are it's an experimental implementation there was a nvidia explanation on chunks pre-filling so i will send a link later nvidia published recently an article about chunk pre-filling but basically pre-filling is the most expensive part of working with long prompts for language models that's why yeah they need to so but what i guess i was having trouble understanding why why that is is it just because it's quadratic and so you have to break it up into chunks is that maybe maybe i can take a step at it so you can imagine so let's look at the attention mask here here i'm generating the first margin am i doing here i am doing uh i have already seven tokens in the kvcache and i am generating the eighth token so i am doing seven dot products so token generation which means generating one token using whatever is in the kvcache is linear with respect to whatever is in the side of the kvcache pre-filling the kvcache is quadratic and mostly because it's quadratic it's very expensive so we are talking about something that is linear with quadratic so and if you consider about prompts long context if you are working with a two million context window one million and nine hundred ninety nine thousand will be prompt nobody will ever generate more than let's say five thousand tokens so yeah because the most expensive is actually pre-filling so so so just to get an intuition here the the other extreme is that you your chunk size is one token right so what's the trade-off so what they do is basically they try to put as bigger as as big as possible until it fits in the gpu so they usually suppose i think good numbers are like four thousand tokens or eight thousand tokens or something in this range and as i said before usually the token generation is memory bound means that the limitation is only given by how much your kvcache can hold so the memory can hold in terms of kvcache while pre-filling is compute bound so to maximize the gpu utilization whenever you work with open ai or coherent they just overlap your new request with other people's old requests so while they are generating tokens they are also pre-filled so the gpu is utilized 100 okay yeah no that's helpful and i if you do send those uh links i'll definitely read them thank you is there is there a break-even point where at like a certain context length it becomes more valuable to do writing with margins than um than just using the llm by itself like it where the where the compute equals out or is it all together margins is better i believe okay writing in the margins it's like you read a book and you have margins versus reading the book i think it's always convenient to read the book with the margins because you are actually paying the price for that right we are you it's not something that comes for free you are actually paying the cost of generating this margin so you're actually putting there some effort and then you you leverage then we what we could do is okay does it always help so far yes so it's not like something that you get for free right so you pay and it's is it worth it so far yes and if it's always worth it i so far from our data it's always worth it like it doesn't like even if even if you're literally just talking about a chunk if you have like a sentence yeah instead of a book oh really yeah then it's not convenient because uh in that case you are not even doing chunk refilling right because if you have only small context then they just prefilled at once but when you have yeah yeah make sense once you start growing i think the topic still stands right like let's say you've got a paragraph and your chunks are sentences or you've got a page right so you've got a thousand tokens and your chunks per se is every sentence relevant so some form of highlighting is you know you have a one page but every sentence you highlight what's relevant or not at that level it's it's kind of negligible to throw the whole thing in a prompt versus do i want to highlight seven of the 40 sentences and do this approach i think that's so that's the other non-extreme right yeah so basically whenever the the context can just be prefilled in the kb cache without any chunk refilling i believe it's not worthy to use it but if it's long then it helps and it helps much more than uh chunking separately like we do with the apis right with the long chain because you are paying double twice the cost of prefilling in this case we are only paying once and we also prove in the ablation studies that actually it's always convenient to send the context plus the margins never just the margins so you can see here this ablation contest compression so if you only send the margins or only the context is always worse than providing them both context being the whole right the entire book and so build building off that ablation let's say you've got a model that doesn't have the context you have to do some sort of chunking splitting uh let's say you know you have a total context of 8 000 tokens and you have a million token document there's approaches if you know how you can process chunk by chunk and then combine so with the you know what you would expect is you could for each chunk do this writing with margins approach at every level and then scale that down with how many ever steps you need is that intuition still pretty accurate i believe with 8 000 um with 8 000 uh how to say context window uh i believe that the latency would be higher right because you at each step you are adding more yes at that kind of latency it's even more convenient to just do independent chunking and generate in that kind of lane because if you you can always split the the context into chunks and then you just send multiple requests and you can pay the price of competing in that kind of range but when you are talking about 64 000 tokens that start making more sense to use this approach so for that level i think it's all the traditional approaches uh they work fine any other questions uh well i think another question that came out was why now why right i mean why nobody thought about this before because actually uh we didn't have long context very long context um models before and we were not forced to even do chunked pre-feeding so as you as you can see from blm they have this feature is an experimental feature right now in blm so it's because right now we need this chunked refilling and everyone is doing it so that's why we have this so it's always you know innovation is always starts from some problem that you face and some need that you have so right now we have this need and we have the capability so right that's how we came up with this awesome we have a few more minutes if anyone else has last minute questions um please feel free to ask and big shout out thanks to the writer team for presenting thank you guys for for listening um you are welcome to uh send us uh your questions uh we have a you know github repository i i i suggest watching the code it's really you know very commented and it follows the same uh kind of pattern that we shared in the paper so it's easily understandable for everyone and we have we did a lot of nice tricks you know like one of the tricks is like you can always delete stuff from the kv from the end of the kv cache and you also know why now it's an interesting project if anyone's interested on fridays we have a similar ai in action section where we try to take practical outside of paper there's the code up there's the paper up if anyone wants to run it and present it share their learnings it'll be a really good learning exercise but that's always there um i guess we got a question from jimmy any future work in this direction any future work in this direction well for sure we will keep working on long context and how we can better leverage long context so there is another kind of problem with long context which is uh the the whole language models use long context actually depends highly on this attention mechanism and how the softmax works so we have seen with the paper called sync attentions that actually the language model allocates a lot of a lot of because when you do a direction mechanism you are doing a weighted sum over the tokens and each token is given a weight and we see that most of the weight is given to the first few tokens so there is a lot of research in this area recently i saw a few days ago another paper came out called the sigmoid attention which is also studying you know the distribution of this logits and the so so i think the attention mechanism will play a big part in how we can extend the long context so if we can also fix this part here so i am very interested you know in the kvcache and optimizing long context modeling so we are we are working in this direction because it's it's needed by the market and also it's i like it and i i think it's cool to be able to analyze an entire book or an entire codebase instead of hoping that the rag finds the right one awesome well uh big shout out to writer team always great to have you guys present um sam is in discord as well i'm sure he'll relay questions and stuff we've got the recording we'll share it with your team i don't know what you choose to do with it but um next week we've got swicks he'll be presenting some of the strawberry q star quiet star all those papers so he'll be doing that next week and then the following week if anyone's interested in anything volunteers are always open i posted a few papers in paper club i think there's also the mistral stuff so if anyone wants to lead pop in there otherwise um next week swicks is doing strawberry and star stuff so that's on the agenda cool thank you guys thanks everyone take care thank you yes i was just about to end the meeting yes question is there a you can um is there any way you can copy over the comments um i'm trying to do it but it seems to like lazy load like as you scroll up and down it's really painful let me see if i can extract these comments do you know if they normally get saved as a zoom recording i'm using you can click save chat where you save chat anyone want to help me out chat there's i was able to copy the comments without any problem uh the file is usually saved on the host computer in a folder like documents zoom meeting date and time if you're on windows see there's a chat log file that as long as okay i got it i just i just saved the chat i'll throw it in discord i have a text file of it yes slides would be great too if we can grab them uh yeah get it all i'll pick them i'll pick them right now perfect thanks guys sweet all right thanks But, yeah.

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

Transcript