back to index

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval


Whisper Transcript | Transcript Only Page

00:00:00.000 | it's recorded but tl;dr probably just shared internal so sam we'll we'll get you recording
00:00:06.240 | after cool i'm i'm gonna pass off to you guys whenever you want to start i'm sure people
00:00:16.080 | trickle in so from writer side i would be presenting i don't know if you guys can hear me
00:00:23.840 | yep all right uh there should be also a sim i don't know if he's also present but i will
00:00:30.480 | kick off the presentation let me know whenever you everyone is ready and i will start
00:00:35.920 | i think we're good take it away
00:00:56.240 | okay i will start by sharing my screen meanwhile let's see if it works
00:01:07.120 | desktop 2 and let me know if you can see my screen
00:01:13.840 | yep we can see your slides all right perfect so tonight uh can i start right everyone
00:01:27.600 | yeah you should be good it's recording people might trickle in every now and then but
00:01:32.800 | all right so tonight i will be presenting a paper that came out from writer writing in the margins
00:01:40.240 | to uh writing in the margins better inference pattern for long context retrieval
00:01:44.720 | it's a paper that involves a long context and how we leverage the kv cache to make a
00:01:53.040 | long context modeling more effective so i will uh i will do a very little technical but not so
00:02:02.800 | technical presentation and later we can deep dive into the details of the paper so let me open the
00:02:08.800 | chat so i can see what everyone is writing meanwhile i'm while i'm talking oh yeah i think
00:02:16.000 | i think you can just ignore the chat people are gonna be it's gonna be really buzzy and
00:02:20.560 | there's gonna be a lot of people like oh wow that's so cool everything yeah vibhu and i will
00:02:24.800 | take care of the chat for you perfect something super crazy pops up we'll let you know otherwise
00:02:29.360 | we'll we'll take care of it all right perfect all right so i will skip the part of what is
00:02:34.320 | a language model but okay the language model is a probabilistic model and that leverages the prompt
00:02:39.040 | to generate what is the next token and how we generate text to the language model we do it
00:02:42.560 | iteratively so one token at a time most of the language models nowadays are based on the
00:02:49.360 | transformer model and in the transformer model basically whenever we have a prompt the first
00:02:55.120 | thing that we do is we put this prompt into the memory of the language model that is known as
00:02:59.760 | the kv cache which is basically the keys and values in these transformer layers and the
00:03:07.360 | operation of creating this initial kv cache is known as prefilling so the first thing that the
00:03:13.680 | inference engine which could be vllm which could be tesserati or any other inference
00:03:18.160 | framework that you're using the first thing that it does with your prompt is doing this prefilling
00:03:25.200 | and if you're interested actually the prefilling is one of the most expensive
00:03:29.040 | part of the language of processing a prompt because it has a quadratic cost with respect
00:03:33.840 | to the computations as well in the with respect to the memory so imagine that we have a prompt
00:03:39.360 | called austin is a city in a prompt that says austin is a city in then the first thing that
00:03:45.600 | we do to generate text with this prompt is we prefill it into the language model in the kv cache
00:03:51.280 | and then we leverage it to generate tokens one token at a time and the kv cache is a this kind
00:03:58.160 | of memory in the language model uh in any actually transformer model that is autoregressive that is
00:04:04.160 | that if it contains some tokens then the language model will leverage them if it doesn't contain
00:04:10.160 | those tokens then the language model cannot leverage them so the language model only sees
00:04:14.480 | what is inside of the kv cache so what happens is over prompt austin is a city in the first thing
00:04:19.280 | that we do is we do this prefilling which puts these tokens into the kv cache which is one for
00:04:25.600 | each layer of the transformer and then we generate what is the next token by asking the language
00:04:31.120 | model what is the next token and suppose the next token is the word texas we take this token texas
00:04:37.200 | we keep it in the kv cache so that the language model can leverage to generate the next token
00:04:42.080 | and suppose the next token is a comma etc etc until we generate the the entire response of
00:04:47.680 | the language model now we have seen that in the past few years prompts are becoming longer and
00:04:53.840 | longer so we started with 2000 context window to 4000 8000 32 60 or 100 and now we have reached
00:05:01.840 | millions of tokens um this means that you can send to an entire book to the language model and
00:05:08.880 | you can ask the language model questions about this book however with the great prompts comes
00:05:14.400 | great responsibility and the the reason is uh the following so imagine you have a very long
00:05:21.520 | prompt suppose that you have a book and you want the language model to answer questions about this
00:05:26.720 | book suppose that this book is like 1 million tokens longer if you do the prefilling of this
00:05:34.400 | 1 million tokens prompt in in the kv cache the language model will actually not be able to do
00:05:42.320 | that because as i as i said before the prefilling is one of the most expensive operations that we do
00:05:48.080 | uh when inferencing language models uh why because it's quadratic with respect to the
00:05:53.360 | sequence length in terms of memory and also in terms of compute so uh language models actually
00:05:58.560 | cannot prefill the entire prompt in one single pass in the in the kv cache so what they do is
00:06:05.200 | they do it by chunks this is called the chunked profile and it's an experimental feature that has
00:06:10.240 | been recently introduced in vllm but it's probably used in more sophisticated inference engines at
00:06:16.800 | major companies so what we do by with the chunks prefilling basically we split the prompt of the
00:06:27.200 | user into multiple chunks and we prefill each chunk step by step so suppose that the user sent
00:06:33.440 | an entire book which is made up of 10 chapters now the chunk usually they are of a fixed size
00:06:39.920 | this doesn't mean that the the chunk has some contextual meaning it may not be the first
00:06:44.720 | chapter of the book or the second chapter of the book it could be just be the first
00:06:49.040 | 4000 tokens of the prompt and then the next 4000 tokens of the prompt so we prefilled the first
00:06:55.280 | chunk into the kv cache then we prefilled the second chunk in the kv cache so now the kv cache
00:07:00.480 | contains the first chunk and the second chunk then the third chunk etc etc until all the prompt is
00:07:05.680 | inside of the kv cache which can then be leveraged to generate tokens there this is intuitively very
00:07:12.720 | similar to how a student would read a book for example imagine a student is given a book to read
00:07:18.640 | and then to answer questions a question about this book what the student would do with the
00:07:25.760 | students would read the first chapter and now the brain of the student contains information about
00:07:29.920 | only the first chapter then the student would read the second chapter and now the brain of
00:07:33.840 | the student contains information about the first chapter and the second chapter of the book and
00:07:37.680 | then etc etc until the student reads the last chapter now the brain of the student contains
00:07:42.080 | information about all the chapters and then we the student will be given the question and the
00:07:49.520 | question and then the student has to leverage the information that he has read from the book to
00:07:55.520 | answer this question however the student would struggle to do it why because intuitively when
00:08:00.960 | you read a book the the moment you start reading the second chapter you can have already forgetting
00:08:06.240 | what is the first chapter about um so what is a better strategy that this could this student could
00:08:14.640 | do well the student could read the chapters and while reading it could take some annotations
00:08:21.040 | and this is what we do with writing in the margins so because for a very long prompt we are
00:08:27.760 | already forced to split the prompt into multiple chunks to do this chunked prefilled why not
00:08:35.520 | leverage the partially prefilled kv cache to generate some annotations that can be then be
00:08:42.080 | leveraged to improve the model's capability of extracting information from this prompt
00:08:47.120 | so basically writing the margins and from a technical point of view works as follows
00:08:54.160 | we have this very large prompt we split it into chunks because we are forced to split
00:08:58.800 | it into chunks we cannot prefer the entire context into the kv cache we prefer the first chunk and
00:09:07.360 | then we add after the first chunk we add a prompt that tells the model okay use the text above to
00:09:13.280 | instruct information about the query and the query is the question that we want to uh to get an
00:09:18.800 | answer to so now the kv cache contains the first chunk and this prompt here and we leverage it to
00:09:25.120 | generate a few tokens which is the this margin annotation and then this is another trick you
00:09:31.200 | can delete stuff from the kv cache only from the end why you can do that because the language model
00:09:38.720 | is an autoregressive language is an autoregressive model which means that every token depends on all
00:09:45.040 | past tokens which means you can delete from stuff from the end and regenerate it eventually but you
00:09:51.920 | cannot of course delete stuff from the beginning or from the middle because it would invalidate
00:09:56.960 | all the future tokens but you can always remove stuff from the end so what we do is we prefer
00:10:01.280 | the first part of prompt we prefer this instructive instruction we generate a few tokens which is the
00:10:08.640 | margin annotation then we can delete this margin and this instruction that we have added and then
00:10:14.720 | we prefer the second chunk we then append another prompt extractive prompt we generate a few more
00:10:23.120 | tokens which is the second margin so the second margin will depend on the first and the second
00:10:27.680 | chunk then we delete this second margins tokens we delete the extractive prompt etc etc until
00:10:34.720 | we have processed all the prompt and this is a visualization that we have also put in the paper
00:10:42.640 | so basically imagine you are given a very large prompt so what we do is we pre-fill the first
00:10:48.080 | part in the in the KVCache and then we extract the information on what is present in the KVCache
00:10:54.160 | and we call it the first margin we then also can classify these margins and in the paper we also
00:11:02.080 | show that actually the the computation and the pre-filling of the margins sorry the computation
00:11:08.000 | of the margin and the classification of the margin can be overlapped inside of the same request
00:11:12.560 | in the batch so you don't need to have another request in the batch but this is okay a little
00:11:17.200 | KVCache tricks for optimizing the inference so what do we do with these margins basically we
00:11:23.360 | append them at the end before asking the question to the language model because our goal is okay
00:11:29.360 | we have this very big context and then we have some question so what we do instead of just asking
00:11:34.400 | the question the model may not be able to find it so we generate these margins and then we append
00:11:38.960 | them right before asking the question and then we ask the model the question so now the model
00:11:43.200 | can also leverage these margins which are present right before the question to better answer the
00:11:51.040 | question why these margins are why this margin will be leveraged by the language model first
00:11:58.560 | of all because they are they are the instruction that we use to extract them is an extractive
00:12:06.560 | summary so the we ask the language model to extract information or using the prefilled KVCache
00:12:14.560 | on knowing what is the query so the margins are relevant to answer that particular query
00:12:22.480 | why do we add these margins at the end before asking the question because a few months ago
00:12:28.960 | there was a paper called lost in the middle basically it says that and it's actually true
00:12:34.400 | basically it says that with the relevant information that we're trying to extract from
00:12:39.200 | this prompt which could be very large is present either at the beginning or at the end of the
00:12:45.520 | prompt then the language model will very likely be able to will be very likely will be able to
00:12:50.240 | find it however this if this information is present kind of in the middle then the
00:12:54.720 | the language model will less likely be able to find it so that's why we add them to the end
00:13:01.760 | so it improves the language model's ability to to leverage this information but does it even work
00:13:08.880 | so yes we have proof and we have we show in the paper a comparison of pre-trained language models
00:13:15.520 | so first of all we are not fine-tuning any language model we are not changing anything
00:13:20.480 | this is just a different way of utilizing something that is already being done which
00:13:25.440 | is the chunked prefill of the KVCache to improve a language model's ability to leverage long context
00:13:31.680 | so it can be used with any transformer model without fine-tuning it
00:13:35.920 | just by doing it just by doing this inference differently that is don't just blindly refill
00:13:42.240 | the KVCache but refill chunk by chunk because you are forced to and then leverage it to extract
00:13:47.040 | these margins and then leverage all these extracted margins at the end
00:13:54.240 | what you are asking me to do
00:13:55.440 | I think someone had mic on I just made it accidental yeah go ahead
00:14:04.800 | so yeah in the paper we have a proof that it helps the language model pre-trained language
00:14:11.360 | models without any fine-tuning to better utilize long context and we provide a few benchmarks
00:14:19.920 | as you can see for example smaller models have a better more improvement so here we have a for
00:14:28.240 | example long context so the generic pattern that we use for long context so just the
00:14:33.440 | context and the question whatever the benchmark is then we have a reg which means basically that we
00:14:40.640 | extract instead of giving the entire context plus the margins and then the question we are
00:14:49.840 | basically extracting each of the chunks separately asking the language model which one of them is
00:14:56.720 | relevant and then only providing the relevant ones which is what we would do in reg
00:15:03.040 | and then asking the language model to leverage only the relevant ones at the end or the writing
00:15:11.120 | in the margins approach which is all the context plus the margins that were extracted during this
00:15:17.920 | chunked prefill and then the question at the end now how is this different from a prompting strategy
00:15:26.480 | because you may be thinking okay but we can already take a very large prompt split it into
00:15:32.080 | chunks and then use the language model to kind of summarize each of this chunk independently
00:15:38.800 | and then send it to the language model again to answer the question well it's a matter of
00:15:45.120 | cost so let's talk about cost imagine that we want we have this very big book made up of 10 chapters
00:15:53.920 | and then we have a question at the end that we want to get an answer to which is what is the
00:15:58.480 | answer to like the universe and everything now what uh writing in the margins would do is would
00:16:04.160 | prefill the if you don't use writing in the margins what you would do is you would feed the
00:16:10.640 | first chapter of the book and then extract some margin with that actually let me show you this
00:16:15.120 | other slide then you would take the second chapter of the book and then extract the margin with that
00:16:20.240 | then the third chapter of the book and extract the margin with that etc etc suppose that each
00:16:24.560 | of this chapter is 100 000 tokens then the cost to generate the margins and suppose that the margins
00:16:30.800 | are really small it's more or less 1 million tokens because 100 000 tokens multiplied by
00:16:35.680 | 10 chapters is around 1 million tokens but then then you need to also send the entire book past
00:16:41.280 | these extracted margins again to the language model to um to to generate the answer and it
00:16:48.400 | would cost you another million because the model has to reprocess this prefilling again
00:16:52.560 | of 1 million tokens so it would cost you 2 million tokens but with writing in the margins it would
00:16:58.400 | cost you a for 1 million prompt more or less it would cost you 1 million tokens because you don't
00:17:03.440 | have to re-prefill the uh the entire context because you are you have already prefilled it so
00:17:10.640 | the the kvcache is growing and you're extracting some information and then you don't have to
00:17:15.600 | repopulate it which is what you would do if you treat each chunk independently and then
00:17:24.560 | do kind of like a let's say the chunking that we commonly do
00:17:29.840 | so what are the advantages well it's compatible with any transformer model
00:17:35.600 | there are other advantages that i want to show you in a video that i made so let me show this
00:17:42.320 | is on my linkedin post i don't know if you can see now again my my screen right it's a yeah
00:17:48.880 | looks good we see the linkedin video perfect okay so we have a large document suppose 1
00:17:54.560 | million tokens and then we have a question like how many employees were hired in 2006
00:17:58.560 | now when we have a very large prompt it must be prefilled in the kvcache by chunks and this
00:18:06.720 | operation is called the chunks prefill so what happens with chunk prefill is that you
00:18:12.720 | take the first chunk and you prefill it in the kvcache then what we do is we add to the kvcache
00:18:20.160 | this extractive summary prompt which is for example use the text above to extract information
00:18:24.800 | about the following query how many employees were hired in 2006 and then we generate tokens
00:18:30.960 | using whatever is inside of the kvcache which is the first chunk plus this instructive
00:18:36.160 | extractive sum prompt and suppose that we generate a few tokens which are visible now
00:18:42.800 | we take these tokens we save them so we decode whatever is generated we're using the tokenizer
00:18:50.320 | and we save it and then we remove it from the kvcache now if you are looking from a
00:18:56.880 | implementation wise point of view you don't actually remove from the kvcache you just
00:19:01.840 | usually the kvcache allocation is a static but even if you use vllm it's used it's done using
00:19:07.680 | the so-called like pages attention so you actually allocate pages of kvcache actually
00:19:13.120 | so basically we it's not like you are removing stuff from the kvcache you are just resizing
00:19:22.240 | the tensor which is actually an o1 operation so it doesn't have additional cost you just
00:19:27.920 | keep track of how many tokens there are so anyway we we take this margin we save
00:19:33.600 | it somewhere and then we prefill the second chunk to the kvcache
00:19:36.320 | and then we again add another extractive summary prompt and then we leverage it to generate the
00:19:45.040 | second margin which will depend on the first and the second chunk of the of the prompt
00:19:51.920 | etc etc for all the chunks so we will have a list of margins
00:19:56.320 | and then we can also classify these margins so we can also say which because of some of
00:20:06.240 | the margins may like be the model hallucinating or the model just saying i cannot find this
00:20:12.160 | information or etc so we can also classify them we can either use an auxiliary classifier or we
00:20:17.440 | can use the model itself to classify them and we show in the paper that you don't have to actually
00:20:22.640 | create a new request to the model to classify these margins you can actually overlap the
00:20:27.040 | generation of the margins with the classification of the previous margin using the same request
00:20:31.760 | in the batch does the margin pertain to all previous chunks or only the current one
00:20:38.240 | all the chunks up to that margin so the first margin only the first chunk the second margin
00:20:44.080 | the chunk one and through this third margin chunk one two and three etc because we want to leverage
00:20:49.280 | the prefilled kvcache okay so we classify these margins and then we append them at the end
00:20:56.960 | and then we append the question
00:21:10.160 | and then we generate the answer the final answer
00:21:12.080 | so the advantage is that as i said before we exploit chunk profile of a large prompt to
00:21:21.440 | generate intermediate summary so we are we are exploiting something that we are already forced
00:21:26.640 | to do but we are not leveraging right now so it's kind of comes for free just with a minor let's say
00:21:34.160 | compute because you have the the cost of generating these margins but you avoid the
00:21:41.520 | bigger cost of prefilling so if you just use chunks chunking techniques like you can do with
00:21:47.360 | the long chain for example you pay twice the cost of prefilling if you do this system you don't pay
00:21:54.480 | twice the cost of prefilling which is very expensive and to give you an insight on how
00:22:00.800 | expensive is prefilling basically in most cases so whenever you work with the openai or with cohere or
00:22:10.320 | any other provider whenever you send a request your request is always overlapped with the token
00:22:17.120 | generation of other requests so the first time your prompt comes to their server they are overlapping
00:22:21.840 | the prefill of your request with the token generation of others because the prefilling
00:22:26.960 | is compute bound because it's very expensive computationally while the token generation is
00:22:32.800 | memory bound so to to always utilize the gpu fully they over they kind of schedule together
00:22:40.400 | one prefill with multiple token generations so so it's compatible with any of the shelf
00:22:46.880 | language model without any fine tuning and we saw we show some benchmarks in the paper
00:22:54.000 | it improves the ability of any language model to extract relevant information so solving the
00:22:58.080 | lost in the middle problem and another cool thing that you can do is basically
00:23:04.640 | because now you generate these margins while prefilling the the prompt you can also feedback
00:23:13.120 | this margin to the user and the user can classify them for you like a thumbs up or thumbs down
00:23:17.840 | so it adds a human in the loop and also the user can visualize the progress of how the
00:23:23.600 | prefilling is going because when you have a very very large prompt i believe that because the cost
00:23:29.200 | of prefilling is quadratic it will become really it will become really expensive to prefill it and
00:23:36.320 | the user may have to wait many seconds so you can actually give a feedback to the user of how
00:23:42.080 | content how much context has been processed and you can actually leverage the waiting time of the
00:23:46.000 | user to give you thumbs up or thumbs down on these margins which can actually improve the ability of
00:23:50.560 | the language model to use them and the user can also early exit so the user found the relevant
00:23:56.640 | information in one of these margins you can say okay stop inference and the user would not have
00:24:00.720 | to pay for you know all the context uh being processed and we also provide an implementation
00:24:07.920 | so if you go to this url so github.com/writer/writing in the margins you can find
00:24:14.640 | our implementation on how we actually do this stuff with the KVCache i don't know how to delete
00:24:20.080 | this line it's so annoying let me check i believe it's annotate and then delete clear clear all
00:24:28.480 | drawings okay so if you go here you can see uh what we do basically um it's a simple it works
00:24:39.040 | with any language model we here we provide a demo or with the llama fi and quen uh here we show for
00:24:45.760 | example um this this code that is present here in the github repository matches exactly the code
00:24:51.680 | that we present in the paper which is the pseudo code that you can see here so how we split into
00:24:58.000 | segments and how we prefill into the KVCache and how we delete stuff from the KVCache so it's
00:25:03.520 | present here so here you have a very simple like we also show the state of the KVCache at each
00:25:09.920 | line of code so that the user can understand what is happening here and this is the code for
00:25:15.760 | the method that we use to delete stuff from the KVCache all right uh let me see if there is
00:25:22.480 | something that is missing here yeah here we provide a comparison of how it differs from
00:25:28.480 | reg and how it differs from just long context processing questions
00:25:34.960 | all right let me check you in the chat but Eugene is like
00:25:43.840 | crushing it answering answering the question in the chat thanks Eugene
00:25:47.840 | my pleasure thank you for taking time to share with us about uh the paper and even preparing
00:25:54.080 | slides uh yeah this okay the slides were from another talk I gave in the company so it's uh
00:26:03.120 | reusing stuff but yeah thank you thank you everyone um so let me go through the question
00:26:11.280 | if there is something in the chat that I can answer um yes this is not any we are not doing
00:26:22.160 | any change to the model architecture so you don't have to fine-tune anything you don't have to change
00:26:26.320 | anything like the um can you use this stuff with like a length pane no because it requires a
00:26:33.840 | modification on how the inference engine is using the model so it's a because when you work at the
00:26:40.560 | KVCache level you cannot just work with the apis and tell them to you know remove stuff from the
00:26:46.000 | KVCache or overlap stuff in the KVCache but it doesn't require changes to the model to the
00:26:51.760 | weights of the model that's why we talk about no fine-tuning here is the extractive summary prompt
00:26:59.520 | just the instruction to produce the margin yes so the instructive summary prompt is basically
00:27:04.880 | a prompt that we add after each chunk to extract relevant relevant information about that query so
00:27:12.000 | it's not just to find the relevant information but about the specific query because we this
00:27:16.480 | inference pattern that we introduced is a specific uh for those prompts that are composed
00:27:22.320 | of a context plus an instruction so we always know what is the instruction that's why
00:27:26.720 | this is I mean the best use of this uh this inference pattern now in the in the paper we
00:27:34.720 | also show um how chunked prefield works at the KVCache level so if you are familiar with how
00:27:40.480 | the KVCache with the query the keys etc but we also show how to overlap the computation of the
00:27:47.360 | margin with the classification of a margin and this is exactly actually the the representation
00:27:54.720 | of the KVCache during the prefilling of one chunk and how it can be overlapped with the
00:28:01.840 | classification using the same language model and the same request in the same language model
00:28:09.120 | sorry could you go deeper into overlap I don't know where like overlap happens is it between
00:28:14.480 | the different chunks or you're overlapping the different chunks um let's talk about overlap
00:28:19.440 | so I am talking about uh let's say first visualize it uh let's say here we do have a nice
00:28:26.960 | representation of that so it is here so you extract the margin and you need to find a way to classify
00:28:33.360 | it you can either use an auxiliary classifier so use another model to classify it as relevant
00:28:39.040 | or irrelevant or you use the same language model to classify but if you want to use the same
00:28:43.520 | language model to classify you would need to create another request in the batch because
00:28:47.200 | you don't want the classification request to visualize anything in the KVCache you just want
00:28:53.520 | to ask the language model okay the I ask a language model to extract information about
00:28:58.400 | this query here so is Ethan Washington in a marble floored room and the language model
00:29:05.840 | extracted this stuff here is it relevant to the query or not if you want to do it you would need
00:29:11.680 | to create another request in the batch but we show here that you can actually um in the
00:29:17.600 | chunked prefilling so let's go here chunked prefilling in the same request in the batch
00:29:23.520 | so when you do chunked prefilling basically what you are doing is you are adding the first chunk
00:29:29.200 | to the KVCache so the the keys and the queries are the first chunk so this is c1 that you see here
00:29:36.080 | and then what do we do is we actually add after this we also want to add an extractive summary
00:29:40.960 | prompt right and then we use this one to generate tokens so now the token generation has um is a
00:29:48.080 | I would say this is the part of the prefilling of the first chunk so the first chunk plus the
00:29:53.440 | extractive summary and then we use it to generate tokens so this is the first
00:29:58.000 | margin token generated usually we pre-allocate the KVCache so the KVCache is not like growing
00:30:05.040 | tensor we pre-allocate it with a fixed number of let's say padding tokens but they are not
00:30:09.280 | really padding tokens they are just unused spaces in the KVCache and then we replace them with the
00:30:14.960 | tokens that are actually generated from the language model so suppose that you have these
00:30:19.280 | unused tokens which I call padding here basically what do we do after we have generated the first
00:30:27.040 | margin we delete this margin right and also the instructive token so what we are actually doing
00:30:32.480 | is we don't delete anything we just change the pointer position of the KVCache on how many tokens
00:30:37.840 | are used so now the pointer suppose it's pointing here then we can pre-fill the second chunk so the
00:30:44.400 | second chunk needs to attend to all its tokens in a causal way so each token in the second chunk
00:30:52.240 | needs to attend to only itself and all the previous tokens in the same chunk but also needs
00:30:57.040 | to attend to all the past tokens of the first chunk that was already pre-filled and we also
00:31:01.840 | need to pre-fill the instructive summary prompt which can visualize all the past tokens that it
00:31:07.920 | has seen but while then then we can also skipping some tokens that we reserve for the generation of
00:31:19.360 | the margin we can also pre-fill the classification instruction for the first margin which was
00:31:30.560 | generated before in the previous step and then during the token generation step so this is the
00:31:36.240 | pre-filling along with the instructives the pre-filling of the second segment after we have
00:31:43.600 | pre-filled the second segment along with the first generated margin we can generate the tokens of the
00:31:49.600 | second margin but classify the first one which we already obtained in the step before so we are
00:31:56.320 | generating two token sets here and the token generation step in the same request now one
00:32:01.840 | using only the part relevant to the first chunk the second chunk and the extractive summary of
00:32:08.000 | the of the after the second chunk and one using only as you can see this is the attention mask
00:32:14.240 | right and one is only using the in the part that is relevant to classifying the first margin that
00:32:20.400 | was instructing the previous step so you can do it also like this thank you yeah sorry there's
00:32:30.160 | a question in the chat um and i think from explanation i think naz is clear so when you
00:32:36.560 | are creating the margin for the second chunk you're actually paying attention to the first and
00:32:43.600 | second yeah okay so you can change the attention mask to only look at the latest chunk however
00:32:50.160 | that's exactly the question yeah yes but it's not possible actually i mean let me clarify why it's
00:32:56.640 | not possible because the kvcash is made up of contextualized tokens is actually these tokens
00:33:01.360 | are not single they are contextualized so the token number one in the kvcash is a contextualized
00:33:06.720 | version of the token number zero and one the token number two in the kvcash is a contextualized
00:33:10.880 | version of the token zero one and two so if you tell the model to only look at the last tokens
00:33:15.440 | you are creating an autoregressive model that is generating the logits of p of let's say x10
00:33:24.800 | but only looking at p of x9 x8 which are contextualized token that contain information
00:33:33.600 | about seven six five but you are not using them so you are actually going out of distribution
00:33:39.760 | so this is why thank you
00:33:42.000 | how much of the kvcash do you prefer with the chunk versus leave it um you can you can actually
00:33:53.600 | okay if you use for example vllm they use this call thing called the pages attention so actually
00:33:58.000 | they prefer they allocate one entire page which is actually a lot of tokens so it's like another
00:34:03.840 | chunk which is more than enough to generate the margin so what are the next steps for this well
00:34:11.520 | the next step is for sure we are sending it to conferences get it published and presenting it
00:34:18.880 | around but we are recently focused on long context modeling and actually we are looking at you know
00:34:27.200 | how long context modeling long context can be better leverage so we will be working in this
00:34:32.000 | field actually we will be how to say we will be researching a lot in this field
00:34:51.600 | i think there's a question from amad how are the queries chosen for the query based summarization
00:35:01.120 | uh i think it's a classification right
00:35:03.680 | uh yes okay so the query is basically uh we work with a prompt that is made up of context plus
00:35:11.920 | query so we all always know what is the query that's the structure of the prompt that we work
00:35:17.200 | with uh what's the use case that writers led you to this research well we are i am personally very
00:35:25.280 | interested in long context modeling and i am giving the freedom to research what i like and
00:35:31.600 | writer is also interested in long context modeling so things intersect and this here we are and then
00:35:37.680 | you know we have our many smart people working together we did a few brainstorming and yeah
00:35:44.160 | what's the latency you see for typical request
00:35:52.400 | well you are delayed there is no kind of latency increase because of this you are just paying
00:36:00.000 | more price to generate more tokens in intermediate cases of course what would happen is that uh you
00:36:07.520 | have before for example you need to um process the entire prompt at once so chunk pre-filling
00:36:14.800 | chunk pre-filling chunk pre-filling now you have chunk pre-filling with some token generation which
00:36:19.600 | will slow down the entire request but you are actually getting something back which is feedback
00:36:25.120 | and you are getting the possibility to see what the model is actually seeing at each step
00:36:29.360 | so you get human in the loop so the human is waiting but
00:36:34.000 | is waiting let's say with some feedback which is nice to have progress bars right
00:36:38.480 | you and maybe this is sensitive do you happen to have a demo of showing how this actually looks
00:36:45.360 | like in the user interface or is it something that we have to sign up to write that we actually see
00:36:49.120 | we don't have that but we are working on demos yeah
00:36:54.160 | here we have you know we have a concept on how it would look like
00:36:58.960 | yeah thank you
00:37:04.320 | so okay in some cases the writing margin does not work well as other methods there are
00:37:15.200 | two factors first of all because we are each margin is kind of a summarization of what is
00:37:22.000 | present in the context it depends highly on how good that model is at summarizing so the better
00:37:28.480 | the margin the better the information it will extract and the better it can be leveraged so
00:37:33.280 | if you think about the student if you're not taking skills are not so good then probably
00:37:38.240 | your notes will not be useful the second thing is actually the comparison here that you see with reg
00:37:45.600 | this reg actually we put ourself in the worst condition possible which is let's help reg beat
00:37:51.440 | us but then actually reg doesn't beat us how usually in reg what you do is you have these
00:37:56.720 | chunks and you extract some vectors of these chunks and then you match them with the dot
00:38:03.120 | product or whatever with the query what we did with reg actually is we asked the language model
00:38:09.760 | to see if the reg yes it was charitable because we asked the language model actually to to to see
00:38:16.480 | if that particular chunk is relevant so actually you have a 70 billion model telling you if that
00:38:23.040 | chunk is relevant compared to extract some vector and map it with dot product i mean we help reg a
00:38:29.680 | lot so actually if we did actually a reg approach like a naive reg approach we would do much better
00:38:36.080 | thanks umar
00:38:43.280 | do we have anyone else have questions i want to come on screen to just ask umar and sam
00:39:02.000 | more questions
00:39:03.200 | if there's nobody else that is interested i actually am having a little bit of trouble
00:39:15.680 | wrapping my mind around why chunk pre-filling is so much more efficient i i i looked at the
00:39:24.480 | sort of the reference and i i kind of get the idea but maybe you can help me understand
00:39:30.960 | the intuition okay it is not uh first of all chunk pre-fill doesn't exist because it's
00:39:36.880 | more efficient it's because we need it we must do it so when you pre-fill uh a chunk into the
00:39:43.440 | language model let me show you actually here we have the kvk representation right so when you
00:39:48.080 | pre-fill a chunk in the language model let's say this one chunk number one c1 you are generating
00:39:54.080 | a quadratic uh matrix as you can see if you have four tokens you are generating a four by four
00:40:00.720 | uh matrix which is prohibitive to generate for very long prompts like you imagine you have 1
00:40:07.840 | million tokens that's 1 b 1 million by 1 million metrics where each of these values is actually
00:40:12.720 | a dot product of a vector and then the computation cost of that also it would really be very slow
00:40:20.880 | and the gpus are really good at parallelizing in this case when you have a lot of operations
00:40:25.120 | they will actually be parallelized but anyway the problem actually is the memories of this
00:40:30.000 | pre-filling because when you generate it it's really huge and it doesn't fit yeah so you are
00:40:35.040 | we are forced to do this chunk then we are doing okay we do this chunk pre-filling so check one
00:40:41.040 | but we are not leveraging these chunks that we are because we are forced to do it right so it's
00:40:47.520 | slower than just doing it in one pass but since we are already forced to do it why not use them
00:40:54.160 | yeah yeah no i i definitely i think i got most of the paper just the chunk pre-filling part
00:41:00.880 | the background if you want more information i can give you some references one is the vllm
00:41:08.800 | page they are actually in the vllm now they are it's an experimental implementation
00:41:13.760 | there was a nvidia explanation on chunks pre-filling so i will send a link later
00:41:21.520 | nvidia published recently an article about chunk pre-filling but basically pre-filling is the most
00:41:27.520 | expensive part of working with long prompts for language models that's why yeah they need to so
00:41:34.080 | but what i guess i was having trouble understanding why why that is is it just because it's quadratic
00:41:39.440 | and so you have to break it up into chunks is that maybe maybe i can take a step at it
00:41:44.800 | so you can imagine so let's look at the attention mask here here i'm generating the first margin
00:41:50.320 | am i doing here i am doing uh i have already seven tokens in the kvcache and i am generating
00:41:58.720 | the eighth token so i am doing seven dot products so token generation which means generating one
00:42:04.960 | token using whatever is in the kvcache is linear with respect to whatever is in the side of the
00:42:08.800 | kvcache pre-filling the kvcache is quadratic and mostly because it's quadratic it's very expensive
00:42:14.000 | so we are talking about something that is linear with quadratic so and if you consider about prompts
00:42:19.680 | long context if you are working with a two million context window one million and nine hundred ninety
00:42:26.720 | nine thousand will be prompt nobody will ever generate more than let's say five thousand tokens
00:42:32.240 | so yeah because the most expensive is actually pre-filling so so so just to get an intuition here
00:42:40.240 | the the other extreme is that you your chunk size is one token right so what's the trade-off
00:42:46.240 | so what they do is basically they try to put as bigger as as big as possible until it fits in the
00:42:53.200 | gpu so they usually suppose i think good numbers are like four thousand tokens or eight thousand
00:43:01.760 | tokens or something in this range and as i said before usually the token generation is memory
00:43:09.760 | bound means that the limitation is only given by how much your kvcache can hold so the memory can
00:43:14.160 | hold in terms of kvcache while pre-filling is compute bound so to maximize the gpu utilization
00:43:20.000 | whenever you work with open ai or coherent they just overlap your new request with other people's
00:43:26.400 | old requests so while they are generating tokens they are also pre-filled so the gpu is utilized
00:43:31.280 | 100 okay yeah no that's helpful and i if you do send those uh links i'll definitely read them
00:43:39.040 | thank you is there is there a break-even point where at like a certain context length it becomes
00:43:46.880 | more valuable to do writing with margins than um than just using the llm by itself like it
00:43:54.160 | where the where the compute equals out or is it all together margins is better i believe okay
00:44:02.480 | writing in the margins it's like you read a book and you have margins versus reading the book i
00:44:08.640 | think it's always convenient to read the book with the margins because you are actually paying the
00:44:12.720 | price for that right we are you it's not something that comes for free you are actually paying the
00:44:17.120 | cost of generating this margin so you're actually putting there some effort and then you you leverage
00:44:22.640 | then we what we could do is okay does it always help so far yes so it's not like something that
00:44:31.520 | you get for free right so you pay and it's is it worth it so far yes and if it's always worth it i
00:44:41.120 | so far from our data it's always worth it like it doesn't like even if even if you're literally
00:44:48.000 | just talking about a chunk if you have like a sentence yeah instead of a book oh really
00:44:53.760 | yeah then it's not convenient because uh in that case you are not even doing chunk refilling right
00:45:00.560 | because if you have only small context then they just prefilled at once but when you have yeah yeah
00:45:06.560 | make sense once you start growing i think the topic still stands right like let's say you've
00:45:13.280 | got a paragraph and your chunks are sentences or you've got a page right so you've got a thousand
00:45:17.600 | tokens and your chunks per se is every sentence relevant so some form of highlighting is you know
00:45:22.720 | you have a one page but every sentence you highlight what's relevant or not at that level
00:45:28.080 | it's it's kind of negligible to throw the whole thing in a prompt versus do i want to highlight
00:45:33.840 | seven of the 40 sentences and do this approach i think that's so that's the other non-extreme right
00:45:41.520 | yeah so basically whenever the the context can just be prefilled in the kb cache without any
00:45:48.480 | chunk refilling i believe it's not worthy to use it but if it's long then it helps
00:45:55.920 | and it helps much more than uh chunking separately like we do with the apis right
00:46:01.680 | with the long chain because you are paying double twice the cost of prefilling in this case we are
00:46:07.120 | only paying once and we also prove in the ablation studies that actually it's always convenient to
00:46:13.440 | send the context plus the margins never just the margins so you can see here this ablation contest
00:46:20.400 | compression so if you only send the margins or only the context is always worse than
00:46:25.360 | providing them both
00:46:28.480 | context being the whole right the entire book
00:46:42.480 | and so build building off that ablation let's say you've got a model that doesn't have the context
00:46:50.560 | you have to do some sort of chunking splitting uh let's say you know you have a total context
00:46:55.280 | of 8 000 tokens and you have a million token document there's approaches if you know how
00:47:00.560 | you can process chunk by chunk and then combine so with the you know what you would expect is
00:47:06.640 | you could for each chunk do this writing with margins approach at every level and then scale
00:47:11.920 | that down with how many ever steps you need is that intuition still pretty accurate
00:47:18.320 | i believe with 8 000 um with 8 000 uh how to say context window
00:47:25.280 | uh i believe that the latency would be higher right because you at each step you are adding
00:47:31.840 | more yes at that kind of latency it's even more convenient to just do independent chunking and
00:47:39.040 | generate in that kind of lane because if you you can always split the the context into chunks and
00:47:45.760 | then you just send multiple requests and you can pay the price of competing in that kind of range
00:47:50.960 | but when you are talking about 64 000 tokens that start making more sense to use this approach
00:47:57.200 | so for that level i think it's all the traditional approaches uh they work fine
00:48:08.880 | any other questions uh well i think another question that came out was why now why right
00:48:16.480 | i mean why nobody thought about this before because actually uh we didn't have long context
00:48:22.880 | very long context um models before and we were not forced to even do chunked pre-feeding so as you as
00:48:30.480 | you can see from blm they have this feature is an experimental feature right now in blm so it's
00:48:38.400 | because right now we need this chunked refilling and everyone is doing it so that's why we have
00:48:43.840 | this so it's always you know innovation is always starts from some problem that you face and some
00:48:49.120 | need that you have so right now we have this need and we have the capability so right that's how we
00:48:55.360 | came up with this awesome we have a few more minutes if anyone else has last minute questions
00:49:11.360 | um please feel free to ask and big shout out thanks to the writer team for presenting
00:49:20.080 | thank you guys for for listening um you are welcome to uh send us uh your questions uh we
00:49:28.960 | have a you know github repository i i i suggest watching the code it's really you know very
00:49:35.200 | commented and it follows the same uh kind of pattern that we shared in the paper so it's
00:49:40.560 | easily understandable for everyone and we have we did a lot of nice tricks you know like one of the
00:49:45.280 | tricks is like you can always delete stuff from the kv from the end of the kv cache and you also
00:49:49.440 | know why now it's an interesting project if anyone's interested on fridays we have a similar
00:49:57.840 | ai in action section where we try to take practical outside of paper there's the code up there's the
00:50:03.360 | paper up if anyone wants to run it and present it share their learnings it'll be a really good
00:50:07.680 | learning exercise but that's always there um i guess we got a question from jimmy any future
00:50:14.960 | work in this direction any future work in this direction well for sure we will keep working on
00:50:21.920 | long context and how we can better leverage long context so there is another kind of problem with
00:50:27.600 | long context which is uh the the whole language models use long context actually depends highly
00:50:34.160 | on this attention mechanism and how the softmax works so we have seen with the paper called
00:50:39.360 | sync attentions that actually the language model allocates a lot of a lot of because when you do a
00:50:46.720 | direction mechanism you are doing a weighted sum over the tokens and each token is given a weight
00:50:52.480 | and we see that most of the weight is given to the first few tokens so there is a lot of research in
00:50:58.560 | this area recently i saw a few days ago another paper came out called the sigmoid attention which
00:51:03.840 | is also studying you know the distribution of this logits and the so so i think the attention
00:51:10.400 | mechanism will play a big part in how we can extend the long context so if we can also fix
00:51:16.880 | this part here so i am very interested you know in the kvcache and optimizing long context modeling so
00:51:23.040 | we are we are working in this direction because it's it's needed by the market and also it's i
00:51:31.200 | like it and i i think it's cool to be able to analyze an entire book or an entire codebase
00:51:36.720 | instead of hoping that the rag finds the right one
00:51:40.720 | awesome well uh big shout out to writer team always great to have you guys present
00:51:49.840 | um sam is in discord as well i'm sure he'll relay questions and stuff we've got the recording we'll
00:51:56.080 | share it with your team i don't know what you choose to do with it but um next week we've got
00:52:01.920 | swicks he'll be presenting some of the strawberry q star quiet star all those papers so he'll be
00:52:08.240 | doing that next week and then the following week if anyone's interested in anything volunteers are
00:52:12.640 | always open i posted a few papers in paper club i think there's also the mistral stuff so if anyone
00:52:18.800 | wants to lead pop in there otherwise um next week swicks is doing strawberry and star stuff so
00:52:25.920 | that's on the agenda
00:52:27.760 | cool thank you guys thanks everyone take care
00:52:35.840 | thank you
00:52:38.420 | yes i was just about to end the meeting yes
00:52:45.840 | question is there a you can um is there any way you can copy over the comments
00:52:52.000 | um i'm trying to do it but it seems to like lazy load like as you scroll up and down it's
00:52:58.800 | really painful let me see if i can extract these comments
00:53:03.520 | do you know if they normally get saved as a zoom recording i'm using you can click save chat where
00:53:14.880 | you save chat anyone want to help me out chat there's i was able to copy the comments without
00:53:21.600 | any problem uh the file is usually saved on the host computer in a folder like documents zoom
00:53:27.760 | meeting date and time if you're on windows
00:53:29.840 | see there's a chat log file that as long as okay i got it i just i just saved the chat i'll throw
00:53:38.480 | it in discord i have a text file of it yes slides would be great too if we can grab them uh yeah
00:53:44.560 | get it all i'll pick them i'll pick them right now perfect thanks guys sweet all right thanks
00:53:50.800 | But, yeah.
00:53:51.640 | [BLANK_AUDIO]