[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

00:00:00.000 | it's recorded but tl;dr probably just shared internal so sam we'll we'll get you recording

00:00:06.240 | after cool i'm i'm gonna pass off to you guys whenever you want to start i'm sure people

00:00:16.080 | trickle in so from writer side i would be presenting i don't know if you guys can hear me

00:00:23.840 | yep all right uh there should be also a sim i don't know if he's also present but i will

00:00:30.480 | kick off the presentation let me know whenever you everyone is ready and i will start

00:00:35.920 | i think we're good take it away

00:00:42.480 | so

00:00:46.640 | so

00:00:56.240 | okay i will start by sharing my screen meanwhile let's see if it works

00:01:07.120 | desktop 2 and let me know if you can see my screen

00:01:13.840 | yep we can see your slides all right perfect so tonight uh can i start right everyone

00:01:27.600 | yeah you should be good it's recording people might trickle in every now and then but

00:01:32.800 | all right so tonight i will be presenting a paper that came out from writer writing in the margins

00:01:40.240 | to uh writing in the margins better inference pattern for long context retrieval

00:01:44.720 | it's a paper that involves a long context and how we leverage the kv cache to make a

00:01:53.040 | long context modeling more effective so i will uh i will do a very little technical but not so

00:02:02.800 | technical presentation and later we can deep dive into the details of the paper so let me open the

00:02:08.800 | chat so i can see what everyone is writing meanwhile i'm while i'm talking oh yeah i think

00:02:16.000 | i think you can just ignore the chat people are gonna be it's gonna be really buzzy and

00:02:20.560 | there's gonna be a lot of people like oh wow that's so cool everything yeah vibhu and i will

00:02:24.800 | take care of the chat for you perfect something super crazy pops up we'll let you know otherwise

00:02:29.360 | we'll we'll take care of it all right perfect all right so i will skip the part of what is

00:02:34.320 | a language model but okay the language model is a probabilistic model and that leverages the prompt

00:02:39.040 | to generate what is the next token and how we generate text to the language model we do it

00:02:42.560 | iteratively so one token at a time most of the language models nowadays are based on the

00:02:49.360 | transformer model and in the transformer model basically whenever we have a prompt the first

00:02:55.120 | thing that we do is we put this prompt into the memory of the language model that is known as

00:02:59.760 | the kv cache which is basically the keys and values in these transformer layers and the

00:03:07.360 | operation of creating this initial kv cache is known as prefilling so the first thing that the

00:03:13.680 | inference engine which could be vllm which could be tesserati or any other inference

00:03:18.160 | framework that you're using the first thing that it does with your prompt is doing this prefilling

00:03:25.200 | and if you're interested actually the prefilling is one of the most expensive

00:03:29.040 | part of the language of processing a prompt because it has a quadratic cost with respect

00:03:33.840 | to the computations as well in the with respect to the memory so imagine that we have a prompt

00:03:39.360 | called austin is a city in a prompt that says austin is a city in then the first thing that

00:03:45.600 | we do to generate text with this prompt is we prefill it into the language model in the kv cache

00:03:51.280 | and then we leverage it to generate tokens one token at a time and the kv cache is a this kind

00:03:58.160 | of memory in the language model uh in any actually transformer model that is autoregressive that is

00:04:04.160 | that if it contains some tokens then the language model will leverage them if it doesn't contain

00:04:10.160 | those tokens then the language model cannot leverage them so the language model only sees

00:04:14.480 | what is inside of the kv cache so what happens is over prompt austin is a city in the first thing

00:04:19.280 | that we do is we do this prefilling which puts these tokens into the kv cache which is one for

00:04:25.600 | each layer of the transformer and then we generate what is the next token by asking the language

00:04:31.120 | model what is the next token and suppose the next token is the word texas we take this token texas

00:04:37.200 | we keep it in the kv cache so that the language model can leverage to generate the next token

00:04:42.080 | and suppose the next token is a comma etc etc until we generate the the entire response of

00:04:47.680 | the language model now we have seen that in the past few years prompts are becoming longer and

00:04:53.840 | longer so we started with 2000 context window to 4000 8000 32 60 or 100 and now we have reached

00:05:01.840 | millions of tokens um this means that you can send to an entire book to the language model and

00:05:08.880 | you can ask the language model questions about this book however with the great prompts comes

00:05:14.400 | great responsibility and the the reason is uh the following so imagine you have a very long

00:05:21.520 | prompt suppose that you have a book and you want the language model to answer questions about this

00:05:26.720 | book suppose that this book is like 1 million tokens longer if you do the prefilling of this

00:05:34.400 | 1 million tokens prompt in in the kv cache the language model will actually not be able to do

00:05:42.320 | that because as i as i said before the prefilling is one of the most expensive operations that we do

00:05:48.080 | uh when inferencing language models uh why because it's quadratic with respect to the

00:05:53.360 | sequence length in terms of memory and also in terms of compute so uh language models actually

00:05:58.560 | cannot prefill the entire prompt in one single pass in the in the kv cache so what they do is

00:06:05.200 | they do it by chunks this is called the chunked profile and it's an experimental feature that has

00:06:10.240 | been recently introduced in vllm but it's probably used in more sophisticated inference engines at

00:06:16.800 | major companies so what we do by with the chunks prefilling basically we split the prompt of the

00:06:27.200 | user into multiple chunks and we prefill each chunk step by step so suppose that the user sent

00:06:33.440 | an entire book which is made up of 10 chapters now the chunk usually they are of a fixed size

00:06:39.920 | this doesn't mean that the the chunk has some contextual meaning it may not be the first

00:06:44.720 | chapter of the book or the second chapter of the book it could be just be the first

00:06:49.040 | 4000 tokens of the prompt and then the next 4000 tokens of the prompt so we prefilled the first

00:06:55.280 | chunk into the kv cache then we prefilled the second chunk in the kv cache so now the kv cache

00:07:00.480 | contains the first chunk and the second chunk then the third chunk etc etc until all the prompt is

00:07:05.680 | inside of the kv cache which can then be leveraged to generate tokens there this is intuitively very

00:07:12.720 | similar to how a student would read a book for example imagine a student is given a book to read

00:07:18.640 | and then to answer questions a question about this book what the student would do with the

00:07:25.760 | students would read the first chapter and now the brain of the student contains information about

00:07:29.920 | only the first chapter then the student would read the second chapter and now the brain of

00:07:33.840 | the student contains information about the first chapter and the second chapter of the book and

00:07:37.680 | then etc etc until the student reads the last chapter now the brain of the student contains

00:07:42.080 | information about all the chapters and then we the student will be given the question and the

00:07:49.520 | question and then the student has to leverage the information that he has read from the book to

00:07:55.520 | answer this question however the student would struggle to do it why because intuitively when

00:08:00.960 | you read a book the the moment you start reading the second chapter you can have already forgetting

00:08:06.240 | what is the first chapter about um so what is a better strategy that this could this student could

00:08:14.640 | do well the student could read the chapters and while reading it could take some annotations

00:08:21.040 | and this is what we do with writing in the margins so because for a very long prompt we are

00:08:27.760 | already forced to split the prompt into multiple chunks to do this chunked prefilled why not

00:08:35.520 | leverage the partially prefilled kv cache to generate some annotations that can be then be

00:08:42.080 | leveraged to improve the model's capability of extracting information from this prompt

00:08:47.120 | so basically writing the margins and from a technical point of view works as follows

00:08:54.160 | we have this very large prompt we split it into chunks because we are forced to split

00:08:58.800 | it into chunks we cannot prefer the entire context into the kv cache we prefer the first chunk and

00:09:07.360 | then we add after the first chunk we add a prompt that tells the model okay use the text above to

00:09:13.280 | instruct information about the query and the query is the question that we want to uh to get an

00:09:18.800 | answer to so now the kv cache contains the first chunk and this prompt here and we leverage it to

00:09:25.120 | generate a few tokens which is the this margin annotation and then this is another trick you

00:09:31.200 | can delete stuff from the kv cache only from the end why you can do that because the language model

00:09:38.720 | is an autoregressive language is an autoregressive model which means that every token depends on all

00:09:45.040 | past tokens which means you can delete from stuff from the end and regenerate it eventually but you

00:09:51.920 | cannot of course delete stuff from the beginning or from the middle because it would invalidate

00:09:56.960 | all the future tokens but you can always remove stuff from the end so what we do is we prefer

00:10:01.280 | the first part of prompt we prefer this instructive instruction we generate a few tokens which is the

00:10:08.640 | margin annotation then we can delete this margin and this instruction that we have added and then

00:10:14.720 | we prefer the second chunk we then append another prompt extractive prompt we generate a few more

00:10:23.120 | tokens which is the second margin so the second margin will depend on the first and the second

00:10:27.680 | chunk then we delete this second margins tokens we delete the extractive prompt etc etc until

00:10:34.720 | we have processed all the prompt and this is a visualization that we have also put in the paper

00:10:42.640 | so basically imagine you are given a very large prompt so what we do is we pre-fill the first

00:10:48.080 | part in the in the KVCache and then we extract the information on what is present in the KVCache

00:10:54.160 | and we call it the first margin we then also can classify these margins and in the paper we also

00:11:02.080 | show that actually the the computation and the pre-filling of the margins sorry the computation

00:11:08.000 | of the margin and the classification of the margin can be overlapped inside of the same request

00:11:12.560 | in the batch so you don't need to have another request in the batch but this is okay a little

00:11:17.200 | KVCache tricks for optimizing the inference so what do we do with these margins basically we

00:11:23.360 | append them at the end before asking the question to the language model because our goal is okay

00:11:29.360 | we have this very big context and then we have some question so what we do instead of just asking

00:11:34.400 | the question the model may not be able to find it so we generate these margins and then we append

00:11:38.960 | them right before asking the question and then we ask the model the question so now the model

00:11:43.200 | can also leverage these margins which are present right before the question to better answer the

00:11:51.040 | question why these margins are why this margin will be leveraged by the language model first

00:11:58.560 | of all because they are they are the instruction that we use to extract them is an extractive

00:12:06.560 | summary so the we ask the language model to extract information or using the prefilled KVCache

00:12:14.560 | on knowing what is the query so the margins are relevant to answer that particular query

00:12:22.480 | why do we add these margins at the end before asking the question because a few months ago

00:12:28.960 | there was a paper called lost in the middle basically it says that and it's actually true

00:12:34.400 | basically it says that with the relevant information that we're trying to extract from

00:12:39.200 | this prompt which could be very large is present either at the beginning or at the end of the

00:12:45.520 | prompt then the language model will very likely be able to will be very likely will be able to

00:12:50.240 | find it however this if this information is present kind of in the middle then the

00:12:54.720 | the language model will less likely be able to find it so that's why we add them to the end

00:13:01.760 | so it improves the language model's ability to to leverage this information but does it even work

00:13:08.880 | so yes we have proof and we have we show in the paper a comparison of pre-trained language models

00:13:15.520 | so first of all we are not fine-tuning any language model we are not changing anything

00:13:20.480 | this is just a different way of utilizing something that is already being done which

00:13:25.440 | is the chunked prefill of the KVCache to improve a language model's ability to leverage long context

00:13:31.680 | so it can be used with any transformer model without fine-tuning it

00:13:35.920 | just by doing it just by doing this inference differently that is don't just blindly refill

00:13:42.240 | the KVCache but refill chunk by chunk because you are forced to and then leverage it to extract

00:13:47.040 | these margins and then leverage all these extracted margins at the end

00:13:54.240 | what you are asking me to do

00:13:55.440 | I think someone had mic on I just made it accidental yeah go ahead

00:14:04.800 | so yeah in the paper we have a proof that it helps the language model pre-trained language

00:14:11.360 | models without any fine-tuning to better utilize long context and we provide a few benchmarks

00:14:19.920 | as you can see for example smaller models have a better more improvement so here we have a for

00:14:28.240 | example long context so the generic pattern that we use for long context so just the

00:14:33.440 | context and the question whatever the benchmark is then we have a reg which means basically that we

00:14:40.640 | extract instead of giving the entire context plus the margins and then the question we are

00:14:49.840 | basically extracting each of the chunks separately asking the language model which one of them is

00:14:56.720 | relevant and then only providing the relevant ones which is what we would do in reg

00:15:03.040 | and then asking the language model to leverage only the relevant ones at the end or the writing

00:15:11.120 | in the margins approach which is all the context plus the margins that were extracted during this

00:15:17.920 | chunked prefill and then the question at the end now how is this different from a prompting strategy

00:15:26.480 | because you may be thinking okay but we can already take a very large prompt split it into

00:15:32.080 | chunks and then use the language model to kind of summarize each of this chunk independently

00:15:38.800 | and then send it to the language model again to answer the question well it's a matter of

00:15:45.120 | cost so let's talk about cost imagine that we want we have this very big book made up of 10 chapters

00:15:53.920 | and then we have a question at the end that we want to get an answer to which is what is the

00:15:58.480 | answer to like the universe and everything now what uh writing in the margins would do is would

00:16:04.160 | prefill the if you don't use writing in the margins what you would do is you would feed the

00:16:10.640 | first chapter of the book and then extract some margin with that actually let me show you this

00:16:15.120 | other slide then you would take the second chapter of the book and then extract the margin with that

00:16:20.240 | then the third chapter of the book and extract the margin with that etc etc suppose that each

00:16:24.560 | of this chapter is 100 000 tokens then the cost to generate the margins and suppose that the margins

00:16:30.800 | are really small it's more or less 1 million tokens because 100 000 tokens multiplied by

00:16:35.680 | 10 chapters is around 1 million tokens but then then you need to also send the entire book past

00:16:41.280 | these extracted margins again to the language model to um to to generate the answer and it

00:16:48.400 | would cost you another million because the model has to reprocess this prefilling again

00:16:52.560 | of 1 million tokens so it would cost you 2 million tokens but with writing in the margins it would

00:16:58.400 | cost you a for 1 million prompt more or less it would cost you 1 million tokens because you don't

00:17:03.440 | have to re-prefill the uh the entire context because you are you have already prefilled it so

00:17:10.640 | the the kvcache is growing and you're extracting some information and then you don't have to

00:17:15.600 | repopulate it which is what you would do if you treat each chunk independently and then

00:17:24.560 | do kind of like a let's say the chunking that we commonly do

00:17:29.840 | so what are the advantages well it's compatible with any transformer model

00:17:35.600 | there are other advantages that i want to show you in a video that i made so let me show this

00:17:42.320 | is on my linkedin post i don't know if you can see now again my my screen right it's a yeah

00:17:48.880 | looks good we see the linkedin video perfect okay so we have a large document suppose 1

00:17:54.560 | million tokens and then we have a question like how many employees were hired in 2006

00:17:58.560 | now when we have a very large prompt it must be prefilled in the kvcache by chunks and this

00:18:06.720 | operation is called the chunks prefill so what happens with chunk prefill is that you

00:18:12.720 | take the first chunk and you prefill it in the kvcache then what we do is we add to the kvcache

00:18:20.160 | this extractive summary prompt which is for example use the text above to extract information

00:18:24.800 | about the following query how many employees were hired in 2006 and then we generate tokens

00:18:30.960 | using whatever is inside of the kvcache which is the first chunk plus this instructive

00:18:36.160 | extractive sum prompt and suppose that we generate a few tokens which are visible now

00:18:42.800 | we take these tokens we save them so we decode whatever is generated we're using the tokenizer

00:18:50.320 | and we save it and then we remove it from the kvcache now if you are looking from a

00:18:56.880 | implementation wise point of view you don't actually remove from the kvcache you just

00:19:01.840 | usually the kvcache allocation is a static but even if you use vllm it's used it's done using

00:19:07.680 | the so-called like pages attention so you actually allocate pages of kvcache actually

00:19:13.120 | so basically we it's not like you are removing stuff from the kvcache you are just resizing

00:19:22.240 | the tensor which is actually an o1 operation so it doesn't have additional cost you just

00:19:27.920 | keep track of how many tokens there are so anyway we we take this margin we save

00:19:33.600 | it somewhere and then we prefill the second chunk to the kvcache

00:19:36.320 | and then we again add another extractive summary prompt and then we leverage it to generate the

00:19:45.040 | second margin which will depend on the first and the second chunk of the of the prompt

00:19:51.920 | etc etc for all the chunks so we will have a list of margins

00:19:56.320 | and then we can also classify these margins so we can also say which because of some of

00:20:06.240 | the margins may like be the model hallucinating or the model just saying i cannot find this

00:20:12.160 | information or etc so we can also classify them we can either use an auxiliary classifier or we

00:20:17.440 | can use the model itself to classify them and we show in the paper that you don't have to actually

00:20:22.640 | create a new request to the model to classify these margins you can actually overlap the

00:20:27.040 | generation of the margins with the classification of the previous margin using the same request

00:20:31.760 | in the batch does the margin pertain to all previous chunks or only the current one

00:20:38.240 | all the chunks up to that margin so the first margin only the first chunk the second margin

00:20:44.080 | the chunk one and through this third margin chunk one two and three etc because we want to leverage

00:20:49.280 | the prefilled kvcache okay so we classify these margins and then we append them at the end

00:20:56.960 | and then we append the question

00:21:10.160 | and then we generate the answer the final answer

00:21:12.080 | so the advantage is that as i said before we exploit chunk profile of a large prompt to

00:21:21.440 | generate intermediate summary so we are we are exploiting something that we are already forced

00:21:26.640 | to do but we are not leveraging right now so it's kind of comes for free just with a minor let's say

00:21:34.160 | compute because you have the the cost of generating these margins but you avoid the

00:21:41.520 | bigger cost of prefilling so if you just use chunks chunking techniques like you can do with

00:21:47.360 | the long chain for example you pay twice the cost of prefilling if you do this system you don't pay

00:21:54.480 | twice the cost of prefilling which is very expensive and to give you an insight on how

00:22:00.800 | expensive is prefilling basically in most cases so whenever you work with the openai or with cohere or

00:22:10.320 | any other provider whenever you send a request your request is always overlapped with the token

00:22:17.120 | generation of other requests so the first time your prompt comes to their server they are overlapping

00:22:21.840 | the prefill of your request with the token generation of others because the prefilling

00:22:26.960 | is compute bound because it's very expensive computationally while the token generation is

00:22:32.800 | memory bound so to to always utilize the gpu fully they over they kind of schedule together

00:22:40.400 | one prefill with multiple token generations so so it's compatible with any of the shelf

00:22:46.880 | language model without any fine tuning and we saw we show some benchmarks in the paper

00:22:54.000 | it improves the ability of any language model to extract relevant information so solving the

00:22:58.080 | lost in the middle problem and another cool thing that you can do is basically

00:23:04.640 | because now you generate these margins while prefilling the the prompt you can also feedback

00:23:13.120 | this margin to the user and the user can classify them for you like a thumbs up or thumbs down

00:23:17.840 | so it adds a human in the loop and also the user can visualize the progress of how the

00:23:23.600 | prefilling is going because when you have a very very large prompt i believe that because the cost

00:23:29.200 | of prefilling is quadratic it will become really it will become really expensive to prefill it and

00:23:36.320 | the user may have to wait many seconds so you can actually give a feedback to the user of how

00:23:42.080 | content how much context has been processed and you can actually leverage the waiting time of the

00:23:46.000 | user to give you thumbs up or thumbs down on these margins which can actually improve the ability of

00:23:50.560 | the language model to use them and the user can also early exit so the user found the relevant

00:23:56.640 | information in one of these margins you can say okay stop inference and the user would not have

00:24:00.720 | to pay for you know all the context uh being processed and we also provide an implementation

00:24:07.920 | so if you go to this url so github.com/writer/writing in the margins you can find

00:24:14.640 | our implementation on how we actually do this stuff with the KVCache i don't know how to delete

00:24:20.080 | this line it's so annoying let me check i believe it's annotate and then delete clear clear all

00:24:28.480 | drawings okay so if you go here you can see uh what we do basically um it's a simple it works

00:24:39.040 | with any language model we here we provide a demo or with the llama fi and quen uh here we show for

00:24:45.760 | example um this this code that is present here in the github repository matches exactly the code

00:24:51.680 | that we present in the paper which is the pseudo code that you can see here so how we split into

00:24:58.000 | segments and how we prefill into the KVCache and how we delete stuff from the KVCache so it's

00:25:03.520 | present here so here you have a very simple like we also show the state of the KVCache at each

00:25:09.920 | line of code so that the user can understand what is happening here and this is the code for

00:25:15.760 | the method that we use to delete stuff from the KVCache all right uh let me see if there is

00:25:22.480 | something that is missing here yeah here we provide a comparison of how it differs from

00:25:28.480 | reg and how it differs from just long context processing questions

00:25:34.960 | all right let me check you in the chat but Eugene is like

00:25:43.840 | crushing it answering answering the question in the chat thanks Eugene

00:25:47.840 | my pleasure thank you for taking time to share with us about uh the paper and even preparing

00:25:54.080 | slides uh yeah this okay the slides were from another talk I gave in the company so it's uh

00:26:03.120 | reusing stuff but yeah thank you thank you everyone um so let me go through the question

00:26:11.280 | if there is something in the chat that I can answer um yes this is not any we are not doing

00:26:22.160 | any change to the model architecture so you don't have to fine-tune anything you don't have to change

00:26:26.320 | anything like the um can you use this stuff with like a length pane no because it requires a

00:26:33.840 | modification on how the inference engine is using the model so it's a because when you work at the

00:26:40.560 | KVCache level you cannot just work with the apis and tell them to you know remove stuff from the

00:26:46.000 | KVCache or overlap stuff in the KVCache but it doesn't require changes to the model to the

00:26:51.760 | weights of the model that's why we talk about no fine-tuning here is the extractive summary prompt

00:26:59.520 | just the instruction to produce the margin yes so the instructive summary prompt is basically

00:27:04.880 | a prompt that we add after each chunk to extract relevant relevant information about that query so

00:27:12.000 | it's not just to find the relevant information but about the specific query because we this

00:27:16.480 | inference pattern that we introduced is a specific uh for those prompts that are composed

00:27:22.320 | of a context plus an instruction so we always know what is the instruction that's why

00:27:26.720 | this is I mean the best use of this uh this inference pattern now in the in the paper we

00:27:34.720 | also show um how chunked prefield works at the KVCache level so if you are familiar with how

00:27:40.480 | the KVCache with the query the keys etc but we also show how to overlap the computation of the

00:27:47.360 | margin with the classification of a margin and this is exactly actually the the representation

00:27:54.720 | of the KVCache during the prefilling of one chunk and how it can be overlapped with the

00:28:01.840 | classification using the same language model and the same request in the same language model

00:28:09.120 | sorry could you go deeper into overlap I don't know where like overlap happens is it between

00:28:14.480 | the different chunks or you're overlapping the different chunks um let's talk about overlap

00:28:19.440 | so I am talking about uh let's say first visualize it uh let's say here we do have a nice

00:28:26.960 | representation of that so it is here so you extract the margin and you need to find a way to classify

00:28:33.360 | it you can either use an auxiliary classifier so use another model to classify it as relevant

00:28:39.040 | or irrelevant or you use the same language model to classify but if you want to use the same

00:28:43.520 | language model to classify you would need to create another request in the batch because

00:28:47.200 | you don't want the classification request to visualize anything in the KVCache you just want

00:28:53.520 | to ask the language model okay the I ask a language model to extract information about

00:28:58.400 | this query here so is Ethan Washington in a marble floored room and the language model

00:29:05.840 | extracted this stuff here is it relevant to the query or not if you want to do it you would need

00:29:11.680 | to create another request in the batch but we show here that you can actually um in the

00:29:17.600 | chunked prefilling so let's go here chunked prefilling in the same request in the batch

00:29:23.520 | so when you do chunked prefilling basically what you are doing is you are adding the first chunk

00:29:29.200 | to the KVCache so the the keys and the queries are the first chunk so this is c1 that you see here

00:29:36.080 | and then what do we do is we actually add after this we also want to add an extractive summary

00:29:40.960 | prompt right and then we use this one to generate tokens so now the token generation has um is a

00:29:48.080 | I would say this is the part of the prefilling of the first chunk so the first chunk plus the

00:29:53.440 | extractive summary and then we use it to generate tokens so this is the first

00:29:58.000 | margin token generated usually we pre-allocate the KVCache so the KVCache is not like growing

00:30:05.040 | tensor we pre-allocate it with a fixed number of let's say padding tokens but they are not

00:30:09.280 | really padding tokens they are just unused spaces in the KVCache and then we replace them with the

00:30:14.960 | tokens that are actually generated from the language model so suppose that you have these

00:30:19.280 | unused tokens which I call padding here basically what do we do after we have generated the first

00:30:27.040 | margin we delete this margin right and also the instructive token so what we are actually doing

00:30:32.480 | is we don't delete anything we just change the pointer position of the KVCache on how many tokens

00:30:37.840 | are used so now the pointer suppose it's pointing here then we can pre-fill the second chunk so the

00:30:44.400 | second chunk needs to attend to all its tokens in a causal way so each token in the second chunk

00:30:52.240 | needs to attend to only itself and all the previous tokens in the same chunk but also needs

00:30:57.040 | to attend to all the past tokens of the first chunk that was already pre-filled and we also

00:31:01.840 | need to pre-fill the instructive summary prompt which can visualize all the past tokens that it

00:31:07.920 | has seen but while then then we can also skipping some tokens that we reserve for the generation of

00:31:19.360 | the margin we can also pre-fill the classification instruction for the first margin which was

00:31:30.560 | generated before in the previous step and then during the token generation step so this is the

00:31:36.240 | pre-filling along with the instructives the pre-filling of the second segment after we have

00:31:43.600 | pre-filled the second segment along with the first generated margin we can generate the tokens of the

00:31:49.600 | second margin but classify the first one which we already obtained in the step before so we are

00:31:56.320 | generating two token sets here and the token generation step in the same request now one

00:32:01.840 | using only the part relevant to the first chunk the second chunk and the extractive summary of

00:32:08.000 | the of the after the second chunk and one using only as you can see this is the attention mask

00:32:14.240 | right and one is only using the in the part that is relevant to classifying the first margin that

00:32:20.400 | was instructing the previous step so you can do it also like this thank you yeah sorry there's

00:32:30.160 | a question in the chat um and i think from explanation i think naz is clear so when you

00:32:36.560 | are creating the margin for the second chunk you're actually paying attention to the first and

00:32:43.600 | second yeah okay so you can change the attention mask to only look at the latest chunk however

00:32:50.160 | that's exactly the question yeah yes but it's not possible actually i mean let me clarify why it's

00:32:56.640 | not possible because the kvcash is made up of contextualized tokens is actually these tokens

00:33:01.360 | are not single they are contextualized so the token number one in the kvcash is a contextualized

00:33:06.720 | version of the token number zero and one the token number two in the kvcash is a contextualized

00:33:10.880 | version of the token zero one and two so if you tell the model to only look at the last tokens

00:33:15.440 | you are creating an autoregressive model that is generating the logits of p of let's say x10

00:33:24.800 | but only looking at p of x9 x8 which are contextualized token that contain information

00:33:33.600 | about seven six five but you are not using them so you are actually going out of distribution

00:33:39.760 | so this is why thank you

00:33:42.000 | how much of the kvcash do you prefer with the chunk versus leave it um you can you can actually

00:33:53.600 | okay if you use for example vllm they use this call thing called the pages attention so actually

00:33:58.000 | they prefer they allocate one entire page which is actually a lot of tokens so it's like another

00:34:03.840 | chunk which is more than enough to generate the margin so what are the next steps for this well

00:34:11.520 | the next step is for sure we are sending it to conferences get it published and presenting it

00:34:18.880 | around but we are recently focused on long context modeling and actually we are looking at you know

00:34:27.200 | how long context modeling long context can be better leverage so we will be working in this

00:34:32.000 | field actually we will be how to say we will be researching a lot in this field

00:34:51.600 | i think there's a question from amad how are the queries chosen for the query based summarization

00:35:01.120 | uh i think it's a classification right

00:35:03.680 | uh yes okay so the query is basically uh we work with a prompt that is made up of context plus

00:35:11.920 | query so we all always know what is the query that's the structure of the prompt that we work

00:35:17.200 | with uh what's the use case that writers led you to this research well we are i am personally very

00:35:25.280 | interested in long context modeling and i am giving the freedom to research what i like and

00:35:31.600 | writer is also interested in long context modeling so things intersect and this here we are and then

00:35:37.680 | you know we have our many smart people working together we did a few brainstorming and yeah

00:35:44.160 | what's the latency you see for typical request

00:35:52.400 | well you are delayed there is no kind of latency increase because of this you are just paying

00:36:00.000 | more price to generate more tokens in intermediate cases of course what would happen is that uh you

00:36:07.520 | have before for example you need to um process the entire prompt at once so chunk pre-filling

00:36:14.800 | chunk pre-filling chunk pre-filling now you have chunk pre-filling with some token generation which

00:36:19.600 | will slow down the entire request but you are actually getting something back which is feedback

00:36:25.120 | and you are getting the possibility to see what the model is actually seeing at each step

00:36:29.360 | so you get human in the loop so the human is waiting but

00:36:34.000 | is waiting let's say with some feedback which is nice to have progress bars right

00:36:38.480 | you and maybe this is sensitive do you happen to have a demo of showing how this actually looks

00:36:45.360 | like in the user interface or is it something that we have to sign up to write that we actually see

00:36:49.120 | we don't have that but we are working on demos yeah

00:36:54.160 | here we have you know we have a concept on how it would look like

00:36:58.960 | yeah thank you

00:37:04.320 | so okay in some cases the writing margin does not work well as other methods there are

00:37:15.200 | two factors first of all because we are each margin is kind of a summarization of what is

00:37:22.000 | present in the context it depends highly on how good that model is at summarizing so the better

00:37:28.480 | the margin the better the information it will extract and the better it can be leveraged so

00:37:33.280 | if you think about the student if you're not taking skills are not so good then probably

00:37:38.240 | your notes will not be useful the second thing is actually the comparison here that you see with reg

00:37:45.600 | this reg actually we put ourself in the worst condition possible which is let's help reg beat

00:37:51.440 | us but then actually reg doesn't beat us how usually in reg what you do is you have these

00:37:56.720 | chunks and you extract some vectors of these chunks and then you match them with the dot

00:38:03.120 | product or whatever with the query what we did with reg actually is we asked the language model

00:38:09.760 | to see if the reg yes it was charitable because we asked the language model actually to to to see

00:38:16.480 | if that particular chunk is relevant so actually you have a 70 billion model telling you if that

00:38:23.040 | chunk is relevant compared to extract some vector and map it with dot product i mean we help reg a

00:38:29.680 | lot so actually if we did actually a reg approach like a naive reg approach we would do much better

00:38:36.080 | thanks umar

00:38:43.280 | do we have anyone else have questions i want to come on screen to just ask umar and sam

00:39:02.000 | more questions

00:39:03.200 | if there's nobody else that is interested i actually am having a little bit of trouble

00:39:15.680 | wrapping my mind around why chunk pre-filling is so much more efficient i i i looked at the

00:39:24.480 | sort of the reference and i i kind of get the idea but maybe you can help me understand

00:39:30.960 | the intuition okay it is not uh first of all chunk pre-fill doesn't exist because it's

00:39:36.880 | more efficient it's because we need it we must do it so when you pre-fill uh a chunk into the

00:39:43.440 | language model let me show you actually here we have the kvk representation right so when you

00:39:48.080 | pre-fill a chunk in the language model let's say this one chunk number one c1 you are generating

00:39:54.080 | a quadratic uh matrix as you can see if you have four tokens you are generating a four by four

00:40:00.720 | uh matrix which is prohibitive to generate for very long prompts like you imagine you have 1

00:40:07.840 | million tokens that's 1 b 1 million by 1 million metrics where each of these values is actually

00:40:12.720 | a dot product of a vector and then the computation cost of that also it would really be very slow

00:40:20.880 | and the gpus are really good at parallelizing in this case when you have a lot of operations

00:40:25.120 | they will actually be parallelized but anyway the problem actually is the memories of this

00:40:30.000 | pre-filling because when you generate it it's really huge and it doesn't fit yeah so you are

00:40:35.040 | we are forced to do this chunk then we are doing okay we do this chunk pre-filling so check one

00:40:41.040 | but we are not leveraging these chunks that we are because we are forced to do it right so it's

00:40:47.520 | slower than just doing it in one pass but since we are already forced to do it why not use them

00:40:54.160 | yeah yeah no i i definitely i think i got most of the paper just the chunk pre-filling part

00:41:00.880 | the background if you want more information i can give you some references one is the vllm

00:41:08.800 | page they are actually in the vllm now they are it's an experimental implementation

00:41:13.760 | there was a nvidia explanation on chunks pre-filling so i will send a link later

00:41:21.520 | nvidia published recently an article about chunk pre-filling but basically pre-filling is the most

00:41:27.520 | expensive part of working with long prompts for language models that's why yeah they need to so

00:41:34.080 | but what i guess i was having trouble understanding why why that is is it just because it's quadratic

00:41:39.440 | and so you have to break it up into chunks is that maybe maybe i can take a step at it

00:41:44.800 | so you can imagine so let's look at the attention mask here here i'm generating the first margin

00:41:50.320 | am i doing here i am doing uh i have already seven tokens in the kvcache and i am generating

00:41:58.720 | the eighth token so i am doing seven dot products so token generation which means generating one

00:42:04.960 | token using whatever is in the kvcache is linear with respect to whatever is in the side of the

00:42:08.800 | kvcache pre-filling the kvcache is quadratic and mostly because it's quadratic it's very expensive

00:42:14.000 | so we are talking about something that is linear with quadratic so and if you consider about prompts

00:42:19.680 | long context if you are working with a two million context window one million and nine hundred ninety

00:42:26.720 | nine thousand will be prompt nobody will ever generate more than let's say five thousand tokens

00:42:32.240 | so yeah because the most expensive is actually pre-filling so so so just to get an intuition here

00:42:40.240 | the the other extreme is that you your chunk size is one token right so what's the trade-off

00:42:46.240 | so what they do is basically they try to put as bigger as as big as possible until it fits in the

00:42:53.200 | gpu so they usually suppose i think good numbers are like four thousand tokens or eight thousand

00:43:01.760 | tokens or something in this range and as i said before usually the token generation is memory

00:43:09.760 | bound means that the limitation is only given by how much your kvcache can hold so the memory can

00:43:14.160 | hold in terms of kvcache while pre-filling is compute bound so to maximize the gpu utilization

00:43:20.000 | whenever you work with open ai or coherent they just overlap your new request with other people's

00:43:26.400 | old requests so while they are generating tokens they are also pre-filled so the gpu is utilized

00:43:31.280 | 100 okay yeah no that's helpful and i if you do send those uh links i'll definitely read them

00:43:39.040 | thank you is there is there a break-even point where at like a certain context length it becomes

00:43:46.880 | more valuable to do writing with margins than um than just using the llm by itself like it

00:43:54.160 | where the where the compute equals out or is it all together margins is better i believe okay

00:44:02.480 | writing in the margins it's like you read a book and you have margins versus reading the book i

00:44:08.640 | think it's always convenient to read the book with the margins because you are actually paying the

00:44:12.720 | price for that right we are you it's not something that comes for free you are actually paying the

00:44:17.120 | cost of generating this margin so you're actually putting there some effort and then you you leverage

00:44:22.640 | then we what we could do is okay does it always help so far yes so it's not like something that

00:44:31.520 | you get for free right so you pay and it's is it worth it so far yes and if it's always worth it i

00:44:41.120 | so far from our data it's always worth it like it doesn't like even if even if you're literally

00:44:48.000 | just talking about a chunk if you have like a sentence yeah instead of a book oh really

00:44:53.760 | yeah then it's not convenient because uh in that case you are not even doing chunk refilling right

00:45:00.560 | because if you have only small context then they just prefilled at once but when you have yeah yeah

00:45:06.560 | make sense once you start growing i think the topic still stands right like let's say you've

00:45:13.280 | got a paragraph and your chunks are sentences or you've got a page right so you've got a thousand

00:45:17.600 | tokens and your chunks per se is every sentence relevant so some form of highlighting is you know

00:45:22.720 | you have a one page but every sentence you highlight what's relevant or not at that level

00:45:28.080 | it's it's kind of negligible to throw the whole thing in a prompt versus do i want to highlight

00:45:33.840 | seven of the 40 sentences and do this approach i think that's so that's the other non-extreme right

00:45:41.520 | yeah so basically whenever the the context can just be prefilled in the kb cache without any

00:45:48.480 | chunk refilling i believe it's not worthy to use it but if it's long then it helps

00:45:55.920 | and it helps much more than uh chunking separately like we do with the apis right

00:46:01.680 | with the long chain because you are paying double twice the cost of prefilling in this case we are

00:46:07.120 | only paying once and we also prove in the ablation studies that actually it's always convenient to

00:46:13.440 | send the context plus the margins never just the margins so you can see here this ablation contest

00:46:20.400 | compression so if you only send the margins or only the context is always worse than

00:46:25.360 | providing them both

00:46:28.480 | context being the whole right the entire book

00:46:42.480 | and so build building off that ablation let's say you've got a model that doesn't have the context

00:46:50.560 | you have to do some sort of chunking splitting uh let's say you know you have a total context

00:46:55.280 | of 8 000 tokens and you have a million token document there's approaches if you know how

00:47:00.560 | you can process chunk by chunk and then combine so with the you know what you would expect is

00:47:06.640 | you could for each chunk do this writing with margins approach at every level and then scale

00:47:11.920 | that down with how many ever steps you need is that intuition still pretty accurate

00:47:18.320 | i believe with 8 000 um with 8 000 uh how to say context window

00:47:25.280 | uh i believe that the latency would be higher right because you at each step you are adding

00:47:31.840 | more yes at that kind of latency it's even more convenient to just do independent chunking and

00:47:39.040 | generate in that kind of lane because if you you can always split the the context into chunks and

00:47:45.760 | then you just send multiple requests and you can pay the price of competing in that kind of range

00:47:50.960 | but when you are talking about 64 000 tokens that start making more sense to use this approach

00:47:57.200 | so for that level i think it's all the traditional approaches uh they work fine

00:48:08.880 | any other questions uh well i think another question that came out was why now why right

00:48:16.480 | i mean why nobody thought about this before because actually uh we didn't have long context

00:48:22.880 | very long context um models before and we were not forced to even do chunked pre-feeding so as you as

00:48:30.480 | you can see from blm they have this feature is an experimental feature right now in blm so it's

00:48:38.400 | because right now we need this chunked refilling and everyone is doing it so that's why we have

00:48:43.840 | this so it's always you know innovation is always starts from some problem that you face and some

00:48:49.120 | need that you have so right now we have this need and we have the capability so right that's how we

00:48:55.360 | came up with this awesome we have a few more minutes if anyone else has last minute questions

00:49:11.360 | um please feel free to ask and big shout out thanks to the writer team for presenting

00:49:20.080 | thank you guys for for listening um you are welcome to uh send us uh your questions uh we

00:49:28.960 | have a you know github repository i i i suggest watching the code it's really you know very

00:49:35.200 | commented and it follows the same uh kind of pattern that we shared in the paper so it's

00:49:40.560 | easily understandable for everyone and we have we did a lot of nice tricks you know like one of the

00:49:45.280 | tricks is like you can always delete stuff from the kv from the end of the kv cache and you also

00:49:49.440 | know why now it's an interesting project if anyone's interested on fridays we have a similar

00:49:57.840 | ai in action section where we try to take practical outside of paper there's the code up there's the

00:50:03.360 | paper up if anyone wants to run it and present it share their learnings it'll be a really good

00:50:07.680 | learning exercise but that's always there um i guess we got a question from jimmy any future

00:50:14.960 | work in this direction any future work in this direction well for sure we will keep working on

00:50:21.920 | long context and how we can better leverage long context so there is another kind of problem with

00:50:27.600 | long context which is uh the the whole language models use long context actually depends highly

00:50:34.160 | on this attention mechanism and how the softmax works so we have seen with the paper called

00:50:39.360 | sync attentions that actually the language model allocates a lot of a lot of because when you do a

00:50:46.720 | direction mechanism you are doing a weighted sum over the tokens and each token is given a weight

00:50:52.480 | and we see that most of the weight is given to the first few tokens so there is a lot of research in

00:50:58.560 | this area recently i saw a few days ago another paper came out called the sigmoid attention which

00:51:03.840 | is also studying you know the distribution of this logits and the so so i think the attention

00:51:10.400 | mechanism will play a big part in how we can extend the long context so if we can also fix

00:51:16.880 | this part here so i am very interested you know in the kvcache and optimizing long context modeling so

00:51:23.040 | we are we are working in this direction because it's it's needed by the market and also it's i

00:51:31.200 | like it and i i think it's cool to be able to analyze an entire book or an entire codebase

00:51:36.720 | instead of hoping that the rag finds the right one

00:51:40.720 | awesome well uh big shout out to writer team always great to have you guys present

00:51:49.840 | um sam is in discord as well i'm sure he'll relay questions and stuff we've got the recording we'll

00:51:56.080 | share it with your team i don't know what you choose to do with it but um next week we've got

00:52:01.920 | swicks he'll be presenting some of the strawberry q star quiet star all those papers so he'll be

00:52:08.240 | doing that next week and then the following week if anyone's interested in anything volunteers are

00:52:12.640 | always open i posted a few papers in paper club i think there's also the mistral stuff so if anyone

00:52:18.800 | wants to lead pop in there otherwise um next week swicks is doing strawberry and star stuff so

00:52:25.920 | that's on the agenda

00:52:27.760 | cool thank you guys thanks everyone take care

00:52:35.840 | thank you

00:52:38.420 | yes i was just about to end the meeting yes

00:52:45.840 | question is there a you can um is there any way you can copy over the comments

00:52:52.000 | um i'm trying to do it but it seems to like lazy load like as you scroll up and down it's

00:52:58.800 | really painful let me see if i can extract these comments

00:53:03.520 | do you know if they normally get saved as a zoom recording i'm using you can click save chat where

00:53:14.880 | you save chat anyone want to help me out chat there's i was able to copy the comments without

00:53:21.600 | any problem uh the file is usually saved on the host computer in a folder like documents zoom

00:53:27.760 | meeting date and time if you're on windows

00:53:29.840 | see there's a chat log file that as long as okay i got it i just i just saved the chat i'll throw

00:53:38.480 | it in discord i have a text file of it yes slides would be great too if we can grab them uh yeah

00:53:44.560 | get it all i'll pick them i'll pick them right now perfect thanks guys sweet all right thanks

00:53:50.800 | But, yeah.

00:53:51.640 | [BLANK_AUDIO]