The Physics of Language Models: Knowledge Capacity Scaling Laws

the chord but um yeah we can go whenever you're ready and i think people will just fire up the chat yeah what time is it for you jeez where are you i'm in Dhaka Bangladesh so it's like 2 a.m in the morning for me oh my god okay yeah we have done uh i've done paper clubs in Singapore when it was 4 a.m so i feel the pain wow thank you yeah dedication love it um okay yeah well since you're since you're up we can get started okay so just give me a second i'll share my screen yeah i feel like um we have this backlog and i think i'm just going to try to also clear the backlog um so that we have new papers that we can cover um we can also cover timeless papers which is always helpful uh are you able to figure out the sharing yes i am okay share most people when they use zoom in the browser they have to restart the browser which is annoying i know you had to do that recently is that so okay oh yay all right go for it okay why am i getting this weird prompt i guess i'll just say hello for now um okay so everyone can see my screen all right yep okay so let's start um once again thank you everyone for um joining and allowing me to present this paper so today i want to discuss um the third edition uh i think it's like the last paper of the knowledge part of physics of language model series um the entire uh paper series i think um the researchers divided it into broadly into three categories so they wanted to actually like um um broadly speaking they divided um intelligence in three different um they wanted to understand intelligence in three different areas so one of them was knowledge and the other one was logical or mathematical reasoning and the other part was um uh understanding the grammar of language itself so this is actually the last part of the knowledge series of from those researchers so in this edition they act um they actually explore the knowledge capacity laws and scaling laws of large language models so um the core problem that we are addressing or they are addressing in this part is that so currently we see a very like amazing cap capabilities of llms they can store really large amounts of knowledge they you can talk to them like search engines and ask them various kinds of questions so however but um however there is a lack of understanding of precise understanding of like um the storage capacities of these models like for example like for a certain number of parameter of model like a seven billion parameter or a two billion parameter um how much knowledge do they store like we don't understand this or before these researchers published their findings we did not have a very clear or quantitative understanding of these aspects and um the uh with this they also try to like unlock the relationship or like a mathematical formula of knowledge storage and model size and so if we want to understand this in terms of like asking a few like very broad questions is that so we want to understand that in a language model how much knowledge is stored per parameter and what kind of relationship do we see in with respect to the model size and knowledge capacity and what kind of factors affect this knowledge storage capacities so um before before this research and they argue that um uh most of the focus most and most of the studies done in this area was actually like folk very much focused on metrics like uh loss functions and perplexity scores which are good and nice but um they don't like give us very granular understanding of okay so what what does it mean perfect perplexity score of 3.2 or something can it like if I give it a very big document can it answer my factual questions accurately you cannot like um answer this kind of question from those metrics alone and yeah and those metrics they also like um those researchers they also don't like account for um all the different kinds of knowledge that um that uh like occurs in our like daily life for example structured knowledge or unstructured knowledge how do you like measure these capacity of language model to understand and to understand and demonstrate these kind of structured or unstructured knowledge capacities and um if I may say why this kind of study is um useful a few of the reasons is that um when you are trying to um um deploy or choose um llms for certain use cases um for instance you want to do um um in context learning or rag or on some of your um enterprise documents or um any kind of knowledge stories how do you like actually go and choose anonymous models and um okay you uh okay you chose um um uh moderate uh sized model for example a two billion model or a eight billion or a nine billion model that was like within your budget and within your within your budget constraints and all and um you had uh you had you tested it on your data set um and you found some kind of inconsistency okay um it was able to for example it was able to um infer or infer or um uh you um it was able to infer or um understand certain kinds of questions for example it can understand um um what is if you have data on company a like what is company a or when was it established this kind of factual knowledge but it was it's kind of failing in other like complex queries like if you um um place the question in a certain in a different way or you include more um um more background knowledge in your question it's failing so or it's failing for when you're not on like language or it's failing because you're trying to um use the model in a different language that um that the benchmarks that benchmarks was not clearly reported so how do you like um iterate and choose um choose uh different models or improve upon these models if be it if you are training on these models or you're trying to infer on these models and yeah and like um how do you like based on those findings how do you choose um uh what kind what is the best uh size of the model to go with and um what kind of resources should you plan for those so um the researchers uh of this paper um what they came up with is they came up with a kind of like a framework for understanding these questions uh in a very structured or of they tried to give us a framework for this um for these kind of scenarios um so what they do is uh they wanted to measure knowledge at the bit level and uh so how they do it is that they constructed synthetic uh data sets synthetic data sets that has a very um very defined structure for example um uh they have a knowledge tuple format which consists of the name and attribute and value and it has a certain number of bits that it represents and the entire data set that they wanted to um represent it that that they um that they constructed has a certain number of bits and so and using controlled um this kind of controlled experimentation um um allowed them to quantify those questions that we asked in the um at the start of this talk in terms of like bits per parameter and and they wanted to like um and allow them to come up with numbers for that can help us determine storage efficiency knowledge storage efficiency of different models in terms of numbers and also by generating this kind of like synthetic data uh this allows us and this allows us to like eliminate possibilities of um like external factors that um if the data set was constructed in a different way um that which could like um inhibit uh this uh which could like um like effect uh or not like give us a like a very clear not not like help us in a establishing a very clear relationship of uh the only the metrics that we wanted to understand so um creating synthetic data was crucial synthetic data in this format was crucial for that okay um so what they like established from their um experiment was that so their findings were like very clean what they saw was there was a consistently two bits of knowledge storage per parameter across like be different architectures or different kinds of quantization techniques and also like different efficiency techniques um and so there is a two bits of knowledge per parameter um like this is like the like general rule that they came up with from their experiments and it was consistent across like also consistent across different model scales be it like two billion parameter models or nine billion ones etc and and they like experimented this on like it was validated on different on these different architectures GPT-2, LAMA, MESTRAL and a mixture of model variants and um also on I guess two quantization techniques the um um from floating point 16 or 32 bits to um int 8 and also int 4.

So what they established was that for um every one billion parameter you can like store I guess yeah you store the model stores to be two bits of information so for seven for seven billion parameter model can actually encode a 14 billion knowledge bits which is which roughly translates to the entire English Wikipedia articles that is available a textbook knowledge and uh uh yeah so yeah I have a question for you uh did they give a real world example for what a two or maybe I missed this but what two bits of knowledge is in the real world um do you know what I mean like like if I had two bits of knowledge would that be like uh like a sentence or a couple of words?

Sure um so from what I understood from the paper was that two bits of knowledge is actually like this um this structured representation from the data set. Oh interesting so it'd be somebody's name and like an attribute about them. Uh yeah because um and it's also because the the data set was constructed in the way that um it was uh the synthetic data set was actually um from personal biographies that was generated and they actually they like measured it to like represent this structured formation so yeah so I guess I think like from my understanding was that two bits of information is actually this structured representation like yeah if the model is actually able to extract this representation or not.

Cool thank you. Okay um okay so if we just dive um likely just dive a little bit into the architectures that was experimented was um we see a very uh we see them like scaling up from GPT-2 to LAMA and uh just for like uh yeah the in these two scales and uh what what was an interesting um finding from this part was that um GPT-2 if when they combined it with rotary embedding um it its performance on those kind of like knowledge extraction was actually on par with LAMA and Mistral models.

I think that that was a bit um there was a bit of information that was interesting to me um yeah so yeah and I'm not sure like why this happens actually but um one of the reasons maybe because that LAMA and Mistral model I guess they have like a kind of complicated gated MLP um inside their architectures which is training them is actually a bit unstable.

I think that's what they said in the paper uh on in contrast GPT-2 has a bit more simpler architecture compared to LAMA and Mistral. So when they just like I guess swapped the uh token embedding layers um this was the interesting finding for that um yeah and the other part was um like um uh that's the effect of storage capacity and for that um they experimented on MOE architectures which represent knowledge in very sparse forms and they found out that uh this sparsity was actually um this sparsity was actually like um very helpful in representing knowledge and yes and if we just go a bit deeper on the different kinds of layers they did a few different ablations on the for example on the multi-layer MLP um layers um one of the two or three or one of the variation was that they tried with fully removing those MLP layers from the transformers and um just um and just like reducing the size of the MLP layer so what they found was that um completely removing MLP layers actually like hinders this knowledge storage capacity a lot performance degrades like 1.5 times compared to if you just reduce the size of MLP layers which is interesting to me because um so this means that knowledge is not like concentrated on certain layers in a language model what I guess what this means is that knowledge is distributed across the entire architecture and removing those like won't do us any good so um yeah and the other um comparison that they uh that they wanted to uh that they did was on the like quantization matrix and um uh what uh what we see is that if um they like scaled down for or if they like quantized from float 16 or 32 bits to um eight bits there was no significant or no major capacity loss which is good um it means that um we can have like very small and compact models with the same knowledge representation as the big their big and bulky counterparts but interestingly if you scale down or if you quantize further like to four bits there is a drastic two times capacity reduction so um I guess if you wanna like represent or have a model that can represent knowledge or your knowledge of your training data or your I guess in context data efficiently for now you wouldn't want to consider four bit variants of LLMs yeah and MOE architectures are like as I previously said are very efficient for representing knowledge um one reason could be their the nature of their sparse architecture um I'm not like a very expert on um and I've like very studied MOE architectures very little I guess others present tonight maybe if you can like elaborate why sparsity might be a good way for representing knowledge um okay so from their experimentation the other part at the other interesting aspect that came out was that in order for the language models to like represent the knowledge like two bits if they wanted to reach that two bits per parameter capacity efficiently they had to like in your training scheme they had to like see that particular instance of knowledge they had to have a thousand exposure of that of that data point I like when I first like read about this I thought that meant like the entire data set was maybe you had to like like run a thousand epoch or something for that model to see the entire data set but it's actually not that um yeah from my understanding it's actually that that particular data point for example Anya Forger was born in December 10 1996 this particular like knowledge representation or this particular sentence in the training should be like exposed to the model a thousand times so that was interesting and if you like reduce it to like a hundred exposures to a thousand and we see a like a drop of one bit per parameter gap capacity like representation so there is a like a linear relationship between the number of times knowledge is exposed to the model and the capacity and just if we like quickly go over the data quality of the training time so they experimented that if you had like what junk or junk means is that like low quality data so seven is to one ratio of low quality data to useful data actually reduces capacity by 20 percent and so this is a very critical impact on the model efficiency and unfortunately I was like very interested in like knowing if there is like if the authors would like propose a very propose a like good rule of thumb for that this kind of data mixture in the training like low quality data and high quality data but I think like they mentioned is that like just make sure of they just experimented with different like data mixture and this is what they found out unfortunately we don't have a recommendation for them for this like what is the efficient data mixture for training model training models.

One interesting thing that like I also read this week when I just went over the minimax paper that was I guess released like at the start of this week they also like trained on a mixture of low quality and high quality data but they found that if they eliminate low quality data completely from their models they saw a performance decreation which was like a bit counterintuitive to me.

So what they also suggested that keeping a like a sweet mix of low quality and high quality data actually improve performance I guess it's because it kind of like serves as a data augmentation technique for the language model if you have a like a sweet ratio of these both data and so eliminating low quality data won't do us any good this is like the takeaway from this part and the most interesting part from this is that during training language models can like actually identify sources of useful information if you just like prepend the source name in the training so what they did was they tested it by prepending wikipedia.org or other trusted sources in their training data and this they saw that it actually helps the model to prioritize high quality data and select those data during training and like pay attention to those knowledge from that kind of data in the training.

I guess this is like a very like high level overview of their from the paper and like some of the interesting research and engineering directions from this paper would be to like explore I guess I would be like very excited to see more blogs or more research published in the area of like parameter efficient techniques that is like targeted towards findings from these scaling law findings from this paper or specific knowledge needs of ours and like more suggestions or more research done on architecture selection for this kind of for deployment constraints of this kind of models and more like what is what are some more efficient quantization techniques that can actually like represent or maximize knowledge representation.

Another interesting direction would be to explore data engineering or data curation techniques for pre-training, post-training stages. What is like the what would be some optimal exposure or repetition ratios during pre-training and post-training what kinds of data should you like fine-tune on if you wanna like I guess support knowledge of different kind and one of the like tagging or yeah one of the tagging technique that the author showed us was if you just prepend names of your source in the data that helps knowledge.

There might be other techniques as well which would be interesting to explore and of course like with knowledge of these if there are like more open source softwares that can help us more data quality detection techniques that would be interesting as well. So yeah that like concludes this paper for now.

I would be like do you guys like want me to move to the like next part or the next guide or I think there's a bunch of questions that we could maybe go through if people want to unmute and ask. Eugene it looks like you're there's a bit of a backlog.

Oh yeah sure like yeah. Okay I guess maybe I can just go through some of the questions. So I guess on the two bits per knowledge parameter right this means all parameters including embeddings fully connected KB. I think you sort of answered that and we've been we've had a lot of chat discussions here as well through the ablations on how the MLP and attention layers are essential.

Another question that we see from Pio is how was this synthetic data set used to get the findings of two bits per parameter finding. The data is synthetic so the models will not train on it right. I mean Pio if you want to unmute yourself and explain your question a bit more.

Yeah thank you. So I'm not sure if I understand exactly like because at the beginning you presented this tuple structure right name attribute value and from my understanding that was synthetically generated and later they were testing like different models like GPT, Lama etc. So how exactly from this tuple format they found this two bits per parameter because this is not clear for me.

Oh so like oh so you're asking like like how like this this representation is like translated to a bits parameter right. Or maybe like how exactly they found this two bits per parameter because it's unclear for me like the models are pre-trained right so. Oh no no no no like so the models are actually like not pre-trained so they actually trained the models from scratch in using these synthetic data sets.

So I guess there was so all the data sets that they generated were from biographies that were synthetically generated. So for example like if I give you an example is that like my name is Ania Forger. I was born in there like 10 December 1996 and my and I was born in Germany and I studied there.

So this is this is a factual data set. This is like one instance of factual data set and they generated like I guess a few millions of these data sets and then they pre-trained and fine-tuned on these data sets these synthetic data sets specifically. And the reason why this was done is that like to eliminate so that the model like only learns the knowledge stored in these data set and they wanted to like test the model's abilities to extract knowledge or extract or if I should like put it like the extract structured knowledge from these data sets.

So using this like synthetic data set was essential so that there was no like external like what should I say other external factors that might like show that might like affect the model's behavior. And I think you're so the model was so in all of their experimentations the models were not pre-trained.

The models were like trained from scratch from these trained or fine-tuned using these synthetic data sets only. And I think your question was that like how they are going for from like these sentences to like encoding only two bits of information. Was that your question? Yeah I think because now I get the like let's say training part.

So I think about the evaluation. So like you mentioned extraction. So was that just like a prompting like starting with the name and later the LLM needed to complete correct attribute and value and that means this tuple is stored correctly or how exactly this extraction was executed? Oh so if I remember correctly anyone can like correct me if I'm wrong please.

So if I remember correctly what how they tested or evaluated was that so there was this like base biosynthetic data set of the biography that was like plain sentences. And from those biography data set there was they also generated question answer pairs from those data sets. So if the data set was like my name is that and I was born there a corresponding question answer or yeah question answer data set from that from that bio would be that who is Ania Fojer and where was she born or where did she study.

So this was so yeah so they like had question and answer pairs from those bio from those biography data as well. And how they trained was that I guess there are like exact experiment details like was experimentation techniques was that they trained on both on both the mixture of those bio those biographies and their like question answers in different mixtures and either like on yeah on those mixtures and they also like evaluated so like if you had like if you generated the question answers from the like biography you also have the like answer key available to you.

So you can use that for evaluation as well. So I guess the evaluation was was done like that and there and if the evaluation scores are high or if the model is performing high on the evaluation set what they like explained was that it means that the model was able to encode this structure this structure like efficiently and how did they like come up with this structure.

If I remember correctly was that they also did different probing techniques inside the model architecture and since the model was not like trained on anything other than this biography data set so with techniques like p-probing which is like positional probing they were like able to understand that in certain positions these knowledges were encoded in the layers.

So like a model that that had very high evaluation performance it was actually probed and seen that they it was able to encode this structure this tuple structure efficiently. So like yeah like does this like answer your question? Yeah really well thank you thank you so much now I really understand this thank you thank you so much.

No problem thank you very much for asking. Anyone else have any outstanding questions in the chat they didn't think was addressed? Like don't be shy okay I guess I can do one from Jean-Marc Sommet. I think is it the mix of data or the quantity of data that improves the capacity?

I was also I think you mentioned something about there being a sweet spot whereby having some amount of low quality data actually improves capacity. Could you talk a little bit more about that? What is that sum amount? What ratio of that? Sure so from the paper I actually like don't remember them a specific like size data set size that they experimented on but they did mention that they experimented different a mixture of like high quality and low quality data or junk or useful data.

So this is like one of the figures that I quoted was this one was that they like conducted a systemic experiments on different ratios of different mixtures and one of the extreme was that if you put like seven to one ratio of junk to useful data there was a significant knowledge capacity degradation which is like around 20 times.

And so what that what this means is that you do need your like a bit of low quality data or repetitive data if I must say but the ratio would be like the ratio should be like I guess very less. However like what I was saying is that there is no like sweet spot mentioned in the paper and I was also like very sad about it because I was very looking forward to results like that because like anyone in this chat or anyone in this room I think you would agree with me is that whenever we try to like go we try to like train or pre-train either any like small transformer or even a small image classification or text classification model your data set or training data set ratio or the training data set that ratio or size that we come is very much like we hope and pray that it like just works.

And I was like really hoping that maybe like I could like get some insight and so yeah we like in the I guess in the future experiments or studies maybe we can like get more enlightenment on these part but yeah but from this paper I don't think there was a very like optimal mixture mentioned but what they suggested was that you cannot go too extreme with your junk data and you cannot go like you cannot like you cannot like also fully eliminate your like low quality data you need a mixture and you need to like come up with the optimal mixture from a few of your experiments.

Gotcha, thank you so much. Thank you very much. Very interesting questions actually they are like very pushing me to think very hard. Anyone else? I guess there's one logistics question about if you could share the slide deck maybe you could just drop it in the discord chat. I actually thought it's like that who made this that you made this?

I think the profs themselves. Okay I was like. I'm sorry come again please. Can we share the slides or how do we get the slides? Sure I can like upload it on discord after it's over how to join discord how to get this called link wow I guess a lot of folks are actually joining from somewhere else let me just share this calling right now yeah I don't know how to get a discord alias it should be yeah I figure it out yeah here's the discord link server settings overview they get a discord alias and maybe I should share the link to the people club but that's really more uh lm yeah that's where really where most of the discussions happen um yeah any other questions I guess it's a question from Colette.

Colette do you want to like ask the question? Okay maybe I'll just read it on your behalf oh go ahead yeah uh it's interesting that we can store two bits of knowledge per meter but how useful are the information or the stored knowledge because you know some of the evaluation metrics look at the practical application does it really add value there are a lot of noise I don't know was there any discussion of how we do knowledge extraction for any data curation and the model architecture I don't think I understand the question actually can you please simplify your question yeah it's simply do we have a task optimized knowledge representation I know that we can store two bits of knowledge a parameter but the the question is in real world application can we control the knowledge that we store or not okay so I guess like um this study was like actually like this is like a very expository kind of study that um they wanted to do on LLM language models I guess and I guess the main aim was was to like uncover these like laws or if there are like any kind of like mathematical laws of like these like very common things that we just always like throw around while using LLMs like okay this model is very knowledgeable and okay this model is like very creative these are like like very from my understanding very like subjective words to describe these very like capable systems and I guess like the researchers thought that okay this is like this is a very disservice that we are doing to these very capable systems why don't we like like try to understand them understand them more and in their like all entirety instead of like our own projected subjective words like creative or knowledgeable good so um um one well like yeah so like from my understanding is that this kind of study is actually like a direction towards understanding or like mathematically trying to like mathematically trying to like come up with a formula of these like terms that we throw around like knowledge knowledgeable LLMs and everything and they just like conducted this study in a very specific experimental setting and I guess like they wanted to like show us that under like if you can control the conditions around and around a language model you can study different aspects of them and they just like showed us how to study like from this paper how to study like these two few or these few knowledge representations of course of course research the research the researching community can like expand on these studies to to study other kinds of knowledge or other aspects of LLMs like I guess what are some other aspects there is like lots of other aspects but for example like what I say maybe like we can like come up with our own study of creativity so I don't think that or I guess that these like numbers like you cannot like can you I maybe and someone else can like put some light on that can you like translate these numbers as they are in the real world how or how much do they do they translate I'm not sure because these what these studies that you see here these are actually like from their words done in very controlled environments and this is what they came up with and how much they apply in real world maybe not entirely 100% but it gives us a certain framework to start thinking about or to make decisions like while choosing models or while like selecting training data for LLMs so yeah I agree with you thank you very much thank you for asking the question um I think we only had time for one paper then uh yeah this turned out to be a long discussion lots of questions actually that we didn't even get to uh RJ and Eric and Eugene's all getting questions um but uh yeah I mean I want to respect everyone's time uh that was uh really really good I think there's a lot of interest in separating knowledge and uh and intelligence so the ablations are always important and I always think it's like a gift to the community thank you very much for having me and yeah with those were amazing questions and I also learned a lot and yeah I will share the slides in the discord please feel free to correct any like anything that you see in the slides and also suggest if you have any other suggestions thank you once again for having me awesome um we don't have a paper next week yet uh we could get Sharmay back for part two but also I think a lot of people want to talk about DeepSeek definitely very hot is that synonymous with R1 I haven't been keeping up honestly yeah of course R1 yeah v3 was also trained on R1 uh I assume R1 is like the topic because it is the most important paper of the year so far so yeah we have a volunteer for next week don't we we have a we have a volunteer oh I don't know that's what I thought uh in the uh discord but maybe I misread that no that was a joke oh okay we have R1 next week then is it Vibhu um maybe Vibhu I don't know anyway we'll settle this on discord um okay we have to to go.

Thanks, everyone. Bye. Bye. Bye. Bye. Bye. Thank you.

The Physics of Language Models: Knowledge Capacity Scaling Laws

Transcript