The Physics of Language Models: Knowledge Capacity Scaling Laws

00:00:00.000 | the chord but um yeah we can go whenever you're ready and i think people will just fire up the

00:00:06.880 | chat yeah what time is it for you jeez where are you i'm in Dhaka Bangladesh so it's like

00:00:15.680 | 2 a.m in the morning for me oh my god okay yeah we have done uh i've done paper clubs in Singapore

00:00:23.520 | when it was 4 a.m so i feel the pain wow thank you yeah dedication love it um okay yeah well

00:00:32.560 | since you're since you're up we can get started okay so just give me a second i'll share my screen

00:00:39.040 | yeah i feel like um we have this backlog and i think i'm just going to try to

00:00:47.920 | also clear the backlog um so that we have new papers that we can cover um we can also cover

00:00:54.880 | timeless papers which is always helpful uh are you able to figure out the sharing

00:01:03.440 | yes i am okay share most people when they

00:01:14.160 | use zoom in the browser they have to restart the browser which is annoying

00:01:17.440 | i know you had to do that recently is that so okay oh yay all right go for it

00:01:27.360 | okay why am i getting this weird prompt

00:01:35.680 | i guess i'll just say hello for now

00:01:44.400 | um okay so everyone can see my screen all right

00:02:04.560 | yep okay so let's start um once again thank you everyone for um joining and allowing me to present

00:02:15.920 | this paper so today i want to discuss um the third edition uh i think it's like the last paper of the

00:02:30.000 | knowledge part of physics of language model series um the entire uh paper series i think um

00:02:39.360 | the researchers divided it into broadly into three categories so they wanted to actually like um um

00:02:47.920 | broadly speaking they divided um intelligence in three different um they wanted to understand

00:02:57.200 | intelligence in three different areas so one of them was knowledge and the other one was

00:03:03.520 | logical or mathematical reasoning and the other part was um uh understanding the grammar of

00:03:13.600 | language itself so this is actually the last part of the knowledge series of from those researchers

00:03:26.080 | so in this edition they act um they actually explore the knowledge capacity laws and scaling

00:03:35.600 | laws of large language models so um the core problem that we are addressing or they are addressing

00:03:48.720 | in this part is that so currently we see a very like amazing cap capabilities of llms

00:03:59.120 | they can store really large amounts of knowledge they you can talk to them like search engines and

00:04:07.920 | ask them various kinds of questions so however but um however there is a lack of understanding

00:04:18.560 | of precise understanding of like um the storage capacities of these models like

00:04:28.800 | for example like for a certain number of parameter of model like a seven billion parameter or a two

00:04:37.760 | billion parameter um how much knowledge do they store like we don't understand this or before

00:04:46.880 | these researchers published their findings we did not have a very clear or quantitative

00:04:54.320 | understanding of these aspects and um the uh with this they also try to like unlock the relationship

00:05:05.280 | or like a mathematical formula of knowledge storage and model size and so if we want to

00:05:17.760 | understand this in terms of like asking a few like very broad questions is that so we want to

00:05:25.440 | understand that in a language model how much knowledge is stored per parameter and what kind

00:05:35.440 | of relationship do we see in with respect to the model size and knowledge capacity and what kind

00:05:46.160 | of factors affect this knowledge storage capacities so um before before this research

00:05:57.440 | and they argue that um uh most of the focus most and most of the studies done in this area was

00:06:07.040 | actually like folk very much focused on metrics like uh loss functions and perplexity scores

00:06:15.360 | which are good and nice but um they don't like give us very granular understanding of okay so

00:06:23.680 | what what does it mean perfect perplexity score of 3.2 or something can it like if I give it a

00:06:32.080 | very big document can it answer my factual questions accurately you cannot like um answer

00:06:42.080 | this kind of question from those metrics alone and yeah and those metrics they also like um those

00:06:52.320 | researchers they also don't like account for um all the different kinds of knowledge that um

00:06:58.960 | that uh like occurs in our like daily life for example structured knowledge or unstructured

00:07:09.200 | knowledge how do you like measure these capacity of language model to understand and

00:07:18.560 | to understand and demonstrate these kind of structured or unstructured knowledge capacities

00:07:30.320 | and um if I may say why this kind of study is um useful a few of the reasons is that um

00:07:42.000 | when you are trying to um um deploy or choose um llms for certain use cases um for instance you

00:07:51.360 | want to do um um in context learning or rag or on some of your um enterprise documents or um

00:08:02.560 | any kind of knowledge stories how do you like actually go and choose anonymous models and um

00:08:11.680 | okay you uh okay you chose um um uh moderate uh sized model for example a two billion model or a

00:08:22.000 | eight billion or a nine billion model that was like within your budget and within your within

00:08:30.080 | your budget constraints and all and um you had uh you had you tested it on your data set

00:08:39.600 | um and you found some kind of inconsistency okay um it was able to for example it was able to um

00:08:48.240 | infer or infer or um uh you um it was able to infer or um understand certain kinds of questions

00:09:00.160 | for example it can understand um um what is if you have data on company a like what is company a or

00:09:11.520 | when was it established this kind of factual knowledge but it was it's kind of failing in

00:09:18.400 | other like complex queries like if you um um place the question in a certain

00:09:28.160 | in a different way or you include more um um more background knowledge in your question

00:09:36.800 | it's failing so or it's failing for when you're not on like language or it's failing

00:09:45.200 | because you're trying to um use the model in a different language that um that the benchmarks

00:09:54.160 | that benchmarks was not clearly reported so how do you like um iterate and choose um

00:10:01.840 | choose uh different models or improve upon these models if be it if you are training on these

00:10:09.760 | models or you're trying to infer on these models and yeah and like um how do you like based on

00:10:20.960 | those findings how do you choose um uh what kind what is the best uh size of the model to go with

00:10:31.760 | and um what kind of resources should you plan for those so um the researchers uh of this paper

00:10:46.400 | um what they came up with is they came up with a kind of like a framework for understanding these

00:10:54.880 | questions uh in a very structured or of they tried to give us a framework for this um for

00:11:05.120 | these kind of scenarios um so what they do is uh they wanted to measure knowledge at the bit level

00:11:16.000 | and uh so how they do it is that they constructed synthetic uh data sets synthetic data sets that

00:11:29.280 | has a very um very defined structure for example um uh they have a knowledge tuple format which

00:11:40.560 | consists of the name and attribute and value and it has a certain number of bits that it represents

00:11:51.680 | and the entire data set that they wanted to um represent it that that they

00:11:58.480 | um that they constructed has a certain number of bits and

00:12:05.520 | so and using controlled um this kind of controlled experimentation

00:12:15.280 | um um allowed them to quantify those questions that we asked in the um at the start of this

00:12:25.120 | talk in terms of like bits per parameter and and they wanted to like um and allow them to

00:12:36.800 | come up with numbers for that can help us determine storage efficiency knowledge storage

00:12:46.400 | efficiency of different models in terms of numbers and also by generating this kind of like synthetic

00:12:56.880 | data uh this allows us and this allows us to like eliminate possibilities of um like external

00:13:07.120 | factors that um if the data set was constructed in a different way um that which could like um

00:13:18.480 | inhibit uh this uh which could like um like effect uh or not like give us a like a very

00:13:31.760 | clear not not like help us in a establishing a very clear relationship of

00:13:39.520 | uh the only the metrics that we wanted to understand so um creating synthetic

00:13:47.280 | data was crucial synthetic data in this format was crucial for that

00:13:52.960 | okay um

00:14:04.320 | so what they like established from their um experiment was that

00:14:10.560 | so their findings were like very clean what they saw was there was a consistently two bits of

00:14:20.560 | knowledge storage per parameter across like be different architectures or different kinds of

00:14:29.520 | quantization techniques and also like different efficiency techniques um and

00:14:39.600 | so there is a two bits of knowledge per parameter um like this is like the like general rule

00:14:51.360 | that they came up with from their experiments and it was consistent across like also consistent

00:14:59.520 | across different model scales be it like two billion parameter models or nine billion ones

00:15:08.560 | etc and and they like experimented this on like it was validated on different

00:15:21.200 | on these different architectures GPT-2, LAMA, MESTRAL and a mixture of model variants

00:15:29.840 | and um also on I guess two quantization techniques the um um from floating point

00:15:40.080 | 16 or 32 bits to um int 8 and also int 4. So what they established was that for um

00:15:55.440 | every one billion parameter you can like store I guess yeah you store the model stores to be

00:16:07.040 | two bits of information so for seven for seven billion parameter model can actually encode

00:16:18.000 | a 14 billion knowledge bits which is which roughly translates to the entire English Wikipedia

00:16:28.800 | articles that is available a textbook knowledge and

00:16:41.520 | uh uh yeah so yeah I have a question for you uh did they give a real world example for what a two

00:16:52.880 | or maybe I missed this but what two bits of knowledge is in the real world

00:16:57.040 | um do you know what I mean like like if I had two bits of knowledge would that be

00:17:07.760 | like uh like a sentence or a couple of words? Sure um so from what I understood from the paper

00:17:17.200 | was that two bits of knowledge is actually like this um this structured representation

00:17:24.400 | from the data set. Oh interesting so it'd be somebody's name and like an attribute about them.

00:17:32.640 | Uh yeah because um and it's also because the the data set was constructed in the way that

00:17:40.400 | um it was uh the synthetic data set was actually um from personal biographies that was generated

00:17:49.280 | and they actually they like measured it to like represent this structured formation so

00:18:01.840 | yeah so I guess I think like from my understanding was that two bits of information

00:18:10.080 | is actually this structured representation like yeah if the model is actually

00:18:17.120 | able to extract this representation or not. Cool thank you.

00:18:26.800 | Okay um okay so if we just dive um likely just dive a little bit into the architectures

00:18:40.400 | that was experimented was um we see a very uh we see them like scaling up from GPT-2 to LAMA

00:18:53.280 | and uh just for like uh yeah the in these two scales and uh what what was an interesting um

00:19:05.200 | finding from this part was that um GPT-2 if when they combined it with rotary embedding

00:19:15.520 | um it its performance on those kind of like knowledge extraction

00:19:21.280 | was actually on par with LAMA and Mistral models.

00:19:26.000 | I think that that was a bit um there was a bit of information that was interesting to me um

00:19:39.760 | yeah so yeah and I'm not sure like why this happens actually but um

00:19:51.200 | one of the reasons maybe because that LAMA and Mistral model I guess they have like a kind of

00:20:03.360 | complicated gated MLP um inside their architectures which is training them is actually

00:20:11.520 | a bit unstable. I think that's what they said in the paper uh on in contrast GPT-2 has a bit more

00:20:22.480 | simpler architecture compared to LAMA and Mistral. So when they just like I guess swapped the

00:20:33.200 | uh token embedding layers um this was the interesting finding for that

00:20:41.360 | um yeah and the other part was um like um uh that's the effect of storage capacity

00:20:56.080 | and for that um they experimented on MOE architectures which represent knowledge in

00:21:06.400 | very sparse forms and they found out that uh this sparsity was actually um this sparsity

00:21:16.160 | was actually like um very helpful in representing knowledge and

00:21:23.920 | yes and if we just go a bit deeper on the different kinds of layers

00:21:35.600 | they did a few different ablations on the for example on the multi-layer MLP um layers um one

00:21:46.640 | of the two or three or one of the variation was that they tried with fully removing those MLP

00:21:55.040 | layers from the transformers and um just um and just like reducing the size of the MLP layer so

00:22:06.560 | what they found was that um completely removing MLP layers actually like hinders this knowledge

00:22:15.440 | storage capacity a lot performance degrades like 1.5 times compared to if you just reduce the size

00:22:28.720 | of MLP layers which is interesting to me because um so this means that knowledge is not like

00:22:38.720 | concentrated on certain layers in a language model what I guess what this means is that

00:22:49.120 | knowledge is distributed across the entire architecture and removing those like won't do

00:23:01.360 | us any good so um yeah and the other um comparison that they uh that they wanted to uh that they did

00:23:16.000 | was on the like quantization matrix and um uh what uh what we see is that if um they like scaled down

00:23:29.360 | for or if they like quantized from float 16 or 32 bits to um eight bits there was no significant or

00:23:38.800 | no major capacity loss which is good um it means that um we can have like very small and compact

00:23:47.680 | models with the same knowledge representation as the big their big and bulky counterparts

00:24:00.800 | but interestingly if you scale down or if you quantize further like to four bits there is a

00:24:09.920 | drastic two times capacity reduction so um I guess if you wanna like represent or have a model that

00:24:19.680 | can represent knowledge or your knowledge of your training data or your I guess in context data

00:24:31.200 | efficiently for now you wouldn't want to consider four bit variants of LLMs

00:24:38.640 | yeah and MOE architectures are like as I previously said are very efficient for

00:24:57.040 | representing knowledge um one reason could be their the nature of their sparse architecture

00:25:05.120 | um I'm not like a very expert on um and I've like very studied MOE architectures very little I guess

00:25:17.120 | others present tonight maybe if you can like elaborate why sparsity

00:25:26.480 | might be a good way for representing knowledge um okay so from their experimentation

00:25:39.520 | the other part at the other interesting aspect that came out was that in order for the language

00:25:47.520 | models to like represent the knowledge like two bits if they wanted to reach that two bits per

00:25:57.840 | parameter capacity efficiently they had to like in your training scheme they had to like see that

00:26:06.240 | particular instance of knowledge they had to have a thousand exposure of that of that data point

00:26:15.120 | I like when I first like read about this I thought that meant like the entire data set

00:26:24.480 | was maybe you had to like like run a thousand epoch or something for that model to see the

00:26:33.120 | entire data set but it's actually not that um yeah from my understanding it's actually that

00:26:42.080 | that particular data point for example Anya Forger was born in December 10 1996

00:26:54.880 | this particular like knowledge representation or this particular sentence in the training

00:27:01.840 | should be like exposed to the model a thousand times so that was interesting and if you like

00:27:12.400 | reduce it to like a hundred exposures to a thousand and we see a like a drop of one bit

00:27:24.400 | per parameter gap capacity like representation so there is a like a linear relationship between

00:27:32.880 | the number of times knowledge is exposed to the model and the capacity and just if we like

00:27:42.640 | quickly go over the data quality of the training time so they experimented that if you had like

00:27:52.320 | what junk or junk means is that like low quality data so seven is to one ratio of low quality data

00:28:01.680 | to useful data actually reduces capacity by 20 percent and so this is a very critical impact

00:28:11.680 | on the model efficiency and unfortunately I was like very interested in like knowing if there is

00:28:20.640 | like if the authors would like propose a very propose a like good rule of thumb for that this

00:28:30.000 | kind of data mixture in the training like low quality data and high quality data but I think

00:28:40.080 | like they mentioned is that like just make sure of they just experimented with different like data

00:28:50.560 | mixture and this is what they found out unfortunately we don't have a recommendation

00:28:59.840 | for them for this like what is the efficient data mixture for training model training models.

00:29:08.000 | One interesting thing that like I also read this week when I just went over the minimax paper that

00:29:18.720 | was I guess released like at the start of this week they also like trained on a mixture of

00:29:31.280 | low quality and high quality data but they found that if they eliminate low quality data completely

00:29:40.800 | from their models they saw a performance decreation which was like a bit counterintuitive to me.

00:29:48.560 | So what they also suggested that keeping a like a sweet mix of low quality and high quality data

00:30:00.640 | actually improve performance I guess it's because it kind of like serves as a data augmentation

00:30:11.040 | technique for the language model if you have a like a sweet ratio of these both data and so

00:30:21.200 | eliminating low quality data won't do us any good this is like the takeaway from this part and

00:30:28.880 | the most interesting part from this is that during training language models can like actually

00:30:42.560 | identify sources of useful information if you just like prepend the source name in the training so

00:30:52.480 | what they did was they tested it by prepending wikipedia.org or other trusted sources in their

00:31:01.680 | training data and this they saw that it actually helps the model to prioritize high quality data

00:31:10.800 | and select those data during training and like pay attention to those

00:31:18.400 | knowledge from that kind of data in the training.

00:31:23.760 | I guess this is like a very like high level overview of their from the paper and

00:31:39.200 | like some of the interesting research and engineering directions from this paper would

00:31:46.080 | be to like explore I guess I would be like very excited to see more blogs or more research

00:31:57.200 | published in the area of like parameter efficient techniques that is like targeted towards

00:32:05.600 | findings from these scaling law findings from this paper or specific knowledge needs of ours

00:32:13.840 | and like more suggestions or more research done on architecture selection for

00:32:27.360 | this kind of for deployment constraints of this kind of models and

00:32:32.480 | more like what is what are some more efficient quantization techniques that can actually like

00:32:44.240 | represent or maximize knowledge representation.

00:32:53.040 | Another interesting direction would be to explore data engineering or data curation

00:33:03.200 | techniques for pre-training, post-training stages. What is like the what would be some optimal

00:33:11.600 | exposure or repetition ratios during pre-training and post-training what kinds of data should you

00:33:20.560 | like fine-tune on if you wanna like I guess support knowledge of different kind and

00:33:34.000 | one of the like tagging or yeah one of the tagging technique that the author showed us was

00:33:45.200 | if you just prepend names of your source in the data that helps knowledge. There might be other

00:33:52.640 | techniques as well which would be interesting to explore and of course like with knowledge of these

00:34:00.880 | if there are like more open source softwares that can help us more data quality detection techniques

00:34:13.040 | that would be interesting as well. So yeah that like concludes this paper for now.

00:34:25.520 | I would be like

00:34:35.440 | do you guys like want me to move to the like next part or the next guide or

00:34:42.400 | I think there's a bunch of questions that we could maybe go through if people want to

00:34:51.920 | unmute and ask. Eugene it looks like you're there's a bit of a backlog. Oh yeah sure like yeah.

00:35:04.400 | Okay I guess maybe I can just go through some of the questions. So I guess on the two bits

00:35:08.480 | per knowledge parameter right this means all parameters including embeddings fully connected

00:35:13.120 | KB. I think you sort of answered that and we've been we've had a lot of chat discussions here as

00:35:17.760 | well through the ablations on how the MLP and attention layers are essential. Another question

00:35:25.280 | that we see from Pio is how was this synthetic data set used to get the findings of two bits

00:35:35.440 | per parameter finding. The data is synthetic so the models will not train on it right. I mean

00:35:40.560 | Pio if you want to unmute yourself and explain your question a bit more. Yeah thank you. So I'm

00:35:48.160 | not sure if I understand exactly like because at the beginning you presented this tuple structure

00:35:55.280 | right name attribute value and from my understanding that was synthetically generated

00:36:01.760 | and later they were testing like different models like GPT, Lama etc. So how exactly from

00:36:10.880 | this tuple format they found this two bits per parameter because this is not clear for me.

00:36:17.680 | Oh so like

00:36:22.960 | oh so you're asking like like how like this this representation is like translated

00:36:36.960 | to a bits parameter right. Or maybe like how exactly they found this two bits per parameter

00:36:46.000 | because it's unclear for me like the models are pre-trained right so. Oh no no no no like so

00:36:56.320 | the models are actually like not pre-trained so they actually trained the models from scratch

00:37:07.760 | in using these synthetic data sets. So I guess there was

00:37:14.400 | so all the data sets that they generated were from biographies that were synthetically generated. So

00:37:25.280 | for example like if I give you an example is that like my name is Ania Forger. I was born in

00:37:35.920 | there like 10 December 1996 and my and I was born in Germany and I studied there.

00:37:43.920 | So this is this is a factual data set. This is like one instance of factual data set and they

00:37:54.480 | generated like I guess a few millions of these data sets and then they pre-trained and fine-tuned

00:38:05.600 | on these data sets these synthetic data sets specifically. And the reason why this was done

00:38:13.920 | is that like to eliminate so that the model like only learns the knowledge stored in these data

00:38:24.960 | set and they wanted to like test the model's abilities to extract knowledge

00:38:35.360 | or extract or if I should like put it like the extract structured knowledge from these data sets.

00:38:47.760 | So using this like synthetic data set was essential so that there was no like external

00:38:59.840 | like what should I say other external factors that might like show that might like

00:39:12.080 | affect the model's behavior. And I think you're so the model was so in all of their

00:39:26.480 | experimentations the models were not pre-trained. The models were like trained from scratch from

00:39:36.080 | these trained or fine-tuned using these synthetic data sets only. And I think your question was that

00:39:48.080 | like how they are going for from like these sentences to like encoding only two bits of

00:39:59.120 | information. Was that your question? Yeah I think because now I get the like let's say

00:40:06.960 | training part. So I think about the evaluation. So like you mentioned extraction. So was that just

00:40:16.880 | like a prompting like starting with the name and later the LLM needed to complete correct attribute

00:40:24.560 | and value and that means this tuple is stored correctly or how exactly this extraction was

00:40:32.560 | executed? Oh so if I remember correctly anyone can like correct me if I'm wrong please. So if

00:40:44.080 | I remember correctly what how they tested or evaluated was that so there was this like base

00:40:54.720 | biosynthetic data set of the biography that was like plain sentences. And from those biography

00:41:04.560 | data set there was they also generated question answer pairs from those data sets. So if the

00:41:12.960 | data set was like my name is that and I was born there a corresponding question answer or yeah

00:41:22.720 | question answer data set from that from that bio would be that who is Ania Fojer and where was she

00:41:32.320 | born or where did she study. So this was so yeah so they like had question and answer pairs from

00:41:51.280 | those bio from those biography data as well. And how they trained was that I guess there are like

00:42:01.120 | exact experiment details like was experimentation techniques was that

00:42:07.280 | they trained on both on both the mixture of those bio those biographies and their like question

00:42:18.720 | answers in different mixtures and either like on yeah on those mixtures and they also like

00:42:30.080 | evaluated so like if you had like if you generated the question answers from the like biography you

00:42:39.840 | also have the like answer key available to you. So you can use that for evaluation as well. So

00:42:50.880 | I guess the evaluation was was done like that and there and if the evaluation scores are high or if

00:43:02.880 | the model is performing high on the evaluation set what they like explained was that it means that

00:43:12.640 | the model was able to encode this structure this structure like efficiently and how did

00:43:21.840 | they like come up with this structure. If I remember correctly was that they also did

00:43:32.080 | different probing techniques inside the model architecture and since the model was not like

00:43:41.360 | trained on anything other than this biography data set so with techniques like p-probing

00:43:48.880 | which is like positional probing they were like able to understand that in certain positions

00:43:55.760 | these knowledges were encoded in the layers. So like a model that that had very high

00:44:08.640 | evaluation performance it was actually probed and seen that they it was able to encode this

00:44:16.240 | structure this tuple structure efficiently. So like yeah like does this like answer your question?

00:44:29.680 | Yeah really well thank you thank you so much now I really understand this thank you thank you so much.

00:44:37.520 | No problem thank you very much for asking.

00:44:40.000 | Anyone else have any outstanding questions in the chat they didn't think was addressed?

00:44:54.240 | Like don't be shy okay I guess I can do one from Jean-Marc Sommet. I think is it the mix of data

00:45:05.520 | or the quantity of data that improves the capacity? I was also I think you mentioned

00:45:12.320 | something about there being a sweet spot whereby having some amount of low quality data actually

00:45:17.520 | improves capacity. Could you talk a little bit more about that? What is that sum amount? What

00:45:21.200 | ratio of that? Sure so from the paper I actually like don't remember them a specific like size

00:45:33.680 | data set size that they experimented on but they did mention that they experimented different a

00:45:43.520 | mixture of like high quality and low quality data or junk or useful data. So this is like one of the

00:45:56.800 | figures that I quoted was this one was that they like conducted a systemic experiments on different

00:46:07.040 | ratios of different mixtures and one of the extreme was that if you put like seven to one ratio of

00:46:18.320 | junk to useful data there was a significant knowledge capacity degradation which is like

00:46:27.280 | around 20 times. And so what that what this means is that you do need your like a bit of

00:46:38.320 | low quality data or repetitive data if I must say but the ratio would be like the ratio should be

00:46:47.280 | like I guess very less. However like what I was saying is that there is no like sweet spot mentioned

00:46:58.160 | in the paper and I was also like very sad about it because I was very looking forward to

00:47:05.040 | results like that because like anyone in this chat or anyone in this room I think you would

00:47:16.080 | agree with me is that whenever we try to like go we try to like train or pre-train

00:47:23.280 | either any like small transformer or even a small image classification or text classification model

00:47:31.680 | your data set or training data set ratio or the training data set that ratio or size that we come

00:47:40.000 | is very much like we hope and pray that it like just works. And I was like really hoping that

00:47:51.600 | maybe like I could like get some insight and so yeah we like in the I guess in the future

00:48:02.240 | experiments or studies maybe we can like get more enlightenment on these part but

00:48:08.640 | yeah but from this paper I don't think there was a very like optimal mixture mentioned but

00:48:18.080 | what they suggested was that you cannot go too extreme with your junk data and you cannot go like

00:48:27.120 | you cannot like you cannot like also fully eliminate your like low quality data you need

00:48:33.200 | a mixture and you need to like come up with the optimal mixture from a few of your experiments.

00:48:44.560 | Gotcha, thank you so much. Thank you very much.

00:48:48.560 | Very interesting questions actually they are like very pushing me to think very hard.

00:49:04.640 | Anyone else?

00:49:10.160 | I guess there's one logistics question about if you could share the slide deck maybe you could

00:49:14.240 | just drop it in the discord chat. I actually thought it's like that who made this that you

00:49:21.280 | made this? I think the profs themselves. Okay I was like. I'm sorry come again please.

00:49:31.200 | Can we share the slides or how do we get the slides? Sure I can like upload it on discord

00:49:37.680 | after it's over how to join discord how to get this called link wow

00:49:49.200 | I guess a lot of folks are actually joining from somewhere else let me just share this calling right

00:49:55.200 | now yeah I don't know how to get a discord alias it should be yeah I figure it out yeah here's the

00:50:05.200 | discord link server settings overview they get a discord alias

00:50:21.200 | and maybe I should share the link to the people club but that's really more

00:50:33.520 | uh lm yeah that's where really where most of the discussions happen

00:50:37.840 | um yeah any other questions

00:50:56.880 | I guess it's a question from Colette. Colette do you want to like ask the question?

00:51:08.000 | Okay maybe I'll just read it on your behalf oh go ahead yeah uh it's interesting that

00:51:22.400 | we can store two bits of knowledge per meter but how useful are the information

00:51:32.960 | or the stored knowledge because you know some of the evaluation metrics look at

00:51:48.960 | the practical application does it really add value there are a lot of noise I don't know

00:51:57.440 | was there any discussion of how we do knowledge extraction

00:52:06.960 | for any data curation and the model architecture

00:52:13.200 | I don't think I understand the question actually can you please simplify your question

00:52:27.440 | yeah it's simply do we have a task optimized knowledge representation I know that we can store

00:52:43.600 | two bits of knowledge a parameter but the the question is in real world application can we

00:52:55.120 | control the knowledge that we store or not okay so I guess like um

00:53:14.880 | this study was like actually like this is like a very expository kind of study

00:53:24.480 | that um they wanted to do on LLM language models I guess and I guess the main aim was

00:53:36.240 | was to like uncover these like laws or if there are like any kind of like mathematical laws of

00:53:45.840 | like these like very common things that we just always like throw around while using LLMs like

00:53:54.640 | okay this model is very knowledgeable and okay this model is like very creative these are like

00:54:02.560 | like very from my understanding very like subjective words to describe these very like

00:54:12.720 | capable systems and I guess like the researchers thought that okay this is like

00:54:23.280 | this is a very disservice that we are doing to these very capable systems why don't we like

00:54:31.440 | like try to understand them understand them more and in their like all entirety instead of like

00:54:42.960 | our own projected subjective words like creative or knowledgeable good so um

00:54:53.520 | um one well like yeah so like from my understanding is that this kind of study

00:55:01.840 | is actually like a direction towards understanding or like mathematically trying to like

00:55:10.800 | mathematically trying to like come up with a formula of these like terms that we throw around

00:55:18.720 | like knowledge knowledgeable LLMs and everything and they just like conducted this study

00:55:28.160 | in a very specific experimental setting and

00:55:33.840 | I guess like they wanted to like show us that under like if you can control the conditions

00:55:44.080 | around and around a language model you can study different aspects of them and they just like showed

00:55:52.960 | us how to study like from this paper how to study like these two few or these few knowledge

00:56:03.200 | representations of course of course research the research the researching community can like

00:56:12.880 | expand on these studies to to study other kinds of knowledge or other aspects of LLMs like

00:56:22.400 | I guess what are some other aspects there is like lots of other aspects but for example like

00:56:32.640 | what I say maybe like we can like come up with our own study of creativity so I don't think that

00:56:42.080 | or I guess that these like numbers like you cannot like

00:56:53.600 | can you I maybe and someone else can like

00:56:58.880 | put some light on that can you like translate these numbers as they are in the real world

00:57:10.640 | how or how much do they do they translate I'm not sure because these what these studies that

00:57:22.480 | you see here these are actually like from their words done in very controlled environments

00:57:31.040 | and this is what they came up with and how much they apply in real world

00:57:38.080 | maybe not entirely 100% but it gives us a certain framework to start thinking about or

00:57:48.720 | to make decisions like while choosing models or while like selecting training data for LLMs so

00:57:58.240 | yeah I agree with you thank you very much

00:58:01.840 | thank you for asking the question

00:58:13.200 | um I think we only had time for one paper then

00:58:22.960 | uh yeah this turned out to be a long discussion lots of questions actually that we didn't even

00:58:31.760 | get to uh RJ and Eric and Eugene's all getting questions um but uh yeah I mean I want to

00:58:39.600 | respect everyone's time uh that was uh really really good I think there's a lot of interest

00:58:44.800 | in separating knowledge and uh and intelligence so the ablations are always important and I always

00:58:52.320 | think it's like a gift to the community thank you very much for having me and yeah with those were

00:59:02.800 | amazing questions and I also learned a lot and yeah I will share the slides in the discord

00:59:10.960 | please feel free to correct any like anything that you see in the slides and also suggest

00:59:20.000 | if you have any other suggestions thank you once again for having me awesome um we don't

00:59:27.120 | have a paper next week yet uh we could get Sharmay back for part two but also I think a lot of people

00:59:32.240 | want to talk about DeepSeek definitely very hot is that synonymous with R1 I haven't been keeping

00:59:38.720 | up honestly yeah of course R1 yeah v3 was also trained on R1 uh I assume R1 is like the topic

00:59:49.360 | because it is the most important paper of the year so far so yeah we have a volunteer for next week

00:59:54.720 | don't we we have a we have a volunteer oh I don't know that's what I thought uh in the

01:00:02.240 | uh discord but maybe I misread that no that was a joke oh okay we have R1 next week then

01:00:11.680 | is it Vibhu um maybe Vibhu I don't know anyway we'll settle this on discord um okay we have to

01:00:19.520 | to go. Thanks, everyone. Bye. Bye. Bye. Bye. Bye. Thank you.