back to indexThe Physics of Language Models: Knowledge Capacity Scaling Laws

00:00:00.000 |
the chord but um yeah we can go whenever you're ready and i think people will just fire up the 00:00:06.880 |
chat yeah what time is it for you jeez where are you i'm in Dhaka Bangladesh so it's like 00:00:15.680 |
2 a.m in the morning for me oh my god okay yeah we have done uh i've done paper clubs in Singapore 00:00:23.520 |
when it was 4 a.m so i feel the pain wow thank you yeah dedication love it um okay yeah well 00:00:32.560 |
since you're since you're up we can get started okay so just give me a second i'll share my screen 00:00:39.040 |
yeah i feel like um we have this backlog and i think i'm just going to try to 00:00:47.920 |
also clear the backlog um so that we have new papers that we can cover um we can also cover 00:00:54.880 |
timeless papers which is always helpful uh are you able to figure out the sharing 00:01:14.160 |
use zoom in the browser they have to restart the browser which is annoying 00:01:17.440 |
i know you had to do that recently is that so okay oh yay all right go for it 00:01:44.400 |
um okay so everyone can see my screen all right 00:02:04.560 |
yep okay so let's start um once again thank you everyone for um joining and allowing me to present 00:02:15.920 |
this paper so today i want to discuss um the third edition uh i think it's like the last paper of the 00:02:30.000 |
knowledge part of physics of language model series um the entire uh paper series i think um 00:02:39.360 |
the researchers divided it into broadly into three categories so they wanted to actually like um um 00:02:47.920 |
broadly speaking they divided um intelligence in three different um they wanted to understand 00:02:57.200 |
intelligence in three different areas so one of them was knowledge and the other one was 00:03:03.520 |
logical or mathematical reasoning and the other part was um uh understanding the grammar of 00:03:13.600 |
language itself so this is actually the last part of the knowledge series of from those researchers 00:03:26.080 |
so in this edition they act um they actually explore the knowledge capacity laws and scaling 00:03:35.600 |
laws of large language models so um the core problem that we are addressing or they are addressing 00:03:48.720 |
in this part is that so currently we see a very like amazing cap capabilities of llms 00:03:59.120 |
they can store really large amounts of knowledge they you can talk to them like search engines and 00:04:07.920 |
ask them various kinds of questions so however but um however there is a lack of understanding 00:04:18.560 |
of precise understanding of like um the storage capacities of these models like 00:04:28.800 |
for example like for a certain number of parameter of model like a seven billion parameter or a two 00:04:37.760 |
billion parameter um how much knowledge do they store like we don't understand this or before 00:04:46.880 |
these researchers published their findings we did not have a very clear or quantitative 00:04:54.320 |
understanding of these aspects and um the uh with this they also try to like unlock the relationship 00:05:05.280 |
or like a mathematical formula of knowledge storage and model size and so if we want to 00:05:17.760 |
understand this in terms of like asking a few like very broad questions is that so we want to 00:05:25.440 |
understand that in a language model how much knowledge is stored per parameter and what kind 00:05:35.440 |
of relationship do we see in with respect to the model size and knowledge capacity and what kind 00:05:46.160 |
of factors affect this knowledge storage capacities so um before before this research 00:05:57.440 |
and they argue that um uh most of the focus most and most of the studies done in this area was 00:06:07.040 |
actually like folk very much focused on metrics like uh loss functions and perplexity scores 00:06:15.360 |
which are good and nice but um they don't like give us very granular understanding of okay so 00:06:23.680 |
what what does it mean perfect perplexity score of 3.2 or something can it like if I give it a 00:06:32.080 |
very big document can it answer my factual questions accurately you cannot like um answer 00:06:42.080 |
this kind of question from those metrics alone and yeah and those metrics they also like um those 00:06:52.320 |
researchers they also don't like account for um all the different kinds of knowledge that um 00:06:58.960 |
that uh like occurs in our like daily life for example structured knowledge or unstructured 00:07:09.200 |
knowledge how do you like measure these capacity of language model to understand and 00:07:18.560 |
to understand and demonstrate these kind of structured or unstructured knowledge capacities 00:07:30.320 |
and um if I may say why this kind of study is um useful a few of the reasons is that um 00:07:42.000 |
when you are trying to um um deploy or choose um llms for certain use cases um for instance you 00:07:51.360 |
want to do um um in context learning or rag or on some of your um enterprise documents or um 00:08:02.560 |
any kind of knowledge stories how do you like actually go and choose anonymous models and um 00:08:11.680 |
okay you uh okay you chose um um uh moderate uh sized model for example a two billion model or a 00:08:22.000 |
eight billion or a nine billion model that was like within your budget and within your within 00:08:30.080 |
your budget constraints and all and um you had uh you had you tested it on your data set 00:08:39.600 |
um and you found some kind of inconsistency okay um it was able to for example it was able to um 00:08:48.240 |
infer or infer or um uh you um it was able to infer or um understand certain kinds of questions 00:09:00.160 |
for example it can understand um um what is if you have data on company a like what is company a or 00:09:11.520 |
when was it established this kind of factual knowledge but it was it's kind of failing in 00:09:18.400 |
other like complex queries like if you um um place the question in a certain 00:09:28.160 |
in a different way or you include more um um more background knowledge in your question 00:09:36.800 |
it's failing so or it's failing for when you're not on like language or it's failing 00:09:45.200 |
because you're trying to um use the model in a different language that um that the benchmarks 00:09:54.160 |
that benchmarks was not clearly reported so how do you like um iterate and choose um 00:10:01.840 |
choose uh different models or improve upon these models if be it if you are training on these 00:10:09.760 |
models or you're trying to infer on these models and yeah and like um how do you like based on 00:10:20.960 |
those findings how do you choose um uh what kind what is the best uh size of the model to go with 00:10:31.760 |
and um what kind of resources should you plan for those so um the researchers uh of this paper 00:10:46.400 |
um what they came up with is they came up with a kind of like a framework for understanding these 00:10:54.880 |
questions uh in a very structured or of they tried to give us a framework for this um for 00:11:05.120 |
these kind of scenarios um so what they do is uh they wanted to measure knowledge at the bit level 00:11:16.000 |
and uh so how they do it is that they constructed synthetic uh data sets synthetic data sets that 00:11:29.280 |
has a very um very defined structure for example um uh they have a knowledge tuple format which 00:11:40.560 |
consists of the name and attribute and value and it has a certain number of bits that it represents 00:11:51.680 |
and the entire data set that they wanted to um represent it that that they 00:11:58.480 |
um that they constructed has a certain number of bits and 00:12:05.520 |
so and using controlled um this kind of controlled experimentation 00:12:15.280 |
um um allowed them to quantify those questions that we asked in the um at the start of this 00:12:25.120 |
talk in terms of like bits per parameter and and they wanted to like um and allow them to 00:12:36.800 |
come up with numbers for that can help us determine storage efficiency knowledge storage 00:12:46.400 |
efficiency of different models in terms of numbers and also by generating this kind of like synthetic 00:12:56.880 |
data uh this allows us and this allows us to like eliminate possibilities of um like external 00:13:07.120 |
factors that um if the data set was constructed in a different way um that which could like um 00:13:18.480 |
inhibit uh this uh which could like um like effect uh or not like give us a like a very 00:13:31.760 |
clear not not like help us in a establishing a very clear relationship of 00:13:39.520 |
uh the only the metrics that we wanted to understand so um creating synthetic 00:13:47.280 |
data was crucial synthetic data in this format was crucial for that 00:14:04.320 |
so what they like established from their um experiment was that 00:14:10.560 |
so their findings were like very clean what they saw was there was a consistently two bits of 00:14:20.560 |
knowledge storage per parameter across like be different architectures or different kinds of 00:14:29.520 |
quantization techniques and also like different efficiency techniques um and 00:14:39.600 |
so there is a two bits of knowledge per parameter um like this is like the like general rule 00:14:51.360 |
that they came up with from their experiments and it was consistent across like also consistent 00:14:59.520 |
across different model scales be it like two billion parameter models or nine billion ones 00:15:08.560 |
etc and and they like experimented this on like it was validated on different 00:15:21.200 |
on these different architectures GPT-2, LAMA, MESTRAL and a mixture of model variants 00:15:29.840 |
and um also on I guess two quantization techniques the um um from floating point 00:15:40.080 |
16 or 32 bits to um int 8 and also int 4. So what they established was that for um 00:15:55.440 |
every one billion parameter you can like store I guess yeah you store the model stores to be 00:16:07.040 |
two bits of information so for seven for seven billion parameter model can actually encode 00:16:18.000 |
a 14 billion knowledge bits which is which roughly translates to the entire English Wikipedia 00:16:28.800 |
articles that is available a textbook knowledge and 00:16:41.520 |
uh uh yeah so yeah I have a question for you uh did they give a real world example for what a two 00:16:52.880 |
or maybe I missed this but what two bits of knowledge is in the real world 00:16:57.040 |
um do you know what I mean like like if I had two bits of knowledge would that be 00:17:07.760 |
like uh like a sentence or a couple of words? Sure um so from what I understood from the paper 00:17:17.200 |
was that two bits of knowledge is actually like this um this structured representation 00:17:24.400 |
from the data set. Oh interesting so it'd be somebody's name and like an attribute about them. 00:17:32.640 |
Uh yeah because um and it's also because the the data set was constructed in the way that 00:17:40.400 |
um it was uh the synthetic data set was actually um from personal biographies that was generated 00:17:49.280 |
and they actually they like measured it to like represent this structured formation so 00:18:01.840 |
yeah so I guess I think like from my understanding was that two bits of information 00:18:10.080 |
is actually this structured representation like yeah if the model is actually 00:18:17.120 |
able to extract this representation or not. Cool thank you. 00:18:26.800 |
Okay um okay so if we just dive um likely just dive a little bit into the architectures 00:18:40.400 |
that was experimented was um we see a very uh we see them like scaling up from GPT-2 to LAMA 00:18:53.280 |
and uh just for like uh yeah the in these two scales and uh what what was an interesting um 00:19:05.200 |
finding from this part was that um GPT-2 if when they combined it with rotary embedding 00:19:15.520 |
um it its performance on those kind of like knowledge extraction 00:19:21.280 |
was actually on par with LAMA and Mistral models. 00:19:26.000 |
I think that that was a bit um there was a bit of information that was interesting to me um 00:19:39.760 |
yeah so yeah and I'm not sure like why this happens actually but um 00:19:51.200 |
one of the reasons maybe because that LAMA and Mistral model I guess they have like a kind of 00:20:03.360 |
complicated gated MLP um inside their architectures which is training them is actually 00:20:11.520 |
a bit unstable. I think that's what they said in the paper uh on in contrast GPT-2 has a bit more 00:20:22.480 |
simpler architecture compared to LAMA and Mistral. So when they just like I guess swapped the 00:20:33.200 |
uh token embedding layers um this was the interesting finding for that 00:20:41.360 |
um yeah and the other part was um like um uh that's the effect of storage capacity 00:20:56.080 |
and for that um they experimented on MOE architectures which represent knowledge in 00:21:06.400 |
very sparse forms and they found out that uh this sparsity was actually um this sparsity 00:21:16.160 |
was actually like um very helpful in representing knowledge and 00:21:23.920 |
yes and if we just go a bit deeper on the different kinds of layers 00:21:35.600 |
they did a few different ablations on the for example on the multi-layer MLP um layers um one 00:21:46.640 |
of the two or three or one of the variation was that they tried with fully removing those MLP 00:21:55.040 |
layers from the transformers and um just um and just like reducing the size of the MLP layer so 00:22:06.560 |
what they found was that um completely removing MLP layers actually like hinders this knowledge 00:22:15.440 |
storage capacity a lot performance degrades like 1.5 times compared to if you just reduce the size 00:22:28.720 |
of MLP layers which is interesting to me because um so this means that knowledge is not like 00:22:38.720 |
concentrated on certain layers in a language model what I guess what this means is that 00:22:49.120 |
knowledge is distributed across the entire architecture and removing those like won't do 00:23:01.360 |
us any good so um yeah and the other um comparison that they uh that they wanted to uh that they did 00:23:16.000 |
was on the like quantization matrix and um uh what uh what we see is that if um they like scaled down 00:23:29.360 |
for or if they like quantized from float 16 or 32 bits to um eight bits there was no significant or 00:23:38.800 |
no major capacity loss which is good um it means that um we can have like very small and compact 00:23:47.680 |
models with the same knowledge representation as the big their big and bulky counterparts 00:24:00.800 |
but interestingly if you scale down or if you quantize further like to four bits there is a 00:24:09.920 |
drastic two times capacity reduction so um I guess if you wanna like represent or have a model that 00:24:19.680 |
can represent knowledge or your knowledge of your training data or your I guess in context data 00:24:31.200 |
efficiently for now you wouldn't want to consider four bit variants of LLMs 00:24:38.640 |
yeah and MOE architectures are like as I previously said are very efficient for 00:24:57.040 |
representing knowledge um one reason could be their the nature of their sparse architecture 00:25:05.120 |
um I'm not like a very expert on um and I've like very studied MOE architectures very little I guess 00:25:17.120 |
others present tonight maybe if you can like elaborate why sparsity 00:25:26.480 |
might be a good way for representing knowledge um okay so from their experimentation 00:25:39.520 |
the other part at the other interesting aspect that came out was that in order for the language 00:25:47.520 |
models to like represent the knowledge like two bits if they wanted to reach that two bits per 00:25:57.840 |
parameter capacity efficiently they had to like in your training scheme they had to like see that 00:26:06.240 |
particular instance of knowledge they had to have a thousand exposure of that of that data point 00:26:15.120 |
I like when I first like read about this I thought that meant like the entire data set 00:26:24.480 |
was maybe you had to like like run a thousand epoch or something for that model to see the 00:26:33.120 |
entire data set but it's actually not that um yeah from my understanding it's actually that 00:26:42.080 |
that particular data point for example Anya Forger was born in December 10 1996 00:26:54.880 |
this particular like knowledge representation or this particular sentence in the training 00:27:01.840 |
should be like exposed to the model a thousand times so that was interesting and if you like 00:27:12.400 |
reduce it to like a hundred exposures to a thousand and we see a like a drop of one bit 00:27:24.400 |
per parameter gap capacity like representation so there is a like a linear relationship between 00:27:32.880 |
the number of times knowledge is exposed to the model and the capacity and just if we like 00:27:42.640 |
quickly go over the data quality of the training time so they experimented that if you had like 00:27:52.320 |
what junk or junk means is that like low quality data so seven is to one ratio of low quality data 00:28:01.680 |
to useful data actually reduces capacity by 20 percent and so this is a very critical impact 00:28:11.680 |
on the model efficiency and unfortunately I was like very interested in like knowing if there is 00:28:20.640 |
like if the authors would like propose a very propose a like good rule of thumb for that this 00:28:30.000 |
kind of data mixture in the training like low quality data and high quality data but I think 00:28:40.080 |
like they mentioned is that like just make sure of they just experimented with different like data 00:28:50.560 |
mixture and this is what they found out unfortunately we don't have a recommendation 00:28:59.840 |
for them for this like what is the efficient data mixture for training model training models. 00:29:08.000 |
One interesting thing that like I also read this week when I just went over the minimax paper that 00:29:18.720 |
was I guess released like at the start of this week they also like trained on a mixture of 00:29:31.280 |
low quality and high quality data but they found that if they eliminate low quality data completely 00:29:40.800 |
from their models they saw a performance decreation which was like a bit counterintuitive to me. 00:29:48.560 |
So what they also suggested that keeping a like a sweet mix of low quality and high quality data 00:30:00.640 |
actually improve performance I guess it's because it kind of like serves as a data augmentation 00:30:11.040 |
technique for the language model if you have a like a sweet ratio of these both data and so 00:30:21.200 |
eliminating low quality data won't do us any good this is like the takeaway from this part and 00:30:28.880 |
the most interesting part from this is that during training language models can like actually 00:30:42.560 |
identify sources of useful information if you just like prepend the source name in the training so 00:30:52.480 |
what they did was they tested it by prepending wikipedia.org or other trusted sources in their 00:31:01.680 |
training data and this they saw that it actually helps the model to prioritize high quality data 00:31:10.800 |
and select those data during training and like pay attention to those 00:31:18.400 |
knowledge from that kind of data in the training. 00:31:23.760 |
I guess this is like a very like high level overview of their from the paper and 00:31:39.200 |
like some of the interesting research and engineering directions from this paper would 00:31:46.080 |
be to like explore I guess I would be like very excited to see more blogs or more research 00:31:57.200 |
published in the area of like parameter efficient techniques that is like targeted towards 00:32:05.600 |
findings from these scaling law findings from this paper or specific knowledge needs of ours 00:32:13.840 |
and like more suggestions or more research done on architecture selection for 00:32:27.360 |
this kind of for deployment constraints of this kind of models and 00:32:32.480 |
more like what is what are some more efficient quantization techniques that can actually like 00:32:44.240 |
represent or maximize knowledge representation. 00:32:53.040 |
Another interesting direction would be to explore data engineering or data curation 00:33:03.200 |
techniques for pre-training, post-training stages. What is like the what would be some optimal 00:33:11.600 |
exposure or repetition ratios during pre-training and post-training what kinds of data should you 00:33:20.560 |
like fine-tune on if you wanna like I guess support knowledge of different kind and 00:33:34.000 |
one of the like tagging or yeah one of the tagging technique that the author showed us was 00:33:45.200 |
if you just prepend names of your source in the data that helps knowledge. There might be other 00:33:52.640 |
techniques as well which would be interesting to explore and of course like with knowledge of these 00:34:00.880 |
if there are like more open source softwares that can help us more data quality detection techniques 00:34:13.040 |
that would be interesting as well. So yeah that like concludes this paper for now. 00:34:35.440 |
do you guys like want me to move to the like next part or the next guide or 00:34:42.400 |
I think there's a bunch of questions that we could maybe go through if people want to 00:34:51.920 |
unmute and ask. Eugene it looks like you're there's a bit of a backlog. Oh yeah sure like yeah. 00:35:04.400 |
Okay I guess maybe I can just go through some of the questions. So I guess on the two bits 00:35:08.480 |
per knowledge parameter right this means all parameters including embeddings fully connected 00:35:13.120 |
KB. I think you sort of answered that and we've been we've had a lot of chat discussions here as 00:35:17.760 |
well through the ablations on how the MLP and attention layers are essential. Another question 00:35:25.280 |
that we see from Pio is how was this synthetic data set used to get the findings of two bits 00:35:35.440 |
per parameter finding. The data is synthetic so the models will not train on it right. I mean 00:35:40.560 |
Pio if you want to unmute yourself and explain your question a bit more. Yeah thank you. So I'm 00:35:48.160 |
not sure if I understand exactly like because at the beginning you presented this tuple structure 00:35:55.280 |
right name attribute value and from my understanding that was synthetically generated 00:36:01.760 |
and later they were testing like different models like GPT, Lama etc. So how exactly from 00:36:10.880 |
this tuple format they found this two bits per parameter because this is not clear for me. 00:36:22.960 |
oh so you're asking like like how like this this representation is like translated 00:36:36.960 |
to a bits parameter right. Or maybe like how exactly they found this two bits per parameter 00:36:46.000 |
because it's unclear for me like the models are pre-trained right so. Oh no no no no like so 00:36:56.320 |
the models are actually like not pre-trained so they actually trained the models from scratch 00:37:07.760 |
in using these synthetic data sets. So I guess there was 00:37:14.400 |
so all the data sets that they generated were from biographies that were synthetically generated. So 00:37:25.280 |
for example like if I give you an example is that like my name is Ania Forger. I was born in 00:37:35.920 |
there like 10 December 1996 and my and I was born in Germany and I studied there. 00:37:43.920 |
So this is this is a factual data set. This is like one instance of factual data set and they 00:37:54.480 |
generated like I guess a few millions of these data sets and then they pre-trained and fine-tuned 00:38:05.600 |
on these data sets these synthetic data sets specifically. And the reason why this was done 00:38:13.920 |
is that like to eliminate so that the model like only learns the knowledge stored in these data 00:38:24.960 |
set and they wanted to like test the model's abilities to extract knowledge 00:38:35.360 |
or extract or if I should like put it like the extract structured knowledge from these data sets. 00:38:47.760 |
So using this like synthetic data set was essential so that there was no like external 00:38:59.840 |
like what should I say other external factors that might like show that might like 00:39:12.080 |
affect the model's behavior. And I think you're so the model was so in all of their 00:39:26.480 |
experimentations the models were not pre-trained. The models were like trained from scratch from 00:39:36.080 |
these trained or fine-tuned using these synthetic data sets only. And I think your question was that 00:39:48.080 |
like how they are going for from like these sentences to like encoding only two bits of 00:39:59.120 |
information. Was that your question? Yeah I think because now I get the like let's say 00:40:06.960 |
training part. So I think about the evaluation. So like you mentioned extraction. So was that just 00:40:16.880 |
like a prompting like starting with the name and later the LLM needed to complete correct attribute 00:40:24.560 |
and value and that means this tuple is stored correctly or how exactly this extraction was 00:40:32.560 |
executed? Oh so if I remember correctly anyone can like correct me if I'm wrong please. So if 00:40:44.080 |
I remember correctly what how they tested or evaluated was that so there was this like base 00:40:54.720 |
biosynthetic data set of the biography that was like plain sentences. And from those biography 00:41:04.560 |
data set there was they also generated question answer pairs from those data sets. So if the 00:41:12.960 |
data set was like my name is that and I was born there a corresponding question answer or yeah 00:41:22.720 |
question answer data set from that from that bio would be that who is Ania Fojer and where was she 00:41:32.320 |
born or where did she study. So this was so yeah so they like had question and answer pairs from 00:41:51.280 |
those bio from those biography data as well. And how they trained was that I guess there are like 00:42:01.120 |
exact experiment details like was experimentation techniques was that 00:42:07.280 |
they trained on both on both the mixture of those bio those biographies and their like question 00:42:18.720 |
answers in different mixtures and either like on yeah on those mixtures and they also like 00:42:30.080 |
evaluated so like if you had like if you generated the question answers from the like biography you 00:42:39.840 |
also have the like answer key available to you. So you can use that for evaluation as well. So 00:42:50.880 |
I guess the evaluation was was done like that and there and if the evaluation scores are high or if 00:43:02.880 |
the model is performing high on the evaluation set what they like explained was that it means that 00:43:12.640 |
the model was able to encode this structure this structure like efficiently and how did 00:43:21.840 |
they like come up with this structure. If I remember correctly was that they also did 00:43:32.080 |
different probing techniques inside the model architecture and since the model was not like 00:43:41.360 |
trained on anything other than this biography data set so with techniques like p-probing 00:43:48.880 |
which is like positional probing they were like able to understand that in certain positions 00:43:55.760 |
these knowledges were encoded in the layers. So like a model that that had very high 00:44:08.640 |
evaluation performance it was actually probed and seen that they it was able to encode this 00:44:16.240 |
structure this tuple structure efficiently. So like yeah like does this like answer your question? 00:44:29.680 |
Yeah really well thank you thank you so much now I really understand this thank you thank you so much. 00:44:40.000 |
Anyone else have any outstanding questions in the chat they didn't think was addressed? 00:44:54.240 |
Like don't be shy okay I guess I can do one from Jean-Marc Sommet. I think is it the mix of data 00:45:05.520 |
or the quantity of data that improves the capacity? I was also I think you mentioned 00:45:12.320 |
something about there being a sweet spot whereby having some amount of low quality data actually 00:45:17.520 |
improves capacity. Could you talk a little bit more about that? What is that sum amount? What 00:45:21.200 |
ratio of that? Sure so from the paper I actually like don't remember them a specific like size 00:45:33.680 |
data set size that they experimented on but they did mention that they experimented different a 00:45:43.520 |
mixture of like high quality and low quality data or junk or useful data. So this is like one of the 00:45:56.800 |
figures that I quoted was this one was that they like conducted a systemic experiments on different 00:46:07.040 |
ratios of different mixtures and one of the extreme was that if you put like seven to one ratio of 00:46:18.320 |
junk to useful data there was a significant knowledge capacity degradation which is like 00:46:27.280 |
around 20 times. And so what that what this means is that you do need your like a bit of 00:46:38.320 |
low quality data or repetitive data if I must say but the ratio would be like the ratio should be 00:46:47.280 |
like I guess very less. However like what I was saying is that there is no like sweet spot mentioned 00:46:58.160 |
in the paper and I was also like very sad about it because I was very looking forward to 00:47:05.040 |
results like that because like anyone in this chat or anyone in this room I think you would 00:47:16.080 |
agree with me is that whenever we try to like go we try to like train or pre-train 00:47:23.280 |
either any like small transformer or even a small image classification or text classification model 00:47:31.680 |
your data set or training data set ratio or the training data set that ratio or size that we come 00:47:40.000 |
is very much like we hope and pray that it like just works. And I was like really hoping that 00:47:51.600 |
maybe like I could like get some insight and so yeah we like in the I guess in the future 00:48:02.240 |
experiments or studies maybe we can like get more enlightenment on these part but 00:48:08.640 |
yeah but from this paper I don't think there was a very like optimal mixture mentioned but 00:48:18.080 |
what they suggested was that you cannot go too extreme with your junk data and you cannot go like 00:48:27.120 |
you cannot like you cannot like also fully eliminate your like low quality data you need 00:48:33.200 |
a mixture and you need to like come up with the optimal mixture from a few of your experiments. 00:48:44.560 |
Gotcha, thank you so much. Thank you very much. 00:48:48.560 |
Very interesting questions actually they are like very pushing me to think very hard. 00:49:10.160 |
I guess there's one logistics question about if you could share the slide deck maybe you could 00:49:14.240 |
just drop it in the discord chat. I actually thought it's like that who made this that you 00:49:21.280 |
made this? I think the profs themselves. Okay I was like. I'm sorry come again please. 00:49:31.200 |
Can we share the slides or how do we get the slides? Sure I can like upload it on discord 00:49:37.680 |
after it's over how to join discord how to get this called link wow 00:49:49.200 |
I guess a lot of folks are actually joining from somewhere else let me just share this calling right 00:49:55.200 |
now yeah I don't know how to get a discord alias it should be yeah I figure it out yeah here's the 00:50:05.200 |
discord link server settings overview they get a discord alias 00:50:21.200 |
and maybe I should share the link to the people club but that's really more 00:50:33.520 |
uh lm yeah that's where really where most of the discussions happen 00:50:56.880 |
I guess it's a question from Colette. Colette do you want to like ask the question? 00:51:08.000 |
Okay maybe I'll just read it on your behalf oh go ahead yeah uh it's interesting that 00:51:22.400 |
we can store two bits of knowledge per meter but how useful are the information 00:51:32.960 |
or the stored knowledge because you know some of the evaluation metrics look at 00:51:48.960 |
the practical application does it really add value there are a lot of noise I don't know 00:51:57.440 |
was there any discussion of how we do knowledge extraction 00:52:06.960 |
for any data curation and the model architecture 00:52:13.200 |
I don't think I understand the question actually can you please simplify your question 00:52:27.440 |
yeah it's simply do we have a task optimized knowledge representation I know that we can store 00:52:43.600 |
two bits of knowledge a parameter but the the question is in real world application can we 00:52:55.120 |
control the knowledge that we store or not okay so I guess like um 00:53:14.880 |
this study was like actually like this is like a very expository kind of study 00:53:24.480 |
that um they wanted to do on LLM language models I guess and I guess the main aim was 00:53:36.240 |
was to like uncover these like laws or if there are like any kind of like mathematical laws of 00:53:45.840 |
like these like very common things that we just always like throw around while using LLMs like 00:53:54.640 |
okay this model is very knowledgeable and okay this model is like very creative these are like 00:54:02.560 |
like very from my understanding very like subjective words to describe these very like 00:54:12.720 |
capable systems and I guess like the researchers thought that okay this is like 00:54:23.280 |
this is a very disservice that we are doing to these very capable systems why don't we like 00:54:31.440 |
like try to understand them understand them more and in their like all entirety instead of like 00:54:42.960 |
our own projected subjective words like creative or knowledgeable good so um 00:54:53.520 |
um one well like yeah so like from my understanding is that this kind of study 00:55:01.840 |
is actually like a direction towards understanding or like mathematically trying to like 00:55:10.800 |
mathematically trying to like come up with a formula of these like terms that we throw around 00:55:18.720 |
like knowledge knowledgeable LLMs and everything and they just like conducted this study 00:55:33.840 |
I guess like they wanted to like show us that under like if you can control the conditions 00:55:44.080 |
around and around a language model you can study different aspects of them and they just like showed 00:55:52.960 |
us how to study like from this paper how to study like these two few or these few knowledge 00:56:03.200 |
representations of course of course research the research the researching community can like 00:56:12.880 |
expand on these studies to to study other kinds of knowledge or other aspects of LLMs like 00:56:22.400 |
I guess what are some other aspects there is like lots of other aspects but for example like 00:56:32.640 |
what I say maybe like we can like come up with our own study of creativity so I don't think that 00:56:42.080 |
or I guess that these like numbers like you cannot like 00:56:58.880 |
put some light on that can you like translate these numbers as they are in the real world 00:57:10.640 |
how or how much do they do they translate I'm not sure because these what these studies that 00:57:22.480 |
you see here these are actually like from their words done in very controlled environments 00:57:31.040 |
and this is what they came up with and how much they apply in real world 00:57:38.080 |
maybe not entirely 100% but it gives us a certain framework to start thinking about or 00:57:48.720 |
to make decisions like while choosing models or while like selecting training data for LLMs so 00:58:13.200 |
um I think we only had time for one paper then 00:58:22.960 |
uh yeah this turned out to be a long discussion lots of questions actually that we didn't even 00:58:31.760 |
get to uh RJ and Eric and Eugene's all getting questions um but uh yeah I mean I want to 00:58:39.600 |
respect everyone's time uh that was uh really really good I think there's a lot of interest 00:58:44.800 |
in separating knowledge and uh and intelligence so the ablations are always important and I always 00:58:52.320 |
think it's like a gift to the community thank you very much for having me and yeah with those were 00:59:02.800 |
amazing questions and I also learned a lot and yeah I will share the slides in the discord 00:59:10.960 |
please feel free to correct any like anything that you see in the slides and also suggest 00:59:20.000 |
if you have any other suggestions thank you once again for having me awesome um we don't 00:59:27.120 |
have a paper next week yet uh we could get Sharmay back for part two but also I think a lot of people 00:59:32.240 |
want to talk about DeepSeek definitely very hot is that synonymous with R1 I haven't been keeping 00:59:38.720 |
up honestly yeah of course R1 yeah v3 was also trained on R1 uh I assume R1 is like the topic 00:59:49.360 |
because it is the most important paper of the year so far so yeah we have a volunteer for next week 00:59:54.720 |
don't we we have a we have a volunteer oh I don't know that's what I thought uh in the 01:00:02.240 |
uh discord but maybe I misread that no that was a joke oh okay we have R1 next week then 01:00:11.680 |
is it Vibhu um maybe Vibhu I don't know anyway we'll settle this on discord um okay we have to 01:00:19.520 |
to go. Thanks, everyone. Bye. Bye. Bye. Bye. Bye. Thank you.