back to indexStanford CS25: V4 I Behind the Scenes of LLM Pre-training: StarCoder Use Case
00:00:05.820 |
Thank you for joining CS25 Transformers Day's last class. 00:00:10.680 |
Today, we have Lubna, who is a machine learning 00:00:13.880 |
engineer in the science team at Hugging Face, 00:00:16.360 |
working on large language models for code and synthetic data 00:00:21.800 |
She's part of the core team at the Big Code Project 00:00:25.440 |
and has co-authored the Stack Dataset and the Starcoder 00:00:31.320 |
Thank you so much for coming to our talk today. 00:00:34.600 |
And as always, attendance link and the Slido questions 00:00:39.320 |
are on our website, and we'll be taking questions 00:00:51.200 |
I'm a machine learning engineer at Hugging Face in the science 00:00:54.400 |
And today, I'll tell you about the behind the scenes 00:01:18.120 |
And as you will see, my slides will be a series 00:01:27.080 |
thought that there was some molten secret source 00:01:33.160 |
and that it will take the open source community a lot of time 00:01:36.320 |
to catch up because the open source models that we had back 00:01:42.160 |
But now it seems that the community kind of figured out 00:01:54.600 |
For example, today we have LLAMA370BD Instruct, 00:01:58.680 |
which has almost the same performance as GPT-4, 00:02:09.720 |
It also allows the community to build very cool use cases 00:02:15.320 |
So we've made a lot of progress in the open field, 00:02:18.440 |
and this is not the only model that's out there. 00:02:22.160 |
We're now observing kind of a rise of open LLMs. 00:02:29.920 |
That was the case, for example, with DeepMind's GEMMA models 00:02:43.840 |
which is kind of the go-to leaderboard for comparing 00:02:52.360 |
And you can see in this plot that as we went from 2023 00:03:00.960 |
between the closed models and the open models 00:03:10.200 |
So we're on a very great path, but there are still 00:03:16.760 |
And this is mainly due to releases missing out 00:03:20.840 |
important details about how the data was processed 00:03:26.440 |
And this is usually the case for two main reasons. 00:03:33.120 |
because when companies publicly disclose the training data, 00:03:45.280 |
The other reason for not disclosing the details 00:03:51.920 |
So some companies want to be the best at training LLMs, 00:03:54.960 |
so they don't want to give all the details for their training. 00:03:59.120 |
Nevertheless, because we have a lot of releases, 00:04:15.040 |
And I think now transformers are kind of the default, 00:04:18.480 |
but there are also other interesting architectures 00:04:23.920 |
or you can use a mixture of experts, which can 00:04:33.720 |
think it's a topic that's already thoroughly explored, 00:04:37.400 |
and there are other aspects that maybe deserve 00:04:46.520 |
much I can tell you about that, except maybe I ask Jensen. 00:04:50.800 |
But the part that they were the most interested of 00:04:54.640 |
is data, which I think is the backbone of LLMs. 00:04:59.640 |
Because now almost everyone is using the same architecture 00:05:04.840 |
And for a given budget, data is what makes some models better 00:05:10.240 |
So it's really worth spending time exploring this data 00:05:13.440 |
and understanding how to get the higher quality samples. 00:05:17.440 |
So now we're going to try to answer our previous question 00:05:20.320 |
of how to train a good LLM by how do we get good training 00:05:30.640 |
First, we need to understand how much data do we need. 00:05:33.920 |
And then once we've figured out the size of the data 00:05:39.280 |
And to clean it, which filtering techniques make more sense 00:05:46.120 |
So to answer the first one, the answer to that 00:05:51.920 |
You want to know how much data you want to train a model on, 00:05:55.680 |
but also what is the optimal size of the model. 00:06:01.840 |
the allocation of a computer budget between data size 00:06:13.360 |
And I'm going to present a brief history of the scaling laws 00:06:19.280 |
because I think it's really interesting to see 00:06:21.320 |
how the sizes of the models progress through time 00:06:26.800 |
sets and the number of tokens we train on them 00:06:33.720 |
I think the first to establish the scaling laws 00:06:40.760 |
And they tried to fit the laws as a function of the data size 00:06:52.280 |
increase in your compute, you should increase your parameter 00:06:57.760 |
But your training tokens, you should only increase them 00:07:01.840 |
This means that if you have more resources to train your models, 00:07:11.000 |
And this is what led to models like GPT-3, which 00:07:18.160 |
trained on 300 billion tokens, which if we think about it now, 00:07:34.080 |
So all these models are actually very under-trained. 00:07:44.480 |
And they found that the reason Kaplan thought that data should 00:07:56.200 |
So although they were changing the data size, 00:08:05.560 |
using the correct cosine that corresponded to the data size. 00:08:17.320 |
should scale your data and your model size equally. 00:08:21.480 |
And in their paper, they train a 65 billion model 00:08:26.440 |
on 1.6 trillion tokens, which is the Chinchilla optimal point. 00:08:32.760 |
like GPT-3 and Gopher, which was over 200 billion parameters. 00:08:49.400 |
And then you try to find the sweet spot, which 00:08:54.640 |
And it tells you what your model size should be 00:08:58.720 |
And as you can see here, if we try to fit the laws, 00:09:01.880 |
we can see that there's a linear increase for data and also 00:09:18.120 |
And you can see that, for example, the Chinchilla model, 00:09:20.960 |
which is 60 billion parameters, was trained on less than 2 00:09:30.360 |
And it was trained on as much data as the Chinchilla model. 00:09:34.600 |
So it was trained way past the Chinchilla optimal point. 00:09:38.840 |
And we might be wondering, why is that the case? 00:09:43.000 |
Did Meta not use their compute budgets in an optimal way? 00:09:47.840 |
And the answer to that is that compute optimal 00:09:53.280 |
Because when you train a model, you don't only 00:09:55.680 |
care about what you're going to spend in training, 00:10:07.600 |
This makes it that people prefer training smaller models longer 00:10:14.840 |
So this was the case for LAMA1, for other models like Mistral, 00:10:32.160 |
as the model kept training, it kept improving. 00:10:39.120 |
the Chinchilla scaling laws as like compute optimal is optimal. 00:10:46.800 |
So for example, this is the cost for training in GPT-4. 00:10:51.200 |
It is said that it's estimated that's $100 million. 00:10:59.440 |
the more time it takes to process the tokens. 00:11:03.120 |
So the scaling laws don't take the inference cost 00:11:20.360 |
We're not respecting the Chinchilla scaling laws. 00:11:22.760 |
So we're choosing to pay what we call a compute overhead. 00:11:26.840 |
It's kind of a sacrifice that you do during the training. 00:11:31.160 |
But this will have a benefit during inference, 00:11:33.440 |
because you will save a lot of cost and money. 00:11:42.000 |
about Harden's law, which tries to measure the compute overhead 00:11:47.360 |
that you will be paying when you choose to train a small model. 00:11:51.040 |
For example, here, there's the space on Hugging Face, 00:12:02.520 |
So for example, if we take a 7B model and we train it 00:12:04.880 |
on 1 billion tokens, you can see that we are here. 00:12:09.200 |
And it's before the Chinchilla optimal model. 00:12:12.320 |
And this gives approximately, I think, 40% overhead. 00:12:17.760 |
But then during inference, as it shows here in the table-- 00:12:26.160 |
So that's something that almost everyone is doing now, 00:12:28.640 |
which is why we see models that are much, much smaller than one 00:12:37.440 |
For further reading, there are some very interesting papers 00:12:42.000 |
For example, there's this paper called Scaling Data Constraint 00:12:45.400 |
Language Models, which shows that if you are limited 00:12:51.240 |
let's say, for example, you want to train a 7B on 10 trillion 00:12:54.800 |
tokens, but you don't have these 10 trillion tokens. 00:12:57.680 |
This paper says that you can basically repeat your data 00:13:06.800 |
So for example, instead of using 8 trillion tokens unique, 00:13:10.480 |
you could use just two and repeat them four times. 00:13:17.680 |
And this is especially useful for some domains 00:13:20.640 |
where we almost exhaust all the data that's publicly available. 00:13:30.320 |
I think it has almost all the code available. 00:13:33.280 |
So it's going to be very hard to scrape and get more code. 00:13:39.480 |
the only option is to actually repeat the data during training. 00:13:43.080 |
And this is good news, because repeating the data up 00:13:52.600 |
when it comes to scaling laws is the DeepSeq LLM. 00:14:09.760 |
So they tried different data subsets, different filtering, 00:14:12.840 |
and they found that the scaling laws were changing. 00:14:15.600 |
So this is very important, because up until now, 00:14:18.080 |
we were using the Chinchilla, but the Chinchilla 00:14:21.840 |
They are not necessarily the ones that we are using now. 00:14:24.920 |
So it's really important to be aware of that. 00:14:29.560 |
with their own scaling laws that work for their data sets. 00:14:32.840 |
And they also conclude that when you have higher quality 00:14:39.040 |
be allocated to the model size and not to the data size. 00:14:43.240 |
So these are interesting things to keep in mind when 00:14:51.480 |
So we have answered the first question, I hope. 00:14:57.920 |
So let's say now you have your compute budget, 00:15:01.200 |
a fixed number of GPUs for a certain amount of days, 00:15:04.600 |
and you also know approximately how much data you want to use. 00:15:09.360 |
The question is that, where do you find this type of data? 00:15:13.640 |
For example, Llama3 was trained on 15 trillion tokens, 00:15:26.600 |
where you can actually get a very large volume of data 00:15:35.520 |
Those are of high quality but are much smaller, 00:15:38.200 |
like Wikipedia, Books, Archive, or Stack Exchange. 00:16:05.720 |
And usually people to create these data sets, 00:16:22.520 |
need to do some heavy filtering at a very large scale. 00:16:26.160 |
For example, just the latest dump has over 400 terabytes, 00:16:37.200 |
will need to have a lot of resources and a team 00:16:43.200 |
The other option is to use an existing filtered web data set. 00:16:48.480 |
Our researchers already filtered Common Crawl and released them. 00:17:17.240 |
And here, for example, it shows the performance, 00:17:20.480 |
which is an aggregation over multiple popular benchmarks 00:17:24.520 |
for NLP, like Hellaswag, MMLU, PICA, and others. 00:17:29.160 |
And it averages them and compares to other data sets 00:17:32.120 |
like C4, RefinedWeb, ThinPyjama, and the pile. 00:17:38.280 |
So that was for web, so you can get 15 trillion tokens easily. 00:17:42.480 |
And then for code data, we have released the stack data 00:17:46.960 |
set, which is the largest data set of open source code. 00:17:54.600 |
Version 1 consisted of 6 terabytes of permissive code. 00:18:01.360 |
that we first cloned all the public repositories on GitHub. 00:18:11.680 |
But we don't want all of that data, because a lot of it 00:18:14.320 |
can be configs or extensions that we don't need 00:18:23.360 |
and we ended up with almost 90 terabytes of data. 00:18:31.520 |
So we can have permissive licenses like Apache 2 or MIT. 00:18:35.680 |
We can have more restrictive licenses like GPL. 00:18:48.160 |
So we ended up with almost 3 terabytes of deduplicated data. 00:18:51.720 |
The stack comes also with a very cool tool for opt-out. 00:18:59.440 |
This tool is basically a space where you can go. 00:19:04.400 |
and it tells you if you have any of your GitHub repositories 00:19:13.360 |
to be removed from all the future trainings of BigCode. 00:19:17.400 |
So we did that for the stack v1, but also for the stack v2. 00:19:21.720 |
And the v2 is a much larger and enhanced data set 00:19:27.720 |
This time, instead of cloning GitHub repositories, 00:20:00.400 |
So these data sets, the stack v1, the stack v2, 00:20:12.800 |
This shows how the stack v2 compares to the v1. 00:20:19.120 |
And after filtering, it's four or five times larger. 00:20:24.480 |
So I talk about how to get web data, how to get code data. 00:20:30.840 |
And it's this year and last year that synthetic data 00:20:41.320 |
And I think this was mainly sparked by the PHY series 00:20:47.240 |
Their first paper was called Textbooks Are All You Need. 00:20:50.440 |
And they basically generated synthetic textbooks 00:20:58.040 |
And they tried to build a new pre-training corpus 00:21:02.280 |
And they were able to match and outperform models 00:21:08.840 |
So this model was trained on almost entirely synthetic data. 00:21:15.000 |
are using synthetic data as part of their pre-training mix. 00:21:32.800 |
would annotate samples and only keep the high-quality ones. 00:21:39.280 |
to improve performance on coding and reasoning along contexts. 00:21:51.200 |
I'm personally working also on that as a hugging phase. 00:21:54.000 |
We recently released a data set called Cosmopedia, 00:21:57.400 |
which was the largest data set of synthetic texts. 00:22:05.120 |
And instead of using closed models like GPT-4, 00:22:07.840 |
it used an open-source model, which is Mixed Trial 887B. 00:22:13.040 |
And we also released a blog post that explains 00:22:18.920 |
Because it can be very tricky to get very diverse samples. 00:22:23.800 |
So we used an approach where we had 80% of the data that 00:22:33.920 |
to generate textbooks that are related to these web samples. 00:22:43.080 |
For example, we can have a topic that is mathematics. 00:22:52.600 |
generate a textbook in the field of mathematics 00:22:57.760 |
And the more web samples we add, the more diversity we add. 00:23:01.800 |
We also used some curated sources like Stanford courses 00:23:05.200 |
and WikiHow, where we use extracts from these pages 00:23:12.880 |
You can find more details in the Cosmopedia blog post. 00:23:20.400 |
for our second question, which was where to find the data. 00:23:24.480 |
And if you're following, we have one question left, 00:23:30.520 |
Because for example, if you use common crawl, 00:23:34.360 |
And even if you use the stack, we did not train our models 00:23:38.960 |
We did a lot of filtering to get a data set that is smaller, 00:23:46.120 |
And for this data set, I will cite this slide 00:23:52.440 |
This lecture is very interesting, by the way. 00:23:58.960 |
they state that a high-quality data set might 00:24:07.760 |
And this is actually the focus of many recent papers. 00:24:15.720 |
the sections about data sets are becoming smaller and smaller 00:24:18.480 |
because people are realizing that the data set is actually 00:24:21.640 |
the backbone, and it is the one that is making some models much 00:24:26.960 |
So it's really important to spend a lot of time creating 00:24:31.280 |
these data sets and trying to remove all the outliers 00:24:35.160 |
and data sets that can hurt the model during the training. 00:24:50.640 |
So I guess in Yi's case, they get English and some Asian 00:25:01.520 |
like you look for files that have a lot of lines repeated 00:25:13.080 |
and remove samples that have a very high one. 00:25:23.720 |
that study the effect of duplicates on training, 00:25:26.640 |
and they find that keeping duplicates in the training 00:25:45.280 |
that are exactly identical, but also near deduplication 00:25:51.320 |
And this uses techniques like min-hash deduplication. 00:25:55.680 |
For Yi, after that, they also did more filtering on top, 00:26:01.280 |
But usually, you can do the classic filtering 00:26:25.320 |
Now the question is, OK, we can do deduplication. 00:26:28.760 |
I think we have methods that are established to do that. 00:26:43.200 |
You can, for sure, find some filters in the literature. 00:26:47.040 |
But if you want to really build a data set that 00:26:51.400 |
to invest some time trying to find more techniques that 00:26:56.080 |
This can be done with manual inspection, which 00:27:04.680 |
And you can come up with filters to help you during the training. 00:27:18.920 |
And for example, for us, when we were developing the StarCoder 00:27:26.240 |
what are the best ways for us to filter code? 00:27:28.800 |
So we use some standard filters, for example, 00:27:33.280 |
But we try to come up with a little bit more complex 00:27:36.440 |
filterings that could help us, like looking for files that 00:27:39.600 |
have a lot of comments because code that is usually 00:27:42.680 |
well-documented is probably of a higher quality 00:27:45.840 |
than another code file that doesn't have any comments. 00:28:10.320 |
So we've tried removing all the files from repository 00:28:15.640 |
And this ended up removing over 70% of the data sets. 00:28:21.240 |
was the worst model that we trained in all our ablation 00:28:24.520 |
experiments, simply because it removed too much data. 00:28:27.760 |
It was not worth using this filtering technique. 00:28:31.760 |
This is why it's very important that when you have a filter, 00:28:34.680 |
you should run what we call an ablation model. 00:28:37.800 |
The ablation is basically you take a subset of your data set 00:28:44.720 |
and see how it behaves with and without the filtering. 00:28:48.560 |
And you might be wondering, OK, if I use a small model, 00:28:52.240 |
but does it really extrapolate to larger models? 00:29:06.160 |
you should select a set of high signal benchmarks 00:29:13.480 |
give you some conclusions about the effect of your filtering 00:29:18.200 |
This can be some of the popular NLP benchmarks for LLMs, 00:29:27.200 |
not training, it's training-- with different seeds 00:29:31.000 |
Because sometimes you can have filtering techniques 00:29:43.560 |
it's always better to run the same experiment with two 00:29:48.040 |
And then maybe do something like the averaging 00:29:52.160 |
have more robust conclusions about the effect 00:30:05.080 |
These were like 1 billion models trained on, I think, 00:30:12.280 |
filterings that worked better for their data sets. 00:30:19.920 |
And I will tell you about how we filtered the stack data sets. 00:30:32.520 |
we only used 800 gigabytes of these 6 terabytes. 00:30:46.200 |
where this time we started from 32 terabytes and 600 00:31:03.520 |
but the filtering techniques are a bit different. 00:31:06.880 |
So first, we wanted to include a lot of programming languages. 00:31:11.200 |
And we looked at them, and we didn't keep all of them. 00:31:14.480 |
We only kept the popular ones and excluded, for example, 00:31:18.200 |
configs and languages that are no longer maintained. 00:31:22.640 |
For Starcoder 2, we included more languages, over 600. 00:31:31.600 |
to learn from, which are GitHub issues, Git commits, 00:31:40.080 |
also added for the V2, Kaggle, notebooks, and pull requests. 00:31:53.360 |
So basically, as I told you, we had some filters 00:31:55.760 |
to remove low-quality files and auto-generated content. 00:32:03.480 |
So if you have an average line length that is too high, 00:32:06.520 |
there's probably something wrong with this file 00:32:10.440 |
But since we had almost 100 programming languages, 00:32:14.880 |
we should not use the same threshold for all the languages 00:32:17.920 |
for this filter, because some programming languages just 00:32:24.240 |
and look at some samples from these languages. 00:32:29.320 |
which helps us look at 100 samples per extension 00:32:37.080 |
The third filtering step was near deduplication. 00:32:46.920 |
We found that near deduplication was the filtering that gave us 00:32:55.920 |
Even though we have 86 programming languages, 00:32:58.520 |
we don't need to change the duplication for each language. 00:33:05.360 |
And here I show you some results of the effects 00:33:10.400 |
For example, here you can see this model, Python all license. 00:33:13.840 |
If the filtering is none, you get a pass at 1, 00:33:25.600 |
The same goes for other subsets, like permissive license. 00:33:28.880 |
So we decided to use deduplication for our data set 00:33:33.440 |
and to use strong deduplication to really remove 00:33:45.840 |
So this could be names, emails, or keys, or passwords, 00:33:55.120 |
to detect secrets and prompt users to remove them, 00:34:11.800 |
So our approach to removing it was to first annotate 00:34:23.720 |
So the annotators were tasked with labeling the PII 00:34:31.520 |
If you find an email, they also label it as an email. 00:34:37.120 |
And then we trained a star PII, which is our NER model, 00:34:43.080 |
And then we run it on the whole star coder training data. 00:34:52.720 |
because it's a neural network, and it needs to run on GPUs. 00:34:57.840 |
The last step in our filtering was data decontamination, 00:35:02.800 |
because you should make sure to remove the benchmarks and test 00:35:08.040 |
Otherwise, your evaluation numbers will just be inflated. 00:35:14.240 |
that we used for evaluation from our training sets. 00:35:16.920 |
The last step in the data curation of the stack 00:35:30.960 |
we can allow ourselves to apply some nice formatting that 00:35:37.440 |
For example, for a star coder, we had the code file. 00:35:43.920 |
that indicate that this is the repository name, 00:35:46.160 |
and another token file name that indicates the file name, 00:35:51.280 |
And this is interesting, because this model, for example, 00:35:57.120 |
I guess their main use case is to be plugged in an IDE, 00:36:05.120 |
be interesting to append the code file with the name 00:36:08.560 |
of the file, for example, or file those bytes, 00:36:12.000 |
so that the model would know this is a Python file. 00:36:14.400 |
If it's in another language, when you add the file name 00:36:18.440 |
know that this is the language that it should generate code in. 00:36:27.960 |
this file has 100 stars, and see if the model would generate 00:36:31.520 |
higher quality code than if it were to generate for zero stars. 00:36:36.480 |
We didn't find any differences really during inference, 00:36:53.800 |
So we have some files that are in the same repository that 00:36:58.480 |
But when we built the stack V1, we just shuffled files, 00:37:05.440 |
And when we trained the model, we just shuffled them, 00:37:07.760 |
and the model did not know if two files belong 00:37:14.560 |
to keep files that are in the same repository next 00:37:20.800 |
with some special tokens like file set, which basically 00:37:30.800 |
and try to find links between them, 3D parallelism. 00:37:34.160 |
And then you have also light eval for doing the evaluation. 00:37:39.640 |
able to run your full trainings, but also your ablation models. 00:37:45.480 |
and then train with nanotron and evaluate with light eval. 00:37:58.240 |
we used for both the stack and star coder models 00:38:04.120 |
And I think we just answered our third question, which 00:38:17.320 |
where you can get this data, both web, and code, 00:38:22.120 |
And you also know how you can properly filter the data 00:38:30.920 |
So now, let me tell you a little bit more about code LLMs, 00:38:38.680 |
And I'm trying to give you a little bit of an overview 00:38:41.320 |
about these models so that you know how to train good LLMs, 00:38:44.520 |
but you also know how to build very cool code assistants 00:38:51.600 |
So how all of this started was when GitHub Copilot 00:38:58.840 |
was so much better than all the other code completion 00:39:01.920 |
models that were before it, which were very small and much 00:39:06.720 |
And GitHub Copilot was using the Codex model by OpenAI. 00:39:11.960 |
And they just showed that you can train a code LLM 00:39:14.960 |
in the same way that you train an LLM for English, 00:39:21.640 |
and give it a lot of code data, and it will learn this code. 00:39:44.520 |
It works very well, much better compared to the more 00:39:50.880 |
And that was over two years ago, and we didn't 00:40:04.560 |
So these are models that are either trained only 00:40:12.960 |
So you can see that we've made a lot of progress 00:40:15.800 |
in this code generation field, which is amazing. 00:40:21.080 |
And this is the result of the community's work 00:40:31.160 |
For example, here, as you can see in the leaderboard, 00:40:35.760 |
score almost 80% on the code evaluation benchmark, which 00:40:39.720 |
is human eval, which means they get almost 80% 00:40:42.680 |
of the problems right, which is a very large number. 00:40:47.240 |
And when talking about the landscape of open-code LLMs 00:40:51.360 |
in BigCode, we have released the stack data set, which 00:40:54.440 |
is now the default data set for training on code, 00:41:05.800 |
Meta also released some very good code models, 00:41:17.800 |
And we have also other models, like the recent Granite models 00:41:25.280 |
So there are different providers for code LLMs 00:41:32.880 |
And the main reason we started the BigCode collaboration 00:41:43.680 |
We released all the details about the training, 00:41:50.440 |
And we also have the code for the processing and the model 00:41:55.880 |
We had over 1,000 researchers joining our Slack 00:42:05.520 |
where the stack was used in the pre-training of a lot 00:42:08.480 |
of prominent code models, like CodeGen and StableCode. 00:42:48.920 |
but also opt-out tools to respect people's wishes 00:42:55.000 |
For example, if they don't want to be included in the trainings, 00:43:00.120 |
It's also important to remove personal identifiable 00:43:04.560 |
So an open release does not mean just releasing model weights 00:43:07.920 |
and stopping there, but also making your work reproducible 00:43:11.560 |
by fully documenting the pipeline for using these models 00:43:19.320 |
and technical reports that documents the whole pipeline. 00:43:26.520 |
went from StataCoder, which was part of our obligations 00:43:30.400 |
to understand how to filter the static data sets. 00:43:35.640 |
was released last year, a $15 billion code generation model. 00:43:42.720 |
which was trained on much more programming languages 00:43:51.160 |
And StarCoder was also rated as the most transparent model 00:43:55.880 |
by the Stanford Foundation Model Transparency Index, which 00:44:00.440 |
is really hard to remember, given the efforts that we put 00:44:02.840 |
into data governance and into making the model release 00:44:07.840 |
Regarding evaluation, so for example, StarCoder15b, 00:44:14.560 |
when it was released, it was the state-of-the-art code model. 00:44:19.040 |
And this was also the case for StarCoder215b, 00:44:24.360 |
And it was even close or better than larger models. 00:44:30.280 |
but it was better than-- it was matching CodeLlama34b. 00:44:34.680 |
And it was close to DeepSeq33b on some benchmarks. 00:44:38.640 |
And here, for example, you can see the results 00:44:56.360 |
there's a very low chance that you also had contamination 00:45:03.120 |
how your model behaves if you add more evaluation benchmarks. 00:45:08.680 |
And I think that's just a good practice that everyone should 00:45:18.000 |
released some tooling, like VS Code implementation, 00:45:25.760 |
tries to see if the generated code was in the training data 00:45:32.200 |
So that's part of our code attribution efforts 00:45:38.240 |
Maybe you're interested in using these models 00:45:46.000 |
and fine-tune in StarCoder or CodeLAM or other models 00:45:54.480 |
by Surab and Sayag, where they try to take a code model 00:45:58.440 |
and train it on the Hugging Face internal libraries 00:46:02.720 |
and then deploy it in Olama and have a local code assistant. 00:46:07.200 |
And the pipeline is very similar to what we did in pre-training. 00:46:12.360 |
try to filter out the things you don't want to keep, 00:46:15.080 |
and then you do the duplication and you train your model. 00:46:18.160 |
So in this case, it will be just a fine-tuning, 00:46:35.600 |
For example, 7b model can be trained in a Google Colab. 00:46:46.280 |
the OpenLLM leaderboard that evaluates models. 00:46:50.120 |
There's also the LLMs' arena, which compares, instructs 00:46:56.560 |
For code models, one of the most popular benchmarks 00:47:03.120 |
you have a function that the model has to autocomplete. 00:47:20.160 |
And then you count a metric that we call pass at one, 00:47:24.640 |
This is the one that's been reported in this leaderboard. 00:47:40.760 |
So it allows you to see how well each model does 00:47:51.720 |
But these benchmarks usually have an issue of contamination 00:47:57.640 |
and overfitting, especially instruction-tuned models. 00:48:01.720 |
I don't know if you've already checked what these data sets 00:48:05.600 |
But usually for code, there are an instruction 00:48:18.240 |
So there's a very high chance of having contamination, which 00:48:22.560 |
means having some files that look like human eval 00:48:28.040 |
exercises in your instruction tuning data set. 00:48:32.120 |
So here, for example, this plot is from the LifeCodeBench 00:48:43.080 |
And so their solution was to have a leaderboard 00:48:50.480 |
scrape new problems from platforms like code contests 00:48:58.680 |
on the problems that were released after the model 00:49:02.840 |
This way, they are sure that there is no contamination. 00:49:09.080 |
They tried to evaluate these models on all the data 00:49:24.240 |
So that's one interesting thing to keep in mind. 00:49:31.000 |
that's going to be interesting to compare, not just 00:49:33.320 |
open models, but also closed models like GPT-4 00:49:36.640 |
and see where the open source community is standing 00:49:47.320 |
And if you have any questions, I can answer them. 00:49:52.080 |
Yes, thank you very much for the great insightful talk. 00:49:58.840 |
I'm not sure if there are any in-person questions, 00:50:01.880 |
or else I will get started with the Slido question. 00:50:16.640 |
I think I had submitted some of these as well. 00:50:28.000 |
So someone's asking, what are the consequences 00:50:30.200 |
of training AI models on AI-generated synthetic data? 00:50:47.320 |
is necessary for things like learning robustness 00:50:58.720 |
on AI-generated data, I can think of two main ones. 00:51:02.040 |
First is enforcing some biases, because models already 00:51:07.400 |
And if we train on data that is generated by them, 00:51:12.560 |
The other thing is, for example, contamination. 00:51:20.840 |
And when you train on that, you will have contamination 00:51:24.560 |
So for example, one of the critiques of the file model 00:51:27.160 |
is that people, because they did not see the synthetic data 00:51:30.040 |
and the models were very good on the benchmarks, 00:51:33.400 |
Are these models really good, or are they just 00:51:37.640 |
So I think contamination and enforcing biases 00:51:43.800 |
And regarding synthetic data not being the same 00:51:48.000 |
as web distribution, I think that's a very good point. 00:51:54.480 |
Cosmopedia, first we found that it was worse than the web. 00:51:59.760 |
And it was surprising, because we spent a lot of time 00:52:03.040 |
trying to curate this data set, which looks so much cleaner 00:52:09.120 |
to add more topics was able to help us compensate 00:52:14.400 |
But adding some web always gives you a performance boost. 00:52:17.880 |
So yes, there is some noise and some specific patterns 00:52:25.480 |
to keep a whole coverage of what natural distributions look 00:52:32.680 |
So it sounds like you're saying a good training 00:52:43.200 |
Some experiments we're on show that that's the case. 00:52:46.800 |
Because you can try to spend some time to carefully curate 00:52:49.280 |
the topics, but we'll probably be missing out on some things. 00:52:54.720 |
is not always what works for training models. 00:52:57.440 |
It seems that keeping some filtered web helps. 00:53:00.080 |
And also, if you see the Phi technical reports, 00:53:07.680 |
And I think that now seems like maybe the best way to go. 00:53:15.640 |
Another question is, is RLHF-type preference data 00:53:19.480 |
more important than unsupervised pre-training data? 00:53:28.960 |
So for example, the unsupervised pre-training 00:53:40.800 |
But nowadays, people are just doing instruction tuning 00:53:44.080 |
without needing to go through RL, where you just 00:53:47.120 |
train the model on pairs of instructions and solutions. 00:53:53.080 |
don't use reinforcement learning but work as well, 00:54:00.560 |
you definitely need to run a supervised training on top 00:54:15.200 |
Does multimodal grounding, for example, including images 00:54:18.400 |
and videos along with the text, reduce the need 00:54:29.120 |
Oh, the question is asking, does multimodal grounding help? 00:54:33.240 |
If you have images and videos along with the text, 00:54:36.360 |
does this reduce the amount of text-only data 00:54:42.640 |
So I can't probably answer that because I haven't tried. 00:54:45.400 |
But I guess all, for example, the multimodal models, 00:54:48.600 |
for example, edifix that were recently released, 00:54:54.400 |
That seems the case for most vision and language models. 00:54:58.720 |
But yeah, I don't know really about the percentages for each. 00:55:09.440 |
any major differences between training text versus code 00:55:13.360 |
models, other than the training data being different? 00:55:25.520 |
For example, Starcoder, it was like a LAMA or a MISRA 00:55:29.920 |
I think one thing that you probably want is long context. 00:55:33.720 |
Because if you want to use these models, for example, 00:55:36.280 |
in VS Code and you want to add all the neighboring 00:55:51.280 |
So we use the first MQA and then GQA to have faster inference. 00:55:56.360 |
But these are also techniques that are implemented for LLMs. 00:56:02.280 |
But yeah, maybe you should prioritize some things 00:56:04.840 |
like having a smaller model that can be used for, for example, 00:56:10.120 |
IDEs faster than actually a much larger model that 00:56:21.760 |
So if you have a very tiny compute budget, for example, 00:56:24.640 |
a single GPU, what would you recommend prioritizing? 00:56:35.480 |
are some great solutions for on-device deployment 00:56:52.040 |
And you should be able to run this on one GPU, 00:56:56.480 |
So I think you should just find a very well-curated data 00:56:59.360 |
set because quality is more important than quantity. 00:57:02.600 |
And then use one of these techniques for easy fine 00:57:16.600 |
but they're saying, I'm guessing the optimal amount of training 00:57:29.000 |
Now, we're following the exchange scaling laws. 00:57:31.880 |
I think they tried to compare English to code, 00:57:39.920 |
I don't know, like medical, things could change. 00:57:42.160 |
And that's why I mentioned the DeepSeq paper, where 00:57:44.240 |
they mentioned that it's really heavily dependent on data. 00:57:51.040 |
from one generic data set to another well-curated one. 00:57:58.440 |
but it's underexplored how these scaling laws change 00:58:08.800 |
Speaking of different domains, code versus text, 00:58:11.080 |
someone's asking, what are some of the interesting differences 00:58:23.200 |
Yeah, so when we were training the tokenizer, 00:58:37.720 |
using for the training data, so our code mixture. 00:58:41.480 |
And we did some analysis to see if there are any outliers, 00:58:50.920 |
But overall, it's very close to the text training. 00:58:57.120 |
And now most LLMs have a significant code portion 00:59:04.520 |
And at the end, you can use either one tokenizer for LLMs 00:59:11.080 |
Because even on code, you have a lot of markdowns. 00:59:14.600 |
So you end up representing all the English tokens, for example, 00:59:21.280 |
And here's the question about fine tuning, I guess, 00:59:30.320 |
Or do you make a different or additional recommendation? 00:59:39.160 |
when you're preparing the data, it's probably a different 00:59:42.400 |
You're not going to train on all of the stack. 00:59:44.960 |
You probably want to continue training on specific language. 00:59:48.360 |
So maybe you could invest more time to even heavily filter. 00:59:51.360 |
Because for fine tuning, you don't need as much data 00:59:59.880 |
And we did not have enough for our pre-training. 01:00:02.040 |
While for fine tuning, for example, for instruction 01:00:06.400 |
where the instruction tuned only on 1,000 instructions. 01:00:08.920 |
And they had a model that was much better than training 01:00:12.760 |
So I think data curation is even much more important when 01:00:22.560 |
So you might have also touched upon this briefly. 01:00:27.040 |
when publishing very large data sets and more nuanced or less 01:00:40.880 |
really using tools also for filtering and documentation. 01:00:49.760 |
be aware of where the licenses are respected, 01:00:54.480 |
Do you have an opt-out tool for your data set? 01:01:01.680 |
If there are some concerns, you could try to add a gate. 01:01:04.200 |
For example, for us, we released the data set 01:01:12.120 |
So it's good to think of these kind of things in advance