Stanford CS25: V4 I Behind the Scenes of LLM Pre-training: StarCoder Use Case

00:00:00.000 | [VIDEO PLAYBACK]

00:00:05.320 | - Hello.

00:00:05.820 | Thank you for joining CS25 Transformers Day's last class.

00:00:10.680 | Today, we have Lubna, who is a machine learning

00:00:13.880 | engineer in the science team at Hugging Face,

00:00:16.360 | working on large language models for code and synthetic data

00:00:20.440 | generation.

00:00:21.800 | She's part of the core team at the Big Code Project

00:00:25.440 | and has co-authored the Stack Dataset and the Starcoder

00:00:29.760 | models for code generation.

00:00:31.320 | Thank you so much for coming to our talk today.

00:00:34.600 | And as always, attendance link and the Slido questions

00:00:39.320 | are on our website, and we'll be taking questions

00:00:41.640 | after the talk.

00:00:42.480 | Thank you, and you can take it off now.

00:00:47.080 | - Hi.

00:00:47.640 | Thank you for the introduction.

00:00:50.200 | So I'm Lubna.

00:00:51.200 | I'm a machine learning engineer at Hugging Face in the science

00:00:53.840 | team.

00:00:54.400 | And today, I'll tell you about the behind the scenes

00:00:57.600 | for training large language models.

00:00:59.760 | And I will use the Starcoder model

00:01:02.400 | that our team has trained as a use case.

00:01:04.960 | So today's plan is very simple.

00:01:09.920 | We're going to try to answer this question.

00:01:12.120 | What does it take to train a good LLM?

00:01:14.600 | So it's one question, but it's very loaded,

00:01:16.680 | and it has a lot of follow-ups.

00:01:18.120 | And as you will see, my slides will be a series

00:01:20.560 | of questions and answers.

00:01:24.920 | So a few years ago, a lot of people

00:01:27.080 | thought that there was some molten secret source

00:01:30.200 | to the strong closed models like GPT-4,

00:01:33.160 | and that it will take the open source community a lot of time

00:01:36.320 | to catch up because the open source models that we had back

00:01:38.920 | then were much smaller and less performance.

00:01:42.160 | But now it seems that the community kind of figured out

00:01:45.080 | most of the pieces for getting strong LLMs,

00:01:47.920 | as it was predicted in this Google memo that

00:01:50.680 | was leaked and released on semi-analysis.

00:01:54.600 | For example, today we have LLAMA370BD Instruct,

00:01:58.680 | which has almost the same performance as GPT-4,

00:02:01.920 | but it's unlocked so many use cases

00:02:04.160 | because the model weights are open.

00:02:05.720 | The model can be quantized and can even

00:02:07.720 | run on a consumer desktop.

00:02:09.720 | It also allows the community to build very cool use cases

00:02:13.280 | on top through fine tuning.

00:02:15.320 | So we've made a lot of progress in the open field,

00:02:18.440 | and this is not the only model that's out there.

00:02:22.160 | We're now observing kind of a rise of open LLMs.

00:02:25.880 | And the company-- more and more companies

00:02:27.840 | are embracing releasing models.

00:02:29.920 | That was the case, for example, with DeepMind's GEMMA models

00:02:33.280 | and with Mistral's models and also

00:02:36.640 | other models from GPT-4E.

00:02:39.800 | Here I put a plot from the LLMSES arena,

00:02:43.840 | which is kind of the go-to leaderboard for comparing

00:02:47.720 | Instruct models nowadays.

00:02:50.200 | It uses human evaluation.

00:02:52.360 | And you can see in this plot that as we went from 2023

00:02:57.520 | to May '24, the gap in performance

00:03:00.960 | between the closed models and the open models

00:03:03.760 | is shrunking and becoming smaller,

00:03:06.120 | which is very promising.

00:03:10.200 | So we're on a very great path, but there are still

00:03:13.480 | a lot of limitations for this.

00:03:16.760 | And this is mainly due to releases missing out

00:03:20.840 | important details about how the data was processed

00:03:24.440 | and how the models were trained.

00:03:26.440 | And this is usually the case for two main reasons.

00:03:30.120 | The first one is to avoid legal scrutiny,

00:03:33.120 | because when companies publicly disclose the training data,

00:03:37.080 | if the training was not done properly

00:03:39.240 | and the copyrighted were not respected,

00:03:41.440 | they risk facing a legal investigation.

00:03:45.280 | The other reason for not disclosing the details

00:03:49.240 | can be to maintain a competitive edge.

00:03:51.920 | So some companies want to be the best at training LLMs,

00:03:54.960 | so they don't want to give all the details for their training.

00:03:59.120 | Nevertheless, because we have a lot of releases,

00:04:01.640 | I think we can still answer this question

00:04:03.600 | and put a lot of pieces together.

00:04:07.400 | So what do we need to train a good LLM?

00:04:10.040 | The first thing is probably the model.

00:04:12.760 | You need to have a good architecture.

00:04:15.040 | And I think now transformers are kind of the default,

00:04:18.480 | but there are also other interesting architectures

00:04:21.240 | like Mamba, which is a state-based model,

00:04:23.920 | or you can use a mixture of experts, which can

00:04:26.280 | be multiple transformer models.

00:04:29.040 | But I'm not going to spend a lot of time

00:04:31.240 | in this lecture on models, because I

00:04:33.720 | think it's a topic that's already thoroughly explored,

00:04:37.400 | and there are other aspects that maybe deserve

00:04:40.360 | a little bit more attention.

00:04:43.000 | So that was it for models.

00:04:44.440 | Then for GPUs, I don't think there's

00:04:46.520 | much I can tell you about that, except maybe I ask Jensen.

00:04:50.800 | But the part that they were the most interested of

00:04:54.640 | is data, which I think is the backbone of LLMs.

00:04:59.640 | Because now almost everyone is using the same architecture

00:05:02.960 | and the same training techniques.

00:05:04.840 | And for a given budget, data is what makes some models better

00:05:09.080 | than the others.

00:05:10.240 | So it's really worth spending time exploring this data

00:05:13.440 | and understanding how to get the higher quality samples.

00:05:17.440 | So now we're going to try to answer our previous question

00:05:20.320 | of how to train a good LLM by how do we get good training

00:05:23.880 | data.

00:05:24.360 | And I think the answer to this is threefold.

00:05:30.640 | First, we need to understand how much data do we need.

00:05:33.920 | And then once we've figured out the size of the data

00:05:36.720 | that we need, where can we get this data?

00:05:39.280 | And to clean it, which filtering techniques make more sense

00:05:42.920 | and will give us the best performance?

00:05:46.120 | So to answer the first one, the answer to that

00:05:50.160 | is the scaling laws.

00:05:51.920 | You want to know how much data you want to train a model on,

00:05:55.680 | but also what is the optimal size of the model.

00:05:58.560 | And the scaling laws try to study

00:06:01.840 | the allocation of a computer budget between data size

00:06:04.920 | and model size.

00:06:06.200 | This means, should you take a smaller model

00:06:08.400 | and train it longer or take a larger model

00:06:10.800 | and train it on this data?

00:06:13.360 | And I'm going to present a brief history of the scaling laws

00:06:19.280 | because I think it's really interesting to see

00:06:21.320 | how the sizes of the models progress through time

00:06:24.680 | and also how the size of the data

00:06:26.800 | sets and the number of tokens we train on them

00:06:29.000 | have changed because there were really

00:06:30.880 | some drastic changes in that.

00:06:33.720 | I think the first to establish the scaling laws

00:06:38.000 | were Kaplan from OpenAI.

00:06:40.760 | And they tried to fit the laws as a function of the data size

00:06:48.440 | and model size.

00:06:49.480 | And they found that if you have a 10 times

00:06:52.280 | increase in your compute, you should increase your parameter

00:06:56.040 | count by 5.5.

00:06:57.760 | But your training tokens, you should only increase them

00:07:00.120 | by 1.8.

00:07:01.840 | This means that if you have more resources to train your models,

00:07:05.280 | you should make the model much larger.

00:07:06.920 | But the data, it's fine.

00:07:08.160 | You shouldn't increase it that much.

00:07:11.000 | And this is what led to models like GPT-3, which

00:07:14.240 | is 175 billion parameters, which was only

00:07:18.160 | trained on 300 billion tokens, which if we think about it now,

00:07:21.880 | is really small.

00:07:24.320 | Other models also follow this, for example,

00:07:26.720 | like OPBT, which was the same size as GPT-3

00:07:29.520 | and trained on a similar amount of data.

00:07:31.800 | There was also Bloom.

00:07:34.080 | So all these models are actually very under-trained.

00:07:37.960 | Then the Chinchilla scaling laws came after.

00:07:41.080 | And they kind of revisited the scaling laws.

00:07:44.480 | And they found that the reason Kaplan thought that data should

00:07:49.520 | not be as scaled as model size is

00:07:52.040 | because they used a fixed cosine scheduler

00:07:54.600 | for all their experiments.

00:07:56.200 | So although they were changing the data size,

00:07:58.400 | the cosine scheduler was fixed.

00:08:00.720 | This meant that for some models, they

00:08:02.760 | were underestimated because they were not

00:08:05.560 | using the correct cosine that corresponded to the data size.

00:08:09.960 | This led to kind of false conclusions.

00:08:12.520 | And the Chinchilla can now give us

00:08:15.080 | new scaling laws that say that you

00:08:17.320 | should scale your data and your model size equally.

00:08:21.480 | And in their paper, they train a 65 billion model

00:08:26.440 | on 1.6 trillion tokens, which is the Chinchilla optimal point.

00:08:30.480 | And they also perform much larger models

00:08:32.760 | like GPT-3 and Gopher, which was over 200 billion parameters.

00:08:37.240 | So here, for example, I have a plot

00:08:42.200 | which shows what the scaling laws try to do.

00:08:44.240 | For example, here you have isoflop curves,

00:08:46.840 | which each curve uses a fixed budget.

00:08:49.400 | And then you try to find the sweet spot, which

00:08:51.840 | is the optimal for your budget allocation.

00:08:54.640 | And it tells you what your model size should be

00:08:56.760 | and what your data size should be.

00:08:58.720 | And as you can see here, if we try to fit the laws,

00:09:01.880 | we can see that there's a linear increase for data and also

00:09:05.560 | model size.

00:09:06.120 | In this scheme, I tried to show how

00:09:13.240 | we've moved from the Chinchilla scaling

00:09:15.200 | laws to today's models.

00:09:18.120 | And you can see that, for example, the Chinchilla model,

00:09:20.960 | which is 60 billion parameters, was trained on less than 2

00:09:23.760 | trillion tokens.

00:09:25.280 | But then after that, we have LAMA,

00:09:27.320 | which was released last year.

00:09:28.920 | And it was just a 7BB model.

00:09:30.360 | And it was trained on as much data as the Chinchilla model.

00:09:34.600 | So it was trained way past the Chinchilla optimal point.

00:09:38.840 | And we might be wondering, why is that the case?

00:09:43.000 | Did Meta not use their compute budgets in an optimal way?

00:09:47.840 | And the answer to that is that compute optimal

00:09:50.600 | is not always optimal.

00:09:53.280 | Because when you train a model, you don't only

00:09:55.680 | care about what you're going to spend in training,

00:09:58.040 | but you also care about the inference.

00:10:00.800 | And the model is trained one time,

00:10:02.400 | but the inference is for more.

00:10:03.960 | The model is going to be served.

00:10:05.320 | So you want to save some cost in that.

00:10:07.600 | This makes it that people prefer training smaller models longer

00:10:11.240 | than actually using much larger models that

00:10:13.240 | are trained on less data.

00:10:14.840 | So this was the case for LAMA1, for other models like Mistral,

00:10:18.360 | but also for LAMA3, which went even further

00:10:20.720 | and trained not on 1 trillion tokens,

00:10:23.000 | but on 15 trillion tokens.

00:10:25.240 | And if you check the archive paper,

00:10:27.440 | the loss kept going down.

00:10:29.080 | And also, the downstream evaluations

00:10:32.160 | as the model kept training, it kept improving.

00:10:35.720 | And I think this is really interesting.

00:10:37.320 | Because some people misunderstood

00:10:39.120 | the Chinchilla scaling laws as like compute optimal is optimal.

00:10:42.160 | But that's not the case.

00:10:44.040 | Because inference cost is not considered.

00:10:46.800 | So for example, this is the cost for training in GPT-4.

00:10:51.200 | It is said that it's estimated that's $100 million.

00:10:54.920 | But also, the inference is very expensive.

00:10:57.480 | And the larger the model becomes,

00:10:59.440 | the more time it takes to process the tokens.

00:11:03.120 | So the scaling laws don't take the inference cost

00:11:06.200 | in consideration.

00:11:07.760 | And if we do take the inference cost, which

00:11:09.840 | is the case for most people, because they

00:11:13.320 | want to use these models in inference,

00:11:15.200 | you might prefer using the smaller models

00:11:17.400 | and training them longer.

00:11:19.520 | And we do that.

00:11:20.360 | We're not respecting the Chinchilla scaling laws.

00:11:22.760 | So we're choosing to pay what we call a compute overhead.

00:11:26.840 | It's kind of a sacrifice that you do during the training.

00:11:29.720 | You choose to pay more.

00:11:31.160 | But this will have a benefit during inference,

00:11:33.440 | because you will save a lot of cost and money.

00:11:38.040 | And there's this very interesting blog post

00:11:42.000 | about Harden's law, which tries to measure the compute overhead

00:11:47.360 | that you will be paying when you choose to train a small model.

00:11:51.040 | For example, here, there's the space on Hugging Face,

00:11:53.720 | where you can input the model size and what

00:11:56.080 | data sets you want to train on.

00:11:57.560 | And it will show you where you are regarding

00:12:00.760 | the Chinchilla optimal point.

00:12:02.520 | So for example, if we take a 7B model and we train it

00:12:04.880 | on 1 billion tokens, you can see that we are here.

00:12:08.080 | It's the red dots.

00:12:09.200 | And it's before the Chinchilla optimal model.

00:12:12.320 | And this gives approximately, I think, 40% overhead.

00:12:17.760 | But then during inference, as it shows here in the table--

00:12:21.000 | sorry, it was 13% overhead.

00:12:22.640 | But there's almost 50% saving costs.

00:12:26.160 | So that's something that almost everyone is doing now,

00:12:28.640 | which is why we see models that are much, much smaller than one

00:12:32.280 | or two years ago.

00:12:37.440 | For further reading, there are some very interesting papers

00:12:40.600 | about scaling laws.

00:12:42.000 | For example, there's this paper called Scaling Data Constraint

00:12:45.400 | Language Models, which shows that if you are limited

00:12:49.680 | in your data size--

00:12:51.240 | let's say, for example, you want to train a 7B on 10 trillion

00:12:54.800 | tokens, but you don't have these 10 trillion tokens.

00:12:57.680 | This paper says that you can basically repeat your data

00:13:00.800 | up to four times, so four epochs.

00:13:03.040 | And you will get similar performance

00:13:04.640 | as if you used unique tokens.

00:13:06.800 | So for example, instead of using 8 trillion tokens unique,

00:13:10.480 | you could use just two and repeat them four times.

00:13:12.800 | And you get almost the same performance

00:13:14.640 | as if these tokens were unique.

00:13:17.680 | And this is especially useful for some domains

00:13:20.640 | where we almost exhaust all the data that's publicly available.

00:13:25.560 | As I will show you later, the Stack V2,

00:13:27.600 | which is a code data set that we released,

00:13:30.320 | I think it has almost all the code available.

00:13:33.280 | So it's going to be very hard to scrape and get more code.

00:13:37.600 | And if you want to train models longer,

00:13:39.480 | the only option is to actually repeat the data during training.

00:13:43.080 | And this is good news, because repeating the data up

00:13:46.200 | to four times is actually significant.

00:13:50.320 | Another paper that I think is interesting

00:13:52.600 | when it comes to scaling laws is the DeepSeq LLM.

00:13:58.080 | They try to establish new scaling laws that

00:14:01.440 | are suited for the data, because they

00:14:03.880 | find that the scaling behavior is highly

00:14:07.400 | dependent on the data quality.

00:14:09.760 | So they tried different data subsets, different filtering,

00:14:12.840 | and they found that the scaling laws were changing.

00:14:15.600 | So this is very important, because up until now,

00:14:18.080 | we were using the Chinchilla, but the Chinchilla

00:14:20.280 | was using fixed data sets.

00:14:21.840 | They are not necessarily the ones that we are using now.

00:14:24.920 | So it's really important to be aware of that.

00:14:27.440 | And this is why DeepSeq tried to come up

00:14:29.560 | with their own scaling laws that work for their data sets.

00:14:32.840 | And they also conclude that when you have higher quality

00:14:36.400 | data sets, maybe more compute should

00:14:39.040 | be allocated to the model size and not to the data size.

00:14:43.240 | So these are interesting things to keep in mind when

00:14:46.280 | it comes to scaling LLMs.

00:14:51.480 | So we have answered the first question, I hope.

00:14:54.600 | How much data to train LLMs?

00:14:57.920 | So let's say now you have your compute budget,

00:15:01.200 | a fixed number of GPUs for a certain amount of days,

00:15:04.600 | and you also know approximately how much data you want to use.

00:15:09.360 | The question is that, where do you find this type of data?

00:15:13.640 | For example, Llama3 was trained on 15 trillion tokens,

00:15:16.680 | but where do you get 15 trillion tokens?

00:15:19.320 | That's a huge number.

00:15:23.320 | To get this data, the two main sources

00:15:26.600 | where you can actually get a very large volume of data

00:15:30.000 | are the web and then GitHub code.

00:15:33.600 | There are some other curated sources.

00:15:35.520 | Those are of high quality but are much smaller,

00:15:38.200 | like Wikipedia, Books, Archive, or Stack Exchange.

00:15:42.080 | You can also get data and new type

00:15:45.120 | You can also get data and new type

00:15:48.560 | that's been very trendy recently,

00:15:51.200 | which is synthetic data.

00:15:53.560 | But let's first start with the sources

00:15:56.360 | where you can get very large volumes.

00:16:00.000 | The first one is web data.

00:16:02.240 | So that's basically web pages.

00:16:05.720 | And usually people to create these data sets,

00:16:08.720 | they start from Common Crawl, which

00:16:10.840 | is a public repository of crawled web pages

00:16:15.000 | and Common Crawl crawls pages regularly,

00:16:17.360 | and they publish dumps every few months.

00:16:21.040 | But if you start from there, you will

00:16:22.520 | need to do some heavy filtering at a very large scale.

00:16:26.160 | For example, just the latest dump has over 400 terabytes,

00:16:29.840 | and they have almost 95 dumps.

00:16:33.040 | So that's not a very easy task, and you

00:16:37.200 | will need to have a lot of resources and a team

00:16:40.400 | to be able to do that crawling.

00:16:43.200 | The other option is to use an existing filtered web data set.

00:16:48.480 | Our researchers already filtered Common Crawl and released them.

00:16:52.600 | And luckily, we do have data sets that

00:16:54.480 | are very large and well-filtered.

00:16:57.120 | One of them is the web data, FineWeb,

00:17:00.640 | that was recently released by Hugging Face,

00:17:03.000 | and it has 15 trillion tokens of web data.

00:17:09.440 | It's also-- it's not just a large data set,

00:17:12.120 | but it also has the best performance

00:17:14.600 | among the publicly available data sets.

00:17:17.240 | And here, for example, it shows the performance,

00:17:20.480 | which is an aggregation over multiple popular benchmarks

00:17:24.520 | for NLP, like Hellaswag, MMLU, PICA, and others.

00:17:29.160 | And it averages them and compares to other data sets

00:17:32.120 | like C4, RefinedWeb, ThinPyjama, and the pile.

00:17:38.280 | So that was for web, so you can get 15 trillion tokens easily.

00:17:42.480 | And then for code data, we have released the stack data

00:17:46.960 | set, which is the largest data set of open source code.

00:17:52.440 | This data set comes in two versions.

00:17:54.600 | Version 1 consisted of 6 terabytes of permissive code.

00:17:58.800 | And how we built this data set is

00:18:01.360 | that we first cloned all the public repositories on GitHub.

00:18:05.640 | So this gave us over 130 repositories

00:18:09.280 | and 100 terabytes of data.

00:18:11.680 | But we don't want all of that data, because a lot of it

00:18:14.320 | can be configs or extensions that we don't need

00:18:17.600 | or languages that are no longer maintained.

00:18:19.880 | So we did some file extension filtering,

00:18:23.360 | and we ended up with almost 90 terabytes of data.

00:18:27.240 | After that, we filtered repositories

00:18:29.440 | based on their licenses.

00:18:31.520 | So we can have permissive licenses like Apache 2 or MIT.

00:18:35.680 | We can have more restrictive licenses like GPL.

00:18:38.720 | So we filtered all the repositories

00:18:41.040 | that did not have a permissive license.

00:18:43.800 | And after that, we did the deduplication

00:18:46.200 | to remove files that are similar.

00:18:48.160 | So we ended up with almost 3 terabytes of deduplicated data.

00:18:51.720 | The stack comes also with a very cool tool for opt-out.

00:18:59.440 | This tool is basically a space where you can go.

00:19:02.440 | You can type your GitHub username,

00:19:04.400 | and it tells you if you have any of your GitHub repositories

00:19:07.320 | in the data set.

00:19:09.160 | And if that's the case, there's also

00:19:11.160 | an option to fill a form and request

00:19:13.360 | to be removed from all the future trainings of BigCode.

00:19:17.400 | So we did that for the stack v1, but also for the stack v2.

00:19:21.720 | And the v2 is a much larger and enhanced data set

00:19:25.360 | compared to the v1.

00:19:27.720 | This time, instead of cloning GitHub repositories,

00:19:31.360 | we went through Software Heritage,

00:19:33.040 | which is an archive of code.

00:19:34.680 | They already did the scraping, and we just

00:19:37.680 | extracted the data from their archive.

00:19:40.120 | And we ended up, after all the filtering,

00:19:43.840 | with almost 1 trillion tokens, which

00:19:47.000 | is a lot compared to the v1, where

00:19:48.960 | we got around 200 billion tokens at the end.

00:19:52.880 | We also added some high-quality resources

00:19:55.120 | like GitHub issues, math and code data sets,

00:19:58.640 | and pull requests.

00:20:00.400 | So these data sets, the stack v1, the stack v2,

00:20:03.120 | can be used to train LLMs on code,

00:20:05.760 | or to train general LLMs and include code

00:20:08.080 | as a subset of the general web data.

00:20:12.800 | This shows how the stack v2 compares to the v1.

00:20:15.880 | And you can see that before filtering,

00:20:17.480 | it's almost 10 times larger.

00:20:19.120 | And after filtering, it's four or five times larger.

00:20:24.480 | So I talk about how to get web data, how to get code data.

00:20:28.040 | And then I also mentioned synthetic data.

00:20:30.840 | And it's this year and last year that synthetic data

00:20:34.200 | became very important for LLM training.

00:20:37.040 | And I think that in the next few years,

00:20:39.200 | it will become even more important.

00:20:41.320 | And I think this was mainly sparked by the PHY series

00:20:44.840 | of models by Microsoft.

00:20:47.240 | Their first paper was called Textbooks Are All You Need.

00:20:50.440 | And they basically generated synthetic textbooks

00:20:54.040 | using GPT-3.5 and GPT-3.4.

00:20:58.040 | And they tried to build a new pre-training corpus

00:21:01.160 | that is synthetic.

00:21:02.280 | And they were able to match and outperform models

00:21:05.440 | that are trained on web data sets.

00:21:08.840 | So this model was trained on almost entirely synthetic data.

00:21:12.400 | But now some of the very popular LLMs

00:21:15.000 | are using synthetic data as part of their pre-training mix.

00:21:19.120 | For example, Cloud3, in the model card,

00:21:22.080 | they say that they generate data internally

00:21:24.720 | and they include it in the pre-training.

00:21:27.240 | This is also the case for LLMA3, where

00:21:29.640 | they used LLMs to build classifiers that

00:21:32.800 | would annotate samples and only keep the high-quality ones.

00:21:36.520 | But they also generated synthetic content

00:21:39.280 | to improve performance on coding and reasoning along contexts.

00:21:45.480 | So synthetic data is a very new topic,

00:21:47.680 | but it seems really interesting.

00:21:51.200 | I'm personally working also on that as a hugging phase.

00:21:54.000 | We recently released a data set called Cosmopedia,

00:21:57.400 | which was the largest data set of synthetic texts.

00:22:02.120 | And it had almost 25 billion tokens.

00:22:05.120 | And instead of using closed models like GPT-4,

00:22:07.840 | it used an open-source model, which is Mixed Trial 887B.

00:22:13.040 | And we also released a blog post that explains

00:22:16.640 | how we created this data set.

00:22:18.920 | Because it can be very tricky to get very diverse samples.

00:22:23.800 | So we used an approach where we had 80% of the data that

00:22:28.440 | comes from the web.

00:22:29.640 | And then we tried to use these web samples

00:22:31.760 | to build new prompts that ask models

00:22:33.920 | to generate textbooks that are related to these web samples.

00:22:37.480 | But while giving them more context,

00:22:39.760 | so we can limit the generations.

00:22:43.080 | For example, we can have a topic that is mathematics.

00:22:47.560 | And then we have web samples that

00:22:49.040 | are related to mathematics.

00:22:50.560 | And each time we give the model a prompt,

00:22:52.600 | generate a textbook in the field of mathematics

00:22:55.560 | that is related to this web sample.

00:22:57.760 | And the more web samples we add, the more diversity we add.

00:23:01.800 | We also used some curated sources like Stanford courses

00:23:05.200 | and WikiHow, where we use extracts from these pages

00:23:08.600 | to ask the models to generate content

00:23:10.560 | that is related to them.

00:23:12.880 | You can find more details in the Cosmopedia blog post.

00:23:18.400 | So I guess now we also have the answer

00:23:20.400 | for our second question, which was where to find the data.

00:23:24.480 | And if you're following, we have one question left,

00:23:27.680 | which is how can we filter this data?

00:23:30.520 | Because for example, if you use common crawl,

00:23:33.200 | you need to filter it.

00:23:34.360 | And even if you use the stack, we did not train our models

00:23:37.760 | on the stack directly.

00:23:38.960 | We did a lot of filtering to get a data set that is smaller,

00:23:42.360 | but has a higher quality.

00:23:46.120 | And for this data set, I will cite this slide

00:23:49.480 | from Thomas Wolfe's presentation.

00:23:52.440 | This lecture is very interesting, by the way.

00:23:54.320 | You can find it here.

00:23:56.440 | And this is from the Yi paper, where

00:23:58.960 | they state that a high-quality data set might

00:24:03.240 | exhibit very advanced capabilities

00:24:05.600 | for a standard architecture.

00:24:07.760 | And this is actually the focus of many recent papers.

00:24:13.160 | And we can see that in model releases,

00:24:15.720 | the sections about data sets are becoming smaller and smaller

00:24:18.480 | because people are realizing that the data set is actually

00:24:21.640 | the backbone, and it is the one that is making some models much

00:24:25.480 | better than others.

00:24:26.960 | So it's really important to spend a lot of time creating

00:24:31.280 | these data sets and trying to remove all the outliers

00:24:35.160 | and data sets that can hurt the model during the training.

00:24:37.880 | This is the pipeline from the Yi paper

00:24:43.760 | for filtering their old web data sets.

00:24:47.960 | So first, they do language filtering.

00:24:50.640 | So I guess in Yi's case, they get English and some Asian

00:24:53.920 | languages.

00:24:54.960 | Then they apply some filtering techniques

00:24:57.440 | to remove low-quality samples.

00:25:00.040 | For example, there are some metrics,

00:25:01.520 | like you look for files that have a lot of lines repeated

00:25:05.360 | and then remove them.

00:25:06.600 | There's also rule-based correction.

00:25:08.480 | You also can use perplexity filtering,

00:25:10.640 | where you compute something like a loss

00:25:13.080 | and remove samples that have a very high one.

00:25:15.840 | Then after that, they also did a step

00:25:18.800 | which is very important, deduplication.

00:25:22.280 | Because there are a lot of papers

00:25:23.720 | that study the effect of duplicates on training,

00:25:26.640 | and they find that keeping duplicates in the training

00:25:29.280 | data can cause models to memorize

00:25:32.600 | and they have less space to be creative.

00:25:36.040 | So this hurts the performance of models,

00:25:38.480 | and it's always advised to remove duplicates

00:25:42.080 | using exact deduplication to remove files

00:25:45.280 | that are exactly identical, but also near deduplication

00:25:49.200 | to remove files that are similar.

00:25:51.320 | And this uses techniques like min-hash deduplication.

00:25:55.680 | For Yi, after that, they also did more filtering on top,

00:25:58.680 | like semantic and topic filtering.

00:26:01.280 | But usually, you can do the classic filtering

00:26:04.120 | and deduplication and then be more creative

00:26:06.240 | for the other filters.

00:26:09.600 | This was also the case for FineWeb.

00:26:11.840 | The reason it is better than other data sets

00:26:14.680 | is that because they spent a lot of time

00:26:17.120 | trying to come up with better filters

00:26:19.560 | and also deduplicate the data sets well.

00:26:25.320 | Now the question is, OK, we can do deduplication.

00:26:28.760 | I think we have methods that are established to do that.

00:26:32.360 | We can also do language filtering.

00:26:35.200 | But then if you want to filter the data

00:26:37.400 | to remove garbage and lower quality files,

00:26:40.680 | how do you come up with good filters?

00:26:43.200 | You can, for sure, find some filters in the literature.

00:26:47.040 | But if you want to really build a data set that

00:26:49.320 | is better than what exists, you need

00:26:51.400 | to invest some time trying to find more techniques that

00:26:54.120 | work better for your case.

00:26:56.080 | This can be done with manual inspection, which

00:27:00.480 | is always a good idea to look at the data

00:27:02.840 | and see what it actually looks like.

00:27:04.680 | And you can come up with filters to help you during the training.

00:27:08.640 | But that is usually not enough because you

00:27:11.200 | might have an intuition for filtering that

00:27:14.040 | works better for your model.

00:27:15.840 | But then when you train, actually,

00:27:17.440 | this filtering doesn't help.

00:27:18.920 | And for example, for us, when we were developing the StarCoder

00:27:23.040 | series of models, we were thinking, OK,

00:27:26.240 | what are the best ways for us to filter code?

00:27:28.800 | So we use some standard filters, for example,

00:27:31.200 | to remove auto-generated content.

00:27:33.280 | But we try to come up with a little bit more complex

00:27:36.440 | filterings that could help us, like looking for files that

00:27:39.600 | have a lot of comments because code that is usually

00:27:42.680 | well-documented is probably of a higher quality

00:27:45.840 | than another code file that doesn't have any comments.

00:27:49.080 | So we implemented this filter that

00:27:51.080 | looks for files that have almost no comments

00:27:53.920 | and then removes them.

00:27:55.240 | And we trained a model on that.

00:27:57.800 | It turned out the performance improvement

00:27:59.800 | was really negligible.

00:28:01.080 | It was not as much as we thought.

00:28:03.400 | We also tried to use another filter, which

00:28:06.000 | is using the stars of a repository

00:28:08.480 | as an indicator of quality.

00:28:10.320 | So we've tried removing all the files from repository

00:28:13.720 | that have less than five stars.

00:28:15.640 | And this ended up removing over 70% of the data sets.

00:28:19.360 | And then when we trained on it, the model

00:28:21.240 | was the worst model that we trained in all our ablation

00:28:24.520 | experiments, simply because it removed too much data.

00:28:27.760 | It was not worth using this filtering technique.

00:28:31.760 | This is why it's very important that when you have a filter,

00:28:34.680 | you should run what we call an ablation model.

00:28:37.800 | The ablation is basically you take a subset of your data set

00:28:41.000 | after you applied the filtering.

00:28:43.040 | And you train a small model on it

00:28:44.720 | and see how it behaves with and without the filtering.

00:28:48.560 | And you might be wondering, OK, if I use a small model,

00:28:52.240 | but does it really extrapolate to larger models?

00:28:55.080 | I think that's a good question.

00:28:56.520 | But generally, from our experience,

00:28:58.800 | we found that this does extrapolate

00:29:00.880 | for most of the ablations.

00:29:04.280 | When you're doing these ablations,

00:29:06.160 | you should select a set of high signal benchmarks

00:29:10.480 | that could show you some--

00:29:13.480 | give you some conclusions about the effect of your filtering

00:29:16.320 | early in the training.

00:29:18.200 | This can be some of the popular NLP benchmarks for LLMs,

00:29:21.880 | for example, HeLaSwag or MMLU.

00:29:25.280 | You should also-- sorry, here, it's

00:29:27.200 | not training, it's training-- with different seeds

00:29:29.760 | to reduce the noise.

00:29:31.000 | Because sometimes you can have filtering techniques

00:29:33.160 | that don't give you a very big difference.

00:29:35.480 | But if you train with just one seed,

00:29:38.160 | you must draw conclusions.

00:29:39.440 | But they're actually just noise.

00:29:41.040 | So if you can and you have the compute,

00:29:43.560 | it's always better to run the same experiment with two

00:29:46.600 | or three different seeds.

00:29:48.040 | And then maybe do something like the averaging

00:29:50.480 | so that you reduce the noise and you

00:29:52.160 | have more robust conclusions about the effect

00:29:54.520 | of your filtering.

00:29:57.880 | For example, for the fine web data set,

00:30:01.080 | the authors run over 200 plus ablations.

00:30:05.080 | These were like 1 billion models trained on, I think,

00:30:08.200 | 30 billion tokens.

00:30:09.800 | And this is how they were able to find

00:30:12.280 | filterings that worked better for their data sets.

00:30:16.160 | Now let's go back to our Starcoder use case.

00:30:19.920 | And I will tell you about how we filtered the stack data sets.

00:30:23.600 | So for the version 1, if you remember,

00:30:27.160 | we had 6 terabytes of source code.

00:30:30.000 | And then, but when we trained Starcoder,

00:30:32.520 | we only used 800 gigabytes of these 6 terabytes.

00:30:36.480 | So a lot of this data was filtered out

00:30:39.800 | after our filtering, our curation.

00:30:44.000 | The same happened for the Stack V2,

00:30:46.200 | where this time we started from 32 terabytes and 600

00:30:49.320 | programming languages.

00:30:50.520 | And after the filtering, we ended up

00:30:52.640 | with only 6.3 terabytes of code.

00:30:56.680 | And for filtering code, the approach

00:30:59.960 | is a bit similar to just filtering web data,

00:31:03.520 | but the filtering techniques are a bit different.

00:31:06.880 | So first, we wanted to include a lot of programming languages.

00:31:11.200 | And we looked at them, and we didn't keep all of them.

00:31:14.480 | We only kept the popular ones and excluded, for example,

00:31:18.200 | configs and languages that are no longer maintained.

00:31:21.160 | So this was for V1.

00:31:22.640 | For Starcoder 2, we included more languages, over 600.

00:31:27.360 | And then we added some other sources

00:31:29.200 | that could be interesting for a code model

00:31:31.600 | to learn from, which are GitHub issues, Git commits,

00:31:35.600 | and Jupyter notebooks.

00:31:37.520 | We also added for the V2, we added

00:31:40.080 | also added for the V2, Kaggle, notebooks, and pull requests.

00:31:46.280 | The second step, after we selected

00:31:48.480 | the languages we wanted to train on,

00:31:50.680 | was data quality inspection.

00:31:53.360 | So basically, as I told you, we had some filters

00:31:55.760 | to remove low-quality files and auto-generated content.

00:32:00.600 | An example is the average line length.

00:32:03.480 | So if you have an average line length that is too high,

00:32:06.520 | there's probably something wrong with this file

00:32:08.480 | where it's probably auto-generated.

00:32:10.440 | But since we had almost 100 programming languages,

00:32:14.880 | we should not use the same threshold for all the languages

00:32:17.920 | for this filter, because some programming languages just

00:32:20.600 | have longer lines.

00:32:21.960 | So it's important to do some inspection

00:32:24.240 | and look at some samples from these languages.

00:32:26.680 | In our case, we had the BigCode community,

00:32:29.320 | which helps us look at 100 samples per extension

00:32:33.400 | and derive the appropriate thresholds

00:32:35.720 | in filtering heuristics.

00:32:37.080 | The third filtering step was near deduplication.

00:32:46.920 | We found that near deduplication was the filtering that gave us

00:32:50.240 | the most performance boost.

00:32:52.120 | It's also very easy to apply, because it's

00:32:54.440 | language agnostic.

00:32:55.920 | Even though we have 86 programming languages,

00:32:58.520 | we don't need to change the duplication for each language.

00:33:01.280 | We can just apply it to the whole data set.

00:33:05.360 | And here I show you some results of the effects

00:33:08.920 | of deduplication.

00:33:10.400 | For example, here you can see this model, Python all license.

00:33:13.840 | If the filtering is none, you get a pass at 1,

00:33:16.520 | which is our code metric, of 13.

00:33:18.960 | But if you apply near deduplication,

00:33:20.840 | you go from 13 to 17.

00:33:22.920 | That's a very big performance bump.

00:33:25.600 | The same goes for other subsets, like permissive license.

00:33:28.880 | So we decided to use deduplication for our data set

00:33:33.440 | and to use strong deduplication to really remove

00:33:36.280 | all the files that could be similar.

00:33:40.160 | Another step in our pipeline is to remove

00:33:43.240 | personal identifiable information.

00:33:45.840 | So this could be names, emails, or keys, or passwords,

00:33:50.080 | because we scraped code from GitHub.

00:33:52.440 | And although GitHub has some tools

00:33:55.120 | to detect secrets and prompt users to remove them,

00:33:58.160 | that's not always the case.

00:33:59.920 | And we found that there were still

00:34:01.320 | a lot of secrets in the data sets.

00:34:03.480 | And we trained our model.

00:34:04.640 | You don't want it to be trained on that,

00:34:06.800 | because in inference, it might generate

00:34:08.960 | sensitive or personal data.

00:34:11.800 | So our approach to removing it was to first annotate

00:34:15.040 | a data set for PII.

00:34:18.000 | We collaborated with an annotation company

00:34:21.560 | to annotate some samples.

00:34:23.720 | So the annotators were tasked with labeling the PII

00:34:27.920 | when they found it.

00:34:28.840 | For example, if they find a name,

00:34:30.200 | they give it a class name.

00:34:31.520 | If you find an email, they also label it as an email.

00:34:34.160 | So it was a named entity recognition task.

00:34:37.120 | And then we trained a star PII, which is our NER model,

00:34:41.080 | to detect this PII.

00:34:43.080 | And then we run it on the whole star coder training data.

00:34:48.760 | This took almost 800, 100 GPU hours,

00:34:52.720 | because it's a neural network, and it needs to run on GPUs.

00:34:57.840 | The last step in our filtering was data decontamination,

00:35:02.800 | because you should make sure to remove the benchmarks and test

00:35:06.320 | sets from your training data.

00:35:08.040 | Otherwise, your evaluation numbers will just be inflated.

00:35:12.000 | So we made sure to remove the benchmarks

00:35:14.240 | that we used for evaluation from our training sets.

00:35:16.920 | The last step in the data curation of the stack

00:35:23.440 | was to format the data.

00:35:25.680 | So now that the data is filtered,

00:35:27.680 | and because code is different from text,

00:35:30.960 | we can allow ourselves to apply some nice formatting that

00:35:35.600 | could help us do an inference.

00:35:37.440 | For example, for a star coder, we had the code file.

00:35:40.800 | But before the code, we added some tokens

00:35:43.920 | that indicate that this is the repository name,

00:35:46.160 | and another token file name that indicates the file name,

00:35:49.440 | and another one for stars.

00:35:51.280 | And this is interesting, because this model, for example,

00:35:55.080 | star coder and other code models,

00:35:57.120 | I guess their main use case is to be plugged in an IDE,

00:36:00.640 | for example, VS Code.

00:36:02.240 | And when you're using them, it could

00:36:05.120 | be interesting to append the code file with the name

00:36:08.560 | of the file, for example, or file those bytes,

00:36:12.000 | so that the model would know this is a Python file.

00:36:14.400 | If it's in another language, when you add the file name

00:36:16.920 | and you have the extension, it could

00:36:18.440 | know that this is the language that it should generate code in.

00:36:23.120 | We also added GitHub stars token,

00:36:25.160 | and we tried to play with it, like to say

00:36:27.960 | this file has 100 stars, and see if the model would generate

00:36:31.520 | higher quality code than if it were to generate for zero stars.

00:36:36.480 | We didn't find any differences really during inference,

00:36:38.920 | but it was fun to add all this formatting.

00:36:42.840 | For star coder 2, one of the improvements

00:36:46.160 | was that star coder 2 was repository aware.

00:36:50.280 | Because when we have GitHub repositories,

00:36:52.880 | it's a repository.

00:36:53.800 | So we have some files that are in the same repository that

00:36:56.760 | are related to each other.

00:36:58.480 | But when we built the stack V1, we just shuffled files,

00:37:02.200 | so we didn't keep this repository structure.

00:37:05.440 | And when we trained the model, we just shuffled them,

00:37:07.760 | and the model did not know if two files belong

00:37:10.920 | to the same repository.

00:37:12.520 | But when we did star coder 2, we tried

00:37:14.560 | to keep files that are in the same repository next

00:37:17.320 | to each other.

00:37:18.240 | And how we did that is by concatenating them

00:37:20.800 | with some special tokens like file set, which basically

00:37:23.720 | separates files.

00:37:24.880 | And this way, the model can kind of

00:37:28.440 | know which files are in the same repository

00:37:30.800 | and try to find links between them, 3D parallelism.

00:37:34.160 | And then you have also light eval for doing the evaluation.

00:37:37.760 | So this is kind of a good stack to be

00:37:39.640 | able to run your full trainings, but also your ablation models.

00:37:43.440 | You can apply a filter from data to strobe

00:37:45.480 | and then train with nanotron and evaluate with light eval.

00:37:48.520 | And they're well-integrated together

00:37:50.240 | and they make one ecosystem.

00:37:52.720 | So that's for general LLMs.

00:37:54.840 | For code LLMs, we also released the code

00:37:58.240 | we used for both the stack and star coder models

00:38:01.560 | under our big code repository on GitHub.

00:38:04.120 | And I think we just answered our third question, which

00:38:10.920 | was how to filter the data.

00:38:12.800 | So now you know how to--

00:38:14.720 | first, how much data you need, and then

00:38:17.320 | where you can get this data, both web, and code,

00:38:20.320 | and synthetic, and curated.

00:38:22.120 | And you also know how you can properly filter the data

00:38:25.480 | and you can test the filtering techniques

00:38:28.000 | that you have in mind.

00:38:30.920 | So now, let me tell you a little bit more about code LLMs,

00:38:35.080 | because that's kind of what I'm working on.

00:38:38.680 | And I'm trying to give you a little bit of an overview

00:38:41.320 | about these models so that you know how to train good LLMs,

00:38:44.520 | but you also know how to build very cool code assistants

00:38:48.160 | and completion models.

00:38:51.600 | So how all of this started was when GitHub Copilot

00:38:55.560 | was released.

00:38:57.040 | And it was very interesting, because it

00:38:58.840 | was so much better than all the other code completion

00:39:01.920 | models that were before it, which were very small and much

00:39:05.480 | less performant.

00:39:06.720 | And GitHub Copilot was using the Codex model by OpenAI.

00:39:11.960 | And they just showed that you can train a code LLM

00:39:14.960 | in the same way that you train an LLM for English,

00:39:17.800 | for example.

00:39:18.760 | You can just take a large transformer model

00:39:21.640 | and give it a lot of code data, and it will learn this code.

00:39:25.640 | Because before, a lot of people were

00:39:27.160 | trying to treat code very differently,

00:39:29.440 | for example, by using abstract syntax trees.

00:39:32.960 | But what Codex model showed is that you

00:39:34.960 | can treat code like text.

00:39:36.520 | And if you want to predict the next line,

00:39:38.640 | you can predict the next text.

00:39:40.360 | You just do next token prediction,

00:39:42.520 | and you get your code.

00:39:44.520 | It works very well, much better compared to the more

00:39:48.120 | feature-engineered techniques.

00:39:50.880 | And that was over two years ago, and we didn't

00:39:54.560 | have any good open-code models.

00:39:57.600 | But today, if you go to the hub, you

00:39:59.640 | can find that we have over 1,700 models that

00:40:03.080 | are trained on code.

00:40:04.560 | So these are models that are either trained only

00:40:07.640 | on code or LLMs that included code

00:40:10.640 | as part of their training.

00:40:12.960 | So you can see that we've made a lot of progress

00:40:15.800 | in this code generation field, which is amazing.

00:40:21.080 | And this is the result of the community's work

00:40:26.480 | to build very good instruction-tuned models

00:40:29.640 | and base models.

00:40:31.160 | For example, here, as you can see in the leaderboard,

00:40:33.560 | we have some very strong models that

00:40:35.760 | score almost 80% on the code evaluation benchmark, which

00:40:39.720 | is human eval, which means they get almost 80%

00:40:42.680 | of the problems right, which is a very large number.

00:40:47.240 | And when talking about the landscape of open-code LLMs

00:40:51.360 | in BigCode, we have released the stack data set, which

00:40:54.440 | is now the default data set for training on code,

00:40:57.200 | and also StarCoder1 and StarCoder2 family

00:40:59.960 | of models and other instruction-tuned models

00:41:02.560 | with the H4 team, like StarChat2.

00:41:05.800 | Meta also released some very good code models,

00:41:08.840 | which are the Code Llama series of models

00:41:10.840 | that go from 7b to 7tb.

00:41:13.680 | There are also the DeepSeq models,

00:41:15.680 | which are also very strong.

00:41:17.800 | And we have also other models, like the recent Granite models

00:41:21.000 | from IBM, CodeQuen, CodeGen, and StableCode.

00:41:25.280 | So there are different providers for code LLMs

00:41:29.160 | and also for data sets for code.

00:41:32.880 | And the main reason we started the BigCode collaboration

00:41:36.400 | and to train StarCoder models was

00:41:38.800 | to kind of have a collaboration where we

00:41:40.760 | have full data transparency.

00:41:43.680 | We released all the details about the training,

00:41:45.960 | but also the data is public so that people

00:41:48.360 | can inspect it and use it.

00:41:50.440 | And we also have the code for the processing and the model

00:41:53.560 | weights.

00:41:54.240 | And the collaboration was open.

00:41:55.880 | We had over 1,000 researchers joining our Slack

00:41:58.840 | and following the journey with us.

00:42:02.520 | And this kind of created a BigCode ecosystem

00:42:05.520 | where the stack was used in the pre-training of a lot

00:42:08.480 | of prominent code models, like CodeGen and StableCode.

00:42:13.120 | And the StarCoder models were used as basis

00:42:16.240 | for a lot of community fine tunings.

00:42:18.280 | And I think it's very important to be

00:42:23.880 | aware of what makes a release of an LLM,

00:42:28.320 | whether it be a code LLM or a general LLM,

00:42:30.680 | open and responsible.

00:42:33.000 | And I think this is fourfold.

00:42:35.240 | First, it's really good for the community

00:42:38.440 | and for research in AI in general.

00:42:41.240 | If you can make open access data sets,

00:42:46.120 | this will mean having data inspection tools,

00:42:48.920 | but also opt-out tools to respect people's wishes

00:42:53.160 | regarding their data sets.

00:42:55.000 | For example, if they don't want to be included in the trainings,

00:42:57.880 | they should be able to opt-out.

00:43:00.120 | It's also important to remove personal identifiable

00:43:02.800 | information.

00:43:04.560 | So an open release does not mean just releasing model weights

00:43:07.920 | and stopping there, but also making your work reproducible

00:43:11.560 | by fully documenting the pipeline for using these models

00:43:15.320 | and also releasing tools for evaluation

00:43:19.320 | and technical reports that documents the whole pipeline.

00:43:24.160 | And for us in BigCode, we kind of

00:43:26.520 | went from StataCoder, which was part of our obligations

00:43:30.400 | to understand how to filter the static data sets.

00:43:33.920 | And then we went to StarCoder, which

00:43:35.640 | was released last year, a $15 billion code generation model.

00:43:39.320 | And then this year, we released StarCoder2,

00:43:42.720 | which was trained on much more programming languages

00:43:45.360 | and had a much higher evaluation score.

00:43:51.160 | And StarCoder was also rated as the most transparent model

00:43:55.880 | by the Stanford Foundation Model Transparency Index, which

00:44:00.440 | is really hard to remember, given the efforts that we put

00:44:02.840 | into data governance and into making the model release

00:44:06.360 | as transparent as possible.

00:44:07.840 | Regarding evaluation, so for example, StarCoder15b,

00:44:14.560 | when it was released, it was the state-of-the-art code model.

00:44:19.040 | And this was also the case for StarCoder215b,

00:44:22.120 | among other 15b models.

00:44:24.360 | And it was even close or better than larger models.

00:44:28.760 | I think I don't have the plot here,

00:44:30.280 | but it was better than-- it was matching CodeLlama34b.

00:44:34.680 | And it was close to DeepSeq33b on some benchmarks.

00:44:38.640 | And here, for example, you can see the results

00:44:41.200 | on different benchmarks.

00:44:43.640 | Because when releasing a model, it's

00:44:45.240 | really important that you don't just

00:44:47.360 | give a weight on one benchmark, but you

00:44:49.760 | should add as many benchmarks as you want.

00:44:52.160 | In case you had contamination, although we

00:44:54.680 | tried to avoid this one benchmark,

00:44:56.360 | there's a very low chance that you also had contamination

00:44:58.840 | on other benchmarks.

00:45:00.360 | And it also allows you to fully understand

00:45:03.120 | how your model behaves if you add more evaluation benchmarks.

00:45:08.680 | And I think that's just a good practice that everyone should

00:45:11.280 | be doing with their releases.

00:45:12.520 | So with the StarCoder models, we also

00:45:18.000 | released some tooling, like VS Code implementation,

00:45:22.920 | which also has a membership test that

00:45:25.760 | tries to see if the generated code was in the training data

00:45:30.120 | and highlight that to the author.

00:45:32.200 | So that's part of our code attribution efforts

00:45:37.040 | for these code models.

00:45:38.240 | Maybe you're interested in using these models

00:45:43.200 | to build your own personal copilot

00:45:46.000 | and fine-tune in StarCoder or CodeLAM or other models

00:45:49.200 | on your personal code bases.

00:45:51.440 | To do that, there's a very nice blog post

00:45:54.480 | by Surab and Sayag, where they try to take a code model

00:45:58.440 | and train it on the Hugging Face internal libraries

00:46:02.720 | and then deploy it in Olama and have a local code assistant.

00:46:07.200 | And the pipeline is very similar to what we did in pre-training.

00:46:10.440 | First, you take your data set, you

00:46:12.360 | try to filter out the things you don't want to keep,

00:46:15.080 | and then you do the duplication and you train your model.

00:46:18.160 | So in this case, it will be just a fine-tuning,

00:46:20.280 | so it will be much quicker.

00:46:21.720 | You can use libraries like PEFT, which

00:46:24.440 | do parameter-efficient fine-tuning,

00:46:26.600 | where you don't need to train all

00:46:28.040 | the parameters of your models, but you only

00:46:30.000 | inject a few trainable parameters.

00:46:32.880 | This makes the training much faster.

00:46:35.600 | For example, 7b model can be trained in a Google Colab.

00:46:41.720 | Now let's go back to evaluation.

00:46:44.120 | So for example, for LLMs, there's

00:46:46.280 | the OpenLLM leaderboard that evaluates models.

00:46:50.120 | There's also the LLMs' arena, which compares, instructs

00:46:53.600 | models, and uses human evaluation.

00:46:56.560 | For code models, one of the most popular benchmarks

00:46:59.640 | is human eval.

00:47:01.120 | And it's basically a benchmark where

00:47:03.120 | you have a function that the model has to autocomplete.

00:47:08.800 | And then when the function is completed,

00:47:11.200 | you take this solution, and then you

00:47:13.440 | run it against multiple unit tests,

00:47:16.240 | and you count how many solutions pass

00:47:18.600 | and how many solutions fail.

00:47:20.160 | And then you count a metric that we call pass at one,

00:47:23.680 | for example.

00:47:24.640 | This is the one that's been reported in this leaderboard.

00:47:27.680 | And this gives you the human eval score.

00:47:30.480 | There's also a translation of this benchmark

00:47:32.640 | to 18 other languages.

00:47:34.720 | Here I show Java and JavaScript in C++.

00:47:38.240 | And this benchmark is called MultiPLE.

00:47:40.760 | So it allows you to see how well each model does

00:47:44.840 | on which programming language, and choose

00:47:47.320 | the one that's the most interesting for you.

00:47:51.720 | But these benchmarks usually have an issue of contamination

00:47:57.640 | and overfitting, especially instruction-tuned models.

00:48:01.720 | I don't know if you've already checked what these data sets

00:48:04.680 | look like.

00:48:05.600 | But usually for code, there are an instruction

00:48:09.160 | that asks the model to generate an exercise.

00:48:11.840 | And often, if you look at them, they

00:48:13.640 | look really similar to human eval, which

00:48:15.720 | is function implementations.

00:48:18.240 | So there's a very high chance of having contamination, which

00:48:22.560 | means having some files that look like human eval

00:48:28.040 | exercises in your instruction tuning data set.

00:48:32.120 | So here, for example, this plot is from the LifeCodeBench

00:48:35.720 | leaderboard.

00:48:36.640 | And they find that some benchmarks may

00:48:40.240 | be overfitting on human eval.

00:48:43.080 | And so their solution was to have a leaderboard

00:48:46.320 | called LifeCodeBench, where they regularly

00:48:50.480 | scrape new problems from platforms like code contests

00:48:55.640 | and least code.

00:48:56.680 | And they evaluate the models only

00:48:58.680 | on the problems that were released after the model

00:49:01.840 | release date.

00:49:02.840 | This way, they are sure that there is no contamination.

00:49:07.040 | And for example, that was the case here.

00:49:09.080 | They tried to evaluate these models on all the data

00:49:12.640 | they have.

00:49:13.320 | And then they compared the performance

00:49:15.240 | to the data that was only released

00:49:17.720 | after the model release.

00:49:18.960 | And they found that some models were not

00:49:21.000 | consistent in their results.

00:49:24.240 | So that's one interesting thing to keep in mind.

00:49:27.120 | And this is also another leaderboard

00:49:31.000 | that's going to be interesting to compare, not just

00:49:33.320 | open models, but also closed models like GPT-4

00:49:36.640 | and see where the open source community is standing

00:49:39.320 | and compare to these code models.

00:49:43.040 | So that was my presentation.

00:49:44.640 | Thank you very much for your attention.

00:49:47.320 | And if you have any questions, I can answer them.

00:49:52.080 | Yes, thank you very much for the great insightful talk.

00:49:56.080 | So we have some questions here on Slido.

00:49:58.840 | I'm not sure if there are any in-person questions,

00:50:01.880 | or else I will get started with the Slido question.

00:50:05.600 | Sure.

00:50:11.880 | OK, I guess not.

00:50:12.720 | So I'll ask some of the questions online.

00:50:16.640 | I think I had submitted some of these as well.

00:50:20.200 | It seems like there's some questions

00:50:22.160 | about synthetic data.

00:50:23.240 | Let me see.

00:50:25.600 | I was also wondering about this.

00:50:28.000 | So someone's asking, what are the consequences

00:50:30.200 | of training AI models on AI-generated synthetic data?

00:50:35.280 | Do you foresee any problems with this?

00:50:37.920 | And there's a related question.

00:50:40.200 | Does synthetic data closely represent

00:50:42.760 | the natural distribution of language?

00:50:44.440 | I assume some low-quality data from humans

00:50:47.320 | is necessary for things like learning robustness

00:50:50.440 | and so forth.

00:50:52.320 | Yeah, sure.

00:50:53.080 | These are very great questions.

00:50:55.360 | So about the consequences of training models

00:50:58.720 | on AI-generated data, I can think of two main ones.

00:51:02.040 | First is enforcing some biases, because models already

00:51:06.360 | have some biases.

00:51:07.400 | And if we train on data that is generated by them,

00:51:09.920 | we might be enforcing it even more.

00:51:12.560 | The other thing is, for example, contamination.

00:51:15.680 | These models might generate content

00:51:18.880 | that looks like the evaluation benchmarks.

00:51:20.840 | And when you train on that, you will have contamination

00:51:23.320 | in your data.

00:51:24.560 | So for example, one of the critiques of the file model

00:51:27.160 | is that people, because they did not see the synthetic data

00:51:30.040 | and the models were very good on the benchmarks,

00:51:32.200 | they were very skeptical.

00:51:33.400 | Are these models really good, or are they just

00:51:35.440 | overfitting on the benchmarks?

00:51:37.640 | So I think contamination and enforcing biases

00:51:40.120 | are one of the main things to keep in mind.

00:51:43.800 | And regarding synthetic data not being the same

00:51:48.000 | as web distribution, I think that's a very good point.

00:51:52.400 | And for example, when we were developing

00:51:54.480 | Cosmopedia, first we found that it was worse than the web.

00:51:59.760 | And it was surprising, because we spent a lot of time

00:52:03.040 | trying to curate this data set, which looks so much cleaner

00:52:06.120 | than the web.

00:52:07.000 | And then adding some web data and trying

00:52:09.120 | to add more topics was able to help us compensate

00:52:13.400 | some of the gaps.

00:52:14.400 | But adding some web always gives you a performance boost.

00:52:17.880 | So yes, there is some noise and some specific patterns

00:52:21.520 | in web data that will probably need

00:52:23.800 | to be included in the training mix

00:52:25.480 | to keep a whole coverage of what natural distributions look

00:52:31.680 | like.

00:52:32.680 | So it sounds like you're saying a good training

00:52:36.280 | set would have a mix, potentially,

00:52:38.320 | of synthetic and natural data.

00:52:40.400 | Is that correct?

00:52:41.800 | Yeah, I think so.

00:52:43.200 | Some experiments we're on show that that's the case.

00:52:46.800 | Because you can try to spend some time to carefully curate

00:52:49.280 | the topics, but we'll probably be missing out on some things.

00:52:53.200 | And the human intuition that we have

00:52:54.720 | is not always what works for training models.

00:52:57.440 | It seems that keeping some filtered web helps.

00:53:00.080 | And also, if you see the Phi technical reports,

00:53:02.520 | for example, in Phi 3, they insist a lot

00:53:04.840 | on filtering the web and including it

00:53:06.480 | in the pre-training.

00:53:07.680 | And I think that now seems like maybe the best way to go.

00:53:13.640 | That makes sense.

00:53:14.320 | Great.

00:53:15.640 | Another question is, is RLHF-type preference data

00:53:19.480 | more important than unsupervised pre-training data?

00:53:22.200 | Should we spend more resources on RLHF data?

00:53:27.520 | Yeah, that's a good question.

00:53:28.960 | So for example, the unsupervised pre-training

00:53:31.880 | is mainly to get base models.

00:53:33.760 | But then you can't use these base models

00:53:35.800 | as chart assistance.

00:53:36.960 | You need to do another step.

00:53:38.720 | So you can either do RLHF.

00:53:40.800 | But nowadays, people are just doing instruction tuning

00:53:44.080 | without needing to go through RL, where you just

00:53:47.120 | train the model on pairs of instructions and solutions.

00:53:49.880 | And that seems to work very well.

00:53:51.400 | And there are now some methods that

00:53:53.080 | don't use reinforcement learning but work as well,

00:53:56.200 | for example, DPO or ORPO.

00:53:58.440 | So I think if you want to chart assistance,

00:54:00.560 | you definitely need to run a supervised training on top

00:54:04.240 | of the unsupervised one.

00:54:05.960 | But it doesn't necessarily have to be RLHF.

00:54:08.480 | There are some other algorithms now.

00:54:11.520 | Great, great.

00:54:12.240 | And here's a multimodal question.

00:54:15.200 | Does multimodal grounding, for example, including images

00:54:18.400 | and videos along with the text, reduce the need

00:54:21.360 | for so much text-only data?

00:54:23.080 | Yeah, what do you mean?

00:54:27.680 | I'm sorry.

00:54:29.120 | Oh, the question is asking, does multimodal grounding help?

00:54:33.240 | If you have images and videos along with the text,

00:54:36.360 | does this reduce the amount of text-only data

00:54:39.240 | required to train models?

00:54:42.640 | So I can't probably answer that because I haven't tried.

00:54:45.400 | But I guess all, for example, the multimodal models,

00:54:48.600 | for example, edifix that were recently released,

00:54:51.280 | there's always a significant text portion.

00:54:54.400 | That seems the case for most vision and language models.

00:54:58.720 | But yeah, I don't know really about the percentages for each.

00:55:02.760 | Right, OK.

00:55:03.560 | A more general question-- you probably

00:55:07.600 | touched upon some of this-- but are there

00:55:09.440 | any major differences between training text versus code

00:55:13.360 | models, other than the training data being different?

00:55:17.960 | Yes, that's a good question.

00:55:19.280 | So the training data is different.

00:55:21.080 | Regarding the training itself, we

00:55:24.080 | use a similar architecture.

00:55:25.520 | For example, Starcoder, it was like a LAMA or a MISRA

00:55:28.880 | architecture.

00:55:29.920 | I think one thing that you probably want is long context.

00:55:33.720 | Because if you want to use these models, for example,

00:55:36.280 | in VS Code and you want to add all the neighboring

00:55:38.600 | files in the context, you should be

00:55:40.360 | able to fit a very large context.

00:55:42.720 | So we try to do some long context extension.

00:55:45.960 | But again, people also do this for LLMs.

00:55:49.440 | We also care a lot about inference.

00:55:51.280 | So we use the first MQA and then GQA to have faster inference.

00:55:56.360 | But these are also techniques that are implemented for LLMs.

00:55:59.880 | So I'd say overall, it's very similar.

00:56:02.280 | But yeah, maybe you should prioritize some things

00:56:04.840 | like having a smaller model that can be used for, for example,

00:56:10.120 | IDEs faster than actually a much larger model that

00:56:13.000 | would need more deployment.

00:56:15.400 | Yeah.

00:56:16.480 | All right, great.

00:56:17.200 | And here's also a general question.

00:56:19.280 | I guess they're asking for advice.

00:56:21.760 | So if you have a very tiny compute budget, for example,

00:56:24.640 | a single GPU, what would you recommend prioritizing?

00:56:29.200 | Let's assume you're fine tuning a model.

00:56:32.880 | Yeah, so I think, for example, now there

00:56:35.480 | are some great solutions for on-device deployment

00:56:39.680 | and fine tunings.

00:56:42.000 | For example, you can run quantized models

00:56:44.440 | with LLAMA, CPP, or other frameworks.

00:56:47.000 | And with techniques like PEFT, you

00:56:49.960 | don't need to do full model fine tuning.

00:56:52.040 | And you should be able to run this on one GPU,

00:56:54.480 | even in a 7B model.

00:56:56.480 | So I think you should just find a very well-curated data

00:56:59.360 | set because quality is more important than quantity.

00:57:02.600 | And then use one of these techniques for easy fine

00:57:05.280 | tuning, and that should work.

00:57:09.560 | All right, great.

00:57:10.320 | Here's a question asking--

00:57:14.400 | I guess, different from pre-training,

00:57:16.600 | but they're saying, I'm guessing the optimal amount of training

00:57:21.120 | data depends heavily on the domain

00:57:23.080 | as well as the task at hand, right?

00:57:27.320 | Yes, probably.

00:57:29.000 | Now, we're following the exchange scaling laws.

00:57:31.880 | I think they tried to compare English to code,

00:57:35.200 | and they found that the findings still hold.

00:57:38.240 | But maybe if you go to another domain,

00:57:39.920 | I don't know, like medical, things could change.

00:57:42.160 | And that's why I mentioned the DeepSeq paper, where

00:57:44.240 | they mentioned that it's really heavily dependent on data.

00:57:47.280 | And for them, it was the same domain.

00:57:48.880 | They just changed data sets, like going

00:57:51.040 | from one generic data set to another well-curated one.

00:57:54.200 | And things started changing.

00:57:56.040 | So I think that's probably the case,

00:57:58.440 | but it's underexplored how these scaling laws change

00:58:01.920 | depending on domains.

00:58:03.400 | So it's good to be aware of that when

00:58:05.080 | developing models for domains that are not

00:58:07.640 | explored by these--

00:58:08.800 | Speaking of different domains, code versus text,

00:58:11.080 | someone's asking, what are some of the interesting differences

00:58:14.240 | between tokenizing for general purpose,

00:58:18.920 | like text, versus for code generation?

00:58:23.200 | Yeah, so when we were training the tokenizer,

00:58:27.240 | I think one thing that was important to keep

00:58:29.040 | was number splitting.

00:58:31.520 | And we used the standard BPE.

00:58:33.880 | And we were training it.

00:58:35.080 | We trained on our data set that we were

00:58:37.720 | using for the training data, so our code mixture.

00:58:41.480 | And we did some analysis to see if there are any outliers,

00:58:44.640 | any tokens that were underrepresented or

00:58:48.480 | overrepresented as sanity checks.

00:58:50.920 | But overall, it's very close to the text training.

00:58:57.120 | And now most LLMs have a significant code portion

00:59:00.920 | in their tokenizers.

00:59:02.440 | So they're also trained on a lot of code.

00:59:04.520 | And at the end, you can use either one tokenizer for LLMs

00:59:08.640 | or code, or the other way.

00:59:11.080 | Because even on code, you have a lot of markdowns.

00:59:13.240 | So there's a lot of English.

00:59:14.600 | So you end up representing all the English tokens, for example,

00:59:17.640 | in your code tokenizer.

00:59:20.760 | I agree.

00:59:21.280 | And here's the question about fine tuning, I guess,

00:59:23.480 | compared to pre-training.

00:59:24.520 | So they're asking, do the same principles

00:59:27.520 | apply for fine tuning?

00:59:30.320 | Or do you make a different or additional recommendation?

00:59:33.320 | So yeah, for fine tuning, I think

00:59:39.160 | when you're preparing the data, it's probably a different

00:59:41.880 | thing.

00:59:42.400 | You're not going to train on all of the stack.

00:59:44.960 | You probably want to continue training on specific language.

00:59:48.360 | So maybe you could invest more time to even heavily filter.

00:59:51.360 | Because for fine tuning, you don't need as much data

00:59:53.560 | as for pre-training.

00:59:55.040 | For example, for us, the filtering we tried,

00:59:57.080 | for example, started to work because they

00:59:58.920 | removed a lot of data.

00:59:59.880 | And we did not have enough for our pre-training.

01:00:02.040 | While for fine tuning, for example, for instruction

01:00:04.280 | tuning, there was the Lima paper,

01:00:06.400 | where the instruction tuned only on 1,000 instructions.

01:00:08.920 | And they had a model that was much better than training

01:00:11.200 | on millions of samples.

01:00:12.760 | So I think data curation is even much more important when

01:00:15.680 | it comes to fine tuning.

01:00:18.960 | Great, great.

01:00:19.600 | One last question, I guess.

01:00:22.560 | So you might have also touched upon this briefly.

01:00:24.800 | But what are some considerations to make

01:00:27.040 | when publishing very large data sets and more nuanced or less

01:00:31.360 | known things to be aware of?

01:00:35.200 | Yeah, so maybe on the technical side,

01:00:40.880 | really using tools also for filtering and documentation.

01:00:44.600 | That is what we try to do with the stack.

01:00:46.680 | And maybe more on the governance side,

01:00:49.760 | be aware of where the licenses are respected,

01:00:53.000 | where the copyrights are respected.

01:00:54.480 | Do you have an opt-out tool for your data set?

01:00:56.840 | And maybe try to release it on the hub

01:00:59.360 | to make it easily accessible for people.

01:01:01.680 | If there are some concerns, you could try to add a gate.

01:01:04.200 | For example, for us, we released the data set

01:01:06.200 | that we used for PII detection.

01:01:08.240 | But we add some gating mechanism because it

01:01:10.280 | was sensitive information.

01:01:12.120 | So it's good to think of these kind of things in advance

01:01:15.040 | before releasing a data set.

01:01:17.640 | But yeah, in general, these are my advice.

01:01:19.480 | All right, great.

01:01:24.000 | Do we have any in-person questions?

01:01:27.400 | If not, then we can probably conclude.

01:01:30.800 | Thank you.

01:01:32.360 | [BLANK_AUDIO]