Stanford CS25: V4 I Behind the Scenes of LLM Pre-training: StarCoder Use Case

- Hello. Thank you for joining CS25 Transformers Day's last class. Today, we have Lubna, who is a machine learning engineer in the science team at Hugging Face, working on large language models for code and synthetic data generation. She's part of the core team at the Big Code Project and has co-authored the Stack Dataset and the Starcoder models for code generation.

Thank you so much for coming to our talk today. And as always, attendance link and the Slido questions are on our website, and we'll be taking questions after the talk. Thank you, and you can take it off now. - Hi. Thank you for the introduction. So I'm Lubna. I'm a machine learning engineer at Hugging Face in the science team.

And today, I'll tell you about the behind the scenes for training large language models. And I will use the Starcoder model that our team has trained as a use case. So today's plan is very simple. We're going to try to answer this question. What does it take to train a good LLM?

So it's one question, but it's very loaded, and it has a lot of follow-ups. And as you will see, my slides will be a series of questions and answers. So a few years ago, a lot of people thought that there was some molten secret source to the strong closed models like GPT-4, and that it will take the open source community a lot of time to catch up because the open source models that we had back then were much smaller and less performance.

But now it seems that the community kind of figured out most of the pieces for getting strong LLMs, as it was predicted in this Google memo that was leaked and released on semi-analysis. For example, today we have LLAMA370BD Instruct, which has almost the same performance as GPT-4, but it's unlocked so many use cases because the model weights are open.

The model can be quantized and can even run on a consumer desktop. It also allows the community to build very cool use cases on top through fine tuning. So we've made a lot of progress in the open field, and this is not the only model that's out there. We're now observing kind of a rise of open LLMs.

And the company-- more and more companies are embracing releasing models. That was the case, for example, with DeepMind's GEMMA models and with Mistral's models and also other models from GPT-4E. Here I put a plot from the LLMSES arena, which is kind of the go-to leaderboard for comparing Instruct models nowadays.

It uses human evaluation. And you can see in this plot that as we went from 2023 to May '24, the gap in performance between the closed models and the open models is shrunking and becoming smaller, which is very promising. So we're on a very great path, but there are still a lot of limitations for this.

And this is mainly due to releases missing out important details about how the data was processed and how the models were trained. And this is usually the case for two main reasons. The first one is to avoid legal scrutiny, because when companies publicly disclose the training data, if the training was not done properly and the copyrighted were not respected, they risk facing a legal investigation.

The other reason for not disclosing the details can be to maintain a competitive edge. So some companies want to be the best at training LLMs, so they don't want to give all the details for their training. Nevertheless, because we have a lot of releases, I think we can still answer this question and put a lot of pieces together.

So what do we need to train a good LLM? The first thing is probably the model. You need to have a good architecture. And I think now transformers are kind of the default, but there are also other interesting architectures like Mamba, which is a state-based model, or you can use a mixture of experts, which can be multiple transformer models.

But I'm not going to spend a lot of time in this lecture on models, because I think it's a topic that's already thoroughly explored, and there are other aspects that maybe deserve a little bit more attention. So that was it for models. Then for GPUs, I don't think there's much I can tell you about that, except maybe I ask Jensen.

But the part that they were the most interested of is data, which I think is the backbone of LLMs. Because now almost everyone is using the same architecture and the same training techniques. And for a given budget, data is what makes some models better than the others. So it's really worth spending time exploring this data and understanding how to get the higher quality samples.

So now we're going to try to answer our previous question of how to train a good LLM by how do we get good training data. And I think the answer to this is threefold. First, we need to understand how much data do we need. And then once we've figured out the size of the data that we need, where can we get this data?

And to clean it, which filtering techniques make more sense and will give us the best performance? So to answer the first one, the answer to that is the scaling laws. You want to know how much data you want to train a model on, but also what is the optimal size of the model.

And the scaling laws try to study the allocation of a computer budget between data size and model size. This means, should you take a smaller model and train it longer or take a larger model and train it on this data? And I'm going to present a brief history of the scaling laws because I think it's really interesting to see how the sizes of the models progress through time and also how the size of the data sets and the number of tokens we train on them have changed because there were really some drastic changes in that.

I think the first to establish the scaling laws were Kaplan from OpenAI. And they tried to fit the laws as a function of the data size and model size. And they found that if you have a 10 times increase in your compute, you should increase your parameter count by 5.5.

But your training tokens, you should only increase them by 1.8. This means that if you have more resources to train your models, you should make the model much larger. But the data, it's fine. You shouldn't increase it that much. And this is what led to models like GPT-3, which is 175 billion parameters, which was only trained on 300 billion tokens, which if we think about it now, is really small.

Other models also follow this, for example, like OPBT, which was the same size as GPT-3 and trained on a similar amount of data. There was also Bloom. So all these models are actually very under-trained. Then the Chinchilla scaling laws came after. And they kind of revisited the scaling laws.

And they found that the reason Kaplan thought that data should not be as scaled as model size is because they used a fixed cosine scheduler for all their experiments. So although they were changing the data size, the cosine scheduler was fixed. This meant that for some models, they were underestimated because they were not using the correct cosine that corresponded to the data size.

This led to kind of false conclusions. And the Chinchilla can now give us new scaling laws that say that you should scale your data and your model size equally. And in their paper, they train a 65 billion model on 1.6 trillion tokens, which is the Chinchilla optimal point. And they also perform much larger models like GPT-3 and Gopher, which was over 200 billion parameters.

So here, for example, I have a plot which shows what the scaling laws try to do. For example, here you have isoflop curves, which each curve uses a fixed budget. And then you try to find the sweet spot, which is the optimal for your budget allocation. And it tells you what your model size should be and what your data size should be.

And as you can see here, if we try to fit the laws, we can see that there's a linear increase for data and also model size. In this scheme, I tried to show how we've moved from the Chinchilla scaling laws to today's models. And you can see that, for example, the Chinchilla model, which is 60 billion parameters, was trained on less than 2 trillion tokens.

But then after that, we have LAMA, which was released last year. And it was just a 7BB model. And it was trained on as much data as the Chinchilla model. So it was trained way past the Chinchilla optimal point. And we might be wondering, why is that the case?

Did Meta not use their compute budgets in an optimal way? And the answer to that is that compute optimal is not always optimal. Because when you train a model, you don't only care about what you're going to spend in training, but you also care about the inference. And the model is trained one time, but the inference is for more.

The model is going to be served. So you want to save some cost in that. This makes it that people prefer training smaller models longer than actually using much larger models that are trained on less data. So this was the case for LAMA1, for other models like Mistral, but also for LAMA3, which went even further and trained not on 1 trillion tokens, but on 15 trillion tokens.

And if you check the archive paper, the loss kept going down. And also, the downstream evaluations as the model kept training, it kept improving. And I think this is really interesting. Because some people misunderstood the Chinchilla scaling laws as like compute optimal is optimal. But that's not the case.

Because inference cost is not considered. So for example, this is the cost for training in GPT-4. It is said that it's estimated that's $100 million. But also, the inference is very expensive. And the larger the model becomes, the more time it takes to process the tokens. So the scaling laws don't take the inference cost in consideration.

And if we do take the inference cost, which is the case for most people, because they want to use these models in inference, you might prefer using the smaller models and training them longer. And we do that. We're not respecting the Chinchilla scaling laws. So we're choosing to pay what we call a compute overhead.

It's kind of a sacrifice that you do during the training. You choose to pay more. But this will have a benefit during inference, because you will save a lot of cost and money. And there's this very interesting blog post about Harden's law, which tries to measure the compute overhead that you will be paying when you choose to train a small model.

For example, here, there's the space on Hugging Face, where you can input the model size and what data sets you want to train on. And it will show you where you are regarding the Chinchilla optimal point. So for example, if we take a 7B model and we train it on 1 billion tokens, you can see that we are here.

It's the red dots. And it's before the Chinchilla optimal model. And this gives approximately, I think, 40% overhead. But then during inference, as it shows here in the table-- sorry, it was 13% overhead. But there's almost 50% saving costs. So that's something that almost everyone is doing now, which is why we see models that are much, much smaller than one or two years ago.

For further reading, there are some very interesting papers about scaling laws. For example, there's this paper called Scaling Data Constraint Language Models, which shows that if you are limited in your data size-- let's say, for example, you want to train a 7B on 10 trillion tokens, but you don't have these 10 trillion tokens.

This paper says that you can basically repeat your data up to four times, so four epochs. And you will get similar performance as if you used unique tokens. So for example, instead of using 8 trillion tokens unique, you could use just two and repeat them four times. And you get almost the same performance as if these tokens were unique.

And this is especially useful for some domains where we almost exhaust all the data that's publicly available. As I will show you later, the Stack V2, which is a code data set that we released, I think it has almost all the code available. So it's going to be very hard to scrape and get more code.

And if you want to train models longer, the only option is to actually repeat the data during training. And this is good news, because repeating the data up to four times is actually significant. Another paper that I think is interesting when it comes to scaling laws is the DeepSeq LLM.

They try to establish new scaling laws that are suited for the data, because they find that the scaling behavior is highly dependent on the data quality. So they tried different data subsets, different filtering, and they found that the scaling laws were changing. So this is very important, because up until now, we were using the Chinchilla, but the Chinchilla was using fixed data sets.

They are not necessarily the ones that we are using now. So it's really important to be aware of that. And this is why DeepSeq tried to come up with their own scaling laws that work for their data sets. And they also conclude that when you have higher quality data sets, maybe more compute should be allocated to the model size and not to the data size.

So these are interesting things to keep in mind when it comes to scaling LLMs. So we have answered the first question, I hope. How much data to train LLMs? So let's say now you have your compute budget, a fixed number of GPUs for a certain amount of days, and you also know approximately how much data you want to use.

The question is that, where do you find this type of data? For example, Llama3 was trained on 15 trillion tokens, but where do you get 15 trillion tokens? That's a huge number. To get this data, the two main sources where you can actually get a very large volume of data are the web and then GitHub code.

There are some other curated sources. Those are of high quality but are much smaller, like Wikipedia, Books, Archive, or Stack Exchange. You can also get data and new type You can also get data and new type that's been very trendy recently, which is synthetic data. But let's first start with the sources where you can get very large volumes.

The first one is web data. So that's basically web pages. And usually people to create these data sets, they start from Common Crawl, which is a public repository of crawled web pages and Common Crawl crawls pages regularly, and they publish dumps every few months. But if you start from there, you will need to do some heavy filtering at a very large scale.

For example, just the latest dump has over 400 terabytes, and they have almost 95 dumps. So that's not a very easy task, and you will need to have a lot of resources and a team to be able to do that crawling. The other option is to use an existing filtered web data set.

Our researchers already filtered Common Crawl and released them. And luckily, we do have data sets that are very large and well-filtered. One of them is the web data, FineWeb, that was recently released by Hugging Face, and it has 15 trillion tokens of web data. It's also-- it's not just a large data set, but it also has the best performance among the publicly available data sets.

And here, for example, it shows the performance, which is an aggregation over multiple popular benchmarks for NLP, like Hellaswag, MMLU, PICA, and others. And it averages them and compares to other data sets like C4, RefinedWeb, ThinPyjama, and the pile. So that was for web, so you can get 15 trillion tokens easily.

And then for code data, we have released the stack data set, which is the largest data set of open source code. This data set comes in two versions. Version 1 consisted of 6 terabytes of permissive code. And how we built this data set is that we first cloned all the public repositories on GitHub.

So this gave us over 130 repositories and 100 terabytes of data. But we don't want all of that data, because a lot of it can be configs or extensions that we don't need or languages that are no longer maintained. So we did some file extension filtering, and we ended up with almost 90 terabytes of data.

After that, we filtered repositories based on their licenses. So we can have permissive licenses like Apache 2 or MIT. We can have more restrictive licenses like GPL. So we filtered all the repositories that did not have a permissive license. And after that, we did the deduplication to remove files that are similar.

So we ended up with almost 3 terabytes of deduplicated data. The stack comes also with a very cool tool for opt-out. This tool is basically a space where you can go. You can type your GitHub username, and it tells you if you have any of your GitHub repositories in the data set.

And if that's the case, there's also an option to fill a form and request to be removed from all the future trainings of BigCode. So we did that for the stack v1, but also for the stack v2. And the v2 is a much larger and enhanced data set compared to the v1.

This time, instead of cloning GitHub repositories, we went through Software Heritage, which is an archive of code. They already did the scraping, and we just extracted the data from their archive. And we ended up, after all the filtering, with almost 1 trillion tokens, which is a lot compared to the v1, where we got around 200 billion tokens at the end.

We also added some high-quality resources like GitHub issues, math and code data sets, and pull requests. So these data sets, the stack v1, the stack v2, can be used to train LLMs on code, or to train general LLMs and include code as a subset of the general web data.

This shows how the stack v2 compares to the v1. And you can see that before filtering, it's almost 10 times larger. And after filtering, it's four or five times larger. So I talk about how to get web data, how to get code data. And then I also mentioned synthetic data.

And it's this year and last year that synthetic data became very important for LLM training. And I think that in the next few years, it will become even more important. And I think this was mainly sparked by the PHY series of models by Microsoft. Their first paper was called Textbooks Are All You Need.

And they basically generated synthetic textbooks using GPT-3.5 and GPT-3.4. And they tried to build a new pre-training corpus that is synthetic. And they were able to match and outperform models that are trained on web data sets. So this model was trained on almost entirely synthetic data. But now some of the very popular LLMs are using synthetic data as part of their pre-training mix.

For example, Cloud3, in the model card, they say that they generate data internally and they include it in the pre-training. This is also the case for LLMA3, where they used LLMs to build classifiers that would annotate samples and only keep the high-quality ones. But they also generated synthetic content to improve performance on coding and reasoning along contexts.

So synthetic data is a very new topic, but it seems really interesting. I'm personally working also on that as a hugging phase. We recently released a data set called Cosmopedia, which was the largest data set of synthetic texts. And it had almost 25 billion tokens. And instead of using closed models like GPT-4, it used an open-source model, which is Mixed Trial 887B.

And we also released a blog post that explains how we created this data set. Because it can be very tricky to get very diverse samples. So we used an approach where we had 80% of the data that comes from the web. And then we tried to use these web samples to build new prompts that ask models to generate textbooks that are related to these web samples.

But while giving them more context, so we can limit the generations. For example, we can have a topic that is mathematics. And then we have web samples that are related to mathematics. And each time we give the model a prompt, generate a textbook in the field of mathematics that is related to this web sample.

And the more web samples we add, the more diversity we add. We also used some curated sources like Stanford courses and WikiHow, where we use extracts from these pages to ask the models to generate content that is related to them. You can find more details in the Cosmopedia blog post.

So I guess now we also have the answer for our second question, which was where to find the data. And if you're following, we have one question left, which is how can we filter this data? Because for example, if you use common crawl, you need to filter it. And even if you use the stack, we did not train our models on the stack directly.

We did a lot of filtering to get a data set that is smaller, but has a higher quality. And for this data set, I will cite this slide from Thomas Wolfe's presentation. This lecture is very interesting, by the way. You can find it here. And this is from the Yi paper, where they state that a high-quality data set might exhibit very advanced capabilities for a standard architecture.

And this is actually the focus of many recent papers. And we can see that in model releases, the sections about data sets are becoming smaller and smaller because people are realizing that the data set is actually the backbone, and it is the one that is making some models much better than others.

So it's really important to spend a lot of time creating these data sets and trying to remove all the outliers and data sets that can hurt the model during the training. This is the pipeline from the Yi paper for filtering their old web data sets. So first, they do language filtering.

So I guess in Yi's case, they get English and some Asian languages. Then they apply some filtering techniques to remove low-quality samples. For example, there are some metrics, like you look for files that have a lot of lines repeated and then remove them. There's also rule-based correction. You also can use perplexity filtering, where you compute something like a loss and remove samples that have a very high one.

Then after that, they also did a step which is very important, deduplication. Because there are a lot of papers that study the effect of duplicates on training, and they find that keeping duplicates in the training data can cause models to memorize and they have less space to be creative.

So this hurts the performance of models, and it's always advised to remove duplicates using exact deduplication to remove files that are exactly identical, but also near deduplication to remove files that are similar. And this uses techniques like min-hash deduplication. For Yi, after that, they also did more filtering on top, like semantic and topic filtering.

But usually, you can do the classic filtering and deduplication and then be more creative for the other filters. This was also the case for FineWeb. The reason it is better than other data sets is that because they spent a lot of time trying to come up with better filters and also deduplicate the data sets well.

Now the question is, OK, we can do deduplication. I think we have methods that are established to do that. We can also do language filtering. But then if you want to filter the data to remove garbage and lower quality files, how do you come up with good filters? You can, for sure, find some filters in the literature.

But if you want to really build a data set that is better than what exists, you need to invest some time trying to find more techniques that work better for your case. This can be done with manual inspection, which is always a good idea to look at the data and see what it actually looks like.

And you can come up with filters to help you during the training. But that is usually not enough because you might have an intuition for filtering that works better for your model. But then when you train, actually, this filtering doesn't help. And for example, for us, when we were developing the StarCoder series of models, we were thinking, OK, what are the best ways for us to filter code?

So we use some standard filters, for example, to remove auto-generated content. But we try to come up with a little bit more complex filterings that could help us, like looking for files that have a lot of comments because code that is usually well-documented is probably of a higher quality than another code file that doesn't have any comments.

So we implemented this filter that looks for files that have almost no comments and then removes them. And we trained a model on that. It turned out the performance improvement was really negligible. It was not as much as we thought. We also tried to use another filter, which is using the stars of a repository as an indicator of quality.

So we've tried removing all the files from repository that have less than five stars. And this ended up removing over 70% of the data sets. And then when we trained on it, the model was the worst model that we trained in all our ablation experiments, simply because it removed too much data.

It was not worth using this filtering technique. This is why it's very important that when you have a filter, you should run what we call an ablation model. The ablation is basically you take a subset of your data set after you applied the filtering. And you train a small model on it and see how it behaves with and without the filtering.

And you might be wondering, OK, if I use a small model, but does it really extrapolate to larger models? I think that's a good question. But generally, from our experience, we found that this does extrapolate for most of the ablations. When you're doing these ablations, you should select a set of high signal benchmarks that could show you some-- give you some conclusions about the effect of your filtering early in the training.

This can be some of the popular NLP benchmarks for LLMs, for example, HeLaSwag or MMLU. You should also-- sorry, here, it's not training, it's training-- with different seeds to reduce the noise. Because sometimes you can have filtering techniques that don't give you a very big difference. But if you train with just one seed, you must draw conclusions.

But they're actually just noise. So if you can and you have the compute, it's always better to run the same experiment with two or three different seeds. And then maybe do something like the averaging so that you reduce the noise and you have more robust conclusions about the effect of your filtering.

For example, for the fine web data set, the authors run over 200 plus ablations. These were like 1 billion models trained on, I think, 30 billion tokens. And this is how they were able to find filterings that worked better for their data sets. Now let's go back to our Starcoder use case.

And I will tell you about how we filtered the stack data sets. So for the version 1, if you remember, we had 6 terabytes of source code. And then, but when we trained Starcoder, we only used 800 gigabytes of these 6 terabytes. So a lot of this data was filtered out after our filtering, our curation.

The same happened for the Stack V2, where this time we started from 32 terabytes and 600 programming languages. And after the filtering, we ended up with only 6.3 terabytes of code. And for filtering code, the approach is a bit similar to just filtering web data, but the filtering techniques are a bit different.

So first, we wanted to include a lot of programming languages. And we looked at them, and we didn't keep all of them. We only kept the popular ones and excluded, for example, configs and languages that are no longer maintained. So this was for V1. For Starcoder 2, we included more languages, over 600.

And then we added some other sources that could be interesting for a code model to learn from, which are GitHub issues, Git commits, and Jupyter notebooks. We also added for the V2, we added also added for the V2, Kaggle, notebooks, and pull requests. The second step, after we selected the languages we wanted to train on, was data quality inspection.

So basically, as I told you, we had some filters to remove low-quality files and auto-generated content. An example is the average line length. So if you have an average line length that is too high, there's probably something wrong with this file where it's probably auto-generated. But since we had almost 100 programming languages, we should not use the same threshold for all the languages for this filter, because some programming languages just have longer lines.

So it's important to do some inspection and look at some samples from these languages. In our case, we had the BigCode community, which helps us look at 100 samples per extension and derive the appropriate thresholds in filtering heuristics. The third filtering step was near deduplication. We found that near deduplication was the filtering that gave us the most performance boost.

It's also very easy to apply, because it's language agnostic. Even though we have 86 programming languages, we don't need to change the duplication for each language. We can just apply it to the whole data set. And here I show you some results of the effects of deduplication. For example, here you can see this model, Python all license.

If the filtering is none, you get a pass at 1, which is our code metric, of 13. But if you apply near deduplication, you go from 13 to 17. That's a very big performance bump. The same goes for other subsets, like permissive license. So we decided to use deduplication for our data set and to use strong deduplication to really remove all the files that could be similar.

Another step in our pipeline is to remove personal identifiable information. So this could be names, emails, or keys, or passwords, because we scraped code from GitHub. And although GitHub has some tools to detect secrets and prompt users to remove them, that's not always the case. And we found that there were still a lot of secrets in the data sets.

And we trained our model. You don't want it to be trained on that, because in inference, it might generate sensitive or personal data. So our approach to removing it was to first annotate a data set for PII. We collaborated with an annotation company to annotate some samples. So the annotators were tasked with labeling the PII when they found it.

For example, if they find a name, they give it a class name. If you find an email, they also label it as an email. So it was a named entity recognition task. And then we trained a star PII, which is our NER model, to detect this PII. And then we run it on the whole star coder training data.

This took almost 800, 100 GPU hours, because it's a neural network, and it needs to run on GPUs. The last step in our filtering was data decontamination, because you should make sure to remove the benchmarks and test sets from your training data. Otherwise, your evaluation numbers will just be inflated.

So we made sure to remove the benchmarks that we used for evaluation from our training sets. The last step in the data curation of the stack was to format the data. So now that the data is filtered, and because code is different from text, we can allow ourselves to apply some nice formatting that could help us do an inference.

For example, for a star coder, we had the code file. But before the code, we added some tokens that indicate that this is the repository name, and another token file name that indicates the file name, and another one for stars. And this is interesting, because this model, for example, star coder and other code models, I guess their main use case is to be plugged in an IDE, for example, VS Code.

And when you're using them, it could be interesting to append the code file with the name of the file, for example, or file those bytes, so that the model would know this is a Python file. If it's in another language, when you add the file name and you have the extension, it could know that this is the language that it should generate code in.

We also added GitHub stars token, and we tried to play with it, like to say this file has 100 stars, and see if the model would generate higher quality code than if it were to generate for zero stars. We didn't find any differences really during inference, but it was fun to add all this formatting.

For star coder 2, one of the improvements was that star coder 2 was repository aware. Because when we have GitHub repositories, it's a repository. So we have some files that are in the same repository that are related to each other. But when we built the stack V1, we just shuffled files, so we didn't keep this repository structure.

And when we trained the model, we just shuffled them, and the model did not know if two files belong to the same repository. But when we did star coder 2, we tried to keep files that are in the same repository next to each other. And how we did that is by concatenating them with some special tokens like file set, which basically separates files.

And this way, the model can kind of know which files are in the same repository and try to find links between them, 3D parallelism. And then you have also light eval for doing the evaluation. So this is kind of a good stack to be able to run your full trainings, but also your ablation models.

You can apply a filter from data to strobe and then train with nanotron and evaluate with light eval. And they're well-integrated together and they make one ecosystem. So that's for general LLMs. For code LLMs, we also released the code we used for both the stack and star coder models under our big code repository on GitHub.

And I think we just answered our third question, which was how to filter the data. So now you know how to-- first, how much data you need, and then where you can get this data, both web, and code, and synthetic, and curated. And you also know how you can properly filter the data and you can test the filtering techniques that you have in mind.

So now, let me tell you a little bit more about code LLMs, because that's kind of what I'm working on. And I'm trying to give you a little bit of an overview about these models so that you know how to train good LLMs, but you also know how to build very cool code assistants and completion models.

So how all of this started was when GitHub Copilot was released. And it was very interesting, because it was so much better than all the other code completion models that were before it, which were very small and much less performant. And GitHub Copilot was using the Codex model by OpenAI.

And they just showed that you can train a code LLM in the same way that you train an LLM for English, for example. You can just take a large transformer model and give it a lot of code data, and it will learn this code. Because before, a lot of people were trying to treat code very differently, for example, by using abstract syntax trees.

But what Codex model showed is that you can treat code like text. And if you want to predict the next line, you can predict the next text. You just do next token prediction, and you get your code. It works very well, much better compared to the more feature-engineered techniques.

And that was over two years ago, and we didn't have any good open-code models. But today, if you go to the hub, you can find that we have over 1,700 models that are trained on code. So these are models that are either trained only on code or LLMs that included code as part of their training.

So you can see that we've made a lot of progress in this code generation field, which is amazing. And this is the result of the community's work to build very good instruction-tuned models and base models. For example, here, as you can see in the leaderboard, we have some very strong models that score almost 80% on the code evaluation benchmark, which is human eval, which means they get almost 80% of the problems right, which is a very large number.

And when talking about the landscape of open-code LLMs in BigCode, we have released the stack data set, which is now the default data set for training on code, and also StarCoder1 and StarCoder2 family of models and other instruction-tuned models with the H4 team, like StarChat2. Meta also released some very good code models, which are the Code Llama series of models that go from 7b to 7tb.

There are also the DeepSeq models, which are also very strong. And we have also other models, like the recent Granite models from IBM, CodeQuen, CodeGen, and StableCode. So there are different providers for code LLMs and also for data sets for code. And the main reason we started the BigCode collaboration and to train StarCoder models was to kind of have a collaboration where we have full data transparency.

We released all the details about the training, but also the data is public so that people can inspect it and use it. And we also have the code for the processing and the model weights. And the collaboration was open. We had over 1,000 researchers joining our Slack and following the journey with us.

And this kind of created a BigCode ecosystem where the stack was used in the pre-training of a lot of prominent code models, like CodeGen and StableCode. And the StarCoder models were used as basis for a lot of community fine tunings. And I think it's very important to be aware of what makes a release of an LLM, whether it be a code LLM or a general LLM, open and responsible.

And I think this is fourfold. First, it's really good for the community and for research in AI in general. If you can make open access data sets, this will mean having data inspection tools, but also opt-out tools to respect people's wishes regarding their data sets. For example, if they don't want to be included in the trainings, they should be able to opt-out.

It's also important to remove personal identifiable information. So an open release does not mean just releasing model weights and stopping there, but also making your work reproducible by fully documenting the pipeline for using these models and also releasing tools for evaluation and technical reports that documents the whole pipeline.

And for us in BigCode, we kind of went from StataCoder, which was part of our obligations to understand how to filter the static data sets. And then we went to StarCoder, which was released last year, a $15 billion code generation model. And then this year, we released StarCoder2, which was trained on much more programming languages and had a much higher evaluation score.

And StarCoder was also rated as the most transparent model by the Stanford Foundation Model Transparency Index, which is really hard to remember, given the efforts that we put into data governance and into making the model release as transparent as possible. Regarding evaluation, so for example, StarCoder15b, when it was released, it was the state-of-the-art code model.

And this was also the case for StarCoder215b, among other 15b models. And it was even close or better than larger models. I think I don't have the plot here, but it was better than-- it was matching CodeLlama34b. And it was close to DeepSeq33b on some benchmarks. And here, for example, you can see the results on different benchmarks.

Because when releasing a model, it's really important that you don't just give a weight on one benchmark, but you should add as many benchmarks as you want. In case you had contamination, although we tried to avoid this one benchmark, there's a very low chance that you also had contamination on other benchmarks.

And it also allows you to fully understand how your model behaves if you add more evaluation benchmarks. And I think that's just a good practice that everyone should be doing with their releases. So with the StarCoder models, we also released some tooling, like VS Code implementation, which also has a membership test that tries to see if the generated code was in the training data and highlight that to the author.

So that's part of our code attribution efforts for these code models. Maybe you're interested in using these models to build your own personal copilot and fine-tune in StarCoder or CodeLAM or other models on your personal code bases. To do that, there's a very nice blog post by Surab and Sayag, where they try to take a code model and train it on the Hugging Face internal libraries and then deploy it in Olama and have a local code assistant.

And the pipeline is very similar to what we did in pre-training. First, you take your data set, you try to filter out the things you don't want to keep, and then you do the duplication and you train your model. So in this case, it will be just a fine-tuning, so it will be much quicker.

You can use libraries like PEFT, which do parameter-efficient fine-tuning, where you don't need to train all the parameters of your models, but you only inject a few trainable parameters. This makes the training much faster. For example, 7b model can be trained in a Google Colab. Now let's go back to evaluation.

So for example, for LLMs, there's the OpenLLM leaderboard that evaluates models. There's also the LLMs' arena, which compares, instructs models, and uses human evaluation. For code models, one of the most popular benchmarks is human eval. And it's basically a benchmark where you have a function that the model has to autocomplete.

And then when the function is completed, you take this solution, and then you run it against multiple unit tests, and you count how many solutions pass and how many solutions fail. And then you count a metric that we call pass at one, for example. This is the one that's been reported in this leaderboard.

And this gives you the human eval score. There's also a translation of this benchmark to 18 other languages. Here I show Java and JavaScript in C++. And this benchmark is called MultiPLE. So it allows you to see how well each model does on which programming language, and choose the one that's the most interesting for you.

But these benchmarks usually have an issue of contamination and overfitting, especially instruction-tuned models. I don't know if you've already checked what these data sets look like. But usually for code, there are an instruction that asks the model to generate an exercise. And often, if you look at them, they look really similar to human eval, which is function implementations.

So there's a very high chance of having contamination, which means having some files that look like human eval exercises in your instruction tuning data set. So here, for example, this plot is from the LifeCodeBench leaderboard. And they find that some benchmarks may be overfitting on human eval. And so their solution was to have a leaderboard called LifeCodeBench, where they regularly scrape new problems from platforms like code contests and least code.

And they evaluate the models only on the problems that were released after the model release date. This way, they are sure that there is no contamination. And for example, that was the case here. They tried to evaluate these models on all the data they have. And then they compared the performance to the data that was only released after the model release.

And they found that some models were not consistent in their results. So that's one interesting thing to keep in mind. And this is also another leaderboard that's going to be interesting to compare, not just open models, but also closed models like GPT-4 and see where the open source community is standing and compare to these code models.

So that was my presentation. Thank you very much for your attention. And if you have any questions, I can answer them. Yes, thank you very much for the great insightful talk. So we have some questions here on Slido. I'm not sure if there are any in-person questions, or else I will get started with the Slido question.

Sure. OK, I guess not. So I'll ask some of the questions online. I think I had submitted some of these as well. It seems like there's some questions about synthetic data. Let me see. I was also wondering about this. So someone's asking, what are the consequences of training AI models on AI-generated synthetic data?

Do you foresee any problems with this? And there's a related question. Does synthetic data closely represent the natural distribution of language? I assume some low-quality data from humans is necessary for things like learning robustness and so forth. Yeah, sure. These are very great questions. So about the consequences of training models on AI-generated data, I can think of two main ones.

First is enforcing some biases, because models already have some biases. And if we train on data that is generated by them, we might be enforcing it even more. The other thing is, for example, contamination. These models might generate content that looks like the evaluation benchmarks. And when you train on that, you will have contamination in your data.

So for example, one of the critiques of the file model is that people, because they did not see the synthetic data and the models were very good on the benchmarks, they were very skeptical. Are these models really good, or are they just overfitting on the benchmarks? So I think contamination and enforcing biases are one of the main things to keep in mind.

And regarding synthetic data not being the same as web distribution, I think that's a very good point. And for example, when we were developing Cosmopedia, first we found that it was worse than the web. And it was surprising, because we spent a lot of time trying to curate this data set, which looks so much cleaner than the web.

And then adding some web data and trying to add more topics was able to help us compensate some of the gaps. But adding some web always gives you a performance boost. So yes, there is some noise and some specific patterns in web data that will probably need to be included in the training mix to keep a whole coverage of what natural distributions look like.

So it sounds like you're saying a good training set would have a mix, potentially, of synthetic and natural data. Is that correct? Yeah, I think so. Some experiments we're on show that that's the case. Because you can try to spend some time to carefully curate the topics, but we'll probably be missing out on some things.

And the human intuition that we have is not always what works for training models. It seems that keeping some filtered web helps. And also, if you see the Phi technical reports, for example, in Phi 3, they insist a lot on filtering the web and including it in the pre-training.

And I think that now seems like maybe the best way to go. That makes sense. Great. Another question is, is RLHF-type preference data more important than unsupervised pre-training data? Should we spend more resources on RLHF data? Yeah, that's a good question. So for example, the unsupervised pre-training is mainly to get base models.

But then you can't use these base models as chart assistance. You need to do another step. So you can either do RLHF. But nowadays, people are just doing instruction tuning without needing to go through RL, where you just train the model on pairs of instructions and solutions. And that seems to work very well.

And there are now some methods that don't use reinforcement learning but work as well, for example, DPO or ORPO. So I think if you want to chart assistance, you definitely need to run a supervised training on top of the unsupervised one. But it doesn't necessarily have to be RLHF.

There are some other algorithms now. Great, great. And here's a multimodal question. Does multimodal grounding, for example, including images and videos along with the text, reduce the need for so much text-only data? Yeah, what do you mean? I'm sorry. Oh, the question is asking, does multimodal grounding help? If you have images and videos along with the text, does this reduce the amount of text-only data required to train models?

So I can't probably answer that because I haven't tried. But I guess all, for example, the multimodal models, for example, edifix that were recently released, there's always a significant text portion. That seems the case for most vision and language models. But yeah, I don't know really about the percentages for each.

Right, OK. A more general question-- you probably touched upon some of this-- but are there any major differences between training text versus code models, other than the training data being different? Yes, that's a good question. So the training data is different. Regarding the training itself, we use a similar architecture.

For example, Starcoder, it was like a LAMA or a MISRA architecture. I think one thing that you probably want is long context. Because if you want to use these models, for example, in VS Code and you want to add all the neighboring files in the context, you should be able to fit a very large context.

So we try to do some long context extension. But again, people also do this for LLMs. We also care a lot about inference. So we use the first MQA and then GQA to have faster inference. But these are also techniques that are implemented for LLMs. So I'd say overall, it's very similar.

But yeah, maybe you should prioritize some things like having a smaller model that can be used for, for example, IDEs faster than actually a much larger model that would need more deployment. Yeah. All right, great. And here's also a general question. I guess they're asking for advice. So if you have a very tiny compute budget, for example, a single GPU, what would you recommend prioritizing?

Let's assume you're fine tuning a model. Yeah, so I think, for example, now there are some great solutions for on-device deployment and fine tunings. For example, you can run quantized models with LLAMA, CPP, or other frameworks. And with techniques like PEFT, you don't need to do full model fine tuning.

And you should be able to run this on one GPU, even in a 7B model. So I think you should just find a very well-curated data set because quality is more important than quantity. And then use one of these techniques for easy fine tuning, and that should work. All right, great.

Here's a question asking-- I guess, different from pre-training, but they're saying, I'm guessing the optimal amount of training data depends heavily on the domain as well as the task at hand, right? Yes, probably. Now, we're following the exchange scaling laws. I think they tried to compare English to code, and they found that the findings still hold.

But maybe if you go to another domain, I don't know, like medical, things could change. And that's why I mentioned the DeepSeq paper, where they mentioned that it's really heavily dependent on data. And for them, it was the same domain. They just changed data sets, like going from one generic data set to another well-curated one.

And things started changing. So I think that's probably the case, but it's underexplored how these scaling laws change depending on domains. So it's good to be aware of that when developing models for domains that are not explored by these-- Speaking of different domains, code versus text, someone's asking, what are some of the interesting differences between tokenizing for general purpose, like text, versus for code generation?

Yeah, so when we were training the tokenizer, I think one thing that was important to keep was number splitting. And we used the standard BPE. And we were training it. We trained on our data set that we were using for the training data, so our code mixture. And we did some analysis to see if there are any outliers, any tokens that were underrepresented or overrepresented as sanity checks.

But overall, it's very close to the text training. And now most LLMs have a significant code portion in their tokenizers. So they're also trained on a lot of code. And at the end, you can use either one tokenizer for LLMs or code, or the other way. Because even on code, you have a lot of markdowns.

So there's a lot of English. So you end up representing all the English tokens, for example, in your code tokenizer. I agree. And here's the question about fine tuning, I guess, compared to pre-training. So they're asking, do the same principles apply for fine tuning? Or do you make a different or additional recommendation?

So yeah, for fine tuning, I think when you're preparing the data, it's probably a different thing. You're not going to train on all of the stack. You probably want to continue training on specific language. So maybe you could invest more time to even heavily filter. Because for fine tuning, you don't need as much data as for pre-training.

For example, for us, the filtering we tried, for example, started to work because they removed a lot of data. And we did not have enough for our pre-training. While for fine tuning, for example, for instruction tuning, there was the Lima paper, where the instruction tuned only on 1,000 instructions.

And they had a model that was much better than training on millions of samples. So I think data curation is even much more important when it comes to fine tuning. Great, great. One last question, I guess. So you might have also touched upon this briefly. But what are some considerations to make when publishing very large data sets and more nuanced or less known things to be aware of?

Yeah, so maybe on the technical side, really using tools also for filtering and documentation. That is what we try to do with the stack. And maybe more on the governance side, be aware of where the licenses are respected, where the copyrights are respected. Do you have an opt-out tool for your data set?

And maybe try to release it on the hub to make it easily accessible for people. If there are some concerns, you could try to add a gate. For example, for us, we released the data set that we used for PII detection. But we add some gating mechanism because it was sensitive information.

So it's good to think of these kind of things in advance before releasing a data set. But yeah, in general, these are my advice. All right, great. Do we have any in-person questions? If not, then we can probably conclude. Thank you.

Stanford CS25: V4 I Behind the Scenes of LLM Pre-training: StarCoder Use Case

Transcript