Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

(upbeat music) - I'm very happy to be here. Thank you for the invitation. So I'm gonna be talking about synthetic data in 2024. And then I'm gonna be talking about small on-device models. So I think the most interesting thing about synthetic data this year is that like, now we have it everywhere in the large language models pipeline.

I think initially synthetic data was mainly used just for post-training, because naturally that's the part where we needed human annotators to show the models how they should answer instructions, how they should be helpful and not toxic. And when we had LLMs that were really performant, we replaced the not human annotators just with the synthesic data.

And then after that, we realized that we don't really have good benchmarks to measure if models follow instructions well, if they are creative enough, or if they are chatty enough. So we also started using LLMs as judges. And I think this year and towards the end of last year, we also went to the pre-training parts and we started generating synthetic data for pre-training to kind of replace some parts of the web.

And the motivation behind that is that you have a lot of control over synthetic data. You can control your prompt and basically also the kind of data that you generate. So instead of just trying to filter the web, you could try to get the LLM to generate what you think the best web pages could look like and then train your models on that.

So this is how we went from not having synthetic data at all in the LLM pipeline to having this everywhere. And so the cool thing is like today, you can train an LLM with like an entirely synthetic pipeline. For example, you can use our Cosmopedia datasets and you can train a 1B model on like 150 billion tokens that are 100% synthetic.

And those are also of good quality. And then you can instruction tune the model on a synthetic SFT dataset. You can also do DPO on a synthetic dataset. And then to evaluate if the model is good, you can use a benchmark that uses LLMs as a judge, for example, MTBench or AlpacaEval.

So I think this is like a really mind blowing because like just a few years ago, we wouldn't think this is possible. And I think there's a lot of concerns about model collapse and I'm gonna talk about that later, but we'll see that like if we use synthetic data properly and we curate it carefully, that shouldn't happen.

And the reason synthetic data is very popular right now is that we have really strong models, both open and closed. It is really cheap and fast to use compared to human annotations, which cost a lot and take a lot of time. And also for open models right now, we have some really good inference frameworks.

So if you have enough GPUs, it's really easy to spawn these GPUs and generate like a lot of synthetic data. Some examples are VLM, TGI and TensorRT. Now let's talk about the elephant in the room, model collapse. Is this the end? If you look at the media and all of like, for example, some papers in nature, it's really scary because there's a lot of synthetic data out there in the web and naturally we train on the web.

So we're gonna be training a lot of synthetic data. And if model collapse is gonna happen, we should really try to take that seriously. And the other issue is that, as I said, we think a lot of people think the web is polluted because there's a lot of synthetic data.

And for example, when we're building fine web datasets, here, Guillermo and Hinek, we're interested in like how much synthetic data is there in the web? So there isn't really a method to properly measure the amount of synthetic data or to save a webpage synthetic or not. But one thing we can do is to try to look for like proxy words, for example, expressions like as a large language model or words like Delve, that we know are actually generated by ChatGPT.

We could try to measure the amount of these words in our dataset and compare them to the previous years. For example, here, we measured like these words ratio in different dumps of Common Crawl. And we can see that like the ratio really increased after ChatGPT's release. So if we were to say that synthetic data amount didn't change you would expect this ratio to stay constant, which is not the case.

So there's a lot of synthetic data probably on the web, but does this really make models worse? So what we did is we trained different models on these different dumps, and we then computed their performance on popular like NLP benchmarks, and then we computed the aggregated score. And surprisingly, you can see that the latest dumps are actually even better than the dumps that are before.

So if there's some synthetic data there, at least it did not make the models worse. Yeah, which is really encouraging. So personally, I wouldn't say the web is positive with synthetic data. Maybe it's even making it more rich. And the issue with like model collapse is that, for example, those studies, they were done at like a small scale, and you would ask the model to complete, for example, a Wikipedia paragraph, and then you would train it on these new generations, and you would do that iteratively.

I think if you do that approach, it's normal to observe this kind of behavior, because the quality is gonna be worse because the model is already small. And then if you train it just on these generations, you shouldn't expect it to become better. But what we're really doing here is that we take a model that is very large, and we try to distill its knowledge into a model that is smaller.

And in this way, you can expect to get like better performance for your small model. And using synthetic data for pre-training has become really popular after the textbooks are all you need papers, where Microsoft basically trained a series of small models on textbooks that were using a large LLM.

And then they found that these models were actually better than models that are much larger. So this was really interesting. It was like a first of its time, but it was also met with a lot of skepticism, which is a good thing in research, it pushes you to question things.

Because the dataset that they trained on was not public. So people were not really sure if these models are really good, or maybe there's just some data contamination. So it was really hard to check if you just have the weights of the models. And as Hugging Face, because we're like open source, we tried to reproduce what they did.

So this is our Cosmopedia dataset. We basically tried to follow a similar approach to what they documented in the paper. And we created a synthetic dataset of textbooks and blog posts and stories that had almost 30 billion tokens. And we trained some models on that. And we found that the key ingredient to getting a good dataset that is synthetic is trying as much as possible to keep it diverse.

Because if you just throw the same prompts as your model, like generate a textbook about linear algebra, and even if you change the temperature, the textbooks are gonna look alike. So there's no way you could scale to millions of samples. And the way you do that is by creating prompts that have some seeds that make them diverse.

In our case, the prompt, we would ask the model to generate a textbook, but make it related to an extract from a webpage. And also we try to frame it to stay within topic. For example, here, we put like an extract about cardiovascular bioimaging, and then we ask the model to generate a textbook related to medicine that is also related to this webpage.

And this is a really nice approach because there's so many webpages out there. So you can be sure that your generation is not gonna be diverse when you change the seed example. One thing that's challenging with this is that you want the seed samples to be related to your topics.

So we use like a search tool to try to go all of fine web datasets and find the pages that are related to the topics we're interested in. And then we also do a lot of experiments with the type of generations we want the model to generate. For example, we ask it for textbooks for middle school students or a textbook for a college.

And we found that like some generation styles help on some specific benchmarks while others help on other benchmarks. For example, college textbooks are really good for MMLU, while middle school textbooks are good for benchmarks like OpenBook UA and Pico. This is like a sample from like our search tool.

For example, you have a top category, which is a topic, and then you have some subtopics, and then you have the topic hits, which are basically the webpages in fine web that's belong to these topics. And here you can see the comparison between Cosmopedia. We had two versions, V1 and V2 in blue and red, and you can see the comparison to fine web.

And as you can see throughout the training, training on Cosmopedia was consistently better. So we managed to get a dataset that was actually good to train these models on. It's of course so much smaller than fine web, it's only 30 billion tokens, but that's the scale that's Microsoft datasets was.

So we kind of managed to reproduce a bit what they did, and the dataset is public, so everyone can go there, check if everything is all right. And this is the recent paper from NVIDIA, Nemotron CC. They took things a bit further and they generated not a few billion tokens, but 1.9 trillion tokens, which is huge.

And we can see later how they did that. It's more of like rephrasing the web. So we can see today that there's like some really huge synthetic datasets out there, and they're public, so like you can try to filter them even further if you wanna get like more high quality corpses.

So for this rephrasing the web, this approach was suggested in this paper by Pratyush, where basically in this paper, they take some samples from C4 datasets, and then they use an LLM to rewrite these samples into a better format. For example, they ask an LLM to rewrite the sample into a Wikipedia passage or into a Q&A page.

And the interesting thing in this approach is that you can use a model that is small because rewriting doesn't require knowledge, it's just rewriting a page into a different style. So the model doesn't need to have like knowledge that is like extensive of what is rewriting, compared to just asking a model to generate a new textbook and not giving it like ground truth.

So here they rewrite some samples from C4 into Q&A, into Wikipedia, and they find that doing this works better than training just on C4. And so what they did in Nemotron CC is a similar approach. They rewrite some pages from Common Crawl for two reasons. One is to like improve pages that are low quality.

So they rewrite them into, for example, Wikipedia page, so they look better. And another reason is to create more diverse datasets. So they have a dataset that they already heavily filtered, and then they take these pages that are already high quality and they ask the model to rewrite them in Q&A format into like open-ended questions or like multi-choice questions.

So this way they can reuse the same page multiple times without fearing like having multiple duplicates because it's the same information, but it's gonna be written differently. So I think that's also a really interesting approach for like generating synthetic data just by rephrasing the pages that you already have.

There's also this approach called Prox where they try to start from a webpage and then they generate a program which finds how to write that page to make it better and less noisy. For example, here you can see that there's some leftover metadata in the webpage and you don't necessarily want to keep that for training your model.

So they train a model that can generate programs that can like normalize and remove lines that are extra. So I think this approach is also interesting, but it's maybe less scalable than the approaches that I presented before. So that was it for like rephrasing and generating new textbooks. Another approach that I think is really good and becoming really popular for using synthetic data for pre-training is basically building better classifiers for filtering the web.

For example, here we released a dataset called FindWebEdu and the way we built it is by taking Llama3 and asking it to rate the educational content of webpages from zero to five. So for example, if a page is like a really good textbook that could be useful in a school setting, it would get a really high score.

And if a page is just like an advertisement or promotional material, it would get a lower score. And then after that, we take these synthetic annotations and we train a classifier on them. It's a classifier like a BERT model. And then we run this classifier on all of FindWeb, which is a 15 trillion tokens datasets.

And then we only keep the pages that have like a score that's higher than three. So for example, in our case, we went from 15 trillion tokens to just 1.5 trillion tokens. Those are really highly educational. And as you can see here, FindWebEdu outperforms all the other public web datasets by a larger margin on a couple of benchmarks.

Here I show the aggregated score. And you can see that this approach is really effective for filtering web datasets to get like better corpuses for training your LLMs. Others also try to do this approach. There's, for example, the DCLM datasets, where they also train the classifier, but not to detect educational content.

Instead, they trained it on OpenHermes dataset, which is a dataset for instruction tuning. And also they explain like IM5 subreddits. And then they also get really high quality datasets, which is like a very information dense and can help you train some really good LLMs. And then Nemotron and Common Crawl, they also did this approach, but instead of using one classifier, they used an ensemble of classifiers.

So they use, for example, the DCLM classifier and also classifiers like the ones we used in FindWebEducational. And then they combine these scores into with an ensemble method to only retain the best high quality pages. And they get a dataset that works even better than the ones we developed.

So that was it for like synthetic data for pre-training. Now we can go back to post-training. I think there's a lot of interesting post-training datasets out there. One that was released recently, the Agent Instruct by Microsoft, where they basically try to target some specific skills and improve the performance of models on them.

For example, here you can see code, brain teasers, open domain QA, and they managed to get a dataset that outperforms this with fine-tuning Mistral 7B on it. It outperforms the original instruct model that was released by Mistral. And as I said, to get good synthetic data, you really have to have a framework to make sure that your data is diverse.

So for example, for them, they always see the generations on either source code or raw text documents. And then they rewrite them to make sure they're easier to generate instructions from. And then they use that for their like instruction data generation. There's also the Tool3 SFT mixture, which was released recently by Allen AI.

It's also really good quality and it covers a wide range of tasks. And the way they make sure that this dataset is diverse is by using personas from the Persona Hub datasets, which is basically a dataset of like, I think over a million personas. And for example, in the Tool3 mixture to generate like a new code snippet, they would give like the model persona, for example, a machine learning researcher interested in neural networks, and then ask it to generate like a coding problem.

This way you make sure that your dataset is really diverse, and then you can further filter the datasets, for example, using the reward models. We also released a dataset called Smalltalk, and we also tried to cover the wide range of tasks. And as you can see here, for example, when fine-tuning Mistral 7B on the dataset, we also outperformed the original Mistral instruct on a number of benchmarks, notably on mathematics and instruction following with IF-EVL.

Another paper that's really interesting I wanted to mention is this one called the Multilingual Data Arbitrage by Cohere. And basically they want to generate a dataset for post-training that is multilingual, and they have a really interesting problem. It's the fact that there isn't like one model that's really good at all the languages they wanted.

So what they do is that like they use not just one teacher model, but multiple teachers, and then they have a router, which basically sends the prompts they have to all these models. And then they get the completions, and they have a reward model that trace all these generations and only keeps the best one.

And this is like arbitrage and finance. So what I think was interesting in this, it shows that like synthetic data, it doesn't have to come from a single model. And because we have so many good models now, you could like pull these models together and get like a dataset that's of a really high quality, and that's diverse, and that covers all your needs.

I was supposed to put a meme there, but lack of time. Yeah, so that was it for like synthetic data. And now we can go to see what's happening in the small models field in 2024. I don't know if you know, but like now we have some really good small models.

For example, Lama 3.2 1b, it matches Lama 2.13b from, that was released last year on the LMSS arena, which is basically the default go-to leaderboard for evaluating models using human evaluation. And as you can see here, the scores of the models are really close. So I think we've made like a huge leap forward in terms of small models.

Of course, that's just one data point, but there's more. For example, if you look at this chart from the Quent 2.5 blog posts, it shows that today we have some really good models that are only like 3 billion parameters and 4 billion, the score really high on MMLU, which is a really popular benchmark for evaluating models.

And you can see here that the blue dots have more than 65 on MMLU and the gray ones have less. And for example, Lama 33b had less. So now we have a 3b model that outperforms a 33b model that was released earlier on MMLU benchmark. So I think now people are starting to realize that like we shouldn't just scale and scale models, but we should try to make them more efficient.

I don't know if you knew, but you can also chat with a 3b+ model on your iPhone. For example, here, this is an app called PocketPal, where you can go and select a model from Hugging Face. It has a large choice. For example, here, we loaded the PHY 3.5, which is 3.8 billion parameters on this iPhone, and we can chat with it.

And you can see that even the latency is also acceptable. For example, here, I asked it to give me a joke about NeurIPS, so let's see what it has to say. Okay, why did the neural network attend NeurIPS? Because it heard there would be a lot of layers and fun, and it wanted to train its sense of humor.

So not very funny, but at least it can run on device. Yeah, so I think now we have good small models, but we also have like good frameworks and tools to use these small models. So I think we're really close to having like really on-edge and on-device models that are really good.

And I think for a while, we've had this narrative that just training larger models is better. Of course, this is supported by science scaling laws. As you can see here, for example, when we scale the model size, the loss is lower, and obviously you get a better model. But, and we can see this, for example, in the GPT family of models, how we went from just 100 million parameters to more than a trillion parameters.

And of course, we all observed the performance improvement when using the latest model. But one thing that we shouldn't forget is that when we scale the model, we also scale the inference costs and time. And so the largest models are gonna cost so much more. So I think now, instead of just building larger models, we should be focusing on building more efficient models.

It's no longer a race for the largest models, since these models are really expensive to run, and they require a really good infrastructure to do that, and they cannot run on, for example, consumer hardware. And when you try to build more efficient models that match larger models, that's when you can really unlock some really interesting on-device use cases.

And I think a trend that we're noticing now is the trend of training smaller models longer. For example, if you compare how long Lama was trained compared to Lama 3, there is a huge increase in the pre-training length. Lama was trained on 1 trillion tokens, but Lama 3 A to B was trained on 15 trillion tokens.

So Meta managed to get a model that's the same size, but it performs so much better by choosing to spend the sacrifice during training, because as we know, training is a one-time cost, but inference is something that's ongoing. If we wanna see what are the small models reads in 2024, I think this mobile LLM paper by Meta is interesting.

They try to study different models that have less than 1 billion parameters and find which architecture makes most sense for these models. For example, they find that depth is more important than width. So it's more important to have models that have more layers than just making them more wide.

They also find that GQA helps, that tying the embedding helps. So I think it's a nice study overall for models that are just a few hundred million parameters. There's also the Apple Intelligence Tech Report, which is interesting. So for Apple Intelligence, they had two models, one that was on server and another model that was on device.

It had 3 billion parameters. And I think the interesting part is that they trained this model using pruning and then distillation. And for example, they have this table where they show that using pruning and distillation works much better than training from scratch. And they also have some interesting insights about how they specialize their models on specific tasks.

Like for example, summarization and rewriting. There's also this paper by NVIDIA that was released recently. I think you've already had a talk about hybrid models. That was all interesting. And this model, they used a hybrid architecture between state space models and transformers. And they managed to train a 1B model that's really performant without needing to train it on a lot of tokens.

And regarding our work, we just recently released Small M2. So it's a series of three models, which are the best in class in each model size. For example, our 1.7B model outperforms Lama 1B and also 0.2.5. And how we managed to train this model is that we spent a lot of time trying to curate the pre-training datasets.

We did a lot of ablations, trying to find which datasets are good and also how to mix them. We also created some new math and code datasets that we're releasing soon. But you basically really spent a lot of time trying to find what's the best mixture that you can train these models on.

And then we spent some time trying to like, we also trained these models for very long. For example, Small M1 was trained only on 1 trillion tokens, but this model is trained on 11 trillion tokens. And we saw that the performance kept improving. The models didn't really plateau mid-training, which I think is really interesting.

It shows that you can train such small models for very long and keep getting performance gains. What's interesting about Small M2 is that it's fully open. We also released the pre-training code base, the fine-tuning code and datasets and also evaluation in this repository. Also, there's really interesting small models for text, but also for vision.

For example, here you can see Small VLM, which is a 2B model that's really efficient. It doesn't consume a lot of RAM and it also has a good performance. There's also Moondream 0.5B, which was released recently. It's like the smallest vision language model. And as you can see, there isn't a big trade-off compared to Moondream 2B.

So now I showed you that we have some really good small models. We also have the tools to use them, but why should you consider using small models and when? I think small models are really interesting because of the on-device feature. Because these models are small and they can run fast, you can basically run them on your laptop, but also on your mobile phone.

And this means that your dataset stays locally. You don't have to send your queries to third parties. And this really enhances privacy. That was, for example, one of the big selling points for Apple Intelligence. Also, right now we really have so many frameworks to do on-device inference. For example, there's MLX, MLC, LLAMA, CPP, Transformers.js.

So we have a lot of options and each of them have great features. So you have so many options for doing that. Small models are also really powerful if you choose to specialize them. For example, here there's a startup called NuMind, which took small LLAM, and then they fine-tuned this on text extraction datasets.

And they managed to get a model that's not very far from models that are much larger. So I think text extraction is like one use case where small models can be really performant and it makes sense to use them instead of just using larger models. You can also chat with these models in browser.

For example, here you can go there, you can load the model, you can even turn off your internet and just start chatting with the model locally. Speaking of text extraction, if you don't want to fine-tune the models, there's really good method of structure generation. We can basically force the models to follow a JSON schema that you defined.

For example, here we try to force the model to follow a schema for extracting key information from GitHub issues. So we can input free text, which is a complaint about a GitHub repository, something not working. And then you can run it there and the model can extract anything that is relevant for your GitHub issue creation.

For example, the priority. For example, here priority is high, the type of the issue, bug, and then a title and the estimation of how long this will take to fix. And you can just like do this in the browser. You can transform your text into a GitHub issue that's properly formatted.

So what's next for synthetic data and small models? I think that domain specific synthetic data is gonna be, it's already important, it's gonna be even more important. For example, generating synthetic data for math. I think this really would help improve the reasoning of a lot of models. And a lot of people are doing it, for example, Quint 2.5 math, everyone's trying to reproduce a one.

And so I think for synthetic data, trying to specialize it on some domains is gonna be really important. And then for small models, I think specializing them through fine-tuning, it's also gonna be really important. 'Cause I think a lot of companies are just trying to use these large models because they are better.

But on some tasks, I think you can already get decent performance with small models. So you don't need to pay like a cost that's much larger just to make your model better at your task by a few percent. And this is not just for text. And I think it also applies for other modalities like vision and audio.

And I think you should also watch out for on-device frameworks and applications. For example, like the app I showed, Pokestpal, Olama, all these frameworks are becoming really popular. And I'm pretty sure that we're gonna get like more of them in 2025. And users really like that. Maybe for other, I should also say a hot take.

I think that like in AI, we started like with fine-tuning, for example, trying to make BERT work on some specific use cases and really struggling to do that. And then we had some models that are much larger. So we just switched to like prompt engineering to get the models to solve our tasks.

I think we're going back to fine-tuning where we realized these models are really costly. It's better to use just a small model. We'll try to specialize it. So I think it's a little bit of a cycle and we're gonna start to see like more fine-tuning and less of just like prompt engineering the models.

So that was my talk. Thank you for following. And if you have any questions, we can take them now. (audience applauding)

Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

Chapters

Transcript