back to indexBest of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

Chapters
0:0 Introduction and Overview
0:18 Synthetic Data in 2024
1:9 Synthetic Data in Pre-Training
2:57 Model Collapse Concerns
4:11 Synthetic Data Quality and Benchmarks
8:51 Rephrasing and Textbook Generation
11:17 Synthetic Data for Filtering and Classification
13:28 Post-Training with Synthetic Data
16:17 Advancements in Small Models
18:17 On-Device and Efficient Models
25:14 Future Trends and Conclusion
00:00:08.900 |
So I'm gonna be talking about synthetic data in 2024. 00:00:11.620 |
And then I'm gonna be talking about small on-device models. 00:00:24.000 |
I think initially synthetic data was mainly used 00:00:29.980 |
where we needed human annotators to show the models 00:00:38.360 |
And when we had LLMs that were really performant, 00:00:46.820 |
we realized that we don't really have good benchmarks 00:00:49.500 |
to measure if models follow instructions well, 00:00:52.780 |
if they are creative enough, or if they are chatty enough. 00:00:59.380 |
And I think this year and towards the end of last year, 00:01:05.380 |
and we started generating synthetic data for pre-training 00:01:12.660 |
is that you have a lot of control over synthetic data. 00:01:16.740 |
and basically also the kind of data that you generate. 00:01:23.860 |
what you think the best web pages could look like 00:01:28.100 |
So this is how we went from not having synthetic data 00:01:30.580 |
at all in the LLM pipeline to having this everywhere. 00:01:36.540 |
you can train an LLM with like an entirely synthetic pipeline. 00:01:40.580 |
For example, you can use our Cosmopedia datasets 00:01:42.820 |
and you can train a 1B model on like 150 billion tokens 00:01:57.660 |
you can use a benchmark that uses LLMs as a judge, 00:02:03.420 |
So I think this is like a really mind blowing 00:02:08.860 |
And I think there's a lot of concerns about model collapse 00:02:13.020 |
but we'll see that like if we use synthetic data properly 00:02:15.900 |
and we curate it carefully, that shouldn't happen. 00:02:18.840 |
And the reason synthetic data is very popular right now 00:02:36.100 |
we have some really good inference frameworks. 00:02:47.320 |
Now let's talk about the elephant in the room, 00:02:58.660 |
it's really scary because there's a lot of synthetic data 00:03:04.240 |
So we're gonna be training a lot of synthetic data. 00:03:14.860 |
we think a lot of people think the web is polluted 00:03:19.140 |
And for example, when we're building fine web datasets, 00:03:23.180 |
we're interested in like how much synthetic data 00:03:26.580 |
So there isn't really a method to properly measure 00:03:33.820 |
But one thing we can do is to try to look for like 00:03:41.900 |
that we know are actually generated by ChatGPT. 00:03:44.500 |
We could try to measure the amount of these words 00:03:47.340 |
in our dataset and compare them to the previous years. 00:03:50.040 |
For example, here, we measured like these words ratio 00:03:54.820 |
And we can see that like the ratio really increased 00:03:58.980 |
So if we were to say that synthetic data amount didn't change 00:04:03.420 |
you would expect this ratio to stay constant, 00:04:06.980 |
So there's a lot of synthetic data probably on the web, 00:04:12.040 |
So what we did is we trained different models 00:04:22.320 |
And surprisingly, you can see that the latest dumps 00:04:24.480 |
are actually even better than the dumps that are before. 00:04:34.440 |
So personally, I wouldn't say the web is positive 00:04:40.180 |
And the issue with like model collapse is that, 00:04:48.320 |
and you would ask the model to complete, for example, 00:04:51.920 |
and then you would train it on these new generations, 00:04:57.280 |
it's normal to observe this kind of behavior, 00:05:02.400 |
And then if you train it just on these generations, 00:05:13.920 |
you can expect to get like better performance 00:05:25.600 |
where Microsoft basically trained a series of small models 00:05:35.040 |
were actually better than models that are much larger. 00:05:41.640 |
but it was also met with a lot of skepticism, 00:05:48.240 |
Because the dataset that they trained on was not public. 00:05:54.560 |
or maybe there's just some data contamination. 00:06:01.560 |
And as Hugging Face, because we're like open source, 00:06:07.880 |
We basically tried to follow a similar approach 00:06:11.400 |
And we created a synthetic dataset of textbooks 00:06:26.000 |
is trying as much as possible to keep it diverse. 00:06:28.880 |
Because if you just throw the same prompts as your model, 00:06:31.160 |
like generate a textbook about linear algebra, 00:06:37.080 |
So there's no way you could scale to millions of samples. 00:06:40.680 |
And the way you do that is by creating prompts 00:06:48.560 |
we would ask the model to generate a textbook, 00:06:50.880 |
but make it related to an extract from a webpage. 00:06:54.160 |
And also we try to frame it to stay within topic. 00:07:02.200 |
and then we ask the model to generate a textbook 00:07:04.640 |
related to medicine that is also related to this webpage. 00:07:14.760 |
is not gonna be diverse when you change the seed example. 00:07:30.000 |
and find the pages that are related to the topics 00:07:36.120 |
with the type of generations we want the model to generate. 00:07:41.160 |
for middle school students or a textbook for a college. 00:07:43.840 |
And we found that like some generation styles 00:07:49.760 |
For example, college textbooks are really good for MMLU, 00:07:52.640 |
while middle school textbooks are good for benchmarks 00:07:56.840 |
This is like a sample from like our search tool. 00:08:01.600 |
For example, you have a top category, which is a topic, 00:08:10.760 |
And here you can see the comparison between Cosmopedia. 00:08:14.640 |
We had two versions, V1 and V2 in blue and red, 00:08:22.880 |
training on Cosmopedia was consistently better. 00:08:27.160 |
that was actually good to train these models on. 00:08:29.800 |
It's of course so much smaller than fine web, 00:08:33.600 |
but that's the scale that's Microsoft datasets was. 00:08:36.200 |
So we kind of managed to reproduce a bit what they did, 00:08:39.240 |
and the dataset is public, so everyone can go there, 00:08:43.880 |
And this is the recent paper from NVIDIA, Nemotron CC. 00:09:04.640 |
some really huge synthetic datasets out there, 00:09:08.520 |
so like you can try to filter them even further 00:09:11.080 |
if you wanna get like more high quality corpses. 00:09:18.040 |
this approach was suggested in this paper by Pratyush, 00:09:28.720 |
to rewrite these samples into a better format. 00:09:31.880 |
For example, they ask an LLM to rewrite the sample 00:09:46.520 |
it's just rewriting a page into a different style. 00:09:49.240 |
So the model doesn't need to have like knowledge 00:09:54.320 |
compared to just asking a model to generate a new textbook 00:10:07.960 |
And so what they did in Nemotron CC is a similar approach. 00:10:13.880 |
They rewrite some pages from Common Crawl for two reasons. 00:10:18.320 |
One is to like improve pages that are low quality. 00:10:22.400 |
So they rewrite them into, for example, Wikipedia page, 00:10:26.440 |
And another reason is to create more diverse datasets. 00:10:29.680 |
So they have a dataset that they already heavily filtered, 00:10:33.000 |
and then they take these pages that are already high quality 00:10:42.960 |
So this way they can reuse the same page multiple times 00:10:45.920 |
without fearing like having multiple duplicates 00:10:52.440 |
So I think that's also a really interesting approach 00:10:57.200 |
just by rephrasing the pages that you already have. 00:11:13.280 |
that there's some leftover metadata in the webpage 00:11:19.240 |
So they train a model that can generate programs 00:11:22.600 |
that can like normalize and remove lines that are extra. 00:11:25.880 |
So I think this approach is also interesting, 00:11:57.080 |
the educational content of webpages from zero to five. 00:12:00.920 |
So for example, if a page is like a really good textbook 00:12:10.040 |
or promotional material, it would get a lower score. 00:12:13.320 |
And then after that, we take these synthetic annotations 00:12:20.880 |
And then we run this classifier on all of FindWeb, 00:12:25.920 |
And then we only keep the pages that have like a score 00:12:31.000 |
we went from 15 trillion tokens to just 1.5 trillion tokens. 00:12:37.160 |
And as you can see here, FindWebEdu outperforms 00:12:40.280 |
all the other public web datasets by a larger margin 00:12:47.840 |
And you can see that this approach is really effective 00:12:50.040 |
for filtering web datasets to get like better corpuses 00:13:07.320 |
Instead, they trained it on OpenHermes dataset, 00:13:15.920 |
And then they also get really high quality datasets, 00:13:22.800 |
and can help you train some really good LLMs. 00:13:33.920 |
So they use, for example, the DCLM classifier 00:13:46.000 |
And they get a dataset that works even better 00:13:49.840 |
So that was it for like synthetic data for pre-training. 00:14:03.840 |
where they basically try to target some specific skills 00:14:07.080 |
and improve the performance of models on them. 00:14:14.160 |
and they managed to get a dataset that outperforms 00:14:40.880 |
they're easier to generate instructions from. 00:14:57.280 |
And the way they make sure that this dataset is diverse 00:15:00.240 |
is by using personas from the Persona Hub datasets, 00:15:17.240 |
and then ask it to generate like a coding problem. 00:15:19.920 |
This way you make sure that your dataset is really diverse, 00:15:22.480 |
and then you can further filter the datasets, 00:15:30.720 |
and we also tried to cover the wide range of tasks. 00:15:37.880 |
we also outperformed the original Mistral instruct 00:15:52.000 |
called the Multilingual Data Arbitrage by Cohere. 00:15:55.600 |
And basically they want to generate a dataset 00:16:01.960 |
It's the fact that there isn't like one model 00:16:04.120 |
that's really good at all the languages they wanted. 00:16:09.240 |
not just one teacher model, but multiple teachers, 00:16:34.760 |
and get like a dataset that's of a really high quality, 00:16:37.440 |
and that's diverse, and that covers all your needs. 00:16:43.520 |
I was supposed to put a meme there, but lack of time. 00:16:46.760 |
Yeah, so that was it for like synthetic data. 00:17:01.000 |
but like now we have some really good small models. 00:17:03.480 |
For example, Lama 3.2 1b, it matches Lama 2.13b from, 00:17:08.480 |
that was released last year on the LMSS arena, 00:17:11.680 |
which is basically the default go-to leaderboard 00:17:14.160 |
for evaluating models using human evaluation. 00:17:20.600 |
So I think we've made like a huge leap forward 00:17:24.160 |
Of course, that's just one data point, but there's more. 00:17:32.960 |
it shows that today we have some really good models 00:17:35.640 |
that are only like 3 billion parameters and 4 billion, 00:17:41.600 |
which is a really popular benchmark for evaluating models. 00:17:47.360 |
have more than 65 on MMLU and the gray ones have less. 00:17:55.000 |
So now we have a 3b model that outperforms a 33b model 00:18:02.840 |
So I think now people are starting to realize 00:18:05.760 |
that like we shouldn't just scale and scale models, 00:18:08.840 |
but we should try to make them more efficient. 00:18:14.760 |
but you can also chat with a 3b+ model on your iPhone. 00:18:18.480 |
For example, here, this is an app called PocketPal, 00:18:21.120 |
where you can go and select a model from Hugging Face. 00:18:28.840 |
which is 3.8 billion parameters on this iPhone, 00:18:33.840 |
And you can see that even the latency is also acceptable. 00:18:37.600 |
For example, here, I asked it to give me a joke 00:18:40.240 |
about NeurIPS, so let's see what it has to say. 00:18:43.000 |
Okay, why did the neural network attend NeurIPS? 00:18:49.480 |
Because it heard there would be a lot of layers and fun, 00:18:54.760 |
So not very funny, but at least it can run on device. 00:18:57.400 |
Yeah, so I think now we have good small models, 00:19:02.160 |
but we also have like good frameworks and tools 00:19:06.240 |
So I think we're really close to having like really on-edge 00:19:12.440 |
And I think for a while, we've had this narrative 00:19:18.280 |
Of course, this is supported by science scaling laws. 00:19:23.440 |
when we scale the model size, the loss is lower, 00:19:37.000 |
And of course, we all observed the performance improvement 00:19:48.120 |
And so the largest models are gonna cost so much more. 00:19:51.480 |
So I think now, instead of just building larger models, 00:19:56.320 |
we should be focusing on building more efficient models. 00:19:59.120 |
It's no longer a race for the largest models, 00:20:01.680 |
since these models are really expensive to run, 00:20:04.040 |
and they require a really good infrastructure to do that, 00:20:07.240 |
and they cannot run on, for example, consumer hardware. 00:20:10.560 |
And when you try to build more efficient models 00:20:21.920 |
is the trend of training smaller models longer. 00:20:24.840 |
For example, if you compare how long Lama was trained 00:20:29.720 |
there is a huge increase in the pre-training length. 00:20:35.320 |
but Lama 3 A to B was trained on 15 trillion tokens. 00:20:38.600 |
So Meta managed to get a model that's the same size, 00:20:43.760 |
by choosing to spend the sacrifice during training, 00:20:47.960 |
because as we know, training is a one-time cost, 00:20:52.080 |
If we wanna see what are the small models reads in 2024, 00:20:58.840 |
I think this mobile LLM paper by Meta is interesting. 00:21:16.600 |
that have more layers than just making them more wide. 00:21:26.120 |
for models that are just a few hundred million parameters. 00:21:30.240 |
There's also the Apple Intelligence Tech Report, 00:21:34.520 |
So for Apple Intelligence, they had two models, 00:21:36.760 |
one that was on server and another model that was on device. 00:21:49.200 |
where they show that using pruning and distillation 00:21:52.080 |
works much better than training from scratch. 00:21:56.360 |
about how they specialize their models on specific tasks. 00:21:59.560 |
Like for example, summarization and rewriting. 00:22:08.480 |
I think you've already had a talk about hybrid models. 00:22:12.720 |
And this model, they used a hybrid architecture 00:22:22.040 |
without needing to train it on a lot of tokens. 00:22:32.840 |
which are the best in class in each model size. 00:22:35.920 |
For example, our 1.7B model outperforms Lama 1B 00:22:53.760 |
We also created some new math and code datasets 00:23:11.240 |
but this model is trained on 11 trillion tokens. 00:23:13.840 |
And we saw that the performance kept improving. 00:23:15.800 |
The models didn't really plateau mid-training, 00:23:19.400 |
It shows that you can train such small models for very long 00:23:26.480 |
What's interesting about Small M2 is that it's fully open. 00:23:36.280 |
Also, there's really interesting small models for text, 00:23:50.480 |
There's also Moondream 0.5B, which was released recently. 00:23:55.040 |
It's like the smallest vision language model. 00:23:57.240 |
And as you can see, there isn't a big trade-off 00:24:09.080 |
but why should you consider using small models and when? 00:24:18.240 |
Because these models are small and they can run fast, 00:24:24.720 |
And this means that your dataset stays locally. 00:24:27.200 |
You don't have to send your queries to third parties. 00:24:33.040 |
one of the big selling points for Apple Intelligence. 00:24:35.760 |
Also, right now we really have so many frameworks 00:24:41.520 |
For example, there's MLX, MLC, LLAMA, CPP, Transformers.js. 00:24:56.720 |
For example, here there's a startup called NuMind, 00:25:00.320 |
and then they fine-tuned this on text extraction datasets. 00:25:04.960 |
that's not very far from models that are much larger. 00:25:07.880 |
So I think text extraction is like one use case 00:25:15.640 |
You can also chat with these models in browser. 00:25:21.120 |
you can load the model, you can even turn off your internet 00:25:23.560 |
and just start chatting with the model locally. 00:25:31.040 |
there's really good method of structure generation. 00:25:40.680 |
to follow a schema for extracting key information 00:25:48.000 |
which is a complaint about a GitHub repository, 00:25:53.080 |
and the model can extract anything that is relevant 00:26:05.680 |
And you can just like do this in the browser. 00:26:08.000 |
You can transform your text into a GitHub issue 00:26:12.680 |
So what's next for synthetic data and small models? 00:26:25.600 |
For example, generating synthetic data for math. 00:26:45.280 |
I think specializing them through fine-tuning, 00:26:55.560 |
I think you can already get decent performance 00:26:58.480 |
So you don't need to pay like a cost that's much larger 00:27:01.760 |
just to make your model better at your task by a few percent. 00:27:07.400 |
And I think it also applies for other modalities 00:27:26.680 |
Maybe for other, I should also say a hot take. 00:27:32.520 |
we started like with fine-tuning, for example, 00:27:35.120 |
trying to make BERT work on some specific use cases 00:27:40.080 |
And then we had some models that are much larger. 00:27:41.960 |
So we just switched to like prompt engineering 00:27:48.360 |
where we realized these models are really costly. 00:27:54.880 |
and we're gonna start to see like more fine-tuning 00:27:57.200 |
and less of just like prompt engineering the models. 00:28:02.960 |
And if you have any questions, we can take them now.