back to index

Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]


Chapters

0:0 Introduction and Overview
0:18 Synthetic Data in 2024
1:9 Synthetic Data in Pre-Training
2:57 Model Collapse Concerns
4:11 Synthetic Data Quality and Benchmarks
8:51 Rephrasing and Textbook Generation
11:17 Synthetic Data for Filtering and Classification
13:28 Post-Training with Synthetic Data
16:17 Advancements in Small Models
18:17 On-Device and Efficient Models
25:14 Future Trends and Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - I'm very happy to be here.
00:00:07.280 | Thank you for the invitation.
00:00:08.900 | So I'm gonna be talking about synthetic data in 2024.
00:00:11.620 | And then I'm gonna be talking about small on-device models.
00:00:14.620 | So I think the most interesting thing
00:00:17.760 | about synthetic data this year is that like,
00:00:19.960 | now we have it everywhere
00:00:21.400 | in the large language models pipeline.
00:00:24.000 | I think initially synthetic data was mainly used
00:00:26.860 | just for post-training,
00:00:28.300 | because naturally that's the part
00:00:29.980 | where we needed human annotators to show the models
00:00:33.220 | how they should answer instructions,
00:00:35.060 | how they should be helpful and not toxic.
00:00:38.360 | And when we had LLMs that were really performant,
00:00:41.060 | we replaced the not human annotators
00:00:43.900 | just with the synthesic data.
00:00:45.700 | And then after that,
00:00:46.820 | we realized that we don't really have good benchmarks
00:00:49.500 | to measure if models follow instructions well,
00:00:52.780 | if they are creative enough, or if they are chatty enough.
00:00:55.700 | So we also started using LLMs as judges.
00:00:59.380 | And I think this year and towards the end of last year,
00:01:02.740 | we also went to the pre-training parts
00:01:05.380 | and we started generating synthetic data for pre-training
00:01:08.740 | to kind of replace some parts of the web.
00:01:11.100 | And the motivation behind that
00:01:12.660 | is that you have a lot of control over synthetic data.
00:01:15.360 | You can control your prompt
00:01:16.740 | and basically also the kind of data that you generate.
00:01:19.500 | So instead of just trying to filter the web,
00:01:21.780 | you could try to get the LLM to generate
00:01:23.860 | what you think the best web pages could look like
00:01:26.300 | and then train your models on that.
00:01:28.100 | So this is how we went from not having synthetic data
00:01:30.580 | at all in the LLM pipeline to having this everywhere.
00:01:33.220 | And so the cool thing is like today,
00:01:36.540 | you can train an LLM with like an entirely synthetic pipeline.
00:01:40.580 | For example, you can use our Cosmopedia datasets
00:01:42.820 | and you can train a 1B model on like 150 billion tokens
00:01:45.960 | that are 100% synthetic.
00:01:48.140 | And those are also of good quality.
00:01:49.860 | And then you can instruction tune the model
00:01:51.500 | on a synthetic SFT dataset.
00:01:53.460 | You can also do DPO on a synthetic dataset.
00:01:55.900 | And then to evaluate if the model is good,
00:01:57.660 | you can use a benchmark that uses LLMs as a judge,
00:02:00.760 | for example, MTBench or AlpacaEval.
00:02:03.420 | So I think this is like a really mind blowing
00:02:05.200 | because like just a few years ago,
00:02:06.520 | we wouldn't think this is possible.
00:02:08.860 | And I think there's a lot of concerns about model collapse
00:02:11.380 | and I'm gonna talk about that later,
00:02:13.020 | but we'll see that like if we use synthetic data properly
00:02:15.900 | and we curate it carefully, that shouldn't happen.
00:02:18.840 | And the reason synthetic data is very popular right now
00:02:23.300 | is that we have really strong models,
00:02:25.660 | both open and closed.
00:02:27.860 | It is really cheap and fast to use
00:02:29.500 | compared to human annotations,
00:02:31.420 | which cost a lot and take a lot of time.
00:02:33.800 | And also for open models right now,
00:02:36.100 | we have some really good inference frameworks.
00:02:38.260 | So if you have enough GPUs,
00:02:39.660 | it's really easy to spawn these GPUs
00:02:41.740 | and generate like a lot of synthetic data.
00:02:44.300 | Some examples are VLM, TGI and TensorRT.
00:02:47.320 | Now let's talk about the elephant in the room,
00:02:52.500 | model collapse.
00:02:53.660 | Is this the end?
00:02:54.500 | If you look at the media and all of like,
00:02:56.540 | for example, some papers in nature,
00:02:58.660 | it's really scary because there's a lot of synthetic data
00:03:01.740 | out there in the web
00:03:02.860 | and naturally we train on the web.
00:03:04.240 | So we're gonna be training a lot of synthetic data.
00:03:06.860 | And if model collapse is gonna happen,
00:03:08.700 | we should really try to take that seriously.
00:03:11.100 | And the other issue is that, as I said,
00:03:14.860 | we think a lot of people think the web is polluted
00:03:17.020 | because there's a lot of synthetic data.
00:03:19.140 | And for example, when we're building fine web datasets,
00:03:21.620 | here, Guillermo and Hinek,
00:03:23.180 | we're interested in like how much synthetic data
00:03:25.060 | is there in the web?
00:03:26.580 | So there isn't really a method to properly measure
00:03:29.860 | the amount of synthetic data
00:03:31.180 | or to save a webpage synthetic or not.
00:03:33.820 | But one thing we can do is to try to look for like
00:03:36.260 | proxy words, for example,
00:03:37.900 | expressions like as a large language model
00:03:40.500 | or words like Delve,
00:03:41.900 | that we know are actually generated by ChatGPT.
00:03:44.500 | We could try to measure the amount of these words
00:03:47.340 | in our dataset and compare them to the previous years.
00:03:50.040 | For example, here, we measured like these words ratio
00:03:52.940 | in different dumps of Common Crawl.
00:03:54.820 | And we can see that like the ratio really increased
00:03:57.380 | after ChatGPT's release.
00:03:58.980 | So if we were to say that synthetic data amount didn't change
00:04:03.420 | you would expect this ratio to stay constant,
00:04:05.500 | which is not the case.
00:04:06.980 | So there's a lot of synthetic data probably on the web,
00:04:09.560 | but does this really make models worse?
00:04:12.040 | So what we did is we trained different models
00:04:14.180 | on these different dumps,
00:04:15.580 | and we then computed their performance
00:04:18.220 | on popular like NLP benchmarks,
00:04:20.040 | and then we computed the aggregated score.
00:04:22.320 | And surprisingly, you can see that the latest dumps
00:04:24.480 | are actually even better than the dumps that are before.
00:04:27.200 | So if there's some synthetic data there,
00:04:29.040 | at least it did not make the models worse.
00:04:31.860 | Yeah, which is really encouraging.
00:04:34.440 | So personally, I wouldn't say the web is positive
00:04:36.720 | with synthetic data.
00:04:37.960 | Maybe it's even making it more rich.
00:04:40.180 | And the issue with like model collapse is that,
00:04:45.240 | for example, those studies,
00:04:46.440 | they were done at like a small scale,
00:04:48.320 | and you would ask the model to complete, for example,
00:04:50.760 | a Wikipedia paragraph,
00:04:51.920 | and then you would train it on these new generations,
00:04:54.240 | and you would do that iteratively.
00:04:56.080 | I think if you do that approach,
00:04:57.280 | it's normal to observe this kind of behavior,
00:05:00.000 | because the quality is gonna be worse
00:05:01.160 | because the model is already small.
00:05:02.400 | And then if you train it just on these generations,
00:05:04.480 | you shouldn't expect it to become better.
00:05:06.400 | But what we're really doing here
00:05:07.640 | is that we take a model that is very large,
00:05:09.560 | and we try to distill its knowledge
00:05:11.200 | into a model that is smaller.
00:05:12.880 | And in this way,
00:05:13.920 | you can expect to get like better performance
00:05:16.160 | for your small model.
00:05:18.040 | And using synthetic data for pre-training
00:05:20.800 | has become really popular
00:05:22.400 | after the textbooks are all you need papers,
00:05:25.600 | where Microsoft basically trained a series of small models
00:05:29.160 | on textbooks that were using a large LLM.
00:05:33.600 | And then they found that these models
00:05:35.040 | were actually better than models that are much larger.
00:05:38.320 | So this was really interesting.
00:05:39.600 | It was like a first of its time,
00:05:41.640 | but it was also met with a lot of skepticism,
00:05:44.200 | which is a good thing in research,
00:05:45.560 | it pushes you to question things.
00:05:48.240 | Because the dataset that they trained on was not public.
00:05:50.920 | So people were not really sure
00:05:52.720 | if these models are really good,
00:05:54.560 | or maybe there's just some data contamination.
00:05:57.120 | So it was really hard to check
00:05:58.640 | if you just have the weights of the models.
00:06:01.560 | And as Hugging Face, because we're like open source,
00:06:03.760 | we tried to reproduce what they did.
00:06:05.760 | So this is our Cosmopedia dataset.
00:06:07.880 | We basically tried to follow a similar approach
00:06:09.840 | to what they documented in the paper.
00:06:11.400 | And we created a synthetic dataset of textbooks
00:06:14.120 | and blog posts and stories
00:06:15.760 | that had almost 30 billion tokens.
00:06:18.200 | And we trained some models on that.
00:06:20.760 | And we found that the key ingredient
00:06:23.720 | to getting a good dataset that is synthetic
00:06:26.000 | is trying as much as possible to keep it diverse.
00:06:28.880 | Because if you just throw the same prompts as your model,
00:06:31.160 | like generate a textbook about linear algebra,
00:06:34.080 | and even if you change the temperature,
00:06:35.720 | the textbooks are gonna look alike.
00:06:37.080 | So there's no way you could scale to millions of samples.
00:06:40.680 | And the way you do that is by creating prompts
00:06:43.800 | that have some seeds that make them diverse.
00:06:46.680 | In our case, the prompt,
00:06:48.560 | we would ask the model to generate a textbook,
00:06:50.880 | but make it related to an extract from a webpage.
00:06:54.160 | And also we try to frame it to stay within topic.
00:06:57.440 | For example, here, we put like an extract
00:07:00.080 | about cardiovascular bioimaging,
00:07:02.200 | and then we ask the model to generate a textbook
00:07:04.640 | related to medicine that is also related to this webpage.
00:07:08.240 | And this is a really nice approach
00:07:09.600 | because there's so many webpages out there.
00:07:11.960 | So you can be sure that your generation
00:07:14.760 | is not gonna be diverse when you change the seed example.
00:07:19.760 | One thing that's challenging with this
00:07:21.200 | is that you want the seed samples
00:07:23.000 | to be related to your topics.
00:07:25.320 | So we use like a search tool
00:07:27.720 | to try to go all of fine web datasets
00:07:30.000 | and find the pages that are related to the topics
00:07:32.320 | we're interested in.
00:07:33.360 | And then we also do a lot of experiments
00:07:36.120 | with the type of generations we want the model to generate.
00:07:39.320 | For example, we ask it for textbooks
00:07:41.160 | for middle school students or a textbook for a college.
00:07:43.840 | And we found that like some generation styles
00:07:45.880 | help on some specific benchmarks
00:07:47.600 | while others help on other benchmarks.
00:07:49.760 | For example, college textbooks are really good for MMLU,
00:07:52.640 | while middle school textbooks are good for benchmarks
00:07:54.720 | like OpenBook UA and Pico.
00:07:56.840 | This is like a sample from like our search tool.
00:08:01.600 | For example, you have a top category, which is a topic,
00:08:04.080 | and then you have some subtopics,
00:08:05.520 | and then you have the topic hits,
00:08:06.920 | which are basically the webpages in fine web
00:08:09.200 | that's belong to these topics.
00:08:10.760 | And here you can see the comparison between Cosmopedia.
00:08:14.640 | We had two versions, V1 and V2 in blue and red,
00:08:18.640 | and you can see the comparison to fine web.
00:08:20.640 | And as you can see throughout the training,
00:08:22.880 | training on Cosmopedia was consistently better.
00:08:25.840 | So we managed to get a dataset
00:08:27.160 | that was actually good to train these models on.
00:08:29.800 | It's of course so much smaller than fine web,
00:08:31.840 | it's only 30 billion tokens,
00:08:33.600 | but that's the scale that's Microsoft datasets was.
00:08:36.200 | So we kind of managed to reproduce a bit what they did,
00:08:39.240 | and the dataset is public, so everyone can go there,
00:08:41.880 | check if everything is all right.
00:08:43.880 | And this is the recent paper from NVIDIA, Nemotron CC.
00:08:49.840 | They took things a bit further
00:08:51.920 | and they generated not a few billion tokens,
00:08:54.120 | but 1.9 trillion tokens, which is huge.
00:08:57.960 | And we can see later how they did that.
00:09:00.040 | It's more of like rephrasing the web.
00:09:01.920 | So we can see today that there's like
00:09:04.640 | some really huge synthetic datasets out there,
00:09:07.680 | and they're public,
00:09:08.520 | so like you can try to filter them even further
00:09:11.080 | if you wanna get like more high quality corpses.
00:09:13.480 | So for this rephrasing the web,
00:09:18.040 | this approach was suggested in this paper by Pratyush,
00:09:22.080 | where basically in this paper,
00:09:23.600 | they take some samples from C4 datasets,
00:09:27.160 | and then they use an LLM
00:09:28.720 | to rewrite these samples into a better format.
00:09:31.880 | For example, they ask an LLM to rewrite the sample
00:09:34.840 | into a Wikipedia passage or into a Q&A page.
00:09:38.760 | And the interesting thing in this approach
00:09:41.160 | is that you can use a model that is small
00:09:43.800 | because rewriting doesn't require knowledge,
00:09:46.520 | it's just rewriting a page into a different style.
00:09:49.240 | So the model doesn't need to have like knowledge
00:09:51.960 | that is like extensive of what is rewriting,
00:09:54.320 | compared to just asking a model to generate a new textbook
00:09:57.120 | and not giving it like ground truth.
00:09:59.480 | So here they rewrite some samples from C4
00:10:02.240 | into Q&A, into Wikipedia,
00:10:04.080 | and they find that doing this works better
00:10:06.440 | than training just on C4.
00:10:07.960 | And so what they did in Nemotron CC is a similar approach.
00:10:13.880 | They rewrite some pages from Common Crawl for two reasons.
00:10:18.320 | One is to like improve pages that are low quality.
00:10:22.400 | So they rewrite them into, for example, Wikipedia page,
00:10:25.120 | so they look better.
00:10:26.440 | And another reason is to create more diverse datasets.
00:10:29.680 | So they have a dataset that they already heavily filtered,
00:10:33.000 | and then they take these pages that are already high quality
00:10:35.720 | and they ask the model to rewrite them
00:10:37.840 | in Q&A format into like open-ended questions
00:10:41.360 | or like multi-choice questions.
00:10:42.960 | So this way they can reuse the same page multiple times
00:10:45.920 | without fearing like having multiple duplicates
00:10:48.520 | because it's the same information,
00:10:50.160 | but it's gonna be written differently.
00:10:52.440 | So I think that's also a really interesting approach
00:10:54.480 | for like generating synthetic data
00:10:57.200 | just by rephrasing the pages that you already have.
00:10:59.840 | There's also this approach called Prox
00:11:04.000 | where they try to start from a webpage
00:11:06.880 | and then they generate a program
00:11:08.440 | which finds how to write that page
00:11:10.160 | to make it better and less noisy.
00:11:12.160 | For example, here you can see
00:11:13.280 | that there's some leftover metadata in the webpage
00:11:16.200 | and you don't necessarily want to keep that
00:11:17.800 | for training your model.
00:11:19.240 | So they train a model that can generate programs
00:11:22.600 | that can like normalize and remove lines that are extra.
00:11:25.880 | So I think this approach is also interesting,
00:11:27.680 | but it's maybe less scalable
00:11:29.240 | than the approaches that I presented before.
00:11:31.480 | So that was it for like rephrasing
00:11:36.080 | and generating new textbooks.
00:11:37.920 | Another approach that I think is really good
00:11:40.360 | and becoming really popular
00:11:41.600 | for using synthetic data for pre-training
00:11:44.200 | is basically building better classifiers
00:11:47.200 | for filtering the web.
00:11:48.960 | For example, here we released a dataset
00:11:50.920 | called FindWebEdu and the way we built it
00:11:53.720 | is by taking Llama3 and asking it to rate
00:11:57.080 | the educational content of webpages from zero to five.
00:12:00.920 | So for example, if a page is like a really good textbook
00:12:03.720 | that could be useful in a school setting,
00:12:05.880 | it would get a really high score.
00:12:07.320 | And if a page is just like an advertisement
00:12:10.040 | or promotional material, it would get a lower score.
00:12:13.320 | And then after that, we take these synthetic annotations
00:12:16.240 | and we train a classifier on them.
00:12:18.240 | It's a classifier like a BERT model.
00:12:20.880 | And then we run this classifier on all of FindWeb,
00:12:23.640 | which is a 15 trillion tokens datasets.
00:12:25.920 | And then we only keep the pages that have like a score
00:12:28.360 | that's higher than three.
00:12:29.600 | So for example, in our case,
00:12:31.000 | we went from 15 trillion tokens to just 1.5 trillion tokens.
00:12:34.880 | Those are really highly educational.
00:12:37.160 | And as you can see here, FindWebEdu outperforms
00:12:40.280 | all the other public web datasets by a larger margin
00:12:44.120 | on a couple of benchmarks.
00:12:45.520 | Here I show the aggregated score.
00:12:47.840 | And you can see that this approach is really effective
00:12:50.040 | for filtering web datasets to get like better corpuses
00:12:53.240 | for training your LLMs.
00:12:54.880 | Others also try to do this approach.
00:13:00.400 | There's, for example, the DCLM datasets,
00:13:03.240 | where they also train the classifier,
00:13:05.000 | but not to detect educational content.
00:13:07.320 | Instead, they trained it on OpenHermes dataset,
00:13:10.160 | which is a dataset for instruction tuning.
00:13:12.440 | And also they explain like IM5 subreddits.
00:13:15.920 | And then they also get really high quality datasets,
00:13:20.000 | which is like a very information dense
00:13:22.800 | and can help you train some really good LLMs.
00:13:25.000 | And then Nemotron and Common Crawl,
00:13:27.960 | they also did this approach,
00:13:29.520 | but instead of using one classifier,
00:13:31.440 | they used an ensemble of classifiers.
00:13:33.920 | So they use, for example, the DCLM classifier
00:13:36.200 | and also classifiers like the ones we used
00:13:38.160 | in FindWebEducational.
00:13:39.520 | And then they combine these scores
00:13:41.400 | into with an ensemble method
00:13:43.160 | to only retain the best high quality pages.
00:13:46.000 | And they get a dataset that works even better
00:13:48.400 | than the ones we developed.
00:13:49.840 | So that was it for like synthetic data for pre-training.
00:13:54.360 | Now we can go back to post-training.
00:13:56.640 | I think there's a lot of interesting
00:13:58.080 | post-training datasets out there.
00:13:59.960 | One that was released recently,
00:14:01.760 | the Agent Instruct by Microsoft,
00:14:03.840 | where they basically try to target some specific skills
00:14:07.080 | and improve the performance of models on them.
00:14:10.080 | For example, here you can see code,
00:14:11.800 | brain teasers, open domain QA,
00:14:14.160 | and they managed to get a dataset that outperforms
00:14:17.120 | this with fine-tuning Mistral 7B on it.
00:14:19.120 | It outperforms the original instruct model
00:14:21.960 | that was released by Mistral.
00:14:23.840 | And as I said, to get good synthetic data,
00:14:28.840 | you really have to have a framework
00:14:30.400 | to make sure that your data is diverse.
00:14:33.000 | So for example, for them,
00:14:34.240 | they always see the generations
00:14:36.200 | on either source code or raw text documents.
00:14:39.040 | And then they rewrite them to make sure
00:14:40.880 | they're easier to generate instructions from.
00:14:43.080 | And then they use that
00:14:44.080 | for their like instruction data generation.
00:14:47.600 | There's also the Tool3 SFT mixture,
00:14:50.560 | which was released recently by Allen AI.
00:14:53.360 | It's also really good quality
00:14:54.760 | and it covers a wide range of tasks.
00:14:57.280 | And the way they make sure that this dataset is diverse
00:15:00.240 | is by using personas from the Persona Hub datasets,
00:15:04.120 | which is basically a dataset of like,
00:15:05.800 | I think over a million personas.
00:15:07.560 | And for example, in the Tool3 mixture
00:15:09.800 | to generate like a new code snippet,
00:15:11.440 | they would give like the model persona,
00:15:13.200 | for example, a machine learning researcher
00:15:15.640 | interested in neural networks,
00:15:17.240 | and then ask it to generate like a coding problem.
00:15:19.920 | This way you make sure that your dataset is really diverse,
00:15:22.480 | and then you can further filter the datasets,
00:15:24.480 | for example, using the reward models.
00:15:26.400 | We also released a dataset called Smalltalk,
00:15:30.720 | and we also tried to cover the wide range of tasks.
00:15:33.520 | And as you can see here, for example,
00:15:35.280 | when fine-tuning Mistral 7B on the dataset,
00:15:37.880 | we also outperformed the original Mistral instruct
00:15:40.920 | on a number of benchmarks,
00:15:42.600 | notably on mathematics
00:15:44.120 | and instruction following with IF-EVL.
00:15:46.240 | Another paper that's really interesting
00:15:50.280 | I wanted to mention is this one
00:15:52.000 | called the Multilingual Data Arbitrage by Cohere.
00:15:55.600 | And basically they want to generate a dataset
00:15:58.200 | for post-training that is multilingual,
00:16:00.360 | and they have a really interesting problem.
00:16:01.960 | It's the fact that there isn't like one model
00:16:04.120 | that's really good at all the languages they wanted.
00:16:06.840 | So what they do is that like they use
00:16:09.240 | not just one teacher model, but multiple teachers,
00:16:12.080 | and then they have a router,
00:16:13.600 | which basically sends the prompts they have
00:16:15.760 | to all these models.
00:16:16.840 | And then they get the completions,
00:16:18.360 | and they have a reward model
00:16:19.520 | that trace all these generations
00:16:21.240 | and only keeps the best one.
00:16:23.000 | And this is like arbitrage and finance.
00:16:24.920 | So what I think was interesting in this,
00:16:27.480 | it shows that like synthetic data,
00:16:28.800 | it doesn't have to come from a single model.
00:16:31.040 | And because we have so many good models now,
00:16:33.080 | you could like pull these models together
00:16:34.760 | and get like a dataset that's of a really high quality,
00:16:37.440 | and that's diverse, and that covers all your needs.
00:16:43.520 | I was supposed to put a meme there, but lack of time.
00:16:46.760 | Yeah, so that was it for like synthetic data.
00:16:52.640 | And now we can go to see what's happening
00:16:55.000 | in the small models field in 2024.
00:16:57.840 | I don't know if you know,
00:17:01.000 | but like now we have some really good small models.
00:17:03.480 | For example, Lama 3.2 1b, it matches Lama 2.13b from,
00:17:08.480 | that was released last year on the LMSS arena,
00:17:11.680 | which is basically the default go-to leaderboard
00:17:14.160 | for evaluating models using human evaluation.
00:17:17.440 | And as you can see here,
00:17:18.600 | the scores of the models are really close.
00:17:20.600 | So I think we've made like a huge leap forward
00:17:22.640 | in terms of small models.
00:17:24.160 | Of course, that's just one data point, but there's more.
00:17:28.120 | For example, if you look at this chart
00:17:30.600 | from the Quent 2.5 blog posts,
00:17:32.960 | it shows that today we have some really good models
00:17:35.640 | that are only like 3 billion parameters and 4 billion,
00:17:39.160 | the score really high on MMLU,
00:17:41.600 | which is a really popular benchmark for evaluating models.
00:17:45.040 | And you can see here that the blue dots
00:17:47.360 | have more than 65 on MMLU and the gray ones have less.
00:17:52.160 | And for example, Lama 33b had less.
00:17:55.000 | So now we have a 3b model that outperforms a 33b model
00:17:59.480 | that was released earlier on MMLU benchmark.
00:18:02.840 | So I think now people are starting to realize
00:18:05.760 | that like we shouldn't just scale and scale models,
00:18:08.840 | but we should try to make them more efficient.
00:18:11.360 | I don't know if you knew,
00:18:14.760 | but you can also chat with a 3b+ model on your iPhone.
00:18:18.480 | For example, here, this is an app called PocketPal,
00:18:21.120 | where you can go and select a model from Hugging Face.
00:18:24.080 | It has a large choice.
00:18:25.480 | For example, here, we loaded the PHY 3.5,
00:18:28.840 | which is 3.8 billion parameters on this iPhone,
00:18:32.400 | and we can chat with it.
00:18:33.840 | And you can see that even the latency is also acceptable.
00:18:37.600 | For example, here, I asked it to give me a joke
00:18:40.240 | about NeurIPS, so let's see what it has to say.
00:18:43.000 | Okay, why did the neural network attend NeurIPS?
00:18:49.480 | Because it heard there would be a lot of layers and fun,
00:18:52.320 | and it wanted to train its sense of humor.
00:18:54.760 | So not very funny, but at least it can run on device.
00:18:57.400 | Yeah, so I think now we have good small models,
00:19:02.160 | but we also have like good frameworks and tools
00:19:04.600 | to use these small models.
00:19:06.240 | So I think we're really close to having like really on-edge
00:19:09.320 | and on-device models that are really good.
00:19:12.440 | And I think for a while, we've had this narrative
00:19:15.160 | that just training larger models is better.
00:19:18.280 | Of course, this is supported by science scaling laws.
00:19:22.040 | As you can see here, for example,
00:19:23.440 | when we scale the model size, the loss is lower,
00:19:26.080 | and obviously you get a better model.
00:19:28.240 | But, and we can see this, for example,
00:19:31.000 | in the GPT family of models,
00:19:32.760 | how we went from just 100 million parameters
00:19:34.920 | to more than a trillion parameters.
00:19:37.000 | And of course, we all observed the performance improvement
00:19:39.720 | when using the latest model.
00:19:42.160 | But one thing that we shouldn't forget
00:19:43.760 | is that when we scale the model,
00:19:45.360 | we also scale the inference costs and time.
00:19:48.120 | And so the largest models are gonna cost so much more.
00:19:51.480 | So I think now, instead of just building larger models,
00:19:56.320 | we should be focusing on building more efficient models.
00:19:59.120 | It's no longer a race for the largest models,
00:20:01.680 | since these models are really expensive to run,
00:20:04.040 | and they require a really good infrastructure to do that,
00:20:07.240 | and they cannot run on, for example, consumer hardware.
00:20:10.560 | And when you try to build more efficient models
00:20:12.920 | that match larger models,
00:20:14.960 | that's when you can really unlock
00:20:16.760 | some really interesting on-device use cases.
00:20:18.960 | And I think a trend that we're noticing now
00:20:21.920 | is the trend of training smaller models longer.
00:20:24.840 | For example, if you compare how long Lama was trained
00:20:28.280 | compared to Lama 3,
00:20:29.720 | there is a huge increase in the pre-training length.
00:20:33.280 | Lama was trained on 1 trillion tokens,
00:20:35.320 | but Lama 3 A to B was trained on 15 trillion tokens.
00:20:38.600 | So Meta managed to get a model that's the same size,
00:20:41.920 | but it performs so much better
00:20:43.760 | by choosing to spend the sacrifice during training,
00:20:47.960 | because as we know, training is a one-time cost,
00:20:49.960 | but inference is something that's ongoing.
00:20:52.080 | If we wanna see what are the small models reads in 2024,
00:20:58.840 | I think this mobile LLM paper by Meta is interesting.
00:21:02.080 | They try to study different models
00:21:04.840 | that have less than 1 billion parameters
00:21:07.760 | and find which architecture makes most sense
00:21:10.000 | for these models.
00:21:11.120 | For example, they find that depth
00:21:13.040 | is more important than width.
00:21:15.040 | So it's more important to have models
00:21:16.600 | that have more layers than just making them more wide.
00:21:19.920 | They also find that GQA helps,
00:21:22.400 | that tying the embedding helps.
00:21:24.400 | So I think it's a nice study overall
00:21:26.120 | for models that are just a few hundred million parameters.
00:21:30.240 | There's also the Apple Intelligence Tech Report,
00:21:32.880 | which is interesting.
00:21:34.520 | So for Apple Intelligence, they had two models,
00:21:36.760 | one that was on server and another model that was on device.
00:21:40.600 | It had 3 billion parameters.
00:21:42.800 | And I think the interesting part
00:21:44.160 | is that they trained this model
00:21:45.400 | using pruning and then distillation.
00:21:47.800 | And for example, they have this table
00:21:49.200 | where they show that using pruning and distillation
00:21:52.080 | works much better than training from scratch.
00:21:54.640 | And they also have some interesting insights
00:21:56.360 | about how they specialize their models on specific tasks.
00:21:59.560 | Like for example, summarization and rewriting.
00:22:02.040 | There's also this paper by NVIDIA
00:22:07.000 | that was released recently.
00:22:08.480 | I think you've already had a talk about hybrid models.
00:22:10.840 | That was all interesting.
00:22:12.720 | And this model, they used a hybrid architecture
00:22:16.040 | between state space models and transformers.
00:22:18.800 | And they managed to train a 1B model
00:22:20.600 | that's really performant
00:22:22.040 | without needing to train it on a lot of tokens.
00:22:24.440 | And regarding our work,
00:22:28.200 | we just recently released Small M2.
00:22:30.800 | So it's a series of three models,
00:22:32.840 | which are the best in class in each model size.
00:22:35.920 | For example, our 1.7B model outperforms Lama 1B
00:22:40.080 | and also 0.2.5.
00:22:42.160 | And how we managed to train this model
00:22:44.520 | is that we spent a lot of time
00:22:46.240 | trying to curate the pre-training datasets.
00:22:48.600 | We did a lot of ablations,
00:22:49.840 | trying to find which datasets are good
00:22:52.240 | and also how to mix them.
00:22:53.760 | We also created some new math and code datasets
00:22:56.400 | that we're releasing soon.
00:22:57.880 | But you basically really spent a lot of time
00:22:59.560 | trying to find what's the best mixture
00:23:01.160 | that you can train these models on.
00:23:03.000 | And then we spent some time trying to like,
00:23:05.440 | we also trained these models for very long.
00:23:07.560 | For example, Small M1 was trained
00:23:09.440 | only on 1 trillion tokens,
00:23:11.240 | but this model is trained on 11 trillion tokens.
00:23:13.840 | And we saw that the performance kept improving.
00:23:15.800 | The models didn't really plateau mid-training,
00:23:18.120 | which I think is really interesting.
00:23:19.400 | It shows that you can train such small models for very long
00:23:22.520 | and keep getting performance gains.
00:23:26.480 | What's interesting about Small M2 is that it's fully open.
00:23:29.320 | We also released the pre-training code base,
00:23:32.040 | the fine-tuning code and datasets
00:23:33.840 | and also evaluation in this repository.
00:23:36.280 | Also, there's really interesting small models for text,
00:23:41.440 | but also for vision.
00:23:42.720 | For example, here you can see Small VLM,
00:23:44.600 | which is a 2B model that's really efficient.
00:23:46.680 | It doesn't consume a lot of RAM
00:23:48.360 | and it also has a good performance.
00:23:50.480 | There's also Moondream 0.5B, which was released recently.
00:23:55.040 | It's like the smallest vision language model.
00:23:57.240 | And as you can see, there isn't a big trade-off
00:23:59.840 | compared to Moondream 2B.
00:24:01.560 | So now I showed you that we have
00:24:05.840 | some really good small models.
00:24:07.280 | We also have the tools to use them,
00:24:09.080 | but why should you consider using small models and when?
00:24:11.920 | I think small models are really interesting
00:24:15.840 | because of the on-device feature.
00:24:18.240 | Because these models are small and they can run fast,
00:24:20.760 | you can basically run them on your laptop,
00:24:22.880 | but also on your mobile phone.
00:24:24.720 | And this means that your dataset stays locally.
00:24:27.200 | You don't have to send your queries to third parties.
00:24:30.240 | And this really enhances privacy.
00:24:32.200 | That was, for example,
00:24:33.040 | one of the big selling points for Apple Intelligence.
00:24:35.760 | Also, right now we really have so many frameworks
00:24:39.920 | to do on-device inference.
00:24:41.520 | For example, there's MLX, MLC, LLAMA, CPP, Transformers.js.
00:24:45.480 | So we have a lot of options
00:24:46.800 | and each of them have great features.
00:24:48.840 | So you have so many options for doing that.
00:24:52.800 | Small models are also really powerful
00:24:54.920 | if you choose to specialize them.
00:24:56.720 | For example, here there's a startup called NuMind,
00:24:59.360 | which took small LLAM,
00:25:00.320 | and then they fine-tuned this on text extraction datasets.
00:25:03.440 | And they managed to get a model
00:25:04.960 | that's not very far from models that are much larger.
00:25:07.880 | So I think text extraction is like one use case
00:25:10.000 | where small models can be really performant
00:25:12.240 | and it makes sense to use them
00:25:13.720 | instead of just using larger models.
00:25:15.640 | You can also chat with these models in browser.
00:25:19.560 | For example, here you can go there,
00:25:21.120 | you can load the model, you can even turn off your internet
00:25:23.560 | and just start chatting with the model locally.
00:25:26.240 | Speaking of text extraction,
00:25:29.480 | if you don't want to fine-tune the models,
00:25:31.040 | there's really good method of structure generation.
00:25:34.440 | We can basically force the models
00:25:35.960 | to follow a JSON schema that you defined.
00:25:38.520 | For example, here we try to force the model
00:25:40.680 | to follow a schema for extracting key information
00:25:44.840 | from GitHub issues.
00:25:46.160 | So we can input free text,
00:25:48.000 | which is a complaint about a GitHub repository,
00:25:50.680 | something not working.
00:25:52.040 | And then you can run it there
00:25:53.080 | and the model can extract anything that is relevant
00:25:55.400 | for your GitHub issue creation.
00:25:57.240 | For example, the priority.
00:25:58.640 | For example, here priority is high,
00:26:00.400 | the type of the issue, bug,
00:26:01.880 | and then a title and the estimation
00:26:03.840 | of how long this will take to fix.
00:26:05.680 | And you can just like do this in the browser.
00:26:08.000 | You can transform your text into a GitHub issue
00:26:11.120 | that's properly formatted.
00:26:12.680 | So what's next for synthetic data and small models?
00:26:19.000 | I think that domain specific synthetic data
00:26:21.520 | is gonna be, it's already important,
00:26:23.600 | it's gonna be even more important.
00:26:25.600 | For example, generating synthetic data for math.
00:26:28.720 | I think this really would help improve
00:26:31.120 | the reasoning of a lot of models.
00:26:33.080 | And a lot of people are doing it,
00:26:34.320 | for example, Quint 2.5 math,
00:26:36.080 | everyone's trying to reproduce a one.
00:26:38.400 | And so I think for synthetic data,
00:26:40.280 | trying to specialize it on some domains
00:26:42.120 | is gonna be really important.
00:26:43.920 | And then for small models,
00:26:45.280 | I think specializing them through fine-tuning,
00:26:47.800 | it's also gonna be really important.
00:26:49.840 | 'Cause I think a lot of companies
00:26:51.160 | are just trying to use these large models
00:26:53.240 | because they are better.
00:26:54.640 | But on some tasks,
00:26:55.560 | I think you can already get decent performance
00:26:57.640 | with small models.
00:26:58.480 | So you don't need to pay like a cost that's much larger
00:27:01.760 | just to make your model better at your task by a few percent.
00:27:05.800 | And this is not just for text.
00:27:07.400 | And I think it also applies for other modalities
00:27:09.800 | like vision and audio.
00:27:11.800 | And I think you should also watch out
00:27:13.320 | for on-device frameworks and applications.
00:27:15.720 | For example, like the app I showed,
00:27:17.000 | Pokestpal, Olama, all these frameworks
00:27:19.240 | are becoming really popular.
00:27:20.760 | And I'm pretty sure that we're gonna get
00:27:22.120 | like more of them in 2025.
00:27:24.160 | And users really like that.
00:27:26.680 | Maybe for other, I should also say a hot take.
00:27:31.280 | I think that like in AI,
00:27:32.520 | we started like with fine-tuning, for example,
00:27:35.120 | trying to make BERT work on some specific use cases
00:27:38.280 | and really struggling to do that.
00:27:40.080 | And then we had some models that are much larger.
00:27:41.960 | So we just switched to like prompt engineering
00:27:44.760 | to get the models to solve our tasks.
00:27:46.760 | I think we're going back to fine-tuning
00:27:48.360 | where we realized these models are really costly.
00:27:50.360 | It's better to use just a small model.
00:27:51.880 | We'll try to specialize it.
00:27:53.360 | So I think it's a little bit of a cycle
00:27:54.880 | and we're gonna start to see like more fine-tuning
00:27:57.200 | and less of just like prompt engineering the models.
00:27:59.920 | So that was my talk.
00:28:01.960 | Thank you for following.
00:28:02.960 | And if you have any questions, we can take them now.
00:28:05.600 | (audience applauding)