Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

00:00:00.000 | (upbeat music)

00:00:02.580 | - I'm very happy to be here.

00:00:07.280 | Thank you for the invitation.

00:00:08.900 | So I'm gonna be talking about synthetic data in 2024.

00:00:11.620 | And then I'm gonna be talking about small on-device models.

00:00:14.620 | So I think the most interesting thing

00:00:17.760 | about synthetic data this year is that like,

00:00:19.960 | now we have it everywhere

00:00:21.400 | in the large language models pipeline.

00:00:24.000 | I think initially synthetic data was mainly used

00:00:26.860 | just for post-training,

00:00:28.300 | because naturally that's the part

00:00:29.980 | where we needed human annotators to show the models

00:00:33.220 | how they should answer instructions,

00:00:35.060 | how they should be helpful and not toxic.

00:00:38.360 | And when we had LLMs that were really performant,

00:00:41.060 | we replaced the not human annotators

00:00:43.900 | just with the synthesic data.

00:00:45.700 | And then after that,

00:00:46.820 | we realized that we don't really have good benchmarks

00:00:49.500 | to measure if models follow instructions well,

00:00:52.780 | if they are creative enough, or if they are chatty enough.

00:00:55.700 | So we also started using LLMs as judges.

00:00:59.380 | And I think this year and towards the end of last year,

00:01:02.740 | we also went to the pre-training parts

00:01:05.380 | and we started generating synthetic data for pre-training

00:01:08.740 | to kind of replace some parts of the web.

00:01:11.100 | And the motivation behind that

00:01:12.660 | is that you have a lot of control over synthetic data.

00:01:15.360 | You can control your prompt

00:01:16.740 | and basically also the kind of data that you generate.

00:01:19.500 | So instead of just trying to filter the web,

00:01:21.780 | you could try to get the LLM to generate

00:01:23.860 | what you think the best web pages could look like

00:01:26.300 | and then train your models on that.

00:01:28.100 | So this is how we went from not having synthetic data

00:01:30.580 | at all in the LLM pipeline to having this everywhere.

00:01:33.220 | And so the cool thing is like today,

00:01:36.540 | you can train an LLM with like an entirely synthetic pipeline.

00:01:40.580 | For example, you can use our Cosmopedia datasets

00:01:42.820 | and you can train a 1B model on like 150 billion tokens

00:01:45.960 | that are 100% synthetic.

00:01:48.140 | And those are also of good quality.

00:01:49.860 | And then you can instruction tune the model

00:01:51.500 | on a synthetic SFT dataset.

00:01:53.460 | You can also do DPO on a synthetic dataset.

00:01:55.900 | And then to evaluate if the model is good,

00:01:57.660 | you can use a benchmark that uses LLMs as a judge,

00:02:00.760 | for example, MTBench or AlpacaEval.

00:02:03.420 | So I think this is like a really mind blowing

00:02:05.200 | because like just a few years ago,

00:02:06.520 | we wouldn't think this is possible.

00:02:08.860 | And I think there's a lot of concerns about model collapse

00:02:11.380 | and I'm gonna talk about that later,

00:02:13.020 | but we'll see that like if we use synthetic data properly

00:02:15.900 | and we curate it carefully, that shouldn't happen.

00:02:18.840 | And the reason synthetic data is very popular right now

00:02:23.300 | is that we have really strong models,

00:02:25.660 | both open and closed.

00:02:27.860 | It is really cheap and fast to use

00:02:29.500 | compared to human annotations,

00:02:31.420 | which cost a lot and take a lot of time.

00:02:33.800 | And also for open models right now,

00:02:36.100 | we have some really good inference frameworks.

00:02:38.260 | So if you have enough GPUs,

00:02:39.660 | it's really easy to spawn these GPUs

00:02:41.740 | and generate like a lot of synthetic data.

00:02:44.300 | Some examples are VLM, TGI and TensorRT.

00:02:47.320 | Now let's talk about the elephant in the room,

00:02:52.500 | model collapse.

00:02:53.660 | Is this the end?

00:02:54.500 | If you look at the media and all of like,

00:02:56.540 | for example, some papers in nature,

00:02:58.660 | it's really scary because there's a lot of synthetic data

00:03:01.740 | out there in the web

00:03:02.860 | and naturally we train on the web.

00:03:04.240 | So we're gonna be training a lot of synthetic data.

00:03:06.860 | And if model collapse is gonna happen,

00:03:08.700 | we should really try to take that seriously.

00:03:11.100 | And the other issue is that, as I said,

00:03:14.860 | we think a lot of people think the web is polluted

00:03:17.020 | because there's a lot of synthetic data.

00:03:19.140 | And for example, when we're building fine web datasets,

00:03:21.620 | here, Guillermo and Hinek,

00:03:23.180 | we're interested in like how much synthetic data

00:03:25.060 | is there in the web?

00:03:26.580 | So there isn't really a method to properly measure

00:03:29.860 | the amount of synthetic data

00:03:31.180 | or to save a webpage synthetic or not.

00:03:33.820 | But one thing we can do is to try to look for like

00:03:36.260 | proxy words, for example,

00:03:37.900 | expressions like as a large language model

00:03:40.500 | or words like Delve,

00:03:41.900 | that we know are actually generated by ChatGPT.

00:03:44.500 | We could try to measure the amount of these words

00:03:47.340 | in our dataset and compare them to the previous years.

00:03:50.040 | For example, here, we measured like these words ratio

00:03:52.940 | in different dumps of Common Crawl.

00:03:54.820 | And we can see that like the ratio really increased

00:03:57.380 | after ChatGPT's release.

00:03:58.980 | So if we were to say that synthetic data amount didn't change

00:04:03.420 | you would expect this ratio to stay constant,

00:04:05.500 | which is not the case.

00:04:06.980 | So there's a lot of synthetic data probably on the web,

00:04:09.560 | but does this really make models worse?

00:04:12.040 | So what we did is we trained different models

00:04:14.180 | on these different dumps,

00:04:15.580 | and we then computed their performance

00:04:18.220 | on popular like NLP benchmarks,

00:04:20.040 | and then we computed the aggregated score.

00:04:22.320 | And surprisingly, you can see that the latest dumps

00:04:24.480 | are actually even better than the dumps that are before.

00:04:27.200 | So if there's some synthetic data there,

00:04:29.040 | at least it did not make the models worse.

00:04:31.860 | Yeah, which is really encouraging.

00:04:34.440 | So personally, I wouldn't say the web is positive

00:04:36.720 | with synthetic data.

00:04:37.960 | Maybe it's even making it more rich.

00:04:40.180 | And the issue with like model collapse is that,

00:04:45.240 | for example, those studies,

00:04:46.440 | they were done at like a small scale,

00:04:48.320 | and you would ask the model to complete, for example,

00:04:50.760 | a Wikipedia paragraph,

00:04:51.920 | and then you would train it on these new generations,

00:04:54.240 | and you would do that iteratively.

00:04:56.080 | I think if you do that approach,

00:04:57.280 | it's normal to observe this kind of behavior,

00:05:00.000 | because the quality is gonna be worse

00:05:01.160 | because the model is already small.

00:05:02.400 | And then if you train it just on these generations,

00:05:04.480 | you shouldn't expect it to become better.

00:05:06.400 | But what we're really doing here

00:05:07.640 | is that we take a model that is very large,

00:05:09.560 | and we try to distill its knowledge

00:05:11.200 | into a model that is smaller.

00:05:12.880 | And in this way,

00:05:13.920 | you can expect to get like better performance

00:05:16.160 | for your small model.

00:05:18.040 | And using synthetic data for pre-training

00:05:20.800 | has become really popular

00:05:22.400 | after the textbooks are all you need papers,

00:05:25.600 | where Microsoft basically trained a series of small models

00:05:29.160 | on textbooks that were using a large LLM.

00:05:33.600 | And then they found that these models

00:05:35.040 | were actually better than models that are much larger.

00:05:38.320 | So this was really interesting.

00:05:39.600 | It was like a first of its time,

00:05:41.640 | but it was also met with a lot of skepticism,

00:05:44.200 | which is a good thing in research,

00:05:45.560 | it pushes you to question things.

00:05:48.240 | Because the dataset that they trained on was not public.

00:05:50.920 | So people were not really sure

00:05:52.720 | if these models are really good,

00:05:54.560 | or maybe there's just some data contamination.

00:05:57.120 | So it was really hard to check

00:05:58.640 | if you just have the weights of the models.

00:06:01.560 | And as Hugging Face, because we're like open source,

00:06:03.760 | we tried to reproduce what they did.

00:06:05.760 | So this is our Cosmopedia dataset.

00:06:07.880 | We basically tried to follow a similar approach

00:06:09.840 | to what they documented in the paper.

00:06:11.400 | And we created a synthetic dataset of textbooks

00:06:14.120 | and blog posts and stories

00:06:15.760 | that had almost 30 billion tokens.

00:06:18.200 | And we trained some models on that.

00:06:20.760 | And we found that the key ingredient

00:06:23.720 | to getting a good dataset that is synthetic

00:06:26.000 | is trying as much as possible to keep it diverse.

00:06:28.880 | Because if you just throw the same prompts as your model,

00:06:31.160 | like generate a textbook about linear algebra,

00:06:34.080 | and even if you change the temperature,

00:06:35.720 | the textbooks are gonna look alike.

00:06:37.080 | So there's no way you could scale to millions of samples.

00:06:40.680 | And the way you do that is by creating prompts

00:06:43.800 | that have some seeds that make them diverse.

00:06:46.680 | In our case, the prompt,

00:06:48.560 | we would ask the model to generate a textbook,

00:06:50.880 | but make it related to an extract from a webpage.

00:06:54.160 | And also we try to frame it to stay within topic.

00:06:57.440 | For example, here, we put like an extract

00:07:00.080 | about cardiovascular bioimaging,

00:07:02.200 | and then we ask the model to generate a textbook

00:07:04.640 | related to medicine that is also related to this webpage.

00:07:08.240 | And this is a really nice approach

00:07:09.600 | because there's so many webpages out there.

00:07:11.960 | So you can be sure that your generation

00:07:14.760 | is not gonna be diverse when you change the seed example.

00:07:19.760 | One thing that's challenging with this

00:07:21.200 | is that you want the seed samples

00:07:23.000 | to be related to your topics.

00:07:25.320 | So we use like a search tool

00:07:27.720 | to try to go all of fine web datasets

00:07:30.000 | and find the pages that are related to the topics

00:07:32.320 | we're interested in.

00:07:33.360 | And then we also do a lot of experiments

00:07:36.120 | with the type of generations we want the model to generate.

00:07:39.320 | For example, we ask it for textbooks

00:07:41.160 | for middle school students or a textbook for a college.

00:07:43.840 | And we found that like some generation styles

00:07:45.880 | help on some specific benchmarks

00:07:47.600 | while others help on other benchmarks.

00:07:49.760 | For example, college textbooks are really good for MMLU,

00:07:52.640 | while middle school textbooks are good for benchmarks

00:07:54.720 | like OpenBook UA and Pico.

00:07:56.840 | This is like a sample from like our search tool.

00:08:01.600 | For example, you have a top category, which is a topic,

00:08:04.080 | and then you have some subtopics,

00:08:05.520 | and then you have the topic hits,

00:08:06.920 | which are basically the webpages in fine web

00:08:09.200 | that's belong to these topics.

00:08:10.760 | And here you can see the comparison between Cosmopedia.

00:08:14.640 | We had two versions, V1 and V2 in blue and red,

00:08:18.640 | and you can see the comparison to fine web.

00:08:20.640 | And as you can see throughout the training,

00:08:22.880 | training on Cosmopedia was consistently better.

00:08:25.840 | So we managed to get a dataset

00:08:27.160 | that was actually good to train these models on.

00:08:29.800 | It's of course so much smaller than fine web,

00:08:31.840 | it's only 30 billion tokens,

00:08:33.600 | but that's the scale that's Microsoft datasets was.

00:08:36.200 | So we kind of managed to reproduce a bit what they did,

00:08:39.240 | and the dataset is public, so everyone can go there,

00:08:41.880 | check if everything is all right.

00:08:43.880 | And this is the recent paper from NVIDIA, Nemotron CC.

00:08:49.840 | They took things a bit further

00:08:51.920 | and they generated not a few billion tokens,

00:08:54.120 | but 1.9 trillion tokens, which is huge.

00:08:57.960 | And we can see later how they did that.

00:09:00.040 | It's more of like rephrasing the web.

00:09:01.920 | So we can see today that there's like

00:09:04.640 | some really huge synthetic datasets out there,

00:09:07.680 | and they're public,

00:09:08.520 | so like you can try to filter them even further

00:09:11.080 | if you wanna get like more high quality corpses.

00:09:13.480 | So for this rephrasing the web,

00:09:18.040 | this approach was suggested in this paper by Pratyush,

00:09:22.080 | where basically in this paper,

00:09:23.600 | they take some samples from C4 datasets,

00:09:27.160 | and then they use an LLM

00:09:28.720 | to rewrite these samples into a better format.

00:09:31.880 | For example, they ask an LLM to rewrite the sample

00:09:34.840 | into a Wikipedia passage or into a Q&A page.

00:09:38.760 | And the interesting thing in this approach

00:09:41.160 | is that you can use a model that is small

00:09:43.800 | because rewriting doesn't require knowledge,

00:09:46.520 | it's just rewriting a page into a different style.

00:09:49.240 | So the model doesn't need to have like knowledge

00:09:51.960 | that is like extensive of what is rewriting,

00:09:54.320 | compared to just asking a model to generate a new textbook

00:09:57.120 | and not giving it like ground truth.

00:09:59.480 | So here they rewrite some samples from C4

00:10:02.240 | into Q&A, into Wikipedia,

00:10:04.080 | and they find that doing this works better

00:10:06.440 | than training just on C4.

00:10:07.960 | And so what they did in Nemotron CC is a similar approach.

00:10:13.880 | They rewrite some pages from Common Crawl for two reasons.

00:10:18.320 | One is to like improve pages that are low quality.

00:10:22.400 | So they rewrite them into, for example, Wikipedia page,

00:10:25.120 | so they look better.

00:10:26.440 | And another reason is to create more diverse datasets.

00:10:29.680 | So they have a dataset that they already heavily filtered,

00:10:33.000 | and then they take these pages that are already high quality

00:10:35.720 | and they ask the model to rewrite them

00:10:37.840 | in Q&A format into like open-ended questions

00:10:41.360 | or like multi-choice questions.

00:10:42.960 | So this way they can reuse the same page multiple times

00:10:45.920 | without fearing like having multiple duplicates

00:10:48.520 | because it's the same information,

00:10:50.160 | but it's gonna be written differently.

00:10:52.440 | So I think that's also a really interesting approach

00:10:54.480 | for like generating synthetic data

00:10:57.200 | just by rephrasing the pages that you already have.

00:10:59.840 | There's also this approach called Prox

00:11:04.000 | where they try to start from a webpage

00:11:06.880 | and then they generate a program

00:11:08.440 | which finds how to write that page

00:11:10.160 | to make it better and less noisy.

00:11:12.160 | For example, here you can see

00:11:13.280 | that there's some leftover metadata in the webpage

00:11:16.200 | and you don't necessarily want to keep that

00:11:17.800 | for training your model.

00:11:19.240 | So they train a model that can generate programs

00:11:22.600 | that can like normalize and remove lines that are extra.

00:11:25.880 | So I think this approach is also interesting,

00:11:27.680 | but it's maybe less scalable

00:11:29.240 | than the approaches that I presented before.

00:11:31.480 | So that was it for like rephrasing

00:11:36.080 | and generating new textbooks.

00:11:37.920 | Another approach that I think is really good

00:11:40.360 | and becoming really popular

00:11:41.600 | for using synthetic data for pre-training

00:11:44.200 | is basically building better classifiers

00:11:47.200 | for filtering the web.

00:11:48.960 | For example, here we released a dataset

00:11:50.920 | called FindWebEdu and the way we built it

00:11:53.720 | is by taking Llama3 and asking it to rate

00:11:57.080 | the educational content of webpages from zero to five.

00:12:00.920 | So for example, if a page is like a really good textbook

00:12:03.720 | that could be useful in a school setting,

00:12:05.880 | it would get a really high score.

00:12:07.320 | And if a page is just like an advertisement

00:12:10.040 | or promotional material, it would get a lower score.

00:12:13.320 | And then after that, we take these synthetic annotations

00:12:16.240 | and we train a classifier on them.

00:12:18.240 | It's a classifier like a BERT model.

00:12:20.880 | And then we run this classifier on all of FindWeb,

00:12:23.640 | which is a 15 trillion tokens datasets.

00:12:25.920 | And then we only keep the pages that have like a score

00:12:28.360 | that's higher than three.

00:12:29.600 | So for example, in our case,

00:12:31.000 | we went from 15 trillion tokens to just 1.5 trillion tokens.

00:12:34.880 | Those are really highly educational.

00:12:37.160 | And as you can see here, FindWebEdu outperforms

00:12:40.280 | all the other public web datasets by a larger margin

00:12:44.120 | on a couple of benchmarks.

00:12:45.520 | Here I show the aggregated score.

00:12:47.840 | And you can see that this approach is really effective

00:12:50.040 | for filtering web datasets to get like better corpuses

00:12:53.240 | for training your LLMs.

00:12:54.880 | Others also try to do this approach.

00:13:00.400 | There's, for example, the DCLM datasets,

00:13:03.240 | where they also train the classifier,

00:13:05.000 | but not to detect educational content.

00:13:07.320 | Instead, they trained it on OpenHermes dataset,

00:13:10.160 | which is a dataset for instruction tuning.

00:13:12.440 | And also they explain like IM5 subreddits.

00:13:15.920 | And then they also get really high quality datasets,

00:13:20.000 | which is like a very information dense

00:13:22.800 | and can help you train some really good LLMs.

00:13:25.000 | And then Nemotron and Common Crawl,

00:13:27.960 | they also did this approach,

00:13:29.520 | but instead of using one classifier,

00:13:31.440 | they used an ensemble of classifiers.

00:13:33.920 | So they use, for example, the DCLM classifier

00:13:36.200 | and also classifiers like the ones we used

00:13:38.160 | in FindWebEducational.

00:13:39.520 | And then they combine these scores

00:13:41.400 | into with an ensemble method

00:13:43.160 | to only retain the best high quality pages.

00:13:46.000 | And they get a dataset that works even better

00:13:48.400 | than the ones we developed.

00:13:49.840 | So that was it for like synthetic data for pre-training.

00:13:54.360 | Now we can go back to post-training.

00:13:56.640 | I think there's a lot of interesting

00:13:58.080 | post-training datasets out there.

00:13:59.960 | One that was released recently,

00:14:01.760 | the Agent Instruct by Microsoft,

00:14:03.840 | where they basically try to target some specific skills

00:14:07.080 | and improve the performance of models on them.

00:14:10.080 | For example, here you can see code,

00:14:11.800 | brain teasers, open domain QA,

00:14:14.160 | and they managed to get a dataset that outperforms

00:14:17.120 | this with fine-tuning Mistral 7B on it.

00:14:19.120 | It outperforms the original instruct model

00:14:21.960 | that was released by Mistral.

00:14:23.840 | And as I said, to get good synthetic data,

00:14:28.840 | you really have to have a framework

00:14:30.400 | to make sure that your data is diverse.

00:14:33.000 | So for example, for them,

00:14:34.240 | they always see the generations

00:14:36.200 | on either source code or raw text documents.

00:14:39.040 | And then they rewrite them to make sure

00:14:40.880 | they're easier to generate instructions from.

00:14:43.080 | And then they use that

00:14:44.080 | for their like instruction data generation.

00:14:47.600 | There's also the Tool3 SFT mixture,

00:14:50.560 | which was released recently by Allen AI.

00:14:53.360 | It's also really good quality

00:14:54.760 | and it covers a wide range of tasks.

00:14:57.280 | And the way they make sure that this dataset is diverse

00:15:00.240 | is by using personas from the Persona Hub datasets,

00:15:04.120 | which is basically a dataset of like,

00:15:05.800 | I think over a million personas.

00:15:07.560 | And for example, in the Tool3 mixture

00:15:09.800 | to generate like a new code snippet,

00:15:11.440 | they would give like the model persona,

00:15:13.200 | for example, a machine learning researcher

00:15:15.640 | interested in neural networks,

00:15:17.240 | and then ask it to generate like a coding problem.

00:15:19.920 | This way you make sure that your dataset is really diverse,

00:15:22.480 | and then you can further filter the datasets,

00:15:24.480 | for example, using the reward models.

00:15:26.400 | We also released a dataset called Smalltalk,

00:15:30.720 | and we also tried to cover the wide range of tasks.

00:15:33.520 | And as you can see here, for example,

00:15:35.280 | when fine-tuning Mistral 7B on the dataset,

00:15:37.880 | we also outperformed the original Mistral instruct

00:15:40.920 | on a number of benchmarks,

00:15:42.600 | notably on mathematics

00:15:44.120 | and instruction following with IF-EVL.

00:15:46.240 | Another paper that's really interesting

00:15:50.280 | I wanted to mention is this one

00:15:52.000 | called the Multilingual Data Arbitrage by Cohere.

00:15:55.600 | And basically they want to generate a dataset

00:15:58.200 | for post-training that is multilingual,

00:16:00.360 | and they have a really interesting problem.

00:16:01.960 | It's the fact that there isn't like one model

00:16:04.120 | that's really good at all the languages they wanted.

00:16:06.840 | So what they do is that like they use

00:16:09.240 | not just one teacher model, but multiple teachers,

00:16:12.080 | and then they have a router,

00:16:13.600 | which basically sends the prompts they have

00:16:15.760 | to all these models.

00:16:16.840 | And then they get the completions,

00:16:18.360 | and they have a reward model

00:16:19.520 | that trace all these generations

00:16:21.240 | and only keeps the best one.

00:16:23.000 | And this is like arbitrage and finance.

00:16:24.920 | So what I think was interesting in this,

00:16:27.480 | it shows that like synthetic data,

00:16:28.800 | it doesn't have to come from a single model.

00:16:31.040 | And because we have so many good models now,

00:16:33.080 | you could like pull these models together

00:16:34.760 | and get like a dataset that's of a really high quality,

00:16:37.440 | and that's diverse, and that covers all your needs.

00:16:43.520 | I was supposed to put a meme there, but lack of time.

00:16:46.760 | Yeah, so that was it for like synthetic data.

00:16:52.640 | And now we can go to see what's happening

00:16:55.000 | in the small models field in 2024.

00:16:57.840 | I don't know if you know,

00:17:01.000 | but like now we have some really good small models.

00:17:03.480 | For example, Lama 3.2 1b, it matches Lama 2.13b from,

00:17:08.480 | that was released last year on the LMSS arena,

00:17:11.680 | which is basically the default go-to leaderboard

00:17:14.160 | for evaluating models using human evaluation.

00:17:17.440 | And as you can see here,

00:17:18.600 | the scores of the models are really close.

00:17:20.600 | So I think we've made like a huge leap forward

00:17:22.640 | in terms of small models.

00:17:24.160 | Of course, that's just one data point, but there's more.

00:17:28.120 | For example, if you look at this chart

00:17:30.600 | from the Quent 2.5 blog posts,

00:17:32.960 | it shows that today we have some really good models

00:17:35.640 | that are only like 3 billion parameters and 4 billion,

00:17:39.160 | the score really high on MMLU,

00:17:41.600 | which is a really popular benchmark for evaluating models.

00:17:45.040 | And you can see here that the blue dots

00:17:47.360 | have more than 65 on MMLU and the gray ones have less.

00:17:52.160 | And for example, Lama 33b had less.

00:17:55.000 | So now we have a 3b model that outperforms a 33b model

00:17:59.480 | that was released earlier on MMLU benchmark.

00:18:02.840 | So I think now people are starting to realize

00:18:05.760 | that like we shouldn't just scale and scale models,

00:18:08.840 | but we should try to make them more efficient.

00:18:11.360 | I don't know if you knew,

00:18:14.760 | but you can also chat with a 3b+ model on your iPhone.

00:18:18.480 | For example, here, this is an app called PocketPal,

00:18:21.120 | where you can go and select a model from Hugging Face.

00:18:24.080 | It has a large choice.

00:18:25.480 | For example, here, we loaded the PHY 3.5,

00:18:28.840 | which is 3.8 billion parameters on this iPhone,

00:18:32.400 | and we can chat with it.

00:18:33.840 | And you can see that even the latency is also acceptable.

00:18:37.600 | For example, here, I asked it to give me a joke

00:18:40.240 | about NeurIPS, so let's see what it has to say.

00:18:43.000 | Okay, why did the neural network attend NeurIPS?

00:18:49.480 | Because it heard there would be a lot of layers and fun,

00:18:52.320 | and it wanted to train its sense of humor.

00:18:54.760 | So not very funny, but at least it can run on device.

00:18:57.400 | Yeah, so I think now we have good small models,

00:19:02.160 | but we also have like good frameworks and tools

00:19:04.600 | to use these small models.

00:19:06.240 | So I think we're really close to having like really on-edge

00:19:09.320 | and on-device models that are really good.

00:19:12.440 | And I think for a while, we've had this narrative

00:19:15.160 | that just training larger models is better.

00:19:18.280 | Of course, this is supported by science scaling laws.

00:19:22.040 | As you can see here, for example,

00:19:23.440 | when we scale the model size, the loss is lower,

00:19:26.080 | and obviously you get a better model.

00:19:28.240 | But, and we can see this, for example,

00:19:31.000 | in the GPT family of models,

00:19:32.760 | how we went from just 100 million parameters

00:19:34.920 | to more than a trillion parameters.

00:19:37.000 | And of course, we all observed the performance improvement

00:19:39.720 | when using the latest model.

00:19:42.160 | But one thing that we shouldn't forget

00:19:43.760 | is that when we scale the model,

00:19:45.360 | we also scale the inference costs and time.

00:19:48.120 | And so the largest models are gonna cost so much more.

00:19:51.480 | So I think now, instead of just building larger models,

00:19:56.320 | we should be focusing on building more efficient models.

00:19:59.120 | It's no longer a race for the largest models,

00:20:01.680 | since these models are really expensive to run,

00:20:04.040 | and they require a really good infrastructure to do that,

00:20:07.240 | and they cannot run on, for example, consumer hardware.

00:20:10.560 | And when you try to build more efficient models

00:20:12.920 | that match larger models,

00:20:14.960 | that's when you can really unlock

00:20:16.760 | some really interesting on-device use cases.

00:20:18.960 | And I think a trend that we're noticing now

00:20:21.920 | is the trend of training smaller models longer.

00:20:24.840 | For example, if you compare how long Lama was trained

00:20:28.280 | compared to Lama 3,

00:20:29.720 | there is a huge increase in the pre-training length.

00:20:33.280 | Lama was trained on 1 trillion tokens,

00:20:35.320 | but Lama 3 A to B was trained on 15 trillion tokens.

00:20:38.600 | So Meta managed to get a model that's the same size,

00:20:41.920 | but it performs so much better

00:20:43.760 | by choosing to spend the sacrifice during training,

00:20:47.960 | because as we know, training is a one-time cost,

00:20:49.960 | but inference is something that's ongoing.

00:20:52.080 | If we wanna see what are the small models reads in 2024,

00:20:58.840 | I think this mobile LLM paper by Meta is interesting.

00:21:02.080 | They try to study different models

00:21:04.840 | that have less than 1 billion parameters

00:21:07.760 | and find which architecture makes most sense

00:21:10.000 | for these models.

00:21:11.120 | For example, they find that depth

00:21:13.040 | is more important than width.

00:21:15.040 | So it's more important to have models

00:21:16.600 | that have more layers than just making them more wide.

00:21:19.920 | They also find that GQA helps,

00:21:22.400 | that tying the embedding helps.

00:21:24.400 | So I think it's a nice study overall

00:21:26.120 | for models that are just a few hundred million parameters.

00:21:30.240 | There's also the Apple Intelligence Tech Report,

00:21:32.880 | which is interesting.

00:21:34.520 | So for Apple Intelligence, they had two models,

00:21:36.760 | one that was on server and another model that was on device.

00:21:40.600 | It had 3 billion parameters.

00:21:42.800 | And I think the interesting part

00:21:44.160 | is that they trained this model

00:21:45.400 | using pruning and then distillation.

00:21:47.800 | And for example, they have this table

00:21:49.200 | where they show that using pruning and distillation

00:21:52.080 | works much better than training from scratch.

00:21:54.640 | And they also have some interesting insights

00:21:56.360 | about how they specialize their models on specific tasks.

00:21:59.560 | Like for example, summarization and rewriting.

00:22:02.040 | There's also this paper by NVIDIA

00:22:07.000 | that was released recently.

00:22:08.480 | I think you've already had a talk about hybrid models.

00:22:10.840 | That was all interesting.

00:22:12.720 | And this model, they used a hybrid architecture

00:22:16.040 | between state space models and transformers.

00:22:18.800 | And they managed to train a 1B model

00:22:20.600 | that's really performant

00:22:22.040 | without needing to train it on a lot of tokens.

00:22:24.440 | And regarding our work,

00:22:28.200 | we just recently released Small M2.

00:22:30.800 | So it's a series of three models,

00:22:32.840 | which are the best in class in each model size.

00:22:35.920 | For example, our 1.7B model outperforms Lama 1B

00:22:40.080 | and also 0.2.5.

00:22:42.160 | And how we managed to train this model

00:22:44.520 | is that we spent a lot of time

00:22:46.240 | trying to curate the pre-training datasets.

00:22:48.600 | We did a lot of ablations,

00:22:49.840 | trying to find which datasets are good

00:22:52.240 | and also how to mix them.

00:22:53.760 | We also created some new math and code datasets

00:22:56.400 | that we're releasing soon.

00:22:57.880 | But you basically really spent a lot of time

00:22:59.560 | trying to find what's the best mixture

00:23:01.160 | that you can train these models on.

00:23:03.000 | And then we spent some time trying to like,

00:23:05.440 | we also trained these models for very long.

00:23:07.560 | For example, Small M1 was trained

00:23:09.440 | only on 1 trillion tokens,

00:23:11.240 | but this model is trained on 11 trillion tokens.

00:23:13.840 | And we saw that the performance kept improving.

00:23:15.800 | The models didn't really plateau mid-training,

00:23:18.120 | which I think is really interesting.

00:23:19.400 | It shows that you can train such small models for very long

00:23:22.520 | and keep getting performance gains.

00:23:26.480 | What's interesting about Small M2 is that it's fully open.

00:23:29.320 | We also released the pre-training code base,

00:23:32.040 | the fine-tuning code and datasets

00:23:33.840 | and also evaluation in this repository.

00:23:36.280 | Also, there's really interesting small models for text,

00:23:41.440 | but also for vision.

00:23:42.720 | For example, here you can see Small VLM,

00:23:44.600 | which is a 2B model that's really efficient.

00:23:46.680 | It doesn't consume a lot of RAM

00:23:48.360 | and it also has a good performance.

00:23:50.480 | There's also Moondream 0.5B, which was released recently.

00:23:55.040 | It's like the smallest vision language model.

00:23:57.240 | And as you can see, there isn't a big trade-off

00:23:59.840 | compared to Moondream 2B.

00:24:01.560 | So now I showed you that we have

00:24:05.840 | some really good small models.

00:24:07.280 | We also have the tools to use them,

00:24:09.080 | but why should you consider using small models and when?

00:24:11.920 | I think small models are really interesting

00:24:15.840 | because of the on-device feature.

00:24:18.240 | Because these models are small and they can run fast,

00:24:20.760 | you can basically run them on your laptop,

00:24:22.880 | but also on your mobile phone.

00:24:24.720 | And this means that your dataset stays locally.

00:24:27.200 | You don't have to send your queries to third parties.

00:24:30.240 | And this really enhances privacy.

00:24:32.200 | That was, for example,

00:24:33.040 | one of the big selling points for Apple Intelligence.

00:24:35.760 | Also, right now we really have so many frameworks

00:24:39.920 | to do on-device inference.

00:24:41.520 | For example, there's MLX, MLC, LLAMA, CPP, Transformers.js.

00:24:45.480 | So we have a lot of options

00:24:46.800 | and each of them have great features.

00:24:48.840 | So you have so many options for doing that.

00:24:52.800 | Small models are also really powerful

00:24:54.920 | if you choose to specialize them.

00:24:56.720 | For example, here there's a startup called NuMind,

00:24:59.360 | which took small LLAM,

00:25:00.320 | and then they fine-tuned this on text extraction datasets.

00:25:03.440 | And they managed to get a model

00:25:04.960 | that's not very far from models that are much larger.

00:25:07.880 | So I think text extraction is like one use case

00:25:10.000 | where small models can be really performant

00:25:12.240 | and it makes sense to use them

00:25:13.720 | instead of just using larger models.

00:25:15.640 | You can also chat with these models in browser.

00:25:19.560 | For example, here you can go there,

00:25:21.120 | you can load the model, you can even turn off your internet

00:25:23.560 | and just start chatting with the model locally.

00:25:26.240 | Speaking of text extraction,

00:25:29.480 | if you don't want to fine-tune the models,

00:25:31.040 | there's really good method of structure generation.

00:25:34.440 | We can basically force the models

00:25:35.960 | to follow a JSON schema that you defined.

00:25:38.520 | For example, here we try to force the model

00:25:40.680 | to follow a schema for extracting key information

00:25:44.840 | from GitHub issues.

00:25:46.160 | So we can input free text,

00:25:48.000 | which is a complaint about a GitHub repository,

00:25:50.680 | something not working.

00:25:52.040 | And then you can run it there

00:25:53.080 | and the model can extract anything that is relevant

00:25:55.400 | for your GitHub issue creation.

00:25:57.240 | For example, the priority.

00:25:58.640 | For example, here priority is high,

00:26:00.400 | the type of the issue, bug,

00:26:01.880 | and then a title and the estimation

00:26:03.840 | of how long this will take to fix.

00:26:05.680 | And you can just like do this in the browser.

00:26:08.000 | You can transform your text into a GitHub issue

00:26:11.120 | that's properly formatted.

00:26:12.680 | So what's next for synthetic data and small models?

00:26:19.000 | I think that domain specific synthetic data

00:26:21.520 | is gonna be, it's already important,

00:26:23.600 | it's gonna be even more important.

00:26:25.600 | For example, generating synthetic data for math.

00:26:28.720 | I think this really would help improve

00:26:31.120 | the reasoning of a lot of models.

00:26:33.080 | And a lot of people are doing it,

00:26:34.320 | for example, Quint 2.5 math,

00:26:36.080 | everyone's trying to reproduce a one.

00:26:38.400 | And so I think for synthetic data,

00:26:40.280 | trying to specialize it on some domains

00:26:42.120 | is gonna be really important.

00:26:43.920 | And then for small models,

00:26:45.280 | I think specializing them through fine-tuning,

00:26:47.800 | it's also gonna be really important.

00:26:49.840 | 'Cause I think a lot of companies

00:26:51.160 | are just trying to use these large models

00:26:53.240 | because they are better.

00:26:54.640 | But on some tasks,

00:26:55.560 | I think you can already get decent performance

00:26:57.640 | with small models.

00:26:58.480 | So you don't need to pay like a cost that's much larger

00:27:01.760 | just to make your model better at your task by a few percent.

00:27:05.800 | And this is not just for text.

00:27:07.400 | And I think it also applies for other modalities

00:27:09.800 | like vision and audio.

00:27:11.800 | And I think you should also watch out

00:27:13.320 | for on-device frameworks and applications.

00:27:15.720 | For example, like the app I showed,

00:27:17.000 | Pokestpal, Olama, all these frameworks

00:27:19.240 | are becoming really popular.

00:27:20.760 | And I'm pretty sure that we're gonna get

00:27:22.120 | like more of them in 2025.

00:27:24.160 | And users really like that.

00:27:26.680 | Maybe for other, I should also say a hot take.

00:27:31.280 | I think that like in AI,

00:27:32.520 | we started like with fine-tuning, for example,

00:27:35.120 | trying to make BERT work on some specific use cases

00:27:38.280 | and really struggling to do that.

00:27:40.080 | And then we had some models that are much larger.

00:27:41.960 | So we just switched to like prompt engineering

00:27:44.760 | to get the models to solve our tasks.

00:27:46.760 | I think we're going back to fine-tuning

00:27:48.360 | where we realized these models are really costly.

00:27:50.360 | It's better to use just a small model.

00:27:51.880 | We'll try to specialize it.

00:27:53.360 | So I think it's a little bit of a cycle

00:27:54.880 | and we're gonna start to see like more fine-tuning

00:27:57.200 | and less of just like prompt engineering the models.

00:27:59.920 | So that was my talk.

00:28:01.960 | Thank you for following.

00:28:02.960 | And if you have any questions, we can take them now.

00:28:05.600 | (audience applauding)

Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

Chapters