Training Llama 2, 3 & 4: The Path to Open Source AGI

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey everyone.

00:00:05.640 | Welcome to the Latent Space Podcast.

00:00:07.440 | This is Alessio, partner and CTO

00:00:09.320 | of Residence at Decibel Partners.

00:00:10.800 | And I'm joined by my co-host, Spooks,

00:00:12.560 | founder of Small AI.

00:00:13.760 | - Hey, and today we have a very special episode

00:00:15.840 | with Thomas Yalom.

00:00:17.240 | I don't know how to describe,

00:00:18.360 | you've done so much work

00:00:19.720 | in a very short amount of time at Meta,

00:00:21.240 | but you were most notably leading Lama 2.

00:00:24.200 | And now today we're also coordinating

00:00:26.760 | on the release of Lama 3.

00:00:27.840 | So welcome.

00:00:28.680 | - Thanks for having me.

00:00:29.640 | - To be clear, obviously the Lama 3 405B.

00:00:33.080 | Is that the official size number that we're going with?

00:00:35.880 | Or is it, do we just say 400B?

00:00:37.080 | - For the text model only, yes.

00:00:40.560 | A bit of additional parameters

00:00:42.080 | for the multi-model version that will come later.

00:00:44.040 | - Awesome. Awesome.

00:00:45.160 | Just to quickly go over your background.

00:00:46.840 | Actually we had a slightly similar past.

00:00:48.680 | I was also a quantitative trader

00:00:50.640 | and it looks like you did five years in quant finance,

00:00:53.080 | working at Trading Timer in SOC Gen.

00:00:55.240 | And then you transitioned into natural language,

00:00:58.280 | getting your PhD at Sorbonne,

00:00:59.920 | working on Recital as well.

00:01:01.880 | And then right after your PhD, joining Meta.

00:01:04.640 | - No, it's exactly that.

00:01:05.600 | But basically I think it's at the AlphaGo moment

00:01:08.360 | where I was doing some trading.

00:01:09.760 | I say like what I need to understand

00:01:13.080 | what's the technology behind that.

00:01:14.640 | And I wanted to study machine learning.

00:01:16.040 | I did first some training,

00:01:17.680 | like six months degree, executive degree,

00:01:20.240 | at the end of which I knew like what XGBoost at the time

00:01:23.080 | and nothing about deep learning at all.

00:01:25.680 | So, and most of the people around were like PhD people.

00:01:30.400 | Okay, PhD seems pretty cool.

00:01:32.600 | Deep learning seems pretty cool.

00:01:33.840 | So I want to do a PhD in deep learning.

00:01:36.760 | That's where I joined.

00:01:38.600 | We have this PhD program in France

00:01:41.320 | within a company and academia.

00:01:44.560 | And so I did my PhD with Recital and Sorbonne University

00:01:48.120 | on natural language generation reinforcement learning.

00:01:51.000 | I guess it was a good topic.

00:01:52.920 | I was not like a visionary.

00:01:54.680 | It was very random.

00:01:56.240 | I've had a company that offered me like this topic

00:01:59.960 | and it was something like I started two weeks before BERT.

00:02:03.120 | - Excellent timing.

00:02:03.960 | Yeah, we actually also just released our episode

00:02:06.040 | with Clementine Fouquier who also did her PhD

00:02:09.120 | with a company in kind of like a very similar format.

00:02:11.720 | I think, yeah, very underrated, very underrated.

00:02:14.040 | This sort of PhD with industry expertise

00:02:16.920 | because you're also like publishing papers the whole time.

00:02:19.160 | I looked at your publishing history.

00:02:21.040 | You were doing like summarization work.

00:02:23.360 | You're doing factual consistency work.

00:02:25.000 | You released some benchmarks

00:02:26.720 | and then you worked on language GANs

00:02:28.560 | before the transformers took over.

00:02:30.680 | - We can come back to that later,

00:02:33.320 | but I should have, I mean, papers have like 10, 50 citations.

00:02:38.000 | If I'm pretty sure that if I call them

00:02:39.960 | like an LHF without human in the loop,

00:02:44.600 | but like a discriminator, which is synthetic,

00:02:47.680 | a human in the loop,

00:02:48.720 | I will have get much more citations today

00:02:51.000 | because like all the inspiration for this paper

00:02:54.000 | were from actually the original open-air paper of LHF.

00:02:57.520 | But at Academia, we don't have the way

00:02:59.640 | to pay annotation online like that.

00:03:03.120 | So how to simulate it?

00:03:05.920 | - Yeah, a lot of these ideas are repeated,

00:03:07.400 | like discriminator, generator,

00:03:08.480 | we just call them different names now,

00:03:10.040 | like verifier, whatever. - Exactly.

00:03:12.480 | - Well, I think your progress into NLP was like really strong

00:03:16.040 | 'cause like the first thing you worked on at Meta was Bloom.

00:03:18.080 | - Yeah, actually I started to work on that

00:03:20.280 | before joining Meta.

00:03:21.560 | I was not like one of the main contributors,

00:03:24.120 | but it was at the intersection of multilinguality,

00:03:26.880 | which was very important to me, large language modeling.

00:03:30.600 | And that's why actually my first big project

00:03:33.200 | at Meta and the team I was working on was Galactica.

00:03:36.120 | And actually interesting step back from Bloom

00:03:39.440 | was like we did a lot of mistakes,

00:03:41.160 | but it was expression that's expected again.

00:03:43.400 | We learned a lot,

00:03:44.760 | but like trying to scale towards like multilinguality.

00:03:48.400 | In fact, we learned later that multilinguality

00:03:51.440 | almost emerged naturally with very, very few data,

00:03:54.240 | which was really surprising

00:03:55.360 | and not expected at all for us at the time.

00:03:57.440 | - I mean, my learning from that is just,

00:03:58.920 | there's a natural harmony of language

00:04:01.480 | that is abstract from English.

00:04:03.480 | When you learn English, you learn language

00:04:05.320 | and then language just translates

00:04:07.040 | to other forms of languages,

00:04:08.920 | especially if they're the same family, right?

00:04:10.600 | Like, yeah, so maybe we should get right into Lama 2,

00:04:13.600 | spend a little bit of time there

00:04:14.600 | and then we'll go into Lama 3.

00:04:16.040 | So like what is the story of Lama 2 from your point of view?

00:04:19.560 | - Yeah, so as I was saying, I started to Meta on Galactica.

00:04:24.080 | That was one of the first large language model at Meta.

00:04:27.040 | It's a language model for science.

00:04:28.640 | We released it in, I think, December or end of November,

00:04:31.720 | I don't remember, one year and a half ago.

00:04:34.560 | I don't know if people remember,

00:04:35.840 | but it was huge on Twitter,

00:04:38.400 | both with people like thinking it's the end of science

00:04:41.280 | and like that with a lot of hallucination papers,

00:04:43.680 | although I was like, it's super awesome.

00:04:45.640 | I still think it was super awesome,

00:04:47.560 | but you know, we didn't do like instruction tuning

00:04:50.120 | or LHF techniques at the time.

00:04:52.200 | It was a weird moment because two weeks later,

00:04:54.560 | ChatGPT came out and that's a moment where like,

00:04:57.920 | I think all the thing companies went upside down

00:05:00.720 | and where we had a huge traction from leads

00:05:04.720 | to now work on that and make a ChatGPT as soon as possible.

00:05:07.640 | So we had this one, two months of like what to do.

00:05:11.080 | I actually was working on Galactica Instruct,

00:05:14.240 | which basically you could connect it.

00:05:16.440 | We had a partner with Overleaf,

00:05:18.720 | the Google Doc of like scientists where you can write papers

00:05:22.680 | and you write there in LaTeX,

00:05:25.040 | you have to do a lot of citations.

00:05:27.080 | So the idea was that you can,

00:05:28.480 | just like ChatGPT or GPT Instruct,

00:05:30.720 | ask or swap two columns in a LaTeX table.

00:05:34.200 | That's something very, very time-consuming.

00:05:36.680 | I can promise.

00:05:37.720 | You could like say, oh, find me a citation

00:05:39.640 | about LLMs and bias.

00:05:40.920 | We'll find you some papers,

00:05:42.320 | insert automatically the bib in like LaTeX.

00:05:45.240 | So that was pretty cool.

00:05:46.080 | But because of the backslash,

00:05:47.040 | we never like opened it in the end.

00:05:49.280 | - Oh, because the Galactica backlash.

00:05:51.880 | - Yes.

00:05:52.720 | - Like I was just saying like today it's not solved

00:05:54.280 | because Lucas Bayer is still asking

00:05:55.680 | for the citation generator.

00:05:57.240 | - I sold it to it.

00:05:58.440 | I was, dude, we had that two years ago

00:06:00.800 | and I promised, I tested it.

00:06:02.840 | It works so well.

00:06:04.400 | I had it on Overleaf integrated.

00:06:06.280 | I tested it.

00:06:07.280 | - Wow, okay.

00:06:08.720 | - Yeah, yeah, yeah.

00:06:09.560 | No, it went quite far in fact.

00:06:11.720 | And actually about citations,

00:06:13.440 | like it's anecdotical,

00:06:14.600 | but because the way Galactica was trained to cite papers

00:06:18.400 | with all the references in paper,

00:06:19.960 | that's what made it emerge so easily at instruction timing.

00:06:24.280 | Actually, Galactica Instruct

00:06:26.400 | was the first annotation project for LHF at Meta.

00:06:30.400 | That was a follow-up of Galactica that we were preparing.

00:06:33.000 | And at the same time,

00:06:34.600 | my friends from Paris office created LamaOne.

00:06:38.760 | It's like to connect the dots with what we said before.

00:06:41.760 | The last author was Guillaume Lample who founded Mistral.

00:06:44.760 | The first author is Hugo Touvron

00:06:45.920 | who worked with me on LamaTwo, still at Meta.

00:06:48.960 | Both did a PhD program within Meta

00:06:51.480 | as a company and an academia.

00:06:53.600 | So that's a pretty good program indeed.

00:06:56.240 | And so we worked on LamaTwo from that point.

00:06:59.240 | We had all the support from the company leadership.

00:07:01.600 | That was one of the main priority.

00:07:03.720 | We had LamaOne and Galactica

00:07:05.400 | as like backbone of good language model.

00:07:08.320 | We started from LamaOne

00:07:09.840 | and we worked mainly with Guillaume

00:07:12.800 | how to make instruction following

00:07:15.160 | and chat models that will follow instructions.

00:07:18.280 | So all the supervised fine-tuning stage,

00:07:20.480 | then the LHF, there's some paper.

00:07:22.480 | So you had some intuition from there we could use.

00:07:25.240 | But in fact, at large scale,

00:07:26.960 | and that was probably the most challenge for us,

00:07:31.200 | there's no research anymore.

00:07:32.560 | We don't know how much to scale.

00:07:34.720 | - Can you describe what scale you're talking?

00:07:36.280 | - Yeah, yeah, to what level of annotation to scale.

00:07:39.640 | Is the annotation like, do you need 100,000,

00:07:42.240 | 1 million, 10 million annotation

00:07:44.440 | of supervised fine-tuning of LHF preference?

00:07:47.480 | We had no idea.

00:07:48.880 | What is the actual algorithm to do?

00:07:50.920 | How often to retrain the models?

00:07:52.720 | You have just the basic,

00:07:54.000 | but then when it comes to like chat GPT

00:07:56.160 | or GPT instructor cloud,

00:07:58.560 | no one published the details there.

00:08:00.440 | And so we had to reinvent the wheel there

00:08:02.320 | in a very short amount of time.

00:08:03.800 | - And what about parameter size?

00:08:05.440 | This is one question that a lot of folks

00:08:07.320 | had about Lama 3.

00:08:08.680 | So Lama 1, you had 7B, 13B, 33B, 65B model sizes,

00:08:13.680 | and then Lama 2, 7, 13, 70.

00:08:17.960 | How do you kind of evaluate what's worth training,

00:08:20.400 | especially when you think about data?

00:08:21.560 | It's like, you know, maybe 100,000 is enough

00:08:23.720 | for like a 7B model,

00:08:24.760 | but it's not enough for a 70B model.

00:08:27.040 | How do you decide model size,

00:08:28.560 | especially when you're maybe annotation constrained

00:08:31.120 | on some of these things?

00:08:32.120 | - That's a very good question.

00:08:33.840 | And there's no good answer.

00:08:35.320 | There's so many parameters to take into account

00:08:38.120 | from the scaling loads at training time

00:08:41.120 | to get the best performance.

00:08:43.200 | The GPU constraint and on what different hardwares,

00:08:45.880 | and we think about meta, but also of the community,

00:08:49.520 | and like people are not just using 800,

00:08:51.640 | but there's 800, there's different size of GPUs memory.

00:08:55.800 | So which size will fit in what,

00:08:57.800 | and what is the most useful?

00:08:59.560 | Also at inference time, not just at fine tuning time,

00:09:02.200 | then you can maybe do some tricks at inference time

00:09:04.720 | to quantize it a bit or FP16 or FP8 now.

00:09:08.760 | All those constraints makes it very, very challenging.

00:09:11.880 | At inference time, you have a lot of costs.

00:09:14.000 | So how to trade off between inference cost

00:09:16.160 | and training cost?

00:09:17.440 | It's a very challenging problem.

00:09:19.360 | In general, we tend to think in particular for Lama 3.

00:09:23.320 | Lama 2, maybe I would say it's like Lama 1.

00:09:25.480 | We had a flagship model, which was 70B.

00:09:27.840 | It's also because the project was taking some routes

00:09:30.480 | to reproducing a Chinchilla, which was a 70B.

00:09:34.600 | For Lama 3, we also moved to one size more.

00:09:37.960 | It's a flagship model for 05B.

00:09:40.640 | I think that there was also the question of,

00:09:42.600 | we want a model at this time,

00:09:44.800 | we have this amount of compute.

00:09:46.560 | Given the scaling loads and the amount of tokens

00:09:48.400 | we have to train it, what would be the right balance

00:09:51.320 | to still fit in at inference time?

00:09:53.960 | So we try to have some trade-off like that.

00:09:56.640 | - Yeah.

00:09:57.480 | And you mentioned Chinchilla is the best way to go,

00:09:59.640 | but then you tweeted recently,

00:10:00.960 | "Don't fall into the Chinchilla trap

00:10:02.600 | if you want your model to be used by billions of people."

00:10:05.080 | So what's the updated state of scaling loss?

00:10:08.000 | I think there was obviously the Kepler,

00:10:10.200 | and then there was Chinchilla,

00:10:11.400 | and then people kind of got the Lama scaling law,

00:10:13.800 | like the 100 to 200X kind of parameter to token ratio.

00:10:18.160 | What's your updated thinking

00:10:19.960 | on how to think about scaling loss

00:10:21.760 | when you pick model size and training data?

00:10:24.160 | - Right.

00:10:25.000 | So, as you said, this Kaplan paper with scaling loss,

00:10:28.360 | but they figured out, basically they tried two dimensions.

00:10:32.360 | The model weights and the number of training time,

00:10:37.040 | like number of steps, training tokens, APOCs.

00:10:40.000 | And for that, they figured that model size is what matters.

00:10:43.520 | So GPT-3 was way too big

00:10:46.240 | compared to the actual number of training tokens

00:10:48.560 | because they did a mistake not adapting the scheduler.

00:10:51.440 | That's what Chinchilla emphasized and discovered.

00:10:54.800 | To be fair, I think Kaplan knew that

00:10:56.560 | at the time of Chinchilla paper.

00:10:57.920 | But yeah, basically Chinchilla said

00:11:00.600 | we have to revisit the scaling laws

00:11:02.880 | originally published by Kaplan

00:11:05.000 | and emphasize much more the importance of training tokens.

00:11:08.280 | And they did some really good scaling laws

00:11:10.080 | showing that there's an optimal,

00:11:11.960 | basically you need to double the number of training tokens

00:11:14.640 | every time you double the training weights

00:11:17.520 | to get an optimal ratio

00:11:20.400 | so that for a finite number of compute,

00:11:22.800 | you will end with the best results in your paper.

00:11:24.920 | And what I call the Chinchilla trap

00:11:26.920 | is that good if you want the best flagship model

00:11:29.440 | that obtains the highest performance on your paper.

00:11:31.880 | But if you want to use your model at inference time,

00:11:35.880 | inference, the two dimensions,

00:11:38.080 | one remains the model weights,

00:11:40.240 | but one drops the number of tokens you train it,

00:11:42.800 | number of steps.

00:11:43.920 | And so to be compute efficient at inference time,

00:11:46.920 | it's much better to train it much longer training time,

00:11:49.600 | even if it's an effort, an additional effort,

00:11:52.400 | than to have a bigger model.

00:11:53.920 | That's what I call, I refer to the Chinchilla trap.

00:11:57.040 | Not that Chinchilla was wrong,

00:11:58.280 | but if you consider inference time,

00:12:01.640 | you need to go beyond Chinchilla.

00:12:03.440 | And in fact, that's what LemmaOne folks did

00:12:06.480 | by overtraining it since they could have get

00:12:08.840 | a better performance in their paper,

00:12:10.440 | but they prefer to create the best artifact

00:12:13.720 | that will be used by the community.

00:12:15.280 | - So that's the skinny thinking.

00:12:17.480 | What other went into LemmaTree kind of planning?

00:12:21.000 | So LemmaTree, you have a pretty good model.

00:12:22.920 | People really liked it.

00:12:24.120 | In LemmaTree, you drop like the intermediate weight.

00:12:26.720 | So it's a 870 and now 405B.

00:12:29.920 | What was the thinking behind going so large?

00:12:31.680 | I mean, you talked about the hardware capabilities

00:12:33.720 | at inference, like I can now run a 405B model at home

00:12:37.080 | for sure.

00:12:37.920 | And it might be hard to even get the cloud resources

00:12:40.720 | to do it.

00:12:41.560 | What was the decision there?

00:12:43.320 | - The decision is super simple.

00:12:45.520 | We want the best model.

00:12:47.360 | We want to be number one and number two.

00:12:49.640 | We started one year and a half ago

00:12:52.360 | and we did quite some journey.

00:12:54.080 | We filled the gap with GPT-4.

00:12:55.960 | So that will be the first open source model

00:12:58.760 | that actually compares to GPT-4.

00:13:01.200 | There's now a GPT-4.0, of course.

00:13:03.520 | And we're close, but we're not there yet.

00:13:06.320 | Not in all capabilities.

00:13:07.640 | But the gap is getting smaller and smaller.

00:13:10.280 | There's also like what compute we had at the time

00:13:12.840 | when we started to run in January.

00:13:15.040 | We put a lot of effort there,

00:13:16.560 | but as like Mark announced, we have more and more GPUs.

00:13:19.560 | So the next generation will be bigger.

00:13:21.200 | So that's what drives the decision.

00:13:22.800 | Now, maybe let me reflect two things he said.

00:13:25.800 | You cannot use it at home.

00:13:27.480 | That's probably true.

00:13:28.720 | But quantizing it to FP8 can run on node,

00:13:32.520 | even with a long contact of 128K tokens.

00:13:36.480 | Second thing is, I'm hopeful that the community

00:13:39.720 | will lead to a lot of findings by open sourcing it.

00:13:42.560 | And there is smart way to actually make you use it

00:13:47.040 | on your computer.

00:13:48.200 | If you remember, when we published models,

00:13:51.360 | people were saying it's too big.

00:13:52.440 | And after two weeks, it was running on a Raspberry.

00:13:54.960 | I don't know if it will be the same,

00:13:56.080 | but I hope it's the same kind of trend.

00:13:58.720 | And by releasing those models, we are enabling that.

00:14:02.320 | Now, the last thing I want to add is having bigger models

00:14:06.200 | enables us to collect better data,

00:14:07.840 | for instance, at LHF stage,

00:14:10.080 | because that's the model we use for the annotation.

00:14:12.000 | And so we distillate straight forward,

00:14:14.400 | like this annotation from this better model

00:14:17.360 | to the other models.

00:14:18.640 | So I can guarantee you that the quality

00:14:20.040 | of the smaller models we are releasing with Lama3

00:14:23.040 | are also thanks to having these artifacts

00:14:25.720 | where we can collect and train.

00:14:27.560 | - Yeah, there's a lot of really good info there.

00:14:29.760 | One thing I'll just briefly touch on for quantization,

00:14:33.480 | there was a recent Noam Shazir blog post.

00:14:36.200 | Noam is writing again for some reason.

00:14:39.160 | And he was talking about sort of native FP8 training.

00:14:43.880 | It seems like that is most useful for inference.

00:14:47.000 | That is what you expect the open source community

00:14:48.960 | to do with your weights once you release them anyway.

00:14:51.640 | Is there any movement or thinking

00:14:53.520 | about just moving to FP8

00:14:55.680 | or whatever other new format is in vogue these days?

00:14:59.920 | - Also these papers like to train like some,

00:15:02.800 | I forget the name,

00:15:03.640 | but like there's two follow-up papers

00:15:05.200 | on like just a zero one or minus one weights.

00:15:08.880 | And like, there's a lot of work there.

00:15:11.200 | I think it's promising directions overall.

00:15:13.480 | Regarding FP8 in particular,

00:15:15.680 | there's also the possibility for the community

00:15:17.040 | to try FP8 or the methods that are very easy

00:15:20.120 | at fine tuning time for the model.

00:15:22.560 | So I'm really looking forward to what the community

00:15:25.320 | can do there.

00:15:26.160 | Although like scaling,

00:15:28.000 | I don't know if it's all you need,

00:15:29.040 | but I will not bet against scaling.

00:15:31.440 | And one of the way to get more scale

00:15:33.960 | is by having better algorithms that we can train

00:15:37.280 | for the same level for less compute.

00:15:40.520 | - Less compute and less memory.

00:15:42.840 | Yeah, like inference time memory

00:15:44.320 | is becoming a real constraint.

00:15:46.240 | - Yeah, yeah, but also training with FP8.

00:15:48.600 | If you're not training with FP8 or, I mean,

00:15:50.440 | FP0 is probably nonsense,

00:15:52.520 | but to what extent, how far we can go, you know?

00:15:55.560 | And every time like you unlock compared to

00:15:58.880 | what we had two, three years ago on a 32 or 64,

00:16:02.480 | it's like huge progress in terms of scaling.

00:16:05.160 | - For me, it's interesting to say,

00:16:06.800 | to see you mention the ternary quantization,

00:16:10.480 | like the 1.58 bit thing.

00:16:12.560 | 'Cause I didn't know that,

00:16:14.000 | I don't know how much to believe, you know?

00:16:15.400 | Like there's a lot of these kinds of papers

00:16:17.040 | where it makes a lot of noise,

00:16:18.520 | but it doesn't actually pan out, it doesn't scale.

00:16:20.520 | - I totally agree with you.

00:16:21.600 | It's so hard for researchers, at least for me,

00:16:25.480 | to see all those papers published,

00:16:28.200 | all those cool ideas,

00:16:29.760 | all those results that are preliminary.

00:16:32.280 | And in all this massive amount of research,

00:16:36.440 | what will scale or not?

00:16:37.960 | What will resist the test of time or not?

00:16:40.120 | And are we like losing maybe some gems that are not just,

00:16:44.240 | people are not working on them,

00:16:45.360 | but because there's too much research around.

00:16:48.160 | I don't know, maybe.

00:16:49.080 | And that's like some problems to have.

00:16:51.600 | That's cool to have these problems nowadays,

00:16:53.720 | compared to probably what Yann LeCun and the others had

00:16:56.080 | 30 years ago, but still it's a problem.

00:16:58.360 | - For what it's worth,

00:16:59.360 | I do think that FAIR is putting out incredible research.

00:17:03.600 | Probably, it doesn't seem like it's your group,

00:17:05.880 | but you also recently published Mobile LLM,

00:17:08.920 | which on the small model side is a really good research

00:17:12.520 | on just small model architecture.

00:17:14.880 | It looks like Hugging Face is also replicating it,

00:17:16.880 | and it's doing quite well.

00:17:18.720 | There's a lot of ideas on shared weights and shared matrices

00:17:21.920 | and model architecture stuff that we can talk about

00:17:24.760 | for smaller scale models.

00:17:26.760 | LLAMA is not at that scale,

00:17:27.960 | but it seems like one of the big themes of this year

00:17:30.800 | is on-device, in-browser,

00:17:33.240 | small models that are good enough for daily use.

00:17:36.280 | I do want to talk about architecture.

00:17:38.560 | I'm not sure when you're releasing

00:17:39.720 | the LLAMA 3 research paper,

00:17:41.880 | but in LLAMA 2, you talked a little bit

00:17:43.280 | about the architecture choices.

00:17:45.200 | - It will be released the day, I think, of the release.

00:17:48.560 | - Okay, what should people know?

00:17:50.080 | What are the major choices of LLAMA 3 versus LLAMA 2?

00:17:53.640 | - There's not a lot of changes in terms of architectures.

00:17:57.440 | I think we can do a lot better in the future,

00:18:00.320 | and not just with transformers,

00:18:01.920 | but, for instance, to me, it doesn't make sense

00:18:04.120 | to use the same amount of compute per token,

00:18:06.400 | for every token.

00:18:07.480 | Like, there's architecture lack of flexibilities.

00:18:09.560 | There's a lot of research to go there,

00:18:11.560 | but still, that's the best thing we have for now.

00:18:14.160 | And so, it's the same recipe than,

00:18:17.280 | in terms of architectures and training, than LLAMA 2,

00:18:20.560 | but we put so much effort on scaling the data,

00:18:24.360 | and the quality of data.

00:18:25.880 | There's now 15 trillion tokens,

00:18:27.680 | compared to two trillions,

00:18:29.280 | so it's another magnitude there, as well,

00:18:31.520 | including for the smaller models.

00:18:33.000 | - One of the things I noticed on the paper

00:18:35.240 | is that you used LLAMA 2 to do the data cleaning

00:18:38.400 | for what went into LLAMA 3.

00:18:40.320 | I think there's a lot of chatter, obviously,

00:18:41.720 | about synthetic data,

00:18:43.000 | and there was the "Refrace the Web" paper

00:18:45.360 | that came out, maybe a few months ago,

00:18:47.080 | about using Mastral to make training data better.

00:18:50.160 | Any learnings from that?

00:18:51.960 | It's like, is there,

00:18:53.200 | how much can you rewrite with the models?

00:18:56.240 | I'm sure people would love to hear more about it.

00:18:58.520 | - Right, so it's a very interesting research direction.

00:19:02.320 | Synthetic data, in general.

00:19:03.760 | Synthetic data for pre-training.

00:19:05.480 | My intuition is that the web is full of shit,

00:19:11.520 | in terms of text,

00:19:13.160 | and training on those tokens is a waste of compute.

00:19:15.880 | Just having a good classifier that labelizes that is cool,

00:19:19.520 | and LLAMA was, at the time,

00:19:21.560 | before LLAMA 3,

00:19:22.400 | the best model we had access to, legally,

00:19:26.240 | to labelize the web

00:19:29.400 | and select what are the good tokens and the bad tokens.

00:19:32.040 | The additional thing is that

00:19:33.360 | it also enabled to have a topic tag,

00:19:37.360 | like, is it about law?

00:19:38.560 | Is it about politics?

00:19:39.640 | Is it about chemistry, math, reasoning?

00:19:42.040 | So that you can also adapt a bit the mixture

00:19:44.800 | to balance a bit more the diversity.

00:19:48.200 | - To me, I'm not exactly sure what you guys did,

00:19:51.120 | but I feel like when people say synthetic data,

00:19:54.400 | there needs to be different categories of synthetic data now

00:19:57.160 | because I think there's so many different usage

00:19:59.960 | of this thing.

00:20:00.800 | But specifically synthetic data for pre-training,

00:20:02.800 | it feels almost like you're running multiple epochs

00:20:06.760 | on the raw data while it's rephrased

00:20:10.520 | or reformatted by a language model, right?

00:20:13.600 | And in my mind, it's very similar to computer vision,

00:20:15.880 | where you do data augmentation on an item, right?

00:20:19.120 | Like, we're doing data augmentation.

00:20:20.760 | That's the less cool name for synthetic data.

00:20:22.680 | (laughs)

00:20:23.680 | - That's very interesting.

00:20:24.520 | I totally agree with you related to pre-training,

00:20:28.120 | totally stamp what you said.

00:20:29.920 | I think it's very different, though,

00:20:31.480 | for post-training and the future direction

00:20:33.320 | on synthetic data that I'm personally excited.

00:20:35.960 | Like, for instance, what I'm excited about is

00:20:38.840 | we had this survey on augmented LLM a year ago,

00:20:41.680 | and all the idea is like,

00:20:43.000 | if you augment your LLM with something else,

00:20:45.480 | it can be a retriever, it can be search, it can be a tool,

00:20:48.520 | it can be a calculator, it can be a code execution.

00:20:51.360 | Then you are not just distillating,

00:20:54.800 | like doing some data augmentation with your model,

00:20:58.120 | but you're actually adding some expert skills

00:21:01.080 | that possibly goes beyond the model weights.

00:21:03.800 | For instance, if your model, like,

00:21:06.240 | can calculate something it was wrong before,

00:21:09.920 | and now it has access to a calculator,

00:21:11.760 | and you can retrain your model on that,

00:21:13.640 | then you're learning something new.

00:21:15.040 | If your model didn't know something about LLM 2,

00:21:17.840 | probably doesn't know a lot about LLM 3,

00:21:19.760 | but now if it can search online about it,

00:21:22.440 | and then you train the model on that,

00:21:24.360 | then you have a positive feedback loop,

00:21:26.120 | like what we call expert iteration,

00:21:28.080 | targeting directly the weakness of the model.

00:21:30.160 | It's like continual augmentation of the language model,

00:21:33.720 | much beyond just data augmentation.

00:21:35.480 | - How related is this to tool use?

00:21:37.440 | Like, are you teaching it to use tools to augment the model,

00:21:41.280 | or are you saying, like, do active learning,

00:21:44.080 | do like, where it's weak,

00:21:45.760 | go augment the model with extra data,

00:21:48.600 | and then memorize that new data, right?

00:21:50.960 | - What I said is more like in terms of directions,

00:21:52.840 | not for LLM 3, but like,

00:21:54.800 | when it knows how to use a tool and correct itself,

00:21:58.160 | this is like a very promising direction

00:22:00.480 | that goes much beyond the augmentation

00:22:02.280 | for like, in the future,

00:22:04.680 | to keep collecting new data, new token.

00:22:06.680 | People are saying like, we are lacking of tokens.

00:22:09.080 | But if you think about those kind of tokens,

00:22:10.920 | where the model always go to correct its own weakness,

00:22:14.120 | it can say like, okay, that's 10 plus 10.

00:22:17.040 | Okay, that's an easy example, probably the model knows,

00:22:18.760 | but imagine for something more complex, 10 plus 10.

00:22:21.920 | I expect this to be 20.

00:22:24.000 | Let's verify with a calculator,

00:22:25.920 | which is easy for a basic agent now, powered by LLM.

00:22:29.640 | And then you verified with respect to what you expected,

00:22:33.040 | that it's correct.

00:22:34.000 | If it's not, you can back propagate this example

00:22:37.200 | directly to the weights,

00:22:38.320 | and so they will keep learning new things.

00:22:40.080 | - It makes sense.

00:22:40.920 | What have you been your insights?

00:22:41.840 | You know, you mentioned about just like using calculators.

00:22:44.680 | What are the insights?

00:22:45.520 | I think it's just, in general,

00:22:47.120 | a lot of that is just driven using code generation,

00:22:49.360 | apart from just tool use.

00:22:50.920 | What are your insights on just like the data mix

00:22:53.640 | of how much code, how much multilinguality,

00:22:56.760 | which is something that you're also passionate about?

00:22:58.960 | We know that that's changed between LLM 2 and LLM 3.

00:23:01.680 | Is it changing for different stages

00:23:03.480 | between the different sizes of LLM 3?

00:23:05.720 | Like, you know, anything like of that sort?

00:23:08.080 | - No, it didn't.

00:23:09.680 | For the different size, we use the same, mostly.

00:23:12.400 | What happens is we change the data mix

00:23:15.160 | during the training of LLM 3,

00:23:17.120 | with some findings that happens that,

00:23:19.120 | I mean, training is long,

00:23:20.360 | so you have to do something while it's training.

00:23:22.480 | And what the team did,

00:23:23.440 | I was working on my side of multi-motion post-training,

00:23:25.400 | but so the pre-training team did quite a lot of work

00:23:28.280 | to find some, have some new findings,

00:23:30.480 | improve the data mixture along the way.

00:23:32.320 | And they intersected before the end of the training.

00:23:35.640 | - I sense a movement in terms of like the curriculum

00:23:39.280 | that people are adopting during pre-training

00:23:41.600 | and even post-training about, you know,

00:23:43.880 | what the mix should be.

00:23:44.720 | Like Snowflake is doing some interesting work

00:23:46.600 | with enterprise intelligence or whatever they call it.

00:23:50.480 | What are your goals with post-training?

00:23:51.800 | Like just at a high level, you know, like,

00:23:53.480 | what do you work with, like the pre-train team?

00:23:55.760 | - I think it's quite easy from now

00:23:57.920 | because there's not yet like this kind

00:23:59.880 | of continual augmentation where it could feed back

00:24:02.800 | like pre-training, things like that.

00:24:04.440 | One of the big continuum between pre-training

00:24:06.760 | and post-training in particular is continual pre-training

00:24:09.880 | where you actually continue the pre-training

00:24:12.560 | before RHF in a self-supervised way,

00:24:14.880 | but on expert level domains,

00:24:16.880 | like for it to have an expert in code

00:24:18.720 | and an expert in like reasoning

00:24:20.200 | or an expert in multilinguality

00:24:22.640 | that enables to collect even better RHF notation after.

00:24:25.520 | So that's one thing.

00:24:26.720 | And then you start from those models

00:24:28.880 | to actually do the RHF stage.

00:24:31.920 | And goal about your vision,

00:24:33.320 | like goal was to get the best model in those dimensions.

00:24:36.880 | That's actually one thing very different to,

00:24:39.160 | I can comment, compared to Lama2.

00:24:41.960 | Lama2, you know, as I said, we were nowhere.

00:24:44.800 | We build entirely end-to-end all the stack

00:24:47.200 | from data notation, contract, methodology, protocol,

00:24:50.640 | algorithms for RHF at Meta.

00:24:52.440 | And we had to limit our scope.

00:24:54.560 | We were like not allowed also to work on that.

00:24:56.880 | We focus mainly on helpfulness,

00:24:59.760 | following instructions for Lama2.

00:25:02.520 | And you can see that as in the following month after Lama2,

00:25:06.680 | a lot of open source models came,

00:25:09.360 | distillating GPT-4 mainly,

00:25:12.200 | but obtaining better reasoning, math, coding, chat models.

00:25:16.760 | And we didn't annotate at all for code,

00:25:18.840 | neither for reasoning or multilinguality.

00:25:22.000 | And one thing I'm quite proud is with the early preview release

00:25:26.160 | we did of Lama3 back in February, May or March,

00:25:30.160 | I don't remember, it lets quickly to instantly to like

00:25:34.160 | state of the art results for the model size,

00:25:36.320 | almost competing with GPT-4 on the Arena leaderboard,

00:25:40.200 | where like human fights each other,

00:25:41.760 | compare like two models and select their preference.

00:25:45.600 | And no one since then had been able to put

00:25:48.480 | like a Lama3 model better than what we did

00:25:51.520 | on most of the domains,

00:25:52.920 | from code reasoning, multilinguality, helpfulness.

00:25:56.280 | So that's the sign that this time, as opposed to Lama2,

00:25:58.520 | we tackle like all those different aspects.

00:26:00.640 | - Do you have any other thoughts

00:26:01.800 | on the more synthetic data focused models,

00:26:05.000 | kind of like a Nemeltron?

00:26:06.720 | I think folks were asking if you see that

00:26:08.760 | as an interesting direction too,

00:26:10.880 | kind of having specific synthetic data generation things.

00:26:14.240 | - I don't know about this model exactly,

00:26:15.720 | but I think like Lama had better performance overall.

00:26:18.640 | I'm very bullish on synthetic data.

00:26:21.120 | Generation, but I think just gets better

00:26:24.240 | when you have a better model.

00:26:25.680 | I'm not really bullish on having like a model

00:26:27.880 | only for synthetic data generation.

00:26:29.720 | I understand the need of having like bigger models,

00:26:32.600 | but then you can rationalizing.

00:26:34.920 | Yeah, maybe people will not use them for inference,

00:26:36.840 | but to distillate some specific knowledge

00:26:39.840 | of synthetic data.

00:26:41.000 | That narrative is, I think I totally agree with that,

00:26:45.600 | but having a model purely for that

00:26:48.120 | and not like good at other things,

00:26:50.200 | I don't think it's the case.

00:26:51.520 | - Makes sense.

00:26:52.360 | One of the architecture questions

00:26:53.640 | that I forgot to mention in there was,

00:26:55.480 | so just the architecture choice of like a very big,

00:26:57.920 | you know, 400B dense model.

00:26:59.960 | I actually honestly thought that maybe 175 or, you know,

00:27:04.040 | was kind of the peak, you know,

00:27:06.120 | whatever can fit on like an H100.

00:27:08.000 | So basically I think the common question that people have

00:27:10.200 | is like, why no MOE?

00:27:11.360 | In a way that Mistral and the others have gone in,

00:27:14.120 | you know, it seems like the trend has been MOEs

00:27:16.520 | and you guys have bucked the trend there.

00:27:18.520 | - I heard that question a lot.

00:27:20.240 | Different aspects there.

00:27:21.840 | Why not MOE in the future?

00:27:23.440 | The other thing is, I think a dense model

00:27:27.080 | is just one specific variation of the model

00:27:30.960 | for an hyperparameter, for an MOE,

00:27:32.960 | with basically one expert.

00:27:34.480 | So it's just an hyperparameter

00:27:36.560 | we haven't optimized a lot yet,

00:27:38.960 | but we have some stuff ongoing

00:27:40.560 | and that's an hyperparameter we'll explore in the future.

00:27:43.760 | - Let's make sure we run through everything on post-training.

00:27:46.440 | You also had a recent tweet about hourly chat

00:27:48.720 | versus immunization learning explained in one tweet.

00:27:52.080 | So we'll put this in the show notes,

00:27:53.400 | but it's basically like two charts about doctor opinions.

00:27:57.760 | On one side, there's like,

00:27:58.880 | whether or not the suggestion is good

00:28:01.240 | from like a content perspective.

00:28:03.560 | And the chatbots rank really highly

00:28:05.120 | and the physicians are kind of like, you know,

00:28:06.680 | a bell curve, as you might imagine.

00:28:08.440 | But then the empathetic voting,

00:28:11.080 | most physicians are rated not empathetic

00:28:13.600 | or slightly empathetic versus all the model responses

00:28:16.720 | are rated very empathetic and empathetic at worst.

00:28:20.840 | You know, most people might look at it

00:28:22.320 | and not really get much from it,

00:28:23.680 | but obviously it resonated with you.

00:28:25.320 | Can you run people through like some of the choices

00:28:27.800 | you make in post-training to like optimize

00:28:29.920 | for one of the two and getting the best responses?

00:28:33.080 | - I think the tweet was about like the intuition

00:28:35.720 | of why reinforcement learning with human feedback works.

00:28:39.160 | When we started LamaTube,

00:28:41.680 | I had like this budget of annotations in millions of dollars

00:28:44.680 | and okay, what to do?

00:28:46.760 | I'm responsible for that.

00:28:47.680 | I'm accountable for a model at the end

00:28:49.200 | that can follow instructions

00:28:50.520 | and compete with GPT 3.5 at the time.

00:28:53.440 | What to do?

00:28:54.560 | You can annotate supervised venturing data,

00:28:56.840 | which refers to a human to create a prompt

00:28:59.560 | and to also write himself the answer expected by the model.

00:29:04.560 | So then you train on that and in a supervised manner,

00:29:09.240 | but that's like very classic and standard

00:29:11.960 | on fat-tuning machine learning.

00:29:14.480 | The other thing is reinforcement learning

00:29:16.320 | with human feedback where the annotators type a prompt,

00:29:18.880 | but this time you sample two different answers

00:29:20.840 | from your model and you ask the annotator

00:29:22.680 | which one he prefers.

00:29:24.320 | And then you will train on the preference, basically,

00:29:26.400 | to simplify.

00:29:27.920 | When you ask to train on the preference of the model,

00:29:30.160 | that seems very weird and not really robust,

00:29:33.880 | training on synthetic model generated by the model.

00:29:36.440 | So I was like, let's annotate 100,000

00:29:38.680 | or of supervised venturing data.

00:29:40.680 | And let's annotate a bit of preference to do a relationship

00:29:42.960 | because everyone is doing it.

00:29:44.480 | And we had this human evaluation

00:29:46.840 | after a few weeks in a la meta projects

00:29:49.160 | where our model was already better

00:29:52.560 | than the annotation from the humans.

00:29:55.160 | So you'd get a prompt,

00:29:56.440 | you check what the human will have annotated as an answer.

00:29:59.400 | You check what the model generates.

00:30:01.120 | And most of the time, the model was better.

00:30:04.200 | I was like, oh, maybe the annotators are pretty bad.

00:30:06.480 | Let's look at that.

00:30:07.920 | And no, the model was pretty good.

00:30:10.640 | And so I understood the intuition behind RLHF.

00:30:13.440 | Those models are already super good at some tasks.

00:30:15.960 | And with RLHF, then what you have is,

00:30:19.240 | imagine a distribution, a Gaussian distribution,

00:30:21.840 | which was basically the tweets.

00:30:23.840 | And you have on the left, bad outputs

00:30:26.920 | and on the right, good outputs.

00:30:28.560 | And the same with medical diagnostics from a doctor.

00:30:31.640 | You have good outputs on the right

00:30:33.080 | and the bad diagnostics on the left.

00:30:35.440 | But you have the distribution

00:30:37.080 | and when you collect all the diagnostics from doctors,

00:30:39.320 | hopefully it's mostly on the right.

00:30:41.000 | There's better, a lot of time, good diagnostics,

00:30:43.680 | but human makes mistakes, right?

00:30:46.000 | So there's bad diagnostics.

00:30:47.600 | On the left, you have still a bit of examples,

00:30:51.600 | which makes curves not at zero, the distribution.

00:30:55.000 | And the same way for humans,

00:30:56.280 | they make mistakes when they annotate.

00:30:58.160 | And so training on behavioral cloning to reflect humans,

00:31:01.880 | the model will learn to do also some mistakes,

00:31:03.920 | just like humans.

00:31:05.480 | And so you will have some bad outputs

00:31:07.040 | from the model time to time, reflecting humans.

00:31:09.840 | And you cannot go beyond that

00:31:11.280 | if you train on human outputs.

00:31:13.000 | But now, if I ask a doctor to check a sample from my model,

00:31:17.720 | or a sample from two doctors,

00:31:19.160 | one diagnostic and another diagnostic,

00:31:21.160 | one is better than the other,

00:31:22.600 | it's easy for a doctor to say which one is better.

00:31:25.120 | The same way, if I sample from my model

00:31:26.960 | that learns a human distribution of answers,

00:31:29.440 | and there's one bad time to time, like humans,

00:31:31.880 | but most of the time, good answers.

00:31:33.440 | And I ask a human to choose which one he prefers.

00:31:35.760 | Personally, I'm really bad at creating poems.

00:31:38.080 | The example I give a lot of time,

00:31:39.960 | try to write a haiku in three lines

00:31:42.160 | of about language models.

00:31:44.200 | I don't know you,

00:31:45.240 | take like five seconds to think what you could come,

00:31:48.280 | I'm terrible.

00:31:49.400 | But yet, if I check two poems generated by a model,

00:31:52.560 | so a human, I can tell which one I prefer.

00:31:54.480 | I'm good at discriminating.

00:31:56.040 | And because of that,

00:31:57.360 | you can have a model that flats the bad outputs,

00:32:00.800 | and learns to only shift towards the best

00:32:03.040 | and better and better outputs.

00:32:04.600 | And you can even end to superhuman abilities,

00:32:07.680 | since that I'm bad at writing a poem,

00:32:09.960 | but I'm good at judging which one is better.

00:32:12.040 | So I can actually annotate data

00:32:13.760 | beyond my own skills at creating them.

00:32:16.800 | That's the magic of RLHF.

00:32:18.760 | - Yeah, we have one episode, RLHF 201,

00:32:21.560 | with Nathan Lambert from the Allen Institute,

00:32:24.480 | who was at Aginface leading RLHF before.

00:32:26.960 | And he mentioned one of the things that makes RLHF work

00:32:29.600 | is that humans are not maybe great

00:32:31.880 | at creating a lot of things,

00:32:33.280 | but they're usually very good at giving an opinion

00:32:35.520 | on which one to they prefer.

00:32:37.720 | So they're able to actually annotate data

00:32:39.440 | of things they would never create from scratch.

00:32:42.160 | One question actually that he asked me to ask you,

00:32:44.640 | how much in post-training you attribute improvement

00:32:47.440 | to the RLHF side versus the instruction fine-tuning side,

00:32:51.720 | and maybe how you think about prioritizing the two

00:32:54.120 | and what areas they impact the most?

00:32:56.280 | - You mean between supervised fine-tuning,

00:32:58.400 | like supervised fine-tuning annotation

00:33:00.240 | and preference annotation?

00:33:01.760 | - Yeah.

00:33:02.600 | - So 100% to RLHF.

00:33:04.760 | In fact, that's quite interesting.

00:33:06.520 | You start for Lama 2 with a pre-trained model,

00:33:09.480 | and you have to have an instruction model to chat model.

00:33:13.240 | Otherwise, the model is just like finishing sentences.

00:33:16.360 | So you need that to start RLHF.

00:33:18.000 | So we had to annotate like 10,000 examples.

00:33:20.440 | What did we do for Lama 3?

00:33:22.280 | You start with a new pre-trained model,

00:33:23.960 | and then you want, before starting the RLHF,

00:33:26.000 | to have now a chat model.

00:33:28.040 | That is not too bad.

00:33:29.080 | The option one was, let's do human annotation again,

00:33:32.640 | like SFT stage.

00:33:34.240 | But in fact, by the principle I said before,

00:33:37.360 | the annotation would be actually worse than Lama 2.

00:33:39.760 | So what we did is that we generated all the data

00:33:41.880 | on the prompts with Lama 2,

00:33:43.400 | and we applied basically the last round of Lama 2 we had

00:33:46.520 | to kick off and start Lama 3 post-training.

00:33:49.160 | So Lama 3 post-training doesn't have any human

00:33:51.880 | written answers there, basically, almost.

00:33:54.120 | It's just leveraging pure synthetic data from Lama 2.

00:33:57.920 | - Do you have an intuition

00:33:58.960 | on which areas work better for which?

00:34:01.480 | For example, you mentioned the physicians are expert.

00:34:03.760 | What about maybe like code or, yeah,

00:34:06.000 | you also have a multi-model working on,

00:34:07.320 | so like image generation is like,

00:34:09.080 | or does this apply to any modality, any subject?

00:34:12.240 | - That's an open research question.

00:34:13.840 | The intuition in general is like, for instance,

00:34:15.800 | for code, because this is factual,

00:34:17.720 | you can check if the code is correct or not.

00:34:19.720 | RLHF is not the way to go.

00:34:21.120 | You prefer to do like supervised fine-tuning

00:34:23.320 | as a human to write the code.

00:34:24.880 | But in fact, because humans make mistakes,

00:34:26.640 | because actually even in code,

00:34:28.520 | there's some preferences that they might feel like that.

00:34:30.640 | And maybe for some other reasons that we don't know,

00:34:33.720 | RLHF is so much more scalable.

00:34:35.680 | It costs less, it's easier than it leads in general

00:34:38.360 | to just better performance.

00:34:40.040 | And maybe we can come with a compromise.

00:34:42.320 | We actually suggested teacher-forcing in Lama 3,

00:34:46.280 | a new method that kind of fills a gap between,

00:34:48.960 | not teacher-forcing, sorry, teacher-critic.

00:34:52.000 | Teacher-forcing is a good to train the models.

00:34:53.960 | Teacher-critic, where it reconciliates

00:34:56.240 | and unifies supervised fine-tuning and RLHF,

00:34:58.600 | so that when you do human preference

00:35:01.080 | and you have two outputs,

00:35:02.560 | but both are very bad in the code, for instance,

00:35:05.720 | you will ask the human to edit the best answer

00:35:08.480 | to make it correct now.

00:35:09.920 | So now you are doing SFT

00:35:11.760 | when all the answer was really bad,

00:35:14.160 | so that you can get out from the local minimum of your model.

00:35:17.560 | - I think this is like super promising,

00:35:19.440 | and it seems like there's just,

00:35:21.280 | well, do you have an idea?

00:35:23.200 | You know, you started with this question

00:35:24.440 | of how much scale you need.

00:35:26.160 | Do you now have a better idea?

00:35:28.160 | - No, what we know is it's not plateauing yet.

00:35:31.120 | - It's not plateauing yet, yeah.

00:35:32.120 | So just infinite amounts more while, you know,

00:35:35.160 | scale AI and all the annotation providers

00:35:37.600 | are very happy to hear that.

00:35:39.040 | And so you mentioned at the start of the conversation

00:35:43.720 | about the AlphaGo moment,

00:35:45.200 | and I feel like this is very interesting to reflect on,

00:35:47.960 | right, like we're basically saying that,

00:35:50.520 | I think that one of the lessons from AlphaGo

00:35:52.600 | is that people thought that human interest in Go

00:35:55.160 | would be diminished

00:35:57.880 | because computers are better than humans,

00:36:00.040 | but then we have this sort of centaur model

00:36:01.840 | where like humans and computers are actually doing better

00:36:04.840 | than either humans and computers would be alone.

00:36:08.120 | And I think we're seeing that with this,

00:36:09.880 | what are you talking about, this RLHF improvement, right,

00:36:12.520 | that we're kind of building human preference into the model

00:36:15.360 | and the blending of the human preference

00:36:17.560 | and the model capability is actually doing better

00:36:20.120 | than we could on our own.

00:36:21.800 | I just think it's pretty fascinating.

00:36:23.680 | - It is fascinating.

00:36:25.120 | - The other thing is RLHF came from the alignment community

00:36:28.840 | and I think there's a lot of conception

00:36:30.880 | that maybe it's like due to safety concerns,

00:36:33.240 | but I feel like it's like really over the past

00:36:35.560 | like two, three years expanded to just,

00:36:38.280 | this produces a better model period,

00:36:40.440 | even if you don't really,

00:36:41.800 | are not that concerned about existential risk.

00:36:43.960 | I always feel like it's so interesting to see this,

00:36:47.520 | like people who take alignment super seriously,

00:36:50.080 | they're the first to consider super alignment.

00:36:52.440 | And now we're considered like,

00:36:54.080 | I'm almost thinking about this as like super quality,

00:36:56.520 | that we are training models

00:36:58.280 | that are higher quality than humans.

00:37:00.400 | And it's not really about alignment so much as like,

00:37:03.480 | we now see that this is actually possible.

00:37:06.280 | - Yeah.

00:37:07.120 | - And it's not even for alignment purposes.

00:37:08.760 | We just think it's like better at reasoning,

00:37:10.480 | better at knowledge, better at everything.

00:37:11.960 | - Well, I don't know how much better yet it is on those,

00:37:14.400 | but clearly it's super human on some writing skills

00:37:18.040 | and it's super useful.

00:37:19.160 | I think that's great, to be honest.

00:37:20.760 | - Yeah, perhaps we can transition to evals.

00:37:23.520 | We've had some questions about the 400B details

00:37:27.400 | that we want to disclose.

00:37:28.600 | By the time this podcast comes out,

00:37:30.400 | we'll have disclosed them.

00:37:31.720 | Yeah, I think last time you disclosed like the evals

00:37:35.240 | while you were still training,

00:37:37.040 | what should people know about the high level headlines

00:37:41.000 | for the new Lama 3?

00:37:42.560 | - At a high level,

00:37:43.560 | it's the best open source model ever.

00:37:46.600 | It's better than GPT-4.

00:37:49.440 | I mean, what version?

00:37:50.720 | But by far, compared to the version originally released,

00:37:54.800 | even now, I think there's maybe the last clouds on a 3.5

00:37:59.280 | and GPT-4.0 that are performing it.

00:38:01.640 | And that's it, period.

00:38:03.040 | So for the 405B, that's a flagship,

00:38:06.360 | that's a pretty good model.

00:38:07.800 | Not yet the number one.

00:38:08.920 | We still have a journey to get there.

00:38:11.200 | For the 7TB and 7B,

00:38:13.520 | they are like world-class models of this size

00:38:16.120 | for general models.

00:38:17.360 | - And are the benchmark numbers

00:38:19.120 | from the initial checkpoint still right?

00:38:21.440 | So the April 15 checkpoint,

00:38:24.120 | MMLU on Instruct is like 86,

00:38:27.280 | GPUA, 48,

00:38:28.680 | HumanEval, 84,

00:38:30.160 | GSMAK, 94,

00:38:31.840 | MAT, 57.8.

00:38:33.640 | Is this still roughly the same performance?

00:38:35.640 | Or, you know, I haven't seen the numbers yet either.

00:38:38.160 | We're just breaking the news right now, so.

00:38:40.360 | - No, it's roughly that.

00:38:42.240 | - Awesome.

00:38:43.080 | So talking about evals,

00:38:44.680 | we just had an episode with Clementine from Hug & Face

00:38:47.320 | about leaderboards and arenas and evals and benchmarks

00:38:51.000 | and all of that.

00:38:52.120 | How do you think about evals during the training process?

00:38:55.840 | And then when the handoff happens,

00:38:57.760 | do you already know exactly what you want to improve?

00:39:00.600 | And I know that, for example,

00:39:01.920 | to improve like maybe an arena score,

00:39:03.480 | you need different than like an MMLU score.

00:39:05.840 | How do you think about prioritizing

00:39:07.400 | the post-training improvement based on benchmarks?

00:39:10.040 | - That's a super hard and good questions.

00:39:13.320 | There's no good answer.

00:39:14.160 | I mean, evals is an open research problem,

00:39:16.880 | like in particular when you're trying to tackle

00:39:19.040 | so many capabilities.

00:39:20.520 | And, you know, it's also like,

00:39:22.440 | as soon as a benchmark,

00:39:23.760 | you're trying like to push numbers on a benchmark,

00:39:26.960 | it stops to be a good benchmark

00:39:28.480 | because then you don't know if you're overfitting it

00:39:30.440 | and it will transfer to similar capabilities.

00:39:33.560 | So evaluation for language models,

00:39:37.320 | in particular on post-training,

00:39:39.080 | is very hard problem.

00:39:42.040 | We tackle that playing with different methods

00:39:45.240 | like reward models, evaluation, model as a judge,

00:39:49.200 | having a diversity of prompts,

00:39:51.920 | diversity of benchmarks as well

00:39:53.240 | for a lot of different capabilities.

00:39:55.200 | That limits the possibility of hacking them, of course.

00:39:58.280 | We do also a lot of human evaluation.

00:40:00.960 | I do also a lot of model tests, quality analysis,

00:40:04.200 | like testing myself some prompts.

00:40:06.200 | I feel it was much easier during Lama 2

00:40:09.640 | when the model was like worst than today.

00:40:12.800 | Now the model is getting so good

00:40:15.760 | that it's hard to get to some prompts

00:40:18.200 | to break them and to compare models

00:40:20.040 | and see the edge cases.

00:40:21.760 | So it's getting harder.

00:40:23.160 | And a great way also to compare models is, you know,

00:40:25.600 | truth is a different round we have done for RHF.

00:40:29.040 | Every time we upload a new model,

00:40:30.480 | for all the annotation we are doing,

00:40:32.520 | we have the win rate between the previous model

00:40:34.160 | and the new model by just sampling

00:40:36.920 | for every prompt we annotate,

00:40:38.680 | prefer sample A with the old model,

00:40:41.720 | sample B with the new model.

00:40:43.120 | And so we can calculate automatically a win rate.

00:40:45.360 | - Interesting.

00:40:46.200 | What are areas that you had to work the hardest

00:40:48.880 | to catch up to like the private models?

00:40:51.640 | Maybe like there's, you know,

00:40:52.880 | not as good public data or whatnot,

00:40:54.800 | or is performance improvement just kind of even

00:40:57.640 | across the spectrum?

00:40:59.960 | - Honestly, all of them,

00:41:01.720 | we are behind all of them with between Lama 2 and GPT-4.

00:41:06.720 | I mean, it's different challenges every time,

00:41:10.440 | like being good at code or reasoning

00:41:12.440 | is something we didn't do at Lama 2,

00:41:14.160 | so we had to build everything from scratch.

00:41:16.120 | Improving on helpfulness,

00:41:18.120 | and which is one of the main dimensions

00:41:20.280 | that people look at, I think in Zarena,

00:41:22.600 | but which is by the way, very interesting evaluation.

00:41:25.360 | Because when we did the preview,

00:41:27.120 | and I don't know yet what will be the results

00:41:29.000 | for this new Lama 3,

00:41:30.400 | we ended very high in this blind test leaderboard.

00:41:34.640 | And to be honest, I didn't expect that.

00:41:37.760 | I knew we had good results internally,

00:41:39.720 | but how that will transfer to perception from the community,

00:41:43.640 | people like using it in practice

00:41:45.280 | and comparing it to the other models,

00:41:47.280 | I didn't expect that positive feedback,

00:41:50.440 | that high ELO score on this benchmark.

00:41:54.000 | It doesn't say like everything.

00:41:55.920 | As I said before, which is also interesting,

00:41:57.840 | because it's a community that judge the prompts

00:42:00.440 | and create the prompts and judge the answers.

00:42:03.040 | We are limited, we are not like good to do that.

00:42:05.720 | And so it gives you a very good indicator

00:42:07.920 | of how good, helpful,

00:42:09.360 | how on the main core of the distribution,

00:42:12.760 | simple prompts about the tone of the model

00:42:15.000 | compared to the others,

00:42:16.080 | but for much more complex problems,

00:42:17.480 | much more intelligence,

00:42:19.080 | reasoning coding of complex stuff,

00:42:21.800 | it doesn't tell the full story.

00:42:24.360 | You know, like while we had 70 B preview

00:42:27.240 | at the level of GPT-4,

00:42:28.920 | even better at the time.

00:42:30.760 | I think it was partly true,

00:42:32.160 | but clearly we were not at like GPT-4 level

00:42:34.200 | in code or reasoning.

00:42:35.920 | We are now.

00:42:36.800 | - There's some conversation about like the math score.

00:42:40.360 | Apparently like the next GPT-next or whatever

00:42:43.040 | is like region 90, which is a big, big jump

00:42:45.640 | from the current state of the art.

00:42:48.320 | It will be interesting.

00:42:49.400 | One of our previous guests,

00:42:50.960 | rounding out the topics on just potential models,

00:42:53.440 | areas of development and evals,

00:42:55.120 | Clementin is looking for a confidence estimation

00:42:58.480 | or uncertainty benchmark.

00:43:00.840 | One of our previous guests, Brian Bischoff,

00:43:02.600 | is also asking about like,

00:43:04.320 | how do we think about evals for practical things

00:43:07.160 | like confidence estimation, structured output,

00:43:10.200 | stuff like that.

00:43:11.360 | - Yeah, I think we lack actually of such evaluations.

00:43:14.720 | When numbers, I was assuring like two days ago

00:43:17.360 | to the team to report at some point is,

00:43:19.480 | okay, we have this accuracy on MMLU,

00:43:22.160 | on whatever, on math and JSM 8-4.

00:43:25.880 | What if we change a bit the prompt

00:43:27.400 | and instead of telling the model you have this question,

00:43:30.000 | you have to answer A, B, C, or D.

00:43:32.000 | What if we tell the model you have to answer A, B, C, or D,

00:43:35.120 | or you don't know.

00:43:36.480 | And maybe the accuracy will be a bit lower,

00:43:39.880 | but I'm curious to see if some models

00:43:41.880 | we have different of calibrations

00:43:43.400 | where maybe model A have 50% correct,

00:43:47.000 | model B has 50% correct,

00:43:48.960 | but model A answered 100% of the questions.

00:43:52.600 | So 50% are not correct.

00:43:54.560 | Model B actually said like, answered only 60%.

00:43:57.400 | So for 40% of the time, he said, I don't know.

00:43:59.960 | I prefer model B.

00:44:01.160 | And we are not like reflecting that in evaluations.

00:44:03.960 | - I think this is very relevant

00:44:05.760 | for post-training in particular,

00:44:07.120 | because it seems that the general consensus

00:44:09.680 | is that base models are more calibrated

00:44:12.640 | than post-train models, right?

00:44:14.480 | Something like that.

00:44:15.400 | - Exactly.

00:44:16.240 | - That seems to be the research from OpenAI as well.

00:44:18.160 | I don't know the degree of this,

00:44:20.160 | and maybe we can invert it, right?

00:44:21.640 | Maybe post-training can help to increase calibration

00:44:24.320 | rather than decrease it.

00:44:25.760 | I feel like this is a little bit

00:44:27.840 | of being too similar to humans,

00:44:30.400 | because humans are not calibrated very well.

00:44:34.120 | - Yeah, and that's the goal of post-training,

00:44:35.600 | I think, to make models more calibrated,

00:44:38.160 | to not be biased toward answering A, B, C, or D

00:44:41.280 | as often as possible,

00:44:42.960 | to follow the uniform distribution.

00:44:44.960 | - And on the structured output tool calling side,

00:44:47.520 | do you think that it's not an explicit part of the evals?

00:44:51.680 | Obviously, you worked on Toolformer

00:44:53.600 | and the language augmentation.

00:44:56.200 | Do you encourage the open-source community

00:44:58.160 | to fine-tune Lama3 to do tool calling,

00:45:01.720 | or do you want to just have that in the model from day one?

00:45:04.960 | - We have that from day one.

00:45:06.480 | Good news for the community.

00:45:07.800 | We are state-of-the-art there.

00:45:09.720 | I think the model will be pretty good at that.

00:45:12.920 | We have a lot of gems about tools in the paper,

00:45:16.160 | but the model is fine-tuned to do tool usage,

00:45:18.960 | to zero-shot function calling.

00:45:21.720 | There are some system prompts,

00:45:22.840 | if you tell the model to use a search,

00:45:26.000 | and Imagination can do a lot of stuff,

00:45:28.480 | like code execution as well,

00:45:29.960 | even in a multi-message way,

00:45:32.600 | so almost multi-step agents,

00:45:36.000 | which kind of sparks our agents.

00:45:38.040 | - Okay, you talked about agents,

00:45:39.160 | so I guess we should probably mention

00:45:40.840 | the work on agent stuff.

00:45:42.200 | And you also, in our pre-conversation,

00:45:44.480 | mentioned that you're already starting work on Lama4.

00:45:46.840 | What does agents have to do with Lama4?

00:45:48.840 | How does your work on Gaia inform all this work?

00:45:51.360 | - Yeah, so we published one year ago,

00:45:53.280 | Gaia General Assistant Benchmark.

00:45:55.920 | That followed a direction I really like pursuing.

00:45:59.320 | I mean, everyone passionate about AI

00:46:01.280 | and trying to build Jarvis will go there.

00:46:03.760 | So I did Toolformer and the survey on augmented models.

00:46:07.880 | In fact, reflecting back, I was,

00:46:10.120 | okay, we have Galactica, we have Lama1,

00:46:14.360 | we have Toolformer,

00:46:15.760 | and there's like GPT 3.5 at the time in Lama4.

00:46:18.800 | If you don't have a good instruct model

00:46:21.240 | to follow instructions,

00:46:22.800 | the extension and the feature of Toolformer is limited.

00:46:26.520 | So we need to work on that, and we did Lama2,

00:46:28.960 | and then now Lama3.

00:46:30.400 | And it's very interesting.

00:46:31.760 | On General Assistant Benchmark, so Gaia,

00:46:34.480 | agents powered by language models

00:46:36.680 | perform to zero with GPT 3.5

00:46:38.960 | and to something very significant,

00:46:41.520 | like 30, 40%, 60% with GPT 4.

00:46:45.600 | So there's a gap of intelligence here.

00:46:47.760 | And I think this gap of intelligence,

00:46:49.280 | this threshold that you pass

00:46:50.880 | in terms of zero short function calling,

00:46:53.480 | following complex instruction

00:46:54.920 | that can span over a page of constraints,

00:46:58.560 | those things that makes the nowadays agents

00:47:02.040 | with React loops, pre-planning,

00:47:04.520 | multi-steps reasoning, function calling,

00:47:07.640 | work in practice is like this gap of intelligence.

00:47:10.760 | So now that we have Lama3,

00:47:12.440 | I'll be back to agents.

00:47:14.080 | I expect some incremental and significant progress

00:47:16.600 | on pre-planning, post-planning,

00:47:18.360 | but I'm really hopeful that we can gain

00:47:21.120 | some order of magnitude of scaling

00:47:23.280 | by interconnecting well models into agents

00:47:27.840 | as a more complex system that can do planning,

00:47:30.320 | that can do backtracking,

00:47:32.640 | that can take actions,

00:47:35.240 | navigate the web, execute code.

00:47:37.440 | - Okay, there's a lot there.

00:47:39.680 | When you say integrating world models,

00:47:42.000 | is there anything from JEPA?

00:47:43.520 | Is that something that we're talking about

00:47:46.400 | or is that a different line of research?

00:47:48.560 | - No, not directly.

00:47:50.120 | That's the same goal, I would say,

00:47:52.760 | but JEPA is very, very fundamental research,

00:47:56.280 | which has some promising early results.

00:47:58.760 | And what I was looking right now

00:48:01.000 | on state-of-the-art results on Gaia,

00:48:02.960 | there's a leaderboard, by the way,

00:48:04.720 | you mentioned Clementine before,

00:48:06.200 | she contributed to Gaia as well,

00:48:08.080 | and Hugging Face put a leaderboard there on their website.

00:48:11.960 | There's some state-of-the-art results.

00:48:14.520 | What is interesting is like GPT-4 alone has 0%,

00:48:19.360 | but, or like 5%, I think, on level one,

00:48:22.000 | that's three level of difficulties.

00:48:24.040 | But OSCOPILOT, then, and Autogen from Microsoft,

00:48:28.440 | and recently Hugging Face agent,

00:48:30.600 | obtained some level one up to 60%.

00:48:33.920 | So connecting an LLM to an agent

00:48:36.080 | that can do all those things,

00:48:37.720 | moves much forward new capabilities.

00:48:40.960 | This is kind of a breakthrough.

00:48:42.160 | And those models are purely based

00:48:44.920 | on instruction tuning models, following instructions,

00:48:48.520 | where like you have an orchestrator,

00:48:50.040 | and you say to your LLM, okay, this is your task,

00:48:53.240 | you have access to these tools, you can navigate the web,

00:48:56.200 | can you do a plan of what you should do?

00:48:58.200 | And then, okay, that's the plan.

00:49:00.280 | Now, execute the first step.

00:49:02.360 | Did you manage to succeed for the first step?

00:49:05.080 | Or do you want to rethink your plan

00:49:08.000 | because you enter in a dilemma?

00:49:10.040 | And you have kind of all this orchestration

00:49:12.600 | by system prompting, instruction following,

00:49:15.960 | and just that, which is quite suboptimal,

00:49:18.880 | and probably you need to go later in latent space

00:49:21.720 | and more JPAS style.

00:49:22.920 | But just that is getting us

00:49:25.040 | to some really impressive results already.

00:49:28.080 | - And do you see the planning and review

00:49:31.480 | to always be needed in the future?

00:49:33.120 | This is kind of like under Garpadi's idea

00:49:34.840 | of like more tokens equal more thinking.

00:49:36.960 | So like the more you're having it write tokens

00:49:39.440 | and like think about the outcome

00:49:41.080 | and like the better result you're probably gonna get to.

00:49:43.680 | Do you think that's always gonna be the case?

00:49:45.200 | Or that in the future, like the model,

00:49:47.680 | you can just say this is the task

00:49:49.080 | and then I'll just return the answer directly

00:49:51.200 | and do all of that in the latent space, so to speak?

00:49:54.600 | - Right.

00:49:55.440 | I think in the future, it should be,

00:49:57.360 | it should hopefully go more as this is a task

00:50:00.280 | and I return it.

00:50:01.640 | But we need to teach that to the model to train that,

00:50:03.960 | which is far from now.

00:50:05.400 | Very medium, long-term directions

00:50:08.240 | that could be really relevant here

00:50:09.600 | is thinking into latent space.

00:50:12.200 | I know some early works are doing that.

00:50:14.280 | And that's a way probably to move to.

00:50:17.680 | First you think,

00:50:18.960 | and then you don't have to write all the tokens.

00:50:21.160 | Like it's in your head.

00:50:22.160 | It doesn't have to be as constricted

00:50:24.240 | than a plain text BLM.

00:50:26.000 | And once you have done your thoughts,

00:50:27.840 | you can just write the final answer or take an action.

00:50:30.520 | - Just a commentary on that.

00:50:31.600 | Anthropic actually cheats at this right now.

00:50:34.160 | If you look at the system prompt in the Cloud Artifacts,

00:50:37.320 | I actually have a thinking section

00:50:38.760 | that is explicitly removed from the output,

00:50:42.520 | which is, I mean, they're still spending the tokens,

00:50:45.000 | but that is before training it is at the prompting level,

00:50:49.640 | you can simulate this.

00:50:51.120 | And then at iClear, there was the pause token,

00:50:53.880 | the backtrack token.

00:50:54.880 | I feel like all these are token level stopgap measures.

00:50:59.520 | I feel like it's still not the final form.

00:51:01.560 | Like we still need to have at the architecture level,

00:51:04.840 | some kind of variable inference length thing

00:51:08.680 | that lets you actually think in latent space

00:51:10.320 | like you're talking about.

00:51:11.160 | I don't know if there's any papers

00:51:12.560 | that you're thinking about.

00:51:14.040 | - No, but that's interesting

00:51:15.080 | because that's what we said at the beginning

00:51:16.720 | of the discussion.

00:51:18.880 | If you remember, like we are lacking the flexibility

00:51:21.640 | for pre-training architecture transformers,

00:51:24.280 | where we spend the same amount of compute per token.

00:51:27.680 | And so because of that, how can you like mitigate this

00:51:31.280 | by generating more tokens?

00:51:32.920 | So more thoughts, more compute,

00:51:35.080 | because you have only access to this dimension.

00:51:37.320 | Ideally, you want an architecture that will enable

00:51:39.880 | naturally to make this emerge, basically.

00:51:43.000 | - Any papers come to mind there

00:51:44.240 | that you would recommend people read,

00:51:45.440 | or this is like completely new science

00:51:47.400 | that we have to do.

00:51:50.000 | - No, I mean, it's earlier science.

00:51:52.400 | I don't know any work that managed to get there.

00:51:54.960 | I know like, for instance, you had the universal transformer

00:51:58.480 | had this idea of a number,

00:52:00.600 | and you can like compute on the layer n times

00:52:05.120 | and being decided by the architecture itself

00:52:08.080 | with respect to the complexity of the token.

00:52:09.960 | I think there's a paper from DeepMind

00:52:11.840 | on mixture of expert with like a key player.

00:52:15.040 | Mixture of, is it this one?

00:52:17.160 | - Mixture of depths.

00:52:18.200 | - I don't, I'm not sure it's this one, maybe.

00:52:20.240 | But like, basically the idea was that

00:52:22.160 | with a mixture of expert,

00:52:23.120 | you have an expert that is an identity matrix

00:52:25.480 | that you can skip.

00:52:26.760 | And so like you can,

00:52:28.680 | but you know, it's early works, very preliminary works.

00:52:32.120 | Like for instance, I haven't seen yet a lot

00:52:33.840 | like putting the compute, generating a token into the loss.

00:52:38.000 | That's gonna be interesting when we start to do that.

00:52:40.240 | - I know we're getting up on time,

00:52:42.160 | but we had just a few more questions.

00:52:44.480 | We definitely want to ask you.

00:52:46.000 | So as you think about,

00:52:47.000 | there were reports about Llama4

00:52:48.640 | started turning again in June.

00:52:50.400 | If you think about the evolution of the models,

00:52:52.000 | I think up until Llama3,

00:52:53.680 | you know, with MetaAI and some of these things,

00:52:55.320 | I'm like, it makes sense

00:52:56.720 | that they want to build their own models

00:52:57.920 | and their multi models.

00:52:59.120 | Sounds like Llama4, maybe a lot of the focus

00:53:02.000 | will also be a more agentic behavior and have all of this.

00:53:05.000 | I'm curious, like at what point it's like, okay,

00:53:07.240 | this is a research direction that we still want to take,

00:53:09.400 | even though, you know,

00:53:10.400 | it doesn't fit right into the product.

00:53:11.960 | Like what's that discussion internally

00:53:13.760 | about what to focus on as you keep scaling these models?

00:53:16.800 | - Yeah, I think it's a balance, you know, between,

00:53:19.560 | well, we want to be number one.

00:53:21.520 | Mark wants to be number one there.

00:53:23.640 | And there's this understanding also that, you know,

00:53:26.360 | this is a critical technology in the future.

00:53:29.520 | And even if to nowadays, that like research,

00:53:32.640 | if nowadays it's not like directly intersecting product,

00:53:35.840 | we don't want to be late in the game as we had in the past.

00:53:39.160 | So that's the first thing.

00:53:40.840 | The second thing is,

00:53:41.680 | and we think that this technology will change the world.

00:53:44.480 | We want to work towards AGI and AGI will change the world.

00:53:49.040 | And if Meta develop an AGI,

00:53:51.840 | it will probably intersect pretty easily the products.

00:53:55.160 | Now, the first thing is, with that in mind,

00:53:58.200 | we have to balance with product needs.

00:54:00.400 | And there's always this ongoing discussion

00:54:02.240 | and this balance to find,

00:54:03.800 | for like between a flagship model,

00:54:05.840 | between maybe a model that will be

00:54:07.880 | more adapted to product needs.

00:54:10.120 | And it doesn't have to be decorrelated.

00:54:12.560 | As I said before,

00:54:13.400 | like you can leverage also the big models

00:54:14.880 | to distillate some capabilities to a smaller one

00:54:18.320 | that will be maybe more suited like research.

00:54:20.880 | There's always this back and forth.

00:54:22.720 | There's also the fact that the product

00:54:24.960 | kind of ideas to the research,

00:54:26.920 | evaluations that are grounded in actual use cases,

00:54:29.880 | that we can also measure ourselves with respect to,

00:54:32.280 | is there some progress

00:54:33.640 | or is it just on an academic benchmark?

00:54:36.160 | - So one, before we transition off,

00:54:38.400 | I think there's the hidden side maybe of these LLMs

00:54:41.160 | that most people don't think about,

00:54:42.760 | which is the organizer and the vocab size,

00:54:46.120 | especially of them.

00:54:47.480 | So LLAMA3 is 128K tokens, vocab tokenizer.

00:54:52.240 | GVD4 was 100K, 4.0 is 200K.

00:54:56.520 | How should people think about the impact that it has?

00:54:59.320 | So basically like, I mean, the TLDR is like in the vocab,

00:55:02.680 | you have this kind of like concepts represented as tokens.

00:55:05.280 | So usually the larger the vocab size,

00:55:07.560 | the more nuanced the model can be

00:55:09.880 | about thinking about different things.

00:55:11.680 | What are the scaling laws of those organizers?

00:55:14.120 | You know, is 128K kind of like very large

00:55:17.040 | and it doesn't really matter?

00:55:17.960 | Like, do you want to double it?

00:55:19.640 | Like any thoughts there would be great.

00:55:21.760 | - There's a lot of dimensions to take into account here.

00:55:23.800 | I think the first thing obvious to say is LLAMA3

00:55:27.160 | compared to LLAMA2 is multilingual,

00:55:29.000 | has multilingual capabilities.

00:55:30.560 | We worked on that.

00:55:31.920 | And so, because you have languages

00:55:33.680 | that are not just Latin languages like English,

00:55:35.960 | there's a lot of different characters.

00:55:38.440 | You want to include them to represent like special word there

00:55:42.400 | and so you need to have a bigger vocabulary size.

00:55:45.520 | That's the obvious thing,

00:55:47.240 | which is also probably why GPT-4.0

00:55:49.440 | has a much bigger vocabulary

00:55:52.760 | as it's like naturally multilingual,

00:55:55.200 | multimodal in speech.

00:55:58.040 | So that's why we went to from 30 to 128 vocabulary size.

00:56:03.040 | The interesting thing I think to discuss about tokenizer

00:56:06.200 | is about scaling laws related to that.

00:56:09.520 | If you increase your vocab size,

00:56:13.800 | well, you have a bigger matrix

00:56:15.480 | which takes longer to compute.

00:56:17.240 | It depends on the model size,

00:56:19.400 | but for a small model,

00:56:20.480 | it has a much bigger impact than a bigger model.

00:56:23.560 | So increasing that,

00:56:26.520 | basically saying otherwise,

00:56:28.120 | so number of vocabulary size 428

00:56:30.280 | is the same than the 8, 70, or 405B,

00:56:33.840 | but so relatively in percentage

00:56:35.360 | of the total number of weights,

00:56:37.040 | for the 7B it's much more than the 405B,

00:56:39.880 | weights more compared to total number of weights.

00:56:42.880 | So that has more impact in terms of training speed there.

00:56:46.400 | But what is interesting is with a bigger vocabulary,

00:56:49.800 | for the same text,

00:56:51.520 | you have less tokens, right?

00:56:54.360 | And so you can train your model

00:56:56.560 | on the same amount of knowledge with fewer steps.

00:57:00.280 | So for the same compute,

00:57:02.120 | you can see more knowledge if you don't epoch.

00:57:04.800 | That's one cool thing.

00:57:06.080 | The second thing is at inference time,

00:57:08.640 | you know that the context lane is not in number of text,

00:57:11.600 | the size of the text, but number of tokens.

00:57:13.920 | And so you can compress more,

00:57:15.360 | so that now with a bigger tokenizer,

00:57:19.400 | 128, more vocabulary,

00:57:21.800 | you can get to longer text

00:57:23.720 | for the same number of tokens.

00:57:26.520 | 8K basically, or 128K,

00:57:29.760 | now with this tokenizer means 30% about less text to encode.

00:57:34.760 | - How are tokenizer vocabs built?

00:57:37.600 | I actually don't know that.

00:57:38.480 | What's the work that goes into it?

00:57:39.680 | And then like, why are people using smaller ones?

00:57:43.080 | Is it harder to make them?

00:57:44.080 | Or is it just about some of the things you mentioned

00:57:46.520 | around scaling the training and all of that?

00:57:49.120 | - Oh, it's, no, there's different methods,

00:57:51.120 | but became quite standard.

00:57:53.360 | Although it could change in the future.

00:57:54.880 | - BPE?

00:57:55.720 | - Yeah, exactly.

00:57:56.560 | - Well, BPE is for text.

00:57:58.080 | I don't know about multimodal vocab.

00:58:00.640 | That's, I haven't read anything about.

00:58:02.640 | - Yeah, let's keep that question.

00:58:04.920 | I'm not expert there.

00:58:05.760 | And I don't remember exactly what they ended to do.

00:58:08.200 | - Now that you're saying this, right?

00:58:09.360 | Okay, so now we have 100K vocab, 200K vocab.

00:58:13.680 | Do we see a million vocab?

00:58:15.720 | Do we see infinity, which is no tokenizer?

00:58:18.560 | You know, like what's the natural limit of tokenization?

00:58:22.000 | - Yeah, that's a good question.

00:58:23.080 | I don't know.

00:58:24.560 | I think there's a limit with respect

00:58:26.440 | that will grow with respect to the model size.

00:58:29.160 | So bigger models means possibly bigger vocabulary

00:58:34.160 | without affecting too much of training.

00:58:36.320 | But yeah, there's a lot of people.

00:58:38.440 | That's not my domain of expertise,

00:58:39.840 | but a lot of people are discussing the interest

00:58:42.120 | of having this kind of tokenizer,

00:58:44.000 | which doesn't fit like natural.

00:58:46.120 | Could we go to character level tokenizer?

00:58:48.360 | Could we go to actually multimodal tokenizer,

00:58:51.760 | which will like decompose at pixel level?

00:58:55.280 | I don't know.

00:58:56.120 | Future directions, that could be very promising.

00:58:58.920 | - Yeah, I would say the diffusion people

00:59:00.520 | have actually started to swing back to pixel level.

00:59:03.840 | And probably that will presage the language people

00:59:07.680 | also moving towards 1 million vocabulary

00:59:11.280 | and then whatever the natural limit is for character level.

00:59:15.240 | - I think we can maybe transition

00:59:16.680 | towards some of your personal stuff.

00:59:18.720 | We kept you here for a long time.

00:59:20.040 | We also, this is a very distributed podcast.

00:59:22.560 | You know, I'm in the Bay Area, you're in France,

00:59:24.360 | Sean is in Singapore.

00:59:25.240 | So everybody is on a different time zone.

00:59:28.680 | You also do some startup investing and advising.

00:59:31.680 | You know, we used to meet Chantal on the podcast.

00:59:33.720 | He also mentioned he always enjoys kind of working

00:59:36.240 | with founders and researchers.

00:59:38.120 | Any company you're involved with that you want to shout out

00:59:40.720 | that you think is super promising,

00:59:42.480 | requests for startups that you've had,

00:59:44.880 | anything around that space would be awesome.

00:59:47.960 | - Two cool companies I can think now is,

00:59:51.000 | one is Lindy, which is based in the Bay Area,

00:59:53.680 | with Flo Crivello.

00:59:55.240 | Very cool one.

00:59:56.560 | - Yeah, he's a good friend.

00:59:57.560 | - Flo.

00:59:58.400 | - Why do you like it?

00:59:59.240 | - Flo is really good, like he's a Frenchman, I guess.

01:00:02.200 | And number two, very recently, I really liked Open Devin,

01:00:07.200 | which is basically trying to reproduce Devin.

01:00:10.760 | - We interviewed him at iClear.

01:00:12.240 | Both are agent startups.

01:00:14.120 | What do you think is like the direction

01:00:15.680 | that startups should be working on, you know, agent-wise,

01:00:18.120 | and maybe what is not working?

01:00:20.720 | - That's a tough question.

01:00:22.160 | One thing I say quite often is,

01:00:24.600 | deep learning has these very specificities

01:00:27.160 | that makes it challenging to predict

01:00:29.640 | that it's self-destructor, self-destructive technology.

01:00:33.440 | So that, think like, you know, Grammarly,

01:00:35.520 | this technology like where the startup,

01:00:37.400 | you plug play and it corrects your grammatical errors.

01:00:41.160 | Everyone told them, guys, deep learning

01:00:43.960 | create a barrier to entrance, annotate data, create data.

01:00:47.640 | And they had a lot of data for that.

01:00:49.600 | And the next day, with the same exact technology,

01:00:52.520 | deep learning, someone comes with chat GPT and tell them,

01:00:55.480 | yeah, I can do the same, better, and so many other things.

01:00:58.720 | Zero barrier to entry from yesterday to today.

01:01:02.760 | And what is crazy here

01:01:04.160 | is that it's based on the same technology.

01:01:06.320 | And so there's a lot of people working nowadays

01:01:09.560 | to try to mitigate issues

01:01:12.040 | with current generation of models.

01:01:14.520 | And I'm telling them, like,

01:01:16.200 | assume always the next generation will get better.

01:01:18.800 | So if your business will benefit

01:01:21.760 | from a new generation with better abilities,

01:01:24.280 | that's a good business.

01:01:25.560 | If your business may be replaceable,

01:01:27.680 | and if all the work you have done may vanish

01:01:30.280 | and be like wasted because there's better models,

01:01:33.000 | then maybe change.

01:01:35.160 | - Yeah, I mean, yes, but better is so unpredictable.

01:01:38.160 | Like if you asked me before, let's say March of this year,

01:01:42.080 | I would have said that maybe, you know,

01:01:44.520 | voice chat is still very defensible.

01:01:47.680 | And then suddenly, you know,

01:01:48.920 | OpenAI demoed their sort of real-time voice thing.

01:01:52.400 | It's sort of natively multimodal.

01:01:55.000 | It's easy to not anticipate

01:01:57.960 | a dimension where it gets better,

01:01:59.640 | but find another one that resisted, it's harder.

01:02:02.280 | I would say in general,

01:02:03.560 | assume you will have progress everywhere.

01:02:06.040 | It may not be right,

01:02:08.240 | but it's a bit dangerous to bet against that.

01:02:11.480 | - Is there any space that you think is overrated by founders

01:02:15.080 | that are trying to build something that like, yeah,

01:02:18.200 | either, you know, the new models are just gonna do,

01:02:19.960 | or like, you just don't think

01:02:20.920 | there's that much interest from folks?

01:02:23.640 | - It's a challenging time for founders.

01:02:25.240 | It's very exciting.

01:02:26.520 | There's a lot of funds, a lot of applications as well,

01:02:28.720 | a lot of stuff to build.

01:02:30.600 | That's pretty cool.

01:02:31.560 | But what is hard is,

01:02:32.640 | because this technology is moving so fast,

01:02:34.760 | I see like now a lot of fundamental stacks

01:02:37.920 | that are like the unicorn of today.

01:02:40.160 | Foundational models, foundational like clusters,

01:02:42.800 | data notation, things like that.

01:02:44.320 | There's a lot, but less successful yet,

01:02:46.680 | for now at least, application company.

01:02:49.640 | And it's hard to build an application

01:02:52.360 | when it changes so fast, as we discussed before.

01:02:54.760 | So it is both crude and yet like,

01:02:58.120 | we haven't find a good like use case

01:03:01.320 | that is like the new thing company there.

01:03:04.840 | And I want to see it.

01:03:06.440 | - Yeah, we definitely see the same, you know,

01:03:08.440 | all of our agent companies, or at least, you know,

01:03:10.840 | building agents are the ones getting the most traction.

01:03:13.360 | Most companies are like,

01:03:14.200 | hey, I actually don't have that much expertise

01:03:15.840 | and I'm just waiting for the models to get better.

01:03:18.160 | So I'm not really sure if I need this now.

01:03:20.200 | So it's an interesting time to be investors.

01:03:23.520 | Anything else we missed?

01:03:25.080 | This was kind of like a masterclass

01:03:27.080 | in how to build state-of-the-art LLM.

01:03:28.720 | So it's going to be a highly played episode, I'm sure.

01:03:32.520 | Any final thoughts you want to share?

01:03:34.800 | - There's two things I can, I guess I can say.

01:03:36.560 | One is Lama is hiring talents worldwide.

01:03:41.320 | And two, you can contact me, reach me out on LinkedIn,

01:03:45.040 | looking for Gen AI technology that,

01:03:47.880 | and founders that will create the future.

01:03:51.120 | - Okay. Hiring one role that you're like,

01:03:53.920 | man, like we really need this kind of person.

01:03:56.960 | If you describe it, that person will be referred to you.

01:04:00.680 | Right? Like, because we're trying to broadcast it

01:04:03.680 | to the whole world.

01:04:05.320 | - Researchers with good common sense,

01:04:07.520 | first principle thinking,

01:04:09.000 | not necessarily like huge expertise on LLM,

01:04:10.920 | but more being super rigorous, meticulous, structured.

01:04:15.080 | - Azamen, thank you again for coming on

01:04:17.320 | and hope everybody gets to enjoy LLAMA 3 today

01:04:19.760 | since it just came out

01:04:20.960 | and we'll have you again for LLAMA 4.

01:04:22.960 | (upbeat music)

01:04:25.560 | (upbeat music)

01:04:28.160 | (upbeat music)

01:04:30.760 | (upbeat music)

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Chapters