back to index

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI


Chapters

0:0 Introductions
4:16 The Llama Origin Story
7:34 Are there RLHF scaling laws?
9:56 Avoiding the "Chinchilla trap"
12:15 Why 405B?
14:27 FP8 training and other scaling research
17:48 Llama3 vs Llama2
18:32 Synthetic data for pre-training
21:43 Tool use to generate synthetic data
22:40 Pre-training data recipe
26:0 Why not MoE?
27:5 Why RLHF is so important
37:6 How they eval models
41:50 Benchmaking Uncertainty
44:4 Structured output and tool calling
45:52 Llama4 & Agents
52:1 Will Meta keep releasing open models?
53:55 Why tokenizer vocab size is underrated
59:12 AI & Startups
63:13 Hiring at Meta AI

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey everyone.
00:00:05.640 | Welcome to the Latent Space Podcast.
00:00:07.440 | This is Alessio, partner and CTO
00:00:09.320 | of Residence at Decibel Partners.
00:00:10.800 | And I'm joined by my co-host, Spooks,
00:00:12.560 | founder of Small AI.
00:00:13.760 | - Hey, and today we have a very special episode
00:00:15.840 | with Thomas Yalom.
00:00:17.240 | I don't know how to describe,
00:00:18.360 | you've done so much work
00:00:19.720 | in a very short amount of time at Meta,
00:00:21.240 | but you were most notably leading Lama 2.
00:00:24.200 | And now today we're also coordinating
00:00:26.760 | on the release of Lama 3.
00:00:27.840 | So welcome.
00:00:28.680 | - Thanks for having me.
00:00:29.640 | - To be clear, obviously the Lama 3 405B.
00:00:33.080 | Is that the official size number that we're going with?
00:00:35.880 | Or is it, do we just say 400B?
00:00:37.080 | - For the text model only, yes.
00:00:40.560 | A bit of additional parameters
00:00:42.080 | for the multi-model version that will come later.
00:00:44.040 | - Awesome. Awesome.
00:00:45.160 | Just to quickly go over your background.
00:00:46.840 | Actually we had a slightly similar past.
00:00:48.680 | I was also a quantitative trader
00:00:50.640 | and it looks like you did five years in quant finance,
00:00:53.080 | working at Trading Timer in SOC Gen.
00:00:55.240 | And then you transitioned into natural language,
00:00:58.280 | getting your PhD at Sorbonne,
00:00:59.920 | working on Recital as well.
00:01:01.880 | And then right after your PhD, joining Meta.
00:01:04.640 | - No, it's exactly that.
00:01:05.600 | But basically I think it's at the AlphaGo moment
00:01:08.360 | where I was doing some trading.
00:01:09.760 | I say like what I need to understand
00:01:13.080 | what's the technology behind that.
00:01:14.640 | And I wanted to study machine learning.
00:01:16.040 | I did first some training,
00:01:17.680 | like six months degree, executive degree,
00:01:20.240 | at the end of which I knew like what XGBoost at the time
00:01:23.080 | and nothing about deep learning at all.
00:01:25.680 | So, and most of the people around were like PhD people.
00:01:30.400 | Okay, PhD seems pretty cool.
00:01:32.600 | Deep learning seems pretty cool.
00:01:33.840 | So I want to do a PhD in deep learning.
00:01:36.760 | That's where I joined.
00:01:38.600 | We have this PhD program in France
00:01:41.320 | within a company and academia.
00:01:44.560 | And so I did my PhD with Recital and Sorbonne University
00:01:48.120 | on natural language generation reinforcement learning.
00:01:51.000 | I guess it was a good topic.
00:01:52.920 | I was not like a visionary.
00:01:54.680 | It was very random.
00:01:56.240 | I've had a company that offered me like this topic
00:01:59.960 | and it was something like I started two weeks before BERT.
00:02:03.120 | - Excellent timing.
00:02:03.960 | Yeah, we actually also just released our episode
00:02:06.040 | with Clementine Fouquier who also did her PhD
00:02:09.120 | with a company in kind of like a very similar format.
00:02:11.720 | I think, yeah, very underrated, very underrated.
00:02:14.040 | This sort of PhD with industry expertise
00:02:16.920 | because you're also like publishing papers the whole time.
00:02:19.160 | I looked at your publishing history.
00:02:21.040 | You were doing like summarization work.
00:02:23.360 | You're doing factual consistency work.
00:02:25.000 | You released some benchmarks
00:02:26.720 | and then you worked on language GANs
00:02:28.560 | before the transformers took over.
00:02:30.680 | - We can come back to that later,
00:02:33.320 | but I should have, I mean, papers have like 10, 50 citations.
00:02:38.000 | If I'm pretty sure that if I call them
00:02:39.960 | like an LHF without human in the loop,
00:02:44.600 | but like a discriminator, which is synthetic,
00:02:47.680 | a human in the loop,
00:02:48.720 | I will have get much more citations today
00:02:51.000 | because like all the inspiration for this paper
00:02:54.000 | were from actually the original open-air paper of LHF.
00:02:57.520 | But at Academia, we don't have the way
00:02:59.640 | to pay annotation online like that.
00:03:03.120 | So how to simulate it?
00:03:05.920 | - Yeah, a lot of these ideas are repeated,
00:03:07.400 | like discriminator, generator,
00:03:08.480 | we just call them different names now,
00:03:10.040 | like verifier, whatever. - Exactly.
00:03:12.480 | - Well, I think your progress into NLP was like really strong
00:03:16.040 | 'cause like the first thing you worked on at Meta was Bloom.
00:03:18.080 | - Yeah, actually I started to work on that
00:03:20.280 | before joining Meta.
00:03:21.560 | I was not like one of the main contributors,
00:03:24.120 | but it was at the intersection of multilinguality,
00:03:26.880 | which was very important to me, large language modeling.
00:03:30.600 | And that's why actually my first big project
00:03:33.200 | at Meta and the team I was working on was Galactica.
00:03:36.120 | And actually interesting step back from Bloom
00:03:39.440 | was like we did a lot of mistakes,
00:03:41.160 | but it was expression that's expected again.
00:03:43.400 | We learned a lot,
00:03:44.760 | but like trying to scale towards like multilinguality.
00:03:48.400 | In fact, we learned later that multilinguality
00:03:51.440 | almost emerged naturally with very, very few data,
00:03:54.240 | which was really surprising
00:03:55.360 | and not expected at all for us at the time.
00:03:57.440 | - I mean, my learning from that is just,
00:03:58.920 | there's a natural harmony of language
00:04:01.480 | that is abstract from English.
00:04:03.480 | When you learn English, you learn language
00:04:05.320 | and then language just translates
00:04:07.040 | to other forms of languages,
00:04:08.920 | especially if they're the same family, right?
00:04:10.600 | Like, yeah, so maybe we should get right into Lama 2,
00:04:13.600 | spend a little bit of time there
00:04:14.600 | and then we'll go into Lama 3.
00:04:16.040 | So like what is the story of Lama 2 from your point of view?
00:04:19.560 | - Yeah, so as I was saying, I started to Meta on Galactica.
00:04:24.080 | That was one of the first large language model at Meta.
00:04:27.040 | It's a language model for science.
00:04:28.640 | We released it in, I think, December or end of November,
00:04:31.720 | I don't remember, one year and a half ago.
00:04:34.560 | I don't know if people remember,
00:04:35.840 | but it was huge on Twitter,
00:04:38.400 | both with people like thinking it's the end of science
00:04:41.280 | and like that with a lot of hallucination papers,
00:04:43.680 | although I was like, it's super awesome.
00:04:45.640 | I still think it was super awesome,
00:04:47.560 | but you know, we didn't do like instruction tuning
00:04:50.120 | or LHF techniques at the time.
00:04:52.200 | It was a weird moment because two weeks later,
00:04:54.560 | ChatGPT came out and that's a moment where like,
00:04:57.920 | I think all the thing companies went upside down
00:05:00.720 | and where we had a huge traction from leads
00:05:04.720 | to now work on that and make a ChatGPT as soon as possible.
00:05:07.640 | So we had this one, two months of like what to do.
00:05:11.080 | I actually was working on Galactica Instruct,
00:05:14.240 | which basically you could connect it.
00:05:16.440 | We had a partner with Overleaf,
00:05:18.720 | the Google Doc of like scientists where you can write papers
00:05:22.680 | and you write there in LaTeX,
00:05:25.040 | you have to do a lot of citations.
00:05:27.080 | So the idea was that you can,
00:05:28.480 | just like ChatGPT or GPT Instruct,
00:05:30.720 | ask or swap two columns in a LaTeX table.
00:05:34.200 | That's something very, very time-consuming.
00:05:36.680 | I can promise.
00:05:37.720 | You could like say, oh, find me a citation
00:05:39.640 | about LLMs and bias.
00:05:40.920 | We'll find you some papers,
00:05:42.320 | insert automatically the bib in like LaTeX.
00:05:45.240 | So that was pretty cool.
00:05:46.080 | But because of the backslash,
00:05:47.040 | we never like opened it in the end.
00:05:49.280 | - Oh, because the Galactica backlash.
00:05:51.880 | - Yes.
00:05:52.720 | - Like I was just saying like today it's not solved
00:05:54.280 | because Lucas Bayer is still asking
00:05:55.680 | for the citation generator.
00:05:57.240 | - I sold it to it.
00:05:58.440 | I was, dude, we had that two years ago
00:06:00.800 | and I promised, I tested it.
00:06:02.840 | It works so well.
00:06:04.400 | I had it on Overleaf integrated.
00:06:06.280 | I tested it.
00:06:07.280 | - Wow, okay.
00:06:08.720 | - Yeah, yeah, yeah.
00:06:09.560 | No, it went quite far in fact.
00:06:11.720 | And actually about citations,
00:06:13.440 | like it's anecdotical,
00:06:14.600 | but because the way Galactica was trained to cite papers
00:06:18.400 | with all the references in paper,
00:06:19.960 | that's what made it emerge so easily at instruction timing.
00:06:24.280 | Actually, Galactica Instruct
00:06:26.400 | was the first annotation project for LHF at Meta.
00:06:30.400 | That was a follow-up of Galactica that we were preparing.
00:06:33.000 | And at the same time,
00:06:34.600 | my friends from Paris office created LamaOne.
00:06:38.760 | It's like to connect the dots with what we said before.
00:06:41.760 | The last author was Guillaume Lample who founded Mistral.
00:06:44.760 | The first author is Hugo Touvron
00:06:45.920 | who worked with me on LamaTwo, still at Meta.
00:06:48.960 | Both did a PhD program within Meta
00:06:51.480 | as a company and an academia.
00:06:53.600 | So that's a pretty good program indeed.
00:06:56.240 | And so we worked on LamaTwo from that point.
00:06:59.240 | We had all the support from the company leadership.
00:07:01.600 | That was one of the main priority.
00:07:03.720 | We had LamaOne and Galactica
00:07:05.400 | as like backbone of good language model.
00:07:08.320 | We started from LamaOne
00:07:09.840 | and we worked mainly with Guillaume
00:07:12.800 | how to make instruction following
00:07:15.160 | and chat models that will follow instructions.
00:07:18.280 | So all the supervised fine-tuning stage,
00:07:20.480 | then the LHF, there's some paper.
00:07:22.480 | So you had some intuition from there we could use.
00:07:25.240 | But in fact, at large scale,
00:07:26.960 | and that was probably the most challenge for us,
00:07:31.200 | there's no research anymore.
00:07:32.560 | We don't know how much to scale.
00:07:34.720 | - Can you describe what scale you're talking?
00:07:36.280 | - Yeah, yeah, to what level of annotation to scale.
00:07:39.640 | Is the annotation like, do you need 100,000,
00:07:42.240 | 1 million, 10 million annotation
00:07:44.440 | of supervised fine-tuning of LHF preference?
00:07:47.480 | We had no idea.
00:07:48.880 | What is the actual algorithm to do?
00:07:50.920 | How often to retrain the models?
00:07:52.720 | You have just the basic,
00:07:54.000 | but then when it comes to like chat GPT
00:07:56.160 | or GPT instructor cloud,
00:07:58.560 | no one published the details there.
00:08:00.440 | And so we had to reinvent the wheel there
00:08:02.320 | in a very short amount of time.
00:08:03.800 | - And what about parameter size?
00:08:05.440 | This is one question that a lot of folks
00:08:07.320 | had about Lama 3.
00:08:08.680 | So Lama 1, you had 7B, 13B, 33B, 65B model sizes,
00:08:13.680 | and then Lama 2, 7, 13, 70.
00:08:17.960 | How do you kind of evaluate what's worth training,
00:08:20.400 | especially when you think about data?
00:08:21.560 | It's like, you know, maybe 100,000 is enough
00:08:23.720 | for like a 7B model,
00:08:24.760 | but it's not enough for a 70B model.
00:08:27.040 | How do you decide model size,
00:08:28.560 | especially when you're maybe annotation constrained
00:08:31.120 | on some of these things?
00:08:32.120 | - That's a very good question.
00:08:33.840 | And there's no good answer.
00:08:35.320 | There's so many parameters to take into account
00:08:38.120 | from the scaling loads at training time
00:08:41.120 | to get the best performance.
00:08:43.200 | The GPU constraint and on what different hardwares,
00:08:45.880 | and we think about meta, but also of the community,
00:08:49.520 | and like people are not just using 800,
00:08:51.640 | but there's 800, there's different size of GPUs memory.
00:08:55.800 | So which size will fit in what,
00:08:57.800 | and what is the most useful?
00:08:59.560 | Also at inference time, not just at fine tuning time,
00:09:02.200 | then you can maybe do some tricks at inference time
00:09:04.720 | to quantize it a bit or FP16 or FP8 now.
00:09:08.760 | All those constraints makes it very, very challenging.
00:09:11.880 | At inference time, you have a lot of costs.
00:09:14.000 | So how to trade off between inference cost
00:09:16.160 | and training cost?
00:09:17.440 | It's a very challenging problem.
00:09:19.360 | In general, we tend to think in particular for Lama 3.
00:09:23.320 | Lama 2, maybe I would say it's like Lama 1.
00:09:25.480 | We had a flagship model, which was 70B.
00:09:27.840 | It's also because the project was taking some routes
00:09:30.480 | to reproducing a Chinchilla, which was a 70B.
00:09:34.600 | For Lama 3, we also moved to one size more.
00:09:37.960 | It's a flagship model for 05B.
00:09:40.640 | I think that there was also the question of,
00:09:42.600 | we want a model at this time,
00:09:44.800 | we have this amount of compute.
00:09:46.560 | Given the scaling loads and the amount of tokens
00:09:48.400 | we have to train it, what would be the right balance
00:09:51.320 | to still fit in at inference time?
00:09:53.960 | So we try to have some trade-off like that.
00:09:56.640 | - Yeah.
00:09:57.480 | And you mentioned Chinchilla is the best way to go,
00:09:59.640 | but then you tweeted recently,
00:10:00.960 | "Don't fall into the Chinchilla trap
00:10:02.600 | if you want your model to be used by billions of people."
00:10:05.080 | So what's the updated state of scaling loss?
00:10:08.000 | I think there was obviously the Kepler,
00:10:10.200 | and then there was Chinchilla,
00:10:11.400 | and then people kind of got the Lama scaling law,
00:10:13.800 | like the 100 to 200X kind of parameter to token ratio.
00:10:18.160 | What's your updated thinking
00:10:19.960 | on how to think about scaling loss
00:10:21.760 | when you pick model size and training data?
00:10:24.160 | - Right.
00:10:25.000 | So, as you said, this Kaplan paper with scaling loss,
00:10:28.360 | but they figured out, basically they tried two dimensions.
00:10:32.360 | The model weights and the number of training time,
00:10:37.040 | like number of steps, training tokens, APOCs.
00:10:40.000 | And for that, they figured that model size is what matters.
00:10:43.520 | So GPT-3 was way too big
00:10:46.240 | compared to the actual number of training tokens
00:10:48.560 | because they did a mistake not adapting the scheduler.
00:10:51.440 | That's what Chinchilla emphasized and discovered.
00:10:54.800 | To be fair, I think Kaplan knew that
00:10:56.560 | at the time of Chinchilla paper.
00:10:57.920 | But yeah, basically Chinchilla said
00:11:00.600 | we have to revisit the scaling laws
00:11:02.880 | originally published by Kaplan
00:11:05.000 | and emphasize much more the importance of training tokens.
00:11:08.280 | And they did some really good scaling laws
00:11:10.080 | showing that there's an optimal,
00:11:11.960 | basically you need to double the number of training tokens
00:11:14.640 | every time you double the training weights
00:11:17.520 | to get an optimal ratio
00:11:20.400 | so that for a finite number of compute,
00:11:22.800 | you will end with the best results in your paper.
00:11:24.920 | And what I call the Chinchilla trap
00:11:26.920 | is that good if you want the best flagship model
00:11:29.440 | that obtains the highest performance on your paper.
00:11:31.880 | But if you want to use your model at inference time,
00:11:35.880 | inference, the two dimensions,
00:11:38.080 | one remains the model weights,
00:11:40.240 | but one drops the number of tokens you train it,
00:11:42.800 | number of steps.
00:11:43.920 | And so to be compute efficient at inference time,
00:11:46.920 | it's much better to train it much longer training time,
00:11:49.600 | even if it's an effort, an additional effort,
00:11:52.400 | than to have a bigger model.
00:11:53.920 | That's what I call, I refer to the Chinchilla trap.
00:11:57.040 | Not that Chinchilla was wrong,
00:11:58.280 | but if you consider inference time,
00:12:01.640 | you need to go beyond Chinchilla.
00:12:03.440 | And in fact, that's what LemmaOne folks did
00:12:06.480 | by overtraining it since they could have get
00:12:08.840 | a better performance in their paper,
00:12:10.440 | but they prefer to create the best artifact
00:12:13.720 | that will be used by the community.
00:12:15.280 | - So that's the skinny thinking.
00:12:17.480 | What other went into LemmaTree kind of planning?
00:12:21.000 | So LemmaTree, you have a pretty good model.
00:12:22.920 | People really liked it.
00:12:24.120 | In LemmaTree, you drop like the intermediate weight.
00:12:26.720 | So it's a 870 and now 405B.
00:12:29.920 | What was the thinking behind going so large?
00:12:31.680 | I mean, you talked about the hardware capabilities
00:12:33.720 | at inference, like I can now run a 405B model at home
00:12:37.080 | for sure.
00:12:37.920 | And it might be hard to even get the cloud resources
00:12:40.720 | to do it.
00:12:41.560 | What was the decision there?
00:12:43.320 | - The decision is super simple.
00:12:45.520 | We want the best model.
00:12:47.360 | We want to be number one and number two.
00:12:49.640 | We started one year and a half ago
00:12:52.360 | and we did quite some journey.
00:12:54.080 | We filled the gap with GPT-4.
00:12:55.960 | So that will be the first open source model
00:12:58.760 | that actually compares to GPT-4.
00:13:01.200 | There's now a GPT-4.0, of course.
00:13:03.520 | And we're close, but we're not there yet.
00:13:06.320 | Not in all capabilities.
00:13:07.640 | But the gap is getting smaller and smaller.
00:13:10.280 | There's also like what compute we had at the time
00:13:12.840 | when we started to run in January.
00:13:15.040 | We put a lot of effort there,
00:13:16.560 | but as like Mark announced, we have more and more GPUs.
00:13:19.560 | So the next generation will be bigger.
00:13:21.200 | So that's what drives the decision.
00:13:22.800 | Now, maybe let me reflect two things he said.
00:13:25.800 | You cannot use it at home.
00:13:27.480 | That's probably true.
00:13:28.720 | But quantizing it to FP8 can run on node,
00:13:32.520 | even with a long contact of 128K tokens.
00:13:36.480 | Second thing is, I'm hopeful that the community
00:13:39.720 | will lead to a lot of findings by open sourcing it.
00:13:42.560 | And there is smart way to actually make you use it
00:13:47.040 | on your computer.
00:13:48.200 | If you remember, when we published models,
00:13:51.360 | people were saying it's too big.
00:13:52.440 | And after two weeks, it was running on a Raspberry.
00:13:54.960 | I don't know if it will be the same,
00:13:56.080 | but I hope it's the same kind of trend.
00:13:58.720 | And by releasing those models, we are enabling that.
00:14:02.320 | Now, the last thing I want to add is having bigger models
00:14:06.200 | enables us to collect better data,
00:14:07.840 | for instance, at LHF stage,
00:14:10.080 | because that's the model we use for the annotation.
00:14:12.000 | And so we distillate straight forward,
00:14:14.400 | like this annotation from this better model
00:14:17.360 | to the other models.
00:14:18.640 | So I can guarantee you that the quality
00:14:20.040 | of the smaller models we are releasing with Lama3
00:14:23.040 | are also thanks to having these artifacts
00:14:25.720 | where we can collect and train.
00:14:27.560 | - Yeah, there's a lot of really good info there.
00:14:29.760 | One thing I'll just briefly touch on for quantization,
00:14:33.480 | there was a recent Noam Shazir blog post.
00:14:36.200 | Noam is writing again for some reason.
00:14:39.160 | And he was talking about sort of native FP8 training.
00:14:43.880 | It seems like that is most useful for inference.
00:14:47.000 | That is what you expect the open source community
00:14:48.960 | to do with your weights once you release them anyway.
00:14:51.640 | Is there any movement or thinking
00:14:53.520 | about just moving to FP8
00:14:55.680 | or whatever other new format is in vogue these days?
00:14:59.920 | - Also these papers like to train like some,
00:15:02.800 | I forget the name,
00:15:03.640 | but like there's two follow-up papers
00:15:05.200 | on like just a zero one or minus one weights.
00:15:08.880 | And like, there's a lot of work there.
00:15:11.200 | I think it's promising directions overall.
00:15:13.480 | Regarding FP8 in particular,
00:15:15.680 | there's also the possibility for the community
00:15:17.040 | to try FP8 or the methods that are very easy
00:15:20.120 | at fine tuning time for the model.
00:15:22.560 | So I'm really looking forward to what the community
00:15:25.320 | can do there.
00:15:26.160 | Although like scaling,
00:15:28.000 | I don't know if it's all you need,
00:15:29.040 | but I will not bet against scaling.
00:15:31.440 | And one of the way to get more scale
00:15:33.960 | is by having better algorithms that we can train
00:15:37.280 | for the same level for less compute.
00:15:40.520 | - Less compute and less memory.
00:15:42.840 | Yeah, like inference time memory
00:15:44.320 | is becoming a real constraint.
00:15:46.240 | - Yeah, yeah, but also training with FP8.
00:15:48.600 | If you're not training with FP8 or, I mean,
00:15:50.440 | FP0 is probably nonsense,
00:15:52.520 | but to what extent, how far we can go, you know?
00:15:55.560 | And every time like you unlock compared to
00:15:58.880 | what we had two, three years ago on a 32 or 64,
00:16:02.480 | it's like huge progress in terms of scaling.
00:16:05.160 | - For me, it's interesting to say,
00:16:06.800 | to see you mention the ternary quantization,
00:16:10.480 | like the 1.58 bit thing.
00:16:12.560 | 'Cause I didn't know that,
00:16:14.000 | I don't know how much to believe, you know?
00:16:15.400 | Like there's a lot of these kinds of papers
00:16:17.040 | where it makes a lot of noise,
00:16:18.520 | but it doesn't actually pan out, it doesn't scale.
00:16:20.520 | - I totally agree with you.
00:16:21.600 | It's so hard for researchers, at least for me,
00:16:25.480 | to see all those papers published,
00:16:28.200 | all those cool ideas,
00:16:29.760 | all those results that are preliminary.
00:16:32.280 | And in all this massive amount of research,
00:16:36.440 | what will scale or not?
00:16:37.960 | What will resist the test of time or not?
00:16:40.120 | And are we like losing maybe some gems that are not just,
00:16:44.240 | people are not working on them,
00:16:45.360 | but because there's too much research around.
00:16:48.160 | I don't know, maybe.
00:16:49.080 | And that's like some problems to have.
00:16:51.600 | That's cool to have these problems nowadays,
00:16:53.720 | compared to probably what Yann LeCun and the others had
00:16:56.080 | 30 years ago, but still it's a problem.
00:16:58.360 | - For what it's worth,
00:16:59.360 | I do think that FAIR is putting out incredible research.
00:17:03.600 | Probably, it doesn't seem like it's your group,
00:17:05.880 | but you also recently published Mobile LLM,
00:17:08.920 | which on the small model side is a really good research
00:17:12.520 | on just small model architecture.
00:17:14.880 | It looks like Hugging Face is also replicating it,
00:17:16.880 | and it's doing quite well.
00:17:18.720 | There's a lot of ideas on shared weights and shared matrices
00:17:21.920 | and model architecture stuff that we can talk about
00:17:24.760 | for smaller scale models.
00:17:26.760 | LLAMA is not at that scale,
00:17:27.960 | but it seems like one of the big themes of this year
00:17:30.800 | is on-device, in-browser,
00:17:33.240 | small models that are good enough for daily use.
00:17:36.280 | I do want to talk about architecture.
00:17:38.560 | I'm not sure when you're releasing
00:17:39.720 | the LLAMA 3 research paper,
00:17:41.880 | but in LLAMA 2, you talked a little bit
00:17:43.280 | about the architecture choices.
00:17:45.200 | - It will be released the day, I think, of the release.
00:17:48.560 | - Okay, what should people know?
00:17:50.080 | What are the major choices of LLAMA 3 versus LLAMA 2?
00:17:53.640 | - There's not a lot of changes in terms of architectures.
00:17:57.440 | I think we can do a lot better in the future,
00:18:00.320 | and not just with transformers,
00:18:01.920 | but, for instance, to me, it doesn't make sense
00:18:04.120 | to use the same amount of compute per token,
00:18:06.400 | for every token.
00:18:07.480 | Like, there's architecture lack of flexibilities.
00:18:09.560 | There's a lot of research to go there,
00:18:11.560 | but still, that's the best thing we have for now.
00:18:14.160 | And so, it's the same recipe than,
00:18:17.280 | in terms of architectures and training, than LLAMA 2,
00:18:20.560 | but we put so much effort on scaling the data,
00:18:24.360 | and the quality of data.
00:18:25.880 | There's now 15 trillion tokens,
00:18:27.680 | compared to two trillions,
00:18:29.280 | so it's another magnitude there, as well,
00:18:31.520 | including for the smaller models.
00:18:33.000 | - One of the things I noticed on the paper
00:18:35.240 | is that you used LLAMA 2 to do the data cleaning
00:18:38.400 | for what went into LLAMA 3.
00:18:40.320 | I think there's a lot of chatter, obviously,
00:18:41.720 | about synthetic data,
00:18:43.000 | and there was the "Refrace the Web" paper
00:18:45.360 | that came out, maybe a few months ago,
00:18:47.080 | about using Mastral to make training data better.
00:18:50.160 | Any learnings from that?
00:18:51.960 | It's like, is there,
00:18:53.200 | how much can you rewrite with the models?
00:18:56.240 | I'm sure people would love to hear more about it.
00:18:58.520 | - Right, so it's a very interesting research direction.
00:19:02.320 | Synthetic data, in general.
00:19:03.760 | Synthetic data for pre-training.
00:19:05.480 | My intuition is that the web is full of shit,
00:19:11.520 | in terms of text,
00:19:13.160 | and training on those tokens is a waste of compute.
00:19:15.880 | Just having a good classifier that labelizes that is cool,
00:19:19.520 | and LLAMA was, at the time,
00:19:21.560 | before LLAMA 3,
00:19:22.400 | the best model we had access to, legally,
00:19:26.240 | to labelize the web
00:19:29.400 | and select what are the good tokens and the bad tokens.
00:19:32.040 | The additional thing is that
00:19:33.360 | it also enabled to have a topic tag,
00:19:37.360 | like, is it about law?
00:19:38.560 | Is it about politics?
00:19:39.640 | Is it about chemistry, math, reasoning?
00:19:42.040 | So that you can also adapt a bit the mixture
00:19:44.800 | to balance a bit more the diversity.
00:19:48.200 | - To me, I'm not exactly sure what you guys did,
00:19:51.120 | but I feel like when people say synthetic data,
00:19:54.400 | there needs to be different categories of synthetic data now
00:19:57.160 | because I think there's so many different usage
00:19:59.960 | of this thing.
00:20:00.800 | But specifically synthetic data for pre-training,
00:20:02.800 | it feels almost like you're running multiple epochs
00:20:06.760 | on the raw data while it's rephrased
00:20:10.520 | or reformatted by a language model, right?
00:20:13.600 | And in my mind, it's very similar to computer vision,
00:20:15.880 | where you do data augmentation on an item, right?
00:20:19.120 | Like, we're doing data augmentation.
00:20:20.760 | That's the less cool name for synthetic data.
00:20:22.680 | (laughs)
00:20:23.680 | - That's very interesting.
00:20:24.520 | I totally agree with you related to pre-training,
00:20:28.120 | totally stamp what you said.
00:20:29.920 | I think it's very different, though,
00:20:31.480 | for post-training and the future direction
00:20:33.320 | on synthetic data that I'm personally excited.
00:20:35.960 | Like, for instance, what I'm excited about is
00:20:38.840 | we had this survey on augmented LLM a year ago,
00:20:41.680 | and all the idea is like,
00:20:43.000 | if you augment your LLM with something else,
00:20:45.480 | it can be a retriever, it can be search, it can be a tool,
00:20:48.520 | it can be a calculator, it can be a code execution.
00:20:51.360 | Then you are not just distillating,
00:20:54.800 | like doing some data augmentation with your model,
00:20:58.120 | but you're actually adding some expert skills
00:21:01.080 | that possibly goes beyond the model weights.
00:21:03.800 | For instance, if your model, like,
00:21:06.240 | can calculate something it was wrong before,
00:21:09.920 | and now it has access to a calculator,
00:21:11.760 | and you can retrain your model on that,
00:21:13.640 | then you're learning something new.
00:21:15.040 | If your model didn't know something about LLM 2,
00:21:17.840 | probably doesn't know a lot about LLM 3,
00:21:19.760 | but now if it can search online about it,
00:21:22.440 | and then you train the model on that,
00:21:24.360 | then you have a positive feedback loop,
00:21:26.120 | like what we call expert iteration,
00:21:28.080 | targeting directly the weakness of the model.
00:21:30.160 | It's like continual augmentation of the language model,
00:21:33.720 | much beyond just data augmentation.
00:21:35.480 | - How related is this to tool use?
00:21:37.440 | Like, are you teaching it to use tools to augment the model,
00:21:41.280 | or are you saying, like, do active learning,
00:21:44.080 | do like, where it's weak,
00:21:45.760 | go augment the model with extra data,
00:21:48.600 | and then memorize that new data, right?
00:21:50.960 | - What I said is more like in terms of directions,
00:21:52.840 | not for LLM 3, but like,
00:21:54.800 | when it knows how to use a tool and correct itself,
00:21:58.160 | this is like a very promising direction
00:22:00.480 | that goes much beyond the augmentation
00:22:02.280 | for like, in the future,
00:22:04.680 | to keep collecting new data, new token.
00:22:06.680 | People are saying like, we are lacking of tokens.
00:22:09.080 | But if you think about those kind of tokens,
00:22:10.920 | where the model always go to correct its own weakness,
00:22:14.120 | it can say like, okay, that's 10 plus 10.
00:22:17.040 | Okay, that's an easy example, probably the model knows,
00:22:18.760 | but imagine for something more complex, 10 plus 10.
00:22:21.920 | I expect this to be 20.
00:22:24.000 | Let's verify with a calculator,
00:22:25.920 | which is easy for a basic agent now, powered by LLM.
00:22:29.640 | And then you verified with respect to what you expected,
00:22:33.040 | that it's correct.
00:22:34.000 | If it's not, you can back propagate this example
00:22:37.200 | directly to the weights,
00:22:38.320 | and so they will keep learning new things.
00:22:40.080 | - It makes sense.
00:22:40.920 | What have you been your insights?
00:22:41.840 | You know, you mentioned about just like using calculators.
00:22:44.680 | What are the insights?
00:22:45.520 | I think it's just, in general,
00:22:47.120 | a lot of that is just driven using code generation,
00:22:49.360 | apart from just tool use.
00:22:50.920 | What are your insights on just like the data mix
00:22:53.640 | of how much code, how much multilinguality,
00:22:56.760 | which is something that you're also passionate about?
00:22:58.960 | We know that that's changed between LLM 2 and LLM 3.
00:23:01.680 | Is it changing for different stages
00:23:03.480 | between the different sizes of LLM 3?
00:23:05.720 | Like, you know, anything like of that sort?
00:23:08.080 | - No, it didn't.
00:23:09.680 | For the different size, we use the same, mostly.
00:23:12.400 | What happens is we change the data mix
00:23:15.160 | during the training of LLM 3,
00:23:17.120 | with some findings that happens that,
00:23:19.120 | I mean, training is long,
00:23:20.360 | so you have to do something while it's training.
00:23:22.480 | And what the team did,
00:23:23.440 | I was working on my side of multi-motion post-training,
00:23:25.400 | but so the pre-training team did quite a lot of work
00:23:28.280 | to find some, have some new findings,
00:23:30.480 | improve the data mixture along the way.
00:23:32.320 | And they intersected before the end of the training.
00:23:35.640 | - I sense a movement in terms of like the curriculum
00:23:39.280 | that people are adopting during pre-training
00:23:41.600 | and even post-training about, you know,
00:23:43.880 | what the mix should be.
00:23:44.720 | Like Snowflake is doing some interesting work
00:23:46.600 | with enterprise intelligence or whatever they call it.
00:23:50.480 | What are your goals with post-training?
00:23:51.800 | Like just at a high level, you know, like,
00:23:53.480 | what do you work with, like the pre-train team?
00:23:55.760 | - I think it's quite easy from now
00:23:57.920 | because there's not yet like this kind
00:23:59.880 | of continual augmentation where it could feed back
00:24:02.800 | like pre-training, things like that.
00:24:04.440 | One of the big continuum between pre-training
00:24:06.760 | and post-training in particular is continual pre-training
00:24:09.880 | where you actually continue the pre-training
00:24:12.560 | before RHF in a self-supervised way,
00:24:14.880 | but on expert level domains,
00:24:16.880 | like for it to have an expert in code
00:24:18.720 | and an expert in like reasoning
00:24:20.200 | or an expert in multilinguality
00:24:22.640 | that enables to collect even better RHF notation after.
00:24:25.520 | So that's one thing.
00:24:26.720 | And then you start from those models
00:24:28.880 | to actually do the RHF stage.
00:24:31.920 | And goal about your vision,
00:24:33.320 | like goal was to get the best model in those dimensions.
00:24:36.880 | That's actually one thing very different to,
00:24:39.160 | I can comment, compared to Lama2.
00:24:41.960 | Lama2, you know, as I said, we were nowhere.
00:24:44.800 | We build entirely end-to-end all the stack
00:24:47.200 | from data notation, contract, methodology, protocol,
00:24:50.640 | algorithms for RHF at Meta.
00:24:52.440 | And we had to limit our scope.
00:24:54.560 | We were like not allowed also to work on that.
00:24:56.880 | We focus mainly on helpfulness,
00:24:59.760 | following instructions for Lama2.
00:25:02.520 | And you can see that as in the following month after Lama2,
00:25:06.680 | a lot of open source models came,
00:25:09.360 | distillating GPT-4 mainly,
00:25:12.200 | but obtaining better reasoning, math, coding, chat models.
00:25:16.760 | And we didn't annotate at all for code,
00:25:18.840 | neither for reasoning or multilinguality.
00:25:22.000 | And one thing I'm quite proud is with the early preview release
00:25:26.160 | we did of Lama3 back in February, May or March,
00:25:30.160 | I don't remember, it lets quickly to instantly to like
00:25:34.160 | state of the art results for the model size,
00:25:36.320 | almost competing with GPT-4 on the Arena leaderboard,
00:25:40.200 | where like human fights each other,
00:25:41.760 | compare like two models and select their preference.
00:25:45.600 | And no one since then had been able to put
00:25:48.480 | like a Lama3 model better than what we did
00:25:51.520 | on most of the domains,
00:25:52.920 | from code reasoning, multilinguality, helpfulness.
00:25:56.280 | So that's the sign that this time, as opposed to Lama2,
00:25:58.520 | we tackle like all those different aspects.
00:26:00.640 | - Do you have any other thoughts
00:26:01.800 | on the more synthetic data focused models,
00:26:05.000 | kind of like a Nemeltron?
00:26:06.720 | I think folks were asking if you see that
00:26:08.760 | as an interesting direction too,
00:26:10.880 | kind of having specific synthetic data generation things.
00:26:14.240 | - I don't know about this model exactly,
00:26:15.720 | but I think like Lama had better performance overall.
00:26:18.640 | I'm very bullish on synthetic data.
00:26:21.120 | Generation, but I think just gets better
00:26:24.240 | when you have a better model.
00:26:25.680 | I'm not really bullish on having like a model
00:26:27.880 | only for synthetic data generation.
00:26:29.720 | I understand the need of having like bigger models,
00:26:32.600 | but then you can rationalizing.
00:26:34.920 | Yeah, maybe people will not use them for inference,
00:26:36.840 | but to distillate some specific knowledge
00:26:39.840 | of synthetic data.
00:26:41.000 | That narrative is, I think I totally agree with that,
00:26:45.600 | but having a model purely for that
00:26:48.120 | and not like good at other things,
00:26:50.200 | I don't think it's the case.
00:26:51.520 | - Makes sense.
00:26:52.360 | One of the architecture questions
00:26:53.640 | that I forgot to mention in there was,
00:26:55.480 | so just the architecture choice of like a very big,
00:26:57.920 | you know, 400B dense model.
00:26:59.960 | I actually honestly thought that maybe 175 or, you know,
00:27:04.040 | was kind of the peak, you know,
00:27:06.120 | whatever can fit on like an H100.
00:27:08.000 | So basically I think the common question that people have
00:27:10.200 | is like, why no MOE?
00:27:11.360 | In a way that Mistral and the others have gone in,
00:27:14.120 | you know, it seems like the trend has been MOEs
00:27:16.520 | and you guys have bucked the trend there.
00:27:18.520 | - I heard that question a lot.
00:27:20.240 | Different aspects there.
00:27:21.840 | Why not MOE in the future?
00:27:23.440 | The other thing is, I think a dense model
00:27:27.080 | is just one specific variation of the model
00:27:30.960 | for an hyperparameter, for an MOE,
00:27:32.960 | with basically one expert.
00:27:34.480 | So it's just an hyperparameter
00:27:36.560 | we haven't optimized a lot yet,
00:27:38.960 | but we have some stuff ongoing
00:27:40.560 | and that's an hyperparameter we'll explore in the future.
00:27:43.760 | - Let's make sure we run through everything on post-training.
00:27:46.440 | You also had a recent tweet about hourly chat
00:27:48.720 | versus immunization learning explained in one tweet.
00:27:52.080 | So we'll put this in the show notes,
00:27:53.400 | but it's basically like two charts about doctor opinions.
00:27:57.760 | On one side, there's like,
00:27:58.880 | whether or not the suggestion is good
00:28:01.240 | from like a content perspective.
00:28:03.560 | And the chatbots rank really highly
00:28:05.120 | and the physicians are kind of like, you know,
00:28:06.680 | a bell curve, as you might imagine.
00:28:08.440 | But then the empathetic voting,
00:28:11.080 | most physicians are rated not empathetic
00:28:13.600 | or slightly empathetic versus all the model responses
00:28:16.720 | are rated very empathetic and empathetic at worst.
00:28:20.840 | You know, most people might look at it
00:28:22.320 | and not really get much from it,
00:28:23.680 | but obviously it resonated with you.
00:28:25.320 | Can you run people through like some of the choices
00:28:27.800 | you make in post-training to like optimize
00:28:29.920 | for one of the two and getting the best responses?
00:28:33.080 | - I think the tweet was about like the intuition
00:28:35.720 | of why reinforcement learning with human feedback works.
00:28:39.160 | When we started LamaTube,
00:28:41.680 | I had like this budget of annotations in millions of dollars
00:28:44.680 | and okay, what to do?
00:28:46.760 | I'm responsible for that.
00:28:47.680 | I'm accountable for a model at the end
00:28:49.200 | that can follow instructions
00:28:50.520 | and compete with GPT 3.5 at the time.
00:28:53.440 | What to do?
00:28:54.560 | You can annotate supervised venturing data,
00:28:56.840 | which refers to a human to create a prompt
00:28:59.560 | and to also write himself the answer expected by the model.
00:29:04.560 | So then you train on that and in a supervised manner,
00:29:09.240 | but that's like very classic and standard
00:29:11.960 | on fat-tuning machine learning.
00:29:14.480 | The other thing is reinforcement learning
00:29:16.320 | with human feedback where the annotators type a prompt,
00:29:18.880 | but this time you sample two different answers
00:29:20.840 | from your model and you ask the annotator
00:29:22.680 | which one he prefers.
00:29:24.320 | And then you will train on the preference, basically,
00:29:26.400 | to simplify.
00:29:27.920 | When you ask to train on the preference of the model,
00:29:30.160 | that seems very weird and not really robust,
00:29:33.880 | training on synthetic model generated by the model.
00:29:36.440 | So I was like, let's annotate 100,000
00:29:38.680 | or of supervised venturing data.
00:29:40.680 | And let's annotate a bit of preference to do a relationship
00:29:42.960 | because everyone is doing it.
00:29:44.480 | And we had this human evaluation
00:29:46.840 | after a few weeks in a la meta projects
00:29:49.160 | where our model was already better
00:29:52.560 | than the annotation from the humans.
00:29:55.160 | So you'd get a prompt,
00:29:56.440 | you check what the human will have annotated as an answer.
00:29:59.400 | You check what the model generates.
00:30:01.120 | And most of the time, the model was better.
00:30:04.200 | I was like, oh, maybe the annotators are pretty bad.
00:30:06.480 | Let's look at that.
00:30:07.920 | And no, the model was pretty good.
00:30:10.640 | And so I understood the intuition behind RLHF.
00:30:13.440 | Those models are already super good at some tasks.
00:30:15.960 | And with RLHF, then what you have is,
00:30:19.240 | imagine a distribution, a Gaussian distribution,
00:30:21.840 | which was basically the tweets.
00:30:23.840 | And you have on the left, bad outputs
00:30:26.920 | and on the right, good outputs.
00:30:28.560 | And the same with medical diagnostics from a doctor.
00:30:31.640 | You have good outputs on the right
00:30:33.080 | and the bad diagnostics on the left.
00:30:35.440 | But you have the distribution
00:30:37.080 | and when you collect all the diagnostics from doctors,
00:30:39.320 | hopefully it's mostly on the right.
00:30:41.000 | There's better, a lot of time, good diagnostics,
00:30:43.680 | but human makes mistakes, right?
00:30:46.000 | So there's bad diagnostics.
00:30:47.600 | On the left, you have still a bit of examples,
00:30:51.600 | which makes curves not at zero, the distribution.
00:30:55.000 | And the same way for humans,
00:30:56.280 | they make mistakes when they annotate.
00:30:58.160 | And so training on behavioral cloning to reflect humans,
00:31:01.880 | the model will learn to do also some mistakes,
00:31:03.920 | just like humans.
00:31:05.480 | And so you will have some bad outputs
00:31:07.040 | from the model time to time, reflecting humans.
00:31:09.840 | And you cannot go beyond that
00:31:11.280 | if you train on human outputs.
00:31:13.000 | But now, if I ask a doctor to check a sample from my model,
00:31:17.720 | or a sample from two doctors,
00:31:19.160 | one diagnostic and another diagnostic,
00:31:21.160 | one is better than the other,
00:31:22.600 | it's easy for a doctor to say which one is better.
00:31:25.120 | The same way, if I sample from my model
00:31:26.960 | that learns a human distribution of answers,
00:31:29.440 | and there's one bad time to time, like humans,
00:31:31.880 | but most of the time, good answers.
00:31:33.440 | And I ask a human to choose which one he prefers.
00:31:35.760 | Personally, I'm really bad at creating poems.
00:31:38.080 | The example I give a lot of time,
00:31:39.960 | try to write a haiku in three lines
00:31:42.160 | of about language models.
00:31:44.200 | I don't know you,
00:31:45.240 | take like five seconds to think what you could come,
00:31:48.280 | I'm terrible.
00:31:49.400 | But yet, if I check two poems generated by a model,
00:31:52.560 | so a human, I can tell which one I prefer.
00:31:54.480 | I'm good at discriminating.
00:31:56.040 | And because of that,
00:31:57.360 | you can have a model that flats the bad outputs,
00:32:00.800 | and learns to only shift towards the best
00:32:03.040 | and better and better outputs.
00:32:04.600 | And you can even end to superhuman abilities,
00:32:07.680 | since that I'm bad at writing a poem,
00:32:09.960 | but I'm good at judging which one is better.
00:32:12.040 | So I can actually annotate data
00:32:13.760 | beyond my own skills at creating them.
00:32:16.800 | That's the magic of RLHF.
00:32:18.760 | - Yeah, we have one episode, RLHF 201,
00:32:21.560 | with Nathan Lambert from the Allen Institute,
00:32:24.480 | who was at Aginface leading RLHF before.
00:32:26.960 | And he mentioned one of the things that makes RLHF work
00:32:29.600 | is that humans are not maybe great
00:32:31.880 | at creating a lot of things,
00:32:33.280 | but they're usually very good at giving an opinion
00:32:35.520 | on which one to they prefer.
00:32:37.720 | So they're able to actually annotate data
00:32:39.440 | of things they would never create from scratch.
00:32:42.160 | One question actually that he asked me to ask you,
00:32:44.640 | how much in post-training you attribute improvement
00:32:47.440 | to the RLHF side versus the instruction fine-tuning side,
00:32:51.720 | and maybe how you think about prioritizing the two
00:32:54.120 | and what areas they impact the most?
00:32:56.280 | - You mean between supervised fine-tuning,
00:32:58.400 | like supervised fine-tuning annotation
00:33:00.240 | and preference annotation?
00:33:01.760 | - Yeah.
00:33:02.600 | - So 100% to RLHF.
00:33:04.760 | In fact, that's quite interesting.
00:33:06.520 | You start for Lama 2 with a pre-trained model,
00:33:09.480 | and you have to have an instruction model to chat model.
00:33:13.240 | Otherwise, the model is just like finishing sentences.
00:33:16.360 | So you need that to start RLHF.
00:33:18.000 | So we had to annotate like 10,000 examples.
00:33:20.440 | What did we do for Lama 3?
00:33:22.280 | You start with a new pre-trained model,
00:33:23.960 | and then you want, before starting the RLHF,
00:33:26.000 | to have now a chat model.
00:33:28.040 | That is not too bad.
00:33:29.080 | The option one was, let's do human annotation again,
00:33:32.640 | like SFT stage.
00:33:34.240 | But in fact, by the principle I said before,
00:33:37.360 | the annotation would be actually worse than Lama 2.
00:33:39.760 | So what we did is that we generated all the data
00:33:41.880 | on the prompts with Lama 2,
00:33:43.400 | and we applied basically the last round of Lama 2 we had
00:33:46.520 | to kick off and start Lama 3 post-training.
00:33:49.160 | So Lama 3 post-training doesn't have any human
00:33:51.880 | written answers there, basically, almost.
00:33:54.120 | It's just leveraging pure synthetic data from Lama 2.
00:33:57.920 | - Do you have an intuition
00:33:58.960 | on which areas work better for which?
00:34:01.480 | For example, you mentioned the physicians are expert.
00:34:03.760 | What about maybe like code or, yeah,
00:34:06.000 | you also have a multi-model working on,
00:34:07.320 | so like image generation is like,
00:34:09.080 | or does this apply to any modality, any subject?
00:34:12.240 | - That's an open research question.
00:34:13.840 | The intuition in general is like, for instance,
00:34:15.800 | for code, because this is factual,
00:34:17.720 | you can check if the code is correct or not.
00:34:19.720 | RLHF is not the way to go.
00:34:21.120 | You prefer to do like supervised fine-tuning
00:34:23.320 | as a human to write the code.
00:34:24.880 | But in fact, because humans make mistakes,
00:34:26.640 | because actually even in code,
00:34:28.520 | there's some preferences that they might feel like that.
00:34:30.640 | And maybe for some other reasons that we don't know,
00:34:33.720 | RLHF is so much more scalable.
00:34:35.680 | It costs less, it's easier than it leads in general
00:34:38.360 | to just better performance.
00:34:40.040 | And maybe we can come with a compromise.
00:34:42.320 | We actually suggested teacher-forcing in Lama 3,
00:34:46.280 | a new method that kind of fills a gap between,
00:34:48.960 | not teacher-forcing, sorry, teacher-critic.
00:34:52.000 | Teacher-forcing is a good to train the models.
00:34:53.960 | Teacher-critic, where it reconciliates
00:34:56.240 | and unifies supervised fine-tuning and RLHF,
00:34:58.600 | so that when you do human preference
00:35:01.080 | and you have two outputs,
00:35:02.560 | but both are very bad in the code, for instance,
00:35:05.720 | you will ask the human to edit the best answer
00:35:08.480 | to make it correct now.
00:35:09.920 | So now you are doing SFT
00:35:11.760 | when all the answer was really bad,
00:35:14.160 | so that you can get out from the local minimum of your model.
00:35:17.560 | - I think this is like super promising,
00:35:19.440 | and it seems like there's just,
00:35:21.280 | well, do you have an idea?
00:35:23.200 | You know, you started with this question
00:35:24.440 | of how much scale you need.
00:35:26.160 | Do you now have a better idea?
00:35:28.160 | - No, what we know is it's not plateauing yet.
00:35:31.120 | - It's not plateauing yet, yeah.
00:35:32.120 | So just infinite amounts more while, you know,
00:35:35.160 | scale AI and all the annotation providers
00:35:37.600 | are very happy to hear that.
00:35:39.040 | And so you mentioned at the start of the conversation
00:35:43.720 | about the AlphaGo moment,
00:35:45.200 | and I feel like this is very interesting to reflect on,
00:35:47.960 | right, like we're basically saying that,
00:35:50.520 | I think that one of the lessons from AlphaGo
00:35:52.600 | is that people thought that human interest in Go
00:35:55.160 | would be diminished
00:35:57.880 | because computers are better than humans,
00:36:00.040 | but then we have this sort of centaur model
00:36:01.840 | where like humans and computers are actually doing better
00:36:04.840 | than either humans and computers would be alone.
00:36:08.120 | And I think we're seeing that with this,
00:36:09.880 | what are you talking about, this RLHF improvement, right,
00:36:12.520 | that we're kind of building human preference into the model
00:36:15.360 | and the blending of the human preference
00:36:17.560 | and the model capability is actually doing better
00:36:20.120 | than we could on our own.
00:36:21.800 | I just think it's pretty fascinating.
00:36:23.680 | - It is fascinating.
00:36:25.120 | - The other thing is RLHF came from the alignment community
00:36:28.840 | and I think there's a lot of conception
00:36:30.880 | that maybe it's like due to safety concerns,
00:36:33.240 | but I feel like it's like really over the past
00:36:35.560 | like two, three years expanded to just,
00:36:38.280 | this produces a better model period,
00:36:40.440 | even if you don't really,
00:36:41.800 | are not that concerned about existential risk.
00:36:43.960 | I always feel like it's so interesting to see this,
00:36:47.520 | like people who take alignment super seriously,
00:36:50.080 | they're the first to consider super alignment.
00:36:52.440 | And now we're considered like,
00:36:54.080 | I'm almost thinking about this as like super quality,
00:36:56.520 | that we are training models
00:36:58.280 | that are higher quality than humans.
00:37:00.400 | And it's not really about alignment so much as like,
00:37:03.480 | we now see that this is actually possible.
00:37:06.280 | - Yeah.
00:37:07.120 | - And it's not even for alignment purposes.
00:37:08.760 | We just think it's like better at reasoning,
00:37:10.480 | better at knowledge, better at everything.
00:37:11.960 | - Well, I don't know how much better yet it is on those,
00:37:14.400 | but clearly it's super human on some writing skills
00:37:18.040 | and it's super useful.
00:37:19.160 | I think that's great, to be honest.
00:37:20.760 | - Yeah, perhaps we can transition to evals.
00:37:23.520 | We've had some questions about the 400B details
00:37:27.400 | that we want to disclose.
00:37:28.600 | By the time this podcast comes out,
00:37:30.400 | we'll have disclosed them.
00:37:31.720 | Yeah, I think last time you disclosed like the evals
00:37:35.240 | while you were still training,
00:37:37.040 | what should people know about the high level headlines
00:37:41.000 | for the new Lama 3?
00:37:42.560 | - At a high level,
00:37:43.560 | it's the best open source model ever.
00:37:46.600 | It's better than GPT-4.
00:37:49.440 | I mean, what version?
00:37:50.720 | But by far, compared to the version originally released,
00:37:54.800 | even now, I think there's maybe the last clouds on a 3.5
00:37:59.280 | and GPT-4.0 that are performing it.
00:38:01.640 | And that's it, period.
00:38:03.040 | So for the 405B, that's a flagship,
00:38:06.360 | that's a pretty good model.
00:38:07.800 | Not yet the number one.
00:38:08.920 | We still have a journey to get there.
00:38:11.200 | For the 7TB and 7B,
00:38:13.520 | they are like world-class models of this size
00:38:16.120 | for general models.
00:38:17.360 | - And are the benchmark numbers
00:38:19.120 | from the initial checkpoint still right?
00:38:21.440 | So the April 15 checkpoint,
00:38:24.120 | MMLU on Instruct is like 86,
00:38:27.280 | GPUA, 48,
00:38:28.680 | HumanEval, 84,
00:38:30.160 | GSMAK, 94,
00:38:31.840 | MAT, 57.8.
00:38:33.640 | Is this still roughly the same performance?
00:38:35.640 | Or, you know, I haven't seen the numbers yet either.
00:38:38.160 | We're just breaking the news right now, so.
00:38:40.360 | - No, it's roughly that.
00:38:42.240 | - Awesome.
00:38:43.080 | So talking about evals,
00:38:44.680 | we just had an episode with Clementine from Hug & Face
00:38:47.320 | about leaderboards and arenas and evals and benchmarks
00:38:51.000 | and all of that.
00:38:52.120 | How do you think about evals during the training process?
00:38:55.840 | And then when the handoff happens,
00:38:57.760 | do you already know exactly what you want to improve?
00:39:00.600 | And I know that, for example,
00:39:01.920 | to improve like maybe an arena score,
00:39:03.480 | you need different than like an MMLU score.
00:39:05.840 | How do you think about prioritizing
00:39:07.400 | the post-training improvement based on benchmarks?
00:39:10.040 | - That's a super hard and good questions.
00:39:13.320 | There's no good answer.
00:39:14.160 | I mean, evals is an open research problem,
00:39:16.880 | like in particular when you're trying to tackle
00:39:19.040 | so many capabilities.
00:39:20.520 | And, you know, it's also like,
00:39:22.440 | as soon as a benchmark,
00:39:23.760 | you're trying like to push numbers on a benchmark,
00:39:26.960 | it stops to be a good benchmark
00:39:28.480 | because then you don't know if you're overfitting it
00:39:30.440 | and it will transfer to similar capabilities.
00:39:33.560 | So evaluation for language models,
00:39:37.320 | in particular on post-training,
00:39:39.080 | is very hard problem.
00:39:42.040 | We tackle that playing with different methods
00:39:45.240 | like reward models, evaluation, model as a judge,
00:39:49.200 | having a diversity of prompts,
00:39:51.920 | diversity of benchmarks as well
00:39:53.240 | for a lot of different capabilities.
00:39:55.200 | That limits the possibility of hacking them, of course.
00:39:58.280 | We do also a lot of human evaluation.
00:40:00.960 | I do also a lot of model tests, quality analysis,
00:40:04.200 | like testing myself some prompts.
00:40:06.200 | I feel it was much easier during Lama 2
00:40:09.640 | when the model was like worst than today.
00:40:12.800 | Now the model is getting so good
00:40:15.760 | that it's hard to get to some prompts
00:40:18.200 | to break them and to compare models
00:40:20.040 | and see the edge cases.
00:40:21.760 | So it's getting harder.
00:40:23.160 | And a great way also to compare models is, you know,
00:40:25.600 | truth is a different round we have done for RHF.
00:40:29.040 | Every time we upload a new model,
00:40:30.480 | for all the annotation we are doing,
00:40:32.520 | we have the win rate between the previous model
00:40:34.160 | and the new model by just sampling
00:40:36.920 | for every prompt we annotate,
00:40:38.680 | prefer sample A with the old model,
00:40:41.720 | sample B with the new model.
00:40:43.120 | And so we can calculate automatically a win rate.
00:40:45.360 | - Interesting.
00:40:46.200 | What are areas that you had to work the hardest
00:40:48.880 | to catch up to like the private models?
00:40:51.640 | Maybe like there's, you know,
00:40:52.880 | not as good public data or whatnot,
00:40:54.800 | or is performance improvement just kind of even
00:40:57.640 | across the spectrum?
00:40:59.960 | - Honestly, all of them,
00:41:01.720 | we are behind all of them with between Lama 2 and GPT-4.
00:41:06.720 | I mean, it's different challenges every time,
00:41:10.440 | like being good at code or reasoning
00:41:12.440 | is something we didn't do at Lama 2,
00:41:14.160 | so we had to build everything from scratch.
00:41:16.120 | Improving on helpfulness,
00:41:18.120 | and which is one of the main dimensions
00:41:20.280 | that people look at, I think in Zarena,
00:41:22.600 | but which is by the way, very interesting evaluation.
00:41:25.360 | Because when we did the preview,
00:41:27.120 | and I don't know yet what will be the results
00:41:29.000 | for this new Lama 3,
00:41:30.400 | we ended very high in this blind test leaderboard.
00:41:34.640 | And to be honest, I didn't expect that.
00:41:37.760 | I knew we had good results internally,
00:41:39.720 | but how that will transfer to perception from the community,
00:41:43.640 | people like using it in practice
00:41:45.280 | and comparing it to the other models,
00:41:47.280 | I didn't expect that positive feedback,
00:41:50.440 | that high ELO score on this benchmark.
00:41:54.000 | It doesn't say like everything.
00:41:55.920 | As I said before, which is also interesting,
00:41:57.840 | because it's a community that judge the prompts
00:42:00.440 | and create the prompts and judge the answers.
00:42:03.040 | We are limited, we are not like good to do that.
00:42:05.720 | And so it gives you a very good indicator
00:42:07.920 | of how good, helpful,
00:42:09.360 | how on the main core of the distribution,
00:42:12.760 | simple prompts about the tone of the model
00:42:15.000 | compared to the others,
00:42:16.080 | but for much more complex problems,
00:42:17.480 | much more intelligence,
00:42:19.080 | reasoning coding of complex stuff,
00:42:21.800 | it doesn't tell the full story.
00:42:24.360 | You know, like while we had 70 B preview
00:42:27.240 | at the level of GPT-4,
00:42:28.920 | even better at the time.
00:42:30.760 | I think it was partly true,
00:42:32.160 | but clearly we were not at like GPT-4 level
00:42:34.200 | in code or reasoning.
00:42:35.920 | We are now.
00:42:36.800 | - There's some conversation about like the math score.
00:42:40.360 | Apparently like the next GPT-next or whatever
00:42:43.040 | is like region 90, which is a big, big jump
00:42:45.640 | from the current state of the art.
00:42:48.320 | It will be interesting.
00:42:49.400 | One of our previous guests,
00:42:50.960 | rounding out the topics on just potential models,
00:42:53.440 | areas of development and evals,
00:42:55.120 | Clementin is looking for a confidence estimation
00:42:58.480 | or uncertainty benchmark.
00:43:00.840 | One of our previous guests, Brian Bischoff,
00:43:02.600 | is also asking about like,
00:43:04.320 | how do we think about evals for practical things
00:43:07.160 | like confidence estimation, structured output,
00:43:10.200 | stuff like that.
00:43:11.360 | - Yeah, I think we lack actually of such evaluations.
00:43:14.720 | When numbers, I was assuring like two days ago
00:43:17.360 | to the team to report at some point is,
00:43:19.480 | okay, we have this accuracy on MMLU,
00:43:22.160 | on whatever, on math and JSM 8-4.
00:43:25.880 | What if we change a bit the prompt
00:43:27.400 | and instead of telling the model you have this question,
00:43:30.000 | you have to answer A, B, C, or D.
00:43:32.000 | What if we tell the model you have to answer A, B, C, or D,
00:43:35.120 | or you don't know.
00:43:36.480 | And maybe the accuracy will be a bit lower,
00:43:39.880 | but I'm curious to see if some models
00:43:41.880 | we have different of calibrations
00:43:43.400 | where maybe model A have 50% correct,
00:43:47.000 | model B has 50% correct,
00:43:48.960 | but model A answered 100% of the questions.
00:43:52.600 | So 50% are not correct.
00:43:54.560 | Model B actually said like, answered only 60%.
00:43:57.400 | So for 40% of the time, he said, I don't know.
00:43:59.960 | I prefer model B.
00:44:01.160 | And we are not like reflecting that in evaluations.
00:44:03.960 | - I think this is very relevant
00:44:05.760 | for post-training in particular,
00:44:07.120 | because it seems that the general consensus
00:44:09.680 | is that base models are more calibrated
00:44:12.640 | than post-train models, right?
00:44:14.480 | Something like that.
00:44:15.400 | - Exactly.
00:44:16.240 | - That seems to be the research from OpenAI as well.
00:44:18.160 | I don't know the degree of this,
00:44:20.160 | and maybe we can invert it, right?
00:44:21.640 | Maybe post-training can help to increase calibration
00:44:24.320 | rather than decrease it.
00:44:25.760 | I feel like this is a little bit
00:44:27.840 | of being too similar to humans,
00:44:30.400 | because humans are not calibrated very well.
00:44:34.120 | - Yeah, and that's the goal of post-training,
00:44:35.600 | I think, to make models more calibrated,
00:44:38.160 | to not be biased toward answering A, B, C, or D
00:44:41.280 | as often as possible,
00:44:42.960 | to follow the uniform distribution.
00:44:44.960 | - And on the structured output tool calling side,
00:44:47.520 | do you think that it's not an explicit part of the evals?
00:44:51.680 | Obviously, you worked on Toolformer
00:44:53.600 | and the language augmentation.
00:44:56.200 | Do you encourage the open-source community
00:44:58.160 | to fine-tune Lama3 to do tool calling,
00:45:01.720 | or do you want to just have that in the model from day one?
00:45:04.960 | - We have that from day one.
00:45:06.480 | Good news for the community.
00:45:07.800 | We are state-of-the-art there.
00:45:09.720 | I think the model will be pretty good at that.
00:45:12.920 | We have a lot of gems about tools in the paper,
00:45:16.160 | but the model is fine-tuned to do tool usage,
00:45:18.960 | to zero-shot function calling.
00:45:21.720 | There are some system prompts,
00:45:22.840 | if you tell the model to use a search,
00:45:26.000 | and Imagination can do a lot of stuff,
00:45:28.480 | like code execution as well,
00:45:29.960 | even in a multi-message way,
00:45:32.600 | so almost multi-step agents,
00:45:36.000 | which kind of sparks our agents.
00:45:38.040 | - Okay, you talked about agents,
00:45:39.160 | so I guess we should probably mention
00:45:40.840 | the work on agent stuff.
00:45:42.200 | And you also, in our pre-conversation,
00:45:44.480 | mentioned that you're already starting work on Lama4.
00:45:46.840 | What does agents have to do with Lama4?
00:45:48.840 | How does your work on Gaia inform all this work?
00:45:51.360 | - Yeah, so we published one year ago,
00:45:53.280 | Gaia General Assistant Benchmark.
00:45:55.920 | That followed a direction I really like pursuing.
00:45:59.320 | I mean, everyone passionate about AI
00:46:01.280 | and trying to build Jarvis will go there.
00:46:03.760 | So I did Toolformer and the survey on augmented models.
00:46:07.880 | In fact, reflecting back, I was,
00:46:10.120 | okay, we have Galactica, we have Lama1,
00:46:14.360 | we have Toolformer,
00:46:15.760 | and there's like GPT 3.5 at the time in Lama4.
00:46:18.800 | If you don't have a good instruct model
00:46:21.240 | to follow instructions,
00:46:22.800 | the extension and the feature of Toolformer is limited.
00:46:26.520 | So we need to work on that, and we did Lama2,
00:46:28.960 | and then now Lama3.
00:46:30.400 | And it's very interesting.
00:46:31.760 | On General Assistant Benchmark, so Gaia,
00:46:34.480 | agents powered by language models
00:46:36.680 | perform to zero with GPT 3.5
00:46:38.960 | and to something very significant,
00:46:41.520 | like 30, 40%, 60% with GPT 4.
00:46:45.600 | So there's a gap of intelligence here.
00:46:47.760 | And I think this gap of intelligence,
00:46:49.280 | this threshold that you pass
00:46:50.880 | in terms of zero short function calling,
00:46:53.480 | following complex instruction
00:46:54.920 | that can span over a page of constraints,
00:46:58.560 | those things that makes the nowadays agents
00:47:02.040 | with React loops, pre-planning,
00:47:04.520 | multi-steps reasoning, function calling,
00:47:07.640 | work in practice is like this gap of intelligence.
00:47:10.760 | So now that we have Lama3,
00:47:12.440 | I'll be back to agents.
00:47:14.080 | I expect some incremental and significant progress
00:47:16.600 | on pre-planning, post-planning,
00:47:18.360 | but I'm really hopeful that we can gain
00:47:21.120 | some order of magnitude of scaling
00:47:23.280 | by interconnecting well models into agents
00:47:27.840 | as a more complex system that can do planning,
00:47:30.320 | that can do backtracking,
00:47:32.640 | that can take actions,
00:47:35.240 | navigate the web, execute code.
00:47:37.440 | - Okay, there's a lot there.
00:47:39.680 | When you say integrating world models,
00:47:42.000 | is there anything from JEPA?
00:47:43.520 | Is that something that we're talking about
00:47:46.400 | or is that a different line of research?
00:47:48.560 | - No, not directly.
00:47:50.120 | That's the same goal, I would say,
00:47:52.760 | but JEPA is very, very fundamental research,
00:47:56.280 | which has some promising early results.
00:47:58.760 | And what I was looking right now
00:48:01.000 | on state-of-the-art results on Gaia,
00:48:02.960 | there's a leaderboard, by the way,
00:48:04.720 | you mentioned Clementine before,
00:48:06.200 | she contributed to Gaia as well,
00:48:08.080 | and Hugging Face put a leaderboard there on their website.
00:48:11.960 | There's some state-of-the-art results.
00:48:14.520 | What is interesting is like GPT-4 alone has 0%,
00:48:19.360 | but, or like 5%, I think, on level one,
00:48:22.000 | that's three level of difficulties.
00:48:24.040 | But OSCOPILOT, then, and Autogen from Microsoft,
00:48:28.440 | and recently Hugging Face agent,
00:48:30.600 | obtained some level one up to 60%.
00:48:33.920 | So connecting an LLM to an agent
00:48:36.080 | that can do all those things,
00:48:37.720 | moves much forward new capabilities.
00:48:40.960 | This is kind of a breakthrough.
00:48:42.160 | And those models are purely based
00:48:44.920 | on instruction tuning models, following instructions,
00:48:48.520 | where like you have an orchestrator,
00:48:50.040 | and you say to your LLM, okay, this is your task,
00:48:53.240 | you have access to these tools, you can navigate the web,
00:48:56.200 | can you do a plan of what you should do?
00:48:58.200 | And then, okay, that's the plan.
00:49:00.280 | Now, execute the first step.
00:49:02.360 | Did you manage to succeed for the first step?
00:49:05.080 | Or do you want to rethink your plan
00:49:08.000 | because you enter in a dilemma?
00:49:10.040 | And you have kind of all this orchestration
00:49:12.600 | by system prompting, instruction following,
00:49:15.960 | and just that, which is quite suboptimal,
00:49:18.880 | and probably you need to go later in latent space
00:49:21.720 | and more JPAS style.
00:49:22.920 | But just that is getting us
00:49:25.040 | to some really impressive results already.
00:49:28.080 | - And do you see the planning and review
00:49:31.480 | to always be needed in the future?
00:49:33.120 | This is kind of like under Garpadi's idea
00:49:34.840 | of like more tokens equal more thinking.
00:49:36.960 | So like the more you're having it write tokens
00:49:39.440 | and like think about the outcome
00:49:41.080 | and like the better result you're probably gonna get to.
00:49:43.680 | Do you think that's always gonna be the case?
00:49:45.200 | Or that in the future, like the model,
00:49:47.680 | you can just say this is the task
00:49:49.080 | and then I'll just return the answer directly
00:49:51.200 | and do all of that in the latent space, so to speak?
00:49:54.600 | - Right.
00:49:55.440 | I think in the future, it should be,
00:49:57.360 | it should hopefully go more as this is a task
00:50:00.280 | and I return it.
00:50:01.640 | But we need to teach that to the model to train that,
00:50:03.960 | which is far from now.
00:50:05.400 | Very medium, long-term directions
00:50:08.240 | that could be really relevant here
00:50:09.600 | is thinking into latent space.
00:50:12.200 | I know some early works are doing that.
00:50:14.280 | And that's a way probably to move to.
00:50:17.680 | First you think,
00:50:18.960 | and then you don't have to write all the tokens.
00:50:21.160 | Like it's in your head.
00:50:22.160 | It doesn't have to be as constricted
00:50:24.240 | than a plain text BLM.
00:50:26.000 | And once you have done your thoughts,
00:50:27.840 | you can just write the final answer or take an action.
00:50:30.520 | - Just a commentary on that.
00:50:31.600 | Anthropic actually cheats at this right now.
00:50:34.160 | If you look at the system prompt in the Cloud Artifacts,
00:50:37.320 | I actually have a thinking section
00:50:38.760 | that is explicitly removed from the output,
00:50:42.520 | which is, I mean, they're still spending the tokens,
00:50:45.000 | but that is before training it is at the prompting level,
00:50:49.640 | you can simulate this.
00:50:51.120 | And then at iClear, there was the pause token,
00:50:53.880 | the backtrack token.
00:50:54.880 | I feel like all these are token level stopgap measures.
00:50:59.520 | I feel like it's still not the final form.
00:51:01.560 | Like we still need to have at the architecture level,
00:51:04.840 | some kind of variable inference length thing
00:51:08.680 | that lets you actually think in latent space
00:51:10.320 | like you're talking about.
00:51:11.160 | I don't know if there's any papers
00:51:12.560 | that you're thinking about.
00:51:14.040 | - No, but that's interesting
00:51:15.080 | because that's what we said at the beginning
00:51:16.720 | of the discussion.
00:51:18.880 | If you remember, like we are lacking the flexibility
00:51:21.640 | for pre-training architecture transformers,
00:51:24.280 | where we spend the same amount of compute per token.
00:51:27.680 | And so because of that, how can you like mitigate this
00:51:31.280 | by generating more tokens?
00:51:32.920 | So more thoughts, more compute,
00:51:35.080 | because you have only access to this dimension.
00:51:37.320 | Ideally, you want an architecture that will enable
00:51:39.880 | naturally to make this emerge, basically.
00:51:43.000 | - Any papers come to mind there
00:51:44.240 | that you would recommend people read,
00:51:45.440 | or this is like completely new science
00:51:47.400 | that we have to do.
00:51:50.000 | - No, I mean, it's earlier science.
00:51:52.400 | I don't know any work that managed to get there.
00:51:54.960 | I know like, for instance, you had the universal transformer
00:51:58.480 | had this idea of a number,
00:52:00.600 | and you can like compute on the layer n times
00:52:05.120 | and being decided by the architecture itself
00:52:08.080 | with respect to the complexity of the token.
00:52:09.960 | I think there's a paper from DeepMind
00:52:11.840 | on mixture of expert with like a key player.
00:52:15.040 | Mixture of, is it this one?
00:52:17.160 | - Mixture of depths.
00:52:18.200 | - I don't, I'm not sure it's this one, maybe.
00:52:20.240 | But like, basically the idea was that
00:52:22.160 | with a mixture of expert,
00:52:23.120 | you have an expert that is an identity matrix
00:52:25.480 | that you can skip.
00:52:26.760 | And so like you can,
00:52:28.680 | but you know, it's early works, very preliminary works.
00:52:32.120 | Like for instance, I haven't seen yet a lot
00:52:33.840 | like putting the compute, generating a token into the loss.
00:52:38.000 | That's gonna be interesting when we start to do that.
00:52:40.240 | - I know we're getting up on time,
00:52:42.160 | but we had just a few more questions.
00:52:44.480 | We definitely want to ask you.
00:52:46.000 | So as you think about,
00:52:47.000 | there were reports about Llama4
00:52:48.640 | started turning again in June.
00:52:50.400 | If you think about the evolution of the models,
00:52:52.000 | I think up until Llama3,
00:52:53.680 | you know, with MetaAI and some of these things,
00:52:55.320 | I'm like, it makes sense
00:52:56.720 | that they want to build their own models
00:52:57.920 | and their multi models.
00:52:59.120 | Sounds like Llama4, maybe a lot of the focus
00:53:02.000 | will also be a more agentic behavior and have all of this.
00:53:05.000 | I'm curious, like at what point it's like, okay,
00:53:07.240 | this is a research direction that we still want to take,
00:53:09.400 | even though, you know,
00:53:10.400 | it doesn't fit right into the product.
00:53:11.960 | Like what's that discussion internally
00:53:13.760 | about what to focus on as you keep scaling these models?
00:53:16.800 | - Yeah, I think it's a balance, you know, between,
00:53:19.560 | well, we want to be number one.
00:53:21.520 | Mark wants to be number one there.
00:53:23.640 | And there's this understanding also that, you know,
00:53:26.360 | this is a critical technology in the future.
00:53:29.520 | And even if to nowadays, that like research,
00:53:32.640 | if nowadays it's not like directly intersecting product,
00:53:35.840 | we don't want to be late in the game as we had in the past.
00:53:39.160 | So that's the first thing.
00:53:40.840 | The second thing is,
00:53:41.680 | and we think that this technology will change the world.
00:53:44.480 | We want to work towards AGI and AGI will change the world.
00:53:49.040 | And if Meta develop an AGI,
00:53:51.840 | it will probably intersect pretty easily the products.
00:53:55.160 | Now, the first thing is, with that in mind,
00:53:58.200 | we have to balance with product needs.
00:54:00.400 | And there's always this ongoing discussion
00:54:02.240 | and this balance to find,
00:54:03.800 | for like between a flagship model,
00:54:05.840 | between maybe a model that will be
00:54:07.880 | more adapted to product needs.
00:54:10.120 | And it doesn't have to be decorrelated.
00:54:12.560 | As I said before,
00:54:13.400 | like you can leverage also the big models
00:54:14.880 | to distillate some capabilities to a smaller one
00:54:18.320 | that will be maybe more suited like research.
00:54:20.880 | There's always this back and forth.
00:54:22.720 | There's also the fact that the product
00:54:24.960 | kind of ideas to the research,
00:54:26.920 | evaluations that are grounded in actual use cases,
00:54:29.880 | that we can also measure ourselves with respect to,
00:54:32.280 | is there some progress
00:54:33.640 | or is it just on an academic benchmark?
00:54:36.160 | - So one, before we transition off,
00:54:38.400 | I think there's the hidden side maybe of these LLMs
00:54:41.160 | that most people don't think about,
00:54:42.760 | which is the organizer and the vocab size,
00:54:46.120 | especially of them.
00:54:47.480 | So LLAMA3 is 128K tokens, vocab tokenizer.
00:54:52.240 | GVD4 was 100K, 4.0 is 200K.
00:54:56.520 | How should people think about the impact that it has?
00:54:59.320 | So basically like, I mean, the TLDR is like in the vocab,
00:55:02.680 | you have this kind of like concepts represented as tokens.
00:55:05.280 | So usually the larger the vocab size,
00:55:07.560 | the more nuanced the model can be
00:55:09.880 | about thinking about different things.
00:55:11.680 | What are the scaling laws of those organizers?
00:55:14.120 | You know, is 128K kind of like very large
00:55:17.040 | and it doesn't really matter?
00:55:17.960 | Like, do you want to double it?
00:55:19.640 | Like any thoughts there would be great.
00:55:21.760 | - There's a lot of dimensions to take into account here.
00:55:23.800 | I think the first thing obvious to say is LLAMA3
00:55:27.160 | compared to LLAMA2 is multilingual,
00:55:29.000 | has multilingual capabilities.
00:55:30.560 | We worked on that.
00:55:31.920 | And so, because you have languages
00:55:33.680 | that are not just Latin languages like English,
00:55:35.960 | there's a lot of different characters.
00:55:38.440 | You want to include them to represent like special word there
00:55:42.400 | and so you need to have a bigger vocabulary size.
00:55:45.520 | That's the obvious thing,
00:55:47.240 | which is also probably why GPT-4.0
00:55:49.440 | has a much bigger vocabulary
00:55:52.760 | as it's like naturally multilingual,
00:55:55.200 | multimodal in speech.
00:55:58.040 | So that's why we went to from 30 to 128 vocabulary size.
00:56:03.040 | The interesting thing I think to discuss about tokenizer
00:56:06.200 | is about scaling laws related to that.
00:56:09.520 | If you increase your vocab size,
00:56:13.800 | well, you have a bigger matrix
00:56:15.480 | which takes longer to compute.
00:56:17.240 | It depends on the model size,
00:56:19.400 | but for a small model,
00:56:20.480 | it has a much bigger impact than a bigger model.
00:56:23.560 | So increasing that,
00:56:26.520 | basically saying otherwise,
00:56:28.120 | so number of vocabulary size 428
00:56:30.280 | is the same than the 8, 70, or 405B,
00:56:33.840 | but so relatively in percentage
00:56:35.360 | of the total number of weights,
00:56:37.040 | for the 7B it's much more than the 405B,
00:56:39.880 | weights more compared to total number of weights.
00:56:42.880 | So that has more impact in terms of training speed there.
00:56:46.400 | But what is interesting is with a bigger vocabulary,
00:56:49.800 | for the same text,
00:56:51.520 | you have less tokens, right?
00:56:54.360 | And so you can train your model
00:56:56.560 | on the same amount of knowledge with fewer steps.
00:57:00.280 | So for the same compute,
00:57:02.120 | you can see more knowledge if you don't epoch.
00:57:04.800 | That's one cool thing.
00:57:06.080 | The second thing is at inference time,
00:57:08.640 | you know that the context lane is not in number of text,
00:57:11.600 | the size of the text, but number of tokens.
00:57:13.920 | And so you can compress more,
00:57:15.360 | so that now with a bigger tokenizer,
00:57:19.400 | 128, more vocabulary,
00:57:21.800 | you can get to longer text
00:57:23.720 | for the same number of tokens.
00:57:26.520 | 8K basically, or 128K,
00:57:29.760 | now with this tokenizer means 30% about less text to encode.
00:57:34.760 | - How are tokenizer vocabs built?
00:57:37.600 | I actually don't know that.
00:57:38.480 | What's the work that goes into it?
00:57:39.680 | And then like, why are people using smaller ones?
00:57:43.080 | Is it harder to make them?
00:57:44.080 | Or is it just about some of the things you mentioned
00:57:46.520 | around scaling the training and all of that?
00:57:49.120 | - Oh, it's, no, there's different methods,
00:57:51.120 | but became quite standard.
00:57:53.360 | Although it could change in the future.
00:57:54.880 | - BPE?
00:57:55.720 | - Yeah, exactly.
00:57:56.560 | - Well, BPE is for text.
00:57:58.080 | I don't know about multimodal vocab.
00:58:00.640 | That's, I haven't read anything about.
00:58:02.640 | - Yeah, let's keep that question.
00:58:04.920 | I'm not expert there.
00:58:05.760 | And I don't remember exactly what they ended to do.
00:58:08.200 | - Now that you're saying this, right?
00:58:09.360 | Okay, so now we have 100K vocab, 200K vocab.
00:58:13.680 | Do we see a million vocab?
00:58:15.720 | Do we see infinity, which is no tokenizer?
00:58:18.560 | You know, like what's the natural limit of tokenization?
00:58:22.000 | - Yeah, that's a good question.
00:58:23.080 | I don't know.
00:58:24.560 | I think there's a limit with respect
00:58:26.440 | that will grow with respect to the model size.
00:58:29.160 | So bigger models means possibly bigger vocabulary
00:58:34.160 | without affecting too much of training.
00:58:36.320 | But yeah, there's a lot of people.
00:58:38.440 | That's not my domain of expertise,
00:58:39.840 | but a lot of people are discussing the interest
00:58:42.120 | of having this kind of tokenizer,
00:58:44.000 | which doesn't fit like natural.
00:58:46.120 | Could we go to character level tokenizer?
00:58:48.360 | Could we go to actually multimodal tokenizer,
00:58:51.760 | which will like decompose at pixel level?
00:58:55.280 | I don't know.
00:58:56.120 | Future directions, that could be very promising.
00:58:58.920 | - Yeah, I would say the diffusion people
00:59:00.520 | have actually started to swing back to pixel level.
00:59:03.840 | And probably that will presage the language people
00:59:07.680 | also moving towards 1 million vocabulary
00:59:11.280 | and then whatever the natural limit is for character level.
00:59:15.240 | - I think we can maybe transition
00:59:16.680 | towards some of your personal stuff.
00:59:18.720 | We kept you here for a long time.
00:59:20.040 | We also, this is a very distributed podcast.
00:59:22.560 | You know, I'm in the Bay Area, you're in France,
00:59:24.360 | Sean is in Singapore.
00:59:25.240 | So everybody is on a different time zone.
00:59:28.680 | You also do some startup investing and advising.
00:59:31.680 | You know, we used to meet Chantal on the podcast.
00:59:33.720 | He also mentioned he always enjoys kind of working
00:59:36.240 | with founders and researchers.
00:59:38.120 | Any company you're involved with that you want to shout out
00:59:40.720 | that you think is super promising,
00:59:42.480 | requests for startups that you've had,
00:59:44.880 | anything around that space would be awesome.
00:59:47.960 | - Two cool companies I can think now is,
00:59:51.000 | one is Lindy, which is based in the Bay Area,
00:59:53.680 | with Flo Crivello.
00:59:55.240 | Very cool one.
00:59:56.560 | - Yeah, he's a good friend.
00:59:57.560 | - Flo.
00:59:58.400 | - Why do you like it?
00:59:59.240 | - Flo is really good, like he's a Frenchman, I guess.
01:00:02.200 | And number two, very recently, I really liked Open Devin,
01:00:07.200 | which is basically trying to reproduce Devin.
01:00:10.760 | - We interviewed him at iClear.
01:00:12.240 | Both are agent startups.
01:00:14.120 | What do you think is like the direction
01:00:15.680 | that startups should be working on, you know, agent-wise,
01:00:18.120 | and maybe what is not working?
01:00:20.720 | - That's a tough question.
01:00:22.160 | One thing I say quite often is,
01:00:24.600 | deep learning has these very specificities
01:00:27.160 | that makes it challenging to predict
01:00:29.640 | that it's self-destructor, self-destructive technology.
01:00:33.440 | So that, think like, you know, Grammarly,
01:00:35.520 | this technology like where the startup,
01:00:37.400 | you plug play and it corrects your grammatical errors.
01:00:41.160 | Everyone told them, guys, deep learning
01:00:43.960 | create a barrier to entrance, annotate data, create data.
01:00:47.640 | And they had a lot of data for that.
01:00:49.600 | And the next day, with the same exact technology,
01:00:52.520 | deep learning, someone comes with chat GPT and tell them,
01:00:55.480 | yeah, I can do the same, better, and so many other things.
01:00:58.720 | Zero barrier to entry from yesterday to today.
01:01:02.760 | And what is crazy here
01:01:04.160 | is that it's based on the same technology.
01:01:06.320 | And so there's a lot of people working nowadays
01:01:09.560 | to try to mitigate issues
01:01:12.040 | with current generation of models.
01:01:14.520 | And I'm telling them, like,
01:01:16.200 | assume always the next generation will get better.
01:01:18.800 | So if your business will benefit
01:01:21.760 | from a new generation with better abilities,
01:01:24.280 | that's a good business.
01:01:25.560 | If your business may be replaceable,
01:01:27.680 | and if all the work you have done may vanish
01:01:30.280 | and be like wasted because there's better models,
01:01:33.000 | then maybe change.
01:01:35.160 | - Yeah, I mean, yes, but better is so unpredictable.
01:01:38.160 | Like if you asked me before, let's say March of this year,
01:01:42.080 | I would have said that maybe, you know,
01:01:44.520 | voice chat is still very defensible.
01:01:47.680 | And then suddenly, you know,
01:01:48.920 | OpenAI demoed their sort of real-time voice thing.
01:01:52.400 | It's sort of natively multimodal.
01:01:55.000 | It's easy to not anticipate
01:01:57.960 | a dimension where it gets better,
01:01:59.640 | but find another one that resisted, it's harder.
01:02:02.280 | I would say in general,
01:02:03.560 | assume you will have progress everywhere.
01:02:06.040 | It may not be right,
01:02:08.240 | but it's a bit dangerous to bet against that.
01:02:11.480 | - Is there any space that you think is overrated by founders
01:02:15.080 | that are trying to build something that like, yeah,
01:02:18.200 | either, you know, the new models are just gonna do,
01:02:19.960 | or like, you just don't think
01:02:20.920 | there's that much interest from folks?
01:02:23.640 | - It's a challenging time for founders.
01:02:25.240 | It's very exciting.
01:02:26.520 | There's a lot of funds, a lot of applications as well,
01:02:28.720 | a lot of stuff to build.
01:02:30.600 | That's pretty cool.
01:02:31.560 | But what is hard is,
01:02:32.640 | because this technology is moving so fast,
01:02:34.760 | I see like now a lot of fundamental stacks
01:02:37.920 | that are like the unicorn of today.
01:02:40.160 | Foundational models, foundational like clusters,
01:02:42.800 | data notation, things like that.
01:02:44.320 | There's a lot, but less successful yet,
01:02:46.680 | for now at least, application company.
01:02:49.640 | And it's hard to build an application
01:02:52.360 | when it changes so fast, as we discussed before.
01:02:54.760 | So it is both crude and yet like,
01:02:58.120 | we haven't find a good like use case
01:03:01.320 | that is like the new thing company there.
01:03:04.840 | And I want to see it.
01:03:06.440 | - Yeah, we definitely see the same, you know,
01:03:08.440 | all of our agent companies, or at least, you know,
01:03:10.840 | building agents are the ones getting the most traction.
01:03:13.360 | Most companies are like,
01:03:14.200 | hey, I actually don't have that much expertise
01:03:15.840 | and I'm just waiting for the models to get better.
01:03:18.160 | So I'm not really sure if I need this now.
01:03:20.200 | So it's an interesting time to be investors.
01:03:23.520 | Anything else we missed?
01:03:25.080 | This was kind of like a masterclass
01:03:27.080 | in how to build state-of-the-art LLM.
01:03:28.720 | So it's going to be a highly played episode, I'm sure.
01:03:32.520 | Any final thoughts you want to share?
01:03:34.800 | - There's two things I can, I guess I can say.
01:03:36.560 | One is Lama is hiring talents worldwide.
01:03:41.320 | And two, you can contact me, reach me out on LinkedIn,
01:03:45.040 | looking for Gen AI technology that,
01:03:47.880 | and founders that will create the future.
01:03:51.120 | - Okay. Hiring one role that you're like,
01:03:53.920 | man, like we really need this kind of person.
01:03:56.960 | If you describe it, that person will be referred to you.
01:04:00.680 | Right? Like, because we're trying to broadcast it
01:04:03.680 | to the whole world.
01:04:05.320 | - Researchers with good common sense,
01:04:07.520 | first principle thinking,
01:04:09.000 | not necessarily like huge expertise on LLM,
01:04:10.920 | but more being super rigorous, meticulous, structured.
01:04:15.080 | - Azamen, thank you again for coming on
01:04:17.320 | and hope everybody gets to enjoy LLAMA 3 today
01:04:19.760 | since it just came out
01:04:20.960 | and we'll have you again for LLAMA 4.
01:04:22.960 | (upbeat music)
01:04:25.560 | (upbeat music)
01:04:28.160 | (upbeat music)
01:04:30.760 | (upbeat music)