Back to Index

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI


Chapters

0:0 Introductions
4:16 The Llama Origin Story
7:34 Are there RLHF scaling laws?
9:56 Avoiding the "Chinchilla trap"
12:15 Why 405B?
14:27 FP8 training and other scaling research
17:48 Llama3 vs Llama2
18:32 Synthetic data for pre-training
21:43 Tool use to generate synthetic data
22:40 Pre-training data recipe
26:0 Why not MoE?
27:5 Why RLHF is so important
37:6 How they eval models
41:50 Benchmaking Uncertainty
44:4 Structured output and tool calling
45:52 Llama4 & Agents
52:1 Will Meta keep releasing open models?
53:55 Why tokenizer vocab size is underrated
59:12 AI & Startups
63:13 Hiring at Meta AI

Transcript

(upbeat music) - Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners. And I'm joined by my co-host, Spooks, founder of Small AI. - Hey, and today we have a very special episode with Thomas Yalom. I don't know how to describe, you've done so much work in a very short amount of time at Meta, but you were most notably leading Lama 2.

And now today we're also coordinating on the release of Lama 3. So welcome. - Thanks for having me. - To be clear, obviously the Lama 3 405B. Is that the official size number that we're going with? Or is it, do we just say 400B? - For the text model only, yes.

A bit of additional parameters for the multi-model version that will come later. - Awesome. Awesome. Just to quickly go over your background. Actually we had a slightly similar past. I was also a quantitative trader and it looks like you did five years in quant finance, working at Trading Timer in SOC Gen.

And then you transitioned into natural language, getting your PhD at Sorbonne, working on Recital as well. And then right after your PhD, joining Meta. - No, it's exactly that. But basically I think it's at the AlphaGo moment where I was doing some trading. I say like what I need to understand what's the technology behind that.

And I wanted to study machine learning. I did first some training, like six months degree, executive degree, at the end of which I knew like what XGBoost at the time and nothing about deep learning at all. So, and most of the people around were like PhD people. Okay, PhD seems pretty cool.

Deep learning seems pretty cool. So I want to do a PhD in deep learning. That's where I joined. We have this PhD program in France within a company and academia. And so I did my PhD with Recital and Sorbonne University on natural language generation reinforcement learning. I guess it was a good topic.

I was not like a visionary. It was very random. I've had a company that offered me like this topic and it was something like I started two weeks before BERT. - Excellent timing. Yeah, we actually also just released our episode with Clementine Fouquier who also did her PhD with a company in kind of like a very similar format.

I think, yeah, very underrated, very underrated. This sort of PhD with industry expertise because you're also like publishing papers the whole time. I looked at your publishing history. You were doing like summarization work. You're doing factual consistency work. You released some benchmarks and then you worked on language GANs before the transformers took over.

- We can come back to that later, but I should have, I mean, papers have like 10, 50 citations. If I'm pretty sure that if I call them like an LHF without human in the loop, but like a discriminator, which is synthetic, a human in the loop, I will have get much more citations today because like all the inspiration for this paper were from actually the original open-air paper of LHF.

But at Academia, we don't have the way to pay annotation online like that. So how to simulate it? - Yeah, a lot of these ideas are repeated, like discriminator, generator, we just call them different names now, like verifier, whatever. - Exactly. - Well, I think your progress into NLP was like really strong 'cause like the first thing you worked on at Meta was Bloom.

- Yeah, actually I started to work on that before joining Meta. I was not like one of the main contributors, but it was at the intersection of multilinguality, which was very important to me, large language modeling. And that's why actually my first big project at Meta and the team I was working on was Galactica.

And actually interesting step back from Bloom was like we did a lot of mistakes, but it was expression that's expected again. We learned a lot, but like trying to scale towards like multilinguality. In fact, we learned later that multilinguality almost emerged naturally with very, very few data, which was really surprising and not expected at all for us at the time.

- I mean, my learning from that is just, there's a natural harmony of language that is abstract from English. When you learn English, you learn language and then language just translates to other forms of languages, especially if they're the same family, right? Like, yeah, so maybe we should get right into Lama 2, spend a little bit of time there and then we'll go into Lama 3.

So like what is the story of Lama 2 from your point of view? - Yeah, so as I was saying, I started to Meta on Galactica. That was one of the first large language model at Meta. It's a language model for science. We released it in, I think, December or end of November, I don't remember, one year and a half ago.

I don't know if people remember, but it was huge on Twitter, both with people like thinking it's the end of science and like that with a lot of hallucination papers, although I was like, it's super awesome. I still think it was super awesome, but you know, we didn't do like instruction tuning or LHF techniques at the time.

It was a weird moment because two weeks later, ChatGPT came out and that's a moment where like, I think all the thing companies went upside down and where we had a huge traction from leads to now work on that and make a ChatGPT as soon as possible. So we had this one, two months of like what to do.

I actually was working on Galactica Instruct, which basically you could connect it. We had a partner with Overleaf, the Google Doc of like scientists where you can write papers and you write there in LaTeX, you have to do a lot of citations. So the idea was that you can, just like ChatGPT or GPT Instruct, ask or swap two columns in a LaTeX table.

That's something very, very time-consuming. I can promise. You could like say, oh, find me a citation about LLMs and bias. We'll find you some papers, insert automatically the bib in like LaTeX. So that was pretty cool. But because of the backslash, we never like opened it in the end.

- Oh, because the Galactica backlash. - Yes. - Like I was just saying like today it's not solved because Lucas Bayer is still asking for the citation generator. - I sold it to it. I was, dude, we had that two years ago and I promised, I tested it. It works so well.

I had it on Overleaf integrated. I tested it. - Wow, okay. - Yeah, yeah, yeah. No, it went quite far in fact. And actually about citations, like it's anecdotical, but because the way Galactica was trained to cite papers with all the references in paper, that's what made it emerge so easily at instruction timing.

Actually, Galactica Instruct was the first annotation project for LHF at Meta. That was a follow-up of Galactica that we were preparing. And at the same time, my friends from Paris office created LamaOne. It's like to connect the dots with what we said before. The last author was Guillaume Lample who founded Mistral.

The first author is Hugo Touvron who worked with me on LamaTwo, still at Meta. Both did a PhD program within Meta as a company and an academia. So that's a pretty good program indeed. And so we worked on LamaTwo from that point. We had all the support from the company leadership.

That was one of the main priority. We had LamaOne and Galactica as like backbone of good language model. We started from LamaOne and we worked mainly with Guillaume how to make instruction following and chat models that will follow instructions. So all the supervised fine-tuning stage, then the LHF, there's some paper.

So you had some intuition from there we could use. But in fact, at large scale, and that was probably the most challenge for us, there's no research anymore. We don't know how much to scale. - Can you describe what scale you're talking? - Yeah, yeah, to what level of annotation to scale.

Is the annotation like, do you need 100,000, 1 million, 10 million annotation of supervised fine-tuning of LHF preference? We had no idea. What is the actual algorithm to do? How often to retrain the models? You have just the basic, but then when it comes to like chat GPT or GPT instructor cloud, no one published the details there.

And so we had to reinvent the wheel there in a very short amount of time. - And what about parameter size? This is one question that a lot of folks had about Lama 3. So Lama 1, you had 7B, 13B, 33B, 65B model sizes, and then Lama 2, 7, 13, 70.

How do you kind of evaluate what's worth training, especially when you think about data? It's like, you know, maybe 100,000 is enough for like a 7B model, but it's not enough for a 70B model. How do you decide model size, especially when you're maybe annotation constrained on some of these things?

- That's a very good question. And there's no good answer. There's so many parameters to take into account from the scaling loads at training time to get the best performance. The GPU constraint and on what different hardwares, and we think about meta, but also of the community, and like people are not just using 800, but there's 800, there's different size of GPUs memory.

So which size will fit in what, and what is the most useful? Also at inference time, not just at fine tuning time, then you can maybe do some tricks at inference time to quantize it a bit or FP16 or FP8 now. All those constraints makes it very, very challenging.

At inference time, you have a lot of costs. So how to trade off between inference cost and training cost? It's a very challenging problem. In general, we tend to think in particular for Lama 3. Lama 2, maybe I would say it's like Lama 1. We had a flagship model, which was 70B.

It's also because the project was taking some routes to reproducing a Chinchilla, which was a 70B. For Lama 3, we also moved to one size more. It's a flagship model for 05B. I think that there was also the question of, we want a model at this time, we have this amount of compute.

Given the scaling loads and the amount of tokens we have to train it, what would be the right balance to still fit in at inference time? So we try to have some trade-off like that. - Yeah. And you mentioned Chinchilla is the best way to go, but then you tweeted recently, "Don't fall into the Chinchilla trap if you want your model to be used by billions of people." So what's the updated state of scaling loss?

I think there was obviously the Kepler, and then there was Chinchilla, and then people kind of got the Lama scaling law, like the 100 to 200X kind of parameter to token ratio. What's your updated thinking on how to think about scaling loss when you pick model size and training data?

- Right. So, as you said, this Kaplan paper with scaling loss, but they figured out, basically they tried two dimensions. The model weights and the number of training time, like number of steps, training tokens, APOCs. And for that, they figured that model size is what matters. So GPT-3 was way too big compared to the actual number of training tokens because they did a mistake not adapting the scheduler.

That's what Chinchilla emphasized and discovered. To be fair, I think Kaplan knew that at the time of Chinchilla paper. But yeah, basically Chinchilla said we have to revisit the scaling laws originally published by Kaplan and emphasize much more the importance of training tokens. And they did some really good scaling laws showing that there's an optimal, basically you need to double the number of training tokens every time you double the training weights to get an optimal ratio so that for a finite number of compute, you will end with the best results in your paper.

And what I call the Chinchilla trap is that good if you want the best flagship model that obtains the highest performance on your paper. But if you want to use your model at inference time, inference, the two dimensions, one remains the model weights, but one drops the number of tokens you train it, number of steps.

And so to be compute efficient at inference time, it's much better to train it much longer training time, even if it's an effort, an additional effort, than to have a bigger model. That's what I call, I refer to the Chinchilla trap. Not that Chinchilla was wrong, but if you consider inference time, you need to go beyond Chinchilla.

And in fact, that's what LemmaOne folks did by overtraining it since they could have get a better performance in their paper, but they prefer to create the best artifact that will be used by the community. - So that's the skinny thinking. What other went into LemmaTree kind of planning?

So LemmaTree, you have a pretty good model. People really liked it. In LemmaTree, you drop like the intermediate weight. So it's a 870 and now 405B. What was the thinking behind going so large? I mean, you talked about the hardware capabilities at inference, like I can now run a 405B model at home for sure.

And it might be hard to even get the cloud resources to do it. What was the decision there? - The decision is super simple. We want the best model. We want to be number one and number two. We started one year and a half ago and we did quite some journey.

We filled the gap with GPT-4. So that will be the first open source model that actually compares to GPT-4. There's now a GPT-4.0, of course. And we're close, but we're not there yet. Not in all capabilities. But the gap is getting smaller and smaller. There's also like what compute we had at the time when we started to run in January.

We put a lot of effort there, but as like Mark announced, we have more and more GPUs. So the next generation will be bigger. So that's what drives the decision. Now, maybe let me reflect two things he said. You cannot use it at home. That's probably true. But quantizing it to FP8 can run on node, even with a long contact of 128K tokens.

Second thing is, I'm hopeful that the community will lead to a lot of findings by open sourcing it. And there is smart way to actually make you use it on your computer. If you remember, when we published models, people were saying it's too big. And after two weeks, it was running on a Raspberry.

I don't know if it will be the same, but I hope it's the same kind of trend. And by releasing those models, we are enabling that. Now, the last thing I want to add is having bigger models enables us to collect better data, for instance, at LHF stage, because that's the model we use for the annotation.

And so we distillate straight forward, like this annotation from this better model to the other models. So I can guarantee you that the quality of the smaller models we are releasing with Lama3 are also thanks to having these artifacts where we can collect and train. - Yeah, there's a lot of really good info there.

One thing I'll just briefly touch on for quantization, there was a recent Noam Shazir blog post. Noam is writing again for some reason. And he was talking about sort of native FP8 training. It seems like that is most useful for inference. That is what you expect the open source community to do with your weights once you release them anyway.

Is there any movement or thinking about just moving to FP8 or whatever other new format is in vogue these days? - Also these papers like to train like some, I forget the name, but like there's two follow-up papers on like just a zero one or minus one weights. And like, there's a lot of work there.

I think it's promising directions overall. Regarding FP8 in particular, there's also the possibility for the community to try FP8 or the methods that are very easy at fine tuning time for the model. So I'm really looking forward to what the community can do there. Although like scaling, I don't know if it's all you need, but I will not bet against scaling.

And one of the way to get more scale is by having better algorithms that we can train for the same level for less compute. - Less compute and less memory. Yeah, like inference time memory is becoming a real constraint. - Yeah, yeah, but also training with FP8. If you're not training with FP8 or, I mean, FP0 is probably nonsense, but to what extent, how far we can go, you know?

And every time like you unlock compared to what we had two, three years ago on a 32 or 64, it's like huge progress in terms of scaling. - For me, it's interesting to say, to see you mention the ternary quantization, like the 1.58 bit thing. 'Cause I didn't know that, I don't know how much to believe, you know?

Like there's a lot of these kinds of papers where it makes a lot of noise, but it doesn't actually pan out, it doesn't scale. - I totally agree with you. It's so hard for researchers, at least for me, to see all those papers published, all those cool ideas, all those results that are preliminary.

And in all this massive amount of research, what will scale or not? What will resist the test of time or not? And are we like losing maybe some gems that are not just, people are not working on them, but because there's too much research around. I don't know, maybe.

And that's like some problems to have. That's cool to have these problems nowadays, compared to probably what Yann LeCun and the others had 30 years ago, but still it's a problem. - For what it's worth, I do think that FAIR is putting out incredible research. Probably, it doesn't seem like it's your group, but you also recently published Mobile LLM, which on the small model side is a really good research on just small model architecture.

It looks like Hugging Face is also replicating it, and it's doing quite well. There's a lot of ideas on shared weights and shared matrices and model architecture stuff that we can talk about for smaller scale models. LLAMA is not at that scale, but it seems like one of the big themes of this year is on-device, in-browser, small models that are good enough for daily use.

I do want to talk about architecture. I'm not sure when you're releasing the LLAMA 3 research paper, but in LLAMA 2, you talked a little bit about the architecture choices. - It will be released the day, I think, of the release. - Okay, what should people know? What are the major choices of LLAMA 3 versus LLAMA 2?

- There's not a lot of changes in terms of architectures. I think we can do a lot better in the future, and not just with transformers, but, for instance, to me, it doesn't make sense to use the same amount of compute per token, for every token. Like, there's architecture lack of flexibilities.

There's a lot of research to go there, but still, that's the best thing we have for now. And so, it's the same recipe than, in terms of architectures and training, than LLAMA 2, but we put so much effort on scaling the data, and the quality of data. There's now 15 trillion tokens, compared to two trillions, so it's another magnitude there, as well, including for the smaller models.

- One of the things I noticed on the paper is that you used LLAMA 2 to do the data cleaning for what went into LLAMA 3. I think there's a lot of chatter, obviously, about synthetic data, and there was the "Refrace the Web" paper that came out, maybe a few months ago, about using Mastral to make training data better.

Any learnings from that? It's like, is there, how much can you rewrite with the models? I'm sure people would love to hear more about it. - Right, so it's a very interesting research direction. Synthetic data, in general. Synthetic data for pre-training. My intuition is that the web is full of shit, in terms of text, and training on those tokens is a waste of compute.

Just having a good classifier that labelizes that is cool, and LLAMA was, at the time, before LLAMA 3, the best model we had access to, legally, to labelize the web and select what are the good tokens and the bad tokens. The additional thing is that it also enabled to have a topic tag, like, is it about law?

Is it about politics? Is it about chemistry, math, reasoning? So that you can also adapt a bit the mixture to balance a bit more the diversity. - To me, I'm not exactly sure what you guys did, but I feel like when people say synthetic data, there needs to be different categories of synthetic data now because I think there's so many different usage of this thing.

But specifically synthetic data for pre-training, it feels almost like you're running multiple epochs on the raw data while it's rephrased or reformatted by a language model, right? And in my mind, it's very similar to computer vision, where you do data augmentation on an item, right? Like, we're doing data augmentation.

That's the less cool name for synthetic data. (laughs) - That's very interesting. I totally agree with you related to pre-training, totally stamp what you said. I think it's very different, though, for post-training and the future direction on synthetic data that I'm personally excited. Like, for instance, what I'm excited about is we had this survey on augmented LLM a year ago, and all the idea is like, if you augment your LLM with something else, it can be a retriever, it can be search, it can be a tool, it can be a calculator, it can be a code execution.

Then you are not just distillating, like doing some data augmentation with your model, but you're actually adding some expert skills that possibly goes beyond the model weights. For instance, if your model, like, can calculate something it was wrong before, and now it has access to a calculator, and you can retrain your model on that, then you're learning something new.

If your model didn't know something about LLM 2, probably doesn't know a lot about LLM 3, but now if it can search online about it, and then you train the model on that, then you have a positive feedback loop, like what we call expert iteration, targeting directly the weakness of the model.

It's like continual augmentation of the language model, much beyond just data augmentation. - How related is this to tool use? Like, are you teaching it to use tools to augment the model, or are you saying, like, do active learning, do like, where it's weak, go augment the model with extra data, and then memorize that new data, right?

- What I said is more like in terms of directions, not for LLM 3, but like, when it knows how to use a tool and correct itself, this is like a very promising direction that goes much beyond the augmentation for like, in the future, to keep collecting new data, new token.

People are saying like, we are lacking of tokens. But if you think about those kind of tokens, where the model always go to correct its own weakness, it can say like, okay, that's 10 plus 10. Okay, that's an easy example, probably the model knows, but imagine for something more complex, 10 plus 10.

I expect this to be 20. Let's verify with a calculator, which is easy for a basic agent now, powered by LLM. And then you verified with respect to what you expected, that it's correct. If it's not, you can back propagate this example directly to the weights, and so they will keep learning new things.

- It makes sense. What have you been your insights? You know, you mentioned about just like using calculators. What are the insights? I think it's just, in general, a lot of that is just driven using code generation, apart from just tool use. What are your insights on just like the data mix of how much code, how much multilinguality, which is something that you're also passionate about?

We know that that's changed between LLM 2 and LLM 3. Is it changing for different stages between the different sizes of LLM 3? Like, you know, anything like of that sort? - No, it didn't. For the different size, we use the same, mostly. What happens is we change the data mix during the training of LLM 3, with some findings that happens that, I mean, training is long, so you have to do something while it's training.

And what the team did, I was working on my side of multi-motion post-training, but so the pre-training team did quite a lot of work to find some, have some new findings, improve the data mixture along the way. And they intersected before the end of the training. - I sense a movement in terms of like the curriculum that people are adopting during pre-training and even post-training about, you know, what the mix should be.

Like Snowflake is doing some interesting work with enterprise intelligence or whatever they call it. What are your goals with post-training? Like just at a high level, you know, like, what do you work with, like the pre-train team? - I think it's quite easy from now because there's not yet like this kind of continual augmentation where it could feed back like pre-training, things like that.

One of the big continuum between pre-training and post-training in particular is continual pre-training where you actually continue the pre-training before RHF in a self-supervised way, but on expert level domains, like for it to have an expert in code and an expert in like reasoning or an expert in multilinguality that enables to collect even better RHF notation after.

So that's one thing. And then you start from those models to actually do the RHF stage. And goal about your vision, like goal was to get the best model in those dimensions. That's actually one thing very different to, I can comment, compared to Lama2. Lama2, you know, as I said, we were nowhere.

We build entirely end-to-end all the stack from data notation, contract, methodology, protocol, algorithms for RHF at Meta. And we had to limit our scope. We were like not allowed also to work on that. We focus mainly on helpfulness, following instructions for Lama2. And you can see that as in the following month after Lama2, a lot of open source models came, distillating GPT-4 mainly, but obtaining better reasoning, math, coding, chat models.

And we didn't annotate at all for code, neither for reasoning or multilinguality. And one thing I'm quite proud is with the early preview release we did of Lama3 back in February, May or March, I don't remember, it lets quickly to instantly to like state of the art results for the model size, almost competing with GPT-4 on the Arena leaderboard, where like human fights each other, compare like two models and select their preference.

And no one since then had been able to put like a Lama3 model better than what we did on most of the domains, from code reasoning, multilinguality, helpfulness. So that's the sign that this time, as opposed to Lama2, we tackle like all those different aspects. - Do you have any other thoughts on the more synthetic data focused models, kind of like a Nemeltron?

I think folks were asking if you see that as an interesting direction too, kind of having specific synthetic data generation things. - I don't know about this model exactly, but I think like Lama had better performance overall. I'm very bullish on synthetic data. Generation, but I think just gets better when you have a better model.

I'm not really bullish on having like a model only for synthetic data generation. I understand the need of having like bigger models, but then you can rationalizing. Yeah, maybe people will not use them for inference, but to distillate some specific knowledge of synthetic data. That narrative is, I think I totally agree with that, but having a model purely for that and not like good at other things, I don't think it's the case.

- Makes sense. One of the architecture questions that I forgot to mention in there was, so just the architecture choice of like a very big, you know, 400B dense model. I actually honestly thought that maybe 175 or, you know, was kind of the peak, you know, whatever can fit on like an H100.

So basically I think the common question that people have is like, why no MOE? In a way that Mistral and the others have gone in, you know, it seems like the trend has been MOEs and you guys have bucked the trend there. - I heard that question a lot.

Different aspects there. Why not MOE in the future? The other thing is, I think a dense model is just one specific variation of the model for an hyperparameter, for an MOE, with basically one expert. So it's just an hyperparameter we haven't optimized a lot yet, but we have some stuff ongoing and that's an hyperparameter we'll explore in the future.

- Let's make sure we run through everything on post-training. You also had a recent tweet about hourly chat versus immunization learning explained in one tweet. So we'll put this in the show notes, but it's basically like two charts about doctor opinions. On one side, there's like, whether or not the suggestion is good from like a content perspective.

And the chatbots rank really highly and the physicians are kind of like, you know, a bell curve, as you might imagine. But then the empathetic voting, most physicians are rated not empathetic or slightly empathetic versus all the model responses are rated very empathetic and empathetic at worst. You know, most people might look at it and not really get much from it, but obviously it resonated with you.

Can you run people through like some of the choices you make in post-training to like optimize for one of the two and getting the best responses? - I think the tweet was about like the intuition of why reinforcement learning with human feedback works. When we started LamaTube, I had like this budget of annotations in millions of dollars and okay, what to do?

I'm responsible for that. I'm accountable for a model at the end that can follow instructions and compete with GPT 3.5 at the time. What to do? You can annotate supervised venturing data, which refers to a human to create a prompt and to also write himself the answer expected by the model.

So then you train on that and in a supervised manner, but that's like very classic and standard on fat-tuning machine learning. The other thing is reinforcement learning with human feedback where the annotators type a prompt, but this time you sample two different answers from your model and you ask the annotator which one he prefers.

And then you will train on the preference, basically, to simplify. When you ask to train on the preference of the model, that seems very weird and not really robust, training on synthetic model generated by the model. So I was like, let's annotate 100,000 or of supervised venturing data. And let's annotate a bit of preference to do a relationship because everyone is doing it.

And we had this human evaluation after a few weeks in a la meta projects where our model was already better than the annotation from the humans. So you'd get a prompt, you check what the human will have annotated as an answer. You check what the model generates. And most of the time, the model was better.

I was like, oh, maybe the annotators are pretty bad. Let's look at that. And no, the model was pretty good. And so I understood the intuition behind RLHF. Those models are already super good at some tasks. And with RLHF, then what you have is, imagine a distribution, a Gaussian distribution, which was basically the tweets.

And you have on the left, bad outputs and on the right, good outputs. And the same with medical diagnostics from a doctor. You have good outputs on the right and the bad diagnostics on the left. But you have the distribution and when you collect all the diagnostics from doctors, hopefully it's mostly on the right.

There's better, a lot of time, good diagnostics, but human makes mistakes, right? So there's bad diagnostics. On the left, you have still a bit of examples, which makes curves not at zero, the distribution. And the same way for humans, they make mistakes when they annotate. And so training on behavioral cloning to reflect humans, the model will learn to do also some mistakes, just like humans.

And so you will have some bad outputs from the model time to time, reflecting humans. And you cannot go beyond that if you train on human outputs. But now, if I ask a doctor to check a sample from my model, or a sample from two doctors, one diagnostic and another diagnostic, one is better than the other, it's easy for a doctor to say which one is better.

The same way, if I sample from my model that learns a human distribution of answers, and there's one bad time to time, like humans, but most of the time, good answers. And I ask a human to choose which one he prefers. Personally, I'm really bad at creating poems. The example I give a lot of time, try to write a haiku in three lines of about language models.

I don't know you, take like five seconds to think what you could come, I'm terrible. But yet, if I check two poems generated by a model, so a human, I can tell which one I prefer. I'm good at discriminating. And because of that, you can have a model that flats the bad outputs, and learns to only shift towards the best and better and better outputs.

And you can even end to superhuman abilities, since that I'm bad at writing a poem, but I'm good at judging which one is better. So I can actually annotate data beyond my own skills at creating them. That's the magic of RLHF. - Yeah, we have one episode, RLHF 201, with Nathan Lambert from the Allen Institute, who was at Aginface leading RLHF before.

And he mentioned one of the things that makes RLHF work is that humans are not maybe great at creating a lot of things, but they're usually very good at giving an opinion on which one to they prefer. So they're able to actually annotate data of things they would never create from scratch.

One question actually that he asked me to ask you, how much in post-training you attribute improvement to the RLHF side versus the instruction fine-tuning side, and maybe how you think about prioritizing the two and what areas they impact the most? - You mean between supervised fine-tuning, like supervised fine-tuning annotation and preference annotation?

- Yeah. - So 100% to RLHF. In fact, that's quite interesting. You start for Lama 2 with a pre-trained model, and you have to have an instruction model to chat model. Otherwise, the model is just like finishing sentences. So you need that to start RLHF. So we had to annotate like 10,000 examples.

What did we do for Lama 3? You start with a new pre-trained model, and then you want, before starting the RLHF, to have now a chat model. That is not too bad. The option one was, let's do human annotation again, like SFT stage. But in fact, by the principle I said before, the annotation would be actually worse than Lama 2.

So what we did is that we generated all the data on the prompts with Lama 2, and we applied basically the last round of Lama 2 we had to kick off and start Lama 3 post-training. So Lama 3 post-training doesn't have any human written answers there, basically, almost. It's just leveraging pure synthetic data from Lama 2.

- Do you have an intuition on which areas work better for which? For example, you mentioned the physicians are expert. What about maybe like code or, yeah, you also have a multi-model working on, so like image generation is like, or does this apply to any modality, any subject? - That's an open research question.

The intuition in general is like, for instance, for code, because this is factual, you can check if the code is correct or not. RLHF is not the way to go. You prefer to do like supervised fine-tuning as a human to write the code. But in fact, because humans make mistakes, because actually even in code, there's some preferences that they might feel like that.

And maybe for some other reasons that we don't know, RLHF is so much more scalable. It costs less, it's easier than it leads in general to just better performance. And maybe we can come with a compromise. We actually suggested teacher-forcing in Lama 3, a new method that kind of fills a gap between, not teacher-forcing, sorry, teacher-critic.

Teacher-forcing is a good to train the models. Teacher-critic, where it reconciliates and unifies supervised fine-tuning and RLHF, so that when you do human preference and you have two outputs, but both are very bad in the code, for instance, you will ask the human to edit the best answer to make it correct now.

So now you are doing SFT when all the answer was really bad, so that you can get out from the local minimum of your model. - I think this is like super promising, and it seems like there's just, well, do you have an idea? You know, you started with this question of how much scale you need.

Do you now have a better idea? - No, what we know is it's not plateauing yet. - It's not plateauing yet, yeah. So just infinite amounts more while, you know, scale AI and all the annotation providers are very happy to hear that. And so you mentioned at the start of the conversation about the AlphaGo moment, and I feel like this is very interesting to reflect on, right, like we're basically saying that, I think that one of the lessons from AlphaGo is that people thought that human interest in Go would be diminished because computers are better than humans, but then we have this sort of centaur model where like humans and computers are actually doing better than either humans and computers would be alone.

And I think we're seeing that with this, what are you talking about, this RLHF improvement, right, that we're kind of building human preference into the model and the blending of the human preference and the model capability is actually doing better than we could on our own. I just think it's pretty fascinating.

- It is fascinating. - The other thing is RLHF came from the alignment community and I think there's a lot of conception that maybe it's like due to safety concerns, but I feel like it's like really over the past like two, three years expanded to just, this produces a better model period, even if you don't really, are not that concerned about existential risk.

I always feel like it's so interesting to see this, like people who take alignment super seriously, they're the first to consider super alignment. And now we're considered like, I'm almost thinking about this as like super quality, that we are training models that are higher quality than humans. And it's not really about alignment so much as like, we now see that this is actually possible.

- Yeah. - And it's not even for alignment purposes. We just think it's like better at reasoning, better at knowledge, better at everything. - Well, I don't know how much better yet it is on those, but clearly it's super human on some writing skills and it's super useful. I think that's great, to be honest.

- Yeah, perhaps we can transition to evals. We've had some questions about the 400B details that we want to disclose. By the time this podcast comes out, we'll have disclosed them. Yeah, I think last time you disclosed like the evals while you were still training, what should people know about the high level headlines for the new Lama 3?

- At a high level, it's the best open source model ever. It's better than GPT-4. I mean, what version? But by far, compared to the version originally released, even now, I think there's maybe the last clouds on a 3.5 and GPT-4.0 that are performing it. And that's it, period.

So for the 405B, that's a flagship, that's a pretty good model. Not yet the number one. We still have a journey to get there. For the 7TB and 7B, they are like world-class models of this size for general models. - And are the benchmark numbers from the initial checkpoint still right?

So the April 15 checkpoint, MMLU on Instruct is like 86, GPUA, 48, HumanEval, 84, GSMAK, 94, MAT, 57.8. Is this still roughly the same performance? Or, you know, I haven't seen the numbers yet either. We're just breaking the news right now, so. - No, it's roughly that. - Awesome.

So talking about evals, we just had an episode with Clementine from Hug & Face about leaderboards and arenas and evals and benchmarks and all of that. How do you think about evals during the training process? And then when the handoff happens, do you already know exactly what you want to improve?

And I know that, for example, to improve like maybe an arena score, you need different than like an MMLU score. How do you think about prioritizing the post-training improvement based on benchmarks? - That's a super hard and good questions. There's no good answer. I mean, evals is an open research problem, like in particular when you're trying to tackle so many capabilities.

And, you know, it's also like, as soon as a benchmark, you're trying like to push numbers on a benchmark, it stops to be a good benchmark because then you don't know if you're overfitting it and it will transfer to similar capabilities. So evaluation for language models, in particular on post-training, is very hard problem.

We tackle that playing with different methods like reward models, evaluation, model as a judge, having a diversity of prompts, diversity of benchmarks as well for a lot of different capabilities. That limits the possibility of hacking them, of course. We do also a lot of human evaluation. I do also a lot of model tests, quality analysis, like testing myself some prompts.

I feel it was much easier during Lama 2 when the model was like worst than today. Now the model is getting so good that it's hard to get to some prompts to break them and to compare models and see the edge cases. So it's getting harder. And a great way also to compare models is, you know, truth is a different round we have done for RHF.

Every time we upload a new model, for all the annotation we are doing, we have the win rate between the previous model and the new model by just sampling for every prompt we annotate, prefer sample A with the old model, sample B with the new model. And so we can calculate automatically a win rate.

- Interesting. What are areas that you had to work the hardest to catch up to like the private models? Maybe like there's, you know, not as good public data or whatnot, or is performance improvement just kind of even across the spectrum? - Honestly, all of them, we are behind all of them with between Lama 2 and GPT-4.

I mean, it's different challenges every time, like being good at code or reasoning is something we didn't do at Lama 2, so we had to build everything from scratch. Improving on helpfulness, and which is one of the main dimensions that people look at, I think in Zarena, but which is by the way, very interesting evaluation.

Because when we did the preview, and I don't know yet what will be the results for this new Lama 3, we ended very high in this blind test leaderboard. And to be honest, I didn't expect that. I knew we had good results internally, but how that will transfer to perception from the community, people like using it in practice and comparing it to the other models, I didn't expect that positive feedback, that high ELO score on this benchmark.

It doesn't say like everything. As I said before, which is also interesting, because it's a community that judge the prompts and create the prompts and judge the answers. We are limited, we are not like good to do that. And so it gives you a very good indicator of how good, helpful, how on the main core of the distribution, simple prompts about the tone of the model compared to the others, but for much more complex problems, much more intelligence, reasoning coding of complex stuff, it doesn't tell the full story.

You know, like while we had 70 B preview at the level of GPT-4, even better at the time. I think it was partly true, but clearly we were not at like GPT-4 level in code or reasoning. We are now. - There's some conversation about like the math score. Apparently like the next GPT-next or whatever is like region 90, which is a big, big jump from the current state of the art.

It will be interesting. One of our previous guests, rounding out the topics on just potential models, areas of development and evals, Clementin is looking for a confidence estimation or uncertainty benchmark. One of our previous guests, Brian Bischoff, is also asking about like, how do we think about evals for practical things like confidence estimation, structured output, stuff like that.

- Yeah, I think we lack actually of such evaluations. When numbers, I was assuring like two days ago to the team to report at some point is, okay, we have this accuracy on MMLU, on whatever, on math and JSM 8-4. What if we change a bit the prompt and instead of telling the model you have this question, you have to answer A, B, C, or D.

What if we tell the model you have to answer A, B, C, or D, or you don't know. And maybe the accuracy will be a bit lower, but I'm curious to see if some models we have different of calibrations where maybe model A have 50% correct, model B has 50% correct, but model A answered 100% of the questions.

So 50% are not correct. Model B actually said like, answered only 60%. So for 40% of the time, he said, I don't know. I prefer model B. And we are not like reflecting that in evaluations. - I think this is very relevant for post-training in particular, because it seems that the general consensus is that base models are more calibrated than post-train models, right?

Something like that. - Exactly. - That seems to be the research from OpenAI as well. I don't know the degree of this, and maybe we can invert it, right? Maybe post-training can help to increase calibration rather than decrease it. I feel like this is a little bit of being too similar to humans, because humans are not calibrated very well.

- Yeah, and that's the goal of post-training, I think, to make models more calibrated, to not be biased toward answering A, B, C, or D as often as possible, to follow the uniform distribution. - And on the structured output tool calling side, do you think that it's not an explicit part of the evals?

Obviously, you worked on Toolformer and the language augmentation. Do you encourage the open-source community to fine-tune Lama3 to do tool calling, or do you want to just have that in the model from day one? - We have that from day one. Good news for the community. We are state-of-the-art there.

I think the model will be pretty good at that. We have a lot of gems about tools in the paper, but the model is fine-tuned to do tool usage, to zero-shot function calling. There are some system prompts, if you tell the model to use a search, and Imagination can do a lot of stuff, like code execution as well, even in a multi-message way, so almost multi-step agents, which kind of sparks our agents.

- Okay, you talked about agents, so I guess we should probably mention the work on agent stuff. And you also, in our pre-conversation, mentioned that you're already starting work on Lama4. What does agents have to do with Lama4? How does your work on Gaia inform all this work? - Yeah, so we published one year ago, Gaia General Assistant Benchmark.

That followed a direction I really like pursuing. I mean, everyone passionate about AI and trying to build Jarvis will go there. So I did Toolformer and the survey on augmented models. In fact, reflecting back, I was, okay, we have Galactica, we have Lama1, we have Toolformer, and there's like GPT 3.5 at the time in Lama4.

If you don't have a good instruct model to follow instructions, the extension and the feature of Toolformer is limited. So we need to work on that, and we did Lama2, and then now Lama3. And it's very interesting. On General Assistant Benchmark, so Gaia, agents powered by language models perform to zero with GPT 3.5 and to something very significant, like 30, 40%, 60% with GPT 4.

So there's a gap of intelligence here. And I think this gap of intelligence, this threshold that you pass in terms of zero short function calling, following complex instruction that can span over a page of constraints, those things that makes the nowadays agents with React loops, pre-planning, multi-steps reasoning, function calling, work in practice is like this gap of intelligence.

So now that we have Lama3, I'll be back to agents. I expect some incremental and significant progress on pre-planning, post-planning, but I'm really hopeful that we can gain some order of magnitude of scaling by interconnecting well models into agents as a more complex system that can do planning, that can do backtracking, that can take actions, navigate the web, execute code.

- Okay, there's a lot there. When you say integrating world models, is there anything from JEPA? Is that something that we're talking about or is that a different line of research? - No, not directly. That's the same goal, I would say, but JEPA is very, very fundamental research, which has some promising early results.

And what I was looking right now on state-of-the-art results on Gaia, there's a leaderboard, by the way, you mentioned Clementine before, she contributed to Gaia as well, and Hugging Face put a leaderboard there on their website. There's some state-of-the-art results. What is interesting is like GPT-4 alone has 0%, but, or like 5%, I think, on level one, that's three level of difficulties.

But OSCOPILOT, then, and Autogen from Microsoft, and recently Hugging Face agent, obtained some level one up to 60%. So connecting an LLM to an agent that can do all those things, moves much forward new capabilities. This is kind of a breakthrough. And those models are purely based on instruction tuning models, following instructions, where like you have an orchestrator, and you say to your LLM, okay, this is your task, you have access to these tools, you can navigate the web, can you do a plan of what you should do?

And then, okay, that's the plan. Now, execute the first step. Did you manage to succeed for the first step? Or do you want to rethink your plan because you enter in a dilemma? And you have kind of all this orchestration by system prompting, instruction following, and just that, which is quite suboptimal, and probably you need to go later in latent space and more JPAS style.

But just that is getting us to some really impressive results already. - And do you see the planning and review to always be needed in the future? This is kind of like under Garpadi's idea of like more tokens equal more thinking. So like the more you're having it write tokens and like think about the outcome and like the better result you're probably gonna get to.

Do you think that's always gonna be the case? Or that in the future, like the model, you can just say this is the task and then I'll just return the answer directly and do all of that in the latent space, so to speak? - Right. I think in the future, it should be, it should hopefully go more as this is a task and I return it.

But we need to teach that to the model to train that, which is far from now. Very medium, long-term directions that could be really relevant here is thinking into latent space. I know some early works are doing that. And that's a way probably to move to. First you think, and then you don't have to write all the tokens.

Like it's in your head. It doesn't have to be as constricted than a plain text BLM. And once you have done your thoughts, you can just write the final answer or take an action. - Just a commentary on that. Anthropic actually cheats at this right now. If you look at the system prompt in the Cloud Artifacts, I actually have a thinking section that is explicitly removed from the output, which is, I mean, they're still spending the tokens, but that is before training it is at the prompting level, you can simulate this.

And then at iClear, there was the pause token, the backtrack token. I feel like all these are token level stopgap measures. I feel like it's still not the final form. Like we still need to have at the architecture level, some kind of variable inference length thing that lets you actually think in latent space like you're talking about.

I don't know if there's any papers that you're thinking about. - No, but that's interesting because that's what we said at the beginning of the discussion. If you remember, like we are lacking the flexibility for pre-training architecture transformers, where we spend the same amount of compute per token. And so because of that, how can you like mitigate this by generating more tokens?

So more thoughts, more compute, because you have only access to this dimension. Ideally, you want an architecture that will enable naturally to make this emerge, basically. - Any papers come to mind there that you would recommend people read, or this is like completely new science that we have to do.

- No, I mean, it's earlier science. I don't know any work that managed to get there. I know like, for instance, you had the universal transformer had this idea of a number, and you can like compute on the layer n times and being decided by the architecture itself with respect to the complexity of the token.

I think there's a paper from DeepMind on mixture of expert with like a key player. Mixture of, is it this one? - Mixture of depths. - I don't, I'm not sure it's this one, maybe. But like, basically the idea was that with a mixture of expert, you have an expert that is an identity matrix that you can skip.

And so like you can, but you know, it's early works, very preliminary works. Like for instance, I haven't seen yet a lot like putting the compute, generating a token into the loss. That's gonna be interesting when we start to do that. - I know we're getting up on time, but we had just a few more questions.

We definitely want to ask you. So as you think about, there were reports about Llama4 started turning again in June. If you think about the evolution of the models, I think up until Llama3, you know, with MetaAI and some of these things, I'm like, it makes sense that they want to build their own models and their multi models.

Sounds like Llama4, maybe a lot of the focus will also be a more agentic behavior and have all of this. I'm curious, like at what point it's like, okay, this is a research direction that we still want to take, even though, you know, it doesn't fit right into the product.

Like what's that discussion internally about what to focus on as you keep scaling these models? - Yeah, I think it's a balance, you know, between, well, we want to be number one. Mark wants to be number one there. And there's this understanding also that, you know, this is a critical technology in the future.

And even if to nowadays, that like research, if nowadays it's not like directly intersecting product, we don't want to be late in the game as we had in the past. So that's the first thing. The second thing is, and we think that this technology will change the world. We want to work towards AGI and AGI will change the world.

And if Meta develop an AGI, it will probably intersect pretty easily the products. Now, the first thing is, with that in mind, we have to balance with product needs. And there's always this ongoing discussion and this balance to find, for like between a flagship model, between maybe a model that will be more adapted to product needs.

And it doesn't have to be decorrelated. As I said before, like you can leverage also the big models to distillate some capabilities to a smaller one that will be maybe more suited like research. There's always this back and forth. There's also the fact that the product kind of ideas to the research, evaluations that are grounded in actual use cases, that we can also measure ourselves with respect to, is there some progress or is it just on an academic benchmark?

- So one, before we transition off, I think there's the hidden side maybe of these LLMs that most people don't think about, which is the organizer and the vocab size, especially of them. So LLAMA3 is 128K tokens, vocab tokenizer. GVD4 was 100K, 4.0 is 200K. How should people think about the impact that it has?

So basically like, I mean, the TLDR is like in the vocab, you have this kind of like concepts represented as tokens. So usually the larger the vocab size, the more nuanced the model can be about thinking about different things. What are the scaling laws of those organizers? You know, is 128K kind of like very large and it doesn't really matter?

Like, do you want to double it? Like any thoughts there would be great. - There's a lot of dimensions to take into account here. I think the first thing obvious to say is LLAMA3 compared to LLAMA2 is multilingual, has multilingual capabilities. We worked on that. And so, because you have languages that are not just Latin languages like English, there's a lot of different characters.

You want to include them to represent like special word there and so you need to have a bigger vocabulary size. That's the obvious thing, which is also probably why GPT-4.0 has a much bigger vocabulary as it's like naturally multilingual, multimodal in speech. So that's why we went to from 30 to 128 vocabulary size.

The interesting thing I think to discuss about tokenizer is about scaling laws related to that. If you increase your vocab size, well, you have a bigger matrix which takes longer to compute. It depends on the model size, but for a small model, it has a much bigger impact than a bigger model.

So increasing that, basically saying otherwise, so number of vocabulary size 428 is the same than the 8, 70, or 405B, but so relatively in percentage of the total number of weights, for the 7B it's much more than the 405B, weights more compared to total number of weights. So that has more impact in terms of training speed there.

But what is interesting is with a bigger vocabulary, for the same text, you have less tokens, right? And so you can train your model on the same amount of knowledge with fewer steps. So for the same compute, you can see more knowledge if you don't epoch. That's one cool thing.

The second thing is at inference time, you know that the context lane is not in number of text, the size of the text, but number of tokens. And so you can compress more, so that now with a bigger tokenizer, 128, more vocabulary, you can get to longer text for the same number of tokens.

8K basically, or 128K, now with this tokenizer means 30% about less text to encode. - How are tokenizer vocabs built? I actually don't know that. What's the work that goes into it? And then like, why are people using smaller ones? Is it harder to make them? Or is it just about some of the things you mentioned around scaling the training and all of that?

- Oh, it's, no, there's different methods, but became quite standard. Although it could change in the future. - BPE? - Yeah, exactly. - Well, BPE is for text. I don't know about multimodal vocab. That's, I haven't read anything about. - Yeah, let's keep that question. I'm not expert there.

And I don't remember exactly what they ended to do. - Now that you're saying this, right? Okay, so now we have 100K vocab, 200K vocab. Do we see a million vocab? Do we see infinity, which is no tokenizer? You know, like what's the natural limit of tokenization? - Yeah, that's a good question.

I don't know. I think there's a limit with respect that will grow with respect to the model size. So bigger models means possibly bigger vocabulary without affecting too much of training. But yeah, there's a lot of people. That's not my domain of expertise, but a lot of people are discussing the interest of having this kind of tokenizer, which doesn't fit like natural.

Could we go to character level tokenizer? Could we go to actually multimodal tokenizer, which will like decompose at pixel level? I don't know. Future directions, that could be very promising. - Yeah, I would say the diffusion people have actually started to swing back to pixel level. And probably that will presage the language people also moving towards 1 million vocabulary and then whatever the natural limit is for character level.

- I think we can maybe transition towards some of your personal stuff. We kept you here for a long time. We also, this is a very distributed podcast. You know, I'm in the Bay Area, you're in France, Sean is in Singapore. So everybody is on a different time zone.

You also do some startup investing and advising. You know, we used to meet Chantal on the podcast. He also mentioned he always enjoys kind of working with founders and researchers. Any company you're involved with that you want to shout out that you think is super promising, requests for startups that you've had, anything around that space would be awesome.

- Two cool companies I can think now is, one is Lindy, which is based in the Bay Area, with Flo Crivello. Very cool one. - Yeah, he's a good friend. - Flo. - Why do you like it? - Flo is really good, like he's a Frenchman, I guess. And number two, very recently, I really liked Open Devin, which is basically trying to reproduce Devin.

- We interviewed him at iClear. Both are agent startups. What do you think is like the direction that startups should be working on, you know, agent-wise, and maybe what is not working? - That's a tough question. One thing I say quite often is, deep learning has these very specificities that makes it challenging to predict that it's self-destructor, self-destructive technology.

So that, think like, you know, Grammarly, this technology like where the startup, you plug play and it corrects your grammatical errors. Everyone told them, guys, deep learning create a barrier to entrance, annotate data, create data. And they had a lot of data for that. And the next day, with the same exact technology, deep learning, someone comes with chat GPT and tell them, yeah, I can do the same, better, and so many other things.

Zero barrier to entry from yesterday to today. And what is crazy here is that it's based on the same technology. And so there's a lot of people working nowadays to try to mitigate issues with current generation of models. And I'm telling them, like, assume always the next generation will get better.

So if your business will benefit from a new generation with better abilities, that's a good business. If your business may be replaceable, and if all the work you have done may vanish and be like wasted because there's better models, then maybe change. - Yeah, I mean, yes, but better is so unpredictable.

Like if you asked me before, let's say March of this year, I would have said that maybe, you know, voice chat is still very defensible. And then suddenly, you know, OpenAI demoed their sort of real-time voice thing. It's sort of natively multimodal. It's easy to not anticipate a dimension where it gets better, but find another one that resisted, it's harder.

I would say in general, assume you will have progress everywhere. It may not be right, but it's a bit dangerous to bet against that. - Is there any space that you think is overrated by founders that are trying to build something that like, yeah, either, you know, the new models are just gonna do, or like, you just don't think there's that much interest from folks?

- It's a challenging time for founders. It's very exciting. There's a lot of funds, a lot of applications as well, a lot of stuff to build. That's pretty cool. But what is hard is, because this technology is moving so fast, I see like now a lot of fundamental stacks that are like the unicorn of today.

Foundational models, foundational like clusters, data notation, things like that. There's a lot, but less successful yet, for now at least, application company. And it's hard to build an application when it changes so fast, as we discussed before. So it is both crude and yet like, we haven't find a good like use case that is like the new thing company there.

And I want to see it. - Yeah, we definitely see the same, you know, all of our agent companies, or at least, you know, building agents are the ones getting the most traction. Most companies are like, hey, I actually don't have that much expertise and I'm just waiting for the models to get better.

So I'm not really sure if I need this now. So it's an interesting time to be investors. Anything else we missed? This was kind of like a masterclass in how to build state-of-the-art LLM. So it's going to be a highly played episode, I'm sure. Any final thoughts you want to share?

- There's two things I can, I guess I can say. One is Lama is hiring talents worldwide. And two, you can contact me, reach me out on LinkedIn, looking for Gen AI technology that, and founders that will create the future. - Okay. Hiring one role that you're like, man, like we really need this kind of person.

If you describe it, that person will be referred to you. Right? Like, because we're trying to broadcast it to the whole world. - Researchers with good common sense, first principle thinking, not necessarily like huge expertise on LLM, but more being super rigorous, meticulous, structured. - Azamen, thank you again for coming on and hope everybody gets to enjoy LLAMA 3 today since it just came out and we'll have you again for LLAMA 4.

(upbeat music) (upbeat music) (upbeat music) (upbeat music)