back to indexTraining Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Chapters
0:0 Introductions
4:16 The Llama Origin Story
7:34 Are there RLHF scaling laws?
9:56 Avoiding the "Chinchilla trap"
12:15 Why 405B?
14:27 FP8 training and other scaling research
17:48 Llama3 vs Llama2
18:32 Synthetic data for pre-training
21:43 Tool use to generate synthetic data
22:40 Pre-training data recipe
26:0 Why not MoE?
27:5 Why RLHF is so important
37:6 How they eval models
41:50 Benchmaking Uncertainty
44:4 Structured output and tool calling
45:52 Llama4 & Agents
52:1 Will Meta keep releasing open models?
53:55 Why tokenizer vocab size is underrated
59:12 AI & Startups
63:13 Hiring at Meta AI
00:00:13.760 |
- Hey, and today we have a very special episode 00:00:33.080 |
Is that the official size number that we're going with? 00:00:42.080 |
for the multi-model version that will come later. 00:00:50.640 |
and it looks like you did five years in quant finance, 00:00:55.240 |
And then you transitioned into natural language, 00:01:05.600 |
But basically I think it's at the AlphaGo moment 00:01:20.240 |
at the end of which I knew like what XGBoost at the time 00:01:25.680 |
So, and most of the people around were like PhD people. 00:01:44.560 |
And so I did my PhD with Recital and Sorbonne University 00:01:48.120 |
on natural language generation reinforcement learning. 00:01:56.240 |
I've had a company that offered me like this topic 00:01:59.960 |
and it was something like I started two weeks before BERT. 00:02:03.960 |
Yeah, we actually also just released our episode 00:02:06.040 |
with Clementine Fouquier who also did her PhD 00:02:09.120 |
with a company in kind of like a very similar format. 00:02:11.720 |
I think, yeah, very underrated, very underrated. 00:02:16.920 |
because you're also like publishing papers the whole time. 00:02:33.320 |
but I should have, I mean, papers have like 10, 50 citations. 00:02:44.600 |
but like a discriminator, which is synthetic, 00:02:51.000 |
because like all the inspiration for this paper 00:02:54.000 |
were from actually the original open-air paper of LHF. 00:03:12.480 |
- Well, I think your progress into NLP was like really strong 00:03:16.040 |
'cause like the first thing you worked on at Meta was Bloom. 00:03:24.120 |
but it was at the intersection of multilinguality, 00:03:26.880 |
which was very important to me, large language modeling. 00:03:33.200 |
at Meta and the team I was working on was Galactica. 00:03:36.120 |
And actually interesting step back from Bloom 00:03:44.760 |
but like trying to scale towards like multilinguality. 00:03:48.400 |
In fact, we learned later that multilinguality 00:03:51.440 |
almost emerged naturally with very, very few data, 00:04:08.920 |
especially if they're the same family, right? 00:04:10.600 |
Like, yeah, so maybe we should get right into Lama 2, 00:04:16.040 |
So like what is the story of Lama 2 from your point of view? 00:04:19.560 |
- Yeah, so as I was saying, I started to Meta on Galactica. 00:04:24.080 |
That was one of the first large language model at Meta. 00:04:28.640 |
We released it in, I think, December or end of November, 00:04:38.400 |
both with people like thinking it's the end of science 00:04:41.280 |
and like that with a lot of hallucination papers, 00:04:47.560 |
but you know, we didn't do like instruction tuning 00:04:52.200 |
It was a weird moment because two weeks later, 00:04:54.560 |
ChatGPT came out and that's a moment where like, 00:04:57.920 |
I think all the thing companies went upside down 00:05:04.720 |
to now work on that and make a ChatGPT as soon as possible. 00:05:07.640 |
So we had this one, two months of like what to do. 00:05:11.080 |
I actually was working on Galactica Instruct, 00:05:18.720 |
the Google Doc of like scientists where you can write papers 00:05:52.720 |
- Like I was just saying like today it's not solved 00:06:14.600 |
but because the way Galactica was trained to cite papers 00:06:19.960 |
that's what made it emerge so easily at instruction timing. 00:06:26.400 |
was the first annotation project for LHF at Meta. 00:06:30.400 |
That was a follow-up of Galactica that we were preparing. 00:06:34.600 |
my friends from Paris office created LamaOne. 00:06:38.760 |
It's like to connect the dots with what we said before. 00:06:41.760 |
The last author was Guillaume Lample who founded Mistral. 00:06:45.920 |
who worked with me on LamaTwo, still at Meta. 00:06:59.240 |
We had all the support from the company leadership. 00:07:15.160 |
and chat models that will follow instructions. 00:07:22.480 |
So you had some intuition from there we could use. 00:07:26.960 |
and that was probably the most challenge for us, 00:07:34.720 |
- Can you describe what scale you're talking? 00:07:36.280 |
- Yeah, yeah, to what level of annotation to scale. 00:08:08.680 |
So Lama 1, you had 7B, 13B, 33B, 65B model sizes, 00:08:17.960 |
How do you kind of evaluate what's worth training, 00:08:28.560 |
especially when you're maybe annotation constrained 00:08:35.320 |
There's so many parameters to take into account 00:08:43.200 |
The GPU constraint and on what different hardwares, 00:08:45.880 |
and we think about meta, but also of the community, 00:08:51.640 |
but there's 800, there's different size of GPUs memory. 00:08:59.560 |
Also at inference time, not just at fine tuning time, 00:09:02.200 |
then you can maybe do some tricks at inference time 00:09:08.760 |
All those constraints makes it very, very challenging. 00:09:19.360 |
In general, we tend to think in particular for Lama 3. 00:09:27.840 |
It's also because the project was taking some routes 00:09:30.480 |
to reproducing a Chinchilla, which was a 70B. 00:09:46.560 |
Given the scaling loads and the amount of tokens 00:09:48.400 |
we have to train it, what would be the right balance 00:09:57.480 |
And you mentioned Chinchilla is the best way to go, 00:10:02.600 |
if you want your model to be used by billions of people." 00:10:11.400 |
and then people kind of got the Lama scaling law, 00:10:13.800 |
like the 100 to 200X kind of parameter to token ratio. 00:10:25.000 |
So, as you said, this Kaplan paper with scaling loss, 00:10:28.360 |
but they figured out, basically they tried two dimensions. 00:10:32.360 |
The model weights and the number of training time, 00:10:37.040 |
like number of steps, training tokens, APOCs. 00:10:40.000 |
And for that, they figured that model size is what matters. 00:10:46.240 |
compared to the actual number of training tokens 00:10:48.560 |
because they did a mistake not adapting the scheduler. 00:10:51.440 |
That's what Chinchilla emphasized and discovered. 00:11:05.000 |
and emphasize much more the importance of training tokens. 00:11:11.960 |
basically you need to double the number of training tokens 00:11:22.800 |
you will end with the best results in your paper. 00:11:26.920 |
is that good if you want the best flagship model 00:11:29.440 |
that obtains the highest performance on your paper. 00:11:31.880 |
But if you want to use your model at inference time, 00:11:40.240 |
but one drops the number of tokens you train it, 00:11:43.920 |
And so to be compute efficient at inference time, 00:11:46.920 |
it's much better to train it much longer training time, 00:11:49.600 |
even if it's an effort, an additional effort, 00:11:53.920 |
That's what I call, I refer to the Chinchilla trap. 00:12:17.480 |
What other went into LemmaTree kind of planning? 00:12:24.120 |
In LemmaTree, you drop like the intermediate weight. 00:12:31.680 |
I mean, you talked about the hardware capabilities 00:12:33.720 |
at inference, like I can now run a 405B model at home 00:12:37.920 |
And it might be hard to even get the cloud resources 00:13:10.280 |
There's also like what compute we had at the time 00:13:16.560 |
but as like Mark announced, we have more and more GPUs. 00:13:22.800 |
Now, maybe let me reflect two things he said. 00:13:36.480 |
Second thing is, I'm hopeful that the community 00:13:39.720 |
will lead to a lot of findings by open sourcing it. 00:13:42.560 |
And there is smart way to actually make you use it 00:13:52.440 |
And after two weeks, it was running on a Raspberry. 00:13:58.720 |
And by releasing those models, we are enabling that. 00:14:02.320 |
Now, the last thing I want to add is having bigger models 00:14:10.080 |
because that's the model we use for the annotation. 00:14:20.040 |
of the smaller models we are releasing with Lama3 00:14:27.560 |
- Yeah, there's a lot of really good info there. 00:14:29.760 |
One thing I'll just briefly touch on for quantization, 00:14:39.160 |
And he was talking about sort of native FP8 training. 00:14:43.880 |
It seems like that is most useful for inference. 00:14:47.000 |
That is what you expect the open source community 00:14:48.960 |
to do with your weights once you release them anyway. 00:14:55.680 |
or whatever other new format is in vogue these days? 00:15:05.200 |
on like just a zero one or minus one weights. 00:15:15.680 |
there's also the possibility for the community 00:15:22.560 |
So I'm really looking forward to what the community 00:15:33.960 |
is by having better algorithms that we can train 00:15:52.520 |
but to what extent, how far we can go, you know? 00:15:58.880 |
what we had two, three years ago on a 32 or 64, 00:16:18.520 |
but it doesn't actually pan out, it doesn't scale. 00:16:21.600 |
It's so hard for researchers, at least for me, 00:16:40.120 |
And are we like losing maybe some gems that are not just, 00:16:45.360 |
but because there's too much research around. 00:16:53.720 |
compared to probably what Yann LeCun and the others had 00:16:59.360 |
I do think that FAIR is putting out incredible research. 00:17:03.600 |
Probably, it doesn't seem like it's your group, 00:17:08.920 |
which on the small model side is a really good research 00:17:14.880 |
It looks like Hugging Face is also replicating it, 00:17:18.720 |
There's a lot of ideas on shared weights and shared matrices 00:17:21.920 |
and model architecture stuff that we can talk about 00:17:27.960 |
but it seems like one of the big themes of this year 00:17:33.240 |
small models that are good enough for daily use. 00:17:45.200 |
- It will be released the day, I think, of the release. 00:17:50.080 |
What are the major choices of LLAMA 3 versus LLAMA 2? 00:17:53.640 |
- There's not a lot of changes in terms of architectures. 00:17:57.440 |
I think we can do a lot better in the future, 00:18:01.920 |
but, for instance, to me, it doesn't make sense 00:18:07.480 |
Like, there's architecture lack of flexibilities. 00:18:11.560 |
but still, that's the best thing we have for now. 00:18:17.280 |
in terms of architectures and training, than LLAMA 2, 00:18:20.560 |
but we put so much effort on scaling the data, 00:18:35.240 |
is that you used LLAMA 2 to do the data cleaning 00:18:47.080 |
about using Mastral to make training data better. 00:18:56.240 |
I'm sure people would love to hear more about it. 00:18:58.520 |
- Right, so it's a very interesting research direction. 00:19:05.480 |
My intuition is that the web is full of shit, 00:19:13.160 |
and training on those tokens is a waste of compute. 00:19:15.880 |
Just having a good classifier that labelizes that is cool, 00:19:29.400 |
and select what are the good tokens and the bad tokens. 00:19:48.200 |
- To me, I'm not exactly sure what you guys did, 00:19:51.120 |
but I feel like when people say synthetic data, 00:19:54.400 |
there needs to be different categories of synthetic data now 00:19:57.160 |
because I think there's so many different usage 00:20:00.800 |
But specifically synthetic data for pre-training, 00:20:02.800 |
it feels almost like you're running multiple epochs 00:20:13.600 |
And in my mind, it's very similar to computer vision, 00:20:15.880 |
where you do data augmentation on an item, right? 00:20:20.760 |
That's the less cool name for synthetic data. 00:20:24.520 |
I totally agree with you related to pre-training, 00:20:33.320 |
on synthetic data that I'm personally excited. 00:20:35.960 |
Like, for instance, what I'm excited about is 00:20:38.840 |
we had this survey on augmented LLM a year ago, 00:20:45.480 |
it can be a retriever, it can be search, it can be a tool, 00:20:48.520 |
it can be a calculator, it can be a code execution. 00:20:54.800 |
like doing some data augmentation with your model, 00:20:58.120 |
but you're actually adding some expert skills 00:21:15.040 |
If your model didn't know something about LLM 2, 00:21:28.080 |
targeting directly the weakness of the model. 00:21:30.160 |
It's like continual augmentation of the language model, 00:21:37.440 |
Like, are you teaching it to use tools to augment the model, 00:21:50.960 |
- What I said is more like in terms of directions, 00:21:54.800 |
when it knows how to use a tool and correct itself, 00:22:06.680 |
People are saying like, we are lacking of tokens. 00:22:10.920 |
where the model always go to correct its own weakness, 00:22:17.040 |
Okay, that's an easy example, probably the model knows, 00:22:18.760 |
but imagine for something more complex, 10 plus 10. 00:22:25.920 |
which is easy for a basic agent now, powered by LLM. 00:22:29.640 |
And then you verified with respect to what you expected, 00:22:34.000 |
If it's not, you can back propagate this example 00:22:41.840 |
You know, you mentioned about just like using calculators. 00:22:47.120 |
a lot of that is just driven using code generation, 00:22:50.920 |
What are your insights on just like the data mix 00:22:56.760 |
which is something that you're also passionate about? 00:22:58.960 |
We know that that's changed between LLM 2 and LLM 3. 00:23:09.680 |
For the different size, we use the same, mostly. 00:23:20.360 |
so you have to do something while it's training. 00:23:23.440 |
I was working on my side of multi-motion post-training, 00:23:25.400 |
but so the pre-training team did quite a lot of work 00:23:32.320 |
And they intersected before the end of the training. 00:23:35.640 |
- I sense a movement in terms of like the curriculum 00:23:44.720 |
Like Snowflake is doing some interesting work 00:23:46.600 |
with enterprise intelligence or whatever they call it. 00:23:53.480 |
what do you work with, like the pre-train team? 00:23:59.880 |
of continual augmentation where it could feed back 00:24:04.440 |
One of the big continuum between pre-training 00:24:06.760 |
and post-training in particular is continual pre-training 00:24:22.640 |
that enables to collect even better RHF notation after. 00:24:33.320 |
like goal was to get the best model in those dimensions. 00:24:47.200 |
from data notation, contract, methodology, protocol, 00:24:54.560 |
We were like not allowed also to work on that. 00:25:02.520 |
And you can see that as in the following month after Lama2, 00:25:12.200 |
but obtaining better reasoning, math, coding, chat models. 00:25:22.000 |
And one thing I'm quite proud is with the early preview release 00:25:26.160 |
we did of Lama3 back in February, May or March, 00:25:30.160 |
I don't remember, it lets quickly to instantly to like 00:25:36.320 |
almost competing with GPT-4 on the Arena leaderboard, 00:25:41.760 |
compare like two models and select their preference. 00:25:52.920 |
from code reasoning, multilinguality, helpfulness. 00:25:56.280 |
So that's the sign that this time, as opposed to Lama2, 00:26:10.880 |
kind of having specific synthetic data generation things. 00:26:15.720 |
but I think like Lama had better performance overall. 00:26:25.680 |
I'm not really bullish on having like a model 00:26:29.720 |
I understand the need of having like bigger models, 00:26:34.920 |
Yeah, maybe people will not use them for inference, 00:26:41.000 |
That narrative is, I think I totally agree with that, 00:26:55.480 |
so just the architecture choice of like a very big, 00:26:59.960 |
I actually honestly thought that maybe 175 or, you know, 00:27:08.000 |
So basically I think the common question that people have 00:27:11.360 |
In a way that Mistral and the others have gone in, 00:27:14.120 |
you know, it seems like the trend has been MOEs 00:27:40.560 |
and that's an hyperparameter we'll explore in the future. 00:27:43.760 |
- Let's make sure we run through everything on post-training. 00:27:46.440 |
You also had a recent tweet about hourly chat 00:27:48.720 |
versus immunization learning explained in one tweet. 00:27:53.400 |
but it's basically like two charts about doctor opinions. 00:28:05.120 |
and the physicians are kind of like, you know, 00:28:13.600 |
or slightly empathetic versus all the model responses 00:28:16.720 |
are rated very empathetic and empathetic at worst. 00:28:25.320 |
Can you run people through like some of the choices 00:28:29.920 |
for one of the two and getting the best responses? 00:28:33.080 |
- I think the tweet was about like the intuition 00:28:35.720 |
of why reinforcement learning with human feedback works. 00:28:41.680 |
I had like this budget of annotations in millions of dollars 00:28:59.560 |
and to also write himself the answer expected by the model. 00:29:04.560 |
So then you train on that and in a supervised manner, 00:29:16.320 |
with human feedback where the annotators type a prompt, 00:29:18.880 |
but this time you sample two different answers 00:29:24.320 |
And then you will train on the preference, basically, 00:29:27.920 |
When you ask to train on the preference of the model, 00:29:33.880 |
training on synthetic model generated by the model. 00:29:40.680 |
And let's annotate a bit of preference to do a relationship 00:29:56.440 |
you check what the human will have annotated as an answer. 00:30:04.200 |
I was like, oh, maybe the annotators are pretty bad. 00:30:10.640 |
And so I understood the intuition behind RLHF. 00:30:13.440 |
Those models are already super good at some tasks. 00:30:19.240 |
imagine a distribution, a Gaussian distribution, 00:30:28.560 |
And the same with medical diagnostics from a doctor. 00:30:37.080 |
and when you collect all the diagnostics from doctors, 00:30:41.000 |
There's better, a lot of time, good diagnostics, 00:30:47.600 |
On the left, you have still a bit of examples, 00:30:51.600 |
which makes curves not at zero, the distribution. 00:30:58.160 |
And so training on behavioral cloning to reflect humans, 00:31:01.880 |
the model will learn to do also some mistakes, 00:31:07.040 |
from the model time to time, reflecting humans. 00:31:13.000 |
But now, if I ask a doctor to check a sample from my model, 00:31:22.600 |
it's easy for a doctor to say which one is better. 00:31:29.440 |
and there's one bad time to time, like humans, 00:31:33.440 |
And I ask a human to choose which one he prefers. 00:31:35.760 |
Personally, I'm really bad at creating poems. 00:31:45.240 |
take like five seconds to think what you could come, 00:31:49.400 |
But yet, if I check two poems generated by a model, 00:31:57.360 |
you can have a model that flats the bad outputs, 00:32:04.600 |
And you can even end to superhuman abilities, 00:32:21.560 |
with Nathan Lambert from the Allen Institute, 00:32:26.960 |
And he mentioned one of the things that makes RLHF work 00:32:33.280 |
but they're usually very good at giving an opinion 00:32:39.440 |
of things they would never create from scratch. 00:32:42.160 |
One question actually that he asked me to ask you, 00:32:44.640 |
how much in post-training you attribute improvement 00:32:47.440 |
to the RLHF side versus the instruction fine-tuning side, 00:32:51.720 |
and maybe how you think about prioritizing the two 00:33:06.520 |
You start for Lama 2 with a pre-trained model, 00:33:09.480 |
and you have to have an instruction model to chat model. 00:33:13.240 |
Otherwise, the model is just like finishing sentences. 00:33:29.080 |
The option one was, let's do human annotation again, 00:33:37.360 |
the annotation would be actually worse than Lama 2. 00:33:39.760 |
So what we did is that we generated all the data 00:33:43.400 |
and we applied basically the last round of Lama 2 we had 00:33:49.160 |
So Lama 3 post-training doesn't have any human 00:33:54.120 |
It's just leveraging pure synthetic data from Lama 2. 00:34:01.480 |
For example, you mentioned the physicians are expert. 00:34:09.080 |
or does this apply to any modality, any subject? 00:34:13.840 |
The intuition in general is like, for instance, 00:34:28.520 |
there's some preferences that they might feel like that. 00:34:30.640 |
And maybe for some other reasons that we don't know, 00:34:35.680 |
It costs less, it's easier than it leads in general 00:34:42.320 |
We actually suggested teacher-forcing in Lama 3, 00:34:46.280 |
a new method that kind of fills a gap between, 00:34:52.000 |
Teacher-forcing is a good to train the models. 00:35:02.560 |
but both are very bad in the code, for instance, 00:35:05.720 |
you will ask the human to edit the best answer 00:35:14.160 |
so that you can get out from the local minimum of your model. 00:35:28.160 |
- No, what we know is it's not plateauing yet. 00:35:32.120 |
So just infinite amounts more while, you know, 00:35:39.040 |
And so you mentioned at the start of the conversation 00:35:45.200 |
and I feel like this is very interesting to reflect on, 00:35:52.600 |
is that people thought that human interest in Go 00:36:01.840 |
where like humans and computers are actually doing better 00:36:04.840 |
than either humans and computers would be alone. 00:36:09.880 |
what are you talking about, this RLHF improvement, right, 00:36:12.520 |
that we're kind of building human preference into the model 00:36:17.560 |
and the model capability is actually doing better 00:36:25.120 |
- The other thing is RLHF came from the alignment community 00:36:33.240 |
but I feel like it's like really over the past 00:36:41.800 |
are not that concerned about existential risk. 00:36:43.960 |
I always feel like it's so interesting to see this, 00:36:47.520 |
like people who take alignment super seriously, 00:36:50.080 |
they're the first to consider super alignment. 00:36:54.080 |
I'm almost thinking about this as like super quality, 00:37:00.400 |
And it's not really about alignment so much as like, 00:37:11.960 |
- Well, I don't know how much better yet it is on those, 00:37:14.400 |
but clearly it's super human on some writing skills 00:37:23.520 |
We've had some questions about the 400B details 00:37:31.720 |
Yeah, I think last time you disclosed like the evals 00:37:37.040 |
what should people know about the high level headlines 00:37:50.720 |
But by far, compared to the version originally released, 00:37:54.800 |
even now, I think there's maybe the last clouds on a 3.5 00:38:13.520 |
they are like world-class models of this size 00:38:35.640 |
Or, you know, I haven't seen the numbers yet either. 00:38:44.680 |
we just had an episode with Clementine from Hug & Face 00:38:47.320 |
about leaderboards and arenas and evals and benchmarks 00:38:52.120 |
How do you think about evals during the training process? 00:38:57.760 |
do you already know exactly what you want to improve? 00:39:07.400 |
the post-training improvement based on benchmarks? 00:39:16.880 |
like in particular when you're trying to tackle 00:39:23.760 |
you're trying like to push numbers on a benchmark, 00:39:28.480 |
because then you don't know if you're overfitting it 00:39:30.440 |
and it will transfer to similar capabilities. 00:39:42.040 |
We tackle that playing with different methods 00:39:45.240 |
like reward models, evaluation, model as a judge, 00:39:55.200 |
That limits the possibility of hacking them, of course. 00:40:00.960 |
I do also a lot of model tests, quality analysis, 00:40:23.160 |
And a great way also to compare models is, you know, 00:40:25.600 |
truth is a different round we have done for RHF. 00:40:32.520 |
we have the win rate between the previous model 00:40:43.120 |
And so we can calculate automatically a win rate. 00:40:46.200 |
What are areas that you had to work the hardest 00:40:54.800 |
or is performance improvement just kind of even 00:41:01.720 |
we are behind all of them with between Lama 2 and GPT-4. 00:41:06.720 |
I mean, it's different challenges every time, 00:41:22.600 |
but which is by the way, very interesting evaluation. 00:41:27.120 |
and I don't know yet what will be the results 00:41:30.400 |
we ended very high in this blind test leaderboard. 00:41:39.720 |
but how that will transfer to perception from the community, 00:41:57.840 |
because it's a community that judge the prompts 00:42:00.440 |
and create the prompts and judge the answers. 00:42:03.040 |
We are limited, we are not like good to do that. 00:42:36.800 |
- There's some conversation about like the math score. 00:42:40.360 |
Apparently like the next GPT-next or whatever 00:42:50.960 |
rounding out the topics on just potential models, 00:42:55.120 |
Clementin is looking for a confidence estimation 00:43:04.320 |
how do we think about evals for practical things 00:43:07.160 |
like confidence estimation, structured output, 00:43:11.360 |
- Yeah, I think we lack actually of such evaluations. 00:43:14.720 |
When numbers, I was assuring like two days ago 00:43:27.400 |
and instead of telling the model you have this question, 00:43:32.000 |
What if we tell the model you have to answer A, B, C, or D, 00:43:54.560 |
Model B actually said like, answered only 60%. 00:43:57.400 |
So for 40% of the time, he said, I don't know. 00:44:01.160 |
And we are not like reflecting that in evaluations. 00:44:16.240 |
- That seems to be the research from OpenAI as well. 00:44:21.640 |
Maybe post-training can help to increase calibration 00:44:34.120 |
- Yeah, and that's the goal of post-training, 00:44:38.160 |
to not be biased toward answering A, B, C, or D 00:44:44.960 |
- And on the structured output tool calling side, 00:44:47.520 |
do you think that it's not an explicit part of the evals? 00:45:01.720 |
or do you want to just have that in the model from day one? 00:45:09.720 |
I think the model will be pretty good at that. 00:45:12.920 |
We have a lot of gems about tools in the paper, 00:45:16.160 |
but the model is fine-tuned to do tool usage, 00:45:44.480 |
mentioned that you're already starting work on Lama4. 00:45:48.840 |
How does your work on Gaia inform all this work? 00:45:55.920 |
That followed a direction I really like pursuing. 00:46:03.760 |
So I did Toolformer and the survey on augmented models. 00:46:15.760 |
and there's like GPT 3.5 at the time in Lama4. 00:46:22.800 |
the extension and the feature of Toolformer is limited. 00:46:26.520 |
So we need to work on that, and we did Lama2, 00:47:07.640 |
work in practice is like this gap of intelligence. 00:47:14.080 |
I expect some incremental and significant progress 00:47:27.840 |
as a more complex system that can do planning, 00:48:08.080 |
and Hugging Face put a leaderboard there on their website. 00:48:14.520 |
What is interesting is like GPT-4 alone has 0%, 00:48:24.040 |
But OSCOPILOT, then, and Autogen from Microsoft, 00:48:44.920 |
on instruction tuning models, following instructions, 00:48:50.040 |
and you say to your LLM, okay, this is your task, 00:48:53.240 |
you have access to these tools, you can navigate the web, 00:49:02.360 |
Did you manage to succeed for the first step? 00:49:18.880 |
and probably you need to go later in latent space 00:49:36.960 |
So like the more you're having it write tokens 00:49:41.080 |
and like the better result you're probably gonna get to. 00:49:43.680 |
Do you think that's always gonna be the case? 00:49:49.080 |
and then I'll just return the answer directly 00:49:51.200 |
and do all of that in the latent space, so to speak? 00:49:57.360 |
it should hopefully go more as this is a task 00:50:01.640 |
But we need to teach that to the model to train that, 00:50:18.960 |
and then you don't have to write all the tokens. 00:50:27.840 |
you can just write the final answer or take an action. 00:50:34.160 |
If you look at the system prompt in the Cloud Artifacts, 00:50:42.520 |
which is, I mean, they're still spending the tokens, 00:50:45.000 |
but that is before training it is at the prompting level, 00:50:51.120 |
And then at iClear, there was the pause token, 00:50:54.880 |
I feel like all these are token level stopgap measures. 00:51:01.560 |
Like we still need to have at the architecture level, 00:51:18.880 |
If you remember, like we are lacking the flexibility 00:51:24.280 |
where we spend the same amount of compute per token. 00:51:27.680 |
And so because of that, how can you like mitigate this 00:51:35.080 |
because you have only access to this dimension. 00:51:37.320 |
Ideally, you want an architecture that will enable 00:51:52.400 |
I don't know any work that managed to get there. 00:51:54.960 |
I know like, for instance, you had the universal transformer 00:52:00.600 |
and you can like compute on the layer n times 00:52:18.200 |
- I don't, I'm not sure it's this one, maybe. 00:52:23.120 |
you have an expert that is an identity matrix 00:52:28.680 |
but you know, it's early works, very preliminary works. 00:52:33.840 |
like putting the compute, generating a token into the loss. 00:52:38.000 |
That's gonna be interesting when we start to do that. 00:52:50.400 |
If you think about the evolution of the models, 00:52:53.680 |
you know, with MetaAI and some of these things, 00:53:02.000 |
will also be a more agentic behavior and have all of this. 00:53:05.000 |
I'm curious, like at what point it's like, okay, 00:53:07.240 |
this is a research direction that we still want to take, 00:53:13.760 |
about what to focus on as you keep scaling these models? 00:53:16.800 |
- Yeah, I think it's a balance, you know, between, 00:53:23.640 |
And there's this understanding also that, you know, 00:53:32.640 |
if nowadays it's not like directly intersecting product, 00:53:35.840 |
we don't want to be late in the game as we had in the past. 00:53:41.680 |
and we think that this technology will change the world. 00:53:44.480 |
We want to work towards AGI and AGI will change the world. 00:53:51.840 |
it will probably intersect pretty easily the products. 00:54:14.880 |
to distillate some capabilities to a smaller one 00:54:18.320 |
that will be maybe more suited like research. 00:54:26.920 |
evaluations that are grounded in actual use cases, 00:54:29.880 |
that we can also measure ourselves with respect to, 00:54:38.400 |
I think there's the hidden side maybe of these LLMs 00:54:56.520 |
How should people think about the impact that it has? 00:54:59.320 |
So basically like, I mean, the TLDR is like in the vocab, 00:55:02.680 |
you have this kind of like concepts represented as tokens. 00:55:11.680 |
What are the scaling laws of those organizers? 00:55:21.760 |
- There's a lot of dimensions to take into account here. 00:55:23.800 |
I think the first thing obvious to say is LLAMA3 00:55:33.680 |
that are not just Latin languages like English, 00:55:38.440 |
You want to include them to represent like special word there 00:55:42.400 |
and so you need to have a bigger vocabulary size. 00:55:58.040 |
So that's why we went to from 30 to 128 vocabulary size. 00:56:03.040 |
The interesting thing I think to discuss about tokenizer 00:56:20.480 |
it has a much bigger impact than a bigger model. 00:56:39.880 |
weights more compared to total number of weights. 00:56:42.880 |
So that has more impact in terms of training speed there. 00:56:46.400 |
But what is interesting is with a bigger vocabulary, 00:56:56.560 |
on the same amount of knowledge with fewer steps. 00:57:02.120 |
you can see more knowledge if you don't epoch. 00:57:08.640 |
you know that the context lane is not in number of text, 00:57:29.760 |
now with this tokenizer means 30% about less text to encode. 00:57:39.680 |
And then like, why are people using smaller ones? 00:57:44.080 |
Or is it just about some of the things you mentioned 00:58:05.760 |
And I don't remember exactly what they ended to do. 00:58:18.560 |
You know, like what's the natural limit of tokenization? 00:58:26.440 |
that will grow with respect to the model size. 00:58:29.160 |
So bigger models means possibly bigger vocabulary 00:58:39.840 |
but a lot of people are discussing the interest 00:58:48.360 |
Could we go to actually multimodal tokenizer, 00:58:56.120 |
Future directions, that could be very promising. 00:59:00.520 |
have actually started to swing back to pixel level. 00:59:03.840 |
And probably that will presage the language people 00:59:11.280 |
and then whatever the natural limit is for character level. 00:59:22.560 |
You know, I'm in the Bay Area, you're in France, 00:59:28.680 |
You also do some startup investing and advising. 00:59:31.680 |
You know, we used to meet Chantal on the podcast. 00:59:33.720 |
He also mentioned he always enjoys kind of working 00:59:38.120 |
Any company you're involved with that you want to shout out 00:59:51.000 |
one is Lindy, which is based in the Bay Area, 00:59:59.240 |
- Flo is really good, like he's a Frenchman, I guess. 01:00:02.200 |
And number two, very recently, I really liked Open Devin, 01:00:07.200 |
which is basically trying to reproduce Devin. 01:00:15.680 |
that startups should be working on, you know, agent-wise, 01:00:29.640 |
that it's self-destructor, self-destructive technology. 01:00:37.400 |
you plug play and it corrects your grammatical errors. 01:00:43.960 |
create a barrier to entrance, annotate data, create data. 01:00:49.600 |
And the next day, with the same exact technology, 01:00:52.520 |
deep learning, someone comes with chat GPT and tell them, 01:00:55.480 |
yeah, I can do the same, better, and so many other things. 01:00:58.720 |
Zero barrier to entry from yesterday to today. 01:01:06.320 |
And so there's a lot of people working nowadays 01:01:16.200 |
assume always the next generation will get better. 01:01:30.280 |
and be like wasted because there's better models, 01:01:35.160 |
- Yeah, I mean, yes, but better is so unpredictable. 01:01:38.160 |
Like if you asked me before, let's say March of this year, 01:01:48.920 |
OpenAI demoed their sort of real-time voice thing. 01:01:59.640 |
but find another one that resisted, it's harder. 01:02:08.240 |
but it's a bit dangerous to bet against that. 01:02:11.480 |
- Is there any space that you think is overrated by founders 01:02:15.080 |
that are trying to build something that like, yeah, 01:02:18.200 |
either, you know, the new models are just gonna do, 01:02:26.520 |
There's a lot of funds, a lot of applications as well, 01:02:40.160 |
Foundational models, foundational like clusters, 01:02:52.360 |
when it changes so fast, as we discussed before. 01:03:06.440 |
- Yeah, we definitely see the same, you know, 01:03:08.440 |
all of our agent companies, or at least, you know, 01:03:10.840 |
building agents are the ones getting the most traction. 01:03:14.200 |
hey, I actually don't have that much expertise 01:03:15.840 |
and I'm just waiting for the models to get better. 01:03:28.720 |
So it's going to be a highly played episode, I'm sure. 01:03:34.800 |
- There's two things I can, I guess I can say. 01:03:41.320 |
And two, you can contact me, reach me out on LinkedIn, 01:03:53.920 |
man, like we really need this kind of person. 01:03:56.960 |
If you describe it, that person will be referred to you. 01:04:00.680 |
Right? Like, because we're trying to broadcast it 01:04:10.920 |
but more being super rigorous, meticulous, structured. 01:04:17.320 |
and hope everybody gets to enjoy LLAMA 3 today