back to indexOpen Source AI is AI we can Trust — with Soumith Chintala of Meta AI
Chapters
0:0 Introductions
0:51 Extrinsic vs Intrinsic Success
2:58 Importance of Open Source and Its Impact
4:25 PyTorch vs TinyGrad
10:23 Why PyTorch is the Switzerland of frameworks
12:44 Modular's Mojo + PyTorch?
16:12 PyTorch vs Apple's MLX
19:46 FAIR / PyTorch Alumni
22:38 How can AI inference providers differentiate?
25:48 How to build good benchmarks and learnings from AnyScale's
29:51 Most interesting unexplored ideas
33:23 What people get wrong about synthetic data
42:0 Meta AI's evolution
45:20 How do you allocate 600,000 GPUs?
49:24 Even the GPU Rich are GPU Poor
55:49 Meta's MTIA silicon
58:56 Why we need open source
67:0 Open source's coordination problem for feedback gathering
81:16 Beyond text generation
89:2 Osmo and the Future of Smell Recognition Technology
00:00:02.820 |
This is Alessio, partner and CTO in Residence 00:00:06.480 |
And I'm joined by my co-host, Svex, founder of Small.ai. 00:00:09.480 |
Hey, and today we have in the studio, Sumev Chantala. 00:00:13.800 |
On one of your rare visits from New York, where you live. 00:00:18.640 |
You got your start in computer vision at NYU with Yann LeCun. 00:00:30.440 |
So if people want to know more about the history of Sumev, 00:00:33.320 |
history of PyTorch, they can go to that podcast. 00:00:39.400 |
or I don't know if it's your luck or your drive 00:00:42.280 |
to find AI early and then find the right quality mentor. 00:00:47.460 |
Because I guess Yann really introduced you to that world. 00:00:51.480 |
You're talking about extrinsic success, right? 00:01:02.760 |
be extrinsically perceived as good and successful. 00:01:10.620 |
that is now like one of the coolest things in the world 00:01:18.640 |
the first thing I tried to become was 3D VFX artists. 00:01:45.060 |
is this person successful or not might be different. 00:01:48.120 |
But I think after a baseline, your happiness is probably 00:01:59.680 |
that I often refer to about the power of intrinsic motivation 00:02:03.020 |
versus extrinsic and how long extrinsic lasts. 00:02:07.440 |
But anyway, now you are an investor in Runway. 00:02:19.680 |
He actually tried to become an animator in his early years 00:02:22.520 |
and failed, or didn't get accepted by Disney, 00:02:25.260 |
and then went and created Pixar and then got bought by Disney 00:02:31.340 |
So you joined Facebook in 2014 and eventually became 00:02:41.060 |
But you also-- I think maybe people don't know that you also 00:02:44.040 |
involved in more hardware and cluster decision affair. 00:02:48.500 |
there, because we're all about hardware this month. 00:02:52.600 |
And then finally, I don't know what else should people 00:02:55.860 |
know about you on the personal side or the professional side. 00:03:00.940 |
like a big passion of mine and probably forms 00:03:11.980 |
It's like one of those things that I attribute to-- 00:03:21.220 |
to distribute opportunity in a way that is very powerful. 00:03:33.940 |
And in college, actually, I didn't have internet, 00:03:40.880 |
So just having-- and knowledge was very centralized. 00:03:49.940 |
And that ended up helping me learn quicker and faster 00:04:04.900 |
I always push regardless of what I get paid for. 00:04:10.100 |
I think I would do that as a passion project on the side. 00:04:16.500 |
as well that open source has, open models versus closed 00:04:20.860 |
But maybe you want to touch a little bit on PyTorch 00:04:23.020 |
before we move on to sort of meta AI in general. 00:04:25.500 |
Yeah, we kind of touched on PyTorch in a lot of episodes. 00:04:31.660 |
He called PyTorch a CISC and TinyGret a RISC. 00:04:36.500 |
I would love to get your thoughts on PyTorch design 00:04:42.420 |
I know you talk a lot about kind of having a happy path 00:04:45.900 |
to start with and then making complexity hidden away, 00:04:52.340 |
is I think you have like 250 primitive operators in PyTorch. 00:04:57.600 |
So how do you think about some of the learnings 00:05:02.020 |
that maybe he's going to run into that you already 00:05:04.520 |
had in the past seven, eight years almost of running PyTorch? 00:05:13.940 |
think it's two different models that people generally 00:05:23.060 |
And my B1 is like super complex, feature complete, whatever. 00:05:28.840 |
Or other people say they will get incrementally ambitious. 00:05:33.160 |
They say, oh, we'll start with something simple, 00:05:35.120 |
and then we'll slowly layer out complexity in a way 00:05:37.700 |
that optimally applies Huffman coding or whatever. 00:05:42.860 |
Where the density of users are and what they're using, 00:05:47.680 |
I would want to keep it in the easy, happy path. 00:06:01.360 |
George, I think, just like we started with PyTorch, 00:06:05.000 |
George started with the incrementally ambitious thing. 00:06:19.440 |
So I think there is no real magic to which why PyTorch 00:06:26.640 |
I think it's probably partly necessitated and partly 00:06:32.100 |
because we built with the technology available under us 00:06:43.120 |
I think if we had to rewrite it, we would probably 00:06:45.960 |
think about ways to rewrite it in a vastly simplified way, 00:06:52.980 |
But a lot of that complexity comes from the fact 00:07:07.720 |
and then you have DRAM and SSD, and then you have network. 00:07:17.220 |
and then you have different levels of network hierarchies, 00:07:19.960 |
NVLink plus InfiniBand or Rocky or something like that. 00:07:26.680 |
And the way the flops are available on your hardware, 00:07:37.880 |
onto both the memory hierarchy and the flops available. 00:07:45.040 |
like a fairly hard mathematical problem to do this setup, 00:07:55.000 |
And finding the optimal thing is like, what is optimal? 00:07:58.440 |
What is optimal depends on the input variables themselves. 00:08:02.440 |
So like, OK, what is the shape of your input tensors, 00:08:05.240 |
and what is the operation you're trying to do, 00:08:20.240 |
the same for every input configuration you have. 00:08:27.400 |
For example, just as the shape of the tensors change, 00:08:31.560 |
let's say you have three input tensors into a sparse dot 00:08:43.000 |
will vastly change how you do this optimally placing 00:08:48.640 |
this operation onto the hardware in a way that will 00:08:53.440 |
So a lot of our complexity comes from writing out 00:08:59.240 |
like hundreds of configurations for each single PyTorch 00:09:07.200 |
and symbolically generating the final CUDA code or CPU code. 00:09:15.000 |
There's no way to avoid it, because mathematically we 00:09:17.080 |
haven't found symbolic ways to do this that also 00:09:40.520 |
I don't think, unless we have great breakthroughs, 00:09:47.640 |
Or he should be thinking about a narrower problem, such as, 00:09:51.240 |
I'm only going to make this work for self-driving car continents. 00:09:55.920 |
Or I'm only going to make this work for LLM transformers 00:10:10.480 |
to power all of the AI research that is happening 00:10:13.720 |
and keep zero compile time and all these other factors, 00:10:18.120 |
I think it's not easy to avoid the complexity. 00:10:28.160 |
If you think about frameworks, they have the model target. 00:10:36.560 |
TensorFlow is trying to be optimized to make TPUs go brr 00:10:43.000 |
I think George is trying to make, especially AMD stack, 00:10:47.880 |
How come PyTorch has been such as Switzerland 00:10:54.320 |
First, meta is not in the business of selling hardware. 00:10:57.760 |
Meta is not in the business of cloud compute. 00:11:03.640 |
We kind of-- the way meta thinks about funding PyTorch is it's 00:11:11.640 |
just like we're funding it because it's net good for meta 00:11:17.000 |
to fund PyTorch because PyTorch has become a standard 00:11:27.240 |
It gives us various leverage and all that within our own work. 00:11:40.440 |
I think the way we think about it is not in terms of Switzerland 00:11:44.440 |
Actually, the way we articulated to all hardware vendors 00:11:47.940 |
and software vendors and all who come to us being like, 00:11:51.600 |
we want to build a backend in core for PyTorch 00:12:00.160 |
If users are using a particular piece of hardware, 00:12:05.460 |
We very much don't want to king make the hardware 00:12:11.720 |
So as the MacBooks have GPUs and as that stuff 00:12:19.960 |
we pushed Apple to push some engineers and work 00:12:25.040 |
And we spent significant time from like meta funded 00:12:29.820 |
Because a lot of people are using the Apple GPUs 00:12:35.360 |
So we kind of mostly look at it from the demand side. 00:12:40.960 |
which hardware should we start taking opinions on? 00:12:44.480 |
Is there a future in which-- because Mojo or Modulus 00:13:01.760 |
So if Mojo is like a PIP install and it's readily available 00:13:06.960 |
and users feel like they can use Mojo so smoothly 00:13:21.440 |
In the same way, PyTorch now depends on Triton, 00:13:26.500 |
And we never had a conversation that was like, huh, 00:13:38.720 |
It almost doesn't-- those conversations don't really 00:13:43.000 |
The conversations are more like, well, does Triton 00:13:45.200 |
have 10,000 dependencies and is it hard to install? 00:13:54.160 |
We look at these things from a user experience point of view. 00:14:16.100 |
would you look to solve that you have right now? 00:14:25.680 |
It's more performance, mainly a performance pitch, 00:14:30.960 |
Yeah, I think the performance pitch for Mojo was like, 00:14:56.180 |
So PyTorch exposes-- it's actually not 250 operators, 00:15:04.400 |
and people write their ideas in the 1,000 operators of PyTorch. 00:15:10.080 |
Mojo is like, well, maybe it's OK to completely sidestep 00:15:17.240 |
those 1,000 operators of PyTorch and just write it 00:15:20.160 |
in a more natural form, just write like raw Python, 00:15:25.400 |
So from the consideration of how do we intersect PyTorch 00:15:33.600 |
where you have custom stuff for some parts of your program, 00:15:42.880 |
how to make it easier for, say, torch.compile to smoothly also 00:15:49.200 |
consume Mojo subgraphs, and the interoperability 00:16:00.480 |
would be replacing PyTorch, not augmenting PyTorch. 00:16:06.240 |
So in that sense, I don't see a synergy in more deeply 00:16:15.040 |
have written something in Mojo and there's some performance 00:16:24.160 |
what should people think of PyTorch versus MLX? 00:16:26.720 |
I mean, MLX is early, and I know the folks well. 00:16:32.160 |
Ani used to work at FAIR, and I used to chat with him 00:16:42.560 |
The way I think about MLX is that MLX is specialized 00:17:05.040 |
be supporting Apple and we will just focus on enabling-- 00:17:13.060 |
but once you go server side or whatever, that's not my problem 00:17:19.000 |
Or MLX, it enters the server side set of things as well. 00:17:26.240 |
If the first thing will happen, MLX's overall addressable 00:17:44.120 |
and they will have vastly more complex work to do. 00:17:49.460 |
They probably wouldn't be able to move as fast in certain ways. 00:17:52.800 |
Like having to deal with distributed compute. 00:17:58.520 |
like having a generalization of the concept of a back end, 00:18:02.400 |
how they treat compilation with plus overheads. 00:18:07.000 |
Right now, they deeply assume the whole MPS graph thing. 00:18:12.480 |
So they need to think about all these additional things 00:18:16.680 |
if they end up expanding onto the server side. 00:18:19.480 |
And they'll probably build something like PyTorch 00:18:26.020 |
And I think there they will fail on the lack of differentiation. 00:18:31.780 |
It wouldn't be obvious to people why they would want to use it. 00:18:36.200 |
I mean, there are some cloud companies offering M1 and M2 00:18:41.120 |
I feel like it might be interesting for Apple 00:18:43.320 |
to pursue that market, but it's not their core. 00:18:45.760 |
Yeah, I mean, if Apple can figure out their interconnect 00:18:52.480 |
Honestly, that's more interesting than the cars. 00:18:56.160 |
I think the mode that NVIDIA has right now, I feel like, 00:19:06.940 |
I'm sure there is very silicon that is not bad at all. 00:19:10.660 |
But the interconnect, like NVLink, is uniquely awesome. 00:19:16.340 |
So I'm sure the other hardware providers are working on it. 00:19:21.060 |
I feel like when you say it's uniquely awesome, you 00:19:23.260 |
have some appreciation of it that the rest of us don't. 00:19:28.800 |
do you mean when you say NVIDIA is very good at networking? 00:19:32.000 |
Obviously, they made the acquisition maybe 15 years ago. 00:19:46.700 |
Who are some of the other fair PyTorch alumni that 00:19:51.220 |
I know you have Fireworks AI, Lightning AI, Lepton. 00:20:00.060 |
Yeah, so Yanqing and I used to be framework rivals, 00:20:06.460 |
I mean, we were all a very small, close-knit community 00:20:13.060 |
Cafe, Torch, Tiano, Chainer, Keras, various frameworks. 00:20:22.820 |
I mean, it used to be more like 20 frameworks. 00:20:39.900 |
and saw if someone wrote their own convolution kernel, 00:20:47.140 |
And there were four or five convolution kernels 00:21:08.180 |
And at some point there, I built out these benchmarks 00:21:18.020 |
benchmarking all the convolution kernels that 00:21:25.380 |
And it hilariously became big enough that at that time, 00:21:30.060 |
AI was getting important, but not important enough 00:21:37.020 |
in to do these kind of benchmarking and standardization. 00:21:41.780 |
So a lot of the startups were using ConNet benchmarks 00:21:55.820 |
I remember Nirvana actually was at the top of the pack 00:21:58.420 |
because Scott Gray wrote amazingly fast convolution 00:22:10.660 |
I think mainly Lepton fireworks are the two most obvious ones. 00:22:17.460 |
But I'm sure the fingerprints are a lot wider. 00:22:27.060 |
They're just people who worked within the PyTorch Cafe 00:22:38.980 |
I think both as an investor and people looking 00:22:59.060 |
And they're like, you know, we are deep in the PyTorch 00:23:02.380 |
ecosystem, and we serve billions of inferences a day 00:23:05.140 |
or whatever at Facebook, and now we can do it for you. 00:23:22.580 |
What should people know about these sort of new inference 00:23:28.140 |
At that point, you would be investing in them 00:23:43.660 |
is that they're really good at GPU programming 00:23:48.140 |
or understanding the complexity of serving models 00:23:52.780 |
once it hits a certain scale, various expertise 00:23:58.380 |
from the infra and AI and GPUs point of view. 00:24:06.980 |
is whether their understanding of the external markets 00:24:19.980 |
understanding how to be disciplined about making money, 00:24:26.980 |
actually, I will de-emphasize the investing bit, 00:24:31.820 |
It's more like, OK, you're PyTorch gods, of course. 00:24:39.020 |
I mean, I would not care about who's building something 00:24:48.580 |
And it's usability, and reliability, and speed. 00:24:53.980 |
Yeah, if someone from some random unknown place 00:25:04.100 |
and I have the bandwidth, I probably will give it a shot. 00:25:06.780 |
And if it turns out to be great, I'll just use it. 00:25:11.700 |
And then maybe one more thing about benchmarks, 00:25:13.660 |
since we already brought it up, and you brought up 00:25:16.620 |
There was some recent drama around Antiscale. 00:25:22.340 |
and obviously they looked great on their own benchmarks. 00:25:28.220 |
I feel like there are two lines of criticism. 00:25:30.260 |
One, which is they didn't test apples for apples 00:25:33.620 |
on the kind of endpoints that the other providers 00:25:36.940 |
that they are competitors with on their benchmarks. 00:25:48.060 |
Yeah, I mean, in summary, basically my criticism 00:25:53.140 |
that Antiscale built these benchmarks for end users 00:26:06.060 |
is give that end user a full understanding of what 00:26:22.980 |
You need to understand your total cost of ownership 00:26:27.700 |
Not like, oh, like one API call is like $0.01, 00:26:36.580 |
People can misprice to cheat on those benchmarks. 00:26:39.220 |
So you want to understand, OK, how much is it 00:26:42.860 |
going to cost me if I actually subscribe to you 00:26:45.980 |
and do like a million API calls a month or something? 00:26:49.460 |
And then you want to understand the latency and reliability, 00:26:55.340 |
not just from one call you made, but an aggregate of calls 00:27:01.140 |
you made over various times of the day and times of the week 00:27:08.260 |
Is it just like some generic single paragraph 00:27:22.460 |
It was a much more narrow sliver of what should 00:27:30.060 |
And I'm pretty sure if before they released it, 00:27:33.580 |
they showed it to their other stakeholders who 00:27:43.020 |
would have easily just pointed out these gaps. 00:27:46.020 |
And I think they didn't do that, and they just released it. 00:27:50.020 |
So I think those were the two main criticisms. 00:27:52.620 |
And I think they were fair, and Robert took it well. 00:27:56.060 |
Yeah, we'll have him on at some point, and we'll discuss it. 00:28:07.740 |
because otherwise everyone's going to play dirty. 00:28:11.860 |
My view of the LLM inference market in general 00:28:19.260 |
The margins are going to drive down towards the bare minimum. 00:28:23.940 |
It's going to be all kinds of arbitrage between how much you 00:28:26.820 |
can get the hardware for and then how much you sell the API 00:28:34.500 |
You need to figure out how to squeeze your margins. 00:28:40.260 |
I think Together and Fireworks and all these people 00:28:42.860 |
are trying to build some faster CUDA kernels and faster 00:28:50.540 |
But those modes only last for a month or two. 00:28:57.580 |
Even if they're not published, the idea space is small. 00:29:06.460 |
the discovery rate is going to be pretty high. 00:29:09.020 |
It's not like we're talking about a combinatorial thing 00:29:13.300 |
You're talking about like llama-style LLM models, 00:29:23.180 |
It's not even like we have a huge diversity of hardware 00:29:32.940 |
the rate at which these ideas are going to get figured out 00:29:38.180 |
The standard one that I know of is fusing operators 00:29:43.420 |
on figuring out how to improve your memory bandwidth 00:29:51.420 |
Any ideas instead of things that are not being beaten to death 00:29:54.700 |
that people should be paying more attention to? 00:29:56.900 |
One thing I was like, you have 1,000 operators. 00:30:01.260 |
that you're seeing maybe outside of this little bubble? 00:30:08.940 |
But basically, it's used in a lot of exotic ways 00:30:13.740 |
from the ML angle, like, OK, what kind of models 00:30:18.180 |
And you get all the way from state space model 00:30:21.980 |
then all these things to stuff like nth-order differentiable 00:30:35.220 |
I think there's one set of interestingness factor 00:30:42.500 |
And then there's the other set of interesting factor 00:30:46.620 |
It's used in Mars Rover simulations, to drug discovery, 00:31:06.940 |
I think in terms of the most interesting application 00:31:17.380 |
are also very critical and really important it is used in. 00:31:39.300 |
And I was scared more about the fact that they were using GANs 00:31:43.620 |
Because at that time, I was a researcher focusing on GANs. 00:31:47.420 |
The diversity is probably the most interesting, 00:31:49.740 |
how many different things it is being used in. 00:32:09.660 |
search and symbolic stuff with differentiable models. 00:32:16.300 |
I think the whole AlphaGo style model is one example. 00:32:26.620 |
to do it for elements as well with various reward 00:32:34.340 |
but the whole alpha geometry thing was interesting. 00:32:39.380 |
the symbolic models with the gradient-based ones. 00:32:50.820 |
when you intersect biology and chemistry with ML. 00:33:03.340 |
So yeah, maybe from the ML side, those things to me 00:33:09.780 |
People are very excited about the alpha geometry thing. 00:33:18.740 |
into the real-world applications, but I'm sure it-- 00:33:25.740 |
You know how the whole thing about synthetic data 00:33:39.820 |
People think synthetic data is some kind of magic wand 00:33:50.340 |
right now because we, as humans, have figured out 00:34:06.100 |
So we've figured out how to ground particle physics 00:34:42.540 |
and just understanding how language can be broken down 00:34:46.420 |
into formal symbolism is something that we've figured out. 00:34:53.060 |
all this knowledge on these subjects, either synthetically-- 00:34:57.340 |
I mean, we created those subjects in our heads, 00:35:05.380 |
But we haven't figured out how to teach neural networks 00:35:19.820 |
So in areas where we have the symbolic models 00:35:23.340 |
and we need to teach all the knowledge we have 00:35:29.820 |
that is better encoded in the symbolic models, 00:35:34.100 |
a bunch of synthetic data, a bunch of input-output pairs, 00:35:42.580 |
that we already have a better low-rank model of 00:35:46.420 |
in gradient descent in a much more overparameterized way. 00:35:50.420 |
Outside of this, where we don't have good symbolic models, 00:35:55.020 |
synthetic data obviously doesn't make any sense. 00:36:00.020 |
where it'll work in all cases and every case or whatever. 00:36:09.140 |
we need to impart that knowledge to neural networks 00:36:12.700 |
and we figured out the synthetic data is a vehicle 00:36:18.540 |
But people, because maybe they don't know enough 00:36:27.060 |
but they hear the next wave of data revolution 00:36:30.100 |
is synthetic data, they think it's some kind of magic 00:36:32.940 |
where we just create a bunch of random data somehow. 00:36:38.500 |
And then they think that's just a revolution, 00:36:40.940 |
and I think that's maybe a gap in understanding 00:36:49.220 |
- There's two more that I'll put in front of you 00:36:54.380 |
One is, I have this joke that it's only synthetic data 00:37:04.980 |
They're distilling GPT-4 by creating synthetic data 00:37:07.660 |
from GPT-4, creating mock textbooks inspired by Phi-2, 00:37:11.540 |
and then fine-tuning open source models like LAMA. 00:37:15.940 |
- And so, should we call that synthetic data? 00:37:19.900 |
- Yeah, I mean, the outputs of LLMs, are they synthetic data? 00:37:29.340 |
If your goal is you're creating synthetic data 00:37:36.540 |
with the goal of trying to distill GPT-4's superiority 00:37:40.780 |
into another model, I guess you can call it synthetic data, 00:37:45.300 |
but it also feels disingenuous because your goal is like, 00:37:57.980 |
- I've often thought of this as data set washing. 00:38:07.120 |
that has all the data in it that we don't know 00:38:14.920 |
- But they also, to be fair, they also use larger models 00:38:21.560 |
- That is, I think, a very, very accepted use of synthetic. 00:38:25.460 |
I think it's a very interesting time where we don't really 00:38:28.960 |
have good social models of what is acceptable 00:38:33.960 |
depending on how many bits of information you use 00:38:44.560 |
It's like, okay, you use like one bit, is that okay? 00:38:51.920 |
Okay, what about if you use like 20 bits, is that okay? 00:38:59.280 |
Like, I don't think we as society have ever been 00:39:08.480 |
or where is the boundary of socially accepted understanding 00:39:15.960 |
Like, we haven't been tested this mathematically before, 00:39:23.600 |
- So yeah, I think this New York Times open AI case 00:39:35.320 |
is solving this very stark paradigm difference 00:39:49.180 |
All you need is variation or diversity of samples 00:40:04.440 |
that is like you're basically trying to create, 00:40:10.500 |
well, language, I know how to parameterize language 00:40:26.580 |
- Yeah, so I think that's 100% like synthetic, right? 00:40:37.460 |
or like some implicit symbolic model of language. 00:40:45.940 |
is just the architecture of the language models 00:40:50.380 |
I think like the, maybe the thing that people grasp 00:40:55.340 |
to deal with numbers because of the tokenizer. 00:41:03.040 |
that will be better with symbolic understanding? 00:41:06.180 |
- I am not sure if it's a fundamental issue or not. 00:41:09.500 |
I think we just don't understand transformers enough. 00:41:13.220 |
I don't even mean transformers as an architecture. 00:41:19.460 |
like combining the tokenizer and transformers 00:41:24.700 |
like when you show math heavy questions versus not. 00:41:35.180 |
I, you know, there's common criticisms that are like, 00:41:38.340 |
well, you know, transformers will just fail at X 00:41:42.260 |
but then when you scale them up to sufficient scale, 00:41:51.940 |
where they're trying to figure out these answers 00:41:53.580 |
called like the science of deep learning or something. 00:42:11.700 |
And LlamaOne was, you know, you are such a believer 00:42:16.680 |
LlamaOne was more or less like the real breakthrough 00:42:20.840 |
The most interesting thing for us covering in this podcast 00:42:32.200 |
the scaling models for open source models or smaller models 00:42:45.860 |
There was OPT before, which I'm also very proud of. 00:42:53.160 |
- Because we bridged the gap in understanding 00:42:56.620 |
of how complex it is to train these models to the world. 00:43:11.800 |
But no one really talked about why it's complex. 00:43:20.660 |
- I met Susan and she's very, very outspoken. 00:43:28.540 |
Like, you know, that's kind of obvious in retrospect. 00:43:34.420 |
- But you trained it according to Chinchilla at the time or? 00:43:40.420 |
but I think it's a commonly held belief at this point 00:43:50.860 |
Guillaume Lample and team Guillaume is fantastic 00:43:56.740 |
I wasn't too involved in that side of things. 00:44:06.660 |
how did they think about scaling laws and all of that? 00:44:19.580 |
with like their infrastructure needs and stuff. 00:44:35.040 |
what we were missing from the industry's understanding 00:44:45.000 |
and we needed more to train the models for longer. 00:44:48.100 |
And we made, I think, a few tweaks to the architecture 00:44:51.600 |
and we scaled up more and like that is llama two. 00:44:56.120 |
I think llama two, you can think of it as like, 00:45:00.160 |
the team kind of rebuilt their muscle around llama two. 00:45:04.320 |
And Hugo, I think, who's the first daughter is fantastic. 00:45:07.760 |
And I think he did play a reasonable big role 00:45:11.320 |
in llama one as well and he overlaps between llama one 00:45:14.520 |
So in llama three, obviously, hopefully will be awesome. 00:45:21.680 |
and then we'll try and fish llama three spoilers out of you. 00:45:38.560 |
Could they have just gone longer or were you just like, 00:46:13.840 |
and including all of the other GPU or accelerator stuff, 00:46:20.960 |
it would be 600 and something K aggregate capacity. 00:46:25.960 |
That's a lot of GPUs, we'll talk about it separately, 00:46:45.600 |
- Yeah, so I think it's all a matter of time. 00:46:53.160 |
It's like when do you stop training the previous one 00:47:15.360 |
When you start working iPhone two, where is the iPhone? 00:47:19.740 |
So mostly the considerations are time and generation 00:47:28.320 |
- So one of the things with the scaling laws, 00:47:37.600 |
you would rather pay a lot more maybe at training 00:47:45.220 |
I think in your tweet you say you can try and guess 00:47:50.320 |
Can you just give people a bit of understanding? 00:47:52.240 |
It's like, because I've already seen a lot of VCs say, 00:47:58.900 |
How do you allocate between the research like FAIR 00:48:09.280 |
like AI generated stickers on WhatsApp and all that? 00:48:12.720 |
- Yeah, we haven't talked about any of this publicly 00:48:24.900 |
You run a company, you run like a VC portfolio, 00:48:39.580 |
and you kind of decide should I invest in this project 00:48:42.260 |
or this other project or how much should I invest 00:48:52.820 |
and it also comes into play like how is your, 00:48:59.700 |
Like overall, like what you can fit of what size 00:49:08.460 |
Like, I mean, I think the details would add more spice 00:49:18.960 |
I mean, this looks like they just think about this 00:49:24.000 |
- Right, so even the GPU rich run through the same struggles 00:49:27.920 |
while having to decide where to allocate things? 00:49:30.800 |
- Yeah, I mean like at some point, I forgot who said it 00:49:44.700 |
you figure out how to make do with smaller models 00:49:48.140 |
but like no one as of today, I think would feel like 00:50:07.760 |
So like that conversation, I don't think I've heard 00:50:20.900 |
and she's trying to put it to interesting uses 00:50:28.820 |
- I mean, that's a cool high conviction opinion 00:50:42.060 |
and she probably will have very differentiated ideas 00:50:46.080 |
and I mean, think about the correlation of ideas 00:51:04.180 |
I used to be a, I used to do image models and stuff 00:51:17.780 |
because oh yeah, someone else did the same thing you did. 00:51:24.260 |
I don't understand why I need to fight for the same pie. 00:51:32.980 |
- And how do you reconcile that with how we started 00:51:36.740 |
the discussion about intrinsic versus extrinsic 00:51:50.600 |
I walked through a lot of the posters and whatnot, 00:51:52.980 |
there seems to be multiple apps in a way in the research, 00:52:01.480 |
on something that is like maybe not as interesting, 00:52:04.500 |
just because of funding and visibility and whatnot 00:52:10.260 |
- I think there's a baseline level of compatibility 00:52:22.020 |
Like, and like whatever reasonable, normal lifestyle 00:52:34.220 |
Like you wouldn't want to be doing something so obscure 00:52:39.220 |
that people are like, I don't know, like you can work on it. 00:52:42.960 |
With a limit on fundability, I'm just like observing 00:52:47.020 |
something like three months of compute, right? 00:52:50.220 |
That's the like max that you can spend on any one project. 00:52:53.440 |
- But like, I think that's very ill specified, 00:52:58.820 |
- So I think the notion of fundability is broader. 00:53:03.820 |
It's more like, hey, are these family of models 00:53:06.780 |
within the acceptable set of you're not crazy 00:53:33.820 |
like image classification to them or something, 00:53:42.640 |
Maybe if you're a neuroscientist, it actually is feasible. 00:53:46.320 |
But if you're like a AI engineer, like the audience 00:53:50.120 |
of these podcasts, then it's less, it's more questionable. 00:53:54.760 |
So I think like, the way I think about it is like, 00:53:57.680 |
you need to figure out how you can be in the baseline level 00:54:01.800 |
of fundability just so that you can just live. 00:54:06.400 |
And then after that, really focus on intrinsic motivation 00:54:11.400 |
and depends on your strengths, like how you can play 00:54:16.740 |
to your strengths and your interests at the same time. 00:54:21.060 |
Like you, like I try to look at a bunch of ideas 00:54:26.060 |
that are interesting to me, but also try to play 00:54:34.960 |
I'm interested in it, but when I want to work 00:54:38.720 |
on something like that, I try to partner with someone 00:54:40.800 |
who is actually a good like theoretical ML person 00:54:43.440 |
and see if I actually have any value to provide. 00:54:48.280 |
So I think you'd want to find that intersection 00:54:50.840 |
of ideas you like, and that also play to your strengths. 00:54:57.520 |
Everything else, like actually finding extrinsic success 00:55:01.160 |
and all of that I think is, the way I think about it 00:55:06.820 |
When you're talking about building ecosystems and stuff, 00:55:10.560 |
like slightly different considerations come into play, 00:55:16.600 |
- Yeah, I should, we're gonna pivot a little bit 00:55:23.600 |
But one more thing I wanted to establish for meta 00:55:25.720 |
is like this 600K number, just kind of rounding out 00:55:31.060 |
So including your own inference needs, right? 00:55:39.380 |
- Yeah, so like, there's a decent amount of workload 00:55:42.400 |
serving Facebook and Instagram and you know, whatever. 00:55:45.920 |
And then is there interest in like your own hardware? 00:55:57.620 |
I think we've even showed like the standard photograph 00:56:05.000 |
I mean, like as in the chip that you basically 00:56:25.220 |
- Like what gaps do you have that the market doesn't offer? 00:56:31.120 |
So basically, remember how I told you about the whole, 00:56:39.360 |
Fundamentally, like when you build a hardware, 00:56:42.080 |
like you make it general enough that a wide set of customers 00:56:46.680 |
and a wide set of workloads can use it effectively 00:56:49.800 |
while trying to get the maximum level of performance 00:56:58.460 |
the more hardware efficient it's going to be, 00:57:04.460 |
the more easier it's going to be to find like the software, 00:57:14.020 |
that one or two workloads to that hardware and so on. 00:57:17.080 |
So it's pretty well understood across the industry 00:57:21.840 |
that if you have a sufficiently large volume enough workload, 00:57:26.840 |
you can specialize it and get some efficiency gains, 00:57:35.460 |
So the way you can think about everyone building, 00:57:42.560 |
like I think a bunch of the other large companies 00:57:48.860 |
is each large company has a sufficient enough set 00:57:53.840 |
of verticalized workloads that have a pattern to them 00:58:03.920 |
like an Nvidia or an AMD GPU does not exploit. 00:58:11.520 |
that you're leaving on the table by not exploiting that. 00:58:21.120 |
that those workloads will exist in the same form, 00:58:25.100 |
that it's worth spending the time to build out a chip 00:58:32.640 |
Like obviously something like this is only useful 00:58:42.040 |
of those kinds of workloads being in the same kind 00:58:49.860 |
So yeah, that's why we're building our own chips. 00:59:00.560 |
and going back to open source, you had a very good tweet. 00:59:03.600 |
You said that a single company's close source effort 00:59:06.360 |
rate limits against people's imaginations and needs. 00:59:13.960 |
that some of the meta AI work in open source has been doing 00:59:17.200 |
and maybe directions of the whole open source AI space? 00:59:20.960 |
In general, I think first I think it's worth talking 00:59:25.280 |
about this in terms of open and not just open source 00:59:28.940 |
because like with the whole notion of model weights, 00:59:31.920 |
no one even knows what source means for these things. 00:59:35.500 |
But just for the discussion, when I say open source, 00:59:39.360 |
you can assume it's just I'm talking about open. 00:59:42.240 |
And then there's the whole notion of like licensing 00:59:47.280 |
- Commercial, non-commercial, commercial with clauses 00:59:57.160 |
is that you make the distribution to be very wide. 01:00:06.300 |
and like people can do transformative things. 01:00:27.420 |
and do something with it is very transformative to me. 01:00:32.260 |
Like I got this thing in a very accessible way. 01:00:38.700 |
And then like so it's very, very various degrees, right? 01:00:44.100 |
but it's like actually like a commercial license, 01:00:50.020 |
from like gaining value that they didn't previously have 01:00:54.780 |
that they maybe had to pay a closed source company for it. 01:00:59.100 |
So open source is just a very interesting tool 01:01:06.540 |
One is like some large company doing a lot of work 01:01:12.260 |
And that kind of effort is not really feasible 01:01:15.820 |
by say like a band of volunteers doing it the same way. 01:01:19.860 |
So there's both a capital and operational expenditure 01:01:33.740 |
They're not as tangible as like direct revenue 01:01:37.660 |
So in that part, Meta has been doing incredibly good things. 01:01:42.660 |
They fund a huge amount of the PyTorch development. 01:01:47.900 |
They've open sourced Llama and those family of models. 01:01:52.060 |
And several other fairly transformative projects. 01:02:19.220 |
and we have a high talent density of great AI people. 01:02:27.700 |
And the thesis for that, I remember when Fair was started, 01:02:38.300 |
What exactly is the benefit from a commercial perspective? 01:02:53.280 |
Our ability to build various product integrations, 01:03:11.380 |
was uniquely in our possession or not for us. 01:03:31.380 |
Still the same to a large extent with the Llama stuff 01:03:35.220 |
and it's a bit more, I think it's the same values, 01:03:46.160 |
And then there's the second kind of open source, 01:03:50.420 |
which is oh, we built this project nights and weekends 01:03:54.140 |
and we're very smart people and we open sourced it 01:04:15.980 |
They're different and beneficial in their own ways. 01:04:28.580 |
If someone's not really looking at a particular space, 01:04:33.780 |
because it's not commercially viable or whatever, 01:04:35.980 |
like a band of volunteers can just coordinate online 01:04:44.820 |
I wanna cover a little bit about open source LLMs maybe. 01:04:51.820 |
So open source LLMs have been very interesting 01:04:54.620 |
because I think we were trending towards an increase 01:05:08.200 |
Like where more and more pressure within the community 01:05:17.580 |
And then the LLM revolution kind of took the opposite effect. 01:05:28.020 |
and DeepMind kind of like all the other cloud 01:06:07.800 |
What is my accessibility to any of these closer models? 01:06:40.380 |
And I actually have seen, living in New York, 01:06:52.700 |
like Dyson spheres or whatever, that's a thing. 01:07:07.960 |
they're probably not globally optimal decisions. 01:07:11.780 |
So I think open source, the distribution of open source, 01:07:27.740 |
it's going great in the fact that Laura, I think, 01:07:31.560 |
came out of the necessity of open source models 01:07:44.580 |
out of the academic open source side of things. 01:07:54.480 |
did any of them already have Laura or DPO internally? 01:08:00.540 |
Maybe, but that does not advance humanity in any way. 01:08:14.680 |
So I don't know, it just feels fundamentally good. 01:08:22.860 |
well, what are the ways in which it is not okay? 01:08:37.860 |
very much related to what kind of cultural culture 01:08:42.860 |
they grew up in, what kind of society they grew up in. 01:08:50.140 |
If they grew up in a society that they trusted, 01:08:52.960 |
then I think they take the closed source argument. 01:08:57.900 |
And if they grew up in a society that they couldn't trust, 01:09:00.420 |
where the norm was that you didn't trust your government, 01:09:05.500 |
then I think the open source argument is what they take. 01:09:10.360 |
to people's innate biases from their childhood 01:09:15.360 |
and their trust in society and governmental aspects 01:09:21.900 |
that push them towards one opinion or the other. 01:09:26.260 |
And I'm definitely in the camp of open source 01:09:33.860 |
Closed source to me just means that centralization of power, 01:09:46.180 |
We're actively disaggregating the centralization of power 01:09:55.180 |
We are, I think, benefiting from so many people 01:10:00.660 |
that aren't allowed by say Silicon Valley left wing tropes. 01:10:13.180 |
but they're not culturally accepted universally in the world. 01:10:20.420 |
And I think open source is not winning in certain ways. 01:10:25.420 |
These are all the things in which, as I mentioned, 01:10:29.980 |
it's actually being very good and beneficial and winning. 01:10:33.060 |
I think one of the ways in which it's not winning, 01:10:36.340 |
at some point I should write a long form post about this, 01:10:39.220 |
is I think it has a classic coordination problem. 01:10:50.480 |
they will just be better coordinated than open source. 01:11:12.100 |
like if you go to Reddit, local llama, subreddit, 01:11:19.020 |
that are being produced from say, NOS research. 01:11:29.420 |
And one common theme is they're all using these fine tuning 01:11:34.420 |
or human preferences data sets that are very limited 01:11:49.080 |
like say front-ends like Uber or Hugging Chat or Ollama, 01:11:58.900 |
All the people using all of these front-ends, 01:12:04.380 |
but there's no way for them to give feedback. 01:12:24.580 |
they are not exposing the ability to give feedback. 01:12:31.340 |
Maybe open source models are being as used as GPT is 01:12:34.940 |
at this point in all kinds of, in a very fragmented way. 01:12:39.700 |
Like in aggregate, all the open source models together 01:12:47.180 |
But the amount of feedback that is driving back 01:12:50.240 |
into the open source ecosystem is like negligible, 01:13:05.000 |
you'd want someone to create a sinkhole for the feedback, 01:13:09.260 |
like maybe Hugging Face or someone just finds like, 01:13:12.960 |
okay, like I will make available a call to log a string 01:13:17.960 |
along with like a bit of information of positive or negative 01:13:24.300 |
And then you would want to send pull requests 01:13:26.540 |
to all the open source front ends, like Uber and all, 01:13:30.860 |
being like, hey, we're just integrating like a feedback UI. 01:13:34.660 |
And then work with like the closed source people 01:13:37.260 |
is also being like, look, it doesn't cost you anything, 01:13:47.480 |
And then I think a bunch of open source researchers 01:13:54.640 |
I'm sure like it will be exploited by spam bots 01:13:59.280 |
to inject your advertising product into like the next-- 01:14:08.760 |
In the same way, I'm sure like all the close providers 01:14:17.920 |
I'm sure they are figuring out if that's legit or not. 01:14:21.600 |
That kind of data filtering needs to be done. 01:14:31.160 |
and that like data cleaning effort both to be like there. 01:14:46.360 |
who's gonna go coordinate all of this integration 01:14:49.840 |
across all of these like open source front ends. 01:14:52.840 |
But I think if we do that, if that actually happens, 01:15:00.800 |
of the open source models having a runaway effect 01:15:10.000 |
Probably doesn't have a chance against Google 01:15:13.360 |
because Google has Android and Chrome and Gmail 01:15:41.720 |
Because in a way like OpenAI's goal is to get to AGI, right? 01:15:47.160 |
we're more focused on personal better usage or like-- 01:15:52.680 |
But I think like, largely I actually don't think people 01:16:05.560 |
AGI means it's powering 40% of world economic output 01:16:21.520 |
Like, generally the notion of like powering X percent 01:16:26.400 |
of economic output is not defined well at all 01:16:31.160 |
for me to understand like how to know when we got to AGI 01:16:40.640 |
Like, you know, you can look at it in terms of intelligence 01:16:48.200 |
We're basically integrating like the current set 01:16:50.520 |
of AI technologies into so many real world use cases 01:16:55.160 |
where we find value that if some new version of AI comes in, 01:17:01.360 |
we can find, like we can be like, ah, this helps me more. 01:17:05.300 |
In that sense, I think like the whole process 01:17:10.180 |
of like how we think we got to AGI will be continuous 01:17:13.700 |
and not like not discontinuous like how I think 01:17:21.280 |
So I think the open source thing will be very much in line 01:17:26.280 |
with getting to AGI because open source has that 01:17:39.900 |
really no one says, huh, I don't wanna use it 01:17:49.080 |
I don't know if I like the models, you know, whatever. 01:18:00.800 |
So I definitely think it has a good chance of achieving 01:18:19.480 |
this very interesting concept of feedbacks and coal 01:18:22.120 |
for open source to really catch up in terms of 01:18:33.860 |
I think the criticism there was like the kind of people 01:18:35.720 |
that go to a specific website to give feedback 01:18:40.760 |
and that's why the models trained on Open Assistant 01:18:43.640 |
didn't really seem like they'd have caught on 01:18:48.760 |
are LMSIS out of UC Berkeley who have the LMSIS arena 01:18:53.080 |
which is being touted as one of the only ways, 01:18:59.720 |
'cause there's nothing to cheat on except for ELO. 01:19:07.560 |
I don't know if you've talked to any of these people. 01:19:14.340 |
I haven't talked to them directly about this yet 01:19:23.300 |
is always going to be way more distributed than centralized. 01:19:26.880 |
Like which is the power of the open source movement. 01:19:31.580 |
Like the UI within which these models are going to be used 01:19:50.200 |
Like the LMSIS leaderboard is the best thing we have 01:19:54.320 |
right now to understand whether a model is better or not 01:19:59.640 |
But it's also biased and only having a sliver of view 01:20:07.880 |
to the LMSIS leaderboard and then using a model 01:20:13.000 |
Like GitHub co-pilot style usage is not captured 01:20:18.000 |
in say like LMSIS thing and so many other styles 01:20:22.560 |
like the character AI style things is not captured in LMSIS. 01:20:29.640 |
- Yeah, so like I think like yeah, my point is like 01:20:45.420 |
with all these like ways in which it's being used. 01:20:49.460 |
Even if you get like the top hundred front ends 01:20:54.180 |
that the model like open source models are used through, 01:21:01.860 |
I think that's already like a substantial thing. 01:21:25.520 |
You're an investor in 1X, which is a humanoid assistant. 01:21:34.580 |
You advise a bunch of robotics projects at NYU. 01:21:42.040 |
On a more, yeah, maybe you have another thing. 01:21:43.800 |
What are like the things that you're most excited about 01:21:51.800 |
I have more things I'm generally excited about 01:21:56.920 |
Investing is one way to try to clear those urges. 01:22:01.920 |
I'm generally excited about robotics being a possibility, 01:22:09.800 |
home robotics being like five to seven years away 01:22:17.560 |
I think like it's not like next year or two years from now, 01:22:24.040 |
I think like a lot more robotics companies might pop out. 01:22:36.300 |
My view is actually hardware is still the bottleneck 01:22:43.240 |
but like I don't think there's any like obvious 01:22:53.600 |
I spend a lot of time, a lot of personal time, 01:22:55.980 |
I spend like every Wednesday afternoon at NYU 01:23:10.300 |
- As of today, we just deployed a couple months ago, 01:23:19.520 |
into like several tens of New York City homes 01:23:26.960 |
And we're basically starting to build out a framework 01:23:34.120 |
on fairly simple tasks, like picking this cup 01:23:40.440 |
or like taking a few pieces of cloth on the ground 01:23:44.560 |
and put it somewhere else or open your microwave. 01:23:55.480 |
So like the key thing, I think one of the things 01:24:12.740 |
A lot of the current robotics research if you see, 01:24:15.660 |
they're like, oh yeah, we collected like 50 demos 01:24:22.940 |
It's a sample, the number of samples you need 01:24:24.860 |
for this thing to do the task is really high. 01:24:32.700 |
and that's sufficient for it to actually like do the task. 01:24:35.860 |
But it comes with like less generalization, right? 01:24:47.360 |
That's very interesting in general, the space. 01:24:57.580 |
for it to be truly useful in the home or whatever. 01:25:10.700 |
But I think like lots of work is happening in the field. 01:25:15.500 |
- Yeah, one of my friends, Carlo at Berkeley, 01:25:28.960 |
or is it just like the actual servos and the... 01:25:33.020 |
- Yeah, by hardware I mean like the actual like servos, 01:25:37.380 |
like the motors, servos, even like the sensors, 01:25:47.380 |
that still like is so much better compared to 01:25:59.120 |
We have, our skin is like all available touch sensing 01:26:14.180 |
that can lift large loads in like the dexterity 01:26:19.860 |
So in terms of hardware, I mean like in terms 01:26:24.500 |
of those capabilities, like we haven't figured out 01:26:31.860 |
I mean Tesla has been making incredible progress. 01:26:34.660 |
OneX I think announced their new thing that looks incredible. 01:26:44.860 |
But we're really not anywhere close to like the hardware 01:26:50.120 |
And there's obviously the other thing I want to call out is 01:27:02.400 |
I mean like that's the other thing we are incredible at. 01:27:10.520 |
If you buy a product, an electronics product of any kind, 01:27:20.900 |
and I have to like do some reasonable amount of work on it. 01:27:28.540 |
where it's very controlled and specialized or whatever, 01:27:31.580 |
like you're talking about reliability like in those ranges. 01:27:42.880 |
I mean like we're gonna start thinking about, 01:27:45.320 |
okay, now we have this thing and we need to figure out 01:27:47.460 |
how to get reliability high enough to deploy it into homes 01:27:59.180 |
- I just realized that Google has a play in this 01:28:12.820 |
- I used to, we have a small robotics program 01:28:17.280 |
I actually used to do it at FAIR a little bit 01:28:19.580 |
before I moved into Infra and focused on my Meta time 01:28:23.340 |
on a lot of like other infrastructural stuff. 01:28:26.940 |
So yeah, Meta's robotics program is a lot smaller. 01:28:30.700 |
- Seems like it would be a fit in personal computing. 01:28:36.140 |
- You can think of it as like Meta has a ridiculously 01:28:50.920 |
I think for Meta, like the robot is not as important 01:28:56.280 |
as like the physical devices kind of stuff, for sure. 01:29:15.140 |
is that he realized that you can smell cancer. 01:29:21.020 |
- Yeah, I mean first like the very interesting reason 01:29:43.660 |
He built this thing called Tangent from Google, 01:29:48.120 |
like another like alternative framework and stuff. 01:29:59.540 |
He just happens to also love like neural networks 01:30:05.140 |
So incredibly smart guy, one of the smartest people I know. 01:30:16.620 |
is something that we haven't even started to scrape 01:30:22.660 |
When we think about audio or images or video, 01:30:26.580 |
they're like so advanced that we have the concept 01:30:31.240 |
We have the concept of like frequency spectrums. 01:30:34.300 |
Like, you know, we figured out how ears process 01:30:37.080 |
like frequencies in mouth spectrum or whatever, 01:30:39.880 |
like logarithmically scaled images for like RGB, YUV. 01:30:44.100 |
Like we have so many different kinds of parameterizations. 01:30:47.020 |
We have formalized these two senses ridiculously. 01:30:58.740 |
We're like where we were with images and say in 1920 01:31:10.060 |
of like having a smell sensor just eventually 01:31:18.380 |
Like as of today, you don't really think about 01:31:22.100 |
like when you're watching an Instagram reel or something, 01:31:24.380 |
huh, like I also would love to know what it smelled like 01:31:28.700 |
and if you're watching a reel of a food or something. 01:31:32.020 |
You don't because we really haven't as a society 01:31:41.500 |
I think the more near term effects are obviously 01:31:44.360 |
going to be around things that provide more obvious utility 01:31:49.420 |
in the short term, like maybe smelling cancer 01:31:52.580 |
or like repelling mosquitoes better or stuff like that. 01:32:02.420 |
- Yeah, like I mean think about how you can customize 01:32:09.180 |
you can customize a shoe or something, right? 01:32:15.780 |
I think if he's able to figure out a near term value for it, 01:32:26.940 |
on the long term which is really in uncharted territory. 01:32:35.660 |
it would be pretty obvious to like kids of the generation 01:32:48.540 |
they're watching something and then they immediately get 01:32:53.380 |
like a smell sense off that remote experience as well. 01:32:57.140 |
Like we haven't really progressed enough in that dimension 01:33:07.300 |
Awesome, I mean we touched on a lot of things. 01:33:18.060 |
- I don't really have a lot of calls to action