back to indexYann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416
Chapters
0:0 Introduction
2:18 Limits of LLMs
13:54 Bilingualism and thinking
17:46 Video prediction
25:7 JEPA (Joint-Embedding Predictive Architecture)
28:15 JEPA vs LLMs
37:31 DINO and I-JEPA
38:51 V-JEPA
44:22 Hierarchical planning
50:40 Autoregressive LLMs
66:6 AI hallucination
71:30 Reasoning in AI
89:2 Reinforcement learning
94:10 Woke AI
103:48 Open source
107:26 AI and ideology
109:58 Marc Andreesen
117:56 Llama 3
124:20 AGI
128:48 AI doomers
144:38 Joscha Bach
148:51 Humanoid robots
158:0 Hope for the future
00:00:06.000 |
as a much bigger danger than everything else. 00:00:15.120 |
we should keep AI systems under lock and key, 00:00:18.560 |
because it's too dangerous to put it in the hands of everybody. 00:00:32.360 |
I believe that people are fundamentally good. 00:00:38.480 |
can make them smarter, it just empowers the goodness 00:00:52.480 |
because they don't think that people are fundamentally good. 00:00:57.680 |
- The following is a conversation with Yann LeCun, 00:01:21.380 |
by open-sourcing many of their biggest models, 00:01:54.380 |
this happens to be somewhat a controversial position. 00:01:58.840 |
And so it's been fun seeing Yann get into a lot of intense 00:02:18.000 |
You've had some strong statements, technical statements, 00:02:22.420 |
about the future of artificial intelligence recently, 00:02:25.580 |
throughout your career, actually, but recently as well. 00:02:38.740 |
These are the large language models like GPT-4, 00:02:45.080 |
and why are they not going to take us all the way? 00:02:49.040 |
The first is that there is a number of characteristics 00:02:53.500 |
For example, the capacity to understand the world, 00:03:14.140 |
of intelligent systems or entities, humans, animals. 00:03:23.060 |
or they can only do them in a very primitive way. 00:03:26.560 |
And they don't really understand the physical world. 00:03:31.340 |
They can't really reason, and they certainly can plan. 00:03:34.420 |
And so, if you expect the system to become intelligent 00:03:38.860 |
just without having the possibility of doing those things, 00:03:44.980 |
that is not to say that go to a receive LLMs are not useful. 00:04:01.020 |
but as a path towards human-level intelligence, 00:04:14.020 |
Those LLMs are trained on enormous amounts of texts, 00:04:16.540 |
basically the entirety of all publicly available texts 00:04:21.520 |
That's typically on the order of 10 to the 13 tokens. 00:04:28.220 |
So that's two 10 to the 13 bytes as training data. 00:04:35.160 |
to just read through this at eight hours a day. 00:04:38.680 |
So it seems like an enormous amount of knowledge, right? 00:04:43.020 |
But then you realize it's really not that much data. 00:04:52.300 |
and they tell you a four-year-old has been awake 00:05:19.700 |
And so 10 to the 15 bytes for a four-year-old 00:05:28.700 |
What that tells you is that through sensory input, 00:05:33.860 |
we see a lot more information than we do through language. 00:05:40.960 |
most of what we learn and most of our knowledge 00:05:49.520 |
Everything that we learn in the first few years of life 00:05:59.400 |
some of the intuition behind what you're saying. 00:06:01.720 |
So it is true there's several orders of magnitude 00:06:16.580 |
your comparison between sensory data versus language, 00:06:29.340 |
So there's a lot of wisdom in language, there's words, 00:06:40.620 |
already has enough wisdom and knowledge in there 00:06:48.740 |
construct a world model, an understanding of the world, 00:07:01.740 |
like whether intelligence needs to be grounded in reality. 00:07:09.260 |
intelligence cannot appear without some grounding 00:07:12.340 |
in some reality, it doesn't need to be physical reality, 00:07:22.340 |
Language is a very approximate representation of our percepts 00:07:29.500 |
I mean, there's a lot of tasks that we accomplish 00:07:40.700 |
Everything that's physical, mechanical, whatever, 00:07:43.540 |
when we build something, when we accomplish a task, 00:07:47.100 |
a model task of grabbing something, et cetera, 00:07:52.900 |
by essentially imagining the result of the outcome 00:07:57.180 |
of a sequence of actions that we might imagine. 00:08:06.060 |
And that's, I would argue, most of our knowledge 00:08:09.900 |
is derived from that interaction with the physical world. 00:08:13.740 |
So a lot of my colleagues who are more interested 00:08:17.420 |
in things like computer vision are really on that camp 00:08:25.100 |
And then other people coming from the NLP side, 00:08:37.140 |
And the complexity of the world is hard to imagine. 00:08:51.020 |
that we take completely for granted in the real world 00:08:53.580 |
that we don't even imagine require intelligence, right? 00:09:03.300 |
it seems to be easy to do high-level complex tasks 00:09:09.700 |
Whereas the thing we take for granted that we do every day, 00:09:16.380 |
or grabbing an object, we can't do with computers. 00:09:29.500 |
But then they can't learn to drive in 20 hours 00:09:35.460 |
They can't learn to clear out the dinner table 00:09:38.660 |
and fill up the dishwasher like any 10-year-old 00:09:45.860 |
What type of learning or reasoning architecture 00:09:50.700 |
or whatever are we missing that basically prevent us 00:09:55.700 |
from having level five sort of cars and domestic robots? 00:10:00.900 |
- Can a large language model construct a world model 00:10:09.340 |
but just doesn't know how to deal with visual data 00:10:17.220 |
- So yeah, that's what a lot of people are working on. 00:10:22.540 |
And the more complex answer is you can use all kinds 00:10:48.580 |
And we have a number of ways to train vision systems. 00:10:51.340 |
These are supervised, semi-supervised, self-supervised, 00:10:55.220 |
That will turn any image into a high-level representation. 00:11:01.100 |
Basically a list of tokens that are really similar 00:11:04.500 |
to the kind of tokens that typical LLM takes as an input. 00:11:17.140 |
And you just expect LLM to kind of, during training, 00:11:21.620 |
to kind of be able to use those representations 00:11:32.700 |
I mean, there are LLMs that have some vision extension. 00:11:36.700 |
But it basically hacks in the sense that those things 00:11:46.460 |
They don't really understand intuitive physics, 00:11:51.220 |
- So you don't think there's something special to you 00:11:53.300 |
about intuitive physics, about sort of common sense reasoning 00:11:55.980 |
about the physical space, about physical reality? 00:12:04.060 |
with the type of LLMs that we are working with today. 00:12:09.300 |
But the main reason is the way LLMs are trained 00:12:16.580 |
you remove some of the words in that text, you mask them, 00:12:26.180 |
And if you build this neural net in a particular way 00:12:33.220 |
that are to the left of the one it's trying to predict, 00:12:36.140 |
then what you have is a system that basically 00:12:38.020 |
is trained to predict the next word in a text, right? 00:12:52.740 |
of all the possible words in your dictionary. 00:12:56.260 |
it predicts tokens that are kind of sub-word units. 00:13:02.780 |
because there's only a finite number of possible words 00:13:07.380 |
And you can just compute the distribution over them. 00:13:16.860 |
Of course, there's a higher chance of picking words 00:13:18.820 |
that have a higher probability within the distribution. 00:13:32.300 |
And once you do this, you shift it into the input, et cetera. 00:13:46.300 |
And there is a difference between this kind of process 00:13:50.620 |
and a process by which before producing a word, 00:14:00.460 |
and it's relatively independent of the language 00:14:06.980 |
let's say a mathematical concept or something, 00:14:10.940 |
and the answer that we're planning to produce 00:14:19.460 |
- Chomsky just rolled his eyes, but I understand. 00:14:21.700 |
So you're saying that there's a bigger abstraction 00:14:25.420 |
that goes before language and maps onto language. 00:14:31.140 |
It's certainly true for a lot of thinking that we do. 00:14:35.780 |
like you're saying your thinking is same in French 00:14:42.060 |
- Pretty much, or is this like, how flexible are you? 00:14:45.740 |
Like if there's a probability of distribution? 00:14:48.060 |
- Well, it depends what kind of thinking, right? 00:14:54.340 |
I get much better in French than English about that. 00:15:00.500 |
Like is your humor an abstract, like when you tweet 00:15:03.340 |
and your tweets are sometimes a little bit spicy, 00:15:05.780 |
is there an abstract representation in your brain 00:15:13.380 |
of imagining the reaction of a reader to that text. 00:15:21.980 |
- Or figure out like a reaction you want to cause 00:15:34.340 |
or imagining something you want to build out of wood 00:15:41.100 |
is absolutely nothing to do with language really. 00:15:44.980 |
like an internal monologue in any particular language. 00:15:47.700 |
You're imagining mental models of the thing, right? 00:15:52.180 |
I mean, if I ask you to imagine what this water bottle 00:16:02.500 |
And so clearly there is a more abstract level 00:16:06.860 |
of representation in which we do most of our thinking 00:16:18.940 |
as opposed to an output being muscle actions, right? 00:16:34.980 |
It's like, it's a bit like the subconscious actions 00:16:42.820 |
you're doing something, you're completely concentrated. 00:16:44.980 |
And someone comes to you and ask you a question 00:16:49.660 |
You don't have time to think about the answer 00:16:51.460 |
but the answer is easy so you don't need to pay attention. 00:17:01.220 |
It retrieves it because it's accumulated a lot of knowledge 00:17:06.140 |
but it's going to just spit out one token after the other 00:17:13.060 |
- But you're making it sound just one token after the other, 00:17:17.260 |
one token at a time generation is bound to be simplistic. 00:17:22.260 |
But if the world model is sufficiently sophisticated 00:17:30.180 |
the most likely thing it generates is a sequence of tokens 00:17:39.140 |
- Okay, but then that assumes that those systems 00:17:53.780 |
not complete, but one that has a deep understanding 00:17:58.580 |
- Yeah, so can you build this first of all by prediction? 00:18:14.180 |
because language is very poor in terms of weak 00:18:21.380 |
So building world models means observing the world 00:18:27.140 |
and understanding why the world is evolving the way it is. 00:18:32.140 |
And then the extra component of a world model 00:18:41.780 |
is going to evolve as a consequence of an action 00:18:47.020 |
here is my idea of the state of the world at time T, 00:18:51.020 |
What is the predicted state of the world at time T plus one? 00:18:55.700 |
Now that state of the world does not need to represent 00:19:01.180 |
It just needs to represent enough that's relevant 00:19:18.600 |
You take a video, show a system, a piece of video, 00:19:22.420 |
and then ask it to predict the reminder of the video. 00:19:29.380 |
Do the same thing as sort of the auto-aggressive LLMs do, 00:19:35.060 |
Either one frame at a time or a group of frames at a time. 00:19:41.180 |
The idea of doing this has been floating around 00:19:51.060 |
have been trying to do this for about 10 years. 00:19:53.380 |
And you can't really do the same trick as with LLMs 00:20:06.860 |
but you can predict the distribution over words. 00:20:11.580 |
what you would have to do is predict the distribution 00:20:16.500 |
And we don't really know how to do that properly. 00:20:19.980 |
We do not know how to represent distributions 00:20:33.060 |
is because the world is incredibly more complicated 00:20:37.300 |
and richer in terms of information than text. 00:20:58.340 |
that's going to be in the room as I pan around. 00:21:00.100 |
The system cannot predict what's going to be in the room 00:21:09.340 |
It can't predict what the painting on the wall looks like 00:21:16.140 |
So there's no way I can predict all those details. 00:21:26.420 |
is to have a model that has what's called a latent variable. 00:21:29.820 |
And the latent variable is fed to a neural net 00:21:33.020 |
and it's supposed to represent all the information 00:21:43.860 |
for the prediction to do a good job at predicting pixels, 00:21:47.180 |
including the fine texture of the carpet and the couch 00:21:54.980 |
That has been a complete failure, essentially. 00:22:04.700 |
We tried VAEs, all kinds of regularized autoencoders. 00:22:15.700 |
to learn good representations of images or video 00:22:20.260 |
that could then be used as input to, for example, 00:22:44.100 |
and then try to reconstruct the complete video or image 00:22:52.180 |
the system will develop good representations of images 00:22:54.900 |
that you can use for object recognition, segmentation, 00:22:58.460 |
That has been essentially a complete failure. 00:23:04.460 |
That's the principle that is used for LLMs, right? 00:23:09.340 |
Is it that it's very difficult to form a good representation 00:23:19.340 |
Is it in terms of the consistency of image to image 00:23:27.140 |
of all the ways you failed, what's that look like? 00:23:35.300 |
first of all, I have to tell you exactly what doesn't work 00:23:37.220 |
because there is something else that does work. 00:23:40.060 |
So the thing that does not work is training the system 00:23:55.740 |
And we have a whole slew of techniques for this 00:23:59.100 |
that are, you know, variant of denoising autoencoders, 00:24:02.540 |
something called MAE developed by some of my colleagues 00:24:10.460 |
or things like this where you train the system 00:24:13.100 |
by corrupting text, except you corrupt images, 00:24:25.540 |
but you train it to supervise with labeled data, 00:24:30.100 |
with textual descriptions of images, et cetera, 00:24:35.780 |
And the performance on recognition tasks is much better 00:24:39.700 |
than if you do this self-supervised retraining. 00:24:45.420 |
The architecture of the encoder is good, okay? 00:24:49.500 |
to reconstruct images does not lead it to produce, 00:25:08.860 |
What are these architectures that you're so excited about? 00:25:15.300 |
to reconstruct the full image from a corrupted version, 00:25:21.540 |
you take the corrupted or transformed version, 00:25:27.140 |
which in general are identical, but not necessarily. 00:25:31.580 |
And then you train a predictor on top of those encoders 00:25:36.580 |
to predict the representation of the full input 00:25:42.460 |
from the representation of the corrupted one, okay? 00:25:47.460 |
So joint embedding, because you're taking the full input 00:25:51.100 |
and the corrupted version or transformed version, 00:25:54.140 |
run them both through encoders, you get a joint embedding. 00:25:59.140 |
can I predict the representation of the full one 00:26:01.980 |
from the representation of the corrupted one, okay? 00:26:07.820 |
so that means joint embedding predictive architecture 00:26:12.620 |
the representation of the good guy from the bad guy. 00:26:55.700 |
and produces representations that are constant. 00:27:02.780 |
and those things have been around since the early '90s. 00:27:07.140 |
You also show pairs of images that you know are different, 00:27:13.420 |
and then you push away the representations from each other. 00:27:17.540 |
So you say, not only do representations of things 00:27:23.900 |
but representations of things that we know are different 00:27:26.620 |
And that prevents the collapse, but it has some limitation. 00:27:30.140 |
And there's a whole bunch of techniques that have appeared 00:27:43.260 |
But there are limitations to those contrastive methods. 00:27:47.260 |
What has changed in the last three, four years 00:27:52.420 |
is now we have methods that are non-contrastive. 00:27:54.900 |
So they don't require those negative contrastive samples 00:28:01.620 |
You turn them only with images that are different versions 00:28:12.740 |
And we have half a dozen different methods for this now. 00:28:17.860 |
between joint embedding architectures and LLMs? 00:28:26.980 |
Whether we should say that you don't like the term AGI, 00:28:40.220 |
Well, we'll probably continue to argue about it, it's great. 00:28:52.620 |
- And Ami stands for advanced machine intelligence. 00:29:16.020 |
by reconstruction, generate the inputs, right? 00:29:20.060 |
They generate the original input that is non-corrupted, 00:29:29.940 |
And there is a huge amount of resources spent in the system 00:29:33.420 |
to actually predict all those pixels, all the details. 00:29:36.180 |
In a JAPA, you're not trying to predict all the pixels, 00:29:40.500 |
you're only trying to predict an abstract representation 00:29:49.460 |
So what the JAPA system, when it's being trained, 00:29:51.460 |
is trying to do is extract as much information as possible 00:29:54.820 |
from the input, but yet only extract information 00:30:00.500 |
Okay, so there's a lot of things in the world 00:30:05.220 |
if you have a self-driving car driving down the street 00:30:13.180 |
And it could be a windy day, so the leaves on the tree 00:30:16.260 |
are kind of moving in kind of semi-chaotic random ways 00:30:31.380 |
And so when you do the prediction in representation space, 00:30:35.940 |
you're not going to have to predict every single pixel 00:30:38.860 |
And that, you know, not only is a lot simpler, 00:30:43.540 |
but also it allows the system to essentially learn 00:30:49.780 |
where, you know, what can be modeled and predicted 00:31:03.140 |
this is something we do absolutely all the time. 00:31:06.980 |
we describe it at a particular level of abstraction. 00:31:10.100 |
And we don't always describe every natural phenomenon 00:31:20.060 |
to describe what happens in the world, you know, 00:31:22.660 |
starting from quantum field theory to like atomic theory 00:31:25.620 |
and molecules, you know, and chemistry materials, 00:31:34.780 |
So we can't just only model everything at the lowest level. 00:31:40.460 |
And that's what the idea of J-PAH is really about. 00:31:44.540 |
Learn abstract representation in a self-supervised manner. 00:31:49.540 |
And, you know, you can do it hierarchically as well. 00:31:56.300 |
And in language, we can get away without doing this 00:31:58.540 |
because language is already to some level abstract 00:32:02.580 |
and already has eliminated a lot of information 00:32:07.060 |
And so we can get away without doing the joint embedding, 00:32:11.020 |
without, you know, lifting the abstraction level 00:32:19.980 |
but it's generative in this abstract representation space. 00:32:24.220 |
- And you're saying language, we were lazy with language 00:32:27.300 |
'cause we already got the abstract representation for free. 00:32:31.980 |
actually think about generally intelligent systems. 00:32:34.580 |
We have to deal with a full mess of physical reality, 00:32:40.100 |
And you do have to do this step of jumping from 00:32:58.180 |
And the thing is those self-supervised algorithm 00:33:00.500 |
that learned by prediction, even in representation space, 00:33:09.260 |
if the input data you feed them is more redundant. 00:33:24.060 |
sensory input like vision than there is in text, 00:33:33.420 |
Language might represent more information really 00:33:40.260 |
And so self-supervised running will not work as well. 00:33:43.700 |
- Is it possible to join the self-supervised training 00:33:56.540 |
even though you talked down about those 10 to the 13 tokens. 00:34:00.260 |
Those 10 to the 13 tokens represent the entirety 00:34:03.340 |
a large fraction of what us humans have figured out, 00:34:11.380 |
and the contents of all the books and the articles 00:34:14.180 |
and the full spectrum of human intellectual creation. 00:34:18.980 |
So is it possible to join those two together? 00:34:30.340 |
And in fact, that's what people are doing at the moment 00:34:35.220 |
We're using language as a crutch to help the deficiencies 00:34:40.020 |
of our vision systems to kind of learn good representations 00:34:46.460 |
And the problem with this is that we might improve 00:34:53.780 |
I mean, our language models by feeding them images, 00:34:58.100 |
but we're not gonna get to the level of even the intelligence 00:35:01.740 |
or level of understanding of the world of a cat or a dog, 00:35:08.620 |
and they understand the world much better than any LLM. 00:35:14.140 |
and sort of imagine the result of a bunch of actions. 00:35:30.780 |
how do we get systems to learn how the world works? 00:35:33.300 |
- So this kind of joint embedding, predictive architecture, 00:35:51.340 |
In fact, the techniques we're using are non-contrastive. 00:35:54.260 |
So not only is the architecture non-generative, 00:35:57.740 |
the learning procedures we're using are non-contrastive. 00:36:05.700 |
and there's a number of methods that use this principle. 00:36:25.700 |
And there's another one also called DINO or DINO, 00:36:41.340 |
and then you corrupt that input or transform it, 00:36:43.540 |
run it through essentially what amounts to the same encoder 00:36:53.100 |
but train a predictor to predict a representation 00:36:55.260 |
of the first uncorrupted input from the corrupted input. 00:37:22.620 |
with the collapse of the type I was explaining before, 00:37:24.700 |
where the system basically ignores the input. 00:37:39.300 |
- So what kind of data are we talking about here? 00:37:47.340 |
you corrupt it by changing the cropping, for example, 00:38:01.620 |
- Basic horrible things that sort of degrade the quality 00:38:25.220 |
and train the entire system, encoder and predictor, 00:38:27.660 |
to predict the representation of the good one 00:38:29.500 |
from the representation of the corrupted one. 00:38:35.420 |
Doesn't need to know that it's an image, for example, 00:38:42.380 |
Whereas with Deno, you need to know it's an image 00:38:44.380 |
because you need to do things like geometry transformation 00:38:53.860 |
is called VJPA, so it's basically the same idea as IJPA, 00:39:02.740 |
And what we mask is actually kind of a temporal tube, 00:39:04.980 |
so a whole segment of each frame in the video 00:39:10.340 |
- And that tube is like statically positioned 00:39:12.860 |
throughout the frames, so it's literally a straight tube. 00:39:15.860 |
- The tube, yeah, typically is 16 frames or something, 00:39:18.860 |
and we mask the same region over the entire 16 frames. 00:39:22.340 |
It's a different one for every video, obviously. 00:39:28.540 |
so as to predict the representation of the full video 00:39:44.980 |
it can tell you what action is taking place in the video 00:39:49.740 |
So that's the first time we get something of that quality. 00:39:55.980 |
- So that's a good test that good representations form, 00:40:03.460 |
that seemed to indicate that the representation allows us, 00:40:10.660 |
is physically possible or completely impossible 00:40:15.340 |
or an object suddenly jumped from one location to another 00:40:21.860 |
- So it's able to capture some physics-based constraints 00:40:30.220 |
- About the appearance and the disappearance of objects? 00:40:40.740 |
to this kind of a world model that understands enough 00:40:52.660 |
but there are systems already, robotic systems, 00:41:04.860 |
where imagine that you have a video and a complete video. 00:41:23.380 |
or you just mask the second half of the video, for example. 00:41:27.260 |
And then you train a Jepa system of the type I described 00:41:32.260 |
to predict the representation of the full video 00:41:36.140 |
but you also feed the predictor with an action. 00:41:56.820 |
You're not gonna be able to predict all the details 00:41:59.940 |
of objects that appear in the view, obviously, 00:42:05.740 |
you can probably predict what's gonna happen. 00:42:08.660 |
So now what you have is a internal model that says, 00:42:13.100 |
here is my idea of state of the world at time t, 00:42:17.860 |
here is a prediction of the state of the world 00:42:43.520 |
I can predict that if I have an object like this 00:42:54.420 |
And if I push it with a particular force on the table, 00:43:00.060 |
it's probably not gonna move with the same force. 00:43:03.620 |
So we have this internal model of the world in our mind, 00:43:21.580 |
predict what the outcome of the sequence of action 00:43:23.620 |
is going to be, measure to what extent the final state 00:43:31.020 |
like moving the bottle to the left of the table, 00:43:38.460 |
that will minimize this objective at runtime. 00:43:46.140 |
And in optimal control, this is a very classical thing. 00:43:50.580 |
You have a model of the system you want to control 00:44:10.980 |
This is the way rocket trajectories have been planned 00:44:21.860 |
but you also often talk about hierarchical planning. 00:44:26.860 |
- Can hierarchical planning emerge from this somehow? 00:44:29.860 |
You will have to build a specific architecture 00:44:34.660 |
So hierarchical planning is absolutely necessary 00:44:39.580 |
If I want to go from, let's say, from New York to Paris, 00:44:52.140 |
At a high level, a very abstract representation 00:44:55.100 |
of my location, I will have to decompose this 00:45:04.700 |
Okay, so my sub-goal is now going to the airport. 00:45:09.140 |
My objective function is my distance to the airport. 00:45:14.140 |
Well, I have to go in the street and hail a taxi, 00:45:26.220 |
going to the elevator, going down the elevator, 00:45:36.380 |
open the door of my office, go to the elevator, 00:45:47.420 |
to millisecond by millisecond muscle control. 00:45:56.540 |
in terms of millisecond by millisecond muscle control. 00:46:08.060 |
You know, how long it's going to take to catch a taxi 00:46:10.660 |
or to go to the airport with traffic, you know. 00:46:14.980 |
I mean, you would have to know exactly the condition 00:46:17.500 |
of everything to be able to do this planning. 00:46:27.380 |
And nobody really knows how to do this in AI. 00:46:35.340 |
to run the appropriate multiple levels of representation 00:46:43.060 |
So like, can you use an LLM, state-of-the-art LLM 00:46:50.940 |
by doing exactly the kind of detailed set of questions 00:46:56.940 |
can you give me a list of 10 steps I need to do 00:47:10.340 |
can you give me a list of 10 steps to make each one of those 00:47:16.420 |
Maybe not, whatever you can actually act upon 00:47:24.500 |
So the first thing is LLMs will be able to answer 00:47:27.700 |
some of those questions down to some level of abstraction 00:47:30.500 |
under the condition that they've been trained 00:47:34.480 |
with similar scenarios in their training set. 00:47:37.260 |
- They would be able to answer all of those questions, 00:47:40.100 |
but some of them may be hallucinated, meaning non-factual. 00:47:44.260 |
- Yeah, true, I mean, they will probably produce 00:47:45.780 |
some answer, except they're not gonna be able 00:47:47.420 |
to really kind of produce millisecond by millisecond 00:47:49.660 |
muscle control of how you stand up from your chair, right? 00:47:59.620 |
but only under the condition that they've been trained 00:48:04.180 |
They're not gonna be able to plan for situations 00:48:09.420 |
They basically are going to have to regurgitate 00:48:12.980 |
So where, like, just for the example of New York to Paris, 00:48:18.940 |
Like at which layer of abstraction do you think you'll start? 00:48:22.620 |
'Cause like I can imagine almost every single part of that 00:48:25.420 |
in LN will be able to answer somewhat accurately, 00:48:27.760 |
especially when you're talking about New York 00:48:33.940 |
to solve that problem if you fine-tuned it for it. 00:48:36.660 |
You know, just, and so I can't say that LNM cannot do this. 00:48:42.420 |
It can do this if you train it for it, there's no question. 00:48:45.780 |
Down to a certain level where things can be formulated 00:48:51.340 |
But like, if you wanna go down to like how you, you know, 00:48:53.840 |
climb down the stairs or just stand up from your chair 00:48:56.100 |
in terms of words, like you can't, you can't do it. 00:48:59.380 |
You need, that's one of the reasons you need experience 00:49:04.940 |
of the physical world, which is much higher bandwidth 00:49:07.740 |
than what you can express in words, in human language. 00:49:15.740 |
that that's what we need for like the interaction 00:49:20.620 |
And then just the LLMs are the thing that sits on top of it 00:49:28.580 |
that I need to book a plane ticket and I need to know, 00:49:33.700 |
- Sure, and you know, a lot of plans that people know about 00:49:37.060 |
that are relatively high level are actually learned. 00:49:40.740 |
They're not people, most people don't invent the, you know, 00:49:50.260 |
we have some ability to do this, of course, obviously, 00:49:59.540 |
Like they've seen other people use those plans 00:50:01.280 |
or they've been told how to do things, right? 00:50:04.180 |
That you can't invent how you like take a person 00:50:07.660 |
who's never heard of airplanes and tell them like, 00:50:11.660 |
And they're probably not going to be able to kind of, 00:50:18.820 |
So certainly LLMs are going to be able to do this, 00:50:20.740 |
but then how you link this from the low level of actions, 00:50:25.740 |
that needs to be done with things like JPA that basically 00:50:32.400 |
lifts the abstraction level of the representation 00:50:34.780 |
without attempting to reconstruct every detail 00:50:40.740 |
I would love to sort of linger on your skepticism 00:50:48.400 |
So one way I would like to test that skepticism 00:50:54.980 |
But if I apply everything you said today and in general 00:51:04.080 |
maybe a little bit less, no, let's say three years ago, 00:51:07.900 |
I wouldn't be able to predict the success of LLMs. 00:51:12.620 |
So does it make sense to you that auto aggressive LLMs 00:51:24.260 |
Because if I were to take your wisdom and intuition 00:51:34.300 |
would be able to do the kind of things they're doing. 00:51:36.260 |
- No, there's one thing that auto aggressive LLMs 00:51:39.260 |
or that LLMs in general, not just the auto aggressive one, 00:51:42.420 |
but including the bird style bi-directional ones 00:51:45.260 |
are exploiting and it's self supervised learning. 00:51:53.300 |
So those things are a incredibly impressive demonstration 00:51:58.300 |
that self supervised learning actually works. 00:52:02.140 |
The idea that started, it didn't start with birth, 00:52:07.140 |
but it was really kind of a good demonstration with this. 00:52:09.660 |
So the idea that you take a piece of text, you corrupt it 00:52:25.680 |
It allowed us to create systems that understand language, 00:52:31.380 |
systems that can translate hundreds of languages 00:52:34.980 |
in any direction, systems that are multilingual. 00:52:38.200 |
So they're not, it's a single system that can be trained 00:52:43.260 |
and translate in any direction and produce summaries 00:52:51.780 |
And then there's a special case of it where you, 00:52:58.580 |
to not elaborate a representation of the text 00:53:11.580 |
And that's what you can build an auto aggressive LLM from. 00:53:23.120 |
that are just trying to produce words from the previous one 00:53:31.260 |
they tend to really kind of understand more about the, 00:53:35.900 |
about language when you train them on lots of data, 00:53:40.720 |
And that surprise occurred quite a while back, 00:53:50.580 |
going back to, you know, the GPT kind of work 00:53:56.820 |
- You mean like GPT-2, like there's a certain place 00:54:02.060 |
might actually keep giving us an emergent benefit. 00:54:06.720 |
- Yeah, I mean, there were work from various places, 00:54:20.860 |
you're so charismatic and you said so many words, 00:54:25.880 |
But again, the same intuition you're applying 00:54:31.600 |
cannot have a deep understanding of the world, 00:54:38.060 |
does it make sense to you that they're able to form 00:54:50.840 |
- Well, we're fooled by their fluency, right? 00:54:58.900 |
all the characteristics of human intelligence, 00:55:08.940 |
Without understanding anything, just hanging out with it. 00:55:11.420 |
- Alan Turing would decide that a Turing test 00:55:15.520 |
This is what the AI community has decided many years ago, 00:55:18.940 |
that the Turing test was a really bad test of intelligence. 00:55:26.300 |
- Hans Moravec would say that Moravec paradox still applies. 00:55:32.180 |
- You don't think he would be really impressed? 00:55:34.260 |
- No, of course, everybody would be impressed, 00:55:35.800 |
but it's not a question of being impressed or not. 00:55:47.580 |
There's a whole industry that is being built around them. 00:56:02.580 |
I'm seeing this from basically 10 years of research 00:56:12.200 |
Actually, that's going back more than 10 years, 00:56:15.300 |
so basically capturing the internal structure 00:56:18.260 |
of a piece of a set of inputs without training the system 00:56:22.460 |
for any particular task, like learning representations. 00:56:25.220 |
You know, the conference I co-founded 14 years ago 00:56:35.980 |
And it's been my obsession for almost 40 years now, so. 00:56:39.900 |
So learning representation is really the thing. 00:56:45.780 |
And then we started working on what we used to call 00:56:48.940 |
unsupervised learning, and sort of revived the idea 00:57:00.780 |
actually works pretty well if you can collect enough data. 00:57:03.940 |
And so the whole idea of unsupervised self-supervised 00:57:20.540 |
and really pushing for, like, finding new methods 00:57:24.740 |
to do self-supervised learning, both for text 00:57:29.740 |
And some of that work has been incredibly successful. 00:57:38.300 |
content moderation on meta, for example, on Facebook, 00:57:43.300 |
whether a piece of text is hate speech or not, or something, 00:57:46.460 |
is due to that progress using self-supervised learning 00:57:51.220 |
transformer architectures and blah, blah, blah. 00:57:53.700 |
But that's the big success of self-supervised learning. 00:57:55.740 |
We had similar success in speech recognition, 00:58:00.140 |
which is also a joint embedding architecture, by the way, 00:58:06.740 |
speech recognition systems that are multilingual 00:58:10.540 |
with mostly unlabeled data and only need a few minutes 00:58:14.140 |
of label data to actually do speech recognition. 00:58:17.980 |
We have systems now based on those combination of ideas 00:58:22.180 |
that can do real-time translation of hundreds of languages 00:58:30.220 |
just fascinating languages that don't have written forms. 00:58:38.740 |
using an internal representation of kind of speech units 00:58:41.220 |
that are discrete, but it's called textless and LP. 00:58:50.980 |
we tried to apply this idea to learning representations 00:58:54.700 |
of images by training a system to predict videos, 00:58:57.340 |
learning intuitive physics by training a system 00:59:05.060 |
with generative models, with models that predict pixels. 00:59:23.980 |
We abandoned this idea of predicting every pixel 00:59:33.260 |
So there's ample evidence that we're not gonna be able 00:59:37.700 |
to learn good representations of the real world 00:59:46.820 |
If you're really interested in human level AI, 00:59:54.900 |
to get far with the joint embedding representation. 01:00:07.580 |
the kind of reasoning that LLMs are able to do, 01:00:13.620 |
but the kind of stuff that LLMs are able to do 01:00:25.980 |
would you jump a type of approach looking at video, 01:00:58.020 |
that might be difficult for a purely language-based system 01:01:07.180 |
from reading texts, the entirety of the publicly available 01:01:12.700 |
from New York to Paris by slapping my fingers. 01:01:29.860 |
So that link from the low level to the high level, 01:01:34.860 |
the thing is that the high level that language expresses 01:01:38.860 |
is based on the common experience of the low level, 01:01:47.660 |
we know we have a common experience of the world, 01:01:50.620 |
like a lot of it is similar, and the NLMs don't have that. 01:02:01.060 |
You and I have a common experience of the world 01:02:05.860 |
and stuff like this, and that common knowledge of the world 01:02:21.180 |
you're going to get this stuff that's between the lines. 01:02:28.620 |
you're going to have to understand how gravity works, 01:02:31.660 |
even if you don't have an explicit explanation of gravity. 01:02:37.360 |
there is explicit explanation of gravity in Wikipedia. 01:02:40.020 |
But the stuff that we think of as common sense reasoning, 01:02:51.820 |
Now you could say, as you have, there's not enough text. 01:02:59.160 |
which is that to be able to do high-level common sense, 01:03:15.380 |
I would not agree with the fact that implicit 01:03:18.980 |
in all languages in the world is the underlying reality. 01:03:34.340 |
okay, there's the dark web, meaning whatever, 01:03:37.460 |
the private conversations, like DMs and stuff like this, 01:03:41.160 |
which is much, much larger probably than what's available, 01:03:46.880 |
- You don't need to communicate the stuff that is common. 01:03:52.140 |
Like when you, you don't need to, but it comes through. 01:04:02.460 |
will be explanation of the fact that cups fall, 01:04:09.380 |
And then you'll have some very vague information 01:04:12.700 |
about what kind of things explode when they hit the ground. 01:04:16.740 |
And then maybe you'll make a joke about entropy 01:04:19.840 |
and we'll never be able to reconstruct this again. 01:04:22.000 |
Like, okay, you'll make a little joke like this, 01:04:27.020 |
And from the jokes, you can piece together the fact 01:04:32.860 |
You don't need to see, it'll be very inefficient. 01:04:36.860 |
It's easier for like, to knock the thing over, 01:04:46.600 |
- I just think that most of the information of this type 01:04:54.320 |
is just not present in text, in any description, essentially. 01:04:59.320 |
- And the sensory data is a much richer source 01:05:04.360 |
- I mean, that's the 16,000 hours of wake time 01:05:20.500 |
And then language doesn't come in until like a year in life. 01:05:30.780 |
You know about inertia, you know about gravity, 01:05:32.700 |
you know the stability, you know about the distinction 01:05:38.100 |
You know, by 18 months, you know about why people 01:05:42.380 |
want to do things and you help them if they can't. 01:05:45.500 |
I mean, there's a lot of things that you learn mostly 01:05:47.940 |
by observation, really not even through interaction. 01:05:53.420 |
babies don't really have any influence on the world. 01:05:58.080 |
And you accumulate like a gigantic amount of knowledge 01:06:02.980 |
So that's what we're missing from current AI systems. 01:06:06.400 |
- I think in one of your slides, you have this nice plot 01:06:10.040 |
that is one of the ways you show that LLMs are limited. 01:06:13.940 |
I wonder if you could talk about hallucinations 01:06:17.940 |
The why hallucinations happen from large language models 01:06:22.940 |
and why, and to what degree is that a fundamental flaw 01:06:29.360 |
- Right, so because of the autoregressive prediction, 01:06:34.100 |
every time an LLM produces a token or a word, 01:06:37.220 |
there is some level of probability for that word 01:06:40.740 |
to take you out of the set of reasonable answers. 01:06:45.620 |
And if you assume, which is a very strong assumption, 01:06:59.500 |
What that means is that every time you produce a token, 01:07:05.420 |
of correct answer decreases, and it decreases exponentially. 01:07:08.660 |
- So there's a strong, like you said, assumption there 01:07:10.420 |
that if there's a non-zero probability of making a mistake, 01:07:23.740 |
So the probability that an answer would be nonsensical 01:07:27.860 |
increases exponentially with the number of tokens. 01:07:40.220 |
towards the truth, because on average, hopefully, 01:07:44.380 |
the truth is well-represented in the training set? 01:07:58.700 |
by having it produce answers for all kinds of questions 01:08:04.860 |
And people are people, so a lot of the questions 01:08:08.100 |
that they have are very similar to each other. 01:08:13.700 |
of questions that people will ask by collecting data. 01:08:23.140 |
to produce good answers for all of those things. 01:08:25.620 |
And it's probably going to be able to learn that 01:08:30.880 |
But then there is the enormous set of prompts 01:08:43.260 |
the proportion of prompts that have been used for training 01:08:48.640 |
It's a tiny, tiny, tiny subset of all possible prompts. 01:08:53.940 |
And so the system will behave properly on the prompts 01:08:56.600 |
that has been either trained, pre-trained, or fine-tuned. 01:09:09.260 |
So whatever training the system has been subject 01:09:24.540 |
it's been trained on or things that are similar. 01:09:27.300 |
And then it will just spew complete nonsense. 01:09:30.340 |
- When you say prompt, do you mean that exact prompt? 01:09:42.620 |
or to say a thing that hasn't been said before 01:09:48.340 |
where like you put essentially a random sequence 01:09:53.820 |
And that's enough to kind of throw the system 01:09:56.060 |
into a mode where it's gonna answer something 01:09:59.820 |
completely different than it would have answered 01:10:04.980 |
basically get it, go outside of its conditioning, right? 01:10:09.340 |
- So that's a very clear demonstration of it. 01:10:11.300 |
But of course, that goes outside of what is designed to do. 01:10:42.540 |
And all of a sudden the answer is complete nonsense. 01:10:46.900 |
which fraction of prompts that humans are likely to generate 01:10:54.380 |
- So the problem is that there is a long tail. 01:10:58.620 |
- This is an issue that a lot of people have realized 01:11:08.180 |
And you can fine tune the system for the 80% or whatever 01:11:18.700 |
that you're not gonna be able to fine tune the system 01:11:25.740 |
Essentially, which is not really what you want. 01:11:27.820 |
You want systems that can reason, certainly that can plan. 01:11:30.820 |
So the type of reasoning that takes place in LLM 01:11:37.060 |
is because the amount of computation that is spent 01:11:43.820 |
So if you ask a question and that question has an answer 01:11:50.340 |
the amount of computation devoted to computing that answer 01:11:54.820 |
It's like, it's the size of the prediction network 01:12:00.060 |
with its 36 layers or 92 layers or whatever it is, 01:12:09.180 |
if the question being asked is simple to answer, 01:12:19.740 |
The amount of computation the system will be able 01:12:25.620 |
or is proportional to the number of token produced 01:12:35.020 |
with a complex problem or a complex question, 01:12:38.540 |
we spend more time trying to solve it and answer it, right? 01:12:45.580 |
there's a iterative element where you're like 01:12:56.780 |
Does this mean it's a fundamental flaw of LLM? 01:12:59.500 |
Does it mean that, there's more part to that question. 01:13:03.340 |
Now you're just behaving like an LLM, immediately answering. 01:13:17.140 |
some of these kinds of mechanisms, like you said, 01:13:19.560 |
persistent long-term memory or reasoning, so on. 01:13:24.560 |
But we need that world model that comes from language. 01:13:40.900 |
because a lot of people are working on reasoning 01:13:46.720 |
I mean, even if we restrict ourselves to language, 01:13:54.640 |
before your answer in terms that are not necessarily linked 01:13:59.420 |
with the language you're gonna use to produce the answer. 01:14:03.980 |
that allows you to plan what you're gonna say 01:14:19.660 |
would be extremely different from auto-regressive LLMs. 01:14:27.940 |
as the difference between what psychologists call 01:14:34.820 |
that you can accomplish without deliberately, consciously 01:14:43.040 |
that you can just do it subconsciously, right? 01:14:48.580 |
you can drive without really thinking about it. 01:14:58.300 |
you can play against a non-experienced chess player 01:15:02.580 |
You just recognize the pattern and you play, right? 01:15:13.480 |
And then there is all the tasks where you need to plan. 01:15:15.200 |
So if you are a not so experienced chess player 01:15:20.660 |
or you play against another experienced chess player, 01:15:27.220 |
And you're much better if you have time to think about it 01:15:30.520 |
than you are if you play blitz with limited time. 01:15:39.580 |
which uses your internal world model, that's system two. 01:15:48.580 |
How do we build a system that can do this kind of planning 01:15:53.340 |
that or reasoning that devotes more resources 01:16:00.320 |
And it's not going to be autoregressive prediction of tokens. 01:16:03.780 |
It's going to be more something akin to inference 01:16:08.060 |
of latent variables in what used to be called 01:16:19.720 |
You know, the prompt is like observed variables. 01:16:24.640 |
And what the model does is that it's basically a measure of, 01:16:45.180 |
which is let's say zero if the answer is a good answer 01:16:51.120 |
if the answer is not a good answer for the question. 01:16:58.900 |
The way you would do is, you know, produce the prompt 01:17:02.520 |
and then search through the space of possible answers 01:17:11.580 |
But that energy-based model would need the model 01:17:18.580 |
- Well, so really what you need to do would be 01:17:31.060 |
So in sort of the space of abstract thoughts, 01:17:46.460 |
So now the way the system produces its answer 01:17:53.140 |
minimizing an objective function, basically, right? 01:18:03.040 |
of the thought of the answer, representation of the answer. 01:18:06.660 |
We feed that to basically an autoregressive decoder, 01:18:10.680 |
which can be very simple, that turns this into a text 01:18:37.740 |
Just to linger on it, you kind of briefly described it, 01:18:47.820 |
- So you have an abstract representation inside the system. 01:18:56.400 |
that predicts a representation of the answer, 01:18:59.820 |
But that representation may not be a good answer 01:19:04.180 |
because there might be some complicated reasoning 01:19:14.480 |
and modifies it so as to minimize a cost function 01:19:32.420 |
whether an answer is a good answer for a question. 01:19:36.000 |
- But suppose such a system could be created. 01:19:38.960 |
But what's the process, this kind of search-like process? 01:19:44.040 |
You can do this if the entire system is differentiable, 01:19:56.900 |
Then by gradient descent, by backpropagating gradients, 01:20:00.640 |
you can figure out how to modify the representation 01:20:08.600 |
So now you have a representation of the answer 01:20:27.600 |
- Right, so you're operating in this abstract representation. 01:20:30.080 |
I mean, this goes back to the joint embedding 01:20:32.640 |
that is better to work in the space of, I don't know, 01:20:37.320 |
or to romanticize the notion like space of concepts 01:20:40.680 |
versus the space of concrete sensory information. 01:20:47.320 |
- Okay, but can this do something like reasoning, 01:20:51.960 |
- Well, not really, only in a very simple way. 01:20:54.160 |
I mean, basically you can think of those things 01:20:56.440 |
as doing the kind of optimization I was talking about, 01:21:02.280 |
which is the space of possible sequences of tokens. 01:21:05.880 |
And they do this optimization in a horribly inefficient way, 01:21:13.400 |
And that's incredibly wasteful in terms of computation. 01:21:32.480 |
in continuous space where you can do gradient descent 01:21:44.280 |
But you can only do this in continuous spaces 01:21:50.360 |
like ability to think deeply or to reason deeply. 01:21:59.240 |
that's better or worse based on deep reasoning? 01:22:04.720 |
- Right, so then we're asking the question of conceptually, 01:22:07.480 |
how do you train an energy-based model, right? 01:22:17.340 |
and it tells you whether Y is compatible with X or not. 01:22:28.120 |
a continuation of the video, you know, whatever. 01:22:32.440 |
And it tells you whether Y is compatible with X. 01:22:35.080 |
And the way it tells you that Y is compatible with X 01:22:37.440 |
is that the output of that function will be zero 01:22:51.880 |
Is you show it pairs of X and Y that are compatible, 01:22:58.840 |
and you train the parameters of the big neural net inside 01:23:08.920 |
well, I'm just gonna say zero for everything. 01:23:11.680 |
So now you have to have a process to make sure 01:23:13.520 |
that for a wrong Y, the energy would be larger than zero. 01:23:21.840 |
So contrastive method is you show an X and a bad Y 01:23:25.040 |
and you tell the system, well, that's, you know, 01:23:28.400 |
give a high energy to this, like push up the energy, right? 01:23:37.680 |
The problem with this is if the space of Y is large, 01:23:48.640 |
And people do this, they do this when you train a system 01:24:00.160 |
that tells you whether an answer is good or bad. 01:24:11.600 |
There is another set of methods which are non-contrastive 01:24:17.360 |
and I prefer those, and those non-contrastive methods 01:24:28.760 |
that are compatible, that come from your training set. 01:24:36.080 |
And the way you do this is by having a regularizer, 01:24:50.440 |
And the precise way to do this is all kinds of different 01:24:54.160 |
specific ways to do this depending on the architecture, 01:25:06.160 |
because there's only a limited volume of space 01:25:11.920 |
by the construction of the system or by the regularizer, 01:25:36.240 |
- Yeah, so you can do this with language directly 01:25:39.760 |
by just X is a text and Y is a continuation of that text. 01:25:49.720 |
I mean, that's going to do what LLMs are doing. 01:25:54.640 |
how the internal structure of the system is built. 01:25:57.280 |
If the internal structure of the system is built 01:26:04.760 |
that you can manipulate so as to minimize the output energy. 01:26:12.920 |
Then that Z can be viewed as representation of a good answer 01:26:16.760 |
that you can translate into a Y that is a good answer. 01:26:24.640 |
- Very similar way, but you have to have this way 01:26:26.760 |
of preventing collapse, of ensuring that, you know, 01:26:30.360 |
there is high energy for things you don't train it on. 01:26:38.720 |
it's done in a way that people don't realize is being done, 01:26:42.680 |
It's due to the fact that when you give a high probability 01:26:45.960 |
to a word, automatically you give low probability 01:26:50.800 |
to other words, because you only have a finite amount 01:26:54.400 |
of probability to go around right there to sum to one. 01:26:57.800 |
So when you minimize the cross entropy or whatever, 01:27:04.560 |
you're increasing the probability your system will give 01:27:08.480 |
to the correct word, but you're also decreasing 01:27:10.200 |
the probability it will give to the incorrect words. 01:27:12.360 |
Now, indirectly, that gives a low probability to, 01:27:17.120 |
a high probability to sequences of words that are good 01:27:19.480 |
and low probability to sequences of words that are bad, 01:27:23.600 |
And it's not obvious why this actually works at all, 01:27:26.800 |
but because you're not doing it on a joint probability 01:27:32.920 |
you're just doing it kind of sort of factorize 01:27:36.920 |
that probability in terms of conditional probabilities 01:27:44.000 |
- So we've been doing this with OJPA architectures, 01:27:48.040 |
So there are the compatibility between two things is, 01:27:53.040 |
here's an image or a video, here's a corrupted, shifted, 01:27:57.480 |
or transformed version of that image or video or masked. 01:28:01.080 |
Okay, and then the energy of the system is the prediction 01:28:11.800 |
The predicted representation of the good thing, 01:28:14.480 |
versus the actual representation of the good thing, right? 01:28:17.360 |
So you run the corrupted image to the system, 01:28:20.840 |
predict the representation of the good input, uncorrupted, 01:28:28.040 |
So this system will tell you, this is a good, 01:28:31.760 |
this is a good image and this is a corrupted version. 01:28:36.680 |
It will give you zero energy if those two things 01:28:39.000 |
are effectively, one of them is a corrupted version 01:28:46.480 |
- And hopefully that whole process gives you a really nice 01:28:49.760 |
compressed representation of reality, of visual reality. 01:28:54.560 |
- And we know it does because then we use those 01:28:56.440 |
representations as input to a classification system. 01:28:59.360 |
- That classification system works really nicely, okay. 01:29:02.000 |
Well, so to summarize, you recommend in a spicy way 01:29:10.400 |
You recommend that we abandon generative models 01:29:21.740 |
Abandon probabilistic models in favor of energy-based models 01:29:26.220 |
Abandon contrastive methods in favor of regularized methods. 01:29:32.100 |
You've been for a while a critic of reinforcement learning. 01:29:37.000 |
So the last recommendation is that we abandon RL 01:29:46.560 |
when planning doesn't yield the predicted outcome. 01:29:50.440 |
And we use RL in that case to adjust the world model 01:29:55.960 |
So you mentioned RLHF, reinforcement learning 01:30:02.980 |
Why do you still hate reinforcement learning? 01:30:07.080 |
and I think it should not be abandoned completely, 01:30:14.480 |
because it's incredibly inefficient in terms of samples. 01:30:21.400 |
is to first have it learn good representations of the world 01:30:38.060 |
You can use, if you've learned a world model, 01:30:40.000 |
you can use the world model to plan a sequence of actions 01:30:51.360 |
Your idea of whether you were gonna fall from your bike 01:30:56.260 |
might be wrong, or whether the person you're fighting 01:31:01.560 |
with MMA was gonna do something and then do something else. 01:31:09.520 |
Either your objective function does not reflect 01:31:13.680 |
the actual objective function you want to optimize, 01:31:19.760 |
So you didn't, the prediction you were making 01:31:22.060 |
about what was gonna happen in the world is inaccurate. 01:31:35.880 |
This is what RL deals with to some extent, right? 01:31:41.080 |
And the way to adjust your world model, even in advance, 01:31:44.200 |
is to explore parts of the space where you world model, 01:31:48.180 |
where you know that your world model is inaccurate. 01:31:50.720 |
That's called curiosity basically, or play, right? 01:31:54.080 |
When you play, you kind of explore parts of the state space 01:32:15.120 |
When it comes time to learning a particular task, 01:32:18.720 |
you already have all the good representations, 01:32:21.840 |
but you need to adjust it for the situation at hand. 01:32:29.620 |
This reinforcement learning with human feedback. 01:32:32.600 |
Why did it have such a transformational effect 01:32:43.560 |
and some of it is just purely supervised, actually. 01:32:50.180 |
And then there is various ways to use human feedback, right? 01:32:57.240 |
multiple answers that are produced by a world model. 01:33:00.020 |
And then what you do is you train an objective function 01:33:13.680 |
and you can backpropagate gradient through this 01:33:16.200 |
so that it only produces highly rated answers. 01:33:29.380 |
So something that, basically a small neural net 01:33:31.800 |
that estimates to what extent an answer is good, right? 01:34:09.160 |
- Now, a lot of people have been very critical 01:34:19.080 |
for essentially, in my words, I could say super woke. 01:34:23.260 |
Woke in the negative connotation of that word. 01:34:26.580 |
There's some almost hilariously absurd things that it does, 01:34:32.740 |
like generating images of a black George Washington, 01:34:43.220 |
which is refusing to comment on or generate images 01:34:55.540 |
one of the most sort of legendary protest images in history. 01:35:06.740 |
and therefore everybody started asking questions 01:35:09.780 |
of what is the process of designing these LLMs, 01:35:33.020 |
and I've made that point multiple times in various forums. 01:35:43.100 |
People can complain that AI systems are biased, 01:35:57.540 |
and that is potentially offensive to some people, 01:36:15.400 |
because of historical incorrectness and things like that. 01:36:27.380 |
is it possible to produce an AI system that is not biased? 01:36:33.400 |
And it's not because of technological challenges, 01:36:37.600 |
although there are technological challenges to that. 01:36:41.360 |
It's because bias is in the eye of the beholder. 01:36:48.800 |
about what constitutes bias for a lot of things. 01:36:53.580 |
I mean, there are facts that are indisputable, 01:37:12.640 |
And the answer is the same answer that we found 01:37:28.160 |
It's because we don't want all of our information 01:37:36.680 |
'cause that's opposite to the whole idea of democracy 01:37:45.040 |
In science, people have to argue for different opinions 01:37:48.160 |
and science makes progress when people disagree 01:37:51.400 |
and they come up with an answer and a consensus forms. 01:37:54.600 |
And it's true in all democracies around the world. 01:37:57.720 |
So there is a future which is already happening 01:38:14.820 |
You can already buy them from Metta, the Ray-Ban Metta, 01:38:18.120 |
where you can talk to them and they are connected 01:38:28.840 |
and there is a camera in the system that in the glasses, 01:38:32.440 |
you can ask it, what can you tell me about this building 01:38:36.520 |
You can be looking at a menu in a foreign language 01:38:44.800 |
So a lot of our interactions with the digital world 01:38:48.280 |
are gonna be mediated by those systems in the near future. 01:38:51.120 |
Increasingly, the search engines that we're gonna use 01:39:01.440 |
that we just ask a question and it will answer 01:39:05.160 |
and then point you to perhaps appropriate reference for it. 01:39:08.880 |
But here is the thing, we cannot afford those systems 01:39:15.320 |
Because those systems will constitute the repository 01:39:29.120 |
For the same reason, the press has to be diverse. 01:39:32.200 |
So how do we get a diverse set of AI assistance? 01:39:35.520 |
It's very expensive and difficult to train a base model, 01:39:42.240 |
You know, in the future, it might be something different, 01:39:46.040 |
So only a few companies can do this properly. 01:39:49.560 |
And if some of those top systems are open source, 01:40:05.560 |
whether they are individual citizens, groups of citizens, 01:40:11.400 |
government organizations, NGOs, companies, whatever, 01:40:17.320 |
to take those open source systems, AI systems, 01:40:23.920 |
and fine-tune them for their own purpose on their own data, 01:40:34.640 |
So I tell you, I talk to the French government quite a bit, 01:40:51.200 |
regardless of how well-intentioned those companies are, right? 01:41:05.400 |
I was talking with the founder of Infosys in India. 01:41:19.920 |
so that LAMA2 speaks all 22 official languages in India. 01:41:28.240 |
Mustafa Sisay, who used to be a scientist at FAIR, 01:41:37.960 |
And what he's trying to do is basically have LLM 01:41:42.880 |
so that people can have access to medical information, 01:41:47.560 |
It's a very small number of doctors per capita in Senegal. 01:41:59.160 |
you can have AI systems that are not only diverse 01:42:01.760 |
in terms of political opinions or things of that type, 01:42:18.920 |
And you can have an industry, an ecosystem of companies 01:42:24.560 |
for vertical applications in industry, right? 01:42:27.200 |
You have, I don't know, a publisher has thousands of books, 01:42:30.240 |
and they want to build a system that allows the customer 01:42:37.640 |
You need to train on their proprietary data, right? 01:42:43.520 |
it's called MetaMate, and it's basically an LLM 01:42:46.640 |
that can answer any question about internal stuff 01:42:55.240 |
A lot of companies want this not just for their employees, 01:42:57.880 |
but also for their customers, to take care of their customers. 01:43:00.760 |
So the only way you're gonna have an AI industry, 01:43:10.280 |
on top of which any group can build specialized systems. 01:43:15.280 |
So the direction of, inevitable direction of history 01:43:26.040 |
will be built on top of open source platforms. 01:43:30.160 |
So meaning like a company like Meta or Google or so on 01:43:40.560 |
after building the foundation pre-trained model, 01:43:53.620 |
but companies are supposed to make money somehow, 01:43:56.240 |
and open source is like giving away, I don't know, 01:44:01.040 |
Mark made a video, Mark Zuckerberg, very sexy video, 01:44:11.120 |
The math of that is just for the GPUs, that's 100 billion, 01:44:17.680 |
plus the infrastructure for training everything. 01:44:22.360 |
So I'm no business guy, but how do you make money on that? 01:44:27.360 |
So the division you paint is a really powerful one, 01:44:32.560 |
- Okay, so you have several business models, right? 01:44:44.160 |
and the financing of that service is either through ads 01:45:00.500 |
by talking to the customers through WhatsApp, 01:45:08.700 |
what topping do you want or what size, blah, blah, blah. 01:45:11.340 |
The business will pay for that, okay, that's a model. 01:45:21.760 |
that is on the more kind of classical services, 01:45:24.360 |
it can be ad supported or there's several models. 01:45:31.600 |
potential customer base and you need to build that system 01:45:48.060 |
then other people can do the same kind of task 01:45:52.920 |
basically provide fine tuned models for businesses. 01:46:08.440 |
we already have a huge user base and customer base, right? 01:46:17.540 |
and there is a way to derive revenue from this. 01:46:22.280 |
And it doesn't hurt that we provide that system 01:46:32.640 |
for others to build applications on top of it too. 01:46:37.400 |
for our customers, we can just buy it from them. 01:46:39.840 |
It could be that they will improve the platform. 01:46:46.400 |
I mean, there is literally millions of downloads 01:46:49.000 |
of LAMATU and thousands of people who have provided ideas 01:46:59.320 |
to make the system available to sort of a wide community 01:47:04.320 |
of people and there's literally thousands of businesses 01:47:09.640 |
So our ability to, Meta's ability to derive revenue 01:47:20.480 |
by the distribution of base models in open source. 01:47:26.640 |
- The fundamental criticism that Gemini is getting 01:47:28.680 |
is that, as you pointed out on the West Coast, 01:47:31.040 |
just to clarify, we're currently in the East Coast 01:47:34.680 |
where I would suppose Meta AI headquarters would be. 01:47:37.840 |
So there are strong words about the West Coast, 01:47:47.000 |
I think it's fair to say that most tech people 01:47:49.960 |
have a political affiliation with the left wing. 01:47:55.320 |
And so the problem that people are criticizing Gemini with 01:48:01.160 |
that you mentioned, that their ideological lean 01:48:17.160 |
Have you witnessed this kind of ideological lean 01:48:24.240 |
I don't think the issue has to do with the political leaning 01:48:29.300 |
It has to do with the acceptability or political leanings 01:48:38.280 |
So a big company cannot afford to offend too many people. 01:48:58.020 |
it's impossible to do it properly for everyone. 01:49:11.560 |
one set of people are going to see it as biased, 01:49:15.700 |
and another set of people is going to see it as biased. 01:49:18.600 |
And then in addition to this, there's the issue of, 01:49:25.840 |
You're going to have black Nazi soldiers in the-- 01:49:30.840 |
- Yeah, so we should mention image generation 01:49:33.640 |
of black Nazi soldiers, which is not factually accurate. 01:49:38.960 |
- Right, and can be offensive for some people as well, right? 01:49:42.180 |
So it's going to be impossible to kind of produce systems 01:49:49.080 |
So the only solution that I see is diversity. 01:49:52.200 |
- And diversity is the full meaning of that word, 01:50:06.040 |
The conclusion is only startups and open source 01:50:08.640 |
can avoid the issue that he's highlighting with big tech. 01:50:17.480 |
One, ever escalating demands from internal activists, 01:50:20.760 |
employee mobs, crazed executives, broken boards, 01:50:24.440 |
pressure groups, extremist regulators, government agencies, 01:50:27.240 |
the press, in quotes, experts, and everything, 01:50:34.240 |
Two, constant risk of generating a bad answer 01:50:37.360 |
or drawing a bad picture or rendering a bad video. 01:50:40.600 |
Who knows what is going to say or do at any moment? 01:50:44.440 |
Three, legal exposure, product liability, slander, 01:50:59.700 |
like how good it actually is in terms of usable 01:51:03.600 |
and pleasant to use and effective and all that kind of stuff 01:51:06.920 |
and five, publicity of bad text, images, video, 01:51:10.440 |
actual puts those examples into the training data 01:51:24.440 |
So if you're going to do the fine tuning yourself 01:51:29.080 |
and keep a closed source, essentially the problem there 01:51:33.200 |
is then trying to minimize the number of people 01:51:37.280 |
And you're saying that almost impossible to do right 01:51:46.800 |
Marc is right about a number of things that he lists 01:51:55.300 |
Certainly congressional investigations is one of them. 01:52:00.400 |
Legal liability, making things that get people 01:52:15.120 |
because they don't want to hurt anyone, first of all, 01:52:21.280 |
and then second, they want to preserve their business. 01:52:23.200 |
So it's essentially impossible for systems like this 01:52:26.920 |
that can inevitably formulate political opinions 01:52:34.040 |
but that people may disagree about moral issues 01:52:38.360 |
and questions about religion and things like that, right, 01:52:43.360 |
or cultural issues that people from different communities 01:52:50.120 |
So there's only kind of a relatively small number of things 01:52:52.560 |
that people will sort of agree on, basic principles. 01:52:57.560 |
But beyond that, if you want those systems to be useful, 01:53:15.480 |
- That's right, open source enables diversity. 01:53:18.200 |
- This can be a fascinating world where if it's true 01:53:21.560 |
that the open source world, if meta leads the way 01:53:28.560 |
like governments will have a fine tune model. 01:53:31.520 |
And then potentially people that vote left and right 01:53:44.400 |
but that's on us humans, we get to figure out. 01:53:48.280 |
Basically the technology enables humans to human 01:53:52.000 |
more effectively and all the difficult ethical questions 01:53:56.160 |
that humans raise will just leave it up to us 01:54:02.640 |
- Yeah, I mean, there are some limits to what, 01:54:04.760 |
the same way there are limits to free speech, 01:54:06.480 |
there has to be some limit to the kind of stuff 01:54:08.880 |
that those systems might be authorized to produce, 01:54:16.440 |
So, I mean, that's one thing I've been interested in, 01:54:18.280 |
which is in the type of architecture that we were discussing 01:54:21.800 |
before, where the output of a system is a result 01:54:31.960 |
And we can put guardrails in open source systems. 01:54:37.400 |
I mean, if we eventually have systems that are built 01:54:39.760 |
with this blueprint, we can put guardrails in those systems 01:54:44.200 |
that guarantee that there is sort of a minimum set 01:54:47.640 |
of guardrails that make the system non-dangerous 01:54:53.680 |
And then the fine tuning that people will add 01:54:58.200 |
or the additional guardrails that people will add 01:55:00.400 |
will kind of cater to their community, whatever it is. 01:55:04.960 |
- And yeah, the fine tuning will be more about 01:55:07.240 |
the gray areas of what is hate speech, what is dangerous 01:55:14.560 |
I mean, like, but still, even with the objectives 01:55:20.800 |
or at least there's a paper that we're a collection 01:55:32.360 |
does the LLM make it any easier than a search would, 01:55:39.480 |
- Right, so the increasing number of studies on this 01:55:44.480 |
seems to point to the fact that it doesn't help. 01:55:57.280 |
if you already have access to a search engine 01:56:01.040 |
And so the sort of increased information you get 01:56:04.920 |
or the ease with which you get it doesn't really help you. 01:56:09.040 |
The second thing is it's one thing to have a list 01:56:12.080 |
of instructions of how to make a chemical weapon, 01:56:34.280 |
So it's too dangerous actually to kind of ever use. 01:56:39.280 |
And it's in fact banned by international treaties. 01:56:58.440 |
I can give you a very precise list of instructions 01:57:08.280 |
you're still gonna have to blow up a dozen of them 01:57:25.240 |
- And it requires even the common sense expertise 01:57:29.080 |
which is how to take language-based instructions 01:57:42.400 |
A lot of biologists have posted on this, actually, 01:57:46.400 |
do you realize how hard it is to actually do the lab work? 01:57:51.840 |
- Yeah, and that's Hans Moravec comes to light once again. 01:57:59.360 |
Mark announced that LAMA 3 is coming out eventually. 01:58:06.920 |
First of all, LAMA 2 that's already out there, 01:58:12.760 |
just the future of the open source under meta? 01:58:18.080 |
So there's gonna be like various versions of LAMA 01:58:26.880 |
bigger, better, multimodal, things like that. 01:58:36.880 |
Maybe are trained from video, so they have some world model. 01:58:39.600 |
Maybe, you know, capable of the type of reasoning 01:58:45.360 |
Like, when is the research that is going in that direction 01:59:07.040 |
So, you know, last week we published the Vijaypa work, 01:59:15.000 |
And then the next step is gonna be world models 01:59:18.960 |
based on kind of this type of idea, training from video. 01:59:26.120 |
and taking place people, and also at UC Berkeley 01:59:38.480 |
My bet is that those systems are gonna be JAPA-like. 01:59:54.720 |
called Danny Jar Hefner, who is now at DeepMind, 01:59:58.720 |
that learn representations, and then use them for planning 02:00:04.160 |
And a lot of work at Berkeley by Peter Abbeel, 02:00:09.560 |
Saguirre Levine, a bunch of other people of that type. 02:00:12.400 |
I'm collaborating with actually in the context 02:00:34.200 |
I haven't been that excited about the direction 02:00:36.720 |
of machine learning and AI since 10 years ago 02:00:42.280 |
And before that, 30 years ago, we were working on, 02:01:05.760 |
There is some set of ideas to make progress there 02:01:14.600 |
What I like is that somewhat we get onto a good direction 02:01:19.600 |
and perhaps succeed before my brain turns to a white sauce 02:01:32.380 |
is it beautiful to you just the amount of GPUs involved, 02:01:38.000 |
sort of the whole training process on this much compute? 02:01:45.320 |
and humans together have built these computing devices 02:02:04.320 |
There's just the details of how to train on that, 02:02:07.680 |
how to build the infrastructure and the hardware, 02:02:19.600 |
- Well, I used to be a hardware guy many years ago. 02:02:23.080 |
Hardware has improved a little bit, changed a little bit. 02:02:27.800 |
- I mean, certainly scale is necessary, but not sufficient. 02:02:34.600 |
I mean, we're still far in terms of compute power 02:02:37.000 |
from what we would need to match the compute power 02:02:42.880 |
This may occur in the next couple of decades, 02:02:51.920 |
So there's a lot of progress to make in hardware. 02:03:00.240 |
I mean, there's a bit coming from silicon technology, 02:03:03.000 |
but a lot of it coming from architectural innovation 02:03:06.440 |
and quite a bit coming from more efficient ways 02:03:10.200 |
of implementing the architectures that have become popular, 02:03:13.640 |
basically a combination of Transformers and ConvNets, right? 02:03:27.280 |
We're gonna have to come up with like new principles, 02:03:30.200 |
new fabrication technology, new basic components, 02:03:34.560 |
perhaps based on sort of different principles 02:03:52.920 |
- Well, if we wanna make it ubiquitous, yeah, certainly. 02:03:56.640 |
'Cause we're gonna have to reduce the power consumption. 02:04:01.640 |
A GPU today, right, is half a kilowatt to a kilowatt. 02:04:08.640 |
And the GPU is way below the power of human brain. 02:04:13.100 |
You need something like 100,000 or a million to match it. 02:04:28.560 |
not the next few years, potentially farther away. 02:04:35.720 |
- So first of all, it's not going to be an event, right? 02:04:43.140 |
that somehow somebody is gonna discover the secret, 02:04:52.400 |
And then turn on a machine and then we have AGI. 02:05:02.640 |
Are we gonna have systems that can learn from video 02:05:07.000 |
how the world works and learn good representations? 02:05:09.440 |
Yeah, before we get them to the scale and performance 02:05:13.060 |
that we observe in humans, it's gonna take quite a while. 02:05:17.240 |
Are we gonna get systems that can have large amount 02:05:23.320 |
of associative memories so they can remember stuff? 02:05:26.660 |
Yeah, but same, it's not gonna happen tomorrow. 02:05:31.460 |
We have a lot of them, but to get this to work together 02:05:37.040 |
Are we gonna have systems that can reason and plan, 02:05:45.000 |
Yeah, but before we get this to work properly, 02:05:49.320 |
And before we get all those things to work together, 02:05:51.280 |
and then on top of this, have systems that can learn 02:05:54.020 |
like hierarchical planning, hierarchical representations, 02:06:07.860 |
probably much more, because there are a lot of problems 02:06:11.060 |
that we're not seeing right now that we have not encountered, 02:06:15.300 |
and so we don't know if there is an easy solution 02:06:18.600 |
So, you know, it's not just around the corner. 02:06:23.380 |
I mean, I've been hearing people for the last 12, 15 years 02:06:27.580 |
claiming that, you know, AGI is just around the corner 02:06:32.620 |
And I knew they were wrong when they were saying it. 02:06:39.740 |
from the birth of the term artificial intelligence, 02:06:58.780 |
Marvek's paradox is a consequence of realizing 02:07:03.780 |
So first of all, intelligence is not a linear thing 02:07:08.620 |
you can measure with a scalar, with a single number. 02:07:11.500 |
You know, can you say that humans are smarter 02:07:20.220 |
But in some ways, orangutans are smarter than humans 02:07:23.820 |
that allows them to survive in the forest, for example. 02:07:26.820 |
- So IQ is a very limited measure of intelligence. 02:07:38.780 |
But because humans kind of come in relatively 02:07:53.780 |
that may be relevant for some tasks, but not others. 02:07:56.620 |
But then if you're talking about other intelligent entities 02:08:02.540 |
for which the basic things that are easy to them 02:08:07.140 |
is very different, then it doesn't mean anything. 02:08:15.780 |
and an ability to acquire new skills efficiently, right? 02:08:22.900 |
And the collection of skills that any particular 02:08:27.540 |
intelligent entity possess or is capable of learning quickly 02:08:31.700 |
is different from the collection of skills of another one. 02:08:42.860 |
as to whether one is more intelligent than the other. 02:08:46.900 |
- So you push back against what are called AI doomers a lot. 02:09:02.180 |
of catastrophe scenarios of how AI could escape or control 02:09:11.220 |
And that relies on a whole bunch of assumptions 02:09:15.540 |
So the first assumption is that the emergence 02:09:21.820 |
that at some point we're going to figure out the secret 02:09:25.100 |
and we'll turn on a machine that is super intelligent. 02:09:30.500 |
it's going to take over the world and kill us all. 02:09:35.900 |
We're going to have systems that are like as smart as a cat, 02:09:39.700 |
have all the characteristics of human level intelligence, 02:09:44.700 |
but their level of intelligence would be like a cat 02:09:53.900 |
to kind of make those things more intelligent. 02:09:56.780 |
we're also going to put some guard rails in them 02:09:58.580 |
and learn how to kind of put some guard rails 02:10:01.740 |
And we're not going to do this with just one, 02:10:04.820 |
that it's going to be lots of different people doing this. 02:10:09.260 |
at making intelligent systems that are controllable and safe 02:10:15.980 |
then we can use the good ones to go against the rogue ones. 02:10:25.500 |
So it's not going to be like we're going to be exposed 02:10:27.700 |
to like a single rogue AI that's going to kill us all. 02:10:33.300 |
which is the fact that because the system is intelligent, 02:10:53.460 |
it seems to be that the more intelligent species 02:10:55.580 |
are the ones that end up dominating the other. 02:10:58.180 |
And even extinguishing the others sometimes by design, 02:11:06.780 |
And so there is sort of a thinking by which you say, 02:11:12.940 |
well, if AI systems are more intelligent than us, 02:11:19.660 |
if not by design, simply because they don't care about us. 02:11:23.180 |
And that's just preposterous for a number of reasons. 02:11:27.780 |
First reason is they're not going to be a species. 02:11:30.340 |
They're not going to be a species that competes with us. 02:11:33.220 |
They're not going to have the desire to dominate 02:11:37.220 |
that has to be hardwired into an intelligent system. 02:11:43.580 |
It is hardwired in baboons, in chimpanzees, in wolves, 02:11:49.980 |
The species in which this desire to dominate or submit 02:12:03.300 |
Non-social species like orangutans don't have it, right? 02:12:06.740 |
And they are as smart as we are almost, right? 02:12:09.500 |
- And to you, there's not significant incentive 02:12:12.140 |
for humans to encode that into the AI systems. 02:12:15.180 |
And to the degree they do, there'll be other AIs 02:12:24.380 |
to make AI systems submissive to humans, right? 02:12:27.660 |
I mean, this is the way we're going to build them, right? 02:12:30.300 |
And so then people say, "Oh, but look at LLMs. 02:12:33.980 |
And they're right, LLMs are not controllable. 02:12:36.780 |
But object-driven AI, so systems that derive their answers 02:12:57.140 |
- I've heard that before somewhere, I don't remember. 02:13:05.660 |
could there be unintended consequences also from all of this? 02:13:35.180 |
that was unexpected because the guardrail wasn't right, 02:13:38.460 |
and we're going to correct them so that they do it right. 02:13:41.180 |
The idea somehow that we can't get it slightly wrong 02:13:52.980 |
the analogy I've used many times is turbojet design. 02:14:07.180 |
I mean, those are incredibly complex pieces of hardware. 02:14:21.020 |
with a two-engine jetliner at near the speed of sound. 02:14:34.540 |
like a general principle of how to make turbojets safe? 02:14:43.380 |
Is there a separate group within General Electric 02:14:58.980 |
because a better turbojet is also a safer turbojet. 02:15:04.780 |
Like, do you need specific provisions to make AI safe? 02:15:10.540 |
and they will be safe because they are designed 02:15:40.820 |
And you could see governments using that as a weapon. 02:15:43.540 |
So do you think if you imagine such a system, 02:15:47.540 |
there's any parallel to something like nuclear weapons? 02:15:58.740 |
So you're saying there's going to be gradual development. 02:16:01.860 |
There's going to be, I mean, it might be rapid, 02:16:09.060 |
- So that AI system designed by Vladimir Putin or whatever, 02:16:13.140 |
or his minions is going to be like trying to talk 02:16:40.980 |
They're going to be talking to your AI assistant, 02:16:43.420 |
which is going to be as smart as theirs, right? 02:16:58.780 |
Like, is this thing like telling me the truth? 02:17:00.740 |
Like, it's not even going to be able to get to you 02:17:03.300 |
because it's only going to talk to your AI assistant. 02:17:10.740 |
You're not even seeing the email, the spam email, right? 02:17:13.940 |
It's automatically put in a folder that you never see. 02:17:18.340 |
That AI system that tries to convince you of something 02:17:23.260 |
which is going to be at least as smart as it. 02:17:25.500 |
And it's going to say, this is spam, you know, 02:17:28.580 |
it's not even going to bring it to your attention. 02:17:32.220 |
- So to you, it's very difficult for any one AI system 02:17:37.500 |
to where it can convince even the other AI systems. 02:17:40.100 |
So like, there's always going to be this kind of race 02:17:59.420 |
but this is why nuclear weapons are so interesting 02:18:07.380 |
That, you know, you could imagine Hitler, Stalin, 02:18:17.620 |
and that having a different kind of impact on the world 02:18:20.620 |
than the United States getting the weapon first. 02:18:32.200 |
and then Manhattan Project-like effort for AI. 02:18:35.780 |
- No, as I said, it's not going to be an event. 02:18:39.180 |
It's going to be, you know, continuous progress. 02:18:42.020 |
And whenever, you know, one breakthrough occurs, 02:18:46.200 |
it's going to be widely disseminated really quickly, 02:18:51.040 |
I mean, this is not a domain where, you know, 02:19:02.300 |
and this kind of information disseminates extremely quickly. 02:19:05.460 |
We've seen this over the last few years, right? 02:19:08.100 |
Where you have a new, like, you know, even take AlphaGo, 02:19:13.980 |
even without like particularly detailed information, right? 02:19:17.980 |
- Yeah, this is an industry that's not good at secrecy. 02:19:22.920 |
just the fact that you know that something is possible 02:19:25.920 |
makes you like realize that it's worth investing the time 02:19:35.220 |
And, you know, say for, you know, all the innovations 02:19:41.480 |
of, you know, self-supervised running transformers, 02:19:47.520 |
you don't need to know exactly the details of how they work 02:19:52.760 |
because it's deployed and then it's getting reproduced. 02:19:54.720 |
And then, you know, people who work for those companies move. 02:19:59.720 |
They go from one company to another and, you know, 02:20:05.120 |
What makes the success of the US tech industry 02:20:09.760 |
and Silicon Valley in particular is exactly that, 02:20:11.760 |
is because information circulates really, really quickly 02:20:17.480 |
And so, you know, the whole region sort of is ahead 02:20:24.600 |
- So maybe I, just to linger on the psychology of AI doomers, 02:20:36.860 |
You say, engineer says, "I invented this new thing. 02:20:48.960 |
like misinformation, propaganda, hate speech, ban it now." 02:20:52.720 |
Then writing doomers come in, akin to the AI doomers. 02:21:04.180 |
to write hate speech, regulate ball pens now. 02:21:18.460 |
Government should require a license for a pen manufacturer." 02:21:21.740 |
I mean, this does seem to be part of human psychology 02:21:32.280 |
So what deep insights can you speak to about this? 02:21:37.280 |
- Well, there is a natural fear of new technology 02:21:53.700 |
by major transformations that are either cultural phenomena 02:22:01.000 |
And they fear for their culture, they fear for their job, 02:22:20.380 |
like any technological revolution or cultural phenomenon 02:22:24.060 |
was always accompanied by, you know, groups or reaction 02:22:29.060 |
in the media that basically attributed all the problems, 02:22:40.660 |
Electricity was going to kill everyone at some point. 02:22:44.400 |
You know, the train was going to be a horrible thing 02:22:59.420 |
of all the horrible things people imagined would arrive 02:23:10.840 |
You know, it's just wonderful examples of, you know, 02:23:15.840 |
jazz or comic books being blamed for unemployment 02:23:22.400 |
or, you know, young people not wanting to work anymore 02:23:38.520 |
The question is, you know, do we embrace change 02:23:53.800 |
I think one thing they worry about with big tech, 02:23:55.880 |
something we've been talking about over and over, 02:24:17.560 |
a huge amount of money and control this technology 02:24:29.080 |
- Well, that's exactly why we need open source platforms. 02:24:31.920 |
- Yeah, I just wanted to nail the point home more and more. 02:24:40.600 |
like I said, you do get a little bit flavorful 02:24:46.760 |
Yoshi Bach tweeted something that you LOL'd at 02:24:57.420 |
"but whether the pod bay doors should be opened or closed 02:25:06.940 |
You know, this is something that really worries me 02:25:12.000 |
that AI, our AI overlords will speak down to us 02:25:20.420 |
and you sort of resist that with your way of being. 02:25:29.560 |
how you can avoid the over-fearing, I suppose, 02:25:47.760 |
a widely diverse set of people to build AI assistants 02:25:52.760 |
that represent the diversity of cultures, opinions, 02:25:57.320 |
languages, and value systems across the world 02:26:00.000 |
so that you're not bound to just be brainwashed 02:26:05.000 |
by a particular way of thinking because of single AI entity. 02:26:10.000 |
So, I mean, I think it's a really, really important question 02:26:13.960 |
for society and the problem I'm seeing is that, 02:26:29.480 |
- Is because I see the danger of this concentration of power 02:26:36.400 |
as a much bigger danger than everything else. 02:26:39.900 |
That if we really want diversity of opinion, AI systems 02:26:44.900 |
that in the future where we'll all be interacting 02:26:51.080 |
through AI systems, we need those to be diverse 02:26:58.400 |
and creeds and political opinions and whatever 02:27:07.840 |
And what works against this is people who think that 02:27:12.840 |
for reasons of security, we should keep AI systems 02:27:17.920 |
under lock and key because it's too dangerous 02:27:24.200 |
Because it could be used by terrorists or something. 02:27:26.720 |
That would lead to potentially a very bad future 02:27:33.800 |
in which all of our information diet is controlled 02:27:39.060 |
by a small number of companies through proprietary systems. 02:27:47.640 |
to build systems that are on the whole good for humanity? 02:27:53.280 |
Isn't that what democracy and free speech is all about? 02:27:57.480 |
- Do you trust institutions to do the right thing? 02:28:03.160 |
And yeah, there's bad people who are gonna do bad things 02:28:05.400 |
but they're not going to have superior technology 02:28:08.620 |
So then it's gonna be my good AI against your bad AI. 02:28:12.600 |
I mean, it's the examples that we were just talking about 02:28:16.380 |
of maybe some rogue country will build some AI system 02:28:31.880 |
But then they will have to go past our AI systems. 02:28:40.440 |
- And doesn't put any articles in their sentences. 02:28:43.260 |
- Well, it'll be at the very least absurdly comedic. 02:28:49.300 |
- Okay, so since we talked about the physical reality, 02:28:54.300 |
I'd love to ask your vision of the future with robots 02:29:03.240 |
you've been speaking about would empower robots 02:29:06.720 |
to be more effective collaborators with us humans. 02:29:10.480 |
So since Tesla's Optimus team has been showing us 02:29:17.160 |
I think it really reinvigorated the whole industry 02:29:20.560 |
that I think Boston Dynamics has been leading 02:29:30.080 |
- Union Tree, but there's like a lot of them. 02:29:58.700 |
other than for like kind of pre-programmed behavior 02:30:02.660 |
And the main issue is, again, the more of a paradox, 02:30:10.420 |
how the world works and kind of plan actions? 02:30:13.200 |
And so we can do it for really specialized tasks. 02:30:16.620 |
And the way Boston Dynamics goes about it is basically 02:30:30.780 |
with a lot of innovation, a little bit of perception. 02:30:35.820 |
like they can't build a domestic robot, right? 02:30:43.820 |
from completely autonomous level five driving. 02:31:13.060 |
we're not gonna have significant progress in robotics. 02:31:16.940 |
So a lot of the people working on robotic hardware 02:31:20.560 |
at the moment are betting or banking on the fact 02:31:24.300 |
that AI is gonna make sufficient progress towards that. 02:31:28.060 |
- And they're hoping to discover a product in it too. 02:31:45.720 |
So there's the factory setting where humanoid robots 02:31:48.300 |
can help automate some aspects of the factory. 02:31:53.340 |
'cause of all the safety required and all this kind of stuff. 02:32:00.420 |
I think you mentioned loading the dishwasher, right? 02:32:07.620 |
- I mean, there's cleaning up, cleaning the house, 02:32:12.620 |
clearing up the table after a meal, washing the dishes, 02:32:21.600 |
I mean, all the tasks that in principle could be automated, 02:32:37.280 |
- Well, navigation in a way that's compelling 02:32:46.600 |
'cause there is a so-called embodied AI group at fair. 02:32:51.600 |
And they've been not building their own robots, 02:32:57.200 |
And you can tell a robot dog go to the fridge 02:33:03.660 |
and they can probably pick up a can in the fridge 02:33:12.640 |
as long as it's been trained to recognize them, 02:33:14.820 |
which vision systems work pretty well nowadays. 02:33:22.420 |
that would be sophisticated enough to do things 02:33:35.080 |
robots in general, in the whole, more and more. 02:33:36.740 |
Because that gets humans to really directly interact 02:33:42.120 |
And in so doing, it allows us to philosophically, 02:33:45.260 |
psychologically explore our relationships with robots. 02:33:48.100 |
It can be really, really, really interesting. 02:33:50.760 |
So I hope you make progress on the whole Jampa thing soon. 02:33:54.340 |
- Well, I mean, I hope things work as planned. 02:33:58.640 |
I mean, again, we've been working on this idea 02:34:03.180 |
of self-supervised learning from video for 10 years. 02:34:07.120 |
And only made significant progress in the last two or three. 02:34:12.080 |
- And actually, you've mentioned that there's a lot 02:34:20.480 |
and this kind of stuff, there's a lot of possibilities still 02:34:28.040 |
that's looking to go to grad school and do a PhD? 02:34:35.600 |
this idea of how do you train a world model by observation. 02:34:48.800 |
to have emergent properties like we have with LLMs. 02:34:53.080 |
that can be done without necessarily scaling up. 02:35:03.760 |
but it's the world of, let's say, the internet 02:35:11.540 |
consists in doing a search in a search engine 02:35:14.060 |
or interrogating a database or running a simulation 02:35:24.500 |
a sequence of actions to give the solution to a problem? 02:35:32.200 |
is not just a question of planning physical actions. 02:35:38.960 |
for a dialog system or for any kind of intelligent system. 02:36:00.760 |
Then there is the question of hierarchical planning. 02:36:03.580 |
So the example I mentioned of planning a trip 02:36:13.780 |
involves hierarchical planning in some sense. 02:36:17.460 |
And we really have absolutely no idea how to do this. 02:36:20.640 |
Like there's zero demonstration of hierarchical planning 02:36:25.640 |
in AI where the various levels of representations 02:36:36.440 |
We can do like two-level hierarchical planning 02:36:41.100 |
So for example, you have like a dog-like robot, right? 02:36:44.840 |
You want it to go from the living room to the kitchen. 02:36:48.300 |
You can plan a path that avoids the obstacle. 02:36:51.260 |
And then you can send this to a lower level planner 02:37:03.900 |
We specify what the proper levels of abstraction, 02:37:09.820 |
the representation at each level of abstraction have to be. 02:37:14.100 |
How do you learn that hierarchical representation 02:37:19.800 |
We, you know, with coordinates and deep learning, 02:37:22.280 |
we can train the system to learn hierarchical representations 02:37:26.300 |
What is the equivalent when what you're trying 02:37:32.140 |
So you want basically a robot dog or humanoid robot 02:37:43.080 |
It might have some trouble at the TSA, but yeah. 02:37:47.420 |
- No, but even doing something fairly simple, 02:37:49.100 |
like a household task, like, you know, cooking or something. 02:37:57.140 |
We take, and once again, we take it for granted. 02:37:59.540 |
What hope do you have for the future of humanity? 02:38:05.120 |
We're talking about so many exciting technologies, 02:38:15.100 |
If you look at social media, there's a lot of, 02:38:17.140 |
there's wars going on, there's division, there's hatred, 02:38:40.340 |
I mean, AI basically will amplify human intelligence. 02:38:55.860 |
perhaps execute a task in ways that are much better 02:39:24.880 |
I certainly have a lot of experience with this, 02:39:29.720 |
of having people working with me who are smarter than me. 02:39:39.960 |
that assist us in all of our tasks, our daily lives, 02:39:45.520 |
I think would be an absolutely wonderful thing. 02:40:07.080 |
I mean, for the same reason that public education 02:40:14.800 |
And the internet is also a good thing intrinsically. 02:40:23.200 |
Because it helps the communication of information 02:40:30.680 |
and knowledge and the transmission of knowledge. 02:40:41.080 |
that perhaps an equivalent event in the history of humanity 02:40:46.080 |
to what might be provided by generalization of AI assistant 02:40:56.960 |
The fact that people could have access to books. 02:41:01.960 |
Books were a lot cheaper than they were before. 02:41:06.400 |
And so a lot more people had an incentive to learn to read, 02:41:29.360 |
escape from religious doctrine, democracy, science, 02:41:35.840 |
and certainly without this there wouldn't have been 02:41:40.840 |
the American Revolution or the French Revolution. 02:41:43.400 |
And so we'd still be under a few dual regimes, perhaps. 02:41:57.680 |
Now, it also created 200 years of revolution. 02:42:03.680 |
It created 200 years of essentially religious conflicts 02:42:07.520 |
in Europe because the first thing that people read 02:42:15.520 |
there was a different interpretation of the Bible 02:42:23.920 |
And in fact, the Catholic Church didn't like the idea 02:42:27.600 |
of the printing press, but they had no choice. 02:42:30.080 |
And so it had some bad effects and some good effects. 02:42:51.720 |
but realized someone else came with the same idea before me. 02:42:55.640 |
Compare this with what happened in the Ottoman Empire. 02:43:04.000 |
And it didn't ban it for all languages, only for Arabic. 02:43:11.840 |
You could actually print books in Latin or Hebrew 02:43:16.000 |
or whatever in the Ottoman Empire, just not in Arabic. 02:43:25.760 |
just wanted to preserve the control over the population 02:43:29.520 |
and the dogma, religious dogma and everything. 02:43:33.040 |
But after talking with the UAE Minister of AI, 02:43:44.520 |
And the other reason was that it was to preserve 02:43:52.280 |
There's an art form, which is writing those beautiful 02:44:00.320 |
Arabic poems or whatever religious text in this thing. 02:44:04.880 |
And it was very powerful cooperation of scribes, 02:44:07.440 |
basically that kind of run a big chunk of the empire 02:44:25.400 |
Who are the people who are asking that AI be regulated 02:44:37.560 |
of technological transformation like AI on the job market 02:44:45.280 |
And there are economists who are much more expert 02:45:02.320 |
The professions are gonna be hot 10 or 15 years from now. 02:45:09.400 |
The same way if we go back 20 years in the past, 02:45:15.040 |
that like the hottest job even like 10 years ago 02:45:19.040 |
was mobile app developer, like smartphones weren't invented. 02:45:23.400 |
- Most of the jobs of the future might be in the metaverse. 02:45:29.120 |
- But the point is you can't possibly predict. 02:45:31.960 |
But you're right, I mean, you made a lot of strong points 02:45:34.680 |
and I believe that people are fundamentally good. 02:45:56.680 |
because they don't think that people are fundamentally good. 02:46:04.480 |
or they don't trust the institution to do the right thing 02:46:09.480 |
- Well, I think both you and I believe in humanity. 02:46:16.480 |
in saying thank you for pushing the open source movement, 02:46:20.120 |
pushing to making both research in AI open source, 02:46:24.320 |
making it available to people and also the models themselves 02:46:32.280 |
in such colorful, beautiful ways on the internet. 02:46:39.040 |
So yeah, thank you for speaking to me once again. 02:46:45.640 |
- Thanks for listening to this conversation with Yann LeCun. 02:46:49.640 |
please check out our sponsors in the description. 02:46:55.680 |
The only way to discover the limits of the possible 02:47:03.560 |
Thank you for listening and hope to see you next time.