back to index

Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416


Chapters

0:0 Introduction
2:18 Limits of LLMs
13:54 Bilingualism and thinking
17:46 Video prediction
25:7 JEPA (Joint-Embedding Predictive Architecture)
28:15 JEPA vs LLMs
37:31 DINO and I-JEPA
38:51 V-JEPA
44:22 Hierarchical planning
50:40 Autoregressive LLMs
66:6 AI hallucination
71:30 Reasoning in AI
89:2 Reinforcement learning
94:10 Woke AI
103:48 Open source
107:26 AI and ideology
109:58 Marc Andreesen
117:56 Llama 3
124:20 AGI
128:48 AI doomers
144:38 Joscha Bach
148:51 Humanoid robots
158:0 Hope for the future

Whisper Transcript | Transcript Only Page

00:00:00.000 | I see the danger of this concentration
00:00:02.180 | of power through proprietary AI systems
00:00:06.000 | as a much bigger danger than everything else.
00:00:08.920 | What works against this is people
00:00:11.960 | who think that for reasons of security,
00:00:15.120 | we should keep AI systems under lock and key,
00:00:18.560 | because it's too dangerous to put it in the hands of everybody.
00:00:22.040 | That would lead to a very bad future
00:00:25.360 | in which all of our information diet
00:00:27.640 | is controlled by a small number of companies
00:00:30.800 | through proprietary systems.
00:00:32.360 | I believe that people are fundamentally good.
00:00:34.320 | And so if AI, especially open-source AI,
00:00:38.480 | can make them smarter, it just empowers the goodness
00:00:43.320 | in humans.
00:00:44.240 | So I share that feeling, OK?
00:00:46.740 | I think people are fundamentally good.
00:00:50.280 | And in fact, a lot of doomers are doomers,
00:00:52.480 | because they don't think that people are fundamentally good.
00:00:55.180 | (air whooshing)
00:00:57.680 | - The following is a conversation with Yann LeCun,
00:01:01.060 | his third time on this podcast.
00:01:02.860 | He is the chief AI scientist at Meta,
00:01:05.500 | professor at NYU, Turing Award winner,
00:01:08.780 | and one of the seminal figures
00:01:10.840 | in the history of artificial intelligence.
00:01:13.180 | He and Meta AI have been big proponents
00:01:16.900 | of open-sourcing AI development
00:01:19.540 | and have been walking the walk
00:01:21.380 | by open-sourcing many of their biggest models,
00:01:23.980 | including LLAMA 2 and eventually LLAMA 3.
00:01:28.180 | Also, Yann has been an outspoken critic
00:01:31.900 | of those people in the AI community
00:01:34.380 | who warned about the looming danger
00:01:36.520 | and existential threat of AGI.
00:01:39.660 | He believes the AGI will be created one day,
00:01:43.580 | but it will be good.
00:01:45.500 | It will not escape human control,
00:01:47.660 | nor will it dominate and kill all humans.
00:01:52.160 | At this moment of rapid AI development,
00:01:54.380 | this happens to be somewhat a controversial position.
00:01:58.840 | And so it's been fun seeing Yann get into a lot of intense
00:02:02.620 | and fascinating discussions online
00:02:04.880 | as we do in this very conversation.
00:02:08.660 | This is the Lex Furman Podcast.
00:02:10.480 | To support it, please check out our sponsors
00:02:12.460 | in the description.
00:02:13.740 | And now, dear friends, here's Yann LeCun.
00:02:18.000 | You've had some strong statements, technical statements,
00:02:22.420 | about the future of artificial intelligence recently,
00:02:25.580 | throughout your career, actually, but recently as well.
00:02:28.320 | You've said that autoregressive LLMs
00:02:31.940 | are not the way we're going to make progress
00:02:36.780 | towards superhuman intelligence.
00:02:38.740 | These are the large language models like GPT-4,
00:02:41.940 | like LLAMA 2 and 3 soon, and so on.
00:02:44.260 | How do they work,
00:02:45.080 | and why are they not going to take us all the way?
00:02:47.740 | - For a number of reasons.
00:02:49.040 | The first is that there is a number of characteristics
00:02:51.820 | of intelligent behavior.
00:02:53.500 | For example, the capacity to understand the world,
00:02:58.820 | understand the physical world,
00:03:00.320 | the ability to remember and retrieve things,
00:03:05.460 | persistent memory, the ability to reason,
00:03:10.340 | and the ability to plan.
00:03:12.360 | Those are four essential characteristic
00:03:14.140 | of intelligent systems or entities, humans, animals.
00:03:19.140 | LLMs can do none of those,
00:03:23.060 | or they can only do them in a very primitive way.
00:03:26.560 | And they don't really understand the physical world.
00:03:29.700 | They don't really have persistent memory.
00:03:31.340 | They can't really reason, and they certainly can plan.
00:03:34.420 | And so, if you expect the system to become intelligent
00:03:38.860 | just without having the possibility of doing those things,
00:03:43.580 | or you're making a mistake,
00:03:44.980 | that is not to say that go to a receive LLMs are not useful.
00:03:50.900 | They're certainly useful,
00:03:52.100 | that they're not interesting,
00:03:55.600 | that we can't build a whole ecosystem
00:03:58.220 | of applications around them.
00:04:00.180 | Of course we can,
00:04:01.020 | but as a path towards human-level intelligence,
00:04:05.980 | they're missing essential components.
00:04:08.700 | And then there is another tidbit or fact
00:04:11.280 | I think is very interesting.
00:04:14.020 | Those LLMs are trained on enormous amounts of texts,
00:04:16.540 | basically the entirety of all publicly available texts
00:04:20.620 | on the internet, right?
00:04:21.520 | That's typically on the order of 10 to the 13 tokens.
00:04:26.520 | Each token is typically two bytes.
00:04:28.220 | So that's two 10 to the 13 bytes as training data.
00:04:31.980 | It would take you or me 170,000 years
00:04:35.160 | to just read through this at eight hours a day.
00:04:38.680 | So it seems like an enormous amount of knowledge, right?
00:04:41.320 | That those systems can accumulate.
00:04:43.020 | But then you realize it's really not that much data.
00:04:48.300 | If you talk to developmental psychologists
00:04:52.300 | and they tell you a four-year-old has been awake
00:04:54.540 | for 16,000 hours in his oral life,
00:04:57.620 | and the amount of information
00:05:01.420 | that has reached the visual cortex
00:05:05.740 | of that child in four years,
00:05:08.720 | is about 10 to the 15 bytes.
00:05:12.140 | And you can compute this by estimating
00:05:13.940 | that the optical nerve carry
00:05:16.380 | about 20 megabytes per second, roughly.
00:05:19.700 | And so 10 to the 15 bytes for a four-year-old
00:05:22.220 | versus two times 10 to the 13 bytes
00:05:25.460 | for 170,000 years worth of reading.
00:05:28.700 | What that tells you is that through sensory input,
00:05:33.860 | we see a lot more information than we do through language.
00:05:37.640 | And that despite our intuition,
00:05:40.960 | most of what we learn and most of our knowledge
00:05:43.920 | is through our observation and interaction
00:05:47.000 | with the real world, not through language.
00:05:49.520 | Everything that we learn in the first few years of life
00:05:51.720 | and certainly everything that animals learn
00:05:54.920 | has nothing to do with language.
00:05:57.100 | - So it'd be good to maybe push against
00:05:59.400 | some of the intuition behind what you're saying.
00:06:01.720 | So it is true there's several orders of magnitude
00:06:05.900 | more data coming into the human mind.
00:06:08.100 | How much faster and the human mind
00:06:11.240 | is able to learn very quickly from that,
00:06:12.980 | filter the data very quickly.
00:06:15.200 | You know, somebody might argue
00:06:16.580 | your comparison between sensory data versus language,
00:06:19.800 | that language is already very compressed.
00:06:23.220 | It already contains a lot more information
00:06:25.260 | than the bytes it takes to store them,
00:06:27.240 | if you compare it to visual data.
00:06:29.340 | So there's a lot of wisdom in language, there's words,
00:06:31.800 | and the way we stitch them together,
00:06:33.820 | it already contains a lot of information.
00:06:36.240 | So is it possible that language alone
00:06:40.620 | already has enough wisdom and knowledge in there
00:06:45.620 | to be able to, from that language,
00:06:48.740 | construct a world model, an understanding of the world,
00:06:52.660 | an understanding of the physical world
00:06:54.700 | that you're saying all lambs lack?
00:06:56.660 | So it's a big debate among philosophers
00:07:00.060 | and also cognitive scientists,
00:07:01.740 | like whether intelligence needs to be grounded in reality.
00:07:05.160 | I'm clearly in the camp that, yes,
00:07:09.260 | intelligence cannot appear without some grounding
00:07:12.340 | in some reality, it doesn't need to be physical reality,
00:07:16.980 | it could be simulated,
00:07:17.980 | but the environment is just much richer
00:07:20.860 | than what you can express in language.
00:07:22.340 | Language is a very approximate representation of our percepts
00:07:27.340 | and our mental models, right?
00:07:29.500 | I mean, there's a lot of tasks that we accomplish
00:07:32.220 | where we manipulate a mental model
00:07:35.620 | of the situation at hand,
00:07:38.300 | and that has nothing to do with language.
00:07:40.700 | Everything that's physical, mechanical, whatever,
00:07:43.540 | when we build something, when we accomplish a task,
00:07:47.100 | a model task of grabbing something, et cetera,
00:07:50.260 | we plan for action sequences and we do this
00:07:52.900 | by essentially imagining the result of the outcome
00:07:57.180 | of a sequence of actions that we might imagine.
00:08:01.260 | And that requires mental models
00:08:03.900 | that don't have much to do with language.
00:08:06.060 | And that's, I would argue, most of our knowledge
00:08:09.900 | is derived from that interaction with the physical world.
00:08:13.740 | So a lot of my colleagues who are more interested
00:08:17.420 | in things like computer vision are really on that camp
00:08:20.500 | that AI needs to be embodied, essentially.
00:08:25.100 | And then other people coming from the NLP side,
00:08:28.420 | or maybe some other motivation
00:08:32.860 | don't necessarily agree with that.
00:08:35.020 | And philosophers are split as well.
00:08:37.140 | And the complexity of the world is hard to imagine.
00:08:46.460 | It's hard to represent all the complexities
00:08:51.020 | that we take completely for granted in the real world
00:08:53.580 | that we don't even imagine require intelligence, right?
00:08:55.740 | This is the old Moravec paradox
00:08:58.020 | from the pioneer of robotics, Hans Moravec.
00:09:01.260 | We said, how is it that with computers,
00:09:03.300 | it seems to be easy to do high-level complex tasks
00:09:05.820 | like playing chess and solving integrals
00:09:08.420 | and doing things like that?
00:09:09.700 | Whereas the thing we take for granted that we do every day,
00:09:13.380 | like, I don't know, learning to drive a car
00:09:16.380 | or grabbing an object, we can't do with computers.
00:09:19.980 | And we have LLMs that can pass the bar exam,
00:09:26.820 | so they must be smart.
00:09:29.500 | But then they can't learn to drive in 20 hours
00:09:33.060 | like any 17-year-old.
00:09:35.460 | They can't learn to clear out the dinner table
00:09:38.660 | and fill up the dishwasher like any 10-year-old
00:09:41.100 | can learn in one shot.
00:09:42.220 | Why is that?
00:09:44.500 | Like, what are we missing?
00:09:45.860 | What type of learning or reasoning architecture
00:09:50.700 | or whatever are we missing that basically prevent us
00:09:55.700 | from having level five sort of cars and domestic robots?
00:10:00.900 | - Can a large language model construct a world model
00:10:05.580 | that does know how to drive
00:10:07.740 | and does know how to fill a dishwasher
00:10:09.340 | but just doesn't know how to deal with visual data
00:10:11.620 | at this time?
00:10:12.580 | So it can operate in a space of concepts.
00:10:17.220 | - So yeah, that's what a lot of people are working on.
00:10:19.980 | So the answer, the short answer is no.
00:10:22.540 | And the more complex answer is you can use all kinds
00:10:26.220 | of tricks to get an LLM to basically digest
00:10:31.220 | visual representations of images or video
00:10:38.740 | or audio for that matter.
00:10:42.380 | And a classical way of doing this
00:10:45.420 | is you train a vision system in some way.
00:10:48.580 | And we have a number of ways to train vision systems.
00:10:51.340 | These are supervised, semi-supervised, self-supervised,
00:10:53.820 | all kinds of different ways.
00:10:55.220 | That will turn any image into a high-level representation.
00:11:01.100 | Basically a list of tokens that are really similar
00:11:04.500 | to the kind of tokens that typical LLM takes as an input.
00:11:10.700 | And then you just feed that to the LLM
00:11:15.260 | in addition to the text.
00:11:17.140 | And you just expect LLM to kind of, during training,
00:11:21.620 | to kind of be able to use those representations
00:11:25.500 | to help make decisions.
00:11:27.180 | I mean, there's been work along those lines
00:11:29.140 | for quite a long time.
00:11:30.420 | And now you see those systems, right?
00:11:32.700 | I mean, there are LLMs that have some vision extension.
00:11:36.700 | But it basically hacks in the sense that those things
00:11:40.060 | are not like trained end-to-end to handle,
00:11:42.500 | to really understand the world.
00:11:43.860 | They're not trained with video, for example.
00:11:46.460 | They don't really understand intuitive physics,
00:11:49.020 | at least not at the moment.
00:11:51.220 | - So you don't think there's something special to you
00:11:53.300 | about intuitive physics, about sort of common sense reasoning
00:11:55.980 | about the physical space, about physical reality?
00:11:59.100 | That to you is a giant leap
00:12:00.780 | that LLMs are just not able to do?
00:12:02.860 | - We're not gonna be able to do this
00:12:04.060 | with the type of LLMs that we are working with today.
00:12:07.860 | And there's a number of reasons for this.
00:12:09.300 | But the main reason is the way LLMs are trained
00:12:14.300 | is that you take a piece of text,
00:12:16.580 | you remove some of the words in that text, you mask them,
00:12:20.300 | you replace them by blank markers,
00:12:22.660 | and you train a gigantic neural net
00:12:24.260 | to predict the words that are missing.
00:12:26.180 | And if you build this neural net in a particular way
00:12:30.300 | so that it can only look at words
00:12:33.220 | that are to the left of the one it's trying to predict,
00:12:36.140 | then what you have is a system that basically
00:12:38.020 | is trained to predict the next word in a text, right?
00:12:40.060 | So then you can feed it a text, a prompt,
00:12:43.460 | and you can ask it to predict the next word.
00:12:45.860 | It can never predict the next word exactly.
00:12:48.220 | And so what it's gonna do is produce
00:12:51.380 | a probability distribution
00:12:52.740 | of all the possible words in your dictionary.
00:12:55.020 | In fact, it doesn't predict words,
00:12:56.260 | it predicts tokens that are kind of sub-word units.
00:12:59.020 | And so it's easy to handle the uncertainty
00:13:01.900 | in the prediction there,
00:13:02.780 | because there's only a finite number of possible words
00:13:05.700 | in the dictionary.
00:13:07.380 | And you can just compute the distribution over them.
00:13:09.900 | Then what the system does is that
00:13:13.020 | it picks a word from that distribution.
00:13:16.860 | Of course, there's a higher chance of picking words
00:13:18.820 | that have a higher probability within the distribution.
00:13:21.420 | So you sample from the distribution
00:13:22.820 | to actually produce a word.
00:13:25.260 | And then you shift that word into the input.
00:13:27.460 | And so that allows the system
00:13:29.820 | not to predict the second word, right?
00:13:32.300 | And once you do this, you shift it into the input, et cetera.
00:13:35.300 | That's called autoregressive prediction,
00:13:37.580 | which is why those LLMs
00:13:39.900 | should be called autoregressive LLMs.
00:13:41.740 | But we just call them LLMs.
00:13:46.300 | And there is a difference between this kind of process
00:13:50.620 | and a process by which before producing a word,
00:13:54.700 | when you talk, when you and I talk,
00:13:56.660 | you and I are bilingual.
00:13:58.500 | We think about what we're gonna say,
00:14:00.460 | and it's relatively independent of the language
00:14:02.620 | in which we're gonna say it.
00:14:04.420 | When we talk about, I don't know,
00:14:06.980 | let's say a mathematical concept or something,
00:14:09.100 | the kind of thinking that we're doing
00:14:10.940 | and the answer that we're planning to produce
00:14:13.380 | is not linked to whether we're gonna see it
00:14:16.620 | in French or Russian or English.
00:14:19.460 | - Chomsky just rolled his eyes, but I understand.
00:14:21.700 | So you're saying that there's a bigger abstraction
00:14:25.420 | that goes before language and maps onto language.
00:14:30.300 | - Right.
00:14:31.140 | It's certainly true for a lot of thinking that we do.
00:14:34.020 | - Is that obvious that we don't,
00:14:35.780 | like you're saying your thinking is same in French
00:14:39.180 | as it is in English?
00:14:40.380 | - Yeah, pretty much.
00:14:42.060 | - Pretty much, or is this like, how flexible are you?
00:14:45.740 | Like if there's a probability of distribution?
00:14:48.060 | - Well, it depends what kind of thinking, right?
00:14:51.060 | If it's just, if it's like producing puns,
00:14:54.340 | I get much better in French than English about that.
00:14:56.940 | - No, but so we're right.
00:14:58.540 | Is there an abstract representation of puns?
00:15:00.500 | Like is your humor an abstract, like when you tweet
00:15:03.340 | and your tweets are sometimes a little bit spicy,
00:15:05.780 | is there an abstract representation in your brain
00:15:09.180 | of a tweet before it maps onto English?
00:15:11.780 | - There is an abstract representation
00:15:13.380 | of imagining the reaction of a reader to that text.
00:15:18.380 | - Or you start with laughter
00:15:19.940 | and then figure out how to make that happen?
00:15:21.980 | - Or figure out like a reaction you want to cause
00:15:25.620 | and then figure out how to say it, right?
00:15:27.460 | So that it causes that reaction.
00:15:29.100 | But that's like really close to language.
00:15:30.780 | But think about like a mathematical concept
00:15:34.340 | or imagining something you want to build out of wood
00:15:38.380 | or something like this, right?
00:15:40.100 | The kind of thinking you're doing
00:15:41.100 | is absolutely nothing to do with language really.
00:15:43.500 | Like it's not like you have necessarily
00:15:44.980 | like an internal monologue in any particular language.
00:15:47.700 | You're imagining mental models of the thing, right?
00:15:52.180 | I mean, if I ask you to imagine what this water bottle
00:15:55.980 | will look like if I rotate it 90 degrees,
00:16:00.140 | that has nothing to do with language.
00:16:02.500 | And so clearly there is a more abstract level
00:16:06.860 | of representation in which we do most of our thinking
00:16:11.460 | and we plan what we're gonna say.
00:16:13.940 | If the output is uttered words
00:16:18.940 | as opposed to an output being muscle actions, right?
00:16:24.820 | We plan our answer before we produce it.
00:16:29.300 | And LLMs don't do that.
00:16:30.380 | They just produce one word after the other
00:16:32.900 | instinctively if you want.
00:16:34.980 | It's like, it's a bit like the subconscious actions
00:16:39.980 | where you don't, like you're distracted,
00:16:42.820 | you're doing something, you're completely concentrated.
00:16:44.980 | And someone comes to you and ask you a question
00:16:48.300 | and you kind of answer the question.
00:16:49.660 | You don't have time to think about the answer
00:16:51.460 | but the answer is easy so you don't need to pay attention.
00:16:54.060 | You sort of respond automatically.
00:16:55.980 | That's kind of what an LLM does, right?
00:16:58.540 | It doesn't think about its answer really.
00:17:01.220 | It retrieves it because it's accumulated a lot of knowledge
00:17:04.540 | so it can retrieve some things
00:17:06.140 | but it's going to just spit out one token after the other
00:17:10.980 | without planning the answer.
00:17:13.060 | - But you're making it sound just one token after the other,
00:17:17.260 | one token at a time generation is bound to be simplistic.
00:17:22.260 | But if the world model is sufficiently sophisticated
00:17:28.260 | at one token at a time,
00:17:30.180 | the most likely thing it generates is a sequence of tokens
00:17:35.420 | is going to be a deeply profound thing.
00:17:39.140 | - Okay, but then that assumes that those systems
00:17:42.780 | actually possess an internal world model.
00:17:44.900 | - So it really goes to the,
00:17:46.500 | I think the fundamental question is
00:17:48.780 | can you build a really complete world model,
00:17:53.780 | not complete, but one that has a deep understanding
00:17:57.740 | of the world.
00:17:58.580 | - Yeah, so can you build this first of all by prediction?
00:18:03.580 | - Right.
00:18:04.420 | - And the answer is probably yes.
00:18:06.260 | Can you build it by predicting words?
00:18:10.720 | And the answer is most probably no
00:18:14.180 | because language is very poor in terms of weak
00:18:17.940 | or low bandwidth if you want.
00:18:19.340 | There's just not enough information there.
00:18:21.380 | So building world models means observing the world
00:18:27.140 | and understanding why the world is evolving the way it is.
00:18:32.140 | And then the extra component of a world model
00:18:38.540 | is something that can predict how the world
00:18:41.780 | is going to evolve as a consequence of an action
00:18:44.020 | you might take, right?
00:18:45.520 | So what model really is,
00:18:47.020 | here is my idea of the state of the world at time T,
00:18:49.180 | here is an action I might take.
00:18:51.020 | What is the predicted state of the world at time T plus one?
00:18:55.700 | Now that state of the world does not need to represent
00:18:59.340 | everything about the world.
00:19:01.180 | It just needs to represent enough that's relevant
00:19:03.460 | for this planning of the action,
00:19:06.140 | but not necessarily all the details.
00:19:08.440 | Now here is the problem.
00:19:10.260 | You're not going to be able to do this
00:19:11.860 | with generative models.
00:19:14.900 | So a generative model has trained on video
00:19:16.860 | and we've tried to do this for 10 years.
00:19:18.600 | You take a video, show a system, a piece of video,
00:19:22.420 | and then ask it to predict the reminder of the video.
00:19:25.780 | Basically predict what's going to happen.
00:19:27.860 | - One frame at a time.
00:19:29.380 | Do the same thing as sort of the auto-aggressive LLMs do,
00:19:33.340 | but for video.
00:19:34.220 | - Right.
00:19:35.060 | Either one frame at a time or a group of frames at a time.
00:19:38.220 | But yeah, a large video model, if you want.
00:19:41.180 | The idea of doing this has been floating around
00:19:46.220 | for a long time.
00:19:47.060 | And at FAIR, some of my colleagues and I
00:19:51.060 | have been trying to do this for about 10 years.
00:19:53.380 | And you can't really do the same trick as with LLMs
00:19:58.500 | because LLMs, as I said,
00:20:02.060 | you can't predict exactly which word
00:20:04.180 | is going to follow a sequence of words,
00:20:06.860 | but you can predict the distribution over words.
00:20:09.540 | Now, if you go to video,
00:20:11.580 | what you would have to do is predict the distribution
00:20:13.540 | over all possible frames in a video.
00:20:16.500 | And we don't really know how to do that properly.
00:20:19.980 | We do not know how to represent distributions
00:20:22.420 | over high dimensional continuous spaces
00:20:24.660 | in ways that are useful.
00:20:25.860 | And there lies the main issue.
00:20:31.340 | And the reason we can do this
00:20:33.060 | is because the world is incredibly more complicated
00:20:37.300 | and richer in terms of information than text.
00:20:40.540 | Text is discrete.
00:20:41.620 | Video is high dimensional and continuous.
00:20:45.020 | A lot of details in this.
00:20:47.260 | So if I take a video of this room
00:20:49.740 | and the video is a camera panning around,
00:20:54.580 | there is no way I can predict everything
00:20:58.340 | that's going to be in the room as I pan around.
00:21:00.100 | The system cannot predict what's going to be in the room
00:21:02.220 | as the camera is panning.
00:21:03.500 | Maybe it's going to predict this is a room
00:21:07.060 | where there is a light
00:21:07.900 | and there is a wall and things like that.
00:21:09.340 | It can't predict what the painting on the wall looks like
00:21:11.700 | or what the texture of the couch looks like.
00:21:14.180 | Certainly not the texture of the carpet.
00:21:16.140 | So there's no way I can predict all those details.
00:21:19.180 | So the way to handle this is,
00:21:23.380 | one way possibly to handle this,
00:21:24.900 | which we've been working for a long time,
00:21:26.420 | is to have a model that has what's called a latent variable.
00:21:29.820 | And the latent variable is fed to a neural net
00:21:33.020 | and it's supposed to represent all the information
00:21:35.220 | about the world that you don't perceive yet.
00:21:37.940 | And that you need to augment the system
00:21:43.860 | for the prediction to do a good job at predicting pixels,
00:21:47.180 | including the fine texture of the carpet and the couch
00:21:52.180 | and the painting on the wall.
00:21:54.980 | That has been a complete failure, essentially.
00:22:00.180 | And we've tried lots of things.
00:22:01.340 | We tried just straight neural nets.
00:22:03.820 | We tried GANs.
00:22:04.700 | We tried VAEs, all kinds of regularized autoencoders.
00:22:09.700 | We tried many things.
00:22:13.900 | We also tried those kinds of methods
00:22:15.700 | to learn good representations of images or video
00:22:20.260 | that could then be used as input to, for example,
00:22:24.900 | an image classification system.
00:22:26.580 | And that also has basically failed.
00:22:29.580 | Like all the systems that attempt to predict
00:22:32.540 | missing parts of an image or video
00:22:34.820 | from a corrupted version of it, basically.
00:22:40.220 | So I take an image or a video,
00:22:41.660 | corrupt it or transform it in some way,
00:22:44.100 | and then try to reconstruct the complete video or image
00:22:47.500 | from the corrupted version.
00:22:48.900 | And then hope that internally,
00:22:52.180 | the system will develop good representations of images
00:22:54.900 | that you can use for object recognition, segmentation,
00:22:57.620 | whatever it is.
00:22:58.460 | That has been essentially a complete failure.
00:23:01.820 | And it works really well for text.
00:23:04.460 | That's the principle that is used for LLMs, right?
00:23:07.140 | - So where's the failure exactly?
00:23:09.340 | Is it that it's very difficult to form a good representation
00:23:13.420 | of an image, like a good embedding of all
00:23:16.740 | all the important information in the image?
00:23:19.340 | Is it in terms of the consistency of image to image
00:23:21.860 | to image to image that forms the video?
00:23:23.980 | Like what are the, if we do a highlight reel
00:23:27.140 | of all the ways you failed, what's that look like?
00:23:30.660 | - Okay, so the reason this doesn't work is,
00:23:35.300 | first of all, I have to tell you exactly what doesn't work
00:23:37.220 | because there is something else that does work.
00:23:40.060 | So the thing that does not work is training the system
00:23:44.220 | to learn representations of images
00:23:47.820 | by training it to reconstruct a good image
00:23:52.140 | from a corrupted version of it, okay?
00:23:54.020 | That's what doesn't work.
00:23:55.740 | And we have a whole slew of techniques for this
00:23:59.100 | that are, you know, variant of denoising autoencoders,
00:24:02.540 | something called MAE developed by some of my colleagues
00:24:05.060 | at FAIR, maxed autoencoder.
00:24:07.020 | So it's basically like the, you know, LLMs
00:24:10.460 | or things like this where you train the system
00:24:13.100 | by corrupting text, except you corrupt images,
00:24:15.300 | you remove patches from it and you train
00:24:17.220 | a gigantic neural net to reconstruct.
00:24:19.500 | The features you get are not good.
00:24:20.980 | And you know they're not good because
00:24:23.420 | if you now train the same architecture,
00:24:25.540 | but you train it to supervise with labeled data,
00:24:30.100 | with textual descriptions of images, et cetera,
00:24:34.060 | you do get good representations.
00:24:35.780 | And the performance on recognition tasks is much better
00:24:39.700 | than if you do this self-supervised retraining.
00:24:42.660 | - So the architecture is good.
00:24:44.580 | - The architecture is good.
00:24:45.420 | The architecture of the encoder is good, okay?
00:24:48.020 | But the fact that you train the system
00:24:49.500 | to reconstruct images does not lead it to produce,
00:24:53.780 | to learn good generic features of images.
00:24:56.300 | - When you train in a self-supervised way.
00:24:58.380 | - Self-supervised by reconstruction.
00:25:00.380 | - Yeah, by reconstruction.
00:25:01.380 | - Okay, so what's the alternative?
00:25:04.380 | - The alternative is joint embedding.
00:25:07.500 | - What is joint embedding?
00:25:08.860 | What are these architectures that you're so excited about?
00:25:11.260 | - Okay, so now instead of training a system
00:25:13.380 | to encode the image and then training it
00:25:15.300 | to reconstruct the full image from a corrupted version,
00:25:20.060 | you take the full image,
00:25:21.540 | you take the corrupted or transformed version,
00:25:25.380 | you run them both through encoders,
00:25:27.140 | which in general are identical, but not necessarily.
00:25:31.580 | And then you train a predictor on top of those encoders
00:25:36.580 | to predict the representation of the full input
00:25:42.460 | from the representation of the corrupted one, okay?
00:25:47.460 | So joint embedding, because you're taking the full input
00:25:51.100 | and the corrupted version or transformed version,
00:25:54.140 | run them both through encoders, you get a joint embedding.
00:25:57.260 | And then you're saying,
00:25:59.140 | can I predict the representation of the full one
00:26:01.980 | from the representation of the corrupted one, okay?
00:26:05.180 | And I call this a JAPA,
00:26:07.820 | so that means joint embedding predictive architecture
00:26:09.860 | because there's joint embedding
00:26:11.220 | and there is this predictor that predicts
00:26:12.620 | the representation of the good guy from the bad guy.
00:26:15.300 | And the big question is,
00:26:18.300 | how do you train something like this?
00:26:20.700 | And until five years ago or six years ago,
00:26:23.780 | we didn't have particularly good answers
00:26:26.260 | for how you train those things,
00:26:27.660 | except for one called contrastive learning.
00:26:31.900 | And the idea of contrastive learning
00:26:36.540 | is you take a pair of images that are,
00:26:39.860 | again, an image and a corrupted version
00:26:42.420 | or degraded version somehow,
00:26:44.260 | or transformed version of the original one,
00:26:47.180 | and you train the predicted representation
00:26:49.900 | to be the same as that.
00:26:51.820 | If you only do this, the system collapses.
00:26:53.900 | It basically completely ignores the input
00:26:55.700 | and produces representations that are constant.
00:26:58.100 | So the contrastive methods avoid this,
00:27:02.780 | and those things have been around since the early '90s.
00:27:05.380 | I had a paper on this in 1993.
00:27:07.140 | You also show pairs of images that you know are different,
00:27:13.420 | and then you push away the representations from each other.
00:27:17.540 | So you say, not only do representations of things
00:27:20.380 | that we know are the same,
00:27:22.020 | should be the same or should be similar,
00:27:23.900 | but representations of things that we know are different
00:27:25.740 | should be different.
00:27:26.620 | And that prevents the collapse, but it has some limitation.
00:27:30.140 | And there's a whole bunch of techniques that have appeared
00:27:33.260 | over the last six, seven years
00:27:35.660 | that can revive this type of method,
00:27:38.060 | some of them from FAIR,
00:27:40.340 | some of them from Google and other places.
00:27:43.260 | But there are limitations to those contrastive methods.
00:27:47.260 | What has changed in the last three, four years
00:27:52.420 | is now we have methods that are non-contrastive.
00:27:54.900 | So they don't require those negative contrastive samples
00:27:58.940 | of images that we know are different.
00:28:01.620 | You turn them only with images that are different versions
00:28:06.380 | or different views of the same thing,
00:28:08.180 | and you rely on some other tweaks
00:28:10.740 | to prevent the system from collapsing.
00:28:12.740 | And we have half a dozen different methods for this now.
00:28:15.980 | - So what is the fundamental difference
00:28:17.860 | between joint embedding architectures and LLMs?
00:28:22.340 | So can JAPA take us to AGI?
00:28:26.980 | Whether we should say that you don't like the term AGI,
00:28:31.780 | and we'll probably argue.
00:28:33.020 | I think every single time I've talked to you
00:28:34.900 | with arguing about the G in AGI.
00:28:36.860 | - Yes.
00:28:37.700 | - I get it, I get it.
00:28:40.220 | Well, we'll probably continue to argue about it, it's great.
00:28:43.300 | You like Ami, 'cause you like French,
00:28:47.220 | and Ami is, I guess, friend in French.
00:28:51.780 | - Yes.
00:28:52.620 | - And Ami stands for advanced machine intelligence.
00:28:55.820 | - Right.
00:28:56.660 | - But either way, can JAPA take us to that,
00:29:00.500 | towards that advanced machine intelligence?
00:29:02.580 | - Well, so it's a first step.
00:29:04.620 | Okay, so first of all, what's the difference
00:29:07.260 | with generative architectures like LLMs?
00:29:11.060 | So LLMs, or vision systems that are trained
00:29:16.020 | by reconstruction, generate the inputs, right?
00:29:20.060 | They generate the original input that is non-corrupted,
00:29:25.060 | non-transformed, right?
00:29:27.340 | So you have to predict all the pixels.
00:29:29.940 | And there is a huge amount of resources spent in the system
00:29:33.420 | to actually predict all those pixels, all the details.
00:29:36.180 | In a JAPA, you're not trying to predict all the pixels,
00:29:40.500 | you're only trying to predict an abstract representation
00:29:43.980 | of the inputs, right?
00:29:47.020 | And that's much easier in many ways.
00:29:49.460 | So what the JAPA system, when it's being trained,
00:29:51.460 | is trying to do is extract as much information as possible
00:29:54.820 | from the input, but yet only extract information
00:29:58.180 | that is relatively easily predictable.
00:30:00.500 | Okay, so there's a lot of things in the world
00:30:03.660 | that we cannot predict, like for example,
00:30:05.220 | if you have a self-driving car driving down the street
00:30:08.180 | or road, there may be trees around the road.
00:30:13.180 | And it could be a windy day, so the leaves on the tree
00:30:16.260 | are kind of moving in kind of semi-chaotic random ways
00:30:19.620 | that you can't predict and you don't care,
00:30:22.020 | you don't wanna predict.
00:30:23.660 | So what you want is your encoder
00:30:25.300 | to basically eliminate all those details.
00:30:27.300 | It will tell you there's moving leaves,
00:30:28.780 | but it's not gonna keep the details
00:30:29.980 | of exactly what's going on.
00:30:31.380 | And so when you do the prediction in representation space,
00:30:35.940 | you're not going to have to predict every single pixel
00:30:38.020 | of every leaf.
00:30:38.860 | And that, you know, not only is a lot simpler,
00:30:43.540 | but also it allows the system to essentially learn
00:30:47.420 | an abstract representation of the world
00:30:49.780 | where, you know, what can be modeled and predicted
00:30:53.500 | is preserved and the rest is viewed as noise
00:30:57.460 | and eliminated by the encoder.
00:30:59.140 | So it kind of lifts the level of abstraction
00:31:00.980 | of the representation.
00:31:02.300 | If you think about this,
00:31:03.140 | this is something we do absolutely all the time.
00:31:05.460 | Whenever we describe a phenomenon,
00:31:06.980 | we describe it at a particular level of abstraction.
00:31:10.100 | And we don't always describe every natural phenomenon
00:31:13.420 | in terms of quantum field theory, right?
00:31:15.260 | That would be impossible, right?
00:31:17.460 | So we have multiple levels of abstraction
00:31:20.060 | to describe what happens in the world, you know,
00:31:22.660 | starting from quantum field theory to like atomic theory
00:31:25.620 | and molecules, you know, and chemistry materials,
00:31:29.060 | and, you know, all the way up to, you know,
00:31:31.700 | kind of concrete objects in the real world
00:31:33.940 | and things like that.
00:31:34.780 | So we can't just only model everything at the lowest level.
00:31:40.460 | And that's what the idea of J-PAH is really about.
00:31:44.540 | Learn abstract representation in a self-supervised manner.
00:31:49.540 | And, you know, you can do it hierarchically as well.
00:31:52.100 | So that I think is an essential component
00:31:54.500 | of an intelligent system.
00:31:56.300 | And in language, we can get away without doing this
00:31:58.540 | because language is already to some level abstract
00:32:02.580 | and already has eliminated a lot of information
00:32:05.460 | that is not predictable.
00:32:07.060 | And so we can get away without doing the joint embedding,
00:32:11.020 | without, you know, lifting the abstraction level
00:32:13.780 | and by directly predicting words.
00:32:15.420 | - So joint embedding, it's still generative,
00:32:19.980 | but it's generative in this abstract representation space.
00:32:23.380 | - Yeah.
00:32:24.220 | - And you're saying language, we were lazy with language
00:32:27.300 | 'cause we already got the abstract representation for free.
00:32:30.380 | And now we have to zoom out,
00:32:31.980 | actually think about generally intelligent systems.
00:32:34.580 | We have to deal with a full mess of physical reality,
00:32:39.260 | of reality.
00:32:40.100 | And you do have to do this step of jumping from
00:32:44.940 | the full, rich, detailed reality
00:32:51.340 | to a abstract representation of that reality
00:32:54.820 | based on what you can then reason
00:32:56.340 | and all that kind of stuff.
00:32:57.340 | - Right.
00:32:58.180 | And the thing is those self-supervised algorithm
00:33:00.500 | that learned by prediction, even in representation space,
00:33:04.740 | they learn more concept
00:33:09.260 | if the input data you feed them is more redundant.
00:33:11.980 | The more redundancy there is in the data,
00:33:14.020 | the more they're able to capture
00:33:15.500 | some internal structure of it.
00:33:17.780 | And so there, there is way more redundancy
00:33:20.460 | in the structure in perceptual inputs,
00:33:24.060 | sensory input like vision than there is in text,
00:33:28.460 | which is not nearly as redundant.
00:33:29.980 | This is back to the question you were asking
00:33:32.500 | a few minutes ago.
00:33:33.420 | Language might represent more information really
00:33:35.540 | because it's already compressed.
00:33:36.700 | You're right about that,
00:33:37.660 | but that means it's also less redundant.
00:33:40.260 | And so self-supervised running will not work as well.
00:33:43.700 | - Is it possible to join the self-supervised training
00:33:48.700 | on visual data and self-supervised training
00:33:52.300 | on language data?
00:33:53.900 | There is a huge amount of knowledge,
00:33:56.540 | even though you talked down about those 10 to the 13 tokens.
00:34:00.260 | Those 10 to the 13 tokens represent the entirety
00:34:03.340 | a large fraction of what us humans have figured out,
00:34:08.300 | both the shit talk on Reddit
00:34:11.380 | and the contents of all the books and the articles
00:34:14.180 | and the full spectrum of human intellectual creation.
00:34:18.980 | So is it possible to join those two together?
00:34:22.260 | - Well, eventually, yes.
00:34:23.740 | But I think if we do this too early,
00:34:27.860 | we run the risk of being tempted to cheat.
00:34:30.340 | And in fact, that's what people are doing at the moment
00:34:32.180 | with the vision language model.
00:34:33.540 | We're basically cheating.
00:34:35.220 | We're using language as a crutch to help the deficiencies
00:34:40.020 | of our vision systems to kind of learn good representations
00:34:44.740 | from images and video.
00:34:46.460 | And the problem with this is that we might improve
00:34:51.100 | our visual language system a bit,
00:34:53.780 | I mean, our language models by feeding them images,
00:34:58.100 | but we're not gonna get to the level of even the intelligence
00:35:01.740 | or level of understanding of the world of a cat or a dog,
00:35:05.580 | which doesn't have language.
00:35:07.380 | You know, they don't have language
00:35:08.620 | and they understand the world much better than any LLM.
00:35:12.060 | They can plan really complex actions
00:35:14.140 | and sort of imagine the result of a bunch of actions.
00:35:17.940 | How do we get machines to learn that?
00:35:20.460 | Before we combine that with language,
00:35:22.940 | obviously, if we combine this with language,
00:35:24.820 | this is gonna be a winner.
00:35:26.220 | But before that, we have to focus on like,
00:35:30.780 | how do we get systems to learn how the world works?
00:35:33.300 | - So this kind of joint embedding, predictive architecture,
00:35:38.300 | for you, that's gonna be able to learn
00:35:40.060 | something like common sense,
00:35:41.380 | something like what a cat uses to predict
00:35:45.580 | how to mess with its owner most optimally
00:35:48.340 | by knocking over a thing.
00:35:49.940 | That's the hope.
00:35:51.340 | In fact, the techniques we're using are non-contrastive.
00:35:54.260 | So not only is the architecture non-generative,
00:35:57.740 | the learning procedures we're using are non-contrastive.
00:36:01.540 | So we have two sets of techniques.
00:36:03.660 | One set is based on distillation
00:36:05.700 | and there's a number of methods that use this principle.
00:36:10.300 | One by DeepMind called BYOL,
00:36:11.620 | a couple by FAIR, one called VicReg,
00:36:17.940 | and another one called IJPA.
00:36:20.140 | And VicReg, I should say,
00:36:21.500 | is not a distillation method actually,
00:36:23.620 | but IJPA and BYOL certainly are.
00:36:25.700 | And there's another one also called DINO or DINO,
00:36:29.300 | also produced from FAIR.
00:36:31.820 | And the idea of those things is that
00:36:32.940 | you take the full input, let's say an image,
00:36:35.820 | you run it through an encoder,
00:36:37.820 | produces a representation,
00:36:41.340 | and then you corrupt that input or transform it,
00:36:43.540 | run it through essentially what amounts to the same encoder
00:36:46.540 | with some minor differences.
00:36:48.500 | And then train a predictor.
00:36:50.420 | Sometimes the predictor is very simple,
00:36:51.900 | sometimes it doesn't exist,
00:36:53.100 | but train a predictor to predict a representation
00:36:55.260 | of the first uncorrupted input from the corrupted input.
00:37:00.260 | But you only train the second branch.
00:37:05.460 | You only train the part of the network
00:37:07.540 | that is fed with the corrupted input.
00:37:10.780 | The other network you don't train,
00:37:12.780 | but since they share the same weight,
00:37:14.260 | when you modify the first one,
00:37:15.980 | it also modifies the second one.
00:37:17.580 | And with various tricks,
00:37:19.660 | you can prevent the system from collapsing,
00:37:22.620 | with the collapse of the type I was explaining before,
00:37:24.700 | where the system basically ignores the input.
00:37:26.900 | So that works very well.
00:37:31.060 | The two techniques we've developed at FAIR,
00:37:34.780 | DINO and IJPA work really well for that.
00:37:39.300 | - So what kind of data are we talking about here?
00:37:41.780 | - So there's several scenario.
00:37:43.380 | One scenario is you take an image,
00:37:47.340 | you corrupt it by changing the cropping, for example,
00:37:52.340 | changing the size a little bit,
00:37:54.380 | maybe changing the orientation, blurring it,
00:37:56.700 | changing the colors,
00:37:58.300 | doing all kinds of horrible things to it.
00:38:00.060 | - But basic horrible things.
00:38:01.620 | - Basic horrible things that sort of degrade the quality
00:38:03.820 | a little bit and change the framing,
00:38:06.420 | crop the image.
00:38:08.380 | And in some cases, in the case of IJPA,
00:38:12.220 | you don't need to do any of this,
00:38:13.220 | you just mask some parts of it, right?
00:38:16.380 | You just basically remove some regions,
00:38:19.460 | like a big block, essentially.
00:38:21.860 | And then run through the encoders
00:38:25.220 | and train the entire system, encoder and predictor,
00:38:27.660 | to predict the representation of the good one
00:38:29.500 | from the representation of the corrupted one.
00:38:31.740 | So that's the IJPA.
00:38:35.420 | Doesn't need to know that it's an image, for example,
00:38:38.300 | because the only thing it needs to know
00:38:39.540 | is how to do this masking.
00:38:42.380 | Whereas with Deno, you need to know it's an image
00:38:44.380 | because you need to do things like geometry transformation
00:38:47.540 | and blurring and things like that
00:38:49.300 | that are really image-specific.
00:38:50.860 | A more recent version of this that we have
00:38:53.860 | is called VJPA, so it's basically the same idea as IJPA,
00:38:56.860 | except it's applied to video.
00:38:59.180 | So now you take a whole video
00:39:00.780 | and you mask a whole chunk of it.
00:39:02.740 | And what we mask is actually kind of a temporal tube,
00:39:04.980 | so a whole segment of each frame in the video
00:39:08.740 | over the entire video.
00:39:10.340 | - And that tube is like statically positioned
00:39:12.860 | throughout the frames, so it's literally a straight tube.
00:39:15.860 | - The tube, yeah, typically is 16 frames or something,
00:39:18.860 | and we mask the same region over the entire 16 frames.
00:39:22.340 | It's a different one for every video, obviously.
00:39:24.620 | And then again, train that system
00:39:28.540 | so as to predict the representation of the full video
00:39:31.300 | from the partially masked video.
00:39:33.260 | That works really well.
00:39:35.380 | It's the first system that we have
00:39:36.860 | that learns good representations of video
00:39:39.940 | so that when you feed those representations
00:39:41.820 | to a supervised classifier head,
00:39:44.980 | it can tell you what action is taking place in the video
00:39:47.780 | with pretty good accuracy.
00:39:49.740 | So that's the first time we get something of that quality.
00:39:55.980 | - So that's a good test that good representations form,
00:39:58.660 | that means there's something to this.
00:40:00.300 | - Yeah, we have also a preliminary result
00:40:03.460 | that seemed to indicate that the representation allows us,
00:40:07.140 | allow our system to tell whether the video
00:40:10.660 | is physically possible or completely impossible
00:40:13.940 | because some object disappeared,
00:40:15.340 | or an object suddenly jumped from one location to another
00:40:19.540 | or changed shape or something.
00:40:21.860 | - So it's able to capture some physics-based constraints
00:40:26.860 | about the reality represented in the video?
00:40:29.260 | - Yeah.
00:40:30.220 | - About the appearance and the disappearance of objects?
00:40:33.140 | - Yeah, that's really new.
00:40:35.740 | - Okay, but can this actually get us
00:40:40.740 | to this kind of a world model that understands enough
00:40:45.580 | about the world to be able to drive a car?
00:40:48.020 | - Possibly.
00:40:50.060 | I mean, this is gonna take a while
00:40:51.540 | before we get to that point,
00:40:52.660 | but there are systems already, robotic systems,
00:40:56.900 | that are based on this idea.
00:40:58.700 | And what you need for this
00:41:02.700 | is a slightly modified version of this
00:41:04.860 | where imagine that you have a video and a complete video.
00:41:09.860 | And what you're doing to this video
00:41:13.980 | is that you're either translating it in time
00:41:17.620 | towards the future,
00:41:18.460 | so you only see the beginning of the video,
00:41:19.980 | but you don't see the latter part of it
00:41:21.740 | that is in the original one,
00:41:23.380 | or you just mask the second half of the video, for example.
00:41:27.260 | And then you train a Jepa system of the type I described
00:41:32.260 | to predict the representation of the full video
00:41:33.980 | from the shifted one,
00:41:36.140 | but you also feed the predictor with an action.
00:41:39.660 | For example, the wheel is turned 10 degrees
00:41:42.820 | to the right or something, right?
00:41:45.420 | So if it's a dashcam in a car
00:41:49.860 | and you know the angle of the wheel,
00:41:51.340 | you should be able to predict to some extent
00:41:52.900 | what's going to happen to what you see.
00:41:56.820 | You're not gonna be able to predict all the details
00:41:59.940 | of objects that appear in the view, obviously,
00:42:02.780 | but at a abstract representation level,
00:42:05.740 | you can probably predict what's gonna happen.
00:42:08.660 | So now what you have is a internal model that says,
00:42:13.100 | here is my idea of state of the world at time t,
00:42:15.260 | here is an action I'm taking,
00:42:17.860 | here is a prediction of the state of the world
00:42:19.300 | at time t plus one, t plus delta t,
00:42:21.980 | t plus two seconds, whatever it is.
00:42:24.300 | If you have a model of this type,
00:42:26.180 | you can use it for planning.
00:42:27.940 | So now you can do what LLMs cannot do,
00:42:31.540 | which is planning what you're gonna do
00:42:33.980 | so as to arrive at a particular outcome
00:42:37.580 | or satisfy a particular objective, right?
00:42:40.780 | So you can have a number of objectives.
00:42:43.520 | I can predict that if I have an object like this
00:42:50.820 | and I open my hand, it's gonna fall, right?
00:42:54.420 | And if I push it with a particular force on the table,
00:42:58.180 | it's gonna move.
00:42:59.020 | If I push the table itself,
00:43:00.060 | it's probably not gonna move with the same force.
00:43:03.620 | So we have this internal model of the world in our mind,
00:43:08.620 | which allows us to plan sequences of actions
00:43:11.780 | to arrive at a particular goal.
00:43:13.340 | And so now if you have this world model,
00:43:18.540 | we can imagine a sequence of actions,
00:43:21.580 | predict what the outcome of the sequence of action
00:43:23.620 | is going to be, measure to what extent the final state
00:43:28.300 | satisfies a particular objective,
00:43:31.020 | like moving the bottle to the left of the table,
00:43:35.060 | and then plan a sequence of actions
00:43:38.460 | that will minimize this objective at runtime.
00:43:41.500 | We're not talking about learning,
00:43:42.340 | we're talking about inference time, right?
00:43:44.260 | So this is planning, really.
00:43:46.140 | And in optimal control, this is a very classical thing.
00:43:48.340 | It's called model predictive control.
00:43:50.580 | You have a model of the system you want to control
00:43:53.780 | that can predict the sequence of states
00:43:56.340 | corresponding to a sequence of commands.
00:43:58.980 | And you're planning a sequence of commands
00:44:02.260 | so that according to your world model,
00:44:04.180 | the end state of the system
00:44:06.420 | will satisfy an objective that you fix.
00:44:10.980 | This is the way rocket trajectories have been planned
00:44:15.980 | since computers have been around,
00:44:17.740 | so since the early '60s, essentially.
00:44:20.100 | - So yes, for model predictive control,
00:44:21.860 | but you also often talk about hierarchical planning.
00:44:26.020 | - Yeah.
00:44:26.860 | - Can hierarchical planning emerge from this somehow?
00:44:29.020 | - Well, so no.
00:44:29.860 | You will have to build a specific architecture
00:44:32.820 | to allow for hierarchical planning.
00:44:34.660 | So hierarchical planning is absolutely necessary
00:44:36.900 | if you want to plan complex actions.
00:44:39.580 | If I want to go from, let's say, from New York to Paris,
00:44:43.340 | this is the example I use all the time,
00:44:45.460 | and I'm sitting in my office at NYU,
00:44:48.180 | my objective that I need to minimize
00:44:50.500 | is my distance to Paris.
00:44:52.140 | At a high level, a very abstract representation
00:44:55.100 | of my location, I will have to decompose this
00:44:58.380 | into two sub-goals.
00:44:59.380 | First one is go to the airport.
00:45:02.260 | Second one is catch a plane to Paris.
00:45:04.700 | Okay, so my sub-goal is now going to the airport.
00:45:09.140 | My objective function is my distance to the airport.
00:45:11.700 | How do I go to the airport?
00:45:14.140 | Well, I have to go in the street and hail a taxi,
00:45:18.300 | which you can do in New York.
00:45:19.700 | Okay, now I have another sub-goal.
00:45:22.740 | Go down on the street, what that means,
00:45:26.220 | going to the elevator, going down the elevator,
00:45:28.860 | walk out the street.
00:45:29.860 | How do I go to the elevator?
00:45:32.700 | I have to stand up for my chair,
00:45:36.380 | open the door of my office, go to the elevator,
00:45:39.420 | push the button.
00:45:40.700 | How do I get up from my chair?
00:45:42.340 | Like, you know, you can imagine going down,
00:45:44.020 | all the way down to basically what amounts
00:45:47.420 | to millisecond by millisecond muscle control.
00:45:50.420 | Okay, and obviously you're not going to plan
00:45:54.180 | your entire trip from New York to Paris
00:45:56.540 | in terms of millisecond by millisecond muscle control.
00:46:00.300 | First, that would be incredibly expensive,
00:46:02.300 | but it will also be completely impossible
00:46:03.800 | because you don't know all the conditions
00:46:06.480 | of what's going to happen.
00:46:08.060 | You know, how long it's going to take to catch a taxi
00:46:10.660 | or to go to the airport with traffic, you know.
00:46:14.980 | I mean, you would have to know exactly the condition
00:46:17.500 | of everything to be able to do this planning.
00:46:19.940 | And you don't have the information.
00:46:21.460 | So you have to do this hierarchical planning
00:46:24.020 | so that you can start acting
00:46:25.420 | and then sort of replanning as you go.
00:46:27.380 | And nobody really knows how to do this in AI.
00:46:32.060 | Nobody knows how to train a system
00:46:35.340 | to run the appropriate multiple levels of representation
00:46:38.620 | so that hierarchical planning works.
00:46:41.380 | - Does something like that already emerge?
00:46:43.060 | So like, can you use an LLM, state-of-the-art LLM
00:46:48.460 | to get you from New York to Paris
00:46:50.940 | by doing exactly the kind of detailed set of questions
00:46:55.060 | that you just did, which is,
00:46:56.940 | can you give me a list of 10 steps I need to do
00:47:01.220 | to get from New York to Paris?
00:47:02.660 | And then for each of those steps,
00:47:05.420 | can you give me a list of 10 steps,
00:47:07.140 | how I make that step happen?
00:47:09.180 | And for each of those steps,
00:47:10.340 | can you give me a list of 10 steps to make each one of those
00:47:13.180 | until you're moving your individual muscles?
00:47:16.420 | Maybe not, whatever you can actually act upon
00:47:19.620 | using your mind.
00:47:20.660 | - Right, so there's a lot of questions
00:47:23.180 | that are actually implied by this, right?
00:47:24.500 | So the first thing is LLMs will be able to answer
00:47:27.700 | some of those questions down to some level of abstraction
00:47:30.500 | under the condition that they've been trained
00:47:34.480 | with similar scenarios in their training set.
00:47:37.260 | - They would be able to answer all of those questions,
00:47:40.100 | but some of them may be hallucinated, meaning non-factual.
00:47:44.260 | - Yeah, true, I mean, they will probably produce
00:47:45.780 | some answer, except they're not gonna be able
00:47:47.420 | to really kind of produce millisecond by millisecond
00:47:49.660 | muscle control of how you stand up from your chair, right?
00:47:53.220 | So, but down to some level of abstraction
00:47:55.580 | where you can describe things by words,
00:47:57.860 | they might be able to give you a plan,
00:47:59.620 | but only under the condition that they've been trained
00:48:01.500 | to produce those kinds of plans, right?
00:48:04.180 | They're not gonna be able to plan for situations
00:48:06.700 | where that they never encountered before.
00:48:09.420 | They basically are going to have to regurgitate
00:48:11.380 | the template that they've been trained on.
00:48:12.980 | So where, like, just for the example of New York to Paris,
00:48:16.020 | is it gonna start getting into trouble?
00:48:18.940 | Like at which layer of abstraction do you think you'll start?
00:48:22.620 | 'Cause like I can imagine almost every single part of that
00:48:25.420 | in LN will be able to answer somewhat accurately,
00:48:27.760 | especially when you're talking about New York
00:48:29.340 | and Paris, major cities.
00:48:31.060 | - So, I mean, certainly LNM would be able
00:48:33.940 | to solve that problem if you fine-tuned it for it.
00:48:36.660 | You know, just, and so I can't say that LNM cannot do this.
00:48:42.420 | It can do this if you train it for it, there's no question.
00:48:45.780 | Down to a certain level where things can be formulated
00:48:49.900 | in terms of words.
00:48:51.340 | But like, if you wanna go down to like how you, you know,
00:48:53.840 | climb down the stairs or just stand up from your chair
00:48:56.100 | in terms of words, like you can't, you can't do it.
00:48:59.380 | You need, that's one of the reasons you need experience
00:49:04.940 | of the physical world, which is much higher bandwidth
00:49:07.740 | than what you can express in words, in human language.
00:49:11.060 | - So everything we've been talking about
00:49:12.460 | on the joint embedding space, is it possible
00:49:15.740 | that that's what we need for like the interaction
00:49:18.020 | with physical reality on the robotics front?
00:49:20.620 | And then just the LLMs are the thing that sits on top of it
00:49:24.660 | for the bigger reasoning about like the fact
00:49:28.580 | that I need to book a plane ticket and I need to know,
00:49:31.660 | I know how to go to the websites and so on.
00:49:33.700 | - Sure, and you know, a lot of plans that people know about
00:49:37.060 | that are relatively high level are actually learned.
00:49:40.740 | They're not people, most people don't invent the, you know,
00:49:45.260 | plans they, by themselves, they, you know,
00:49:50.260 | we have some ability to do this, of course, obviously,
00:49:54.180 | but most plans that people use are plans
00:49:57.920 | that they've been trained on.
00:49:59.540 | Like they've seen other people use those plans
00:50:01.280 | or they've been told how to do things, right?
00:50:04.180 | That you can't invent how you like take a person
00:50:07.660 | who's never heard of airplanes and tell them like,
00:50:10.220 | why do you go from New York to Paris?
00:50:11.660 | And they're probably not going to be able to kind of,
00:50:14.700 | you know, deconstruct the whole plan
00:50:16.180 | unless I've seen examples of that before.
00:50:18.820 | So certainly LLMs are going to be able to do this,
00:50:20.740 | but then how you link this from the low level of actions,
00:50:25.740 | that needs to be done with things like JPA that basically
00:50:32.400 | lifts the abstraction level of the representation
00:50:34.780 | without attempting to reconstruct every detail
00:50:36.700 | of the situation.
00:50:38.080 | That's why we need JPA for.
00:50:40.740 | I would love to sort of linger on your skepticism
00:50:44.260 | around auto aggressive LLMs.
00:50:48.400 | So one way I would like to test that skepticism
00:50:51.960 | is everything you say makes a lot of sense.
00:50:54.980 | But if I apply everything you said today and in general
00:51:01.500 | to like, I don't know, 10 years ago,
00:51:04.080 | maybe a little bit less, no, let's say three years ago,
00:51:07.900 | I wouldn't be able to predict the success of LLMs.
00:51:12.620 | So does it make sense to you that auto aggressive LLMs
00:51:17.060 | are able to be so damn good?
00:51:19.620 | - Yes.
00:51:21.780 | - Can you explain your intuition?
00:51:24.260 | Because if I were to take your wisdom and intuition
00:51:29.120 | at face value, I would say there's no way
00:51:31.420 | auto aggressive LLMs, one token at a time,
00:51:34.300 | would be able to do the kind of things they're doing.
00:51:36.260 | - No, there's one thing that auto aggressive LLMs
00:51:39.260 | or that LLMs in general, not just the auto aggressive one,
00:51:42.420 | but including the bird style bi-directional ones
00:51:45.260 | are exploiting and it's self supervised learning.
00:51:49.220 | And I've been a very, very strong advocate
00:51:51.060 | of self supervised learning for many years.
00:51:53.300 | So those things are a incredibly impressive demonstration
00:51:58.300 | that self supervised learning actually works.
00:52:02.140 | The idea that started, it didn't start with birth,
00:52:07.140 | but it was really kind of a good demonstration with this.
00:52:09.660 | So the idea that you take a piece of text, you corrupt it
00:52:14.660 | and then you transform gigantic neural net
00:52:16.200 | to reconstruct the parts that are missing,
00:52:18.300 | that has been an enormous,
00:52:20.920 | produced an enormous amount of benefits.
00:52:25.680 | It allowed us to create systems that understand language,
00:52:31.380 | systems that can translate hundreds of languages
00:52:34.980 | in any direction, systems that are multilingual.
00:52:38.200 | So they're not, it's a single system that can be trained
00:52:40.900 | to understand hundreds of languages
00:52:43.260 | and translate in any direction and produce summaries
00:52:48.260 | and then answer questions and produce text.
00:52:51.780 | And then there's a special case of it where you,
00:52:54.740 | which is the auto aggressive trick,
00:52:56.620 | where you constrain the system
00:52:58.580 | to not elaborate a representation of the text
00:53:02.020 | from looking at the entire text,
00:53:03.740 | but only predicting a word
00:53:06.540 | from the words that are come before, right?
00:53:08.580 | Then you do this by the constraining
00:53:10.380 | the architecture of the network.
00:53:11.580 | And that's what you can build an auto aggressive LLM from.
00:53:15.140 | So there was a surprise many years ago
00:53:17.660 | with what's called decoder only LLM.
00:53:20.940 | So since, you know, systems of this type
00:53:23.120 | that are just trying to produce words from the previous one
00:53:28.120 | and the fact that when you scale them up,
00:53:31.260 | they tend to really kind of understand more about the,
00:53:35.900 | about language when you train them on lots of data,
00:53:38.140 | you make them really big.
00:53:39.380 | That was kind of a surprise.
00:53:40.720 | And that surprise occurred quite a while back,
00:53:42.900 | like, you know, with work from, you know,
00:53:47.900 | Google meta, open AI, et cetera, you know,
00:53:50.580 | going back to, you know, the GPT kind of work
00:53:54.620 | general pre-trained transformers.
00:53:56.820 | - You mean like GPT-2, like there's a certain place
00:54:00.380 | where you start to realize scaling
00:54:02.060 | might actually keep giving us an emergent benefit.
00:54:06.720 | - Yeah, I mean, there were work from various places,
00:54:09.240 | but if you want to kind of, you know,
00:54:12.900 | place it in the GPT timeline,
00:54:16.380 | that would be around GPT-2, yeah.
00:54:17.980 | - Well, I just, 'cause you said it,
00:54:20.860 | you're so charismatic and you said so many words,
00:54:23.620 | but self-supervised learning, yes.
00:54:25.880 | But again, the same intuition you're applying
00:54:29.060 | to saying that autoregressive LLMs
00:54:31.600 | cannot have a deep understanding of the world,
00:54:35.240 | if we just apply that same intuition,
00:54:38.060 | does it make sense to you that they're able to form
00:54:41.680 | enough of a representation of the world
00:54:43.840 | to be damn convincing, essentially passing
00:54:48.340 | the original Turing test with flying colors?
00:54:50.840 | - Well, we're fooled by their fluency, right?
00:54:53.320 | We just assume that if a system is fluent
00:54:56.100 | in manipulating language, then it has
00:54:58.900 | all the characteristics of human intelligence,
00:55:00.780 | but that impression is false.
00:55:04.140 | We're really fooled by it.
00:55:06.560 | - What do you think Alan Turing would say?
00:55:08.940 | Without understanding anything, just hanging out with it.
00:55:11.420 | - Alan Turing would decide that a Turing test
00:55:13.140 | is a really bad test, okay?
00:55:15.520 | This is what the AI community has decided many years ago,
00:55:18.940 | that the Turing test was a really bad test of intelligence.
00:55:22.080 | - What would Hans Moravec say
00:55:23.320 | about the large language models?
00:55:26.300 | - Hans Moravec would say that Moravec paradox still applies.
00:55:29.760 | - Okay. - Okay.
00:55:31.340 | Okay, we can pass the--
00:55:32.180 | - You don't think he would be really impressed?
00:55:34.260 | - No, of course, everybody would be impressed,
00:55:35.800 | but it's not a question of being impressed or not.
00:55:39.980 | It's a question of knowing what the limit
00:55:42.100 | of those systems can do.
00:55:44.260 | Again, they are impressive.
00:55:45.940 | They can do a lot of useful things.
00:55:47.580 | There's a whole industry that is being built around them.
00:55:49.820 | They're gonna make progress.
00:55:51.900 | But there is a lot of things they cannot do,
00:55:53.720 | and we have to realize what they cannot do,
00:55:55.600 | and then figure out how we get there.
00:55:59.920 | And I'm not seeing this,
00:56:02.580 | I'm seeing this from basically 10 years of research
00:56:06.740 | on the idea of self-supervised learning.
00:56:12.200 | Actually, that's going back more than 10 years,
00:56:13.820 | but the idea of self-supervised learning,
00:56:15.300 | so basically capturing the internal structure
00:56:18.260 | of a piece of a set of inputs without training the system
00:56:22.460 | for any particular task, like learning representations.
00:56:25.220 | You know, the conference I co-founded 14 years ago
00:56:28.880 | is called International Conference
00:56:30.820 | on Learning Representations.
00:56:31.900 | That's the entire issue that deep learning
00:56:34.060 | is dealing with, right?
00:56:35.980 | And it's been my obsession for almost 40 years now, so.
00:56:39.900 | So learning representation is really the thing.
00:56:42.820 | For the longest time, we could only do this
00:56:44.340 | with supervised learning.
00:56:45.780 | And then we started working on what we used to call
00:56:48.940 | unsupervised learning, and sort of revived the idea
00:56:53.580 | of unsupervised learning in the early 2000s
00:56:56.660 | with Yoshua Bengio and Jeff Hinton,
00:56:59.340 | then discovered that supervised learning
00:57:00.780 | actually works pretty well if you can collect enough data.
00:57:03.940 | And so the whole idea of unsupervised self-supervised
00:57:07.460 | learning kind of took a backseat for a bit,
00:57:10.980 | and then I kind of tried to revive it
00:57:14.860 | in a big way, you know, starting in 2014,
00:57:18.580 | basically when we started FAIR,
00:57:20.540 | and really pushing for, like, finding new methods
00:57:24.740 | to do self-supervised learning, both for text
00:57:27.180 | and for images and for video and audio.
00:57:29.740 | And some of that work has been incredibly successful.
00:57:33.020 | I mean, the reason why we have multilingual
00:57:35.500 | translation system, you know, things to do,
00:57:38.300 | content moderation on meta, for example, on Facebook,
00:57:41.780 | that are multilingual, to understand
00:57:43.300 | whether a piece of text is hate speech or not, or something,
00:57:46.460 | is due to that progress using self-supervised learning
00:57:48.740 | for NLP, combining this with, you know,
00:57:51.220 | transformer architectures and blah, blah, blah.
00:57:53.700 | But that's the big success of self-supervised learning.
00:57:55.740 | We had similar success in speech recognition,
00:57:59.020 | a system called Wave2Vec,
00:58:00.140 | which is also a joint embedding architecture, by the way,
00:58:02.460 | trained with contrastive learning.
00:58:03.700 | And that system also can produce
00:58:06.740 | speech recognition systems that are multilingual
00:58:10.540 | with mostly unlabeled data and only need a few minutes
00:58:14.140 | of label data to actually do speech recognition.
00:58:16.900 | That's amazing.
00:58:17.980 | We have systems now based on those combination of ideas
00:58:22.180 | that can do real-time translation of hundreds of languages
00:58:25.660 | into each other, speech to speech.
00:58:28.020 | - Speech to speech, even including,
00:58:30.220 | just fascinating languages that don't have written forms.
00:58:34.340 | - That's right. - They're spoken only.
00:58:35.500 | - That's right, we don't go through text.
00:58:36.780 | It goes directly from speech to speech
00:58:38.740 | using an internal representation of kind of speech units
00:58:41.220 | that are discrete, but it's called textless and LP.
00:58:44.580 | We used to call it this way, but yeah.
00:58:47.100 | So that, I mean, incredible success there.
00:58:49.220 | And then, you know, for 10 years,
00:58:50.980 | we tried to apply this idea to learning representations
00:58:54.700 | of images by training a system to predict videos,
00:58:57.340 | learning intuitive physics by training a system
00:58:59.900 | to predict what's gonna happen in the video,
00:59:02.300 | and tried and tried and failed and failed
00:59:05.060 | with generative models, with models that predict pixels.
00:59:09.300 | We could not get them to learn
00:59:11.300 | good representations of images.
00:59:13.220 | We could not get them to learn
00:59:14.420 | good representations of videos.
00:59:16.420 | And we tried many times.
00:59:17.260 | We published lots of papers on it.
00:59:19.140 | Yeah, well, they kind of sort of worked,
00:59:20.820 | but not really great.
00:59:22.300 | They started working.
00:59:23.980 | We abandoned this idea of predicting every pixel
00:59:28.220 | and basically just doing the joint embedding
00:59:30.420 | and predicting any representation space.
00:59:32.300 | That works.
00:59:33.260 | So there's ample evidence that we're not gonna be able
00:59:37.700 | to learn good representations of the real world
00:59:42.020 | using generative model.
00:59:43.260 | So I'm telling people,
00:59:44.620 | everybody's talking about generative AI.
00:59:46.820 | If you're really interested in human level AI,
00:59:48.860 | abandon the idea of generative AI.
00:59:50.620 | - Okay, but you really think it's possible
00:59:54.900 | to get far with the joint embedding representation.
00:59:57.420 | So like, there's common sense reasoning,
01:00:01.380 | and then there's high level reasoning.
01:00:05.700 | I feel like those are two,
01:00:07.580 | the kind of reasoning that LLMs are able to do,
01:00:11.620 | okay, let me not use the word reasoning,
01:00:13.620 | but the kind of stuff that LLMs are able to do
01:00:16.020 | seems fundamentally different
01:00:17.380 | than the common sense reasoning we use
01:00:19.540 | to navigate the world.
01:00:20.900 | It seems like we're gonna need both.
01:00:22.500 | You're not, would you be able to get,
01:00:25.100 | with the joint embedding,
01:00:25.980 | would you jump a type of approach looking at video,
01:00:29.140 | would you be able to learn, let's see,
01:00:33.020 | well, how to get from New York to Paris,
01:00:35.460 | or how to understand the state of politics
01:00:40.460 | and the world today, right?
01:00:44.420 | These are things where various humans
01:00:46.700 | generate a lot of language and opinions on
01:00:49.020 | in the space of language,
01:00:50.100 | but don't visually represent that
01:00:52.860 | in any clearly compressible way.
01:00:56.060 | - Right, well, there's a lot of situations
01:00:58.020 | that might be difficult for a purely language-based system
01:01:02.740 | to know, like, okay, you can probably learn
01:01:07.180 | from reading texts, the entirety of the publicly available
01:01:10.780 | texts in the world that I cannot get
01:01:12.700 | from New York to Paris by slapping my fingers.
01:01:15.380 | That's not gonna work, right?
01:01:16.300 | - Yes.
01:01:17.140 | - But there's probably sort of more complex
01:01:20.980 | scenarios of this type,
01:01:22.300 | which an NLM may never have encountered
01:01:25.700 | and may not be able to determine
01:01:27.700 | whether it's possible or not.
01:01:29.860 | So that link from the low level to the high level,
01:01:34.860 | the thing is that the high level that language expresses
01:01:38.860 | is based on the common experience of the low level,
01:01:43.260 | which NLMs currently do not have.
01:01:45.220 | When we talk to each other,
01:01:47.660 | we know we have a common experience of the world,
01:01:50.620 | like a lot of it is similar, and the NLMs don't have that.
01:01:59.060 | But see, it's present.
01:02:01.060 | You and I have a common experience of the world
01:02:02.860 | in terms of the physics of how gravity works
01:02:05.860 | and stuff like this, and that common knowledge of the world
01:02:10.860 | I feel like is there in the language.
01:02:15.500 | We don't explicitly express it,
01:02:17.780 | but if you have a huge amount of text,
01:02:21.180 | you're going to get this stuff that's between the lines.
01:02:24.180 | In order to form a consistent world model,
01:02:28.620 | you're going to have to understand how gravity works,
01:02:31.660 | even if you don't have an explicit explanation of gravity.
01:02:35.140 | So even though in the case of gravity,
01:02:37.360 | there is explicit explanation of gravity in Wikipedia.
01:02:40.020 | But the stuff that we think of as common sense reasoning,
01:02:45.020 | I feel like to generate language correctly,
01:02:49.300 | you're going to have to figure that out.
01:02:51.820 | Now you could say, as you have, there's not enough text.
01:02:54.300 | Sorry, okay, so what? (laughs)
01:02:56.940 | You don't think so.
01:02:57.780 | - No, I agree with what you just said,
01:02:59.160 | which is that to be able to do high-level common sense,
01:03:03.680 | to have high-level common sense,
01:03:04.780 | you need to have the low-level common sense
01:03:06.920 | to build on top of.
01:03:08.020 | - Yeah, but that's not there.
01:03:10.280 | - And that's not there in LLMs.
01:03:11.580 | LLMs are purely trained from text.
01:03:13.340 | So then the other statement you made,
01:03:15.380 | I would not agree with the fact that implicit
01:03:18.980 | in all languages in the world is the underlying reality.
01:03:22.740 | There's a lot about underlying reality,
01:03:24.460 | which is not expressed in language.
01:03:26.840 | Is that obvious to you?
01:03:27.960 | - Yeah, totally.
01:03:28.980 | - So like all the conversations we have,
01:03:34.340 | okay, there's the dark web, meaning whatever,
01:03:37.460 | the private conversations, like DMs and stuff like this,
01:03:41.160 | which is much, much larger probably than what's available,
01:03:44.900 | what LLMs are trained on.
01:03:46.880 | - You don't need to communicate the stuff that is common.
01:03:49.980 | - But the humor, all of it.
01:03:51.300 | No, you do.
01:03:52.140 | Like when you, you don't need to, but it comes through.
01:03:54.520 | Like if I accidentally knock this over,
01:03:58.300 | you'll probably make fun of me.
01:03:59.500 | In the content of the you making fun of me
01:04:02.460 | will be explanation of the fact that cups fall,
01:04:07.360 | and then gravity works in this way.
01:04:09.380 | And then you'll have some very vague information
01:04:12.700 | about what kind of things explode when they hit the ground.
01:04:16.740 | And then maybe you'll make a joke about entropy
01:04:19.000 | or something like this,
01:04:19.840 | and we'll never be able to reconstruct this again.
01:04:22.000 | Like, okay, you'll make a little joke like this,
01:04:25.060 | and there'll be trillion of other jokes.
01:04:27.020 | And from the jokes, you can piece together the fact
01:04:29.580 | that gravity works and mugs can break
01:04:31.900 | and all this kind of stuff.
01:04:32.860 | You don't need to see, it'll be very inefficient.
01:04:36.860 | It's easier for like, to knock the thing over,
01:04:41.860 | but I feel like it would be there
01:04:44.380 | if you have enough of that data.
01:04:46.600 | - I just think that most of the information of this type
01:04:50.700 | that we have accumulated when we were babies
01:04:54.320 | is just not present in text, in any description, essentially.
01:04:59.320 | - And the sensory data is a much richer source
01:05:03.180 | for getting that kind of understanding.
01:05:04.360 | - I mean, that's the 16,000 hours of wake time
01:05:07.700 | of a four-year-old and 10 to the 15 bytes
01:05:11.600 | going through vision, just vision, right?
01:05:13.480 | There is a similar bandwidth of touch
01:05:17.600 | and a little less through audio.
01:05:20.500 | And then language doesn't come in until like a year in life.
01:05:25.500 | And by the time you are nine years old,
01:05:28.580 | you've learned about gravity.
01:05:30.780 | You know about inertia, you know about gravity,
01:05:32.700 | you know the stability, you know about the distinction
01:05:36.280 | between animate and inanimate objects.
01:05:38.100 | You know, by 18 months, you know about why people
01:05:42.380 | want to do things and you help them if they can't.
01:05:45.500 | I mean, there's a lot of things that you learn mostly
01:05:47.940 | by observation, really not even through interaction.
01:05:52.340 | In the first few months of life,
01:05:53.420 | babies don't really have any influence on the world.
01:05:55.900 | They can only observe, right?
01:05:58.080 | And you accumulate like a gigantic amount of knowledge
01:06:02.060 | just from that.
01:06:02.980 | So that's what we're missing from current AI systems.
01:06:06.400 | - I think in one of your slides, you have this nice plot
01:06:10.040 | that is one of the ways you show that LLMs are limited.
01:06:13.940 | I wonder if you could talk about hallucinations
01:06:16.120 | from your perspectives.
01:06:17.940 | The why hallucinations happen from large language models
01:06:22.940 | and why, and to what degree is that a fundamental flaw
01:06:27.540 | of large language models?
01:06:29.360 | - Right, so because of the autoregressive prediction,
01:06:34.100 | every time an LLM produces a token or a word,
01:06:37.220 | there is some level of probability for that word
01:06:40.740 | to take you out of the set of reasonable answers.
01:06:45.620 | And if you assume, which is a very strong assumption,
01:06:48.000 | that the probability of such error
01:06:50.620 | is that those errors are independent
01:06:55.180 | across a sequence of tokens being produced.
01:06:59.500 | What that means is that every time you produce a token,
01:07:02.400 | the probability that you stay within the set
01:07:05.420 | of correct answer decreases, and it decreases exponentially.
01:07:08.660 | - So there's a strong, like you said, assumption there
01:07:10.420 | that if there's a non-zero probability of making a mistake,
01:07:14.780 | which there appears to be,
01:07:16.260 | then there's going to be a kind of drift.
01:07:18.700 | - Yeah, and that drift is exponential.
01:07:21.360 | It's like errors accumulate, right?
01:07:23.740 | So the probability that an answer would be nonsensical
01:07:27.860 | increases exponentially with the number of tokens.
01:07:31.400 | - Is that obvious to you, by the way?
01:07:33.820 | Well, so mathematically speaking, maybe,
01:07:36.780 | but isn't there a kind of gravitational pull
01:07:40.220 | towards the truth, because on average, hopefully,
01:07:44.380 | the truth is well-represented in the training set?
01:07:48.940 | - No, it's basically a struggle
01:07:50.920 | against the curse of dimensionality.
01:07:55.540 | So the way you can correct for this
01:07:57.040 | is that you fine-tune the system
01:07:58.700 | by having it produce answers for all kinds of questions
01:08:02.540 | that people might come up with.
01:08:04.860 | And people are people, so a lot of the questions
01:08:08.100 | that they have are very similar to each other.
01:08:10.260 | So you can probably cover 80% or whatever
01:08:13.700 | of questions that people will ask by collecting data.
01:08:18.700 | And then you fine-tune the system
01:08:23.140 | to produce good answers for all of those things.
01:08:25.620 | And it's probably going to be able to learn that
01:08:28.280 | because it's got a lot of capacity to learn.
01:08:30.880 | But then there is the enormous set of prompts
01:08:36.920 | that you have not covered during training.
01:08:39.900 | And that set is enormous.
01:08:41.340 | Within the set of all possible prompts,
01:08:43.260 | the proportion of prompts that have been used for training
01:08:47.340 | is absolutely tiny.
01:08:48.640 | It's a tiny, tiny, tiny subset of all possible prompts.
01:08:53.940 | And so the system will behave properly on the prompts
01:08:56.600 | that has been either trained, pre-trained, or fine-tuned.
01:08:59.540 | But then there is an entire space of things
01:09:04.180 | that it cannot possibly have been trained on
01:09:06.840 | because the number is gigantic.
01:09:09.260 | So whatever training the system has been subject
01:09:14.260 | to produce appropriate answers,
01:09:18.060 | you can break it by finding out a prompt
01:09:20.540 | that will be outside of the set of prompts
01:09:24.540 | it's been trained on or things that are similar.
01:09:27.300 | And then it will just spew complete nonsense.
01:09:30.340 | - When you say prompt, do you mean that exact prompt?
01:09:33.460 | Or do you mean a prompt that's like
01:09:36.020 | in many parts very different than,
01:09:38.660 | like is it that easy to ask a question
01:09:42.620 | or to say a thing that hasn't been said before
01:09:45.540 | on the internet?
01:09:46.380 | - I mean, people have come up with things
01:09:48.340 | where like you put essentially a random sequence
01:09:52.300 | of characters in the prompt.
01:09:53.820 | And that's enough to kind of throw the system
01:09:56.060 | into a mode where it's gonna answer something
01:09:59.820 | completely different than it would have answered
01:10:02.060 | without this.
01:10:03.420 | So that's a way to jailbreak the system,
01:10:04.980 | basically get it, go outside of its conditioning, right?
01:10:09.340 | - So that's a very clear demonstration of it.
01:10:11.300 | But of course, that goes outside of what is designed to do.
01:10:16.300 | If you actually stitch together
01:10:20.900 | reasonably grammatical sentences,
01:10:22.900 | is it that easy to break it?
01:10:26.520 | - Yeah, some people have done things like
01:10:29.060 | you write a sentence in English, right?
01:10:31.260 | That has, or you ask a question in English
01:10:33.860 | and it produces a perfectly fine answer.
01:10:36.740 | And then you just substitute a few words
01:10:38.780 | by the same word in another language.
01:10:42.540 | And all of a sudden the answer is complete nonsense.
01:10:44.740 | - So I guess what I'm saying is like,
01:10:46.900 | which fraction of prompts that humans are likely to generate
01:10:51.900 | are going to break the system?
01:10:54.380 | - So the problem is that there is a long tail.
01:10:57.660 | - Yes.
01:10:58.620 | - This is an issue that a lot of people have realized
01:11:02.620 | doing social networks and stuff like that,
01:11:04.140 | which is there's a very, very long tail
01:11:06.340 | of things that people will ask.
01:11:08.180 | And you can fine tune the system for the 80% or whatever
01:11:12.940 | of the things that most people will ask.
01:11:16.180 | And then this long tail is so large
01:11:18.700 | that you're not gonna be able to fine tune the system
01:11:20.780 | for all the conditions.
01:11:21.940 | And in the end, the system has a being
01:11:23.780 | kind of a giant lookup table, right?
01:11:25.740 | Essentially, which is not really what you want.
01:11:27.820 | You want systems that can reason, certainly that can plan.
01:11:30.820 | So the type of reasoning that takes place in LLM
01:11:33.820 | is very, very primitive.
01:11:35.540 | And the reason you can tell it's primitive
01:11:37.060 | is because the amount of computation that is spent
01:11:41.020 | per token produced is constant.
01:11:43.820 | So if you ask a question and that question has an answer
01:11:47.900 | in a given number of token,
01:11:50.340 | the amount of computation devoted to computing that answer
01:11:52.780 | can be exactly estimated.
01:11:54.820 | It's like, it's the size of the prediction network
01:12:00.060 | with its 36 layers or 92 layers or whatever it is,
01:12:03.140 | multiplied by number of tokens, that's it.
01:12:06.220 | And so essentially it doesn't matter
01:12:09.180 | if the question being asked is simple to answer,
01:12:14.180 | complicated to answer, impossible to answer
01:12:17.820 | because it's undecidable or something.
01:12:19.740 | The amount of computation the system will be able
01:12:23.100 | to devote to the answer is constant
01:12:25.620 | or is proportional to the number of token produced
01:12:27.900 | in the answer, right?
01:12:29.700 | This is not the way we work.
01:12:30.940 | The way we reason is that when we're faced
01:12:35.020 | with a complex problem or a complex question,
01:12:38.540 | we spend more time trying to solve it and answer it, right?
01:12:42.820 | Because it's more difficult.
01:12:43.900 | - There's a prediction element,
01:12:45.580 | there's a iterative element where you're like
01:12:48.020 | adjusting your understanding of a thing
01:12:52.500 | by going over and over and over.
01:12:54.740 | There's a hierarchical element, so on.
01:12:56.780 | Does this mean it's a fundamental flaw of LLM?
01:12:59.500 | Does it mean that, there's more part to that question.
01:13:03.340 | Now you're just behaving like an LLM, immediately answering.
01:13:08.740 | No, that it's just the low-level world model
01:13:13.740 | on top of which we can then build
01:13:17.140 | some of these kinds of mechanisms, like you said,
01:13:19.560 | persistent long-term memory or reasoning, so on.
01:13:24.560 | But we need that world model that comes from language.
01:13:28.440 | Is it, maybe it is not so difficult
01:13:30.760 | to build this kind of reasoning system
01:13:33.660 | on top of a well-constructed world model.
01:13:36.740 | - Okay, whether it's difficult or not,
01:13:38.440 | the near future we'll say,
01:13:40.900 | because a lot of people are working on reasoning
01:13:43.580 | and planning abilities for dialogue systems.
01:13:46.720 | I mean, even if we restrict ourselves to language,
01:13:50.660 | just having the ability to plan your answer
01:13:54.640 | before your answer in terms that are not necessarily linked
01:13:59.420 | with the language you're gonna use to produce the answer.
01:14:02.220 | So this idea of this mental model
01:14:03.980 | that allows you to plan what you're gonna say
01:14:05.960 | before you say it.
01:14:06.820 | That is very important.
01:14:11.680 | I think there's going to be a lot of systems
01:14:13.820 | over the next few years
01:14:14.820 | that are going to have this capability,
01:14:17.340 | but the blueprint of those systems
01:14:19.660 | would be extremely different from auto-regressive LLMs.
01:14:23.140 | So it's the same difference
01:14:27.940 | as the difference between what psychologists call
01:14:30.660 | system one and system two in humans, right?
01:14:32.580 | So system one is the type of tasks
01:14:34.820 | that you can accomplish without deliberately, consciously
01:14:37.840 | think about how you do them.
01:14:40.280 | You just do them.
01:14:42.080 | You've done them enough
01:14:43.040 | that you can just do it subconsciously, right?
01:14:45.380 | Without thinking about them.
01:14:46.520 | If you're an experienced driver,
01:14:48.580 | you can drive without really thinking about it.
01:14:51.080 | And you can talk to someone at the same time
01:14:52.700 | or listen to the radio, right?
01:14:54.140 | If you are a very experienced chess player,
01:14:58.300 | you can play against a non-experienced chess player
01:15:01.060 | without really thinking either.
01:15:02.580 | You just recognize the pattern and you play, right?
01:15:05.380 | That's system one.
01:15:06.640 | So all the things that you do instinctively
01:15:09.760 | without really having to deliberately plan
01:15:12.660 | and think about it.
01:15:13.480 | And then there is all the tasks where you need to plan.
01:15:15.200 | So if you are a not so experienced chess player
01:15:19.540 | or you are experienced,
01:15:20.660 | or you play against another experienced chess player,
01:15:22.980 | you think about all kinds of options, right?
01:15:24.760 | You think about it for a while, right?
01:15:27.220 | And you're much better if you have time to think about it
01:15:30.520 | than you are if you play blitz with limited time.
01:15:34.580 | So this type of deliberate planning,
01:15:39.580 | which uses your internal world model, that's system two.
01:15:44.540 | This is what LLMs currently cannot do.
01:15:46.540 | So how do we get them to do this, right?
01:15:48.580 | How do we build a system that can do this kind of planning
01:15:53.340 | that or reasoning that devotes more resources
01:15:57.420 | to complex problems than to simple problems?
01:16:00.320 | And it's not going to be autoregressive prediction of tokens.
01:16:03.780 | It's going to be more something akin to inference
01:16:08.060 | of latent variables in what used to be called
01:16:13.060 | probabilistic models or graphical models
01:16:16.260 | and things of that type.
01:16:17.720 | So basically, the principle is like this.
01:16:19.720 | You know, the prompt is like observed variables.
01:16:24.640 | And what the model does is that it's basically a measure of,
01:16:31.000 | it can measure to what extent an answer
01:16:36.180 | is a good answer for a prompt, okay?
01:16:38.960 | So think of it as some gigantic neural net,
01:16:41.180 | but it's got only one output.
01:16:42.660 | And that output is a scalar number,
01:16:45.180 | which is let's say zero if the answer is a good answer
01:16:48.580 | for the question and a large number
01:16:51.120 | if the answer is not a good answer for the question.
01:16:53.500 | Imagine you had this model.
01:16:55.460 | If you had such a model,
01:16:56.620 | you could use it to produce good answers.
01:16:58.900 | The way you would do is, you know, produce the prompt
01:17:02.520 | and then search through the space of possible answers
01:17:05.260 | for one that minimizes that number.
01:17:07.460 | That's called an energy-based model.
01:17:11.580 | But that energy-based model would need the model
01:17:16.420 | constructed by the LLM.
01:17:18.580 | - Well, so really what you need to do would be
01:17:21.340 | to not search over possible strings of text
01:17:24.940 | that minimize that energy.
01:17:27.780 | But what you would do is do this
01:17:29.420 | in abstract representation space.
01:17:31.060 | So in sort of the space of abstract thoughts,
01:17:34.500 | you would elaborate a thought, right,
01:17:37.060 | using this process of minimizing the output
01:17:40.960 | of your model, okay, which is just a scalar.
01:17:43.860 | It's an optimization process, right?
01:17:46.460 | So now the way the system produces its answer
01:17:49.420 | is through optimization by, you know,
01:17:53.140 | minimizing an objective function, basically, right?
01:17:56.380 | And this is, we're talking about inference.
01:17:57.720 | We're not talking about training, right?
01:17:59.300 | The system has been trained already.
01:18:01.060 | So now we have an abstract representation
01:18:03.040 | of the thought of the answer, representation of the answer.
01:18:06.660 | We feed that to basically an autoregressive decoder,
01:18:10.680 | which can be very simple, that turns this into a text
01:18:13.580 | that expresses this thought, okay?
01:18:15.700 | So that, in my opinion, is the blueprint
01:18:18.100 | of future Datov systems.
01:18:20.220 | They will think about their answer,
01:18:23.460 | plan their answer by optimization
01:18:25.900 | before turning it into text.
01:18:27.340 | And that is Turing complete.
01:18:31.300 | - Can you explain exactly
01:18:32.380 | what the optimization problem there is?
01:18:34.500 | Like, what's the objective function?
01:18:37.740 | Just to linger on it, you kind of briefly described it,
01:18:40.500 | but over what space are you optimizing?
01:18:43.800 | - The space of representations.
01:18:45.720 | - It goes abstract representation.
01:18:47.820 | - So you have an abstract representation inside the system.
01:18:51.620 | You have a prompt.
01:18:52.500 | The prompt goes through an encoder,
01:18:53.660 | produces a representation,
01:18:55.180 | perhaps goes through a predictor
01:18:56.400 | that predicts a representation of the answer,
01:18:58.460 | of the proper answer.
01:18:59.820 | But that representation may not be a good answer
01:19:04.180 | because there might be some complicated reasoning
01:19:06.600 | you need to do, right?
01:19:07.600 | So then you have another process
01:19:11.240 | that takes the representation of the answers
01:19:14.480 | and modifies it so as to minimize a cost function
01:19:19.480 | that measures to what extent the answer
01:19:21.560 | is a good answer for the question.
01:19:23.020 | Now, we sort of ignore the fact for,
01:19:27.840 | I mean, the issue for a moment
01:19:29.760 | of how you train that system to measure
01:19:32.420 | whether an answer is a good answer for a question.
01:19:36.000 | - But suppose such a system could be created.
01:19:38.960 | But what's the process, this kind of search-like process?
01:19:42.440 | - It's a optimization process.
01:19:44.040 | You can do this if the entire system is differentiable,
01:19:47.680 | that scalar output is the result
01:19:50.120 | of running through some neural net,
01:19:52.560 | running the representation of the answers
01:19:55.760 | through some neural net.
01:19:56.900 | Then by gradient descent, by backpropagating gradients,
01:20:00.640 | you can figure out how to modify the representation
01:20:03.320 | of the answers so as to minimize that.
01:20:05.160 | - So that's still a gradient-based.
01:20:06.680 | - It's gradient-based inference.
01:20:08.600 | So now you have a representation of the answer
01:20:10.480 | in abstract space.
01:20:12.080 | Now you can turn it into text, right?
01:20:15.640 | And the cool thing about this is that
01:20:18.660 | the representation now can be optimized
01:20:21.600 | through gradient descent,
01:20:22.520 | but also is independent of the language
01:20:24.640 | in which you're going to express the answer.
01:20:27.600 | - Right, so you're operating in this abstract representation.
01:20:30.080 | I mean, this goes back to the joint embedding
01:20:32.640 | that is better to work in the space of, I don't know,
01:20:37.320 | or to romanticize the notion like space of concepts
01:20:40.680 | versus the space of concrete sensory information.
01:20:45.680 | - Right.
01:20:47.320 | - Okay, but can this do something like reasoning,
01:20:50.720 | which is what we're talking about?
01:20:51.960 | - Well, not really, only in a very simple way.
01:20:54.160 | I mean, basically you can think of those things
01:20:56.440 | as doing the kind of optimization I was talking about,
01:21:00.320 | except the optimizing the discrete space,
01:21:02.280 | which is the space of possible sequences of tokens.
01:21:05.880 | And they do this optimization in a horribly inefficient way,
01:21:09.280 | which is generate a lot of hypotheses
01:21:11.280 | and then select the best ones.
01:21:13.400 | And that's incredibly wasteful in terms of computation.
01:21:18.400 | 'Cause you basically have to run your LLM
01:21:20.880 | for like every possible generated sequence.
01:21:24.880 | And it's incredibly wasteful.
01:21:28.880 | So it's much better to do an optimization
01:21:32.480 | in continuous space where you can do gradient descent
01:21:35.040 | as opposed to like generate tons of things
01:21:36.760 | and then select the best.
01:21:38.200 | You just iteratively refine your answer
01:21:41.120 | to go towards the best, right?
01:21:42.960 | That's much more efficient.
01:21:44.280 | But you can only do this in continuous spaces
01:21:46.560 | with differentiable functions.
01:21:48.200 | - You're talking about the reasoning,
01:21:50.360 | like ability to think deeply or to reason deeply.
01:21:55.200 | How do you know what is an answer
01:21:59.240 | that's better or worse based on deep reasoning?
01:22:04.720 | - Right, so then we're asking the question of conceptually,
01:22:07.480 | how do you train an energy-based model, right?
01:22:09.380 | So an energy-based model is a function
01:22:11.920 | with a scalar output, just a number.
01:22:13.900 | You give it two inputs, X and Y,
01:22:17.340 | and it tells you whether Y is compatible with X or not.
01:22:20.480 | X you observe, let's say it's a prompt,
01:22:22.680 | an image, a video, whatever.
01:22:24.680 | And Y is a proposal for an answer,
01:22:28.120 | a continuation of the video, you know, whatever.
01:22:32.440 | And it tells you whether Y is compatible with X.
01:22:35.080 | And the way it tells you that Y is compatible with X
01:22:37.440 | is that the output of that function will be zero
01:22:39.800 | if Y is compatible with X.
01:22:41.200 | It would be a positive number, non-zero,
01:22:44.800 | if Y is not compatible with X.
01:22:46.380 | Okay, how do you train a system like this
01:22:49.800 | at a completely general level?
01:22:51.880 | Is you show it pairs of X and Y that are compatible,
01:22:56.200 | a question and the corresponding answer,
01:22:58.840 | and you train the parameters of the big neural net inside
01:23:01.720 | to produce zero.
01:23:03.680 | Okay, now that doesn't completely work
01:23:07.280 | because the system might decide,
01:23:08.920 | well, I'm just gonna say zero for everything.
01:23:11.680 | So now you have to have a process to make sure
01:23:13.520 | that for a wrong Y, the energy would be larger than zero.
01:23:18.520 | And there you have two options.
01:23:20.560 | One is contrastive methods.
01:23:21.840 | So contrastive method is you show an X and a bad Y
01:23:25.040 | and you tell the system, well, that's, you know,
01:23:28.400 | give a high energy to this, like push up the energy, right?
01:23:30.880 | Change the weights in the neural net
01:23:32.320 | that computes the energy so that it goes up.
01:23:34.480 | So that's contrastive methods.
01:23:37.680 | The problem with this is if the space of Y is large,
01:23:41.320 | the number of such contrastive samples
01:23:44.680 | you're gonna have to show is gigantic.
01:23:48.640 | And people do this, they do this when you train a system
01:23:52.800 | with RLHF, basically what you're training
01:23:55.200 | is what's called a reward model,
01:23:57.640 | which is basically an objective function
01:24:00.160 | that tells you whether an answer is good or bad.
01:24:02.560 | And that's basically exactly what this is.
01:24:06.960 | So we already do this to some extent.
01:24:08.560 | We're just not using it for inference.
01:24:09.960 | We're just using it for training.
01:24:11.600 | There is another set of methods which are non-contrastive
01:24:17.360 | and I prefer those, and those non-contrastive methods
01:24:20.960 | basically say, okay, the energy function
01:24:25.960 | needs to have low energy on pairs of X, Y's
01:24:28.760 | that are compatible, that come from your training set.
01:24:31.480 | How do you make sure that the energy
01:24:34.160 | is gonna be higher everywhere else?
01:24:36.080 | And the way you do this is by having a regularizer,
01:24:42.240 | a criterion, a term in your cost function
01:24:45.200 | that basically minimizes the volume of space
01:24:49.200 | that can take low energy.
01:24:50.440 | And the precise way to do this is all kinds of different
01:24:54.160 | specific ways to do this depending on the architecture,
01:24:56.440 | but that's the basic principle.
01:24:58.560 | So that if you push down the energy function
01:25:01.000 | for particular regions in the X, Y space,
01:25:04.080 | it will automatically go up in other places
01:25:06.160 | because there's only a limited volume of space
01:25:09.360 | that can take low energy, okay,
01:25:11.920 | by the construction of the system or by the regularizer,
01:25:14.840 | the regularizing function.
01:25:16.840 | - We've been talking very generally,
01:25:18.880 | but what is a good X and a good Y?
01:25:21.480 | What is a good representation of X and Y?
01:25:25.880 | 'Cause we've been talking about language
01:25:27.320 | and if you just take language directly,
01:25:30.520 | that presumably is not good.
01:25:32.320 | So there has to be some kind of
01:25:33.320 | abstract representation of ideas.
01:25:36.240 | - Yeah, so you can do this with language directly
01:25:39.760 | by just X is a text and Y is a continuation of that text.
01:25:43.640 | Or X is a question, Y is the answer.
01:25:48.200 | - But you're saying that's not gonna take,
01:25:49.720 | I mean, that's going to do what LLMs are doing.
01:25:52.720 | - Well, no, it depends on how you,
01:25:54.640 | how the internal structure of the system is built.
01:25:57.280 | If the internal structure of the system is built
01:25:59.480 | in such a way that inside of the system,
01:26:02.240 | there is a latent variable, let's call it Z,
01:26:04.760 | that you can manipulate so as to minimize the output energy.
01:26:12.920 | Then that Z can be viewed as representation of a good answer
01:26:16.760 | that you can translate into a Y that is a good answer.
01:26:19.520 | - So this kind of system could be trained
01:26:22.720 | in a very similar way.
01:26:24.640 | - Very similar way, but you have to have this way
01:26:26.760 | of preventing collapse, of ensuring that, you know,
01:26:30.360 | there is high energy for things you don't train it on.
01:26:33.120 | And currently it's very implicit in LLM,
01:26:38.720 | it's done in a way that people don't realize is being done,
01:26:40.720 | but it is being done.
01:26:42.680 | It's due to the fact that when you give a high probability
01:26:45.960 | to a word, automatically you give low probability
01:26:50.800 | to other words, because you only have a finite amount
01:26:54.400 | of probability to go around right there to sum to one.
01:26:57.800 | So when you minimize the cross entropy or whatever,
01:27:00.520 | when you train your LLM to produce the,
01:27:03.240 | to predict the next word,
01:27:04.560 | you're increasing the probability your system will give
01:27:08.480 | to the correct word, but you're also decreasing
01:27:10.200 | the probability it will give to the incorrect words.
01:27:12.360 | Now, indirectly, that gives a low probability to,
01:27:17.120 | a high probability to sequences of words that are good
01:27:19.480 | and low probability to sequences of words that are bad,
01:27:21.720 | but it's very indirect.
01:27:23.600 | And it's not obvious why this actually works at all,
01:27:26.800 | but because you're not doing it on a joint probability
01:27:31.080 | of all the symbols in a sequence,
01:27:32.920 | you're just doing it kind of sort of factorize
01:27:36.920 | that probability in terms of conditional probabilities
01:27:39.640 | over successive tokens.
01:27:41.480 | - How do you do this for visual data?
01:27:44.000 | - So we've been doing this with OJPA architectures,
01:27:46.160 | basically with IJPA.
01:27:48.040 | So there are the compatibility between two things is,
01:27:53.040 | here's an image or a video, here's a corrupted, shifted,
01:27:57.480 | or transformed version of that image or video or masked.
01:28:01.080 | Okay, and then the energy of the system is the prediction
01:28:05.800 | error of the representation of the image.
01:28:11.800 | The predicted representation of the good thing,
01:28:14.480 | versus the actual representation of the good thing, right?
01:28:17.360 | So you run the corrupted image to the system,
01:28:20.840 | predict the representation of the good input, uncorrupted,
01:28:24.600 | and then compute the prediction error.
01:28:26.400 | That's the energy of the system.
01:28:28.040 | So this system will tell you, this is a good,
01:28:31.760 | this is a good image and this is a corrupted version.
01:28:36.680 | It will give you zero energy if those two things
01:28:39.000 | are effectively, one of them is a corrupted version
01:28:42.280 | of the other.
01:28:43.120 | It gives you a high energy if the two images
01:28:45.280 | are completely different.
01:28:46.480 | - And hopefully that whole process gives you a really nice
01:28:49.760 | compressed representation of reality, of visual reality.
01:28:54.560 | - And we know it does because then we use those
01:28:56.440 | representations as input to a classification system.
01:28:59.360 | - That classification system works really nicely, okay.
01:29:02.000 | Well, so to summarize, you recommend in a spicy way
01:29:08.560 | that only yellow raccoon can.
01:29:10.400 | You recommend that we abandon generative models
01:29:12.700 | in favor of joint embedding architectures.
01:29:15.280 | Abandon autoregressive generation.
01:29:17.760 | Abandon, this feels like a court testimony.
01:29:21.740 | Abandon probabilistic models in favor of energy-based models
01:29:25.160 | as we talked about.
01:29:26.220 | Abandon contrastive methods in favor of regularized methods.
01:29:30.100 | And let me ask you about this.
01:29:32.100 | You've been for a while a critic of reinforcement learning.
01:29:37.000 | So the last recommendation is that we abandon RL
01:29:41.320 | in favor of model predictive control,
01:29:43.640 | as you were talking about, and only use RL
01:29:46.560 | when planning doesn't yield the predicted outcome.
01:29:50.440 | And we use RL in that case to adjust the world model
01:29:54.600 | or the critic.
01:29:55.960 | So you mentioned RLHF, reinforcement learning
01:30:00.960 | with human feedback.
01:30:02.980 | Why do you still hate reinforcement learning?
01:30:05.840 | - I don't hate reinforcement learning
01:30:07.080 | and I think it should not be abandoned completely,
01:30:12.080 | but I think its use should be minimized
01:30:14.480 | because it's incredibly inefficient in terms of samples.
01:30:18.440 | And so the proper way to train a system
01:30:21.400 | is to first have it learn good representations of the world
01:30:26.400 | and world models from mostly observation,
01:30:29.620 | maybe a little bit of interactions.
01:30:31.560 | - And then steered based on that.
01:30:33.080 | If the representation is good,
01:30:34.280 | then the adjustments should be minimal.
01:30:36.800 | - Yeah, and now there's two things.
01:30:38.060 | You can use, if you've learned a world model,
01:30:40.000 | you can use the world model to plan a sequence of actions
01:30:42.680 | to arrive at a particular objective.
01:30:44.480 | You don't need RL unless the way you measure
01:30:49.560 | whether you succeed might be inexact.
01:30:51.360 | Your idea of whether you were gonna fall from your bike
01:30:56.260 | might be wrong, or whether the person you're fighting
01:31:01.560 | with MMA was gonna do something and then do something else.
01:31:04.560 | So there's two ways you can be wrong.
01:31:09.520 | Either your objective function does not reflect
01:31:13.680 | the actual objective function you want to optimize,
01:31:16.360 | or your world model is inaccurate, right?
01:31:19.760 | So you didn't, the prediction you were making
01:31:22.060 | about what was gonna happen in the world is inaccurate.
01:31:25.280 | So if you want to adjust your world model
01:31:27.280 | while you are operating the world,
01:31:30.880 | or your objective function,
01:31:32.680 | that is basically in the realm of RL.
01:31:35.880 | This is what RL deals with to some extent, right?
01:31:39.600 | So adjust your world model.
01:31:41.080 | And the way to adjust your world model, even in advance,
01:31:44.200 | is to explore parts of the space where you world model,
01:31:48.180 | where you know that your world model is inaccurate.
01:31:50.720 | That's called curiosity basically, or play, right?
01:31:54.080 | When you play, you kind of explore parts of the state space
01:31:58.680 | that you don't want to do for real,
01:32:03.680 | because it might be dangerous,
01:32:05.800 | but you can adjust your world model
01:32:07.880 | without killing yourself, basically.
01:32:11.640 | So that's what you want to use RL for.
01:32:15.120 | When it comes time to learning a particular task,
01:32:18.720 | you already have all the good representations,
01:32:20.560 | you already have your world model,
01:32:21.840 | but you need to adjust it for the situation at hand.
01:32:25.200 | That's when you use RL.
01:32:26.640 | - What do you think RL-HF works so well?
01:32:29.620 | This reinforcement learning with human feedback.
01:32:32.600 | Why did it have such a transformational effect
01:32:34.880 | on large language models before?
01:32:37.440 | - So what's had the transformational effect
01:32:39.920 | is human feedback.
01:32:42.000 | There is many ways to use it,
01:32:43.560 | and some of it is just purely supervised, actually.
01:32:45.760 | It's not really reinforcement learning.
01:32:47.440 | - So it's the HF.
01:32:49.280 | - It's the HF.
01:32:50.180 | And then there is various ways to use human feedback, right?
01:32:53.320 | So you can ask humans to rate answers,
01:32:57.240 | multiple answers that are produced by a world model.
01:33:00.020 | And then what you do is you train an objective function
01:33:05.560 | to predict that rating.
01:33:07.380 | And then you can use that objective function
01:33:11.520 | to predict whether an answer is good,
01:33:13.680 | and you can backpropagate gradient through this
01:33:15.120 | to fine tune your system
01:33:16.200 | so that it only produces highly rated answers.
01:33:19.880 | Okay, so that's one way.
01:33:22.680 | So that's like, in RL that means training
01:33:27.360 | what's called a reward model, right?
01:33:29.380 | So something that, basically a small neural net
01:33:31.800 | that estimates to what extent an answer is good, right?
01:33:35.120 | It's very similar to the objective
01:33:36.600 | I was talking about earlier for planning,
01:33:39.720 | except now it's not used for planning.
01:33:41.320 | It's used for fine tuning your system.
01:33:43.180 | I think it would be much more efficient
01:33:45.560 | to use it for planning,
01:33:46.520 | but currently it's used to fine tune
01:33:51.040 | the parameters of the system.
01:33:52.620 | Now there's several ways to do this.
01:33:54.920 | You know, some of them are supervised.
01:33:57.520 | You just, you know, ask a human person like,
01:34:00.000 | what is a good answer for this, right?
01:34:02.360 | Then you just type the answer.
01:34:04.240 | I mean, there's lots of ways
01:34:07.160 | that those systems are being adjusted.
01:34:09.160 | - Now, a lot of people have been very critical
01:34:13.560 | of the recently released Google's Gemini 1.5
01:34:19.080 | for essentially, in my words, I could say super woke.
01:34:23.260 | Woke in the negative connotation of that word.
01:34:26.580 | There's some almost hilariously absurd things that it does,
01:34:30.340 | like it modifies history,
01:34:32.740 | like generating images of a black George Washington,
01:34:37.540 | or perhaps more seriously,
01:34:40.840 | something that you commented on Twitter,
01:34:43.220 | which is refusing to comment on or generate images
01:34:48.300 | of, or even descriptions of Tiananmen Square
01:34:51.860 | or the Tank Man,
01:34:55.540 | one of the most sort of legendary protest images in history.
01:35:00.540 | Of course, these images are highly censored
01:35:05.260 | by the Chinese government,
01:35:06.740 | and therefore everybody started asking questions
01:35:09.780 | of what is the process of designing these LLMs,
01:35:14.700 | what is the role of censorship in these,
01:35:17.500 | and all that kind of stuff.
01:35:19.020 | So you commented on Twitter saying
01:35:22.660 | that open source is the answer, essentially.
01:35:26.100 | So can you explain?
01:35:28.220 | - I actually made that comment
01:35:31.180 | on just about every social network I can,
01:35:33.020 | and I've made that point multiple times in various forums.
01:35:38.020 | Here's my point of view on this.
01:35:43.100 | People can complain that AI systems are biased,
01:35:47.260 | and they generally are biased
01:35:49.740 | by the distribution of the training data
01:35:52.060 | that they've been trained on.
01:35:53.860 | That reflects biases in society,
01:35:57.540 | and that is potentially offensive to some people,
01:36:03.980 | or potentially not.
01:36:06.880 | And some techniques to de-bias
01:36:10.700 | then become offensive to some people
01:36:15.400 | because of historical incorrectness and things like that.
01:36:20.400 | And so you can ask the question,
01:36:25.520 | you can ask two questions.
01:36:26.360 | The first question is,
01:36:27.380 | is it possible to produce an AI system that is not biased?
01:36:30.960 | And the answer is absolutely not.
01:36:33.400 | And it's not because of technological challenges,
01:36:37.600 | although there are technological challenges to that.
01:36:41.360 | It's because bias is in the eye of the beholder.
01:36:45.480 | Different people may have different ideas
01:36:48.800 | about what constitutes bias for a lot of things.
01:36:53.580 | I mean, there are facts that are indisputable,
01:36:57.080 | but there are a lot of opinions or things
01:36:59.780 | that can be expressed in different ways.
01:37:02.080 | And so you cannot have an unbiased system.
01:37:05.040 | That's just an impossibility.
01:37:08.800 | And so what's the answer to this?
01:37:12.640 | And the answer is the same answer that we found
01:37:16.520 | in liberal democracy about the press.
01:37:20.860 | The press needs to be free and diverse.
01:37:24.220 | We have free speech for a good reason.
01:37:28.160 | It's because we don't want all of our information
01:37:31.880 | to come from a unique source
01:37:36.680 | 'cause that's opposite to the whole idea of democracy
01:37:40.040 | and progress of ideas and even science.
01:37:45.040 | In science, people have to argue for different opinions
01:37:48.160 | and science makes progress when people disagree
01:37:51.400 | and they come up with an answer and a consensus forms.
01:37:54.600 | And it's true in all democracies around the world.
01:37:57.720 | So there is a future which is already happening
01:38:02.720 | where every single one of our interaction
01:38:05.640 | with the digital world will be mediated
01:38:08.040 | by AI systems, AI assistants, right?
01:38:11.740 | We're gonna have smart glasses.
01:38:14.820 | You can already buy them from Metta, the Ray-Ban Metta,
01:38:18.120 | where you can talk to them and they are connected
01:38:21.520 | with an LLM and you can get answers
01:38:23.600 | on any question you have.
01:38:25.920 | Or you can be looking at a monument
01:38:28.840 | and there is a camera in the system that in the glasses,
01:38:32.440 | you can ask it, what can you tell me about this building
01:38:35.680 | or this monument?
01:38:36.520 | You can be looking at a menu in a foreign language
01:38:39.160 | and the thing will translate it for you
01:38:40.760 | or you can do real-time translation
01:38:43.640 | if you speak different languages.
01:38:44.800 | So a lot of our interactions with the digital world
01:38:48.280 | are gonna be mediated by those systems in the near future.
01:38:51.120 | Increasingly, the search engines that we're gonna use
01:38:57.000 | are not gonna be search engines.
01:38:58.080 | They're gonna be dialogue systems
01:39:01.440 | that we just ask a question and it will answer
01:39:05.160 | and then point you to perhaps appropriate reference for it.
01:39:08.880 | But here is the thing, we cannot afford those systems
01:39:11.920 | to come from a handful of companies
01:39:13.960 | on the west coast of the US.
01:39:15.320 | Because those systems will constitute the repository
01:39:20.080 | of all human knowledge.
01:39:22.040 | And we cannot have that be controlled
01:39:25.600 | by a small number of people, right?
01:39:27.960 | It has to be diverse.
01:39:29.120 | For the same reason, the press has to be diverse.
01:39:32.200 | So how do we get a diverse set of AI assistance?
01:39:35.520 | It's very expensive and difficult to train a base model,
01:39:40.120 | right, a base LLM at the moment.
01:39:42.240 | You know, in the future, it might be something different,
01:39:43.920 | but at the moment, that's an LLM.
01:39:46.040 | So only a few companies can do this properly.
01:39:49.560 | And if some of those top systems are open source,
01:39:55.560 | anybody can use them.
01:39:57.120 | Anybody can fine-tune them.
01:39:59.120 | If we put in place some systems
01:40:01.680 | that allows any group of people,
01:40:05.560 | whether they are individual citizens, groups of citizens,
01:40:11.400 | government organizations, NGOs, companies, whatever,
01:40:17.320 | to take those open source systems, AI systems,
01:40:23.920 | and fine-tune them for their own purpose on their own data,
01:40:27.680 | then we're gonna have a very large diversity
01:40:29.640 | of different AI systems that are specialized
01:40:32.840 | for all of those things, right?
01:40:34.640 | So I tell you, I talk to the French government quite a bit,
01:40:38.200 | and the French government will not accept
01:40:41.360 | that the digital diet of all their citizens
01:40:44.560 | be controlled by three companies
01:40:46.400 | on the west coast of the US.
01:40:48.120 | That's just not acceptable.
01:40:49.640 | It's a danger to democracy,
01:40:51.200 | regardless of how well-intentioned those companies are, right?
01:40:54.560 | And it's also a danger to local culture,
01:41:01.000 | to values, to language, right?
01:41:05.400 | I was talking with the founder of Infosys in India.
01:41:10.400 | He's funding a project to fine-tune LAMA2,
01:41:16.640 | the open source model produced by Meta,
01:41:19.920 | so that LAMA2 speaks all 22 official languages in India.
01:41:23.120 | It's very important for people in India.
01:41:26.480 | I was talking to a former colleague of mine,
01:41:28.240 | Mustafa Sisay, who used to be a scientist at FAIR,
01:41:31.320 | and then moved back to Africa,
01:41:32.480 | created a research lab for Google in Africa,
01:41:35.200 | and now has a new startup called Cara.
01:41:37.960 | And what he's trying to do is basically have LLM
01:41:40.520 | that speaks the local languages in Senegal
01:41:42.880 | so that people can have access to medical information,
01:41:46.200 | 'cause they don't have access to doctors.
01:41:47.560 | It's a very small number of doctors per capita in Senegal.
01:41:51.920 | I mean, you can't have any of this
01:41:55.480 | unless you have open source platforms.
01:41:57.960 | So with open source platforms,
01:41:59.160 | you can have AI systems that are not only diverse
01:42:01.760 | in terms of political opinions or things of that type,
01:42:05.040 | but in terms of language, culture,
01:42:10.000 | value systems, political opinions,
01:42:15.440 | technical abilities in various domains.
01:42:18.920 | And you can have an industry, an ecosystem of companies
01:42:22.200 | that fine tune those open source systems
01:42:24.560 | for vertical applications in industry, right?
01:42:27.200 | You have, I don't know, a publisher has thousands of books,
01:42:30.240 | and they want to build a system that allows the customer
01:42:32.880 | to just ask a question about the content
01:42:36.040 | of any of their books.
01:42:37.640 | You need to train on their proprietary data, right?
01:42:41.000 | You have a company, we have one within Meta,
01:42:43.520 | it's called MetaMate, and it's basically an LLM
01:42:46.640 | that can answer any question about internal stuff
01:42:50.200 | about the company.
01:42:52.080 | Very useful.
01:42:53.280 | A lot of companies want this, right?
01:42:55.240 | A lot of companies want this not just for their employees,
01:42:57.880 | but also for their customers, to take care of their customers.
01:43:00.760 | So the only way you're gonna have an AI industry,
01:43:04.360 | the only way you're gonna have AI systems
01:43:06.280 | that are not uniquely biased
01:43:08.680 | is if you have open source platforms
01:43:10.280 | on top of which any group can build specialized systems.
01:43:15.280 | So the direction of, inevitable direction of history
01:43:21.680 | is that the vast majority of AI systems
01:43:26.040 | will be built on top of open source platforms.
01:43:28.400 | - So that's a beautiful vision.
01:43:30.160 | So meaning like a company like Meta or Google or so on
01:43:37.880 | should take only minimal fine-tuning steps
01:43:40.560 | after building the foundation pre-trained model,
01:43:44.880 | as few steps as possible.
01:43:47.240 | - Basically.
01:43:48.080 | - Can Meta afford to do that?
01:43:51.520 | - No.
01:43:52.360 | - So I don't know if you know this,
01:43:53.620 | but companies are supposed to make money somehow,
01:43:56.240 | and open source is like giving away, I don't know,
01:44:01.040 | Mark made a video, Mark Zuckerberg, very sexy video,
01:44:06.120 | talking about 350,000 Nvidia H100s.
01:44:11.120 | The math of that is just for the GPUs, that's 100 billion,
01:44:17.680 | plus the infrastructure for training everything.
01:44:22.360 | So I'm no business guy, but how do you make money on that?
01:44:27.360 | So the division you paint is a really powerful one,
01:44:30.180 | but how is it possible to make money?
01:44:32.560 | - Okay, so you have several business models, right?
01:44:36.760 | The business model that Meta is built around
01:44:39.600 | is your first service,
01:44:44.160 | and the financing of that service is either through ads
01:44:50.080 | or through business customers.
01:44:52.680 | So for example, if you have an LLM
01:44:54.940 | that can help a mom and pop pizza place
01:45:00.500 | by talking to the customers through WhatsApp,
01:45:03.640 | and so the customers can just order a pizza
01:45:06.580 | and the system will just ask them like,
01:45:08.700 | what topping do you want or what size, blah, blah, blah.
01:45:11.340 | The business will pay for that, okay, that's a model.
01:45:15.280 | And otherwise, if it's a system
01:45:21.760 | that is on the more kind of classical services,
01:45:24.360 | it can be ad supported or there's several models.
01:45:28.140 | But the point is, if you have a big enough
01:45:31.600 | potential customer base and you need to build that system
01:45:36.240 | anyway for them, it doesn't hurt you
01:45:41.080 | to actually distribute it in open source.
01:45:43.240 | - Again, I'm no business guy,
01:45:45.400 | but if you release the open source model,
01:45:48.060 | then other people can do the same kind of task
01:45:51.720 | and compete on it,
01:45:52.920 | basically provide fine tuned models for businesses.
01:45:57.000 | So is the bet that Meta is making,
01:45:59.700 | by the way, I'm a huge fan of all this,
01:46:01.460 | but is the bet that Meta is making is like,
01:46:03.840 | we'll do a better job of it?
01:46:05.580 | - Well, no, the bet is more,
01:46:08.440 | we already have a huge user base and customer base, right?
01:46:13.440 | So it's gonna be useful to them.
01:46:15.320 | Whatever we offer them is gonna be useful
01:46:17.540 | and there is a way to derive revenue from this.
01:46:22.280 | And it doesn't hurt that we provide that system
01:46:26.680 | or the base model, right?
01:46:29.400 | The foundation model in open source
01:46:32.640 | for others to build applications on top of it too.
01:46:35.820 | If those applications turn out to be useful
01:46:37.400 | for our customers, we can just buy it from them.
01:46:39.840 | It could be that they will improve the platform.
01:46:44.280 | In fact, we see this already.
01:46:46.400 | I mean, there is literally millions of downloads
01:46:49.000 | of LAMATU and thousands of people who have provided ideas
01:46:53.720 | about how to make it better.
01:46:55.600 | So this clearly accelerates progress
01:46:59.320 | to make the system available to sort of a wide community
01:47:04.320 | of people and there's literally thousands of businesses
01:47:07.800 | who are building applications with it.
01:47:09.640 | So our ability to, Meta's ability to derive revenue
01:47:18.200 | from this technology is not impaired
01:47:20.480 | by the distribution of base models in open source.
01:47:26.640 | - The fundamental criticism that Gemini is getting
01:47:28.680 | is that, as you pointed out on the West Coast,
01:47:31.040 | just to clarify, we're currently in the East Coast
01:47:34.680 | where I would suppose Meta AI headquarters would be.
01:47:37.840 | So there are strong words about the West Coast,
01:47:42.540 | but I guess the issue that happens is,
01:47:47.000 | I think it's fair to say that most tech people
01:47:49.960 | have a political affiliation with the left wing.
01:47:53.920 | They lean left.
01:47:55.320 | And so the problem that people are criticizing Gemini with
01:47:58.440 | is that there's, in that de-biasing process
01:48:01.160 | that you mentioned, that their ideological lean
01:48:06.000 | becomes obvious.
01:48:08.940 | Is this something that could be escaped?
01:48:14.520 | You're saying open source is the only way.
01:48:17.160 | Have you witnessed this kind of ideological lean
01:48:19.640 | that makes engineering difficult?
01:48:22.360 | - No, I don't think it has to do,
01:48:24.240 | I don't think the issue has to do with the political leaning
01:48:26.740 | of the people designing those systems.
01:48:29.300 | It has to do with the acceptability or political leanings
01:48:34.300 | of their customer base or audience, right?
01:48:38.280 | So a big company cannot afford to offend too many people.
01:48:43.640 | So they're going to make sure
01:48:46.480 | that whatever product they put out is safe,
01:48:49.440 | whatever that means.
01:48:50.440 | And it's very possible to overdo it.
01:48:56.200 | And it's also very possible to,
01:48:58.020 | it's impossible to do it properly for everyone.
01:49:00.360 | You're not going to satisfy everyone.
01:49:02.520 | So that's what I said before.
01:49:03.760 | You cannot have a system that is unbiased,
01:49:05.680 | that is perceived as unbiased by everyone.
01:49:07.760 | It's gonna be, you push it in one way,
01:49:11.560 | one set of people are going to see it as biased,
01:49:14.640 | and then you push it the other way,
01:49:15.700 | and another set of people is going to see it as biased.
01:49:18.600 | And then in addition to this, there's the issue of,
01:49:21.640 | if you push the system,
01:49:22.660 | perhaps a little too far in one direction,
01:49:24.260 | it's going to be non-factual, right?
01:49:25.840 | You're going to have black Nazi soldiers in the--
01:49:30.840 | - Yeah, so we should mention image generation
01:49:33.640 | of black Nazi soldiers, which is not factually accurate.
01:49:38.960 | - Right, and can be offensive for some people as well, right?
01:49:42.180 | So it's going to be impossible to kind of produce systems
01:49:47.180 | that are unbiased for everyone.
01:49:49.080 | So the only solution that I see is diversity.
01:49:52.200 | - And diversity is the full meaning of that word,
01:49:54.980 | diversity of in every possible way.
01:49:57.940 | - Yeah.
01:49:59.380 | - Marc Andreessen just tweeted today,
01:50:02.640 | let me do a TLDR.
01:50:06.040 | The conclusion is only startups and open source
01:50:08.640 | can avoid the issue that he's highlighting with big tech.
01:50:12.240 | He's asking, can big tech actually field
01:50:15.440 | generative AI products?
01:50:17.480 | One, ever escalating demands from internal activists,
01:50:20.760 | employee mobs, crazed executives, broken boards,
01:50:24.440 | pressure groups, extremist regulators, government agencies,
01:50:27.240 | the press, in quotes, experts, and everything,
01:50:31.240 | corrupting the output.
01:50:34.240 | Two, constant risk of generating a bad answer
01:50:37.360 | or drawing a bad picture or rendering a bad video.
01:50:40.600 | Who knows what is going to say or do at any moment?
01:50:44.440 | Three, legal exposure, product liability, slander,
01:50:48.160 | election law, many other things, and so on.
01:50:51.720 | Anything that makes Congress mad.
01:50:53.900 | Four, continuous attempts to tighten grip
01:50:57.240 | on acceptable output, degrade the model,
01:50:59.700 | like how good it actually is in terms of usable
01:51:03.600 | and pleasant to use and effective and all that kind of stuff
01:51:06.920 | and five, publicity of bad text, images, video,
01:51:10.440 | actual puts those examples into the training data
01:51:13.080 | for the next version and so on.
01:51:15.780 | So he just highlights how difficult this is
01:51:18.360 | from all kinds of people being unhappy.
01:51:21.040 | As you said, you can't create a system
01:51:23.040 | that makes everybody happy.
01:51:24.440 | So if you're going to do the fine tuning yourself
01:51:29.080 | and keep a closed source, essentially the problem there
01:51:33.200 | is then trying to minimize the number of people
01:51:35.080 | who are going to be unhappy.
01:51:37.280 | And you're saying that almost impossible to do right
01:51:42.280 | and the better way is to do open source.
01:51:44.740 | - Basically, yeah.
01:51:46.800 | Marc is right about a number of things that he lists
01:51:51.760 | that indeed scare large companies.
01:51:55.300 | Certainly congressional investigations is one of them.
01:52:00.400 | Legal liability, making things that get people
01:52:05.400 | to hurt themselves or hurt others.
01:52:09.200 | Big companies are really careful
01:52:12.580 | about not producing things of this type
01:52:15.120 | because they don't want to hurt anyone, first of all,
01:52:21.280 | and then second, they want to preserve their business.
01:52:23.200 | So it's essentially impossible for systems like this
01:52:26.920 | that can inevitably formulate political opinions
01:52:30.960 | and opinions about various things
01:52:32.840 | that may be political or not,
01:52:34.040 | but that people may disagree about moral issues
01:52:38.360 | and questions about religion and things like that, right,
01:52:43.360 | or cultural issues that people from different communities
01:52:47.960 | would disagree with in the first place.
01:52:50.120 | So there's only kind of a relatively small number of things
01:52:52.560 | that people will sort of agree on, basic principles.
01:52:57.560 | But beyond that, if you want those systems to be useful,
01:53:01.840 | they will necessarily have to offend
01:53:05.200 | a number of people inevitably.
01:53:08.080 | - And so open source is just better.
01:53:10.960 | And then-- - Diversity is better, right.
01:53:13.280 | - And open source enables diversity.
01:53:15.480 | - That's right, open source enables diversity.
01:53:18.200 | - This can be a fascinating world where if it's true
01:53:21.560 | that the open source world, if meta leads the way
01:53:23.960 | and creates this kind of open source
01:53:25.840 | foundation model world, there's going to be,
01:53:28.560 | like governments will have a fine tune model.
01:53:31.520 | And then potentially people that vote left and right
01:53:36.520 | will have their own model and preference
01:53:40.640 | to be able to choose.
01:53:42.000 | And it will potentially divide us even more,
01:53:44.400 | but that's on us humans, we get to figure out.
01:53:48.280 | Basically the technology enables humans to human
01:53:52.000 | more effectively and all the difficult ethical questions
01:53:56.160 | that humans raise will just leave it up to us
01:54:01.040 | to figure that out.
01:54:02.640 | - Yeah, I mean, there are some limits to what,
01:54:04.760 | the same way there are limits to free speech,
01:54:06.480 | there has to be some limit to the kind of stuff
01:54:08.880 | that those systems might be authorized to produce,
01:54:13.880 | some guardrails.
01:54:16.440 | So, I mean, that's one thing I've been interested in,
01:54:18.280 | which is in the type of architecture that we were discussing
01:54:21.800 | before, where the output of a system is a result
01:54:26.760 | of an inference to satisfy an objective,
01:54:29.840 | that objective can include guardrails.
01:54:31.960 | And we can put guardrails in open source systems.
01:54:37.400 | I mean, if we eventually have systems that are built
01:54:39.760 | with this blueprint, we can put guardrails in those systems
01:54:44.200 | that guarantee that there is sort of a minimum set
01:54:47.640 | of guardrails that make the system non-dangerous
01:54:50.040 | and non-toxic, et cetera.
01:54:51.480 | Basic things that everybody would agree on.
01:54:53.680 | And then the fine tuning that people will add
01:54:58.200 | or the additional guardrails that people will add
01:55:00.400 | will kind of cater to their community, whatever it is.
01:55:04.960 | - And yeah, the fine tuning will be more about
01:55:07.240 | the gray areas of what is hate speech, what is dangerous
01:55:10.400 | and all that kind of stuff.
01:55:11.480 | I mean, you've--
01:55:12.320 | - Or different value systems.
01:55:13.360 | - Different value systems.
01:55:14.560 | I mean, like, but still, even with the objectives
01:55:16.760 | of how to build a bioweapon, for example,
01:55:18.760 | I think something you've commented on,
01:55:20.800 | or at least there's a paper that we're a collection
01:55:24.040 | of researchers just trying to understand
01:55:26.400 | the social impacts of these LLMs.
01:55:29.320 | And I guess one threshold is nice is like,
01:55:32.360 | does the LLM make it any easier than a search would,
01:55:37.360 | like a Google search would?
01:55:39.480 | - Right, so the increasing number of studies on this
01:55:44.480 | seems to point to the fact that it doesn't help.
01:55:49.600 | So having an LLM doesn't help you design
01:55:53.480 | or build a bioweapon or a chemical weapon
01:55:57.280 | if you already have access to a search engine
01:56:00.200 | and a library.
01:56:01.040 | And so the sort of increased information you get
01:56:04.920 | or the ease with which you get it doesn't really help you.
01:56:08.200 | That's the first thing.
01:56:09.040 | The second thing is it's one thing to have a list
01:56:12.080 | of instructions of how to make a chemical weapon,
01:56:15.600 | for example, or a bioweapon.
01:56:17.160 | It's another thing to actually build it.
01:56:20.040 | And it's much harder than you might think.
01:56:22.160 | And LLM will not help you with that.
01:56:24.000 | In fact, nobody in the world,
01:56:27.080 | not even countries use bioweapons
01:56:29.560 | because most of the times they have no idea
01:56:32.320 | to protect their own populations against it.
01:56:34.280 | So it's too dangerous actually to kind of ever use.
01:56:39.280 | And it's in fact banned by international treaties.
01:56:44.280 | Chemical weapons is different.
01:56:45.680 | It's also banned by treaties,
01:56:47.680 | but it's the same problem.
01:56:50.760 | It's difficult to use in situations
01:56:53.120 | that doesn't turn against the perpetrators.
01:56:56.520 | But we could ask Elon Musk.
01:56:58.440 | I can give you a very precise list of instructions
01:57:01.800 | of how you build a rocket engine.
01:57:03.440 | And even if you have a team of 50 engineers
01:57:06.800 | that are really experienced building it,
01:57:08.280 | you're still gonna have to blow up a dozen of them
01:57:10.120 | before you get one that works.
01:57:11.560 | And it's the same with chemical weapons
01:57:18.040 | or bioweapons or things like this.
01:57:19.560 | It requires expertise in the real world
01:57:23.080 | that an LLM is not gonna help you with.
01:57:25.240 | - And it requires even the common sense expertise
01:57:28.040 | that we've been talking about,
01:57:29.080 | which is how to take language-based instructions
01:57:34.000 | and materialize them in the physical world.
01:57:36.880 | It requires a lot of knowledge
01:57:38.480 | that's not in the instructions.
01:57:41.560 | - Yeah, exactly.
01:57:42.400 | A lot of biologists have posted on this, actually,
01:57:44.520 | in response to those things saying like,
01:57:46.400 | do you realize how hard it is to actually do the lab work?
01:57:49.240 | Like, you know, this is not trivial.
01:57:51.840 | - Yeah, and that's Hans Moravec comes to light once again.
01:57:56.840 | Just to linger on LAMA,
01:57:59.360 | Mark announced that LAMA 3 is coming out eventually.
01:58:01.800 | I don't think there's a release date.
01:58:03.480 | But what are you most excited about?
01:58:06.920 | First of all, LAMA 2 that's already out there,
01:58:08.960 | and maybe the future, LAMA 3, 4, 5, 6, 10,
01:58:12.760 | just the future of the open source under meta?
01:58:15.600 | - Well, a number of things.
01:58:18.080 | So there's gonna be like various versions of LAMA
01:58:22.000 | that are improvements of previous LAMAs,
01:58:26.880 | bigger, better, multimodal, things like that.
01:58:30.680 | And then in future generations,
01:58:32.000 | systems that are capable of planning,
01:58:34.120 | that really understand how the world works.
01:58:36.880 | Maybe are trained from video, so they have some world model.
01:58:39.600 | Maybe, you know, capable of the type of reasoning
01:58:42.160 | and planning I was talking about earlier.
01:58:44.120 | Like, how long is that gonna take?
01:58:45.360 | Like, when is the research that is going in that direction
01:58:48.520 | going to sort of feed into the product line,
01:58:52.080 | if you want, of LAMA?
01:58:53.520 | I don't know.
01:58:54.360 | I can't tell you.
01:58:55.200 | And there is, you know, a few breakthroughs
01:58:56.320 | that we have to basically go through
01:58:59.680 | before we can get there.
01:59:01.880 | But you'll be able to monitor our progress
01:59:04.560 | because we publish our research, right?
01:59:07.040 | So, you know, last week we published the Vijaypa work,
01:59:12.040 | which is sort of a first step
01:59:13.240 | towards training systems from video.
01:59:15.000 | And then the next step is gonna be world models
01:59:18.960 | based on kind of this type of idea, training from video.
01:59:23.760 | There's similar work at DeepMind also,
01:59:26.120 | and taking place people, and also at UC Berkeley
01:59:30.840 | on world models from video.
01:59:33.800 | A lot of people are working on this.
01:59:35.160 | I think a lot of good ideas are appearing.
01:59:38.480 | My bet is that those systems are gonna be JAPA-like.
01:59:41.760 | They're not gonna be generative models.
01:59:43.960 | And we'll see what the future will tell.
01:59:48.960 | There's really good work at a gentleman
01:59:54.720 | called Danny Jar Hefner, who is now at DeepMind,
01:59:56.880 | who's worked on kind of models of this type
01:59:58.720 | that learn representations, and then use them for planning
02:00:01.800 | or learning tasks by reinforcement learning.
02:00:04.160 | And a lot of work at Berkeley by Peter Abbeel,
02:00:09.560 | Saguirre Levine, a bunch of other people of that type.
02:00:12.400 | I'm collaborating with actually in the context
02:00:15.360 | of some grants with my NYU hat.
02:00:18.160 | And then collaborations also through META,
02:00:22.360 | 'cause the lab at Berkeley is associated
02:00:25.640 | with META in some way, so with FAIR.
02:00:28.280 | So I think it's very exciting.
02:00:30.720 | I think, I'm super excited about,
02:00:34.200 | I haven't been that excited about the direction
02:00:36.720 | of machine learning and AI since 10 years ago
02:00:41.320 | when FAIR was started.
02:00:42.280 | And before that, 30 years ago, we were working on,
02:00:46.120 | let's say 35, on convolutional nets
02:00:48.720 | and the early days of neural nets.
02:00:52.000 | So I'm super excited because I see a path
02:00:56.280 | towards potentially human level intelligence
02:00:59.200 | with systems that can understand the world,
02:01:04.120 | remember, plan, reason.
02:01:05.760 | There is some set of ideas to make progress there
02:01:10.480 | that might have a chance of working.
02:01:12.400 | And I'm really excited about this.
02:01:14.600 | What I like is that somewhat we get onto a good direction
02:01:19.600 | and perhaps succeed before my brain turns to a white sauce
02:01:24.920 | or before I need to retire. (laughs)
02:01:28.320 | - Yeah, yeah.
02:01:30.160 | Are you also excited by, are you,
02:01:32.380 | is it beautiful to you just the amount of GPUs involved,
02:01:38.000 | sort of the whole training process on this much compute?
02:01:42.880 | It's just zooming out, just looking at earth
02:01:45.320 | and humans together have built these computing devices
02:01:49.720 | and are able to train this one brain.
02:01:53.560 | Then we then open source.
02:01:55.740 | Like giving birth to this open source brain
02:02:01.040 | trained on this gigantic compute system.
02:02:04.320 | There's just the details of how to train on that,
02:02:07.680 | how to build the infrastructure and the hardware,
02:02:10.060 | the cooling, all of this kind of stuff.
02:02:12.240 | Are you just still, most of your excitement
02:02:14.360 | is in the theory aspect of it?
02:02:17.240 | Meaning like the software?
02:02:19.600 | - Well, I used to be a hardware guy many years ago.
02:02:21.480 | - Yes, yes, that's right. - Decades ago.
02:02:23.080 | Hardware has improved a little bit, changed a little bit.
02:02:26.960 | Yeah.
02:02:27.800 | - I mean, certainly scale is necessary, but not sufficient.
02:02:32.360 | - Absolutely.
02:02:33.200 | - So we certainly need competition.
02:02:34.600 | I mean, we're still far in terms of compute power
02:02:37.000 | from what we would need to match the compute power
02:02:40.800 | of the human brain.
02:02:42.880 | This may occur in the next couple of decades,
02:02:45.040 | but we're still some ways away.
02:02:47.600 | And certainly in terms of power efficiency,
02:02:49.880 | we're really far.
02:02:51.920 | So there's a lot of progress to make in hardware.
02:02:56.000 | And right now, a lot of the progress is not,
02:03:00.240 | I mean, there's a bit coming from silicon technology,
02:03:03.000 | but a lot of it coming from architectural innovation
02:03:06.440 | and quite a bit coming from more efficient ways
02:03:10.200 | of implementing the architectures that have become popular,
02:03:13.640 | basically a combination of Transformers and ConvNets, right?
02:03:17.520 | And so there's still some ways to go
02:03:22.280 | until we're gonna saturate.
02:03:27.280 | We're gonna have to come up with like new principles,
02:03:30.200 | new fabrication technology, new basic components,
02:03:34.560 | perhaps based on sort of different principles
02:03:38.880 | and those classical digital CMOS.
02:03:42.000 | - Interesting.
02:03:42.840 | So you think in order to build AMI,
02:03:47.440 | AMI, we need, we potentially might need
02:03:50.520 | some hardware innovation too.
02:03:52.920 | - Well, if we wanna make it ubiquitous, yeah, certainly.
02:03:56.640 | 'Cause we're gonna have to reduce the power consumption.
02:04:01.640 | A GPU today, right, is half a kilowatt to a kilowatt.
02:04:05.580 | Human brain is about 25 watts.
02:04:08.640 | And the GPU is way below the power of human brain.
02:04:13.100 | You need something like 100,000 or a million to match it.
02:04:16.040 | So we are off by a huge factor.
02:04:19.760 | - You often say that AGI is not coming soon,
02:04:26.280 | meaning like not this year,
02:04:28.560 | not the next few years, potentially farther away.
02:04:32.760 | What's your basic intuition behind that?
02:04:35.720 | - So first of all, it's not going to be an event, right?
02:04:39.080 | The idea somehow, which is popularized
02:04:41.560 | by science fiction and Hollywood,
02:04:43.140 | that somehow somebody is gonna discover the secret,
02:04:47.060 | the secret to AGI or human level AI or AMI,
02:04:50.860 | whatever you wanna call it.
02:04:52.400 | And then turn on a machine and then we have AGI.
02:04:55.220 | That's just not going to happen.
02:04:57.140 | It's not going to be an event.
02:04:58.640 | It's gonna be gradual progress.
02:05:02.640 | Are we gonna have systems that can learn from video
02:05:07.000 | how the world works and learn good representations?
02:05:09.440 | Yeah, before we get them to the scale and performance
02:05:13.060 | that we observe in humans, it's gonna take quite a while.
02:05:15.600 | It's not gonna happen in one day.
02:05:17.240 | Are we gonna get systems that can have large amount
02:05:23.320 | of associative memories so they can remember stuff?
02:05:26.660 | Yeah, but same, it's not gonna happen tomorrow.
02:05:28.720 | I mean, there is some basic techniques
02:05:30.440 | that need to be developed.
02:05:31.460 | We have a lot of them, but to get this to work together
02:05:34.800 | with a full system is another story.
02:05:37.040 | Are we gonna have systems that can reason and plan,
02:05:39.200 | perhaps along the lines of objective-driven
02:05:42.160 | AI architectures that I described before?
02:05:45.000 | Yeah, but before we get this to work properly,
02:05:47.480 | it's gonna take a while.
02:05:49.320 | And before we get all those things to work together,
02:05:51.280 | and then on top of this, have systems that can learn
02:05:54.020 | like hierarchical planning, hierarchical representations,
02:05:56.800 | systems that can be configured for a lot
02:05:59.640 | of different situations at hands,
02:06:01.020 | the way the human brain can.
02:06:02.640 | All of this is gonna take at least a decade,
02:06:07.860 | probably much more, because there are a lot of problems
02:06:11.060 | that we're not seeing right now that we have not encountered,
02:06:15.300 | and so we don't know if there is an easy solution
02:06:17.280 | within this framework.
02:06:18.600 | So, you know, it's not just around the corner.
02:06:23.380 | I mean, I've been hearing people for the last 12, 15 years
02:06:27.580 | claiming that, you know, AGI is just around the corner
02:06:30.040 | and being systematically wrong.
02:06:32.620 | And I knew they were wrong when they were saying it.
02:06:34.520 | I call their bullshit.
02:06:35.580 | Why do you think people have been calling,
02:06:38.220 | first of all, I mean, from the beginning,
02:06:39.740 | from the birth of the term artificial intelligence,
02:06:42.780 | there has been a eternal optimism
02:06:45.340 | that's perhaps unlike other technologies?
02:06:49.100 | Is it a Marvek paradox?
02:06:51.820 | Is the explanation for why people
02:06:54.420 | are so optimistic about AGI?
02:06:57.060 | - I don't think it's just Marvek's paradox.
02:06:58.780 | Marvek's paradox is a consequence of realizing
02:07:01.260 | that the world is not as easy as we think.
02:07:03.780 | So first of all, intelligence is not a linear thing
02:07:08.620 | you can measure with a scalar, with a single number.
02:07:11.500 | You know, can you say that humans are smarter
02:07:15.260 | than orangutans?
02:07:18.340 | In some ways, yes.
02:07:20.220 | But in some ways, orangutans are smarter than humans
02:07:22.140 | in a lot of domains
02:07:23.820 | that allows them to survive in the forest, for example.
02:07:26.820 | - So IQ is a very limited measure of intelligence.
02:07:30.380 | Do you know intelligence is bigger
02:07:31.580 | than what IQ, for example, measures?
02:07:33.900 | - Well, IQ can measure, you know,
02:07:36.580 | approximately something for humans.
02:07:38.780 | But because humans kind of come in relatively
02:07:44.660 | kind of uniform form, right?
02:07:49.060 | But it only measures one type of ability
02:07:53.780 | that may be relevant for some tasks, but not others.
02:07:56.620 | But then if you're talking about other intelligent entities
02:08:02.540 | for which the basic things that are easy to them
02:08:07.140 | is very different, then it doesn't mean anything.
02:08:11.420 | So intelligence is a collection of skills
02:08:15.780 | and an ability to acquire new skills efficiently, right?
02:08:22.900 | And the collection of skills that any particular
02:08:27.540 | intelligent entity possess or is capable of learning quickly
02:08:31.700 | is different from the collection of skills of another one.
02:08:35.340 | And because it's a multidimensional thing,
02:08:37.460 | the set of skills is high dimensional space,
02:08:39.500 | you can't measure, you can compare,
02:08:41.340 | you cannot compare two things
02:08:42.860 | as to whether one is more intelligent than the other.
02:08:45.780 | It's multidimensional.
02:08:46.900 | - So you push back against what are called AI doomers a lot.
02:08:53.740 | Can you explain their perspective
02:08:57.780 | and why you think they're wrong?
02:08:59.780 | - Okay, so AI doomers imagine all kinds
02:09:02.180 | of catastrophe scenarios of how AI could escape or control
02:09:07.180 | and basically kill us all.
02:09:09.580 | (laughs)
02:09:11.220 | And that relies on a whole bunch of assumptions
02:09:14.460 | that are mostly false.
02:09:15.540 | So the first assumption is that the emergence
02:09:19.380 | of super intelligence could be an event,
02:09:21.820 | that at some point we're going to figure out the secret
02:09:25.100 | and we'll turn on a machine that is super intelligent.
02:09:28.300 | And because we'd never done it before,
02:09:30.500 | it's going to take over the world and kill us all.
02:09:33.060 | That is false.
02:09:33.940 | It's not going to be an event.
02:09:35.900 | We're going to have systems that are like as smart as a cat,
02:09:39.700 | have all the characteristics of human level intelligence,
02:09:44.700 | but their level of intelligence would be like a cat
02:09:47.540 | or a parrot maybe or something.
02:09:49.860 | And then we're going to work our way up
02:09:53.900 | to kind of make those things more intelligent.
02:09:55.420 | And as we make them more intelligent,
02:09:56.780 | we're also going to put some guard rails in them
02:09:58.580 | and learn how to kind of put some guard rails
02:10:00.460 | so they behave properly.
02:10:01.740 | And we're not going to do this with just one,
02:10:03.860 | it's not going to be one effort,
02:10:04.820 | that it's going to be lots of different people doing this.
02:10:07.620 | And some of them are going to succeed
02:10:09.260 | at making intelligent systems that are controllable and safe
02:10:13.180 | and have the right guard rails.
02:10:14.420 | And if some other goes rogue,
02:10:15.980 | then we can use the good ones to go against the rogue ones.
02:10:20.380 | So it's going to be my smart AI police
02:10:23.300 | against your rogue AI.
02:10:25.500 | So it's not going to be like we're going to be exposed
02:10:27.700 | to like a single rogue AI that's going to kill us all.
02:10:29.940 | That's just not happening.
02:10:31.860 | Now, there is another fallacy,
02:10:33.300 | which is the fact that because the system is intelligent,
02:10:36.340 | it necessarily wants to take over.
02:10:38.060 | And there is several arguments
02:10:43.420 | that make people scared of this,
02:10:44.780 | which I think are completely false as well.
02:10:48.500 | So one of them is in nature,
02:10:53.460 | it seems to be that the more intelligent species
02:10:55.580 | are the ones that end up dominating the other.
02:10:58.180 | And even extinguishing the others sometimes by design,
02:11:03.180 | sometimes just by mistake.
02:11:06.780 | And so there is sort of a thinking by which you say,
02:11:12.940 | well, if AI systems are more intelligent than us,
02:11:17.420 | surely they're going to eliminate us,
02:11:19.660 | if not by design, simply because they don't care about us.
02:11:23.180 | And that's just preposterous for a number of reasons.
02:11:27.780 | First reason is they're not going to be a species.
02:11:30.340 | They're not going to be a species that competes with us.
02:11:33.220 | They're not going to have the desire to dominate
02:11:35.420 | because the desire to dominate is something
02:11:37.220 | that has to be hardwired into an intelligent system.
02:11:41.020 | It is hardwired in humans.
02:11:43.580 | It is hardwired in baboons, in chimpanzees, in wolves,
02:11:48.860 | not in orangutans.
02:11:49.980 | The species in which this desire to dominate or submit
02:11:56.340 | or attain status in other ways
02:11:59.060 | is specific to social species.
02:12:03.300 | Non-social species like orangutans don't have it, right?
02:12:06.740 | And they are as smart as we are almost, right?
02:12:09.500 | - And to you, there's not significant incentive
02:12:12.140 | for humans to encode that into the AI systems.
02:12:15.180 | And to the degree they do, there'll be other AIs
02:12:18.980 | that sort of punish them for it.
02:12:22.140 | I'll compete them over.
02:12:23.100 | - Well, there's all kinds of incentives
02:12:24.380 | to make AI systems submissive to humans, right?
02:12:27.660 | I mean, this is the way we're going to build them, right?
02:12:30.300 | And so then people say, "Oh, but look at LLMs.
02:12:32.780 | "LLMs are not controllable."
02:12:33.980 | And they're right, LLMs are not controllable.
02:12:36.780 | But object-driven AI, so systems that derive their answers
02:12:41.500 | by optimization of an objective
02:12:43.780 | means they have to optimize this objective,
02:12:45.820 | and that objective can include guardrails.
02:12:48.380 | One guardrail is obey humans.
02:12:52.860 | Another guardrail is don't obey humans
02:12:54.660 | if it's hurting other humans.
02:12:57.140 | - I've heard that before somewhere, I don't remember.
02:12:59.620 | - Yes, maybe in a book.
02:13:02.020 | - Yeah, but speaking of that book,
02:13:05.660 | could there be unintended consequences also from all of this?
02:13:09.260 | - No, of course.
02:13:10.700 | So this is not a simple problem, right?
02:13:12.660 | I mean, designing those guardrails
02:13:14.620 | so that the system behaves properly
02:13:16.300 | is not going to be a simple issue
02:13:20.860 | for which there is a silver bullet,
02:13:22.500 | for which you have a mathematical proof
02:13:24.020 | that the system can be safe.
02:13:25.660 | It's going to be a very progressive,
02:13:27.460 | iterative design system
02:13:28.740 | where we put those guardrails in such a way
02:13:31.820 | that the system behaves properly.
02:13:33.020 | And sometimes they're going to do something
02:13:35.180 | that was unexpected because the guardrail wasn't right,
02:13:38.460 | and we're going to correct them so that they do it right.
02:13:41.180 | The idea somehow that we can't get it slightly wrong
02:13:44.140 | because if we get it slightly wrong,
02:13:45.500 | we all die is ridiculous.
02:13:47.980 | We're just going to go progressively.
02:13:51.580 | And it's just going to be,
02:13:52.980 | the analogy I've used many times is turbojet design.
02:13:57.980 | How did we figure out how to make turbojets
02:14:03.180 | so unbelievably reliable, right?
02:14:07.180 | I mean, those are incredibly complex pieces of hardware.
02:14:11.140 | They run at really high temperatures
02:14:12.740 | for 20 hours at a time sometimes.
02:14:17.540 | And we can fly halfway around the world
02:14:21.020 | with a two-engine jetliner at near the speed of sound.
02:14:26.020 | Like how incredible is this?
02:14:28.580 | It's just unbelievable, right?
02:14:31.060 | And did we do this because we invented
02:14:34.540 | like a general principle of how to make turbojets safe?
02:14:37.060 | No, it took decades to kind of fine-tune
02:14:39.820 | the design of those systems
02:14:40.940 | so that they were safe.
02:14:43.380 | Is there a separate group within General Electric
02:14:48.380 | or SNECMA or whatever that is specialized
02:14:52.500 | in turbojet safety?
02:14:54.660 | No, the design is all about safety
02:14:58.980 | because a better turbojet is also a safer turbojet.
02:15:01.260 | So a more reliable one.
02:15:03.660 | It's the same for AI.
02:15:04.780 | Like, do you need specific provisions to make AI safe?
02:15:08.580 | No, you need to make better AI systems
02:15:10.540 | and they will be safe because they are designed
02:15:12.700 | to be more useful and more controllable.
02:15:16.380 | So let's imagine a system, AI system,
02:15:18.980 | that's able to be incredibly convincing
02:15:23.300 | and can convince you of anything.
02:15:24.940 | I can at least imagine such a system.
02:15:28.060 | And I can see such a system be weapon-like
02:15:33.940 | because it can control people's minds.
02:15:35.460 | We're pretty gullible.
02:15:37.020 | We want to believe a thing.
02:15:38.540 | You can have an AI system that controls it.
02:15:40.820 | And you could see governments using that as a weapon.
02:15:43.540 | So do you think if you imagine such a system,
02:15:47.540 | there's any parallel to something like nuclear weapons?
02:15:54.420 | So why is that technology different?
02:15:58.740 | So you're saying there's going to be gradual development.
02:16:01.860 | There's going to be, I mean, it might be rapid,
02:16:04.300 | but there'll be iterative.
02:16:05.860 | And then we'll be able to kind of respond
02:16:08.020 | and so on.
02:16:09.060 | - So that AI system designed by Vladimir Putin or whatever,
02:16:13.140 | or his minions is going to be like trying to talk
02:16:18.140 | to every American to convince them to vote
02:16:23.420 | for whoever peaces Putin or whatever,
02:16:28.420 | or rile people up against each other
02:16:36.860 | as they've been trying to do.
02:16:38.260 | They're not going to be talking to you.
02:16:40.980 | They're going to be talking to your AI assistant,
02:16:43.420 | which is going to be as smart as theirs, right?
02:16:48.340 | That AI, because as I said, in the future,
02:16:51.180 | every single one of your interaction
02:16:52.620 | with the digital world will be mediated
02:16:54.220 | by your AI assistant.
02:16:55.820 | So the first thing you're going to ask is,
02:16:57.580 | is this a scam?
02:16:58.780 | Like, is this thing like telling me the truth?
02:17:00.740 | Like, it's not even going to be able to get to you
02:17:03.300 | because it's only going to talk to your AI assistant.
02:17:05.820 | Your AI assistant is not even going to,
02:17:08.620 | it's going to be like a spam filter, right?
02:17:10.740 | You're not even seeing the email, the spam email, right?
02:17:13.940 | It's automatically put in a folder that you never see.
02:17:17.420 | It's going to be the same thing.
02:17:18.340 | That AI system that tries to convince you of something
02:17:21.540 | is going to be talking to your AI assistant,
02:17:23.260 | which is going to be at least as smart as it.
02:17:25.500 | And it's going to say, this is spam, you know,
02:17:28.580 | it's not even going to bring it to your attention.
02:17:32.220 | - So to you, it's very difficult for any one AI system
02:17:35.260 | to take such a big leap ahead
02:17:37.500 | to where it can convince even the other AI systems.
02:17:40.100 | So like, there's always going to be this kind of race
02:17:44.220 | where nobody's way ahead.
02:17:46.660 | - That's the history of the world.
02:17:48.900 | History of the world is, you know,
02:17:50.140 | whenever there is a progress someplace,
02:17:52.380 | there is a countermeasure.
02:17:54.100 | And, you know, it's a cat and mouse game.
02:17:57.620 | - This is why, mostly, yes,
02:17:59.420 | but this is why nuclear weapons are so interesting
02:18:01.700 | because that was such a powerful weapon
02:18:05.340 | that it mattered who got it first.
02:18:07.380 | That, you know, you could imagine Hitler, Stalin,
02:18:13.020 | Mao getting the weapon first
02:18:17.620 | and that having a different kind of impact on the world
02:18:20.620 | than the United States getting the weapon first.
02:18:24.140 | But to you, nuclear weapons is like,
02:18:27.480 | you don't imagine a breakthrough discovery
02:18:32.200 | and then Manhattan Project-like effort for AI.
02:18:35.780 | - No, as I said, it's not going to be an event.
02:18:39.180 | It's going to be, you know, continuous progress.
02:18:42.020 | And whenever, you know, one breakthrough occurs,
02:18:46.200 | it's going to be widely disseminated really quickly,
02:18:48.920 | probably first within industry.
02:18:51.040 | I mean, this is not a domain where, you know,
02:18:53.680 | government or military organizations
02:18:55.560 | are particularly innovative
02:18:57.740 | and they're in fact way behind.
02:18:59.340 | And so this is going to come from industry
02:19:02.300 | and this kind of information disseminates extremely quickly.
02:19:05.460 | We've seen this over the last few years, right?
02:19:08.100 | Where you have a new, like, you know, even take AlphaGo,
02:19:11.980 | this was reproduced within three months,
02:19:13.980 | even without like particularly detailed information, right?
02:19:17.980 | - Yeah, this is an industry that's not good at secrecy.
02:19:21.240 | - No, but even if there is,
02:19:22.920 | just the fact that you know that something is possible
02:19:25.920 | makes you like realize that it's worth investing the time
02:19:30.220 | to actually do it.
02:19:31.080 | You may be the second person to do it,
02:19:32.920 | but you know, you'll do it.
02:19:35.220 | And, you know, say for, you know, all the innovations
02:19:41.480 | of, you know, self-supervised running transformers,
02:19:44.200 | decoder-only architectures, LLMs.
02:19:46.320 | I mean, those things,
02:19:47.520 | you don't need to know exactly the details of how they work
02:19:49.840 | to know that, you know, it's possible
02:19:52.760 | because it's deployed and then it's getting reproduced.
02:19:54.720 | And then, you know, people who work for those companies move.
02:19:59.720 | They go from one company to another and, you know,
02:20:03.400 | the information disseminates.
02:20:05.120 | What makes the success of the US tech industry
02:20:09.760 | and Silicon Valley in particular is exactly that,
02:20:11.760 | is because information circulates really, really quickly
02:20:14.480 | and, you know, disseminates very quickly.
02:20:17.480 | And so, you know, the whole region sort of is ahead
02:20:21.760 | because of that circulation of information.
02:20:24.600 | - So maybe I, just to linger on the psychology of AI doomers,
02:20:28.560 | you give, in the classic Yann LeCun way,
02:20:31.960 | a pretty good example of just
02:20:34.200 | when a new technology comes to be.
02:20:36.860 | You say, engineer says, "I invented this new thing.
02:20:41.300 | I call it a ball pen."
02:20:44.320 | And then the Twittersphere responds, "OMG,
02:20:47.320 | people could write horrible things with it
02:20:48.960 | like misinformation, propaganda, hate speech, ban it now."
02:20:52.720 | Then writing doomers come in, akin to the AI doomers.
02:20:57.580 | Imagine if everyone can get a ball pen.
02:21:00.980 | This could destroy society.
02:21:02.300 | There should be a law against using ball pen
02:21:04.180 | to write hate speech, regulate ball pens now.
02:21:07.240 | And then the pencil industry mogul says,
02:21:09.720 | "Yeah, ball pens are very dangerous,
02:21:12.680 | unlike pencil writing, which is erasable.
02:21:15.740 | Ball pen writing stays forever.
02:21:18.460 | Government should require a license for a pen manufacturer."
02:21:21.740 | I mean, this does seem to be part of human psychology
02:21:27.660 | when it comes up against new technology.
02:21:32.280 | So what deep insights can you speak to about this?
02:21:37.280 | - Well, there is a natural fear of new technology
02:21:42.720 | and the impact it can have on society.
02:21:45.320 | And people have kind of instinctive reaction
02:21:48.940 | to the world they know being threatened
02:21:53.700 | by major transformations that are either cultural phenomena
02:21:58.320 | or technological revolutions.
02:22:01.000 | And they fear for their culture, they fear for their job,
02:22:05.660 | they fear for the future of their children
02:22:09.980 | and their way of life, right?
02:22:13.800 | So any change is feared.
02:22:16.920 | And you see this, you know, along history,
02:22:20.380 | like any technological revolution or cultural phenomenon
02:22:24.060 | was always accompanied by, you know, groups or reaction
02:22:29.060 | in the media that basically attributed all the problems,
02:22:34.600 | the current problems of society
02:22:37.780 | to that particular change, right?
02:22:40.660 | Electricity was going to kill everyone at some point.
02:22:44.400 | You know, the train was going to be a horrible thing
02:22:47.880 | because, you know, you can't breathe
02:22:49.180 | past 50 kilometers an hour.
02:22:50.860 | And so there's a wonderful website
02:22:54.000 | called the Pessimist Archive,
02:22:55.640 | which has all those newspaper clips
02:22:59.420 | of all the horrible things people imagined would arrive
02:23:02.800 | because of either technological innovation
02:23:07.480 | or a cultural phenomenon.
02:23:10.840 | You know, it's just wonderful examples of, you know,
02:23:15.840 | jazz or comic books being blamed for unemployment
02:23:22.400 | or, you know, young people not wanting to work anymore
02:23:26.360 | and things like that, right?
02:23:27.360 | And that has existed for centuries.
02:23:30.700 | And it's, you know, knee-jerk reactions.
02:23:38.520 | The question is, you know, do we embrace change
02:23:40.800 | or do we resist it?
02:23:44.080 | And what are the real dangers
02:23:47.200 | as opposed to the imagined ones?
02:23:50.500 | - So people worry about,
02:23:53.800 | I think one thing they worry about with big tech,
02:23:55.880 | something we've been talking about over and over,
02:23:58.640 | but I think worth mentioning again,
02:24:02.320 | they worry about how powerful AI will be
02:24:05.960 | and they worry about it being in the hands
02:24:08.720 | of one centralized power
02:24:10.080 | of just a handful of central control.
02:24:13.760 | And so that's the skepticism with big tech.
02:24:15.880 | You can make, these companies can make
02:24:17.560 | a huge amount of money and control this technology
02:24:21.800 | and by so doing, you know, take advantage,
02:24:26.680 | abuse the little guy in society.
02:24:29.080 | - Well, that's exactly why we need open source platforms.
02:24:31.920 | - Yeah, I just wanted to nail the point home more and more.
02:24:36.920 | - Yes.
02:24:38.480 | - So let me ask you on your,
02:24:40.600 | like I said, you do get a little bit flavorful
02:24:45.200 | on the internet.
02:24:46.760 | Yoshi Bach tweeted something that you LOL'd at
02:24:50.800 | in reference to HAL 9000.
02:24:53.320 | Quote, "I appreciate your argument
02:24:55.560 | "and I fully understand your frustration,
02:24:57.420 | "but whether the pod bay doors should be opened or closed
02:25:00.960 | "is a complex and nuanced issue."
02:25:03.840 | So you're at the head of Meta AI.
02:25:06.940 | You know, this is something that really worries me
02:25:12.000 | that AI, our AI overlords will speak down to us
02:25:16.640 | with corporate speak of this nature
02:25:20.420 | and you sort of resist that with your way of being.
02:25:23.400 | Is this something you can just comment on,
02:25:27.100 | sort of working at a big company,
02:25:29.560 | how you can avoid the over-fearing, I suppose,
02:25:34.560 | through caution create harm?
02:25:41.360 | - Yeah, again, I think the answer to this
02:25:43.880 | is open source platforms and then enabling
02:25:47.760 | a widely diverse set of people to build AI assistants
02:25:52.760 | that represent the diversity of cultures, opinions,
02:25:57.320 | languages, and value systems across the world
02:26:00.000 | so that you're not bound to just be brainwashed
02:26:05.000 | by a particular way of thinking because of single AI entity.
02:26:10.000 | So, I mean, I think it's a really, really important question
02:26:13.960 | for society and the problem I'm seeing is that,
02:26:17.440 | which is why I've been so vocal
02:26:21.880 | and sometimes a little sardonic about it.
02:26:25.160 | - Never stop, never stop Jan.
02:26:27.720 | (laughing)
02:26:28.640 | We love it.
02:26:29.480 | - Is because I see the danger of this concentration of power
02:26:33.000 | through proprietary AI systems
02:26:36.400 | as a much bigger danger than everything else.
02:26:39.900 | That if we really want diversity of opinion, AI systems
02:26:44.900 | that in the future where we'll all be interacting
02:26:51.080 | through AI systems, we need those to be diverse
02:26:54.280 | for the preservation of diversity of ideas
02:26:58.400 | and creeds and political opinions and whatever
02:27:03.400 | and the preservation of democracy.
02:27:07.840 | And what works against this is people who think that
02:27:12.840 | for reasons of security, we should keep AI systems
02:27:17.920 | under lock and key because it's too dangerous
02:27:20.280 | to put it in the hands of everybody.
02:27:24.200 | Because it could be used by terrorists or something.
02:27:26.720 | That would lead to potentially a very bad future
02:27:33.800 | in which all of our information diet is controlled
02:27:39.060 | by a small number of companies through proprietary systems.
02:27:43.140 | - Do you trust humans with this technology
02:27:47.640 | to build systems that are on the whole good for humanity?
02:27:53.280 | Isn't that what democracy and free speech is all about?
02:27:56.560 | - I think so.
02:27:57.480 | - Do you trust institutions to do the right thing?
02:28:00.400 | Do you trust people to do the right thing?
02:28:03.160 | And yeah, there's bad people who are gonna do bad things
02:28:05.400 | but they're not going to have superior technology
02:28:07.780 | to the good people.
02:28:08.620 | So then it's gonna be my good AI against your bad AI.
02:28:12.600 | I mean, it's the examples that we were just talking about
02:28:16.380 | of maybe some rogue country will build some AI system
02:28:22.320 | that's gonna try to convince everybody
02:28:23.960 | to go into a civil war or something
02:28:27.480 | or elect a favorable ruler.
02:28:31.880 | But then they will have to go past our AI systems.
02:28:36.600 | - An AI system with a strong Russian accent
02:28:38.760 | will be trying to convince our--
02:28:40.440 | - And doesn't put any articles in their sentences.
02:28:43.260 | - Well, it'll be at the very least absurdly comedic.
02:28:49.300 | - Okay, so since we talked about the physical reality,
02:28:54.300 | I'd love to ask your vision of the future with robots
02:28:59.160 | in this physical reality.
02:29:00.580 | So many of the kinds of intelligence
02:29:03.240 | you've been speaking about would empower robots
02:29:06.720 | to be more effective collaborators with us humans.
02:29:10.480 | So since Tesla's Optimus team has been showing us
02:29:15.180 | some progress on humanoid robots,
02:29:17.160 | I think it really reinvigorated the whole industry
02:29:20.560 | that I think Boston Dynamics has been leading
02:29:22.860 | for a very, very long time.
02:29:24.280 | So now there's all kinds of companies,
02:29:25.660 | Figure AI, obviously, Boston Dynamics.
02:29:29.120 | - Union Tree.
02:29:30.080 | - Union Tree, but there's like a lot of them.
02:29:33.500 | It's great. - There's a lot of them.
02:29:34.340 | - It's great, I mean, I love it.
02:29:36.340 | So do you think there'll be millions
02:29:41.540 | of humanoid robots walking around soon?
02:29:44.020 | - Not soon, but it's gonna happen.
02:29:46.260 | Like the next decade, I think,
02:29:47.380 | is gonna be really interesting in robots.
02:29:49.500 | Like the emergence of the robotics industry
02:29:53.660 | has been in the waiting for 10, 20 years
02:29:57.720 | without really emerging,
02:29:58.700 | other than for like kind of pre-programmed behavior
02:30:01.660 | and stuff like that.
02:30:02.660 | And the main issue is, again, the more of a paradox,
02:30:08.700 | like how do we get the system to understand
02:30:10.420 | how the world works and kind of plan actions?
02:30:13.200 | And so we can do it for really specialized tasks.
02:30:16.620 | And the way Boston Dynamics goes about it is basically
02:30:21.620 | with a lot of handcrafted dynamical models
02:30:25.900 | and careful planning in advance,
02:30:29.300 | which is very classical robotics
02:30:30.780 | with a lot of innovation, a little bit of perception.
02:30:34.220 | But it's still not,
02:30:35.820 | like they can't build a domestic robot, right?
02:30:38.800 | And we're still some distance away
02:30:43.820 | from completely autonomous level five driving.
02:30:46.220 | And we're certainly very far away
02:30:49.540 | from having level five autonomous driving
02:30:53.660 | by a system that can train itself
02:30:55.820 | by driving 20 hours like any 17-year-old.
02:30:59.500 | So until we have, again, world models,
02:31:06.420 | systems that can train themselves
02:31:09.300 | to understand how the world works,
02:31:13.060 | we're not gonna have significant progress in robotics.
02:31:16.940 | So a lot of the people working on robotic hardware
02:31:20.560 | at the moment are betting or banking on the fact
02:31:24.300 | that AI is gonna make sufficient progress towards that.
02:31:28.060 | - And they're hoping to discover a product in it too.
02:31:31.060 | Before you have a really strong world model,
02:31:34.660 | there'll be an almost strong world model.
02:31:38.060 | And people are trying to find a product
02:31:41.440 | and a clumsy robot, I suppose.
02:31:43.720 | Like not a perfectly efficient robot.
02:31:45.720 | So there's the factory setting where humanoid robots
02:31:48.300 | can help automate some aspects of the factory.
02:31:51.260 | I think that's a crazy difficult task
02:31:53.340 | 'cause of all the safety required and all this kind of stuff.
02:31:56.000 | I think in the home is more interesting,
02:31:58.260 | but then you start to think,
02:32:00.420 | I think you mentioned loading the dishwasher, right?
02:32:03.200 | - Yeah.
02:32:04.580 | - I suppose that's one of the main problems
02:32:06.640 | you're working on.
02:32:07.620 | - I mean, there's cleaning up, cleaning the house,
02:32:12.620 | clearing up the table after a meal, washing the dishes,
02:32:18.720 | all those tasks, cooking.
02:32:21.600 | I mean, all the tasks that in principle could be automated,
02:32:24.040 | but are actually incredibly sophisticated,
02:32:26.720 | really complicated.
02:32:28.320 | - But even just basic navigation
02:32:29.720 | around a space full of uncertainty.
02:32:32.120 | - That sort of works.
02:32:33.160 | Like you can sort of do this now.
02:32:35.560 | Navigation is fine.
02:32:37.280 | - Well, navigation in a way that's compelling
02:32:40.100 | to us humans is a different thing.
02:32:42.900 | - Yeah, it's not gonna be necessarily.
02:32:45.380 | I mean, we have demos actually,
02:32:46.600 | 'cause there is a so-called embodied AI group at fair.
02:32:51.600 | And they've been not building their own robots,
02:32:55.180 | but using commercial robots.
02:32:57.200 | And you can tell a robot dog go to the fridge
02:33:02.360 | and they can actually open the fridge
02:33:03.660 | and they can probably pick up a can in the fridge
02:33:05.900 | and stuff like that and bring it to you.
02:33:09.380 | So it can navigate, it can grab objects
02:33:12.640 | as long as it's been trained to recognize them,
02:33:14.820 | which vision systems work pretty well nowadays.
02:33:17.200 | But it's not like a completely general robot
02:33:22.420 | that would be sophisticated enough to do things
02:33:26.180 | like clearing up the data table.
02:33:29.300 | - Yeah, to me, that's an exciting future
02:33:33.300 | of getting humanoid robots,
02:33:35.080 | robots in general, in the whole, more and more.
02:33:36.740 | Because that gets humans to really directly interact
02:33:40.340 | with AI systems in the physical space.
02:33:42.120 | And in so doing, it allows us to philosophically,
02:33:45.260 | psychologically explore our relationships with robots.
02:33:48.100 | It can be really, really, really interesting.
02:33:50.760 | So I hope you make progress on the whole Jampa thing soon.
02:33:54.340 | - Well, I mean, I hope things work as planned.
02:33:58.640 | I mean, again, we've been working on this idea
02:34:03.180 | of self-supervised learning from video for 10 years.
02:34:07.120 | And only made significant progress in the last two or three.
02:34:12.080 | - And actually, you've mentioned that there's a lot
02:34:14.240 | of interesting breakups that can happen
02:34:15.760 | without having access to a lot of compute.
02:34:18.380 | So if you're interested in doing a PhD
02:34:20.480 | and this kind of stuff, there's a lot of possibilities still
02:34:24.140 | to do innovative work.
02:34:25.600 | So what advice would you give to a undergrad
02:34:28.040 | that's looking to go to grad school and do a PhD?
02:34:32.340 | - So basically, I've listed them already,
02:34:35.600 | this idea of how do you train a world model by observation.
02:34:38.660 | And you don't have to train necessarily
02:34:41.400 | on gigantic data sets or...
02:34:44.320 | I mean, it could turn out to be necessary
02:34:47.080 | to actually train on large data sets,
02:34:48.800 | to have emergent properties like we have with LLMs.
02:34:51.780 | But I think there is a lot of good ideas
02:34:53.080 | that can be done without necessarily scaling up.
02:34:56.760 | Then there is how do you do planning
02:34:58.480 | with a learned world model?
02:35:00.600 | If the world the system evolves in
02:35:02.660 | is not the physical world,
02:35:03.760 | but it's the world of, let's say, the internet
02:35:06.800 | or some sort of world where an action
02:35:11.540 | consists in doing a search in a search engine
02:35:14.060 | or interrogating a database or running a simulation
02:35:18.180 | or calling a calculator
02:35:19.820 | or solving differential equation,
02:35:21.520 | how do you get a system to actually plan
02:35:24.500 | a sequence of actions to give the solution to a problem?
02:35:29.720 | And so the question of planning
02:35:32.200 | is not just a question of planning physical actions.
02:35:35.680 | It could be planning actions to use tools
02:35:38.960 | for a dialog system or for any kind of intelligent system.
02:35:42.320 | And there's some work on this,
02:35:45.480 | but not a huge amount.
02:35:47.080 | Some work at FAIR, one called Toolformer,
02:35:50.840 | which was a couple of years ago,
02:35:52.480 | and some more recent work on planning.
02:35:55.460 | But I don't think we have a good solution
02:35:59.700 | for any of that.
02:36:00.760 | Then there is the question of hierarchical planning.
02:36:03.580 | So the example I mentioned of planning a trip
02:36:07.980 | from New York to Paris, that's hierarchical,
02:36:11.360 | but almost every action that we take
02:36:13.780 | involves hierarchical planning in some sense.
02:36:17.460 | And we really have absolutely no idea how to do this.
02:36:20.640 | Like there's zero demonstration of hierarchical planning
02:36:25.640 | in AI where the various levels of representations
02:36:30.640 | that are necessary have been learned.
02:36:36.440 | We can do like two-level hierarchical planning
02:36:39.440 | when we designed the two levels.
02:36:41.100 | So for example, you have like a dog-like robot, right?
02:36:44.840 | You want it to go from the living room to the kitchen.
02:36:48.300 | You can plan a path that avoids the obstacle.
02:36:51.260 | And then you can send this to a lower level planner
02:36:55.180 | that figures out how to move the legs
02:36:56.960 | to kind of follow that trajectory, right?
02:36:59.540 | So that works, but that two-level planning
02:37:01.600 | is designed by hand, right?
02:37:03.900 | We specify what the proper levels of abstraction,
02:37:09.820 | the representation at each level of abstraction have to be.
02:37:13.140 | How do you learn this?
02:37:14.100 | How do you learn that hierarchical representation
02:37:16.620 | of action plans, right?
02:37:19.800 | We, you know, with coordinates and deep learning,
02:37:22.280 | we can train the system to learn hierarchical representations
02:37:25.320 | of percepts.
02:37:26.300 | What is the equivalent when what you're trying
02:37:29.200 | to represent are action plans?
02:37:30.760 | - For action plans, yeah.
02:37:32.140 | So you want basically a robot dog or humanoid robot
02:37:35.520 | that turns on and travels from New York
02:37:38.240 | to Paris all by itself.
02:37:40.240 | - For example.
02:37:42.080 | - All right.
02:37:43.080 | It might have some trouble at the TSA, but yeah.
02:37:47.420 | - No, but even doing something fairly simple,
02:37:49.100 | like a household task, like, you know, cooking or something.
02:37:53.860 | - Yeah, there's a lot involved.
02:37:55.340 | It's a super complex task.
02:37:57.140 | We take, and once again, we take it for granted.
02:37:59.540 | What hope do you have for the future of humanity?
02:38:05.120 | We're talking about so many exciting technologies,
02:38:07.820 | so many exciting possibilities.
02:38:09.540 | What gives you hope when you look out
02:38:12.100 | over the next 10, 20, 50, 100 years?
02:38:15.100 | If you look at social media, there's a lot of,
02:38:17.140 | there's wars going on, there's division, there's hatred,
02:38:21.660 | all this kind of stuff.
02:38:22.860 | That's also part of humanity.
02:38:24.620 | But amidst all that, what gives you hope?
02:38:27.000 | - I love that question.
02:38:30.300 | We can make humanity smarter with AI.
02:38:37.900 | Okay.
02:38:40.340 | I mean, AI basically will amplify human intelligence.
02:38:45.300 | It's as if every one of us will have a staff
02:38:50.020 | of smart AI assistants.
02:38:52.180 | They might be smarter than us.
02:38:53.740 | They'll do our bidding,
02:38:55.860 | perhaps execute a task in ways that are much better
02:39:03.680 | than we could do ourselves,
02:39:05.320 | because they'd be smarter than us.
02:39:07.820 | And so it's like everyone would be the boss
02:39:10.620 | of a staff of super smart virtual people.
02:39:15.680 | So we shouldn't feel threatened by this
02:39:18.120 | any more than we should feel threatened
02:39:19.640 | by being the manager of a group of people,
02:39:22.920 | some of whom are more intelligent than us.
02:39:24.880 | I certainly have a lot of experience with this,
02:39:29.720 | of having people working with me who are smarter than me.
02:39:34.200 | That's actually a wonderful thing.
02:39:36.400 | So having machines that are smarter than us,
02:39:39.960 | that assist us in all of our tasks, our daily lives,
02:39:43.880 | whether it's professional or personal,
02:39:45.520 | I think would be an absolutely wonderful thing.
02:39:47.960 | Because intelligence is the commodity
02:39:52.280 | that is most in demand.
02:39:54.080 | That's really what, I mean,
02:39:55.520 | all the mistakes that humanity makes
02:39:56.960 | is because of lack of intelligence really,
02:39:58.960 | or lack of knowledge, which is related.
02:40:01.600 | So making people smarter can only be better.
02:40:07.080 | I mean, for the same reason that public education
02:40:09.640 | is a good thing.
02:40:12.280 | And books are a good thing.
02:40:14.800 | And the internet is also a good thing intrinsically.
02:40:17.360 | And even social networks are a good thing
02:40:19.520 | if you run them properly.
02:40:21.560 | It's difficult, but you can.
02:40:23.200 | Because it helps the communication of information
02:40:30.680 | and knowledge and the transmission of knowledge.
02:40:33.880 | So AI is gonna make humanity smarter.
02:40:36.440 | And the analogy I've been using is the fact
02:40:41.080 | that perhaps an equivalent event in the history of humanity
02:40:46.080 | to what might be provided by generalization of AI assistant
02:40:52.320 | is the invention of the printing press.
02:40:55.240 | It made everybody smarter.
02:40:56.960 | The fact that people could have access to books.
02:41:01.960 | Books were a lot cheaper than they were before.
02:41:06.400 | And so a lot more people had an incentive to learn to read,
02:41:10.520 | which wasn't the case before.
02:41:11.920 | And people became smarter.
02:41:17.400 | It enabled the enlightenment, right?
02:41:21.120 | There wouldn't be an enlightenment
02:41:22.200 | without the printing press.
02:41:24.360 | It enabled philosophy, rationalism,
02:41:29.360 | escape from religious doctrine, democracy, science,
02:41:35.840 | and certainly without this there wouldn't have been
02:41:40.840 | the American Revolution or the French Revolution.
02:41:43.400 | And so we'd still be under a few dual regimes, perhaps.
02:41:47.760 | And so it completely transformed the world
02:41:53.840 | because people became smarter
02:41:55.360 | and kind of learned about things.
02:41:57.680 | Now, it also created 200 years of revolution.
02:42:03.680 | It created 200 years of essentially religious conflicts
02:42:07.520 | in Europe because the first thing that people read
02:42:10.880 | was the Bible and realized that perhaps
02:42:15.520 | there was a different interpretation of the Bible
02:42:17.280 | than what the priests were telling them.
02:42:20.000 | And so that created the Protestant movement
02:42:22.840 | and created the rift.
02:42:23.920 | And in fact, the Catholic Church didn't like the idea
02:42:27.600 | of the printing press, but they had no choice.
02:42:30.080 | And so it had some bad effects and some good effects.
02:42:32.880 | I don't think anyone today would say
02:42:34.320 | that the invention of the printing press
02:42:36.000 | had an overall negative effect,
02:42:38.320 | despite the fact that it created 200 years
02:42:41.240 | of religious conflicts in Europe.
02:42:44.480 | Now, compare this.
02:42:45.920 | And I thought I was very proud of myself
02:42:49.560 | to come up with this analogy,
02:42:51.720 | but realized someone else came with the same idea before me.
02:42:55.640 | Compare this with what happened in the Ottoman Empire.
02:42:59.000 | The Ottoman Empire banned the printing press
02:43:02.800 | for 200 years.
02:43:04.000 | And it didn't ban it for all languages, only for Arabic.
02:43:11.840 | You could actually print books in Latin or Hebrew
02:43:16.000 | or whatever in the Ottoman Empire, just not in Arabic.
02:43:19.360 | And I thought it was because the rulers
02:43:25.760 | just wanted to preserve the control over the population
02:43:29.520 | and the dogma, religious dogma and everything.
02:43:33.040 | But after talking with the UAE Minister of AI,
02:43:37.280 | Omar al-Alama,
02:43:40.120 | he told me no, there was another reason.
02:43:44.520 | And the other reason was that it was to preserve
02:43:49.680 | the cooperation of calligraphers.
02:43:52.280 | There's an art form, which is writing those beautiful
02:44:00.320 | Arabic poems or whatever religious text in this thing.
02:44:04.880 | And it was very powerful cooperation of scribes,
02:44:07.440 | basically that kind of run a big chunk of the empire
02:44:12.240 | and we couldn't put them out of business.
02:44:14.160 | So they banned the printing press in part
02:44:16.440 | to protect that business.
02:44:18.560 | Now, what's the analogy for AI today?
02:44:23.320 | Who are we protecting by banning AI?
02:44:25.400 | Who are the people who are asking that AI be regulated
02:44:28.880 | to protect their jobs?
02:44:31.800 | And of course, it's a real question
02:44:35.240 | of what is going to be the effect
02:44:37.560 | of technological transformation like AI on the job market
02:44:42.560 | and the labor market.
02:44:45.280 | And there are economists who are much more expert
02:44:48.400 | at this than I am, but when I talk to them,
02:44:50.320 | they tell us we're not gonna run out of job.
02:44:54.680 | This is not gonna cause mass unemployment.
02:44:57.800 | This is just gonna be gradual shift
02:45:01.040 | of different professions.
02:45:02.320 | The professions are gonna be hot 10 or 15 years from now.
02:45:05.920 | We have no idea today what they're gonna be.
02:45:09.400 | The same way if we go back 20 years in the past,
02:45:12.200 | like who could have thought 20 years ago
02:45:15.040 | that like the hottest job even like 10 years ago
02:45:19.040 | was mobile app developer, like smartphones weren't invented.
02:45:23.400 | - Most of the jobs of the future might be in the metaverse.
02:45:27.080 | - Well, it could be, yeah.
02:45:29.120 | - But the point is you can't possibly predict.
02:45:31.960 | But you're right, I mean, you made a lot of strong points
02:45:34.680 | and I believe that people are fundamentally good.
02:45:38.520 | And so if AI, especially open source AI
02:45:42.680 | can make them smarter,
02:45:45.840 | it just empowers the goodness in humans.
02:45:48.400 | - So I share that feeling, okay?
02:45:50.880 | I think people are fundamentally good.
02:45:52.800 | And in fact, a lot of doomers are doomers
02:45:56.680 | because they don't think that people are fundamentally good.
02:45:59.720 | And they either don't trust people
02:46:04.480 | or they don't trust the institution to do the right thing
02:46:07.920 | so that people behave properly.
02:46:09.480 | - Well, I think both you and I believe in humanity.
02:46:13.560 | And I think I speak for a lot of people
02:46:16.480 | in saying thank you for pushing the open source movement,
02:46:20.120 | pushing to making both research in AI open source,
02:46:24.320 | making it available to people and also the models themselves
02:46:27.760 | making it open source.
02:46:28.680 | So thank you for that.
02:46:30.360 | And thank you for speaking your mind
02:46:32.280 | in such colorful, beautiful ways on the internet.
02:46:34.320 | I hope you never stop.
02:46:35.720 | You're one of the most fun people I know
02:46:37.880 | and get to be a fan of.
02:46:39.040 | So yeah, thank you for speaking to me once again.
02:46:42.360 | And thank you for being you.
02:46:44.000 | - Thank you, Lex.
02:46:45.640 | - Thanks for listening to this conversation with Yann LeCun.
02:46:48.320 | To support this podcast,
02:46:49.640 | please check out our sponsors in the description.
02:46:52.240 | And now let me leave you with some words
02:46:54.200 | from Arthur C. Clarke.
02:46:55.680 | The only way to discover the limits of the possible
02:46:59.840 | is to go beyond them and to the impossible.
02:47:03.560 | Thank you for listening and hope to see you next time.
02:47:07.760 | (upbeat music)
02:47:10.360 | (upbeat music)
02:47:12.960 | [ Silence ]