Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI

00:00:00.000 | I see the danger of this concentration

00:00:02.180 | of power through proprietary AI systems

00:00:06.000 | as a much bigger danger than everything else.

00:00:08.920 | What works against this is people

00:00:11.960 | who think that for reasons of security,

00:00:15.120 | we should keep AI systems under lock and key,

00:00:18.560 | because it's too dangerous to put it in the hands of everybody.

00:00:22.040 | That would lead to a very bad future

00:00:25.360 | in which all of our information diet

00:00:27.640 | is controlled by a small number of companies

00:00:30.800 | through proprietary systems.

00:00:32.360 | I believe that people are fundamentally good.

00:00:34.320 | And so if AI, especially open-source AI,

00:00:38.480 | can make them smarter, it just empowers the goodness

00:00:43.320 | in humans.

00:00:44.240 | So I share that feeling, OK?

00:00:46.740 | I think people are fundamentally good.

00:00:50.280 | And in fact, a lot of doomers are doomers,

00:00:52.480 | because they don't think that people are fundamentally good.

00:00:55.180 | (air whooshing)

00:00:57.680 | - The following is a conversation with Yann LeCun,

00:01:01.060 | his third time on this podcast.

00:01:02.860 | He is the chief AI scientist at Meta,

00:01:05.500 | professor at NYU, Turing Award winner,

00:01:08.780 | and one of the seminal figures

00:01:10.840 | in the history of artificial intelligence.

00:01:13.180 | He and Meta AI have been big proponents

00:01:16.900 | of open-sourcing AI development

00:01:19.540 | and have been walking the walk

00:01:21.380 | by open-sourcing many of their biggest models,

00:01:23.980 | including LLAMA 2 and eventually LLAMA 3.

00:01:28.180 | Also, Yann has been an outspoken critic

00:01:31.900 | of those people in the AI community

00:01:34.380 | who warned about the looming danger

00:01:36.520 | and existential threat of AGI.

00:01:39.660 | He believes the AGI will be created one day,

00:01:43.580 | but it will be good.

00:01:45.500 | It will not escape human control,

00:01:47.660 | nor will it dominate and kill all humans.

00:01:52.160 | At this moment of rapid AI development,

00:01:54.380 | this happens to be somewhat a controversial position.

00:01:58.840 | And so it's been fun seeing Yann get into a lot of intense

00:02:02.620 | and fascinating discussions online

00:02:04.880 | as we do in this very conversation.

00:02:08.660 | This is the Lex Furman Podcast.

00:02:10.480 | To support it, please check out our sponsors

00:02:12.460 | in the description.

00:02:13.740 | And now, dear friends, here's Yann LeCun.

00:02:18.000 | You've had some strong statements, technical statements,

00:02:22.420 | about the future of artificial intelligence recently,

00:02:25.580 | throughout your career, actually, but recently as well.

00:02:28.320 | You've said that autoregressive LLMs

00:02:31.940 | are not the way we're going to make progress

00:02:36.780 | towards superhuman intelligence.

00:02:38.740 | These are the large language models like GPT-4,

00:02:41.940 | like LLAMA 2 and 3 soon, and so on.

00:02:44.260 | How do they work,

00:02:45.080 | and why are they not going to take us all the way?

00:02:47.740 | - For a number of reasons.

00:02:49.040 | The first is that there is a number of characteristics

00:02:51.820 | of intelligent behavior.

00:02:53.500 | For example, the capacity to understand the world,

00:02:58.820 | understand the physical world,

00:03:00.320 | the ability to remember and retrieve things,

00:03:05.460 | persistent memory, the ability to reason,

00:03:10.340 | and the ability to plan.

00:03:12.360 | Those are four essential characteristic

00:03:14.140 | of intelligent systems or entities, humans, animals.

00:03:19.140 | LLMs can do none of those,

00:03:23.060 | or they can only do them in a very primitive way.

00:03:26.560 | And they don't really understand the physical world.

00:03:29.700 | They don't really have persistent memory.

00:03:31.340 | They can't really reason, and they certainly can plan.

00:03:34.420 | And so, if you expect the system to become intelligent

00:03:38.860 | just without having the possibility of doing those things,

00:03:43.580 | or you're making a mistake,

00:03:44.980 | that is not to say that go to a receive LLMs are not useful.

00:03:50.900 | They're certainly useful,

00:03:52.100 | that they're not interesting,

00:03:55.600 | that we can't build a whole ecosystem

00:03:58.220 | of applications around them.

00:04:00.180 | Of course we can,

00:04:01.020 | but as a path towards human-level intelligence,

00:04:05.980 | they're missing essential components.

00:04:08.700 | And then there is another tidbit or fact

00:04:11.280 | I think is very interesting.

00:04:14.020 | Those LLMs are trained on enormous amounts of texts,

00:04:16.540 | basically the entirety of all publicly available texts

00:04:20.620 | on the internet, right?

00:04:21.520 | That's typically on the order of 10 to the 13 tokens.

00:04:26.520 | Each token is typically two bytes.

00:04:28.220 | So that's two 10 to the 13 bytes as training data.

00:04:31.980 | It would take you or me 170,000 years

00:04:35.160 | to just read through this at eight hours a day.

00:04:38.680 | So it seems like an enormous amount of knowledge, right?

00:04:41.320 | That those systems can accumulate.

00:04:43.020 | But then you realize it's really not that much data.

00:04:48.300 | If you talk to developmental psychologists

00:04:52.300 | and they tell you a four-year-old has been awake

00:04:54.540 | for 16,000 hours in his oral life,

00:04:57.620 | and the amount of information

00:05:01.420 | that has reached the visual cortex

00:05:05.740 | of that child in four years,

00:05:08.720 | is about 10 to the 15 bytes.

00:05:12.140 | And you can compute this by estimating

00:05:13.940 | that the optical nerve carry

00:05:16.380 | about 20 megabytes per second, roughly.

00:05:19.700 | And so 10 to the 15 bytes for a four-year-old

00:05:22.220 | versus two times 10 to the 13 bytes

00:05:25.460 | for 170,000 years worth of reading.

00:05:28.700 | What that tells you is that through sensory input,

00:05:33.860 | we see a lot more information than we do through language.

00:05:37.640 | And that despite our intuition,

00:05:40.960 | most of what we learn and most of our knowledge

00:05:43.920 | is through our observation and interaction

00:05:47.000 | with the real world, not through language.

00:05:49.520 | Everything that we learn in the first few years of life

00:05:51.720 | and certainly everything that animals learn

00:05:54.920 | has nothing to do with language.

00:05:57.100 | - So it'd be good to maybe push against

00:05:59.400 | some of the intuition behind what you're saying.

00:06:01.720 | So it is true there's several orders of magnitude

00:06:05.900 | more data coming into the human mind.

00:06:08.100 | How much faster and the human mind

00:06:11.240 | is able to learn very quickly from that,

00:06:12.980 | filter the data very quickly.

00:06:15.200 | You know, somebody might argue

00:06:16.580 | your comparison between sensory data versus language,

00:06:19.800 | that language is already very compressed.

00:06:23.220 | It already contains a lot more information

00:06:25.260 | than the bytes it takes to store them,

00:06:27.240 | if you compare it to visual data.

00:06:29.340 | So there's a lot of wisdom in language, there's words,

00:06:31.800 | and the way we stitch them together,

00:06:33.820 | it already contains a lot of information.

00:06:36.240 | So is it possible that language alone

00:06:40.620 | already has enough wisdom and knowledge in there

00:06:45.620 | to be able to, from that language,

00:06:48.740 | construct a world model, an understanding of the world,

00:06:52.660 | an understanding of the physical world

00:06:54.700 | that you're saying all lambs lack?

00:06:56.660 | So it's a big debate among philosophers

00:07:00.060 | and also cognitive scientists,

00:07:01.740 | like whether intelligence needs to be grounded in reality.

00:07:05.160 | I'm clearly in the camp that, yes,

00:07:09.260 | intelligence cannot appear without some grounding

00:07:12.340 | in some reality, it doesn't need to be physical reality,

00:07:16.980 | it could be simulated,

00:07:17.980 | but the environment is just much richer

00:07:20.860 | than what you can express in language.

00:07:22.340 | Language is a very approximate representation of our percepts

00:07:27.340 | and our mental models, right?

00:07:29.500 | I mean, there's a lot of tasks that we accomplish

00:07:32.220 | where we manipulate a mental model

00:07:35.620 | of the situation at hand,

00:07:38.300 | and that has nothing to do with language.

00:07:40.700 | Everything that's physical, mechanical, whatever,

00:07:43.540 | when we build something, when we accomplish a task,

00:07:47.100 | a model task of grabbing something, et cetera,

00:07:50.260 | we plan for action sequences and we do this

00:07:52.900 | by essentially imagining the result of the outcome

00:07:57.180 | of a sequence of actions that we might imagine.

00:08:01.260 | And that requires mental models

00:08:03.900 | that don't have much to do with language.

00:08:06.060 | And that's, I would argue, most of our knowledge

00:08:09.900 | is derived from that interaction with the physical world.

00:08:13.740 | So a lot of my colleagues who are more interested

00:08:17.420 | in things like computer vision are really on that camp

00:08:20.500 | that AI needs to be embodied, essentially.

00:08:25.100 | And then other people coming from the NLP side,

00:08:28.420 | or maybe some other motivation

00:08:32.860 | don't necessarily agree with that.

00:08:35.020 | And philosophers are split as well.

00:08:37.140 | And the complexity of the world is hard to imagine.

00:08:46.460 | It's hard to represent all the complexities

00:08:51.020 | that we take completely for granted in the real world

00:08:53.580 | that we don't even imagine require intelligence, right?

00:08:55.740 | This is the old Moravec paradox

00:08:58.020 | from the pioneer of robotics, Hans Moravec.

00:09:01.260 | We said, how is it that with computers,

00:09:03.300 | it seems to be easy to do high-level complex tasks

00:09:05.820 | like playing chess and solving integrals

00:09:08.420 | and doing things like that?

00:09:09.700 | Whereas the thing we take for granted that we do every day,

00:09:13.380 | like, I don't know, learning to drive a car

00:09:16.380 | or grabbing an object, we can't do with computers.

00:09:19.980 | And we have LLMs that can pass the bar exam,

00:09:26.820 | so they must be smart.

00:09:29.500 | But then they can't learn to drive in 20 hours

00:09:33.060 | like any 17-year-old.

00:09:35.460 | They can't learn to clear out the dinner table

00:09:38.660 | and fill up the dishwasher like any 10-year-old

00:09:41.100 | can learn in one shot.

00:09:42.220 | Why is that?

00:09:44.500 | Like, what are we missing?

00:09:45.860 | What type of learning or reasoning architecture

00:09:50.700 | or whatever are we missing that basically prevent us

00:09:55.700 | from having level five sort of cars and domestic robots?

00:10:00.900 | - Can a large language model construct a world model

00:10:05.580 | that does know how to drive

00:10:07.740 | and does know how to fill a dishwasher

00:10:09.340 | but just doesn't know how to deal with visual data

00:10:11.620 | at this time?

00:10:12.580 | So it can operate in a space of concepts.

00:10:17.220 | - So yeah, that's what a lot of people are working on.

00:10:19.980 | So the answer, the short answer is no.

00:10:22.540 | And the more complex answer is you can use all kinds

00:10:26.220 | of tricks to get an LLM to basically digest

00:10:31.220 | visual representations of images or video

00:10:38.740 | or audio for that matter.

00:10:42.380 | And a classical way of doing this

00:10:45.420 | is you train a vision system in some way.

00:10:48.580 | And we have a number of ways to train vision systems.

00:10:51.340 | These are supervised, semi-supervised, self-supervised,

00:10:53.820 | all kinds of different ways.

00:10:55.220 | That will turn any image into a high-level representation.

00:11:01.100 | Basically a list of tokens that are really similar

00:11:04.500 | to the kind of tokens that typical LLM takes as an input.

00:11:10.700 | And then you just feed that to the LLM

00:11:15.260 | in addition to the text.

00:11:17.140 | And you just expect LLM to kind of, during training,

00:11:21.620 | to kind of be able to use those representations

00:11:25.500 | to help make decisions.

00:11:27.180 | I mean, there's been work along those lines

00:11:29.140 | for quite a long time.

00:11:30.420 | And now you see those systems, right?

00:11:32.700 | I mean, there are LLMs that have some vision extension.

00:11:36.700 | But it basically hacks in the sense that those things

00:11:40.060 | are not like trained end-to-end to handle,

00:11:42.500 | to really understand the world.

00:11:43.860 | They're not trained with video, for example.

00:11:46.460 | They don't really understand intuitive physics,

00:11:49.020 | at least not at the moment.

00:11:51.220 | - So you don't think there's something special to you

00:11:53.300 | about intuitive physics, about sort of common sense reasoning

00:11:55.980 | about the physical space, about physical reality?

00:11:59.100 | That to you is a giant leap

00:12:00.780 | that LLMs are just not able to do?

00:12:02.860 | - We're not gonna be able to do this

00:12:04.060 | with the type of LLMs that we are working with today.

00:12:07.860 | And there's a number of reasons for this.

00:12:09.300 | But the main reason is the way LLMs are trained

00:12:14.300 | is that you take a piece of text,

00:12:16.580 | you remove some of the words in that text, you mask them,

00:12:20.300 | you replace them by blank markers,

00:12:22.660 | and you train a gigantic neural net

00:12:24.260 | to predict the words that are missing.

00:12:26.180 | And if you build this neural net in a particular way

00:12:30.300 | so that it can only look at words

00:12:33.220 | that are to the left of the one it's trying to predict,

00:12:36.140 | then what you have is a system that basically

00:12:38.020 | is trained to predict the next word in a text, right?

00:12:40.060 | So then you can feed it a text, a prompt,

00:12:43.460 | and you can ask it to predict the next word.

00:12:45.860 | It can never predict the next word exactly.

00:12:48.220 | And so what it's gonna do is produce

00:12:51.380 | a probability distribution

00:12:52.740 | of all the possible words in your dictionary.

00:12:55.020 | In fact, it doesn't predict words,

00:12:56.260 | it predicts tokens that are kind of sub-word units.

00:12:59.020 | And so it's easy to handle the uncertainty

00:13:01.900 | in the prediction there,

00:13:02.780 | because there's only a finite number of possible words

00:13:05.700 | in the dictionary.

00:13:07.380 | And you can just compute the distribution over them.

00:13:09.900 | Then what the system does is that

00:13:13.020 | it picks a word from that distribution.

00:13:16.860 | Of course, there's a higher chance of picking words

00:13:18.820 | that have a higher probability within the distribution.

00:13:21.420 | So you sample from the distribution

00:13:22.820 | to actually produce a word.

00:13:25.260 | And then you shift that word into the input.

00:13:27.460 | And so that allows the system

00:13:29.820 | not to predict the second word, right?

00:13:32.300 | And once you do this, you shift it into the input, et cetera.

00:13:35.300 | That's called autoregressive prediction,

00:13:37.580 | which is why those LLMs

00:13:39.900 | should be called autoregressive LLMs.

00:13:41.740 | But we just call them LLMs.

00:13:46.300 | And there is a difference between this kind of process

00:13:50.620 | and a process by which before producing a word,

00:13:54.700 | when you talk, when you and I talk,

00:13:56.660 | you and I are bilingual.

00:13:58.500 | We think about what we're gonna say,

00:14:00.460 | and it's relatively independent of the language

00:14:02.620 | in which we're gonna say it.

00:14:04.420 | When we talk about, I don't know,

00:14:06.980 | let's say a mathematical concept or something,

00:14:09.100 | the kind of thinking that we're doing

00:14:10.940 | and the answer that we're planning to produce

00:14:13.380 | is not linked to whether we're gonna see it

00:14:16.620 | in French or Russian or English.

00:14:19.460 | - Chomsky just rolled his eyes, but I understand.

00:14:21.700 | So you're saying that there's a bigger abstraction

00:14:25.420 | that goes before language and maps onto language.

00:14:30.300 | - Right.

00:14:31.140 | It's certainly true for a lot of thinking that we do.

00:14:34.020 | - Is that obvious that we don't,

00:14:35.780 | like you're saying your thinking is same in French

00:14:39.180 | as it is in English?

00:14:40.380 | - Yeah, pretty much.

00:14:42.060 | - Pretty much, or is this like, how flexible are you?

00:14:45.740 | Like if there's a probability of distribution?

00:14:48.060 | - Well, it depends what kind of thinking, right?

00:14:51.060 | If it's just, if it's like producing puns,

00:14:54.340 | I get much better in French than English about that.

00:14:56.940 | - No, but so we're right.

00:14:58.540 | Is there an abstract representation of puns?

00:15:00.500 | Like is your humor an abstract, like when you tweet

00:15:03.340 | and your tweets are sometimes a little bit spicy,

00:15:05.780 | is there an abstract representation in your brain

00:15:09.180 | of a tweet before it maps onto English?

00:15:11.780 | - There is an abstract representation

00:15:13.380 | of imagining the reaction of a reader to that text.

00:15:18.380 | - Or you start with laughter

00:15:19.940 | and then figure out how to make that happen?

00:15:21.980 | - Or figure out like a reaction you want to cause

00:15:25.620 | and then figure out how to say it, right?

00:15:27.460 | So that it causes that reaction.

00:15:29.100 | But that's like really close to language.

00:15:30.780 | But think about like a mathematical concept

00:15:34.340 | or imagining something you want to build out of wood

00:15:38.380 | or something like this, right?

00:15:40.100 | The kind of thinking you're doing

00:15:41.100 | is absolutely nothing to do with language really.

00:15:43.500 | Like it's not like you have necessarily

00:15:44.980 | like an internal monologue in any particular language.

00:15:47.700 | You're imagining mental models of the thing, right?

00:15:52.180 | I mean, if I ask you to imagine what this water bottle

00:15:55.980 | will look like if I rotate it 90 degrees,

00:16:00.140 | that has nothing to do with language.

00:16:02.500 | And so clearly there is a more abstract level

00:16:06.860 | of representation in which we do most of our thinking

00:16:11.460 | and we plan what we're gonna say.

00:16:13.940 | If the output is uttered words

00:16:18.940 | as opposed to an output being muscle actions, right?

00:16:24.820 | We plan our answer before we produce it.

00:16:29.300 | And LLMs don't do that.

00:16:30.380 | They just produce one word after the other

00:16:32.900 | instinctively if you want.

00:16:34.980 | It's like, it's a bit like the subconscious actions

00:16:39.980 | where you don't, like you're distracted,

00:16:42.820 | you're doing something, you're completely concentrated.

00:16:44.980 | And someone comes to you and ask you a question

00:16:48.300 | and you kind of answer the question.

00:16:49.660 | You don't have time to think about the answer

00:16:51.460 | but the answer is easy so you don't need to pay attention.

00:16:54.060 | You sort of respond automatically.

00:16:55.980 | That's kind of what an LLM does, right?

00:16:58.540 | It doesn't think about its answer really.

00:17:01.220 | It retrieves it because it's accumulated a lot of knowledge

00:17:04.540 | so it can retrieve some things

00:17:06.140 | but it's going to just spit out one token after the other

00:17:10.980 | without planning the answer.

00:17:13.060 | - But you're making it sound just one token after the other,

00:17:17.260 | one token at a time generation is bound to be simplistic.

00:17:22.260 | But if the world model is sufficiently sophisticated

00:17:28.260 | at one token at a time,

00:17:30.180 | the most likely thing it generates is a sequence of tokens

00:17:35.420 | is going to be a deeply profound thing.

00:17:39.140 | - Okay, but then that assumes that those systems

00:17:42.780 | actually possess an internal world model.

00:17:44.900 | - So it really goes to the,

00:17:46.500 | I think the fundamental question is

00:17:48.780 | can you build a really complete world model,

00:17:53.780 | not complete, but one that has a deep understanding

00:17:57.740 | of the world.

00:17:58.580 | - Yeah, so can you build this first of all by prediction?

00:18:03.580 | - Right.

00:18:04.420 | - And the answer is probably yes.

00:18:06.260 | Can you build it by predicting words?

00:18:10.720 | And the answer is most probably no

00:18:14.180 | because language is very poor in terms of weak

00:18:17.940 | or low bandwidth if you want.

00:18:19.340 | There's just not enough information there.

00:18:21.380 | So building world models means observing the world

00:18:27.140 | and understanding why the world is evolving the way it is.

00:18:32.140 | And then the extra component of a world model

00:18:38.540 | is something that can predict how the world

00:18:41.780 | is going to evolve as a consequence of an action

00:18:44.020 | you might take, right?

00:18:45.520 | So what model really is,

00:18:47.020 | here is my idea of the state of the world at time T,

00:18:49.180 | here is an action I might take.

00:18:51.020 | What is the predicted state of the world at time T plus one?

00:18:55.700 | Now that state of the world does not need to represent

00:18:59.340 | everything about the world.

00:19:01.180 | It just needs to represent enough that's relevant

00:19:03.460 | for this planning of the action,

00:19:06.140 | but not necessarily all the details.

00:19:08.440 | Now here is the problem.

00:19:10.260 | You're not going to be able to do this

00:19:11.860 | with generative models.

00:19:14.900 | So a generative model has trained on video

00:19:16.860 | and we've tried to do this for 10 years.

00:19:18.600 | You take a video, show a system, a piece of video,

00:19:22.420 | and then ask it to predict the reminder of the video.

00:19:25.780 | Basically predict what's going to happen.

00:19:27.860 | - One frame at a time.

00:19:29.380 | Do the same thing as sort of the auto-aggressive LLMs do,

00:19:33.340 | but for video.

00:19:34.220 | - Right.

00:19:35.060 | Either one frame at a time or a group of frames at a time.

00:19:38.220 | But yeah, a large video model, if you want.

00:19:41.180 | The idea of doing this has been floating around

00:19:46.220 | for a long time.

00:19:47.060 | And at FAIR, some of my colleagues and I

00:19:51.060 | have been trying to do this for about 10 years.

00:19:53.380 | And you can't really do the same trick as with LLMs

00:19:58.500 | because LLMs, as I said,

00:20:02.060 | you can't predict exactly which word

00:20:04.180 | is going to follow a sequence of words,

00:20:06.860 | but you can predict the distribution over words.

00:20:09.540 | Now, if you go to video,

00:20:11.580 | what you would have to do is predict the distribution

00:20:13.540 | over all possible frames in a video.

00:20:16.500 | And we don't really know how to do that properly.

00:20:19.980 | We do not know how to represent distributions

00:20:22.420 | over high dimensional continuous spaces

00:20:24.660 | in ways that are useful.

00:20:25.860 | And there lies the main issue.

00:20:31.340 | And the reason we can do this

00:20:33.060 | is because the world is incredibly more complicated

00:20:37.300 | and richer in terms of information than text.

00:20:40.540 | Text is discrete.

00:20:41.620 | Video is high dimensional and continuous.

00:20:45.020 | A lot of details in this.

00:20:47.260 | So if I take a video of this room

00:20:49.740 | and the video is a camera panning around,

00:20:54.580 | there is no way I can predict everything

00:20:58.340 | that's going to be in the room as I pan around.

00:21:00.100 | The system cannot predict what's going to be in the room

00:21:02.220 | as the camera is panning.

00:21:03.500 | Maybe it's going to predict this is a room

00:21:07.060 | where there is a light

00:21:07.900 | and there is a wall and things like that.

00:21:09.340 | It can't predict what the painting on the wall looks like

00:21:11.700 | or what the texture of the couch looks like.

00:21:14.180 | Certainly not the texture of the carpet.

00:21:16.140 | So there's no way I can predict all those details.

00:21:19.180 | So the way to handle this is,

00:21:23.380 | one way possibly to handle this,

00:21:24.900 | which we've been working for a long time,

00:21:26.420 | is to have a model that has what's called a latent variable.

00:21:29.820 | And the latent variable is fed to a neural net

00:21:33.020 | and it's supposed to represent all the information

00:21:35.220 | about the world that you don't perceive yet.

00:21:37.940 | And that you need to augment the system

00:21:43.860 | for the prediction to do a good job at predicting pixels,

00:21:47.180 | including the fine texture of the carpet and the couch

00:21:52.180 | and the painting on the wall.

00:21:54.980 | That has been a complete failure, essentially.

00:22:00.180 | And we've tried lots of things.

00:22:01.340 | We tried just straight neural nets.

00:22:03.820 | We tried GANs.

00:22:04.700 | We tried VAEs, all kinds of regularized autoencoders.

00:22:09.700 | We tried many things.

00:22:13.900 | We also tried those kinds of methods

00:22:15.700 | to learn good representations of images or video

00:22:20.260 | that could then be used as input to, for example,

00:22:24.900 | an image classification system.

00:22:26.580 | And that also has basically failed.

00:22:29.580 | Like all the systems that attempt to predict

00:22:32.540 | missing parts of an image or video

00:22:34.820 | from a corrupted version of it, basically.

00:22:40.220 | So I take an image or a video,

00:22:41.660 | corrupt it or transform it in some way,

00:22:44.100 | and then try to reconstruct the complete video or image

00:22:47.500 | from the corrupted version.

00:22:48.900 | And then hope that internally,

00:22:52.180 | the system will develop good representations of images

00:22:54.900 | that you can use for object recognition, segmentation,

00:22:57.620 | whatever it is.

00:22:58.460 | That has been essentially a complete failure.

00:23:01.820 | And it works really well for text.

00:23:04.460 | That's the principle that is used for LLMs, right?

00:23:07.140 | - So where's the failure exactly?

00:23:09.340 | Is it that it's very difficult to form a good representation

00:23:13.420 | of an image, like a good embedding of all

00:23:16.740 | all the important information in the image?

00:23:19.340 | Is it in terms of the consistency of image to image

00:23:21.860 | to image to image that forms the video?

00:23:23.980 | Like what are the, if we do a highlight reel

00:23:27.140 | of all the ways you failed, what's that look like?

00:23:30.660 | - Okay, so the reason this doesn't work is,

00:23:35.300 | first of all, I have to tell you exactly what doesn't work

00:23:37.220 | because there is something else that does work.

00:23:40.060 | So the thing that does not work is training the system

00:23:44.220 | to learn representations of images

00:23:47.820 | by training it to reconstruct a good image

00:23:52.140 | from a corrupted version of it, okay?

00:23:54.020 | That's what doesn't work.

00:23:55.740 | And we have a whole slew of techniques for this

00:23:59.100 | that are, you know, variant of denoising autoencoders,

00:24:02.540 | something called MAE developed by some of my colleagues

00:24:05.060 | at FAIR, maxed autoencoder.

00:24:07.020 | So it's basically like the, you know, LLMs

00:24:10.460 | or things like this where you train the system

00:24:13.100 | by corrupting text, except you corrupt images,

00:24:15.300 | you remove patches from it and you train

00:24:17.220 | a gigantic neural net to reconstruct.

00:24:19.500 | The features you get are not good.

00:24:20.980 | And you know they're not good because

00:24:23.420 | if you now train the same architecture,

00:24:25.540 | but you train it to supervise with labeled data,

00:24:30.100 | with textual descriptions of images, et cetera,

00:24:34.060 | you do get good representations.

00:24:35.780 | And the performance on recognition tasks is much better

00:24:39.700 | than if you do this self-supervised retraining.

00:24:42.660 | - So the architecture is good.

00:24:44.580 | - The architecture is good.

00:24:45.420 | The architecture of the encoder is good, okay?

00:24:48.020 | But the fact that you train the system

00:24:49.500 | to reconstruct images does not lead it to produce,

00:24:53.780 | to learn good generic features of images.

00:24:56.300 | - When you train in a self-supervised way.

00:24:58.380 | - Self-supervised by reconstruction.

00:25:00.380 | - Yeah, by reconstruction.

00:25:01.380 | - Okay, so what's the alternative?

00:25:04.380 | - The alternative is joint embedding.

00:25:07.500 | - What is joint embedding?

00:25:08.860 | What are these architectures that you're so excited about?

00:25:11.260 | - Okay, so now instead of training a system

00:25:13.380 | to encode the image and then training it

00:25:15.300 | to reconstruct the full image from a corrupted version,

00:25:20.060 | you take the full image,

00:25:21.540 | you take the corrupted or transformed version,

00:25:25.380 | you run them both through encoders,

00:25:27.140 | which in general are identical, but not necessarily.

00:25:31.580 | And then you train a predictor on top of those encoders

00:25:36.580 | to predict the representation of the full input

00:25:42.460 | from the representation of the corrupted one, okay?

00:25:47.460 | So joint embedding, because you're taking the full input

00:25:51.100 | and the corrupted version or transformed version,

00:25:54.140 | run them both through encoders, you get a joint embedding.

00:25:57.260 | And then you're saying,

00:25:59.140 | can I predict the representation of the full one

00:26:01.980 | from the representation of the corrupted one, okay?

00:26:05.180 | And I call this a JAPA,

00:26:07.820 | so that means joint embedding predictive architecture

00:26:09.860 | because there's joint embedding

00:26:11.220 | and there is this predictor that predicts

00:26:12.620 | the representation of the good guy from the bad guy.

00:26:15.300 | And the big question is,

00:26:18.300 | how do you train something like this?

00:26:20.700 | And until five years ago or six years ago,

00:26:23.780 | we didn't have particularly good answers

00:26:26.260 | for how you train those things,

00:26:27.660 | except for one called contrastive learning.

00:26:31.900 | And the idea of contrastive learning

00:26:36.540 | is you take a pair of images that are,

00:26:39.860 | again, an image and a corrupted version

00:26:42.420 | or degraded version somehow,

00:26:44.260 | or transformed version of the original one,

00:26:47.180 | and you train the predicted representation

00:26:49.900 | to be the same as that.

00:26:51.820 | If you only do this, the system collapses.

00:26:53.900 | It basically completely ignores the input

00:26:55.700 | and produces representations that are constant.

00:26:58.100 | So the contrastive methods avoid this,

00:27:02.780 | and those things have been around since the early '90s.

00:27:05.380 | I had a paper on this in 1993.

00:27:07.140 | You also show pairs of images that you know are different,

00:27:13.420 | and then you push away the representations from each other.

00:27:17.540 | So you say, not only do representations of things

00:27:20.380 | that we know are the same,

00:27:22.020 | should be the same or should be similar,

00:27:23.900 | but representations of things that we know are different

00:27:25.740 | should be different.

00:27:26.620 | And that prevents the collapse, but it has some limitation.

00:27:30.140 | And there's a whole bunch of techniques that have appeared

00:27:33.260 | over the last six, seven years

00:27:35.660 | that can revive this type of method,

00:27:38.060 | some of them from FAIR,

00:27:40.340 | some of them from Google and other places.

00:27:43.260 | But there are limitations to those contrastive methods.

00:27:47.260 | What has changed in the last three, four years

00:27:52.420 | is now we have methods that are non-contrastive.

00:27:54.900 | So they don't require those negative contrastive samples

00:27:58.940 | of images that we know are different.

00:28:01.620 | You turn them only with images that are different versions

00:28:06.380 | or different views of the same thing,

00:28:08.180 | and you rely on some other tweaks

00:28:10.740 | to prevent the system from collapsing.

00:28:12.740 | And we have half a dozen different methods for this now.

00:28:15.980 | - So what is the fundamental difference

00:28:17.860 | between joint embedding architectures and LLMs?

00:28:22.340 | So can JAPA take us to AGI?

00:28:26.980 | Whether we should say that you don't like the term AGI,

00:28:31.780 | and we'll probably argue.

00:28:33.020 | I think every single time I've talked to you

00:28:34.900 | with arguing about the G in AGI.

00:28:36.860 | - Yes.

00:28:37.700 | - I get it, I get it.

00:28:40.220 | Well, we'll probably continue to argue about it, it's great.

00:28:43.300 | You like Ami, 'cause you like French,

00:28:47.220 | and Ami is, I guess, friend in French.

00:28:51.780 | - Yes.

00:28:52.620 | - And Ami stands for advanced machine intelligence.

00:28:55.820 | - Right.

00:28:56.660 | - But either way, can JAPA take us to that,

00:29:00.500 | towards that advanced machine intelligence?

00:29:02.580 | - Well, so it's a first step.

00:29:04.620 | Okay, so first of all, what's the difference

00:29:07.260 | with generative architectures like LLMs?

00:29:11.060 | So LLMs, or vision systems that are trained

00:29:16.020 | by reconstruction, generate the inputs, right?

00:29:20.060 | They generate the original input that is non-corrupted,

00:29:25.060 | non-transformed, right?

00:29:27.340 | So you have to predict all the pixels.

00:29:29.940 | And there is a huge amount of resources spent in the system

00:29:33.420 | to actually predict all those pixels, all the details.

00:29:36.180 | In a JAPA, you're not trying to predict all the pixels,

00:29:40.500 | you're only trying to predict an abstract representation

00:29:43.980 | of the inputs, right?

00:29:47.020 | And that's much easier in many ways.

00:29:49.460 | So what the JAPA system, when it's being trained,

00:29:51.460 | is trying to do is extract as much information as possible

00:29:54.820 | from the input, but yet only extract information

00:29:58.180 | that is relatively easily predictable.

00:30:00.500 | Okay, so there's a lot of things in the world

00:30:03.660 | that we cannot predict, like for example,

00:30:05.220 | if you have a self-driving car driving down the street

00:30:08.180 | or road, there may be trees around the road.

00:30:13.180 | And it could be a windy day, so the leaves on the tree

00:30:16.260 | are kind of moving in kind of semi-chaotic random ways

00:30:19.620 | that you can't predict and you don't care,

00:30:22.020 | you don't wanna predict.

00:30:23.660 | So what you want is your encoder

00:30:25.300 | to basically eliminate all those details.

00:30:27.300 | It will tell you there's moving leaves,

00:30:28.780 | but it's not gonna keep the details

00:30:29.980 | of exactly what's going on.

00:30:31.380 | And so when you do the prediction in representation space,

00:30:35.940 | you're not going to have to predict every single pixel

00:30:38.020 | of every leaf.

00:30:38.860 | And that, you know, not only is a lot simpler,

00:30:43.540 | but also it allows the system to essentially learn

00:30:47.420 | an abstract representation of the world

00:30:49.780 | where, you know, what can be modeled and predicted

00:30:53.500 | is preserved and the rest is viewed as noise

00:30:57.460 | and eliminated by the encoder.

00:30:59.140 | So it kind of lifts the level of abstraction

00:31:00.980 | of the representation.

00:31:02.300 | If you think about this,

00:31:03.140 | this is something we do absolutely all the time.

00:31:05.460 | Whenever we describe a phenomenon,

00:31:06.980 | we describe it at a particular level of abstraction.

00:31:10.100 | And we don't always describe every natural phenomenon

00:31:13.420 | in terms of quantum field theory, right?

00:31:15.260 | That would be impossible, right?

00:31:17.460 | So we have multiple levels of abstraction

00:31:20.060 | to describe what happens in the world, you know,

00:31:22.660 | starting from quantum field theory to like atomic theory

00:31:25.620 | and molecules, you know, and chemistry materials,

00:31:29.060 | and, you know, all the way up to, you know,

00:31:31.700 | kind of concrete objects in the real world

00:31:33.940 | and things like that.

00:31:34.780 | So we can't just only model everything at the lowest level.

00:31:40.460 | And that's what the idea of J-PAH is really about.

00:31:44.540 | Learn abstract representation in a self-supervised manner.

00:31:49.540 | And, you know, you can do it hierarchically as well.

00:31:52.100 | So that I think is an essential component

00:31:54.500 | of an intelligent system.

00:31:56.300 | And in language, we can get away without doing this

00:31:58.540 | because language is already to some level abstract

00:32:02.580 | and already has eliminated a lot of information

00:32:05.460 | that is not predictable.

00:32:07.060 | And so we can get away without doing the joint embedding,

00:32:11.020 | without, you know, lifting the abstraction level

00:32:13.780 | and by directly predicting words.

00:32:15.420 | - So joint embedding, it's still generative,

00:32:19.980 | but it's generative in this abstract representation space.

00:32:23.380 | - Yeah.

00:32:24.220 | - And you're saying language, we were lazy with language

00:32:27.300 | 'cause we already got the abstract representation for free.

00:32:30.380 | And now we have to zoom out,

00:32:31.980 | actually think about generally intelligent systems.

00:32:34.580 | We have to deal with a full mess of physical reality,

00:32:39.260 | of reality.

00:32:40.100 | And you do have to do this step of jumping from

00:32:44.940 | the full, rich, detailed reality

00:32:51.340 | to a abstract representation of that reality

00:32:54.820 | based on what you can then reason

00:32:56.340 | and all that kind of stuff.

00:32:57.340 | - Right.

00:32:58.180 | And the thing is those self-supervised algorithm

00:33:00.500 | that learned by prediction, even in representation space,

00:33:04.740 | they learn more concept

00:33:09.260 | if the input data you feed them is more redundant.

00:33:11.980 | The more redundancy there is in the data,

00:33:14.020 | the more they're able to capture

00:33:15.500 | some internal structure of it.

00:33:17.780 | And so there, there is way more redundancy

00:33:20.460 | in the structure in perceptual inputs,

00:33:24.060 | sensory input like vision than there is in text,

00:33:28.460 | which is not nearly as redundant.

00:33:29.980 | This is back to the question you were asking

00:33:32.500 | a few minutes ago.

00:33:33.420 | Language might represent more information really

00:33:35.540 | because it's already compressed.

00:33:36.700 | You're right about that,

00:33:37.660 | but that means it's also less redundant.

00:33:40.260 | And so self-supervised running will not work as well.

00:33:43.700 | - Is it possible to join the self-supervised training

00:33:48.700 | on visual data and self-supervised training

00:33:52.300 | on language data?

00:33:53.900 | There is a huge amount of knowledge,

00:33:56.540 | even though you talked down about those 10 to the 13 tokens.

00:34:00.260 | Those 10 to the 13 tokens represent the entirety

00:34:03.340 | a large fraction of what us humans have figured out,

00:34:08.300 | both the shit talk on Reddit

00:34:11.380 | and the contents of all the books and the articles

00:34:14.180 | and the full spectrum of human intellectual creation.

00:34:18.980 | So is it possible to join those two together?

00:34:22.260 | - Well, eventually, yes.

00:34:23.740 | But I think if we do this too early,

00:34:27.860 | we run the risk of being tempted to cheat.

00:34:30.340 | And in fact, that's what people are doing at the moment

00:34:32.180 | with the vision language model.

00:34:33.540 | We're basically cheating.

00:34:35.220 | We're using language as a crutch to help the deficiencies

00:34:40.020 | of our vision systems to kind of learn good representations

00:34:44.740 | from images and video.

00:34:46.460 | And the problem with this is that we might improve

00:34:51.100 | our visual language system a bit,

00:34:53.780 | I mean, our language models by feeding them images,

00:34:58.100 | but we're not gonna get to the level of even the intelligence

00:35:01.740 | or level of understanding of the world of a cat or a dog,

00:35:05.580 | which doesn't have language.

00:35:07.380 | You know, they don't have language

00:35:08.620 | and they understand the world much better than any LLM.

00:35:12.060 | They can plan really complex actions

00:35:14.140 | and sort of imagine the result of a bunch of actions.

00:35:17.940 | How do we get machines to learn that?

00:35:20.460 | Before we combine that with language,

00:35:22.940 | obviously, if we combine this with language,

00:35:24.820 | this is gonna be a winner.

00:35:26.220 | But before that, we have to focus on like,

00:35:30.780 | how do we get systems to learn how the world works?

00:35:33.300 | - So this kind of joint embedding, predictive architecture,

00:35:38.300 | for you, that's gonna be able to learn

00:35:40.060 | something like common sense,

00:35:41.380 | something like what a cat uses to predict

00:35:45.580 | how to mess with its owner most optimally

00:35:48.340 | by knocking over a thing.

00:35:49.940 | That's the hope.

00:35:51.340 | In fact, the techniques we're using are non-contrastive.

00:35:54.260 | So not only is the architecture non-generative,

00:35:57.740 | the learning procedures we're using are non-contrastive.

00:36:01.540 | So we have two sets of techniques.

00:36:03.660 | One set is based on distillation

00:36:05.700 | and there's a number of methods that use this principle.

00:36:10.300 | One by DeepMind called BYOL,

00:36:11.620 | a couple by FAIR, one called VicReg,

00:36:17.940 | and another one called IJPA.

00:36:20.140 | And VicReg, I should say,

00:36:21.500 | is not a distillation method actually,

00:36:23.620 | but IJPA and BYOL certainly are.

00:36:25.700 | And there's another one also called DINO or DINO,

00:36:29.300 | also produced from FAIR.

00:36:31.820 | And the idea of those things is that

00:36:32.940 | you take the full input, let's say an image,

00:36:35.820 | you run it through an encoder,

00:36:37.820 | produces a representation,

00:36:41.340 | and then you corrupt that input or transform it,

00:36:43.540 | run it through essentially what amounts to the same encoder

00:36:46.540 | with some minor differences.

00:36:48.500 | And then train a predictor.

00:36:50.420 | Sometimes the predictor is very simple,

00:36:51.900 | sometimes it doesn't exist,

00:36:53.100 | but train a predictor to predict a representation

00:36:55.260 | of the first uncorrupted input from the corrupted input.

00:37:00.260 | But you only train the second branch.

00:37:05.460 | You only train the part of the network

00:37:07.540 | that is fed with the corrupted input.

00:37:10.780 | The other network you don't train,

00:37:12.780 | but since they share the same weight,

00:37:14.260 | when you modify the first one,

00:37:15.980 | it also modifies the second one.

00:37:17.580 | And with various tricks,

00:37:19.660 | you can prevent the system from collapsing,

00:37:22.620 | with the collapse of the type I was explaining before,

00:37:24.700 | where the system basically ignores the input.

00:37:26.900 | So that works very well.

00:37:31.060 | The two techniques we've developed at FAIR,

00:37:34.780 | DINO and IJPA work really well for that.

00:37:39.300 | - So what kind of data are we talking about here?

00:37:41.780 | - So there's several scenario.

00:37:43.380 | One scenario is you take an image,

00:37:47.340 | you corrupt it by changing the cropping, for example,

00:37:52.340 | changing the size a little bit,

00:37:54.380 | maybe changing the orientation, blurring it,

00:37:56.700 | changing the colors,

00:37:58.300 | doing all kinds of horrible things to it.

00:38:00.060 | - But basic horrible things.

00:38:01.620 | - Basic horrible things that sort of degrade the quality

00:38:03.820 | a little bit and change the framing,

00:38:06.420 | crop the image.

00:38:08.380 | And in some cases, in the case of IJPA,

00:38:12.220 | you don't need to do any of this,

00:38:13.220 | you just mask some parts of it, right?

00:38:16.380 | You just basically remove some regions,

00:38:19.460 | like a big block, essentially.

00:38:21.860 | And then run through the encoders

00:38:25.220 | and train the entire system, encoder and predictor,

00:38:27.660 | to predict the representation of the good one

00:38:29.500 | from the representation of the corrupted one.

00:38:31.740 | So that's the IJPA.

00:38:35.420 | Doesn't need to know that it's an image, for example,

00:38:38.300 | because the only thing it needs to know

00:38:39.540 | is how to do this masking.

00:38:42.380 | Whereas with Deno, you need to know it's an image

00:38:44.380 | because you need to do things like geometry transformation

00:38:47.540 | and blurring and things like that

00:38:49.300 | that are really image-specific.

00:38:50.860 | A more recent version of this that we have

00:38:53.860 | is called VJPA, so it's basically the same idea as IJPA,

00:38:56.860 | except it's applied to video.

00:38:59.180 | So now you take a whole video

00:39:00.780 | and you mask a whole chunk of it.

00:39:02.740 | And what we mask is actually kind of a temporal tube,

00:39:04.980 | so a whole segment of each frame in the video

00:39:08.740 | over the entire video.

00:39:10.340 | - And that tube is like statically positioned

00:39:12.860 | throughout the frames, so it's literally a straight tube.

00:39:15.860 | - The tube, yeah, typically is 16 frames or something,

00:39:18.860 | and we mask the same region over the entire 16 frames.

00:39:22.340 | It's a different one for every video, obviously.

00:39:24.620 | And then again, train that system

00:39:28.540 | so as to predict the representation of the full video

00:39:31.300 | from the partially masked video.

00:39:33.260 | That works really well.

00:39:35.380 | It's the first system that we have

00:39:36.860 | that learns good representations of video

00:39:39.940 | so that when you feed those representations

00:39:41.820 | to a supervised classifier head,

00:39:44.980 | it can tell you what action is taking place in the video

00:39:47.780 | with pretty good accuracy.

00:39:49.740 | So that's the first time we get something of that quality.

00:39:55.980 | - So that's a good test that good representations form,

00:39:58.660 | that means there's something to this.

00:40:00.300 | - Yeah, we have also a preliminary result

00:40:03.460 | that seemed to indicate that the representation allows us,

00:40:07.140 | allow our system to tell whether the video

00:40:10.660 | is physically possible or completely impossible

00:40:13.940 | because some object disappeared,

00:40:15.340 | or an object suddenly jumped from one location to another

00:40:19.540 | or changed shape or something.

00:40:21.860 | - So it's able to capture some physics-based constraints

00:40:26.860 | about the reality represented in the video?

00:40:29.260 | - Yeah.

00:40:30.220 | - About the appearance and the disappearance of objects?

00:40:33.140 | - Yeah, that's really new.

00:40:35.740 | - Okay, but can this actually get us

00:40:40.740 | to this kind of a world model that understands enough

00:40:45.580 | about the world to be able to drive a car?

00:40:48.020 | - Possibly.

00:40:50.060 | I mean, this is gonna take a while

00:40:51.540 | before we get to that point,

00:40:52.660 | but there are systems already, robotic systems,

00:40:56.900 | that are based on this idea.

00:40:58.700 | And what you need for this

00:41:02.700 | is a slightly modified version of this

00:41:04.860 | where imagine that you have a video and a complete video.

00:41:09.860 | And what you're doing to this video

00:41:13.980 | is that you're either translating it in time

00:41:17.620 | towards the future,

00:41:18.460 | so you only see the beginning of the video,

00:41:19.980 | but you don't see the latter part of it

00:41:21.740 | that is in the original one,

00:41:23.380 | or you just mask the second half of the video, for example.

00:41:27.260 | And then you train a Jepa system of the type I described

00:41:32.260 | to predict the representation of the full video

00:41:33.980 | from the shifted one,

00:41:36.140 | but you also feed the predictor with an action.

00:41:39.660 | For example, the wheel is turned 10 degrees

00:41:42.820 | to the right or something, right?

00:41:45.420 | So if it's a dashcam in a car

00:41:49.860 | and you know the angle of the wheel,

00:41:51.340 | you should be able to predict to some extent

00:41:52.900 | what's going to happen to what you see.

00:41:56.820 | You're not gonna be able to predict all the details

00:41:59.940 | of objects that appear in the view, obviously,

00:42:02.780 | but at a abstract representation level,

00:42:05.740 | you can probably predict what's gonna happen.

00:42:08.660 | So now what you have is a internal model that says,

00:42:13.100 | here is my idea of state of the world at time t,

00:42:15.260 | here is an action I'm taking,

00:42:17.860 | here is a prediction of the state of the world

00:42:19.300 | at time t plus one, t plus delta t,

00:42:21.980 | t plus two seconds, whatever it is.

00:42:24.300 | If you have a model of this type,

00:42:26.180 | you can use it for planning.

00:42:27.940 | So now you can do what LLMs cannot do,

00:42:31.540 | which is planning what you're gonna do

00:42:33.980 | so as to arrive at a particular outcome

00:42:37.580 | or satisfy a particular objective, right?

00:42:40.780 | So you can have a number of objectives.

00:42:43.520 | I can predict that if I have an object like this

00:42:50.820 | and I open my hand, it's gonna fall, right?

00:42:54.420 | And if I push it with a particular force on the table,

00:42:58.180 | it's gonna move.

00:42:59.020 | If I push the table itself,

00:43:00.060 | it's probably not gonna move with the same force.

00:43:03.620 | So we have this internal model of the world in our mind,

00:43:08.620 | which allows us to plan sequences of actions

00:43:11.780 | to arrive at a particular goal.

00:43:13.340 | And so now if you have this world model,

00:43:18.540 | we can imagine a sequence of actions,

00:43:21.580 | predict what the outcome of the sequence of action

00:43:23.620 | is going to be, measure to what extent the final state

00:43:28.300 | satisfies a particular objective,

00:43:31.020 | like moving the bottle to the left of the table,

00:43:35.060 | and then plan a sequence of actions

00:43:38.460 | that will minimize this objective at runtime.

00:43:41.500 | We're not talking about learning,

00:43:42.340 | we're talking about inference time, right?

00:43:44.260 | So this is planning, really.

00:43:46.140 | And in optimal control, this is a very classical thing.

00:43:48.340 | It's called model predictive control.

00:43:50.580 | You have a model of the system you want to control

00:43:53.780 | that can predict the sequence of states

00:43:56.340 | corresponding to a sequence of commands.

00:43:58.980 | And you're planning a sequence of commands

00:44:02.260 | so that according to your world model,

00:44:04.180 | the end state of the system

00:44:06.420 | will satisfy an objective that you fix.

00:44:10.980 | This is the way rocket trajectories have been planned

00:44:15.980 | since computers have been around,

00:44:17.740 | so since the early '60s, essentially.

00:44:20.100 | - So yes, for model predictive control,

00:44:21.860 | but you also often talk about hierarchical planning.

00:44:26.020 | - Yeah.

00:44:26.860 | - Can hierarchical planning emerge from this somehow?

00:44:29.020 | - Well, so no.

00:44:29.860 | You will have to build a specific architecture

00:44:32.820 | to allow for hierarchical planning.

00:44:34.660 | So hierarchical planning is absolutely necessary

00:44:36.900 | if you want to plan complex actions.

00:44:39.580 | If I want to go from, let's say, from New York to Paris,

00:44:43.340 | this is the example I use all the time,

00:44:45.460 | and I'm sitting in my office at NYU,

00:44:48.180 | my objective that I need to minimize

00:44:50.500 | is my distance to Paris.

00:44:52.140 | At a high level, a very abstract representation

00:44:55.100 | of my location, I will have to decompose this

00:44:58.380 | into two sub-goals.

00:44:59.380 | First one is go to the airport.

00:45:02.260 | Second one is catch a plane to Paris.

00:45:04.700 | Okay, so my sub-goal is now going to the airport.

00:45:09.140 | My objective function is my distance to the airport.

00:45:11.700 | How do I go to the airport?

00:45:14.140 | Well, I have to go in the street and hail a taxi,

00:45:18.300 | which you can do in New York.

00:45:19.700 | Okay, now I have another sub-goal.

00:45:22.740 | Go down on the street, what that means,

00:45:26.220 | going to the elevator, going down the elevator,

00:45:28.860 | walk out the street.

00:45:29.860 | How do I go to the elevator?

00:45:32.700 | I have to stand up for my chair,

00:45:36.380 | open the door of my office, go to the elevator,

00:45:39.420 | push the button.

00:45:40.700 | How do I get up from my chair?

00:45:42.340 | Like, you know, you can imagine going down,

00:45:44.020 | all the way down to basically what amounts

00:45:47.420 | to millisecond by millisecond muscle control.

00:45:50.420 | Okay, and obviously you're not going to plan

00:45:54.180 | your entire trip from New York to Paris

00:45:56.540 | in terms of millisecond by millisecond muscle control.

00:46:00.300 | First, that would be incredibly expensive,

00:46:02.300 | but it will also be completely impossible

00:46:03.800 | because you don't know all the conditions

00:46:06.480 | of what's going to happen.

00:46:08.060 | You know, how long it's going to take to catch a taxi

00:46:10.660 | or to go to the airport with traffic, you know.

00:46:14.980 | I mean, you would have to know exactly the condition

00:46:17.500 | of everything to be able to do this planning.

00:46:19.940 | And you don't have the information.

00:46:21.460 | So you have to do this hierarchical planning

00:46:24.020 | so that you can start acting

00:46:25.420 | and then sort of replanning as you go.

00:46:27.380 | And nobody really knows how to do this in AI.

00:46:32.060 | Nobody knows how to train a system

00:46:35.340 | to run the appropriate multiple levels of representation

00:46:38.620 | so that hierarchical planning works.

00:46:41.380 | - Does something like that already emerge?

00:46:43.060 | So like, can you use an LLM, state-of-the-art LLM

00:46:48.460 | to get you from New York to Paris

00:46:50.940 | by doing exactly the kind of detailed set of questions

00:46:55.060 | that you just did, which is,

00:46:56.940 | can you give me a list of 10 steps I need to do

00:47:01.220 | to get from New York to Paris?

00:47:02.660 | And then for each of those steps,

00:47:05.420 | can you give me a list of 10 steps,

00:47:07.140 | how I make that step happen?

00:47:09.180 | And for each of those steps,

00:47:10.340 | can you give me a list of 10 steps to make each one of those

00:47:13.180 | until you're moving your individual muscles?

00:47:16.420 | Maybe not, whatever you can actually act upon

00:47:19.620 | using your mind.

00:47:20.660 | - Right, so there's a lot of questions

00:47:23.180 | that are actually implied by this, right?

00:47:24.500 | So the first thing is LLMs will be able to answer

00:47:27.700 | some of those questions down to some level of abstraction

00:47:30.500 | under the condition that they've been trained

00:47:34.480 | with similar scenarios in their training set.

00:47:37.260 | - They would be able to answer all of those questions,

00:47:40.100 | but some of them may be hallucinated, meaning non-factual.

00:47:44.260 | - Yeah, true, I mean, they will probably produce

00:47:45.780 | some answer, except they're not gonna be able

00:47:47.420 | to really kind of produce millisecond by millisecond

00:47:49.660 | muscle control of how you stand up from your chair, right?

00:47:53.220 | So, but down to some level of abstraction

00:47:55.580 | where you can describe things by words,

00:47:57.860 | they might be able to give you a plan,

00:47:59.620 | but only under the condition that they've been trained

00:48:01.500 | to produce those kinds of plans, right?

00:48:04.180 | They're not gonna be able to plan for situations

00:48:06.700 | where that they never encountered before.

00:48:09.420 | They basically are going to have to regurgitate

00:48:11.380 | the template that they've been trained on.

00:48:12.980 | So where, like, just for the example of New York to Paris,

00:48:16.020 | is it gonna start getting into trouble?

00:48:18.940 | Like at which layer of abstraction do you think you'll start?

00:48:22.620 | 'Cause like I can imagine almost every single part of that

00:48:25.420 | in LN will be able to answer somewhat accurately,

00:48:27.760 | especially when you're talking about New York

00:48:29.340 | and Paris, major cities.

00:48:31.060 | - So, I mean, certainly LNM would be able

00:48:33.940 | to solve that problem if you fine-tuned it for it.

00:48:36.660 | You know, just, and so I can't say that LNM cannot do this.

00:48:42.420 | It can do this if you train it for it, there's no question.

00:48:45.780 | Down to a certain level where things can be formulated

00:48:49.900 | in terms of words.

00:48:51.340 | But like, if you wanna go down to like how you, you know,

00:48:53.840 | climb down the stairs or just stand up from your chair

00:48:56.100 | in terms of words, like you can't, you can't do it.

00:48:59.380 | You need, that's one of the reasons you need experience

00:49:04.940 | of the physical world, which is much higher bandwidth

00:49:07.740 | than what you can express in words, in human language.

00:49:11.060 | - So everything we've been talking about

00:49:12.460 | on the joint embedding space, is it possible

00:49:15.740 | that that's what we need for like the interaction

00:49:18.020 | with physical reality on the robotics front?

00:49:20.620 | And then just the LLMs are the thing that sits on top of it

00:49:24.660 | for the bigger reasoning about like the fact

00:49:28.580 | that I need to book a plane ticket and I need to know,

00:49:31.660 | I know how to go to the websites and so on.

00:49:33.700 | - Sure, and you know, a lot of plans that people know about

00:49:37.060 | that are relatively high level are actually learned.

00:49:40.740 | They're not people, most people don't invent the, you know,

00:49:45.260 | plans they, by themselves, they, you know,

00:49:50.260 | we have some ability to do this, of course, obviously,

00:49:54.180 | but most plans that people use are plans

00:49:57.920 | that they've been trained on.

00:49:59.540 | Like they've seen other people use those plans

00:50:01.280 | or they've been told how to do things, right?

00:50:04.180 | That you can't invent how you like take a person

00:50:07.660 | who's never heard of airplanes and tell them like,

00:50:10.220 | why do you go from New York to Paris?

00:50:11.660 | And they're probably not going to be able to kind of,

00:50:14.700 | you know, deconstruct the whole plan

00:50:16.180 | unless I've seen examples of that before.

00:50:18.820 | So certainly LLMs are going to be able to do this,

00:50:20.740 | but then how you link this from the low level of actions,

00:50:25.740 | that needs to be done with things like JPA that basically

00:50:32.400 | lifts the abstraction level of the representation

00:50:34.780 | without attempting to reconstruct every detail

00:50:36.700 | of the situation.

00:50:38.080 | That's why we need JPA for.

00:50:40.740 | I would love to sort of linger on your skepticism

00:50:44.260 | around auto aggressive LLMs.

00:50:48.400 | So one way I would like to test that skepticism

00:50:51.960 | is everything you say makes a lot of sense.

00:50:54.980 | But if I apply everything you said today and in general

00:51:01.500 | to like, I don't know, 10 years ago,

00:51:04.080 | maybe a little bit less, no, let's say three years ago,

00:51:07.900 | I wouldn't be able to predict the success of LLMs.

00:51:12.620 | So does it make sense to you that auto aggressive LLMs

00:51:17.060 | are able to be so damn good?

00:51:19.620 | - Yes.

00:51:21.780 | - Can you explain your intuition?

00:51:24.260 | Because if I were to take your wisdom and intuition

00:51:29.120 | at face value, I would say there's no way

00:51:31.420 | auto aggressive LLMs, one token at a time,

00:51:34.300 | would be able to do the kind of things they're doing.

00:51:36.260 | - No, there's one thing that auto aggressive LLMs

00:51:39.260 | or that LLMs in general, not just the auto aggressive one,

00:51:42.420 | but including the bird style bi-directional ones

00:51:45.260 | are exploiting and it's self supervised learning.

00:51:49.220 | And I've been a very, very strong advocate

00:51:51.060 | of self supervised learning for many years.

00:51:53.300 | So those things are a incredibly impressive demonstration

00:51:58.300 | that self supervised learning actually works.

00:52:02.140 | The idea that started, it didn't start with birth,

00:52:07.140 | but it was really kind of a good demonstration with this.

00:52:09.660 | So the idea that you take a piece of text, you corrupt it

00:52:14.660 | and then you transform gigantic neural net

00:52:16.200 | to reconstruct the parts that are missing,

00:52:18.300 | that has been an enormous,

00:52:20.920 | produced an enormous amount of benefits.

00:52:25.680 | It allowed us to create systems that understand language,

00:52:31.380 | systems that can translate hundreds of languages

00:52:34.980 | in any direction, systems that are multilingual.

00:52:38.200 | So they're not, it's a single system that can be trained

00:52:40.900 | to understand hundreds of languages

00:52:43.260 | and translate in any direction and produce summaries

00:52:48.260 | and then answer questions and produce text.

00:52:51.780 | And then there's a special case of it where you,

00:52:54.740 | which is the auto aggressive trick,

00:52:56.620 | where you constrain the system

00:52:58.580 | to not elaborate a representation of the text

00:53:02.020 | from looking at the entire text,

00:53:03.740 | but only predicting a word

00:53:06.540 | from the words that are come before, right?

00:53:08.580 | Then you do this by the constraining

00:53:10.380 | the architecture of the network.

00:53:11.580 | And that's what you can build an auto aggressive LLM from.

00:53:15.140 | So there was a surprise many years ago

00:53:17.660 | with what's called decoder only LLM.

00:53:20.940 | So since, you know, systems of this type

00:53:23.120 | that are just trying to produce words from the previous one

00:53:28.120 | and the fact that when you scale them up,

00:53:31.260 | they tend to really kind of understand more about the,

00:53:35.900 | about language when you train them on lots of data,

00:53:38.140 | you make them really big.

00:53:39.380 | That was kind of a surprise.

00:53:40.720 | And that surprise occurred quite a while back,

00:53:42.900 | like, you know, with work from, you know,

00:53:47.900 | Google meta, open AI, et cetera, you know,

00:53:50.580 | going back to, you know, the GPT kind of work

00:53:54.620 | general pre-trained transformers.

00:53:56.820 | - You mean like GPT-2, like there's a certain place

00:54:00.380 | where you start to realize scaling

00:54:02.060 | might actually keep giving us an emergent benefit.

00:54:06.720 | - Yeah, I mean, there were work from various places,

00:54:09.240 | but if you want to kind of, you know,

00:54:12.900 | place it in the GPT timeline,

00:54:16.380 | that would be around GPT-2, yeah.

00:54:17.980 | - Well, I just, 'cause you said it,

00:54:20.860 | you're so charismatic and you said so many words,

00:54:23.620 | but self-supervised learning, yes.

00:54:25.880 | But again, the same intuition you're applying

00:54:29.060 | to saying that autoregressive LLMs

00:54:31.600 | cannot have a deep understanding of the world,

00:54:35.240 | if we just apply that same intuition,

00:54:38.060 | does it make sense to you that they're able to form

00:54:41.680 | enough of a representation of the world

00:54:43.840 | to be damn convincing, essentially passing

00:54:48.340 | the original Turing test with flying colors?

00:54:50.840 | - Well, we're fooled by their fluency, right?

00:54:53.320 | We just assume that if a system is fluent

00:54:56.100 | in manipulating language, then it has

00:54:58.900 | all the characteristics of human intelligence,

00:55:00.780 | but that impression is false.

00:55:04.140 | We're really fooled by it.

00:55:06.560 | - What do you think Alan Turing would say?

00:55:08.940 | Without understanding anything, just hanging out with it.

00:55:11.420 | - Alan Turing would decide that a Turing test

00:55:13.140 | is a really bad test, okay?

00:55:15.520 | This is what the AI community has decided many years ago,

00:55:18.940 | that the Turing test was a really bad test of intelligence.

00:55:22.080 | - What would Hans Moravec say

00:55:23.320 | about the large language models?

00:55:26.300 | - Hans Moravec would say that Moravec paradox still applies.

00:55:29.760 | - Okay. - Okay.

00:55:31.340 | Okay, we can pass the--

00:55:32.180 | - You don't think he would be really impressed?

00:55:34.260 | - No, of course, everybody would be impressed,

00:55:35.800 | but it's not a question of being impressed or not.

00:55:39.980 | It's a question of knowing what the limit

00:55:42.100 | of those systems can do.

00:55:44.260 | Again, they are impressive.

00:55:45.940 | They can do a lot of useful things.

00:55:47.580 | There's a whole industry that is being built around them.

00:55:49.820 | They're gonna make progress.

00:55:51.900 | But there is a lot of things they cannot do,

00:55:53.720 | and we have to realize what they cannot do,

00:55:55.600 | and then figure out how we get there.

00:55:59.920 | And I'm not seeing this,

00:56:02.580 | I'm seeing this from basically 10 years of research

00:56:06.740 | on the idea of self-supervised learning.

00:56:12.200 | Actually, that's going back more than 10 years,

00:56:13.820 | but the idea of self-supervised learning,

00:56:15.300 | so basically capturing the internal structure

00:56:18.260 | of a piece of a set of inputs without training the system

00:56:22.460 | for any particular task, like learning representations.

00:56:25.220 | You know, the conference I co-founded 14 years ago

00:56:28.880 | is called International Conference

00:56:30.820 | on Learning Representations.

00:56:31.900 | That's the entire issue that deep learning

00:56:34.060 | is dealing with, right?

00:56:35.980 | And it's been my obsession for almost 40 years now, so.

00:56:39.900 | So learning representation is really the thing.

00:56:42.820 | For the longest time, we could only do this

00:56:44.340 | with supervised learning.

00:56:45.780 | And then we started working on what we used to call

00:56:48.940 | unsupervised learning, and sort of revived the idea

00:56:53.580 | of unsupervised learning in the early 2000s

00:56:56.660 | with Yoshua Bengio and Jeff Hinton,

00:56:59.340 | then discovered that supervised learning

00:57:00.780 | actually works pretty well if you can collect enough data.

00:57:03.940 | And so the whole idea of unsupervised self-supervised

00:57:07.460 | learning kind of took a backseat for a bit,

00:57:10.980 | and then I kind of tried to revive it

00:57:14.860 | in a big way, you know, starting in 2014,

00:57:18.580 | basically when we started FAIR,

00:57:20.540 | and really pushing for, like, finding new methods

00:57:24.740 | to do self-supervised learning, both for text

00:57:27.180 | and for images and for video and audio.

00:57:29.740 | And some of that work has been incredibly successful.

00:57:33.020 | I mean, the reason why we have multilingual

00:57:35.500 | translation system, you know, things to do,

00:57:38.300 | content moderation on meta, for example, on Facebook,

00:57:41.780 | that are multilingual, to understand

00:57:43.300 | whether a piece of text is hate speech or not, or something,

00:57:46.460 | is due to that progress using self-supervised learning

00:57:48.740 | for NLP, combining this with, you know,

00:57:51.220 | transformer architectures and blah, blah, blah.

00:57:53.700 | But that's the big success of self-supervised learning.

00:57:55.740 | We had similar success in speech recognition,

00:57:59.020 | a system called Wave2Vec,

00:58:00.140 | which is also a joint embedding architecture, by the way,

00:58:02.460 | trained with contrastive learning.

00:58:03.700 | And that system also can produce

00:58:06.740 | speech recognition systems that are multilingual

00:58:10.540 | with mostly unlabeled data and only need a few minutes

00:58:14.140 | of label data to actually do speech recognition.

00:58:16.900 | That's amazing.

00:58:17.980 | We have systems now based on those combination of ideas

00:58:22.180 | that can do real-time translation of hundreds of languages

00:58:25.660 | into each other, speech to speech.

00:58:28.020 | - Speech to speech, even including,

00:58:30.220 | just fascinating languages that don't have written forms.

00:58:34.340 | - That's right. - They're spoken only.

00:58:35.500 | - That's right, we don't go through text.

00:58:36.780 | It goes directly from speech to speech

00:58:38.740 | using an internal representation of kind of speech units

00:58:41.220 | that are discrete, but it's called textless and LP.

00:58:44.580 | We used to call it this way, but yeah.

00:58:47.100 | So that, I mean, incredible success there.

00:58:49.220 | And then, you know, for 10 years,

00:58:50.980 | we tried to apply this idea to learning representations

00:58:54.700 | of images by training a system to predict videos,

00:58:57.340 | learning intuitive physics by training a system

00:58:59.900 | to predict what's gonna happen in the video,

00:59:02.300 | and tried and tried and failed and failed

00:59:05.060 | with generative models, with models that predict pixels.

00:59:09.300 | We could not get them to learn

00:59:11.300 | good representations of images.

00:59:13.220 | We could not get them to learn

00:59:14.420 | good representations of videos.

00:59:16.420 | And we tried many times.

00:59:17.260 | We published lots of papers on it.

00:59:19.140 | Yeah, well, they kind of sort of worked,

00:59:20.820 | but not really great.

00:59:22.300 | They started working.

00:59:23.980 | We abandoned this idea of predicting every pixel

00:59:28.220 | and basically just doing the joint embedding

00:59:30.420 | and predicting any representation space.

00:59:32.300 | That works.

00:59:33.260 | So there's ample evidence that we're not gonna be able

00:59:37.700 | to learn good representations of the real world

00:59:42.020 | using generative model.

00:59:43.260 | So I'm telling people,

00:59:44.620 | everybody's talking about generative AI.

00:59:46.820 | If you're really interested in human level AI,

00:59:48.860 | abandon the idea of generative AI.

00:59:50.620 | - Okay, but you really think it's possible

00:59:54.900 | to get far with the joint embedding representation.

00:59:57.420 | So like, there's common sense reasoning,

01:00:01.380 | and then there's high level reasoning.

01:00:05.700 | I feel like those are two,

01:00:07.580 | the kind of reasoning that LLMs are able to do,

01:00:11.620 | okay, let me not use the word reasoning,

01:00:13.620 | but the kind of stuff that LLMs are able to do

01:00:16.020 | seems fundamentally different

01:00:17.380 | than the common sense reasoning we use

01:00:19.540 | to navigate the world.

01:00:20.900 | It seems like we're gonna need both.

01:00:22.500 | You're not, would you be able to get,

01:00:25.100 | with the joint embedding,

01:00:25.980 | would you jump a type of approach looking at video,

01:00:29.140 | would you be able to learn, let's see,

01:00:33.020 | well, how to get from New York to Paris,

01:00:35.460 | or how to understand the state of politics

01:00:40.460 | and the world today, right?

01:00:44.420 | These are things where various humans

01:00:46.700 | generate a lot of language and opinions on

01:00:49.020 | in the space of language,

01:00:50.100 | but don't visually represent that

01:00:52.860 | in any clearly compressible way.

01:00:56.060 | - Right, well, there's a lot of situations

01:00:58.020 | that might be difficult for a purely language-based system

01:01:02.740 | to know, like, okay, you can probably learn

01:01:07.180 | from reading texts, the entirety of the publicly available

01:01:10.780 | texts in the world that I cannot get

01:01:12.700 | from New York to Paris by slapping my fingers.

01:01:15.380 | That's not gonna work, right?

01:01:16.300 | - Yes.

01:01:17.140 | - But there's probably sort of more complex

01:01:20.980 | scenarios of this type,

01:01:22.300 | which an NLM may never have encountered

01:01:25.700 | and may not be able to determine

01:01:27.700 | whether it's possible or not.

01:01:29.860 | So that link from the low level to the high level,

01:01:34.860 | the thing is that the high level that language expresses

01:01:38.860 | is based on the common experience of the low level,

01:01:43.260 | which NLMs currently do not have.

01:01:45.220 | When we talk to each other,

01:01:47.660 | we know we have a common experience of the world,

01:01:50.620 | like a lot of it is similar, and the NLMs don't have that.

01:01:59.060 | But see, it's present.

01:02:01.060 | You and I have a common experience of the world

01:02:02.860 | in terms of the physics of how gravity works

01:02:05.860 | and stuff like this, and that common knowledge of the world

01:02:10.860 | I feel like is there in the language.

01:02:15.500 | We don't explicitly express it,

01:02:17.780 | but if you have a huge amount of text,

01:02:21.180 | you're going to get this stuff that's between the lines.

01:02:24.180 | In order to form a consistent world model,

01:02:28.620 | you're going to have to understand how gravity works,

01:02:31.660 | even if you don't have an explicit explanation of gravity.

01:02:35.140 | So even though in the case of gravity,

01:02:37.360 | there is explicit explanation of gravity in Wikipedia.

01:02:40.020 | But the stuff that we think of as common sense reasoning,

01:02:45.020 | I feel like to generate language correctly,

01:02:49.300 | you're going to have to figure that out.

01:02:51.820 | Now you could say, as you have, there's not enough text.

01:02:54.300 | Sorry, okay, so what? (laughs)

01:02:56.940 | You don't think so.

01:02:57.780 | - No, I agree with what you just said,

01:02:59.160 | which is that to be able to do high-level common sense,

01:03:03.680 | to have high-level common sense,

01:03:04.780 | you need to have the low-level common sense

01:03:06.920 | to build on top of.

01:03:08.020 | - Yeah, but that's not there.

01:03:10.280 | - And that's not there in LLMs.

01:03:11.580 | LLMs are purely trained from text.

01:03:13.340 | So then the other statement you made,

01:03:15.380 | I would not agree with the fact that implicit

01:03:18.980 | in all languages in the world is the underlying reality.

01:03:22.740 | There's a lot about underlying reality,

01:03:24.460 | which is not expressed in language.

01:03:26.840 | Is that obvious to you?

01:03:27.960 | - Yeah, totally.

01:03:28.980 | - So like all the conversations we have,

01:03:34.340 | okay, there's the dark web, meaning whatever,

01:03:37.460 | the private conversations, like DMs and stuff like this,

01:03:41.160 | which is much, much larger probably than what's available,

01:03:44.900 | what LLMs are trained on.

01:03:46.880 | - You don't need to communicate the stuff that is common.

01:03:49.980 | - But the humor, all of it.

01:03:51.300 | No, you do.

01:03:52.140 | Like when you, you don't need to, but it comes through.

01:03:54.520 | Like if I accidentally knock this over,

01:03:58.300 | you'll probably make fun of me.

01:03:59.500 | In the content of the you making fun of me

01:04:02.460 | will be explanation of the fact that cups fall,

01:04:07.360 | and then gravity works in this way.

01:04:09.380 | And then you'll have some very vague information

01:04:12.700 | about what kind of things explode when they hit the ground.

01:04:16.740 | And then maybe you'll make a joke about entropy

01:04:19.000 | or something like this,

01:04:19.840 | and we'll never be able to reconstruct this again.

01:04:22.000 | Like, okay, you'll make a little joke like this,

01:04:25.060 | and there'll be trillion of other jokes.

01:04:27.020 | And from the jokes, you can piece together the fact

01:04:29.580 | that gravity works and mugs can break

01:04:31.900 | and all this kind of stuff.

01:04:32.860 | You don't need to see, it'll be very inefficient.

01:04:36.860 | It's easier for like, to knock the thing over,

01:04:41.860 | but I feel like it would be there

01:04:44.380 | if you have enough of that data.

01:04:46.600 | - I just think that most of the information of this type

01:04:50.700 | that we have accumulated when we were babies

01:04:54.320 | is just not present in text, in any description, essentially.

01:04:59.320 | - And the sensory data is a much richer source

01:05:03.180 | for getting that kind of understanding.

01:05:04.360 | - I mean, that's the 16,000 hours of wake time

01:05:07.700 | of a four-year-old and 10 to the 15 bytes

01:05:11.600 | going through vision, just vision, right?

01:05:13.480 | There is a similar bandwidth of touch

01:05:17.600 | and a little less through audio.

01:05:20.500 | And then language doesn't come in until like a year in life.

01:05:25.500 | And by the time you are nine years old,

01:05:28.580 | you've learned about gravity.

01:05:30.780 | You know about inertia, you know about gravity,

01:05:32.700 | you know the stability, you know about the distinction

01:05:36.280 | between animate and inanimate objects.

01:05:38.100 | You know, by 18 months, you know about why people

01:05:42.380 | want to do things and you help them if they can't.

01:05:45.500 | I mean, there's a lot of things that you learn mostly

01:05:47.940 | by observation, really not even through interaction.

01:05:52.340 | In the first few months of life,

01:05:53.420 | babies don't really have any influence on the world.

01:05:55.900 | They can only observe, right?

01:05:58.080 | And you accumulate like a gigantic amount of knowledge

01:06:02.060 | just from that.

01:06:02.980 | So that's what we're missing from current AI systems.

01:06:06.400 | - I think in one of your slides, you have this nice plot

01:06:10.040 | that is one of the ways you show that LLMs are limited.

01:06:13.940 | I wonder if you could talk about hallucinations

01:06:16.120 | from your perspectives.

01:06:17.940 | The why hallucinations happen from large language models

01:06:22.940 | and why, and to what degree is that a fundamental flaw

01:06:27.540 | of large language models?

01:06:29.360 | - Right, so because of the autoregressive prediction,

01:06:34.100 | every time an LLM produces a token or a word,

01:06:37.220 | there is some level of probability for that word

01:06:40.740 | to take you out of the set of reasonable answers.

01:06:45.620 | And if you assume, which is a very strong assumption,

01:06:48.000 | that the probability of such error

01:06:50.620 | is that those errors are independent

01:06:55.180 | across a sequence of tokens being produced.

01:06:59.500 | What that means is that every time you produce a token,

01:07:02.400 | the probability that you stay within the set

01:07:05.420 | of correct answer decreases, and it decreases exponentially.

01:07:08.660 | - So there's a strong, like you said, assumption there

01:07:10.420 | that if there's a non-zero probability of making a mistake,

01:07:14.780 | which there appears to be,

01:07:16.260 | then there's going to be a kind of drift.

01:07:18.700 | - Yeah, and that drift is exponential.

01:07:21.360 | It's like errors accumulate, right?

01:07:23.740 | So the probability that an answer would be nonsensical

01:07:27.860 | increases exponentially with the number of tokens.

01:07:31.400 | - Is that obvious to you, by the way?

01:07:33.820 | Well, so mathematically speaking, maybe,

01:07:36.780 | but isn't there a kind of gravitational pull

01:07:40.220 | towards the truth, because on average, hopefully,

01:07:44.380 | the truth is well-represented in the training set?

01:07:48.940 | - No, it's basically a struggle

01:07:50.920 | against the curse of dimensionality.

01:07:55.540 | So the way you can correct for this

01:07:57.040 | is that you fine-tune the system

01:07:58.700 | by having it produce answers for all kinds of questions

01:08:02.540 | that people might come up with.

01:08:04.860 | And people are people, so a lot of the questions

01:08:08.100 | that they have are very similar to each other.

01:08:10.260 | So you can probably cover 80% or whatever

01:08:13.700 | of questions that people will ask by collecting data.

01:08:18.700 | And then you fine-tune the system

01:08:23.140 | to produce good answers for all of those things.

01:08:25.620 | And it's probably going to be able to learn that

01:08:28.280 | because it's got a lot of capacity to learn.

01:08:30.880 | But then there is the enormous set of prompts

01:08:36.920 | that you have not covered during training.

01:08:39.900 | And that set is enormous.

01:08:41.340 | Within the set of all possible prompts,

01:08:43.260 | the proportion of prompts that have been used for training

01:08:47.340 | is absolutely tiny.

01:08:48.640 | It's a tiny, tiny, tiny subset of all possible prompts.

01:08:53.940 | And so the system will behave properly on the prompts

01:08:56.600 | that has been either trained, pre-trained, or fine-tuned.

01:08:59.540 | But then there is an entire space of things

01:09:04.180 | that it cannot possibly have been trained on

01:09:06.840 | because the number is gigantic.

01:09:09.260 | So whatever training the system has been subject

01:09:14.260 | to produce appropriate answers,

01:09:18.060 | you can break it by finding out a prompt

01:09:20.540 | that will be outside of the set of prompts

01:09:24.540 | it's been trained on or things that are similar.

01:09:27.300 | And then it will just spew complete nonsense.

01:09:30.340 | - When you say prompt, do you mean that exact prompt?

01:09:33.460 | Or do you mean a prompt that's like

01:09:36.020 | in many parts very different than,

01:09:38.660 | like is it that easy to ask a question

01:09:42.620 | or to say a thing that hasn't been said before

01:09:45.540 | on the internet?

01:09:46.380 | - I mean, people have come up with things

01:09:48.340 | where like you put essentially a random sequence

01:09:52.300 | of characters in the prompt.

01:09:53.820 | And that's enough to kind of throw the system

01:09:56.060 | into a mode where it's gonna answer something

01:09:59.820 | completely different than it would have answered

01:10:02.060 | without this.

01:10:03.420 | So that's a way to jailbreak the system,

01:10:04.980 | basically get it, go outside of its conditioning, right?

01:10:09.340 | - So that's a very clear demonstration of it.

01:10:11.300 | But of course, that goes outside of what is designed to do.

01:10:16.300 | If you actually stitch together

01:10:20.900 | reasonably grammatical sentences,

01:10:22.900 | is it that easy to break it?

01:10:26.520 | - Yeah, some people have done things like

01:10:29.060 | you write a sentence in English, right?

01:10:31.260 | That has, or you ask a question in English

01:10:33.860 | and it produces a perfectly fine answer.

01:10:36.740 | And then you just substitute a few words

01:10:38.780 | by the same word in another language.

01:10:42.540 | And all of a sudden the answer is complete nonsense.

01:10:44.740 | - So I guess what I'm saying is like,

01:10:46.900 | which fraction of prompts that humans are likely to generate

01:10:51.900 | are going to break the system?

01:10:54.380 | - So the problem is that there is a long tail.

01:10:57.660 | - Yes.

01:10:58.620 | - This is an issue that a lot of people have realized

01:11:02.620 | doing social networks and stuff like that,

01:11:04.140 | which is there's a very, very long tail

01:11:06.340 | of things that people will ask.

01:11:08.180 | And you can fine tune the system for the 80% or whatever

01:11:12.940 | of the things that most people will ask.

01:11:16.180 | And then this long tail is so large

01:11:18.700 | that you're not gonna be able to fine tune the system

01:11:20.780 | for all the conditions.

01:11:21.940 | And in the end, the system has a being

01:11:23.780 | kind of a giant lookup table, right?

01:11:25.740 | Essentially, which is not really what you want.

01:11:27.820 | You want systems that can reason, certainly that can plan.

01:11:30.820 | So the type of reasoning that takes place in LLM

01:11:33.820 | is very, very primitive.

01:11:35.540 | And the reason you can tell it's primitive

01:11:37.060 | is because the amount of computation that is spent

01:11:41.020 | per token produced is constant.

01:11:43.820 | So if you ask a question and that question has an answer

01:11:47.900 | in a given number of token,

01:11:50.340 | the amount of computation devoted to computing that answer

01:11:52.780 | can be exactly estimated.

01:11:54.820 | It's like, it's the size of the prediction network

01:12:00.060 | with its 36 layers or 92 layers or whatever it is,

01:12:03.140 | multiplied by number of tokens, that's it.

01:12:06.220 | And so essentially it doesn't matter

01:12:09.180 | if the question being asked is simple to answer,

01:12:14.180 | complicated to answer, impossible to answer

01:12:17.820 | because it's undecidable or something.

01:12:19.740 | The amount of computation the system will be able

01:12:23.100 | to devote to the answer is constant

01:12:25.620 | or is proportional to the number of token produced

01:12:27.900 | in the answer, right?

01:12:29.700 | This is not the way we work.

01:12:30.940 | The way we reason is that when we're faced

01:12:35.020 | with a complex problem or a complex question,

01:12:38.540 | we spend more time trying to solve it and answer it, right?

01:12:42.820 | Because it's more difficult.

01:12:43.900 | - There's a prediction element,

01:12:45.580 | there's a iterative element where you're like

01:12:48.020 | adjusting your understanding of a thing

01:12:52.500 | by going over and over and over.

01:12:54.740 | There's a hierarchical element, so on.

01:12:56.780 | Does this mean it's a fundamental flaw of LLM?

01:12:59.500 | Does it mean that, there's more part to that question.

01:13:03.340 | Now you're just behaving like an LLM, immediately answering.

01:13:08.740 | No, that it's just the low-level world model

01:13:13.740 | on top of which we can then build

01:13:17.140 | some of these kinds of mechanisms, like you said,

01:13:19.560 | persistent long-term memory or reasoning, so on.

01:13:24.560 | But we need that world model that comes from language.

01:13:28.440 | Is it, maybe it is not so difficult

01:13:30.760 | to build this kind of reasoning system

01:13:33.660 | on top of a well-constructed world model.

01:13:36.740 | - Okay, whether it's difficult or not,

01:13:38.440 | the near future we'll say,

01:13:40.900 | because a lot of people are working on reasoning

01:13:43.580 | and planning abilities for dialogue systems.

01:13:46.720 | I mean, even if we restrict ourselves to language,

01:13:50.660 | just having the ability to plan your answer

01:13:54.640 | before your answer in terms that are not necessarily linked

01:13:59.420 | with the language you're gonna use to produce the answer.

01:14:02.220 | So this idea of this mental model

01:14:03.980 | that allows you to plan what you're gonna say

01:14:05.960 | before you say it.

01:14:06.820 | That is very important.

01:14:11.680 | I think there's going to be a lot of systems

01:14:13.820 | over the next few years

01:14:14.820 | that are going to have this capability,

01:14:17.340 | but the blueprint of those systems

01:14:19.660 | would be extremely different from auto-regressive LLMs.

01:14:23.140 | So it's the same difference

01:14:27.940 | as the difference between what psychologists call

01:14:30.660 | system one and system two in humans, right?

01:14:32.580 | So system one is the type of tasks

01:14:34.820 | that you can accomplish without deliberately, consciously

01:14:37.840 | think about how you do them.

01:14:40.280 | You just do them.

01:14:42.080 | You've done them enough

01:14:43.040 | that you can just do it subconsciously, right?

01:14:45.380 | Without thinking about them.

01:14:46.520 | If you're an experienced driver,

01:14:48.580 | you can drive without really thinking about it.

01:14:51.080 | And you can talk to someone at the same time

01:14:52.700 | or listen to the radio, right?

01:14:54.140 | If you are a very experienced chess player,

01:14:58.300 | you can play against a non-experienced chess player

01:15:01.060 | without really thinking either.

01:15:02.580 | You just recognize the pattern and you play, right?

01:15:05.380 | That's system one.

01:15:06.640 | So all the things that you do instinctively

01:15:09.760 | without really having to deliberately plan

01:15:12.660 | and think about it.

01:15:13.480 | And then there is all the tasks where you need to plan.

01:15:15.200 | So if you are a not so experienced chess player

01:15:19.540 | or you are experienced,

01:15:20.660 | or you play against another experienced chess player,

01:15:22.980 | you think about all kinds of options, right?

01:15:24.760 | You think about it for a while, right?

01:15:27.220 | And you're much better if you have time to think about it

01:15:30.520 | than you are if you play blitz with limited time.

01:15:34.580 | So this type of deliberate planning,

01:15:39.580 | which uses your internal world model, that's system two.

01:15:44.540 | This is what LLMs currently cannot do.

01:15:46.540 | So how do we get them to do this, right?

01:15:48.580 | How do we build a system that can do this kind of planning

01:15:53.340 | that or reasoning that devotes more resources

01:15:57.420 | to complex problems than to simple problems?

01:16:00.320 | And it's not going to be autoregressive prediction of tokens.

01:16:03.780 | It's going to be more something akin to inference

01:16:08.060 | of latent variables in what used to be called

01:16:13.060 | probabilistic models or graphical models

01:16:16.260 | and things of that type.

01:16:17.720 | So basically, the principle is like this.

01:16:19.720 | You know, the prompt is like observed variables.

01:16:24.640 | And what the model does is that it's basically a measure of,

01:16:31.000 | it can measure to what extent an answer

01:16:36.180 | is a good answer for a prompt, okay?

01:16:38.960 | So think of it as some gigantic neural net,

01:16:41.180 | but it's got only one output.

01:16:42.660 | And that output is a scalar number,

01:16:45.180 | which is let's say zero if the answer is a good answer

01:16:48.580 | for the question and a large number

01:16:51.120 | if the answer is not a good answer for the question.

01:16:53.500 | Imagine you had this model.

01:16:55.460 | If you had such a model,

01:16:56.620 | you could use it to produce good answers.

01:16:58.900 | The way you would do is, you know, produce the prompt

01:17:02.520 | and then search through the space of possible answers

01:17:05.260 | for one that minimizes that number.

01:17:07.460 | That's called an energy-based model.

01:17:11.580 | But that energy-based model would need the model

01:17:16.420 | constructed by the LLM.

01:17:18.580 | - Well, so really what you need to do would be

01:17:21.340 | to not search over possible strings of text

01:17:24.940 | that minimize that energy.

01:17:27.780 | But what you would do is do this

01:17:29.420 | in abstract representation space.

01:17:31.060 | So in sort of the space of abstract thoughts,

01:17:34.500 | you would elaborate a thought, right,

01:17:37.060 | using this process of minimizing the output

01:17:40.960 | of your model, okay, which is just a scalar.

01:17:43.860 | It's an optimization process, right?

01:17:46.460 | So now the way the system produces its answer

01:17:49.420 | is through optimization by, you know,

01:17:53.140 | minimizing an objective function, basically, right?

01:17:56.380 | And this is, we're talking about inference.

01:17:57.720 | We're not talking about training, right?

01:17:59.300 | The system has been trained already.

01:18:01.060 | So now we have an abstract representation

01:18:03.040 | of the thought of the answer, representation of the answer.

01:18:06.660 | We feed that to basically an autoregressive decoder,

01:18:10.680 | which can be very simple, that turns this into a text

01:18:13.580 | that expresses this thought, okay?

01:18:15.700 | So that, in my opinion, is the blueprint

01:18:18.100 | of future Datov systems.

01:18:20.220 | They will think about their answer,

01:18:23.460 | plan their answer by optimization

01:18:25.900 | before turning it into text.

01:18:27.340 | And that is Turing complete.

01:18:31.300 | - Can you explain exactly

01:18:32.380 | what the optimization problem there is?

01:18:34.500 | Like, what's the objective function?

01:18:37.740 | Just to linger on it, you kind of briefly described it,

01:18:40.500 | but over what space are you optimizing?

01:18:43.800 | - The space of representations.

01:18:45.720 | - It goes abstract representation.

01:18:47.820 | - So you have an abstract representation inside the system.

01:18:51.620 | You have a prompt.

01:18:52.500 | The prompt goes through an encoder,

01:18:53.660 | produces a representation,

01:18:55.180 | perhaps goes through a predictor

01:18:56.400 | that predicts a representation of the answer,

01:18:58.460 | of the proper answer.

01:18:59.820 | But that representation may not be a good answer

01:19:04.180 | because there might be some complicated reasoning

01:19:06.600 | you need to do, right?

01:19:07.600 | So then you have another process

01:19:11.240 | that takes the representation of the answers

01:19:14.480 | and modifies it so as to minimize a cost function

01:19:19.480 | that measures to what extent the answer

01:19:21.560 | is a good answer for the question.

01:19:23.020 | Now, we sort of ignore the fact for,

01:19:27.840 | I mean, the issue for a moment

01:19:29.760 | of how you train that system to measure

01:19:32.420 | whether an answer is a good answer for a question.

01:19:36.000 | - But suppose such a system could be created.

01:19:38.960 | But what's the process, this kind of search-like process?

01:19:42.440 | - It's a optimization process.

01:19:44.040 | You can do this if the entire system is differentiable,

01:19:47.680 | that scalar output is the result

01:19:50.120 | of running through some neural net,

01:19:52.560 | running the representation of the answers

01:19:55.760 | through some neural net.

01:19:56.900 | Then by gradient descent, by backpropagating gradients,

01:20:00.640 | you can figure out how to modify the representation

01:20:03.320 | of the answers so as to minimize that.

01:20:05.160 | - So that's still a gradient-based.

01:20:06.680 | - It's gradient-based inference.

01:20:08.600 | So now you have a representation of the answer

01:20:10.480 | in abstract space.

01:20:12.080 | Now you can turn it into text, right?

01:20:15.640 | And the cool thing about this is that

01:20:18.660 | the representation now can be optimized

01:20:21.600 | through gradient descent,

01:20:22.520 | but also is independent of the language

01:20:24.640 | in which you're going to express the answer.

01:20:27.600 | - Right, so you're operating in this abstract representation.

01:20:30.080 | I mean, this goes back to the joint embedding

01:20:32.640 | that is better to work in the space of, I don't know,

01:20:37.320 | or to romanticize the notion like space of concepts

01:20:40.680 | versus the space of concrete sensory information.

01:20:45.680 | - Right.

01:20:47.320 | - Okay, but can this do something like reasoning,

01:20:50.720 | which is what we're talking about?

01:20:51.960 | - Well, not really, only in a very simple way.

01:20:54.160 | I mean, basically you can think of those things

01:20:56.440 | as doing the kind of optimization I was talking about,

01:21:00.320 | except the optimizing the discrete space,

01:21:02.280 | which is the space of possible sequences of tokens.

01:21:05.880 | And they do this optimization in a horribly inefficient way,

01:21:09.280 | which is generate a lot of hypotheses

01:21:11.280 | and then select the best ones.

01:21:13.400 | And that's incredibly wasteful in terms of computation.

01:21:18.400 | 'Cause you basically have to run your LLM

01:21:20.880 | for like every possible generated sequence.

01:21:24.880 | And it's incredibly wasteful.

01:21:28.880 | So it's much better to do an optimization

01:21:32.480 | in continuous space where you can do gradient descent

01:21:35.040 | as opposed to like generate tons of things

01:21:36.760 | and then select the best.

01:21:38.200 | You just iteratively refine your answer

01:21:41.120 | to go towards the best, right?

01:21:42.960 | That's much more efficient.

01:21:44.280 | But you can only do this in continuous spaces

01:21:46.560 | with differentiable functions.

01:21:48.200 | - You're talking about the reasoning,

01:21:50.360 | like ability to think deeply or to reason deeply.

01:21:55.200 | How do you know what is an answer

01:21:59.240 | that's better or worse based on deep reasoning?

01:22:04.720 | - Right, so then we're asking the question of conceptually,

01:22:07.480 | how do you train an energy-based model, right?

01:22:09.380 | So an energy-based model is a function

01:22:11.920 | with a scalar output, just a number.

01:22:13.900 | You give it two inputs, X and Y,

01:22:17.340 | and it tells you whether Y is compatible with X or not.

01:22:20.480 | X you observe, let's say it's a prompt,

01:22:22.680 | an image, a video, whatever.

01:22:24.680 | And Y is a proposal for an answer,

01:22:28.120 | a continuation of the video, you know, whatever.

01:22:32.440 | And it tells you whether Y is compatible with X.

01:22:35.080 | And the way it tells you that Y is compatible with X

01:22:37.440 | is that the output of that function will be zero

01:22:39.800 | if Y is compatible with X.

01:22:41.200 | It would be a positive number, non-zero,

01:22:44.800 | if Y is not compatible with X.

01:22:46.380 | Okay, how do you train a system like this

01:22:49.800 | at a completely general level?

01:22:51.880 | Is you show it pairs of X and Y that are compatible,

01:22:56.200 | a question and the corresponding answer,

01:22:58.840 | and you train the parameters of the big neural net inside

01:23:01.720 | to produce zero.

01:23:03.680 | Okay, now that doesn't completely work

01:23:07.280 | because the system might decide,

01:23:08.920 | well, I'm just gonna say zero for everything.

01:23:11.680 | So now you have to have a process to make sure

01:23:13.520 | that for a wrong Y, the energy would be larger than zero.

01:23:18.520 | And there you have two options.

01:23:20.560 | One is contrastive methods.

01:23:21.840 | So contrastive method is you show an X and a bad Y

01:23:25.040 | and you tell the system, well, that's, you know,

01:23:28.400 | give a high energy to this, like push up the energy, right?

01:23:30.880 | Change the weights in the neural net

01:23:32.320 | that computes the energy so that it goes up.

01:23:34.480 | So that's contrastive methods.

01:23:37.680 | The problem with this is if the space of Y is large,

01:23:41.320 | the number of such contrastive samples

01:23:44.680 | you're gonna have to show is gigantic.

01:23:48.640 | And people do this, they do this when you train a system

01:23:52.800 | with RLHF, basically what you're training

01:23:55.200 | is what's called a reward model,

01:23:57.640 | which is basically an objective function

01:24:00.160 | that tells you whether an answer is good or bad.

01:24:02.560 | And that's basically exactly what this is.

01:24:06.960 | So we already do this to some extent.

01:24:08.560 | We're just not using it for inference.

01:24:09.960 | We're just using it for training.

01:24:11.600 | There is another set of methods which are non-contrastive

01:24:17.360 | and I prefer those, and those non-contrastive methods

01:24:20.960 | basically say, okay, the energy function

01:24:25.960 | needs to have low energy on pairs of X, Y's

01:24:28.760 | that are compatible, that come from your training set.

01:24:31.480 | How do you make sure that the energy

01:24:34.160 | is gonna be higher everywhere else?

01:24:36.080 | And the way you do this is by having a regularizer,

01:24:42.240 | a criterion, a term in your cost function

01:24:45.200 | that basically minimizes the volume of space

01:24:49.200 | that can take low energy.

01:24:50.440 | And the precise way to do this is all kinds of different

01:24:54.160 | specific ways to do this depending on the architecture,

01:24:56.440 | but that's the basic principle.

01:24:58.560 | So that if you push down the energy function

01:25:01.000 | for particular regions in the X, Y space,

01:25:04.080 | it will automatically go up in other places

01:25:06.160 | because there's only a limited volume of space

01:25:09.360 | that can take low energy, okay,

01:25:11.920 | by the construction of the system or by the regularizer,

01:25:14.840 | the regularizing function.

01:25:16.840 | - We've been talking very generally,

01:25:18.880 | but what is a good X and a good Y?

01:25:21.480 | What is a good representation of X and Y?

01:25:25.880 | 'Cause we've been talking about language

01:25:27.320 | and if you just take language directly,

01:25:30.520 | that presumably is not good.

01:25:32.320 | So there has to be some kind of

01:25:33.320 | abstract representation of ideas.

01:25:36.240 | - Yeah, so you can do this with language directly

01:25:39.760 | by just X is a text and Y is a continuation of that text.

01:25:43.640 | Or X is a question, Y is the answer.

01:25:48.200 | - But you're saying that's not gonna take,

01:25:49.720 | I mean, that's going to do what LLMs are doing.

01:25:52.720 | - Well, no, it depends on how you,

01:25:54.640 | how the internal structure of the system is built.

01:25:57.280 | If the internal structure of the system is built

01:25:59.480 | in such a way that inside of the system,

01:26:02.240 | there is a latent variable, let's call it Z,

01:26:04.760 | that you can manipulate so as to minimize the output energy.

01:26:12.920 | Then that Z can be viewed as representation of a good answer

01:26:16.760 | that you can translate into a Y that is a good answer.

01:26:19.520 | - So this kind of system could be trained

01:26:22.720 | in a very similar way.

01:26:24.640 | - Very similar way, but you have to have this way

01:26:26.760 | of preventing collapse, of ensuring that, you know,

01:26:30.360 | there is high energy for things you don't train it on.

01:26:33.120 | And currently it's very implicit in LLM,

01:26:38.720 | it's done in a way that people don't realize is being done,

01:26:40.720 | but it is being done.

01:26:42.680 | It's due to the fact that when you give a high probability

01:26:45.960 | to a word, automatically you give low probability

01:26:50.800 | to other words, because you only have a finite amount

01:26:54.400 | of probability to go around right there to sum to one.

01:26:57.800 | So when you minimize the cross entropy or whatever,

01:27:00.520 | when you train your LLM to produce the,

01:27:03.240 | to predict the next word,

01:27:04.560 | you're increasing the probability your system will give

01:27:08.480 | to the correct word, but you're also decreasing

01:27:10.200 | the probability it will give to the incorrect words.

01:27:12.360 | Now, indirectly, that gives a low probability to,

01:27:17.120 | a high probability to sequences of words that are good

01:27:19.480 | and low probability to sequences of words that are bad,

01:27:21.720 | but it's very indirect.

01:27:23.600 | And it's not obvious why this actually works at all,

01:27:26.800 | but because you're not doing it on a joint probability

01:27:31.080 | of all the symbols in a sequence,

01:27:32.920 | you're just doing it kind of sort of factorize

01:27:36.920 | that probability in terms of conditional probabilities

01:27:39.640 | over successive tokens.

01:27:41.480 | - How do you do this for visual data?

01:27:44.000 | - So we've been doing this with OJPA architectures,

01:27:46.160 | basically with IJPA.

01:27:48.040 | So there are the compatibility between two things is,

01:27:53.040 | here's an image or a video, here's a corrupted, shifted,

01:27:57.480 | or transformed version of that image or video or masked.

01:28:01.080 | Okay, and then the energy of the system is the prediction

01:28:05.800 | error of the representation of the image.

01:28:11.800 | The predicted representation of the good thing,

01:28:14.480 | versus the actual representation of the good thing, right?

01:28:17.360 | So you run the corrupted image to the system,

01:28:20.840 | predict the representation of the good input, uncorrupted,

01:28:24.600 | and then compute the prediction error.

01:28:26.400 | That's the energy of the system.

01:28:28.040 | So this system will tell you, this is a good,

01:28:31.760 | this is a good image and this is a corrupted version.

01:28:36.680 | It will give you zero energy if those two things

01:28:39.000 | are effectively, one of them is a corrupted version

01:28:42.280 | of the other.

01:28:43.120 | It gives you a high energy if the two images

01:28:45.280 | are completely different.

01:28:46.480 | - And hopefully that whole process gives you a really nice

01:28:49.760 | compressed representation of reality, of visual reality.

01:28:54.560 | - And we know it does because then we use those

01:28:56.440 | representations as input to a classification system.

01:28:59.360 | - That classification system works really nicely, okay.

01:29:02.000 | Well, so to summarize, you recommend in a spicy way

01:29:08.560 | that only yellow raccoon can.

01:29:10.400 | You recommend that we abandon generative models

01:29:12.700 | in favor of joint embedding architectures.

01:29:15.280 | Abandon autoregressive generation.

01:29:17.760 | Abandon, this feels like a court testimony.

01:29:21.740 | Abandon probabilistic models in favor of energy-based models

01:29:25.160 | as we talked about.

01:29:26.220 | Abandon contrastive methods in favor of regularized methods.

01:29:30.100 | And let me ask you about this.

01:29:32.100 | You've been for a while a critic of reinforcement learning.

01:29:37.000 | So the last recommendation is that we abandon RL

01:29:41.320 | in favor of model predictive control,

01:29:43.640 | as you were talking about, and only use RL

01:29:46.560 | when planning doesn't yield the predicted outcome.

01:29:50.440 | And we use RL in that case to adjust the world model

01:29:54.600 | or the critic.

01:29:55.960 | So you mentioned RLHF, reinforcement learning

01:30:00.960 | with human feedback.

01:30:02.980 | Why do you still hate reinforcement learning?

01:30:05.840 | - I don't hate reinforcement learning

01:30:07.080 | and I think it should not be abandoned completely,

01:30:12.080 | but I think its use should be minimized

01:30:14.480 | because it's incredibly inefficient in terms of samples.

01:30:18.440 | And so the proper way to train a system

01:30:21.400 | is to first have it learn good representations of the world

01:30:26.400 | and world models from mostly observation,

01:30:29.620 | maybe a little bit of interactions.

01:30:31.560 | - And then steered based on that.

01:30:33.080 | If the representation is good,

01:30:34.280 | then the adjustments should be minimal.

01:30:36.800 | - Yeah, and now there's two things.

01:30:38.060 | You can use, if you've learned a world model,

01:30:40.000 | you can use the world model to plan a sequence of actions

01:30:42.680 | to arrive at a particular objective.

01:30:44.480 | You don't need RL unless the way you measure

01:30:49.560 | whether you succeed might be inexact.

01:30:51.360 | Your idea of whether you were gonna fall from your bike

01:30:56.260 | might be wrong, or whether the person you're fighting

01:31:01.560 | with MMA was gonna do something and then do something else.

01:31:04.560 | So there's two ways you can be wrong.

01:31:09.520 | Either your objective function does not reflect

01:31:13.680 | the actual objective function you want to optimize,

01:31:16.360 | or your world model is inaccurate, right?

01:31:19.760 | So you didn't, the prediction you were making

01:31:22.060 | about what was gonna happen in the world is inaccurate.

01:31:25.280 | So if you want to adjust your world model

01:31:27.280 | while you are operating the world,

01:31:30.880 | or your objective function,

01:31:32.680 | that is basically in the realm of RL.

01:31:35.880 | This is what RL deals with to some extent, right?

01:31:39.600 | So adjust your world model.

01:31:41.080 | And the way to adjust your world model, even in advance,

01:31:44.200 | is to explore parts of the space where you world model,

01:31:48.180 | where you know that your world model is inaccurate.

01:31:50.720 | That's called curiosity basically, or play, right?

01:31:54.080 | When you play, you kind of explore parts of the state space

01:31:58.680 | that you don't want to do for real,

01:32:03.680 | because it might be dangerous,

01:32:05.800 | but you can adjust your world model

01:32:07.880 | without killing yourself, basically.

01:32:11.640 | So that's what you want to use RL for.

01:32:15.120 | When it comes time to learning a particular task,

01:32:18.720 | you already have all the good representations,

01:32:20.560 | you already have your world model,

01:32:21.840 | but you need to adjust it for the situation at hand.

01:32:25.200 | That's when you use RL.

01:32:26.640 | - What do you think RL-HF works so well?

01:32:29.620 | This reinforcement learning with human feedback.

01:32:32.600 | Why did it have such a transformational effect

01:32:34.880 | on large language models before?

01:32:37.440 | - So what's had the transformational effect

01:32:39.920 | is human feedback.

01:32:42.000 | There is many ways to use it,

01:32:43.560 | and some of it is just purely supervised, actually.

01:32:45.760 | It's not really reinforcement learning.

01:32:47.440 | - So it's the HF.

01:32:49.280 | - It's the HF.

01:32:50.180 | And then there is various ways to use human feedback, right?

01:32:53.320 | So you can ask humans to rate answers,

01:32:57.240 | multiple answers that are produced by a world model.

01:33:00.020 | And then what you do is you train an objective function

01:33:05.560 | to predict that rating.

01:33:07.380 | And then you can use that objective function

01:33:11.520 | to predict whether an answer is good,

01:33:13.680 | and you can backpropagate gradient through this

01:33:15.120 | to fine tune your system

01:33:16.200 | so that it only produces highly rated answers.

01:33:19.880 | Okay, so that's one way.

01:33:22.680 | So that's like, in RL that means training

01:33:27.360 | what's called a reward model, right?

01:33:29.380 | So something that, basically a small neural net

01:33:31.800 | that estimates to what extent an answer is good, right?

01:33:35.120 | It's very similar to the objective

01:33:36.600 | I was talking about earlier for planning,

01:33:39.720 | except now it's not used for planning.

01:33:41.320 | It's used for fine tuning your system.

01:33:43.180 | I think it would be much more efficient

01:33:45.560 | to use it for planning,

01:33:46.520 | but currently it's used to fine tune

01:33:51.040 | the parameters of the system.

01:33:52.620 | Now there's several ways to do this.

01:33:54.920 | You know, some of them are supervised.

01:33:57.520 | You just, you know, ask a human person like,

01:34:00.000 | what is a good answer for this, right?

01:34:02.360 | Then you just type the answer.

01:34:04.240 | I mean, there's lots of ways

01:34:07.160 | that those systems are being adjusted.

01:34:09.160 | - Now, a lot of people have been very critical

01:34:13.560 | of the recently released Google's Gemini 1.5

01:34:19.080 | for essentially, in my words, I could say super woke.

01:34:23.260 | Woke in the negative connotation of that word.

01:34:26.580 | There's some almost hilariously absurd things that it does,

01:34:30.340 | like it modifies history,

01:34:32.740 | like generating images of a black George Washington,

01:34:37.540 | or perhaps more seriously,

01:34:40.840 | something that you commented on Twitter,

01:34:43.220 | which is refusing to comment on or generate images

01:34:48.300 | of, or even descriptions of Tiananmen Square

01:34:51.860 | or the Tank Man,

01:34:55.540 | one of the most sort of legendary protest images in history.

01:35:00.540 | Of course, these images are highly censored

01:35:05.260 | by the Chinese government,

01:35:06.740 | and therefore everybody started asking questions

01:35:09.780 | of what is the process of designing these LLMs,

01:35:14.700 | what is the role of censorship in these,

01:35:17.500 | and all that kind of stuff.

01:35:19.020 | So you commented on Twitter saying

01:35:22.660 | that open source is the answer, essentially.

01:35:26.100 | So can you explain?

01:35:28.220 | - I actually made that comment

01:35:31.180 | on just about every social network I can,

01:35:33.020 | and I've made that point multiple times in various forums.

01:35:38.020 | Here's my point of view on this.

01:35:43.100 | People can complain that AI systems are biased,

01:35:47.260 | and they generally are biased

01:35:49.740 | by the distribution of the training data

01:35:52.060 | that they've been trained on.

01:35:53.860 | That reflects biases in society,

01:35:57.540 | and that is potentially offensive to some people,

01:36:03.980 | or potentially not.

01:36:06.880 | And some techniques to de-bias

01:36:10.700 | then become offensive to some people

01:36:15.400 | because of historical incorrectness and things like that.

01:36:20.400 | And so you can ask the question,

01:36:25.520 | you can ask two questions.

01:36:26.360 | The first question is,

01:36:27.380 | is it possible to produce an AI system that is not biased?

01:36:30.960 | And the answer is absolutely not.

01:36:33.400 | And it's not because of technological challenges,

01:36:37.600 | although there are technological challenges to that.

01:36:41.360 | It's because bias is in the eye of the beholder.

01:36:45.480 | Different people may have different ideas

01:36:48.800 | about what constitutes bias for a lot of things.

01:36:53.580 | I mean, there are facts that are indisputable,

01:36:57.080 | but there are a lot of opinions or things

01:36:59.780 | that can be expressed in different ways.

01:37:02.080 | And so you cannot have an unbiased system.

01:37:05.040 | That's just an impossibility.

01:37:08.800 | And so what's the answer to this?

01:37:12.640 | And the answer is the same answer that we found

01:37:16.520 | in liberal democracy about the press.

01:37:20.860 | The press needs to be free and diverse.

01:37:24.220 | We have free speech for a good reason.

01:37:28.160 | It's because we don't want all of our information

01:37:31.880 | to come from a unique source

01:37:36.680 | 'cause that's opposite to the whole idea of democracy

01:37:40.040 | and progress of ideas and even science.

01:37:45.040 | In science, people have to argue for different opinions

01:37:48.160 | and science makes progress when people disagree

01:37:51.400 | and they come up with an answer and a consensus forms.

01:37:54.600 | And it's true in all democracies around the world.

01:37:57.720 | So there is a future which is already happening

01:38:02.720 | where every single one of our interaction

01:38:05.640 | with the digital world will be mediated

01:38:08.040 | by AI systems, AI assistants, right?

01:38:11.740 | We're gonna have smart glasses.

01:38:14.820 | You can already buy them from Metta, the Ray-Ban Metta,

01:38:18.120 | where you can talk to them and they are connected

01:38:21.520 | with an LLM and you can get answers

01:38:23.600 | on any question you have.

01:38:25.920 | Or you can be looking at a monument

01:38:28.840 | and there is a camera in the system that in the glasses,

01:38:32.440 | you can ask it, what can you tell me about this building

01:38:35.680 | or this monument?

01:38:36.520 | You can be looking at a menu in a foreign language

01:38:39.160 | and the thing will translate it for you

01:38:40.760 | or you can do real-time translation

01:38:43.640 | if you speak different languages.

01:38:44.800 | So a lot of our interactions with the digital world

01:38:48.280 | are gonna be mediated by those systems in the near future.

01:38:51.120 | Increasingly, the search engines that we're gonna use

01:38:57.000 | are not gonna be search engines.

01:38:58.080 | They're gonna be dialogue systems

01:39:01.440 | that we just ask a question and it will answer

01:39:05.160 | and then point you to perhaps appropriate reference for it.

01:39:08.880 | But here is the thing, we cannot afford those systems

01:39:11.920 | to come from a handful of companies

01:39:13.960 | on the west coast of the US.

01:39:15.320 | Because those systems will constitute the repository

01:39:20.080 | of all human knowledge.

01:39:22.040 | And we cannot have that be controlled

01:39:25.600 | by a small number of people, right?

01:39:27.960 | It has to be diverse.

01:39:29.120 | For the same reason, the press has to be diverse.

01:39:32.200 | So how do we get a diverse set of AI assistance?

01:39:35.520 | It's very expensive and difficult to train a base model,

01:39:40.120 | right, a base LLM at the moment.

01:39:42.240 | You know, in the future, it might be something different,

01:39:43.920 | but at the moment, that's an LLM.

01:39:46.040 | So only a few companies can do this properly.

01:39:49.560 | And if some of those top systems are open source,

01:39:55.560 | anybody can use them.

01:39:57.120 | Anybody can fine-tune them.

01:39:59.120 | If we put in place some systems

01:40:01.680 | that allows any group of people,

01:40:05.560 | whether they are individual citizens, groups of citizens,

01:40:11.400 | government organizations, NGOs, companies, whatever,

01:40:17.320 | to take those open source systems, AI systems,

01:40:23.920 | and fine-tune them for their own purpose on their own data,

01:40:27.680 | then we're gonna have a very large diversity

01:40:29.640 | of different AI systems that are specialized

01:40:32.840 | for all of those things, right?

01:40:34.640 | So I tell you, I talk to the French government quite a bit,

01:40:38.200 | and the French government will not accept

01:40:41.360 | that the digital diet of all their citizens

01:40:44.560 | be controlled by three companies

01:40:46.400 | on the west coast of the US.

01:40:48.120 | That's just not acceptable.

01:40:49.640 | It's a danger to democracy,

01:40:51.200 | regardless of how well-intentioned those companies are, right?

01:40:54.560 | And it's also a danger to local culture,

01:41:01.000 | to values, to language, right?

01:41:05.400 | I was talking with the founder of Infosys in India.

01:41:10.400 | He's funding a project to fine-tune LAMA2,

01:41:16.640 | the open source model produced by Meta,

01:41:19.920 | so that LAMA2 speaks all 22 official languages in India.

01:41:23.120 | It's very important for people in India.

01:41:26.480 | I was talking to a former colleague of mine,

01:41:28.240 | Mustafa Sisay, who used to be a scientist at FAIR,

01:41:31.320 | and then moved back to Africa,

01:41:32.480 | created a research lab for Google in Africa,

01:41:35.200 | and now has a new startup called Cara.

01:41:37.960 | And what he's trying to do is basically have LLM

01:41:40.520 | that speaks the local languages in Senegal

01:41:42.880 | so that people can have access to medical information,

01:41:46.200 | 'cause they don't have access to doctors.

01:41:47.560 | It's a very small number of doctors per capita in Senegal.

01:41:51.920 | I mean, you can't have any of this

01:41:55.480 | unless you have open source platforms.

01:41:57.960 | So with open source platforms,

01:41:59.160 | you can have AI systems that are not only diverse

01:42:01.760 | in terms of political opinions or things of that type,

01:42:05.040 | but in terms of language, culture,

01:42:10.000 | value systems, political opinions,

01:42:15.440 | technical abilities in various domains.

01:42:18.920 | And you can have an industry, an ecosystem of companies

01:42:22.200 | that fine tune those open source systems

01:42:24.560 | for vertical applications in industry, right?

01:42:27.200 | You have, I don't know, a publisher has thousands of books,

01:42:30.240 | and they want to build a system that allows the customer

01:42:32.880 | to just ask a question about the content

01:42:36.040 | of any of their books.

01:42:37.640 | You need to train on their proprietary data, right?

01:42:41.000 | You have a company, we have one within Meta,

01:42:43.520 | it's called MetaMate, and it's basically an LLM

01:42:46.640 | that can answer any question about internal stuff

01:42:50.200 | about the company.

01:42:52.080 | Very useful.

01:42:53.280 | A lot of companies want this, right?

01:42:55.240 | A lot of companies want this not just for their employees,

01:42:57.880 | but also for their customers, to take care of their customers.

01:43:00.760 | So the only way you're gonna have an AI industry,

01:43:04.360 | the only way you're gonna have AI systems

01:43:06.280 | that are not uniquely biased

01:43:08.680 | is if you have open source platforms

01:43:10.280 | on top of which any group can build specialized systems.

01:43:15.280 | So the direction of, inevitable direction of history

01:43:21.680 | is that the vast majority of AI systems

01:43:26.040 | will be built on top of open source platforms.

01:43:28.400 | - So that's a beautiful vision.

01:43:30.160 | So meaning like a company like Meta or Google or so on

01:43:37.880 | should take only minimal fine-tuning steps

01:43:40.560 | after building the foundation pre-trained model,

01:43:44.880 | as few steps as possible.

01:43:47.240 | - Basically.

01:43:48.080 | - Can Meta afford to do that?

01:43:51.520 | - No.

01:43:52.360 | - So I don't know if you know this,

01:43:53.620 | but companies are supposed to make money somehow,

01:43:56.240 | and open source is like giving away, I don't know,

01:44:01.040 | Mark made a video, Mark Zuckerberg, very sexy video,

01:44:06.120 | talking about 350,000 Nvidia H100s.

01:44:11.120 | The math of that is just for the GPUs, that's 100 billion,

01:44:17.680 | plus the infrastructure for training everything.

01:44:22.360 | So I'm no business guy, but how do you make money on that?

01:44:27.360 | So the division you paint is a really powerful one,

01:44:30.180 | but how is it possible to make money?

01:44:32.560 | - Okay, so you have several business models, right?

01:44:36.760 | The business model that Meta is built around

01:44:39.600 | is your first service,

01:44:44.160 | and the financing of that service is either through ads

01:44:50.080 | or through business customers.

01:44:52.680 | So for example, if you have an LLM

01:44:54.940 | that can help a mom and pop pizza place

01:45:00.500 | by talking to the customers through WhatsApp,

01:45:03.640 | and so the customers can just order a pizza

01:45:06.580 | and the system will just ask them like,

01:45:08.700 | what topping do you want or what size, blah, blah, blah.

01:45:11.340 | The business will pay for that, okay, that's a model.

01:45:15.280 | And otherwise, if it's a system

01:45:21.760 | that is on the more kind of classical services,

01:45:24.360 | it can be ad supported or there's several models.

01:45:28.140 | But the point is, if you have a big enough

01:45:31.600 | potential customer base and you need to build that system

01:45:36.240 | anyway for them, it doesn't hurt you

01:45:41.080 | to actually distribute it in open source.

01:45:43.240 | - Again, I'm no business guy,

01:45:45.400 | but if you release the open source model,

01:45:48.060 | then other people can do the same kind of task

01:45:51.720 | and compete on it,

01:45:52.920 | basically provide fine tuned models for businesses.

01:45:57.000 | So is the bet that Meta is making,

01:45:59.700 | by the way, I'm a huge fan of all this,

01:46:01.460 | but is the bet that Meta is making is like,

01:46:03.840 | we'll do a better job of it?

01:46:05.580 | - Well, no, the bet is more,

01:46:08.440 | we already have a huge user base and customer base, right?

01:46:13.440 | So it's gonna be useful to them.

01:46:15.320 | Whatever we offer them is gonna be useful

01:46:17.540 | and there is a way to derive revenue from this.

01:46:22.280 | And it doesn't hurt that we provide that system

01:46:26.680 | or the base model, right?

01:46:29.400 | The foundation model in open source

01:46:32.640 | for others to build applications on top of it too.

01:46:35.820 | If those applications turn out to be useful

01:46:37.400 | for our customers, we can just buy it from them.

01:46:39.840 | It could be that they will improve the platform.

01:46:44.280 | In fact, we see this already.

01:46:46.400 | I mean, there is literally millions of downloads

01:46:49.000 | of LAMATU and thousands of people who have provided ideas

01:46:53.720 | about how to make it better.

01:46:55.600 | So this clearly accelerates progress

01:46:59.320 | to make the system available to sort of a wide community

01:47:04.320 | of people and there's literally thousands of businesses

01:47:07.800 | who are building applications with it.

01:47:09.640 | So our ability to, Meta's ability to derive revenue

01:47:18.200 | from this technology is not impaired

01:47:20.480 | by the distribution of base models in open source.

01:47:26.640 | - The fundamental criticism that Gemini is getting

01:47:28.680 | is that, as you pointed out on the West Coast,

01:47:31.040 | just to clarify, we're currently in the East Coast

01:47:34.680 | where I would suppose Meta AI headquarters would be.

01:47:37.840 | So there are strong words about the West Coast,

01:47:42.540 | but I guess the issue that happens is,

01:47:47.000 | I think it's fair to say that most tech people

01:47:49.960 | have a political affiliation with the left wing.

01:47:53.920 | They lean left.

01:47:55.320 | And so the problem that people are criticizing Gemini with

01:47:58.440 | is that there's, in that de-biasing process

01:48:01.160 | that you mentioned, that their ideological lean

01:48:06.000 | becomes obvious.

01:48:08.940 | Is this something that could be escaped?

01:48:14.520 | You're saying open source is the only way.

01:48:17.160 | Have you witnessed this kind of ideological lean

01:48:19.640 | that makes engineering difficult?

01:48:22.360 | - No, I don't think it has to do,

01:48:24.240 | I don't think the issue has to do with the political leaning

01:48:26.740 | of the people designing those systems.

01:48:29.300 | It has to do with the acceptability or political leanings

01:48:34.300 | of their customer base or audience, right?

01:48:38.280 | So a big company cannot afford to offend too many people.

01:48:43.640 | So they're going to make sure

01:48:46.480 | that whatever product they put out is safe,

01:48:49.440 | whatever that means.

01:48:50.440 | And it's very possible to overdo it.

01:48:56.200 | And it's also very possible to,

01:48:58.020 | it's impossible to do it properly for everyone.

01:49:00.360 | You're not going to satisfy everyone.

01:49:02.520 | So that's what I said before.

01:49:03.760 | You cannot have a system that is unbiased,

01:49:05.680 | that is perceived as unbiased by everyone.

01:49:07.760 | It's gonna be, you push it in one way,

01:49:11.560 | one set of people are going to see it as biased,

01:49:14.640 | and then you push it the other way,

01:49:15.700 | and another set of people is going to see it as biased.

01:49:18.600 | And then in addition to this, there's the issue of,

01:49:21.640 | if you push the system,

01:49:22.660 | perhaps a little too far in one direction,

01:49:24.260 | it's going to be non-factual, right?

01:49:25.840 | You're going to have black Nazi soldiers in the--

01:49:30.840 | - Yeah, so we should mention image generation

01:49:33.640 | of black Nazi soldiers, which is not factually accurate.

01:49:38.960 | - Right, and can be offensive for some people as well, right?

01:49:42.180 | So it's going to be impossible to kind of produce systems

01:49:47.180 | that are unbiased for everyone.

01:49:49.080 | So the only solution that I see is diversity.

01:49:52.200 | - And diversity is the full meaning of that word,

01:49:54.980 | diversity of in every possible way.

01:49:57.940 | - Yeah.

01:49:59.380 | - Marc Andreessen just tweeted today,

01:50:02.640 | let me do a TLDR.

01:50:06.040 | The conclusion is only startups and open source

01:50:08.640 | can avoid the issue that he's highlighting with big tech.

01:50:12.240 | He's asking, can big tech actually field

01:50:15.440 | generative AI products?

01:50:17.480 | One, ever escalating demands from internal activists,

01:50:20.760 | employee mobs, crazed executives, broken boards,

01:50:24.440 | pressure groups, extremist regulators, government agencies,

01:50:27.240 | the press, in quotes, experts, and everything,

01:50:31.240 | corrupting the output.

01:50:34.240 | Two, constant risk of generating a bad answer

01:50:37.360 | or drawing a bad picture or rendering a bad video.

01:50:40.600 | Who knows what is going to say or do at any moment?

01:50:44.440 | Three, legal exposure, product liability, slander,

01:50:48.160 | election law, many other things, and so on.

01:50:51.720 | Anything that makes Congress mad.

01:50:53.900 | Four, continuous attempts to tighten grip

01:50:57.240 | on acceptable output, degrade the model,

01:50:59.700 | like how good it actually is in terms of usable

01:51:03.600 | and pleasant to use and effective and all that kind of stuff

01:51:06.920 | and five, publicity of bad text, images, video,

01:51:10.440 | actual puts those examples into the training data

01:51:13.080 | for the next version and so on.

01:51:15.780 | So he just highlights how difficult this is

01:51:18.360 | from all kinds of people being unhappy.

01:51:21.040 | As you said, you can't create a system

01:51:23.040 | that makes everybody happy.

01:51:24.440 | So if you're going to do the fine tuning yourself

01:51:29.080 | and keep a closed source, essentially the problem there

01:51:33.200 | is then trying to minimize the number of people

01:51:35.080 | who are going to be unhappy.

01:51:37.280 | And you're saying that almost impossible to do right

01:51:42.280 | and the better way is to do open source.

01:51:44.740 | - Basically, yeah.

01:51:46.800 | Marc is right about a number of things that he lists

01:51:51.760 | that indeed scare large companies.

01:51:55.300 | Certainly congressional investigations is one of them.

01:52:00.400 | Legal liability, making things that get people

01:52:05.400 | to hurt themselves or hurt others.

01:52:09.200 | Big companies are really careful

01:52:12.580 | about not producing things of this type

01:52:15.120 | because they don't want to hurt anyone, first of all,

01:52:21.280 | and then second, they want to preserve their business.

01:52:23.200 | So it's essentially impossible for systems like this

01:52:26.920 | that can inevitably formulate political opinions

01:52:30.960 | and opinions about various things

01:52:32.840 | that may be political or not,

01:52:34.040 | but that people may disagree about moral issues

01:52:38.360 | and questions about religion and things like that, right,

01:52:43.360 | or cultural issues that people from different communities

01:52:47.960 | would disagree with in the first place.

01:52:50.120 | So there's only kind of a relatively small number of things

01:52:52.560 | that people will sort of agree on, basic principles.

01:52:57.560 | But beyond that, if you want those systems to be useful,

01:53:01.840 | they will necessarily have to offend

01:53:05.200 | a number of people inevitably.

01:53:08.080 | - And so open source is just better.

01:53:10.960 | And then-- - Diversity is better, right.

01:53:13.280 | - And open source enables diversity.

01:53:15.480 | - That's right, open source enables diversity.

01:53:18.200 | - This can be a fascinating world where if it's true

01:53:21.560 | that the open source world, if meta leads the way

01:53:23.960 | and creates this kind of open source

01:53:25.840 | foundation model world, there's going to be,

01:53:28.560 | like governments will have a fine tune model.

01:53:31.520 | And then potentially people that vote left and right

01:53:36.520 | will have their own model and preference

01:53:40.640 | to be able to choose.

01:53:42.000 | And it will potentially divide us even more,

01:53:44.400 | but that's on us humans, we get to figure out.

01:53:48.280 | Basically the technology enables humans to human

01:53:52.000 | more effectively and all the difficult ethical questions

01:53:56.160 | that humans raise will just leave it up to us

01:54:01.040 | to figure that out.

01:54:02.640 | - Yeah, I mean, there are some limits to what,

01:54:04.760 | the same way there are limits to free speech,

01:54:06.480 | there has to be some limit to the kind of stuff

01:54:08.880 | that those systems might be authorized to produce,

01:54:13.880 | some guardrails.

01:54:16.440 | So, I mean, that's one thing I've been interested in,

01:54:18.280 | which is in the type of architecture that we were discussing

01:54:21.800 | before, where the output of a system is a result

01:54:26.760 | of an inference to satisfy an objective,

01:54:29.840 | that objective can include guardrails.

01:54:31.960 | And we can put guardrails in open source systems.

01:54:37.400 | I mean, if we eventually have systems that are built

01:54:39.760 | with this blueprint, we can put guardrails in those systems

01:54:44.200 | that guarantee that there is sort of a minimum set

01:54:47.640 | of guardrails that make the system non-dangerous

01:54:50.040 | and non-toxic, et cetera.

01:54:51.480 | Basic things that everybody would agree on.

01:54:53.680 | And then the fine tuning that people will add

01:54:58.200 | or the additional guardrails that people will add

01:55:00.400 | will kind of cater to their community, whatever it is.

01:55:04.960 | - And yeah, the fine tuning will be more about

01:55:07.240 | the gray areas of what is hate speech, what is dangerous

01:55:10.400 | and all that kind of stuff.

01:55:11.480 | I mean, you've--

01:55:12.320 | - Or different value systems.

01:55:13.360 | - Different value systems.

01:55:14.560 | I mean, like, but still, even with the objectives

01:55:16.760 | of how to build a bioweapon, for example,

01:55:18.760 | I think something you've commented on,

01:55:20.800 | or at least there's a paper that we're a collection

01:55:24.040 | of researchers just trying to understand

01:55:26.400 | the social impacts of these LLMs.

01:55:29.320 | And I guess one threshold is nice is like,

01:55:32.360 | does the LLM make it any easier than a search would,

01:55:37.360 | like a Google search would?

01:55:39.480 | - Right, so the increasing number of studies on this

01:55:44.480 | seems to point to the fact that it doesn't help.

01:55:49.600 | So having an LLM doesn't help you design

01:55:53.480 | or build a bioweapon or a chemical weapon

01:55:57.280 | if you already have access to a search engine

01:56:00.200 | and a library.

01:56:01.040 | And so the sort of increased information you get

01:56:04.920 | or the ease with which you get it doesn't really help you.

01:56:08.200 | That's the first thing.

01:56:09.040 | The second thing is it's one thing to have a list

01:56:12.080 | of instructions of how to make a chemical weapon,

01:56:15.600 | for example, or a bioweapon.

01:56:17.160 | It's another thing to actually build it.

01:56:20.040 | And it's much harder than you might think.

01:56:22.160 | And LLM will not help you with that.

01:56:24.000 | In fact, nobody in the world,

01:56:27.080 | not even countries use bioweapons

01:56:29.560 | because most of the times they have no idea

01:56:32.320 | to protect their own populations against it.

01:56:34.280 | So it's too dangerous actually to kind of ever use.

01:56:39.280 | And it's in fact banned by international treaties.

01:56:44.280 | Chemical weapons is different.

01:56:45.680 | It's also banned by treaties,

01:56:47.680 | but it's the same problem.

01:56:50.760 | It's difficult to use in situations

01:56:53.120 | that doesn't turn against the perpetrators.

01:56:56.520 | But we could ask Elon Musk.

01:56:58.440 | I can give you a very precise list of instructions

01:57:01.800 | of how you build a rocket engine.

01:57:03.440 | And even if you have a team of 50 engineers

01:57:06.800 | that are really experienced building it,

01:57:08.280 | you're still gonna have to blow up a dozen of them

01:57:10.120 | before you get one that works.

01:57:11.560 | And it's the same with chemical weapons

01:57:18.040 | or bioweapons or things like this.

01:57:19.560 | It requires expertise in the real world

01:57:23.080 | that an LLM is not gonna help you with.

01:57:25.240 | - And it requires even the common sense expertise

01:57:28.040 | that we've been talking about,

01:57:29.080 | which is how to take language-based instructions

01:57:34.000 | and materialize them in the physical world.

01:57:36.880 | It requires a lot of knowledge

01:57:38.480 | that's not in the instructions.

01:57:41.560 | - Yeah, exactly.

01:57:42.400 | A lot of biologists have posted on this, actually,

01:57:44.520 | in response to those things saying like,

01:57:46.400 | do you realize how hard it is to actually do the lab work?

01:57:49.240 | Like, you know, this is not trivial.

01:57:51.840 | - Yeah, and that's Hans Moravec comes to light once again.

01:57:56.840 | Just to linger on LAMA,

01:57:59.360 | Mark announced that LAMA 3 is coming out eventually.

01:58:01.800 | I don't think there's a release date.

01:58:03.480 | But what are you most excited about?

01:58:06.920 | First of all, LAMA 2 that's already out there,

01:58:08.960 | and maybe the future, LAMA 3, 4, 5, 6, 10,

01:58:12.760 | just the future of the open source under meta?

01:58:15.600 | - Well, a number of things.

01:58:18.080 | So there's gonna be like various versions of LAMA

01:58:22.000 | that are improvements of previous LAMAs,

01:58:26.880 | bigger, better, multimodal, things like that.

01:58:30.680 | And then in future generations,

01:58:32.000 | systems that are capable of planning,

01:58:34.120 | that really understand how the world works.

01:58:36.880 | Maybe are trained from video, so they have some world model.

01:58:39.600 | Maybe, you know, capable of the type of reasoning

01:58:42.160 | and planning I was talking about earlier.

01:58:44.120 | Like, how long is that gonna take?

01:58:45.360 | Like, when is the research that is going in that direction

01:58:48.520 | going to sort of feed into the product line,

01:58:52.080 | if you want, of LAMA?

01:58:53.520 | I don't know.

01:58:54.360 | I can't tell you.

01:58:55.200 | And there is, you know, a few breakthroughs

01:58:56.320 | that we have to basically go through

01:58:59.680 | before we can get there.

01:59:01.880 | But you'll be able to monitor our progress

01:59:04.560 | because we publish our research, right?

01:59:07.040 | So, you know, last week we published the Vijaypa work,

01:59:12.040 | which is sort of a first step

01:59:13.240 | towards training systems from video.

01:59:15.000 | And then the next step is gonna be world models

01:59:18.960 | based on kind of this type of idea, training from video.

01:59:23.760 | There's similar work at DeepMind also,

01:59:26.120 | and taking place people, and also at UC Berkeley

01:59:30.840 | on world models from video.

01:59:33.800 | A lot of people are working on this.

01:59:35.160 | I think a lot of good ideas are appearing.

01:59:38.480 | My bet is that those systems are gonna be JAPA-like.

01:59:41.760 | They're not gonna be generative models.

01:59:43.960 | And we'll see what the future will tell.

01:59:48.960 | There's really good work at a gentleman

01:59:54.720 | called Danny Jar Hefner, who is now at DeepMind,

01:59:56.880 | who's worked on kind of models of this type

01:59:58.720 | that learn representations, and then use them for planning

02:00:01.800 | or learning tasks by reinforcement learning.

02:00:04.160 | And a lot of work at Berkeley by Peter Abbeel,

02:00:09.560 | Saguirre Levine, a bunch of other people of that type.

02:00:12.400 | I'm collaborating with actually in the context

02:00:15.360 | of some grants with my NYU hat.

02:00:18.160 | And then collaborations also through META,

02:00:22.360 | 'cause the lab at Berkeley is associated

02:00:25.640 | with META in some way, so with FAIR.

02:00:28.280 | So I think it's very exciting.

02:00:30.720 | I think, I'm super excited about,

02:00:34.200 | I haven't been that excited about the direction

02:00:36.720 | of machine learning and AI since 10 years ago

02:00:41.320 | when FAIR was started.

02:00:42.280 | And before that, 30 years ago, we were working on,

02:00:46.120 | let's say 35, on convolutional nets

02:00:48.720 | and the early days of neural nets.

02:00:52.000 | So I'm super excited because I see a path

02:00:56.280 | towards potentially human level intelligence

02:00:59.200 | with systems that can understand the world,

02:01:04.120 | remember, plan, reason.

02:01:05.760 | There is some set of ideas to make progress there

02:01:10.480 | that might have a chance of working.

02:01:12.400 | And I'm really excited about this.

02:01:14.600 | What I like is that somewhat we get onto a good direction

02:01:19.600 | and perhaps succeed before my brain turns to a white sauce

02:01:24.920 | or before I need to retire. (laughs)

02:01:28.320 | - Yeah, yeah.

02:01:30.160 | Are you also excited by, are you,

02:01:32.380 | is it beautiful to you just the amount of GPUs involved,

02:01:38.000 | sort of the whole training process on this much compute?

02:01:42.880 | It's just zooming out, just looking at earth

02:01:45.320 | and humans together have built these computing devices

02:01:49.720 | and are able to train this one brain.

02:01:53.560 | Then we then open source.

02:01:55.740 | Like giving birth to this open source brain

02:02:01.040 | trained on this gigantic compute system.

02:02:04.320 | There's just the details of how to train on that,

02:02:07.680 | how to build the infrastructure and the hardware,

02:02:10.060 | the cooling, all of this kind of stuff.

02:02:12.240 | Are you just still, most of your excitement

02:02:14.360 | is in the theory aspect of it?

02:02:17.240 | Meaning like the software?

02:02:19.600 | - Well, I used to be a hardware guy many years ago.

02:02:21.480 | - Yes, yes, that's right. - Decades ago.

02:02:23.080 | Hardware has improved a little bit, changed a little bit.

02:02:26.960 | Yeah.

02:02:27.800 | - I mean, certainly scale is necessary, but not sufficient.

02:02:32.360 | - Absolutely.

02:02:33.200 | - So we certainly need competition.

02:02:34.600 | I mean, we're still far in terms of compute power

02:02:37.000 | from what we would need to match the compute power

02:02:40.800 | of the human brain.

02:02:42.880 | This may occur in the next couple of decades,

02:02:45.040 | but we're still some ways away.

02:02:47.600 | And certainly in terms of power efficiency,

02:02:49.880 | we're really far.

02:02:51.920 | So there's a lot of progress to make in hardware.

02:02:56.000 | And right now, a lot of the progress is not,

02:03:00.240 | I mean, there's a bit coming from silicon technology,

02:03:03.000 | but a lot of it coming from architectural innovation

02:03:06.440 | and quite a bit coming from more efficient ways

02:03:10.200 | of implementing the architectures that have become popular,

02:03:13.640 | basically a combination of Transformers and ConvNets, right?

02:03:17.520 | And so there's still some ways to go

02:03:22.280 | until we're gonna saturate.

02:03:27.280 | We're gonna have to come up with like new principles,

02:03:30.200 | new fabrication technology, new basic components,

02:03:34.560 | perhaps based on sort of different principles

02:03:38.880 | and those classical digital CMOS.

02:03:42.000 | - Interesting.

02:03:42.840 | So you think in order to build AMI,

02:03:47.440 | AMI, we need, we potentially might need

02:03:50.520 | some hardware innovation too.

02:03:52.920 | - Well, if we wanna make it ubiquitous, yeah, certainly.

02:03:56.640 | 'Cause we're gonna have to reduce the power consumption.

02:04:01.640 | A GPU today, right, is half a kilowatt to a kilowatt.

02:04:05.580 | Human brain is about 25 watts.

02:04:08.640 | And the GPU is way below the power of human brain.

02:04:13.100 | You need something like 100,000 or a million to match it.

02:04:16.040 | So we are off by a huge factor.

02:04:19.760 | - You often say that AGI is not coming soon,

02:04:26.280 | meaning like not this year,

02:04:28.560 | not the next few years, potentially farther away.

02:04:32.760 | What's your basic intuition behind that?

02:04:35.720 | - So first of all, it's not going to be an event, right?

02:04:39.080 | The idea somehow, which is popularized

02:04:41.560 | by science fiction and Hollywood,

02:04:43.140 | that somehow somebody is gonna discover the secret,

02:04:47.060 | the secret to AGI or human level AI or AMI,

02:04:50.860 | whatever you wanna call it.

02:04:52.400 | And then turn on a machine and then we have AGI.

02:04:55.220 | That's just not going to happen.

02:04:57.140 | It's not going to be an event.

02:04:58.640 | It's gonna be gradual progress.

02:05:02.640 | Are we gonna have systems that can learn from video

02:05:07.000 | how the world works and learn good representations?

02:05:09.440 | Yeah, before we get them to the scale and performance

02:05:13.060 | that we observe in humans, it's gonna take quite a while.

02:05:15.600 | It's not gonna happen in one day.

02:05:17.240 | Are we gonna get systems that can have large amount

02:05:23.320 | of associative memories so they can remember stuff?

02:05:26.660 | Yeah, but same, it's not gonna happen tomorrow.

02:05:28.720 | I mean, there is some basic techniques

02:05:30.440 | that need to be developed.

02:05:31.460 | We have a lot of them, but to get this to work together

02:05:34.800 | with a full system is another story.

02:05:37.040 | Are we gonna have systems that can reason and plan,

02:05:39.200 | perhaps along the lines of objective-driven

02:05:42.160 | AI architectures that I described before?

02:05:45.000 | Yeah, but before we get this to work properly,

02:05:47.480 | it's gonna take a while.

02:05:49.320 | And before we get all those things to work together,

02:05:51.280 | and then on top of this, have systems that can learn

02:05:54.020 | like hierarchical planning, hierarchical representations,

02:05:56.800 | systems that can be configured for a lot

02:05:59.640 | of different situations at hands,

02:06:01.020 | the way the human brain can.

02:06:02.640 | All of this is gonna take at least a decade,

02:06:07.860 | probably much more, because there are a lot of problems

02:06:11.060 | that we're not seeing right now that we have not encountered,

02:06:15.300 | and so we don't know if there is an easy solution

02:06:17.280 | within this framework.

02:06:18.600 | So, you know, it's not just around the corner.

02:06:23.380 | I mean, I've been hearing people for the last 12, 15 years

02:06:27.580 | claiming that, you know, AGI is just around the corner

02:06:30.040 | and being systematically wrong.

02:06:32.620 | And I knew they were wrong when they were saying it.

02:06:34.520 | I call their bullshit.

02:06:35.580 | Why do you think people have been calling,

02:06:38.220 | first of all, I mean, from the beginning,

02:06:39.740 | from the birth of the term artificial intelligence,

02:06:42.780 | there has been a eternal optimism

02:06:45.340 | that's perhaps unlike other technologies?

02:06:49.100 | Is it a Marvek paradox?

02:06:51.820 | Is the explanation for why people

02:06:54.420 | are so optimistic about AGI?

02:06:57.060 | - I don't think it's just Marvek's paradox.

02:06:58.780 | Marvek's paradox is a consequence of realizing

02:07:01.260 | that the world is not as easy as we think.

02:07:03.780 | So first of all, intelligence is not a linear thing

02:07:08.620 | you can measure with a scalar, with a single number.

02:07:11.500 | You know, can you say that humans are smarter

02:07:15.260 | than orangutans?

02:07:18.340 | In some ways, yes.

02:07:20.220 | But in some ways, orangutans are smarter than humans

02:07:22.140 | in a lot of domains

02:07:23.820 | that allows them to survive in the forest, for example.

02:07:26.820 | - So IQ is a very limited measure of intelligence.

02:07:30.380 | Do you know intelligence is bigger

02:07:31.580 | than what IQ, for example, measures?

02:07:33.900 | - Well, IQ can measure, you know,

02:07:36.580 | approximately something for humans.

02:07:38.780 | But because humans kind of come in relatively

02:07:44.660 | kind of uniform form, right?

02:07:49.060 | But it only measures one type of ability

02:07:53.780 | that may be relevant for some tasks, but not others.

02:07:56.620 | But then if you're talking about other intelligent entities

02:08:02.540 | for which the basic things that are easy to them

02:08:07.140 | is very different, then it doesn't mean anything.

02:08:11.420 | So intelligence is a collection of skills

02:08:15.780 | and an ability to acquire new skills efficiently, right?

02:08:22.900 | And the collection of skills that any particular

02:08:27.540 | intelligent entity possess or is capable of learning quickly

02:08:31.700 | is different from the collection of skills of another one.

02:08:35.340 | And because it's a multidimensional thing,

02:08:37.460 | the set of skills is high dimensional space,

02:08:39.500 | you can't measure, you can compare,

02:08:41.340 | you cannot compare two things

02:08:42.860 | as to whether one is more intelligent than the other.

02:08:45.780 | It's multidimensional.

02:08:46.900 | - So you push back against what are called AI doomers a lot.

02:08:53.740 | Can you explain their perspective

02:08:57.780 | and why you think they're wrong?

02:08:59.780 | - Okay, so AI doomers imagine all kinds

02:09:02.180 | of catastrophe scenarios of how AI could escape or control

02:09:07.180 | and basically kill us all.

02:09:09.580 | (laughs)

02:09:11.220 | And that relies on a whole bunch of assumptions

02:09:14.460 | that are mostly false.

02:09:15.540 | So the first assumption is that the emergence

02:09:19.380 | of super intelligence could be an event,

02:09:21.820 | that at some point we're going to figure out the secret

02:09:25.100 | and we'll turn on a machine that is super intelligent.

02:09:28.300 | And because we'd never done it before,

02:09:30.500 | it's going to take over the world and kill us all.

02:09:33.060 | That is false.

02:09:33.940 | It's not going to be an event.

02:09:35.900 | We're going to have systems that are like as smart as a cat,

02:09:39.700 | have all the characteristics of human level intelligence,

02:09:44.700 | but their level of intelligence would be like a cat

02:09:47.540 | or a parrot maybe or something.

02:09:49.860 | And then we're going to work our way up

02:09:53.900 | to kind of make those things more intelligent.

02:09:55.420 | And as we make them more intelligent,

02:09:56.780 | we're also going to put some guard rails in them

02:09:58.580 | and learn how to kind of put some guard rails

02:10:00.460 | so they behave properly.

02:10:01.740 | And we're not going to do this with just one,

02:10:03.860 | it's not going to be one effort,

02:10:04.820 | that it's going to be lots of different people doing this.

02:10:07.620 | And some of them are going to succeed

02:10:09.260 | at making intelligent systems that are controllable and safe

02:10:13.180 | and have the right guard rails.

02:10:14.420 | And if some other goes rogue,

02:10:15.980 | then we can use the good ones to go against the rogue ones.

02:10:20.380 | So it's going to be my smart AI police

02:10:23.300 | against your rogue AI.

02:10:25.500 | So it's not going to be like we're going to be exposed

02:10:27.700 | to like a single rogue AI that's going to kill us all.

02:10:29.940 | That's just not happening.

02:10:31.860 | Now, there is another fallacy,

02:10:33.300 | which is the fact that because the system is intelligent,

02:10:36.340 | it necessarily wants to take over.

02:10:38.060 | And there is several arguments

02:10:43.420 | that make people scared of this,

02:10:44.780 | which I think are completely false as well.

02:10:48.500 | So one of them is in nature,

02:10:53.460 | it seems to be that the more intelligent species

02:10:55.580 | are the ones that end up dominating the other.

02:10:58.180 | And even extinguishing the others sometimes by design,

02:11:03.180 | sometimes just by mistake.

02:11:06.780 | And so there is sort of a thinking by which you say,

02:11:12.940 | well, if AI systems are more intelligent than us,

02:11:17.420 | surely they're going to eliminate us,

02:11:19.660 | if not by design, simply because they don't care about us.

02:11:23.180 | And that's just preposterous for a number of reasons.

02:11:27.780 | First reason is they're not going to be a species.

02:11:30.340 | They're not going to be a species that competes with us.

02:11:33.220 | They're not going to have the desire to dominate

02:11:35.420 | because the desire to dominate is something

02:11:37.220 | that has to be hardwired into an intelligent system.

02:11:41.020 | It is hardwired in humans.

02:11:43.580 | It is hardwired in baboons, in chimpanzees, in wolves,

02:11:48.860 | not in orangutans.

02:11:49.980 | The species in which this desire to dominate or submit

02:11:56.340 | or attain status in other ways

02:11:59.060 | is specific to social species.

02:12:03.300 | Non-social species like orangutans don't have it, right?

02:12:06.740 | And they are as smart as we are almost, right?

02:12:09.500 | - And to you, there's not significant incentive

02:12:12.140 | for humans to encode that into the AI systems.

02:12:15.180 | And to the degree they do, there'll be other AIs

02:12:18.980 | that sort of punish them for it.

02:12:22.140 | I'll compete them over.

02:12:23.100 | - Well, there's all kinds of incentives

02:12:24.380 | to make AI systems submissive to humans, right?

02:12:27.660 | I mean, this is the way we're going to build them, right?

02:12:30.300 | And so then people say, "Oh, but look at LLMs.

02:12:32.780 | "LLMs are not controllable."

02:12:33.980 | And they're right, LLMs are not controllable.

02:12:36.780 | But object-driven AI, so systems that derive their answers

02:12:41.500 | by optimization of an objective

02:12:43.780 | means they have to optimize this objective,

02:12:45.820 | and that objective can include guardrails.

02:12:48.380 | One guardrail is obey humans.

02:12:52.860 | Another guardrail is don't obey humans

02:12:54.660 | if it's hurting other humans.

02:12:57.140 | - I've heard that before somewhere, I don't remember.

02:12:59.620 | - Yes, maybe in a book.

02:13:02.020 | - Yeah, but speaking of that book,

02:13:05.660 | could there be unintended consequences also from all of this?

02:13:09.260 | - No, of course.

02:13:10.700 | So this is not a simple problem, right?

02:13:12.660 | I mean, designing those guardrails

02:13:14.620 | so that the system behaves properly

02:13:16.300 | is not going to be a simple issue

02:13:20.860 | for which there is a silver bullet,

02:13:22.500 | for which you have a mathematical proof

02:13:24.020 | that the system can be safe.

02:13:25.660 | It's going to be a very progressive,

02:13:27.460 | iterative design system

02:13:28.740 | where we put those guardrails in such a way

02:13:31.820 | that the system behaves properly.

02:13:33.020 | And sometimes they're going to do something

02:13:35.180 | that was unexpected because the guardrail wasn't right,

02:13:38.460 | and we're going to correct them so that they do it right.

02:13:41.180 | The idea somehow that we can't get it slightly wrong

02:13:44.140 | because if we get it slightly wrong,

02:13:45.500 | we all die is ridiculous.

02:13:47.980 | We're just going to go progressively.

02:13:51.580 | And it's just going to be,

02:13:52.980 | the analogy I've used many times is turbojet design.

02:13:57.980 | How did we figure out how to make turbojets

02:14:03.180 | so unbelievably reliable, right?

02:14:07.180 | I mean, those are incredibly complex pieces of hardware.

02:14:11.140 | They run at really high temperatures

02:14:12.740 | for 20 hours at a time sometimes.

02:14:17.540 | And we can fly halfway around the world

02:14:21.020 | with a two-engine jetliner at near the speed of sound.

02:14:26.020 | Like how incredible is this?

02:14:28.580 | It's just unbelievable, right?

02:14:31.060 | And did we do this because we invented

02:14:34.540 | like a general principle of how to make turbojets safe?

02:14:37.060 | No, it took decades to kind of fine-tune

02:14:39.820 | the design of those systems

02:14:40.940 | so that they were safe.

02:14:43.380 | Is there a separate group within General Electric

02:14:48.380 | or SNECMA or whatever that is specialized

02:14:52.500 | in turbojet safety?

02:14:54.660 | No, the design is all about safety

02:14:58.980 | because a better turbojet is also a safer turbojet.

02:15:01.260 | So a more reliable one.

02:15:03.660 | It's the same for AI.

02:15:04.780 | Like, do you need specific provisions to make AI safe?

02:15:08.580 | No, you need to make better AI systems

02:15:10.540 | and they will be safe because they are designed

02:15:12.700 | to be more useful and more controllable.

02:15:16.380 | So let's imagine a system, AI system,

02:15:18.980 | that's able to be incredibly convincing

02:15:23.300 | and can convince you of anything.

02:15:24.940 | I can at least imagine such a system.

02:15:28.060 | And I can see such a system be weapon-like

02:15:33.940 | because it can control people's minds.

02:15:35.460 | We're pretty gullible.

02:15:37.020 | We want to believe a thing.

02:15:38.540 | You can have an AI system that controls it.

02:15:40.820 | And you could see governments using that as a weapon.

02:15:43.540 | So do you think if you imagine such a system,

02:15:47.540 | there's any parallel to something like nuclear weapons?

02:15:53.260 | No.

02:15:54.420 | So why is that technology different?

02:15:58.740 | So you're saying there's going to be gradual development.

02:16:01.860 | There's going to be, I mean, it might be rapid,

02:16:04.300 | but there'll be iterative.

02:16:05.860 | And then we'll be able to kind of respond

02:16:08.020 | and so on.

02:16:09.060 | - So that AI system designed by Vladimir Putin or whatever,

02:16:13.140 | or his minions is going to be like trying to talk

02:16:18.140 | to every American to convince them to vote

02:16:23.420 | for whoever peaces Putin or whatever,

02:16:28.420 | or rile people up against each other

02:16:36.860 | as they've been trying to do.

02:16:38.260 | They're not going to be talking to you.

02:16:40.980 | They're going to be talking to your AI assistant,

02:16:43.420 | which is going to be as smart as theirs, right?

02:16:48.340 | That AI, because as I said, in the future,

02:16:51.180 | every single one of your interaction

02:16:52.620 | with the digital world will be mediated

02:16:54.220 | by your AI assistant.

02:16:55.820 | So the first thing you're going to ask is,

02:16:57.580 | is this a scam?

02:16:58.780 | Like, is this thing like telling me the truth?

02:17:00.740 | Like, it's not even going to be able to get to you

02:17:03.300 | because it's only going to talk to your AI assistant.

02:17:05.820 | Your AI assistant is not even going to,

02:17:08.620 | it's going to be like a spam filter, right?

02:17:10.740 | You're not even seeing the email, the spam email, right?

02:17:13.940 | It's automatically put in a folder that you never see.

02:17:17.420 | It's going to be the same thing.

02:17:18.340 | That AI system that tries to convince you of something

02:17:21.540 | is going to be talking to your AI assistant,

02:17:23.260 | which is going to be at least as smart as it.

02:17:25.500 | And it's going to say, this is spam, you know,

02:17:28.580 | it's not even going to bring it to your attention.

02:17:32.220 | - So to you, it's very difficult for any one AI system

02:17:35.260 | to take such a big leap ahead

02:17:37.500 | to where it can convince even the other AI systems.

02:17:40.100 | So like, there's always going to be this kind of race

02:17:44.220 | where nobody's way ahead.

02:17:46.660 | - That's the history of the world.

02:17:48.900 | History of the world is, you know,

02:17:50.140 | whenever there is a progress someplace,

02:17:52.380 | there is a countermeasure.

02:17:54.100 | And, you know, it's a cat and mouse game.

02:17:57.620 | - This is why, mostly, yes,

02:17:59.420 | but this is why nuclear weapons are so interesting

02:18:01.700 | because that was such a powerful weapon

02:18:05.340 | that it mattered who got it first.

02:18:07.380 | That, you know, you could imagine Hitler, Stalin,

02:18:13.020 | Mao getting the weapon first

02:18:17.620 | and that having a different kind of impact on the world

02:18:20.620 | than the United States getting the weapon first.

02:18:24.140 | But to you, nuclear weapons is like,

02:18:27.480 | you don't imagine a breakthrough discovery

02:18:32.200 | and then Manhattan Project-like effort for AI.

02:18:35.780 | - No, as I said, it's not going to be an event.

02:18:39.180 | It's going to be, you know, continuous progress.

02:18:42.020 | And whenever, you know, one breakthrough occurs,

02:18:46.200 | it's going to be widely disseminated really quickly,

02:18:48.920 | probably first within industry.

02:18:51.040 | I mean, this is not a domain where, you know,

02:18:53.680 | government or military organizations

02:18:55.560 | are particularly innovative

02:18:57.740 | and they're in fact way behind.

02:18:59.340 | And so this is going to come from industry

02:19:02.300 | and this kind of information disseminates extremely quickly.

02:19:05.460 | We've seen this over the last few years, right?

02:19:08.100 | Where you have a new, like, you know, even take AlphaGo,

02:19:11.980 | this was reproduced within three months,

02:19:13.980 | even without like particularly detailed information, right?

02:19:17.980 | - Yeah, this is an industry that's not good at secrecy.

02:19:21.240 | - No, but even if there is,

02:19:22.920 | just the fact that you know that something is possible

02:19:25.920 | makes you like realize that it's worth investing the time

02:19:30.220 | to actually do it.

02:19:31.080 | You may be the second person to do it,

02:19:32.920 | but you know, you'll do it.

02:19:35.220 | And, you know, say for, you know, all the innovations

02:19:41.480 | of, you know, self-supervised running transformers,

02:19:44.200 | decoder-only architectures, LLMs.

02:19:46.320 | I mean, those things,

02:19:47.520 | you don't need to know exactly the details of how they work

02:19:49.840 | to know that, you know, it's possible

02:19:52.760 | because it's deployed and then it's getting reproduced.

02:19:54.720 | And then, you know, people who work for those companies move.

02:19:59.720 | They go from one company to another and, you know,

02:20:03.400 | the information disseminates.

02:20:05.120 | What makes the success of the US tech industry

02:20:09.760 | and Silicon Valley in particular is exactly that,

02:20:11.760 | is because information circulates really, really quickly

02:20:14.480 | and, you know, disseminates very quickly.

02:20:17.480 | And so, you know, the whole region sort of is ahead

02:20:21.760 | because of that circulation of information.

02:20:24.600 | - So maybe I, just to linger on the psychology of AI doomers,

02:20:28.560 | you give, in the classic Yann LeCun way,

02:20:31.960 | a pretty good example of just

02:20:34.200 | when a new technology comes to be.

02:20:36.860 | You say, engineer says, "I invented this new thing.

02:20:41.300 | I call it a ball pen."

02:20:44.320 | And then the Twittersphere responds, "OMG,

02:20:47.320 | people could write horrible things with it

02:20:48.960 | like misinformation, propaganda, hate speech, ban it now."

02:20:52.720 | Then writing doomers come in, akin to the AI doomers.

02:20:57.580 | Imagine if everyone can get a ball pen.

02:21:00.980 | This could destroy society.

02:21:02.300 | There should be a law against using ball pen

02:21:04.180 | to write hate speech, regulate ball pens now.

02:21:07.240 | And then the pencil industry mogul says,

02:21:09.720 | "Yeah, ball pens are very dangerous,

02:21:12.680 | unlike pencil writing, which is erasable.

02:21:15.740 | Ball pen writing stays forever.

02:21:18.460 | Government should require a license for a pen manufacturer."

02:21:21.740 | I mean, this does seem to be part of human psychology

02:21:27.660 | when it comes up against new technology.

02:21:32.280 | So what deep insights can you speak to about this?

02:21:37.280 | - Well, there is a natural fear of new technology

02:21:42.720 | and the impact it can have on society.

02:21:45.320 | And people have kind of instinctive reaction

02:21:48.940 | to the world they know being threatened

02:21:53.700 | by major transformations that are either cultural phenomena

02:21:58.320 | or technological revolutions.

02:22:01.000 | And they fear for their culture, they fear for their job,

02:22:05.660 | they fear for the future of their children

02:22:09.980 | and their way of life, right?

02:22:13.800 | So any change is feared.

02:22:16.920 | And you see this, you know, along history,

02:22:20.380 | like any technological revolution or cultural phenomenon

02:22:24.060 | was always accompanied by, you know, groups or reaction

02:22:29.060 | in the media that basically attributed all the problems,

02:22:34.600 | the current problems of society

02:22:37.780 | to that particular change, right?

02:22:40.660 | Electricity was going to kill everyone at some point.

02:22:44.400 | You know, the train was going to be a horrible thing

02:22:47.880 | because, you know, you can't breathe

02:22:49.180 | past 50 kilometers an hour.

02:22:50.860 | And so there's a wonderful website

02:22:54.000 | called the Pessimist Archive,

02:22:55.640 | which has all those newspaper clips

02:22:59.420 | of all the horrible things people imagined would arrive

02:23:02.800 | because of either technological innovation

02:23:07.480 | or a cultural phenomenon.

02:23:10.840 | You know, it's just wonderful examples of, you know,

02:23:15.840 | jazz or comic books being blamed for unemployment

02:23:22.400 | or, you know, young people not wanting to work anymore

02:23:26.360 | and things like that, right?

02:23:27.360 | And that has existed for centuries.

02:23:30.700 | And it's, you know, knee-jerk reactions.

02:23:38.520 | The question is, you know, do we embrace change

02:23:40.800 | or do we resist it?

02:23:44.080 | And what are the real dangers

02:23:47.200 | as opposed to the imagined ones?

02:23:50.500 | - So people worry about,

02:23:53.800 | I think one thing they worry about with big tech,

02:23:55.880 | something we've been talking about over and over,

02:23:58.640 | but I think worth mentioning again,

02:24:02.320 | they worry about how powerful AI will be

02:24:05.960 | and they worry about it being in the hands

02:24:08.720 | of one centralized power

02:24:10.080 | of just a handful of central control.

02:24:13.760 | And so that's the skepticism with big tech.

02:24:15.880 | You can make, these companies can make

02:24:17.560 | a huge amount of money and control this technology

02:24:21.800 | and by so doing, you know, take advantage,

02:24:26.680 | abuse the little guy in society.

02:24:29.080 | - Well, that's exactly why we need open source platforms.

02:24:31.920 | - Yeah, I just wanted to nail the point home more and more.

02:24:36.920 | - Yes.

02:24:38.480 | - So let me ask you on your,

02:24:40.600 | like I said, you do get a little bit flavorful

02:24:45.200 | on the internet.

02:24:46.760 | Yoshi Bach tweeted something that you LOL'd at

02:24:50.800 | in reference to HAL 9000.

02:24:53.320 | Quote, "I appreciate your argument

02:24:55.560 | "and I fully understand your frustration,

02:24:57.420 | "but whether the pod bay doors should be opened or closed

02:25:00.960 | "is a complex and nuanced issue."

02:25:03.840 | So you're at the head of Meta AI.

02:25:06.940 | You know, this is something that really worries me

02:25:12.000 | that AI, our AI overlords will speak down to us

02:25:16.640 | with corporate speak of this nature

02:25:20.420 | and you sort of resist that with your way of being.

02:25:23.400 | Is this something you can just comment on,

02:25:27.100 | sort of working at a big company,

02:25:29.560 | how you can avoid the over-fearing, I suppose,

02:25:34.560 | through caution create harm?

02:25:41.360 | - Yeah, again, I think the answer to this

02:25:43.880 | is open source platforms and then enabling

02:25:47.760 | a widely diverse set of people to build AI assistants

02:25:52.760 | that represent the diversity of cultures, opinions,

02:25:57.320 | languages, and value systems across the world

02:26:00.000 | so that you're not bound to just be brainwashed

02:26:05.000 | by a particular way of thinking because of single AI entity.

02:26:10.000 | So, I mean, I think it's a really, really important question

02:26:13.960 | for society and the problem I'm seeing is that,

02:26:17.440 | which is why I've been so vocal

02:26:21.880 | and sometimes a little sardonic about it.

02:26:25.160 | - Never stop, never stop Jan.

02:26:27.720 | (laughing)

02:26:28.640 | We love it.

02:26:29.480 | - Is because I see the danger of this concentration of power

02:26:33.000 | through proprietary AI systems

02:26:36.400 | as a much bigger danger than everything else.

02:26:39.900 | That if we really want diversity of opinion, AI systems

02:26:44.900 | that in the future where we'll all be interacting

02:26:51.080 | through AI systems, we need those to be diverse

02:26:54.280 | for the preservation of diversity of ideas

02:26:58.400 | and creeds and political opinions and whatever

02:27:03.400 | and the preservation of democracy.

02:27:07.840 | And what works against this is people who think that

02:27:12.840 | for reasons of security, we should keep AI systems

02:27:17.920 | under lock and key because it's too dangerous

02:27:20.280 | to put it in the hands of everybody.

02:27:24.200 | Because it could be used by terrorists or something.

02:27:26.720 | That would lead to potentially a very bad future

02:27:33.800 | in which all of our information diet is controlled

02:27:39.060 | by a small number of companies through proprietary systems.

02:27:43.140 | - Do you trust humans with this technology

02:27:47.640 | to build systems that are on the whole good for humanity?

02:27:53.280 | Isn't that what democracy and free speech is all about?

02:27:56.560 | - I think so.

02:27:57.480 | - Do you trust institutions to do the right thing?

02:28:00.400 | Do you trust people to do the right thing?

02:28:03.160 | And yeah, there's bad people who are gonna do bad things

02:28:05.400 | but they're not going to have superior technology

02:28:07.780 | to the good people.

02:28:08.620 | So then it's gonna be my good AI against your bad AI.

02:28:12.600 | I mean, it's the examples that we were just talking about

02:28:16.380 | of maybe some rogue country will build some AI system

02:28:22.320 | that's gonna try to convince everybody

02:28:23.960 | to go into a civil war or something

02:28:27.480 | or elect a favorable ruler.

02:28:31.880 | But then they will have to go past our AI systems.

02:28:36.600 | - An AI system with a strong Russian accent

02:28:38.760 | will be trying to convince our--

02:28:40.440 | - And doesn't put any articles in their sentences.

02:28:43.260 | - Well, it'll be at the very least absurdly comedic.

02:28:49.300 | - Okay, so since we talked about the physical reality,

02:28:54.300 | I'd love to ask your vision of the future with robots

02:28:59.160 | in this physical reality.

02:29:00.580 | So many of the kinds of intelligence

02:29:03.240 | you've been speaking about would empower robots

02:29:06.720 | to be more effective collaborators with us humans.

02:29:10.480 | So since Tesla's Optimus team has been showing us

02:29:15.180 | some progress on humanoid robots,

02:29:17.160 | I think it really reinvigorated the whole industry

02:29:20.560 | that I think Boston Dynamics has been leading

02:29:22.860 | for a very, very long time.

02:29:24.280 | So now there's all kinds of companies,

02:29:25.660 | Figure AI, obviously, Boston Dynamics.

02:29:29.120 | - Union Tree.

02:29:30.080 | - Union Tree, but there's like a lot of them.

02:29:33.500 | It's great. - There's a lot of them.

02:29:34.340 | - It's great, I mean, I love it.

02:29:36.340 | So do you think there'll be millions

02:29:41.540 | of humanoid robots walking around soon?

02:29:44.020 | - Not soon, but it's gonna happen.

02:29:46.260 | Like the next decade, I think,

02:29:47.380 | is gonna be really interesting in robots.

02:29:49.500 | Like the emergence of the robotics industry

02:29:53.660 | has been in the waiting for 10, 20 years

02:29:57.720 | without really emerging,

02:29:58.700 | other than for like kind of pre-programmed behavior

02:30:01.660 | and stuff like that.

02:30:02.660 | And the main issue is, again, the more of a paradox,

02:30:08.700 | like how do we get the system to understand

02:30:10.420 | how the world works and kind of plan actions?

02:30:13.200 | And so we can do it for really specialized tasks.

02:30:16.620 | And the way Boston Dynamics goes about it is basically

02:30:21.620 | with a lot of handcrafted dynamical models

02:30:25.900 | and careful planning in advance,

02:30:29.300 | which is very classical robotics

02:30:30.780 | with a lot of innovation, a little bit of perception.

02:30:34.220 | But it's still not,

02:30:35.820 | like they can't build a domestic robot, right?

02:30:38.800 | And we're still some distance away

02:30:43.820 | from completely autonomous level five driving.

02:30:46.220 | And we're certainly very far away

02:30:49.540 | from having level five autonomous driving

02:30:53.660 | by a system that can train itself

02:30:55.820 | by driving 20 hours like any 17-year-old.

02:30:59.500 | So until we have, again, world models,

02:31:06.420 | systems that can train themselves

02:31:09.300 | to understand how the world works,

02:31:13.060 | we're not gonna have significant progress in robotics.

02:31:16.940 | So a lot of the people working on robotic hardware

02:31:20.560 | at the moment are betting or banking on the fact

02:31:24.300 | that AI is gonna make sufficient progress towards that.

02:31:28.060 | - And they're hoping to discover a product in it too.

02:31:31.060 | Before you have a really strong world model,

02:31:34.660 | there'll be an almost strong world model.

02:31:38.060 | And people are trying to find a product

02:31:41.440 | and a clumsy robot, I suppose.

02:31:43.720 | Like not a perfectly efficient robot.

02:31:45.720 | So there's the factory setting where humanoid robots

02:31:48.300 | can help automate some aspects of the factory.

02:31:51.260 | I think that's a crazy difficult task

02:31:53.340 | 'cause of all the safety required and all this kind of stuff.

02:31:56.000 | I think in the home is more interesting,

02:31:58.260 | but then you start to think,

02:32:00.420 | I think you mentioned loading the dishwasher, right?

02:32:03.200 | - Yeah.

02:32:04.580 | - I suppose that's one of the main problems

02:32:06.640 | you're working on.

02:32:07.620 | - I mean, there's cleaning up, cleaning the house,

02:32:12.620 | clearing up the table after a meal, washing the dishes,

02:32:18.720 | all those tasks, cooking.

02:32:21.600 | I mean, all the tasks that in principle could be automated,

02:32:24.040 | but are actually incredibly sophisticated,

02:32:26.720 | really complicated.

02:32:28.320 | - But even just basic navigation

02:32:29.720 | around a space full of uncertainty.

02:32:32.120 | - That sort of works.

02:32:33.160 | Like you can sort of do this now.

02:32:35.560 | Navigation is fine.

02:32:37.280 | - Well, navigation in a way that's compelling

02:32:40.100 | to us humans is a different thing.

02:32:42.900 | - Yeah, it's not gonna be necessarily.

02:32:45.380 | I mean, we have demos actually,

02:32:46.600 | 'cause there is a so-called embodied AI group at fair.

02:32:51.600 | And they've been not building their own robots,

02:32:55.180 | but using commercial robots.

02:32:57.200 | And you can tell a robot dog go to the fridge

02:33:02.360 | and they can actually open the fridge

02:33:03.660 | and they can probably pick up a can in the fridge

02:33:05.900 | and stuff like that and bring it to you.

02:33:09.380 | So it can navigate, it can grab objects

02:33:12.640 | as long as it's been trained to recognize them,

02:33:14.820 | which vision systems work pretty well nowadays.

02:33:17.200 | But it's not like a completely general robot

02:33:22.420 | that would be sophisticated enough to do things

02:33:26.180 | like clearing up the data table.

02:33:29.300 | - Yeah, to me, that's an exciting future

02:33:33.300 | of getting humanoid robots,

02:33:35.080 | robots in general, in the whole, more and more.

02:33:36.740 | Because that gets humans to really directly interact

02:33:40.340 | with AI systems in the physical space.

02:33:42.120 | And in so doing, it allows us to philosophically,

02:33:45.260 | psychologically explore our relationships with robots.

02:33:48.100 | It can be really, really, really interesting.

02:33:50.760 | So I hope you make progress on the whole Jampa thing soon.

02:33:54.340 | - Well, I mean, I hope things work as planned.

02:33:58.640 | I mean, again, we've been working on this idea

02:34:03.180 | of self-supervised learning from video for 10 years.

02:34:07.120 | And only made significant progress in the last two or three.

02:34:12.080 | - And actually, you've mentioned that there's a lot

02:34:14.240 | of interesting breakups that can happen

02:34:15.760 | without having access to a lot of compute.

02:34:18.380 | So if you're interested in doing a PhD

02:34:20.480 | and this kind of stuff, there's a lot of possibilities still

02:34:24.140 | to do innovative work.

02:34:25.600 | So what advice would you give to a undergrad

02:34:28.040 | that's looking to go to grad school and do a PhD?

02:34:32.340 | - So basically, I've listed them already,

02:34:35.600 | this idea of how do you train a world model by observation.

02:34:38.660 | And you don't have to train necessarily

02:34:41.400 | on gigantic data sets or...

02:34:44.320 | I mean, it could turn out to be necessary

02:34:47.080 | to actually train on large data sets,

02:34:48.800 | to have emergent properties like we have with LLMs.

02:34:51.780 | But I think there is a lot of good ideas

02:34:53.080 | that can be done without necessarily scaling up.

02:34:56.760 | Then there is how do you do planning

02:34:58.480 | with a learned world model?

02:35:00.600 | If the world the system evolves in

02:35:02.660 | is not the physical world,

02:35:03.760 | but it's the world of, let's say, the internet

02:35:06.800 | or some sort of world where an action

02:35:11.540 | consists in doing a search in a search engine

02:35:14.060 | or interrogating a database or running a simulation

02:35:18.180 | or calling a calculator

02:35:19.820 | or solving differential equation,

02:35:21.520 | how do you get a system to actually plan

02:35:24.500 | a sequence of actions to give the solution to a problem?

02:35:29.720 | And so the question of planning

02:35:32.200 | is not just a question of planning physical actions.

02:35:35.680 | It could be planning actions to use tools

02:35:38.960 | for a dialog system or for any kind of intelligent system.

02:35:42.320 | And there's some work on this,

02:35:45.480 | but not a huge amount.

02:35:47.080 | Some work at FAIR, one called Toolformer,

02:35:50.840 | which was a couple of years ago,

02:35:52.480 | and some more recent work on planning.

02:35:55.460 | But I don't think we have a good solution

02:35:59.700 | for any of that.

02:36:00.760 | Then there is the question of hierarchical planning.

02:36:03.580 | So the example I mentioned of planning a trip

02:36:07.980 | from New York to Paris, that's hierarchical,

02:36:11.360 | but almost every action that we take

02:36:13.780 | involves hierarchical planning in some sense.

02:36:17.460 | And we really have absolutely no idea how to do this.

02:36:20.640 | Like there's zero demonstration of hierarchical planning

02:36:25.640 | in AI where the various levels of representations

02:36:30.640 | that are necessary have been learned.

02:36:36.440 | We can do like two-level hierarchical planning

02:36:39.440 | when we designed the two levels.

02:36:41.100 | So for example, you have like a dog-like robot, right?

02:36:44.840 | You want it to go from the living room to the kitchen.

02:36:48.300 | You can plan a path that avoids the obstacle.

02:36:51.260 | And then you can send this to a lower level planner

02:36:55.180 | that figures out how to move the legs

02:36:56.960 | to kind of follow that trajectory, right?

02:36:59.540 | So that works, but that two-level planning

02:37:01.600 | is designed by hand, right?

02:37:03.900 | We specify what the proper levels of abstraction,

02:37:09.820 | the representation at each level of abstraction have to be.

02:37:13.140 | How do you learn this?

02:37:14.100 | How do you learn that hierarchical representation

02:37:16.620 | of action plans, right?

02:37:19.800 | We, you know, with coordinates and deep learning,

02:37:22.280 | we can train the system to learn hierarchical representations

02:37:25.320 | of percepts.

02:37:26.300 | What is the equivalent when what you're trying

02:37:29.200 | to represent are action plans?

02:37:30.760 | - For action plans, yeah.

02:37:32.140 | So you want basically a robot dog or humanoid robot

02:37:35.520 | that turns on and travels from New York

02:37:38.240 | to Paris all by itself.

02:37:40.240 | - For example.

02:37:42.080 | - All right.

02:37:43.080 | It might have some trouble at the TSA, but yeah.

02:37:47.420 | - No, but even doing something fairly simple,

02:37:49.100 | like a household task, like, you know, cooking or something.

02:37:53.860 | - Yeah, there's a lot involved.

02:37:55.340 | It's a super complex task.

02:37:57.140 | We take, and once again, we take it for granted.

02:37:59.540 | What hope do you have for the future of humanity?

02:38:05.120 | We're talking about so many exciting technologies,

02:38:07.820 | so many exciting possibilities.

02:38:09.540 | What gives you hope when you look out

02:38:12.100 | over the next 10, 20, 50, 100 years?

02:38:15.100 | If you look at social media, there's a lot of,

02:38:17.140 | there's wars going on, there's division, there's hatred,

02:38:21.660 | all this kind of stuff.

02:38:22.860 | That's also part of humanity.

02:38:24.620 | But amidst all that, what gives you hope?

02:38:27.000 | - I love that question.

02:38:30.300 | We can make humanity smarter with AI.

02:38:37.900 | Okay.

02:38:40.340 | I mean, AI basically will amplify human intelligence.

02:38:45.300 | It's as if every one of us will have a staff

02:38:50.020 | of smart AI assistants.

02:38:52.180 | They might be smarter than us.

02:38:53.740 | They'll do our bidding,

02:38:55.860 | perhaps execute a task in ways that are much better

02:39:03.680 | than we could do ourselves,

02:39:05.320 | because they'd be smarter than us.

02:39:07.820 | And so it's like everyone would be the boss

02:39:10.620 | of a staff of super smart virtual people.

02:39:15.680 | So we shouldn't feel threatened by this

02:39:18.120 | any more than we should feel threatened

02:39:19.640 | by being the manager of a group of people,

02:39:22.920 | some of whom are more intelligent than us.

02:39:24.880 | I certainly have a lot of experience with this,

02:39:29.720 | of having people working with me who are smarter than me.

02:39:34.200 | That's actually a wonderful thing.

02:39:36.400 | So having machines that are smarter than us,

02:39:39.960 | that assist us in all of our tasks, our daily lives,

02:39:43.880 | whether it's professional or personal,

02:39:45.520 | I think would be an absolutely wonderful thing.

02:39:47.960 | Because intelligence is the commodity

02:39:52.280 | that is most in demand.

02:39:54.080 | That's really what, I mean,

02:39:55.520 | all the mistakes that humanity makes

02:39:56.960 | is because of lack of intelligence really,

02:39:58.960 | or lack of knowledge, which is related.

02:40:01.600 | So making people smarter can only be better.

02:40:07.080 | I mean, for the same reason that public education

02:40:09.640 | is a good thing.

02:40:12.280 | And books are a good thing.

02:40:14.800 | And the internet is also a good thing intrinsically.

02:40:17.360 | And even social networks are a good thing

02:40:19.520 | if you run them properly.

02:40:21.560 | It's difficult, but you can.

02:40:23.200 | Because it helps the communication of information

02:40:30.680 | and knowledge and the transmission of knowledge.

02:40:33.880 | So AI is gonna make humanity smarter.

02:40:36.440 | And the analogy I've been using is the fact

02:40:41.080 | that perhaps an equivalent event in the history of humanity

02:40:46.080 | to what might be provided by generalization of AI assistant

02:40:52.320 | is the invention of the printing press.

02:40:55.240 | It made everybody smarter.

02:40:56.960 | The fact that people could have access to books.

02:41:01.960 | Books were a lot cheaper than they were before.

02:41:06.400 | And so a lot more people had an incentive to learn to read,

02:41:10.520 | which wasn't the case before.

02:41:11.920 | And people became smarter.

02:41:17.400 | It enabled the enlightenment, right?

02:41:21.120 | There wouldn't be an enlightenment

02:41:22.200 | without the printing press.

02:41:24.360 | It enabled philosophy, rationalism,

02:41:29.360 | escape from religious doctrine, democracy, science,

02:41:35.840 | and certainly without this there wouldn't have been

02:41:40.840 | the American Revolution or the French Revolution.

02:41:43.400 | And so we'd still be under a few dual regimes, perhaps.

02:41:47.760 | And so it completely transformed the world

02:41:53.840 | because people became smarter

02:41:55.360 | and kind of learned about things.

02:41:57.680 | Now, it also created 200 years of revolution.

02:42:03.680 | It created 200 years of essentially religious conflicts

02:42:07.520 | in Europe because the first thing that people read

02:42:10.880 | was the Bible and realized that perhaps

02:42:15.520 | there was a different interpretation of the Bible

02:42:17.280 | than what the priests were telling them.

02:42:20.000 | And so that created the Protestant movement

02:42:22.840 | and created the rift.

02:42:23.920 | And in fact, the Catholic Church didn't like the idea

02:42:27.600 | of the printing press, but they had no choice.

02:42:30.080 | And so it had some bad effects and some good effects.

02:42:32.880 | I don't think anyone today would say

02:42:34.320 | that the invention of the printing press

02:42:36.000 | had an overall negative effect,

02:42:38.320 | despite the fact that it created 200 years

02:42:41.240 | of religious conflicts in Europe.

02:42:44.480 | Now, compare this.

02:42:45.920 | And I thought I was very proud of myself

02:42:49.560 | to come up with this analogy,

02:42:51.720 | but realized someone else came with the same idea before me.

02:42:55.640 | Compare this with what happened in the Ottoman Empire.

02:42:59.000 | The Ottoman Empire banned the printing press

02:43:02.800 | for 200 years.

02:43:04.000 | And it didn't ban it for all languages, only for Arabic.

02:43:11.840 | You could actually print books in Latin or Hebrew

02:43:16.000 | or whatever in the Ottoman Empire, just not in Arabic.

02:43:19.360 | And I thought it was because the rulers

02:43:25.760 | just wanted to preserve the control over the population

02:43:29.520 | and the dogma, religious dogma and everything.

02:43:33.040 | But after talking with the UAE Minister of AI,

02:43:37.280 | Omar al-Alama,

02:43:40.120 | he told me no, there was another reason.

02:43:44.520 | And the other reason was that it was to preserve

02:43:49.680 | the cooperation of calligraphers.

02:43:52.280 | There's an art form, which is writing those beautiful

02:44:00.320 | Arabic poems or whatever religious text in this thing.

02:44:04.880 | And it was very powerful cooperation of scribes,

02:44:07.440 | basically that kind of run a big chunk of the empire

02:44:12.240 | and we couldn't put them out of business.

02:44:14.160 | So they banned the printing press in part

02:44:16.440 | to protect that business.

02:44:18.560 | Now, what's the analogy for AI today?

02:44:23.320 | Who are we protecting by banning AI?

02:44:25.400 | Who are the people who are asking that AI be regulated

02:44:28.880 | to protect their jobs?

02:44:31.800 | And of course, it's a real question

02:44:35.240 | of what is going to be the effect

02:44:37.560 | of technological transformation like AI on the job market

02:44:42.560 | and the labor market.

02:44:45.280 | And there are economists who are much more expert

02:44:48.400 | at this than I am, but when I talk to them,

02:44:50.320 | they tell us we're not gonna run out of job.

02:44:54.680 | This is not gonna cause mass unemployment.

02:44:57.800 | This is just gonna be gradual shift

02:45:01.040 | of different professions.

02:45:02.320 | The professions are gonna be hot 10 or 15 years from now.

02:45:05.920 | We have no idea today what they're gonna be.

02:45:09.400 | The same way if we go back 20 years in the past,

02:45:12.200 | like who could have thought 20 years ago

02:45:15.040 | that like the hottest job even like 10 years ago

02:45:19.040 | was mobile app developer, like smartphones weren't invented.

02:45:23.400 | - Most of the jobs of the future might be in the metaverse.

02:45:27.080 | - Well, it could be, yeah.

02:45:29.120 | - But the point is you can't possibly predict.

02:45:31.960 | But you're right, I mean, you made a lot of strong points

02:45:34.680 | and I believe that people are fundamentally good.

02:45:38.520 | And so if AI, especially open source AI

02:45:42.680 | can make them smarter,

02:45:45.840 | it just empowers the goodness in humans.

02:45:48.400 | - So I share that feeling, okay?

02:45:50.880 | I think people are fundamentally good.

02:45:52.800 | And in fact, a lot of doomers are doomers

02:45:56.680 | because they don't think that people are fundamentally good.

02:45:59.720 | And they either don't trust people

02:46:04.480 | or they don't trust the institution to do the right thing

02:46:07.920 | so that people behave properly.

02:46:09.480 | - Well, I think both you and I believe in humanity.

02:46:13.560 | And I think I speak for a lot of people

02:46:16.480 | in saying thank you for pushing the open source movement,

02:46:20.120 | pushing to making both research in AI open source,

02:46:24.320 | making it available to people and also the models themselves

02:46:27.760 | making it open source.

02:46:28.680 | So thank you for that.

02:46:30.360 | And thank you for speaking your mind

02:46:32.280 | in such colorful, beautiful ways on the internet.

02:46:34.320 | I hope you never stop.

02:46:35.720 | You're one of the most fun people I know

02:46:37.880 | and get to be a fan of.

02:46:39.040 | So yeah, thank you for speaking to me once again.

02:46:42.360 | And thank you for being you.

02:46:44.000 | - Thank you, Lex.

02:46:45.640 | - Thanks for listening to this conversation with Yann LeCun.

02:46:48.320 | To support this podcast,

02:46:49.640 | please check out our sponsors in the description.

02:46:52.240 | And now let me leave you with some words

02:46:54.200 | from Arthur C. Clarke.

02:46:55.680 | The only way to discover the limits of the possible

02:46:59.840 | is to go beyond them and to the impossible.

02:47:03.560 | Thank you for listening and hope to see you next time.

02:47:07.760 | (upbeat music)

02:47:10.360 | (upbeat music)

02:47:12.960 | [ Silence ]

Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416

Chapters