Yann LeCun: Deep Learning, ConvNets, and Self-Supervised Learning

00:00:00.000 | The following is a conversation with Jan Lekun.

00:00:03.080 | He's considered to be one of the fathers of deep learning,

00:00:06.320 | which if you've been hiding under a rock,

00:00:09.040 | is the recent revolution in AI that's captivated the world

00:00:12.280 | with the possibility of what machines can learn from data.

00:00:16.180 | He's a professor at New York University,

00:00:18.520 | a vice president and chief AI scientist at Facebook,

00:00:21.740 | and co-recipient of the Turing Award

00:00:24.360 | for his work on deep learning.

00:00:26.280 | He's probably best known as the founding father

00:00:28.880 | of convolutional neural networks.

00:00:30.720 | In particular, their application

00:00:32.480 | to optical character recognition

00:00:34.400 | and the famed MNIST dataset.

00:00:37.240 | He is also an outspoken personality,

00:00:40.080 | unafraid to speak his mind in a distinctive French accent

00:00:43.800 | and explore provocative ideas,

00:00:45.720 | both in the rigorous medium of academic research

00:00:48.360 | and the somewhat less rigorous medium

00:00:51.000 | of Twitter and Facebook.

00:00:52.800 | This is the Artificial Intelligence Podcast.

00:00:55.600 | If you enjoy it, subscribe on YouTube,

00:00:57.960 | give it five stars on iTunes, support it on Patreon,

00:01:00.960 | or simply connect with me on Twitter at Lex Friedman,

00:01:03.840 | spelled F-R-I-D-M-A-N.

00:01:06.840 | And now, here's my conversation with Jan Lekun.

00:01:10.640 | You said that 2001, Space Odyssey

00:01:13.840 | is one of your favorite movies.

00:01:15.400 | Hal 9000 decides to get rid of the astronauts

00:01:20.360 | for people that haven't seen the movie, spoiler alert,

00:01:23.040 | because he, it, she believes that the astronauts,

00:01:28.040 | they will interfere with the mission.

00:01:31.600 | Do you see Hal as flawed in some fundamental way

00:01:34.680 | or even evil, or did he do the right thing?

00:01:38.440 | - Neither.

00:01:39.320 | There's no notion of evil in that context,

00:01:43.240 | other than the fact that people die,

00:01:44.720 | but it was an example of what people call

00:01:48.720 | value misalignment, right?

00:01:50.080 | You give an objective to a machine

00:01:52.120 | and the machine strives to achieve this objective.

00:01:55.680 | And if you don't put any constraints on this objective,

00:01:58.120 | like don't kill people and don't do things like this,

00:02:00.760 | the machine, given the power, will do stupid things

00:02:06.240 | just to achieve this objective,

00:02:07.960 | or damaging things to achieve this objective.

00:02:10.160 | It's a little bit like, I mean, we're used to this

00:02:12.400 | in the context of human society.

00:02:14.280 | We put in place laws to prevent people

00:02:20.920 | from doing bad things, because spontaneously

00:02:22.920 | they would do those bad things, right?

00:02:24.800 | So we have to shape their cost function,

00:02:28.400 | their objective function, if you want,

00:02:29.520 | through laws to kind of correct,

00:02:31.520 | and education, obviously, to sort of correct for those.

00:02:35.160 | - So maybe just pushing a little further on that point,

00:02:41.040 | Hal, you know, there's a mission.

00:02:44.360 | There's a, there's fuzziness around the ambiguity

00:02:47.640 | around what the actual mission is,

00:02:49.800 | but, you know, do you think that there will be a time

00:02:54.800 | from a utilitarian perspective where an AI system,

00:02:58.160 | where it is not misalignment, where it is alignment

00:03:00.920 | for the greater good of society,

00:03:02.800 | that an AI system will make decisions that are difficult?

00:03:05.880 | - Well, that's the trick.

00:03:06.800 | I mean, eventually we'll have to figure out how to do this.

00:03:10.800 | And again, we're not starting from scratch

00:03:12.600 | because we've been doing this with humans for millennia.

00:03:16.440 | So designing objective functions for people

00:03:19.160 | is something that we know how to do.

00:03:20.880 | And we don't do it by, you know, programming things,

00:03:24.600 | although the legal code is called code.

00:03:29.040 | So that tells you something.

00:03:30.680 | And it's actually the design of an objective function.

00:03:33.040 | That's really what legal code is, right?

00:03:34.600 | It tells you, here's what you can do,

00:03:36.280 | here's what you can't do.

00:03:37.400 | If you do it, you pay that much.

00:03:39.000 | That's an objective function.

00:03:40.720 | So there is this idea somehow that it's a new thing

00:03:44.560 | for people to try to design objective functions

00:03:46.600 | that are aligned with the common good.

00:03:47.920 | But no, we've been writing laws for millennia

00:03:49.840 | and that's exactly what it is.

00:03:52.080 | So that's where, you know, the science of lawmaking

00:03:57.080 | and computer science will-

00:04:00.560 | - Come together.

00:04:01.400 | - Will come together.

00:04:02.840 | - So it's nothing, there's nothing special about HAL

00:04:05.480 | or AI systems.

00:04:06.760 | It's just the continuation of tools used

00:04:09.440 | to make some of these difficult ethical judgments

00:04:11.720 | that laws make.

00:04:13.000 | - Yeah, and we have systems like this already

00:04:15.080 | that make many decisions for ourselves in society

00:04:19.960 | that need to be designed in a way that they,

00:04:22.600 | like rules about things that sometimes have bad side effects.

00:04:27.480 | And we have to be flexible enough about those rules

00:04:29.600 | so that they can be broken when it's obvious

00:04:31.560 | that they shouldn't be applied.

00:04:33.120 | So you don't see this on the camera here,

00:04:35.640 | but all the decoration in this room

00:04:36.920 | is all pictures from 2001 and Space Odyssey.

00:04:39.640 | - Wow, is that by accident or is there a lot-

00:04:43.640 | - It's not by accident, it's by design.

00:04:45.640 | (Lex laughing)

00:04:47.400 | - Oh, wow.

00:04:48.440 | So if you were to build HAL 10,000,

00:04:52.520 | so an improvement of HAL 9,000, what would you improve?

00:04:57.080 | - Well, first of all,

00:04:57.920 | I wouldn't ask you to hold secrets and tell lies

00:05:01.960 | because that's really what breaks it in the end.

00:05:03.800 | That's the fact that it's asking itself questions

00:05:07.160 | about the purpose of the mission.

00:05:08.920 | And it's, you know, pieces things together

00:05:10.880 | that it's heard, you know,

00:05:11.720 | all the secrecy of the preparation of the mission

00:05:13.960 | and the fact that it was a discovery on the lunar surface

00:05:17.680 | that really was kept secret.

00:05:19.520 | And one part of HAL's memory knows this,

00:05:22.320 | and the other part does not know it

00:05:24.680 | and is supposed to not tell anyone.

00:05:26.640 | And that creates internal conflict.

00:05:28.560 | - So you think there's never should be a set of things

00:05:32.200 | that an AI system should not be allowed,

00:05:35.480 | like a set of facts that should not be shared

00:05:39.880 | with the human operators?

00:05:42.320 | - Well, I think, no, I think it should be a bit like

00:05:45.360 | in the design of autonomous AI systems,

00:05:51.520 | there should be the equivalent of, you know,

00:05:54.200 | the oath that Hippocrates oaths.

00:05:58.040 | - Hippocratic oath, yeah.

00:05:58.960 | - That doctors sign up to, right?

00:06:02.520 | So there's certain things, certain rules

00:06:03.960 | that you have to abide by,

00:06:05.920 | and we can sort of hardwire this into our machines

00:06:08.920 | to kind of make sure they don't go.

00:06:10.920 | So I'm not, you know, an advocate of the three laws

00:06:14.640 | of robotics, you know, the Asimov kind of thing,

00:06:17.040 | because I don't think it's practical,

00:06:18.480 | but, you know, some level of limits.

00:06:23.160 | But to be clear, this is not,

00:06:26.920 | these are not questions that are kind of really worth

00:06:31.120 | asking today because we just don't have the technology

00:06:33.560 | to do this.

00:06:34.400 | We don't have autonomous intelligent machines.

00:06:36.360 | We have intelligent machines,

00:06:37.480 | semi-intelligent machines that are very specialized.

00:06:40.960 | But they don't really sort of satisfy an objective.

00:06:43.320 | They're just, you know, kind of trained to do one thing.

00:06:46.480 | So until we have some idea for design

00:06:49.960 | of a full-fledged autonomous intelligent system,

00:06:53.320 | asking the question of how we design this objective,

00:06:55.640 | I think is a little too abstract.

00:06:58.560 | - It's a little too abstract.

00:06:59.640 | There's useful elements to it

00:07:01.560 | in that it helps us understand our own ethical codes,

00:07:06.560 | humans.

00:07:07.920 | So even just as a thought experiment,

00:07:10.240 | if you imagine that an AGI system is here today,

00:07:14.280 | how would we program it as a kind of nice thought experiment

00:07:17.640 | of constructing how should we have a law,

00:07:21.880 | have a system of laws for us humans.

00:07:24.360 | It's just a nice practical tool.

00:07:26.800 | And I think there's echoes of that idea too

00:07:29.760 | in the AI systems we have today

00:07:32.160 | that don't have to be that intelligent,

00:07:34.280 | like autonomous vehicles.

00:07:35.600 | These things start creeping in

00:07:37.760 | that we're thinking about,

00:07:39.200 | but certainly they shouldn't be framed as hell.

00:07:42.560 | - Yeah.

00:07:43.680 | - Looking back, what is the most,

00:07:46.680 | I'm sorry if it's a silly question,

00:07:49.400 | but what is the most beautiful or surprising idea

00:07:52.480 | in deep learning or AI in general

00:07:55.000 | that you've ever come across?

00:07:56.280 | Sort of personally, when you said back

00:07:58.440 | and just had this kind of,

00:08:01.920 | oh, that's pretty cool moment.

00:08:03.920 | That's nice.

00:08:04.760 | That's surprising.

00:08:05.600 | - I don't know if it's an idea

00:08:06.560 | rather than a sort of empirical fact.

00:08:11.040 | The fact that you can build gigantic neural nets,

00:08:16.440 | train them on relatively small amounts of data,

00:08:21.440 | relatively, with stochastic gradient descent

00:08:24.840 | and that it actually works,

00:08:26.920 | breaks everything you read in every textbook, right?

00:08:29.240 | Every pre-deep learning textbook that told you

00:08:32.560 | you need to have fewer parameters

00:08:33.920 | and you have data samples.

00:08:36.360 | If you have a non-convex objective function,

00:08:38.760 | you have no guarantee of convergence.

00:08:40.680 | All those things that you read in textbook

00:08:42.080 | and they tell you to stay away from this

00:08:43.640 | and they're all wrong.

00:08:45.120 | - Huge number of parameters, non-convex,

00:08:48.080 | and somehow which is very relative

00:08:50.320 | to the number of parameters, data,

00:08:53.480 | it's able to learn anything.

00:08:54.840 | - Right.

00:08:55.680 | - Does that still surprise you today?

00:08:57.520 | - Well, it was kind of obvious to me

00:09:00.360 | before I knew anything that this is a good idea.

00:09:04.120 | And then it became surprising that it worked

00:09:06.040 | because I started reading those textbooks.

00:09:08.080 | Okay.

00:09:10.080 | - So, okay, so can you talk through the intuition

00:09:12.280 | of why it was obvious to you if you remember?

00:09:14.360 | - Well, okay, so the intuition was,

00:09:16.120 | it's sort of like those people in the late 19th century

00:09:19.960 | who proved that heavier than air flight was impossible, right?

00:09:24.960 | And of course you have birds, right?

00:09:26.800 | They do fly.

00:09:28.280 | And so on the face of it,

00:09:30.400 | it's obviously wrong as an empirical question, right?

00:09:33.200 | And so we have the same kind of thing that,

00:09:35.280 | we know that the brain works,

00:09:38.560 | we don't know how, but we know it works.

00:09:39.920 | And we know it's a large network of neurons in interaction

00:09:43.160 | and that learning takes place by changing the connections.

00:09:45.360 | So kind of getting this level of inspiration

00:09:48.000 | without copying the details,

00:09:49.320 | but sort of trying to derive basic principles,

00:09:52.520 | you know, that kind of gives you a clue

00:09:56.800 | as to which direction to go.

00:09:58.360 | There's also the idea somehow

00:09:59.680 | that I've been convinced of since I was an undergrad

00:10:02.080 | that even before,

00:10:04.680 | that intelligence is inseparable from learning.

00:10:06.880 | So the idea somehow that you can create

00:10:10.040 | an intelligent machine by basically programming,

00:10:14.040 | for me, it was a non-starter from the start.

00:10:17.640 | Every intelligent entity that we know about

00:10:20.320 | arrives at this intelligence through learning.

00:10:24.960 | So learning, you know,

00:10:25.800 | machine learning was a completely obvious path.

00:10:28.200 | Also because I'm lazy, so, you know,

00:10:31.560 | kind of. (laughs)

00:10:33.400 | - You automate basically everything

00:10:35.200 | and learning is the automation of intelligence.

00:10:37.880 | - Right.

00:10:39.200 | - So do you think, so what is learning then?

00:10:42.960 | What falls under learning?

00:10:44.560 | Because do you think of reasoning as learning?

00:10:48.280 | - Well, reasoning is certainly a consequence

00:10:52.560 | of learning as well,

00:10:54.240 | just like other functions of the brain.

00:10:57.320 | The big question about reasoning is,

00:10:58.960 | how do you make reasoning compatible

00:11:01.440 | with gradient-based learning?

00:11:03.440 | - Do you think neural networks can be made to reason?

00:11:05.680 | - Yes, there is no question about that.

00:11:07.760 | Again, we have a good example, right?

00:11:09.600 | The question is how.

00:11:12.400 | So the question is how much prior structure

00:11:14.760 | do you have to put in the neural net

00:11:16.080 | so that something like human reasoning

00:11:18.280 | will emerge from it, you know, from learning?

00:11:21.440 | Another question is,

00:11:23.160 | all of our kind of model of what reasoning is

00:11:26.320 | that are based on logic are discrete

00:11:28.880 | and are therefore incompatible with gradient-based learning.

00:11:33.360 | And I'm a very strong believer

00:11:34.800 | in this idea of gradient-based learning.

00:11:36.480 | I don't believe that other types of learning

00:11:39.920 | that don't use kind of gradient information, if you want.

00:11:42.520 | - So you don't like discrete mathematics?

00:11:44.040 | You don't like anything discrete?

00:11:45.600 | - Well, it's not that I don't like it,

00:11:47.520 | it's just that it's incompatible with learning

00:11:49.680 | and I'm a big fan of learning, right?

00:11:51.640 | So in fact, that's perhaps one reason

00:11:54.080 | why deep learning has been kind of looked at

00:11:57.560 | with suspicion by a lot of computer scientists

00:11:59.240 | because the math is very different.

00:12:00.440 | The math that you use for deep learning,

00:12:02.920 | you know, has more to do with cybernetics,

00:12:07.720 | the kind of math you do in electrical engineering

00:12:10.200 | than the kind of math you do in computer science.

00:12:12.760 | And, you know, nothing in machine learning is exact, right?

00:12:16.200 | Computer science is all about sort of, you know,

00:12:19.080 | obsessive-compulsive attention to details

00:12:21.920 | of like, you know, every index has to be right

00:12:24.200 | and you can prove that an algorithm is correct, right?

00:12:27.200 | Machine learning is the science of sloppiness, really.

00:12:30.840 | - That's beautiful.

00:12:33.560 | So, okay, maybe let's feel around in the dark

00:12:38.560 | of what is a neural network that reasons

00:12:41.840 | or a system that works with continuous functions

00:12:46.840 | that's able to do, build knowledge,

00:12:52.440 | however we think about reasoning,

00:12:54.320 | build on previous knowledge, build on extra knowledge,

00:12:57.880 | create new knowledge,

00:12:59.520 | generalize outside of any training set ever built.

00:13:03.080 | What does that look like?

00:13:04.560 | If, yeah, maybe do you have inklings of thoughts

00:13:08.760 | of what that might look like?

00:13:10.840 | - Yeah, I mean, yes and no.

00:13:12.320 | If I had precise ideas about this,

00:13:14.200 | I think, you know, we'd be building it right now.

00:13:16.600 | But, and there are people working on this

00:13:18.600 | or whose main research interest

00:13:20.760 | is actually exactly that, right?

00:13:22.240 | So what you need to have is a working memory.

00:13:25.320 | So you need to have some device, if you want,

00:13:29.960 | some subsystem that can store a relatively large number

00:13:34.600 | of factual episodic information

00:13:37.200 | for a reasonable amount of time.

00:13:40.920 | So in the brain, for example,

00:13:43.920 | there are kind of three main types of memory.

00:13:45.760 | One is the sort of memory of the state of your cortex,

00:13:51.560 | and that sort of disappears within 20 seconds.

00:13:53.800 | You can't remember things for more than about 20 seconds

00:13:56.200 | or a minute if you don't have any other form of memory.

00:14:00.400 | The second type of memory, which is longer term,

00:14:02.600 | but still short term, is the hippocampus.

00:14:04.320 | So you can, you know, you came into this building,

00:14:06.520 | you remember where the exit is, where the elevators are.

00:14:11.040 | You have some map of that building

00:14:13.440 | that's stored in your hippocampus.

00:14:15.360 | You might remember something about what I said,

00:14:18.200 | you know, if you've been to the zoo,

00:14:19.360 | you might remember something about what I said,

00:14:21.160 | you know, a few minutes ago.

00:14:22.200 | - I forgot it all already, but it's part.

00:14:23.040 | - Of course, it's been erased, but you know,

00:14:25.240 | but that would be in your hippocampus.

00:14:28.400 | And then the longer term memory is in the synapse,

00:14:31.560 | the synapses, right?

00:14:32.800 | So what you need if you want a system

00:14:35.520 | that's capable of reasoning

00:14:36.440 | is that you want the hippocampus-like thing, right?

00:14:39.720 | And that's what people have tried to do

00:14:42.680 | with memory networks and, you know,

00:14:44.560 | neural training machines and stuff like that, right?

00:14:46.680 | Transformers, which have sort of a memory in there,

00:14:50.520 | kind of self-attention system.

00:14:51.960 | You can think of it this way.

00:14:53.440 | So that's one element you need.

00:14:57.120 | Another thing you need is some sort of network

00:14:59.840 | that can access this memory,

00:15:03.200 | get an information back, and then kind of crunch on it,

00:15:08.120 | and then do this iteratively multiple times,

00:15:10.880 | because a chain of reasoning

00:15:14.280 | is a process by which you update your knowledge

00:15:19.280 | about the state of the world,

00:15:20.360 | about what's going to happen, et cetera.

00:15:22.760 | And that has to be this sort of

00:15:25.400 | recurrent operation, basically.

00:15:27.080 | - And you think that kind of,

00:15:29.120 | if we think about a transformer,

00:15:31.080 | so that seems to be too small to contain the knowledge

00:15:33.960 | that's to represent the knowledge

00:15:37.240 | that's contained in Wikipedia, for example.

00:15:39.200 | - Well, a transformer doesn't have this idea of recurrence.

00:15:41.960 | It's got a fixed number of layers,

00:15:43.080 | and that's the number of steps that limits,

00:15:45.560 | basically, its representation.

00:15:47.120 | - But recurrence would build on the knowledge somehow.

00:15:51.200 | I mean, it would evolve the knowledge

00:15:54.680 | and expand the amount of information, perhaps,

00:15:58.040 | or useful information within that knowledge.

00:16:00.320 | But is this something that just can emerge with size?

00:16:04.760 | Because it seems like everything we have now is too small.

00:16:06.480 | - Not just.

00:16:07.320 | No, it's not clear.

00:16:09.320 | I mean, how you access and write

00:16:11.120 | into an associative memory inefficient

00:16:13.000 | way, I mean, sort of the original memory network

00:16:15.200 | maybe had something like the right architecture,

00:16:17.520 | but if you try to scale up a memory network

00:16:20.520 | so that the memory contains all of Wikipedia,

00:16:22.840 | it doesn't quite work.

00:16:24.000 | - Right.

00:16:25.120 | - So there's a need for new ideas there, okay.

00:16:28.640 | But it's not the only form of reasoning.

00:16:29.960 | So there's another form of reasoning,

00:16:31.360 | which is very classical, also, in some types of AI,

00:16:36.360 | and it's based on, let's call it energy minimization.

00:16:40.880 | Okay, so you have some sort of objective,

00:16:44.920 | some energy function that represents

00:16:46.800 | the quality or the negative quality, okay.

00:16:53.280 | Energy goes up when things get bad

00:16:54.720 | and they get low when things get good.

00:16:57.280 | So let's say you want to figure out,

00:16:59.960 | you know, what gestures do I need to do

00:17:02.960 | to grab an object or walk out the door.

00:17:07.200 | If you have a good model of your own body,

00:17:10.320 | a good model of the environment,

00:17:12.480 | using this kind of energy minimization,

00:17:14.440 | you can do planning.

00:17:16.920 | And in optimal control,

00:17:19.240 | it's called model predictive control.

00:17:22.120 | You have a model of what's going to happen in the world

00:17:24.120 | as a consequence of your actions,

00:17:25.560 | and that allows you to, by energy minimization,

00:17:28.600 | figure out a sequence of action

00:17:29.840 | that optimizes a particular objective function,

00:17:32.120 | which minimizes the number of times

00:17:34.200 | you're going to hit something

00:17:35.040 | and the energy you're going to spend

00:17:36.560 | doing the gesture and et cetera.

00:17:39.840 | So that's a form of reasoning.

00:17:42.440 | Planning is a form of reasoning.

00:17:43.520 | And perhaps what led to the ability of humans to reason

00:17:48.040 | is the fact that, or, you know,

00:17:51.600 | species that appear before us

00:17:53.480 | had to do some sort of planning

00:17:55.040 | to be able to hunt and survive

00:17:56.960 | and survive the winter in particular.

00:17:59.600 | And so, you know, it's the same capacity

00:18:01.520 | that you need to have.

00:18:03.360 | - So in your intuition is,

00:18:06.440 | if we look at expert systems,

00:18:09.320 | and encoding knowledge as logic systems,

00:18:13.200 | as graphs, or in this kind of way,

00:18:16.720 | is not a useful way to think about knowledge?

00:18:20.280 | - Graphs are a little brittle, or logic representation.

00:18:23.960 | So basically, you know, variables that have values

00:18:27.880 | and then constraint between them

00:18:29.280 | that are represented by rules

00:18:31.280 | is a little too rigid and too brittle, right?

00:18:32.840 | So one of the, you know, some of the early efforts

00:18:35.680 | in that respect were to put probabilities on them.

00:18:40.680 | So a rule, you know, if you have this and that symptom,

00:18:44.280 | you know, you have this disease with that probability,

00:18:47.200 | and you should prescribe that antibiotic

00:18:49.400 | with that probability, right?

00:18:50.520 | That's the myosin system from the '70s.

00:18:53.280 | And that's what that branch of AI led to,

00:18:58.320 | you know, Bayesian networks and graphical models

00:19:00.320 | and causal inference and variational, you know, method.

00:19:04.960 | So there is, I mean, certainly a lot of interesting work

00:19:09.960 | going on in this area.

00:19:11.440 | The main issue with this is knowledge acquisition.

00:19:13.880 | How do you reduce a bunch of data to a graph of this type?

00:19:18.880 | - Yeah, it relies on the expert to,

00:19:21.840 | on the human being to encode, to add knowledge.

00:19:24.960 | - And that's essentially impractical.

00:19:27.120 | - Yeah, it's not scalable.

00:19:29.480 | - That's a big question.

00:19:30.320 | The second question is,

00:19:31.440 | do you want to represent knowledge as symbols,

00:19:34.640 | and do you want to manipulate them with logic?

00:19:37.240 | And again, that's incompatible with learning.

00:19:39.320 | So one suggestion with, you know,

00:19:42.680 | Geoff Hinton has been advocating for many decades

00:19:45.040 | is replace symbols by vectors.

00:19:49.360 | Think of it as pattern of activities

00:19:50.960 | in a bunch of neurons or units

00:19:53.320 | or whatever you want to call them,

00:19:55.120 | and replace logic by continuous functions.

00:19:59.560 | Okay, and that becomes now compatible.

00:20:01.840 | There's a very good set of ideas

00:20:04.960 | by, written in a paper about 10 years ago

00:20:07.640 | by Leon Boutou, who is here at Facebook.

00:20:11.000 | The title of the paper is

00:20:14.360 | "From Machine Learning to Machine Reasoning."

00:20:15.840 | And his idea is that a learning system

00:20:19.480 | should be able to manipulate objects

00:20:20.880 | that are in a space,

00:20:23.160 | and then put the result back in the same space.

00:20:24.920 | So it's this idea of working memory, basically.

00:20:27.280 | And it's very enlightening.

00:20:30.640 | - And in a sense, that might learn something

00:20:33.760 | like the simple expert systems.

00:20:36.900 | I mean, you can learn basic logic operations there.

00:20:42.080 | - Yeah, quite possibly.

00:20:43.400 | There's a big debate on sort of how much prior structure

00:20:46.680 | you have to put in for this kind of stuff to emerge.

00:20:49.080 | That's the debate I have with Gary Marcus

00:20:50.720 | and people like that.

00:20:51.560 | - Yeah, yeah.

00:20:52.880 | So, and the other person,

00:20:55.040 | so I just talked to Judea Pearl,

00:20:57.520 | from the, you mentioned causal inference world.

00:21:00.240 | So his worry is that the current neural networks

00:21:04.160 | are not able to learn

00:21:06.960 | what causes what causal inference between things.

00:21:12.760 | - So I think he's right and wrong about this.

00:21:15.640 | If he's talking about the sort of classic

00:21:18.600 | type of neural nets,

00:21:21.320 | people sort of didn't worry too much about this.

00:21:23.800 | But there's a lot of people now working on causal inference.

00:21:26.200 | And there's a paper that just came out last week

00:21:27.840 | by Leon Boutou, among others,

00:21:29.160 | David Lopez-Paz and a bunch of other people,

00:21:32.000 | exactly on that problem of how do you kind of,

00:21:35.520 | get a neural net to sort of pay attention

00:21:39.400 | to real causal relationships,

00:21:41.560 | which may also solve issues of bias in data

00:21:46.560 | and things like this.

00:21:48.040 | - I'd like to read that paper because that ultimately,

00:21:51.200 | the challenge there is also seems to fall back

00:21:54.720 | on the human expert

00:21:56.920 | to ultimately decide causality between things.

00:22:01.920 | - People are not very good at establishing causality,

00:22:03.680 | first of all.

00:22:04.800 | So first of all, you talk to physicists

00:22:06.560 | and physicists actually don't believe in causality

00:22:08.560 | because look at all the basic laws of macrophysics

00:22:12.960 | are time reversible.

00:22:13.960 | So there is no causality.

00:22:15.480 | - The arrow of time is not real.

00:22:17.400 | - It's as soon as you start looking at macroscopic systems,

00:22:20.400 | where there is unpredictable randomness,

00:22:22.800 | where there is clearly an arrow of time,

00:22:25.440 | but it's a big mystery in physics actually,

00:22:27.080 | well, how that emerges.

00:22:28.360 | - Is it emergent or is it part of the fundamental fabric

00:22:33.280 | of reality?

00:22:34.320 | - Or is it a bias of intelligent systems that,

00:22:37.480 | because of the second law of thermodynamics,

00:22:39.280 | we perceive a particular arrow of time,

00:22:41.480 | but in fact, it's kind of arbitrary, right?

00:22:45.160 | - So yeah, physicists, mathematicians,

00:22:47.160 | they don't care about, I mean,

00:22:48.480 | the math doesn't care about the flow of time.

00:22:51.520 | - Well, certainly macrophysics doesn't.

00:22:54.120 | People themselves are not very good at establishing

00:22:57.080 | causal relationships.

00:22:58.960 | If you ask, I think it was in one of Seymour Papert's book

00:23:02.800 | on like children learning,

00:23:06.880 | he studied with Jean Piaget,

00:23:08.880 | he's the guy who co-authored the book "Perceptron"

00:23:11.560 | with Marvin Minsky that kind of killed

00:23:13.000 | the first wave of neural nets,

00:23:14.080 | but he was actually a learning person.

00:23:17.240 | He, in the sense of studying learning in humans

00:23:21.080 | and machines, that's why he got interested in perceptron.

00:23:24.160 | And he wrote that if you ask a little kid

00:23:29.160 | about what is the cause of the wind,

00:23:32.680 | a lot of kids will say, they will think for a while

00:23:35.840 | and they'll say, "Oh, it's the branches in the trees.

00:23:38.080 | They move and that creates wind."

00:23:40.120 | So they get the causal relationship backwards.

00:23:42.600 | And it's because their understanding of the world

00:23:44.520 | and intuitive physics is not that great.

00:23:46.280 | I mean, these are like four or five year old kids.

00:23:48.800 | It gets better and then you understand that this,

00:23:52.320 | it can't be.

00:23:54.080 | But there are many things which we can,

00:23:57.440 | because of our common sense understanding of things,

00:24:00.920 | what people call common sense,

00:24:03.280 | and our understanding of physics,

00:24:04.960 | there's a lot of stuff that we can figure out

00:24:08.400 | 'cause even with diseases, we can figure out

00:24:10.480 | what's not causing what often.

00:24:14.560 | There's a lot of mystery, of course,

00:24:16.040 | but the idea is that you should be able

00:24:18.120 | to encode that into systems.

00:24:20.160 | 'Cause it seems unlikely they'd be able

00:24:21.400 | to figure that out themselves.

00:24:22.800 | - Well, whenever we can do intervention,

00:24:24.520 | but all of humanity has been completely deluded

00:24:27.440 | for millennia, probably since its existence,

00:24:30.440 | about a very, very wrong causal relationship

00:24:33.480 | where whatever you can explain,

00:24:34.600 | you attribute it to some deity, some divinity.

00:24:37.680 | And that's a cop-out.

00:24:40.120 | That's a way of saying, "I don't know the cause,

00:24:41.800 | so God did it."

00:24:43.080 | - So you mentioned Marvin Minsky and the irony of

00:24:51.560 | maybe causing the first AI winter.

00:24:54.600 | You were there in the '90s, you were there in the '80s,

00:24:56.920 | of course.

00:24:58.120 | In the '90s, why do you think people lost faith

00:25:00.640 | in deep learning in the '90s and found it again

00:25:04.000 | a decade later, over a decade later?

00:25:06.360 | - Yeah, it wasn't called deep learning yet.

00:25:07.760 | It was just called neural nets.

00:25:08.720 | - Neural networks.

00:25:09.560 | - Yeah, they lost interest.

00:25:13.840 | I mean, I think I would put that around 1995,

00:25:16.800 | at least the machine learning community.

00:25:18.040 | There was always a neural net community,

00:25:19.640 | but it became kind of disconnected

00:25:23.760 | from mainstream machine learning, if you want.

00:25:26.280 | There were, it was basically electrical engineering

00:25:30.960 | that kept at it.

00:25:32.040 | And computer science.

00:25:34.840 | - Just gave up.

00:25:35.680 | - Gave up on neural nets.

00:25:38.000 | I don't know.

00:25:39.600 | I was too close to it to really sort of analyze it

00:25:44.000 | with sort of an unbiased eye, if you want.

00:25:47.440 | But I would make a few guesses.

00:25:50.800 | So the first one is, at the time, neural nets were,

00:25:55.800 | it was very hard to make them work,

00:25:57.880 | in the sense that you would implement backprop

00:26:02.400 | in your favorite language.

00:26:03.840 | And that favorite language was not Python.

00:26:07.080 | It was not MATLAB.

00:26:07.920 | It was not any of those things,

00:26:09.320 | 'cause they didn't exist, right?

00:26:10.760 | You had to write it in Fortran or C

00:26:13.320 | or something like this, right?

00:26:16.320 | So you would experiment with it.

00:26:18.680 | You would probably make some very basic mistakes,

00:26:21.320 | like, you know, barely initialize your weights,

00:26:23.240 | make the network too small

00:26:24.200 | because you read in a textbook, you know,

00:26:25.520 | you don't want too many parameters, right?

00:26:27.640 | And of course, you know, and you would train on XOR

00:26:29.280 | because you didn't have any other dataset to train on.

00:26:32.000 | And of course, you know, it works half the time.

00:26:33.760 | So you would say, "I give up."

00:26:36.280 | Also, you would train it with batch gradient,

00:26:37.680 | which, you know, isn't really sufficient.

00:26:40.240 | So there was a lot of, there was a bag of tricks

00:26:42.680 | that you had to know to make those things work,

00:26:44.840 | or you had to reinvent.

00:26:46.880 | And a lot of people just didn't,

00:26:48.200 | and they just couldn't make it work.

00:26:50.000 | So that's one thing.

00:26:52.400 | The investment in software platform

00:26:54.720 | to be able to kind of, you know, display things,

00:26:58.120 | figure out why things don't work,

00:26:59.360 | kind of get a good intuition for how to get them to work,

00:27:02.120 | have enough flexibility so you can create, you know,

00:27:04.640 | network architectures like convolutional nets

00:27:06.240 | and stuff like that.

00:27:07.240 | It was hard.

00:27:09.160 | I mean, you had to write everything from scratch.

00:27:10.520 | And again, you didn't have any Python

00:27:11.840 | or MATLAB or anything, right?

00:27:13.800 | - I read that, sorry to interrupt,

00:27:15.560 | but I read that you wrote in Lisp,

00:27:17.760 | your first versions of Linnet

00:27:21.280 | with the convolutional networks,

00:27:22.760 | which by the way, one of my favorite languages.

00:27:25.400 | That's how I knew you were legit.

00:27:27.640 | Turing Award, whatever.

00:27:29.520 | You programmed in Lisp.

00:27:30.840 | - It's still my favorite language.

00:27:32.000 | But it's not that we programmed in Lisp,

00:27:34.960 | it's that we had to write our Lisp interpreter.

00:27:37.560 | Okay, 'cause it's not like we used one that existed.

00:27:40.400 | So we wrote a Lisp interpreter

00:27:42.480 | that we hooked up to, you know,

00:27:44.240 | a backend library that we wrote

00:27:46.320 | also for sort of neural net computation.

00:27:48.520 | And then after a few years, around 1991,

00:27:50.920 | we invented this idea of basically having modules

00:27:54.680 | that know how to forward propagate

00:27:56.280 | and back propagate gradients,

00:27:57.680 | and then interconnecting those modules in a graph.

00:28:00.360 | Leombo 2 had made proposals on this,

00:28:03.360 | about this in the late 80s,

00:28:04.800 | and we were able to implement this using our Lisp system.

00:28:08.280 | Eventually we wanted to use that system

00:28:09.920 | to build production code

00:28:12.760 | for character recognition at Bell Labs.

00:28:14.400 | So we actually wrote a compiler for that Lisp interpreter

00:28:16.880 | so that Petrissi Mart, who is now at Microsoft,

00:28:19.400 | kind of did the bulk of it with Leon and me.

00:28:22.560 | And so we could write our system in Lisp

00:28:25.040 | and then compile to C,

00:28:26.640 | and then we'll have a self-contained compute system

00:28:29.840 | that could kind of do the entire thing.

00:28:32.280 | Neither PyTorch nor TensorFlow can do this today.

00:28:36.240 | Yet.

00:28:37.080 | Okay, it's coming.

00:28:38.000 | - Yeah.

00:28:38.840 | (laughs)

00:28:40.280 | - I mean, there's something like that in PyTorch

00:28:42.160 | called TorchScript.

00:28:44.640 | And so we had to write our Lisp interpreter,

00:28:47.000 | we had to write our Lisp compiler,

00:28:48.120 | we had to invest a huge amount of effort to do this.

00:28:50.960 | And not everybody,

00:28:52.440 | if you don't completely believe in the concept,

00:28:55.160 | you're not going to invest the time to do this.

00:28:57.160 | Now, at the time also,

00:28:59.320 | or today, this would turn into Torch or PyTorch

00:29:02.760 | or TensorFlow or whatever.

00:29:03.960 | We'd put it in open source,

00:29:05.000 | everybody would use it and realize it's good.

00:29:08.040 | Back before 1995, working at AT&T,

00:29:11.360 | there's no way the lawyers would let you

00:29:13.800 | release anything in open source of this nature.

00:29:17.760 | And so we could not distribute our code, really.

00:29:20.720 | - And on that point,

00:29:21.760 | and sorry to go on a million tangents,

00:29:23.640 | but on that point,

00:29:24.920 | I also read that there was some almost patent,

00:29:27.760 | like a patent on convolutional neural networks.

00:29:29.960 | - Yes, there was.

00:29:30.800 | - So that, first of all,

00:29:34.520 | I mean, just--

00:29:36.400 | - There's two, actually.

00:29:38.160 | - That ran out.

00:29:39.120 | - Thankfully, in 2007.

00:29:42.000 | - In 2007.

00:29:43.000 | Can we just talk about that for a second?

00:29:48.600 | I know you're at Facebook,

00:29:49.760 | but you're also at NYU.

00:29:50.960 | What does it mean to patent ideas

00:29:55.520 | like these software ideas, essentially?

00:29:58.920 | Or mathematical ideas?

00:30:02.360 | Or what are they?

00:30:03.320 | - Okay, so they're not mathematical ideas.

00:30:05.400 | They are algorithms.

00:30:07.600 | And there was a period where the US patent office

00:30:11.200 | would allow the patent of software

00:30:14.000 | as long as it was embodied.

00:30:15.360 | The Europeans are very different.

00:30:18.120 | They don't quite accept that.

00:30:20.320 | They have a different concept, but you know.

00:30:23.040 | I don't, I no longer,

00:30:24.040 | I mean, I never actually strongly believed in this,

00:30:26.280 | but I don't believe in this kind of patent.

00:30:28.880 | Facebook basically doesn't believe in this kind of patent.

00:30:33.200 | Google fires patents because they've been burned with Apple.

00:30:38.200 | And so now they do this for defensive purpose,

00:30:41.560 | but usually they say,

00:30:42.880 | "We're not gonna sue you if you infringe."

00:30:44.960 | Facebook has a similar policy.

00:30:47.280 | They say, "We fire patents on certain things

00:30:49.720 | "for defensive purpose.

00:30:50.640 | "We're not gonna sue you if you infringe,

00:30:52.240 | "unless you sue us."

00:30:53.400 | So the industry does not believe in patents.

00:30:59.480 | They are there because of the legal landscape

00:31:01.960 | and various things,

00:31:03.480 | but I don't really believe in patents

00:31:06.480 | for this kind of stuff.

00:31:08.080 | - So that's a great thing.

00:31:09.560 | So I-

00:31:10.400 | - I'll tell you a worse story, actually.

00:31:11.760 | So what happens was the first patent about convolutional net

00:31:15.400 | was about kind of the early version of convolutional net

00:31:18.200 | that didn't have separate pooling layers.

00:31:19.920 | It had convolutional layers

00:31:22.800 | with stride more than one, if you want, right?

00:31:25.200 | And then there was a second one on convolutional nets

00:31:28.400 | with separate pooling layers, training with backprop.

00:31:31.680 | And they were filed in '89 and 1990

00:31:35.280 | or something like this.

00:31:36.200 | At the time, the life of a patent was 17 years.

00:31:39.320 | So here's what happened over the next few years

00:31:42.040 | is that we started developing

00:31:44.640 | character recognition technology around convolutional nets.

00:31:48.600 | And in 1994, a check reading system

00:31:53.400 | was deployed in ATM machines.

00:31:56.160 | In 1995, it was for large check reading machines

00:31:59.040 | in back offices, et cetera.

00:32:00.640 | And those systems were developed by an engineering group

00:32:04.800 | that we were collaborating with at AT&T.

00:32:06.960 | And they were commercialized by NCR,

00:32:08.600 | which at the time was a subsidiary of AT&T.

00:32:11.600 | Now AT&T split up in 1996, early 1996.

00:32:16.600 | And the lawyers just looked at all the patents

00:32:20.400 | and they distributed the patents among the various companies.

00:32:22.960 | They gave the convolutional net patent to NCR

00:32:26.400 | because they were actually selling products that used it.

00:32:29.200 | But nobody at NCR had any idea what a convolutional net was.

00:32:32.280 | - Yeah.

00:32:33.200 | - Okay.

00:32:34.040 | So between 1996 and 2007,

00:32:36.720 | so there's a whole period until 2002

00:32:39.880 | where I didn't actually work on machine learning

00:32:42.000 | or convolutional net.

00:32:42.840 | I resumed working on this around 2002.

00:32:44.880 | And between 2002 and 2007, I was working on them,

00:32:48.840 | crossing my finger that nobody at NCR would notice.

00:32:51.120 | And nobody noticed.

00:32:52.040 | - Yeah, and I hope that this kind of somewhat,

00:32:55.640 | as you said, lawyers aside,

00:32:58.320 | relative openness of the community now will continue.

00:33:02.920 | - It accelerates the entire progress of the industry.

00:33:05.960 | And the problems that Facebook and Google

00:33:10.960 | and others are facing today

00:33:13.000 | is not whether Facebook or Google or Microsoft or IBM

00:33:15.960 | or whoever is ahead of the other,

00:33:18.080 | is that we don't have the technology

00:33:19.680 | to build the things we want to build.

00:33:21.080 | We want to build intelligent virtual assistants

00:33:23.200 | that have common sense.

00:33:24.960 | We don't have monopoly on good ideas for this.

00:33:26.720 | We don't believe we do.

00:33:27.960 | Maybe others believe they do, but we don't.

00:33:30.440 | Okay.

00:33:31.320 | If a startup tells you they have the secret

00:33:33.840 | to human level intelligence and common sense,

00:33:36.880 | don't believe them.

00:33:37.720 | They don't.

00:33:38.560 | And it's going to take the entire work

00:33:42.760 | of the world research community for a while

00:33:45.240 | to get to the point where you can go off

00:33:47.560 | and each of those companies

00:33:49.200 | is going to start to build things on this.

00:33:50.600 | We're not there yet.

00:33:51.760 | - It's absolutely.

00:33:52.600 | This calls to the gap between the space of ideas

00:33:57.000 | and the rigorous testing of those ideas

00:34:00.440 | of practical application that you often speak to.

00:34:03.560 | You've written advice saying,

00:34:05.480 | "Don't get fooled by people who claim

00:34:07.880 | "to have a solution to artificial general intelligence,

00:34:10.480 | "who claim to have an AI system

00:34:11.880 | "that works just like the human brain

00:34:14.220 | "or who claim to have figured out how the brain works.

00:34:17.000 | "Ask them what the error rate they get

00:34:20.920 | "on MNIST or ImageNet."

00:34:23.120 | - Yeah, this is a little dated by the way.

00:34:24.680 | (laughs)

00:34:25.920 | - I mean, five years, who's counting?

00:34:28.280 | Okay.

00:34:29.120 | But I think your opinion is still MNIST and ImageNet,

00:34:32.320 | yes, may be dated.

00:34:34.880 | There may be new benchmarks, right?

00:34:36.320 | But I think that philosophy is one you still

00:34:39.320 | and somewhat hold that benchmarks

00:34:43.360 | and the practical testing, the practical application

00:34:45.720 | is where you really get to test the ideas.

00:34:47.960 | - Well, it may not be completely practical.

00:34:49.800 | Like for example, it could be a toy dataset,

00:34:52.440 | but it has to be some sort of task

00:34:54.840 | that the community as a whole has accepted

00:34:57.280 | as some sort of standard kind of benchmark, if you want.

00:35:00.600 | It doesn't need to be real.

00:35:01.440 | So for example, many years ago here at FAIR,

00:35:04.280 | people, Jason West, Antoine Born and a few others

00:35:07.640 | proposed the Babi tasks,

00:35:09.040 | which were kind of a toy problem to test

00:35:12.240 | the ability of machines to reason actually

00:35:14.320 | to access working memory and things like this.

00:35:16.920 | And it was very useful, even though it wasn't a real task.

00:35:20.080 | MNIST is kind of halfway real task.

00:35:22.660 | So, toy problems can be very useful.

00:35:26.040 | It's just that I was really struck by the fact that

00:35:29.560 | a lot of people, particularly a lot of people

00:35:31.120 | with money to invest would be fooled by people telling them,

00:35:34.400 | oh, we have the algorithm of the cortex

00:35:37.400 | and you should give us 50 million.

00:35:39.360 | - Yes, absolutely.

00:35:40.200 | So there's a lot of people who try to take advantage

00:35:45.280 | of the hype for business reasons and so on.

00:35:48.240 | But let me sort of talk to this idea

00:35:50.800 | that sort of new ideas,

00:35:53.840 | the ideas that push the field forward

00:35:56.120 | may not yet have a benchmark,

00:35:58.620 | or it may be very difficult to establish a benchmark.

00:36:00.880 | - I agree.

00:36:01.720 | That's part of the process.

00:36:02.560 | Establishing benchmarks is part of the process.

00:36:04.600 | - So what are your thoughts about,

00:36:07.300 | so we have these benchmarks on,

00:36:09.620 | around stuff we can do with images,

00:36:12.280 | from classification to captioning

00:36:14.920 | to just every kind of information you can pull off

00:36:16.920 | from images and the surface level.

00:36:18.860 | There's audio data sets, there's some video.

00:36:21.420 | What can we start, natural language,

00:36:24.940 | what kind of benchmarks do you see that start creeping

00:36:30.160 | on to more something like intelligence,

00:36:33.600 | like reasoning, like maybe you don't like the term,

00:36:37.420 | but AGI, echoes of that kind of formulation?

00:36:41.520 | - A lot of people are working on interactive environments

00:36:44.160 | in which you can train and test intelligent systems.

00:36:48.120 | So there, for example,

00:36:50.200 | the classical paradigm of supervised learning

00:36:56.160 | is that you have a data set,

00:36:57.960 | you partition it into a training set,

00:36:59.440 | validation set, test set,

00:37:00.440 | and there's a clear protocol, right?

00:37:03.040 | But what if, that assumes that the samples

00:37:06.400 | are statistically independent,

00:37:08.880 | you can exchange them,

00:37:10.100 | the order in which you see them shouldn't matter,

00:37:12.240 | things like that.

00:37:13.480 | But what if the answer you give

00:37:15.520 | determines the next sample you see,

00:37:17.560 | which is the case, for example, in robotics, right?

00:37:19.560 | You robot does something

00:37:21.160 | and then it gets exposed to a new room

00:37:23.620 | and depending on where it goes,

00:37:25.120 | the room would be different.

00:37:26.000 | So that creates the exploration problem.

00:37:28.440 | The what if the samples,

00:37:32.960 | so that creates also a dependency between samples, right?

00:37:35.760 | If you can only move in space,

00:37:39.620 | the next sample you're gonna see

00:37:40.960 | is gonna be probably in the same building, most likely.

00:37:44.080 | So all the assumptions about the validity

00:37:47.920 | of this training set, test set hypothesis break

00:37:51.640 | whenever a machine can take an action

00:37:53.120 | that has an influence in the world

00:37:54.960 | and it's what it's gonna see.

00:37:56.400 | So people are setting up artificial environments

00:38:00.160 | where that takes place, right?

00:38:02.080 | The robot runs around a 3D model of a house

00:38:05.840 | and can interact with objects and things like this.

00:38:08.680 | So you do robotics by simulation,

00:38:10.380 | you have those, you know,

00:38:11.760 | opening a gym type thing

00:38:14.400 | or MuJoCo kind of simulated robots

00:38:18.800 | and you have games, you know, things like that.

00:38:21.280 | So that's where the field is going really,

00:38:23.640 | this kind of environment.

00:38:24.880 | Now, back to the question of AGI,

00:38:28.320 | like I don't like the term AGI

00:38:29.840 | because it implies that human intelligence is general

00:38:35.760 | and human intelligence is nothing like general.

00:38:38.360 | It's very, very specialized.

00:38:40.860 | We think it's general, we like to think of ourselves

00:38:42.740 | as having general intelligence, we don't.

00:38:44.260 | We're very specialized.

00:38:46.100 | We're only slightly more general than--

00:38:47.540 | - Why does it feel general?

00:38:48.900 | So you kind of, the term general,

00:38:52.060 | I think what's impressive about humans

00:38:54.220 | is ability to learn, as we were talking about learning,

00:38:58.260 | to learn in just so many different domains.

00:39:01.260 | It's perhaps not arbitrarily general,

00:39:04.420 | but just you can learn in many domains

00:39:06.440 | and integrate that knowledge somehow.

00:39:08.220 | - Okay. - The knowledge persists.

00:39:09.860 | - So let me take a very specific example.

00:39:12.220 | It's not an example, it's more like

00:39:13.980 | a quasi-mathematical demonstration.

00:39:17.100 | So you have about one million fibers

00:39:18.520 | coming out of one of your eyes, okay, two million total,

00:39:21.340 | but let's talk about just one of them.

00:39:23.440 | It's one million nerve fibers, your optical nerve.

00:39:26.040 | Let's imagine that they are binary,

00:39:28.800 | so they can be active or inactive, right?

00:39:30.640 | So the input to your visual cortex is one million bits.

00:39:36.880 | Now, they're connected to your brain in a particular way,

00:39:39.400 | and your brain has connections

00:39:41.940 | that are kind of a little bit like a convolutional net,

00:39:44.160 | they're kind of local in space and things like this.

00:39:47.960 | Now imagine I play a trick on you.

00:39:49.660 | It's a pretty nasty trick, I admit.

00:39:53.040 | I cut your optical nerve,

00:39:55.720 | and I put a device that makes a random perturbation,

00:39:59.120 | a permutation of all the nerve fibers.

00:40:01.120 | So now what comes to your brain

00:40:04.600 | is a fixed but random permutation of all the pixels.

00:40:07.820 | There's no way in hell that your visual cortex,

00:40:11.360 | even if I do this to you in infancy,

00:40:14.760 | will actually learn vision

00:40:16.520 | to the same level of quality that you can.

00:40:20.040 | - Got it, and you're saying

00:40:21.120 | there's no way you've relearned that?

00:40:22.680 | - No, because now two pixels that are nearby in the world

00:40:25.640 | will end up in very different places in your visual cortex,

00:40:29.240 | and your neurons there have no connections with each other

00:40:31.640 | because they're only connected locally.

00:40:33.480 | - So this whole, our entire,

00:40:35.040 | the hardware is built in many ways to support--

00:40:38.600 | - The locality of the real world.

00:40:39.720 | - Yeah.

00:40:40.560 | - Yes, that's specialization.

00:40:42.600 | - Yeah, but it's still pretty damn impressive.

00:40:44.600 | So it's not perfect generalization.

00:40:46.240 | It's not even close.

00:40:47.080 | - No, no, it's not that it's not even close.

00:40:49.960 | It's not at all.

00:40:50.960 | - Yeah, it's not.

00:40:51.800 | It's specialized, yeah.

00:40:52.620 | - So how many Boolean functions,

00:40:54.040 | so let's imagine you want to train your visual system

00:40:58.280 | to recognize particular patterns of those one million bits.

00:41:03.800 | Okay, so that's a Boolean function, right?

00:41:05.760 | Either the pattern is here or not here.

00:41:07.040 | It's a two-way classification

00:41:09.200 | with one million binary inputs.

00:41:11.680 | How many such Boolean functions are there?

00:41:16.280 | Okay, you have two to the one million

00:41:19.040 | combinations of inputs.

00:41:21.200 | For each of those, you have an output bit.

00:41:24.080 | And so you have two to the one million

00:41:26.840 | Boolean functions of this type, okay?

00:41:30.040 | Which is an unimaginably large number.

00:41:33.020 | How many of those functions can actually be computed

00:41:35.560 | by your visual cortex?

00:41:37.240 | And the answer is a tiny, tiny, tiny, tiny, tiny, tiny sliver

00:41:41.440 | like an enormously tiny sliver.

00:41:43.520 | - Yeah, yeah.

00:41:45.000 | - So we are ridiculously specialized.

00:41:47.300 | - But, okay.

00:41:49.880 | (laughing)

00:41:51.480 | Okay, that's an argument against the word general.

00:41:54.200 | I think there's a,

00:41:55.540 | I agree with your intuition, but I'm not sure it's,

00:42:00.960 | it seems the brain is impressively

00:42:04.780 | capable of adjusting to things, so.

00:42:09.620 | - It's because we can't imagine tasks

00:42:13.380 | that are outside of our comprehension, right?

00:42:16.300 | So we think we are general

00:42:18.020 | because we are general of all the things

00:42:19.260 | that we can apprehend.

00:42:20.740 | But there is a huge world out there

00:42:22.980 | of things that we have no idea.

00:42:24.700 | We call that heat, by the way.

00:42:26.820 | - Heat. - Heat.

00:42:27.740 | So, at least physicists call that heat,

00:42:30.620 | or they call it entropy, which is kind of,

00:42:33.360 | you have a thing full of gas, right?

00:42:38.360 | - Closed system full of gas.

00:42:40.700 | - Right?

00:42:41.700 | Closed or not closed.

00:42:42.580 | It has pressure, it has temperature,

00:42:47.580 | it has, and you can write equations,

00:42:51.740 | PV equal NRT, things like that, right?

00:42:54.140 | When you reduce the volume, the temperature goes up,

00:42:57.380 | the pressure goes up, things like that, right?

00:43:00.340 | For a perfect gas, at least.

00:43:02.180 | Those are the things you can know about that system.

00:43:05.460 | And it's a tiny, tiny number of bits

00:43:07.020 | compared to the complete information

00:43:09.360 | of the state of the entire system.

00:43:10.780 | Because the state of the entire system

00:43:12.180 | will give you the position and momentum

00:43:13.740 | of every molecule of the gas.

00:43:16.720 | And what you don't know about it is the entropy,

00:43:20.900 | and you interpret it as heat.

00:43:23.980 | The energy contained in that thing is what we call heat.

00:43:28.020 | Now, it's very possible that, in fact,

00:43:32.260 | there is some very strong structure

00:43:33.700 | in how those molecules are moving.

00:43:35.020 | It's just that they are in a way

00:43:36.460 | that we are just not wired to perceive.

00:43:39.100 | - Yeah, we're ignorant to it.

00:43:40.120 | And there's an infinite amount of things

00:43:43.880 | we're not wired to perceive.

00:43:45.420 | And you're right, that's a nice way to put it.

00:43:47.340 | We're general to all the things we can imagine,

00:43:50.220 | which is a very tiny subset of all things that are possible.

00:43:54.860 | - So it's like Kolmogorov complexity

00:43:56.300 | or the Kolmogorov-Chaitin summa of complexity.

00:43:58.660 | Every bit string or every integer is random,

00:44:05.580 | except for all the ones that you can actually write down.

00:44:08.140 | (both laughing)

00:44:11.580 | - Yeah, okay, so beautifully put.

00:44:13.500 | So we can just call it artificial intelligence.

00:44:15.420 | We don't need to have a general.

00:44:16.980 | - Or human level.

00:44:18.780 | Human level intelligence is good.

00:44:23.340 | Anytime you touch human, it gets interesting

00:44:26.580 | because we attach ourselves to human

00:44:31.580 | and it's difficult to define what human intelligence is.

00:44:36.180 | Nevertheless, my definition is maybe

00:44:39.940 | damn impressive intelligence, okay?

00:44:43.900 | Damn impressive demonstration of intelligence, whatever.

00:44:46.700 | And so on that topic, most successes in deep learning

00:44:51.420 | have been in supervised learning.

00:44:53.700 | What is your view on unsupervised learning?

00:44:57.860 | Is there a hope to reduce involvement of human input

00:45:02.860 | and still have successful systems

00:45:05.620 | that have practical use?

00:45:08.260 | - Yeah, I mean, there's definitely a hope.

00:45:09.900 | It's more than a hope, actually.

00:45:11.180 | It's mounting evidence for it.

00:45:13.900 | And that's basically all I do.

00:45:16.620 | The only thing I'm interested in at the moment is,

00:45:19.100 | I call it self-supervised learning, not unsupervised,

00:45:21.260 | 'cause unsupervised learning is a loaded term.

00:45:24.020 | People who know something about machine learning

00:45:28.100 | tell you, so you're doing clustering or PCA,

00:45:30.660 | which is not the case.

00:45:31.580 | And the wide public, when you say unsupervised learning,

00:45:33.660 | oh my God, machines are gonna learn by themselves

00:45:35.580 | and without supervision.

00:45:36.820 | - Where's the parents?

00:45:40.820 | - Yeah, so I call it self-supervised learning

00:45:42.940 | because in fact, the underlying algorithms that are used

00:45:46.140 | are the same algorithms

00:45:47.340 | as the supervised learning algorithms,

00:45:50.300 | except that what we train them to do

00:45:52.340 | is not predict a particular set of variables,

00:45:55.540 | like the category of an image,

00:45:58.620 | and not to predict a set of variables

00:46:02.580 | that have been provided by human labelers.

00:46:06.420 | But what you train the machine to do is basically

00:46:08.580 | reconstruct a piece of its input

00:46:10.300 | that is being masked out, essentially.

00:46:14.140 | You can think of it this way, right?

00:46:15.660 | So show a piece of video to a machine

00:46:18.820 | and ask it to predict what's gonna happen next.

00:46:20.980 | And of course, after a while, you can show what happens

00:46:23.820 | and the machine will kind of train itself

00:46:26.260 | to do better at that task.

00:46:27.540 | You can do, like all the latest, most successful models

00:46:32.260 | in natural language processing

00:46:33.300 | use self-supervised learning.

00:46:34.820 | You know, sort of BERT-style systems, for example, right?

00:46:38.700 | You show it a window of a dozen words on a test corpus,

00:46:43.540 | you take out 15% of the words,

00:46:46.340 | and then you train a machine

00:46:48.300 | to predict the words that are missing.

00:46:51.420 | That's self-supervised learning.

00:46:52.860 | It's not predicting the future,

00:46:54.060 | it's just predicting things in the middle,

00:46:56.340 | but you could have it predict the future,

00:46:57.900 | that's what language models do.

00:46:59.540 | - So you construct, so in an unsupervised way,

00:47:01.820 | you construct a model of language.

00:47:04.020 | Do you think-

00:47:05.100 | - Or video, or the physical world, or whatever, right?

00:47:09.180 | - How far do you think that can take us?

00:47:12.660 | Do you think BERT understands anything?

00:47:16.460 | - To some level.

00:47:18.900 | It has a shallow understanding of text,

00:47:23.500 | but it needs to, I mean,

00:47:24.780 | to have kind of true human level intelligence,

00:47:26.860 | I think you need to ground language in reality.

00:47:29.260 | So some people are attempting to do this, right?

00:47:32.820 | Having systems that kind of have some visual representation

00:47:35.500 | of what is being talked about,

00:47:37.460 | which is one reason you need

00:47:38.620 | those interactive environments, actually.

00:47:41.100 | But it's like a huge technical problem that is not solved,

00:47:45.100 | and that explains why self-supervised learning works

00:47:48.180 | in the context of natural language,

00:47:50.020 | but does not work in the context,

00:47:51.500 | or at least not well,

00:47:52.780 | in the context of image recognition and video,

00:47:55.420 | although it's making progress quickly.

00:47:57.860 | And the reason, that reason is the fact that

00:48:00.700 | it's much easier to represent uncertainty in the prediction

00:48:05.340 | in the context of natural language

00:48:06.940 | than it is in the context of things like video and images.

00:48:10.140 | So for example, if I ask you to predict

00:48:12.980 | what words are missing,

00:48:13.940 | you know, 15% of the words that are taken out.

00:48:16.300 | - The possibility is just small.

00:48:19.180 | I mean- - It's small, right?

00:48:20.060 | There is a hundred thousand words in the lexicon,

00:48:23.340 | and what the machine spits out

00:48:24.860 | is a big probability vector, right?

00:48:27.660 | It's a bunch of numbers between zero and one

00:48:29.700 | that's onto one.

00:48:30.780 | And we know how to do this with computers.

00:48:33.140 | So there, representing uncertainty in the prediction

00:48:36.980 | is relatively easy,

00:48:37.860 | and that's, in my opinion,

00:48:39.180 | why those techniques work for NLP.

00:48:42.500 | For images, if you ask,

00:48:45.540 | if you block a piece of an image

00:48:46.940 | and you ask the system,

00:48:47.780 | reconstruct that piece of the image,

00:48:49.220 | there are many possible answers

00:48:50.780 | that are all perfectly legit, right?

00:48:54.660 | And how do you represent that,

00:48:57.060 | this set of possible answers?

00:48:58.780 | You can't train a system to make one prediction.

00:49:00.940 | You can't train a neural net to say,

00:49:02.500 | here it is, that's the image,

00:49:04.660 | because there's a whole set of things

00:49:06.460 | that are compatible with it.

00:49:07.300 | So how do you get the machine to represent

00:49:08.820 | not a single output,

00:49:09.740 | but a whole set of outputs?

00:49:11.100 | And similarly with video prediction,

00:49:17.300 | there's a lot of things that can happen

00:49:19.300 | in the future of a video.

00:49:20.180 | You're looking at me right now,

00:49:21.220 | I'm not moving my head very much,

00:49:22.820 | but I might turn my head to the left or to the right.

00:49:27.020 | If you don't have a system that can predict this,

00:49:29.420 | and you train it with least square

00:49:31.820 | to kind of minimize the error with a prediction

00:49:33.780 | on what I'm doing,

00:49:34.740 | what you get is a blurry image of myself

00:49:37.020 | in all possible future positions that I might be in,

00:49:39.700 | which is not a good prediction.

00:49:41.580 | - But so there might be other ways

00:49:43.420 | to do the self-supervision, right?

00:49:45.660 | For visual scenes.

00:49:48.100 | - Like what?

00:49:48.940 | (laughs)

00:49:49.780 | - If I knew, I wouldn't tell you.

00:49:52.740 | Publish it first, I don't know.

00:49:54.300 | - No, there might be.

00:49:56.700 | - So, I mean, these are kind of,

00:49:59.340 | there might be artificial ways of like self-play in games

00:50:03.260 | to where you can simulate part of the environment.

00:50:05.060 | You can-

00:50:05.900 | - Oh, that doesn't solve the problem.

00:50:06.820 | It's just a way of generating data.

00:50:08.580 | - But because you have more of a control,

00:50:12.580 | like maybe you can control,

00:50:14.620 | yeah, it's a way to generate data.

00:50:16.100 | That's right.

00:50:16.940 | And because you can do huge amounts of data generation,

00:50:20.500 | that doesn't, you're right.

00:50:21.580 | Well, it creeps up on the problem from the side of data,

00:50:26.020 | and you don't think that's the right way to creep up.

00:50:27.700 | - It doesn't solve this problem

00:50:28.940 | of handling uncertainty in the world, right?

00:50:30.980 | So if you have a machine learn a predictive model

00:50:35.260 | of the world in a game that is deterministic

00:50:38.180 | or quasi-deterministic, it's easy, right?

00:50:42.540 | Just give a few frames of the game to a ConvNet,

00:50:45.100 | put a bunch of layers,

00:50:47.020 | and then have the game generates the next few frames.

00:50:49.660 | And if the game is deterministic, it works fine.

00:50:52.380 | And that includes feeding the system with the action

00:50:59.140 | that your little character is gonna take.

00:51:03.060 | The problem comes from the fact that the real world

00:51:06.660 | and most games are not entirely predictable.

00:51:09.700 | And so there you get those blurry predictions,

00:51:11.340 | and you can't do planning with blurry predictions.

00:51:14.300 | Right, so if you have a perfect model of the world,

00:51:17.420 | you can, in your head, run this model

00:51:20.700 | with a hypothesis for a sequence of actions,

00:51:24.060 | and you're going to predict the outcome

00:51:25.340 | of that sequence of actions.

00:51:26.740 | But if your model is imperfect, how can you plan?

00:51:32.420 | - Yeah, it quickly explodes.

00:51:33.900 | What are your thoughts on the extension of this,

00:51:37.260 | which topic I'm super excited about,

00:51:39.660 | it's connected to something you were talking about

00:51:41.340 | in terms of robotics, is active learning.

00:51:44.580 | So as opposed to sort of completely unsupervised

00:51:47.860 | or self-supervised learning,

00:51:49.720 | you ask the system for human help

00:51:53.780 | for selecting parts you want annotated next.

00:51:58.100 | So if you think about a robot exploring a space

00:52:00.660 | or a baby exploring a space,

00:52:02.420 | or a system exploring a dataset,

00:52:05.260 | every once in a while asking for human input,

00:52:07.940 | do you see value in that kind of work?

00:52:12.180 | - I don't see transformative value.

00:52:14.180 | It's going to make things that we can already do

00:52:16.780 | more efficient, or they will learn slightly more efficiently,

00:52:20.780 | but it's not going to make machines

00:52:21.900 | sort of significantly more intelligent.

00:52:23.900 | And by the way, there is no opposition,

00:52:29.340 | there's no conflict between self-supervised learning,

00:52:34.340 | reinforcement learning, and supervised learning,

00:52:35.980 | or imitation learning, or active learning.

00:52:38.020 | I see self-supervised learning as a preliminary

00:52:42.380 | to all of the above.

00:52:43.820 | - Yes.

00:52:44.660 | - So the example I use very often is,

00:52:48.060 | how is it that, so if you use

00:52:51.380 | classical reinforcement learning,

00:52:54.540 | deep reinforcement learning, if you want,

00:52:57.540 | the best methods today,

00:52:59.220 | so-called model-free reinforcement learning,

00:53:03.020 | to learn to play Atari games,

00:53:04.580 | take about 80 hours of training to reach the level

00:53:07.780 | that any human can reach in about 15 minutes.

00:53:10.020 | They get better than humans, but it takes them a long time.

00:53:15.220 | AlphaStar, okay, the, you know,

00:53:21.020 | Aurelien Wigner and his team's system to play StarCraft,

00:53:27.020 | plays, you know, a single map, a single type of player,

00:53:32.020 | and can reach better than human level

00:53:38.780 | with about the equivalent of 200 years of training

00:53:43.340 | playing against itself.

00:53:45.260 | It's 200 years, right?

00:53:46.380 | It's not something that no human can, could ever do.

00:53:50.060 | - I mean, I'm not sure what lesson to take away from that.

00:53:52.300 | - Okay, now take those algorithms,

00:53:54.780 | the best Aurel algorithms we have today,

00:53:57.340 | to train a car to drive itself.

00:54:00.180 | It would probably have to drive millions of hours.

00:54:03.940 | It will have to kill thousands of pedestrians.

00:54:05.660 | It will have to run into thousands of trees.

00:54:07.380 | It will have to run off cliffs.

00:54:09.460 | And it had to run off cliffs multiple times

00:54:11.620 | before it figures out that it's a bad idea, first of all.

00:54:15.140 | And second of all, before it figures out how not to do it.

00:54:18.460 | And so, I mean, this type of learning, obviously,

00:54:20.900 | does not reflect the kind of learning

00:54:22.380 | that animals and humans do.

00:54:24.220 | There is something missing

00:54:25.300 | that's really, really important there.

00:54:27.340 | And my hypothesis,

00:54:28.700 | which I've been advocating for like five years now,

00:54:31.380 | is that we have predictive models of the world

00:54:34.660 | that include the ability to predict under uncertainty.

00:54:39.620 | And what allows us to not run off a cliff

00:54:44.620 | when we learn to drive,

00:54:45.780 | most of us can learn to drive

00:54:46.900 | in about 20 or 30 hours of training

00:54:48.620 | without ever crashing, causing any accident.

00:54:52.020 | And if we drive next to a cliff,

00:54:54.300 | we know that if we turn the wheel to the right,

00:54:56.220 | the car is gonna run off the cliff

00:54:58.140 | and nothing good is gonna come out of this.

00:54:59.860 | Because we have a pretty good model of intuitive physics

00:55:01.660 | that tells us the car is gonna fall.

00:55:03.340 | We know about gravity.

00:55:05.300 | Babies learn this around the age of eight or nine months,

00:55:08.180 | that objects don't float, they fall.

00:55:10.980 | And we have a pretty good idea

00:55:14.180 | of the effect of turning the wheel of the car.

00:55:16.060 | And we know we need to stay on the road.

00:55:18.060 | So there's a lot of things that we bring to the table,

00:55:20.620 | which is basically our predictive model of the world.

00:55:23.500 | And that model allows us to not do stupid things

00:55:26.940 | and to basically stay within the context

00:55:29.340 | of things we need to do.

00:55:31.100 | We still face unpredictable situations

00:55:33.740 | and that's how we learn,

00:55:35.260 | but that allows us to learn really, really, really quickly.

00:55:38.780 | So that's called model-based reinforcement learning.

00:55:41.340 | There's some imitation and supervised learning

00:55:44.180 | because we have a driving instructor

00:55:46.060 | that tells us occasionally what to do.

00:55:47.980 | But most of the learning

00:55:50.180 | is learning the model, learning physics

00:55:54.740 | that we've done since we were babies.

00:55:56.380 | That's where almost all the learning-

00:55:58.100 | - And the physics is somewhat transferable from,

00:56:01.300 | is transferable from scene to scene.

00:56:03.180 | Stupid things are the same everywhere.

00:56:05.540 | - Yeah.

00:56:06.380 | I mean, if you have experience of the world,

00:56:09.060 | you don't need to be from a particularly intelligent species

00:56:12.620 | to know that if you spill water from a container,

00:56:16.140 | the rest is gonna get wet.

00:56:20.020 | And you might get wet.

00:56:21.860 | So, you know, cats know this, right?

00:56:24.260 | - Yeah.

00:56:25.100 | - So the main problem we need to solve

00:56:27.060 | is how do we learn models of the world?

00:56:29.900 | That's, and that's what I'm interested in.

00:56:31.260 | That's what self-supervised learning is all about.

00:56:34.060 | - If you were to try to construct a benchmark for,

00:56:37.380 | let's look at MNIST.

00:56:41.100 | I love that dataset.

00:56:42.300 | Do you think it's useful, interesting/possible

00:56:48.020 | to perform well on MNIST

00:56:51.260 | with just one example of each digit?

00:56:53.900 | And how would we solve that problem?

00:56:57.460 | - The answer is probably yes.

00:56:59.540 | The question is what other type of learning

00:57:02.380 | are you allowed to do?

00:57:03.220 | So if what you're allowed to do

00:57:04.300 | is train on some gigantic dataset of labeled digit,

00:57:07.340 | that's called transfer learning.

00:57:08.820 | And we know that works, okay?

00:57:10.540 | We do this at Facebook, like in production, right?

00:57:13.500 | We train large convolutional nets

00:57:15.860 | to predict hashtags that people type on Instagram.

00:57:18.180 | And we train on billions of images, literally billions.

00:57:20.940 | And then we chop off the last layer

00:57:22.940 | and fine tune on whatever task we want.

00:57:24.900 | That works really well.

00:57:26.340 | You can beat the ImageNet record with this.

00:57:28.740 | We actually open-sourced the whole thing

00:57:30.500 | like a few weeks ago.

00:57:31.780 | - Yeah, that's still pretty cool.

00:57:33.340 | But yeah, so what would be impressive?

00:57:35.940 | And what's useful and impressive?

00:57:38.180 | What kind of transfer learning would be useful and impressive?

00:57:40.300 | Is it Wikipedia?

00:57:41.740 | That kind of thing?

00:57:42.580 | - No, no, so I don't think transfer learning

00:57:44.980 | is really where we should focus.

00:57:46.220 | We should try to do,

00:57:48.020 | you know, have a kind of scenario for benchmark

00:57:52.180 | where you have unlabeled data

00:57:54.500 | and you can, and it's very large number of unlabeled data.

00:57:59.380 | It could be video clips.

00:58:02.060 | It could be where you do, you know, frame prediction.

00:58:04.820 | It could be images where you could choose to,

00:58:07.260 | you know, mask a piece of it.

00:58:09.620 | Could be whatever, but they're unlabeled

00:58:13.100 | and you're not allowed to label them.

00:58:15.700 | So you do some training on this

00:58:18.780 | and then you train on a particular supervised task,

00:58:23.780 | ImageNet or NIST,

00:58:27.020 | and you measure how your test error decrease

00:58:30.820 | or validation error decreases

00:58:32.100 | as you increase the number of labeled training samples.

00:58:34.860 | Okay, and what you'd like to see is that,

00:58:42.420 | you know, your error decreases much faster

00:58:44.860 | than if you train from scratch, from random weights.

00:58:47.460 | So that to reach the same level of performance

00:58:50.500 | in a completely supervised, purely supervised system

00:58:54.140 | would reach, you would need way fewer samples.

00:58:56.420 | So that's the crucial question

00:58:57.700 | because it will answer the question to like, you know,

00:59:00.260 | people interested in medical image analysis.

00:59:02.980 | Okay, you know, if I want to get to a particular

00:59:06.180 | level of error rate for this task,

00:59:08.940 | I know I need a million samples.

00:59:12.140 | Can I do, you know, self-supervised pre-training

00:59:15.340 | to reduce this to about a hundred or something?

00:59:17.700 | - And you think the answer there

00:59:18.620 | is self-supervised pre-training?

00:59:20.820 | - Yeah, some form, some form of it.

00:59:25.020 | - Telling you, active learning, but you disagree.

00:59:28.500 | - No, it's not useless.

00:59:30.260 | It's just not gonna lead to a quantum leap.

00:59:32.460 | It's just gonna make things that we already do.

00:59:33.980 | - So you're way smarter than me.

00:59:35.540 | I just disagree with you.

00:59:37.500 | But I don't have anything to back that.

00:59:39.340 | It's just intuition.

00:59:40.900 | So I worked with a lot of large scale datasets

00:59:42.980 | and there's something that might be magic

00:59:45.740 | in active learning, but okay.

00:59:47.980 | Now at least I said it publicly.

00:59:49.580 | (laughs)

00:59:50.820 | At least I'm being an idiot publicly.

00:59:52.540 | Okay.

00:59:53.460 | - It's not being an idiot.

00:59:54.300 | It's, you know, working with the data you have.

00:59:56.140 | I mean, certainly people are doing things like,

00:59:58.460 | okay, I have 3000 hours of, you know,

01:00:01.220 | imitation learning for self-driving car,

01:00:03.380 | but most of those are incredibly boring.

01:00:05.340 | What I like is select, you know, 10% of them

01:00:07.940 | that are kind of the most informative.

01:00:09.420 | And with just that, I would probably reach the same.

01:00:12.420 | So it's a weak form of active learning, if you want.

01:00:16.340 | - Yes, but there might be a much stronger version.

01:00:20.140 | - Yeah, that's right.

01:00:20.980 | - That's what, and that's an open question if it exists.

01:00:23.940 | The question is how much stronger it can get.

01:00:26.500 | Elon Musk is confident.

01:00:28.620 | Talked to him recently.

01:00:30.220 | He's confident that large scale data

01:00:32.140 | and deep learning can solve the autonomous driving problem.

01:00:35.100 | What are your thoughts on the limits,

01:00:38.300 | possibilities of deep learning in this space?

01:00:40.820 | - It's obviously part of the solution.

01:00:42.980 | I mean, I don't think we'll ever have a self-driving system,

01:00:45.940 | or at least not in the foreseeable future,

01:00:47.700 | that does not use deep learning.

01:00:49.300 | Let me put it this way.

01:00:50.500 | Now, how much of it?

01:00:51.780 | So in the history of sort of engineering,

01:00:55.100 | particularly sort of AI-like systems,

01:01:00.500 | there's generally a first phase

01:01:01.900 | where everything is built by hand.

01:01:03.020 | Then there is a second phase,

01:01:04.220 | and that was the case for autonomous driving,

01:01:06.300 | you know, 20, 30 years ago.

01:01:08.460 | There's a phase where there's,

01:01:10.140 | a little bit of learning is used,

01:01:11.300 | but there's a lot of engineering that's involved

01:01:13.700 | in kind of, you know, taking care of corner cases

01:01:16.300 | and putting limits, et cetera,

01:01:18.620 | because the learning system is not perfect.

01:01:20.380 | And then as technology progresses,

01:01:22.620 | we end up relying more and more on learning.

01:01:26.020 | That's the history of character recognition,

01:01:27.820 | it's the history of speech recognition,

01:01:29.100 | now computer vision, natural language processing.

01:01:31.380 | And I think the same is going to happen

01:01:33.980 | with autonomous driving, that currently

01:01:37.260 | the methods that are closest to providing

01:01:41.980 | some level of autonomy, some, you know,

01:01:43.820 | decent level of autonomy, where you don't expect

01:01:45.820 | a driver to kind of do anything,

01:01:47.420 | is where you constrain the world.

01:01:50.820 | So you only run within, you know,

01:01:52.580 | 100 square kilometers or square miles in Phoenix,

01:01:55.340 | but the weather is nice and the roads are wide,

01:01:58.580 | which is what Waymo is doing.

01:02:00.220 | You completely over-engineer the car

01:02:03.260 | with tons of lidars and sophisticated sensors

01:02:07.500 | that are too expensive for consumer cars,

01:02:09.260 | but they're fine if you just run a fleet.

01:02:11.300 | And you engineer the hell out of everything else,

01:02:16.380 | you map the entire world,

01:02:17.940 | so you have a complete 3D model of everything.

01:02:20.380 | So the only thing that the perception system

01:02:22.140 | has to take care of is moving objects

01:02:24.180 | and construction and sort of, you know,

01:02:27.660 | things that weren't in your map.

01:02:29.500 | And you can engineer a good, you know,

01:02:32.140 | SLAM system and all that stuff, right?

01:02:33.620 | So that's kind of the current approach

01:02:35.820 | that's closest to some level of autonomy,

01:02:37.500 | but I think eventually the long-term solution

01:02:39.660 | is gonna rely more and more on learning

01:02:43.420 | and possibly using a combination

01:02:45.020 | of self-supervised learning and model-based reinforcement

01:02:49.340 | or something like that.

01:02:50.860 | - But ultimately learning will be not just at the core,

01:02:54.780 | but really the fundamental part of the system.

01:02:57.180 | - Yeah, it already is, but it'll become more and more.

01:03:00.340 | - What do you think it takes to build a system

01:03:02.740 | with human level intelligence?

01:03:04.060 | You talked about the AI system in the movie "Her"

01:03:07.620 | being way out of reach, our current reach.

01:03:10.060 | This might be outdated as well, but-

01:03:12.380 | - It's still way out of reach.

01:03:13.220 | - It's still way out of reach.

01:03:14.700 | What would it take to build "Her"?

01:03:18.340 | Do you think?

01:03:19.740 | - So I can tell you the first two obstacles

01:03:21.740 | that we have to clear,

01:03:22.860 | but I don't know how many obstacles there are after this.

01:03:24.820 | So the image I usually use is that

01:03:26.620 | there is a bunch of mountains that we have to climb

01:03:28.620 | and we can see the first one,

01:03:29.700 | but we don't know if there are 50 mountains

01:03:31.300 | behind it or not.

01:03:32.140 | And this might be a good sort of metaphor

01:03:34.900 | for why AI researchers in the past

01:03:38.380 | have been overly optimistic about the result of AI.

01:03:41.980 | For example, Newell and Simon

01:03:46.900 | wrote the general problem solver

01:03:49.380 | and they called it a general problem solver.

01:03:51.340 | - General problem solver.

01:03:52.940 | - And of course, the first thing you realize

01:03:54.540 | is that all the problems you want to solve are exponential.

01:03:56.340 | And so you can't actually use it for anything useful.

01:03:59.140 | But you know.

01:04:00.060 | - Yeah, so yeah, all you see is the first peak.

01:04:02.260 | So what are the first couple of peaks for "Her"?

01:04:05.260 | - So the first peak,

01:04:06.380 | which is precisely what I'm working on,

01:04:07.980 | is self-supervised learning.

01:04:09.780 | How do we get machines to learn models of the world

01:04:12.260 | by observation,

01:04:13.340 | kind of like babies and like young animals?

01:04:15.820 | So we've been working with cognitive scientists.

01:04:22.260 | So this Emmanuelle Dupoux, who is at FAIR in Paris,

01:04:26.620 | is a half-time, is also a researcher in French University.

01:04:31.620 | And he has this chart that shows

01:04:36.900 | which, how many months of life

01:04:39.780 | baby humans can learn different concepts.

01:04:42.700 | And you can measure this in sort of various ways.

01:04:45.660 | So things like distinguishing animate objects

01:04:51.380 | from inanimate objects,

01:04:52.940 | you can tell the difference at age two, three months.

01:04:56.740 | Whether an object is going to stay stable,

01:04:58.500 | is going to fall, you know,

01:04:59.860 | about four months, you can tell.

01:05:02.900 | You know, there are various things like this.

01:05:04.660 | And then things like gravity,

01:05:06.460 | the fact that objects are not supposed to float in the air,

01:05:08.620 | but are supposed to fall,

01:05:10.060 | you learn this around the age of eight or nine months.

01:05:12.580 | If you look at a lot of, you know, eight-month-old babies,

01:05:15.340 | you give them a bunch of toys on their high chair.

01:05:18.540 | First thing they do is they throw them on the ground

01:05:19.980 | and they look at them.

01:05:21.180 | It's because, you know,

01:05:22.020 | they're learning about, actively learning

01:05:24.460 | about gravity. - Gravity, yeah.

01:05:26.980 | - Okay, so they're not trying to annoy you,

01:05:29.700 | but they, you know, they need to do the experiment, right?

01:05:32.660 | So, you know, how do we get machines to learn like babies?

01:05:36.580 | Mostly by observation with a little bit of interaction

01:05:39.220 | and learning those models of the world,

01:05:41.220 | because I think that's really a crucial piece

01:05:43.740 | of an intelligent autonomous system.

01:05:46.340 | So if you think about the architecture

01:05:47.540 | of an intelligent autonomous system,

01:05:49.500 | it needs to have a predictive model of the world.

01:05:51.340 | So something that says, here is a world at time T,

01:05:54.060 | here is a state of the world at time T plus one

01:05:55.500 | if I take this action.

01:05:56.620 | And it's not a single answer, it can be a--

01:05:59.700 | - Yeah, it can be a distribution, yeah.

01:06:01.260 | - Yeah, well, but we don't know how to represent

01:06:03.180 | distributions in high dimensional space.

01:06:04.820 | So it's gotta be something weaker than that, okay?

01:06:07.180 | But with some representation of uncertainty.

01:06:09.740 | If you have that, then you can do

01:06:12.620 | what optimal control theorists call

01:06:14.460 | model predictive control,

01:06:15.500 | which means that you can run your model

01:06:17.620 | with a hypothesis for a sequence of action

01:06:19.900 | and then see the result.

01:06:21.860 | Now, what you need, the other thing you need

01:06:23.260 | is some sort of objective that you want to optimize.

01:06:26.020 | Am I reaching the goal of grabbing this object?

01:06:28.740 | Am I minimizing energy?

01:06:30.020 | Am I whatever, right?

01:06:31.180 | So there is some sort of objective

01:06:33.460 | that you have to minimize.

01:06:34.860 | And so in your head, if you have this model,

01:06:36.740 | you can figure out the sequence of action

01:06:38.260 | that will optimize your objective.

01:06:39.940 | That objective is something that ultimately

01:06:43.340 | is rooted in your basal ganglia,

01:06:45.580 | at least in the human brain,

01:06:46.540 | that's what it's basal ganglia,

01:06:48.780 | computes your level of contentment or miscontentment.

01:06:52.380 | I don't know if that's a word.

01:06:54.260 | Unhappiness, okay?

01:06:55.460 | - Yeah, yeah.

01:06:56.660 | - Discontentment.

01:06:57.500 | - Discontentment, maybe.

01:06:58.460 | - And so your entire behavior is driven

01:07:01.540 | towards minimizing that objective,

01:07:04.980 | which is maximizing your contentment,

01:07:07.460 | computed by your basal ganglia.

01:07:09.340 | And what you have is an objective function,

01:07:13.300 | which is basically a predictor

01:07:14.860 | of what your basal ganglia is going to tell you.

01:07:17.180 | So you're not going to put your hand on fire

01:07:19.140 | because you know it's going to burn.

01:07:22.300 | And you're going to get hurt.

01:07:23.740 | And you're predicting this because of your model of the world

01:07:26.100 | and your sort of predictor of this objective, right?

01:07:30.140 | So if you have those three components,

01:07:33.500 | you have four components,

01:07:35.140 | you have the hardwired contentment objective computer,

01:07:40.140 | if you want, calculator.

01:07:43.900 | And then you have those three components.

01:07:45.100 | One is the objective predictor,

01:07:46.700 | which basically predicts your level of contentment.

01:07:48.900 | One is the model of the world.

01:07:52.500 | And there's a third module I didn't mention,

01:07:54.060 | which is the module that will figure out

01:07:57.220 | the best course of action

01:07:59.060 | to optimize an objective given your model.

01:08:01.300 | Okay?

01:08:03.540 | - Yeah.

01:08:04.420 | - Call this a policy network or something like that, right?

01:08:08.300 | Now, you need those three components

01:08:11.700 | to act autonomously, intelligently.

01:08:13.940 | And you can be stupid in three different ways.

01:08:16.100 | You can be stupid because your model of the world is wrong.

01:08:19.380 | You can be stupid because your objective is not aligned

01:08:22.500 | with what you actually want to achieve.

01:08:25.060 | Okay?

01:08:25.900 | In humans, that would be a psychopath.

01:08:29.140 | - Right.

01:08:30.020 | - And then the third way you can be stupid

01:08:33.620 | is that you have the right model,

01:08:34.940 | you have the right objective,

01:08:36.340 | but you're unable to figure out a course of action

01:08:38.820 | to optimize your objective given your model.

01:08:40.700 | - Right.

01:08:41.540 | - Okay.

01:08:42.380 | Some people who are in charge of big countries

01:08:45.900 | actually have all three that are wrong.

01:08:47.740 | - All right.

01:08:48.580 | (laughs)

01:08:50.940 | Which countries?

01:08:51.760 | I don't know.

01:08:52.600 | Okay, so if we think about this agent,

01:08:55.980 | if we think about the movie "Her,"

01:08:57.980 | you've criticized the art project that is Sophia the Robot.

01:09:02.980 | And what that project essentially does

01:09:07.540 | is uses our natural inclination to anthropomorphize

01:09:11.740 | things that look like human and give them more.

01:09:14.780 | Do you think that could be used by AI systems

01:09:17.700 | like in the movie "Her"?

01:09:19.020 | So do you think that body is needed

01:09:23.340 | to create a feeling of intelligence?

01:09:27.140 | - Well, if Sophia was just an art piece,

01:09:29.260 | I would have no problem with it,

01:09:30.340 | but it's presented as something else.

01:09:33.020 | - Let me add that comment real quick.

01:09:35.260 | If creators of Sophia could change something

01:09:38.500 | about their marketing or behavior in general,

01:09:40.700 | what would it be?

01:09:41.540 | What's--

01:09:42.820 | - I'm just about everything.

01:09:43.940 | (laughs)

01:09:45.660 | - I mean, don't you think, here's a tough question.

01:09:50.100 | Let me, so I agree with you.

01:09:51.700 | So Sophia is not, the general public

01:09:55.980 | feels that Sophia can do way more than she actually can.

01:09:59.300 | - That's right.

01:10:00.220 | - And the people who created Sophia

01:10:02.740 | are not honestly publicly communicating,

01:10:07.740 | trying to teach the public.

01:10:09.460 | - Right.

01:10:10.300 | - But here's a tough question.

01:10:13.260 | Don't you think the same thing

01:10:18.060 | is scientists in industry and research

01:10:22.100 | are taking advantage of the same misunderstanding

01:10:24.660 | in the public when they create AI companies

01:10:27.340 | or publish stuff?

01:10:29.900 | - Some companies, yes.

01:10:31.140 | I mean, there is no sense of,

01:10:33.140 | there's no desire to delude.

01:10:34.900 | There's no desire to kind of overclaim

01:10:37.820 | what something is done.

01:10:38.660 | Right, you publish a paper on AI

01:10:39.820 | that has this result on ImageNet,

01:10:42.220 | it's pretty clear.

01:10:43.060 | I mean, it's not even interesting anymore,

01:10:44.940 | but I don't think there is that.

01:10:47.940 | I mean, the reviewers are generally not very forgiving

01:10:52.900 | of unsupported claims of this type.

01:10:57.180 | And, but there are certainly quite a few startups

01:10:59.660 | that have had a huge amount of hype around this

01:11:02.660 | that I find extremely damaging

01:11:05.500 | and I've been calling it out when I've seen it.

01:11:08.020 | So yeah, but to go back to your original question,

01:11:10.220 | like the necessity of embodiment.

01:11:13.020 | I think, I don't think embodiment is necessary.

01:11:15.580 | I think grounding is necessary.

01:11:17.100 | So I don't think we're gonna get machines

01:11:18.900 | that really understand language

01:11:20.460 | without some level of grounding in the real world.

01:11:22.420 | And it's not clear to me that language

01:11:24.340 | is a high enough bandwidth medium

01:11:26.100 | to communicate how the real world works.

01:11:28.220 | I think for this--

01:11:30.300 | - Can you talk to ground, what grounding means?

01:11:32.300 | - So grounding means that,

01:11:34.020 | so there is this classic problem of common sense reasoning,

01:11:37.700 | you know, the Winograd schema, right?

01:11:41.020 | And so I tell you the trophy doesn't fit in a suitcase

01:11:44.980 | because it's too big,

01:11:46.380 | or the trophy doesn't fit in a suitcase

01:11:47.780 | because it's too small.

01:11:49.180 | And the it in the first case refers to the trophy

01:11:51.820 | in the second case to the suitcase.

01:11:53.660 | And the reason you can figure this out

01:11:55.180 | is because you know what the trophy and the suitcase are,

01:11:57.020 | you know, one is supposed to fit in the other one,

01:11:58.700 | and you know the notion of size

01:12:00.620 | and the big object doesn't fit in a small object

01:12:03.020 | unless it's a TARDIS, you know, things like that, right?

01:12:05.300 | So you have this knowledge of how the world works,

01:12:08.700 | of geometry and things like that.

01:12:10.660 | I don't believe you can learn everything about the world

01:12:14.700 | by just being told in language how the world works.

01:12:18.020 | I think you need some low-level perception of the world,

01:12:21.740 | you know, be it visual touch, you know, whatever,

01:12:23.740 | but some higher bandwidth perception of the world.

01:12:26.620 | - So by reading all the world's text,

01:12:28.820 | you still may not have enough information.

01:12:31.140 | - That's right.

01:12:32.540 | There's a lot of things that just will never appear in text

01:12:35.420 | and that you can't really infer.

01:12:37.020 | So I think common sense will emerge from, you know,

01:12:41.740 | certainly a lot of language interaction,

01:12:43.420 | but also with watching videos

01:12:45.660 | or perhaps even interacting in virtual environments

01:12:48.900 | and possibly, you know, robot interacting in the real world.

01:12:51.780 | But I don't actually believe necessarily

01:12:53.620 | that this last one is absolutely necessary,

01:12:55.980 | but I think there's a need for some grounding.

01:13:00.260 | - But the final product doesn't necessarily

01:13:03.020 | need to be embodied, you're saying.

01:13:04.860 | It just needs to have an awareness, a grounding.

01:13:07.700 | - Right, but it needs to know how the world works

01:13:10.140 | to have, you know, to not be frustrating to talk to.

01:13:14.420 | - And you talked about emotions being important.

01:13:19.540 | That's a whole nother topic.

01:13:21.780 | - Well, so, you know, I talked about this,

01:13:24.340 | the basal ganglia as the, you know,

01:13:29.340 | the thing that calculates your level of miscontentment,

01:13:32.940 | and then there is this other module

01:13:34.660 | that sort of tries to do a prediction

01:13:36.660 | of whether you're gonna be content or not.

01:13:38.540 | That's the source of some emotion.

01:13:40.260 | So fear, for example, is an anticipation

01:13:43.100 | of bad things that can happen to you, right?

01:13:46.420 | You have this inkling that there is some chance

01:13:49.260 | that something really bad is gonna happen to you,

01:13:50.900 | and that creates fear.

01:13:52.300 | When you know for sure that something bad

01:13:53.700 | is gonna happen to you, you kind of give up, right?

01:13:55.900 | It's not there anymore.

01:13:57.500 | It's uncertainty that creates fear.

01:14:00.060 | - So the punchline is, we're not gonna have

01:14:01.660 | autonomous intelligence without emotions.

01:14:03.700 | - Whatever the heck emotions are.

01:14:08.860 | So you mentioned very practical things of fear,

01:14:11.060 | but there's a lot of other mess around it.

01:14:13.420 | - But they are kind of the results of, you know, drives.

01:14:16.340 | - Yeah, there's deeper biological stuff going on,

01:14:19.300 | and I've talked to a few folks on this.

01:14:21.380 | There's fascinating stuff that ultimately

01:14:23.860 | connects to our brain.

01:14:27.260 | If we create an AGI system, sorry.

01:14:30.860 | - Human level intelligence.

01:14:31.700 | - Human level intelligence system,

01:14:33.380 | and you get to ask her one question,

01:14:37.100 | what would that question be?

01:14:38.500 | - You know, I think the first one we'll create

01:14:42.860 | will probably not be that smart.

01:14:45.460 | They'd be like a four year old.

01:14:47.460 | - So you would have to ask her a question

01:14:49.980 | to know she's not that smart?

01:14:51.500 | - Yeah.

01:14:53.620 | - Well, what's a good question to ask, you know,

01:14:56.900 | to be impressed? - What is the cause of wind?

01:14:58.940 | And if she answers, oh, it's because the leaves

01:15:03.900 | of the tree are moving and that creates wind,

01:15:06.460 | she's onto something.

01:15:07.580 | - And if she says that's a stupid question,

01:15:11.780 | she's really onto something.

01:15:12.620 | - No, and then you tell her, actually, you know,

01:15:15.420 | here is the real thing, and she says,

01:15:18.460 | oh, yeah, that makes sense.

01:15:20.500 | - So questions that reveal the ability

01:15:24.500 | to do common sense reasoning about the physical world.

01:15:26.980 | - Yeah, and you'll sum it up with a causal inference.

01:15:30.140 | - Causal inference.

01:15:31.220 | Well, it was a huge honor.

01:15:33.660 | Congratulations on the Turing Award.

01:15:35.740 | Thank you so much for talking today.

01:15:37.260 | - Thank you. - Appreciate it.

01:15:38.660 | (upbeat music)

01:15:41.260 | (upbeat music)

01:15:43.860 | (upbeat music)

01:15:46.460 | (upbeat music)

01:15:49.060 | (upbeat music)

01:15:51.660 | (upbeat music)

01:15:54.260 | [BLANK_AUDIO]

Yann LeCun: Deep Learning, ConvNets, and Self-Supervised Learning | Lex Fridman Podcast #36

Chapters