back to index

Yann LeCun: Deep Learning, ConvNets, and Self-Supervised Learning | Lex Fridman Podcast #36


Chapters

0:0 Intro
1:11 Space Odyssey
1:38 Value misalignment
3:5 Designing objective functions
4:12 Ethical systems
4:57 Holding secrets
5:42 Autonomous AI
9:15 Intuition
10:49 Learning and Reasoning
13:10 Working Memory
16:7 Energy minimization
18:20 Expert systems
21:13 Causal inference
24:23 Deep learning in the 90s
31:10 A war story
34:48 Toy problems
36:41 Interactive environments
40:48 How many boolean functions
45:8 Selfsupervised running
47:18 Ground language in reality
50:28 Handling uncertainty in the world
52:14 Selfsupervised learning
53:53 Modelbased reinforcement running
56:58 Transfer learning

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Jan Lekun.
00:00:03.080 | He's considered to be one of the fathers of deep learning,
00:00:06.320 | which if you've been hiding under a rock,
00:00:09.040 | is the recent revolution in AI that's captivated the world
00:00:12.280 | with the possibility of what machines can learn from data.
00:00:16.180 | He's a professor at New York University,
00:00:18.520 | a vice president and chief AI scientist at Facebook,
00:00:21.740 | and co-recipient of the Turing Award
00:00:24.360 | for his work on deep learning.
00:00:26.280 | He's probably best known as the founding father
00:00:28.880 | of convolutional neural networks.
00:00:30.720 | In particular, their application
00:00:32.480 | to optical character recognition
00:00:34.400 | and the famed MNIST dataset.
00:00:37.240 | He is also an outspoken personality,
00:00:40.080 | unafraid to speak his mind in a distinctive French accent
00:00:43.800 | and explore provocative ideas,
00:00:45.720 | both in the rigorous medium of academic research
00:00:48.360 | and the somewhat less rigorous medium
00:00:51.000 | of Twitter and Facebook.
00:00:52.800 | This is the Artificial Intelligence Podcast.
00:00:55.600 | If you enjoy it, subscribe on YouTube,
00:00:57.960 | give it five stars on iTunes, support it on Patreon,
00:01:00.960 | or simply connect with me on Twitter at Lex Friedman,
00:01:03.840 | spelled F-R-I-D-M-A-N.
00:01:06.840 | And now, here's my conversation with Jan Lekun.
00:01:10.640 | You said that 2001, Space Odyssey
00:01:13.840 | is one of your favorite movies.
00:01:15.400 | Hal 9000 decides to get rid of the astronauts
00:01:20.360 | for people that haven't seen the movie, spoiler alert,
00:01:23.040 | because he, it, she believes that the astronauts,
00:01:28.040 | they will interfere with the mission.
00:01:31.600 | Do you see Hal as flawed in some fundamental way
00:01:34.680 | or even evil, or did he do the right thing?
00:01:38.440 | - Neither.
00:01:39.320 | There's no notion of evil in that context,
00:01:43.240 | other than the fact that people die,
00:01:44.720 | but it was an example of what people call
00:01:48.720 | value misalignment, right?
00:01:50.080 | You give an objective to a machine
00:01:52.120 | and the machine strives to achieve this objective.
00:01:55.680 | And if you don't put any constraints on this objective,
00:01:58.120 | like don't kill people and don't do things like this,
00:02:00.760 | the machine, given the power, will do stupid things
00:02:06.240 | just to achieve this objective,
00:02:07.960 | or damaging things to achieve this objective.
00:02:10.160 | It's a little bit like, I mean, we're used to this
00:02:12.400 | in the context of human society.
00:02:14.280 | We put in place laws to prevent people
00:02:20.920 | from doing bad things, because spontaneously
00:02:22.920 | they would do those bad things, right?
00:02:24.800 | So we have to shape their cost function,
00:02:28.400 | their objective function, if you want,
00:02:29.520 | through laws to kind of correct,
00:02:31.520 | and education, obviously, to sort of correct for those.
00:02:35.160 | - So maybe just pushing a little further on that point,
00:02:41.040 | Hal, you know, there's a mission.
00:02:44.360 | There's a, there's fuzziness around the ambiguity
00:02:47.640 | around what the actual mission is,
00:02:49.800 | but, you know, do you think that there will be a time
00:02:54.800 | from a utilitarian perspective where an AI system,
00:02:58.160 | where it is not misalignment, where it is alignment
00:03:00.920 | for the greater good of society,
00:03:02.800 | that an AI system will make decisions that are difficult?
00:03:05.880 | - Well, that's the trick.
00:03:06.800 | I mean, eventually we'll have to figure out how to do this.
00:03:10.800 | And again, we're not starting from scratch
00:03:12.600 | because we've been doing this with humans for millennia.
00:03:16.440 | So designing objective functions for people
00:03:19.160 | is something that we know how to do.
00:03:20.880 | And we don't do it by, you know, programming things,
00:03:24.600 | although the legal code is called code.
00:03:29.040 | So that tells you something.
00:03:30.680 | And it's actually the design of an objective function.
00:03:33.040 | That's really what legal code is, right?
00:03:34.600 | It tells you, here's what you can do,
00:03:36.280 | here's what you can't do.
00:03:37.400 | If you do it, you pay that much.
00:03:39.000 | That's an objective function.
00:03:40.720 | So there is this idea somehow that it's a new thing
00:03:44.560 | for people to try to design objective functions
00:03:46.600 | that are aligned with the common good.
00:03:47.920 | But no, we've been writing laws for millennia
00:03:49.840 | and that's exactly what it is.
00:03:52.080 | So that's where, you know, the science of lawmaking
00:03:57.080 | and computer science will-
00:04:00.560 | - Come together.
00:04:01.400 | - Will come together.
00:04:02.840 | - So it's nothing, there's nothing special about HAL
00:04:05.480 | or AI systems.
00:04:06.760 | It's just the continuation of tools used
00:04:09.440 | to make some of these difficult ethical judgments
00:04:11.720 | that laws make.
00:04:13.000 | - Yeah, and we have systems like this already
00:04:15.080 | that make many decisions for ourselves in society
00:04:19.960 | that need to be designed in a way that they,
00:04:22.600 | like rules about things that sometimes have bad side effects.
00:04:27.480 | And we have to be flexible enough about those rules
00:04:29.600 | so that they can be broken when it's obvious
00:04:31.560 | that they shouldn't be applied.
00:04:33.120 | So you don't see this on the camera here,
00:04:35.640 | but all the decoration in this room
00:04:36.920 | is all pictures from 2001 and Space Odyssey.
00:04:39.640 | - Wow, is that by accident or is there a lot-
00:04:43.640 | - It's not by accident, it's by design.
00:04:45.640 | (Lex laughing)
00:04:47.400 | - Oh, wow.
00:04:48.440 | So if you were to build HAL 10,000,
00:04:52.520 | so an improvement of HAL 9,000, what would you improve?
00:04:57.080 | - Well, first of all,
00:04:57.920 | I wouldn't ask you to hold secrets and tell lies
00:05:01.960 | because that's really what breaks it in the end.
00:05:03.800 | That's the fact that it's asking itself questions
00:05:07.160 | about the purpose of the mission.
00:05:08.920 | And it's, you know, pieces things together
00:05:10.880 | that it's heard, you know,
00:05:11.720 | all the secrecy of the preparation of the mission
00:05:13.960 | and the fact that it was a discovery on the lunar surface
00:05:17.680 | that really was kept secret.
00:05:19.520 | And one part of HAL's memory knows this,
00:05:22.320 | and the other part does not know it
00:05:24.680 | and is supposed to not tell anyone.
00:05:26.640 | And that creates internal conflict.
00:05:28.560 | - So you think there's never should be a set of things
00:05:32.200 | that an AI system should not be allowed,
00:05:35.480 | like a set of facts that should not be shared
00:05:39.880 | with the human operators?
00:05:42.320 | - Well, I think, no, I think it should be a bit like
00:05:45.360 | in the design of autonomous AI systems,
00:05:51.520 | there should be the equivalent of, you know,
00:05:54.200 | the oath that Hippocrates oaths.
00:05:58.040 | - Hippocratic oath, yeah.
00:05:58.960 | - That doctors sign up to, right?
00:06:02.520 | So there's certain things, certain rules
00:06:03.960 | that you have to abide by,
00:06:05.920 | and we can sort of hardwire this into our machines
00:06:08.920 | to kind of make sure they don't go.
00:06:10.920 | So I'm not, you know, an advocate of the three laws
00:06:14.640 | of robotics, you know, the Asimov kind of thing,
00:06:17.040 | because I don't think it's practical,
00:06:18.480 | but, you know, some level of limits.
00:06:23.160 | But to be clear, this is not,
00:06:26.920 | these are not questions that are kind of really worth
00:06:31.120 | asking today because we just don't have the technology
00:06:33.560 | to do this.
00:06:34.400 | We don't have autonomous intelligent machines.
00:06:36.360 | We have intelligent machines,
00:06:37.480 | semi-intelligent machines that are very specialized.
00:06:40.960 | But they don't really sort of satisfy an objective.
00:06:43.320 | They're just, you know, kind of trained to do one thing.
00:06:46.480 | So until we have some idea for design
00:06:49.960 | of a full-fledged autonomous intelligent system,
00:06:53.320 | asking the question of how we design this objective,
00:06:55.640 | I think is a little too abstract.
00:06:58.560 | - It's a little too abstract.
00:06:59.640 | There's useful elements to it
00:07:01.560 | in that it helps us understand our own ethical codes,
00:07:06.560 | humans.
00:07:07.920 | So even just as a thought experiment,
00:07:10.240 | if you imagine that an AGI system is here today,
00:07:14.280 | how would we program it as a kind of nice thought experiment
00:07:17.640 | of constructing how should we have a law,
00:07:21.880 | have a system of laws for us humans.
00:07:24.360 | It's just a nice practical tool.
00:07:26.800 | And I think there's echoes of that idea too
00:07:29.760 | in the AI systems we have today
00:07:32.160 | that don't have to be that intelligent,
00:07:34.280 | like autonomous vehicles.
00:07:35.600 | These things start creeping in
00:07:37.760 | that we're thinking about,
00:07:39.200 | but certainly they shouldn't be framed as hell.
00:07:42.560 | - Yeah.
00:07:43.680 | - Looking back, what is the most,
00:07:46.680 | I'm sorry if it's a silly question,
00:07:49.400 | but what is the most beautiful or surprising idea
00:07:52.480 | in deep learning or AI in general
00:07:55.000 | that you've ever come across?
00:07:56.280 | Sort of personally, when you said back
00:07:58.440 | and just had this kind of,
00:08:01.920 | oh, that's pretty cool moment.
00:08:03.920 | That's nice.
00:08:04.760 | That's surprising.
00:08:05.600 | - I don't know if it's an idea
00:08:06.560 | rather than a sort of empirical fact.
00:08:11.040 | The fact that you can build gigantic neural nets,
00:08:16.440 | train them on relatively small amounts of data,
00:08:21.440 | relatively, with stochastic gradient descent
00:08:24.840 | and that it actually works,
00:08:26.920 | breaks everything you read in every textbook, right?
00:08:29.240 | Every pre-deep learning textbook that told you
00:08:32.560 | you need to have fewer parameters
00:08:33.920 | and you have data samples.
00:08:36.360 | If you have a non-convex objective function,
00:08:38.760 | you have no guarantee of convergence.
00:08:40.680 | All those things that you read in textbook
00:08:42.080 | and they tell you to stay away from this
00:08:43.640 | and they're all wrong.
00:08:45.120 | - Huge number of parameters, non-convex,
00:08:48.080 | and somehow which is very relative
00:08:50.320 | to the number of parameters, data,
00:08:53.480 | it's able to learn anything.
00:08:54.840 | - Right.
00:08:55.680 | - Does that still surprise you today?
00:08:57.520 | - Well, it was kind of obvious to me
00:09:00.360 | before I knew anything that this is a good idea.
00:09:04.120 | And then it became surprising that it worked
00:09:06.040 | because I started reading those textbooks.
00:09:08.080 | Okay.
00:09:10.080 | - So, okay, so can you talk through the intuition
00:09:12.280 | of why it was obvious to you if you remember?
00:09:14.360 | - Well, okay, so the intuition was,
00:09:16.120 | it's sort of like those people in the late 19th century
00:09:19.960 | who proved that heavier than air flight was impossible, right?
00:09:24.960 | And of course you have birds, right?
00:09:26.800 | They do fly.
00:09:28.280 | And so on the face of it,
00:09:30.400 | it's obviously wrong as an empirical question, right?
00:09:33.200 | And so we have the same kind of thing that,
00:09:35.280 | we know that the brain works,
00:09:38.560 | we don't know how, but we know it works.
00:09:39.920 | And we know it's a large network of neurons in interaction
00:09:43.160 | and that learning takes place by changing the connections.
00:09:45.360 | So kind of getting this level of inspiration
00:09:48.000 | without copying the details,
00:09:49.320 | but sort of trying to derive basic principles,
00:09:52.520 | you know, that kind of gives you a clue
00:09:56.800 | as to which direction to go.
00:09:58.360 | There's also the idea somehow
00:09:59.680 | that I've been convinced of since I was an undergrad
00:10:02.080 | that even before,
00:10:04.680 | that intelligence is inseparable from learning.
00:10:06.880 | So the idea somehow that you can create
00:10:10.040 | an intelligent machine by basically programming,
00:10:14.040 | for me, it was a non-starter from the start.
00:10:17.640 | Every intelligent entity that we know about
00:10:20.320 | arrives at this intelligence through learning.
00:10:24.960 | So learning, you know,
00:10:25.800 | machine learning was a completely obvious path.
00:10:28.200 | Also because I'm lazy, so, you know,
00:10:31.560 | kind of. (laughs)
00:10:33.400 | - You automate basically everything
00:10:35.200 | and learning is the automation of intelligence.
00:10:37.880 | - Right.
00:10:39.200 | - So do you think, so what is learning then?
00:10:42.960 | What falls under learning?
00:10:44.560 | Because do you think of reasoning as learning?
00:10:48.280 | - Well, reasoning is certainly a consequence
00:10:52.560 | of learning as well,
00:10:54.240 | just like other functions of the brain.
00:10:57.320 | The big question about reasoning is,
00:10:58.960 | how do you make reasoning compatible
00:11:01.440 | with gradient-based learning?
00:11:03.440 | - Do you think neural networks can be made to reason?
00:11:05.680 | - Yes, there is no question about that.
00:11:07.760 | Again, we have a good example, right?
00:11:09.600 | The question is how.
00:11:12.400 | So the question is how much prior structure
00:11:14.760 | do you have to put in the neural net
00:11:16.080 | so that something like human reasoning
00:11:18.280 | will emerge from it, you know, from learning?
00:11:21.440 | Another question is,
00:11:23.160 | all of our kind of model of what reasoning is
00:11:26.320 | that are based on logic are discrete
00:11:28.880 | and are therefore incompatible with gradient-based learning.
00:11:33.360 | And I'm a very strong believer
00:11:34.800 | in this idea of gradient-based learning.
00:11:36.480 | I don't believe that other types of learning
00:11:39.920 | that don't use kind of gradient information, if you want.
00:11:42.520 | - So you don't like discrete mathematics?
00:11:44.040 | You don't like anything discrete?
00:11:45.600 | - Well, it's not that I don't like it,
00:11:47.520 | it's just that it's incompatible with learning
00:11:49.680 | and I'm a big fan of learning, right?
00:11:51.640 | So in fact, that's perhaps one reason
00:11:54.080 | why deep learning has been kind of looked at
00:11:57.560 | with suspicion by a lot of computer scientists
00:11:59.240 | because the math is very different.
00:12:00.440 | The math that you use for deep learning,
00:12:02.920 | you know, has more to do with cybernetics,
00:12:07.720 | the kind of math you do in electrical engineering
00:12:10.200 | than the kind of math you do in computer science.
00:12:12.760 | And, you know, nothing in machine learning is exact, right?
00:12:16.200 | Computer science is all about sort of, you know,
00:12:19.080 | obsessive-compulsive attention to details
00:12:21.920 | of like, you know, every index has to be right
00:12:24.200 | and you can prove that an algorithm is correct, right?
00:12:27.200 | Machine learning is the science of sloppiness, really.
00:12:30.840 | - That's beautiful.
00:12:33.560 | So, okay, maybe let's feel around in the dark
00:12:38.560 | of what is a neural network that reasons
00:12:41.840 | or a system that works with continuous functions
00:12:46.840 | that's able to do, build knowledge,
00:12:52.440 | however we think about reasoning,
00:12:54.320 | build on previous knowledge, build on extra knowledge,
00:12:57.880 | create new knowledge,
00:12:59.520 | generalize outside of any training set ever built.
00:13:03.080 | What does that look like?
00:13:04.560 | If, yeah, maybe do you have inklings of thoughts
00:13:08.760 | of what that might look like?
00:13:10.840 | - Yeah, I mean, yes and no.
00:13:12.320 | If I had precise ideas about this,
00:13:14.200 | I think, you know, we'd be building it right now.
00:13:16.600 | But, and there are people working on this
00:13:18.600 | or whose main research interest
00:13:20.760 | is actually exactly that, right?
00:13:22.240 | So what you need to have is a working memory.
00:13:25.320 | So you need to have some device, if you want,
00:13:29.960 | some subsystem that can store a relatively large number
00:13:34.600 | of factual episodic information
00:13:37.200 | for a reasonable amount of time.
00:13:40.920 | So in the brain, for example,
00:13:43.920 | there are kind of three main types of memory.
00:13:45.760 | One is the sort of memory of the state of your cortex,
00:13:51.560 | and that sort of disappears within 20 seconds.
00:13:53.800 | You can't remember things for more than about 20 seconds
00:13:56.200 | or a minute if you don't have any other form of memory.
00:14:00.400 | The second type of memory, which is longer term,
00:14:02.600 | but still short term, is the hippocampus.
00:14:04.320 | So you can, you know, you came into this building,
00:14:06.520 | you remember where the exit is, where the elevators are.
00:14:11.040 | You have some map of that building
00:14:13.440 | that's stored in your hippocampus.
00:14:15.360 | You might remember something about what I said,
00:14:18.200 | you know, if you've been to the zoo,
00:14:19.360 | you might remember something about what I said,
00:14:21.160 | you know, a few minutes ago.
00:14:22.200 | - I forgot it all already, but it's part.
00:14:23.040 | - Of course, it's been erased, but you know,
00:14:25.240 | but that would be in your hippocampus.
00:14:28.400 | And then the longer term memory is in the synapse,
00:14:31.560 | the synapses, right?
00:14:32.800 | So what you need if you want a system
00:14:35.520 | that's capable of reasoning
00:14:36.440 | is that you want the hippocampus-like thing, right?
00:14:39.720 | And that's what people have tried to do
00:14:42.680 | with memory networks and, you know,
00:14:44.560 | neural training machines and stuff like that, right?
00:14:46.680 | Transformers, which have sort of a memory in there,
00:14:50.520 | kind of self-attention system.
00:14:51.960 | You can think of it this way.
00:14:53.440 | So that's one element you need.
00:14:57.120 | Another thing you need is some sort of network
00:14:59.840 | that can access this memory,
00:15:03.200 | get an information back, and then kind of crunch on it,
00:15:08.120 | and then do this iteratively multiple times,
00:15:10.880 | because a chain of reasoning
00:15:14.280 | is a process by which you update your knowledge
00:15:19.280 | about the state of the world,
00:15:20.360 | about what's going to happen, et cetera.
00:15:22.760 | And that has to be this sort of
00:15:25.400 | recurrent operation, basically.
00:15:27.080 | - And you think that kind of,
00:15:29.120 | if we think about a transformer,
00:15:31.080 | so that seems to be too small to contain the knowledge
00:15:33.960 | that's to represent the knowledge
00:15:37.240 | that's contained in Wikipedia, for example.
00:15:39.200 | - Well, a transformer doesn't have this idea of recurrence.
00:15:41.960 | It's got a fixed number of layers,
00:15:43.080 | and that's the number of steps that limits,
00:15:45.560 | basically, its representation.
00:15:47.120 | - But recurrence would build on the knowledge somehow.
00:15:51.200 | I mean, it would evolve the knowledge
00:15:54.680 | and expand the amount of information, perhaps,
00:15:58.040 | or useful information within that knowledge.
00:16:00.320 | But is this something that just can emerge with size?
00:16:04.760 | Because it seems like everything we have now is too small.
00:16:06.480 | - Not just.
00:16:07.320 | No, it's not clear.
00:16:09.320 | I mean, how you access and write
00:16:11.120 | into an associative memory inefficient
00:16:13.000 | way, I mean, sort of the original memory network
00:16:15.200 | maybe had something like the right architecture,
00:16:17.520 | but if you try to scale up a memory network
00:16:20.520 | so that the memory contains all of Wikipedia,
00:16:22.840 | it doesn't quite work.
00:16:24.000 | - Right.
00:16:25.120 | - So there's a need for new ideas there, okay.
00:16:28.640 | But it's not the only form of reasoning.
00:16:29.960 | So there's another form of reasoning,
00:16:31.360 | which is very classical, also, in some types of AI,
00:16:36.360 | and it's based on, let's call it energy minimization.
00:16:40.880 | Okay, so you have some sort of objective,
00:16:44.920 | some energy function that represents
00:16:46.800 | the quality or the negative quality, okay.
00:16:53.280 | Energy goes up when things get bad
00:16:54.720 | and they get low when things get good.
00:16:57.280 | So let's say you want to figure out,
00:16:59.960 | you know, what gestures do I need to do
00:17:02.960 | to grab an object or walk out the door.
00:17:07.200 | If you have a good model of your own body,
00:17:10.320 | a good model of the environment,
00:17:12.480 | using this kind of energy minimization,
00:17:14.440 | you can do planning.
00:17:16.920 | And in optimal control,
00:17:19.240 | it's called model predictive control.
00:17:22.120 | You have a model of what's going to happen in the world
00:17:24.120 | as a consequence of your actions,
00:17:25.560 | and that allows you to, by energy minimization,
00:17:28.600 | figure out a sequence of action
00:17:29.840 | that optimizes a particular objective function,
00:17:32.120 | which minimizes the number of times
00:17:34.200 | you're going to hit something
00:17:35.040 | and the energy you're going to spend
00:17:36.560 | doing the gesture and et cetera.
00:17:39.840 | So that's a form of reasoning.
00:17:42.440 | Planning is a form of reasoning.
00:17:43.520 | And perhaps what led to the ability of humans to reason
00:17:48.040 | is the fact that, or, you know,
00:17:51.600 | species that appear before us
00:17:53.480 | had to do some sort of planning
00:17:55.040 | to be able to hunt and survive
00:17:56.960 | and survive the winter in particular.
00:17:59.600 | And so, you know, it's the same capacity
00:18:01.520 | that you need to have.
00:18:03.360 | - So in your intuition is,
00:18:06.440 | if we look at expert systems,
00:18:09.320 | and encoding knowledge as logic systems,
00:18:13.200 | as graphs, or in this kind of way,
00:18:16.720 | is not a useful way to think about knowledge?
00:18:20.280 | - Graphs are a little brittle, or logic representation.
00:18:23.960 | So basically, you know, variables that have values
00:18:27.880 | and then constraint between them
00:18:29.280 | that are represented by rules
00:18:31.280 | is a little too rigid and too brittle, right?
00:18:32.840 | So one of the, you know, some of the early efforts
00:18:35.680 | in that respect were to put probabilities on them.
00:18:40.680 | So a rule, you know, if you have this and that symptom,
00:18:44.280 | you know, you have this disease with that probability,
00:18:47.200 | and you should prescribe that antibiotic
00:18:49.400 | with that probability, right?
00:18:50.520 | That's the myosin system from the '70s.
00:18:53.280 | And that's what that branch of AI led to,
00:18:58.320 | you know, Bayesian networks and graphical models
00:19:00.320 | and causal inference and variational, you know, method.
00:19:04.960 | So there is, I mean, certainly a lot of interesting work
00:19:09.960 | going on in this area.
00:19:11.440 | The main issue with this is knowledge acquisition.
00:19:13.880 | How do you reduce a bunch of data to a graph of this type?
00:19:18.880 | - Yeah, it relies on the expert to,
00:19:21.840 | on the human being to encode, to add knowledge.
00:19:24.960 | - And that's essentially impractical.
00:19:27.120 | - Yeah, it's not scalable.
00:19:29.480 | - That's a big question.
00:19:30.320 | The second question is,
00:19:31.440 | do you want to represent knowledge as symbols,
00:19:34.640 | and do you want to manipulate them with logic?
00:19:37.240 | And again, that's incompatible with learning.
00:19:39.320 | So one suggestion with, you know,
00:19:42.680 | Geoff Hinton has been advocating for many decades
00:19:45.040 | is replace symbols by vectors.
00:19:49.360 | Think of it as pattern of activities
00:19:50.960 | in a bunch of neurons or units
00:19:53.320 | or whatever you want to call them,
00:19:55.120 | and replace logic by continuous functions.
00:19:59.560 | Okay, and that becomes now compatible.
00:20:01.840 | There's a very good set of ideas
00:20:04.960 | by, written in a paper about 10 years ago
00:20:07.640 | by Leon Boutou, who is here at Facebook.
00:20:11.000 | The title of the paper is
00:20:14.360 | "From Machine Learning to Machine Reasoning."
00:20:15.840 | And his idea is that a learning system
00:20:19.480 | should be able to manipulate objects
00:20:20.880 | that are in a space,
00:20:23.160 | and then put the result back in the same space.
00:20:24.920 | So it's this idea of working memory, basically.
00:20:27.280 | And it's very enlightening.
00:20:30.640 | - And in a sense, that might learn something
00:20:33.760 | like the simple expert systems.
00:20:36.900 | I mean, you can learn basic logic operations there.
00:20:42.080 | - Yeah, quite possibly.
00:20:43.400 | There's a big debate on sort of how much prior structure
00:20:46.680 | you have to put in for this kind of stuff to emerge.
00:20:49.080 | That's the debate I have with Gary Marcus
00:20:50.720 | and people like that.
00:20:51.560 | - Yeah, yeah.
00:20:52.880 | So, and the other person,
00:20:55.040 | so I just talked to Judea Pearl,
00:20:57.520 | from the, you mentioned causal inference world.
00:21:00.240 | So his worry is that the current neural networks
00:21:04.160 | are not able to learn
00:21:06.960 | what causes what causal inference between things.
00:21:12.760 | - So I think he's right and wrong about this.
00:21:15.640 | If he's talking about the sort of classic
00:21:18.600 | type of neural nets,
00:21:21.320 | people sort of didn't worry too much about this.
00:21:23.800 | But there's a lot of people now working on causal inference.
00:21:26.200 | And there's a paper that just came out last week
00:21:27.840 | by Leon Boutou, among others,
00:21:29.160 | David Lopez-Paz and a bunch of other people,
00:21:32.000 | exactly on that problem of how do you kind of,
00:21:35.520 | get a neural net to sort of pay attention
00:21:39.400 | to real causal relationships,
00:21:41.560 | which may also solve issues of bias in data
00:21:46.560 | and things like this.
00:21:48.040 | - I'd like to read that paper because that ultimately,
00:21:51.200 | the challenge there is also seems to fall back
00:21:54.720 | on the human expert
00:21:56.920 | to ultimately decide causality between things.
00:22:01.920 | - People are not very good at establishing causality,
00:22:03.680 | first of all.
00:22:04.800 | So first of all, you talk to physicists
00:22:06.560 | and physicists actually don't believe in causality
00:22:08.560 | because look at all the basic laws of macrophysics
00:22:12.960 | are time reversible.
00:22:13.960 | So there is no causality.
00:22:15.480 | - The arrow of time is not real.
00:22:17.400 | - It's as soon as you start looking at macroscopic systems,
00:22:20.400 | where there is unpredictable randomness,
00:22:22.800 | where there is clearly an arrow of time,
00:22:25.440 | but it's a big mystery in physics actually,
00:22:27.080 | well, how that emerges.
00:22:28.360 | - Is it emergent or is it part of the fundamental fabric
00:22:33.280 | of reality?
00:22:34.320 | - Or is it a bias of intelligent systems that,
00:22:37.480 | because of the second law of thermodynamics,
00:22:39.280 | we perceive a particular arrow of time,
00:22:41.480 | but in fact, it's kind of arbitrary, right?
00:22:45.160 | - So yeah, physicists, mathematicians,
00:22:47.160 | they don't care about, I mean,
00:22:48.480 | the math doesn't care about the flow of time.
00:22:51.520 | - Well, certainly macrophysics doesn't.
00:22:54.120 | People themselves are not very good at establishing
00:22:57.080 | causal relationships.
00:22:58.960 | If you ask, I think it was in one of Seymour Papert's book
00:23:02.800 | on like children learning,
00:23:06.880 | he studied with Jean Piaget,
00:23:08.880 | he's the guy who co-authored the book "Perceptron"
00:23:11.560 | with Marvin Minsky that kind of killed
00:23:13.000 | the first wave of neural nets,
00:23:14.080 | but he was actually a learning person.
00:23:17.240 | He, in the sense of studying learning in humans
00:23:21.080 | and machines, that's why he got interested in perceptron.
00:23:24.160 | And he wrote that if you ask a little kid
00:23:29.160 | about what is the cause of the wind,
00:23:32.680 | a lot of kids will say, they will think for a while
00:23:35.840 | and they'll say, "Oh, it's the branches in the trees.
00:23:38.080 | They move and that creates wind."
00:23:40.120 | So they get the causal relationship backwards.
00:23:42.600 | And it's because their understanding of the world
00:23:44.520 | and intuitive physics is not that great.
00:23:46.280 | I mean, these are like four or five year old kids.
00:23:48.800 | It gets better and then you understand that this,
00:23:52.320 | it can't be.
00:23:54.080 | But there are many things which we can,
00:23:57.440 | because of our common sense understanding of things,
00:24:00.920 | what people call common sense,
00:24:03.280 | and our understanding of physics,
00:24:04.960 | there's a lot of stuff that we can figure out
00:24:08.400 | 'cause even with diseases, we can figure out
00:24:10.480 | what's not causing what often.
00:24:14.560 | There's a lot of mystery, of course,
00:24:16.040 | but the idea is that you should be able
00:24:18.120 | to encode that into systems.
00:24:20.160 | 'Cause it seems unlikely they'd be able
00:24:21.400 | to figure that out themselves.
00:24:22.800 | - Well, whenever we can do intervention,
00:24:24.520 | but all of humanity has been completely deluded
00:24:27.440 | for millennia, probably since its existence,
00:24:30.440 | about a very, very wrong causal relationship
00:24:33.480 | where whatever you can explain,
00:24:34.600 | you attribute it to some deity, some divinity.
00:24:37.680 | And that's a cop-out.
00:24:40.120 | That's a way of saying, "I don't know the cause,
00:24:41.800 | so God did it."
00:24:43.080 | - So you mentioned Marvin Minsky and the irony of
00:24:51.560 | maybe causing the first AI winter.
00:24:54.600 | You were there in the '90s, you were there in the '80s,
00:24:56.920 | of course.
00:24:58.120 | In the '90s, why do you think people lost faith
00:25:00.640 | in deep learning in the '90s and found it again
00:25:04.000 | a decade later, over a decade later?
00:25:06.360 | - Yeah, it wasn't called deep learning yet.
00:25:07.760 | It was just called neural nets.
00:25:08.720 | - Neural networks.
00:25:09.560 | - Yeah, they lost interest.
00:25:13.840 | I mean, I think I would put that around 1995,
00:25:16.800 | at least the machine learning community.
00:25:18.040 | There was always a neural net community,
00:25:19.640 | but it became kind of disconnected
00:25:23.760 | from mainstream machine learning, if you want.
00:25:26.280 | There were, it was basically electrical engineering
00:25:30.960 | that kept at it.
00:25:32.040 | And computer science.
00:25:34.840 | - Just gave up.
00:25:35.680 | - Gave up on neural nets.
00:25:38.000 | I don't know.
00:25:39.600 | I was too close to it to really sort of analyze it
00:25:44.000 | with sort of an unbiased eye, if you want.
00:25:47.440 | But I would make a few guesses.
00:25:50.800 | So the first one is, at the time, neural nets were,
00:25:55.800 | it was very hard to make them work,
00:25:57.880 | in the sense that you would implement backprop
00:26:02.400 | in your favorite language.
00:26:03.840 | And that favorite language was not Python.
00:26:07.080 | It was not MATLAB.
00:26:07.920 | It was not any of those things,
00:26:09.320 | 'cause they didn't exist, right?
00:26:10.760 | You had to write it in Fortran or C
00:26:13.320 | or something like this, right?
00:26:16.320 | So you would experiment with it.
00:26:18.680 | You would probably make some very basic mistakes,
00:26:21.320 | like, you know, barely initialize your weights,
00:26:23.240 | make the network too small
00:26:24.200 | because you read in a textbook, you know,
00:26:25.520 | you don't want too many parameters, right?
00:26:27.640 | And of course, you know, and you would train on XOR
00:26:29.280 | because you didn't have any other dataset to train on.
00:26:32.000 | And of course, you know, it works half the time.
00:26:33.760 | So you would say, "I give up."
00:26:36.280 | Also, you would train it with batch gradient,
00:26:37.680 | which, you know, isn't really sufficient.
00:26:40.240 | So there was a lot of, there was a bag of tricks
00:26:42.680 | that you had to know to make those things work,
00:26:44.840 | or you had to reinvent.
00:26:46.880 | And a lot of people just didn't,
00:26:48.200 | and they just couldn't make it work.
00:26:50.000 | So that's one thing.
00:26:52.400 | The investment in software platform
00:26:54.720 | to be able to kind of, you know, display things,
00:26:58.120 | figure out why things don't work,
00:26:59.360 | kind of get a good intuition for how to get them to work,
00:27:02.120 | have enough flexibility so you can create, you know,
00:27:04.640 | network architectures like convolutional nets
00:27:06.240 | and stuff like that.
00:27:07.240 | It was hard.
00:27:09.160 | I mean, you had to write everything from scratch.
00:27:10.520 | And again, you didn't have any Python
00:27:11.840 | or MATLAB or anything, right?
00:27:13.800 | - I read that, sorry to interrupt,
00:27:15.560 | but I read that you wrote in Lisp,
00:27:17.760 | your first versions of Linnet
00:27:21.280 | with the convolutional networks,
00:27:22.760 | which by the way, one of my favorite languages.
00:27:25.400 | That's how I knew you were legit.
00:27:27.640 | Turing Award, whatever.
00:27:29.520 | You programmed in Lisp.
00:27:30.840 | - It's still my favorite language.
00:27:32.000 | But it's not that we programmed in Lisp,
00:27:34.960 | it's that we had to write our Lisp interpreter.
00:27:37.560 | Okay, 'cause it's not like we used one that existed.
00:27:40.400 | So we wrote a Lisp interpreter
00:27:42.480 | that we hooked up to, you know,
00:27:44.240 | a backend library that we wrote
00:27:46.320 | also for sort of neural net computation.
00:27:48.520 | And then after a few years, around 1991,
00:27:50.920 | we invented this idea of basically having modules
00:27:54.680 | that know how to forward propagate
00:27:56.280 | and back propagate gradients,
00:27:57.680 | and then interconnecting those modules in a graph.
00:28:00.360 | Leombo 2 had made proposals on this,
00:28:03.360 | about this in the late 80s,
00:28:04.800 | and we were able to implement this using our Lisp system.
00:28:08.280 | Eventually we wanted to use that system
00:28:09.920 | to build production code
00:28:12.760 | for character recognition at Bell Labs.
00:28:14.400 | So we actually wrote a compiler for that Lisp interpreter
00:28:16.880 | so that Petrissi Mart, who is now at Microsoft,
00:28:19.400 | kind of did the bulk of it with Leon and me.
00:28:22.560 | And so we could write our system in Lisp
00:28:25.040 | and then compile to C,
00:28:26.640 | and then we'll have a self-contained compute system
00:28:29.840 | that could kind of do the entire thing.
00:28:32.280 | Neither PyTorch nor TensorFlow can do this today.
00:28:37.080 | Okay, it's coming.
00:28:38.000 | - Yeah.
00:28:38.840 | (laughs)
00:28:40.280 | - I mean, there's something like that in PyTorch
00:28:42.160 | called TorchScript.
00:28:44.640 | And so we had to write our Lisp interpreter,
00:28:47.000 | we had to write our Lisp compiler,
00:28:48.120 | we had to invest a huge amount of effort to do this.
00:28:50.960 | And not everybody,
00:28:52.440 | if you don't completely believe in the concept,
00:28:55.160 | you're not going to invest the time to do this.
00:28:57.160 | Now, at the time also,
00:28:59.320 | or today, this would turn into Torch or PyTorch
00:29:02.760 | or TensorFlow or whatever.
00:29:03.960 | We'd put it in open source,
00:29:05.000 | everybody would use it and realize it's good.
00:29:08.040 | Back before 1995, working at AT&T,
00:29:11.360 | there's no way the lawyers would let you
00:29:13.800 | release anything in open source of this nature.
00:29:17.760 | And so we could not distribute our code, really.
00:29:20.720 | - And on that point,
00:29:21.760 | and sorry to go on a million tangents,
00:29:23.640 | but on that point,
00:29:24.920 | I also read that there was some almost patent,
00:29:27.760 | like a patent on convolutional neural networks.
00:29:29.960 | - Yes, there was.
00:29:30.800 | - So that, first of all,
00:29:34.520 | I mean, just--
00:29:36.400 | - There's two, actually.
00:29:38.160 | - That ran out.
00:29:39.120 | - Thankfully, in 2007.
00:29:42.000 | - In 2007.
00:29:43.000 | Can we just talk about that for a second?
00:29:48.600 | I know you're at Facebook,
00:29:49.760 | but you're also at NYU.
00:29:50.960 | What does it mean to patent ideas
00:29:55.520 | like these software ideas, essentially?
00:29:58.920 | Or mathematical ideas?
00:30:02.360 | Or what are they?
00:30:03.320 | - Okay, so they're not mathematical ideas.
00:30:05.400 | They are algorithms.
00:30:07.600 | And there was a period where the US patent office
00:30:11.200 | would allow the patent of software
00:30:14.000 | as long as it was embodied.
00:30:15.360 | The Europeans are very different.
00:30:18.120 | They don't quite accept that.
00:30:20.320 | They have a different concept, but you know.
00:30:23.040 | I don't, I no longer,
00:30:24.040 | I mean, I never actually strongly believed in this,
00:30:26.280 | but I don't believe in this kind of patent.
00:30:28.880 | Facebook basically doesn't believe in this kind of patent.
00:30:33.200 | Google fires patents because they've been burned with Apple.
00:30:38.200 | And so now they do this for defensive purpose,
00:30:41.560 | but usually they say,
00:30:42.880 | "We're not gonna sue you if you infringe."
00:30:44.960 | Facebook has a similar policy.
00:30:47.280 | They say, "We fire patents on certain things
00:30:49.720 | "for defensive purpose.
00:30:50.640 | "We're not gonna sue you if you infringe,
00:30:52.240 | "unless you sue us."
00:30:53.400 | So the industry does not believe in patents.
00:30:59.480 | They are there because of the legal landscape
00:31:01.960 | and various things,
00:31:03.480 | but I don't really believe in patents
00:31:06.480 | for this kind of stuff.
00:31:08.080 | - So that's a great thing.
00:31:09.560 | So I-
00:31:10.400 | - I'll tell you a worse story, actually.
00:31:11.760 | So what happens was the first patent about convolutional net
00:31:15.400 | was about kind of the early version of convolutional net
00:31:18.200 | that didn't have separate pooling layers.
00:31:19.920 | It had convolutional layers
00:31:22.800 | with stride more than one, if you want, right?
00:31:25.200 | And then there was a second one on convolutional nets
00:31:28.400 | with separate pooling layers, training with backprop.
00:31:31.680 | And they were filed in '89 and 1990
00:31:35.280 | or something like this.
00:31:36.200 | At the time, the life of a patent was 17 years.
00:31:39.320 | So here's what happened over the next few years
00:31:42.040 | is that we started developing
00:31:44.640 | character recognition technology around convolutional nets.
00:31:48.600 | And in 1994, a check reading system
00:31:53.400 | was deployed in ATM machines.
00:31:56.160 | In 1995, it was for large check reading machines
00:31:59.040 | in back offices, et cetera.
00:32:00.640 | And those systems were developed by an engineering group
00:32:04.800 | that we were collaborating with at AT&T.
00:32:06.960 | And they were commercialized by NCR,
00:32:08.600 | which at the time was a subsidiary of AT&T.
00:32:11.600 | Now AT&T split up in 1996, early 1996.
00:32:16.600 | And the lawyers just looked at all the patents
00:32:20.400 | and they distributed the patents among the various companies.
00:32:22.960 | They gave the convolutional net patent to NCR
00:32:26.400 | because they were actually selling products that used it.
00:32:29.200 | But nobody at NCR had any idea what a convolutional net was.
00:32:32.280 | - Yeah.
00:32:33.200 | - Okay.
00:32:34.040 | So between 1996 and 2007,
00:32:36.720 | so there's a whole period until 2002
00:32:39.880 | where I didn't actually work on machine learning
00:32:42.000 | or convolutional net.
00:32:42.840 | I resumed working on this around 2002.
00:32:44.880 | And between 2002 and 2007, I was working on them,
00:32:48.840 | crossing my finger that nobody at NCR would notice.
00:32:51.120 | And nobody noticed.
00:32:52.040 | - Yeah, and I hope that this kind of somewhat,
00:32:55.640 | as you said, lawyers aside,
00:32:58.320 | relative openness of the community now will continue.
00:33:02.920 | - It accelerates the entire progress of the industry.
00:33:05.960 | And the problems that Facebook and Google
00:33:10.960 | and others are facing today
00:33:13.000 | is not whether Facebook or Google or Microsoft or IBM
00:33:15.960 | or whoever is ahead of the other,
00:33:18.080 | is that we don't have the technology
00:33:19.680 | to build the things we want to build.
00:33:21.080 | We want to build intelligent virtual assistants
00:33:23.200 | that have common sense.
00:33:24.960 | We don't have monopoly on good ideas for this.
00:33:26.720 | We don't believe we do.
00:33:27.960 | Maybe others believe they do, but we don't.
00:33:30.440 | Okay.
00:33:31.320 | If a startup tells you they have the secret
00:33:33.840 | to human level intelligence and common sense,
00:33:36.880 | don't believe them.
00:33:37.720 | They don't.
00:33:38.560 | And it's going to take the entire work
00:33:42.760 | of the world research community for a while
00:33:45.240 | to get to the point where you can go off
00:33:47.560 | and each of those companies
00:33:49.200 | is going to start to build things on this.
00:33:50.600 | We're not there yet.
00:33:51.760 | - It's absolutely.
00:33:52.600 | This calls to the gap between the space of ideas
00:33:57.000 | and the rigorous testing of those ideas
00:34:00.440 | of practical application that you often speak to.
00:34:03.560 | You've written advice saying,
00:34:05.480 | "Don't get fooled by people who claim
00:34:07.880 | "to have a solution to artificial general intelligence,
00:34:10.480 | "who claim to have an AI system
00:34:11.880 | "that works just like the human brain
00:34:14.220 | "or who claim to have figured out how the brain works.
00:34:17.000 | "Ask them what the error rate they get
00:34:20.920 | "on MNIST or ImageNet."
00:34:23.120 | - Yeah, this is a little dated by the way.
00:34:24.680 | (laughs)
00:34:25.920 | - I mean, five years, who's counting?
00:34:28.280 | Okay.
00:34:29.120 | But I think your opinion is still MNIST and ImageNet,
00:34:32.320 | yes, may be dated.
00:34:34.880 | There may be new benchmarks, right?
00:34:36.320 | But I think that philosophy is one you still
00:34:39.320 | and somewhat hold that benchmarks
00:34:43.360 | and the practical testing, the practical application
00:34:45.720 | is where you really get to test the ideas.
00:34:47.960 | - Well, it may not be completely practical.
00:34:49.800 | Like for example, it could be a toy dataset,
00:34:52.440 | but it has to be some sort of task
00:34:54.840 | that the community as a whole has accepted
00:34:57.280 | as some sort of standard kind of benchmark, if you want.
00:35:00.600 | It doesn't need to be real.
00:35:01.440 | So for example, many years ago here at FAIR,
00:35:04.280 | people, Jason West, Antoine Born and a few others
00:35:07.640 | proposed the Babi tasks,
00:35:09.040 | which were kind of a toy problem to test
00:35:12.240 | the ability of machines to reason actually
00:35:14.320 | to access working memory and things like this.
00:35:16.920 | And it was very useful, even though it wasn't a real task.
00:35:20.080 | MNIST is kind of halfway real task.
00:35:22.660 | So, toy problems can be very useful.
00:35:26.040 | It's just that I was really struck by the fact that
00:35:29.560 | a lot of people, particularly a lot of people
00:35:31.120 | with money to invest would be fooled by people telling them,
00:35:34.400 | oh, we have the algorithm of the cortex
00:35:37.400 | and you should give us 50 million.
00:35:39.360 | - Yes, absolutely.
00:35:40.200 | So there's a lot of people who try to take advantage
00:35:45.280 | of the hype for business reasons and so on.
00:35:48.240 | But let me sort of talk to this idea
00:35:50.800 | that sort of new ideas,
00:35:53.840 | the ideas that push the field forward
00:35:56.120 | may not yet have a benchmark,
00:35:58.620 | or it may be very difficult to establish a benchmark.
00:36:00.880 | - I agree.
00:36:01.720 | That's part of the process.
00:36:02.560 | Establishing benchmarks is part of the process.
00:36:04.600 | - So what are your thoughts about,
00:36:07.300 | so we have these benchmarks on,
00:36:09.620 | around stuff we can do with images,
00:36:12.280 | from classification to captioning
00:36:14.920 | to just every kind of information you can pull off
00:36:16.920 | from images and the surface level.
00:36:18.860 | There's audio data sets, there's some video.
00:36:21.420 | What can we start, natural language,
00:36:24.940 | what kind of benchmarks do you see that start creeping
00:36:30.160 | on to more something like intelligence,
00:36:33.600 | like reasoning, like maybe you don't like the term,
00:36:37.420 | but AGI, echoes of that kind of formulation?
00:36:41.520 | - A lot of people are working on interactive environments
00:36:44.160 | in which you can train and test intelligent systems.
00:36:48.120 | So there, for example,
00:36:50.200 | the classical paradigm of supervised learning
00:36:56.160 | is that you have a data set,
00:36:57.960 | you partition it into a training set,
00:36:59.440 | validation set, test set,
00:37:00.440 | and there's a clear protocol, right?
00:37:03.040 | But what if, that assumes that the samples
00:37:06.400 | are statistically independent,
00:37:08.880 | you can exchange them,
00:37:10.100 | the order in which you see them shouldn't matter,
00:37:12.240 | things like that.
00:37:13.480 | But what if the answer you give
00:37:15.520 | determines the next sample you see,
00:37:17.560 | which is the case, for example, in robotics, right?
00:37:19.560 | You robot does something
00:37:21.160 | and then it gets exposed to a new room
00:37:23.620 | and depending on where it goes,
00:37:25.120 | the room would be different.
00:37:26.000 | So that creates the exploration problem.
00:37:28.440 | The what if the samples,
00:37:32.960 | so that creates also a dependency between samples, right?
00:37:35.760 | If you can only move in space,
00:37:39.620 | the next sample you're gonna see
00:37:40.960 | is gonna be probably in the same building, most likely.
00:37:44.080 | So all the assumptions about the validity
00:37:47.920 | of this training set, test set hypothesis break
00:37:51.640 | whenever a machine can take an action
00:37:53.120 | that has an influence in the world
00:37:54.960 | and it's what it's gonna see.
00:37:56.400 | So people are setting up artificial environments
00:38:00.160 | where that takes place, right?
00:38:02.080 | The robot runs around a 3D model of a house
00:38:05.840 | and can interact with objects and things like this.
00:38:08.680 | So you do robotics by simulation,
00:38:10.380 | you have those, you know,
00:38:11.760 | opening a gym type thing
00:38:14.400 | or MuJoCo kind of simulated robots
00:38:18.800 | and you have games, you know, things like that.
00:38:21.280 | So that's where the field is going really,
00:38:23.640 | this kind of environment.
00:38:24.880 | Now, back to the question of AGI,
00:38:28.320 | like I don't like the term AGI
00:38:29.840 | because it implies that human intelligence is general
00:38:35.760 | and human intelligence is nothing like general.
00:38:38.360 | It's very, very specialized.
00:38:40.860 | We think it's general, we like to think of ourselves
00:38:42.740 | as having general intelligence, we don't.
00:38:44.260 | We're very specialized.
00:38:46.100 | We're only slightly more general than--
00:38:47.540 | - Why does it feel general?
00:38:48.900 | So you kind of, the term general,
00:38:52.060 | I think what's impressive about humans
00:38:54.220 | is ability to learn, as we were talking about learning,
00:38:58.260 | to learn in just so many different domains.
00:39:01.260 | It's perhaps not arbitrarily general,
00:39:04.420 | but just you can learn in many domains
00:39:06.440 | and integrate that knowledge somehow.
00:39:08.220 | - Okay. - The knowledge persists.
00:39:09.860 | - So let me take a very specific example.
00:39:12.220 | It's not an example, it's more like
00:39:13.980 | a quasi-mathematical demonstration.
00:39:17.100 | So you have about one million fibers
00:39:18.520 | coming out of one of your eyes, okay, two million total,
00:39:21.340 | but let's talk about just one of them.
00:39:23.440 | It's one million nerve fibers, your optical nerve.
00:39:26.040 | Let's imagine that they are binary,
00:39:28.800 | so they can be active or inactive, right?
00:39:30.640 | So the input to your visual cortex is one million bits.
00:39:36.880 | Now, they're connected to your brain in a particular way,
00:39:39.400 | and your brain has connections
00:39:41.940 | that are kind of a little bit like a convolutional net,
00:39:44.160 | they're kind of local in space and things like this.
00:39:47.960 | Now imagine I play a trick on you.
00:39:49.660 | It's a pretty nasty trick, I admit.
00:39:53.040 | I cut your optical nerve,
00:39:55.720 | and I put a device that makes a random perturbation,
00:39:59.120 | a permutation of all the nerve fibers.
00:40:01.120 | So now what comes to your brain
00:40:04.600 | is a fixed but random permutation of all the pixels.
00:40:07.820 | There's no way in hell that your visual cortex,
00:40:11.360 | even if I do this to you in infancy,
00:40:14.760 | will actually learn vision
00:40:16.520 | to the same level of quality that you can.
00:40:20.040 | - Got it, and you're saying
00:40:21.120 | there's no way you've relearned that?
00:40:22.680 | - No, because now two pixels that are nearby in the world
00:40:25.640 | will end up in very different places in your visual cortex,
00:40:29.240 | and your neurons there have no connections with each other
00:40:31.640 | because they're only connected locally.
00:40:33.480 | - So this whole, our entire,
00:40:35.040 | the hardware is built in many ways to support--
00:40:38.600 | - The locality of the real world.
00:40:39.720 | - Yeah.
00:40:40.560 | - Yes, that's specialization.
00:40:42.600 | - Yeah, but it's still pretty damn impressive.
00:40:44.600 | So it's not perfect generalization.
00:40:46.240 | It's not even close.
00:40:47.080 | - No, no, it's not that it's not even close.
00:40:49.960 | It's not at all.
00:40:50.960 | - Yeah, it's not.
00:40:51.800 | It's specialized, yeah.
00:40:52.620 | - So how many Boolean functions,
00:40:54.040 | so let's imagine you want to train your visual system
00:40:58.280 | to recognize particular patterns of those one million bits.
00:41:03.800 | Okay, so that's a Boolean function, right?
00:41:05.760 | Either the pattern is here or not here.
00:41:07.040 | It's a two-way classification
00:41:09.200 | with one million binary inputs.
00:41:11.680 | How many such Boolean functions are there?
00:41:16.280 | Okay, you have two to the one million
00:41:19.040 | combinations of inputs.
00:41:21.200 | For each of those, you have an output bit.
00:41:24.080 | And so you have two to the one million
00:41:26.840 | Boolean functions of this type, okay?
00:41:30.040 | Which is an unimaginably large number.
00:41:33.020 | How many of those functions can actually be computed
00:41:35.560 | by your visual cortex?
00:41:37.240 | And the answer is a tiny, tiny, tiny, tiny, tiny, tiny sliver
00:41:41.440 | like an enormously tiny sliver.
00:41:43.520 | - Yeah, yeah.
00:41:45.000 | - So we are ridiculously specialized.
00:41:47.300 | - But, okay.
00:41:49.880 | (laughing)
00:41:51.480 | Okay, that's an argument against the word general.
00:41:54.200 | I think there's a,
00:41:55.540 | I agree with your intuition, but I'm not sure it's,
00:42:00.960 | it seems the brain is impressively
00:42:04.780 | capable of adjusting to things, so.
00:42:09.620 | - It's because we can't imagine tasks
00:42:13.380 | that are outside of our comprehension, right?
00:42:16.300 | So we think we are general
00:42:18.020 | because we are general of all the things
00:42:19.260 | that we can apprehend.
00:42:20.740 | But there is a huge world out there
00:42:22.980 | of things that we have no idea.
00:42:24.700 | We call that heat, by the way.
00:42:26.820 | - Heat. - Heat.
00:42:27.740 | So, at least physicists call that heat,
00:42:30.620 | or they call it entropy, which is kind of,
00:42:33.360 | you have a thing full of gas, right?
00:42:38.360 | - Closed system full of gas.
00:42:40.700 | - Right?
00:42:41.700 | Closed or not closed.
00:42:42.580 | It has pressure, it has temperature,
00:42:47.580 | it has, and you can write equations,
00:42:51.740 | PV equal NRT, things like that, right?
00:42:54.140 | When you reduce the volume, the temperature goes up,
00:42:57.380 | the pressure goes up, things like that, right?
00:43:00.340 | For a perfect gas, at least.
00:43:02.180 | Those are the things you can know about that system.
00:43:05.460 | And it's a tiny, tiny number of bits
00:43:07.020 | compared to the complete information
00:43:09.360 | of the state of the entire system.
00:43:10.780 | Because the state of the entire system
00:43:12.180 | will give you the position and momentum
00:43:13.740 | of every molecule of the gas.
00:43:16.720 | And what you don't know about it is the entropy,
00:43:20.900 | and you interpret it as heat.
00:43:23.980 | The energy contained in that thing is what we call heat.
00:43:28.020 | Now, it's very possible that, in fact,
00:43:32.260 | there is some very strong structure
00:43:33.700 | in how those molecules are moving.
00:43:35.020 | It's just that they are in a way
00:43:36.460 | that we are just not wired to perceive.
00:43:39.100 | - Yeah, we're ignorant to it.
00:43:40.120 | And there's an infinite amount of things
00:43:43.880 | we're not wired to perceive.
00:43:45.420 | And you're right, that's a nice way to put it.
00:43:47.340 | We're general to all the things we can imagine,
00:43:50.220 | which is a very tiny subset of all things that are possible.
00:43:54.860 | - So it's like Kolmogorov complexity
00:43:56.300 | or the Kolmogorov-Chaitin summa of complexity.
00:43:58.660 | Every bit string or every integer is random,
00:44:05.580 | except for all the ones that you can actually write down.
00:44:08.140 | (both laughing)
00:44:11.580 | - Yeah, okay, so beautifully put.
00:44:13.500 | So we can just call it artificial intelligence.
00:44:15.420 | We don't need to have a general.
00:44:16.980 | - Or human level.
00:44:18.780 | Human level intelligence is good.
00:44:23.340 | Anytime you touch human, it gets interesting
00:44:26.580 | because we attach ourselves to human
00:44:31.580 | and it's difficult to define what human intelligence is.
00:44:36.180 | Nevertheless, my definition is maybe
00:44:39.940 | damn impressive intelligence, okay?
00:44:43.900 | Damn impressive demonstration of intelligence, whatever.
00:44:46.700 | And so on that topic, most successes in deep learning
00:44:51.420 | have been in supervised learning.
00:44:53.700 | What is your view on unsupervised learning?
00:44:57.860 | Is there a hope to reduce involvement of human input
00:45:02.860 | and still have successful systems
00:45:05.620 | that have practical use?
00:45:08.260 | - Yeah, I mean, there's definitely a hope.
00:45:09.900 | It's more than a hope, actually.
00:45:11.180 | It's mounting evidence for it.
00:45:13.900 | And that's basically all I do.
00:45:16.620 | The only thing I'm interested in at the moment is,
00:45:19.100 | I call it self-supervised learning, not unsupervised,
00:45:21.260 | 'cause unsupervised learning is a loaded term.
00:45:24.020 | People who know something about machine learning
00:45:28.100 | tell you, so you're doing clustering or PCA,
00:45:30.660 | which is not the case.
00:45:31.580 | And the wide public, when you say unsupervised learning,
00:45:33.660 | oh my God, machines are gonna learn by themselves
00:45:35.580 | and without supervision.
00:45:36.820 | - Where's the parents?
00:45:40.820 | - Yeah, so I call it self-supervised learning
00:45:42.940 | because in fact, the underlying algorithms that are used
00:45:46.140 | are the same algorithms
00:45:47.340 | as the supervised learning algorithms,
00:45:50.300 | except that what we train them to do
00:45:52.340 | is not predict a particular set of variables,
00:45:55.540 | like the category of an image,
00:45:58.620 | and not to predict a set of variables
00:46:02.580 | that have been provided by human labelers.
00:46:06.420 | But what you train the machine to do is basically
00:46:08.580 | reconstruct a piece of its input
00:46:10.300 | that is being masked out, essentially.
00:46:14.140 | You can think of it this way, right?
00:46:15.660 | So show a piece of video to a machine
00:46:18.820 | and ask it to predict what's gonna happen next.
00:46:20.980 | And of course, after a while, you can show what happens
00:46:23.820 | and the machine will kind of train itself
00:46:26.260 | to do better at that task.
00:46:27.540 | You can do, like all the latest, most successful models
00:46:32.260 | in natural language processing
00:46:33.300 | use self-supervised learning.
00:46:34.820 | You know, sort of BERT-style systems, for example, right?
00:46:38.700 | You show it a window of a dozen words on a test corpus,
00:46:43.540 | you take out 15% of the words,
00:46:46.340 | and then you train a machine
00:46:48.300 | to predict the words that are missing.
00:46:51.420 | That's self-supervised learning.
00:46:52.860 | It's not predicting the future,
00:46:54.060 | it's just predicting things in the middle,
00:46:56.340 | but you could have it predict the future,
00:46:57.900 | that's what language models do.
00:46:59.540 | - So you construct, so in an unsupervised way,
00:47:01.820 | you construct a model of language.
00:47:04.020 | Do you think-
00:47:05.100 | - Or video, or the physical world, or whatever, right?
00:47:09.180 | - How far do you think that can take us?
00:47:12.660 | Do you think BERT understands anything?
00:47:16.460 | - To some level.
00:47:18.900 | It has a shallow understanding of text,
00:47:23.500 | but it needs to, I mean,
00:47:24.780 | to have kind of true human level intelligence,
00:47:26.860 | I think you need to ground language in reality.
00:47:29.260 | So some people are attempting to do this, right?
00:47:32.820 | Having systems that kind of have some visual representation
00:47:35.500 | of what is being talked about,
00:47:37.460 | which is one reason you need
00:47:38.620 | those interactive environments, actually.
00:47:41.100 | But it's like a huge technical problem that is not solved,
00:47:45.100 | and that explains why self-supervised learning works
00:47:48.180 | in the context of natural language,
00:47:50.020 | but does not work in the context,
00:47:51.500 | or at least not well,
00:47:52.780 | in the context of image recognition and video,
00:47:55.420 | although it's making progress quickly.
00:47:57.860 | And the reason, that reason is the fact that
00:48:00.700 | it's much easier to represent uncertainty in the prediction
00:48:05.340 | in the context of natural language
00:48:06.940 | than it is in the context of things like video and images.
00:48:10.140 | So for example, if I ask you to predict
00:48:12.980 | what words are missing,
00:48:13.940 | you know, 15% of the words that are taken out.
00:48:16.300 | - The possibility is just small.
00:48:19.180 | I mean- - It's small, right?
00:48:20.060 | There is a hundred thousand words in the lexicon,
00:48:23.340 | and what the machine spits out
00:48:24.860 | is a big probability vector, right?
00:48:27.660 | It's a bunch of numbers between zero and one
00:48:29.700 | that's onto one.
00:48:30.780 | And we know how to do this with computers.
00:48:33.140 | So there, representing uncertainty in the prediction
00:48:36.980 | is relatively easy,
00:48:37.860 | and that's, in my opinion,
00:48:39.180 | why those techniques work for NLP.
00:48:42.500 | For images, if you ask,
00:48:45.540 | if you block a piece of an image
00:48:46.940 | and you ask the system,
00:48:47.780 | reconstruct that piece of the image,
00:48:49.220 | there are many possible answers
00:48:50.780 | that are all perfectly legit, right?
00:48:54.660 | And how do you represent that,
00:48:57.060 | this set of possible answers?
00:48:58.780 | You can't train a system to make one prediction.
00:49:00.940 | You can't train a neural net to say,
00:49:02.500 | here it is, that's the image,
00:49:04.660 | because there's a whole set of things
00:49:06.460 | that are compatible with it.
00:49:07.300 | So how do you get the machine to represent
00:49:08.820 | not a single output,
00:49:09.740 | but a whole set of outputs?
00:49:11.100 | And similarly with video prediction,
00:49:17.300 | there's a lot of things that can happen
00:49:19.300 | in the future of a video.
00:49:20.180 | You're looking at me right now,
00:49:21.220 | I'm not moving my head very much,
00:49:22.820 | but I might turn my head to the left or to the right.
00:49:27.020 | If you don't have a system that can predict this,
00:49:29.420 | and you train it with least square
00:49:31.820 | to kind of minimize the error with a prediction
00:49:33.780 | on what I'm doing,
00:49:34.740 | what you get is a blurry image of myself
00:49:37.020 | in all possible future positions that I might be in,
00:49:39.700 | which is not a good prediction.
00:49:41.580 | - But so there might be other ways
00:49:43.420 | to do the self-supervision, right?
00:49:45.660 | For visual scenes.
00:49:48.100 | - Like what?
00:49:48.940 | (laughs)
00:49:49.780 | - If I knew, I wouldn't tell you.
00:49:52.740 | Publish it first, I don't know.
00:49:54.300 | - No, there might be.
00:49:56.700 | - So, I mean, these are kind of,
00:49:59.340 | there might be artificial ways of like self-play in games
00:50:03.260 | to where you can simulate part of the environment.
00:50:05.060 | You can-
00:50:05.900 | - Oh, that doesn't solve the problem.
00:50:06.820 | It's just a way of generating data.
00:50:08.580 | - But because you have more of a control,
00:50:12.580 | like maybe you can control,
00:50:14.620 | yeah, it's a way to generate data.
00:50:16.100 | That's right.
00:50:16.940 | And because you can do huge amounts of data generation,
00:50:20.500 | that doesn't, you're right.
00:50:21.580 | Well, it creeps up on the problem from the side of data,
00:50:26.020 | and you don't think that's the right way to creep up.
00:50:27.700 | - It doesn't solve this problem
00:50:28.940 | of handling uncertainty in the world, right?
00:50:30.980 | So if you have a machine learn a predictive model
00:50:35.260 | of the world in a game that is deterministic
00:50:38.180 | or quasi-deterministic, it's easy, right?
00:50:42.540 | Just give a few frames of the game to a ConvNet,
00:50:45.100 | put a bunch of layers,
00:50:47.020 | and then have the game generates the next few frames.
00:50:49.660 | And if the game is deterministic, it works fine.
00:50:52.380 | And that includes feeding the system with the action
00:50:59.140 | that your little character is gonna take.
00:51:03.060 | The problem comes from the fact that the real world
00:51:06.660 | and most games are not entirely predictable.
00:51:09.700 | And so there you get those blurry predictions,
00:51:11.340 | and you can't do planning with blurry predictions.
00:51:14.300 | Right, so if you have a perfect model of the world,
00:51:17.420 | you can, in your head, run this model
00:51:20.700 | with a hypothesis for a sequence of actions,
00:51:24.060 | and you're going to predict the outcome
00:51:25.340 | of that sequence of actions.
00:51:26.740 | But if your model is imperfect, how can you plan?
00:51:32.420 | - Yeah, it quickly explodes.
00:51:33.900 | What are your thoughts on the extension of this,
00:51:37.260 | which topic I'm super excited about,
00:51:39.660 | it's connected to something you were talking about
00:51:41.340 | in terms of robotics, is active learning.
00:51:44.580 | So as opposed to sort of completely unsupervised
00:51:47.860 | or self-supervised learning,
00:51:49.720 | you ask the system for human help
00:51:53.780 | for selecting parts you want annotated next.
00:51:58.100 | So if you think about a robot exploring a space
00:52:00.660 | or a baby exploring a space,
00:52:02.420 | or a system exploring a dataset,
00:52:05.260 | every once in a while asking for human input,
00:52:07.940 | do you see value in that kind of work?
00:52:12.180 | - I don't see transformative value.
00:52:14.180 | It's going to make things that we can already do
00:52:16.780 | more efficient, or they will learn slightly more efficiently,
00:52:20.780 | but it's not going to make machines
00:52:21.900 | sort of significantly more intelligent.
00:52:23.900 | And by the way, there is no opposition,
00:52:29.340 | there's no conflict between self-supervised learning,
00:52:34.340 | reinforcement learning, and supervised learning,
00:52:35.980 | or imitation learning, or active learning.
00:52:38.020 | I see self-supervised learning as a preliminary
00:52:42.380 | to all of the above.
00:52:43.820 | - Yes.
00:52:44.660 | - So the example I use very often is,
00:52:48.060 | how is it that, so if you use
00:52:51.380 | classical reinforcement learning,
00:52:54.540 | deep reinforcement learning, if you want,
00:52:57.540 | the best methods today,
00:52:59.220 | so-called model-free reinforcement learning,
00:53:03.020 | to learn to play Atari games,
00:53:04.580 | take about 80 hours of training to reach the level
00:53:07.780 | that any human can reach in about 15 minutes.
00:53:10.020 | They get better than humans, but it takes them a long time.
00:53:15.220 | AlphaStar, okay, the, you know,
00:53:21.020 | Aurelien Wigner and his team's system to play StarCraft,
00:53:27.020 | plays, you know, a single map, a single type of player,
00:53:32.020 | and can reach better than human level
00:53:38.780 | with about the equivalent of 200 years of training
00:53:43.340 | playing against itself.
00:53:45.260 | It's 200 years, right?
00:53:46.380 | It's not something that no human can, could ever do.
00:53:50.060 | - I mean, I'm not sure what lesson to take away from that.
00:53:52.300 | - Okay, now take those algorithms,
00:53:54.780 | the best Aurel algorithms we have today,
00:53:57.340 | to train a car to drive itself.
00:54:00.180 | It would probably have to drive millions of hours.
00:54:03.940 | It will have to kill thousands of pedestrians.
00:54:05.660 | It will have to run into thousands of trees.
00:54:07.380 | It will have to run off cliffs.
00:54:09.460 | And it had to run off cliffs multiple times
00:54:11.620 | before it figures out that it's a bad idea, first of all.
00:54:15.140 | And second of all, before it figures out how not to do it.
00:54:18.460 | And so, I mean, this type of learning, obviously,
00:54:20.900 | does not reflect the kind of learning
00:54:22.380 | that animals and humans do.
00:54:24.220 | There is something missing
00:54:25.300 | that's really, really important there.
00:54:27.340 | And my hypothesis,
00:54:28.700 | which I've been advocating for like five years now,
00:54:31.380 | is that we have predictive models of the world
00:54:34.660 | that include the ability to predict under uncertainty.
00:54:39.620 | And what allows us to not run off a cliff
00:54:44.620 | when we learn to drive,
00:54:45.780 | most of us can learn to drive
00:54:46.900 | in about 20 or 30 hours of training
00:54:48.620 | without ever crashing, causing any accident.
00:54:52.020 | And if we drive next to a cliff,
00:54:54.300 | we know that if we turn the wheel to the right,
00:54:56.220 | the car is gonna run off the cliff
00:54:58.140 | and nothing good is gonna come out of this.
00:54:59.860 | Because we have a pretty good model of intuitive physics
00:55:01.660 | that tells us the car is gonna fall.
00:55:03.340 | We know about gravity.
00:55:05.300 | Babies learn this around the age of eight or nine months,
00:55:08.180 | that objects don't float, they fall.
00:55:10.980 | And we have a pretty good idea
00:55:14.180 | of the effect of turning the wheel of the car.
00:55:16.060 | And we know we need to stay on the road.
00:55:18.060 | So there's a lot of things that we bring to the table,
00:55:20.620 | which is basically our predictive model of the world.
00:55:23.500 | And that model allows us to not do stupid things
00:55:26.940 | and to basically stay within the context
00:55:29.340 | of things we need to do.
00:55:31.100 | We still face unpredictable situations
00:55:33.740 | and that's how we learn,
00:55:35.260 | but that allows us to learn really, really, really quickly.
00:55:38.780 | So that's called model-based reinforcement learning.
00:55:41.340 | There's some imitation and supervised learning
00:55:44.180 | because we have a driving instructor
00:55:46.060 | that tells us occasionally what to do.
00:55:47.980 | But most of the learning
00:55:50.180 | is learning the model, learning physics
00:55:54.740 | that we've done since we were babies.
00:55:56.380 | That's where almost all the learning-
00:55:58.100 | - And the physics is somewhat transferable from,
00:56:01.300 | is transferable from scene to scene.
00:56:03.180 | Stupid things are the same everywhere.
00:56:05.540 | - Yeah.
00:56:06.380 | I mean, if you have experience of the world,
00:56:09.060 | you don't need to be from a particularly intelligent species
00:56:12.620 | to know that if you spill water from a container,
00:56:16.140 | the rest is gonna get wet.
00:56:20.020 | And you might get wet.
00:56:21.860 | So, you know, cats know this, right?
00:56:24.260 | - Yeah.
00:56:25.100 | - So the main problem we need to solve
00:56:27.060 | is how do we learn models of the world?
00:56:29.900 | That's, and that's what I'm interested in.
00:56:31.260 | That's what self-supervised learning is all about.
00:56:34.060 | - If you were to try to construct a benchmark for,
00:56:37.380 | let's look at MNIST.
00:56:41.100 | I love that dataset.
00:56:42.300 | Do you think it's useful, interesting/possible
00:56:48.020 | to perform well on MNIST
00:56:51.260 | with just one example of each digit?
00:56:53.900 | And how would we solve that problem?
00:56:57.460 | - The answer is probably yes.
00:56:59.540 | The question is what other type of learning
00:57:02.380 | are you allowed to do?
00:57:03.220 | So if what you're allowed to do
00:57:04.300 | is train on some gigantic dataset of labeled digit,
00:57:07.340 | that's called transfer learning.
00:57:08.820 | And we know that works, okay?
00:57:10.540 | We do this at Facebook, like in production, right?
00:57:13.500 | We train large convolutional nets
00:57:15.860 | to predict hashtags that people type on Instagram.
00:57:18.180 | And we train on billions of images, literally billions.
00:57:20.940 | And then we chop off the last layer
00:57:22.940 | and fine tune on whatever task we want.
00:57:24.900 | That works really well.
00:57:26.340 | You can beat the ImageNet record with this.
00:57:28.740 | We actually open-sourced the whole thing
00:57:30.500 | like a few weeks ago.
00:57:31.780 | - Yeah, that's still pretty cool.
00:57:33.340 | But yeah, so what would be impressive?
00:57:35.940 | And what's useful and impressive?
00:57:38.180 | What kind of transfer learning would be useful and impressive?
00:57:40.300 | Is it Wikipedia?
00:57:41.740 | That kind of thing?
00:57:42.580 | - No, no, so I don't think transfer learning
00:57:44.980 | is really where we should focus.
00:57:46.220 | We should try to do,
00:57:48.020 | you know, have a kind of scenario for benchmark
00:57:52.180 | where you have unlabeled data
00:57:54.500 | and you can, and it's very large number of unlabeled data.
00:57:59.380 | It could be video clips.
00:58:02.060 | It could be where you do, you know, frame prediction.
00:58:04.820 | It could be images where you could choose to,
00:58:07.260 | you know, mask a piece of it.
00:58:09.620 | Could be whatever, but they're unlabeled
00:58:13.100 | and you're not allowed to label them.
00:58:15.700 | So you do some training on this
00:58:18.780 | and then you train on a particular supervised task,
00:58:23.780 | ImageNet or NIST,
00:58:27.020 | and you measure how your test error decrease
00:58:30.820 | or validation error decreases
00:58:32.100 | as you increase the number of labeled training samples.
00:58:34.860 | Okay, and what you'd like to see is that,
00:58:42.420 | you know, your error decreases much faster
00:58:44.860 | than if you train from scratch, from random weights.
00:58:47.460 | So that to reach the same level of performance
00:58:50.500 | in a completely supervised, purely supervised system
00:58:54.140 | would reach, you would need way fewer samples.
00:58:56.420 | So that's the crucial question
00:58:57.700 | because it will answer the question to like, you know,
00:59:00.260 | people interested in medical image analysis.
00:59:02.980 | Okay, you know, if I want to get to a particular
00:59:06.180 | level of error rate for this task,
00:59:08.940 | I know I need a million samples.
00:59:12.140 | Can I do, you know, self-supervised pre-training
00:59:15.340 | to reduce this to about a hundred or something?
00:59:17.700 | - And you think the answer there
00:59:18.620 | is self-supervised pre-training?
00:59:20.820 | - Yeah, some form, some form of it.
00:59:25.020 | - Telling you, active learning, but you disagree.
00:59:28.500 | - No, it's not useless.
00:59:30.260 | It's just not gonna lead to a quantum leap.
00:59:32.460 | It's just gonna make things that we already do.
00:59:33.980 | - So you're way smarter than me.
00:59:35.540 | I just disagree with you.
00:59:37.500 | But I don't have anything to back that.
00:59:39.340 | It's just intuition.
00:59:40.900 | So I worked with a lot of large scale datasets
00:59:42.980 | and there's something that might be magic
00:59:45.740 | in active learning, but okay.
00:59:47.980 | Now at least I said it publicly.
00:59:49.580 | (laughs)
00:59:50.820 | At least I'm being an idiot publicly.
00:59:52.540 | Okay.
00:59:53.460 | - It's not being an idiot.
00:59:54.300 | It's, you know, working with the data you have.
00:59:56.140 | I mean, certainly people are doing things like,
00:59:58.460 | okay, I have 3000 hours of, you know,
01:00:01.220 | imitation learning for self-driving car,
01:00:03.380 | but most of those are incredibly boring.
01:00:05.340 | What I like is select, you know, 10% of them
01:00:07.940 | that are kind of the most informative.
01:00:09.420 | And with just that, I would probably reach the same.
01:00:12.420 | So it's a weak form of active learning, if you want.
01:00:16.340 | - Yes, but there might be a much stronger version.
01:00:20.140 | - Yeah, that's right.
01:00:20.980 | - That's what, and that's an open question if it exists.
01:00:23.940 | The question is how much stronger it can get.
01:00:26.500 | Elon Musk is confident.
01:00:28.620 | Talked to him recently.
01:00:30.220 | He's confident that large scale data
01:00:32.140 | and deep learning can solve the autonomous driving problem.
01:00:35.100 | What are your thoughts on the limits,
01:00:38.300 | possibilities of deep learning in this space?
01:00:40.820 | - It's obviously part of the solution.
01:00:42.980 | I mean, I don't think we'll ever have a self-driving system,
01:00:45.940 | or at least not in the foreseeable future,
01:00:47.700 | that does not use deep learning.
01:00:49.300 | Let me put it this way.
01:00:50.500 | Now, how much of it?
01:00:51.780 | So in the history of sort of engineering,
01:00:55.100 | particularly sort of AI-like systems,
01:01:00.500 | there's generally a first phase
01:01:01.900 | where everything is built by hand.
01:01:03.020 | Then there is a second phase,
01:01:04.220 | and that was the case for autonomous driving,
01:01:06.300 | you know, 20, 30 years ago.
01:01:08.460 | There's a phase where there's,
01:01:10.140 | a little bit of learning is used,
01:01:11.300 | but there's a lot of engineering that's involved
01:01:13.700 | in kind of, you know, taking care of corner cases
01:01:16.300 | and putting limits, et cetera,
01:01:18.620 | because the learning system is not perfect.
01:01:20.380 | And then as technology progresses,
01:01:22.620 | we end up relying more and more on learning.
01:01:26.020 | That's the history of character recognition,
01:01:27.820 | it's the history of speech recognition,
01:01:29.100 | now computer vision, natural language processing.
01:01:31.380 | And I think the same is going to happen
01:01:33.980 | with autonomous driving, that currently
01:01:37.260 | the methods that are closest to providing
01:01:41.980 | some level of autonomy, some, you know,
01:01:43.820 | decent level of autonomy, where you don't expect
01:01:45.820 | a driver to kind of do anything,
01:01:47.420 | is where you constrain the world.
01:01:50.820 | So you only run within, you know,
01:01:52.580 | 100 square kilometers or square miles in Phoenix,
01:01:55.340 | but the weather is nice and the roads are wide,
01:01:58.580 | which is what Waymo is doing.
01:02:00.220 | You completely over-engineer the car
01:02:03.260 | with tons of lidars and sophisticated sensors
01:02:07.500 | that are too expensive for consumer cars,
01:02:09.260 | but they're fine if you just run a fleet.
01:02:11.300 | And you engineer the hell out of everything else,
01:02:16.380 | you map the entire world,
01:02:17.940 | so you have a complete 3D model of everything.
01:02:20.380 | So the only thing that the perception system
01:02:22.140 | has to take care of is moving objects
01:02:24.180 | and construction and sort of, you know,
01:02:27.660 | things that weren't in your map.
01:02:29.500 | And you can engineer a good, you know,
01:02:32.140 | SLAM system and all that stuff, right?
01:02:33.620 | So that's kind of the current approach
01:02:35.820 | that's closest to some level of autonomy,
01:02:37.500 | but I think eventually the long-term solution
01:02:39.660 | is gonna rely more and more on learning
01:02:43.420 | and possibly using a combination
01:02:45.020 | of self-supervised learning and model-based reinforcement
01:02:49.340 | or something like that.
01:02:50.860 | - But ultimately learning will be not just at the core,
01:02:54.780 | but really the fundamental part of the system.
01:02:57.180 | - Yeah, it already is, but it'll become more and more.
01:03:00.340 | - What do you think it takes to build a system
01:03:02.740 | with human level intelligence?
01:03:04.060 | You talked about the AI system in the movie "Her"
01:03:07.620 | being way out of reach, our current reach.
01:03:10.060 | This might be outdated as well, but-
01:03:12.380 | - It's still way out of reach.
01:03:13.220 | - It's still way out of reach.
01:03:14.700 | What would it take to build "Her"?
01:03:18.340 | Do you think?
01:03:19.740 | - So I can tell you the first two obstacles
01:03:21.740 | that we have to clear,
01:03:22.860 | but I don't know how many obstacles there are after this.
01:03:24.820 | So the image I usually use is that
01:03:26.620 | there is a bunch of mountains that we have to climb
01:03:28.620 | and we can see the first one,
01:03:29.700 | but we don't know if there are 50 mountains
01:03:31.300 | behind it or not.
01:03:32.140 | And this might be a good sort of metaphor
01:03:34.900 | for why AI researchers in the past
01:03:38.380 | have been overly optimistic about the result of AI.
01:03:41.980 | For example, Newell and Simon
01:03:46.900 | wrote the general problem solver
01:03:49.380 | and they called it a general problem solver.
01:03:51.340 | - General problem solver.
01:03:52.940 | - And of course, the first thing you realize
01:03:54.540 | is that all the problems you want to solve are exponential.
01:03:56.340 | And so you can't actually use it for anything useful.
01:03:59.140 | But you know.
01:04:00.060 | - Yeah, so yeah, all you see is the first peak.
01:04:02.260 | So what are the first couple of peaks for "Her"?
01:04:05.260 | - So the first peak,
01:04:06.380 | which is precisely what I'm working on,
01:04:07.980 | is self-supervised learning.
01:04:09.780 | How do we get machines to learn models of the world
01:04:12.260 | by observation,
01:04:13.340 | kind of like babies and like young animals?
01:04:15.820 | So we've been working with cognitive scientists.
01:04:22.260 | So this Emmanuelle Dupoux, who is at FAIR in Paris,
01:04:26.620 | is a half-time, is also a researcher in French University.
01:04:31.620 | And he has this chart that shows
01:04:36.900 | which, how many months of life
01:04:39.780 | baby humans can learn different concepts.
01:04:42.700 | And you can measure this in sort of various ways.
01:04:45.660 | So things like distinguishing animate objects
01:04:51.380 | from inanimate objects,
01:04:52.940 | you can tell the difference at age two, three months.
01:04:56.740 | Whether an object is going to stay stable,
01:04:58.500 | is going to fall, you know,
01:04:59.860 | about four months, you can tell.
01:05:02.900 | You know, there are various things like this.
01:05:04.660 | And then things like gravity,
01:05:06.460 | the fact that objects are not supposed to float in the air,
01:05:08.620 | but are supposed to fall,
01:05:10.060 | you learn this around the age of eight or nine months.
01:05:12.580 | If you look at a lot of, you know, eight-month-old babies,
01:05:15.340 | you give them a bunch of toys on their high chair.
01:05:18.540 | First thing they do is they throw them on the ground
01:05:19.980 | and they look at them.
01:05:21.180 | It's because, you know,
01:05:22.020 | they're learning about, actively learning
01:05:24.460 | about gravity. - Gravity, yeah.
01:05:26.980 | - Okay, so they're not trying to annoy you,
01:05:29.700 | but they, you know, they need to do the experiment, right?
01:05:32.660 | So, you know, how do we get machines to learn like babies?
01:05:36.580 | Mostly by observation with a little bit of interaction
01:05:39.220 | and learning those models of the world,
01:05:41.220 | because I think that's really a crucial piece
01:05:43.740 | of an intelligent autonomous system.
01:05:46.340 | So if you think about the architecture
01:05:47.540 | of an intelligent autonomous system,
01:05:49.500 | it needs to have a predictive model of the world.
01:05:51.340 | So something that says, here is a world at time T,
01:05:54.060 | here is a state of the world at time T plus one
01:05:55.500 | if I take this action.
01:05:56.620 | And it's not a single answer, it can be a--
01:05:59.700 | - Yeah, it can be a distribution, yeah.
01:06:01.260 | - Yeah, well, but we don't know how to represent
01:06:03.180 | distributions in high dimensional space.
01:06:04.820 | So it's gotta be something weaker than that, okay?
01:06:07.180 | But with some representation of uncertainty.
01:06:09.740 | If you have that, then you can do
01:06:12.620 | what optimal control theorists call
01:06:14.460 | model predictive control,
01:06:15.500 | which means that you can run your model
01:06:17.620 | with a hypothesis for a sequence of action
01:06:19.900 | and then see the result.
01:06:21.860 | Now, what you need, the other thing you need
01:06:23.260 | is some sort of objective that you want to optimize.
01:06:26.020 | Am I reaching the goal of grabbing this object?
01:06:28.740 | Am I minimizing energy?
01:06:30.020 | Am I whatever, right?
01:06:31.180 | So there is some sort of objective
01:06:33.460 | that you have to minimize.
01:06:34.860 | And so in your head, if you have this model,
01:06:36.740 | you can figure out the sequence of action
01:06:38.260 | that will optimize your objective.
01:06:39.940 | That objective is something that ultimately
01:06:43.340 | is rooted in your basal ganglia,
01:06:45.580 | at least in the human brain,
01:06:46.540 | that's what it's basal ganglia,
01:06:48.780 | computes your level of contentment or miscontentment.
01:06:52.380 | I don't know if that's a word.
01:06:54.260 | Unhappiness, okay?
01:06:55.460 | - Yeah, yeah.
01:06:56.660 | - Discontentment.
01:06:57.500 | - Discontentment, maybe.
01:06:58.460 | - And so your entire behavior is driven
01:07:01.540 | towards minimizing that objective,
01:07:04.980 | which is maximizing your contentment,
01:07:07.460 | computed by your basal ganglia.
01:07:09.340 | And what you have is an objective function,
01:07:13.300 | which is basically a predictor
01:07:14.860 | of what your basal ganglia is going to tell you.
01:07:17.180 | So you're not going to put your hand on fire
01:07:19.140 | because you know it's going to burn.
01:07:22.300 | And you're going to get hurt.
01:07:23.740 | And you're predicting this because of your model of the world
01:07:26.100 | and your sort of predictor of this objective, right?
01:07:30.140 | So if you have those three components,
01:07:33.500 | you have four components,
01:07:35.140 | you have the hardwired contentment objective computer,
01:07:40.140 | if you want, calculator.
01:07:43.900 | And then you have those three components.
01:07:45.100 | One is the objective predictor,
01:07:46.700 | which basically predicts your level of contentment.
01:07:48.900 | One is the model of the world.
01:07:52.500 | And there's a third module I didn't mention,
01:07:54.060 | which is the module that will figure out
01:07:57.220 | the best course of action
01:07:59.060 | to optimize an objective given your model.
01:08:01.300 | Okay?
01:08:03.540 | - Yeah.
01:08:04.420 | - Call this a policy network or something like that, right?
01:08:08.300 | Now, you need those three components
01:08:11.700 | to act autonomously, intelligently.
01:08:13.940 | And you can be stupid in three different ways.
01:08:16.100 | You can be stupid because your model of the world is wrong.
01:08:19.380 | You can be stupid because your objective is not aligned
01:08:22.500 | with what you actually want to achieve.
01:08:25.060 | Okay?
01:08:25.900 | In humans, that would be a psychopath.
01:08:29.140 | - Right.
01:08:30.020 | - And then the third way you can be stupid
01:08:33.620 | is that you have the right model,
01:08:34.940 | you have the right objective,
01:08:36.340 | but you're unable to figure out a course of action
01:08:38.820 | to optimize your objective given your model.
01:08:40.700 | - Right.
01:08:41.540 | - Okay.
01:08:42.380 | Some people who are in charge of big countries
01:08:45.900 | actually have all three that are wrong.
01:08:47.740 | - All right.
01:08:48.580 | (laughs)
01:08:50.940 | Which countries?
01:08:51.760 | I don't know.
01:08:52.600 | Okay, so if we think about this agent,
01:08:55.980 | if we think about the movie "Her,"
01:08:57.980 | you've criticized the art project that is Sophia the Robot.
01:09:02.980 | And what that project essentially does
01:09:07.540 | is uses our natural inclination to anthropomorphize
01:09:11.740 | things that look like human and give them more.
01:09:14.780 | Do you think that could be used by AI systems
01:09:17.700 | like in the movie "Her"?
01:09:19.020 | So do you think that body is needed
01:09:23.340 | to create a feeling of intelligence?
01:09:27.140 | - Well, if Sophia was just an art piece,
01:09:29.260 | I would have no problem with it,
01:09:30.340 | but it's presented as something else.
01:09:33.020 | - Let me add that comment real quick.
01:09:35.260 | If creators of Sophia could change something
01:09:38.500 | about their marketing or behavior in general,
01:09:40.700 | what would it be?
01:09:41.540 | What's--
01:09:42.820 | - I'm just about everything.
01:09:43.940 | (laughs)
01:09:45.660 | - I mean, don't you think, here's a tough question.
01:09:50.100 | Let me, so I agree with you.
01:09:51.700 | So Sophia is not, the general public
01:09:55.980 | feels that Sophia can do way more than she actually can.
01:09:59.300 | - That's right.
01:10:00.220 | - And the people who created Sophia
01:10:02.740 | are not honestly publicly communicating,
01:10:07.740 | trying to teach the public.
01:10:09.460 | - Right.
01:10:10.300 | - But here's a tough question.
01:10:13.260 | Don't you think the same thing
01:10:18.060 | is scientists in industry and research
01:10:22.100 | are taking advantage of the same misunderstanding
01:10:24.660 | in the public when they create AI companies
01:10:27.340 | or publish stuff?
01:10:29.900 | - Some companies, yes.
01:10:31.140 | I mean, there is no sense of,
01:10:33.140 | there's no desire to delude.
01:10:34.900 | There's no desire to kind of overclaim
01:10:37.820 | what something is done.
01:10:38.660 | Right, you publish a paper on AI
01:10:39.820 | that has this result on ImageNet,
01:10:42.220 | it's pretty clear.
01:10:43.060 | I mean, it's not even interesting anymore,
01:10:44.940 | but I don't think there is that.
01:10:47.940 | I mean, the reviewers are generally not very forgiving
01:10:52.900 | of unsupported claims of this type.
01:10:57.180 | And, but there are certainly quite a few startups
01:10:59.660 | that have had a huge amount of hype around this
01:11:02.660 | that I find extremely damaging
01:11:05.500 | and I've been calling it out when I've seen it.
01:11:08.020 | So yeah, but to go back to your original question,
01:11:10.220 | like the necessity of embodiment.
01:11:13.020 | I think, I don't think embodiment is necessary.
01:11:15.580 | I think grounding is necessary.
01:11:17.100 | So I don't think we're gonna get machines
01:11:18.900 | that really understand language
01:11:20.460 | without some level of grounding in the real world.
01:11:22.420 | And it's not clear to me that language
01:11:24.340 | is a high enough bandwidth medium
01:11:26.100 | to communicate how the real world works.
01:11:28.220 | I think for this--
01:11:30.300 | - Can you talk to ground, what grounding means?
01:11:32.300 | - So grounding means that,
01:11:34.020 | so there is this classic problem of common sense reasoning,
01:11:37.700 | you know, the Winograd schema, right?
01:11:41.020 | And so I tell you the trophy doesn't fit in a suitcase
01:11:44.980 | because it's too big,
01:11:46.380 | or the trophy doesn't fit in a suitcase
01:11:47.780 | because it's too small.
01:11:49.180 | And the it in the first case refers to the trophy
01:11:51.820 | in the second case to the suitcase.
01:11:53.660 | And the reason you can figure this out
01:11:55.180 | is because you know what the trophy and the suitcase are,
01:11:57.020 | you know, one is supposed to fit in the other one,
01:11:58.700 | and you know the notion of size
01:12:00.620 | and the big object doesn't fit in a small object
01:12:03.020 | unless it's a TARDIS, you know, things like that, right?
01:12:05.300 | So you have this knowledge of how the world works,
01:12:08.700 | of geometry and things like that.
01:12:10.660 | I don't believe you can learn everything about the world
01:12:14.700 | by just being told in language how the world works.
01:12:18.020 | I think you need some low-level perception of the world,
01:12:21.740 | you know, be it visual touch, you know, whatever,
01:12:23.740 | but some higher bandwidth perception of the world.
01:12:26.620 | - So by reading all the world's text,
01:12:28.820 | you still may not have enough information.
01:12:31.140 | - That's right.
01:12:32.540 | There's a lot of things that just will never appear in text
01:12:35.420 | and that you can't really infer.
01:12:37.020 | So I think common sense will emerge from, you know,
01:12:41.740 | certainly a lot of language interaction,
01:12:43.420 | but also with watching videos
01:12:45.660 | or perhaps even interacting in virtual environments
01:12:48.900 | and possibly, you know, robot interacting in the real world.
01:12:51.780 | But I don't actually believe necessarily
01:12:53.620 | that this last one is absolutely necessary,
01:12:55.980 | but I think there's a need for some grounding.
01:13:00.260 | - But the final product doesn't necessarily
01:13:03.020 | need to be embodied, you're saying.
01:13:04.860 | It just needs to have an awareness, a grounding.
01:13:07.700 | - Right, but it needs to know how the world works
01:13:10.140 | to have, you know, to not be frustrating to talk to.
01:13:14.420 | - And you talked about emotions being important.
01:13:19.540 | That's a whole nother topic.
01:13:21.780 | - Well, so, you know, I talked about this,
01:13:24.340 | the basal ganglia as the, you know,
01:13:29.340 | the thing that calculates your level of miscontentment,
01:13:32.940 | and then there is this other module
01:13:34.660 | that sort of tries to do a prediction
01:13:36.660 | of whether you're gonna be content or not.
01:13:38.540 | That's the source of some emotion.
01:13:40.260 | So fear, for example, is an anticipation
01:13:43.100 | of bad things that can happen to you, right?
01:13:46.420 | You have this inkling that there is some chance
01:13:49.260 | that something really bad is gonna happen to you,
01:13:50.900 | and that creates fear.
01:13:52.300 | When you know for sure that something bad
01:13:53.700 | is gonna happen to you, you kind of give up, right?
01:13:55.900 | It's not there anymore.
01:13:57.500 | It's uncertainty that creates fear.
01:14:00.060 | - So the punchline is, we're not gonna have
01:14:01.660 | autonomous intelligence without emotions.
01:14:03.700 | - Whatever the heck emotions are.
01:14:08.860 | So you mentioned very practical things of fear,
01:14:11.060 | but there's a lot of other mess around it.
01:14:13.420 | - But they are kind of the results of, you know, drives.
01:14:16.340 | - Yeah, there's deeper biological stuff going on,
01:14:19.300 | and I've talked to a few folks on this.
01:14:21.380 | There's fascinating stuff that ultimately
01:14:23.860 | connects to our brain.
01:14:27.260 | If we create an AGI system, sorry.
01:14:30.860 | - Human level intelligence.
01:14:31.700 | - Human level intelligence system,
01:14:33.380 | and you get to ask her one question,
01:14:37.100 | what would that question be?
01:14:38.500 | - You know, I think the first one we'll create
01:14:42.860 | will probably not be that smart.
01:14:45.460 | They'd be like a four year old.
01:14:47.460 | - So you would have to ask her a question
01:14:49.980 | to know she's not that smart?
01:14:51.500 | - Yeah.
01:14:53.620 | - Well, what's a good question to ask, you know,
01:14:56.900 | to be impressed? - What is the cause of wind?
01:14:58.940 | And if she answers, oh, it's because the leaves
01:15:03.900 | of the tree are moving and that creates wind,
01:15:06.460 | she's onto something.
01:15:07.580 | - And if she says that's a stupid question,
01:15:11.780 | she's really onto something.
01:15:12.620 | - No, and then you tell her, actually, you know,
01:15:15.420 | here is the real thing, and she says,
01:15:18.460 | oh, yeah, that makes sense.
01:15:20.500 | - So questions that reveal the ability
01:15:24.500 | to do common sense reasoning about the physical world.
01:15:26.980 | - Yeah, and you'll sum it up with a causal inference.
01:15:30.140 | - Causal inference.
01:15:31.220 | Well, it was a huge honor.
01:15:33.660 | Congratulations on the Turing Award.
01:15:35.740 | Thank you so much for talking today.
01:15:37.260 | - Thank you. - Appreciate it.
01:15:38.660 | (upbeat music)
01:15:41.260 | (upbeat music)
01:15:43.860 | (upbeat music)
01:15:46.460 | (upbeat music)
01:15:49.060 | (upbeat music)
01:15:51.660 | (upbeat music)
01:15:54.260 | [BLANK_AUDIO]