back to index

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110


Chapters

0:0 Introduction
3:17 Computer vision is hard
10:5 Tesla Autopilot
21:20 Human brain vs computers
23:14 The general problem of computer vision
29:9 Images vs video in computer vision
37:47 Benchmarks in computer vision
40:6 Active learning
45:34 From pixels to semantics
52:47 Semantic segmentation
57:5 The three R's of computer vision
62:52 End-to-end learning in computer vision
64:24 6 lessons we can learn from children
68:36 Vision and language
72:30 Turing test
76:17 Open problems in computer vision
84:49 AGI
95:47 Pick the right problem

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Jitendra Malik,
00:00:03.480 | a professor at Berkeley and one of the seminal figures
00:00:06.320 | in the field of computer vision,
00:00:08.400 | the kind before the deep learning revolution
00:00:10.960 | and the kind after.
00:00:12.960 | He has been cited over 180,000 times
00:00:17.360 | and has mentored many world-class researchers
00:00:20.480 | in computer science.
00:00:21.740 | Quick summary of the ads.
00:00:24.440 | Two sponsors, one new one, which is BetterHelp
00:00:27.800 | and an old goodie, ExpressVPN.
00:00:31.480 | Please consider supporting this podcast
00:00:33.200 | by going to betterhelp.com/lex
00:00:36.440 | and signing up at expressvpn.com/lexpod.
00:00:40.780 | Click the links, buy the stuff.
00:00:43.080 | It really is the best way to support this podcast
00:00:45.420 | and the journey I'm on.
00:00:47.240 | If you enjoy this thing, subscribe on YouTube,
00:00:49.840 | review it with Five Stars and Apple Podcast,
00:00:52.040 | support it on Patreon or connect with me on Twitter
00:00:55.240 | at Lex Friedman, however the heck you spell that.
00:00:58.720 | As usual, I'll do a few minutes of ads now
00:01:01.120 | and never any ads in the middle
00:01:02.480 | that can break the flow of the conversation.
00:01:05.080 | This show is sponsored by BetterHelp, spelled H-E-L-P, help.
00:01:10.080 | Check it out at betterhelp.com/lex.
00:01:15.040 | They figure out what you need and match you
00:01:16.960 | with a licensed professional therapist in under 48 hours.
00:01:21.360 | It's not a crisis line, it's not self-help,
00:01:24.200 | it's professional counseling done securely online.
00:01:28.200 | I'm a bit from the David Goggins line of creatures,
00:01:30.600 | as you may know, and so have some demons to contend with,
00:01:35.220 | usually on long runs or all nights working,
00:01:38.920 | forever impossibly full of self-doubt.
00:01:41.920 | It may be because I'm Russian,
00:01:43.960 | but I think suffering is essential for creation.
00:01:47.060 | But I also think you can suffer beautifully
00:01:49.600 | in a way that doesn't destroy you.
00:01:51.980 | For most people, I think a good therapist
00:01:53.840 | can help in this, so it's at least worth a try.
00:01:57.220 | Check out their reviews, they're good.
00:01:59.640 | It's easy, private, affordable, available worldwide.
00:02:03.260 | You can communicate by text any time
00:02:05.340 | and schedule weekly audio and video sessions.
00:02:08.600 | I highly recommend that you check them out
00:02:12.240 | at betterhelp.com/lex.
00:02:15.320 | This show is also sponsored by ExpressVPN.
00:02:19.320 | Get it at expressvpn.com/lexpod.
00:02:22.640 | To support this podcast and to get an extra three months free
00:02:26.880 | on a one-year package.
00:02:28.560 | I've been using ExpressVPN for many years.
00:02:31.360 | I love it.
00:02:32.640 | I think ExpressVPN is the best VPN out there.
00:02:36.000 | They told me to say it, but it happens to be true.
00:02:39.040 | It doesn't log your data, it's crazy fast,
00:02:41.720 | and it's easy to use.
00:02:43.280 | Literally just one big, sexy power on button.
00:02:47.560 | Again, for obvious reasons, it's really important
00:02:49.640 | that they don't log your data.
00:02:51.400 | It works on Linux and everywhere else too,
00:02:54.200 | but really, why use anything else?
00:02:57.120 | Shout out to my favorite flavor of Linux, Ubuntu Mate 2004.
00:03:02.080 | Once again, get it at expressvpn.com/lexpod
00:03:06.080 | to support this podcast and to get an extra three months free
00:03:10.720 | on a one-year package.
00:03:12.140 | And now, here's my conversation with Jitendra Malik.
00:03:17.920 | In 1966, Seymour Pappert at MIT wrote up a proposal
00:03:22.920 | called the Summer Vision Project to be given,
00:03:26.580 | as far as we know, to 10 students to work on
00:03:29.880 | and solve that summer.
00:03:31.240 | So that proposal outlined many of the computer vision tasks
00:03:34.320 | we still work on today.
00:03:36.800 | Why do you think we underestimate,
00:03:38.960 | and perhaps we did underestimate
00:03:41.080 | and perhaps still underestimate how hard computer vision is?
00:03:46.040 | - Because most of what we do in vision,
00:03:48.480 | we do unconsciously or subconsciously.
00:03:50.960 | - In human vision. - In human vision.
00:03:52.940 | So that gives us this, that effortlessness
00:03:56.160 | gives us the sense that, oh, this must be very easy
00:03:59.120 | to implement on a computer.
00:04:01.920 | Now, this is why the early researchers in AI
00:04:06.920 | got it so wrong.
00:04:08.800 | However, if you go into neuroscience or psychology
00:04:14.200 | of human vision, then the complexity becomes very clear.
00:04:19.000 | The fact is that a very large part of the cerebral cortex
00:04:23.720 | is devoted to visual processing.
00:04:26.000 | I mean, and this is true in other primates as well.
00:04:29.400 | So once we looked at it from a neuroscience
00:04:33.220 | or psychology perspective, it becomes quite clear
00:04:36.260 | that the problem is very challenging
00:04:37.880 | and it will take some time.
00:04:39.600 | - You said the higher-level parts are the harder parts?
00:04:43.840 | - I think vision appears to be easy
00:04:47.640 | because most of what visual processing
00:04:51.680 | is subconscious or unconscious.
00:04:54.320 | So we underestimate the difficulty.
00:04:58.200 | Whereas when you are proving a mathematical theorem
00:05:03.200 | or playing chess, the difficulty is much more evident
00:05:07.880 | because it is your conscious brain which is processing
00:05:12.960 | various aspects of the problem-solving behavior.
00:05:17.120 | Whereas in vision, all this is happening,
00:05:19.600 | but it's not in your awareness.
00:05:21.840 | It's in your, it's operating below that.
00:05:25.720 | - But it still seems strange.
00:05:28.420 | Yes, that's true, but it seems strange
00:05:30.060 | that as computer vision researchers, for example,
00:05:34.000 | the community broadly, time and time again,
00:05:38.280 | makes the mistake of thinking the problem
00:05:41.080 | is easier than it is.
00:05:42.400 | Or maybe it's not a mistake.
00:05:43.800 | We'll talk a little bit about autonomous driving,
00:05:45.680 | for example, how hard of a vision task that is.
00:05:48.640 | Do you think, I mean, is it just human nature
00:05:55.000 | or is there something fundamental to the vision problem
00:05:57.640 | that we underestimate?
00:06:00.880 | We're still not able to be cognizant
00:06:03.540 | of how hard the problem is.
00:06:05.840 | - Yeah, I think in the early days,
00:06:07.440 | it could have been excused
00:06:09.760 | because in the early days,
00:06:11.640 | all aspects of AI were regarded as too easy.
00:06:14.380 | But I think today it is much less excusable.
00:06:19.460 | And I think why people fall for this
00:06:23.400 | is because of what I call the fallacy
00:06:26.960 | of the successful first step.
00:06:28.940 | There are many problems in vision
00:06:32.520 | where getting 50% of the solution you can get in one minute,
00:06:37.700 | getting to 90% can take you a day,
00:06:41.340 | getting to 99% may take you five years,
00:06:44.340 | and 99.99% may be not in your lifetime.
00:06:49.340 | - I wonder if that's unique to vision.
00:06:52.540 | It seems that language, people are not so confident about.
00:06:56.460 | So natural language processing,
00:06:57.820 | people are a little bit more cautious
00:06:59.180 | about our ability to solve that problem.
00:07:04.180 | I think for language, people intuit
00:07:06.300 | that we have to be able to do natural language understanding.
00:07:10.540 | For vision, it seems that we're not cognizant
00:07:15.540 | or we don't think about how much understanding is required.
00:07:19.020 | It's probably still an open problem.
00:07:21.420 | But in your sense,
00:07:22.420 | how much understanding is required to solve vision?
00:07:26.860 | Put another way,
00:07:29.460 | how much something called common sense reasoning
00:07:33.700 | is required to really be able to interpret
00:07:37.260 | even static scenes?
00:07:39.540 | - Yeah, so vision operates at all levels,
00:07:43.140 | and there are parts which can be solved
00:07:47.060 | with what we could call maybe peripheral processing.
00:07:50.680 | So in the human vision literature,
00:07:53.560 | there used to be these terms sensation,
00:07:56.460 | perception, and cognition,
00:07:58.780 | which roughly speaking referred to
00:08:01.540 | the front end of processing,
00:08:04.100 | middle stages of processing,
00:08:05.660 | and higher level of processing.
00:08:07.940 | And I think they made a big deal out of this,
00:08:11.500 | and they wanted to just study only perception
00:08:13.820 | and then dismiss certain problems as being, quote, cognitive.
00:08:18.820 | But really, I think these are artificial divides.
00:08:23.180 | The problem is continuous at all levels,
00:08:26.140 | and there are challenges at all levels.
00:08:28.500 | The techniques that we have today,
00:08:31.060 | they work better at the lower and mid levels of the problem.
00:08:34.900 | I think the higher levels of the problem,
00:08:36.940 | quote, the cognitive levels of the problem,
00:08:40.020 | are there, and we,
00:08:43.140 | in many real applications, we have to confront them.
00:08:46.420 | Now, how much that is necessary
00:08:49.820 | will depend on the application.
00:08:51.460 | For some problems, it doesn't matter.
00:08:52.980 | For some problems, it matters a lot.
00:08:55.180 | So I am, for example,
00:08:59.060 | a pessimist on fully autonomous driving in the near future.
00:09:04.060 | And the reason is because I think there will be
00:09:07.900 | that 0.01% of the cases
00:09:11.900 | where quite sophisticated cognitive reasoning is called for.
00:09:16.500 | However, there are tasks where you can,
00:09:20.300 | first of all, they are much more, they are robust,
00:09:23.700 | so in the sense that error rates,
00:09:26.060 | error is not so much of a problem.
00:09:28.380 | For example, let's say you're doing image search.
00:09:33.380 | You're trying to get images based on some description,
00:09:38.460 | some visual description.
00:09:40.300 | We are very tolerant of errors there, right?
00:09:43.900 | I mean, when Google Image Search gives you some images back
00:09:47.140 | and a few of them are wrong, it's okay.
00:09:49.900 | It doesn't hurt anybody.
00:09:51.300 | There's not a matter of life and death.
00:09:54.540 | But making mistakes when you are driving
00:09:59.540 | at 60 miles per hour
00:10:01.620 | and you could potentially kill somebody
00:10:04.100 | is much more important.
00:10:06.100 | - So just for the fun of it, since you mentioned,
00:10:09.460 | let's go there briefly about autonomous vehicles.
00:10:12.780 | So one of the companies in the space, Tesla,
00:10:15.740 | is with Andrej Karpathy and Elon Musk
00:10:18.860 | are working on a system called Autopilot,
00:10:21.700 | which is primarily a vision-based system
00:10:23.980 | with eight cameras and basically a single neural network,
00:10:27.860 | a multitask neural network.
00:10:30.080 | They call it HydraNet, multiple heads,
00:10:33.300 | so it does multiple tasks,
00:10:34.780 | but is forming the same representation at the core.
00:10:37.640 | Do you think driving can be converted in this way
00:10:41.900 | to purely a vision problem
00:10:44.940 | and then solved with learning?
00:10:47.940 | Or even more specifically, in the current approach,
00:10:52.580 | what do you think about what Tesla Autopilot team is doing?
00:10:56.060 | - So the way I think about it is that
00:10:59.500 | there are certainly subsets of the visual-based
00:11:02.860 | driving problem which are quite solvable.
00:11:05.380 | So for example, driving in freeway conditions
00:11:08.060 | is quite a solvable problem.
00:11:11.980 | I think there were demonstrations of that
00:11:14.780 | going back to the 1980s by someone called
00:11:18.660 | Ernst Dickmans in Munich.
00:11:22.020 | In the '90s, there were approaches from Carnegie Mellon,
00:11:25.660 | there were approaches from our team at Berkeley.
00:11:28.700 | In the 2000s, there were approaches from Stanford,
00:11:31.940 | and so on.
00:11:33.100 | So autonomous driving in certain settings is very doable.
00:11:38.100 | The challenge is to have an autopilot work
00:11:42.140 | under all kinds of driving conditions.
00:11:45.380 | At that point, it's not just a question of vision
00:11:48.620 | or perception, but really also of control
00:11:51.460 | and dealing with all the edge cases.
00:11:54.180 | - So where do you think most of the difficult cases,
00:11:57.620 | to me, even the highway driving is an open problem
00:12:00.100 | because it applies the same 50, 90, 95, 99 rule
00:12:05.100 | or the first step, the fallacy of the first step,
00:12:09.140 | I forget how you put it, we fall victim to.
00:12:12.100 | I think even highway driving has a lot of elements
00:12:14.900 | because to solve autonomous driving,
00:12:17.000 | you have to completely relinquish
00:12:19.260 | the help of a human being.
00:12:22.740 | You're always in control.
00:12:23.740 | So you're really going to feel the edge cases.
00:12:26.540 | So I think even highway driving is really difficult.
00:12:29.280 | But in terms of the general driving task,
00:12:32.160 | do you think vision is the fundamental problem
00:12:35.460 | or is it also your action,
00:12:39.380 | the interaction with the environment,
00:12:42.700 | the ability to, and then like the middle ground,
00:12:46.300 | I don't know if you put that under vision,
00:12:47.660 | which is trying to predict the behavior of others,
00:12:51.220 | which is a little bit in the world
00:12:53.340 | of understanding the scene,
00:12:55.700 | but it's also trying to form a model
00:12:58.220 | of the actors in the scene and predict their behavior.
00:13:02.060 | - Yeah, I include that in vision
00:13:03.860 | because to me, perception blends into cognition
00:13:07.180 | and building predictive models of other agents in the world,
00:13:11.320 | which could be other agents, could be people,
00:13:13.300 | other agents could be other cars.
00:13:15.360 | That is part of the task of perception
00:13:17.420 | because perception always has to not tell us what is now,
00:13:23.280 | but what will happen because what's now is boring.
00:13:26.340 | It's done, it's over with.
00:13:27.740 | We care about the future because we act in the future.
00:13:33.280 | - And we care about the past in as much as it informs
00:13:36.580 | what's going to happen in the future.
00:13:38.940 | - So I think we have to build predictive models
00:13:41.240 | of behaviors of people and those can get quite complicated.
00:13:46.240 | So I mean, I've seen examples of this in,
00:13:52.900 | actually, I mean, I own a Tesla
00:13:57.780 | and it has various safety features built in.
00:14:01.420 | And what I see are these examples where,
00:14:05.720 | let's say there is some skateboarder.
00:14:08.700 | I mean, and I don't want to be too critical
00:14:11.800 | because obviously these systems are always being improved
00:14:16.280 | and any specific criticism I have,
00:14:19.380 | maybe the system six months from now
00:14:21.460 | will not have that particular failure mode.
00:14:25.700 | So it had the wrong response
00:14:30.700 | and it's because it couldn't predict
00:14:33.960 | what this skateboarder was going to do.
00:14:38.280 | Okay, and because it really required
00:14:41.680 | that higher level cognitive understanding
00:14:44.160 | of what skateboarders typically do
00:14:46.520 | as opposed to a normal pedestrian.
00:14:48.680 | So what might have been the correct behavior
00:14:50.520 | for a pedestrian, a typical behavior for a pedestrian
00:14:53.760 | was not the typical behavior for a skateboarder, right?
00:14:58.760 | - Yeah.
00:14:59.760 | - And so therefore to do a good job there,
00:15:04.360 | you need to have enough data where your pedestrians,
00:15:07.540 | you also have skateboarders,
00:15:09.520 | you've seen enough skateboarders to see
00:15:11.640 | what kinds of patterns of behavior they have.
00:15:16.440 | So it is in principle with enough data
00:15:19.680 | that problem could be solved.
00:15:21.460 | But I think our current systems,
00:15:26.120 | computer vision systems,
00:15:27.280 | they need far, far more data than humans do
00:15:31.480 | for learning those same capabilities.
00:15:33.660 | - So say that there is going to be a system
00:15:35.680 | that solves autonomous driving,
00:15:37.960 | do you think it will look similar to what we have today,
00:15:41.600 | but have a lot more data, perhaps more compute,
00:15:44.440 | but the fundamental architectures involved,
00:15:47.120 | like neural, well, in the case of Tesla Autopilot,
00:15:49.880 | is neural networks, do you think it will look similar?
00:15:54.480 | In that regard, it'll just have more data.
00:15:56.800 | - That's a scientific hypothesis
00:15:59.300 | as to which way is it going to go.
00:16:01.880 | I will tell you what I would bet on.
00:16:05.300 | So, and this is my general philosophical position
00:16:09.440 | on how these learning systems have been.
00:16:12.520 | What we have found currently very effective
00:16:17.160 | in computer vision in the deep learning paradigm
00:16:21.000 | is sort of tabula rasa learning,
00:16:24.020 | and tabula rasa learning in a supervised way,
00:16:27.480 | with lots and lots of--
00:16:28.320 | - What's tabula rasa learning?
00:16:29.480 | - Tabula rasa in the sense that blank slate.
00:16:32.200 | We just have the system which is,
00:16:34.880 | given a series of experiences in this setting,
00:16:37.720 | and then it learns there.
00:16:38.960 | Now, let's think about human driving.
00:16:42.600 | It is not tabula rasa learning.
00:16:44.600 | So at the age of 16, in high school,
00:16:47.700 | a teenager goes into driver ed class, right?
00:16:54.840 | And now, at that point, they learn,
00:16:57.600 | but at the age of 16, they're already visual geniuses,
00:17:02.160 | because from zero to 16,
00:17:04.640 | they have built a certain repertoire of vision.
00:17:07.560 | In fact, most of it has probably been achieved by age two.
00:17:11.240 | Right?
00:17:12.960 | In this period of age, up to age two,
00:17:16.200 | they know that the world is three-dimensional.
00:17:18.080 | They know how objects look like
00:17:20.480 | from different perspectives.
00:17:22.280 | They know about occlusion.
00:17:24.620 | They know about common dynamics
00:17:27.560 | of humans and other bodies.
00:17:29.680 | They have some notion of intuitive physics.
00:17:32.100 | So they have built that up from their observations
00:17:35.120 | and interactions in early childhood,
00:17:38.600 | and of course, reinforced through
00:17:40.440 | their growing up to age 16.
00:17:43.920 | So then, at age 16, when they go into driver ed,
00:17:47.960 | what are they learning?
00:17:49.280 | They're not learning afresh the visual world.
00:17:52.280 | They have a mastery of the visual world.
00:17:54.640 | What they are learning is control.
00:17:58.020 | Okay?
00:17:58.860 | They are learning how to be smooth about control,
00:18:01.400 | about steering and brakes and so forth.
00:18:03.920 | They're learning a sense of typical traffic situations.
00:18:07.900 | Now, that education process can be quite short,
00:18:12.900 | because they are coming in as visual geniuses.
00:18:17.640 | And of course, in their future,
00:18:20.320 | they're going to encounter situations
00:18:22.040 | which are very novel, right?
00:18:24.140 | So during my driver ed class,
00:18:27.280 | I may not have had to deal with a skateboarder.
00:18:29.820 | I may not have had to deal with a truck
00:18:32.180 | driving in front of me,
00:18:33.700 | where the back opens up and some junk
00:18:38.240 | gets dropped from the truck,
00:18:39.880 | and I have to deal with it, right?
00:18:42.020 | But I can deal with this as a driver,
00:18:45.180 | even though I did not encounter this in my driver ed class.
00:18:48.700 | And the reason I can deal with it
00:18:50.000 | is because I have all this general visual knowledge
00:18:52.780 | and expertise.
00:18:54.540 | - And do you think the learning mechanisms we have today
00:18:59.900 | can do that kind of long-term accumulation of knowledge?
00:19:03.700 | Or do we have to do some kind of,
00:19:07.620 | the work that led up to expert systems
00:19:11.300 | with knowledge representation,
00:19:13.300 | the broader field of artificial intelligence
00:19:17.100 | worked on this kind of accumulation of knowledge.
00:19:20.200 | Do you think neural networks can do the same?
00:19:22.100 | - I think I don't see any in-principle problem
00:19:27.100 | with neural networks doing it,
00:19:29.220 | but I think the learning techniques
00:19:31.100 | would need to evolve significantly.
00:19:33.660 | So the current learning techniques that we have
00:19:38.660 | are supervised learning.
00:19:41.460 | You're given lots of examples,
00:19:43.300 | X, Y, pairs, and you learn the functional mapping
00:19:47.140 | between them.
00:19:48.580 | I think that human learning is far richer than that.
00:19:52.280 | It includes many different components.
00:19:54.660 | There is a child explores the world and sees us.
00:19:59.660 | For example, a child takes an object
00:20:04.740 | and manipulates it in his or her hand,
00:20:09.140 | and therefore gets to see the object
00:20:10.860 | from different points of view.
00:20:12.660 | And the child has commanded the movement.
00:20:14.780 | So that's a kind of learning data,
00:20:16.520 | but the learning data has been arranged by the child.
00:20:20.820 | And this is a very rich kind of data.
00:20:24.100 | Child can do various experiments with the world.
00:20:27.460 | So there are many aspects of sort of human learning,
00:20:33.580 | and these have been studied in child development
00:20:37.340 | by psychologists.
00:20:39.300 | And what they tell us is that supervised learning
00:20:43.320 | is a very small part of it.
00:20:45.340 | There are many different aspects of learning.
00:20:48.580 | And what we would need to do is to develop models
00:20:52.220 | of all of these, and then train our systems
00:20:57.220 | in that, with that kind of protocol.
00:21:02.380 | - So new methods of learning.
00:21:04.460 | - Yes.
00:21:05.300 | - Some of which might imitate the human brain.
00:21:07.300 | But you also, in your talks, have mentioned
00:21:10.660 | sort of the compute side of things,
00:21:12.860 | in terms of the difference in the human brain,
00:21:15.060 | or referencing Marovec, Hans Marovec.
00:21:17.820 | - Yeah.
00:21:20.660 | - Do you think there's something interesting,
00:21:23.100 | valuable to consider about the difference
00:21:25.380 | in the computational power of the human brain
00:21:29.020 | versus the computers of today,
00:21:31.580 | in terms of instructions per second?
00:21:34.860 | - Yes, so if we go back,
00:21:36.620 | so this is a point I've been making for 20 years now.
00:21:41.540 | And I think once upon a time,
00:21:43.840 | the way I used to argue this was that
00:21:46.540 | we just didn't have the computing power of the human brain.
00:21:48.980 | Our computers were not quite there.
00:21:53.220 | And I mean, there is a well-known trade-off
00:21:58.220 | which we know that neurons are slow,
00:22:02.740 | compared to transistors,
00:22:05.160 | but we have a lot of them,
00:22:07.980 | and they have a very high connectivity.
00:22:10.060 | Whereas in silicon, you have much faster devices,
00:22:13.980 | transistors switch at on the order of nanoseconds,
00:22:18.100 | but the connectivity is usually smaller.
00:22:20.140 | At this point in time, I mean,
00:22:23.540 | we are now talking about 2020,
00:22:25.860 | we do have, if you consider the latest GPUs and so on,
00:22:29.780 | amazing computing power.
00:22:31.660 | And if we look back at Hans Marovec's type of calculations,
00:22:36.660 | which he did in the 1990s,
00:22:38.940 | we may be there today,
00:22:40.860 | in terms of computing power comparable to the brain,
00:22:43.660 | but it's not of the same style.
00:22:46.380 | It's of a very different style.
00:22:47.980 | So I mean, for example,
00:22:51.300 | the style of computing that we have in our GPUs
00:22:54.420 | is far, far more power hungry
00:22:57.220 | than the style of computing that is there
00:22:59.660 | in the human brain or other biological entities.
00:23:03.980 | - Yeah, and that, the efficiency part is,
00:23:09.140 | we're gonna have to solve that
00:23:10.100 | in order to build actual real-world systems of large scale.
00:23:15.060 | Let me ask sort of the high level question,
00:23:17.460 | just taking a step back.
00:23:19.380 | How would you articulate
00:23:21.380 | the general problem of computer vision?
00:23:24.260 | Does such a thing exist?
00:23:25.980 | So if you look at the computer vision conferences
00:23:27.860 | and the work that's been going on,
00:23:29.580 | it's often separated into different little segments,
00:23:34.140 | breaking the problem of vision apart
00:23:36.060 | into whether segmentation, 3D reconstruction,
00:23:40.820 | object detection, I don't know,
00:23:43.300 | image capturing, whatever.
00:23:45.300 | There's benchmarks for each.
00:23:46.660 | But if you were to sort of philosophically say,
00:23:49.500 | what is the big problem of computer vision?
00:23:52.260 | Does such a thing exist?
00:23:53.460 | - Yes, but it's not in isolation.
00:23:57.260 | So if we have to,
00:23:59.180 | so for all intelligence tasks,
00:24:03.860 | I always go back to sort of biology or humans.
00:24:09.500 | And if you think about vision or perception in that setting,
00:24:14.100 | we realize that perception is always to guide action.
00:24:17.980 | Perception for a biological system
00:24:20.860 | does not give any benefits
00:24:22.660 | unless it is coupled with action.
00:24:24.980 | So we can go back and think about
00:24:27.740 | the first multicellular animals
00:24:30.260 | which arose in the Cambrian era 500 million years ago.
00:24:34.700 | And these animals could move
00:24:37.860 | and they could see in some way.
00:24:40.740 | And the two activities helped each other
00:24:43.340 | because how does movement help?
00:24:47.380 | Movement helps that
00:24:49.020 | because you can get food in different places.
00:24:52.060 | But you need to know where to go.
00:24:54.260 | And that's really about perception or seeing.
00:24:57.980 | I mean, vision is perhaps the single most perception sense,
00:25:01.740 | but all the others are equally, are also important.
00:25:05.940 | So perception and action kind of go together.
00:25:10.060 | So earlier it was in these very simple feedback loops
00:25:13.500 | which were about finding food
00:25:17.220 | or avoiding becoming food if there's a predator running,
00:25:20.620 | trying to eat you up and so forth.
00:25:25.220 | So we must, at the fundamental level,
00:25:27.740 | connect perception to action.
00:25:29.820 | Then as we evolved,
00:25:33.700 | perception became more and more sophisticated
00:25:36.580 | because it served many more purposes.
00:25:38.580 | And so today we have what seems like
00:25:43.260 | a fairly general purpose capability
00:25:45.860 | which can look at the external world
00:25:48.100 | and build a model of the external world inside the head.
00:25:51.980 | We do have that capability.
00:25:54.980 | That model is not perfect.
00:25:56.900 | And psychologists have great fun in pointing out
00:25:59.300 | the ways in which the model in your head
00:26:01.620 | is not a perfect model of the external world.
00:26:05.180 | They create various illusions
00:26:08.100 | to show the ways in which it is imperfect.
00:26:11.340 | But it's amazing how far it has come
00:26:14.260 | from a very simple perception action loop
00:26:17.780 | that exists in an animal 500 million years ago.
00:26:22.780 | Once we have these very sophisticated visual systems,
00:26:28.100 | we can then impose a structure on them.
00:26:30.660 | It's we as scientists who are imposing that structure
00:26:34.180 | where we have chosen to characterize this part of the system
00:26:38.860 | as this, quote, "module of object detection"
00:26:41.940 | or, quote, "this module of 3D reconstruction."
00:26:44.980 | What's going on is really all of these processes
00:26:48.140 | are running simultaneously.
00:26:50.460 | And they are running simultaneously
00:26:56.300 | because originally their purpose was, in fact,
00:26:58.620 | to help guide action.
00:27:00.940 | - So as a guiding general statement of a problem,
00:27:03.900 | do you think we can say that the general problem
00:27:08.180 | of computer vision, you said, in humans,
00:27:12.380 | it was tied to action.
00:27:14.700 | Do you think we should also say that ultimately
00:27:17.180 | that the goal, the problem of computer vision
00:27:20.820 | is to sense the world in a way
00:27:23.660 | that helps you act in the world?
00:27:27.460 | - Yes, I think that's the most fundamental purpose.
00:27:32.460 | We have by now hyper-evolved.
00:27:37.260 | So we have this visual system
00:27:38.940 | which can be used for other things.
00:27:41.900 | For example, judging the aesthetic value of a painting.
00:27:45.420 | And this is not guiding action.
00:27:49.100 | Maybe it's guiding action in terms of how much money
00:27:51.900 | you will put in your auction bid,
00:27:54.140 | but that's a bit stretched.
00:27:55.900 | But the basics are, in fact, in terms of action.
00:27:59.700 | But we have evolved really this hyper,
00:28:04.700 | we have hyper-evolved our visual system.
00:28:07.900 | - Actually, just to, sorry to interrupt,
00:28:10.140 | but perhaps it is fundamentally about action.
00:28:13.260 | You kind of jokingly said about spending,
00:28:15.700 | but perhaps the capitalistic drive
00:28:20.220 | that drives a lot of the development in this world
00:28:23.460 | is about the exchange of money
00:28:25.060 | and the fundamental action is money.
00:28:26.540 | If you watch Netflix, if you enjoy watching movies,
00:28:29.500 | you're using your perception system to interpret the movie.
00:28:32.620 | Ultimately, your enjoyment of that movie
00:28:34.660 | means you'll subscribe to Netflix.
00:28:36.660 | So the action is this extra layer
00:28:41.180 | that we've developed in modern society,
00:28:42.940 | perhaps is fundamentally tied
00:28:45.140 | to the action of spending money.
00:28:46.820 | - Well, certainly with respect to interactions with firms.
00:28:54.100 | So in this homo economicus role,
00:28:56.900 | when you're interacting with firms,
00:28:59.060 | it does become that.
00:29:01.940 | - What else is there?
00:29:03.380 | (both laughing)
00:29:05.700 | No, it was a rhetorical question.
00:29:06.980 | Okay, so to linger on the division
00:29:11.460 | between the static and the dynamic,
00:29:14.700 | so much of the work in computer vision,
00:29:16.660 | so many of the breakthroughs that you've been a part of
00:29:19.300 | have been in the static world,
00:29:21.900 | in looking at static images.
00:29:24.260 | And then you've also worked on starting,
00:29:26.700 | but to a much smaller degree,
00:29:28.540 | the community is looking at dynamic, at video,
00:29:31.460 | at dynamic scenes.
00:29:32.820 | And then there is robotic vision,
00:29:35.220 | which is dynamic, but also where you actually have a robot
00:29:39.340 | in the physical world interacting based on that vision.
00:29:42.100 | Which problem is harder?
00:29:45.660 | Sort of the trivial first answer is,
00:29:51.100 | well, of course, one image is harder.
00:29:53.740 | But if you look at a deeper question there,
00:29:58.540 | are we, what's the term,
00:30:01.500 | cutting ourselves at the knees,
00:30:04.060 | or making the problem harder by focusing on images?
00:30:07.820 | - That's a fair question.
00:30:09.100 | I think sometimes we can simplify a problem so much
00:30:17.100 | that we essentially lose part of the juice
00:30:21.300 | that could enable us to solve the problem.
00:30:23.400 | And one could reasonably argue that, to some extent,
00:30:28.020 | this happens when we go from video to single images.
00:30:31.360 | Now, historically, you have to consider the limits
00:30:35.500 | imposed by the computation capabilities we had.
00:30:40.500 | So if we, many of the choices made
00:30:43.780 | in the computer vision community
00:30:46.340 | through the '70s, '80s, '90s,
00:30:50.620 | can be understood as choices which were forced upon us
00:30:55.620 | by the fact that we just didn't have access to compute,
00:31:00.940 | enough compute.
00:31:01.940 | - Not enough memory, not enough hard drives.
00:31:04.100 | - Exactly, not enough compute, not enough storage.
00:31:07.700 | So think of these choices.
00:31:09.420 | So one of the choices is focusing on single images
00:31:12.820 | rather than video.
00:31:14.140 | - Okay, clear question, storage and compute.
00:31:17.540 | We had to focus on, we used to detect edges
00:31:23.700 | and throw away the image, right?
00:31:25.580 | So you have an image, which is, say, 256 by 256 pixels,
00:31:29.660 | and instead of keeping around the grayscale value,
00:31:33.180 | what we did was we detected edges,
00:31:35.460 | find the places where the brightness changes a lot,
00:31:38.340 | and then throw away the rest.
00:31:41.960 | So this was a major compression device,
00:31:44.740 | and the hope was that this makes it
00:31:47.220 | that you can still work with it,
00:31:48.580 | and the logic was humans can interpret a line drawing,
00:31:51.780 | and yes, and this will save us computation.
00:31:58.020 | So many of the choices were dictated by that.
00:32:00.940 | I think today we are no longer detecting edges, right?
00:32:05.940 | We process images with ConvNets because we don't need to.
00:32:10.780 | We don't have those compute restrictions anymore.
00:32:13.960 | Now, video is still understudied
00:32:16.280 | because video compute is still quite challenging
00:32:19.600 | if you are a university researcher.
00:32:22.320 | I think video computing is not so challenging
00:32:24.960 | if you are at Google or Facebook or Amazon.
00:32:28.840 | - Still super challenging.
00:32:30.240 | I just spoke with the VP of engineering,
00:32:33.000 | Google head of YouTube search and discovery,
00:32:35.560 | and they still struggle doing stuff on video.
00:32:38.400 | It's very difficult except using techniques
00:32:42.140 | that are essentially the techniques you used in the '90s,
00:32:45.300 | some very basic computer vision techniques.
00:32:48.620 | - No, that's when you want to do things at scale.
00:32:51.440 | So if you want to operate at the scale
00:32:53.700 | of all the content of YouTube, it's very challenging,
00:32:56.980 | and there are similar issues in Facebook.
00:32:59.260 | But as a researcher, you have more opportunities.
00:33:04.260 | - You can train large,
00:33:06.940 | networks with relatively large video datasets, yeah.
00:33:10.540 | - Yes, so I think that this is part of the reason
00:33:13.660 | why we have so emphasized static images.
00:33:17.220 | I think that this is changing,
00:33:18.740 | and over the next few years,
00:33:20.460 | I see a lot more progress happening in video.
00:33:25.180 | So I have this generic statement that,
00:33:29.460 | to me, video recognition feels like 10 years
00:33:31.920 | behind object recognition.
00:33:33.780 | And you can quantify that
00:33:35.820 | because you can take some of the challenging video datasets,
00:33:39.020 | and their performance on action classification
00:33:42.620 | is like, say, 30%,
00:33:44.640 | which is kind of what we used to have
00:33:47.300 | around 2009 in object detection.
00:33:52.300 | It's like about 10 years behind.
00:33:54.620 | And whether it'll take 10 years to catch up
00:33:57.700 | is a different question.
00:33:58.740 | Hopefully, it will take less than that.
00:34:01.080 | - Let me ask a similar question I've already asked,
00:34:04.540 | but once again, so for dynamic scenes,
00:34:07.440 | do you think some kind of injection of knowledge bases
00:34:13.580 | and reasoning is required
00:34:16.020 | to help improve action recognition?
00:34:18.840 | If we solve the general action recognition problem,
00:34:27.820 | what do you think the solution would look like?
00:34:29.900 | That's another way to put it.
00:34:30.740 | - So I completely agree that knowledge is called for,
00:34:35.740 | and that knowledge can be quite sophisticated.
00:34:39.620 | So the way I would say it is that
00:34:41.540 | perception blends into cognition,
00:34:43.900 | and cognition brings in issues of memory
00:34:46.780 | and this notion of a schema from psychology,
00:34:51.780 | which is, let me use the classic example,
00:34:55.100 | which is you go to a restaurant, right?
00:34:58.700 | Now, the things that happen in a certain order,
00:35:01.020 | you walk in, somebody takes you to a table,
00:35:05.340 | waiter comes, gives you a menu,
00:35:08.700 | takes the order, food arrives,
00:35:10.900 | eventually, bill arrives, et cetera, et cetera.
00:35:14.020 | There's a classic example of AI from the 1970s.
00:35:19.020 | There was the term frames and scripts and schemas.
00:35:24.740 | These are all quite similar ideas.
00:35:27.140 | Okay, and in the '70s, the way the AI of the time
00:35:31.340 | dealt with it was by hand-coding this.
00:35:34.220 | So they hand-coded in this notion of a script
00:35:37.020 | and the various stages and the actors
00:35:40.180 | and so on and so forth,
00:35:42.060 | and used that to interpret, for example, language.
00:35:45.460 | I mean, if there's a description of a story
00:35:49.220 | involving some people eating at a restaurant,
00:35:52.720 | there are all these inferences you can make
00:35:56.220 | because you know what happens typically at a restaurant.
00:35:59.220 | So I think this kind of knowledge is absolutely essential.
00:36:05.140 | So I think that when we are going to do
00:36:08.780 | long-form video understanding,
00:36:11.620 | we are going to need to do this.
00:36:13.540 | I think the kinds of technology that we have right now
00:36:16.180 | with 3D convolutions over a couple of seconds
00:36:20.020 | of clip or video,
00:36:21.300 | it's very much tailored
00:36:22.900 | towards short-term video understanding,
00:36:25.940 | not that long-term understanding.
00:36:28.340 | Long-term understanding requires a notion of,
00:36:32.220 | this notion of schemas that I talked about,
00:36:35.340 | perhaps some notions of goals, intentionality,
00:36:39.580 | functionality, and so on and so forth.
00:36:43.040 | Now, how will we bring that in?
00:36:45.980 | So we could either revert back to the '70s and say,
00:36:48.820 | okay, I'm going to hand-code in a script,
00:36:52.580 | or we might try to learn it.
00:36:56.180 | So I tend to believe that
00:36:59.900 | we have to find learning ways of doing this,
00:37:02.940 | because I think learning ways land up being more robust.
00:37:06.620 | And there must be a learning version of the story
00:37:09.220 | because children acquire a lot of this knowledge
00:37:13.580 | by sort of just observation.
00:37:16.620 | So at no moment in a child's life does a,
00:37:21.300 | it's possible, but I think it's not so typical
00:37:24.380 | that somebody, that a mother coaches a child
00:37:27.900 | through all the stages of what happens in a restaurant.
00:37:30.620 | They just go as a family,
00:37:31.860 | they go to the restaurant, they eat, come back,
00:37:35.620 | and the child goes through 10 such experiences,
00:37:37.920 | and the child has got a schema
00:37:40.500 | of what happens when you go to a restaurant.
00:37:42.660 | So we somehow need to,
00:37:44.900 | we need to provide that capability to our systems.
00:37:47.940 | - You mentioned the following line
00:37:50.560 | from the end of the Alan Turing paper,
00:37:53.140 | Computing Machinery and Intelligence,
00:37:54.820 | that many people, like you said,
00:37:57.280 | many people know and very few have read,
00:38:00.460 | where he proposes the Turing test.
00:38:03.240 | This is how you know,
00:38:04.620 | 'cause it's towards the end of the paper.
00:38:06.580 | "Instead of trying to produce a program
00:38:08.340 | "to simulate the adult mind,
00:38:09.920 | "why not rather try to produce one
00:38:11.740 | "which simulates the child's?"
00:38:14.340 | So that's a really interesting point.
00:38:16.980 | If I think about the benchmarks we have before us,
00:38:20.420 | the tests of our computer vision systems,
00:38:24.460 | they're often kind of trying to get to the adult.
00:38:28.220 | So what kind of benchmarks should we have?
00:38:31.060 | What kind of tests for computer vision do you think
00:38:33.180 | we should have that mimic the child's in computer vision?
00:38:38.100 | - Yeah, I think we should have those,
00:38:40.560 | and we don't have those today.
00:38:42.580 | And I think the part of the challenge
00:38:47.180 | is that we should really be collecting data
00:38:49.860 | of the type that a child experiences.
00:38:54.860 | So that gets into issues of privacy and so on and so forth.
00:38:59.260 | But there are attempts in this direction
00:39:01.140 | to sort of try to collect the kind of data
00:39:04.940 | that a child encounters growing up.
00:39:08.500 | So what's the child's linguistic environment?
00:39:11.020 | What's the child's visual environment?
00:39:13.460 | So if we could collect that kind of data
00:39:17.060 | and then develop learning schemes based on that data,
00:39:21.780 | that would be one way to do it.
00:39:23.660 | I think that's a very promising direction myself.
00:39:28.740 | There might be people who would argue
00:39:31.140 | that we could just short circuit this in some way.
00:39:33.980 | And sometimes we have imitated,
00:39:38.900 | we have had success by not imitating nature in detail.
00:39:44.340 | So the usual example is airplanes, right?
00:39:47.460 | We don't build flapping wings.
00:39:50.960 | So yes, that's one of the points of debate.
00:39:56.820 | In my mind, I would bet on this learning
00:40:02.060 | like a child approach.
00:40:05.020 | - So one of the fundamental aspects
00:40:08.540 | of learning like a child is the interactivity.
00:40:11.280 | So the child gets to play
00:40:12.540 | with the data set it's learning from.
00:40:14.340 | - Yes.
00:40:15.180 | - So it gets to select.
00:40:16.100 | I mean, you can call that active learning.
00:40:17.900 | You can, in the machine learning world,
00:40:20.620 | you can call it a lot of terms.
00:40:22.180 | What are your thoughts about this whole space
00:40:25.620 | of being able to play with the data set
00:40:27.580 | or select what you're learning?
00:40:29.540 | - Yeah, so I think that I believe in that.
00:40:33.980 | And I think that we could achieve it in two ways
00:40:38.460 | and I think we should use both.
00:40:40.700 | So one is actually real robotics, right?
00:40:45.540 | So real physical embodiments of agents
00:40:50.540 | who are interacting with the world
00:40:52.540 | and they have a physical body with dynamics
00:40:54.980 | and mass and moment of inertia and friction
00:40:58.900 | and all the rest and you learn your body.
00:41:01.580 | The robot learns its body by doing a series of actions.
00:41:08.400 | The second is that simulation environments.
00:41:11.540 | So I think simulation environments
00:41:14.360 | are getting much, much better.
00:41:16.080 | In my life in Facebook AI research,
00:41:21.680 | our group has worked on something called Habitat,
00:41:24.760 | which is a simulation environment,
00:41:27.080 | which is a visually photorealistic environment
00:41:31.600 | of places like houses or interiors
00:41:36.280 | of various urban spaces and so forth.
00:41:39.440 | And as you move, you get a picture
00:41:42.040 | which is a pretty accurate picture.
00:41:43.940 | So I can now, you can imagine that subsequent generations
00:41:49.900 | of these simulators will be accurate, not just visually,
00:41:54.960 | but with respect to forces and masses
00:41:58.860 | and haptic interactions and so on.
00:42:03.200 | And then we have that environment to play with.
00:42:07.600 | I think that, let me state one reason
00:42:11.280 | why I think this active,
00:42:14.400 | being able to act in the world is important.
00:42:16.320 | I think that this is one way to break
00:42:18.800 | the correlation versus causation barrier.
00:42:22.880 | So this is something which is of a great deal
00:42:26.160 | of interest these days.
00:42:27.160 | I mean, people like Judea Pearl have talked a lot about
00:42:32.240 | that we are neglecting causality,
00:42:34.760 | and he describes the entire set of successes
00:42:38.580 | of deep learning as just curve fitting, right?
00:42:41.360 | Because it's, but I don't quite agree.
00:42:45.280 | - He's a troublemaker, he is.
00:42:46.600 | - But causality is important,
00:42:49.360 | but causality is not like a single silver bullet.
00:42:54.360 | It's not like one single principle.
00:42:56.080 | There are many different aspects here.
00:42:58.480 | And one of the ways in which,
00:43:01.560 | one of our most reliable ways of establishing causal links,
00:43:05.160 | and this is the way, for example,
00:43:07.280 | the medical community does this,
00:43:09.860 | is randomized control trials.
00:43:12.720 | So you have, you pick some situation,
00:43:15.340 | and now in some situation you perform an action,
00:43:17.760 | and for certain others you don't, right?
00:43:21.800 | So you have a controlled experiment.
00:43:24.160 | Well, the child is, in fact,
00:43:25.640 | performing controlled experiments all the time, right?
00:43:28.680 | - Right, right. - Okay?
00:43:29.880 | - Small scale.
00:43:30.720 | - And in a small scale,
00:43:32.080 | and, but that is a way that the child gets to build
00:43:37.080 | and refine its causal models of the world.
00:43:40.920 | And my colleague, Alison Gopnik,
00:43:43.760 | has together with a couple of authors,
00:43:46.200 | co-authors has this book called
00:43:47.400 | "The Scientist in the Crib," referring to children.
00:43:50.760 | So I like, the part that I like about that is
00:43:54.280 | the scientist wants to do, wants to build causal models,
00:43:58.880 | and the scientist does controlled experiments.
00:44:01.720 | And I think the child is doing that.
00:44:03.720 | So to enable that, we will need to have these,
00:44:08.280 | these active experiments,
00:44:10.000 | and I think those could be done,
00:44:12.700 | some in the real world and some in simulation.
00:44:14.840 | - So you have hope for simulation?
00:44:16.920 | - I have hope for simulation.
00:44:18.000 | - That's an exciting possibility,
00:44:19.420 | if we can get to not just photorealistic,
00:44:21.680 | but what's that called?
00:44:23.960 | Life realistic simulation.
00:44:27.640 | So you don't see any fundamental blocks
00:44:31.480 | to why we can't eventually simulate
00:44:34.400 | the principles of what it means to exist in the world
00:44:37.960 | as a physical entity?
00:44:39.520 | - I don't see any fundamental problems there.
00:44:41.160 | I mean, and look,
00:44:42.600 | the computer graphics community has come a long way.
00:44:45.360 | So in the early days, going back to the '80s and '90s,
00:44:48.600 | they were focusing on visual realism, right?
00:44:52.640 | And then they could do the easy stuff,
00:44:54.480 | but they couldn't do stuff like hair or fur and so on.
00:44:59.000 | Okay, well, they managed to do that.
00:45:01.040 | Then they couldn't do physical actions, right?
00:45:04.360 | Like there's a bowl of glass and it falls down
00:45:07.280 | and it shatters,
00:45:08.360 | but then they could start to do
00:45:09.760 | pretty realistic models of that,
00:45:11.540 | and so on and so forth.
00:45:13.840 | So the graphics people have shown
00:45:15.360 | that they can do this forward direction,
00:45:18.880 | not just for optical interactions,
00:45:21.180 | but also for physical interactions.
00:45:23.780 | So I think, of course,
00:45:26.240 | some of that is very computer intensive,
00:45:28.000 | but I think by and by,
00:45:29.960 | we will find ways of making our models ever more realistic.
00:45:34.960 | - You break vision apart into,
00:45:37.920 | in one of your presentations,
00:45:39.160 | early vision, static scene understanding,
00:45:41.200 | dynamic scene understanding,
00:45:42.560 | and raise a few interesting questions.
00:45:44.360 | I thought I could just throw some at you
00:45:46.960 | to see if you wanna talk about them.
00:45:50.280 | So early vision, so it's,
00:45:52.400 | what is it that you said?
00:45:53.980 | Sensation, perception, and cognition.
00:45:58.340 | So is this a sensation?
00:45:59.680 | - Yes.
00:46:00.520 | - What can we learn from image statistics
00:46:03.460 | that we don't already know?
00:46:05.660 | So at the lowest level,
00:46:07.180 | what can we make from just the statistics,
00:46:13.420 | the basics, so there were the variations
00:46:15.660 | in the rock pixels, the textures, and so on?
00:46:18.100 | - Yeah, so what we seem to have learned is
00:46:21.520 | that there's a lot of redundancy in these images,
00:46:26.520 | and as a result, we are able to do a lot of compression.
00:46:31.360 | And this compression is very important
00:46:34.940 | in biological settings, right?
00:46:36.900 | So you might have 10 to the eight photoreceptors
00:46:40.100 | and only 10 to the six fibers in the optic nerve,
00:46:42.500 | so you have to do this compression
00:46:43.980 | by a factor of 100 is to one.
00:46:46.520 | And so there are analogs of that
00:46:50.980 | which are happening in our neural network,
00:46:53.980 | artificial neural network.
00:46:55.180 | - At the early layers.
00:46:56.020 | - At the early layers.
00:46:57.260 | - There's a lot of compression
00:46:59.260 | that can be done in the beginning,
00:47:01.300 | just the statistics.
00:47:02.600 | - Yeah.
00:47:03.440 | - How much?
00:47:06.380 | - Well, so I mean, the way to think about it
00:47:10.700 | is just how successful is image compression, right?
00:47:14.700 | And there are, and that's been done with older technologies,
00:47:19.620 | but it can be done with,
00:47:21.260 | there are several companies which are trying to use
00:47:25.700 | sort of these more advanced neural network type techniques
00:47:29.220 | for compression, both for static images
00:47:31.780 | as well as for video.
00:47:34.220 | One of my former students has a company
00:47:37.500 | which is trying to do stuff like this.
00:47:40.220 | And I think that they are showing
00:47:44.540 | quite interesting results,
00:47:47.300 | and I think that that's all the success of,
00:47:50.580 | that's really about image statistics and video statistics.
00:47:53.620 | - But that's still not doing compression of the kind
00:47:56.940 | when I see a picture of a cat,
00:47:58.920 | all I have to say is it's a cat,
00:48:00.620 | that's another semantic kind of compression.
00:48:02.680 | - Yeah, so this is at the lower level, right?
00:48:04.740 | So we are, as I said, yeah,
00:48:07.420 | that's focusing on low-level statistics.
00:48:10.260 | - So to linger on that for a little bit,
00:48:13.060 | you mentioned how far can bottom-up image segmentation go,
00:48:18.060 | and in general, what,
00:48:20.460 | you mentioned that the central question
00:48:23.180 | for seeing understanding is the interplay
00:48:24.780 | of bottom-up and top-down information.
00:48:26.680 | Maybe this is a good time to elaborate on that,
00:48:29.780 | maybe define what is bottom-up, what is top-down
00:48:34.580 | in the context of computer vision.
00:48:37.220 | - Right, that's, so today what we have are
00:48:42.260 | very interesting systems,
00:48:43.540 | because they work completely bottom-up,
00:48:46.020 | however, they're trained--
00:48:46.860 | - What does bottom-up mean, sorry?
00:48:47.820 | - So bottom-up means, in this case,
00:48:49.500 | means a feed-forward neural network.
00:48:52.060 | - So starting from the raw pixels?
00:48:53.660 | - Yeah, they start from the raw pixels
00:48:55.540 | and they end up with something like cat or not a cat, right?
00:49:00.500 | So our systems are running totally feed-forward.
00:49:04.420 | They're trained in a very top-down way.
00:49:07.440 | So they're trained by saying, okay, this is a cat,
00:49:10.140 | there's a cat, there's a dog, there's a zebra, et cetera.
00:49:12.980 | And I'm not happy with either of these choices fully.
00:49:18.940 | We have gone into,
00:49:20.660 | because we have completely separated these processes, right?
00:49:24.860 | So there's a, so I would like the process,
00:49:29.420 | so what do we know compared to biology?
00:49:34.060 | So in biology, what we know is that the processes
00:49:37.540 | in at test time, at runtime,
00:49:41.660 | those processes are not purely feed-forward,
00:49:44.060 | but they involve feedback.
00:49:45.420 | And they involve much shallower neural networks.
00:49:49.980 | So the kinds of neural networks we are using
00:49:52.580 | in computer vision, say a ResNet 50, has 50 layers.
00:49:56.420 | Well, in the brain, in the visual cortex,
00:49:59.540 | going from the retina to IT, maybe we have like seven, right?
00:50:04.180 | So they're far shallower,
00:50:06.060 | but we have the possibility of feedback.
00:50:08.020 | So there are backward connections.
00:50:09.860 | And this might enable us to deal
00:50:14.820 | with the more ambiguous stimuli, for example.
00:50:18.160 | So the biological solution seems to involve feedback.
00:50:23.160 | The solution in artificial vision seems to be
00:50:27.840 | just feed-forward, but with a much deeper network.
00:50:30.620 | And the two are functionally equivalent,
00:50:33.300 | because if you have a feedback network,
00:50:35.100 | which just has like three rounds of feedback,
00:50:37.500 | you can just unroll it and make it three times the depth
00:50:40.980 | and create it in a totally feed-forward way.
00:50:44.500 | So this is something which, I mean,
00:50:46.460 | we have written some papers on this theme,
00:50:49.140 | but I really feel that this theme
00:50:52.500 | should be pursued further.
00:50:55.460 | - Some kind of recurrence mechanism.
00:50:57.220 | - Yeah.
00:50:58.420 | Okay, the other, so that's,
00:51:01.380 | so I want to have a little bit more top-down
00:51:04.500 | in the, at test time.
00:51:07.580 | Okay, then at training time,
00:51:10.300 | we make use of a lot of top-down knowledge right now.
00:51:13.740 | So basically to learn to segment an object,
00:51:16.440 | we have to have all these examples of,
00:51:19.140 | this is the boundary of a cat,
00:51:20.700 | and this is the boundary of a chair,
00:51:22.140 | and this is the boundary of a horse, and so on.
00:51:24.500 | And this is too much top-down knowledge.
00:51:26.940 | How do humans do this?
00:51:30.380 | We manage with far less supervision.
00:51:34.140 | And we do it in a sort of bottom-up way,
00:51:36.380 | because, for example, we're looking at a video stream,
00:51:40.220 | and the horse moves.
00:51:42.340 | And that enables me to say
00:51:44.900 | that all these pixels are together.
00:51:47.260 | So the Gestalt psychologists used to call this
00:51:50.280 | the principle of common fate.
00:51:51.920 | So there was a bottom-up process
00:51:55.100 | by which we were able to segment out these objects.
00:51:58.260 | And we have totally focused
00:52:01.500 | on this top-down training signal.
00:52:04.420 | So in my view, we have currently solved it
00:52:07.860 | in machine vision, this top-down, bottom-up interaction.
00:52:11.040 | But I don't find the solution fully satisfactory.
00:52:16.060 | And I would rather have a bit of both at both stages.
00:52:20.100 | - For all computer vision problems,
00:52:22.220 | not just segmentation.
00:52:24.140 | - And the question that you can ask is,
00:52:27.220 | so for me, I'm inspired a lot by human vision,
00:52:30.300 | and I care about that.
00:52:31.820 | You could be just a hard-boiled engineer and not give a damn.
00:52:35.480 | So to you, I would then argue
00:52:37.660 | that you would need far less training data
00:52:40.500 | if you could make my research agenda fruitful.
00:52:45.500 | - Okay, so maybe taking a step into segmentation,
00:52:51.660 | static scene understanding.
00:52:53.820 | What is the interaction
00:52:54.940 | between segmentation and recognition?
00:52:57.340 | You mentioned the movement of objects.
00:53:00.700 | So for people who don't know computer vision,
00:53:03.740 | segmentation is this weird activity
00:53:06.060 | that computer vision folks have all agreed
00:53:09.040 | is very important,
00:53:10.080 | of drawing outlines around objects versus a bounding box,
00:53:16.720 | and then classifying that object.
00:53:19.980 | What's the value of segmentation?
00:53:23.540 | What is it as a problem in computer vision?
00:53:27.180 | How is it fundamentally different
00:53:28.740 | from detection, recognition, and the other problems?
00:53:31.380 | - Yeah, so I think,
00:53:32.780 | so segmentation enables us to say
00:53:37.700 | that some set of pixels are an object
00:53:41.820 | without necessarily even being able to name that object
00:53:45.860 | or knowing properties of that object.
00:53:48.060 | - Oh, so you mean segmentation purely
00:53:50.780 | as the act of separating an object--
00:53:54.860 | - From its background.
00:53:55.700 | - A blob that's united in some way from its background.
00:54:00.700 | - Yeah, so entitification, if you will,
00:54:03.260 | making an entity out of it.
00:54:04.940 | - Entitification, beautifully.
00:54:06.820 | - So I think that we have that capability,
00:54:11.540 | and that enables us to,
00:54:16.260 | as we are growing up,
00:54:17.780 | to acquire names of objects
00:54:21.900 | with very little supervision.
00:54:23.660 | So suppose the child, let's posit
00:54:25.980 | that the child has this ability
00:54:27.340 | to separate out objects in the world.
00:54:29.860 | Then when the mother says, "Pick up your bottle,"
00:54:34.380 | or the cat's behaving funny today,
00:54:38.620 | the word cat suggests some object,
00:54:43.980 | and then the child sort of does the mapping.
00:54:46.260 | - Right. - Right?
00:54:47.380 | The mother doesn't have to teach
00:54:50.340 | specific object labels by pointing to them.
00:54:53.420 | Weak supervision works in the context
00:54:57.740 | that you have the ability to create objects.
00:55:01.460 | So I think that,
00:55:03.780 | so to me, that's a very fundamental capability.
00:55:07.660 | There are applications where this is very important,
00:55:10.980 | for example, medical diagnosis.
00:55:13.060 | So in medical diagnosis, you have some brain scan.
00:55:17.740 | I mean, this is some work that we did in my group
00:55:20.820 | where you have CT scans of people
00:55:23.140 | who have had traumatic brain injury,
00:55:25.460 | and what the radiologist needs to do
00:55:28.500 | is to precisely delineate various places
00:55:32.140 | where there might be bleeds, for example.
00:55:36.180 | And there are clear needs like that.
00:55:39.740 | So there are certainly very practical applications
00:55:43.340 | of computer vision where segmentation is necessary.
00:55:46.220 | But philosophically, segmentation enables
00:55:50.540 | the task of recognition to proceed
00:55:53.980 | with much weaker supervision than we require today.
00:55:57.820 | - And you think of segmentation as this kind of task
00:56:00.860 | that takes on a visual scene
00:56:03.460 | and breaks it apart into interesting entities
00:56:08.460 | that might be useful for whatever the task is.
00:56:11.260 | - Yeah.
00:56:12.660 | And it is not semantics-free.
00:56:14.620 | So I think, I mean, it blends into,
00:56:18.700 | it involves perception and cognition.
00:56:21.940 | It is not, I think the mistake that we used to make
00:56:26.540 | in the early days of computer vision
00:56:28.540 | was to treat it as a purely bottom-up perceptual task.
00:56:32.380 | It is not just that.
00:56:33.680 | Because we do revise our notion
00:56:37.700 | of segmentation with more experience, right?
00:56:41.620 | Because, for example, there are objects
00:56:43.580 | which are non-rigid, like animals or humans.
00:56:46.940 | And I think understanding that all the pixels of a human
00:56:51.820 | are one entity is actually quite a challenge
00:56:54.260 | because the parts of the human,
00:56:56.620 | they can move independently.
00:56:58.860 | The human wears clothes,
00:57:00.660 | so they might be differently colored.
00:57:02.700 | So it's all sort of a challenge.
00:57:05.520 | - You mentioned the three R's of computer vision
00:57:08.020 | are recognition, reconstruction, and reorganization.
00:57:12.140 | Can you describe these three R's and how they interact?
00:57:15.300 | - Yeah, so recognition is the easiest one
00:57:19.580 | because that's what I think people generally think of
00:57:24.520 | as computer vision achieving these days,
00:57:28.060 | which is labels.
00:57:30.380 | So is this a cat, is this a dog, is this a chihuahua?
00:57:35.180 | I mean, it could be very fine-grained,
00:57:37.940 | like a specific breed of a dog
00:57:40.900 | or a specific species of bird,
00:57:43.460 | or it could be very abstract, like animal.
00:57:46.980 | - But given a part of an image or a whole image,
00:57:49.940 | say, put a label on that.
00:57:51.460 | - Yeah, so that's recognition.
00:57:54.540 | Reconstruction is essentially,
00:57:59.180 | you can think of it as inverse graphics.
00:58:03.540 | I mean, that's one way to think about it.
00:58:07.140 | So graphics is you have some internal computer
00:58:10.460 | representation and you have a computer representation
00:58:14.900 | of some objects arranged in a scene,
00:58:17.340 | and what you do is you produce a picture.
00:58:20.080 | You produce the pixels corresponding
00:58:22.060 | to a rendering of that scene.
00:58:23.500 | So let's do the inverse of this.
00:58:28.860 | We are given an image and we try to,
00:58:31.060 | we say, oh, this image arises from some objects
00:58:38.420 | in a scene looked at with a camera from this viewpoint,
00:58:41.820 | and we might have more information about the objects,
00:58:44.180 | like their shape, maybe their textures,
00:58:47.500 | maybe color, et cetera, et cetera.
00:58:51.660 | So that's the reconstruction problem.
00:58:53.300 | In a way, you are, in your head,
00:58:57.220 | creating a model of the external world.
00:58:59.660 | Okay, reorganization is to do with,
00:59:04.780 | essentially, finding these entities.
00:59:07.540 | So it's organization.
00:59:12.380 | The word organization implies structure.
00:59:15.500 | So in perception, in psychology,
00:59:19.900 | we use the term perceptual organization,
00:59:22.620 | that the world is not just, an image is not just seen as,
00:59:27.620 | is not internally represented as just a collection of pixels,
00:59:32.580 | but we make these entities.
00:59:34.780 | We create these entities, objects,
00:59:37.260 | whatever you want to call them.
00:59:38.100 | - And the relationship between the entities as well,
00:59:40.180 | or is it purely about the entities?
00:59:42.340 | - It could be about the relationships,
00:59:44.220 | but mainly we focus on the fact that there are entities.
00:59:47.660 | - So I'm trying to pinpoint what the organization means.
00:59:52.380 | - So organization is that instead of a uniform grid,
00:59:57.380 | we have the structure of objects.
01:00:02.020 | - So segmentation is a small part of that.
01:00:05.300 | - So segmentation gets us going towards that.
01:00:08.260 | - Yeah, and you kind of have this triangle
01:00:11.700 | where they all interact together.
01:00:13.540 | - Yes.
01:00:14.380 | - So how do you see that interaction
01:00:17.140 | in sort of reorganization is, yes,
01:00:22.140 | defining the entities in the world.
01:00:25.020 | The recognition is labeling those entities.
01:00:30.020 | And then reconstruction is what, filling in the gaps?
01:00:33.940 | - Well, to, for example, see,
01:00:36.140 | impute some 3D objects corresponding
01:00:40.660 | to each of these entities.
01:00:43.180 | That would be part of reconstruction.
01:00:44.380 | - So adding more information
01:00:45.620 | that's not there in the raw data.
01:00:47.820 | - Correct.
01:00:48.660 | I mean, I started pushing this kind of a view
01:00:54.460 | in the, around 2010 or something like that,
01:00:58.060 | because at that time in computer vision,
01:01:01.020 | the distinction that people were just working
01:01:06.020 | on many different problems,
01:01:08.480 | but they treated each of them as a separate,
01:01:10.460 | isolated problem.
01:01:11.980 | With each with its own data set,
01:01:13.820 | and then you try to solve that and get good numbers on it.
01:01:16.940 | So I wasn't, I didn't like that approach
01:01:19.540 | because I wanted to see the connection between these.
01:01:23.540 | And if people divided up vision into various modules,
01:01:28.540 | the way they would do it is as low level,
01:01:31.820 | mid level, and high level vision,
01:01:33.460 | corresponding roughly to the psychologist's notion
01:01:37.220 | of sensation, perception, and cognition.
01:01:39.940 | And I didn't, that didn't map to tasks
01:01:43.460 | that people cared about.
01:01:45.460 | Okay, so therefore I tried to promote
01:01:48.560 | this particular framework as a way of considering
01:01:51.860 | the problems that people in computer vision
01:01:53.700 | were actually working on,
01:01:55.500 | and trying to be more explicit about the fact
01:01:58.940 | that they actually are connected to each other.
01:02:01.300 | And I was at that time just doing this
01:02:05.620 | on the basis of information flow.
01:02:07.840 | Now it turns out in the last five years or so,
01:02:12.100 | in the post the deep learning revolution,
01:02:17.220 | that this architecture has turned out
01:02:20.420 | to be very conducive to that.
01:02:24.740 | Because basically in these neural networks,
01:02:28.020 | we are trying to build multiple representations.
01:02:31.080 | There can be multiple output heads
01:02:35.100 | sharing common representations.
01:02:37.160 | So in a certain sense, today, given the reality
01:02:41.080 | of what solutions people have to this,
01:02:43.220 | I do not need to preach this anymore.
01:02:48.220 | It is just there, it's part of the solution space.
01:02:52.460 | - So speaking of neural networks,
01:02:54.860 | how much of this problem of computer vision,
01:02:59.860 | of reorganization, recognition, can be reconstruction?
01:03:05.440 | How much of it can be learned end to end, do you think?
01:03:12.620 | Sort of set it and forget it, just plug and play,
01:03:18.180 | have a giant data set, multiple perhaps, multi-modal,
01:03:22.140 | and then just learn the entirety of it.
01:03:25.500 | - Well, so I think that currently
01:03:28.620 | what that end to end learning means nowadays
01:03:31.140 | is end to end supervised learning.
01:03:32.880 | And that I would argue is too narrow a view of the problem.
01:03:38.080 | I like this child development view,
01:03:42.460 | this lifelong learning view,
01:03:44.820 | one where there are certain capabilities that are built up
01:03:48.260 | and then there are certain capabilities
01:03:49.880 | which are built up on top of that.
01:03:53.180 | So that's what I believe in.
01:03:58.180 | So I think end to end learning in the supervised setting
01:04:03.600 | for a very precise task to me
01:04:14.020 | is sort of a limited view of the learning process.
01:04:18.100 | - Got it, so if we think about beyond purely supervised,
01:04:22.740 | look back to children.
01:04:24.700 | You mentioned six lessons that we can learn from children
01:04:28.820 | of be multi-modal, be incremental, be physical,
01:04:33.380 | explore, be social, use language.
01:04:36.380 | Can you speak to these, perhaps picking one
01:04:39.540 | that you find most fundamental to our time today?
01:04:42.740 | - Yeah, so I mean, I should say to give you credit,
01:04:46.180 | this is from a paper by Smith and Gasser.
01:04:49.940 | And it reflects essentially, I would say,
01:04:54.860 | common wisdom among child development people.
01:04:59.860 | It's just that this is not common wisdom
01:05:04.300 | among people in computer vision and AI and machine learning.
01:05:07.980 | So I view my role as trying to--
01:05:12.660 | - Bridge the two worlds.
01:05:13.940 | - Bridge the two worlds.
01:05:15.140 | So let's take an example of a multi-modal, I like that.
01:05:20.100 | So multi-modal, a canonical example is a child interacting
01:05:25.100 | with an object.
01:05:28.780 | So then the child holds a ball and plays with it.
01:05:32.540 | So at that point, it's getting a touch signal.
01:05:35.620 | So the touch signal is getting a notion of 3D shape,
01:05:40.620 | but it is sparse.
01:05:42.940 | And then the child is also seeing a visual signal.
01:05:46.660 | And these two, so imagine these are two
01:05:50.820 | in totally different spaces.
01:05:52.620 | So one is the space of receptors on the skin of the fingers
01:05:56.980 | and the thumb and the palm.
01:05:58.420 | And then these map onto these neuronal fibers
01:06:02.980 | that are getting activated somewhere.
01:06:05.300 | These lead to some activation in somatosensory cortex.
01:06:10.460 | I mean, a similar thing will happen if we have a robot hand.
01:06:14.700 | Okay, and then we have the pixels
01:06:17.060 | corresponding to the visual view,
01:06:19.260 | but we know that they correspond to the same object.
01:06:22.660 | So that's a very, very strong cross-calibration signal.
01:06:28.780 | And it is self-supervisory, which is beautiful.
01:06:32.380 | There's nobody assigning a label.
01:06:34.020 | The mother doesn't have to come and assign a label.
01:06:37.780 | The child doesn't even have to know
01:06:39.460 | that this object is called a ball.
01:06:41.220 | Okay, but the child is learning something
01:06:44.540 | about the three-dimensional world from this signal.
01:06:48.500 | I think tactile and visual, there is some work on.
01:06:53.580 | There is a lot of work currently on audio and visual.
01:06:56.300 | Okay, and audio-visual, so there is some event
01:07:00.380 | that happens in the world.
01:07:01.820 | And that event has a visual signature,
01:07:04.180 | and it has a auditory signature.
01:07:07.140 | So there is this glass bowl on the table,
01:07:09.060 | and it falls and breaks, and I hear the smashing sound,
01:07:12.540 | and I see the pieces of glass.
01:07:14.260 | Okay, I've built that connection between the two, right?
01:07:19.460 | We have people, I mean, this has become a hot topic
01:07:22.820 | in computer vision in the last couple of years.
01:07:25.500 | There are problems like separating out multiple speakers.
01:07:31.500 | - Right. - Which was a classic problem
01:07:33.900 | in audition, they call this the problem of source separation
01:07:37.620 | or the cocktail party effect and so on.
01:07:40.540 | But just try to do it visually when you also have,
01:07:44.820 | it becomes so much easier and so much more useful.
01:07:49.820 | - So the multimodal, I mean, there's so much more signal
01:07:55.020 | with multimodal, and you can use that
01:07:57.140 | for some kind of weak supervision as well.
01:08:00.300 | - Yes, because they are occurring at the same time in time.
01:08:03.180 | So you have time, which links the two, right?
01:08:06.140 | So at a certain moment, T1, you've got a certain signal
01:08:09.500 | in the auditory domain and a certain signal
01:08:11.340 | in the visual domain, but they must be causally related.
01:08:15.300 | - Yeah, it's an exciting area, not well studied yet.
01:08:18.500 | - Yeah, I mean, we have a little bit of work at this,
01:08:20.460 | but so much more needs to be done.
01:08:24.460 | So this is a good example.
01:08:28.140 | Be physical, that's to do with, like,
01:08:30.940 | so one thing we talked about earlier,
01:08:32.980 | that there's an embodied world.
01:08:35.500 | - To mention language, use language.
01:08:39.260 | So Noam Chomsky believes that language
01:08:42.060 | may be at the core of cognition,
01:08:43.740 | at the core of everything in the human mind.
01:08:46.220 | What is the connection between language and vision to you?
01:08:50.140 | Like, what's more fundamental?
01:08:51.820 | Are they neighbors?
01:08:53.340 | Is one the parent and the child, the chicken and the egg?
01:08:58.020 | - Oh, it's very clear.
01:08:58.900 | It is vision, which is the parent.
01:09:01.180 | The parent is just the fundamental ability, okay?
01:09:04.140 | - Wait, wait, wait, wait.
01:09:05.580 | - So, so--
01:09:07.580 | - It comes before, you think vision
01:09:09.380 | is more fundamental than language.
01:09:10.900 | - Correct.
01:09:12.180 | And you can think of it either in phylogeny or in ontogeny.
01:09:17.180 | So phylogeny means, if you look at evolutionary time, right?
01:09:22.220 | So we have vision that developed 500 million years ago.
01:09:26.480 | Okay, then something like when we get to maybe
01:09:30.020 | like five million years ago,
01:09:31.700 | you have the first bipedal primates.
01:09:34.380 | So when we started to walk, then the hands became free.
01:09:38.700 | And so then manipulation, the ability to manipulate objects
01:09:42.580 | and build tools and so on and so forth.
01:09:45.140 | - So you said 500,000 years ago?
01:09:47.380 | - No, no, sorry.
01:09:48.220 | The first multicellular animals,
01:09:51.420 | which you can say had some intelligence,
01:09:55.420 | arose 500 million years ago.
01:09:58.080 | Okay, and now let's fast forward
01:10:01.160 | to say the last seven million years,
01:10:03.780 | which is the development of the hominid line, right?
01:10:07.040 | Where from the other primates,
01:10:09.160 | we have the branch which leads on to modern humans.
01:10:12.760 | Now, there are many of these hominids,
01:10:16.340 | but the ones which, you know, people talk about Lucy,
01:10:21.340 | because that's like a skeleton from three million years ago
01:10:24.920 | and we know that Lucy walked, okay?
01:10:28.480 | So at this stage, you have that the hand is free
01:10:31.840 | for manipulating objects.
01:10:33.640 | And then the ability to manipulate objects, build tools,
01:10:37.820 | and the brain size grew in this era.
01:10:43.400 | So, okay, so now you have manipulation.
01:10:46.040 | Now, we don't know exactly when language arose.
01:10:49.560 | - But after that. - But after that.
01:10:52.000 | Because no apes have, I mean,
01:10:54.920 | so, I mean, Chomsky is correct in that,
01:10:56.900 | that it is a uniquely human capability
01:10:59.400 | and we, primates, other primates don't have that.
01:11:04.400 | So it developed somewhere in this era,
01:11:06.960 | but it developed, I would, I mean,
01:11:11.400 | argue that it probably developed
01:11:13.040 | after we had this stage of humans,
01:11:17.800 | I mean, the human species already able to manipulate
01:11:21.600 | and hands-free, much bigger brain size.
01:11:25.360 | - And for that, there's a lot of vision
01:11:28.640 | has already had to have developed.
01:11:31.480 | - Yeah. - So the sensation
01:11:32.960 | and the perception, maybe some of the cognition.
01:11:36.060 | - Yeah, so we, so those, so that, so the world,
01:11:41.060 | so there, so these ancestors of ours,
01:11:46.480 | you know, three, four million years ago,
01:11:48.560 | they had spatial intelligence.
01:11:53.280 | So they knew that the world consists of objects.
01:11:56.200 | They knew that the objects
01:11:57.280 | were in certain relationships to each other.
01:11:59.660 | They had observed causal interactions among objects.
01:12:04.660 | They could move in space,
01:12:06.480 | so they had space and time and all of that.
01:12:08.900 | So language builds on that substrate.
01:12:13.040 | So language has a lot of, I mean,
01:12:16.760 | I mean, all human languages have constructs
01:12:19.880 | which depend on a notion of space and time.
01:12:22.580 | Where did that notion of space and time come from?
01:12:26.840 | It had to come from perception and action
01:12:29.640 | in the world we live in.
01:12:31.040 | - Yeah, what you've referred to as the spatial intelligence.
01:12:33.440 | - Yeah. - Yeah.
01:12:35.080 | So to linger a little bit, we mentioned Turing
01:12:38.960 | and his mention of we should learn from children.
01:12:44.320 | Nevertheless, language is the fundamental piece
01:12:47.360 | of the test of intelligence that Turing proposed.
01:12:50.480 | - Yes. - What do you think
01:12:51.320 | is a good test of intelligence?
01:12:53.840 | Are you, what would impress the heck out of you?
01:12:56.400 | Is it fundamentally natural language,
01:12:59.920 | or is there something in vision?
01:13:01.620 | - I think I wouldn't, I don't think we should have
01:13:07.120 | create a single test of intelligence.
01:13:09.000 | So just like I don't believe in IQ as a single number,
01:13:13.920 | I think generally there can be many capabilities
01:13:18.080 | which are correlated perhaps.
01:13:19.700 | So I think that there will be accomplishments
01:13:26.820 | which are visual accomplishments,
01:13:28.200 | accomplishments which are accomplishments
01:13:32.000 | in manipulation or robotics,
01:13:34.720 | and then accomplishments in language.
01:13:36.840 | I do believe that language will be the hardest nut to crack.
01:13:40.400 | - Really? - Yeah.
01:13:41.520 | - So what's harder, to pass the spirit of the Turing test,
01:13:45.400 | or like whatever formulation will make it natural language,
01:13:49.160 | convincingly a natural language,
01:13:51.120 | like somebody you would wanna have a beer with,
01:13:52.720 | hang out and have a chat with,
01:13:54.580 | or the general natural scene understanding?
01:13:59.220 | You think language is the top of the problem?
01:14:01.480 | - I think I'm not a fan of the,
01:14:05.520 | I think Turing test, that Turing as he proposed the test
01:14:09.560 | in 1950 was trying to solve a certain problem.
01:14:13.880 | - Yeah, imitation. - Yeah.
01:14:15.720 | And I think it made a lot of sense then.
01:14:18.560 | Where we are today, 70 years later,
01:14:20.740 | I think we should not worry about that.
01:14:26.560 | I think the Turing test is no longer the right way
01:14:29.320 | to channel research in AI,
01:14:33.600 | because it takes us down this path of this chatbot
01:14:37.080 | which can fool us for five minutes or whatever.
01:14:39.720 | Okay, I think I would rather have a list
01:14:43.000 | of 10 different tasks.
01:14:44.400 | I mean, I think there are tasks which,
01:14:47.400 | there are tasks in the manipulation domain,
01:14:50.360 | tasks in navigation, tasks in visual scene understanding,
01:14:53.800 | tasks in reading a story
01:14:56.320 | and answering questions based on that.
01:14:58.920 | I mean, so my favorite language understanding task
01:15:03.320 | would be reading a novel
01:15:05.400 | and being able to answer arbitrary questions from it.
01:15:08.760 | Okay. - Right.
01:15:10.480 | - I think that to me,
01:15:12.960 | and this is not an exhaustive list by any means.
01:15:15.720 | So I would, I think that that's where we need to be going to
01:15:20.720 | and each of these, on each of these axes,
01:15:23.800 | there's a fair amount of work to be done.
01:15:26.060 | - So on the visual understanding side,
01:15:28.240 | in this Intelligence Olympics that we've set up,
01:15:31.080 | what's a good test for one of many
01:15:35.120 | of visual scene understanding?
01:15:38.120 | Do you think such benchmarks exist?
01:15:41.360 | Sorry to interrupt.
01:15:42.200 | - No, there aren't any.
01:15:43.680 | I think essentially to me,
01:15:46.760 | a really good aid to the blind.
01:15:50.840 | So suppose there was a blind person
01:15:53.360 | and I needed to assist the blind person.
01:15:56.120 | - So ultimately, like we said,
01:15:59.080 | vision that aids in the action,
01:16:01.200 | in the survival in this world.
01:16:03.520 | - Yeah.
01:16:04.360 | - Maybe in the simulated world.
01:16:07.160 | - Maybe easier to measure performance in a simulated world.
01:16:13.360 | What we are ultimately after
01:16:14.680 | is performance in the real world.
01:16:16.320 | - So David Hilbert in 1900 proposed 23 open problems
01:16:21.960 | in mathematics, some of which are still unsolved.
01:16:24.760 | Most important, famous of which
01:16:26.720 | is probably the Riemann hypothesis.
01:16:29.040 | You've thought about and presented about
01:16:30.960 | the Hilbert problems of computer vision.
01:16:33.120 | So let me ask, what do you today,
01:16:36.600 | I don't know when the last year you presented that,
01:16:38.960 | 2015, but versions of it.
01:16:41.240 | You're kind of the face and the spokesperson
01:16:43.480 | for computer visions.
01:16:44.700 | It's your job to state what the open problems are
01:16:50.620 | for the field.
01:16:51.720 | So what today are the Hilbert problems
01:16:53.860 | of computer vision, do you think?
01:16:56.440 | - Let me pick one which I regard as clearly unsolved,
01:17:01.440 | which is what I would call long form video understanding.
01:17:07.400 | So we have a video clip and we want to understand
01:17:13.640 | the behavior in there in terms of agents,
01:17:19.080 | their goals, intentionality,
01:17:24.400 | and make predictions about what might happen.
01:17:28.320 | So that kind of understanding which goes away
01:17:34.800 | from atomic visual action.
01:17:36.320 | So in the short range, the question is,
01:17:39.840 | are you sitting, are you standing,
01:17:41.220 | are you catching a ball?
01:17:42.440 | That we can do now.
01:17:45.960 | Even if we can't do it fully accurately,
01:17:48.120 | if we can do it at 50%, maybe next year we'll do it at 65
01:17:52.600 | and so forth.
01:17:53.840 | But I think the long range video understanding,
01:17:57.560 | I don't think we can do today.
01:18:01.640 | - And that means--
01:18:03.280 | - And it blends into cognition.
01:18:04.560 | That's the reason why it's challenging.
01:18:06.840 | - And so you have to track,
01:18:08.200 | you have to understand the entities,
01:18:10.320 | you have to understand the entities,
01:18:11.640 | you have to track them,
01:18:13.480 | and you have to have some kind of model of their behavior.
01:18:16.920 | - Correct.
01:18:17.760 | And their behavior might be, these are agents,
01:18:21.600 | so they are not just like passive objects,
01:18:23.920 | but they're agents, so therefore,
01:18:26.160 | they would exhibit goal-directed behavior.
01:18:29.380 | Okay, so this is one area.
01:18:32.360 | Then I will talk about, say, understanding the world in 3D.
01:18:36.920 | Now this may seem paradoxical because in a way,
01:18:40.480 | we have been able to do 3D understanding
01:18:42.880 | even like 30 years ago, right?
01:18:45.680 | But I don't think we currently have the richness
01:18:48.640 | of 3D understanding in our computer vision system
01:18:52.160 | that we would like.
01:18:53.320 | So let me elaborate on that a bit.
01:18:57.480 | So currently, we have two kinds of techniques
01:19:01.360 | which are not fully unified.
01:19:03.240 | So there are the kinds of techniques
01:19:04.680 | from multi-view geometry,
01:19:06.800 | that you have multiple pictures of a scene
01:19:08.720 | and you do a reconstruction using stereoscopic vision
01:19:12.480 | or structure from motion.
01:19:14.560 | But these techniques do not,
01:19:18.040 | they totally fail if you just have a single view
01:19:21.200 | because they are relying on this multiple-view geometry.
01:19:25.300 | Okay, then we have some techniques
01:19:28.120 | that we have developed in the computer vision community
01:19:30.360 | which try to guess 3D from single views.
01:19:34.240 | And these techniques are based on supervised learning
01:19:39.240 | and they are based on having at training time
01:19:42.760 | 3D models of objects available.
01:19:45.920 | And this is completely unnatural supervision, right?
01:19:49.880 | That's not, CAD models are not injected into your brain.
01:19:53.480 | Okay, so what would I like?
01:19:55.920 | What I would like would be a kind of learning
01:20:00.120 | as you move around the world notion of 3D.
01:20:05.120 | So we have our succession of visual experiences
01:20:13.960 | and from those, we, so in, as part of that,
01:20:18.960 | I might see a chair from different viewpoints
01:20:21.600 | or a table from different viewpoints and so on.
01:20:24.800 | Now as part, that enables me
01:20:27.800 | to build some internal representation.
01:20:31.120 | And then next time I just see a single photograph
01:20:35.240 | and it may not even be of that chair,
01:20:37.120 | it's of some other chair.
01:20:38.760 | And I have a guess of what its 3D shape is like.
01:20:42.080 | - So you're almost learning the CAD model kind of--
01:20:45.600 | - Yeah, implicitly.
01:20:46.960 | I mean, the CAD model need not be in the same form
01:20:50.240 | as used by computer graphics programs.
01:20:52.360 | - Hidden in the representation somehow.
01:20:53.800 | - It's hidden in the representation,
01:20:55.400 | the ability to predict new views
01:20:58.080 | and what I would see if I went to such and such position.
01:21:03.080 | - By the way, on a small tangent on that,
01:21:06.720 | are you okay or comfortable with the idea
01:21:11.880 | that you're comfortable with neural networks
01:21:14.480 | that do achieve visual understanding,
01:21:16.280 | that do, for example, achieve this kind of 3D understanding
01:21:19.160 | and you don't know how they, you don't know the,
01:21:23.720 | you're not able to visualize or understand
01:21:28.440 | or interact with the representation?
01:21:31.040 | So the fact that they're not or may not be explainable.
01:21:34.880 | - Yeah, I think that's fine.
01:21:37.160 | (laughing)
01:21:38.360 | To me, that is, so,
01:21:41.400 | so let me put some caveats on that.
01:21:44.440 | So it depends on the setting.
01:21:46.400 | So first of all, I think
01:21:47.920 | humans are not explainable.
01:21:55.520 | - Yeah, that's a really good point, yeah.
01:21:57.080 | - So we, one human to another human is not fully explainable.
01:22:02.080 | I think there are settings where explainability matters
01:22:06.720 | and these might, these are, these might be,
01:22:09.920 | for example, questions on medical diagnosis.
01:22:12.400 | So I'm in a setting where maybe the doctor,
01:22:17.400 | maybe a computer program has made a certain diagnosis.
01:22:21.240 | And then depending on the diagnosis,
01:22:23.760 | perhaps I should have treatment A or treatment B, right?
01:22:28.000 | So now is the computer program's diagnosis based on data,
01:22:35.120 | which was data collected of,
01:22:38.520 | for American males who are in their 30s and 40s,
01:22:42.160 | and maybe not so relevant to me.
01:22:45.240 | Maybe it is relevant, you know, et cetera, et cetera.
01:22:48.320 | And I mean, in medical diagnosis,
01:22:50.340 | we have major issues to do with the reference class.
01:22:53.480 | So we may have acquired statistics from one group of people
01:22:56.600 | and applying it to a different group of people
01:22:59.520 | who may not share all the same characteristics.
01:23:02.760 | The data might have,
01:23:05.160 | there might be error bars in the prediction.
01:23:07.520 | So that prediction should really be taken
01:23:10.240 | with a huge grain of salt.
01:23:12.780 | But this has an impact on what treatments
01:23:16.120 | should be picked, right?
01:23:20.040 | So there are settings where I want to know more
01:23:23.400 | than just this is the answer.
01:23:26.600 | But what I acknowledge is that,
01:23:30.920 | so in that sense,
01:23:32.080 | explainability and interpretability may matter.
01:23:34.960 | It's about giving error bounds
01:23:37.280 | and a better sense of the quality of the decision.
01:23:40.200 | Where I'm willing to sacrifice interpretability
01:23:46.280 | is that I believe that there can be systems
01:23:50.080 | which can be highly performant,
01:23:51.680 | but which are internally black boxes.
01:23:55.000 | - And that seems to be where it's headed.
01:23:57.780 | Some of the best performing systems
01:23:59.560 | are essentially black boxes.
01:24:01.080 | Fundamentally by their construction.
01:24:04.080 | - You and I are black boxes to each other.
01:24:06.360 | - Yeah, so the nice thing about the black boxes we are
01:24:09.680 | is so we ourselves are black boxes,
01:24:13.760 | but we're also, those of us who are charming,
01:24:17.720 | are able to convince others,
01:24:19.480 | like explain what's going on inside the black box
01:24:23.400 | with narratives of stories.
01:24:25.360 | So in some sense, neural networks
01:24:27.760 | don't have to actually explain what's going on inside.
01:24:31.640 | They just have to come up with stories,
01:24:33.360 | real or fake, that convince you
01:24:35.840 | that they know what's going on.
01:24:38.400 | - And I'm sure we can do that.
01:24:40.080 | We can create those stories.
01:24:42.440 | Neural networks can create those stories.
01:24:44.480 | - Yeah. (laughs)
01:24:47.320 | And the transformer will be involved.
01:24:49.940 | Do you think we will ever build a system
01:24:53.700 | of human level or superhuman level intelligence?
01:24:56.440 | We've kind of defined what it takes
01:24:58.560 | to try to approach that,
01:25:00.040 | but do you think that's within our reach?
01:25:02.600 | The thing that we thought we could do,
01:25:04.560 | what Turing thought actually we could do by year 2000,
01:25:08.320 | right, do you think we'll ever be able to do?
01:25:11.120 | - So I think there are two answers here.
01:25:12.760 | One answer is in principle, can we do this at some time?
01:25:17.760 | And my answer is yes.
01:25:19.600 | The second answer is a pragmatic one.
01:25:23.520 | Do you think we will be able to do it
01:25:25.120 | in the next 20 years or whatever?
01:25:27.640 | And to that my answer is no.
01:25:29.100 | So, and of course that's a wild guess.
01:25:32.560 | I think that, you know, Donald Rumsfeld
01:25:38.400 | is not a favorite person of mine,
01:25:40.000 | but one of his lines is very good,
01:25:42.120 | which is about known knowns, known unknowns,
01:25:46.300 | and unknown unknowns.
01:25:48.160 | So in the business we are in, there are known unknowns,
01:25:53.040 | and we have unknown unknowns.
01:25:54.920 | So I think with respect to a lot of what's the case
01:25:59.920 | in vision and robotics, I feel like we have known unknowns.
01:26:05.380 | So I have a sense of where we need to go
01:26:09.680 | and what the problems that need to be solved are.
01:26:12.220 | I feel with respect to natural language,
01:26:17.000 | understanding and high level cognition,
01:26:20.440 | it's not just known unknowns, but also unknown unknowns.
01:26:24.080 | So it is very difficult to put any kind of a time frame
01:26:28.200 | to that.
01:26:29.040 | - Do you think some of the unknown unknowns
01:26:33.720 | might be positive in that they'll surprise us
01:26:36.760 | and make the job much easier?
01:26:38.600 | So fundamental breakthroughs?
01:26:40.160 | - I think that is possible, because certainly
01:26:42.120 | I have been very positively surprised
01:26:44.640 | by how effective these deep learning systems have been.
01:26:50.000 | Because I certainly would not have believed that in 2010.
01:26:55.000 | I think what we knew from the mathematical theory
01:27:02.680 | was that convex optimization works
01:27:06.040 | when there's a single global optima,
01:27:07.760 | then these gradient descent techniques would work.
01:27:11.100 | Now these are nonlinear systems with non-convex systems.
01:27:16.000 | - Huge number of variables, so over-parameterized.
01:27:18.520 | - Over-parameterized, and the people
01:27:22.240 | who used to play with them a lot,
01:27:24.640 | the ones who are totally immersed in the lore
01:27:27.200 | and the black magic, they knew that they worked well,
01:27:32.200 | even though they were--
01:27:33.840 | - Really?
01:27:34.660 | I thought like everybody--
01:27:36.240 | - No, the claim that I hear from my friends
01:27:39.740 | like Jan LeCun and so forth is--
01:27:41.560 | - Oh, now, yeah.
01:27:42.480 | - That they feel that they were comfortable with them.
01:27:45.920 | - Well, he says that now.
01:27:46.760 | - The community as a whole was certainly not.
01:27:50.640 | And I think we were, to me that was the surprise,
01:27:54.800 | that they actually worked robustly
01:27:58.720 | for a wide range of problems,
01:28:01.240 | from a wide range of initializations and so on.
01:28:04.640 | And so that was certainly more rapid progress
01:28:09.640 | than we expected.
01:28:13.620 | But then there are certainly lots of times,
01:28:15.880 | in fact, most of the history of AI
01:28:18.520 | is when we have made less progress
01:28:21.400 | at a slower rate than we expected.
01:28:23.920 | So we just keep going.
01:28:27.480 | I think what I regard as really unwarranted
01:28:32.480 | are these fears of AGI in 10 years and 20 years
01:28:38.800 | and that kind of stuff,
01:28:41.400 | because that's based on completely unrealistic models
01:28:44.800 | of how rapidly we will make progress in this field.
01:28:47.680 | - So I agree with you,
01:28:49.960 | but I've also gotten the chance to interact
01:28:53.480 | with very smart people who really worry
01:28:55.520 | about the existential threats of AI.
01:28:57.600 | And I, as an open-minded person,
01:28:59.680 | am sort of taking it in.
01:29:02.640 | Do you think if AI systems,
01:29:08.920 | in some way, the unknown unknowns,
01:29:11.680 | not super intelligent AI,
01:29:12.960 | but in ways we don't quite understand
01:29:15.480 | the nature of super intelligence,
01:29:17.320 | will have a detrimental effect on society?
01:29:20.200 | Do you think this is something we should be worried about?
01:29:24.160 | Or we need to first allow the unknown unknowns
01:29:26.840 | to become known unknowns?
01:29:29.800 | - I think we need to be worried about AI today.
01:29:32.920 | I think that it is not just a worry we need to have
01:29:36.000 | when we get that AGI.
01:29:38.320 | I think that AI is being used in many systems today.
01:29:42.960 | And there might be settings, for example,
01:29:45.200 | when it causes biases or decisions which could be harmful.
01:29:50.200 | I mean, decisions which could be unfair to some people,
01:29:53.880 | or it could be a self-driving car which kills a pedestrian.
01:29:57.600 | So AI systems are being deployed today, right?
01:30:01.840 | And they're being deployed in many different settings,
01:30:03.840 | maybe in medical diagnosis, maybe in a self-driving car,
01:30:06.560 | maybe in selecting applicants for an interview.
01:30:09.880 | So I would argue that when these systems make mistakes,
01:30:14.880 | there are consequences.
01:30:17.860 | And we are in a certain sense,
01:30:19.980 | responsible for those consequences.
01:30:22.640 | So I would argue that this is a continuous effort.
01:30:26.320 | And this is something that in a way is not so surprising.
01:30:32.320 | It's about all engineering and scientific progress,
01:30:35.840 | which great power comes great responsibility.
01:30:39.880 | So as these systems are deployed,
01:30:41.800 | we have to worry about them.
01:30:42.920 | And it's a continuous problem.
01:30:44.680 | I don't think of it as something
01:30:47.080 | which will suddenly happen on some day in 2079,
01:30:51.360 | for which I need to design some clever trick.
01:30:54.880 | I'm saying that these problems exist today.
01:30:58.280 | And we need to be continuously on the lookout
01:31:00.920 | for worrying about safety, biases, risks, right?
01:31:05.920 | I mean, a self-driving car kills a pedestrian.
01:31:09.840 | And they have, right?
01:31:11.640 | I mean, this Uber incident in Arizona, right?
01:31:16.000 | It has happened, right?
01:31:17.640 | This is not about AGI.
01:31:19.100 | In fact, it's about a very dumb intelligence,
01:31:22.480 | which is still killing people.
01:31:23.760 | - The worry people have with AGI is the scale.
01:31:28.360 | But I think you're 100% right is,
01:31:31.360 | like the thing that worries me about AI today,
01:31:34.560 | and it's happening in a huge scale,
01:31:36.120 | is recommender systems, recommendation systems.
01:31:39.240 | So if you look at Twitter or Facebook or YouTube,
01:31:42.620 | they're controlling the ideas that we have access to,
01:31:47.620 | the news and so on.
01:31:50.480 | And that's a fundamentally machine learning algorithm
01:31:52.520 | behind each of these recommendations.
01:31:55.160 | And they, I mean, my life would not be the same
01:31:58.420 | without these sources of information.
01:32:00.840 | I'm a totally new human being.
01:32:02.320 | And the ideas that I know are very much
01:32:05.500 | because of the internet,
01:32:06.840 | because of the algorithm that recommend those ideas.
01:32:09.600 | And so as they get smarter and smarter,
01:32:12.320 | I mean, that is the AGI.
01:32:13.920 | - Yeah.
01:32:14.760 | - Is that's the algorithm that's recommending
01:32:18.080 | the next YouTube video you should watch
01:32:21.180 | has control of millions of billions of people.
01:32:25.780 | That algorithm is already super intelligent
01:32:28.860 | and has complete control of the population.
01:32:32.220 | Not a complete, but very strong control.
01:32:34.860 | For now, we can turn off YouTube.
01:32:36.820 | We can just go have a normal life outside of that.
01:32:39.780 | But the more and more that gets into our life,
01:32:43.140 | it's that algorithm, we start depending on it
01:32:46.980 | and the different companies that are working on the algorithm.
01:32:48.940 | So I think it's, you're right.
01:32:50.020 | It's already there.
01:32:52.580 | And YouTube in particular is using computer vision,
01:32:57.060 | doing their hardest to try to understand
01:32:59.780 | the content of videos so they could be able
01:33:03.060 | to connect videos with the people
01:33:05.380 | who would benefit from those videos the most.
01:33:07.900 | And so that development could go
01:33:10.740 | in a bunch of different directions,
01:33:12.140 | some of which might be harmful.
01:33:13.760 | So yeah, you're right.
01:33:16.180 | The threats of AI are here already
01:33:18.740 | and we should be thinking about them.
01:33:21.140 | - On a philosophical notion,
01:33:24.000 | if you could, personal perhaps,
01:33:28.020 | if you could relive a moment in your life outside of family
01:33:31.940 | because it made you truly happy
01:33:33.860 | or it was a profound moment
01:33:36.340 | that impacted the direction of your life,
01:33:38.620 | what moment would you go to?
01:33:40.340 | - I don't think of single moments,
01:33:45.780 | but I look over the long haul.
01:33:49.280 | I feel that I've been very lucky
01:33:51.960 | because I feel that,
01:33:54.700 | I think that in scientific research,
01:33:57.780 | a lot of it is about being at the right place
01:34:01.700 | at the right time.
01:34:03.380 | And you can work on problems at a time
01:34:06.780 | when they're just too premature.
01:34:10.340 | You know, you butt your head against them
01:34:12.620 | and nothing happens because it's,
01:34:15.540 | the prerequisites for success are not there.
01:34:19.700 | And then there are times when you are in a field
01:34:21.860 | which is all pretty mature
01:34:24.700 | and you can only solve curlicues upon curlicues.
01:34:29.700 | I've been lucky to have been in this field,
01:34:32.340 | which for 34 years,
01:34:35.180 | well, actually 34 years as a professor at Berkeley,
01:34:37.980 | so longer than that,
01:34:40.420 | which when I started in it was just like some little crazy,
01:34:45.420 | absolutely useless field,
01:34:49.980 | couldn't really do anything
01:34:53.220 | to a time when it's really, really
01:34:55.660 | solving a lot of practical problems,
01:34:59.460 | has offered a lot of tools for scientific research,
01:35:03.360 | because computer vision is impactful
01:35:06.340 | for images in biology or astronomy and so on and so forth.
01:35:11.340 | And we have, so we have made great scientific progress
01:35:15.660 | which has had real practical impact in the world.
01:35:19.300 | And I feel lucky that I got in at a time
01:35:23.820 | when the field was very young
01:35:27.180 | and at a time when it is,
01:35:29.540 | it's now mature but not fully mature.
01:35:33.740 | It's mature but not done.
01:35:35.660 | I mean, it's really still in a productive phase.
01:35:39.060 | - Yeah, I think people 500 years from now
01:35:42.140 | would laugh at you calling this field mature.
01:35:44.500 | - Yeah, that is very possible, yeah.
01:35:47.460 | - So, but you're also, lest I forget to mention,
01:35:50.580 | you've also mentored some of the biggest names
01:35:53.960 | of computer vision, computer science, and AI today.
01:35:56.820 | There's so many questions I could ask,
01:36:00.540 | but it really is, what is it, how did you do it?
01:36:04.280 | What does it take to be a good mentor?
01:36:06.540 | What does it take to be a good guide?
01:36:08.400 | - Yeah, I think what I feel,
01:36:12.700 | I've been lucky to have had very, very smart
01:36:16.900 | and hardworking and creative students.
01:36:18.980 | I think some part of the credit
01:36:21.740 | just belongs to being at Berkeley.
01:36:25.500 | Those of us who are at top universities are blessed
01:36:29.180 | because we have very, very smart
01:36:32.780 | and capable students coming and knocking on our door.
01:36:36.380 | So I have to be humble enough to acknowledge that.
01:36:40.320 | But what have I added?
01:36:42.000 | I think I have added something.
01:36:43.920 | What I have added is, I think what I've always tried
01:36:48.760 | to teach them is a sense of picking the right problems.
01:36:53.680 | So I think that in science, in the short run,
01:37:00.040 | success is always based on technical competence.
01:37:04.480 | You're quick with math or you're whatever.
01:37:09.080 | I mean, there's certain technical capabilities
01:37:11.920 | which make for short-range progress.
01:37:15.400 | Long-range progress is really determined
01:37:18.220 | by asking the right questions
01:37:20.440 | and focusing on the right problems.
01:37:22.840 | And I feel that what I've been able to bring to the table
01:37:28.680 | in terms of advising these students
01:37:31.320 | is some sense of taste of what are good problems,
01:37:36.320 | what are problems that are worth attacking now
01:37:39.280 | as opposed to waiting 10 years.
01:37:41.480 | - What's a good problem, if you could summarize?
01:37:44.240 | Is that possible to even summarize?
01:37:46.040 | Like, what's your sense of a good problem?
01:37:48.200 | - I think I have a sense of what is a good problem,
01:37:52.280 | which is there's a British scientist,
01:37:56.440 | in fact, he won a Nobel Prize, Peter Medawar,
01:37:59.400 | who has a book on this.
01:38:02.560 | And basically, he calls it,
01:38:05.400 | the research is the art of the soluble.
01:38:08.320 | So we need to sort of find problems
01:38:11.640 | which are not yet solved, but which are approachable.
01:38:16.640 | And he sort of refers to this sense
01:38:22.160 | that there is this problem which isn't quite solved yet,
01:38:24.960 | but it has a soft underbelly.
01:38:27.320 | There is some place where you can spear the beast.
01:38:32.320 | And having that intuition that this problem is ripe
01:38:37.160 | is a good thing, because otherwise,
01:38:39.280 | you can just beat your head and not make progress.
01:38:42.260 | So I think that is important.
01:38:45.800 | So if I have that and if I can convey that to students,
01:38:49.640 | it's not just that they do great research
01:38:52.560 | while they're working with me,
01:38:54.080 | but that they continue to do great research.
01:38:56.280 | So in a sense, I'm proud of my students
01:38:59.080 | and their achievements and their great research,
01:39:01.280 | even 20 years after they've ceased being my student.
01:39:05.720 | - So it's in part developing,
01:39:06.960 | helping them develop that sense
01:39:08.800 | that a problem is not yet solved, but it's solvable.
01:39:12.440 | - Correct.
01:39:13.560 | The other thing which I have,
01:39:15.520 | which I think I bring to the table,
01:39:17.800 | is a certain intellectual breadth.
01:39:22.280 | I've spent a fair amount of time studying psychology,
01:39:27.280 | neuroscience, relevant areas of applied math and so forth.
01:39:31.240 | So I can probably help them see some connections
01:39:35.880 | to disparate things which they might not have otherwise.
01:39:40.880 | So the smart students coming into Berkeley
01:39:45.040 | can be very deep in the sense,
01:39:48.960 | they can think very deeply,
01:39:50.320 | meaning very hard down one particular path,
01:39:54.200 | but where I could help them is the shallow breadth,
01:39:59.200 | but where they would have the narrow depth,
01:40:04.520 | but that's of some value.
01:40:09.160 | - Well, it was beautifully refreshing just to hear you
01:40:12.640 | naturally jump to psychology, back to computer science
01:40:15.760 | in this conversation back and forth.
01:40:17.480 | I mean, that's actually a rare quality,
01:40:20.400 | and I think it's certainly for students empowering
01:40:23.600 | to think about problems in a new way.
01:40:25.440 | So for that and for many other reasons,
01:40:27.960 | I really enjoyed this conversation.
01:40:29.300 | Thank you so much, it was a huge honor.
01:40:30.780 | Thanks for talking to me.
01:40:31.920 | - It's been my pleasure.
01:40:33.120 | - Thanks for listening to this conversation
01:40:36.360 | with Jitendra Malik, and thank you to our sponsors,
01:40:39.560 | BetterHelp and ExpressVPN.
01:40:42.960 | Please consider supporting this podcast
01:40:45.040 | by going to betterhelp.com/lex
01:40:48.520 | and signing up at expressvpn.com/lexpod.
01:40:52.720 | Click the links, buy the stuff.
01:40:55.300 | It's how they know I sent you,
01:40:56.800 | and it really is the best way to support this podcast
01:41:00.080 | and the journey I'm on.
01:41:02.260 | If you enjoy this thing, subscribe on YouTube,
01:41:04.800 | review it with Firestarz and Apple Podcasts,
01:41:07.120 | support it on Patreon, or connect with me on Twitter
01:41:10.600 | at Lex Friedman.
01:41:12.120 | Don't ask me how to spell that.
01:41:13.360 | I don't remember it myself.
01:41:15.600 | And now let me leave you with some words
01:41:17.680 | from Prince Mishkin in "The Idiot" by Dostoevsky.
01:41:21.800 | Beauty will save the world.
01:41:23.740 | Thank you for listening, and hope to see you next time.
01:41:27.740 | (upbeat music)
01:41:30.320 | (upbeat music)
01:41:32.900 | [BLANK_AUDIO]