Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

00:00:00.000 | The following is a conversation with Jitendra Malik,

00:00:03.480 | a professor at Berkeley and one of the seminal figures

00:00:06.320 | in the field of computer vision,

00:00:08.400 | the kind before the deep learning revolution

00:00:10.960 | and the kind after.

00:00:12.960 | He has been cited over 180,000 times

00:00:17.360 | and has mentored many world-class researchers

00:00:20.480 | in computer science.

00:00:21.740 | Quick summary of the ads.

00:00:24.440 | Two sponsors, one new one, which is BetterHelp

00:00:27.800 | and an old goodie, ExpressVPN.

00:00:31.480 | Please consider supporting this podcast

00:00:33.200 | by going to betterhelp.com/lex

00:00:36.440 | and signing up at expressvpn.com/lexpod.

00:00:40.780 | Click the links, buy the stuff.

00:00:43.080 | It really is the best way to support this podcast

00:00:45.420 | and the journey I'm on.

00:00:47.240 | If you enjoy this thing, subscribe on YouTube,

00:00:49.840 | review it with Five Stars and Apple Podcast,

00:00:52.040 | support it on Patreon or connect with me on Twitter

00:00:55.240 | at Lex Friedman, however the heck you spell that.

00:00:58.720 | As usual, I'll do a few minutes of ads now

00:01:01.120 | and never any ads in the middle

00:01:02.480 | that can break the flow of the conversation.

00:01:05.080 | This show is sponsored by BetterHelp, spelled H-E-L-P, help.

00:01:10.080 | Check it out at betterhelp.com/lex.

00:01:15.040 | They figure out what you need and match you

00:01:16.960 | with a licensed professional therapist in under 48 hours.

00:01:21.360 | It's not a crisis line, it's not self-help,

00:01:24.200 | it's professional counseling done securely online.

00:01:28.200 | I'm a bit from the David Goggins line of creatures,

00:01:30.600 | as you may know, and so have some demons to contend with,

00:01:35.220 | usually on long runs or all nights working,

00:01:38.920 | forever impossibly full of self-doubt.

00:01:41.920 | It may be because I'm Russian,

00:01:43.960 | but I think suffering is essential for creation.

00:01:47.060 | But I also think you can suffer beautifully

00:01:49.600 | in a way that doesn't destroy you.

00:01:51.980 | For most people, I think a good therapist

00:01:53.840 | can help in this, so it's at least worth a try.

00:01:57.220 | Check out their reviews, they're good.

00:01:59.640 | It's easy, private, affordable, available worldwide.

00:02:03.260 | You can communicate by text any time

00:02:05.340 | and schedule weekly audio and video sessions.

00:02:08.600 | I highly recommend that you check them out

00:02:12.240 | at betterhelp.com/lex.

00:02:15.320 | This show is also sponsored by ExpressVPN.

00:02:19.320 | Get it at expressvpn.com/lexpod.

00:02:22.640 | To support this podcast and to get an extra three months free

00:02:26.880 | on a one-year package.

00:02:28.560 | I've been using ExpressVPN for many years.

00:02:31.360 | I love it.

00:02:32.640 | I think ExpressVPN is the best VPN out there.

00:02:36.000 | They told me to say it, but it happens to be true.

00:02:39.040 | It doesn't log your data, it's crazy fast,

00:02:41.720 | and it's easy to use.

00:02:43.280 | Literally just one big, sexy power on button.

00:02:47.560 | Again, for obvious reasons, it's really important

00:02:49.640 | that they don't log your data.

00:02:51.400 | It works on Linux and everywhere else too,

00:02:54.200 | but really, why use anything else?

00:02:57.120 | Shout out to my favorite flavor of Linux, Ubuntu Mate 2004.

00:03:02.080 | Once again, get it at expressvpn.com/lexpod

00:03:06.080 | to support this podcast and to get an extra three months free

00:03:10.720 | on a one-year package.

00:03:12.140 | And now, here's my conversation with Jitendra Malik.

00:03:17.920 | In 1966, Seymour Pappert at MIT wrote up a proposal

00:03:22.920 | called the Summer Vision Project to be given,

00:03:26.580 | as far as we know, to 10 students to work on

00:03:29.880 | and solve that summer.

00:03:31.240 | So that proposal outlined many of the computer vision tasks

00:03:34.320 | we still work on today.

00:03:36.800 | Why do you think we underestimate,

00:03:38.960 | and perhaps we did underestimate

00:03:41.080 | and perhaps still underestimate how hard computer vision is?

00:03:46.040 | - Because most of what we do in vision,

00:03:48.480 | we do unconsciously or subconsciously.

00:03:50.960 | - In human vision. - In human vision.

00:03:52.940 | So that gives us this, that effortlessness

00:03:56.160 | gives us the sense that, oh, this must be very easy

00:03:59.120 | to implement on a computer.

00:04:01.920 | Now, this is why the early researchers in AI

00:04:06.920 | got it so wrong.

00:04:08.800 | However, if you go into neuroscience or psychology

00:04:14.200 | of human vision, then the complexity becomes very clear.

00:04:19.000 | The fact is that a very large part of the cerebral cortex

00:04:23.720 | is devoted to visual processing.

00:04:26.000 | I mean, and this is true in other primates as well.

00:04:29.400 | So once we looked at it from a neuroscience

00:04:33.220 | or psychology perspective, it becomes quite clear

00:04:36.260 | that the problem is very challenging

00:04:37.880 | and it will take some time.

00:04:39.600 | - You said the higher-level parts are the harder parts?

00:04:43.840 | - I think vision appears to be easy

00:04:47.640 | because most of what visual processing

00:04:51.680 | is subconscious or unconscious.

00:04:54.320 | So we underestimate the difficulty.

00:04:58.200 | Whereas when you are proving a mathematical theorem

00:05:03.200 | or playing chess, the difficulty is much more evident

00:05:07.880 | because it is your conscious brain which is processing

00:05:12.960 | various aspects of the problem-solving behavior.

00:05:17.120 | Whereas in vision, all this is happening,

00:05:19.600 | but it's not in your awareness.

00:05:21.840 | It's in your, it's operating below that.

00:05:25.720 | - But it still seems strange.

00:05:28.420 | Yes, that's true, but it seems strange

00:05:30.060 | that as computer vision researchers, for example,

00:05:34.000 | the community broadly, time and time again,

00:05:38.280 | makes the mistake of thinking the problem

00:05:41.080 | is easier than it is.

00:05:42.400 | Or maybe it's not a mistake.

00:05:43.800 | We'll talk a little bit about autonomous driving,

00:05:45.680 | for example, how hard of a vision task that is.

00:05:48.640 | Do you think, I mean, is it just human nature

00:05:55.000 | or is there something fundamental to the vision problem

00:05:57.640 | that we underestimate?

00:06:00.880 | We're still not able to be cognizant

00:06:03.540 | of how hard the problem is.

00:06:05.840 | - Yeah, I think in the early days,

00:06:07.440 | it could have been excused

00:06:09.760 | because in the early days,

00:06:11.640 | all aspects of AI were regarded as too easy.

00:06:14.380 | But I think today it is much less excusable.

00:06:19.460 | And I think why people fall for this

00:06:23.400 | is because of what I call the fallacy

00:06:26.960 | of the successful first step.

00:06:28.940 | There are many problems in vision

00:06:32.520 | where getting 50% of the solution you can get in one minute,

00:06:37.700 | getting to 90% can take you a day,

00:06:41.340 | getting to 99% may take you five years,

00:06:44.340 | and 99.99% may be not in your lifetime.

00:06:49.340 | - I wonder if that's unique to vision.

00:06:52.540 | It seems that language, people are not so confident about.

00:06:56.460 | So natural language processing,

00:06:57.820 | people are a little bit more cautious

00:06:59.180 | about our ability to solve that problem.

00:07:04.180 | I think for language, people intuit

00:07:06.300 | that we have to be able to do natural language understanding.

00:07:10.540 | For vision, it seems that we're not cognizant

00:07:15.540 | or we don't think about how much understanding is required.

00:07:19.020 | It's probably still an open problem.

00:07:21.420 | But in your sense,

00:07:22.420 | how much understanding is required to solve vision?

00:07:26.860 | Put another way,

00:07:29.460 | how much something called common sense reasoning

00:07:33.700 | is required to really be able to interpret

00:07:37.260 | even static scenes?

00:07:39.540 | - Yeah, so vision operates at all levels,

00:07:43.140 | and there are parts which can be solved

00:07:47.060 | with what we could call maybe peripheral processing.

00:07:50.680 | So in the human vision literature,

00:07:53.560 | there used to be these terms sensation,

00:07:56.460 | perception, and cognition,

00:07:58.780 | which roughly speaking referred to

00:08:01.540 | the front end of processing,

00:08:04.100 | middle stages of processing,

00:08:05.660 | and higher level of processing.

00:08:07.940 | And I think they made a big deal out of this,

00:08:11.500 | and they wanted to just study only perception

00:08:13.820 | and then dismiss certain problems as being, quote, cognitive.

00:08:18.820 | But really, I think these are artificial divides.

00:08:23.180 | The problem is continuous at all levels,

00:08:26.140 | and there are challenges at all levels.

00:08:28.500 | The techniques that we have today,

00:08:31.060 | they work better at the lower and mid levels of the problem.

00:08:34.900 | I think the higher levels of the problem,

00:08:36.940 | quote, the cognitive levels of the problem,

00:08:40.020 | are there, and we,

00:08:43.140 | in many real applications, we have to confront them.

00:08:46.420 | Now, how much that is necessary

00:08:49.820 | will depend on the application.

00:08:51.460 | For some problems, it doesn't matter.

00:08:52.980 | For some problems, it matters a lot.

00:08:55.180 | So I am, for example,

00:08:59.060 | a pessimist on fully autonomous driving in the near future.

00:09:04.060 | And the reason is because I think there will be

00:09:07.900 | that 0.01% of the cases

00:09:11.900 | where quite sophisticated cognitive reasoning is called for.

00:09:16.500 | However, there are tasks where you can,

00:09:20.300 | first of all, they are much more, they are robust,

00:09:23.700 | so in the sense that error rates,

00:09:26.060 | error is not so much of a problem.

00:09:28.380 | For example, let's say you're doing image search.

00:09:33.380 | You're trying to get images based on some description,

00:09:38.460 | some visual description.

00:09:40.300 | We are very tolerant of errors there, right?

00:09:43.900 | I mean, when Google Image Search gives you some images back

00:09:47.140 | and a few of them are wrong, it's okay.

00:09:49.900 | It doesn't hurt anybody.

00:09:51.300 | There's not a matter of life and death.

00:09:54.540 | But making mistakes when you are driving

00:09:59.540 | at 60 miles per hour

00:10:01.620 | and you could potentially kill somebody

00:10:04.100 | is much more important.

00:10:06.100 | - So just for the fun of it, since you mentioned,

00:10:09.460 | let's go there briefly about autonomous vehicles.

00:10:12.780 | So one of the companies in the space, Tesla,

00:10:15.740 | is with Andrej Karpathy and Elon Musk

00:10:18.860 | are working on a system called Autopilot,

00:10:21.700 | which is primarily a vision-based system

00:10:23.980 | with eight cameras and basically a single neural network,

00:10:27.860 | a multitask neural network.

00:10:30.080 | They call it HydraNet, multiple heads,

00:10:33.300 | so it does multiple tasks,

00:10:34.780 | but is forming the same representation at the core.

00:10:37.640 | Do you think driving can be converted in this way

00:10:41.900 | to purely a vision problem

00:10:44.940 | and then solved with learning?

00:10:47.940 | Or even more specifically, in the current approach,

00:10:52.580 | what do you think about what Tesla Autopilot team is doing?

00:10:56.060 | - So the way I think about it is that

00:10:59.500 | there are certainly subsets of the visual-based

00:11:02.860 | driving problem which are quite solvable.

00:11:05.380 | So for example, driving in freeway conditions

00:11:08.060 | is quite a solvable problem.

00:11:11.980 | I think there were demonstrations of that

00:11:14.780 | going back to the 1980s by someone called

00:11:18.660 | Ernst Dickmans in Munich.

00:11:22.020 | In the '90s, there were approaches from Carnegie Mellon,

00:11:25.660 | there were approaches from our team at Berkeley.

00:11:28.700 | In the 2000s, there were approaches from Stanford,

00:11:31.940 | and so on.

00:11:33.100 | So autonomous driving in certain settings is very doable.

00:11:38.100 | The challenge is to have an autopilot work

00:11:42.140 | under all kinds of driving conditions.

00:11:45.380 | At that point, it's not just a question of vision

00:11:48.620 | or perception, but really also of control

00:11:51.460 | and dealing with all the edge cases.

00:11:54.180 | - So where do you think most of the difficult cases,

00:11:57.620 | to me, even the highway driving is an open problem

00:12:00.100 | because it applies the same 50, 90, 95, 99 rule

00:12:05.100 | or the first step, the fallacy of the first step,

00:12:09.140 | I forget how you put it, we fall victim to.

00:12:12.100 | I think even highway driving has a lot of elements

00:12:14.900 | because to solve autonomous driving,

00:12:17.000 | you have to completely relinquish

00:12:19.260 | the help of a human being.

00:12:22.740 | You're always in control.

00:12:23.740 | So you're really going to feel the edge cases.

00:12:26.540 | So I think even highway driving is really difficult.

00:12:29.280 | But in terms of the general driving task,

00:12:32.160 | do you think vision is the fundamental problem

00:12:35.460 | or is it also your action,

00:12:39.380 | the interaction with the environment,

00:12:42.700 | the ability to, and then like the middle ground,

00:12:46.300 | I don't know if you put that under vision,

00:12:47.660 | which is trying to predict the behavior of others,

00:12:51.220 | which is a little bit in the world

00:12:53.340 | of understanding the scene,

00:12:55.700 | but it's also trying to form a model

00:12:58.220 | of the actors in the scene and predict their behavior.

00:13:02.060 | - Yeah, I include that in vision

00:13:03.860 | because to me, perception blends into cognition

00:13:07.180 | and building predictive models of other agents in the world,

00:13:11.320 | which could be other agents, could be people,

00:13:13.300 | other agents could be other cars.

00:13:15.360 | That is part of the task of perception

00:13:17.420 | because perception always has to not tell us what is now,

00:13:23.280 | but what will happen because what's now is boring.

00:13:26.340 | It's done, it's over with.

00:13:27.740 | We care about the future because we act in the future.

00:13:33.280 | - And we care about the past in as much as it informs

00:13:36.580 | what's going to happen in the future.

00:13:38.940 | - So I think we have to build predictive models

00:13:41.240 | of behaviors of people and those can get quite complicated.

00:13:46.240 | So I mean, I've seen examples of this in,

00:13:52.900 | actually, I mean, I own a Tesla

00:13:57.780 | and it has various safety features built in.

00:14:01.420 | And what I see are these examples where,

00:14:05.720 | let's say there is some skateboarder.

00:14:08.700 | I mean, and I don't want to be too critical

00:14:11.800 | because obviously these systems are always being improved

00:14:16.280 | and any specific criticism I have,

00:14:19.380 | maybe the system six months from now

00:14:21.460 | will not have that particular failure mode.

00:14:25.700 | So it had the wrong response

00:14:30.700 | and it's because it couldn't predict

00:14:33.960 | what this skateboarder was going to do.

00:14:38.280 | Okay, and because it really required

00:14:41.680 | that higher level cognitive understanding

00:14:44.160 | of what skateboarders typically do

00:14:46.520 | as opposed to a normal pedestrian.

00:14:48.680 | So what might have been the correct behavior

00:14:50.520 | for a pedestrian, a typical behavior for a pedestrian

00:14:53.760 | was not the typical behavior for a skateboarder, right?

00:14:58.760 | - Yeah.

00:14:59.760 | - And so therefore to do a good job there,

00:15:04.360 | you need to have enough data where your pedestrians,

00:15:07.540 | you also have skateboarders,

00:15:09.520 | you've seen enough skateboarders to see

00:15:11.640 | what kinds of patterns of behavior they have.

00:15:16.440 | So it is in principle with enough data

00:15:19.680 | that problem could be solved.

00:15:21.460 | But I think our current systems,

00:15:26.120 | computer vision systems,

00:15:27.280 | they need far, far more data than humans do

00:15:31.480 | for learning those same capabilities.

00:15:33.660 | - So say that there is going to be a system

00:15:35.680 | that solves autonomous driving,

00:15:37.960 | do you think it will look similar to what we have today,

00:15:41.600 | but have a lot more data, perhaps more compute,

00:15:44.440 | but the fundamental architectures involved,

00:15:47.120 | like neural, well, in the case of Tesla Autopilot,

00:15:49.880 | is neural networks, do you think it will look similar?

00:15:54.480 | In that regard, it'll just have more data.

00:15:56.800 | - That's a scientific hypothesis

00:15:59.300 | as to which way is it going to go.

00:16:01.880 | I will tell you what I would bet on.

00:16:05.300 | So, and this is my general philosophical position

00:16:09.440 | on how these learning systems have been.

00:16:12.520 | What we have found currently very effective

00:16:17.160 | in computer vision in the deep learning paradigm

00:16:21.000 | is sort of tabula rasa learning,

00:16:24.020 | and tabula rasa learning in a supervised way,

00:16:27.480 | with lots and lots of--

00:16:28.320 | - What's tabula rasa learning?

00:16:29.480 | - Tabula rasa in the sense that blank slate.

00:16:32.200 | We just have the system which is,

00:16:34.880 | given a series of experiences in this setting,

00:16:37.720 | and then it learns there.

00:16:38.960 | Now, let's think about human driving.

00:16:42.600 | It is not tabula rasa learning.

00:16:44.600 | So at the age of 16, in high school,

00:16:47.700 | a teenager goes into driver ed class, right?

00:16:54.840 | And now, at that point, they learn,

00:16:57.600 | but at the age of 16, they're already visual geniuses,

00:17:02.160 | because from zero to 16,

00:17:04.640 | they have built a certain repertoire of vision.

00:17:07.560 | In fact, most of it has probably been achieved by age two.

00:17:11.240 | Right?

00:17:12.960 | In this period of age, up to age two,

00:17:16.200 | they know that the world is three-dimensional.

00:17:18.080 | They know how objects look like

00:17:20.480 | from different perspectives.

00:17:22.280 | They know about occlusion.

00:17:24.620 | They know about common dynamics

00:17:27.560 | of humans and other bodies.

00:17:29.680 | They have some notion of intuitive physics.

00:17:32.100 | So they have built that up from their observations

00:17:35.120 | and interactions in early childhood,

00:17:38.600 | and of course, reinforced through

00:17:40.440 | their growing up to age 16.

00:17:43.920 | So then, at age 16, when they go into driver ed,

00:17:47.960 | what are they learning?

00:17:49.280 | They're not learning afresh the visual world.

00:17:52.280 | They have a mastery of the visual world.

00:17:54.640 | What they are learning is control.

00:17:58.020 | Okay?

00:17:58.860 | They are learning how to be smooth about control,

00:18:01.400 | about steering and brakes and so forth.

00:18:03.920 | They're learning a sense of typical traffic situations.

00:18:07.900 | Now, that education process can be quite short,

00:18:12.900 | because they are coming in as visual geniuses.

00:18:17.640 | And of course, in their future,

00:18:20.320 | they're going to encounter situations

00:18:22.040 | which are very novel, right?

00:18:24.140 | So during my driver ed class,

00:18:27.280 | I may not have had to deal with a skateboarder.

00:18:29.820 | I may not have had to deal with a truck

00:18:32.180 | driving in front of me,

00:18:33.700 | where the back opens up and some junk

00:18:38.240 | gets dropped from the truck,

00:18:39.880 | and I have to deal with it, right?

00:18:42.020 | But I can deal with this as a driver,

00:18:45.180 | even though I did not encounter this in my driver ed class.

00:18:48.700 | And the reason I can deal with it

00:18:50.000 | is because I have all this general visual knowledge

00:18:52.780 | and expertise.

00:18:54.540 | - And do you think the learning mechanisms we have today

00:18:59.900 | can do that kind of long-term accumulation of knowledge?

00:19:03.700 | Or do we have to do some kind of,

00:19:07.620 | the work that led up to expert systems

00:19:11.300 | with knowledge representation,

00:19:13.300 | the broader field of artificial intelligence

00:19:17.100 | worked on this kind of accumulation of knowledge.

00:19:20.200 | Do you think neural networks can do the same?

00:19:22.100 | - I think I don't see any in-principle problem

00:19:27.100 | with neural networks doing it,

00:19:29.220 | but I think the learning techniques

00:19:31.100 | would need to evolve significantly.

00:19:33.660 | So the current learning techniques that we have

00:19:38.660 | are supervised learning.

00:19:41.460 | You're given lots of examples,

00:19:43.300 | X, Y, pairs, and you learn the functional mapping

00:19:47.140 | between them.

00:19:48.580 | I think that human learning is far richer than that.

00:19:52.280 | It includes many different components.

00:19:54.660 | There is a child explores the world and sees us.

00:19:59.660 | For example, a child takes an object

00:20:04.740 | and manipulates it in his or her hand,

00:20:09.140 | and therefore gets to see the object

00:20:10.860 | from different points of view.

00:20:12.660 | And the child has commanded the movement.

00:20:14.780 | So that's a kind of learning data,

00:20:16.520 | but the learning data has been arranged by the child.

00:20:20.820 | And this is a very rich kind of data.

00:20:24.100 | Child can do various experiments with the world.

00:20:27.460 | So there are many aspects of sort of human learning,

00:20:33.580 | and these have been studied in child development

00:20:37.340 | by psychologists.

00:20:39.300 | And what they tell us is that supervised learning

00:20:43.320 | is a very small part of it.

00:20:45.340 | There are many different aspects of learning.

00:20:48.580 | And what we would need to do is to develop models

00:20:52.220 | of all of these, and then train our systems

00:20:57.220 | in that, with that kind of protocol.

00:21:02.380 | - So new methods of learning.

00:21:04.460 | - Yes.

00:21:05.300 | - Some of which might imitate the human brain.

00:21:07.300 | But you also, in your talks, have mentioned

00:21:10.660 | sort of the compute side of things,

00:21:12.860 | in terms of the difference in the human brain,

00:21:15.060 | or referencing Marovec, Hans Marovec.

00:21:17.820 | - Yeah.

00:21:20.660 | - Do you think there's something interesting,

00:21:23.100 | valuable to consider about the difference

00:21:25.380 | in the computational power of the human brain

00:21:29.020 | versus the computers of today,

00:21:31.580 | in terms of instructions per second?

00:21:34.860 | - Yes, so if we go back,

00:21:36.620 | so this is a point I've been making for 20 years now.

00:21:41.540 | And I think once upon a time,

00:21:43.840 | the way I used to argue this was that

00:21:46.540 | we just didn't have the computing power of the human brain.

00:21:48.980 | Our computers were not quite there.

00:21:53.220 | And I mean, there is a well-known trade-off

00:21:58.220 | which we know that neurons are slow,

00:22:02.740 | compared to transistors,

00:22:05.160 | but we have a lot of them,

00:22:07.980 | and they have a very high connectivity.

00:22:10.060 | Whereas in silicon, you have much faster devices,

00:22:13.980 | transistors switch at on the order of nanoseconds,

00:22:18.100 | but the connectivity is usually smaller.

00:22:20.140 | At this point in time, I mean,

00:22:23.540 | we are now talking about 2020,

00:22:25.860 | we do have, if you consider the latest GPUs and so on,

00:22:29.780 | amazing computing power.

00:22:31.660 | And if we look back at Hans Marovec's type of calculations,

00:22:36.660 | which he did in the 1990s,

00:22:38.940 | we may be there today,

00:22:40.860 | in terms of computing power comparable to the brain,

00:22:43.660 | but it's not of the same style.

00:22:46.380 | It's of a very different style.

00:22:47.980 | So I mean, for example,

00:22:51.300 | the style of computing that we have in our GPUs

00:22:54.420 | is far, far more power hungry

00:22:57.220 | than the style of computing that is there

00:22:59.660 | in the human brain or other biological entities.

00:23:03.980 | - Yeah, and that, the efficiency part is,

00:23:09.140 | we're gonna have to solve that

00:23:10.100 | in order to build actual real-world systems of large scale.

00:23:15.060 | Let me ask sort of the high level question,

00:23:17.460 | just taking a step back.

00:23:19.380 | How would you articulate

00:23:21.380 | the general problem of computer vision?

00:23:24.260 | Does such a thing exist?

00:23:25.980 | So if you look at the computer vision conferences

00:23:27.860 | and the work that's been going on,

00:23:29.580 | it's often separated into different little segments,

00:23:34.140 | breaking the problem of vision apart

00:23:36.060 | into whether segmentation, 3D reconstruction,

00:23:40.820 | object detection, I don't know,

00:23:43.300 | image capturing, whatever.

00:23:45.300 | There's benchmarks for each.

00:23:46.660 | But if you were to sort of philosophically say,

00:23:49.500 | what is the big problem of computer vision?

00:23:52.260 | Does such a thing exist?

00:23:53.460 | - Yes, but it's not in isolation.

00:23:57.260 | So if we have to,

00:23:59.180 | so for all intelligence tasks,

00:24:03.860 | I always go back to sort of biology or humans.

00:24:09.500 | And if you think about vision or perception in that setting,

00:24:14.100 | we realize that perception is always to guide action.

00:24:17.980 | Perception for a biological system

00:24:20.860 | does not give any benefits

00:24:22.660 | unless it is coupled with action.

00:24:24.980 | So we can go back and think about

00:24:27.740 | the first multicellular animals

00:24:30.260 | which arose in the Cambrian era 500 million years ago.

00:24:34.700 | And these animals could move

00:24:37.860 | and they could see in some way.

00:24:40.740 | And the two activities helped each other

00:24:43.340 | because how does movement help?

00:24:47.380 | Movement helps that

00:24:49.020 | because you can get food in different places.

00:24:52.060 | But you need to know where to go.

00:24:54.260 | And that's really about perception or seeing.

00:24:57.980 | I mean, vision is perhaps the single most perception sense,

00:25:01.740 | but all the others are equally, are also important.

00:25:05.940 | So perception and action kind of go together.

00:25:10.060 | So earlier it was in these very simple feedback loops

00:25:13.500 | which were about finding food

00:25:17.220 | or avoiding becoming food if there's a predator running,

00:25:20.620 | trying to eat you up and so forth.

00:25:25.220 | So we must, at the fundamental level,

00:25:27.740 | connect perception to action.

00:25:29.820 | Then as we evolved,

00:25:33.700 | perception became more and more sophisticated

00:25:36.580 | because it served many more purposes.

00:25:38.580 | And so today we have what seems like

00:25:43.260 | a fairly general purpose capability

00:25:45.860 | which can look at the external world

00:25:48.100 | and build a model of the external world inside the head.

00:25:51.980 | We do have that capability.

00:25:54.980 | That model is not perfect.

00:25:56.900 | And psychologists have great fun in pointing out

00:25:59.300 | the ways in which the model in your head

00:26:01.620 | is not a perfect model of the external world.

00:26:05.180 | They create various illusions

00:26:08.100 | to show the ways in which it is imperfect.

00:26:11.340 | But it's amazing how far it has come

00:26:14.260 | from a very simple perception action loop

00:26:17.780 | that exists in an animal 500 million years ago.

00:26:22.780 | Once we have these very sophisticated visual systems,

00:26:28.100 | we can then impose a structure on them.

00:26:30.660 | It's we as scientists who are imposing that structure

00:26:34.180 | where we have chosen to characterize this part of the system

00:26:38.860 | as this, quote, "module of object detection"

00:26:41.940 | or, quote, "this module of 3D reconstruction."

00:26:44.980 | What's going on is really all of these processes

00:26:48.140 | are running simultaneously.

00:26:50.460 | And they are running simultaneously

00:26:56.300 | because originally their purpose was, in fact,

00:26:58.620 | to help guide action.

00:27:00.940 | - So as a guiding general statement of a problem,

00:27:03.900 | do you think we can say that the general problem

00:27:08.180 | of computer vision, you said, in humans,

00:27:12.380 | it was tied to action.

00:27:14.700 | Do you think we should also say that ultimately

00:27:17.180 | that the goal, the problem of computer vision

00:27:20.820 | is to sense the world in a way

00:27:23.660 | that helps you act in the world?

00:27:27.460 | - Yes, I think that's the most fundamental purpose.

00:27:32.460 | We have by now hyper-evolved.

00:27:37.260 | So we have this visual system

00:27:38.940 | which can be used for other things.

00:27:41.900 | For example, judging the aesthetic value of a painting.

00:27:45.420 | And this is not guiding action.

00:27:49.100 | Maybe it's guiding action in terms of how much money

00:27:51.900 | you will put in your auction bid,

00:27:54.140 | but that's a bit stretched.

00:27:55.900 | But the basics are, in fact, in terms of action.

00:27:59.700 | But we have evolved really this hyper,

00:28:04.700 | we have hyper-evolved our visual system.

00:28:07.900 | - Actually, just to, sorry to interrupt,

00:28:10.140 | but perhaps it is fundamentally about action.

00:28:13.260 | You kind of jokingly said about spending,

00:28:15.700 | but perhaps the capitalistic drive

00:28:20.220 | that drives a lot of the development in this world

00:28:23.460 | is about the exchange of money

00:28:25.060 | and the fundamental action is money.

00:28:26.540 | If you watch Netflix, if you enjoy watching movies,

00:28:29.500 | you're using your perception system to interpret the movie.

00:28:32.620 | Ultimately, your enjoyment of that movie

00:28:34.660 | means you'll subscribe to Netflix.

00:28:36.660 | So the action is this extra layer

00:28:41.180 | that we've developed in modern society,

00:28:42.940 | perhaps is fundamentally tied

00:28:45.140 | to the action of spending money.

00:28:46.820 | - Well, certainly with respect to interactions with firms.

00:28:54.100 | So in this homo economicus role,

00:28:56.900 | when you're interacting with firms,

00:28:59.060 | it does become that.

00:29:01.940 | - What else is there?

00:29:03.380 | (both laughing)

00:29:05.700 | No, it was a rhetorical question.

00:29:06.980 | Okay, so to linger on the division

00:29:11.460 | between the static and the dynamic,

00:29:14.700 | so much of the work in computer vision,

00:29:16.660 | so many of the breakthroughs that you've been a part of

00:29:19.300 | have been in the static world,

00:29:21.900 | in looking at static images.

00:29:24.260 | And then you've also worked on starting,

00:29:26.700 | but to a much smaller degree,

00:29:28.540 | the community is looking at dynamic, at video,

00:29:31.460 | at dynamic scenes.

00:29:32.820 | And then there is robotic vision,

00:29:35.220 | which is dynamic, but also where you actually have a robot

00:29:39.340 | in the physical world interacting based on that vision.

00:29:42.100 | Which problem is harder?

00:29:45.660 | Sort of the trivial first answer is,

00:29:51.100 | well, of course, one image is harder.

00:29:53.740 | But if you look at a deeper question there,

00:29:58.540 | are we, what's the term,

00:30:01.500 | cutting ourselves at the knees,

00:30:04.060 | or making the problem harder by focusing on images?

00:30:07.820 | - That's a fair question.

00:30:09.100 | I think sometimes we can simplify a problem so much

00:30:17.100 | that we essentially lose part of the juice

00:30:21.300 | that could enable us to solve the problem.

00:30:23.400 | And one could reasonably argue that, to some extent,

00:30:28.020 | this happens when we go from video to single images.

00:30:31.360 | Now, historically, you have to consider the limits

00:30:35.500 | imposed by the computation capabilities we had.

00:30:40.500 | So if we, many of the choices made

00:30:43.780 | in the computer vision community

00:30:46.340 | through the '70s, '80s, '90s,

00:30:50.620 | can be understood as choices which were forced upon us

00:30:55.620 | by the fact that we just didn't have access to compute,

00:31:00.940 | enough compute.

00:31:01.940 | - Not enough memory, not enough hard drives.

00:31:04.100 | - Exactly, not enough compute, not enough storage.

00:31:07.700 | So think of these choices.

00:31:09.420 | So one of the choices is focusing on single images

00:31:12.820 | rather than video.

00:31:14.140 | - Okay, clear question, storage and compute.

00:31:17.540 | We had to focus on, we used to detect edges

00:31:23.700 | and throw away the image, right?

00:31:25.580 | So you have an image, which is, say, 256 by 256 pixels,

00:31:29.660 | and instead of keeping around the grayscale value,

00:31:33.180 | what we did was we detected edges,

00:31:35.460 | find the places where the brightness changes a lot,

00:31:38.340 | and then throw away the rest.

00:31:41.960 | So this was a major compression device,

00:31:44.740 | and the hope was that this makes it

00:31:47.220 | that you can still work with it,

00:31:48.580 | and the logic was humans can interpret a line drawing,

00:31:51.780 | and yes, and this will save us computation.

00:31:58.020 | So many of the choices were dictated by that.

00:32:00.940 | I think today we are no longer detecting edges, right?

00:32:05.940 | We process images with ConvNets because we don't need to.

00:32:10.780 | We don't have those compute restrictions anymore.

00:32:13.960 | Now, video is still understudied

00:32:16.280 | because video compute is still quite challenging

00:32:19.600 | if you are a university researcher.

00:32:22.320 | I think video computing is not so challenging

00:32:24.960 | if you are at Google or Facebook or Amazon.

00:32:28.840 | - Still super challenging.

00:32:30.240 | I just spoke with the VP of engineering,

00:32:33.000 | Google head of YouTube search and discovery,

00:32:35.560 | and they still struggle doing stuff on video.

00:32:38.400 | It's very difficult except using techniques

00:32:42.140 | that are essentially the techniques you used in the '90s,

00:32:45.300 | some very basic computer vision techniques.

00:32:48.620 | - No, that's when you want to do things at scale.

00:32:51.440 | So if you want to operate at the scale

00:32:53.700 | of all the content of YouTube, it's very challenging,

00:32:56.980 | and there are similar issues in Facebook.

00:32:59.260 | But as a researcher, you have more opportunities.

00:33:04.260 | - You can train large,

00:33:06.940 | networks with relatively large video datasets, yeah.

00:33:10.540 | - Yes, so I think that this is part of the reason

00:33:13.660 | why we have so emphasized static images.

00:33:17.220 | I think that this is changing,

00:33:18.740 | and over the next few years,

00:33:20.460 | I see a lot more progress happening in video.

00:33:25.180 | So I have this generic statement that,

00:33:29.460 | to me, video recognition feels like 10 years

00:33:31.920 | behind object recognition.

00:33:33.780 | And you can quantify that

00:33:35.820 | because you can take some of the challenging video datasets,

00:33:39.020 | and their performance on action classification

00:33:42.620 | is like, say, 30%,

00:33:44.640 | which is kind of what we used to have

00:33:47.300 | around 2009 in object detection.

00:33:52.300 | It's like about 10 years behind.

00:33:54.620 | And whether it'll take 10 years to catch up

00:33:57.700 | is a different question.

00:33:58.740 | Hopefully, it will take less than that.

00:34:01.080 | - Let me ask a similar question I've already asked,

00:34:04.540 | but once again, so for dynamic scenes,

00:34:07.440 | do you think some kind of injection of knowledge bases

00:34:13.580 | and reasoning is required

00:34:16.020 | to help improve action recognition?

00:34:18.840 | If we solve the general action recognition problem,

00:34:27.820 | what do you think the solution would look like?

00:34:29.900 | That's another way to put it.

00:34:30.740 | - So I completely agree that knowledge is called for,

00:34:35.740 | and that knowledge can be quite sophisticated.

00:34:39.620 | So the way I would say it is that

00:34:41.540 | perception blends into cognition,

00:34:43.900 | and cognition brings in issues of memory

00:34:46.780 | and this notion of a schema from psychology,

00:34:51.780 | which is, let me use the classic example,

00:34:55.100 | which is you go to a restaurant, right?

00:34:58.700 | Now, the things that happen in a certain order,

00:35:01.020 | you walk in, somebody takes you to a table,

00:35:05.340 | waiter comes, gives you a menu,

00:35:08.700 | takes the order, food arrives,

00:35:10.900 | eventually, bill arrives, et cetera, et cetera.

00:35:14.020 | There's a classic example of AI from the 1970s.

00:35:19.020 | There was the term frames and scripts and schemas.

00:35:24.740 | These are all quite similar ideas.

00:35:27.140 | Okay, and in the '70s, the way the AI of the time

00:35:31.340 | dealt with it was by hand-coding this.

00:35:34.220 | So they hand-coded in this notion of a script

00:35:37.020 | and the various stages and the actors

00:35:40.180 | and so on and so forth,

00:35:42.060 | and used that to interpret, for example, language.

00:35:45.460 | I mean, if there's a description of a story

00:35:49.220 | involving some people eating at a restaurant,

00:35:52.720 | there are all these inferences you can make

00:35:56.220 | because you know what happens typically at a restaurant.

00:35:59.220 | So I think this kind of knowledge is absolutely essential.

00:36:05.140 | So I think that when we are going to do

00:36:08.780 | long-form video understanding,

00:36:11.620 | we are going to need to do this.

00:36:13.540 | I think the kinds of technology that we have right now

00:36:16.180 | with 3D convolutions over a couple of seconds

00:36:20.020 | of clip or video,

00:36:21.300 | it's very much tailored

00:36:22.900 | towards short-term video understanding,

00:36:25.940 | not that long-term understanding.

00:36:28.340 | Long-term understanding requires a notion of,

00:36:32.220 | this notion of schemas that I talked about,

00:36:35.340 | perhaps some notions of goals, intentionality,

00:36:39.580 | functionality, and so on and so forth.

00:36:43.040 | Now, how will we bring that in?

00:36:45.980 | So we could either revert back to the '70s and say,

00:36:48.820 | okay, I'm going to hand-code in a script,

00:36:52.580 | or we might try to learn it.

00:36:56.180 | So I tend to believe that

00:36:59.900 | we have to find learning ways of doing this,

00:37:02.940 | because I think learning ways land up being more robust.

00:37:06.620 | And there must be a learning version of the story

00:37:09.220 | because children acquire a lot of this knowledge

00:37:13.580 | by sort of just observation.

00:37:16.620 | So at no moment in a child's life does a,

00:37:21.300 | it's possible, but I think it's not so typical

00:37:24.380 | that somebody, that a mother coaches a child

00:37:27.900 | through all the stages of what happens in a restaurant.

00:37:30.620 | They just go as a family,

00:37:31.860 | they go to the restaurant, they eat, come back,

00:37:35.620 | and the child goes through 10 such experiences,

00:37:37.920 | and the child has got a schema

00:37:40.500 | of what happens when you go to a restaurant.

00:37:42.660 | So we somehow need to,

00:37:44.900 | we need to provide that capability to our systems.

00:37:47.940 | - You mentioned the following line

00:37:50.560 | from the end of the Alan Turing paper,

00:37:53.140 | Computing Machinery and Intelligence,

00:37:54.820 | that many people, like you said,

00:37:57.280 | many people know and very few have read,

00:38:00.460 | where he proposes the Turing test.

00:38:03.240 | This is how you know,

00:38:04.620 | 'cause it's towards the end of the paper.

00:38:06.580 | "Instead of trying to produce a program

00:38:08.340 | "to simulate the adult mind,

00:38:09.920 | "why not rather try to produce one

00:38:11.740 | "which simulates the child's?"

00:38:14.340 | So that's a really interesting point.

00:38:16.980 | If I think about the benchmarks we have before us,

00:38:20.420 | the tests of our computer vision systems,

00:38:24.460 | they're often kind of trying to get to the adult.

00:38:28.220 | So what kind of benchmarks should we have?

00:38:31.060 | What kind of tests for computer vision do you think

00:38:33.180 | we should have that mimic the child's in computer vision?

00:38:38.100 | - Yeah, I think we should have those,

00:38:40.560 | and we don't have those today.

00:38:42.580 | And I think the part of the challenge

00:38:47.180 | is that we should really be collecting data

00:38:49.860 | of the type that a child experiences.

00:38:54.860 | So that gets into issues of privacy and so on and so forth.

00:38:59.260 | But there are attempts in this direction

00:39:01.140 | to sort of try to collect the kind of data

00:39:04.940 | that a child encounters growing up.

00:39:08.500 | So what's the child's linguistic environment?

00:39:11.020 | What's the child's visual environment?

00:39:13.460 | So if we could collect that kind of data

00:39:17.060 | and then develop learning schemes based on that data,

00:39:21.780 | that would be one way to do it.

00:39:23.660 | I think that's a very promising direction myself.

00:39:28.740 | There might be people who would argue

00:39:31.140 | that we could just short circuit this in some way.

00:39:33.980 | And sometimes we have imitated,

00:39:38.900 | we have had success by not imitating nature in detail.

00:39:44.340 | So the usual example is airplanes, right?

00:39:47.460 | We don't build flapping wings.

00:39:50.960 | So yes, that's one of the points of debate.

00:39:56.820 | In my mind, I would bet on this learning

00:40:02.060 | like a child approach.

00:40:05.020 | - So one of the fundamental aspects

00:40:08.540 | of learning like a child is the interactivity.

00:40:11.280 | So the child gets to play

00:40:12.540 | with the data set it's learning from.

00:40:14.340 | - Yes.

00:40:15.180 | - So it gets to select.

00:40:16.100 | I mean, you can call that active learning.

00:40:17.900 | You can, in the machine learning world,

00:40:20.620 | you can call it a lot of terms.

00:40:22.180 | What are your thoughts about this whole space

00:40:25.620 | of being able to play with the data set

00:40:27.580 | or select what you're learning?

00:40:29.540 | - Yeah, so I think that I believe in that.

00:40:33.980 | And I think that we could achieve it in two ways

00:40:38.460 | and I think we should use both.

00:40:40.700 | So one is actually real robotics, right?

00:40:45.540 | So real physical embodiments of agents

00:40:50.540 | who are interacting with the world

00:40:52.540 | and they have a physical body with dynamics

00:40:54.980 | and mass and moment of inertia and friction

00:40:58.900 | and all the rest and you learn your body.

00:41:01.580 | The robot learns its body by doing a series of actions.

00:41:08.400 | The second is that simulation environments.

00:41:11.540 | So I think simulation environments

00:41:14.360 | are getting much, much better.

00:41:16.080 | In my life in Facebook AI research,

00:41:21.680 | our group has worked on something called Habitat,

00:41:24.760 | which is a simulation environment,

00:41:27.080 | which is a visually photorealistic environment

00:41:31.600 | of places like houses or interiors

00:41:36.280 | of various urban spaces and so forth.

00:41:39.440 | And as you move, you get a picture

00:41:42.040 | which is a pretty accurate picture.

00:41:43.940 | So I can now, you can imagine that subsequent generations

00:41:49.900 | of these simulators will be accurate, not just visually,

00:41:54.960 | but with respect to forces and masses

00:41:58.860 | and haptic interactions and so on.

00:42:03.200 | And then we have that environment to play with.

00:42:07.600 | I think that, let me state one reason

00:42:11.280 | why I think this active,

00:42:14.400 | being able to act in the world is important.

00:42:16.320 | I think that this is one way to break

00:42:18.800 | the correlation versus causation barrier.

00:42:22.880 | So this is something which is of a great deal

00:42:26.160 | of interest these days.

00:42:27.160 | I mean, people like Judea Pearl have talked a lot about

00:42:32.240 | that we are neglecting causality,

00:42:34.760 | and he describes the entire set of successes

00:42:38.580 | of deep learning as just curve fitting, right?

00:42:41.360 | Because it's, but I don't quite agree.

00:42:45.280 | - He's a troublemaker, he is.

00:42:46.600 | - But causality is important,

00:42:49.360 | but causality is not like a single silver bullet.

00:42:54.360 | It's not like one single principle.

00:42:56.080 | There are many different aspects here.

00:42:58.480 | And one of the ways in which,

00:43:01.560 | one of our most reliable ways of establishing causal links,

00:43:05.160 | and this is the way, for example,

00:43:07.280 | the medical community does this,

00:43:09.860 | is randomized control trials.

00:43:12.720 | So you have, you pick some situation,

00:43:15.340 | and now in some situation you perform an action,

00:43:17.760 | and for certain others you don't, right?

00:43:21.800 | So you have a controlled experiment.

00:43:24.160 | Well, the child is, in fact,

00:43:25.640 | performing controlled experiments all the time, right?

00:43:28.680 | - Right, right. - Okay?

00:43:29.880 | - Small scale.

00:43:30.720 | - And in a small scale,

00:43:32.080 | and, but that is a way that the child gets to build

00:43:37.080 | and refine its causal models of the world.

00:43:40.920 | And my colleague, Alison Gopnik,

00:43:43.760 | has together with a couple of authors,

00:43:46.200 | co-authors has this book called

00:43:47.400 | "The Scientist in the Crib," referring to children.

00:43:50.760 | So I like, the part that I like about that is

00:43:54.280 | the scientist wants to do, wants to build causal models,

00:43:58.880 | and the scientist does controlled experiments.

00:44:01.720 | And I think the child is doing that.

00:44:03.720 | So to enable that, we will need to have these,

00:44:08.280 | these active experiments,

00:44:10.000 | and I think those could be done,

00:44:12.700 | some in the real world and some in simulation.

00:44:14.840 | - So you have hope for simulation?

00:44:16.920 | - I have hope for simulation.

00:44:18.000 | - That's an exciting possibility,

00:44:19.420 | if we can get to not just photorealistic,

00:44:21.680 | but what's that called?

00:44:23.960 | Life realistic simulation.

00:44:27.640 | So you don't see any fundamental blocks

00:44:31.480 | to why we can't eventually simulate

00:44:34.400 | the principles of what it means to exist in the world

00:44:37.960 | as a physical entity?

00:44:39.520 | - I don't see any fundamental problems there.

00:44:41.160 | I mean, and look,

00:44:42.600 | the computer graphics community has come a long way.

00:44:45.360 | So in the early days, going back to the '80s and '90s,

00:44:48.600 | they were focusing on visual realism, right?

00:44:52.640 | And then they could do the easy stuff,

00:44:54.480 | but they couldn't do stuff like hair or fur and so on.

00:44:59.000 | Okay, well, they managed to do that.

00:45:01.040 | Then they couldn't do physical actions, right?

00:45:04.360 | Like there's a bowl of glass and it falls down

00:45:07.280 | and it shatters,

00:45:08.360 | but then they could start to do

00:45:09.760 | pretty realistic models of that,

00:45:11.540 | and so on and so forth.

00:45:13.840 | So the graphics people have shown

00:45:15.360 | that they can do this forward direction,

00:45:18.880 | not just for optical interactions,

00:45:21.180 | but also for physical interactions.

00:45:23.780 | So I think, of course,

00:45:26.240 | some of that is very computer intensive,

00:45:28.000 | but I think by and by,

00:45:29.960 | we will find ways of making our models ever more realistic.

00:45:34.960 | - You break vision apart into,

00:45:37.920 | in one of your presentations,

00:45:39.160 | early vision, static scene understanding,

00:45:41.200 | dynamic scene understanding,

00:45:42.560 | and raise a few interesting questions.

00:45:44.360 | I thought I could just throw some at you

00:45:46.960 | to see if you wanna talk about them.

00:45:50.280 | So early vision, so it's,

00:45:52.400 | what is it that you said?

00:45:53.980 | Sensation, perception, and cognition.

00:45:58.340 | So is this a sensation?

00:45:59.680 | - Yes.

00:46:00.520 | - What can we learn from image statistics

00:46:03.460 | that we don't already know?

00:46:05.660 | So at the lowest level,

00:46:07.180 | what can we make from just the statistics,

00:46:13.420 | the basics, so there were the variations

00:46:15.660 | in the rock pixels, the textures, and so on?

00:46:18.100 | - Yeah, so what we seem to have learned is

00:46:21.520 | that there's a lot of redundancy in these images,

00:46:26.520 | and as a result, we are able to do a lot of compression.

00:46:31.360 | And this compression is very important

00:46:34.940 | in biological settings, right?

00:46:36.900 | So you might have 10 to the eight photoreceptors

00:46:40.100 | and only 10 to the six fibers in the optic nerve,

00:46:42.500 | so you have to do this compression

00:46:43.980 | by a factor of 100 is to one.

00:46:46.520 | And so there are analogs of that

00:46:50.980 | which are happening in our neural network,

00:46:53.980 | artificial neural network.

00:46:55.180 | - At the early layers.

00:46:56.020 | - At the early layers.

00:46:57.260 | - There's a lot of compression

00:46:59.260 | that can be done in the beginning,

00:47:01.300 | just the statistics.

00:47:02.600 | - Yeah.

00:47:03.440 | - How much?

00:47:06.380 | - Well, so I mean, the way to think about it

00:47:10.700 | is just how successful is image compression, right?

00:47:14.700 | And there are, and that's been done with older technologies,

00:47:19.620 | but it can be done with,

00:47:21.260 | there are several companies which are trying to use

00:47:25.700 | sort of these more advanced neural network type techniques

00:47:29.220 | for compression, both for static images

00:47:31.780 | as well as for video.

00:47:34.220 | One of my former students has a company

00:47:37.500 | which is trying to do stuff like this.

00:47:40.220 | And I think that they are showing

00:47:44.540 | quite interesting results,

00:47:47.300 | and I think that that's all the success of,

00:47:50.580 | that's really about image statistics and video statistics.

00:47:53.620 | - But that's still not doing compression of the kind

00:47:56.940 | when I see a picture of a cat,

00:47:58.920 | all I have to say is it's a cat,

00:48:00.620 | that's another semantic kind of compression.

00:48:02.680 | - Yeah, so this is at the lower level, right?

00:48:04.740 | So we are, as I said, yeah,

00:48:07.420 | that's focusing on low-level statistics.

00:48:10.260 | - So to linger on that for a little bit,

00:48:13.060 | you mentioned how far can bottom-up image segmentation go,

00:48:18.060 | and in general, what,

00:48:20.460 | you mentioned that the central question

00:48:23.180 | for seeing understanding is the interplay

00:48:24.780 | of bottom-up and top-down information.

00:48:26.680 | Maybe this is a good time to elaborate on that,

00:48:29.780 | maybe define what is bottom-up, what is top-down

00:48:34.580 | in the context of computer vision.

00:48:37.220 | - Right, that's, so today what we have are

00:48:42.260 | very interesting systems,

00:48:43.540 | because they work completely bottom-up,

00:48:46.020 | however, they're trained--

00:48:46.860 | - What does bottom-up mean, sorry?

00:48:47.820 | - So bottom-up means, in this case,

00:48:49.500 | means a feed-forward neural network.

00:48:52.060 | - So starting from the raw pixels?

00:48:53.660 | - Yeah, they start from the raw pixels

00:48:55.540 | and they end up with something like cat or not a cat, right?

00:49:00.500 | So our systems are running totally feed-forward.

00:49:04.420 | They're trained in a very top-down way.

00:49:07.440 | So they're trained by saying, okay, this is a cat,

00:49:10.140 | there's a cat, there's a dog, there's a zebra, et cetera.

00:49:12.980 | And I'm not happy with either of these choices fully.

00:49:18.940 | We have gone into,

00:49:20.660 | because we have completely separated these processes, right?

00:49:24.860 | So there's a, so I would like the process,

00:49:29.420 | so what do we know compared to biology?

00:49:34.060 | So in biology, what we know is that the processes

00:49:37.540 | in at test time, at runtime,

00:49:41.660 | those processes are not purely feed-forward,

00:49:44.060 | but they involve feedback.

00:49:45.420 | And they involve much shallower neural networks.

00:49:49.980 | So the kinds of neural networks we are using

00:49:52.580 | in computer vision, say a ResNet 50, has 50 layers.

00:49:56.420 | Well, in the brain, in the visual cortex,

00:49:59.540 | going from the retina to IT, maybe we have like seven, right?

00:50:04.180 | So they're far shallower,

00:50:06.060 | but we have the possibility of feedback.

00:50:08.020 | So there are backward connections.

00:50:09.860 | And this might enable us to deal

00:50:14.820 | with the more ambiguous stimuli, for example.

00:50:18.160 | So the biological solution seems to involve feedback.

00:50:23.160 | The solution in artificial vision seems to be

00:50:27.840 | just feed-forward, but with a much deeper network.

00:50:30.620 | And the two are functionally equivalent,

00:50:33.300 | because if you have a feedback network,

00:50:35.100 | which just has like three rounds of feedback,

00:50:37.500 | you can just unroll it and make it three times the depth

00:50:40.980 | and create it in a totally feed-forward way.

00:50:44.500 | So this is something which, I mean,

00:50:46.460 | we have written some papers on this theme,

00:50:49.140 | but I really feel that this theme

00:50:52.500 | should be pursued further.

00:50:55.460 | - Some kind of recurrence mechanism.

00:50:57.220 | - Yeah.

00:50:58.420 | Okay, the other, so that's,

00:51:01.380 | so I want to have a little bit more top-down

00:51:04.500 | in the, at test time.

00:51:07.580 | Okay, then at training time,

00:51:10.300 | we make use of a lot of top-down knowledge right now.

00:51:13.740 | So basically to learn to segment an object,

00:51:16.440 | we have to have all these examples of,

00:51:19.140 | this is the boundary of a cat,

00:51:20.700 | and this is the boundary of a chair,

00:51:22.140 | and this is the boundary of a horse, and so on.

00:51:24.500 | And this is too much top-down knowledge.

00:51:26.940 | How do humans do this?

00:51:30.380 | We manage with far less supervision.

00:51:34.140 | And we do it in a sort of bottom-up way,

00:51:36.380 | because, for example, we're looking at a video stream,

00:51:40.220 | and the horse moves.

00:51:42.340 | And that enables me to say

00:51:44.900 | that all these pixels are together.

00:51:47.260 | So the Gestalt psychologists used to call this

00:51:50.280 | the principle of common fate.

00:51:51.920 | So there was a bottom-up process

00:51:55.100 | by which we were able to segment out these objects.

00:51:58.260 | And we have totally focused

00:52:01.500 | on this top-down training signal.

00:52:04.420 | So in my view, we have currently solved it

00:52:07.860 | in machine vision, this top-down, bottom-up interaction.

00:52:11.040 | But I don't find the solution fully satisfactory.

00:52:16.060 | And I would rather have a bit of both at both stages.

00:52:20.100 | - For all computer vision problems,

00:52:22.220 | not just segmentation.

00:52:24.140 | - And the question that you can ask is,

00:52:27.220 | so for me, I'm inspired a lot by human vision,

00:52:30.300 | and I care about that.

00:52:31.820 | You could be just a hard-boiled engineer and not give a damn.

00:52:35.480 | So to you, I would then argue

00:52:37.660 | that you would need far less training data

00:52:40.500 | if you could make my research agenda fruitful.

00:52:45.500 | - Okay, so maybe taking a step into segmentation,

00:52:51.660 | static scene understanding.

00:52:53.820 | What is the interaction

00:52:54.940 | between segmentation and recognition?

00:52:57.340 | You mentioned the movement of objects.

00:53:00.700 | So for people who don't know computer vision,

00:53:03.740 | segmentation is this weird activity

00:53:06.060 | that computer vision folks have all agreed

00:53:09.040 | is very important,

00:53:10.080 | of drawing outlines around objects versus a bounding box,

00:53:16.720 | and then classifying that object.

00:53:19.980 | What's the value of segmentation?

00:53:23.540 | What is it as a problem in computer vision?

00:53:27.180 | How is it fundamentally different

00:53:28.740 | from detection, recognition, and the other problems?

00:53:31.380 | - Yeah, so I think,

00:53:32.780 | so segmentation enables us to say

00:53:37.700 | that some set of pixels are an object

00:53:41.820 | without necessarily even being able to name that object

00:53:45.860 | or knowing properties of that object.

00:53:48.060 | - Oh, so you mean segmentation purely

00:53:50.780 | as the act of separating an object--

00:53:54.860 | - From its background.

00:53:55.700 | - A blob that's united in some way from its background.

00:54:00.700 | - Yeah, so entitification, if you will,

00:54:03.260 | making an entity out of it.

00:54:04.940 | - Entitification, beautifully.

00:54:06.820 | - So I think that we have that capability,

00:54:11.540 | and that enables us to,

00:54:16.260 | as we are growing up,

00:54:17.780 | to acquire names of objects

00:54:21.900 | with very little supervision.

00:54:23.660 | So suppose the child, let's posit

00:54:25.980 | that the child has this ability

00:54:27.340 | to separate out objects in the world.

00:54:29.860 | Then when the mother says, "Pick up your bottle,"

00:54:34.380 | or the cat's behaving funny today,

00:54:38.620 | the word cat suggests some object,

00:54:43.980 | and then the child sort of does the mapping.

00:54:46.260 | - Right. - Right?

00:54:47.380 | The mother doesn't have to teach

00:54:50.340 | specific object labels by pointing to them.

00:54:53.420 | Weak supervision works in the context

00:54:57.740 | that you have the ability to create objects.

00:55:01.460 | So I think that,

00:55:03.780 | so to me, that's a very fundamental capability.

00:55:07.660 | There are applications where this is very important,

00:55:10.980 | for example, medical diagnosis.

00:55:13.060 | So in medical diagnosis, you have some brain scan.

00:55:17.740 | I mean, this is some work that we did in my group

00:55:20.820 | where you have CT scans of people

00:55:23.140 | who have had traumatic brain injury,

00:55:25.460 | and what the radiologist needs to do

00:55:28.500 | is to precisely delineate various places

00:55:32.140 | where there might be bleeds, for example.

00:55:36.180 | And there are clear needs like that.

00:55:39.740 | So there are certainly very practical applications

00:55:43.340 | of computer vision where segmentation is necessary.

00:55:46.220 | But philosophically, segmentation enables

00:55:50.540 | the task of recognition to proceed

00:55:53.980 | with much weaker supervision than we require today.

00:55:57.820 | - And you think of segmentation as this kind of task

00:56:00.860 | that takes on a visual scene

00:56:03.460 | and breaks it apart into interesting entities

00:56:08.460 | that might be useful for whatever the task is.

00:56:11.260 | - Yeah.

00:56:12.660 | And it is not semantics-free.

00:56:14.620 | So I think, I mean, it blends into,

00:56:18.700 | it involves perception and cognition.

00:56:21.940 | It is not, I think the mistake that we used to make

00:56:26.540 | in the early days of computer vision

00:56:28.540 | was to treat it as a purely bottom-up perceptual task.

00:56:32.380 | It is not just that.

00:56:33.680 | Because we do revise our notion

00:56:37.700 | of segmentation with more experience, right?

00:56:41.620 | Because, for example, there are objects

00:56:43.580 | which are non-rigid, like animals or humans.

00:56:46.940 | And I think understanding that all the pixels of a human

00:56:51.820 | are one entity is actually quite a challenge

00:56:54.260 | because the parts of the human,

00:56:56.620 | they can move independently.

00:56:58.860 | The human wears clothes,

00:57:00.660 | so they might be differently colored.

00:57:02.700 | So it's all sort of a challenge.

00:57:05.520 | - You mentioned the three R's of computer vision

00:57:08.020 | are recognition, reconstruction, and reorganization.

00:57:12.140 | Can you describe these three R's and how they interact?

00:57:15.300 | - Yeah, so recognition is the easiest one

00:57:19.580 | because that's what I think people generally think of

00:57:24.520 | as computer vision achieving these days,

00:57:28.060 | which is labels.

00:57:30.380 | So is this a cat, is this a dog, is this a chihuahua?

00:57:35.180 | I mean, it could be very fine-grained,

00:57:37.940 | like a specific breed of a dog

00:57:40.900 | or a specific species of bird,

00:57:43.460 | or it could be very abstract, like animal.

00:57:46.980 | - But given a part of an image or a whole image,

00:57:49.940 | say, put a label on that.

00:57:51.460 | - Yeah, so that's recognition.

00:57:54.540 | Reconstruction is essentially,

00:57:59.180 | you can think of it as inverse graphics.

00:58:03.540 | I mean, that's one way to think about it.

00:58:07.140 | So graphics is you have some internal computer

00:58:10.460 | representation and you have a computer representation

00:58:14.900 | of some objects arranged in a scene,

00:58:17.340 | and what you do is you produce a picture.

00:58:20.080 | You produce the pixels corresponding

00:58:22.060 | to a rendering of that scene.

00:58:23.500 | So let's do the inverse of this.

00:58:28.860 | We are given an image and we try to,

00:58:31.060 | we say, oh, this image arises from some objects

00:58:38.420 | in a scene looked at with a camera from this viewpoint,

00:58:41.820 | and we might have more information about the objects,

00:58:44.180 | like their shape, maybe their textures,

00:58:47.500 | maybe color, et cetera, et cetera.

00:58:51.660 | So that's the reconstruction problem.

00:58:53.300 | In a way, you are, in your head,

00:58:57.220 | creating a model of the external world.

00:58:59.660 | Okay, reorganization is to do with,

00:59:04.780 | essentially, finding these entities.

00:59:07.540 | So it's organization.

00:59:12.380 | The word organization implies structure.

00:59:15.500 | So in perception, in psychology,

00:59:19.900 | we use the term perceptual organization,

00:59:22.620 | that the world is not just, an image is not just seen as,

00:59:27.620 | is not internally represented as just a collection of pixels,

00:59:32.580 | but we make these entities.

00:59:34.780 | We create these entities, objects,

00:59:37.260 | whatever you want to call them.

00:59:38.100 | - And the relationship between the entities as well,

00:59:40.180 | or is it purely about the entities?

00:59:42.340 | - It could be about the relationships,

00:59:44.220 | but mainly we focus on the fact that there are entities.

00:59:47.660 | - So I'm trying to pinpoint what the organization means.

00:59:52.380 | - So organization is that instead of a uniform grid,

00:59:57.380 | we have the structure of objects.

01:00:02.020 | - So segmentation is a small part of that.

01:00:05.300 | - So segmentation gets us going towards that.

01:00:08.260 | - Yeah, and you kind of have this triangle

01:00:11.700 | where they all interact together.

01:00:13.540 | - Yes.

01:00:14.380 | - So how do you see that interaction

01:00:17.140 | in sort of reorganization is, yes,

01:00:22.140 | defining the entities in the world.

01:00:25.020 | The recognition is labeling those entities.

01:00:30.020 | And then reconstruction is what, filling in the gaps?

01:00:33.940 | - Well, to, for example, see,

01:00:36.140 | impute some 3D objects corresponding

01:00:40.660 | to each of these entities.

01:00:43.180 | That would be part of reconstruction.

01:00:44.380 | - So adding more information

01:00:45.620 | that's not there in the raw data.

01:00:47.820 | - Correct.

01:00:48.660 | I mean, I started pushing this kind of a view

01:00:54.460 | in the, around 2010 or something like that,

01:00:58.060 | because at that time in computer vision,

01:01:01.020 | the distinction that people were just working

01:01:06.020 | on many different problems,

01:01:08.480 | but they treated each of them as a separate,

01:01:10.460 | isolated problem.

01:01:11.980 | With each with its own data set,

01:01:13.820 | and then you try to solve that and get good numbers on it.

01:01:16.940 | So I wasn't, I didn't like that approach

01:01:19.540 | because I wanted to see the connection between these.

01:01:23.540 | And if people divided up vision into various modules,

01:01:28.540 | the way they would do it is as low level,

01:01:31.820 | mid level, and high level vision,

01:01:33.460 | corresponding roughly to the psychologist's notion

01:01:37.220 | of sensation, perception, and cognition.

01:01:39.940 | And I didn't, that didn't map to tasks

01:01:43.460 | that people cared about.

01:01:45.460 | Okay, so therefore I tried to promote

01:01:48.560 | this particular framework as a way of considering

01:01:51.860 | the problems that people in computer vision

01:01:53.700 | were actually working on,

01:01:55.500 | and trying to be more explicit about the fact

01:01:58.940 | that they actually are connected to each other.

01:02:01.300 | And I was at that time just doing this

01:02:05.620 | on the basis of information flow.

01:02:07.840 | Now it turns out in the last five years or so,

01:02:12.100 | in the post the deep learning revolution,

01:02:17.220 | that this architecture has turned out

01:02:20.420 | to be very conducive to that.

01:02:24.740 | Because basically in these neural networks,

01:02:28.020 | we are trying to build multiple representations.

01:02:31.080 | There can be multiple output heads

01:02:35.100 | sharing common representations.

01:02:37.160 | So in a certain sense, today, given the reality

01:02:41.080 | of what solutions people have to this,

01:02:43.220 | I do not need to preach this anymore.

01:02:48.220 | It is just there, it's part of the solution space.

01:02:52.460 | - So speaking of neural networks,

01:02:54.860 | how much of this problem of computer vision,

01:02:59.860 | of reorganization, recognition, can be reconstruction?

01:03:05.440 | How much of it can be learned end to end, do you think?

01:03:12.620 | Sort of set it and forget it, just plug and play,

01:03:18.180 | have a giant data set, multiple perhaps, multi-modal,

01:03:22.140 | and then just learn the entirety of it.

01:03:25.500 | - Well, so I think that currently

01:03:28.620 | what that end to end learning means nowadays

01:03:31.140 | is end to end supervised learning.

01:03:32.880 | And that I would argue is too narrow a view of the problem.

01:03:38.080 | I like this child development view,

01:03:42.460 | this lifelong learning view,

01:03:44.820 | one where there are certain capabilities that are built up

01:03:48.260 | and then there are certain capabilities

01:03:49.880 | which are built up on top of that.

01:03:53.180 | So that's what I believe in.

01:03:58.180 | So I think end to end learning in the supervised setting

01:04:03.600 | for a very precise task to me

01:04:14.020 | is sort of a limited view of the learning process.

01:04:18.100 | - Got it, so if we think about beyond purely supervised,

01:04:22.740 | look back to children.

01:04:24.700 | You mentioned six lessons that we can learn from children

01:04:28.820 | of be multi-modal, be incremental, be physical,

01:04:33.380 | explore, be social, use language.

01:04:36.380 | Can you speak to these, perhaps picking one

01:04:39.540 | that you find most fundamental to our time today?

01:04:42.740 | - Yeah, so I mean, I should say to give you credit,

01:04:46.180 | this is from a paper by Smith and Gasser.

01:04:49.940 | And it reflects essentially, I would say,

01:04:54.860 | common wisdom among child development people.

01:04:59.860 | It's just that this is not common wisdom

01:05:04.300 | among people in computer vision and AI and machine learning.

01:05:07.980 | So I view my role as trying to--

01:05:12.660 | - Bridge the two worlds.

01:05:13.940 | - Bridge the two worlds.

01:05:15.140 | So let's take an example of a multi-modal, I like that.

01:05:20.100 | So multi-modal, a canonical example is a child interacting

01:05:25.100 | with an object.

01:05:28.780 | So then the child holds a ball and plays with it.

01:05:32.540 | So at that point, it's getting a touch signal.

01:05:35.620 | So the touch signal is getting a notion of 3D shape,

01:05:40.620 | but it is sparse.

01:05:42.940 | And then the child is also seeing a visual signal.

01:05:46.660 | And these two, so imagine these are two

01:05:50.820 | in totally different spaces.

01:05:52.620 | So one is the space of receptors on the skin of the fingers

01:05:56.980 | and the thumb and the palm.

01:05:58.420 | And then these map onto these neuronal fibers

01:06:02.980 | that are getting activated somewhere.

01:06:05.300 | These lead to some activation in somatosensory cortex.

01:06:10.460 | I mean, a similar thing will happen if we have a robot hand.

01:06:14.700 | Okay, and then we have the pixels

01:06:17.060 | corresponding to the visual view,

01:06:19.260 | but we know that they correspond to the same object.

01:06:22.660 | So that's a very, very strong cross-calibration signal.

01:06:28.780 | And it is self-supervisory, which is beautiful.

01:06:32.380 | There's nobody assigning a label.

01:06:34.020 | The mother doesn't have to come and assign a label.

01:06:37.780 | The child doesn't even have to know

01:06:39.460 | that this object is called a ball.

01:06:41.220 | Okay, but the child is learning something

01:06:44.540 | about the three-dimensional world from this signal.

01:06:48.500 | I think tactile and visual, there is some work on.

01:06:53.580 | There is a lot of work currently on audio and visual.

01:06:56.300 | Okay, and audio-visual, so there is some event

01:07:00.380 | that happens in the world.

01:07:01.820 | And that event has a visual signature,

01:07:04.180 | and it has a auditory signature.

01:07:07.140 | So there is this glass bowl on the table,

01:07:09.060 | and it falls and breaks, and I hear the smashing sound,

01:07:12.540 | and I see the pieces of glass.

01:07:14.260 | Okay, I've built that connection between the two, right?

01:07:19.460 | We have people, I mean, this has become a hot topic

01:07:22.820 | in computer vision in the last couple of years.

01:07:25.500 | There are problems like separating out multiple speakers.

01:07:31.500 | - Right. - Which was a classic problem

01:07:33.900 | in audition, they call this the problem of source separation

01:07:37.620 | or the cocktail party effect and so on.

01:07:40.540 | But just try to do it visually when you also have,

01:07:44.820 | it becomes so much easier and so much more useful.

01:07:49.820 | - So the multimodal, I mean, there's so much more signal

01:07:55.020 | with multimodal, and you can use that

01:07:57.140 | for some kind of weak supervision as well.

01:08:00.300 | - Yes, because they are occurring at the same time in time.

01:08:03.180 | So you have time, which links the two, right?

01:08:06.140 | So at a certain moment, T1, you've got a certain signal

01:08:09.500 | in the auditory domain and a certain signal

01:08:11.340 | in the visual domain, but they must be causally related.

01:08:15.300 | - Yeah, it's an exciting area, not well studied yet.

01:08:18.500 | - Yeah, I mean, we have a little bit of work at this,

01:08:20.460 | but so much more needs to be done.

01:08:24.460 | So this is a good example.

01:08:28.140 | Be physical, that's to do with, like,

01:08:30.940 | so one thing we talked about earlier,

01:08:32.980 | that there's an embodied world.

01:08:35.500 | - To mention language, use language.

01:08:39.260 | So Noam Chomsky believes that language

01:08:42.060 | may be at the core of cognition,

01:08:43.740 | at the core of everything in the human mind.

01:08:46.220 | What is the connection between language and vision to you?

01:08:50.140 | Like, what's more fundamental?

01:08:51.820 | Are they neighbors?

01:08:53.340 | Is one the parent and the child, the chicken and the egg?

01:08:58.020 | - Oh, it's very clear.

01:08:58.900 | It is vision, which is the parent.

01:09:01.180 | The parent is just the fundamental ability, okay?

01:09:04.140 | - Wait, wait, wait, wait.

01:09:05.580 | - So, so--

01:09:07.580 | - It comes before, you think vision

01:09:09.380 | is more fundamental than language.

01:09:10.900 | - Correct.

01:09:12.180 | And you can think of it either in phylogeny or in ontogeny.

01:09:17.180 | So phylogeny means, if you look at evolutionary time, right?

01:09:22.220 | So we have vision that developed 500 million years ago.

01:09:26.480 | Okay, then something like when we get to maybe

01:09:30.020 | like five million years ago,

01:09:31.700 | you have the first bipedal primates.

01:09:34.380 | So when we started to walk, then the hands became free.

01:09:38.700 | And so then manipulation, the ability to manipulate objects

01:09:42.580 | and build tools and so on and so forth.

01:09:45.140 | - So you said 500,000 years ago?

01:09:47.380 | - No, no, sorry.

01:09:48.220 | The first multicellular animals,

01:09:51.420 | which you can say had some intelligence,

01:09:55.420 | arose 500 million years ago.

01:09:58.080 | Okay, and now let's fast forward

01:10:01.160 | to say the last seven million years,

01:10:03.780 | which is the development of the hominid line, right?

01:10:07.040 | Where from the other primates,

01:10:09.160 | we have the branch which leads on to modern humans.

01:10:12.760 | Now, there are many of these hominids,

01:10:16.340 | but the ones which, you know, people talk about Lucy,

01:10:21.340 | because that's like a skeleton from three million years ago

01:10:24.920 | and we know that Lucy walked, okay?

01:10:28.480 | So at this stage, you have that the hand is free

01:10:31.840 | for manipulating objects.

01:10:33.640 | And then the ability to manipulate objects, build tools,

01:10:37.820 | and the brain size grew in this era.

01:10:43.400 | So, okay, so now you have manipulation.

01:10:46.040 | Now, we don't know exactly when language arose.

01:10:49.560 | - But after that. - But after that.

01:10:52.000 | Because no apes have, I mean,

01:10:54.920 | so, I mean, Chomsky is correct in that,

01:10:56.900 | that it is a uniquely human capability

01:10:59.400 | and we, primates, other primates don't have that.

01:11:04.400 | So it developed somewhere in this era,

01:11:06.960 | but it developed, I would, I mean,

01:11:11.400 | argue that it probably developed

01:11:13.040 | after we had this stage of humans,

01:11:17.800 | I mean, the human species already able to manipulate

01:11:21.600 | and hands-free, much bigger brain size.

01:11:25.360 | - And for that, there's a lot of vision

01:11:28.640 | has already had to have developed.

01:11:31.480 | - Yeah. - So the sensation

01:11:32.960 | and the perception, maybe some of the cognition.

01:11:36.060 | - Yeah, so we, so those, so that, so the world,

01:11:41.060 | so there, so these ancestors of ours,

01:11:46.480 | you know, three, four million years ago,

01:11:48.560 | they had spatial intelligence.

01:11:53.280 | So they knew that the world consists of objects.

01:11:56.200 | They knew that the objects

01:11:57.280 | were in certain relationships to each other.

01:11:59.660 | They had observed causal interactions among objects.

01:12:04.660 | They could move in space,

01:12:06.480 | so they had space and time and all of that.

01:12:08.900 | So language builds on that substrate.

01:12:13.040 | So language has a lot of, I mean,

01:12:16.760 | I mean, all human languages have constructs

01:12:19.880 | which depend on a notion of space and time.

01:12:22.580 | Where did that notion of space and time come from?

01:12:26.840 | It had to come from perception and action

01:12:29.640 | in the world we live in.

01:12:31.040 | - Yeah, what you've referred to as the spatial intelligence.

01:12:33.440 | - Yeah. - Yeah.

01:12:35.080 | So to linger a little bit, we mentioned Turing

01:12:38.960 | and his mention of we should learn from children.

01:12:44.320 | Nevertheless, language is the fundamental piece

01:12:47.360 | of the test of intelligence that Turing proposed.

01:12:50.480 | - Yes. - What do you think

01:12:51.320 | is a good test of intelligence?

01:12:53.840 | Are you, what would impress the heck out of you?

01:12:56.400 | Is it fundamentally natural language,

01:12:59.920 | or is there something in vision?

01:13:01.620 | - I think I wouldn't, I don't think we should have

01:13:07.120 | create a single test of intelligence.

01:13:09.000 | So just like I don't believe in IQ as a single number,

01:13:13.920 | I think generally there can be many capabilities

01:13:18.080 | which are correlated perhaps.

01:13:19.700 | So I think that there will be accomplishments

01:13:26.820 | which are visual accomplishments,

01:13:28.200 | accomplishments which are accomplishments

01:13:32.000 | in manipulation or robotics,

01:13:34.720 | and then accomplishments in language.

01:13:36.840 | I do believe that language will be the hardest nut to crack.

01:13:40.400 | - Really? - Yeah.

01:13:41.520 | - So what's harder, to pass the spirit of the Turing test,

01:13:45.400 | or like whatever formulation will make it natural language,

01:13:49.160 | convincingly a natural language,

01:13:51.120 | like somebody you would wanna have a beer with,

01:13:52.720 | hang out and have a chat with,

01:13:54.580 | or the general natural scene understanding?

01:13:59.220 | You think language is the top of the problem?

01:14:01.480 | - I think I'm not a fan of the,

01:14:05.520 | I think Turing test, that Turing as he proposed the test

01:14:09.560 | in 1950 was trying to solve a certain problem.

01:14:13.880 | - Yeah, imitation. - Yeah.

01:14:15.720 | And I think it made a lot of sense then.

01:14:18.560 | Where we are today, 70 years later,

01:14:20.740 | I think we should not worry about that.

01:14:26.560 | I think the Turing test is no longer the right way

01:14:29.320 | to channel research in AI,

01:14:33.600 | because it takes us down this path of this chatbot

01:14:37.080 | which can fool us for five minutes or whatever.

01:14:39.720 | Okay, I think I would rather have a list

01:14:43.000 | of 10 different tasks.

01:14:44.400 | I mean, I think there are tasks which,

01:14:47.400 | there are tasks in the manipulation domain,

01:14:50.360 | tasks in navigation, tasks in visual scene understanding,

01:14:53.800 | tasks in reading a story

01:14:56.320 | and answering questions based on that.

01:14:58.920 | I mean, so my favorite language understanding task

01:15:03.320 | would be reading a novel

01:15:05.400 | and being able to answer arbitrary questions from it.

01:15:08.760 | Okay. - Right.

01:15:10.480 | - I think that to me,

01:15:12.960 | and this is not an exhaustive list by any means.

01:15:15.720 | So I would, I think that that's where we need to be going to

01:15:20.720 | and each of these, on each of these axes,

01:15:23.800 | there's a fair amount of work to be done.

01:15:26.060 | - So on the visual understanding side,

01:15:28.240 | in this Intelligence Olympics that we've set up,

01:15:31.080 | what's a good test for one of many

01:15:35.120 | of visual scene understanding?

01:15:38.120 | Do you think such benchmarks exist?

01:15:41.360 | Sorry to interrupt.

01:15:42.200 | - No, there aren't any.

01:15:43.680 | I think essentially to me,

01:15:46.760 | a really good aid to the blind.

01:15:50.840 | So suppose there was a blind person

01:15:53.360 | and I needed to assist the blind person.

01:15:56.120 | - So ultimately, like we said,

01:15:59.080 | vision that aids in the action,

01:16:01.200 | in the survival in this world.

01:16:03.520 | - Yeah.

01:16:04.360 | - Maybe in the simulated world.

01:16:07.160 | - Maybe easier to measure performance in a simulated world.

01:16:13.360 | What we are ultimately after

01:16:14.680 | is performance in the real world.

01:16:16.320 | - So David Hilbert in 1900 proposed 23 open problems

01:16:21.960 | in mathematics, some of which are still unsolved.

01:16:24.760 | Most important, famous of which

01:16:26.720 | is probably the Riemann hypothesis.

01:16:29.040 | You've thought about and presented about

01:16:30.960 | the Hilbert problems of computer vision.

01:16:33.120 | So let me ask, what do you today,

01:16:36.600 | I don't know when the last year you presented that,

01:16:38.960 | 2015, but versions of it.

01:16:41.240 | You're kind of the face and the spokesperson

01:16:43.480 | for computer visions.

01:16:44.700 | It's your job to state what the open problems are

01:16:50.620 | for the field.

01:16:51.720 | So what today are the Hilbert problems

01:16:53.860 | of computer vision, do you think?

01:16:56.440 | - Let me pick one which I regard as clearly unsolved,

01:17:01.440 | which is what I would call long form video understanding.

01:17:07.400 | So we have a video clip and we want to understand

01:17:13.640 | the behavior in there in terms of agents,

01:17:19.080 | their goals, intentionality,

01:17:24.400 | and make predictions about what might happen.

01:17:28.320 | So that kind of understanding which goes away

01:17:34.800 | from atomic visual action.

01:17:36.320 | So in the short range, the question is,

01:17:39.840 | are you sitting, are you standing,

01:17:41.220 | are you catching a ball?

01:17:42.440 | That we can do now.

01:17:45.960 | Even if we can't do it fully accurately,

01:17:48.120 | if we can do it at 50%, maybe next year we'll do it at 65

01:17:52.600 | and so forth.

01:17:53.840 | But I think the long range video understanding,

01:17:57.560 | I don't think we can do today.

01:18:01.640 | - And that means--

01:18:03.280 | - And it blends into cognition.

01:18:04.560 | That's the reason why it's challenging.

01:18:06.840 | - And so you have to track,

01:18:08.200 | you have to understand the entities,

01:18:10.320 | you have to understand the entities,

01:18:11.640 | you have to track them,

01:18:13.480 | and you have to have some kind of model of their behavior.

01:18:16.920 | - Correct.

01:18:17.760 | And their behavior might be, these are agents,

01:18:21.600 | so they are not just like passive objects,

01:18:23.920 | but they're agents, so therefore,

01:18:26.160 | they would exhibit goal-directed behavior.

01:18:29.380 | Okay, so this is one area.

01:18:32.360 | Then I will talk about, say, understanding the world in 3D.

01:18:36.920 | Now this may seem paradoxical because in a way,

01:18:40.480 | we have been able to do 3D understanding

01:18:42.880 | even like 30 years ago, right?

01:18:45.680 | But I don't think we currently have the richness

01:18:48.640 | of 3D understanding in our computer vision system

01:18:52.160 | that we would like.

01:18:53.320 | So let me elaborate on that a bit.

01:18:57.480 | So currently, we have two kinds of techniques

01:19:01.360 | which are not fully unified.

01:19:03.240 | So there are the kinds of techniques

01:19:04.680 | from multi-view geometry,

01:19:06.800 | that you have multiple pictures of a scene

01:19:08.720 | and you do a reconstruction using stereoscopic vision

01:19:12.480 | or structure from motion.

01:19:14.560 | But these techniques do not,

01:19:18.040 | they totally fail if you just have a single view

01:19:21.200 | because they are relying on this multiple-view geometry.

01:19:25.300 | Okay, then we have some techniques

01:19:28.120 | that we have developed in the computer vision community

01:19:30.360 | which try to guess 3D from single views.

01:19:34.240 | And these techniques are based on supervised learning

01:19:39.240 | and they are based on having at training time

01:19:42.760 | 3D models of objects available.

01:19:45.920 | And this is completely unnatural supervision, right?

01:19:49.880 | That's not, CAD models are not injected into your brain.

01:19:53.480 | Okay, so what would I like?

01:19:55.920 | What I would like would be a kind of learning

01:20:00.120 | as you move around the world notion of 3D.

01:20:05.120 | So we have our succession of visual experiences

01:20:13.960 | and from those, we, so in, as part of that,

01:20:18.960 | I might see a chair from different viewpoints

01:20:21.600 | or a table from different viewpoints and so on.

01:20:24.800 | Now as part, that enables me

01:20:27.800 | to build some internal representation.

01:20:31.120 | And then next time I just see a single photograph

01:20:35.240 | and it may not even be of that chair,

01:20:37.120 | it's of some other chair.

01:20:38.760 | And I have a guess of what its 3D shape is like.

01:20:42.080 | - So you're almost learning the CAD model kind of--

01:20:45.600 | - Yeah, implicitly.

01:20:46.960 | I mean, the CAD model need not be in the same form

01:20:50.240 | as used by computer graphics programs.

01:20:52.360 | - Hidden in the representation somehow.

01:20:53.800 | - It's hidden in the representation,

01:20:55.400 | the ability to predict new views

01:20:58.080 | and what I would see if I went to such and such position.

01:21:03.080 | - By the way, on a small tangent on that,

01:21:06.720 | are you okay or comfortable with the idea

01:21:11.880 | that you're comfortable with neural networks

01:21:14.480 | that do achieve visual understanding,

01:21:16.280 | that do, for example, achieve this kind of 3D understanding

01:21:19.160 | and you don't know how they, you don't know the,

01:21:23.720 | you're not able to visualize or understand

01:21:28.440 | or interact with the representation?

01:21:31.040 | So the fact that they're not or may not be explainable.

01:21:34.880 | - Yeah, I think that's fine.

01:21:37.160 | (laughing)

01:21:38.360 | To me, that is, so,

01:21:41.400 | so let me put some caveats on that.

01:21:44.440 | So it depends on the setting.

01:21:46.400 | So first of all, I think

01:21:47.920 | humans are not explainable.

01:21:55.520 | - Yeah, that's a really good point, yeah.

01:21:57.080 | - So we, one human to another human is not fully explainable.

01:22:02.080 | I think there are settings where explainability matters

01:22:06.720 | and these might, these are, these might be,

01:22:09.920 | for example, questions on medical diagnosis.

01:22:12.400 | So I'm in a setting where maybe the doctor,

01:22:17.400 | maybe a computer program has made a certain diagnosis.

01:22:21.240 | And then depending on the diagnosis,

01:22:23.760 | perhaps I should have treatment A or treatment B, right?

01:22:28.000 | So now is the computer program's diagnosis based on data,

01:22:35.120 | which was data collected of,

01:22:38.520 | for American males who are in their 30s and 40s,

01:22:42.160 | and maybe not so relevant to me.

01:22:45.240 | Maybe it is relevant, you know, et cetera, et cetera.

01:22:48.320 | And I mean, in medical diagnosis,

01:22:50.340 | we have major issues to do with the reference class.

01:22:53.480 | So we may have acquired statistics from one group of people

01:22:56.600 | and applying it to a different group of people

01:22:59.520 | who may not share all the same characteristics.

01:23:02.760 | The data might have,

01:23:05.160 | there might be error bars in the prediction.

01:23:07.520 | So that prediction should really be taken

01:23:10.240 | with a huge grain of salt.

01:23:12.780 | But this has an impact on what treatments

01:23:16.120 | should be picked, right?

01:23:20.040 | So there are settings where I want to know more

01:23:23.400 | than just this is the answer.

01:23:26.600 | But what I acknowledge is that,

01:23:30.920 | so in that sense,

01:23:32.080 | explainability and interpretability may matter.

01:23:34.960 | It's about giving error bounds

01:23:37.280 | and a better sense of the quality of the decision.

01:23:40.200 | Where I'm willing to sacrifice interpretability

01:23:46.280 | is that I believe that there can be systems

01:23:50.080 | which can be highly performant,

01:23:51.680 | but which are internally black boxes.

01:23:55.000 | - And that seems to be where it's headed.

01:23:57.780 | Some of the best performing systems

01:23:59.560 | are essentially black boxes.

01:24:01.080 | Fundamentally by their construction.

01:24:04.080 | - You and I are black boxes to each other.

01:24:06.360 | - Yeah, so the nice thing about the black boxes we are

01:24:09.680 | is so we ourselves are black boxes,

01:24:13.760 | but we're also, those of us who are charming,

01:24:17.720 | are able to convince others,

01:24:19.480 | like explain what's going on inside the black box

01:24:23.400 | with narratives of stories.

01:24:25.360 | So in some sense, neural networks

01:24:27.760 | don't have to actually explain what's going on inside.

01:24:31.640 | They just have to come up with stories,

01:24:33.360 | real or fake, that convince you

01:24:35.840 | that they know what's going on.

01:24:38.400 | - And I'm sure we can do that.

01:24:40.080 | We can create those stories.

01:24:42.440 | Neural networks can create those stories.

01:24:44.480 | - Yeah. (laughs)

01:24:47.320 | And the transformer will be involved.

01:24:49.940 | Do you think we will ever build a system

01:24:53.700 | of human level or superhuman level intelligence?

01:24:56.440 | We've kind of defined what it takes

01:24:58.560 | to try to approach that,

01:25:00.040 | but do you think that's within our reach?

01:25:02.600 | The thing that we thought we could do,

01:25:04.560 | what Turing thought actually we could do by year 2000,

01:25:08.320 | right, do you think we'll ever be able to do?

01:25:11.120 | - So I think there are two answers here.

01:25:12.760 | One answer is in principle, can we do this at some time?

01:25:17.760 | And my answer is yes.

01:25:19.600 | The second answer is a pragmatic one.

01:25:23.520 | Do you think we will be able to do it

01:25:25.120 | in the next 20 years or whatever?

01:25:27.640 | And to that my answer is no.

01:25:29.100 | So, and of course that's a wild guess.

01:25:32.560 | I think that, you know, Donald Rumsfeld

01:25:38.400 | is not a favorite person of mine,

01:25:40.000 | but one of his lines is very good,

01:25:42.120 | which is about known knowns, known unknowns,

01:25:46.300 | and unknown unknowns.

01:25:48.160 | So in the business we are in, there are known unknowns,

01:25:53.040 | and we have unknown unknowns.

01:25:54.920 | So I think with respect to a lot of what's the case

01:25:59.920 | in vision and robotics, I feel like we have known unknowns.

01:26:05.380 | So I have a sense of where we need to go

01:26:09.680 | and what the problems that need to be solved are.

01:26:12.220 | I feel with respect to natural language,

01:26:17.000 | understanding and high level cognition,

01:26:20.440 | it's not just known unknowns, but also unknown unknowns.

01:26:24.080 | So it is very difficult to put any kind of a time frame

01:26:28.200 | to that.

01:26:29.040 | - Do you think some of the unknown unknowns

01:26:33.720 | might be positive in that they'll surprise us

01:26:36.760 | and make the job much easier?

01:26:38.600 | So fundamental breakthroughs?

01:26:40.160 | - I think that is possible, because certainly

01:26:42.120 | I have been very positively surprised

01:26:44.640 | by how effective these deep learning systems have been.

01:26:50.000 | Because I certainly would not have believed that in 2010.

01:26:55.000 | I think what we knew from the mathematical theory

01:27:02.680 | was that convex optimization works

01:27:06.040 | when there's a single global optima,

01:27:07.760 | then these gradient descent techniques would work.

01:27:11.100 | Now these are nonlinear systems with non-convex systems.

01:27:16.000 | - Huge number of variables, so over-parameterized.

01:27:18.520 | - Over-parameterized, and the people

01:27:22.240 | who used to play with them a lot,

01:27:24.640 | the ones who are totally immersed in the lore

01:27:27.200 | and the black magic, they knew that they worked well,

01:27:32.200 | even though they were--

01:27:33.840 | - Really?

01:27:34.660 | I thought like everybody--

01:27:36.240 | - No, the claim that I hear from my friends

01:27:39.740 | like Jan LeCun and so forth is--

01:27:41.560 | - Oh, now, yeah.

01:27:42.480 | - That they feel that they were comfortable with them.

01:27:45.920 | - Well, he says that now.

01:27:46.760 | - The community as a whole was certainly not.

01:27:50.640 | And I think we were, to me that was the surprise,

01:27:54.800 | that they actually worked robustly

01:27:58.720 | for a wide range of problems,

01:28:01.240 | from a wide range of initializations and so on.

01:28:04.640 | And so that was certainly more rapid progress

01:28:09.640 | than we expected.

01:28:13.620 | But then there are certainly lots of times,

01:28:15.880 | in fact, most of the history of AI

01:28:18.520 | is when we have made less progress

01:28:21.400 | at a slower rate than we expected.

01:28:23.920 | So we just keep going.

01:28:27.480 | I think what I regard as really unwarranted

01:28:32.480 | are these fears of AGI in 10 years and 20 years

01:28:38.800 | and that kind of stuff,

01:28:41.400 | because that's based on completely unrealistic models

01:28:44.800 | of how rapidly we will make progress in this field.

01:28:47.680 | - So I agree with you,

01:28:49.960 | but I've also gotten the chance to interact

01:28:53.480 | with very smart people who really worry

01:28:55.520 | about the existential threats of AI.

01:28:57.600 | And I, as an open-minded person,

01:28:59.680 | am sort of taking it in.

01:29:02.640 | Do you think if AI systems,

01:29:08.920 | in some way, the unknown unknowns,

01:29:11.680 | not super intelligent AI,

01:29:12.960 | but in ways we don't quite understand

01:29:15.480 | the nature of super intelligence,

01:29:17.320 | will have a detrimental effect on society?

01:29:20.200 | Do you think this is something we should be worried about?

01:29:24.160 | Or we need to first allow the unknown unknowns

01:29:26.840 | to become known unknowns?

01:29:29.800 | - I think we need to be worried about AI today.

01:29:32.920 | I think that it is not just a worry we need to have

01:29:36.000 | when we get that AGI.

01:29:38.320 | I think that AI is being used in many systems today.

01:29:42.960 | And there might be settings, for example,

01:29:45.200 | when it causes biases or decisions which could be harmful.

01:29:50.200 | I mean, decisions which could be unfair to some people,

01:29:53.880 | or it could be a self-driving car which kills a pedestrian.

01:29:57.600 | So AI systems are being deployed today, right?

01:30:01.840 | And they're being deployed in many different settings,

01:30:03.840 | maybe in medical diagnosis, maybe in a self-driving car,

01:30:06.560 | maybe in selecting applicants for an interview.

01:30:09.880 | So I would argue that when these systems make mistakes,

01:30:14.880 | there are consequences.

01:30:17.860 | And we are in a certain sense,

01:30:19.980 | responsible for those consequences.

01:30:22.640 | So I would argue that this is a continuous effort.

01:30:26.320 | And this is something that in a way is not so surprising.

01:30:32.320 | It's about all engineering and scientific progress,

01:30:35.840 | which great power comes great responsibility.

01:30:39.880 | So as these systems are deployed,

01:30:41.800 | we have to worry about them.

01:30:42.920 | And it's a continuous problem.

01:30:44.680 | I don't think of it as something

01:30:47.080 | which will suddenly happen on some day in 2079,

01:30:51.360 | for which I need to design some clever trick.

01:30:54.880 | I'm saying that these problems exist today.

01:30:58.280 | And we need to be continuously on the lookout

01:31:00.920 | for worrying about safety, biases, risks, right?

01:31:05.920 | I mean, a self-driving car kills a pedestrian.

01:31:09.840 | And they have, right?

01:31:11.640 | I mean, this Uber incident in Arizona, right?

01:31:16.000 | It has happened, right?

01:31:17.640 | This is not about AGI.

01:31:19.100 | In fact, it's about a very dumb intelligence,

01:31:22.480 | which is still killing people.

01:31:23.760 | - The worry people have with AGI is the scale.

01:31:28.360 | But I think you're 100% right is,

01:31:31.360 | like the thing that worries me about AI today,

01:31:34.560 | and it's happening in a huge scale,

01:31:36.120 | is recommender systems, recommendation systems.

01:31:39.240 | So if you look at Twitter or Facebook or YouTube,

01:31:42.620 | they're controlling the ideas that we have access to,

01:31:47.620 | the news and so on.

01:31:50.480 | And that's a fundamentally machine learning algorithm

01:31:52.520 | behind each of these recommendations.

01:31:55.160 | And they, I mean, my life would not be the same

01:31:58.420 | without these sources of information.

01:32:00.840 | I'm a totally new human being.

01:32:02.320 | And the ideas that I know are very much

01:32:05.500 | because of the internet,

01:32:06.840 | because of the algorithm that recommend those ideas.

01:32:09.600 | And so as they get smarter and smarter,

01:32:12.320 | I mean, that is the AGI.

01:32:13.920 | - Yeah.

01:32:14.760 | - Is that's the algorithm that's recommending

01:32:18.080 | the next YouTube video you should watch

01:32:21.180 | has control of millions of billions of people.

01:32:25.780 | That algorithm is already super intelligent

01:32:28.860 | and has complete control of the population.

01:32:32.220 | Not a complete, but very strong control.

01:32:34.860 | For now, we can turn off YouTube.

01:32:36.820 | We can just go have a normal life outside of that.

01:32:39.780 | But the more and more that gets into our life,

01:32:43.140 | it's that algorithm, we start depending on it

01:32:46.980 | and the different companies that are working on the algorithm.

01:32:48.940 | So I think it's, you're right.

01:32:50.020 | It's already there.

01:32:52.580 | And YouTube in particular is using computer vision,

01:32:57.060 | doing their hardest to try to understand

01:32:59.780 | the content of videos so they could be able

01:33:03.060 | to connect videos with the people

01:33:05.380 | who would benefit from those videos the most.

01:33:07.900 | And so that development could go

01:33:10.740 | in a bunch of different directions,

01:33:12.140 | some of which might be harmful.

01:33:13.760 | So yeah, you're right.

01:33:16.180 | The threats of AI are here already

01:33:18.740 | and we should be thinking about them.

01:33:21.140 | - On a philosophical notion,

01:33:24.000 | if you could, personal perhaps,

01:33:28.020 | if you could relive a moment in your life outside of family

01:33:31.940 | because it made you truly happy

01:33:33.860 | or it was a profound moment

01:33:36.340 | that impacted the direction of your life,

01:33:38.620 | what moment would you go to?

01:33:40.340 | - I don't think of single moments,

01:33:45.780 | but I look over the long haul.

01:33:49.280 | I feel that I've been very lucky

01:33:51.960 | because I feel that,

01:33:54.700 | I think that in scientific research,

01:33:57.780 | a lot of it is about being at the right place

01:34:01.700 | at the right time.

01:34:03.380 | And you can work on problems at a time

01:34:06.780 | when they're just too premature.

01:34:10.340 | You know, you butt your head against them

01:34:12.620 | and nothing happens because it's,

01:34:15.540 | the prerequisites for success are not there.

01:34:19.700 | And then there are times when you are in a field

01:34:21.860 | which is all pretty mature

01:34:24.700 | and you can only solve curlicues upon curlicues.

01:34:29.700 | I've been lucky to have been in this field,

01:34:32.340 | which for 34 years,

01:34:35.180 | well, actually 34 years as a professor at Berkeley,

01:34:37.980 | so longer than that,

01:34:40.420 | which when I started in it was just like some little crazy,

01:34:45.420 | absolutely useless field,

01:34:49.980 | couldn't really do anything

01:34:53.220 | to a time when it's really, really

01:34:55.660 | solving a lot of practical problems,

01:34:59.460 | has offered a lot of tools for scientific research,

01:35:03.360 | because computer vision is impactful

01:35:06.340 | for images in biology or astronomy and so on and so forth.

01:35:11.340 | And we have, so we have made great scientific progress

01:35:15.660 | which has had real practical impact in the world.

01:35:19.300 | And I feel lucky that I got in at a time

01:35:23.820 | when the field was very young

01:35:27.180 | and at a time when it is,

01:35:29.540 | it's now mature but not fully mature.

01:35:33.740 | It's mature but not done.

01:35:35.660 | I mean, it's really still in a productive phase.

01:35:39.060 | - Yeah, I think people 500 years from now

01:35:42.140 | would laugh at you calling this field mature.

01:35:44.500 | - Yeah, that is very possible, yeah.

01:35:47.460 | - So, but you're also, lest I forget to mention,

01:35:50.580 | you've also mentored some of the biggest names

01:35:53.960 | of computer vision, computer science, and AI today.

01:35:56.820 | There's so many questions I could ask,

01:36:00.540 | but it really is, what is it, how did you do it?

01:36:04.280 | What does it take to be a good mentor?

01:36:06.540 | What does it take to be a good guide?

01:36:08.400 | - Yeah, I think what I feel,

01:36:12.700 | I've been lucky to have had very, very smart

01:36:16.900 | and hardworking and creative students.

01:36:18.980 | I think some part of the credit

01:36:21.740 | just belongs to being at Berkeley.

01:36:25.500 | Those of us who are at top universities are blessed

01:36:29.180 | because we have very, very smart

01:36:32.780 | and capable students coming and knocking on our door.

01:36:36.380 | So I have to be humble enough to acknowledge that.

01:36:40.320 | But what have I added?

01:36:42.000 | I think I have added something.

01:36:43.920 | What I have added is, I think what I've always tried

01:36:48.760 | to teach them is a sense of picking the right problems.

01:36:53.680 | So I think that in science, in the short run,

01:37:00.040 | success is always based on technical competence.

01:37:04.480 | You're quick with math or you're whatever.

01:37:09.080 | I mean, there's certain technical capabilities

01:37:11.920 | which make for short-range progress.

01:37:15.400 | Long-range progress is really determined

01:37:18.220 | by asking the right questions

01:37:20.440 | and focusing on the right problems.

01:37:22.840 | And I feel that what I've been able to bring to the table

01:37:28.680 | in terms of advising these students

01:37:31.320 | is some sense of taste of what are good problems,

01:37:36.320 | what are problems that are worth attacking now

01:37:39.280 | as opposed to waiting 10 years.

01:37:41.480 | - What's a good problem, if you could summarize?

01:37:44.240 | Is that possible to even summarize?

01:37:46.040 | Like, what's your sense of a good problem?

01:37:48.200 | - I think I have a sense of what is a good problem,

01:37:52.280 | which is there's a British scientist,

01:37:56.440 | in fact, he won a Nobel Prize, Peter Medawar,

01:37:59.400 | who has a book on this.

01:38:02.560 | And basically, he calls it,

01:38:05.400 | the research is the art of the soluble.

01:38:08.320 | So we need to sort of find problems

01:38:11.640 | which are not yet solved, but which are approachable.

01:38:16.640 | And he sort of refers to this sense

01:38:22.160 | that there is this problem which isn't quite solved yet,

01:38:24.960 | but it has a soft underbelly.

01:38:27.320 | There is some place where you can spear the beast.

01:38:32.320 | And having that intuition that this problem is ripe

01:38:37.160 | is a good thing, because otherwise,

01:38:39.280 | you can just beat your head and not make progress.

01:38:42.260 | So I think that is important.

01:38:45.800 | So if I have that and if I can convey that to students,

01:38:49.640 | it's not just that they do great research

01:38:52.560 | while they're working with me,

01:38:54.080 | but that they continue to do great research.

01:38:56.280 | So in a sense, I'm proud of my students

01:38:59.080 | and their achievements and their great research,

01:39:01.280 | even 20 years after they've ceased being my student.

01:39:05.720 | - So it's in part developing,

01:39:06.960 | helping them develop that sense

01:39:08.800 | that a problem is not yet solved, but it's solvable.

01:39:12.440 | - Correct.

01:39:13.560 | The other thing which I have,

01:39:15.520 | which I think I bring to the table,

01:39:17.800 | is a certain intellectual breadth.

01:39:22.280 | I've spent a fair amount of time studying psychology,

01:39:27.280 | neuroscience, relevant areas of applied math and so forth.

01:39:31.240 | So I can probably help them see some connections

01:39:35.880 | to disparate things which they might not have otherwise.

01:39:40.880 | So the smart students coming into Berkeley

01:39:45.040 | can be very deep in the sense,

01:39:48.960 | they can think very deeply,

01:39:50.320 | meaning very hard down one particular path,

01:39:54.200 | but where I could help them is the shallow breadth,

01:39:59.200 | but where they would have the narrow depth,

01:40:04.520 | but that's of some value.

01:40:09.160 | - Well, it was beautifully refreshing just to hear you

01:40:12.640 | naturally jump to psychology, back to computer science

01:40:15.760 | in this conversation back and forth.

01:40:17.480 | I mean, that's actually a rare quality,

01:40:20.400 | and I think it's certainly for students empowering

01:40:23.600 | to think about problems in a new way.

01:40:25.440 | So for that and for many other reasons,

01:40:27.960 | I really enjoyed this conversation.

01:40:29.300 | Thank you so much, it was a huge honor.

01:40:30.780 | Thanks for talking to me.

01:40:31.920 | - It's been my pleasure.

01:40:33.120 | - Thanks for listening to this conversation

01:40:36.360 | with Jitendra Malik, and thank you to our sponsors,

01:40:39.560 | BetterHelp and ExpressVPN.

01:40:42.960 | Please consider supporting this podcast

01:40:45.040 | by going to betterhelp.com/lex

01:40:48.520 | and signing up at expressvpn.com/lexpod.

01:40:52.720 | Click the links, buy the stuff.

01:40:55.300 | It's how they know I sent you,

01:40:56.800 | and it really is the best way to support this podcast

01:41:00.080 | and the journey I'm on.

01:41:02.260 | If you enjoy this thing, subscribe on YouTube,

01:41:04.800 | review it with Firestarz and Apple Podcasts,

01:41:07.120 | support it on Patreon, or connect with me on Twitter

01:41:10.600 | at Lex Friedman.

01:41:12.120 | Don't ask me how to spell that.

01:41:13.360 | I don't remember it myself.

01:41:15.600 | And now let me leave you with some words

01:41:17.680 | from Prince Mishkin in "The Idiot" by Dostoevsky.

01:41:21.800 | Beauty will save the world.

01:41:23.740 | Thank you for listening, and hope to see you next time.

01:41:27.740 | (upbeat music)

01:41:30.320 | (upbeat music)

01:41:32.900 | [BLANK_AUDIO]

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

Chapters