back to indexJitendra Malik: Computer Vision | Lex Fridman Podcast #110
Chapters
0:0 Introduction
3:17 Computer vision is hard
10:5 Tesla Autopilot
21:20 Human brain vs computers
23:14 The general problem of computer vision
29:9 Images vs video in computer vision
37:47 Benchmarks in computer vision
40:6 Active learning
45:34 From pixels to semantics
52:47 Semantic segmentation
57:5 The three R's of computer vision
62:52 End-to-end learning in computer vision
64:24 6 lessons we can learn from children
68:36 Vision and language
72:30 Turing test
76:17 Open problems in computer vision
84:49 AGI
95:47 Pick the right problem
00:00:00.000 |
The following is a conversation with Jitendra Malik, 00:00:03.480 |
a professor at Berkeley and one of the seminal figures 00:00:17.360 |
and has mentored many world-class researchers 00:00:24.440 |
Two sponsors, one new one, which is BetterHelp 00:00:43.080 |
It really is the best way to support this podcast 00:00:47.240 |
If you enjoy this thing, subscribe on YouTube, 00:00:52.040 |
support it on Patreon or connect with me on Twitter 00:00:55.240 |
at Lex Friedman, however the heck you spell that. 00:01:05.080 |
This show is sponsored by BetterHelp, spelled H-E-L-P, help. 00:01:16.960 |
with a licensed professional therapist in under 48 hours. 00:01:24.200 |
it's professional counseling done securely online. 00:01:28.200 |
I'm a bit from the David Goggins line of creatures, 00:01:30.600 |
as you may know, and so have some demons to contend with, 00:01:43.960 |
but I think suffering is essential for creation. 00:01:53.840 |
can help in this, so it's at least worth a try. 00:01:59.640 |
It's easy, private, affordable, available worldwide. 00:02:05.340 |
and schedule weekly audio and video sessions. 00:02:22.640 |
To support this podcast and to get an extra three months free 00:02:32.640 |
I think ExpressVPN is the best VPN out there. 00:02:36.000 |
They told me to say it, but it happens to be true. 00:02:43.280 |
Literally just one big, sexy power on button. 00:02:47.560 |
Again, for obvious reasons, it's really important 00:02:57.120 |
Shout out to my favorite flavor of Linux, Ubuntu Mate 2004. 00:03:06.080 |
to support this podcast and to get an extra three months free 00:03:12.140 |
And now, here's my conversation with Jitendra Malik. 00:03:17.920 |
In 1966, Seymour Pappert at MIT wrote up a proposal 00:03:22.920 |
called the Summer Vision Project to be given, 00:03:31.240 |
So that proposal outlined many of the computer vision tasks 00:03:41.080 |
and perhaps still underestimate how hard computer vision is? 00:03:56.160 |
gives us the sense that, oh, this must be very easy 00:04:08.800 |
However, if you go into neuroscience or psychology 00:04:14.200 |
of human vision, then the complexity becomes very clear. 00:04:19.000 |
The fact is that a very large part of the cerebral cortex 00:04:26.000 |
I mean, and this is true in other primates as well. 00:04:33.220 |
or psychology perspective, it becomes quite clear 00:04:39.600 |
- You said the higher-level parts are the harder parts? 00:04:58.200 |
Whereas when you are proving a mathematical theorem 00:05:03.200 |
or playing chess, the difficulty is much more evident 00:05:07.880 |
because it is your conscious brain which is processing 00:05:12.960 |
various aspects of the problem-solving behavior. 00:05:30.060 |
that as computer vision researchers, for example, 00:05:43.800 |
We'll talk a little bit about autonomous driving, 00:05:45.680 |
for example, how hard of a vision task that is. 00:05:48.640 |
Do you think, I mean, is it just human nature 00:05:55.000 |
or is there something fundamental to the vision problem 00:06:32.520 |
where getting 50% of the solution you can get in one minute, 00:06:52.540 |
It seems that language, people are not so confident about. 00:07:06.300 |
that we have to be able to do natural language understanding. 00:07:10.540 |
For vision, it seems that we're not cognizant 00:07:15.540 |
or we don't think about how much understanding is required. 00:07:22.420 |
how much understanding is required to solve vision? 00:07:29.460 |
how much something called common sense reasoning 00:07:47.060 |
with what we could call maybe peripheral processing. 00:08:07.940 |
And I think they made a big deal out of this, 00:08:11.500 |
and they wanted to just study only perception 00:08:13.820 |
and then dismiss certain problems as being, quote, cognitive. 00:08:18.820 |
But really, I think these are artificial divides. 00:08:31.060 |
they work better at the lower and mid levels of the problem. 00:08:43.140 |
in many real applications, we have to confront them. 00:08:59.060 |
a pessimist on fully autonomous driving in the near future. 00:09:04.060 |
And the reason is because I think there will be 00:09:11.900 |
where quite sophisticated cognitive reasoning is called for. 00:09:20.300 |
first of all, they are much more, they are robust, 00:09:28.380 |
For example, let's say you're doing image search. 00:09:33.380 |
You're trying to get images based on some description, 00:09:43.900 |
I mean, when Google Image Search gives you some images back 00:10:06.100 |
- So just for the fun of it, since you mentioned, 00:10:09.460 |
let's go there briefly about autonomous vehicles. 00:10:23.980 |
with eight cameras and basically a single neural network, 00:10:34.780 |
but is forming the same representation at the core. 00:10:37.640 |
Do you think driving can be converted in this way 00:10:47.940 |
Or even more specifically, in the current approach, 00:10:52.580 |
what do you think about what Tesla Autopilot team is doing? 00:10:59.500 |
there are certainly subsets of the visual-based 00:11:05.380 |
So for example, driving in freeway conditions 00:11:22.020 |
In the '90s, there were approaches from Carnegie Mellon, 00:11:25.660 |
there were approaches from our team at Berkeley. 00:11:28.700 |
In the 2000s, there were approaches from Stanford, 00:11:33.100 |
So autonomous driving in certain settings is very doable. 00:11:45.380 |
At that point, it's not just a question of vision 00:11:54.180 |
- So where do you think most of the difficult cases, 00:11:57.620 |
to me, even the highway driving is an open problem 00:12:00.100 |
because it applies the same 50, 90, 95, 99 rule 00:12:05.100 |
or the first step, the fallacy of the first step, 00:12:12.100 |
I think even highway driving has a lot of elements 00:12:23.740 |
So you're really going to feel the edge cases. 00:12:26.540 |
So I think even highway driving is really difficult. 00:12:32.160 |
do you think vision is the fundamental problem 00:12:42.700 |
the ability to, and then like the middle ground, 00:12:47.660 |
which is trying to predict the behavior of others, 00:12:58.220 |
of the actors in the scene and predict their behavior. 00:13:03.860 |
because to me, perception blends into cognition 00:13:07.180 |
and building predictive models of other agents in the world, 00:13:11.320 |
which could be other agents, could be people, 00:13:17.420 |
because perception always has to not tell us what is now, 00:13:23.280 |
but what will happen because what's now is boring. 00:13:27.740 |
We care about the future because we act in the future. 00:13:33.280 |
- And we care about the past in as much as it informs 00:13:38.940 |
- So I think we have to build predictive models 00:13:41.240 |
of behaviors of people and those can get quite complicated. 00:14:11.800 |
because obviously these systems are always being improved 00:14:50.520 |
for a pedestrian, a typical behavior for a pedestrian 00:14:53.760 |
was not the typical behavior for a skateboarder, right? 00:15:04.360 |
you need to have enough data where your pedestrians, 00:15:11.640 |
what kinds of patterns of behavior they have. 00:15:37.960 |
do you think it will look similar to what we have today, 00:15:41.600 |
but have a lot more data, perhaps more compute, 00:15:47.120 |
like neural, well, in the case of Tesla Autopilot, 00:15:49.880 |
is neural networks, do you think it will look similar? 00:16:05.300 |
So, and this is my general philosophical position 00:16:17.160 |
in computer vision in the deep learning paradigm 00:16:24.020 |
and tabula rasa learning in a supervised way, 00:16:34.880 |
given a series of experiences in this setting, 00:16:57.600 |
but at the age of 16, they're already visual geniuses, 00:17:04.640 |
they have built a certain repertoire of vision. 00:17:07.560 |
In fact, most of it has probably been achieved by age two. 00:17:16.200 |
they know that the world is three-dimensional. 00:17:32.100 |
So they have built that up from their observations 00:17:43.920 |
So then, at age 16, when they go into driver ed, 00:17:49.280 |
They're not learning afresh the visual world. 00:17:58.860 |
They are learning how to be smooth about control, 00:18:03.920 |
They're learning a sense of typical traffic situations. 00:18:07.900 |
Now, that education process can be quite short, 00:18:12.900 |
because they are coming in as visual geniuses. 00:18:27.280 |
I may not have had to deal with a skateboarder. 00:18:45.180 |
even though I did not encounter this in my driver ed class. 00:18:50.000 |
is because I have all this general visual knowledge 00:18:54.540 |
- And do you think the learning mechanisms we have today 00:18:59.900 |
can do that kind of long-term accumulation of knowledge? 00:19:17.100 |
worked on this kind of accumulation of knowledge. 00:19:20.200 |
Do you think neural networks can do the same? 00:19:22.100 |
- I think I don't see any in-principle problem 00:19:33.660 |
So the current learning techniques that we have 00:19:43.300 |
X, Y, pairs, and you learn the functional mapping 00:19:48.580 |
I think that human learning is far richer than that. 00:19:54.660 |
There is a child explores the world and sees us. 00:20:16.520 |
but the learning data has been arranged by the child. 00:20:24.100 |
Child can do various experiments with the world. 00:20:27.460 |
So there are many aspects of sort of human learning, 00:20:33.580 |
and these have been studied in child development 00:20:39.300 |
And what they tell us is that supervised learning 00:20:45.340 |
There are many different aspects of learning. 00:20:48.580 |
And what we would need to do is to develop models 00:21:05.300 |
- Some of which might imitate the human brain. 00:21:12.860 |
in terms of the difference in the human brain, 00:21:20.660 |
- Do you think there's something interesting, 00:21:25.380 |
in the computational power of the human brain 00:21:36.620 |
so this is a point I've been making for 20 years now. 00:21:46.540 |
we just didn't have the computing power of the human brain. 00:22:10.060 |
Whereas in silicon, you have much faster devices, 00:22:13.980 |
transistors switch at on the order of nanoseconds, 00:22:25.860 |
we do have, if you consider the latest GPUs and so on, 00:22:31.660 |
And if we look back at Hans Marovec's type of calculations, 00:22:40.860 |
in terms of computing power comparable to the brain, 00:22:51.300 |
the style of computing that we have in our GPUs 00:22:59.660 |
in the human brain or other biological entities. 00:23:10.100 |
in order to build actual real-world systems of large scale. 00:23:25.980 |
So if you look at the computer vision conferences 00:23:29.580 |
it's often separated into different little segments, 00:23:36.060 |
into whether segmentation, 3D reconstruction, 00:23:46.660 |
But if you were to sort of philosophically say, 00:24:03.860 |
I always go back to sort of biology or humans. 00:24:09.500 |
And if you think about vision or perception in that setting, 00:24:14.100 |
we realize that perception is always to guide action. 00:24:30.260 |
which arose in the Cambrian era 500 million years ago. 00:24:49.020 |
because you can get food in different places. 00:24:54.260 |
And that's really about perception or seeing. 00:24:57.980 |
I mean, vision is perhaps the single most perception sense, 00:25:01.740 |
but all the others are equally, are also important. 00:25:05.940 |
So perception and action kind of go together. 00:25:10.060 |
So earlier it was in these very simple feedback loops 00:25:17.220 |
or avoiding becoming food if there's a predator running, 00:25:33.700 |
perception became more and more sophisticated 00:25:48.100 |
and build a model of the external world inside the head. 00:25:56.900 |
And psychologists have great fun in pointing out 00:26:01.620 |
is not a perfect model of the external world. 00:26:17.780 |
that exists in an animal 500 million years ago. 00:26:22.780 |
Once we have these very sophisticated visual systems, 00:26:30.660 |
It's we as scientists who are imposing that structure 00:26:34.180 |
where we have chosen to characterize this part of the system 00:26:41.940 |
or, quote, "this module of 3D reconstruction." 00:26:44.980 |
What's going on is really all of these processes 00:26:56.300 |
because originally their purpose was, in fact, 00:27:00.940 |
- So as a guiding general statement of a problem, 00:27:03.900 |
do you think we can say that the general problem 00:27:14.700 |
Do you think we should also say that ultimately 00:27:17.180 |
that the goal, the problem of computer vision 00:27:27.460 |
- Yes, I think that's the most fundamental purpose. 00:27:41.900 |
For example, judging the aesthetic value of a painting. 00:27:49.100 |
Maybe it's guiding action in terms of how much money 00:27:55.900 |
But the basics are, in fact, in terms of action. 00:28:10.140 |
but perhaps it is fundamentally about action. 00:28:20.220 |
that drives a lot of the development in this world 00:28:26.540 |
If you watch Netflix, if you enjoy watching movies, 00:28:29.500 |
you're using your perception system to interpret the movie. 00:28:46.820 |
- Well, certainly with respect to interactions with firms. 00:29:16.660 |
so many of the breakthroughs that you've been a part of 00:29:28.540 |
the community is looking at dynamic, at video, 00:29:35.220 |
which is dynamic, but also where you actually have a robot 00:29:39.340 |
in the physical world interacting based on that vision. 00:30:04.060 |
or making the problem harder by focusing on images? 00:30:09.100 |
I think sometimes we can simplify a problem so much 00:30:23.400 |
And one could reasonably argue that, to some extent, 00:30:28.020 |
this happens when we go from video to single images. 00:30:31.360 |
Now, historically, you have to consider the limits 00:30:35.500 |
imposed by the computation capabilities we had. 00:30:50.620 |
can be understood as choices which were forced upon us 00:30:55.620 |
by the fact that we just didn't have access to compute, 00:31:04.100 |
- Exactly, not enough compute, not enough storage. 00:31:09.420 |
So one of the choices is focusing on single images 00:31:25.580 |
So you have an image, which is, say, 256 by 256 pixels, 00:31:29.660 |
and instead of keeping around the grayscale value, 00:31:35.460 |
find the places where the brightness changes a lot, 00:31:48.580 |
and the logic was humans can interpret a line drawing, 00:31:58.020 |
So many of the choices were dictated by that. 00:32:00.940 |
I think today we are no longer detecting edges, right? 00:32:05.940 |
We process images with ConvNets because we don't need to. 00:32:10.780 |
We don't have those compute restrictions anymore. 00:32:16.280 |
because video compute is still quite challenging 00:32:22.320 |
I think video computing is not so challenging 00:32:35.560 |
and they still struggle doing stuff on video. 00:32:42.140 |
that are essentially the techniques you used in the '90s, 00:32:48.620 |
- No, that's when you want to do things at scale. 00:32:53.700 |
of all the content of YouTube, it's very challenging, 00:32:59.260 |
But as a researcher, you have more opportunities. 00:33:06.940 |
networks with relatively large video datasets, yeah. 00:33:10.540 |
- Yes, so I think that this is part of the reason 00:33:20.460 |
I see a lot more progress happening in video. 00:33:35.820 |
because you can take some of the challenging video datasets, 00:33:39.020 |
and their performance on action classification 00:34:01.080 |
- Let me ask a similar question I've already asked, 00:34:07.440 |
do you think some kind of injection of knowledge bases 00:34:18.840 |
If we solve the general action recognition problem, 00:34:27.820 |
what do you think the solution would look like? 00:34:30.740 |
- So I completely agree that knowledge is called for, 00:34:35.740 |
and that knowledge can be quite sophisticated. 00:34:58.700 |
Now, the things that happen in a certain order, 00:35:10.900 |
eventually, bill arrives, et cetera, et cetera. 00:35:14.020 |
There's a classic example of AI from the 1970s. 00:35:19.020 |
There was the term frames and scripts and schemas. 00:35:27.140 |
Okay, and in the '70s, the way the AI of the time 00:35:34.220 |
So they hand-coded in this notion of a script 00:35:42.060 |
and used that to interpret, for example, language. 00:35:49.220 |
involving some people eating at a restaurant, 00:35:56.220 |
because you know what happens typically at a restaurant. 00:35:59.220 |
So I think this kind of knowledge is absolutely essential. 00:36:13.540 |
I think the kinds of technology that we have right now 00:36:16.180 |
with 3D convolutions over a couple of seconds 00:36:28.340 |
Long-term understanding requires a notion of, 00:36:35.340 |
perhaps some notions of goals, intentionality, 00:36:45.980 |
So we could either revert back to the '70s and say, 00:37:02.940 |
because I think learning ways land up being more robust. 00:37:06.620 |
And there must be a learning version of the story 00:37:09.220 |
because children acquire a lot of this knowledge 00:37:21.300 |
it's possible, but I think it's not so typical 00:37:27.900 |
through all the stages of what happens in a restaurant. 00:37:31.860 |
they go to the restaurant, they eat, come back, 00:37:35.620 |
and the child goes through 10 such experiences, 00:37:44.900 |
we need to provide that capability to our systems. 00:38:16.980 |
If I think about the benchmarks we have before us, 00:38:24.460 |
they're often kind of trying to get to the adult. 00:38:31.060 |
What kind of tests for computer vision do you think 00:38:33.180 |
we should have that mimic the child's in computer vision? 00:38:54.860 |
So that gets into issues of privacy and so on and so forth. 00:39:08.500 |
So what's the child's linguistic environment? 00:39:17.060 |
and then develop learning schemes based on that data, 00:39:23.660 |
I think that's a very promising direction myself. 00:39:31.140 |
that we could just short circuit this in some way. 00:39:38.900 |
we have had success by not imitating nature in detail. 00:40:08.540 |
of learning like a child is the interactivity. 00:40:22.180 |
What are your thoughts about this whole space 00:40:33.980 |
And I think that we could achieve it in two ways 00:41:01.580 |
The robot learns its body by doing a series of actions. 00:41:21.680 |
our group has worked on something called Habitat, 00:41:27.080 |
which is a visually photorealistic environment 00:41:43.940 |
So I can now, you can imagine that subsequent generations 00:41:49.900 |
of these simulators will be accurate, not just visually, 00:42:03.200 |
And then we have that environment to play with. 00:42:22.880 |
So this is something which is of a great deal 00:42:27.160 |
I mean, people like Judea Pearl have talked a lot about 00:42:38.580 |
of deep learning as just curve fitting, right? 00:42:49.360 |
but causality is not like a single silver bullet. 00:43:01.560 |
one of our most reliable ways of establishing causal links, 00:43:15.340 |
and now in some situation you perform an action, 00:43:25.640 |
performing controlled experiments all the time, right? 00:43:32.080 |
and, but that is a way that the child gets to build 00:43:47.400 |
"The Scientist in the Crib," referring to children. 00:43:50.760 |
So I like, the part that I like about that is 00:43:54.280 |
the scientist wants to do, wants to build causal models, 00:43:58.880 |
and the scientist does controlled experiments. 00:44:03.720 |
So to enable that, we will need to have these, 00:44:12.700 |
some in the real world and some in simulation. 00:44:34.400 |
the principles of what it means to exist in the world 00:44:39.520 |
- I don't see any fundamental problems there. 00:44:42.600 |
the computer graphics community has come a long way. 00:44:45.360 |
So in the early days, going back to the '80s and '90s, 00:44:54.480 |
but they couldn't do stuff like hair or fur and so on. 00:45:01.040 |
Then they couldn't do physical actions, right? 00:45:04.360 |
Like there's a bowl of glass and it falls down 00:45:29.960 |
we will find ways of making our models ever more realistic. 00:46:21.520 |
that there's a lot of redundancy in these images, 00:46:26.520 |
and as a result, we are able to do a lot of compression. 00:46:36.900 |
So you might have 10 to the eight photoreceptors 00:46:40.100 |
and only 10 to the six fibers in the optic nerve, 00:47:10.700 |
is just how successful is image compression, right? 00:47:14.700 |
And there are, and that's been done with older technologies, 00:47:21.260 |
there are several companies which are trying to use 00:47:25.700 |
sort of these more advanced neural network type techniques 00:47:50.580 |
that's really about image statistics and video statistics. 00:47:53.620 |
- But that's still not doing compression of the kind 00:48:02.680 |
- Yeah, so this is at the lower level, right? 00:48:13.060 |
you mentioned how far can bottom-up image segmentation go, 00:48:26.680 |
Maybe this is a good time to elaborate on that, 00:48:29.780 |
maybe define what is bottom-up, what is top-down 00:48:55.540 |
and they end up with something like cat or not a cat, right? 00:49:00.500 |
So our systems are running totally feed-forward. 00:49:07.440 |
So they're trained by saying, okay, this is a cat, 00:49:10.140 |
there's a cat, there's a dog, there's a zebra, et cetera. 00:49:12.980 |
And I'm not happy with either of these choices fully. 00:49:20.660 |
because we have completely separated these processes, right? 00:49:34.060 |
So in biology, what we know is that the processes 00:49:45.420 |
And they involve much shallower neural networks. 00:49:52.580 |
in computer vision, say a ResNet 50, has 50 layers. 00:49:59.540 |
going from the retina to IT, maybe we have like seven, right? 00:50:14.820 |
with the more ambiguous stimuli, for example. 00:50:18.160 |
So the biological solution seems to involve feedback. 00:50:23.160 |
The solution in artificial vision seems to be 00:50:27.840 |
just feed-forward, but with a much deeper network. 00:50:35.100 |
which just has like three rounds of feedback, 00:50:37.500 |
you can just unroll it and make it three times the depth 00:51:10.300 |
we make use of a lot of top-down knowledge right now. 00:51:22.140 |
and this is the boundary of a horse, and so on. 00:51:36.380 |
because, for example, we're looking at a video stream, 00:51:47.260 |
So the Gestalt psychologists used to call this 00:51:55.100 |
by which we were able to segment out these objects. 00:52:07.860 |
in machine vision, this top-down, bottom-up interaction. 00:52:11.040 |
But I don't find the solution fully satisfactory. 00:52:16.060 |
And I would rather have a bit of both at both stages. 00:52:27.220 |
so for me, I'm inspired a lot by human vision, 00:52:31.820 |
You could be just a hard-boiled engineer and not give a damn. 00:52:40.500 |
if you could make my research agenda fruitful. 00:52:45.500 |
- Okay, so maybe taking a step into segmentation, 00:53:00.700 |
So for people who don't know computer vision, 00:53:10.080 |
of drawing outlines around objects versus a bounding box, 00:53:28.740 |
from detection, recognition, and the other problems? 00:53:41.820 |
without necessarily even being able to name that object 00:53:55.700 |
- A blob that's united in some way from its background. 00:54:29.860 |
Then when the mother says, "Pick up your bottle," 00:55:03.780 |
so to me, that's a very fundamental capability. 00:55:07.660 |
There are applications where this is very important, 00:55:13.060 |
So in medical diagnosis, you have some brain scan. 00:55:17.740 |
I mean, this is some work that we did in my group 00:55:39.740 |
So there are certainly very practical applications 00:55:43.340 |
of computer vision where segmentation is necessary. 00:55:53.980 |
with much weaker supervision than we require today. 00:55:57.820 |
- And you think of segmentation as this kind of task 00:56:03.460 |
and breaks it apart into interesting entities 00:56:08.460 |
that might be useful for whatever the task is. 00:56:21.940 |
It is not, I think the mistake that we used to make 00:56:28.540 |
was to treat it as a purely bottom-up perceptual task. 00:56:46.940 |
And I think understanding that all the pixels of a human 00:57:05.520 |
- You mentioned the three R's of computer vision 00:57:08.020 |
are recognition, reconstruction, and reorganization. 00:57:12.140 |
Can you describe these three R's and how they interact? 00:57:19.580 |
because that's what I think people generally think of 00:57:30.380 |
So is this a cat, is this a dog, is this a chihuahua? 00:57:46.980 |
- But given a part of an image or a whole image, 00:58:07.140 |
So graphics is you have some internal computer 00:58:10.460 |
representation and you have a computer representation 00:58:31.060 |
we say, oh, this image arises from some objects 00:58:38.420 |
in a scene looked at with a camera from this viewpoint, 00:58:41.820 |
and we might have more information about the objects, 00:59:22.620 |
that the world is not just, an image is not just seen as, 00:59:27.620 |
is not internally represented as just a collection of pixels, 00:59:38.100 |
- And the relationship between the entities as well, 00:59:44.220 |
but mainly we focus on the fact that there are entities. 00:59:47.660 |
- So I'm trying to pinpoint what the organization means. 00:59:52.380 |
- So organization is that instead of a uniform grid, 01:00:05.300 |
- So segmentation gets us going towards that. 01:00:30.020 |
And then reconstruction is what, filling in the gaps? 01:00:48.660 |
I mean, I started pushing this kind of a view 01:01:01.020 |
the distinction that people were just working 01:01:13.820 |
and then you try to solve that and get good numbers on it. 01:01:19.540 |
because I wanted to see the connection between these. 01:01:23.540 |
And if people divided up vision into various modules, 01:01:33.460 |
corresponding roughly to the psychologist's notion 01:01:48.560 |
this particular framework as a way of considering 01:01:55.500 |
and trying to be more explicit about the fact 01:01:58.940 |
that they actually are connected to each other. 01:02:07.840 |
Now it turns out in the last five years or so, 01:02:28.020 |
we are trying to build multiple representations. 01:02:37.160 |
So in a certain sense, today, given the reality 01:02:48.220 |
It is just there, it's part of the solution space. 01:02:59.860 |
of reorganization, recognition, can be reconstruction? 01:03:05.440 |
How much of it can be learned end to end, do you think? 01:03:12.620 |
Sort of set it and forget it, just plug and play, 01:03:18.180 |
have a giant data set, multiple perhaps, multi-modal, 01:03:32.880 |
And that I would argue is too narrow a view of the problem. 01:03:44.820 |
one where there are certain capabilities that are built up 01:03:58.180 |
So I think end to end learning in the supervised setting 01:04:14.020 |
is sort of a limited view of the learning process. 01:04:18.100 |
- Got it, so if we think about beyond purely supervised, 01:04:24.700 |
You mentioned six lessons that we can learn from children 01:04:28.820 |
of be multi-modal, be incremental, be physical, 01:04:39.540 |
that you find most fundamental to our time today? 01:04:42.740 |
- Yeah, so I mean, I should say to give you credit, 01:04:54.860 |
common wisdom among child development people. 01:05:04.300 |
among people in computer vision and AI and machine learning. 01:05:15.140 |
So let's take an example of a multi-modal, I like that. 01:05:20.100 |
So multi-modal, a canonical example is a child interacting 01:05:28.780 |
So then the child holds a ball and plays with it. 01:05:32.540 |
So at that point, it's getting a touch signal. 01:05:35.620 |
So the touch signal is getting a notion of 3D shape, 01:05:42.940 |
And then the child is also seeing a visual signal. 01:05:52.620 |
So one is the space of receptors on the skin of the fingers 01:05:58.420 |
And then these map onto these neuronal fibers 01:06:05.300 |
These lead to some activation in somatosensory cortex. 01:06:10.460 |
I mean, a similar thing will happen if we have a robot hand. 01:06:19.260 |
but we know that they correspond to the same object. 01:06:22.660 |
So that's a very, very strong cross-calibration signal. 01:06:28.780 |
And it is self-supervisory, which is beautiful. 01:06:34.020 |
The mother doesn't have to come and assign a label. 01:06:44.540 |
about the three-dimensional world from this signal. 01:06:48.500 |
I think tactile and visual, there is some work on. 01:06:53.580 |
There is a lot of work currently on audio and visual. 01:06:56.300 |
Okay, and audio-visual, so there is some event 01:07:09.060 |
and it falls and breaks, and I hear the smashing sound, 01:07:14.260 |
Okay, I've built that connection between the two, right? 01:07:19.460 |
We have people, I mean, this has become a hot topic 01:07:22.820 |
in computer vision in the last couple of years. 01:07:25.500 |
There are problems like separating out multiple speakers. 01:07:33.900 |
in audition, they call this the problem of source separation 01:07:40.540 |
But just try to do it visually when you also have, 01:07:44.820 |
it becomes so much easier and so much more useful. 01:07:49.820 |
- So the multimodal, I mean, there's so much more signal 01:08:00.300 |
- Yes, because they are occurring at the same time in time. 01:08:03.180 |
So you have time, which links the two, right? 01:08:06.140 |
So at a certain moment, T1, you've got a certain signal 01:08:11.340 |
in the visual domain, but they must be causally related. 01:08:15.300 |
- Yeah, it's an exciting area, not well studied yet. 01:08:18.500 |
- Yeah, I mean, we have a little bit of work at this, 01:08:46.220 |
What is the connection between language and vision to you? 01:08:53.340 |
Is one the parent and the child, the chicken and the egg? 01:09:01.180 |
The parent is just the fundamental ability, okay? 01:09:12.180 |
And you can think of it either in phylogeny or in ontogeny. 01:09:17.180 |
So phylogeny means, if you look at evolutionary time, right? 01:09:22.220 |
So we have vision that developed 500 million years ago. 01:09:26.480 |
Okay, then something like when we get to maybe 01:09:34.380 |
So when we started to walk, then the hands became free. 01:09:38.700 |
And so then manipulation, the ability to manipulate objects 01:10:03.780 |
which is the development of the hominid line, right? 01:10:09.160 |
we have the branch which leads on to modern humans. 01:10:16.340 |
but the ones which, you know, people talk about Lucy, 01:10:21.340 |
because that's like a skeleton from three million years ago 01:10:28.480 |
So at this stage, you have that the hand is free 01:10:33.640 |
And then the ability to manipulate objects, build tools, 01:10:46.040 |
Now, we don't know exactly when language arose. 01:10:59.400 |
and we, primates, other primates don't have that. 01:11:17.800 |
I mean, the human species already able to manipulate 01:11:32.960 |
and the perception, maybe some of the cognition. 01:11:36.060 |
- Yeah, so we, so those, so that, so the world, 01:11:53.280 |
So they knew that the world consists of objects. 01:11:59.660 |
They had observed causal interactions among objects. 01:12:22.580 |
Where did that notion of space and time come from? 01:12:31.040 |
- Yeah, what you've referred to as the spatial intelligence. 01:12:35.080 |
So to linger a little bit, we mentioned Turing 01:12:38.960 |
and his mention of we should learn from children. 01:12:44.320 |
Nevertheless, language is the fundamental piece 01:12:47.360 |
of the test of intelligence that Turing proposed. 01:12:53.840 |
Are you, what would impress the heck out of you? 01:13:01.620 |
- I think I wouldn't, I don't think we should have 01:13:09.000 |
So just like I don't believe in IQ as a single number, 01:13:13.920 |
I think generally there can be many capabilities 01:13:19.700 |
So I think that there will be accomplishments 01:13:36.840 |
I do believe that language will be the hardest nut to crack. 01:13:41.520 |
- So what's harder, to pass the spirit of the Turing test, 01:13:45.400 |
or like whatever formulation will make it natural language, 01:13:51.120 |
like somebody you would wanna have a beer with, 01:13:59.220 |
You think language is the top of the problem? 01:14:05.520 |
I think Turing test, that Turing as he proposed the test 01:14:09.560 |
in 1950 was trying to solve a certain problem. 01:14:26.560 |
I think the Turing test is no longer the right way 01:14:33.600 |
because it takes us down this path of this chatbot 01:14:37.080 |
which can fool us for five minutes or whatever. 01:14:50.360 |
tasks in navigation, tasks in visual scene understanding, 01:14:58.920 |
I mean, so my favorite language understanding task 01:15:05.400 |
and being able to answer arbitrary questions from it. 01:15:12.960 |
and this is not an exhaustive list by any means. 01:15:15.720 |
So I would, I think that that's where we need to be going to 01:15:28.240 |
in this Intelligence Olympics that we've set up, 01:16:07.160 |
- Maybe easier to measure performance in a simulated world. 01:16:16.320 |
- So David Hilbert in 1900 proposed 23 open problems 01:16:21.960 |
in mathematics, some of which are still unsolved. 01:16:36.600 |
I don't know when the last year you presented that, 01:16:44.700 |
It's your job to state what the open problems are 01:16:56.440 |
- Let me pick one which I regard as clearly unsolved, 01:17:01.440 |
which is what I would call long form video understanding. 01:17:07.400 |
So we have a video clip and we want to understand 01:17:24.400 |
and make predictions about what might happen. 01:17:28.320 |
So that kind of understanding which goes away 01:17:48.120 |
if we can do it at 50%, maybe next year we'll do it at 65 01:17:53.840 |
But I think the long range video understanding, 01:18:13.480 |
and you have to have some kind of model of their behavior. 01:18:17.760 |
And their behavior might be, these are agents, 01:18:32.360 |
Then I will talk about, say, understanding the world in 3D. 01:18:36.920 |
Now this may seem paradoxical because in a way, 01:18:45.680 |
But I don't think we currently have the richness 01:18:48.640 |
of 3D understanding in our computer vision system 01:18:57.480 |
So currently, we have two kinds of techniques 01:19:08.720 |
and you do a reconstruction using stereoscopic vision 01:19:18.040 |
they totally fail if you just have a single view 01:19:21.200 |
because they are relying on this multiple-view geometry. 01:19:28.120 |
that we have developed in the computer vision community 01:19:34.240 |
And these techniques are based on supervised learning 01:19:39.240 |
and they are based on having at training time 01:19:45.920 |
And this is completely unnatural supervision, right? 01:19:49.880 |
That's not, CAD models are not injected into your brain. 01:19:55.920 |
What I would like would be a kind of learning 01:20:05.120 |
So we have our succession of visual experiences 01:20:18.960 |
I might see a chair from different viewpoints 01:20:21.600 |
or a table from different viewpoints and so on. 01:20:31.120 |
And then next time I just see a single photograph 01:20:38.760 |
And I have a guess of what its 3D shape is like. 01:20:42.080 |
- So you're almost learning the CAD model kind of-- 01:20:46.960 |
I mean, the CAD model need not be in the same form 01:20:58.080 |
and what I would see if I went to such and such position. 01:21:16.280 |
that do, for example, achieve this kind of 3D understanding 01:21:19.160 |
and you don't know how they, you don't know the, 01:21:31.040 |
So the fact that they're not or may not be explainable. 01:21:57.080 |
- So we, one human to another human is not fully explainable. 01:22:02.080 |
I think there are settings where explainability matters 01:22:17.400 |
maybe a computer program has made a certain diagnosis. 01:22:23.760 |
perhaps I should have treatment A or treatment B, right? 01:22:28.000 |
So now is the computer program's diagnosis based on data, 01:22:38.520 |
for American males who are in their 30s and 40s, 01:22:45.240 |
Maybe it is relevant, you know, et cetera, et cetera. 01:22:50.340 |
we have major issues to do with the reference class. 01:22:53.480 |
So we may have acquired statistics from one group of people 01:22:56.600 |
and applying it to a different group of people 01:22:59.520 |
who may not share all the same characteristics. 01:23:20.040 |
So there are settings where I want to know more 01:23:32.080 |
explainability and interpretability may matter. 01:23:37.280 |
and a better sense of the quality of the decision. 01:23:40.200 |
Where I'm willing to sacrifice interpretability 01:24:06.360 |
- Yeah, so the nice thing about the black boxes we are 01:24:13.760 |
but we're also, those of us who are charming, 01:24:19.480 |
like explain what's going on inside the black box 01:24:27.760 |
don't have to actually explain what's going on inside. 01:24:53.700 |
of human level or superhuman level intelligence? 01:25:04.560 |
what Turing thought actually we could do by year 2000, 01:25:08.320 |
right, do you think we'll ever be able to do? 01:25:12.760 |
One answer is in principle, can we do this at some time? 01:25:48.160 |
So in the business we are in, there are known unknowns, 01:25:54.920 |
So I think with respect to a lot of what's the case 01:25:59.920 |
in vision and robotics, I feel like we have known unknowns. 01:26:09.680 |
and what the problems that need to be solved are. 01:26:20.440 |
it's not just known unknowns, but also unknown unknowns. 01:26:24.080 |
So it is very difficult to put any kind of a time frame 01:26:33.720 |
might be positive in that they'll surprise us 01:26:40.160 |
- I think that is possible, because certainly 01:26:44.640 |
by how effective these deep learning systems have been. 01:26:50.000 |
Because I certainly would not have believed that in 2010. 01:26:55.000 |
I think what we knew from the mathematical theory 01:27:07.760 |
then these gradient descent techniques would work. 01:27:11.100 |
Now these are nonlinear systems with non-convex systems. 01:27:16.000 |
- Huge number of variables, so over-parameterized. 01:27:24.640 |
the ones who are totally immersed in the lore 01:27:27.200 |
and the black magic, they knew that they worked well, 01:27:42.480 |
- That they feel that they were comfortable with them. 01:27:46.760 |
- The community as a whole was certainly not. 01:27:50.640 |
And I think we were, to me that was the surprise, 01:28:01.240 |
from a wide range of initializations and so on. 01:28:04.640 |
And so that was certainly more rapid progress 01:28:32.480 |
are these fears of AGI in 10 years and 20 years 01:28:41.400 |
because that's based on completely unrealistic models 01:28:44.800 |
of how rapidly we will make progress in this field. 01:29:20.200 |
Do you think this is something we should be worried about? 01:29:24.160 |
Or we need to first allow the unknown unknowns 01:29:29.800 |
- I think we need to be worried about AI today. 01:29:32.920 |
I think that it is not just a worry we need to have 01:29:38.320 |
I think that AI is being used in many systems today. 01:29:45.200 |
when it causes biases or decisions which could be harmful. 01:29:50.200 |
I mean, decisions which could be unfair to some people, 01:29:53.880 |
or it could be a self-driving car which kills a pedestrian. 01:29:57.600 |
So AI systems are being deployed today, right? 01:30:01.840 |
And they're being deployed in many different settings, 01:30:03.840 |
maybe in medical diagnosis, maybe in a self-driving car, 01:30:06.560 |
maybe in selecting applicants for an interview. 01:30:09.880 |
So I would argue that when these systems make mistakes, 01:30:22.640 |
So I would argue that this is a continuous effort. 01:30:26.320 |
And this is something that in a way is not so surprising. 01:30:32.320 |
It's about all engineering and scientific progress, 01:30:35.840 |
which great power comes great responsibility. 01:30:47.080 |
which will suddenly happen on some day in 2079, 01:30:51.360 |
for which I need to design some clever trick. 01:30:58.280 |
And we need to be continuously on the lookout 01:31:00.920 |
for worrying about safety, biases, risks, right? 01:31:05.920 |
I mean, a self-driving car kills a pedestrian. 01:31:11.640 |
I mean, this Uber incident in Arizona, right? 01:31:19.100 |
In fact, it's about a very dumb intelligence, 01:31:23.760 |
- The worry people have with AGI is the scale. 01:31:31.360 |
like the thing that worries me about AI today, 01:31:36.120 |
is recommender systems, recommendation systems. 01:31:39.240 |
So if you look at Twitter or Facebook or YouTube, 01:31:42.620 |
they're controlling the ideas that we have access to, 01:31:50.480 |
And that's a fundamentally machine learning algorithm 01:31:55.160 |
And they, I mean, my life would not be the same 01:32:06.840 |
because of the algorithm that recommend those ideas. 01:32:14.760 |
- Is that's the algorithm that's recommending 01:32:21.180 |
has control of millions of billions of people. 01:32:36.820 |
We can just go have a normal life outside of that. 01:32:39.780 |
But the more and more that gets into our life, 01:32:43.140 |
it's that algorithm, we start depending on it 01:32:46.980 |
and the different companies that are working on the algorithm. 01:32:52.580 |
And YouTube in particular is using computer vision, 01:33:05.380 |
who would benefit from those videos the most. 01:33:28.020 |
if you could relive a moment in your life outside of family 01:33:57.780 |
a lot of it is about being at the right place 01:34:19.700 |
And then there are times when you are in a field 01:34:24.700 |
and you can only solve curlicues upon curlicues. 01:34:35.180 |
well, actually 34 years as a professor at Berkeley, 01:34:40.420 |
which when I started in it was just like some little crazy, 01:34:59.460 |
has offered a lot of tools for scientific research, 01:35:06.340 |
for images in biology or astronomy and so on and so forth. 01:35:11.340 |
And we have, so we have made great scientific progress 01:35:15.660 |
which has had real practical impact in the world. 01:35:35.660 |
I mean, it's really still in a productive phase. 01:35:42.140 |
would laugh at you calling this field mature. 01:35:47.460 |
- So, but you're also, lest I forget to mention, 01:35:50.580 |
you've also mentored some of the biggest names 01:35:53.960 |
of computer vision, computer science, and AI today. 01:36:00.540 |
but it really is, what is it, how did you do it? 01:36:25.500 |
Those of us who are at top universities are blessed 01:36:32.780 |
and capable students coming and knocking on our door. 01:36:36.380 |
So I have to be humble enough to acknowledge that. 01:36:43.920 |
What I have added is, I think what I've always tried 01:36:48.760 |
to teach them is a sense of picking the right problems. 01:36:53.680 |
So I think that in science, in the short run, 01:37:00.040 |
success is always based on technical competence. 01:37:09.080 |
I mean, there's certain technical capabilities 01:37:22.840 |
And I feel that what I've been able to bring to the table 01:37:31.320 |
is some sense of taste of what are good problems, 01:37:36.320 |
what are problems that are worth attacking now 01:37:41.480 |
- What's a good problem, if you could summarize? 01:37:48.200 |
- I think I have a sense of what is a good problem, 01:37:56.440 |
in fact, he won a Nobel Prize, Peter Medawar, 01:38:11.640 |
which are not yet solved, but which are approachable. 01:38:22.160 |
that there is this problem which isn't quite solved yet, 01:38:27.320 |
There is some place where you can spear the beast. 01:38:32.320 |
And having that intuition that this problem is ripe 01:38:39.280 |
you can just beat your head and not make progress. 01:38:45.800 |
So if I have that and if I can convey that to students, 01:38:59.080 |
and their achievements and their great research, 01:39:01.280 |
even 20 years after they've ceased being my student. 01:39:08.800 |
that a problem is not yet solved, but it's solvable. 01:39:22.280 |
I've spent a fair amount of time studying psychology, 01:39:27.280 |
neuroscience, relevant areas of applied math and so forth. 01:39:31.240 |
So I can probably help them see some connections 01:39:35.880 |
to disparate things which they might not have otherwise. 01:39:54.200 |
but where I could help them is the shallow breadth, 01:40:09.160 |
- Well, it was beautifully refreshing just to hear you 01:40:12.640 |
naturally jump to psychology, back to computer science 01:40:20.400 |
and I think it's certainly for students empowering 01:40:36.360 |
with Jitendra Malik, and thank you to our sponsors, 01:40:56.800 |
and it really is the best way to support this podcast 01:41:02.260 |
If you enjoy this thing, subscribe on YouTube, 01:41:07.120 |
support it on Patreon, or connect with me on Twitter 01:41:17.680 |
from Prince Mishkin in "The Idiot" by Dostoevsky. 01:41:23.740 |
Thank you for listening, and hope to see you next time.