back to indexLanguage or Vision - What's Harder? (Ilya Sutskever) | AI Podcast Clips
Chapters
0:0 Intro
0:42 Machine Learning and Vision
3:11 Action is fundamentally different
4:1 Commonality
4:40 Which problem is harder
6:1 Is language hard
6:40 Where does vision stop
7:25 Should vision count
00:00:06.960 |
in computer vision, language, natural language processing, 00:00:10.980 |
reinforcement learning, sort of everything in between. 00:00:16.440 |
There may not be a topic you haven't touched, 00:00:20.140 |
and of course the fundamental science of deep learning. 00:00:23.540 |
What is the difference to you between vision, language, 00:00:30.860 |
as learning problems, and what are the commonalities? 00:00:43.600 |
Machine learning is a field with a lot of unity, 00:00:54.100 |
In fact, there's only one or two or three principles 00:01:01.300 |
in almost the same way to the different modalities 00:01:05.300 |
And that's why today, when someone writes a paper 00:01:08.060 |
on improving optimization of deep learning and vision, 00:01:16.300 |
Reinforcement learning, so I would say that computer vision 00:01:27.860 |
and we use convolutional neural networks in vision. 00:01:30.460 |
But it's also possible that one day this will change 00:01:32.860 |
and everything will be unified with a single architecture. 00:01:43.340 |
for every different tiny problem had its own architecture. 00:01:59.880 |
and sub, you know, little set of collection of skills, 00:02:02.580 |
people who would know how to engineer the features. 00:02:12.460 |
Or rather, I shouldn't say expect, I think it's possible. 00:02:20.800 |
RL does require slightly different techniques 00:02:24.760 |
You really do need to do something about exploration. 00:02:29.980 |
but I think there is a lot of unity even there. 00:02:35.160 |
broad unification between RL and supervised learning 00:02:39.180 |
where somehow the RL will be making decisions 00:02:42.500 |
And there will be, I imagine one big black box 00:02:47.220 |
you shovel things into it and it just figures out 00:02:51.980 |
I mean, reinforcement learning has some aspects 00:03:02.780 |
and there's elements of a really rich sensory space. 00:03:06.980 |
So it seems like the, it's like the union of the two 00:03:13.900 |
I'd say that reinforcement learning is neither, 00:03:20.420 |
- You think action is fundamentally different? 00:03:25.220 |
what is unique about policy of learning to act? 00:03:33.740 |
you are fundamentally in a non-stationary world 00:03:45.260 |
And this is not the case for the more traditional 00:03:48.060 |
static problem where you have some distribution 00:03:50.220 |
and you just apply a model to that distribution. 00:03:53.440 |
- You think it's a fundamentally different problem 00:03:57.880 |
it's a generalization of the problem of understanding? 00:04:00.920 |
- I mean, it's a question of definitions almost. 00:04:03.720 |
There is a huge amount of commonality for sure. 00:04:05.880 |
You take gradients, you try to approximate gradients 00:04:20.220 |
You compute the gradient, you apply Adam in both cases. 00:04:32.800 |
It's really just a matter of your point of view, 00:04:34.880 |
what frame of reference, how much do you wanna zoom in 00:04:43.720 |
So people like Noam Chomsky believe that language 00:04:49.600 |
Do you think language understanding is harder 00:04:52.540 |
than visual scene understanding or vice versa? 00:05:03.380 |
- So what does it mean for a problem to be hard? 00:05:06.540 |
Okay, the non-interesting dumb answer to that 00:05:11.140 |
is there's a benchmark and there's a human level performance 00:05:16.180 |
on that benchmark and how is the effort required 00:05:23.000 |
- So from the perspective of how much until we get 00:05:32.840 |
So what I was going to say that a lot of it depends on, 00:05:36.000 |
you know, once you solve a problem, it stops being hard 00:05:39.960 |
And so whether something is hard or not depends 00:05:43.680 |
So, you know, you say today, true human level, 00:05:47.660 |
language understanding and visual perception are hard 00:05:50.260 |
in the sense that there is no way of solving the problem 00:05:57.900 |
Beyond that, I'm just, I'd be, my guess would be 00:06:01.420 |
- Oh, okay, so you don't have a fundamental intuition 00:06:08.260 |
I'd say language is probably going to be harder. 00:06:16.140 |
language understanding, I'll go with language. 00:06:18.440 |
But then if I show you a piece of paper with letters on it, 00:06:26.580 |
you say it's the best human level vision system. 00:06:29.060 |
I show you, I open a book and I show you letters. 00:06:32.740 |
Will it understand how these letters form into word 00:06:40.060 |
- Yeah, so Chomsky would say it starts at language. 00:06:42.180 |
So vision is just a little example of the kind of 00:06:50.460 |
that's already represented in our brain somehow, 00:06:55.320 |
But where does vision stop and language begin? 00:07:13.860 |
to achieve really deep understanding in either images 00:07:18.700 |
or language without basically using the same kind of system. 00:07:27.060 |
if we can get one, our machine learning is probably 00:07:34.140 |
And also, I think a lot of it really does depend 00:07:43.940 |
Because reading is vision, but should it count? 00:07:47.240 |
- Yeah, to me, so my definition is if a system 00:07:51.080 |
looked at an image and then a system looked at a piece 00:07:55.140 |
of text and then told me something about that 00:08:06.420 |
all the systems do that, but here's the thing they don't do. 00:08:13.660 |
- Well, the ones, okay, so I'm a fan of monogamy, 00:08:22.020 |
So I believe in the fact that yes, it's possible 00:08:24.540 |
to have somebody continuously giving you pleasurable, 00:08:36.020 |
- The surprise, it's that injection of randomness 00:08:46.860 |
continued inspiration, like the wit, the humor. 00:08:52.740 |
I think, yeah, that would be, it's a very subjective test, 00:08:57.780 |
but I think if you have enough humans in the room. 00:09:04.540 |
Yeah, I feel like I misunderstood what you meant 00:09:07.260 |
I thought you meant to impress you with its intelligence, 00:09:17.180 |
and it's gonna get it right, and you're gonna say, wow, 00:09:19.860 |
Our systems of January 2020 have not been doing that. 00:09:23.860 |
- Yeah, no, I think it all boils down to the reason