Language or Vision - What's Harder? (Ilya Sutskever)

00:00:00.000 | - So incredibly, you've contributed some

00:00:04.620 | of the biggest recent ideas in AI,

00:00:06.960 | in computer vision, language, natural language processing,

00:00:10.980 | reinforcement learning, sort of everything in between.

00:00:15.220 | Maybe not GANs.

00:00:16.440 | There may not be a topic you haven't touched,

00:00:20.140 | and of course the fundamental science of deep learning.

00:00:23.540 | What is the difference to you between vision, language,

00:00:28.100 | and as in reinforcement learning, action,

00:00:30.860 | as learning problems, and what are the commonalities?

00:00:33.500 | Do you see them as all interconnected,

00:00:35.460 | or are they fundamentally different domains

00:00:37.740 | that require different approaches?

00:00:40.700 | - Okay, that's a good question.

00:00:43.600 | Machine learning is a field with a lot of unity,

00:00:45.860 | a huge amount of unity.

00:00:47.180 | In fact-- - What do you mean by unity?

00:00:49.260 | Like overlap of ideas?

00:00:52.300 | - Overlap of ideas, overlap of principles.

00:00:54.100 | In fact, there's only one or two or three principles

00:00:56.620 | which are very, very simple,

00:00:58.300 | and then they apply in almost the same way,

00:01:01.300 | in almost the same way to the different modalities

00:01:03.900 | of the different problems.

00:01:05.300 | And that's why today, when someone writes a paper

00:01:08.060 | on improving optimization of deep learning and vision,

00:01:11.100 | it improves the different NLP applications,

00:01:13.260 | and it improves the different

00:01:14.100 | reinforcement learning applications.

00:01:16.300 | Reinforcement learning, so I would say that computer vision

00:01:19.740 | and NLP are very similar to each other.

00:01:22.540 | Today, they differ in that they have

00:01:24.940 | slightly different architectures.

00:01:26.120 | We use transformers in NLP,

00:01:27.860 | and we use convolutional neural networks in vision.

00:01:30.460 | But it's also possible that one day this will change

00:01:32.860 | and everything will be unified with a single architecture.

00:01:35.780 | Because if you go back a few years ago

00:01:37.620 | in natural language processing,

00:01:39.420 | there were a huge number of architectures

00:01:43.340 | for every different tiny problem had its own architecture.

00:01:46.240 | Today, there's just one transformer

00:01:49.860 | for all those different tasks.

00:01:51.420 | And if you go back in time even more,

00:01:53.620 | you had even more and more fragmentation

00:01:55.340 | and every little problem in AI

00:01:57.760 | had its own little subspecialization

00:01:59.880 | and sub, you know, little set of collection of skills,

00:02:02.580 | people who would know how to engineer the features.

00:02:04.940 | Now it's all been subsumed by deep learning.

00:02:06.820 | We have this unification.

00:02:08.100 | And so I expect vision to be communified

00:02:10.780 | with natural language as well.

00:02:12.460 | Or rather, I shouldn't say expect, I think it's possible.

00:02:14.420 | I don't wanna be too sure because I think

00:02:16.740 | on the convolutional neural net,

00:02:17.580 | it's very computationally efficient.

00:02:19.460 | RL is different.

00:02:20.800 | RL does require slightly different techniques

00:02:22.800 | because you really do need to take action.

00:02:24.760 | You really do need to do something about exploration.

00:02:27.820 | Your variance is much higher,

00:02:29.980 | but I think there is a lot of unity even there.

00:02:32.140 | And I would expect, for example,

00:02:33.260 | that at some point there will be some

00:02:35.160 | broad unification between RL and supervised learning

00:02:39.180 | where somehow the RL will be making decisions

00:02:41.100 | to make the supervised learning go better.

00:02:42.500 | And there will be, I imagine one big black box

00:02:45.700 | and you just throw every, you know,

00:02:47.220 | you shovel things into it and it just figures out

00:02:49.940 | what to do with whatever you shovel at it.

00:02:51.980 | I mean, reinforcement learning has some aspects

00:02:54.700 | of language and vision combined almost.

00:02:59.060 | There's elements of a long-term memory

00:03:01.660 | that you should be utilizing,

00:03:02.780 | and there's elements of a really rich sensory space.

00:03:06.980 | So it seems like the, it's like the union of the two

00:03:10.780 | or something like that.

00:03:12.540 | I'd say something slightly differently.

00:03:13.900 | I'd say that reinforcement learning is neither,

00:03:16.620 | but it naturally interfaces and integrates

00:03:19.360 | with the two of them.

00:03:20.420 | - You think action is fundamentally different?

00:03:23.200 | So yeah, what is interesting about,

00:03:25.220 | what is unique about policy of learning to act?

00:03:29.940 | - Well, so one example, for instance,

00:03:31.420 | is that when you learn to act,

00:03:33.740 | you are fundamentally in a non-stationary world

00:03:37.140 | because as your actions change,

00:03:39.700 | the things you see start changing.

00:03:41.980 | You experience the world in a different way.

00:03:45.260 | And this is not the case for the more traditional

00:03:48.060 | static problem where you have some distribution

00:03:50.220 | and you just apply a model to that distribution.

00:03:53.440 | - You think it's a fundamentally different problem

00:03:55.160 | or is it just a more difficult,

00:03:57.880 | it's a generalization of the problem of understanding?

00:04:00.920 | - I mean, it's a question of definitions almost.

00:04:03.720 | There is a huge amount of commonality for sure.

00:04:05.880 | You take gradients, you try to approximate gradients

00:04:09.280 | in both cases.

00:04:10.120 | In the case of reinforcement learning,

00:04:11.880 | you have some tools to reduce the variance

00:04:14.040 | of the gradients, you do that.

00:04:15.880 | There's lots of commonality.

00:04:17.840 | You use the same neural net in both cases.

00:04:20.220 | You compute the gradient, you apply Adam in both cases.

00:04:22.960 | So, I mean, there's lots in common for sure,

00:04:28.160 | but there are some small differences

00:04:30.800 | which are not completely insignificant.

00:04:32.800 | It's really just a matter of your point of view,

00:04:34.880 | what frame of reference, how much do you wanna zoom in

00:04:38.200 | or out as you look at these problems.

00:04:41.160 | - Which problem do you think is harder?

00:04:43.720 | So people like Noam Chomsky believe that language

00:04:46.120 | is fundamental to everything.

00:04:47.840 | So it underlies everything.

00:04:49.600 | Do you think language understanding is harder

00:04:52.540 | than visual scene understanding or vice versa?

00:04:55.560 | - I think that asking if a problem is hard

00:04:58.540 | is slightly wrong.

00:05:00.140 | I think the question is a little bit wrong

00:05:01.460 | and I wanna explain why.

00:05:03.380 | - So what does it mean for a problem to be hard?

00:05:06.540 | Okay, the non-interesting dumb answer to that

00:05:11.140 | is there's a benchmark and there's a human level performance

00:05:16.180 | on that benchmark and how is the effort required

00:05:20.600 | to reach the human level benchmark.

00:05:23.000 | - So from the perspective of how much until we get

00:05:25.600 | to human level on a very good benchmark.

00:05:29.200 | - Yeah, I understand what you mean by that.

00:05:32.840 | So what I was going to say that a lot of it depends on,

00:05:36.000 | you know, once you solve a problem, it stops being hard

00:05:37.960 | and that's always true.

00:05:39.960 | And so whether something is hard or not depends

00:05:42.160 | on what our tools can do today.

00:05:43.680 | So, you know, you say today, true human level,

00:05:47.660 | language understanding and visual perception are hard

00:05:50.260 | in the sense that there is no way of solving the problem

00:05:53.900 | completely in the next three months, right?

00:05:55.960 | So I agree with that statement.

00:05:57.900 | Beyond that, I'm just, I'd be, my guess would be

00:05:59.980 | as good as yours, I don't know.

00:06:01.420 | - Oh, okay, so you don't have a fundamental intuition

00:06:04.320 | about how hard language understanding is.

00:06:06.780 | - I think, I know I changed my mind.

00:06:08.260 | I'd say language is probably going to be harder.

00:06:10.780 | I mean, it depends on how you define it.

00:06:13.140 | Like if you mean absolute top-notch, 100%

00:06:16.140 | language understanding, I'll go with language.

00:06:18.440 | But then if I show you a piece of paper with letters on it,

00:06:22.860 | is that, you see what I mean?

00:06:25.340 | It's like you have a vision system,

00:06:26.580 | you say it's the best human level vision system.

00:06:29.060 | I show you, I open a book and I show you letters.

00:06:32.740 | Will it understand how these letters form into word

00:06:34.820 | and sentences and meaning?

00:06:36.220 | Is this part of the vision problem?

00:06:37.660 | Where does vision end and language begin?

00:06:40.060 | - Yeah, so Chomsky would say it starts at language.

00:06:42.180 | So vision is just a little example of the kind of

00:06:44.860 | structure and fundamental hierarchy of ideas

00:06:50.460 | that's already represented in our brain somehow,

00:06:53.000 | that's represented through language.

00:06:55.320 | But where does vision stop and language begin?

00:07:00.320 | That's a really interesting question.

00:07:11.700 | So one possibility is that it's impossible

00:07:13.860 | to achieve really deep understanding in either images

00:07:18.700 | or language without basically using the same kind of system.

00:07:22.380 | So you're going to get the other for free.

00:07:24.580 | - I think it's pretty likely that yes,

00:07:27.060 | if we can get one, our machine learning is probably

00:07:29.820 | that good that we can get the other.

00:07:31.300 | But I'm not 100% sure.

00:07:34.140 | And also, I think a lot of it really does depend

00:07:38.500 | on your definitions.

00:07:40.660 | - Definitions of?

00:07:41.940 | - Of like perfect vision.

00:07:43.940 | Because reading is vision, but should it count?

00:07:47.240 | - Yeah, to me, so my definition is if a system

00:07:51.080 | looked at an image and then a system looked at a piece

00:07:55.140 | of text and then told me something about that

00:07:59.940 | and I was really impressed.

00:08:01.380 | - That's relative.

00:08:03.380 | You'll be impressed for half an hour

00:08:05.180 | and then you're gonna say, well, I mean,

00:08:06.420 | all the systems do that, but here's the thing they don't do.

00:08:09.100 | - Yeah, but I don't have that with humans.

00:08:11.020 | Humans continue to impress me.

00:08:12.820 | - Is that true?

00:08:13.660 | - Well, the ones, okay, so I'm a fan of monogamy,

00:08:17.900 | so I like the idea of marrying somebody,

00:08:19.900 | being with them for several decades.

00:08:22.020 | So I believe in the fact that yes, it's possible

00:08:24.540 | to have somebody continuously giving you pleasurable,

00:08:29.540 | interesting, witty, new ideas, friends.

00:08:32.500 | Yeah, I think so.

00:08:33.900 | They continue to surprise you.

00:08:36.020 | - The surprise, it's that injection of randomness

00:08:41.020 | seems to be a nice source of, yeah,

00:08:46.860 | continued inspiration, like the wit, the humor.

00:08:52.740 | I think, yeah, that would be, it's a very subjective test,

00:08:57.780 | but I think if you have enough humans in the room.

00:09:02.580 | - Yeah, I understand what you mean.

00:09:04.540 | Yeah, I feel like I misunderstood what you meant

00:09:06.420 | by impressing you.

00:09:07.260 | I thought you meant to impress you with its intelligence,

00:09:10.500 | with how well it understands an image.

00:09:14.220 | I thought you meant something like,

00:09:15.620 | I'm gonna show it a really complicated image

00:09:17.180 | and it's gonna get it right, and you're gonna say, wow,

00:09:19.020 | that's really cool.

00:09:19.860 | Our systems of January 2020 have not been doing that.

00:09:23.860 | - Yeah, no, I think it all boils down to the reason

00:09:27.980 | people click like on stuff on the internet,

00:09:30.040 | which is it makes them laugh.

00:09:32.260 | So it's like humor or wit or insight.

00:09:36.640 | I'm sure we'll get that as well.

00:09:38.900 | (upbeat music)

00:09:41.480 | (upbeat music)

00:09:44.060 | (upbeat music)

00:09:46.640 | (upbeat music)

00:09:49.220 | (upbeat music)

00:09:51.800 | (upbeat music)

00:09:54.380 | [BLANK_AUDIO]

Language or Vision - What's Harder? (Ilya Sutskever) | AI Podcast Clips

Chapters