back to index

Language or Vision - What's Harder? (Ilya Sutskever) | AI Podcast Clips


0:0 Intro
0:42 Machine Learning and Vision
3:11 Action is fundamentally different
4:1 Commonality
4:40 Which problem is harder
6:1 Is language hard
6:40 Where does vision stop
7:25 Should vision count

Whisper Transcript | Transcript Only Page

00:00:00.000 | - So incredibly, you've contributed some
00:00:04.620 | of the biggest recent ideas in AI,
00:00:06.960 | in computer vision, language, natural language processing,
00:00:10.980 | reinforcement learning, sort of everything in between.
00:00:15.220 | Maybe not GANs.
00:00:16.440 | There may not be a topic you haven't touched,
00:00:20.140 | and of course the fundamental science of deep learning.
00:00:23.540 | What is the difference to you between vision, language,
00:00:28.100 | and as in reinforcement learning, action,
00:00:30.860 | as learning problems, and what are the commonalities?
00:00:33.500 | Do you see them as all interconnected,
00:00:35.460 | or are they fundamentally different domains
00:00:37.740 | that require different approaches?
00:00:40.700 | - Okay, that's a good question.
00:00:43.600 | Machine learning is a field with a lot of unity,
00:00:45.860 | a huge amount of unity.
00:00:47.180 | In fact-- - What do you mean by unity?
00:00:49.260 | Like overlap of ideas?
00:00:52.300 | - Overlap of ideas, overlap of principles.
00:00:54.100 | In fact, there's only one or two or three principles
00:00:56.620 | which are very, very simple,
00:00:58.300 | and then they apply in almost the same way,
00:01:01.300 | in almost the same way to the different modalities
00:01:03.900 | of the different problems.
00:01:05.300 | And that's why today, when someone writes a paper
00:01:08.060 | on improving optimization of deep learning and vision,
00:01:11.100 | it improves the different NLP applications,
00:01:13.260 | and it improves the different
00:01:14.100 | reinforcement learning applications.
00:01:16.300 | Reinforcement learning, so I would say that computer vision
00:01:19.740 | and NLP are very similar to each other.
00:01:22.540 | Today, they differ in that they have
00:01:24.940 | slightly different architectures.
00:01:26.120 | We use transformers in NLP,
00:01:27.860 | and we use convolutional neural networks in vision.
00:01:30.460 | But it's also possible that one day this will change
00:01:32.860 | and everything will be unified with a single architecture.
00:01:35.780 | Because if you go back a few years ago
00:01:37.620 | in natural language processing,
00:01:39.420 | there were a huge number of architectures
00:01:43.340 | for every different tiny problem had its own architecture.
00:01:46.240 | Today, there's just one transformer
00:01:49.860 | for all those different tasks.
00:01:51.420 | And if you go back in time even more,
00:01:53.620 | you had even more and more fragmentation
00:01:55.340 | and every little problem in AI
00:01:57.760 | had its own little subspecialization
00:01:59.880 | and sub, you know, little set of collection of skills,
00:02:02.580 | people who would know how to engineer the features.
00:02:04.940 | Now it's all been subsumed by deep learning.
00:02:06.820 | We have this unification.
00:02:08.100 | And so I expect vision to be communified
00:02:10.780 | with natural language as well.
00:02:12.460 | Or rather, I shouldn't say expect, I think it's possible.
00:02:14.420 | I don't wanna be too sure because I think
00:02:16.740 | on the convolutional neural net,
00:02:17.580 | it's very computationally efficient.
00:02:19.460 | RL is different.
00:02:20.800 | RL does require slightly different techniques
00:02:22.800 | because you really do need to take action.
00:02:24.760 | You really do need to do something about exploration.
00:02:27.820 | Your variance is much higher,
00:02:29.980 | but I think there is a lot of unity even there.
00:02:32.140 | And I would expect, for example,
00:02:33.260 | that at some point there will be some
00:02:35.160 | broad unification between RL and supervised learning
00:02:39.180 | where somehow the RL will be making decisions
00:02:41.100 | to make the supervised learning go better.
00:02:42.500 | And there will be, I imagine one big black box
00:02:45.700 | and you just throw every, you know,
00:02:47.220 | you shovel things into it and it just figures out
00:02:49.940 | what to do with whatever you shovel at it.
00:02:51.980 | I mean, reinforcement learning has some aspects
00:02:54.700 | of language and vision combined almost.
00:02:59.060 | There's elements of a long-term memory
00:03:01.660 | that you should be utilizing,
00:03:02.780 | and there's elements of a really rich sensory space.
00:03:06.980 | So it seems like the, it's like the union of the two
00:03:10.780 | or something like that.
00:03:12.540 | I'd say something slightly differently.
00:03:13.900 | I'd say that reinforcement learning is neither,
00:03:16.620 | but it naturally interfaces and integrates
00:03:19.360 | with the two of them.
00:03:20.420 | - You think action is fundamentally different?
00:03:23.200 | So yeah, what is interesting about,
00:03:25.220 | what is unique about policy of learning to act?
00:03:29.940 | - Well, so one example, for instance,
00:03:31.420 | is that when you learn to act,
00:03:33.740 | you are fundamentally in a non-stationary world
00:03:37.140 | because as your actions change,
00:03:39.700 | the things you see start changing.
00:03:41.980 | You experience the world in a different way.
00:03:45.260 | And this is not the case for the more traditional
00:03:48.060 | static problem where you have some distribution
00:03:50.220 | and you just apply a model to that distribution.
00:03:53.440 | - You think it's a fundamentally different problem
00:03:55.160 | or is it just a more difficult,
00:03:57.880 | it's a generalization of the problem of understanding?
00:04:00.920 | - I mean, it's a question of definitions almost.
00:04:03.720 | There is a huge amount of commonality for sure.
00:04:05.880 | You take gradients, you try to approximate gradients
00:04:09.280 | in both cases.
00:04:10.120 | In the case of reinforcement learning,
00:04:11.880 | you have some tools to reduce the variance
00:04:14.040 | of the gradients, you do that.
00:04:15.880 | There's lots of commonality.
00:04:17.840 | You use the same neural net in both cases.
00:04:20.220 | You compute the gradient, you apply Adam in both cases.
00:04:22.960 | So, I mean, there's lots in common for sure,
00:04:28.160 | but there are some small differences
00:04:30.800 | which are not completely insignificant.
00:04:32.800 | It's really just a matter of your point of view,
00:04:34.880 | what frame of reference, how much do you wanna zoom in
00:04:38.200 | or out as you look at these problems.
00:04:41.160 | - Which problem do you think is harder?
00:04:43.720 | So people like Noam Chomsky believe that language
00:04:46.120 | is fundamental to everything.
00:04:47.840 | So it underlies everything.
00:04:49.600 | Do you think language understanding is harder
00:04:52.540 | than visual scene understanding or vice versa?
00:04:55.560 | - I think that asking if a problem is hard
00:04:58.540 | is slightly wrong.
00:05:00.140 | I think the question is a little bit wrong
00:05:01.460 | and I wanna explain why.
00:05:03.380 | - So what does it mean for a problem to be hard?
00:05:06.540 | Okay, the non-interesting dumb answer to that
00:05:11.140 | is there's a benchmark and there's a human level performance
00:05:16.180 | on that benchmark and how is the effort required
00:05:20.600 | to reach the human level benchmark.
00:05:23.000 | - So from the perspective of how much until we get
00:05:25.600 | to human level on a very good benchmark.
00:05:29.200 | - Yeah, I understand what you mean by that.
00:05:32.840 | So what I was going to say that a lot of it depends on,
00:05:36.000 | you know, once you solve a problem, it stops being hard
00:05:37.960 | and that's always true.
00:05:39.960 | And so whether something is hard or not depends
00:05:42.160 | on what our tools can do today.
00:05:43.680 | So, you know, you say today, true human level,
00:05:47.660 | language understanding and visual perception are hard
00:05:50.260 | in the sense that there is no way of solving the problem
00:05:53.900 | completely in the next three months, right?
00:05:55.960 | So I agree with that statement.
00:05:57.900 | Beyond that, I'm just, I'd be, my guess would be
00:05:59.980 | as good as yours, I don't know.
00:06:01.420 | - Oh, okay, so you don't have a fundamental intuition
00:06:04.320 | about how hard language understanding is.
00:06:06.780 | - I think, I know I changed my mind.
00:06:08.260 | I'd say language is probably going to be harder.
00:06:10.780 | I mean, it depends on how you define it.
00:06:13.140 | Like if you mean absolute top-notch, 100%
00:06:16.140 | language understanding, I'll go with language.
00:06:18.440 | But then if I show you a piece of paper with letters on it,
00:06:22.860 | is that, you see what I mean?
00:06:25.340 | It's like you have a vision system,
00:06:26.580 | you say it's the best human level vision system.
00:06:29.060 | I show you, I open a book and I show you letters.
00:06:32.740 | Will it understand how these letters form into word
00:06:34.820 | and sentences and meaning?
00:06:36.220 | Is this part of the vision problem?
00:06:37.660 | Where does vision end and language begin?
00:06:40.060 | - Yeah, so Chomsky would say it starts at language.
00:06:42.180 | So vision is just a little example of the kind of
00:06:44.860 | structure and fundamental hierarchy of ideas
00:06:50.460 | that's already represented in our brain somehow,
00:06:53.000 | that's represented through language.
00:06:55.320 | But where does vision stop and language begin?
00:07:00.320 | That's a really interesting question.
00:07:11.700 | So one possibility is that it's impossible
00:07:13.860 | to achieve really deep understanding in either images
00:07:18.700 | or language without basically using the same kind of system.
00:07:22.380 | So you're going to get the other for free.
00:07:24.580 | - I think it's pretty likely that yes,
00:07:27.060 | if we can get one, our machine learning is probably
00:07:29.820 | that good that we can get the other.
00:07:31.300 | But I'm not 100% sure.
00:07:34.140 | And also, I think a lot of it really does depend
00:07:38.500 | on your definitions.
00:07:40.660 | - Definitions of?
00:07:41.940 | - Of like perfect vision.
00:07:43.940 | Because reading is vision, but should it count?
00:07:47.240 | - Yeah, to me, so my definition is if a system
00:07:51.080 | looked at an image and then a system looked at a piece
00:07:55.140 | of text and then told me something about that
00:07:59.940 | and I was really impressed.
00:08:01.380 | - That's relative.
00:08:03.380 | You'll be impressed for half an hour
00:08:05.180 | and then you're gonna say, well, I mean,
00:08:06.420 | all the systems do that, but here's the thing they don't do.
00:08:09.100 | - Yeah, but I don't have that with humans.
00:08:11.020 | Humans continue to impress me.
00:08:12.820 | - Is that true?
00:08:13.660 | - Well, the ones, okay, so I'm a fan of monogamy,
00:08:17.900 | so I like the idea of marrying somebody,
00:08:19.900 | being with them for several decades.
00:08:22.020 | So I believe in the fact that yes, it's possible
00:08:24.540 | to have somebody continuously giving you pleasurable,
00:08:29.540 | interesting, witty, new ideas, friends.
00:08:32.500 | Yeah, I think so.
00:08:33.900 | They continue to surprise you.
00:08:36.020 | - The surprise, it's that injection of randomness
00:08:41.020 | seems to be a nice source of, yeah,
00:08:46.860 | continued inspiration, like the wit, the humor.
00:08:52.740 | I think, yeah, that would be, it's a very subjective test,
00:08:57.780 | but I think if you have enough humans in the room.
00:09:02.580 | - Yeah, I understand what you mean.
00:09:04.540 | Yeah, I feel like I misunderstood what you meant
00:09:06.420 | by impressing you.
00:09:07.260 | I thought you meant to impress you with its intelligence,
00:09:10.500 | with how well it understands an image.
00:09:14.220 | I thought you meant something like,
00:09:15.620 | I'm gonna show it a really complicated image
00:09:17.180 | and it's gonna get it right, and you're gonna say, wow,
00:09:19.020 | that's really cool.
00:09:19.860 | Our systems of January 2020 have not been doing that.
00:09:23.860 | - Yeah, no, I think it all boils down to the reason
00:09:27.980 | people click like on stuff on the internet,
00:09:30.040 | which is it makes them laugh.
00:09:32.260 | So it's like humor or wit or insight.
00:09:36.640 | I'm sure we'll get that as well.
00:09:38.900 | (upbeat music)
00:09:41.480 | (upbeat music)
00:09:44.060 | (upbeat music)
00:09:46.640 | (upbeat music)
00:09:49.220 | (upbeat music)
00:09:51.800 | (upbeat music)
00:09:54.380 | [BLANK_AUDIO]