Back to Index

Language or Vision - What's Harder? (Ilya Sutskever) | AI Podcast Clips


Chapters

0:0 Intro
0:42 Machine Learning and Vision
3:11 Action is fundamentally different
4:1 Commonality
4:40 Which problem is harder
6:1 Is language hard
6:40 Where does vision stop
7:25 Should vision count

Transcript

- So incredibly, you've contributed some of the biggest recent ideas in AI, in computer vision, language, natural language processing, reinforcement learning, sort of everything in between. Maybe not GANs. There may not be a topic you haven't touched, and of course the fundamental science of deep learning. What is the difference to you between vision, language, and as in reinforcement learning, action, as learning problems, and what are the commonalities?

Do you see them as all interconnected, or are they fundamentally different domains that require different approaches? - Okay, that's a good question. Machine learning is a field with a lot of unity, a huge amount of unity. In fact-- - What do you mean by unity? Like overlap of ideas?

- Overlap of ideas, overlap of principles. In fact, there's only one or two or three principles which are very, very simple, and then they apply in almost the same way, in almost the same way to the different modalities of the different problems. And that's why today, when someone writes a paper on improving optimization of deep learning and vision, it improves the different NLP applications, and it improves the different reinforcement learning applications.

Reinforcement learning, so I would say that computer vision and NLP are very similar to each other. Today, they differ in that they have slightly different architectures. We use transformers in NLP, and we use convolutional neural networks in vision. But it's also possible that one day this will change and everything will be unified with a single architecture.

Because if you go back a few years ago in natural language processing, there were a huge number of architectures for every different tiny problem had its own architecture. Today, there's just one transformer for all those different tasks. And if you go back in time even more, you had even more and more fragmentation and every little problem in AI had its own little subspecialization and sub, you know, little set of collection of skills, people who would know how to engineer the features.

Now it's all been subsumed by deep learning. We have this unification. And so I expect vision to be communified with natural language as well. Or rather, I shouldn't say expect, I think it's possible. I don't wanna be too sure because I think on the convolutional neural net, it's very computationally efficient.

RL is different. RL does require slightly different techniques because you really do need to take action. You really do need to do something about exploration. Your variance is much higher, but I think there is a lot of unity even there. And I would expect, for example, that at some point there will be some broad unification between RL and supervised learning where somehow the RL will be making decisions to make the supervised learning go better.

And there will be, I imagine one big black box and you just throw every, you know, you shovel things into it and it just figures out what to do with whatever you shovel at it. I mean, reinforcement learning has some aspects of language and vision combined almost. There's elements of a long-term memory that you should be utilizing, and there's elements of a really rich sensory space.

So it seems like the, it's like the union of the two or something like that. I'd say something slightly differently. I'd say that reinforcement learning is neither, but it naturally interfaces and integrates with the two of them. - You think action is fundamentally different? So yeah, what is interesting about, what is unique about policy of learning to act?

- Well, so one example, for instance, is that when you learn to act, you are fundamentally in a non-stationary world because as your actions change, the things you see start changing. You experience the world in a different way. And this is not the case for the more traditional static problem where you have some distribution and you just apply a model to that distribution.

- You think it's a fundamentally different problem or is it just a more difficult, it's a generalization of the problem of understanding? - I mean, it's a question of definitions almost. There is a huge amount of commonality for sure. You take gradients, you try to approximate gradients in both cases.

In the case of reinforcement learning, you have some tools to reduce the variance of the gradients, you do that. There's lots of commonality. You use the same neural net in both cases. You compute the gradient, you apply Adam in both cases. So, I mean, there's lots in common for sure, but there are some small differences which are not completely insignificant.

It's really just a matter of your point of view, what frame of reference, how much do you wanna zoom in or out as you look at these problems. - Which problem do you think is harder? So people like Noam Chomsky believe that language is fundamental to everything. So it underlies everything.

Do you think language understanding is harder than visual scene understanding or vice versa? - I think that asking if a problem is hard is slightly wrong. I think the question is a little bit wrong and I wanna explain why. - So what does it mean for a problem to be hard?

Okay, the non-interesting dumb answer to that is there's a benchmark and there's a human level performance on that benchmark and how is the effort required to reach the human level benchmark. - So from the perspective of how much until we get to human level on a very good benchmark.

- Yeah, I understand what you mean by that. So what I was going to say that a lot of it depends on, you know, once you solve a problem, it stops being hard and that's always true. And so whether something is hard or not depends on what our tools can do today.

So, you know, you say today, true human level, language understanding and visual perception are hard in the sense that there is no way of solving the problem completely in the next three months, right? So I agree with that statement. Beyond that, I'm just, I'd be, my guess would be as good as yours, I don't know.

- Oh, okay, so you don't have a fundamental intuition about how hard language understanding is. - I think, I know I changed my mind. I'd say language is probably going to be harder. I mean, it depends on how you define it. Like if you mean absolute top-notch, 100% language understanding, I'll go with language.

But then if I show you a piece of paper with letters on it, is that, you see what I mean? It's like you have a vision system, you say it's the best human level vision system. I show you, I open a book and I show you letters. Will it understand how these letters form into word and sentences and meaning?

Is this part of the vision problem? Where does vision end and language begin? - Yeah, so Chomsky would say it starts at language. So vision is just a little example of the kind of structure and fundamental hierarchy of ideas that's already represented in our brain somehow, that's represented through language.

But where does vision stop and language begin? That's a really interesting question. So one possibility is that it's impossible to achieve really deep understanding in either images or language without basically using the same kind of system. So you're going to get the other for free. - I think it's pretty likely that yes, if we can get one, our machine learning is probably that good that we can get the other.

But I'm not 100% sure. And also, I think a lot of it really does depend on your definitions. - Definitions of? - Of like perfect vision. Because reading is vision, but should it count? - Yeah, to me, so my definition is if a system looked at an image and then a system looked at a piece of text and then told me something about that and I was really impressed.

- That's relative. You'll be impressed for half an hour and then you're gonna say, well, I mean, all the systems do that, but here's the thing they don't do. - Yeah, but I don't have that with humans. Humans continue to impress me. - Is that true? - Well, the ones, okay, so I'm a fan of monogamy, so I like the idea of marrying somebody, being with them for several decades.

So I believe in the fact that yes, it's possible to have somebody continuously giving you pleasurable, interesting, witty, new ideas, friends. Yeah, I think so. They continue to surprise you. - The surprise, it's that injection of randomness seems to be a nice source of, yeah, continued inspiration, like the wit, the humor.

I think, yeah, that would be, it's a very subjective test, but I think if you have enough humans in the room. - Yeah, I understand what you mean. Yeah, I feel like I misunderstood what you meant by impressing you. I thought you meant to impress you with its intelligence, with how well it understands an image.

I thought you meant something like, I'm gonna show it a really complicated image and it's gonna get it right, and you're gonna say, wow, that's really cool. Our systems of January 2020 have not been doing that. - Yeah, no, I think it all boils down to the reason people click like on stuff on the internet, which is it makes them laugh.

So it's like humor or wit or insight. I'm sure we'll get that as well. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)