Ishan Misra: Self-Supervised Deep Learning in Computer Vision

00:00:00.000 | The following is a conversation with Ishan Misra,

00:00:03.260 | research scientist at Facebook AI Research

00:00:05.840 | who works on self-supervised machine learning

00:00:08.600 | in the domain of computer vision.

00:00:10.520 | Or in other words, making AI systems

00:00:13.480 | understand the visual world with minimal help

00:00:16.580 | from us humans.

00:00:18.040 | Transformers and self-attention has been successfully used

00:00:21.740 | by OpenAI's GPT-3 and other language models

00:00:25.620 | to do self-supervised learning in the domain of language.

00:00:28.600 | Ishan, together with Yan LeCun and others,

00:00:31.800 | is trying to achieve the same success

00:00:33.960 | in the domain of images and video.

00:00:36.400 | The goal is to leave a robot

00:00:38.320 | watching YouTube videos all night

00:00:40.320 | and in the morning, come back to a much smarter robot.

00:00:43.600 | I read the blog post, "Self-supervised Learning,

00:00:45.960 | "The Dark Matter of Intelligence" by Ishan and Yan LeCun

00:00:50.360 | and then listened to Ishan's appearance

00:00:52.940 | on the excellent "Machine Learning Street Talk" podcast.

00:00:57.200 | And I knew I had to talk to him.

00:00:59.160 | By the way, if you're interested in machine learning and AI,

00:01:02.840 | I cannot recommend the ML Street Talk podcast highly enough.

00:01:07.840 | Those guys are great.

00:01:09.640 | Quick mention of our sponsors,

00:01:11.280 | Onnit, The Information, Grammarly and Athletic Greens.

00:01:15.400 | Check them out in the description to support this podcast.

00:01:18.640 | As a side note, let me say that for those of you

00:01:21.680 | who may have been listening for quite a while,

00:01:23.240 | this podcast used to be called

00:01:24.960 | Artificial Intelligence Podcast

00:01:27.120 | because my life passion has always been,

00:01:29.720 | will always be artificial intelligence,

00:01:32.680 | both narrowly and broadly defined.

00:01:35.460 | My goal with this podcast is still

00:01:37.720 | to have many conversations with world-class researchers

00:01:40.580 | in AI, math, physics, biology and all the other sciences.

00:01:45.120 | But I also want to talk to historians, musicians,

00:01:48.880 | athletes and of course, occasionally comedians.

00:01:51.520 | In fact, I'm trying out doing this podcast

00:01:53.600 | three times a week now to give me more freedom

00:01:56.200 | with guest selection and maybe get a chance

00:01:59.380 | to have a bit more fun.

00:02:00.880 | Speaking of fun, in this conversation,

00:02:03.160 | I challenged the listener to count the number of times

00:02:05.440 | the word banana is mentioned.

00:02:08.000 | Ishan and I use the word banana as the canonical example

00:02:12.600 | at the core of the hard problem of computer vision

00:02:15.200 | and maybe the hard problem of consciousness.

00:02:19.020 | This is the Lex Friedman Podcast

00:02:22.620 | and here is my conversation with Ishan Misra.

00:02:26.280 | What is self-supervised learning?

00:02:29.860 | And maybe even give the bigger basics

00:02:32.740 | of what is supervised and semi-supervised learning

00:02:35.340 | and maybe why is self-supervised learning

00:02:37.620 | a better term than unsupervised learning?

00:02:40.100 | - Let's start with supervised learning.

00:02:41.580 | So typically for machine learning systems,

00:02:43.900 | the way they're trained is you get a bunch of humans,

00:02:46.900 | the humans point out particular concepts.

00:02:48.580 | So if it's in the case of images,

00:02:50.180 | you want the humans to come and tell you

00:02:51.980 | what is present in the image, draw boxes around them,

00:02:55.820 | draw masks of things, pixels,

00:02:57.700 | which are of particular categories or not.

00:03:00.540 | For NLP, again, there are lots of these particular tasks,

00:03:03.220 | say about sentiment analysis, about entailment and so on.

00:03:06.620 | So typically for supervised learning,

00:03:08.100 | we get a big corpus of such annotated or labeled data

00:03:11.280 | and then we feed that to a system

00:03:12.780 | and the system is really trying to mimic.

00:03:14.820 | So it's taking this input of the data

00:03:16.580 | and then trying to mimic the output.

00:03:18.380 | So it looks at an image and the human has tagged

00:03:20.660 | that this image contains a banana

00:03:22.420 | and now the system is basically trying to mimic that.

00:03:24.660 | So that's its learning signal.

00:03:26.700 | And so for supervised learning,

00:03:28.020 | we try to gather lots of such data

00:03:30.060 | and we train these machine learning models

00:03:31.820 | to imitate the input output.

00:03:33.460 | And the hope is basically by doing so,

00:03:35.620 | now on unseen or like new kinds of data,

00:03:38.060 | this model can automatically learn

00:03:40.020 | to predict these concepts.

00:03:41.320 | So this is a standard sort of supervised setting.

00:03:43.380 | For semi-supervised setting,

00:03:45.780 | the idea typically is that you have,

00:03:47.580 | of course, all of the supervised data,

00:03:49.260 | but you have lots of other data which is unsupervised

00:03:51.780 | or which is like not labeled.

00:03:53.100 | Now the problem basically with supervised learning

00:03:55.260 | and why you actually have all of these alternate

00:03:57.460 | sort of learning paradigms is,

00:03:59.400 | supervised learning just does not scale.

00:04:01.800 | So if you look at for computer vision,

00:04:03.900 | the sort of largest,

00:04:04.980 | one of the most popular data sets is ImageNet.

00:04:07.500 | So the entire ImageNet data set

00:04:09.300 | has about 22,000 concepts and about 14 million images.

00:04:13.820 | So these concepts are basically just nouns

00:04:16.140 | and they're annotated on images.

00:04:18.340 | And this entire data set was a mammoth

00:04:20.020 | data collection effort.

00:04:20.860 | It actually gave rise to a lot of powerful

00:04:23.020 | learning algorithms.

00:04:23.860 | It's credited with like sort of the rise

00:04:25.620 | of deep learning as well.

00:04:27.220 | But this data set took about 22 human years

00:04:30.100 | to collect, to annotate.

00:04:31.920 | And it's not even that many concepts, right?

00:04:33.460 | It's not even that many images.

00:04:34.540 | 14 million is nothing really.

00:04:36.760 | Like you have about, I think, 400 million images or so,

00:04:39.340 | or even more than that uploaded

00:04:40.620 | to most of the popular sort of social media websites today.

00:04:44.180 | So now supervised learning just doesn't scale.

00:04:46.420 | If I want to now annotate more concepts,

00:04:48.660 | if I want to have various types of fine-grained concepts,

00:04:51.300 | then it won't really scale.

00:04:53.220 | So now you come up to these sort of

00:04:54.500 | different learning paradigms,

00:04:55.700 | for example, semi-supervised learning,

00:04:57.540 | where the idea is, of course,

00:04:58.580 | you have this annotated corpus of supervised data,

00:05:01.360 | and you have lots of these unlabeled images.

00:05:03.700 | And the idea is that the algorithm should basically

00:05:05.580 | try to measure some kind of consistency,

00:05:07.960 | or really try to measure some kind of signal

00:05:10.260 | on this sort of unlabeled data

00:05:12.140 | to make itself more confident

00:05:14.180 | about what it's really trying to predict.

00:05:16.160 | So by access to this lots of unlabeled data,

00:05:19.660 | the idea is that the algorithm actually learns

00:05:22.220 | to be more confident,

00:05:23.500 | and actually gets better at predicting these concepts.

00:05:26.880 | And now we come to the other extreme,

00:05:28.460 | which is like self-supervised learning.

00:05:30.500 | The idea basically is that the machine,

00:05:32.260 | or the algorithm should really discover concepts,

00:05:34.700 | or discover things about the world,

00:05:36.380 | or learn representations about the world which are useful,

00:05:39.180 | without access to explicit human supervision.

00:05:41.740 | - So the word supervision is still

00:05:44.300 | in the term self-supervised.

00:05:46.240 | So what is the supervision signal?

00:05:48.520 | And maybe that perhaps is when Yanma Kun and you argue

00:05:51.960 | that unsupervised is the incorrect terminology here.

00:05:55.000 | So what is the supervision signal

00:05:57.400 | when the humans aren't part of the picture?

00:05:59.680 | Or not a big part of the picture?

00:06:02.360 | - Right.

00:06:03.200 | So self-supervised,

00:06:04.480 | the reason it has the term supervised in itself

00:06:06.720 | is because you're using the data itself as supervision.

00:06:10.320 | So because the data serves as its own source of supervision,

00:06:13.200 | it's self-supervised in that way.

00:06:15.140 | Now the reason a lot of people,

00:06:16.380 | I mean, we did it in that blog post with Yan,

00:06:18.380 | but a lot of other people have also argued

00:06:20.100 | for using this term self-supervised.

00:06:22.100 | So starting from like '94 from Virginia Desas' group,

00:06:25.620 | I think UCSD, and now she's at UCSD.

00:06:28.780 | Jeetendra Malik has said this a bunch of times as well.

00:06:31.640 | So you have supervised.

00:06:33.060 | And then unsupervised basically means everything

00:06:35.180 | which is not supervised.

00:06:36.420 | But that includes stuff like semi-supervised,

00:06:38.620 | that includes other like transductive learning,

00:06:41.300 | lots of other sort of settings.

00:06:43.020 | So that's the reason like now people are preferring

00:06:46.040 | this term self-supervised

00:06:47.100 | because it explicitly says what's happening.

00:06:49.260 | The data itself is the source of supervision

00:06:51.620 | and any sort of learning algorithm

00:06:53.100 | which tries to extract just sort of data supervision signals

00:06:56.900 | from the data itself is a self-supervised learning algorithm.

00:06:59.460 | - But there is within the data,

00:07:02.140 | a set of tricks which unlock the supervision.

00:07:04.940 | - Right.

00:07:05.780 | - So can you give maybe some examples?

00:07:07.180 | And there's innovation ingenuity required

00:07:11.360 | to unlock that supervision.

00:07:12.840 | The data doesn't just speak to you some ground truth.

00:07:15.580 | You have to do some kind of trick.

00:07:17.760 | So I don't know what your favorite domain is.

00:07:19.580 | So you specifically specialize in visual learning,

00:07:23.020 | but is there favorite examples

00:07:24.500 | maybe in language or other domains?

00:07:26.500 | - Perhaps the most successful applications

00:07:28.300 | have been in NLP, language processing.

00:07:31.060 | So the idea basically being that you can train models

00:07:34.000 | that can, you have a sentence

00:07:35.780 | and you mask out certain words.

00:07:37.380 | And now these models learn to predict the masked out words.

00:07:40.500 | So if you have like the cat jumped over the dog,

00:07:44.000 | so you can basically mask out cat.

00:07:45.940 | And now you are essentially asking the model to predict

00:07:47.880 | what was missing, what did I mask out?

00:07:50.280 | So the model is going to predict basically

00:07:52.460 | a distribution over all the possible words that it knows.

00:07:55.320 | And probably it has like, if it's a well-trained model,

00:07:58.360 | it has a sort of higher probability density

00:08:00.580 | for this word cat.

00:08:02.560 | For vision, I would say the sort of more,

00:08:05.500 | I mean, the easier example,

00:08:07.440 | which is not as widely used these days

00:08:09.400 | is basically say, for example, video prediction.

00:08:12.000 | So video is again, a sequence of things.

00:08:14.060 | So you can ask the model.

00:08:15.000 | So if you have a video of say 10 seconds,

00:08:17.400 | you can feed in the first nine seconds to a model

00:08:19.800 | and then ask it, hey, what happens basically

00:08:21.920 | in the 10th second?

00:08:22.760 | Can you predict what's going to happen?

00:08:24.480 | And the idea basically is because the model

00:08:26.720 | is predicting something about the data itself.

00:08:29.400 | Of course, you didn't need any human

00:08:31.360 | to tell you what was happening

00:08:32.280 | because the 10 second video was naturally captured.

00:08:34.580 | Because the model is predicting what's happening there,

00:08:36.640 | it's going to automatically learn something

00:08:39.000 | about the structure of the world,

00:08:40.000 | how objects move, object permanence,

00:08:42.560 | and these kinds of things.

00:08:43.980 | So like if I have something at the edge of the table,

00:08:45.920 | it'll fall down.

00:08:47.500 | Things like these,

00:08:48.340 | which you really don't have to sit and annotate.

00:08:50.240 | In a supervised learning setting,

00:08:51.280 | I would have to sit and annotate.

00:08:52.260 | This is a cup.

00:08:53.100 | Now I move this cup.

00:08:54.100 | This is still a cup.

00:08:55.160 | And now I move this cup, it's still a cup.

00:08:56.600 | And then it falls down and this is a fallen down cup.

00:08:58.800 | So I won't have to annotate all of these things

00:09:00.400 | in a self-supervised setting.

00:09:02.000 | - Isn't that kind of a brilliant little trick

00:09:05.240 | of taking a series of data that is consistent

00:09:08.280 | and removing one element in that series,

00:09:11.860 | and then teaching the algorithm to predict that element.

00:09:16.860 | Isn't that, first of all, that's quite brilliant.

00:09:19.620 | It seems to be applicable in anything

00:09:23.040 | that has the constraint of being a sequence

00:09:27.880 | that is consistent with the physical reality.

00:09:30.220 | The question is, are there other tricks like this

00:09:34.400 | that can generate the self-supervision signal?

00:09:37.840 | - So sequence is possibly the most widely used one in NLP.

00:09:41.200 | For vision, the one that is actually used for images,

00:09:44.080 | which is very popular these days,

00:09:45.840 | is basically taking an image

00:09:47.600 | and now taking different crops of that image.

00:09:50.080 | So you can basically decide to crop,

00:09:51.400 | say the top left corner,

00:09:53.100 | and you crop, say the bottom right corner,

00:09:55.280 | and asking the network to basically present it

00:09:58.080 | with a choice saying that, okay, now you have this image,

00:10:01.360 | you have this image, are these the same or not?

00:10:04.480 | And so the idea basically is that because different,

00:10:06.680 | like in an image, different parts of the image

00:10:08.480 | are going to be related.

00:10:09.800 | So for example, if you have a chair and a table,

00:10:12.400 | basically these things are going to be close by,

00:10:15.080 | versus if you take, again,

00:10:16.880 | if you have like a zoomed in picture of a chair,

00:10:19.520 | if you're taking different crops,

00:10:20.520 | it's going to be different parts of the chair.

00:10:22.360 | So the idea basically is that different crops

00:10:25.040 | of the image are related,

00:10:26.160 | and so the features or the representations

00:10:27.920 | that you get from these different crops

00:10:29.040 | should also be related.

00:10:30.320 | So this is possibly the most widely used trick these days

00:10:33.120 | for self-supervised learning in computer vision.

00:10:35.760 | - So again, using the consistency

00:10:38.400 | that's inherent to physical reality,

00:10:40.200 | in visual domain, that's, you know,

00:10:43.800 | parts of an image are consistent,

00:10:45.640 | and then in the language domain,

00:10:48.400 | or anything that has sequences,

00:10:50.280 | like language or something that's like a time series,

00:10:53.000 | then you can chop off parts in time.

00:10:55.440 | It's similar to the story of RNNs and CNNs,

00:10:59.380 | of RNNs and Covenants.

00:11:02.320 | - Yuan Yan Lacoon wrote the blog post in March, 2021,

00:11:06.640 | titled "Self-supervised learning,

00:11:08.840 | the dark matter of intelligence."

00:11:11.080 | Can you summarize this blog post

00:11:12.640 | and maybe explain the main idea or set of ideas?

00:11:15.640 | - The blog post was mainly about sort of just telling,

00:11:18.680 | I mean, this is really an accepted fact,

00:11:21.680 | I would say for a lot of people now,

00:11:22.960 | that self-supervised learning is something

00:11:24.320 | that is going to play an important role

00:11:27.200 | for machine learning algorithms that come in the future,

00:11:29.240 | and even now.

00:11:30.400 | - Let me just comment that we don't yet

00:11:33.840 | have a good understanding of what dark matter is.

00:11:36.480 | - That's true.

00:11:37.320 | (laughing)

00:11:39.000 | The idea basically being--

00:11:39.840 | - Maybe the metaphor doesn't exactly transfer,

00:11:41.840 | but maybe it's actually perfectly transfers,

00:11:44.840 | that we don't know, we have an inkling

00:11:47.880 | that it'll be a big part

00:11:49.280 | of whatever solving intelligence looks like.

00:11:51.240 | - Right, so I think self-supervised learning,

00:11:52.960 | the way it's done right now,

00:11:54.080 | is I would say like the first step

00:11:56.240 | towards what it probably should end up learning,

00:11:58.560 | or what it should enable us to do.

00:12:00.200 | - Yeah.

00:12:01.040 | - So the idea for that particular piece was,

00:12:03.720 | self-supervised learning is going to be a very powerful way

00:12:06.160 | to learn common sense about the world,

00:12:08.400 | or like stuff that is really hard to label.

00:12:10.800 | For example, like is this piece

00:12:13.720 | over here heavier than the cup?

00:12:15.640 | Now, for all these kinds of things,

00:12:17.480 | you'll have to sit and label these things.

00:12:18.720 | So supervised learning is clearly not going to scale.

00:12:21.520 | So what is the thing that's actually going to scale?

00:12:23.480 | It's probably going to be an agent

00:12:25.040 | that can either actually interact with it,

00:12:26.800 | to lift it up, or observe me doing it.

00:12:29.960 | So if I'm basically lifting these things up,

00:12:31.520 | it can probably reason about,

00:12:32.560 | hey, this is taking him more time to lift up,

00:12:34.720 | or the velocity is different.

00:12:36.160 | Whereas the velocity for this is different,

00:12:37.800 | probably this one is heavier.

00:12:39.560 | So essentially by observations of the data,

00:12:41.960 | you should be able to infer a lot of things about the world

00:12:44.760 | without someone explicitly telling you,

00:12:46.800 | this is heavy, this is not,

00:12:48.680 | this is something that can pour,

00:12:49.960 | this is something that cannot pour,

00:12:51.160 | this is somewhere that you can sit,

00:12:52.440 | this is not somewhere that you can sit.

00:12:53.880 | - But you've just mentioned ability

00:12:55.480 | to interact with the world.

00:12:57.440 | There's so many questions that are yet to be,

00:13:00.960 | that are still open, which is,

00:13:02.800 | how do you select a set of data

00:13:04.440 | over which the self-supervised learning process works?

00:13:08.600 | How much interactivity, like in the active learning,

00:13:11.480 | or the machine teaching context is there?

00:13:14.360 | What are the reward signals?

00:13:16.440 | Like how much actual interaction there is

00:13:18.520 | with the physical world?

00:13:20.040 | That kind of thing.

00:13:21.400 | So that could be a huge question.

00:13:24.760 | And then on top of that,

00:13:26.680 | which I have a million questions about,

00:13:28.920 | which we don't know the answers to,

00:13:30.360 | but it's worth talking about,

00:13:32.000 | is how much reasoning is involved?

00:13:35.080 | How much accumulation of knowledge

00:13:38.480 | versus something that's more akin to learning,

00:13:40.760 | or whether that's the same thing.

00:13:43.200 | But so we're like, it is truly dark matter.

00:13:46.520 | - We don't know how exactly to do it.

00:13:49.160 | But we are, I mean, a lot of us are actually convinced

00:13:52.040 | that it's going to be a sort of major thing

00:13:54.200 | in machine learning.

00:13:55.040 | Let me reframe it then,

00:13:56.600 | that human supervision cannot be at large scale,

00:14:01.160 | the source of the solution to intelligence.

00:14:04.120 | So the machines have to discover the supervision

00:14:08.000 | in the natural signal of the world.

00:14:10.240 | - I mean, the other thing is also that humans

00:14:12.440 | are not particularly good labelers.

00:14:14.200 | They're not very consistent.

00:14:16.000 | For example, like what's the difference

00:14:17.880 | between a dining table and a table?

00:14:19.840 | Is it just the fact that one,

00:14:21.560 | like if you just look at a particular table,

00:14:23.080 | what makes us say one is dining table and the other is not?

00:14:26.480 | Humans are not particularly consistent.

00:14:28.160 | They're not like very good sources of supervision

00:14:30.080 | for a lot of these kind of edge cases.

00:14:32.320 | So it may be also the fact that if we want,

00:14:35.880 | like want an algorithm or want a machine

00:14:37.960 | to solve a particular task for us,

00:14:39.640 | we can maybe just specify the end goal.

00:14:42.120 | And like the stuff in between,

00:14:44.240 | we really probably should not be specifying

00:14:46.080 | because we're not maybe going to confuse it a lot actually.

00:14:49.320 | - Well, humans can't even answer the meaning of life.

00:14:51.440 | So I'm not sure if we're good supervisors

00:14:53.920 | of the end goal either.

00:14:55.220 | So let me ask you about categories.

00:14:56.960 | Humans are not very good at telling the difference

00:14:59.040 | between what is and isn't a table, like you mentioned.

00:15:01.940 | Do you think it's possible,

00:15:04.560 | let me ask you like a, pretend you're Plato.

00:15:08.140 | Is it possible to create a pretty good taxonomy

00:15:14.800 | of objects in the world?

00:15:16.400 | It seems like a lot of approaches in machine learning

00:15:19.000 | kind of assume a hopeful vision that it's possible

00:15:22.160 | to construct a perfect taxonomy,

00:15:24.080 | or it exists perhaps out of our reach,

00:15:26.520 | but we can always get closer and closer to it.

00:15:28.820 | Or is that a hopeless pursuit?

00:15:30.400 | - I think it's hopeless in some way.

00:15:33.040 | So the thing is for any particular categorization

00:15:36.080 | that you create,

00:15:36.920 | if you have a discrete sort of categorization,

00:15:38.760 | I can always take the nearest two concepts

00:15:40.520 | or I can take a third concept and I can blend it in

00:15:42.600 | and I can create a new category.

00:15:44.500 | So if you were to enumerate N categories,

00:15:46.560 | I will always find an N plus one category for you.

00:15:48.900 | That's not going to be in the N categories.

00:15:50.720 | And I can actually create not just N plus one,

00:15:52.440 | I can very easily create far more than N categories.

00:15:55.160 | The thing is a lot of things we talk about

00:15:57.320 | are actually compositional.

00:15:59.000 | So it's really hard for us to come and sit in

00:16:01.800 | and enumerate all of these out.

00:16:03.240 | And they compose in various weird ways, right?

00:16:05.880 | Like you have like a croissant and a donut come together

00:16:08.360 | to form a cronut.

00:16:09.720 | So if you were to like enumerate all the foods up until,

00:16:12.440 | I don't know, whenever the cronut was about 10 years ago

00:16:15.200 | or 15 years ago,

00:16:16.460 | then this entire thing called cronut would not exist.

00:16:19.000 | - Yeah, I remember there was the most awesome video

00:16:21.760 | of a cat wearing a monkey costume.

00:16:23.520 | People should look it up, it's great.

00:16:28.240 | So is that a monkey or is that a cat?

00:16:31.000 | So it's a very difficult philosophical question.

00:16:33.840 | So there is a concept of similarity between objects.

00:16:37.280 | So you think that can take us very far?

00:16:39.860 | Just kind of getting a good function,

00:16:43.200 | a good way to tell which parts of things are similar

00:16:47.920 | and which parts of things are very different?

00:16:50.700 | - I think so, yeah.

00:16:51.780 | So you don't necessarily need to name everything

00:16:54.320 | or assign a name to everything to be able to use it, right?

00:16:57.820 | So there are like lots of--

00:16:59.540 | - Shakespeare said that, what's in a name?

00:17:01.740 | - What's in a name, yeah.

00:17:02.580 | - Yeah, okay.

00:17:03.540 | - I mean, lots of like, for example, animals, right?

00:17:05.820 | They don't have necessarily a well-formed

00:17:08.140 | like syntactic language,

00:17:09.540 | but they're able to go about their day perfectly.

00:17:11.820 | The same thing happens for us.

00:17:12.900 | So, I mean, we probably look at things and we figure out,

00:17:17.100 | oh, this is similar to something else

00:17:18.500 | that I've seen before.

00:17:19.340 | And then I can probably learn how to use it.

00:17:22.020 | So I haven't seen all the possible doorknobs in the world.

00:17:26.300 | But if you show me, like I was able to get

00:17:28.700 | into this particular place fairly easily.

00:17:30.380 | I've never seen that particular doorknob.

00:17:32.140 | So I, of course, related to all the doorknobs that I've seen

00:17:34.340 | and I know exactly how it's going to open.

00:17:36.540 | I have a pretty good idea of how it's going to open.

00:17:39.420 | And I think this kind of translation between experiences

00:17:41.820 | only happens because of similarity.

00:17:43.700 | Because I'm able to relate it to a doorknob.

00:17:45.380 | If I related it to a hairdryer,

00:17:46.580 | I would probably be stuck still outside,

00:17:48.380 | not able to get in.

00:17:49.380 | - Again, a bit of a philosophical question,

00:17:52.220 | but can similarity take us all the way

00:17:55.620 | to understanding a thing?

00:17:57.680 | Can having a good function that compares objects

00:18:01.940 | get us to understand something profound

00:18:04.900 | about singular objects?

00:18:07.220 | - I think I'll ask you a question back.

00:18:08.620 | What does it mean to understand objects?

00:18:11.560 | - Well, let me tell you what that's similar to.

00:18:13.500 | No.

00:18:14.340 | (laughing)

00:18:15.660 | So there's an idea of sort of reasoning

00:18:17.700 | by analogy kind of thing.

00:18:19.740 | I think understanding is the process of placing that thing

00:18:24.740 | in some kind of network of knowledge that you have.

00:18:28.420 | That it perhaps is fundamentally related to other concepts.

00:18:33.180 | So it's not like understanding is fundamentally related

00:18:36.500 | by like composition of other concepts

00:18:39.260 | and maybe in relation to other concepts.

00:18:41.480 | And maybe like deeper and deeper understanding

00:18:45.800 | is maybe just adding more edges to that graph somehow.

00:18:50.800 | So maybe it is a composition of similarities.

00:18:55.080 | I mean, ultimately, I suppose it is a kind of embedding

00:18:59.560 | in that wisdom space.

00:19:02.480 | - Yeah.

00:19:03.320 | (laughing)

00:19:04.140 | Okay, wisdom space is good.

00:19:05.240 | I think, I do think, right?

00:19:08.040 | So similarity does get you very, very far.

00:19:10.760 | Is it the answer to everything?

00:19:12.360 | I mean, I don't even know what everything is,

00:19:14.140 | but it's going to take us really far.

00:19:16.720 | And I think the thing is,

00:19:18.920 | things are similar in very different contexts, right?

00:19:21.680 | So an elephant is similar to, I don't know,

00:19:24.360 | another sort of wild animal.

00:19:25.640 | Let's just pick, I don't know, lion.

00:19:27.440 | And in a different way,

00:19:28.520 | because they're both four-legged creatures.

00:19:30.560 | They're also land animals.

00:19:32.080 | But of course, they're very different

00:19:33.160 | in a lot of different ways.

00:19:34.000 | So elephants are like herbivores, lions are not.

00:19:37.280 | So similarity does,

00:19:38.560 | similarity and particularly dissimilarity

00:19:40.680 | also actually helps us understand a lot about things.

00:19:43.760 | And so that's actually why I think

00:19:45.240 | discrete categorization is very hard.

00:19:47.640 | Just like forming this particular category of elephant

00:19:50.080 | and a particular category of lion,

00:19:51.880 | maybe it's good for just like taxonomy,

00:19:54.400 | biological taxonomies.

00:19:55.800 | But when it comes to other things,

00:19:57.240 | which are not as maybe, for example, like grilled cheese.

00:20:01.760 | I have a grilled cheese, I dip it in tomato

00:20:03.120 | and I keep it outside.

00:20:04.000 | Now, is that still a grilled cheese

00:20:05.080 | or is that something else?

00:20:06.760 | - Right, so categorization is still very useful

00:20:09.800 | for solving problems.

00:20:11.280 | But is your intuition then sort of the self-supervised

00:20:15.960 | should be the, to borrow Jan LeCun's terminology,

00:20:20.960 | should be the cake and then categorization,

00:20:23.680 | the classification, maybe the supervised like layer

00:20:27.400 | should be just like the thing on top,

00:20:29.120 | the cherry or the icing or whatever.

00:20:31.040 | So if you make it the cake,

00:20:32.960 | it gets in the way of learning.

00:20:35.560 | - If you make it the cake,

00:20:36.400 | then you won't be able to sit and annotate everything.

00:20:39.400 | That's as simple as it is.

00:20:40.680 | Like that's my very practical view on it.

00:20:43.120 | It's just, I mean, in my PhD,

00:20:44.960 | I sat down and annotated like a bunch of cards

00:20:47.040 | for one of my projects.

00:20:48.480 | And very quickly, I was just like,

00:20:49.920 | it was in a video and I was basically drawing boxes

00:20:52.200 | around all these cards.

00:20:53.560 | And I think I spent about a week doing all of that

00:20:55.640 | and I barely got anything done.

00:20:57.680 | And basically, this was, I think my first year of my PhD

00:21:00.320 | or like second year of my master's.

00:21:02.760 | And then by the end of it, I'm like, okay,

00:21:04.040 | this is just hopeless.

00:21:05.040 | I can keep doing it.

00:21:06.000 | And when I had done that, someone came up to me

00:21:08.480 | and they basically told me,

00:21:09.600 | oh, this is a pickup truck, this is not a car.

00:21:12.800 | And that's like, aha, this actually makes sense

00:21:14.840 | because a pickup truck is not really,

00:21:16.080 | like what was I annotating?

00:21:17.040 | Was I annotating anything that is mobile?

00:21:19.600 | Or was I annotating particular sedans?

00:21:21.440 | Or was I annotating SUVs?

00:21:22.720 | What was I doing?

00:21:23.640 | - By the way, the annotation was bounding boxes?

00:21:25.760 | - Bounding boxes, yeah.

00:21:27.000 | - There's so many deep, profound questions here

00:21:30.080 | that you're almost cheating your way out of

00:21:32.200 | by doing self-supervised learning, by the way,

00:21:34.400 | which is like what makes for an object.

00:21:37.480 | As opposed to solve intelligence,

00:21:39.040 | maybe you don't ever need to answer that question.

00:21:41.560 | I mean, this is the question

00:21:43.680 | that anyone that's ever done annotation

00:21:45.280 | because it's so painful gets to ask,

00:21:48.000 | like, why am I doing,

00:21:49.960 | drawing very careful line around this object?

00:21:55.440 | Like what is the value?

00:21:57.520 | I remember when I first saw semantic segmentation

00:22:00.200 | where you have like instant segmentation

00:22:03.640 | where you have a very exact line around the object

00:22:08.280 | in the 2D plane of a fundamentally 3D object

00:22:11.720 | projected on a 2D plane.

00:22:13.400 | So you're drawing a line around a car

00:22:15.800 | that might be occluded,

00:22:16.920 | there might be another thing in front of it,

00:22:18.840 | but you're still drawing the line

00:22:20.360 | of the part of the car that you see.

00:22:22.680 | How is that the car?

00:22:25.880 | Why is that the car?

00:22:27.880 | Like I had like an existential crisis every time.

00:22:31.000 | Like how is that going to help us understand

00:22:33.520 | or solve computer vision?

00:22:35.320 | I'm not sure I have a good answer to what's better.

00:22:38.240 | And I'm not sure I share the confidence that you have

00:22:41.520 | that self-supervised learning can take us far.

00:22:46.520 | I think I'm more and more convinced

00:22:48.560 | that it's a very important component,

00:22:50.840 | but I still feel like we need to understand what makes,

00:22:54.160 | like this dream of maybe what it's called symbolic AI,

00:23:01.400 | of arriving, like once you have this common sense base,

00:23:05.560 | be able to play with these concepts

00:23:09.000 | and build graphs or hierarchies of concepts on top

00:23:13.440 | in order to then like form a deep sense

00:23:18.440 | of this three-dimensional world or four-dimensional world

00:23:22.040 | and be able to reason and then project that onto 2D plane

00:23:25.480 | in order to interpret a 2D image.

00:23:28.520 | Can I ask you just an out there question?

00:23:30.960 | I remember, I think Andrej Karpathy had a blog post

00:23:35.000 | about computer vision, like being really hard.

00:23:39.000 | I forgot what the title was, but it was many, many years ago.

00:23:42.080 | And he had, I think President Obama stepping on a scale

00:23:44.760 | and there was humor

00:23:45.600 | and there was a bunch of people laughing and whatever.

00:23:48.440 | And there's a lot of interesting things about that image.

00:23:52.000 | And I think Andrej highlighted a bunch of things

00:23:55.140 | about the image that us humans

00:23:56.600 | are able to immediately understand.

00:23:59.000 | Like the idea, I think of gravity

00:24:00.960 | and that you have the concept of a weight.

00:24:04.040 | You immediately project because of our knowledge of pose

00:24:08.160 | and how human bodies are constructed,

00:24:10.360 | you understand how the forces are being applied

00:24:13.040 | with the human body.

00:24:14.580 | The really interesting other thing

00:24:16.080 | that you're able to understand

00:24:17.440 | is multiple people looking at each other in the image.

00:24:20.520 | You're able to have a mental model

00:24:22.360 | of what the people are thinking about.

00:24:23.760 | You're able to infer like,

00:24:25.320 | oh, this person is probably thinks,

00:24:27.520 | like is laughing at how humorous the situation is.

00:24:31.240 | And this person is confused about what the situation is

00:24:34.160 | because they're looking this way.

00:24:35.600 | We're able to infer all of that.

00:24:37.540 | So that's human vision.

00:24:40.460 | How difficult is computer vision?

00:24:45.040 | Like in order to achieve that level of understanding

00:24:48.440 | and maybe how big of a part

00:24:51.440 | does self-supervised learning play in that, do you think?

00:24:54.380 | And do you still, you know,

00:24:56.080 | back that was like over a decade ago,

00:24:58.440 | I think Andre and I think a lot of people agreed

00:25:00.920 | is computer vision is really hard.

00:25:03.320 | Do you still think computer vision is really hard?

00:25:06.000 | - I think it is, yes.

00:25:07.520 | And getting to that kind of understanding,

00:25:10.600 | I mean, it's really out there.

00:25:12.500 | So if you ask me to solve just that particular problem,

00:25:15.380 | I can do it the supervised learning route.

00:25:17.560 | I can always construct a data set and basically predict,

00:25:19.720 | oh, is there humor in this or not?

00:25:21.640 | And of course I can do it.

00:25:22.480 | - Actually, that's a good question.

00:25:23.600 | Do you think you can, okay, okay.

00:25:25.200 | Do you think you can do

00:25:26.480 | human supervised annotation of humor?

00:25:29.040 | - To some extent, yes.

00:25:29.960 | I'm sure it'll work.

00:25:30.880 | I mean, it won't be as bad as like randomly guessing.

00:25:34.380 | I'm sure it can still predict

00:25:35.560 | whether it's humorous or not in some way.

00:25:37.840 | - Yeah, maybe like Reddit upvotes is the signal.

00:25:40.400 | I don't know.

00:25:41.240 | - I mean, it won't do a great job, but it'll do something.

00:25:43.800 | It may actually be like, it may find certain things

00:25:46.080 | which are not humorous, humorous as well,

00:25:47.560 | which is going to be bad for us.

00:25:49.180 | But I mean, it won't be random.

00:25:52.120 | - Yeah, kind of like my sense of humor.

00:25:54.520 | - Okay, so fine.

00:25:55.920 | So you can, that particular problem, yes.

00:25:57.520 | But the general problem you're saying is hard.

00:25:59.600 | - The general problem is hard.

00:26:00.440 | And I mean, self-supervised learning

00:26:02.320 | is not the answer to everything.

00:26:03.920 | Of course it's not.

00:26:04.760 | I think if you have machines

00:26:06.760 | that are going to communicate with humans at the end of it,

00:26:08.760 | you want to understand what the algorithm is doing, right?

00:26:10.880 | You want it to be able to like produce an output

00:26:13.720 | that you can decipher, that you can understand,

00:26:15.560 | or it's actually useful for something else,

00:26:17.440 | which again is a human.

00:26:19.360 | So at some point in this sort of entire loop,

00:26:22.280 | a human steps in.

00:26:23.720 | And now this human needs to understand what's going on.

00:26:26.360 | So, and at that point,

00:26:27.640 | this entire notion of language or semantics really comes in.

00:26:30.440 | If the machine just spits out something,

00:26:32.600 | and if we can't understand it,

00:26:34.000 | then it's not really that useful for us.

00:26:36.280 | So self-supervised learning is probably going to be useful

00:26:38.440 | for a lot of the things before that part,

00:26:40.800 | before the machine really needs to communicate

00:26:42.880 | a particular kind of output with a human.

00:26:46.060 | Because I mean, otherwise,

00:26:47.800 | how is it going to do that without language?

00:26:49.920 | - Or some kind of communication.

00:26:51.880 | But you're saying that it's possible

00:26:53.280 | to build a big base of understanding or whatever,

00:26:55.880 | of what's a better- - Concepts.

00:26:58.440 | - Concepts. - Concepts, yeah.

00:26:59.780 | - Like common sense concepts.

00:27:02.300 | Supervised learning in the context of computer vision

00:27:06.120 | is something you've focused on,

00:27:07.520 | but that's a really hard domain.

00:27:08.960 | And it's kind of the cutting edge

00:27:10.480 | of what we're as a community working on today.

00:27:13.040 | Can we take a little bit of a step back

00:27:14.760 | and look at language?

00:27:16.320 | Can you summarize the history of success

00:27:19.000 | of self-supervised learning in natural language processing,

00:27:22.480 | language modeling?

00:27:23.880 | What are transformers?

00:27:25.600 | What is the masking, the sentence completion

00:27:28.760 | that you mentioned before?

00:27:30.080 | How does it lead us to understand anything?

00:27:33.560 | Semantic meaning of words,

00:27:34.800 | syntactic role of words and sentences.

00:27:37.660 | - So I'm of course not the expert in NLP.

00:27:40.120 | I kind of follow it a little bit from the sides.

00:27:43.480 | So the main sort of reason

00:27:45.800 | why all of this masking stuff works is,

00:27:47.880 | I think it's called the distributional hypothesis in NLP.

00:27:50.880 | The idea basically being that words

00:27:52.640 | that occur in the same context should have similar meaning.

00:27:55.960 | So if you have the blank jumped over the blank,

00:27:59.040 | it basically, whatever is like in the first blank

00:28:01.960 | is basically an object that can actually jump,

00:28:04.120 | is going to be something that can jump.

00:28:05.820 | So a cat or a dog, or I don't know, sheep, something,

00:28:08.360 | all of these things can basically be

00:28:09.720 | in that particular context.

00:28:11.640 | And now, so essentially the idea is that

00:28:13.420 | if you have words that are in the same context

00:28:16.040 | and you predict them,

00:28:17.320 | you're going to learn lots of useful things

00:28:20.000 | about how words are related.

00:28:21.480 | Because you're predicting by looking at their context

00:28:23.560 | what the word is going to be.

00:28:24.900 | So in this particular case, the blank jumped over the fence.

00:28:28.240 | So now if it's a sheep, the sheep jumped over the fence,

00:28:30.940 | the dog jumped over the fence.

00:28:32.400 | So essentially the algorithm or the representation

00:28:35.560 | basically puts together these two concepts together.

00:28:37.600 | So it says, okay, dogs are going to be kind of related

00:28:39.840 | to sheep because both of them occur in the same context.

00:28:42.720 | Of course, now you can decide depending

00:28:44.760 | on your particular application downstream.

00:28:46.760 | You can say that dogs are absolutely not related to sheep

00:28:49.140 | because, well, I don't, I really care about,

00:28:51.280 | you know, dog food, for example.

00:28:53.000 | I'm a dog food person and I really want to give

00:28:55.120 | this dog food to this particular animal.

00:28:57.280 | So depending on what your downstream application is,

00:29:00.080 | of course, this notion of similarity or this notion

00:29:03.000 | or this common sense that you've learned

00:29:04.280 | may not be applicable.

00:29:05.800 | But the point is basically that this,

00:29:08.060 | just predicting what the blanks are

00:29:09.920 | is going to take you really, really far.

00:29:11.720 | - So there's a nice feature of language

00:29:14.000 | that the number of words in a particular language

00:29:18.680 | is very large, but it's finite,

00:29:20.760 | and it's actually not that large

00:29:22.040 | in the grand scheme of things.

00:29:24.120 | I still got to, because we take it for granted,

00:29:26.540 | so first of all, when you say masking,

00:29:28.380 | you're talking about this very process of the blank,

00:29:31.520 | of removing words from a sentence,

00:29:33.440 | and then having the knowledge of what word went there

00:29:36.760 | in the initial data set.

00:29:38.540 | That's the ground truth that you're training on,

00:29:41.080 | and then you're asking the neural network

00:29:43.480 | to predict what goes there.

00:29:45.120 | That's like a little trick.

00:29:49.240 | It's a really powerful trick.

00:29:50.880 | The question is how far that takes us,

00:29:53.320 | and the other question is, is there other tricks?

00:29:56.280 | 'Cause to me, it's very possible

00:29:58.680 | there's other very fascinating tricks.

00:30:00.740 | I'll give you an example.

00:30:01.940 | In autonomous driving, there's a bunch of tricks

00:30:06.960 | that give you the self-supervised signal back.

00:30:10.360 | For example, very similar to sentences, but not really,

00:30:15.360 | which is you have signals from humans driving the car,

00:30:20.260 | because a lot of us drive cars to places,

00:30:23.680 | and so you can ask the neural network to predict

00:30:27.820 | what's going to happen in the next two seconds

00:30:30.260 | for a safe navigation through the environment,

00:30:33.400 | and the signal comes from the fact

00:30:36.200 | that you also have knowledge of what happened

00:30:38.640 | in the next two seconds, because you have video of the data.

00:30:42.120 | The question in autonomous driving, as it is in language,

00:30:46.760 | can we learn how to drive autonomously

00:30:50.160 | based on that kind of self-supervision?

00:30:53.440 | Probably the answer is no.

00:30:55.320 | The question is how good can we get?

00:30:57.720 | And the same with language.

00:30:58.880 | How good can we get, and are there other tricks?

00:31:02.320 | We get sometimes super excited by this trick

00:31:04.640 | that works really well, but I wonder,

00:31:07.320 | it's almost like mining for gold.

00:31:09.120 | I wonder how many signals there are in the data

00:31:12.760 | that could be leveraged that are like there.

00:31:15.300 | I just wanted to kind of linger on that,

00:31:18.600 | because sometimes it's easy to think

00:31:20.840 | that maybe this masking process is self-supervised learning.

00:31:24.840 | No, it's only one method.

00:31:27.200 | So there could be many, many other methods,

00:31:29.280 | many tricky methods, maybe interesting ways

00:31:33.840 | to leverage human computation in very interesting ways

00:31:36.880 | that might actually border on semi-supervised learning,

00:31:39.920 | something like that.

00:31:40.840 | Obviously, the internet is generated by humans

00:31:43.520 | at the end of the day.

00:31:44.720 | So all that to say is, what's your sense,

00:31:48.760 | in this particular context of language,

00:31:50.680 | how far can that masking process take us?

00:31:54.680 | - So it has stood the test of time, right?

00:31:56.200 | I mean, so Word2Vec, the initial sort of NLP technique

00:31:59.800 | that was using this, to now, for example,

00:32:02.120 | like all the BERT and all these big models that we get,

00:32:05.880 | BERT and Roberta, for example,

00:32:07.600 | all of them are still sort of based

00:32:08.800 | on the same principle of masking.

00:32:10.640 | It's taken us really far.

00:32:12.160 | I mean, you can actually do things like,

00:32:14.440 | oh, these two sentences are similar or not,

00:32:16.240 | whether this particular sentence

00:32:17.560 | follows this other sentence in terms of logic,

00:32:19.600 | so entailment, you can do a lot of these things

00:32:21.800 | with just this masking trick.

00:32:23.680 | So I'm not sure if I can predict how far it can take us,

00:32:28.400 | because when it first came out, when Word2Vec was out,

00:32:31.560 | I don't think a lot of us would have imagined

00:32:33.520 | that this would actually help us do some kind

00:32:36.000 | of entailment problems and really that well.

00:32:38.560 | And so just the fact that by just scaling up

00:32:40.960 | the amount of data that we're training on

00:32:42.360 | and using better and more powerful

00:32:44.640 | neural network architectures has taken us from that to this,

00:32:47.600 | is just showing you how maybe poor predictors we are,

00:32:52.280 | like as humans, how poor we are at predicting

00:32:54.880 | how successful a particular technique is going to be.

00:32:57.360 | So I think I can say something now,

00:32:58.680 | but like 10 years from now,

00:33:00.080 | I look completely stupid basically predicting this.

00:33:02.800 | - In the language domain, is there something in your work

00:33:07.160 | that you find useful and insightful

00:33:09.560 | and transferable to computer vision,

00:33:12.560 | but also just, I don't know, beautiful and profound

00:33:15.720 | that I think carries through to the vision domain?

00:33:18.160 | - I mean, the idea of masking has been very powerful.

00:33:21.040 | It has been used in vision as well for predicting,

00:33:23.680 | like you say, the next, if you have in sort of frames

00:33:27.200 | and you predict what's going to happen in the next frame.

00:33:29.400 | So that's been very powerful.

00:33:30.960 | In terms of modeling, like in just terms,

00:33:32.880 | in terms of architecture,

00:33:33.800 | I think you would have asked about transformers while back.

00:33:36.880 | That has really become,

00:33:38.360 | like it has become super exciting for computer vision now.

00:33:40.840 | Like in the past, I would say year and a half,

00:33:42.760 | it's become really powerful.

00:33:44.160 | - What's a transformer?

00:33:45.240 | - Right.

00:33:46.080 | I mean, the core part of a transformer

00:33:47.440 | is something called the self-attention model.

00:33:49.040 | So it came out of Google.

00:33:50.440 | And the idea basically is that if you have N elements,

00:33:53.780 | what you're creating is a way for all of these N elements

00:33:56.520 | to talk to each other.

00:33:57.880 | So the idea basically is that you are paying attention.

00:34:01.820 | Each element is paying attention

00:34:03.200 | to each of the other element.

00:34:04.980 | And basically by doing this,

00:34:06.800 | it's really trying to figure out,

00:34:08.980 | you're basically getting a much better view of the data.

00:34:11.460 | So for example, if you have a sentence of like four words,

00:34:14.500 | the point is if you get a representation

00:34:16.340 | or a feature for this entire sentence,

00:34:18.320 | it's constructed in a way such that each word

00:34:21.280 | has paid attention to everything else.

00:34:23.840 | Now, the reason it's like different from say,

00:34:26.100 | what you would do in a conf net

00:34:28.440 | is basically that in a conf net,

00:34:29.560 | you would only pay attention to a local window.

00:34:31.400 | So each word would only pay attention to its next neighbor

00:34:34.520 | or like one neighbor after that.

00:34:36.160 | And the same thing goes for images.

00:34:37.860 | In images, you would basically pay attention to pixels

00:34:40.120 | in a three cross three or a seven cross seven neighborhood.

00:34:42.800 | And that's it.

00:34:43.680 | Whereas with the transformer, the self-attention mainly,

00:34:46.020 | the sort of idea is that each element

00:34:48.760 | needs to pay attention to each other element.

00:34:50.440 | - And when you say attention,

00:34:51.980 | maybe another way to phrase that

00:34:53.400 | is you're considering a context,

00:34:57.680 | a wide context in terms of the wide context of the sentence

00:35:01.560 | in understanding the meaning of a particular word

00:35:05.120 | and in computer vision,

00:35:06.160 | that's understanding a larger context

00:35:07.840 | to understand the local pattern

00:35:10.080 | of a particular local part of an image.

00:35:13.080 | - Right, so basically if you have say,

00:35:14.940 | again, a banana in the image,

00:35:16.520 | you're looking at the full image first.

00:35:18.600 | So whether it's like, you're looking at all the pixels

00:35:20.960 | that are of a kitchen, of a dining table and so on.

00:35:23.760 | And then you're basically looking at the banana also.

00:35:25.920 | - Yeah, by the way, in terms of,

00:35:27.200 | if we were to train the funny classifier,

00:35:29.220 | there's something funny about the word banana.

00:35:32.000 | Just wanted to anticipate that.

00:35:33.840 | - I am wearing a banana shirt, so yeah.

00:35:36.200 | - Is there bananas on it?

00:35:37.500 | Okay, so masking has worked for the vision context as well.

00:35:42.400 | - And so this transformer idea has worked as well.

00:35:44.280 | So basically looking at all the elements

00:35:46.240 | to understand a particular element

00:35:48.120 | has been really powerful in vision.

00:35:49.880 | The reason is like a lot of things

00:35:52.040 | when you're looking at them in isolation.

00:35:53.440 | So if you look at just a blob of pixels,

00:35:55.560 | so Antonio Toralba at MIT used to have this

00:35:57.600 | like really famous image,

00:35:58.920 | which I looked at when I was a PhD student,

00:36:01.000 | where he would basically have a blob of pixels

00:36:02.800 | and he would ask you, "Hey, what is this?"

00:36:04.920 | And it looked basically like a shoe

00:36:06.800 | or like it could look like a TV remote,

00:36:08.840 | it could look like anything.

00:36:10.040 | And it turns out it was a beer bottle.

00:36:12.320 | But I'm not sure, it was one of these three things,

00:36:14.080 | but basically he showed you the full picture

00:36:15.400 | and then it was very obvious what it was.

00:36:17.520 | But the point is, just by looking

00:36:19.040 | at that particular local window, you couldn't figure it out.

00:36:21.840 | Because of resolution, because of other things,

00:36:23.840 | it's just not easy always to just figure it out

00:36:26.040 | by looking at just the neighborhood of pixels,

00:36:27.920 | what these pixels are.

00:36:29.640 | And the same thing happens for language as well.

00:36:31.960 | - For the parameters that have to learn something

00:36:34.240 | about the data, you need to give it the capacity

00:36:37.120 | to learn the essential things.

00:36:39.080 | Like if it's not actually able to receive the signal at all,

00:36:42.600 | then it's not gonna be able to learn that signal.

00:36:44.240 | And to understand images, to understand language,

00:36:47.240 | you have to be able to see words in their full context.

00:36:50.640 | Okay, what is harder to solve, vision or language?

00:36:54.880 | Visual intelligence or linguistic intelligence?

00:36:57.800 | - So I'm going to say computer vision is harder.

00:36:59.760 | My reason for this is basically that language,

00:37:03.240 | of course, has a big structure to it because we developed it.

00:37:06.800 | But as vision is something that is common

00:37:08.600 | in a lot of animals, everyone is able to get by,

00:37:11.400 | a lot of these animals on earth are actually able

00:37:13.320 | to get by without language.

00:37:15.080 | And a lot of these animals we also deem

00:37:17.080 | to be intelligent.

00:37:18.280 | So clearly intelligence does have

00:37:20.920 | like a visual component to it.

00:37:22.520 | And yes, of course, in the case of humans,

00:37:24.240 | it of course also has a linguistic component.

00:37:26.400 | But it means that there is something far more fundamental

00:37:28.720 | about vision than there is about language.

00:37:30.840 | And I'm sorry to anyone who disagrees,

00:37:32.960 | but yes, this is what I feel.

00:37:34.360 | - So that's being a little bit reflected

00:37:36.920 | in the challenges that have to do with the progress

00:37:40.800 | of self-supervised learning, would you say?

00:37:42.520 | Or is that just the peculiar accidents

00:37:45.560 | of the progress of the AI community

00:37:47.400 | that we focused on, or we discovered self-attention

00:37:50.240 | and transformers in the context of language first?

00:37:53.640 | - So like the self-supervised learning success

00:37:55.520 | was actually for vision has not much to do

00:37:58.840 | with the transformers part, I would say.

00:38:00.320 | It's actually been independent a little bit.

00:38:02.480 | I think it's just that the signal was a little bit different

00:38:05.320 | for vision than there was for like NLP

00:38:08.120 | and probably NLP folks discovered it before.

00:38:11.240 | So for vision, the main success has basically

00:38:13.360 | been this like crops so far,

00:38:14.840 | like taking different crops of images.

00:38:16.800 | Whereas for NLP, it was this masking thing.

00:38:18.920 | - But also the level of success

00:38:20.480 | is still much higher for language.

00:38:22.080 | - It has.

00:38:22.920 | So that has a lot to do with, I mean,

00:38:25.660 | I can get into a lot of details.

00:38:26.920 | For this particular question, let's go for it.

00:38:28.560 | Okay, so the first thing is language is very structured.

00:38:32.280 | So you are going to produce a distribution

00:38:34.040 | over a finite vocabulary.

00:38:35.920 | English has a finite number of words.

00:38:37.680 | It's actually not that large.

00:38:39.280 | And you need to produce basically,

00:38:41.440 | for when you're doing this masking thing,

00:38:42.760 | all you need to do is basically tell me

00:38:44.160 | which one of these like 50,000 words it is.

00:38:46.440 | That's it.

00:38:47.280 | Now for vision, let's imagine doing the same thing.

00:38:49.560 | Okay, we're basically going to blank out

00:38:51.480 | a particular part of the image

00:38:52.600 | and we ask the network or this neural network

00:38:54.680 | to predict what is present in this missing patch.

00:38:58.080 | It's combinatorially large, right?

00:38:59.960 | You have 256 pixel values.

00:39:02.560 | If you're even producing basically a seven cross seven

00:39:04.840 | or a 14 cross 14, like window of pixels

00:39:08.000 | at each of these 169 or each of these 49 locations,

00:39:11.340 | you have 256 values to predict.

00:39:13.760 | And so it's really, really large.

00:39:15.280 | And very quickly, the kind of like prediction problems

00:39:19.000 | that we're setting up are going to be

00:39:20.280 | extremely like intractable for us.

00:39:22.800 | And so the thing is for NLP,

00:39:23.960 | it has been really successful

00:39:25.000 | because we are very good at predicting,

00:39:27.560 | like doing this like distribution over a finite set.

00:39:30.880 | And the problem is when this set becomes really large,

00:39:33.520 | we're going to become really, really bad

00:39:35.560 | at making these predictions

00:39:37.000 | and at solving basically this particular set of problems.

00:39:41.040 | So if you were to do it exactly in the same way

00:39:44.200 | as NLP for vision, there is very limited success.

00:39:47.040 | The way stuff is working right now

00:39:48.980 | is actually not by predicting these masks.

00:39:51.680 | It's basically by saying that you take these two

00:39:53.660 | like crops from the image,

00:39:55.160 | you get a feature representation from it

00:39:57.060 | and just saying that these two features,

00:39:58.680 | so they're like vectors,

00:40:00.440 | just saying that the distance between these vectors

00:40:02.040 | should be small.

00:40:03.220 | And so it's a very different way of learning

00:40:06.640 | from the visual signal than there is from NLP.

00:40:09.120 | - Okay, the other reason is the distributional hypothesis

00:40:11.360 | that we talked about for NLP, right?

00:40:12.920 | So a word given its context,

00:40:15.160 | basically the context actually supplies

00:40:16.560 | a lot of meaning to the word.

00:40:18.440 | Now, because there are just finite number of words

00:40:22.280 | and there is a finite way in like which we compose them,

00:40:25.760 | of course, the same thing holds for pixels,

00:40:27.440 | but in language, there's a lot of structure, right?

00:40:29.760 | So I always say whatever,

00:40:31.000 | the dash jumped over the fence, for example.

00:40:33.760 | There are lots of these sentences that you'll get.

00:40:36.720 | And from this, you can actually look at

00:40:38.680 | this particular sentence might occur

00:40:40.160 | in a lot of different contexts as well.

00:40:41.480 | This exact same sentence might occur in a different context.

00:40:44.080 | So the sheep jumped over the fence,

00:40:45.560 | the cat jumped over the fence,

00:40:46.840 | the dog jumped over the fence.

00:40:48.160 | So you immediately get a lot of these words,

00:40:50.460 | which are because this particular token itself

00:40:52.720 | has so much meaning,

00:40:53.560 | you get a lot of these tokens or these words,

00:40:55.480 | which are actually going to have sort of

00:40:57.720 | this related meaning across given this context.

00:41:00.560 | Whereas for vision, it's much harder.

00:41:02.640 | Because just by like pure,

00:41:04.160 | like the way we capture images,

00:41:05.600 | lighting can be different.

00:41:07.440 | There might be like different noise in the sensor.

00:41:09.800 | So the thing is you're capturing a physical phenomenon,

00:41:12.220 | and then you're basically going through

00:41:13.840 | a very complicated pipeline of like image processing,

00:41:16.360 | and then you're translating that into

00:41:18.020 | some kind of like digital signal.

00:41:20.400 | Whereas with language, you write it down,

00:41:23.520 | and you transfer it to a digital signal,

00:41:25.040 | almost like it's a lossless like transfer.

00:41:27.440 | And each of these tokens are very, very well defined.

00:41:30.160 | - There could be a little bit of an argument there,

00:41:32.860 | because language as written down

00:41:36.140 | is a projection of thought.

00:41:39.420 | This is one of the open questions is,

00:41:42.560 | if you perfectly can solve language,

00:41:45.480 | are you getting close to being able to solve,

00:41:49.360 | easily with flying colors past the Turing test kind of thing.

00:41:52.800 | So that's, it's similar, but different.

00:41:56.560 | And the computer vision problem is in the 2D plane

00:41:59.760 | is a projection of a three dimensional world.

00:42:02.640 | So perhaps there are similar problems there.

00:42:05.640 | Maybe this is-

00:42:06.480 | - I mean, I think what I'm saying is NLP is not easy.

00:42:08.560 | Of course, don't get me wrong.

00:42:09.500 | Like abstract thought expressed in knowledge,

00:42:12.900 | or knowledge basically expressed in language

00:42:14.580 | is really hard to understand, right?

00:42:16.720 | I mean, we've been communicating with language for so long,

00:42:19.140 | and it is of course a very complicated concept.

00:42:21.980 | The thing is, at least getting like somewhat reasonable,

00:42:26.980 | like being able to solve some kind of reasonable tasks

00:42:29.820 | with language, I would say slightly easier

00:42:32.060 | than it is with computer vision.

00:42:33.660 | - Yeah, I would say, yeah.

00:42:35.360 | So that's well put.

00:42:36.600 | I would say getting impressive performance

00:42:39.340 | on language is easier.

00:42:43.380 | I feel like for both language and computer vision,

00:42:45.340 | there's going to be this wall of like,

00:42:49.460 | like this hump you have to overcome

00:42:52.240 | to achieve superhuman level performance

00:42:54.780 | or human level performance.

00:42:56.580 | And I feel like for language, that wall is farther away.

00:43:00.220 | So you can get pretty nice.

00:43:01.900 | You can do a lot of tricks.

00:43:04.100 | You can show really impressive performance.

00:43:06.540 | You can even fool people that you're tweeting

00:43:09.660 | or you're blog post writing,

00:43:11.460 | or your question answering has intelligence behind it.

00:43:16.460 | But to truly demonstrate understanding of dialogue,

00:43:21.900 | of continuous long form dialogue,

00:43:25.020 | that would require perhaps big breakthroughs.

00:43:28.580 | In the same way in computer vision,

00:43:30.420 | I think the big breakthroughs need to happen earlier

00:43:33.380 | to achieve impressive performance.

00:43:36.620 | This might be a good place to, you already mentioned it,

00:43:38.740 | but what is contrastive learning

00:43:41.100 | and what are energy based models?

00:43:43.860 | - Contrastive learning is sort of a paradigm of learning

00:43:46.860 | where the idea is that you are learning this embedding space

00:43:50.700 | or so you're learning this sort of vector space

00:43:52.700 | of all your concepts.

00:43:54.500 | And the way you learn that is basically by contrasting.

00:43:56.780 | So the idea is that you have a sample,

00:43:59.100 | you have another sample that's related to it.

00:44:00.980 | So that's called the positive.

00:44:02.860 | And you have another sample that's not related to it.

00:44:05.100 | So that's negative.

00:44:06.100 | So for example, let's just take an NLP

00:44:08.340 | or a simple example in computer vision.

00:44:10.980 | So you have an image of a cat, you have an image of a dog.

00:44:14.500 | And for whatever application that you're doing,

00:44:16.540 | say you're trying to figure out what pets are,

00:44:18.860 | you're saying that these two images are related.

00:44:20.300 | So image of a cat and dog are related,

00:44:22.300 | but now you have another third image of a banana

00:44:25.380 | because you don't like that word.

00:44:26.980 | So now you basically have this banana.

00:44:28.940 | - Thank you for speaking to the crowd.

00:44:30.660 | - And so you take both of these images

00:44:32.580 | and you take the image from the cat,

00:44:34.460 | the image from the dog,

00:44:35.300 | you get a feature from both of them.

00:44:36.780 | And now what you're training the network to do

00:44:38.180 | is basically pull both of these features together

00:44:42.100 | while pushing them away from the feature of a banana.

00:44:44.740 | So this is the contrastive part.

00:44:45.860 | So you're contrasting against the banana.

00:44:47.860 | So there's always this notion of a negative and a positive.

00:44:51.500 | Now, energy-based models are like one way

00:44:54.140 | that Jan sort of explains a lot of these methods.

00:44:57.460 | So Jan basically, I think a couple of years

00:45:00.700 | or more than that, like when I joined Facebook,

00:45:02.860 | Jan used to keep mentioning this word energy-based models.

00:45:05.140 | And of course I had no idea what he was talking about.

00:45:07.260 | So then one day I caught him

00:45:08.460 | in one of the conference rooms and I'm like,

00:45:10.020 | can you please tell me what this is?

00:45:11.260 | So then like very patiently he sat down

00:45:14.180 | with like a marker and a whiteboard.

00:45:15.980 | And his idea basically is that

00:45:18.340 | rather than talking about probability distributions,

00:45:20.300 | you can talk about energies of models.

00:45:21.980 | So models are trying to minimize certain energies

00:45:24.020 | in certain space,

00:45:25.020 | or they're trying to maximize a certain kind of energy.

00:45:28.220 | And the idea basically is that

00:45:29.780 | you can explain a lot of the contrastive models,

00:45:32.220 | GANs for example,

00:45:33.300 | which are like generative adversarial networks.

00:45:36.020 | A lot of these modern learning methods

00:45:37.940 | or VAEs which are variational autoencoders,

00:45:39.940 | you can really explain them very nicely

00:45:41.860 | in terms of an energy function

00:45:43.180 | that they're trying to minimize or maximize.

00:45:45.340 | And so by putting this common sort of language

00:45:48.380 | for all of these models,

00:45:49.740 | what looks very different in machine learning

00:45:51.820 | that OVAEs are very different from what GANs are,

00:45:54.180 | are very, very different from what contrastive models are.

00:45:56.420 | You actually get a sense of like,

00:45:57.580 | oh, these are actually very, very related.

00:46:00.140 | It's just that the way or the mechanism

00:46:02.500 | in which they're sort of maximizing

00:46:04.220 | or minimizing this energy function is slightly different.

00:46:06.980 | - It's revealing the commonalities

00:46:08.900 | between all these approaches

00:46:10.380 | and putting a sexy word on top of it like energy.

00:46:12.860 | And so similarity,

00:46:14.340 | two things that are similar have low energy.

00:46:16.740 | Like the low energy signifying similarity.

00:46:20.340 | - Right, exactly.

00:46:21.180 | So basically the idea is that if you were to imagine

00:46:23.540 | like the embedding as a manifold, a 2D manifold,

00:46:26.460 | you would get a hill or like a high sort of peak

00:46:28.900 | in the energy manifold

00:46:30.580 | wherever two things are not related.

00:46:32.420 | And basically you would have like a dip

00:46:34.100 | where two things are related.

00:46:35.500 | So you'd get a dip in the manifold.

00:46:37.060 | - And in the self-supervised context,

00:46:40.180 | how do you know two things are related

00:46:42.260 | and two things are not related?

00:46:44.100 | - Right, so this is where all the sort of ingenuity

00:46:46.420 | or tricks comes in, right?

00:46:47.860 | So for example, like you can take the fill in the blank

00:46:51.660 | problem or you can take in the context problem.

00:46:54.340 | And what you can say is two words

00:46:55.900 | that are in the same context are related.

00:46:57.740 | Two words that are in different contexts are not related.

00:47:00.500 | For images, basically two crops from the same image

00:47:02.980 | are related and whereas a third image is not related at all.

00:47:06.420 | Or for a video, it can be two frames

00:47:08.140 | from that video are related

00:47:09.140 | because they're likely to contain

00:47:10.780 | the same sort of concepts in them.

00:47:12.700 | Whereas a third frame from a different video is not related.

00:47:15.580 | So it basically is, it's a very general term.

00:47:18.300 | Contrastive learning is nothing really to do

00:47:19.820 | with self-supervised learning.

00:47:20.820 | It actually is very popular in, for example,

00:47:23.220 | like any kind of metric learning

00:47:25.180 | or any kind of embedding learning.

00:47:26.900 | So it's also used in supervised learning.

00:47:28.860 | It's all, and the thing is,

00:47:30.220 | because we are not really using labels

00:47:32.060 | to get these positive or negative pairs,

00:47:34.500 | it can basically also be used for self-supervised learning.

00:47:37.580 | - So you mentioned one of the ideas in the vision context

00:47:40.420 | that works is to have different crops.

00:47:45.220 | So you could think of that as a way

00:47:47.020 | to sort of manipulating the data

00:47:49.460 | to generate examples that are similar.

00:47:53.260 | Obviously, there's a bunch of other techniques.

00:47:55.780 | You mentioned lighting as a very, you know,

00:47:58.580 | in images, lighting is something that varies a lot

00:48:01.620 | and you can artificially change those kinds of things.

00:48:04.500 | There's the whole broad field of data augmentation,

00:48:07.700 | which manipulates images in order to increase arbitrarily

00:48:11.820 | the size of the dataset.

00:48:13.420 | First of all, what is data augmentation?

00:48:15.860 | And second of all, what's the role of data augmentation

00:48:18.140 | in self-supervised learning and contrastive learning?

00:48:22.020 | - So data augmentation is just a way, like you said,

00:48:24.800 | it's basically a way to augment the data.

00:48:26.700 | So you have, say, N samples,

00:48:28.700 | and what you do is you basically define

00:48:30.180 | some kind of transforms for the sample.

00:48:32.300 | So you take your, say, image,

00:48:33.700 | and then you define a transform

00:48:34.900 | where you can just increase, say, the colors,

00:48:37.340 | like the colors or the brightness of the image,

00:48:39.140 | or increase or decrease the contrast of the image,

00:48:41.340 | for example, or take different crops of it.

00:48:43.740 | So data augmentation is just a process

00:48:46.220 | to basically perturb the data or augment the data.

00:48:51.060 | And so it has played a fundamental role for computer vision,

00:48:54.300 | for self-supervised learning especially.

00:48:56.620 | The way most of the current methods work,

00:48:59.180 | contrastive or otherwise, is by taking an image,

00:49:02.740 | in the case of images, is by taking an image

00:49:05.340 | and then computing basically two perturbations of it.

00:49:08.580 | So these can be two different crops of the image

00:49:11.500 | with different types of lighting

00:49:12.940 | or different contrast or different colors.

00:49:14.980 | So you jitter the colors a little bit and so on.

00:49:17.860 | And now the idea is basically,

00:49:19.900 | because it's the same object

00:49:21.740 | or because it's like related concepts

00:49:23.460 | in both of these perturbations,

00:49:25.220 | you want the features

00:49:26.300 | from both of these perturbations to be similar.

00:49:28.940 | So now you can use a variety of different ways

00:49:31.300 | to enforce this constraint,

00:49:32.620 | like these features being similar.

00:49:34.220 | You can do this by contrastive learning.

00:49:36.020 | So basically, both of these things are positives,

00:49:38.460 | a third sort of image is negative.

00:49:40.420 | You can do this basically by like clustering.

00:49:43.460 | For example, you can say that both of these images should,

00:49:46.980 | the features from both of these images

00:49:48.140 | should belong in the same cluster because they're related.

00:49:50.580 | Whereas image, like another image

00:49:52.300 | should belong to a different cluster.

00:49:53.860 | So there's a variety of different ways

00:49:55.140 | to basically enforce this particular constraint.

00:49:57.580 | - By the way, when you say features,

00:49:59.100 | it means there's a very large neural network

00:50:01.660 | that extracting patterns from the image

00:50:03.620 | and the kind of patterns it extracts

00:50:05.140 | should be either identical or very similar.

00:50:07.980 | - Right.

00:50:08.820 | - That's what that means.

00:50:09.660 | - So the neural network basically takes in the image

00:50:11.860 | and then outputs a set of like,

00:50:14.140 | basically a vector of like numbers,

00:50:16.580 | and that's the feature.

00:50:17.700 | And you want this feature for both of these

00:50:20.020 | like different crops that you computed to be similar.

00:50:22.100 | So you want this vector to be identical

00:50:24.500 | in its like entries, for example.

00:50:26.100 | - Be like literally close

00:50:28.140 | in this multidimensional space to each other.

00:50:30.740 | - Right.

00:50:31.620 | - And like you said,

00:50:32.580 | close can mean part of the same cluster

00:50:34.740 | or something like that in this large space.

00:50:37.420 | First of all, that,

00:50:38.940 | I wonder if there is connection

00:50:40.660 | to the way humans learn to this,

00:50:43.780 | almost like maybe subconsciously

00:50:48.060 | in order to understand a thing,

00:50:50.140 | you kind of have to see it from two, three, multiple angles.

00:50:54.660 | I wonder, I have a lot of friends

00:50:57.340 | who are neuroscientists maybe,

00:50:58.460 | or getting incognito scientists.

00:51:00.220 | I wonder if that's in there somewhere,

00:51:03.200 | like in order for us to place a concept in its proper place,

00:51:08.580 | we have to basically crop it in all kinds of ways,

00:51:12.460 | do basic data augmentation on it

00:51:14.420 | in whatever very clever ways that the brain likes to do.

00:51:17.660 | - Right.

00:51:19.020 | - Like spinning around in our mind somehow

00:51:21.140 | that is very effective.

00:51:23.100 | - So I think for some of them,

00:51:24.140 | we like need to do it.

00:51:25.060 | So like babies, for example, pick up objects,

00:51:26.980 | like move them and put them close to their eye and whatnot.

00:51:30.100 | But for certain other things,

00:51:31.180 | actually we are good at imagining it as well, right?

00:51:33.780 | So if you, I have never seen, for example,

00:51:35.940 | an elephant from the top.

00:51:36.940 | I've never basically looked at it from like top down.

00:51:39.540 | But if you showed me a picture of it,

00:51:40.700 | I could very well tell you that that's an elephant.

00:51:43.780 | So I think some of it, we're just like,

00:51:45.340 | we naturally build it or transfer it from other objects

00:51:47.820 | that we've seen to imagine what it's going to look like.

00:51:50.940 | - Has anyone done that with augmentation?

00:51:53.260 | Like imagine all the possible things

00:51:56.900 | that are occluded or not there,

00:51:59.860 | but not just like normal things, like wild things,

00:52:03.340 | but they're nevertheless physically consistent.

00:52:06.880 | - So, I mean, people do kind of like occlusion

00:52:10.440 | based augmentation as well.

00:52:11.720 | So you place in like a random like box,

00:52:14.080 | gray box to sort of mask out a certain part of the image.

00:52:17.360 | And the thing is basically you're kind of occluding it.

00:52:19.880 | For example, you place it say on half of a person's face.

00:52:23.480 | So basically saying that, you know,

00:52:24.840 | something below their nose is occluded,

00:52:26.600 | because it's grayed out.

00:52:28.160 | - So, no, I meant like you have like, what is it?

00:52:31.560 | A table and you can't see behind the table.

00:52:33.760 | And you imagine there's a bunch of elves

00:52:37.000 | with bananas behind the table.

00:52:38.800 | Like, I wonder if there's useful to have

00:52:40.880 | a wild imagination for the network,

00:52:44.160 | because that's possible.

00:52:45.280 | Well, maybe not elves, but like puppies and kittens

00:52:47.880 | or something like that.

00:52:48.960 | Just have a wild imagination

00:52:51.240 | and like constantly be generating that wild imagination.

00:52:55.040 | 'Cause in terms of data augmentation

00:52:57.520 | that's currently applied, it's super ultra, very boring.

00:53:01.160 | It's very basic data augmentation.

00:53:02.880 | I wonder if there's a benefit to being wildly imaginable

00:53:07.000 | while trying to be consistent with physical reality.

00:53:11.840 | - I think it's a kind of a chicken and egg problem, right?

00:53:14.160 | Because to have like amazing data augmentation,

00:53:16.360 | you need to understand what the scene is.

00:53:18.480 | And what we're trying to do data augmentation

00:53:20.600 | to learn what a scene is anyway.

00:53:22.040 | So it's basically just keeps going on.

00:53:23.720 | - Before you understand it, just put elves with bananas

00:53:26.000 | until you know it's not to be true.

00:53:28.080 | Just like children have a wild imagination

00:53:31.640 | until the adults ruin it all.

00:53:33.920 | Okay, so what are the different kinds of data augmentation

00:53:36.920 | that you've seen to be effective in visual intelligence?

00:53:40.760 | - For like vision,

00:53:42.000 | it's a lot of these image filtering operations.

00:53:44.120 | So like blurring the image,

00:53:45.760 | all the kinds of Instagram filters that you can think of.

00:53:49.400 | So like arbitrarily like make the red super red,

00:53:52.480 | make the green super greens, like saturate the image.

00:53:55.800 | - Rotation cropping.

00:53:56.960 | - Rotation cropping, exactly.

00:53:58.440 | All of these kinds of things.

00:53:59.560 | Like I said, lighting is a really interesting one to me.

00:54:02.640 | That feels like really complicated to do.

00:54:04.760 | - So I mean, the augmentations that we work on

00:54:07.320 | aren't like that involved.

00:54:08.920 | They're not going to be like physically realistic versions

00:54:10.880 | of lighting.

00:54:11.720 | It's not that you're assuming that there's a light source up

00:54:13.720 | and then you're moving it to the right

00:54:15.120 | and then what does the thing look like?

00:54:17.040 | It's really more about like brightness of the image,

00:54:19.200 | overall brightness of the image

00:54:20.440 | or overall contrast of the image and so on.

00:54:22.560 | - But this is a really important point to me.

00:54:25.120 | I always thought that data augmentation

00:54:28.720 | holds an important key to big improvements

00:54:32.720 | in machine learning.

00:54:33.840 | And it seems that it is an important aspect

00:54:36.640 | of self-supervised learning.

00:54:39.080 | So I wonder if there's big improvements to be achieved

00:54:42.560 | on much more intelligent kinds of data augmentation.

00:54:46.680 | For example, currently,

00:54:48.320 | maybe you can correct me if I'm wrong,

00:54:50.160 | data augmentation is not parametrized.

00:54:52.760 | - Yeah.

00:54:53.600 | - You're not learning.

00:54:54.440 | - You're not learning.

00:54:55.260 | To me, it seems like data augmentation potentially

00:54:59.720 | should involve more learning

00:55:01.960 | than the learning process itself.

00:55:04.120 | - Right.

00:55:05.320 | - You're almost like thinking of like generative kind of,

00:55:08.760 | it's the elves with bananas.

00:55:10.200 | You're trying to,

00:55:11.040 | it's like very active imagination of messing with the world

00:55:14.840 | and teaching that mechanism for messing with the world

00:55:17.600 | to be realistic.

00:55:19.080 | - Right.

00:55:20.460 | - Because that feels like,

00:55:22.600 | I mean, it's imagination.

00:55:24.680 | Just as you said,

00:55:25.640 | it feels like us humans are able to,

00:55:28.160 | maybe sometimes subconsciously,

00:55:30.720 | imagine before we see the thing,

00:55:33.000 | imagine what we're expecting to see.

00:55:35.500 | Like maybe several options.

00:55:37.240 | And especially, we probably forgot,

00:55:38.800 | but when we were younger,

00:55:40.520 | probably the possibilities were wilder, more numerous.

00:55:44.200 | And then as we get older,

00:55:45.160 | we become to understand the world

00:55:47.400 | and the possibilities of what we might see

00:55:51.040 | becomes less and less and less.

00:55:53.120 | So I wonder if you think there's a lot of breakthroughs

00:55:55.600 | yet to be had in data augmentation.

00:55:57.160 | And maybe also, can you just comment on the stuff we have?

00:55:59.780 | Is that a big part of self-supervised learning?

00:56:02.120 | - Yes.

00:56:02.960 | So data augmentation is like key to self-supervised learning.

00:56:05.520 | That has like the kind of augmentation that we're using.

00:56:08.320 | And basically the fact that we're trying to learn

00:56:11.040 | these neural networks that are predicting these features

00:56:13.920 | from images that are robust under data augmentation

00:56:17.080 | has been the key for visual self-supervised learning.

00:56:19.560 | And they play a fairly fundamental role to it.

00:56:22.400 | Now, the irony of all of this is that

00:56:24.600 | for like deep learning purists will say

00:56:26.720 | the entire point of deep learning

00:56:28.400 | is that you feed in the pixels to the neural network

00:56:31.160 | and it should figure out the patterns on its own.

00:56:33.120 | So if it really wants to look at edges,

00:56:34.480 | it should look at edges.

00:56:35.640 | You shouldn't really go and handcraft these features.

00:56:38.600 | You shouldn't go tell it that look at edges.

00:56:41.160 | So data augmentation should basically

00:56:43.120 | be in the same category.

00:56:44.400 | Why should we tell the network

00:56:46.040 | or tell this entire learning paradigm

00:56:48.200 | what kinds of data augmentation that we're looking for?

00:56:50.840 | We are encoding a very sort of human specific bias there

00:56:55.200 | that we know things are like,

00:56:57.560 | if you change the contrast of the image,

00:56:59.200 | it should still be an apple

00:57:00.240 | or it should still be apple, not banana.

00:57:02.240 | - Thank you.

00:57:03.520 | - Basically if we change like colors,

00:57:05.880 | it should still be the same kind of concept.

00:57:08.040 | Of course, this is not one,

00:57:09.880 | this is doesn't feel like super satisfactory

00:57:12.480 | because a lot of our human knowledge

00:57:14.560 | or our human supervision

00:57:15.760 | is actually going into the data augmentation.

00:57:17.600 | So although we are calling it self-supervised learning,

00:57:19.680 | a lot of the human knowledge is actually being encoded

00:57:21.880 | in the data augmentation process.

00:57:23.520 | So it's really like we've kind of sneaked away

00:57:25.480 | the supervision at the input

00:57:27.120 | and we're like really designing these nice list

00:57:29.440 | of data augmentations that are working very well.

00:57:31.640 | - Of course, the idea is that it's much easier

00:57:33.720 | to design a list of data augmentation than it is to do.

00:57:36.600 | So humans are doing nevertheless doing less and less work

00:57:39.640 | and maybe leveraging their creativity more and more.

00:57:42.600 | And when we say data augmentation is not parameterized,

00:57:45.080 | it means it's not part of the learning process.

00:57:48.200 | Do you think it's possible to integrate

00:57:50.560 | some of the data augmentation into the learning process?

00:57:53.280 | - I think so.

00:57:54.120 | I think so.

00:57:54.960 | And in fact, it will be really beneficial for us

00:57:57.440 | because a lot of these data augmentations

00:57:59.720 | that we use in vision are very extreme.

00:58:01.840 | For example, like when you have certain concepts,

00:58:05.400 | again, a banana, you take the banana

00:58:08.160 | and then basically you change the color of the banana.

00:58:10.600 | So you make it a purple banana.

00:58:12.440 | Now this data augmentation process

00:58:14.160 | is actually independent of the,

00:58:15.920 | like it has no notion of what is present in the image.

00:58:18.920 | So it can change this color arbitrarily.

00:58:20.560 | It can make it a red banana as well.

00:58:22.600 | And now what we're doing is we're telling the neural network

00:58:24.760 | that this red banana and so a crop of this image

00:58:28.240 | which has the red banana and a crop of this image

00:58:30.240 | where I change the color to a purple banana

00:58:31.840 | should be, the features should be the same.

00:58:34.080 | Now bananas aren't red or purple mostly.

00:58:36.680 | So really the data augmentation process

00:58:38.560 | should take into account what is present in the image

00:58:41.120 | and what are the kinds of physical realities

00:58:43.080 | that are possible.

00:58:43.920 | It shouldn't be completely independent of the image.

00:58:45.840 | - So you might get big gains if you,

00:58:48.840 | instead of being drastic, do subtle augmentation

00:58:51.560 | but realistic augmentation.

00:58:53.280 | - Right, realistic.

00:58:54.120 | I'm not sure if it's subtle, but like realistic for sure.

00:58:56.280 | - If it's realistic, then even subtle augmentation

00:58:59.600 | will give you big benefits.

00:59:00.680 | - Exactly, yeah.

00:59:01.840 | And it'll be like for particular domains,

00:59:05.040 | you might actually see like,

00:59:06.440 | if for example, now we're doing medical imaging,

00:59:08.960 | there are going to be certain kinds

00:59:10.120 | of like geometric augmentations

00:59:11.440 | which are not really going to be very valid

00:59:13.480 | for the human body.

00:59:15.080 | So if you were to like actually loop in data augmentation

00:59:18.280 | into the learning process,

00:59:19.480 | it will actually be much more useful.

00:59:21.320 | Now, this actually does take us to maybe

00:59:23.680 | a semi-supervised kind of a setting

00:59:25.120 | because you do want to understand

00:59:27.480 | what is it that you're trying to solve.

00:59:29.080 | So currently self-supervised learning

00:59:30.880 | kind of operates in the wild, right?

00:59:32.720 | So you do the self-supervised learning,

00:59:34.960 | and the purists and all of us basically say that,

00:59:37.560 | okay, this should learn useful representations

00:59:39.440 | and they should be useful for any kind of end task,

00:59:42.320 | no matter it's like banana recognition

00:59:44.280 | or like autonomous driving.

00:59:46.240 | Now, it's a tall order.

00:59:47.760 | Maybe the first baby step for us should be that,

00:59:50.480 | okay, if you're trying to loop in this data augmentation

00:59:52.640 | into the learning process,

00:59:53.920 | then we at least need to have some sense

00:59:56.000 | of what we're trying to do.

00:59:56.840 | Are we trying to distinguish

00:59:57.760 | between different types of bananas?

00:59:59.560 | Or are we trying to distinguish between banana and apple?

01:00:02.040 | Or are we trying to do all of these things at once?

01:00:04.400 | And so some notion of like what happens at the end

01:00:07.920 | might actually help us do much better at this side.

01:00:10.840 | - Let me ask you a ridiculous question.

01:00:13.760 | If I were to give you like a black box,

01:00:16.240 | like a choice to have an arbitrary large data set

01:00:19.480 | of real natural data

01:00:21.280 | versus really good data augmentation algorithms,

01:00:26.560 | which would you like to train in a self-supervised way on?

01:00:31.240 | So natural data from the internet are arbitrary large,

01:00:34.960 | so unlimited data,

01:00:37.280 | or it's like more controlled,

01:00:40.200 | good data augmentation on the finite data set.

01:00:43.600 | - The thing is like,

01:00:44.440 | because our learning algorithms for vision right now

01:00:47.240 | really rely on data augmentation,

01:00:49.360 | even if you were to give me like an infinite source

01:00:51.440 | of like image data,

01:00:52.880 | I still need a good data augmentation algorithm.

01:00:54.560 | - You need something that tells you

01:00:56.040 | that two things are similar.

01:00:57.400 | - Right, and so something,

01:00:59.000 | because you've given me an arbitrarily large data set,

01:01:01.600 | I still need to use data augmentation

01:01:03.760 | to take that image,

01:01:04.880 | construct like these two perturbations of it,

01:01:06.880 | and then learn from it.

01:01:08.240 | So the thing is our learning paradigm

01:01:09.960 | is very primitive right now.

01:01:11.880 | Even if you were to give me lots of images,

01:01:13.800 | it's still not really useful.

01:01:15.200 | A good data augmentation algorithm

01:01:16.520 | is actually going to be more useful.

01:01:18.040 | So you can like reduce down the amount of data

01:01:21.160 | that you'll give me by like 10 times.

01:01:22.920 | But if you were to give me

01:01:23.760 | a good data augmentation algorithm,

01:01:25.040 | that will probably do better

01:01:26.440 | than giving me like 10 times the size of that data,

01:01:29.040 | but me having to rely on like

01:01:30.960 | a very primitive data augmentation algorithm.

01:01:32.640 | - Like through tagging and all those kinds of things,

01:01:35.040 | is there a way to discover things

01:01:37.280 | that are semantically similar on the internet?

01:01:39.640 | Obviously there is,

01:01:40.480 | but there might be extremely noisy.

01:01:42.560 | And the difference might be

01:01:44.960 | farther away than you would be comfortable with.

01:01:47.880 | - So I mean, yes, tagging will help you a lot.

01:01:49.760 | It'll actually go a very long way

01:01:51.520 | in figuring out what images are related or not.

01:01:54.000 | And then so,

01:01:55.720 | but then the purists would argue

01:01:57.520 | that when you're using human tags,

01:01:58.960 | because these tags are like supervision,

01:02:01.240 | is it really self supervised learning now?

01:02:04.000 | Because you're using human tags

01:02:05.400 | to figure out which images are like similar.

01:02:08.000 | - Hashtag no filter means a lot of things.

01:02:10.440 | - Yes.

01:02:11.280 | I mean, there are certain tags

01:02:12.400 | which are going to be applicable pretty much to anything.

01:02:15.320 | So they're pretty useless for learning.

01:02:18.280 | But I mean, certain tags are actually like

01:02:20.840 | the Eiffel tower, for example,

01:02:22.280 | or the Taj Mahal, for example.

01:02:23.840 | These tags are like very indicative of what's going on.

01:02:26.480 | And they are, I mean, they are human supervision.

01:02:29.480 | - Yeah, this is one of the tasks of discovering

01:02:31.920 | from human generated data,

01:02:33.680 | strong signals that could be leveraged

01:02:37.320 | for self supervision.

01:02:39.560 | Like humans are doing so much work already.

01:02:42.280 | Like many years ago,

01:02:43.520 | there was something that was called,

01:02:45.160 | I guess, human computation back in the day.

01:02:48.040 | Humans are doing so much work.

01:02:50.280 | It'd be exciting to discover ways to leverage

01:02:53.520 | the work they're doing to teach machines

01:02:55.900 | without any extra effort from them.

01:02:58.000 | An example could be, like we said, driving,

01:03:00.200 | humans driving and machines can learn from the driving.

01:03:03.040 | I always hoped that there could be some supervision signal

01:03:06.800 | discovered in video games,

01:03:08.200 | because there's so many people that play video games

01:03:10.760 | that it feels like so much effort

01:03:13.920 | is put into video games, into playing video games.

01:03:17.720 | And you can design video games somewhat cheaply

01:03:21.840 | to include whatever signals you want.

01:03:24.640 | It feels like that could be leveraged somehow.

01:03:27.560 | - So people are using that.

01:03:28.720 | Like there are actually folks right here in UT Austin,

01:03:30.880 | like Philip Grenville is a professor at UT Austin.

01:03:33.680 | He's been like working on video games

01:03:36.160 | as a source of supervision.

01:03:38.000 | I mean, it's really fun, like as a PhD student

01:03:40.080 | getting to basically play video games all day.

01:03:42.200 | - Yeah, but so I do hope that kind of thing scales

01:03:44.960 | and like ultimately boils down to discovering

01:03:48.120 | some undeniably very good signal.

01:03:51.620 | It's like masking in NLP.

01:03:54.080 | But that said, there's non-contrastive methods.

01:03:57.640 | What do non-contrastive energy-based

01:04:00.880 | self-supervised learning methods look like

01:04:03.560 | and why are they promising?

01:04:05.680 | - So like I said about contrastive learning,

01:04:07.840 | you have this notion of a positive and a negative.

01:04:10.760 | Now, the thing is this entire learning paradigm

01:04:13.680 | really requires access to a lot of negatives

01:04:17.200 | to learn a good sort of feature space.

01:04:19.080 | The idea is if I tell you, okay,

01:04:21.700 | so a cat and a dog are similar

01:04:23.720 | and they're very different from a banana.

01:04:25.720 | The thing is this is a fairly simple analogy, right?

01:04:28.040 | Because bananas look visually very different

01:04:30.880 | from what cats and dogs do.

01:04:32.480 | So very quickly, if this is the only source of supervision

01:04:35.080 | that I'm giving you,

01:04:36.680 | your learning is not going to be like after a point

01:04:38.720 | then neural network is really not going to learn a lot

01:04:41.720 | because the negative that you're getting

01:04:43.040 | is going to be so random.

01:04:43.960 | So it can be, oh, a cat and a dog are very similar,

01:04:46.720 | but they're very different from a Volkswagen Beetle.

01:04:49.960 | Now, like this car looks very different

01:04:52.000 | from these animals again.

01:04:53.020 | So the thing is in contrastive learning,

01:04:54.960 | the quality of the negative sample really matters a lot.

01:04:58.200 | And so what has happened is basically

01:05:00.360 | that typically these methods that are contrastive

01:05:02.920 | really require access to lots of negatives,

01:05:04.960 | which becomes harder and harder to sort of scale

01:05:06.960 | when designing a learning algorithm.

01:05:09.080 | So that's been one of the reasons

01:05:10.960 | why non-contrastive methods have become popular

01:05:13.760 | and why people think that they're going to be more useful.

01:05:16.440 | So a non-contrastive method, for example,

01:05:18.480 | like clustering is one non-contrastive method.

01:05:20.960 | The idea basically being that you have two of these samples.

01:05:24.720 | So the cat and dog are two crops of this image.

01:05:27.720 | They belong to the same cluster.

01:05:29.320 | And so essentially you're basically doing clustering online

01:05:33.360 | when you're learning this network,

01:05:35.120 | and which is very different from having access

01:05:36.760 | to a lot of negatives explicitly.

01:05:38.960 | The other way which has become really popular

01:05:40.880 | is something called self-distillation.

01:05:43.180 | So the idea basically is that you have a teacher network

01:05:45.720 | and a student network,

01:05:47.560 | and the teacher network produces a feature.

01:05:49.560 | So it takes in the image,

01:05:51.120 | and basically the neural network figures out the patterns,

01:05:53.720 | gets the feature out.

01:05:55.280 | And there's another neural network,

01:05:56.840 | which is the student neural network,

01:05:58.000 | and that also produces a feature.

01:05:59.960 | And now all you're doing is basically saying

01:06:01.680 | that the features produced by the teacher network

01:06:04.000 | and the student network should be very similar.

01:06:06.140 | That's it.

01:06:06.980 | There is no notion of a negative anymore.

01:06:09.240 | And that's it.

01:06:10.080 | So it's all about similarity maximization

01:06:11.840 | between these two features.

01:06:13.720 | And so all I need to now do is figure out

01:06:16.360 | how to have these two sorts of parallel networks,

01:06:18.720 | a student network and a teacher network.

01:06:20.640 | And basically researchers have figured out

01:06:23.020 | very cheap methods to do this.

01:06:24.280 | So you can actually have for free, really,

01:06:26.800 | two types of neural networks.

01:06:29.040 | They're kind of related,

01:06:30.160 | but they're different enough that you can actually

01:06:32.040 | basically have a learning problem set up.

01:06:34.000 | - So you can ensure that they always remain different enough

01:06:38.200 | so that the thing doesn't collapse into something boring.

01:06:41.040 | - Exactly.

01:06:41.880 | So the main sort of enemy of self-supervised learning,

01:06:44.360 | any kind of similarity maximization technique is collapse.

01:06:47.560 | It's a collapse means that you learn

01:06:49.820 | the same feature representation

01:06:51.560 | for all the images in the world,

01:06:53.160 | which is completely useless.

01:06:54.640 | - Everything is a banana.

01:06:55.640 | - Everything is a banana.

01:06:56.560 | Everything is a cat.

01:06:57.400 | Everything is a car.

01:06:58.240 | - Yeah.

01:06:59.200 | And so all we need to do is basically come up

01:07:01.680 | with ways to prevent collapse.

01:07:03.320 | Contrastive learning is one way of doing it.

01:07:05.360 | And then for example, like clustering or self-distillation

01:07:07.840 | or other ways of doing it.

01:07:09.240 | We also had a recent paper where we used like decorrelation

01:07:13.120 | between like two sets of features to prevent collapse.

01:07:16.780 | So that's inspired a little bit by like Horace Barlow's

01:07:18.920 | neuroscience principles.

01:07:20.720 | - By the way, I should comment that whoever counts

01:07:23.560 | the number of times the word banana, apple, cat and dog

01:07:27.800 | we're using this conversation wins the internet.

01:07:30.160 | I wish you luck.

01:07:31.140 | What is Suave and the main improvement proposed

01:07:36.800 | in the paper on supervised learning of visual features

01:07:40.360 | by contrasting cluster assignments?

01:07:43.000 | - Suave basically is a clustering based technique,

01:07:46.400 | which is for again, the same thing

01:07:48.400 | for self-supervised learning in vision,

01:07:50.760 | where we have two crops.

01:07:52.440 | And the idea basically is that you want the features

01:07:55.280 | from these two crops of an image

01:07:57.000 | to lie in the same cluster.

01:07:58.920 | And basically crops that are coming from different images

01:08:02.540 | to be in different clusters.

01:08:03.960 | Now, typically in a sort of,

01:08:05.900 | if you were to do this clustering,

01:08:07.140 | you would perform clustering offline.

01:08:09.520 | What that means is you would,

01:08:11.040 | if you have a data set of N examples,

01:08:13.160 | you would run over all of these N examples,

01:08:15.360 | get features for them, perform clustering.

01:08:17.520 | So basically get some clusters

01:08:19.480 | and then repeat the process again.

01:08:22.000 | So this is offline basically because I need to do one pass

01:08:24.640 | through the data to compute its clusters.

01:08:27.240 | Suave is basically just a simple way of doing this online.

01:08:30.200 | So as you're going through the data,

01:08:31.820 | you're actually computing these clusters online.

01:08:34.800 | And so of course there is like a lot of tricks involved

01:08:37.480 | in how to do this in a robust manner without collapsing,

01:08:40.140 | but this is the sort of key idea to it.

01:08:42.440 | - Is there a nice way to say what is the key methodology

01:08:45.480 | of the clustering that enables that?

01:08:47.680 | - Right, so the idea basically is that

01:08:51.040 | when you have N samples,

01:08:52.720 | we assume that we have access to,

01:08:54.920 | like there are always K clusters in a data set.

01:08:57.040 | K is a fixed number.

01:08:57.920 | So for example, K is 3000.

01:09:00.160 | And so if you have any,

01:09:02.200 | when you look at any sort of small number of examples,

01:09:04.840 | all of them must belong to one of these K clusters.

01:09:08.000 | And we impose this equipartition constraint.

01:09:10.360 | What this means is that basically

01:09:13.640 | your entire set of N samples

01:09:16.880 | should be equally partitioned into K clusters.

01:09:19.480 | So all your K clusters are basically equal,

01:09:21.800 | they have equal contribution to these N samples.

01:09:24.400 | And this ensures that we never collapse.

01:09:26.520 | So collapse can be viewed as a way

01:09:28.320 | in which all samples belong to one cluster.

01:09:30.680 | So all this, if all features become the same,

01:09:33.180 | then you have basically just one mega cluster.

01:09:35.160 | You don't even have like 10 clusters or 3000 clusters.

01:09:38.160 | So Swarv basically ensures that at each point,

01:09:41.000 | all these 3000 clusters are being used

01:09:43.000 | in the clustering process.

01:09:45.080 | And that's it.

01:09:46.280 | Basically just figure out how to do this online.

01:09:48.520 | And again, basically just make sure

01:09:51.000 | that two crops from the same image

01:09:52.600 | belong to the same cluster and others don't.

01:09:55.760 | - And the fact they have a fixed K makes things simpler.

01:09:58.880 | - Fixed K makes things simpler.

01:10:00.400 | Our clustering is not like really hard clustering,

01:10:02.600 | it's soft clustering.

01:10:03.760 | So basically you can be 0.2 to cluster number one

01:10:06.920 | and 0.8 to cluster number two.

01:10:08.480 | So it's not really hard.

01:10:09.920 | So essentially, even though we have like 3000 clusters,

01:10:12.760 | we can actually represent a lot of clusters.

01:10:15.160 | - What is SEER?

01:10:16.960 | S-E-E-R.

01:10:19.240 | And what are the key results and insights in the paper,

01:10:23.120 | self-supervised pre-training of visual features in the wild?

01:10:27.400 | What is this big, beautiful SEER system?

01:10:30.720 | - SEER, so I'll first go to Swarv

01:10:32.920 | because Swarv is actually like

01:10:34.240 | one of the key components for SEER.

01:10:35.800 | So Swarv was, when we use Swarv,

01:10:37.800 | it was demonstrated on ImageNet.

01:10:39.800 | So typically like self-supervised methods,

01:10:42.920 | the way we sort of operate is like in the research community,

01:10:46.200 | we kind of cheat.

01:10:47.200 | So we take ImageNet,

01:10:48.560 | which of course I talked about as having lots of labels.

01:10:51.280 | And then we throw away the labels,

01:10:52.800 | like throw away all the hard work

01:10:54.240 | that went behind basically the labeling process.

01:10:56.800 | And we pretend that it is like unsupervised.

01:11:00.240 | But the problem here is that we have,

01:11:02.440 | when we collected these images,

01:11:05.160 | the ImageNet dataset

01:11:06.720 | has a particular distribution of concepts, right?

01:11:09.960 | So these images are very curated.

01:11:11.760 | And what that means is these images,

01:11:13.680 | of course, belong to a certain set of noun concepts.

01:11:17.680 | And also ImageNet has this bias

01:11:19.320 | that all images contain an object,

01:11:21.240 | which is like very big,

01:11:22.480 | and it's typically in the center.

01:11:24.080 | So when you're talking about a dog,

01:11:25.120 | it's a well-framed dog,

01:11:26.160 | it's towards the center of the image.

01:11:28.360 | So a lot of the data augmentation,

01:11:29.760 | a lot of the sort of hidden assumptions

01:11:31.520 | in self-supervised learning

01:11:33.440 | actually really exploit this bias of ImageNet.

01:11:37.400 | And so, I mean, a lot of my work,

01:11:39.720 | a lot of work from other people

01:11:41.000 | always uses ImageNet sort of as the benchmark

01:11:43.160 | to show the success of self-supervised learning.

01:11:45.480 | - So you're implying that there's

01:11:46.640 | particular limitations to this kind of dataset?

01:11:49.240 | - Yes, I mean, it's basically because

01:11:51.040 | our data augmentations that we designed,

01:11:53.200 | like all data augmentations that we designed

01:11:56.000 | for self-supervised learning and vision

01:11:57.520 | are kind of overfed to ImageNet.

01:11:59.400 | - But you're saying a little bit hard-coded,

01:12:02.440 | like the cropping.

01:12:03.840 | - Exactly, the cropping parameters,

01:12:05.480 | the kind of lighting that we're using,

01:12:07.320 | the kind of blurring that we're using.

01:12:08.800 | - Yeah, but you would, for a more in the wild dataset,

01:12:12.000 | you would need to be cleverer and more careful

01:12:16.240 | in setting the range of parameters

01:12:17.520 | and those kinds of things.

01:12:18.960 | - So for Sear, our main goal was twofold.

01:12:21.400 | One, basically to move away from ImageNet for training.

01:12:24.680 | So the images that we used were like uncurated images.

01:12:27.720 | Now there's a lot of debate

01:12:28.640 | whether they're actually curated or not,

01:12:30.080 | but I'll talk about that later.

01:12:32.360 | But the idea was basically

01:12:33.880 | these are going to be random internet images

01:12:36.400 | that we are not going to filter out

01:12:37.920 | based on like particular categories.

01:12:40.080 | So we did not say that, oh, images that belong to dogs

01:12:42.880 | and cats should be the only images

01:12:44.280 | that come in this dataset, banana.

01:12:47.000 | And basically other images should be thrown out.

01:12:50.040 | So we didn't do any of that.

01:12:51.800 | So these are random internet images.

01:12:53.560 | And of course, it also goes back to like the problem

01:12:56.040 | of scale that you talked about.

01:12:57.320 | So these were basically about a billion or so images.

01:13:00.120 | And for context ImageNet,

01:13:01.560 | the ImageNet version that we use

01:13:02.800 | was 1 million images earlier.

01:13:04.280 | So this is basically going like

01:13:05.400 | three orders of magnitude more.

01:13:07.600 | The idea was basically to see

01:13:08.600 | if we can train a very large convolutional model

01:13:11.800 | in a self-supervised way on this uncurated,

01:13:14.440 | but really large set of images.

01:13:16.400 | And how well would this model do?

01:13:18.280 | So is self-supervised learning really over fit to ImageNet?

01:13:21.440 | Or can it actually work in the wild?

01:13:23.840 | And it was also out of curiosity,

01:13:25.720 | what kind of things will this model learn?

01:13:27.520 | Will it actually be able to still figure out,

01:13:29.680 | you know, different types of objects and so on?

01:13:32.000 | Would there be particular kinds of tasks

01:13:33.720 | it would actually do better than an ImageNet trained model?

01:13:38.160 | And so for CIR, one of our main findings was that

01:13:40.960 | we can actually train very large models

01:13:43.120 | in a completely self-supervised way

01:13:44.800 | on lots of internet images

01:13:46.360 | without really necessarily filtering them out.

01:13:48.640 | Which was in itself a good thing

01:13:49.760 | because it's a fairly simple process, right?

01:13:51.960 | So you get images which are uploaded

01:13:54.080 | and you basically can immediately use them

01:13:55.800 | to train a model in an unsupervised way.

01:13:57.720 | You don't really need to sit and filter them out.

01:13:59.720 | These images can be cartoons, these can be memes,

01:14:02.040 | these can be actual pictures uploaded by people.

01:14:04.440 | And you don't really care about what these images are.

01:14:06.160 | You don't even care about what concepts they contain.

01:14:08.520 | So this was a very sort of simple setup.

01:14:10.280 | - What image selection mechanism would you say is there,

01:14:14.760 | like inherent in some aspect of the process?

01:14:18.840 | So you're kind of implying that there's almost none,

01:14:21.280 | but what is there, would you say,

01:14:23.600 | if you were to introspect?

01:14:24.960 | - Right, so it's not like, uncurated can basically,

01:14:28.480 | like one way of imagining uncurated

01:14:30.400 | is basically you have like cameras,

01:14:32.400 | like cameras that can take pictures at random viewpoints.

01:14:35.200 | When people upload pictures to the internet,

01:14:37.400 | they are typically going to care about the framing of it.

01:14:40.320 | They're not going to upload, say,

01:14:41.840 | the picture of a zoomed in wall, for example.

01:14:43.800 | - Well, when we say internet, do we mean social networks?

01:14:46.080 | - Yes. - Okay.

01:14:47.160 | - So these are not going to be like pictures

01:14:48.680 | of like a zoomed in table or a zoomed in wall.

01:14:51.400 | So it's not really completely uncurated

01:14:53.160 | because people do have the like photographer's bias,

01:14:55.800 | where they do want to keep things

01:14:57.040 | towards the center a little bit,

01:14:58.640 | or like really have like, you know,

01:15:00.280 | nice looking things and so on in the picture.

01:15:02.640 | So that's the kind of bias that typically exists

01:15:05.640 | in this dataset and also the user base, right?

01:15:07.720 | You're not going to get lots of pictures

01:15:09.320 | from different parts of the world

01:15:10.520 | because there are certain parts of the world

01:15:12.120 | where people may not actually be uploading

01:15:14.320 | a lot of pictures to the internet

01:15:15.440 | or may not even have access to a lot of internet.

01:15:17.360 | - So this is a giant dataset and a giant neural network.

01:15:21.720 | I don't think we've talked about

01:15:23.000 | what architectures work well for SSL,

01:15:27.960 | for self-supervised learning.

01:15:29.320 | - For SEER and for SWAB,

01:15:30.680 | we were using convolutional networks,

01:15:32.480 | but recently in a work called Dyno,

01:15:34.160 | we've basically started using transformers for vision.

01:15:36.880 | Both seem to work really well,

01:15:38.640 | convNets and transformers,

01:15:39.880 | and depending on what you want to do,

01:15:41.120 | you might choose to use a particular formulation.

01:15:43.560 | So for SEER, it was a convNet,

01:15:45.400 | it was particularly a regNet model,

01:15:47.480 | which was also work from Facebook.

01:15:49.720 | RegNets are like really good

01:15:51.200 | when it comes to compute versus like accuracy.

01:15:54.760 | So because it was a very efficient model,

01:15:56.920 | compute and memory wise efficient,

01:15:59.680 | and basically it worked really well in terms of scaling.

01:16:02.480 | So we used a very large regNet model

01:16:04.200 | and trained it on a billion images.

01:16:05.480 | - Can you maybe quickly comment on what regNets are?

01:16:08.640 | It comes from this paper,

01:16:10.680 | Designing Network Design Spaces.

01:16:13.520 | It's just super interesting concept

01:16:15.520 | that emphasizes on how to create

01:16:16.960 | efficient neural networks, large neural networks.

01:16:19.480 | - So one of the sort of key takeaways from this paper,

01:16:21.760 | which the author, like whenever you hear them

01:16:23.360 | present this work, they keep saying is,

01:16:26.040 | a lot of neural networks are characterized

01:16:27.920 | in terms of flops, right?

01:16:29.000 | Flops basically being the floating point operations.

01:16:31.440 | And people really love to use flops to say,

01:16:33.280 | this model is like really computationally heavy,

01:16:36.160 | or like our model is computationally cheap and so on.

01:16:38.960 | Now it turns out that flops are really not a good indicator

01:16:41.800 | of how well a particular network is,

01:16:43.800 | like how efficient it is really.

01:16:45.920 | And what a better indicator is,

01:16:47.880 | is the activation or the memory

01:16:49.640 | that is being used by this particular model.

01:16:52.080 | And so designing like one of the key findings

01:16:54.920 | from this paper was basically that

01:16:56.520 | you need to design network families

01:16:58.640 | or neural network architectures

01:17:00.080 | that are actually very efficient

01:17:01.360 | in the memory space as well,

01:17:02.720 | not just in terms of pure flops.

01:17:04.760 | So RegNet is basically a network architecture family

01:17:07.520 | that came out of this paper,

01:17:08.880 | that is particularly good at both flops

01:17:11.120 | and the sort of memory required for it.

01:17:13.520 | And of course, it builds upon like earlier work,

01:17:15.720 | like ResNet being like the sort of

01:17:17.440 | more popular inspiration for it,

01:17:18.960 | where you have residual connections.

01:17:20.360 | But one of the things in this work is basically,

01:17:22.360 | they also use like squeeze excitation blocks.

01:17:25.040 | So it's a lot of nice sort of technical innovation

01:17:27.040 | in all of this, from prior work

01:17:28.680 | and a lot of the ingenuity of these particular authors

01:17:31.360 | in how to combine these multiple building blocks.

01:17:34.080 | But the key constraint was optimized

01:17:35.960 | for both flops and memory when you're basically doing this,

01:17:38.280 | don't just look at flops.

01:17:39.520 | - And that allows you to what,

01:17:41.360 | have a sort of have very large networks

01:17:45.160 | through this process can optimize for low,

01:17:49.040 | like for efficiency, for low memory.

01:17:51.240 | - Also in just in terms of pure hardware,

01:17:53.560 | they fit very well on GPU memory.

01:17:55.800 | So they can be like really powerful

01:17:57.320 | neural network architectures with lots of parameters,

01:17:59.520 | lots of flops, but also because they're like efficient

01:18:01.960 | in terms of the amount of memory that they're using,

01:18:04.000 | you can actually fit a lot of these on like,

01:18:06.560 | you can fit a very large model on a single GPU, for example.

01:18:09.600 | - Would you say that the choice of architecture

01:18:14.240 | matters more than the choice

01:18:15.880 | of maybe data augmentation techniques?

01:18:18.520 | Is there a possibility to say what matters more?

01:18:21.680 | You kind of implied that you can probably go really far

01:18:24.360 | with just using basic conv nuts.

01:18:27.560 | - All right, I think data like data and data augmentation,

01:18:30.520 | the algorithm being used for the self supervised training

01:18:33.240 | matters a lot more than the particular kind of architecture.

01:18:36.360 | With different types of architecture,

01:18:37.600 | you will get different like properties

01:18:39.520 | in the resulting sort of representation.

01:18:41.640 | But really, I mean, the secret sauce

01:18:43.560 | is in the data augmentation

01:18:44.600 | and the algorithm being used to train them.

01:18:47.000 | The architectures, I mean, at this point,

01:18:49.160 | a lot of them perform very similarly,

01:18:51.640 | depending on like the particular task that you care about,

01:18:53.760 | they have certain advantages and disadvantages.

01:18:56.360 | - Is there something interesting to be said

01:18:58.080 | about what it takes with Sears

01:19:00.120 | to train a giant neural network?

01:19:01.880 | You're talking about a huge amount of data,

01:19:04.120 | a huge neural network.

01:19:05.760 | Is there something interesting to be said

01:19:07.720 | of how to effectively train something like that fast?

01:19:11.240 | - Lots of GPUs.

01:19:12.960 | - Okay, so.

01:19:13.800 | (both laughing)

01:19:15.440 | - I mean, so the model was like a billion parameters.

01:19:17.960 | - Yeah.

01:19:18.800 | - And it was trained on a billion images.

01:19:20.600 | - Yeah.

01:19:21.440 | - So if like, basically the same number of parameters

01:19:23.320 | as the number of images, and it took a while.

01:19:26.160 | I don't remember the exact number, it's in the paper,

01:19:28.600 | but it took a while.

01:19:29.440 | (both laughing)

01:19:31.840 | - I guess I'm trying to get at is

01:19:33.720 | when you're thinking of scaling this kind of thing,

01:19:38.640 | I mean, one of the exciting possibilities

01:19:41.880 | of self-supervised learning is the several orders

01:19:45.280 | of magnitude scaling of everything,

01:19:47.320 | both in your own network and the size of the data.

01:19:50.880 | And so the question is,

01:19:52.560 | do you think there's some interesting tricks

01:19:55.120 | to do large scale distributed compute,

01:19:57.840 | or is that really outside of even deep learning?

01:20:00.880 | That's more about like hardware engineering.

01:20:04.320 | - I think more and more there is like this,

01:20:06.760 | a lot of like systems are designed,

01:20:10.080 | basically taking into account

01:20:11.320 | the machine learning needs, right?

01:20:12.400 | So because whenever you're doing

01:20:14.320 | this kind of distributed training,

01:20:15.440 | there is a lot of intercommunication between nodes.

01:20:17.720 | So like gradients or the model parameters are being passed.

01:20:20.560 | So you really want to minimize communication costs

01:20:22.720 | when you really want to scale these models up.

01:20:25.200 | You want basically to be able to do as much,

01:20:29.120 | like as limited amount of communication as possible.

01:20:31.400 | So currently like a dominant paradigm

01:20:33.240 | is synchronized sort of training.

01:20:35.040 | So essentially after every sort of gradient step,

01:20:38.440 | all you basically have like a synchronization step

01:20:41.160 | between all the sort of compute chips

01:20:43.360 | that you're going on with.

01:20:44.680 | I think asynchronous training was popular,

01:20:47.800 | but it doesn't seem to perform as well.

01:20:50.360 | But in general, I think that sort of the,

01:20:53.280 | I guess it's outside my scope as well, yeah.

01:20:55.240 | But the main thing is like minimize the amount

01:20:58.960 | of synchronization steps that you have.

01:21:01.840 | That has been the key takeaway, at least in my experience.

01:21:04.600 | The others I have no idea about.

01:21:05.920 | How to design the chip.

01:21:06.760 | - Yeah, there's very few things that I see,

01:21:09.320 | Jim Keller's eyes light up as much

01:21:11.880 | as talking about giant computers doing

01:21:14.120 | like that fast communication that you're talking to,

01:21:17.600 | well, when they're training machine learning.

01:21:20.160 | Systems.

01:21:21.200 | What is VISL?

01:21:22.320 | V-I-S-S-L.

01:21:24.160 | The PyTorch based SSL library.

01:21:27.880 | What are the use cases that it might have?

01:21:30.080 | - VISL basically was born out of a lot of us at Facebook

01:21:33.000 | are doing the self-supervised learning research.

01:21:35.120 | So it's a common framework in which we have like a lot

01:21:38.680 | of self-supervised learning methods implemented for vision.

01:21:41.680 | It's also, it has in itself like a benchmark of tasks

01:21:45.920 | that you can evaluate the self-supervised representations on.

01:21:48.760 | So the use case for it is basically for anyone

01:21:51.200 | who's either trying to evaluate their self-supervised model

01:21:53.720 | or train their self-supervised model

01:21:55.960 | or a researcher who's trying to build

01:21:57.760 | a new self-supervised technique.

01:21:59.240 | So it's basically supposed to be all of these things.

01:22:01.480 | So as a researcher, before VISL, for example,

01:22:04.440 | or like when we started doing this work fairly seriously

01:22:06.920 | at Facebook, it was very hard for us to go

01:22:09.240 | and implement every self-supervised learning model,

01:22:11.840 | test it out in a like sort of consistent manner.

01:22:14.560 | The experimental setup was very different

01:22:16.400 | across different groups.

01:22:18.120 | Even when someone said that they were reporting

01:22:20.360 | image net accuracy, it could mean lots of different things.

01:22:23.160 | So with VISL, we tried to really sort of standardize that

01:22:25.320 | as much as possible.

01:22:26.360 | And there was a paper like we did in 2019

01:22:28.240 | just about benchmarking.

01:22:29.720 | And so VISL basically builds upon a lot of this kind of work

01:22:32.840 | that we did about like benchmarking.

01:22:35.120 | And then every time we try to like,

01:22:37.160 | we come up with a self-supervised learning method,

01:22:38.960 | a lot of us try to push that into VISL as well,

01:22:41.160 | just so that it basically is like the central piece

01:22:43.440 | where a lot of these methods can reside.

01:22:46.360 | - Just out of curiosity, like people may be,

01:22:49.200 | so certainly outside of Facebook, but just researchers,

01:22:52.000 | or just even people that know how to program in Python

01:22:54.920 | and know how to use PyTorch, what would be the use case?

01:22:58.640 | What would be a fun thing to play around with VISL on?

01:23:01.320 | Like what's a fun thing to play around

01:23:04.280 | with self-supervised learning on, would you say?

01:23:07.920 | Is there a good hello world program?

01:23:09.760 | Like is it always about big size that's important to have,

01:23:14.600 | or is there fun little smaller case playgrounds

01:23:18.840 | to play around with?

01:23:19.720 | - So we're trying to like push something towards that.

01:23:22.400 | I think there are a few setups out there,

01:23:24.320 | but nothing like super standard on the smaller scale.

01:23:26.800 | I mean, ImageNet in itself is actually pretty big also.

01:23:29.280 | So that is not something which is like feasible

01:23:32.240 | for a lot of people, but we are trying to like push up

01:23:34.840 | with like smaller sort of use cases.

01:23:36.360 | The thing is, at a smaller scale,

01:23:38.960 | a lot of the observations or a lot of the algorithms

01:23:41.200 | that work don't necessarily translate

01:23:42.720 | into the medium or the larger scale.

01:23:44.960 | So it's really tricky to come up

01:23:46.120 | with a good small scale setup

01:23:47.440 | where a lot of your empirical observations

01:23:49.120 | will really translate to the other setup.

01:23:51.520 | So it's been really challenging.

01:23:53.240 | I've been trying to do that for a little bit as well,

01:23:54.880 | because it does take time to train stuff on ImageNet.

01:23:56.800 | It does take time to train on like more images,

01:23:59.840 | but pretty much every time I've tried to do that,

01:24:02.200 | it's been unsuccessful because all the observations

01:24:04.080 | I draw from my set of experiments on a smaller data set

01:24:06.840 | don't translate into ImageNet,

01:24:09.120 | or like don't translate into another sort of data set.

01:24:11.720 | So it's been hard for us to figure this one out,

01:24:14.200 | but it's an important problem.

01:24:15.720 | - So there's this really interesting idea

01:24:17.920 | of learning across multiple modalities.

01:24:20.800 | You have a CVPR 2021 best paper candidate

01:24:25.800 | titled "Audio-Visual Instance Discrimination

01:24:29.200 | with Cross-Modal Agreement."

01:24:31.400 | What are the key results, insights in this paper,

01:24:33.840 | and what can you say in general about the promise

01:24:35.880 | and power of multimodal learning?

01:24:37.600 | - For this paper, it actually came as a little bit

01:24:39.920 | of a shock to me at how well it worked.

01:24:41.920 | So I can describe what the problem setup was.

01:24:44.120 | So it's been used in the past by lots of folks,

01:24:46.520 | like for example, Andrew Owens from MIT,

01:24:48.360 | Aljosha Efros from Berkeley,

01:24:49.920 | Andrew Zisserman from Oxford.

01:24:51.120 | So a lot of these people have been sort of

01:24:52.360 | showing results in this.

01:24:54.040 | Of course, I was aware of this result,

01:24:55.440 | but I wasn't really sure how well it would work in practice

01:24:58.560 | for like other sort of downstream tasks.

01:25:00.560 | So the results kept getting better,

01:25:02.400 | and I wasn't sure if like a lot of our insights

01:25:04.160 | from self-supervised learning would translate

01:25:05.880 | into this multimodal learning problem.

01:25:08.280 | So multimodal learning is when you have like,

01:25:11.840 | when you have multiple modalities.

01:25:14.160 | (laughs)

01:25:15.000 | And that's not even a quote.

01:25:15.840 | - Excellent.

01:25:16.680 | (laughs)

01:25:17.520 | - Okay, so the particular modalities that we worked on

01:25:20.000 | in this work were audio and video.

01:25:22.000 | So the idea was basically if you have a video,

01:25:23.880 | you have its corresponding audio track,

01:25:25.840 | and you want to use both of these signals,

01:25:27.520 | the audio signal and the video signal

01:25:29.240 | to learn a good representation for video

01:25:31.240 | and good representation for audio.

01:25:32.080 | - Like this podcast.

01:25:33.640 | - Like this podcast, exactly.

01:25:35.440 | So what we did in this work was basically trained

01:25:38.120 | two different neural networks,

01:25:39.400 | one on the video signal, one on the audio signal.

01:25:41.920 | And what we wanted is basically the features

01:25:43.800 | that we get from both of these neural networks

01:25:45.400 | should be similar.

01:25:46.800 | So it should basically be able to produce

01:25:48.720 | the same kinds of features from the video

01:25:51.120 | and the same kinds of features from the audio.

01:25:53.240 | Now, why is this useful?

01:25:54.280 | Well, for a lot of these objects that we have,

01:25:56.680 | there is a characteristic sound, right?

01:25:58.280 | So trains, when they go by,

01:25:59.520 | they make a particular kind of sound.

01:26:00.800 | Boats make a particular kind of sound.

01:26:02.520 | People, when they're jumping around,

01:26:03.840 | will like shout, "Whoo-ah," whatever.

01:26:06.240 | Bananas don't make a sound.

01:26:07.280 | So where you can't learn anything about bananas there.

01:26:09.360 | - Or when humans mention bananas.

01:26:11.640 | - Well, yes, when they say the word banana, then-

01:26:13.480 | - So you can't trust basically anything

01:26:15.080 | that comes out of a human's mouth

01:26:16.360 | as a source, that source of audio is useless.

01:26:18.720 | - So the typical use case is basically like,

01:26:20.640 | for example, someone playing a musical instrument.

01:26:22.440 | So guitars have a particular kind of sound and so on.

01:26:24.680 | So because a lot of these things are correlated,

01:26:27.080 | the idea in multimodal learning

01:26:28.440 | is to take these two kinds of modalities,

01:26:30.120 | video and audio, and learn a common embedding space,

01:26:33.120 | a common feature space where both of these

01:26:35.200 | related modalities can basically be close together.

01:26:38.560 | And again, you use contrastive learning for this.

01:26:40.600 | So in contrastive learning, basically the video

01:26:43.320 | and the corresponding audio are positives,

01:26:45.520 | and you can take any other video or any other audio,

01:26:48.200 | and that becomes a negative.

01:26:49.840 | And so basically that's it.

01:26:51.000 | It's just a simple application of contrastive learning.

01:26:53.720 | The main sort of finding from this work for us

01:26:55.960 | was basically that you can actually learn

01:26:58.720 | very, very powerful feature representations,

01:27:00.760 | very, very powerful video representations.

01:27:02.840 | So you can learn the sort of video network

01:27:05.400 | that we ended up learning can actually be used

01:27:07.480 | for downstream, for example, recognizing human actions

01:27:11.000 | or recognizing different types of sounds, for example.

01:27:14.440 | So this was sort of the key finding.

01:27:17.160 | - Can you give kind of an example of a human action

01:27:20.200 | or like just so we can build up intuition

01:27:23.360 | of what kind of thing?

01:27:24.360 | - Right, so there is this dataset called kinetics,

01:27:26.880 | for example, which has like 400 different types

01:27:28.640 | of human actions.

01:27:29.480 | So people jumping, people doing different kinds

01:27:32.360 | of sports or different types of swimming.

01:27:34.280 | So like different strokes in swimming, golf, and so on.

01:27:37.600 | So there are like just different types

01:27:39.200 | of actions right there.

01:27:40.560 | And the point is this kind of video network

01:27:42.600 | that you learn in a self-supervised way

01:27:44.400 | can be used very easily to kind of recognize

01:27:46.920 | these different types of actions.

01:27:48.920 | It can also be used for recognizing

01:27:50.440 | different types of objects.

01:27:51.760 | And what we did is we tried to visualize

01:27:54.800 | whether the network can figure out

01:27:56.080 | where the sound is coming from.

01:27:57.880 | So basically give it a video and basically play

01:28:01.000 | of say of a person just strumming a guitar,

01:28:03.000 | but of course there is no audio in this.

01:28:04.720 | And now you give it this sound of a guitar.

01:28:07.160 | And you ask, like basically try to visualize

01:28:08.880 | where the network is, where the network thinks

01:28:11.000 | the sound is coming from.

01:28:12.520 | And it can kind of basically draw like,

01:28:14.560 | when you visualize it, you can see

01:28:15.720 | that it's basically focusing on the guitar.

01:28:17.480 | - Yeah, that's surreal.

01:28:18.320 | - And the same thing, for example,

01:28:20.160 | for certain people's voices,

01:28:21.480 | like famous celebrities' voices,

01:28:22.920 | it can actually figure out where their mouth is.

01:28:26.080 | So it can actually distinguish different people's voices,

01:28:28.600 | for example, a little bit as well.

01:28:30.480 | Without that ever being annotated in any way.

01:28:33.640 | - Right, so this is all what it had discovered.

01:28:35.520 | We never pointed out that this is a guitar

01:28:38.200 | and this is the kind of sound it produces.

01:28:40.080 | It can actually naturally figure that out

01:28:41.520 | because it's seen so many correlations

01:28:43.480 | of this sound coming with this kind of like an object

01:28:46.680 | that it basically learns to associate this sound

01:28:49.040 | with this kind of an object.

01:28:50.000 | - Yeah, that's really fascinating, right?

01:28:52.800 | That's really interesting.

01:28:53.640 | So the idea with this kind of network

01:28:55.200 | is then you then fine tune it for a particular task.

01:28:57.920 | So this is forming like a really good knowledge base

01:29:01.880 | within a neural network based on which you could then

01:29:04.520 | train a little bit more to accomplish a specific task.

01:29:07.720 | - Right, exactly.

01:29:09.200 | So you don't need a lot of videos

01:29:11.120 | of humans doing actions annotated.

01:29:12.800 | You can just use a few of them to basically get your--

01:29:16.040 | - How much insight do you draw from the fact

01:29:18.520 | that it can figure out where the sound is coming from?

01:29:22.560 | I'm trying to see, so that's kind of very,

01:29:26.160 | it's very CVPR beautiful, right?

01:29:28.120 | It's a cool insight.

01:29:30.000 | I wonder how profound that is.

01:29:33.000 | Does it speak to the idea that multiple modalities

01:29:39.240 | are somehow much bigger than the sum of their parts

01:29:44.120 | or is it really, really useful to have multiple modalities

01:29:47.960 | or is this just a cool thing that there's parts

01:29:50.640 | of our world that can be revealed

01:29:55.440 | like effectively through multiple modalities

01:29:58.400 | but most of it is really all about vision

01:30:01.240 | or about one of the modalities?

01:30:03.920 | - I would say a little tending more towards the second part.

01:30:07.840 | So most of it can be sort of figured out with one modality

01:30:10.760 | but having an extra modality always helps you.

01:30:13.240 | So in this case, for example, like one thing is when you're,

01:30:17.800 | if you observe someone cutting something

01:30:19.440 | and you don't have any sort of sound there,

01:30:22.040 | whether it's an apple or whether it's an onion,

01:30:25.160 | it's very hard to figure that out.

01:30:26.760 | But if you hear someone cutting it,

01:30:28.280 | it's very easy to figure it out

01:30:29.840 | because apples and onions make a very different kind

01:30:32.920 | of characteristic sound when they're cut.

01:30:34.920 | So you really figure this out based on audio.

01:30:36.960 | It's much easier.

01:30:38.320 | So your life will become much easier

01:30:40.120 | when you have access to different kinds of modalities.

01:30:42.360 | And the other thing is, so I like to relate it in this way.

01:30:45.120 | It may be like completely wrong

01:30:46.440 | but the distributional hypothesis in NLP, right?

01:30:49.400 | Where context basically gives kind of meaning to that word.

01:30:53.120 | Sound kind of does that too, right?

01:30:55.120 | So if you have the same sound,

01:30:57.200 | so that's the same context across different videos,

01:30:59.920 | you're very likely to be observing

01:31:01.360 | the same kind of concept.

01:31:03.080 | So that's the kind of reason why it figures out

01:31:05.080 | the guitar thing, right?

01:31:06.520 | It observed the same sound across multiple different videos

01:31:09.840 | and it figures out maybe this is the common factor

01:31:11.960 | that's actually doing it.

01:31:13.320 | - I wonder, I used to have this argument with my dad a bunch

01:31:17.520 | for creating general intelligence,

01:31:19.840 | whether a smell is an important,

01:31:22.920 | like if that's important sensory information.

01:31:25.560 | Mostly we're talking about like falling in love

01:31:27.600 | with an AI system.

01:31:29.000 | And for him, smell and touch are important.

01:31:31.480 | And I was arguing that it's not at all.

01:31:33.920 | It's important, it's nice and everything,

01:31:35.360 | but like you can fall in love with just language really,

01:31:38.440 | but voice is very powerful and vision is next

01:31:41.440 | and smell is not that important.

01:31:43.920 | Can I ask you about this process of active learning?

01:31:46.920 | You mentioned interactivity.

01:31:49.240 | - Right.

01:31:50.080 | - Is there some value within the self-supervised

01:31:55.080 | learning context to select parts of the data

01:32:00.160 | in intelligent ways such that they would most benefit

01:32:05.080 | the learning process?

01:32:06.400 | - Right, so I think so.

01:32:07.520 | I think, I mean, I know I'm talking

01:32:09.200 | to an active learning fan here,

01:32:10.320 | so of course I know the answer.

01:32:12.600 | - First you were talking bananas

01:32:14.000 | and now you're talking about active learning, I love it.

01:32:16.720 | I think Yann LeCun told me that active learning

01:32:18.800 | is not that interesting.

01:32:20.480 | I think back then I didn't want to argue with him too much,

01:32:24.400 | but when we talk again, we're gonna spend three hours

01:32:27.200 | arguing about active learning.

01:32:28.440 | My sense was you can go extremely far with active learning,

01:32:32.480 | you know, perhaps farther than anything else.

01:32:34.960 | Like to me, there's this kind of intuition

01:32:38.000 | that similar to data augmentation,

01:32:40.880 | you can get a lot from the data,

01:32:45.320 | from intelligent optimized usage of the data.

01:32:50.320 | I'm trying to speak generally in such a way

01:32:53.240 | that includes data augmentation and active learning,

01:32:57.080 | that there's something about maybe interactive exploration

01:32:59.920 | of the data that at least this part of the solution

01:33:04.360 | to intelligence, like an important part.

01:33:07.160 | I don't know what your thoughts are

01:33:08.240 | on active learning in general.

01:33:09.360 | - I actually really like active learning.

01:33:10.880 | So back in the day we did this largely ignored CVPR paper

01:33:14.240 | called learning by asking questions.

01:33:16.560 | So the idea was basically you would train an agent

01:33:18.280 | that would ask a question about the image,

01:33:20.140 | it would get an answer,

01:33:21.560 | and basically then it would update itself,

01:33:23.400 | it would see the next image,

01:33:24.400 | it would decide what's the next hardest question

01:33:26.840 | that I can ask to learn the most.

01:33:28.800 | And the idea was basically because it was being smart

01:33:31.340 | about the kinds of questions it was asking,

01:33:33.520 | it would learn in fewer samples,

01:33:35.120 | it would be more efficient at using data.

01:33:37.920 | And we did find to some extent that it was actually better

01:33:40.360 | than randomly asking questions.

01:33:42.060 | Kind of weird thing about active learning

01:33:43.520 | is it's also a chicken and egg problem

01:33:45.200 | because when you look at an image,

01:33:47.160 | to ask a good question about the image,

01:33:48.680 | you need to understand something about the image.

01:33:50.920 | You can't ask a completely arbitrarily random question,

01:33:53.480 | it may not even apply to that particular image.

01:33:55.520 | So there is some amount of understanding or knowledge

01:33:57.640 | that basically keeps getting built

01:33:59.200 | when you're doing active learning.

01:34:01.320 | So I think active learning by itself is really good.

01:34:04.600 | And the main thing we need to figure out

01:34:06.400 | is basically how do we come up with a technique

01:34:09.640 | to first model what the model knows,

01:34:13.340 | and also model what the model does not know.

01:34:16.040 | I think that's the sort of beauty of it, right?

01:34:18.360 | Because when you know that there are certain things

01:34:20.480 | that you don't know anything about,

01:34:22.160 | asking a question about those concepts

01:34:23.640 | is actually going to bring you the most value.

01:34:26.520 | And I think that's the sort of key challenge.

01:34:28.360 | Now self-supervised learning by itself,

01:34:29.960 | like selecting data for it and so on,

01:34:31.480 | that's actually really useful.

01:34:32.680 | But I think that's a very narrow view

01:34:34.000 | of looking at active learning, right?

01:34:35.120 | If you look at it more broadly,

01:34:36.360 | it is basically about if the model has a knowledge

01:34:40.040 | about n concepts,

01:34:41.400 | and it is weak basically about certain things,

01:34:43.880 | so it needs to ask questions

01:34:45.300 | either to discover new concepts

01:34:46.900 | or to basically increase its knowledge

01:34:49.200 | about these n concepts.

01:34:50.400 | So at that level, it's a very powerful technique.

01:34:53.220 | I actually do think it's going to be really useful.

01:34:56.540 | Even in simple things such as data labeling,

01:34:59.060 | it's super useful.

01:35:00.260 | So here is one simple way

01:35:02.940 | that you can use active learning.

01:35:04.300 | For example, you have your self-supervised model,

01:35:06.900 | which is very good at predicting similarities

01:35:08.780 | and dissimilarities between things.

01:35:10.780 | And so if you label a picture as basically say, a banana,

01:35:14.640 | now you know that all the images

01:35:17.740 | that are very similar to this image

01:35:19.220 | are also likely to contain bananas.

01:35:21.500 | So probably when you want to understand

01:35:24.260 | what else is a banana,

01:35:25.180 | you're not going to use these other images.

01:35:26.920 | You're actually going to use an image

01:35:28.200 | that is not completely dissimilar,

01:35:31.160 | but somewhere in between,

01:35:32.340 | which is not super similar to this image,

01:35:33.860 | but not super dissimilar either.

01:35:35.660 | And that's going to tell you a lot more

01:35:37.140 | about what this concept of a banana is.

01:35:39.540 | - So that's kind of a heuristic.

01:35:41.860 | I wonder if it's possible to also learn ways

01:35:46.860 | to discover the most likely, the most beneficial image.

01:35:51.940 | So not just looking at a thing

01:35:55.020 | that's somewhat similar to a banana,

01:35:58.460 | but not exactly similar,

01:36:00.000 | but have some kind of more complicated learning system,

01:36:03.560 | like learned discovery mechanism

01:36:07.100 | that tells you what image to look for.

01:36:09.440 | - Yeah, exactly.

01:36:10.940 | - Yeah, like actually in a self-supervised way,

01:36:14.340 | learning strictly a function that says,

01:36:17.300 | is this image going to be very useful to me,

01:36:20.600 | given what I currently know?

01:36:22.140 | - I think there's a lot of synergy there.

01:36:24.020 | It's just, I think, yeah, it's going to be explored.

01:36:26.760 | - I think very much related to that,

01:36:29.420 | I kind of think of what Tesla Autopilot is doing

01:36:32.420 | currently as kind of active learning.

01:36:36.900 | There's something that Andrei Kapathia and their team

01:36:39.280 | are calling a data engine.

01:36:41.300 | - Yes.

01:36:42.140 | - So you're basically deploying a bunch of instantiations

01:36:45.700 | of a neural network into the wild,

01:36:47.820 | and they're collecting a bunch of edge cases

01:36:50.700 | that are then sent back for annotation for particular,

01:36:53.980 | and edge cases as defined as near failure

01:36:56.700 | or some weirdness on a particular task

01:36:59.980 | that's then sent back.

01:37:01.420 | It's that not exactly a banana,

01:37:04.020 | but almost a banana cases sent back for annotation.

01:37:07.220 | And then there's this loop that keeps going,

01:37:09.260 | and you keep retraining and retraining.

01:37:11.620 | And the active learning step there,

01:37:13.300 | or whatever you want to call it,

01:37:14.820 | is the cars themselves that are sending you back the data,

01:37:19.100 | like what the hell happened here?

01:37:20.780 | This was weird.

01:37:22.840 | What are your thoughts about that sort of deployment

01:37:26.460 | of neural networks in the wild?

01:37:28.260 | Another way to ask a question,

01:37:30.100 | but first is your thoughts,

01:37:31.360 | and maybe if you want to comment,

01:37:33.860 | is there applications for autonomous driving?

01:37:36.960 | Like computer vision based autonomous driving,

01:37:40.180 | applications of self-supervised learning

01:37:42.060 | in the context of computer vision based autonomous driving?

01:37:46.120 | - So I think so.

01:37:48.380 | I think for self-supervised learning to be used

01:37:50.060 | in autonomous driving, there are lots of opportunities.

01:37:52.580 | And just like pure consistency in predictions is one way.

01:37:55.860 | So because you have this nice sequence of data

01:38:00.300 | that is coming in, a video stream of it,

01:38:02.340 | associated of course with the actions,

01:38:04.100 | let's say the car took,

01:38:05.260 | you can form a very nice predictive model

01:38:07.680 | of what's happening.

01:38:08.520 | So for example, like all the way,

01:38:10.660 | like one way possibly in which how they're figuring out

01:38:14.500 | what data to get labeled is basically

01:38:15.940 | through prediction uncertainty, right?

01:38:17.500 | So you predict that the car was going to turn right.

01:38:20.420 | So this was the action that was going to happen

01:38:21.900 | say in the shadow mode, and now the driver turned left.

01:38:24.700 | And this is a really big surprise.

01:38:27.220 | So basically by forming these good predictive models,

01:38:30.180 | you are, I mean, these are kind of self-supervised models.

01:38:32.900 | Prediction models are basically being trained

01:38:34.660 | just by looking at what's going to happen next

01:38:36.820 | and asking them to predict what's going to happen next.

01:38:38.980 | So I would say this is really like one use

01:38:40.780 | of self-supervised learning.

01:38:42.340 | It's a predictive model,

01:38:43.460 | and you're learning a predictive model

01:38:44.700 | basically just by looking at what data you have.

01:38:46.900 | - Is there something about that active learning context

01:38:49.620 | that you find insights from?

01:38:53.020 | Like that kind of deployment of the system,

01:38:54.780 | seeing cases where it doesn't perform as you expected

01:38:59.140 | and then retraining the system based on that?

01:39:01.020 | - I think that, I mean, that really resonates with me.

01:39:03.620 | It's super smart to do it that way.

01:39:05.580 | Because I mean, the thing is with any kind

01:39:08.540 | of like practical system like autonomous driving,

01:39:11.180 | there are those edge cases are the things

01:39:13.060 | that are actually the problem, right?

01:39:14.540 | I mean, highway driving or like freeway driving

01:39:17.420 | has basically been, like there has been a lot of success

01:39:20.140 | in that particular part of autonomous driving

01:39:21.860 | for a long time.

01:39:23.100 | I would say like since the 80s or something.

01:39:25.580 | Now, the point is all these failure cases

01:39:28.020 | are the sort of reason why autonomous driving hasn't come,

01:39:31.420 | it hasn't become like super, super mainstream

01:39:33.180 | and available like in every possible car right now.

01:39:35.620 | And so basically by really scaling this problem out

01:39:38.180 | by really trying to get all of these edge cases out

01:39:40.420 | as quickly as possible,

01:39:41.820 | and then just like using those to improve your model,

01:39:43.860 | that's super smart.

01:39:45.580 | And prediction uncertainty to do that

01:39:47.060 | is like one really nice way of doing it.

01:39:49.740 | - Let me put you on the spot.

01:39:52.020 | So we mentioned offline Jitendra.

01:39:55.220 | He thinks that the Tesla computer vision approach

01:39:58.180 | or really any approach for autonomous driving

01:40:00.780 | is very far away.

01:40:02.660 | How many years away,

01:40:05.420 | if you had to bet all your money on it,

01:40:06.940 | are we to solving autonomous driving

01:40:09.580 | with this kind of computer vision only

01:40:11.980 | machine learning based approach?

01:40:13.580 | - Okay, so what does solving autonomous driving mean?

01:40:15.380 | Does it mean solving it in the US?

01:40:17.180 | Does it mean solving it in India?

01:40:18.420 | Because I can tell you

01:40:19.260 | that very different types of driving have been.

01:40:21.140 | - Not India, not Russia.

01:40:23.740 | In the United States, autonomous,

01:40:26.220 | so what solving means is when the car says

01:40:30.420 | it has control, it is fully liable.

01:40:34.140 | You can go to sleep, it's driving by itself.

01:40:37.860 | So this is highway and city driving,

01:40:39.820 | but not everywhere, but mostly everywhere.

01:40:42.380 | And it's, let's say significantly better,

01:40:45.140 | like say five times less accidents than humans.

01:40:50.140 | Sufficiently safer such that the public feels

01:40:54.060 | like that transition is enticing beneficial,

01:40:58.020 | both for our safety and financially

01:40:59.580 | and all those kinds of things.

01:41:01.140 | - Okay, so first disclaimer,

01:41:02.380 | I'm not an expert in autonomous driving.

01:41:04.300 | So let me put it out there.

01:41:06.020 | I would say like at least five to 10 years.

01:41:08.420 | This would be my guess from now.

01:41:11.860 | Yeah, I'm actually very impressed.

01:41:14.740 | Like when I sat in a friend's Tesla recently

01:41:16.900 | and of course like looking, so it can,

01:41:20.140 | on the screen it basically shows all the detections

01:41:22.300 | and everything the car is doing as you're driving by.

01:41:24.740 | And that's super distracting for me as a person

01:41:26.940 | because all I keep looking at is like the bounding boxes

01:41:29.500 | in the car that's tracking.

01:41:30.860 | And it's really impressive,

01:41:31.820 | like especially when it's raining and it's able to do that.

01:41:34.340 | That was the most impressive part for me.

01:41:36.060 | It's actually able to get through rain and do that.

01:41:38.580 | And one of the reasons why like a lot of us believed

01:41:41.780 | and I would put myself in that category

01:41:44.100 | is LIDAR based sort of technology

01:41:46.900 | for autonomous driving was the key driver, right?

01:41:48.780 | So Waymo was using it for the longest time.

01:41:51.060 | And Tesla then decided to go this completely other route

01:41:53.340 | that we're not going to even use LIDAR.

01:41:55.820 | So their initial system I think was camera and radar based.

01:41:58.780 | And now they're actually moving

01:41:59.700 | to a completely like vision based system.

01:42:02.060 | And so that was just like, it sounded completely crazy.

01:42:04.700 | Like LIDAR is very useful in cases

01:42:07.100 | where you have low visibility.

01:42:09.300 | Of course it comes with its own set of complications.

01:42:11.780 | But now to see that happen in like on a live Tesla,

01:42:15.220 | that basically just proves everyone wrong,

01:42:17.060 | I would say in a way.

01:42:18.180 | And that's just working really well.

01:42:20.620 | I think there were also like a lot of advancements

01:42:22.780 | in camera technology.

01:42:23.980 | Now there were like, I know at CMU when I was there,

01:42:26.340 | there was a particular kind of camera

01:42:28.020 | that had been developed that was really good

01:42:30.140 | at basically low visibility settings.

01:42:32.860 | So like lots of snow and lots of rain,

01:42:34.500 | it could actually still have a very reasonable visibility.

01:42:37.700 | And I think there are lots of these kinds of innovations

01:42:39.460 | that will happen on the sensor side itself,

01:42:41.020 | which is actually going to make this very easy

01:42:42.900 | in the future.

01:42:43.900 | And so maybe that's actually why I'm more optimistic

01:42:46.140 | about vision based autonomous driving.

01:42:49.060 | It's gonna call it self supervised driving.

01:42:50.500 | But vision based autonomous driving,

01:42:53.580 | that's the reason I'm quite optimistic about it.

01:42:55.500 | Because I think there are going to be lots

01:42:56.700 | of these advances on the sensor side itself.

01:42:58.980 | So acquiring this data,

01:43:00.740 | we're actually going to get much better about it.

01:43:02.660 | And then of course, once we're able to scale out

01:43:05.100 | and get all of these edge cases in,

01:43:06.820 | as like Andre described,

01:43:08.780 | I think that's going to make us go very far away.

01:43:11.740 | - Yeah, so it's funny,

01:43:13.620 | I'm very much with you on the five to 10 years,

01:43:16.300 | maybe 10 years, but you made it,

01:43:20.100 | I'm not sure how you made it sound,

01:43:21.820 | but for some people that might seem like really far away,

01:43:25.380 | and then for other people,

01:43:26.940 | it might seem like very close.

01:43:30.460 | There's a lot of fundamental questions

01:43:32.300 | about how much game theory is in this whole thing.

01:43:36.900 | So like how much is this simply collision avoidance problem?

01:43:41.900 | And how much of it is,

01:43:44.340 | you're still interacting with other humans in the scene,

01:43:46.980 | and you're trying to create an experience that's compelling.

01:43:49.460 | So you want to get from point A to point B quickly,

01:43:53.060 | you want to navigate the scene in a safe way,

01:43:55.260 | but you also want to show some level of aggression,

01:43:58.500 | because, well, certainly this is why you're screwed in India,

01:44:01.980 | because you have to show aggression.

01:44:03.380 | - Or Jersey, or New Jersey.

01:44:04.900 | (both laughing)

01:44:07.020 | - So like, or New York, or basically any major city.

01:44:11.180 | But I think it's probably Elon

01:44:13.220 | that I talked the most about this,

01:44:14.780 | which is surprised to the level of which

01:44:17.700 | they're not considering human beings

01:44:20.100 | as a huge problem in this, as a source of problem.

01:44:22.980 | Like the driving is fundamentally a robot on robot

01:44:27.980 | versus the environment problem,

01:44:31.180 | versus like, you can just consider humans

01:44:34.020 | not part of the problem.

01:44:35.180 | I used to think humans are, almost certainly,

01:44:38.860 | have to be modeled really well.

01:44:41.220 | Pedestrians and cyclists and humans inside of the cars,

01:44:44.380 | you have to have like mental models for them.

01:44:46.380 | You cannot just see it as objects.

01:44:48.340 | But more and more, it's like the,

01:44:51.420 | it's the same kind of intuition breaking thing

01:44:53.700 | that self-supervised learning does,

01:44:56.180 | which is, well, maybe through the learning,

01:44:58.820 | you'll get all the human, like human information you need.

01:45:03.820 | Right? - Right.

01:45:05.060 | - Maybe you'll get it just with enough data.

01:45:07.740 | You don't need to have explicit good models

01:45:09.660 | of human behavior.

01:45:10.820 | Maybe you get it through the data.

01:45:12.140 | So, I mean, my skepticism also just knowing

01:45:14.660 | a lot of automotive companies

01:45:16.340 | and how difficult it is to be innovative,

01:45:18.620 | I was skeptical that they would be able at scale

01:45:22.540 | to convert the driving scene across the world

01:45:27.380 | into digital form,

01:45:29.060 | such that you can create this data engine at scale.

01:45:33.140 | And the fact that Tesla is at least getting there

01:45:36.660 | or are already there,

01:45:39.460 | makes me think that it's now starting to be coupled

01:45:43.660 | to the self-supervised learning vision,

01:45:47.620 | which is like, if that's gonna work,

01:45:49.860 | if through purely this process, you can get really far,

01:45:52.980 | then maybe you can solve driving that way.

01:45:54.900 | I don't know.

01:45:55.740 | I tend to believe we don't give enough credit

01:46:00.060 | to how amazing humans are both at driving

01:46:05.060 | and at supervising autonomous systems.

01:46:09.420 | And also we don't, this is, I wish we were,

01:46:13.260 | I wish there was much more driver sensing inside Teslas

01:46:17.180 | and much deeper consideration of human factors,

01:46:21.220 | like understanding psychology and drowsiness

01:46:24.700 | and all those kinds of things.

01:46:26.220 | When the car does more and more of the work,

01:46:28.740 | how to keep utilizing the little human supervision

01:46:32.980 | that are needed to keep this whole thing safe.

01:46:35.100 | I mean, it's a fascinating dance

01:46:36.580 | of human robot interaction.

01:46:38.460 | To me, autonomous driving for a long time

01:46:42.140 | is a human robot interaction problem.

01:46:45.020 | It is not a robotics problem or computer vision problem.

01:46:48.020 | Like you have to have a human in the loop,

01:46:49.980 | but so, which is why I think it's 10 years plus,

01:46:53.340 | but I do think there'll be a bunch of cities and contexts

01:46:56.300 | where geo-restricted, it will work really, really damn well.

01:47:01.300 | - Yeah.

01:47:02.620 | So I think for me, like it's five, if I'm being optimistic

01:47:04.980 | and it's going to be five for a lot of cases

01:47:07.340 | and 10 plus, yeah, I agree with you.

01:47:09.220 | 10 plus basically, if we want to like cover most of say,

01:47:13.420 | contiguous United States or something.

01:47:15.220 | - Oh, interesting.

01:47:16.060 | So my optimistic is five and pessimistic is 30.

01:47:20.300 | - 30.

01:47:21.140 | - I have a long tail on this one.

01:47:22.460 | I've watched enough driving videos.

01:47:24.420 | I've watched enough pedestrians to think like,

01:47:27.940 | we may be like, there's a small part of me still,

01:47:31.140 | not a small, like a pretty big part of me

01:47:33.860 | that thinks we will have to build AGI to solve driving.

01:47:37.540 | - Oh wow.

01:47:38.460 | - Like there's something to me like,

01:47:40.020 | because humans are part of the picture,

01:47:41.780 | deeply part of the picture

01:47:44.020 | and also human society is part of the picture

01:47:46.060 | in that human life is at stake.

01:47:47.940 | Anytime a robot kills a human,

01:47:50.860 | it's not clear to me that that's not a problem

01:47:54.300 | that machine learning will also have to solve.

01:47:56.340 | Like it has to, you have to integrate that

01:47:59.400 | into the whole thing,

01:48:00.240 | just like Facebook or social networks.

01:48:02.940 | You know, one thing is to say

01:48:04.120 | how to make a really good recommender system.

01:48:06.700 | And then the other thing is to integrate

01:48:08.660 | into that recommender system,

01:48:10.260 | all the journalists that will write articles

01:48:12.060 | about that recommender system.

01:48:13.860 | Like you have to consider the society

01:48:15.880 | within which the AI system operates.

01:48:18.380 | And in order to, and like politicians too,

01:48:20.700 | you know, this is the regulatory stuff

01:48:22.660 | for autonomous driving.

01:48:24.200 | It's kind of fascinating that the more successful

01:48:26.700 | your AI system becomes,

01:48:28.700 | the more it gets integrated in society

01:48:31.580 | and the more precious politicians and the public

01:48:34.500 | and the clickbait journalists

01:48:35.980 | and all the different fascinating forces of our society

01:48:39.060 | start acting on it.

01:48:40.380 | And then it's no longer how good you are

01:48:42.220 | at doing the initial task.

01:48:43.980 | It's also how good you are at navigating human nature,

01:48:47.020 | which is a fascinating space.

01:48:49.940 | What do you think are the limits of deep learning?

01:48:52.620 | If you allow me, we'll zoom out a little bit

01:48:54.820 | into the big question of artificial intelligence.

01:48:58.140 | You said dark matter of intelligence

01:49:01.260 | is self-supervised learning,

01:49:02.820 | but there could be more.

01:49:04.340 | What do you think the limits of self-supervised learning

01:49:07.780 | and just learning in general, deep learning are?

01:49:10.740 | - I think like for deep learning in particular,

01:49:12.700 | because self-supervised learning is,

01:49:14.180 | I would say a little bit more vague right now.

01:49:16.820 | So I wouldn't, like for something that's so vague,

01:49:18.700 | it's hard to predict what its limits are going to be.

01:49:21.980 | But like I said, I think anywhere you want to interact

01:49:25.260 | with human self-supervised learning

01:49:26.540 | kind of hits a boundary very quickly

01:49:28.460 | because you need to have an interface

01:49:29.940 | to be able to communicate with a human.

01:49:31.620 | So really like if you have just like vacuous concepts

01:49:35.060 | or like just like nebulous concepts

01:49:36.900 | discovered by a network,

01:49:38.580 | it's very hard to communicate those with a human

01:49:40.380 | without like inserting some kind of human knowledge

01:49:42.420 | or some kind of like human bias there.

01:49:44.380 | In general, I think for deep learning,

01:49:47.020 | the biggest challenge is just like data efficiency.

01:49:50.660 | Even with self-supervised learning,

01:49:52.620 | even with anything else,

01:49:53.540 | if you just see a single concept once,

01:49:57.460 | like one image of a, like, I don't know,

01:49:59.860 | whatever you want to call it, like any concept,

01:50:02.500 | it's really hard for these methods to generalize

01:50:04.820 | by looking at just one or two samples of things.

01:50:07.700 | And that has been a real, real challenge.

01:50:09.740 | And I think that's actually why like these edge cases,

01:50:11.660 | for example, for Tesla are actually that important.

01:50:14.500 | Because if you see just one instance of the car failing,

01:50:18.020 | and if you just annotate that

01:50:19.300 | and you get that into your dataset,

01:50:21.900 | you have like very limited guarantee

01:50:23.540 | that it's not going to happen again.

01:50:25.140 | And you're actually going to be able to recognize

01:50:26.740 | this kind of instance in a very different scenario.

01:50:28.620 | So like when it was snowing,

01:50:30.300 | so you got that thing labeled when it was snowing,

01:50:32.020 | but now when it's raining,

01:50:33.220 | you're actually not able to get it.

01:50:34.620 | Or you basically have the same scenario

01:50:36.580 | in a different part of the world.

01:50:37.420 | So the lighting was different or so on.

01:50:39.100 | So it's just really hard for these models,

01:50:41.020 | like deep learning, especially to do that.

01:50:42.700 | - What's your intuition?

01:50:43.540 | How do we solve handwritten digit recognition problem

01:50:47.540 | when we only have one example for each number?

01:50:51.220 | It feels like humans are using something like learning.

01:50:54.700 | - Right, I think it's,

01:50:56.020 | we are good at transferring knowledge a little bit.

01:50:59.260 | We are just better at, like, for a lot of these problems

01:51:02.620 | where we are generalizing from a single sample

01:51:04.820 | or recognizing from a single sample,

01:51:06.940 | we are using a lot of our own domain knowledge

01:51:08.740 | and a lot of our like inductive bias

01:51:10.340 | into that one sample to generalize it.

01:51:12.300 | So I've never seen you write the number nine, for example.

01:51:15.300 | And if you were to write it, I would still get it.

01:51:17.460 | And if you were to write a different kind of alphabet

01:51:19.260 | and like write it in two different ways,

01:51:20.820 | I would still probably be able to figure out

01:51:22.340 | that these are the same two characters.

01:51:24.700 | It's just that I have been very used

01:51:26.300 | to seeing handwritten digits in my life.

01:51:29.020 | The other sort of problem with any deep learning system

01:51:31.340 | or any kind of machine learning system

01:51:32.660 | is like it's guarantees, right?

01:51:34.140 | There are no guarantees for it.

01:51:35.820 | Now you can argue that humans also don't have any guarantees

01:51:38.100 | like there is no guarantee that I can recognize a cat

01:51:41.100 | in every scenario.

01:51:42.180 | I'm sure there are going to be lots of cats

01:51:43.860 | that I don't recognize,

01:51:44.980 | lots of scenarios in which I don't recognize cats in general.

01:51:48.060 | But I think from like,

01:51:50.180 | from just a sort of application perspective,

01:51:52.780 | you do need guarantees, right?

01:51:54.700 | We call these things algorithms.

01:51:56.900 | Now algorithms, like traditional CS algorithms

01:51:59.020 | have guarantees, sorting is a guarantee.

01:52:01.420 | If you were to like call sort

01:52:03.380 | on a particular array of numbers,

01:52:05.540 | you are guaranteed that it's going to be sorted.

01:52:07.580 | Otherwise it's a bug.

01:52:09.260 | Now for machine learning,

01:52:10.100 | it's very hard to characterize this.

01:52:12.380 | We know for a fact that like a cat recognition model

01:52:15.380 | is not going to recognize cats,

01:52:16.980 | every cat in the world in every circumstance.

01:52:19.660 | I think most people would agree with that statement.

01:52:21.980 | But we are still okay with it.

01:52:23.540 | We still don't call this as a bug.

01:52:25.340 | Whereas in traditional computer science

01:52:26.660 | or traditional science,

01:52:27.820 | like if you have this kind of failure case existing,

01:52:29.900 | then you think of it as like something is wrong.

01:52:33.140 | I think there is this sort of notion

01:52:34.460 | of nebulous correctness for machine learning.

01:52:36.980 | And that's something we just need

01:52:37.820 | to be very comfortable with.

01:52:39.420 | And for deep learning,

01:52:40.500 | or like for a lot of these machine learning algorithms,

01:52:42.660 | it's not clear how do we characterize

01:52:44.660 | this notion of correctness.

01:52:46.260 | I think limitation in our understanding,

01:52:48.100 | or at least a limitation in our phrasing of this.

01:52:51.100 | And if we were to come up with better ways

01:52:53.020 | to understand this limitation,

01:52:55.020 | then it would actually help us a lot.

01:52:57.140 | - Do you think there's a distinction

01:52:58.820 | between the concept of learning

01:53:01.780 | and the concept of reasoning?

01:53:03.340 | Do you think it's possible for neural networks to reason?

01:53:09.220 | - So I think of it slightly differently.

01:53:11.620 | So for me, learning is whenever

01:53:14.460 | I can like make a snap judgment.

01:53:16.020 | So if you show me a picture of a dog,

01:53:17.140 | I can immediately say it's a dog.

01:53:18.820 | But if you give me like a puzzle,

01:53:20.380 | you know, like whatever,

01:53:22.060 | Goldsberg machine of like,

01:53:24.180 | things going to happen,

01:53:25.020 | then I have to reason.

01:53:25.860 | I've never, it's a very complicated setup.

01:53:27.620 | I've never seen that particular setup.

01:53:29.340 | And I really need to, you know,

01:53:30.660 | draw and like imagine in my head

01:53:32.220 | what's going to happen to figure it out.

01:53:34.660 | So I think yes,

01:53:35.500 | neural networks are really good at recognition,

01:53:38.940 | but they're not very good at reasoning.

01:53:41.180 | Because they're like,

01:53:42.020 | if they have seen something before,

01:53:44.140 | or seen something similar before,

01:53:45.820 | they're very good at making those sort of snap judgments.

01:53:48.260 | But if you were to give them a very complicated thing

01:53:50.740 | that they've not seen before,

01:53:52.500 | they have very limited ability right now

01:53:55.340 | to compose different things like,

01:53:56.740 | oh, I've seen this particular part before,

01:53:58.260 | I've seen this particular part before.

01:54:00.060 | And now probably like,

01:54:00.980 | this is how they're going to work in tandem.

01:54:02.940 | It's very hard for them to come up

01:54:04.140 | with these kinds of things.

01:54:05.220 | - Well, there's a certain aspect to reasoning

01:54:08.820 | that you can maybe convert into the process of programming.

01:54:11.860 | And so there's the whole field of program synthesis

01:54:14.300 | and people have been applying machine learning

01:54:17.220 | to the problem of program synthesis.

01:54:18.900 | And the question is, you know,

01:54:20.140 | can they, the step of composition,

01:54:22.700 | why can't that be learned?

01:54:25.260 | You know, the step of like building things on top of,

01:54:29.380 | like little intuitions, concepts on top of each other,

01:54:33.220 | can that be learnable?

01:54:35.260 | What's your intuition there?

01:54:36.780 | Or like, I guess, similar set of techniques,

01:54:39.460 | do you think that would be applicable?

01:54:42.060 | - So I think it is, of course, learnable.

01:54:43.980 | It is learnable because like,

01:54:45.100 | we are prime examples of machines that have like,

01:54:47.620 | or individuals that have learned this, right?

01:54:49.500 | Like humans have learned this.

01:54:51.100 | So it is, of course,

01:54:51.940 | it is a technique that is very easy to learn.

01:54:54.820 | I think where we are kind of hitting a wall,

01:54:58.420 | basically with like current machine learning

01:55:00.500 | is the fact that when the network learns

01:55:03.420 | all of this information,

01:55:04.660 | we basically are not able to figure out

01:55:07.500 | how well it's going to generalize to an unseen thing.

01:55:10.620 | And we have no, like a priori,

01:55:12.420 | no way of characterizing that.

01:55:13.900 | And I think that's basically telling us a lot about,

01:55:18.460 | like a lot about the fact that we really don't know

01:55:20.700 | what this model has learned and how well it's basically,

01:55:22.740 | because we don't know how well it's going to transfer.

01:55:25.140 | - There's also a sense in which it feels like

01:55:28.100 | we humans may not be aware of how much like background,

01:55:33.100 | how good our background model is,

01:55:36.860 | how much knowledge we just have

01:55:38.780 | slowly building on top of each other.

01:55:41.540 | It feels like neural networks

01:55:42.620 | are constantly throwing stuff out.

01:55:43.980 | Like you'll do some incredible thing

01:55:45.500 | where you're learning a particular task in computer vision,

01:55:49.180 | you celebrate your state of the art successes

01:55:51.380 | and you throw that out.

01:55:52.860 | Like it feels like it's,

01:55:54.420 | you're never using stuff you've learned

01:55:56.860 | for your future successes in other domains.

01:56:00.220 | And humans are obviously doing that exceptionally well,

01:56:03.380 | still throwing stuff away in their mind,

01:56:05.980 | but keeping certain kernels of truth.

01:56:07.980 | - Right, so I think we're like,

01:56:09.340 | continual learning is sort of the paradigm

01:56:11.220 | for this in machine learning.

01:56:12.060 | And I don't think it's a very well explored paradigm.

01:56:15.300 | We have like things in deep learning, for example,

01:56:17.380 | right, catastrophic forgetting

01:56:18.540 | is like one of the standard things.

01:56:20.300 | The thing basically being that if you teach a network

01:56:23.260 | like to recognize dogs,

01:56:24.900 | and now you teach that same network to recognize cats,

01:56:27.540 | it basically forgets how to recognize dogs.

01:56:29.180 | So it forgets very quickly.

01:56:30.940 | I mean, and whereas a human,

01:56:32.660 | if you were to teach someone to recognize dogs

01:56:34.700 | and then to recognize cats,

01:56:36.060 | they don't forget immediately how to recognize these dogs.

01:56:38.580 | I think that's basically sort of what you're trying to get.

01:56:40.780 | - Yeah, I just, I wonder if like

01:56:42.540 | the long-term memory mechanisms

01:56:44.860 | or the mechanisms that store not just memories,

01:56:47.260 | but concepts that allow you to reason

01:56:52.260 | of like, and compose concepts,

01:56:57.340 | if those things will look very different

01:56:59.180 | than neural networks,

01:57:00.060 | or if you can do that within a single neural network

01:57:02.460 | with some particular sort of architecture quirks,

01:57:06.180 | that seems to be a really open problem.

01:57:07.860 | And of course I go up and down on that

01:57:09.580 | because there's something so compelling

01:57:12.740 | to the symbolic AI

01:57:14.980 | or to the ideas of logic-based sort of expert systems.

01:57:19.980 | You have like human interpretable facts

01:57:22.580 | that built on top of each other.

01:57:24.220 | It's really annoying, like with self-supervised learning,

01:57:27.900 | that the AI is not very explainable.

01:57:31.260 | Like you can't like understand

01:57:33.500 | all the beautiful thing it has learned.

01:57:35.660 | You can't ask it like questions.

01:57:38.540 | But then again, maybe that's a stupid thing

01:57:41.060 | for us humans to want.

01:57:42.540 | But I think whenever we try to like understand it,

01:57:45.380 | we're putting our own subjective human bias into it.

01:57:48.500 | And I think that's the sort of problem.

01:57:50.140 | With self-supervised learning,

01:57:51.140 | the goal is that it should learn naturally from the data.

01:57:54.420 | So now if you try to understand it,

01:57:55.660 | you are using your own preconceived notions

01:57:58.780 | of what this model has learned.

01:58:00.740 | And that's the problem.

01:58:02.460 | - High level question.

01:58:04.660 | What do you think it takes to build a system

01:58:07.900 | with superhuman, maybe let's say human level

01:58:10.500 | or superhuman level, general intelligence?

01:58:13.500 | We've already kind of started talking about this,

01:58:15.580 | but what's your intuition?

01:58:17.740 | Like, does this thing have to have a body?

01:58:20.740 | Does it have to interact richly with the world?

01:58:23.900 | Does it have to have some more human elements

01:58:27.900 | like self-awareness?

01:58:30.460 | - I think emotion.

01:58:32.220 | I think emotion is something which is,

01:58:34.340 | like it's not really attributed

01:58:37.100 | typically in standard machine learning.

01:58:38.420 | It's not something we think about.

01:58:39.540 | Like there is NLP, there is vision,

01:58:41.020 | there is no like emotion.

01:58:42.580 | Emotion is never a part of all of this.

01:58:44.620 | And that just seems a little bit weird to me.

01:58:47.060 | I think the reason basically being that

01:58:48.860 | there is surprise and like,

01:58:51.540 | basically emotion is like one of the reasons

01:58:53.780 | emotions arises like what happens

01:58:55.780 | and what you expect to happen, right?

01:58:57.100 | There is like a mismatch between these things.

01:58:59.420 | And so that gives rise like,

01:59:01.060 | I can either be surprised or I can be saddened

01:59:03.500 | or I can be happy and all of this.

01:59:05.300 | And so this basically indicates

01:59:07.940 | that I already have a predictive model in my head

01:59:10.140 | and something that I predicted

01:59:11.420 | or something that I thought was likely to happen.

01:59:13.700 | And then there was something that I observed that happened

01:59:15.540 | that there was a disconnect between these two things.

01:59:18.260 | And that basically is like maybe one of the reasons

01:59:21.820 | I like you have a lot of emotions.

01:59:24.260 | - Yeah, I think, so I talked to people a lot about

01:59:26.740 | like Lisa Feldman Barrett.

01:59:29.100 | I think that's an interesting concept of emotion

01:59:31.660 | but I have a sense that emotion primarily

01:59:36.820 | in the way we think about it,

01:59:38.060 | which is the display of emotion

01:59:40.300 | is a communication mechanism between humans.

01:59:43.780 | So it's a part of basically human to human interaction,

01:59:47.380 | an important part, but just the part.

01:59:50.180 | So it's like, I would throw it into the full mix

01:59:55.060 | of communication.

01:59:58.020 | And to me, communication can be done with objects

02:00:01.260 | that don't look at all like humans.

02:00:04.340 | - Okay.

02:00:05.420 | I've seen our ability to anthropomorphize,

02:00:07.540 | our ability to connect with things that look like a Roomba,

02:00:10.660 | our ability to connect.

02:00:11.980 | First of all, let's talk about other biological systems

02:00:14.700 | like dogs, our ability to love things

02:00:17.420 | that are very different than humans.

02:00:19.380 | - But they do display emotion, right?

02:00:20.940 | I mean, dogs do display emotion.

02:00:23.180 | So they don't have to be anthropomorphic

02:00:25.300 | for them to like display the kind of emotion

02:00:27.540 | that we don't.

02:00:28.380 | - Exactly.

02:00:29.860 | But then the word emotion starts to lose.

02:00:33.940 | - So then we have to be, I guess, specific, but yeah.

02:00:36.260 | So have rich, flavorful communication.

02:00:39.540 | - Communication, yeah.

02:00:40.380 | - Yeah, so like, yes, it's full of emotion.

02:00:42.980 | It's full of wit and humor and moods

02:00:47.980 | and all those kinds of things, yeah.

02:00:50.260 | So you're talking about like flavor.

02:00:53.740 | - Flavor, yeah.

02:00:54.580 | Okay, let's call it that.

02:00:55.420 | So there's content and then there is flavor

02:00:57.220 | and I'm talking about the flavor.

02:00:58.420 | - Do you think it needs to have a body?

02:01:00.260 | Do you think like to interact with the physical world,

02:01:02.820 | do you think you can understand the physical world

02:01:04.620 | without being able to directly interact with it?

02:01:07.020 | - I don't think so, yeah.

02:01:08.420 | I think at some point we will need to bite the bullet

02:01:10.660 | and actually interact with the physical world

02:01:12.660 | as much as I like working on like passive computer vision

02:01:15.820 | where I just like sit in my armchair

02:01:17.260 | and look at videos and learn.

02:01:19.020 | I do think that we will need to have

02:01:21.220 | some kind of embodiment or some kind of interaction

02:01:24.580 | to figure out things about the world.

02:01:26.900 | - What about consciousness?

02:01:28.580 | Do you think, how often do you think about consciousness

02:01:32.260 | when you think about your work?

02:01:34.300 | You could think of it as the more simple thing

02:01:36.500 | of self-awareness, of being aware that you are a perceiving,

02:01:41.500 | sensing, acting thing in this world,

02:01:46.800 | or you can think about the bigger version of that

02:01:50.300 | which is consciousness, which is having it feel

02:01:54.460 | like something to be that entity,

02:01:57.180 | the subjective experience of being in this world.

02:01:59.540 | - So I think of self-awareness a little bit more

02:02:01.380 | than the broader goal of it

02:02:03.380 | because I think self-awareness is pretty critical

02:02:06.100 | for any kind of AGI or whatever you want to call it

02:02:10.140 | that we build because it needs to contextualize

02:02:13.020 | what it is and what role it's playing

02:02:15.540 | with respect to all the other things that exist around it.

02:02:17.900 | I think that requires self-awareness.

02:02:19.620 | It needs to understand that it's an autonomous car, right?

02:02:23.180 | And what does that mean?

02:02:24.860 | What are its limitations?

02:02:26.180 | What are the things that it is supposed to do and so on?

02:02:29.020 | What is its role in some way?

02:02:30.700 | Or, I mean, these are the kind of things

02:02:34.180 | that we kind of expect from it, I would say.

02:02:36.820 | And so that's the level of self-awareness

02:02:39.300 | that's, I would say, basically required at least,

02:02:42.160 | if not more than that.

02:02:44.220 | - Yeah, I tend to, on the emotion side,

02:02:46.380 | believe that it has to be able to display consciousness.

02:02:51.380 | - Display consciousness, what do you mean by that?

02:02:54.300 | - Meaning like for us humans to connect with each other

02:02:57.580 | or to connect with other living entities,

02:03:01.660 | I think we need to feel, like in order for us

02:03:05.100 | to truly feel like that there's another being there,

02:03:09.380 | we have to believe that they're conscious.

02:03:11.420 | And so we won't ever connect with something

02:03:14.980 | that doesn't have elements of consciousness.

02:03:17.300 | Now, I tend to think that that's easier to achieve

02:03:21.540 | than it may sound 'cause we anthropomorphize stuff so hard.

02:03:25.700 | Like you have a mug that just like has wheels

02:03:28.740 | and like rotates every once in a while and makes a sound.

02:03:31.900 | I think a couple of days in, especially if you're,

02:03:35.360 | if you don't hang out with humans,

02:03:39.500 | you might start to believe that mug on wheels is conscious.

02:03:42.220 | So I think we anthropomorphize pretty effectively

02:03:44.860 | as human beings.

02:03:46.020 | But I do think that it's in the same bucket

02:03:49.260 | that we'll call emotion that show that,

02:03:54.740 | I think of consciousness as the capacity to suffer.

02:03:57.440 | And if you're an entity that's able to feel things

02:04:02.420 | in the world and to communicate that to others,

02:04:05.580 | I think that's a really powerful way to interact with humans.

02:04:10.940 | And in order to create an AGI system,

02:04:13.220 | I believe you should be able to richly interact with humans.

02:04:17.980 | Like humans would need to want to interact with you.

02:04:21.120 | Like it can't be like, it's the self-supervised learning

02:04:25.600 | versus like the robot shouldn't have to pay you

02:04:29.280 | to interact with me.

02:04:30.280 | So like it should be a natural fun thing.

02:04:33.600 | And then you're going to scale up significantly

02:04:36.080 | how much interaction it gets.

02:04:39.080 | It's the Alexa prize,

02:04:40.860 | which they're trying to get me to be a judge

02:04:42.880 | on their contest.

02:04:44.680 | Let's see if I wanna do that.

02:04:46.020 | But their challenge is to talk to you,

02:04:50.560 | make the human sufficiently interested

02:04:53.960 | that the human keeps talking for 20 minutes.

02:04:56.160 | - To Alexa?

02:04:57.120 | - To Alexa, yeah.

02:04:58.600 | And right now they're not even close to that

02:05:00.240 | 'cause it just gets so boring when you're like,

02:05:02.560 | when the intelligence is not there,

02:05:04.280 | it gets very not interesting to talk to it.

02:05:06.920 | And so the robot needs to be interesting.

02:05:08.960 | And one of the ways it can be interesting

02:05:10.400 | is display the capacity to love, to suffer.

02:05:14.680 | And I would say that essentially means

02:05:17.480 | the capacity to display consciousness.

02:05:20.920 | Like it is an entity much like a human being.

02:05:25.160 | Of course, what that really means,

02:05:27.320 | I don't know if that's fundamentally a robotics problem

02:05:30.520 | or some kind of problem that we're not yet even aware.

02:05:33.040 | Like if it is truly a hard problem of consciousness,

02:05:36.040 | I tend to maybe optimistically think it's a,

02:05:38.600 | we can pretty effectively fake it till we make it.

02:05:42.640 | So we can display a lot of human-like elements for a while

02:05:46.400 | and that will be sufficient to form

02:05:49.080 | really close connections with humans.

02:05:50.920 | What to use the most beautiful idea

02:05:53.720 | in self-supervised learning?

02:05:55.840 | Like when you sit back with, I don't know,

02:05:59.040 | with a glass of wine and armchair and just at a fireplace,

02:06:04.040 | just thinking how beautiful this world

02:06:08.280 | that you get to explore is,

02:06:10.080 | what do you think is the especially beautiful idea?

02:06:13.800 | - The fact that like object level,

02:06:16.520 | what objects are in some notion of objectness emerges

02:06:20.000 | from these models by just like self-supervised learning.

02:06:23.720 | So for example, like one of the things like the dino paper

02:06:28.720 | that I was a part of at Facebook is,

02:06:31.040 | the object sort of boundaries emerge

02:06:34.280 | from these representations.

02:06:35.640 | So if you have like a dog running in the field,

02:06:38.080 | the boundaries around the dog,

02:06:39.480 | the network is basically able to figure out

02:06:42.320 | what the boundaries of this dog are automatically.

02:06:45.520 | And it was never trained to do that.

02:06:47.040 | It was never trained to,

02:06:49.120 | no one taught it that this is a dog

02:06:51.000 | and these pixels belong to a dog.

02:06:52.720 | It's able to group these things together automatically.

02:06:55.000 | So that's one.

02:06:56.120 | And I think in general, that entire notion

02:06:58.080 | that this dumb idea that you take like these two crops

02:07:01.400 | of an image and then you say that the features

02:07:03.160 | should be similar, that has resulted in something like this,

02:07:06.040 | like the model is able to figure out

02:07:07.920 | what the dog pixels are and so on.

02:07:10.320 | That just seems like so surprising.

02:07:12.080 | And I mean, I don't think a lot of us even understand

02:07:16.200 | how that is happening really.

02:07:18.120 | And it's something we're taking for granted,

02:07:20.800 | maybe like a lot in terms of how we're setting up

02:07:23.120 | these algorithms, but it's just,

02:07:24.920 | it's a very beautiful and powerful idea.

02:07:26.800 | So it's really fundamentally telling us something about

02:07:30.240 | that there is so much signal in the pixels

02:07:32.400 | that we can be super dumb about it,

02:07:34.160 | about how we're setting up the self-supervised learning

02:07:36.040 | problem and despite being like super dumb about it,

02:07:39.560 | we'll actually get very good,

02:07:41.600 | like we'll actually get something that is able to do

02:07:43.960 | very like surprising things.

02:07:45.680 | - I wonder if there's other like objectness

02:07:48.240 | of other concepts that can emerge.

02:07:50.340 | I don't know if you follow Francois Chollet,

02:07:53.560 | he had the competition for intelligence

02:07:56.600 | that basically it's kind of like an IQ test,

02:07:59.520 | but for machines, but for an IQ test,

02:08:02.360 | you have to have a few concepts that you wanna apply.

02:08:05.360 | One of them is objectness.

02:08:07.800 | I wonder if those concepts can emerge

02:08:11.520 | through self-supervised learning on billions of images.

02:08:14.760 | - I think something like object permanence

02:08:16.320 | can definitely emerge, right?

02:08:17.440 | So that's like a fundamental concept which we have,

02:08:20.240 | maybe not through images, through video,

02:08:21.480 | but that's another concept that should be emergent from it,

02:08:25.140 | because it's not something that,

02:08:26.760 | like if you don't teach humans that this is like

02:08:29.840 | about this concept of object permanence,

02:08:31.480 | it actually emerges.

02:08:32.480 | And the same thing for like animals, like dogs,

02:08:34.100 | I think actually, permanence automatically

02:08:36.360 | is something that they are born with.

02:08:38.080 | So I think it should emerge from the data.

02:08:40.320 | It should emerge basically very quickly.

02:08:42.440 | - I wonder if ideas like symmetry, rotation,

02:08:45.880 | these kinds of things might emerge.

02:08:47.920 | - So I think rotation probably, yes, yeah, rotation, yes.

02:08:51.600 | - I mean, there's some constraints

02:08:52.680 | in the architecture itself.

02:08:54.040 | - Right.

02:08:55.200 | - But it's interesting if all of them could be,

02:08:59.240 | like counting was another one,

02:09:01.060 | being able to kind of understand

02:09:04.880 | that there's multiple objects of the same kind in the image

02:09:07.720 | and be able to count them.

02:09:09.020 | I wonder if all of that could be,

02:09:11.560 | if constructed correctly, they can emerge,

02:09:14.360 | 'cause then you can transfer those concepts

02:09:16.480 | to then interpret images at a deeper level.

02:09:20.680 | - Right.

02:09:21.520 | Counting I do believe, I mean, should be possible.

02:09:24.680 | I don't know like yet,

02:09:25.920 | but I do think it's not that far in the realm of possibility.

02:09:29.720 | - Yeah, that'd be interesting

02:09:30.560 | if using self-supervised learning on images

02:09:33.240 | can then be applied to then solving those kinds of IQ tests,

02:09:36.520 | which seem currently to be kind of impossible.

02:09:38.840 | What idea do you believe might be true

02:09:43.320 | that most people think is not true

02:09:46.600 | or don't agree with you on?

02:09:48.560 | Is there something like that?

02:09:50.040 | - So this is going to be a little controversial,

02:09:52.400 | but okay, sure.

02:09:53.500 | I don't believe in simulation,

02:09:55.340 | like actually using simulation to do things very much.

02:09:58.840 | - Just to clarify, because this is a podcast

02:10:01.040 | where you talk about, are we living in a simulation often,

02:10:03.600 | you're referring to using simulation to construct worlds

02:10:08.000 | that you then leverage for machine learning.

02:10:10.320 | - Right, yeah.

02:10:11.160 | For example, like one example would be like

02:10:13.080 | to train an autonomous car driving system,

02:10:15.520 | you basically first build a simulator,

02:10:17.400 | which builds like the environment of the world.

02:10:19.840 | And then you basically have a lot of like,

02:10:22.680 | you train your machine learning system in that.

02:10:25.320 | So I believe it is possible,

02:10:27.520 | but I think it's a really expensive way of doing things.

02:10:30.920 | And at the end of it, you do need the real world.

02:10:33.800 | So I'm not sure.

02:10:35.560 | So maybe for certain settings,

02:10:36.960 | like maybe the payout is so large,

02:10:38.920 | like for autonomous driving,

02:10:39.960 | the payout is so large that you can actually invest

02:10:42.000 | that much money to build it.

02:10:43.400 | But I think as a general sort of principle,

02:10:45.520 | it does not apply to a lot of concepts.

02:10:47.040 | You can't really build simulations of everything.

02:10:49.760 | Not only because like one, it's expensive,

02:10:51.560 | because second, it's also not possible for a lot of things.

02:10:54.840 | So in general, like there is a lot of work

02:10:59.400 | on like using synthetic data and like synthetic simulators.

02:11:02.080 | I generally am not very, like I don't believe in that.

02:11:05.800 | - So you're saying it's very challenging visually,

02:11:09.000 | like to correctly like simulate the visual,

02:11:11.920 | like the lighting, all those kinds of things.

02:11:13.560 | - I mean, all these companies that you have, right?

02:11:15.640 | So like Pixar and like whatever,

02:11:17.840 | all these companies are,

02:11:19.800 | all this like computer graphics stuff

02:11:21.520 | is really about accurately,

02:11:22.880 | a lot of them is about like accurately trying to figure out

02:11:26.080 | how the lighting is and like how things reflect

02:11:28.560 | off of one another and so on.

02:11:30.400 | And like how sparkly things look and so on.

02:11:32.200 | So it's a very hard problem.

02:11:33.960 | So do we really need to solve that first

02:11:37.120 | to be able to like do computer vision?

02:11:39.360 | Probably not.

02:11:40.560 | - And for me, in the context of autonomous driving,

02:11:44.720 | it's very tempting to be able to use simulation, right?

02:11:47.960 | Because it's a safety critical application,

02:11:50.480 | but the other limitation or simulation

02:11:53.280 | that perhaps is a bigger one than the visual limitation

02:11:58.360 | is the behavior of objects.

02:12:00.720 | 'Cause the, so you're ultimately interested in edge cases.

02:12:03.800 | And the question is how well can you generate edge cases

02:12:07.240 | in simulation, especially with human behavior?

02:12:10.960 | - I think another problem is like for autonomous driving,

02:12:13.240 | right, it's a constantly changing world.

02:12:15.160 | So say autonomous driving, like in 10 years from now,

02:12:18.480 | like there are lots of autonomous cars,

02:12:20.720 | but they're still going to be humans.

02:12:22.400 | So now there are 50% of the agents say, which are humans,

02:12:25.160 | 50% of the agents that are autonomous,

02:12:26.800 | like car driving agents.

02:12:28.520 | So now the mixture has changed.

02:12:30.040 | So now the kinds of behaviors that you actually expect

02:12:32.280 | from the other agents or other cars on the road

02:12:35.120 | are actually going to be very different.

02:12:36.680 | And as the proportion of the number of autonomous cars

02:12:39.040 | to humans keeps changing,

02:12:40.400 | this behavior will actually change a lot.

02:12:42.560 | So now if you were to build a simulator

02:12:44.040 | based on just like right now to build them today,

02:12:46.400 | you don't have that many autonomous cars on the road.

02:12:48.360 | So you will try to like make all of the other agents

02:12:50.440 | in that simulator behave as humans,

02:12:52.960 | but that's not really going to hold true

02:12:54.600 | 10, 15, 20, 30 years from now.

02:12:57.360 | - Do you think we're living in a simulation?

02:12:59.240 | - No.

02:13:00.080 | - How hard is it?

02:13:02.760 | This is why I think it's an interesting question.

02:13:04.800 | How hard is it to build a video game,

02:13:07.720 | like virtual reality game, where it is so real,

02:13:11.800 | forget like ultra realistic

02:13:15.160 | to where you can't tell the difference,

02:13:17.320 | but like, it's so nice that you just wanna stay there.

02:13:20.800 | You just wanna stay there and you don't wanna come back.

02:13:24.920 | Do you think that's doable within our lifetime?

02:13:29.360 | - Within our lifetime?

02:13:30.480 | Probably, yeah.

02:13:32.160 | I eat healthy, I live long.

02:13:33.760 | (both laughing)

02:13:36.040 | - Does that make you sad that there'll be like,

02:13:38.440 | like population of kids that basically spend 95%,

02:13:44.400 | 99% of their time in a virtual world?

02:13:50.080 | - Very, very hard question to answer.

02:13:51.920 | For certain people, it might be something

02:13:55.720 | that they really derive a lot of value out of,

02:13:58.120 | derive a lot of enjoyment and like happiness out of,

02:14:00.720 | and maybe the real world wasn't giving them that,

02:14:03.120 | that's why they did that.

02:14:03.960 | So maybe it is good for certain people.

02:14:05.920 | - So ultimately, if it maximizes happiness,

02:14:09.360 | - Right, I think if- - Or we could judge.

02:14:10.720 | - Yeah, I think if it's making people happy,

02:14:12.720 | maybe it's okay.

02:14:14.400 | Again, I think it's, this is a very hard question.

02:14:18.280 | - So like you've been a part of a lot of amazing papers,

02:14:22.600 | what advice would you give to somebody

02:14:25.640 | on what it takes to write a good paper?

02:14:28.040 | Grad students writing papers now,

02:14:31.000 | is there common things that you've learned along the way

02:14:34.520 | that you think it takes,

02:14:35.760 | both for a good idea and a good paper?

02:14:39.000 | - Right, so I think both of these I've picked up from,

02:14:44.640 | like lots of people I've worked with in the past.

02:14:46.520 | So one of them is picking the right problem

02:14:48.680 | to work on in research is as important

02:14:51.040 | as like finding the solution to it.

02:14:53.680 | So, I mean, there are multiple reasons for this.

02:14:56.200 | So one is that there are certain problems

02:14:58.960 | that can actually be solved in a particular timeframe.

02:15:02.320 | So now say you want to work on finding the meaning of life.

02:15:06.400 | This is a great problem.

02:15:07.400 | I think most people will agree with that.

02:15:09.440 | But do you believe that your talents

02:15:12.240 | and like the energy that you'll spend on it

02:15:13.800 | will make a meaning,

02:15:15.520 | like make some kind of meaningful progress in your lifetime?

02:15:18.800 | If you are optimistic about it, then like go ahead.

02:15:20.960 | - That's why I started this podcast.

02:15:22.120 | I keep asking people about the meaning of life.

02:15:24.040 | I'm hoping by episode like 220, I'll figure it out.

02:15:28.000 | - Oh, not too many episodes to go.

02:15:30.280 | - All right.

02:15:31.720 | Maybe today, I don't know.

02:15:33.080 | But you're right.

02:15:33.920 | So that seems intractable at the moment.

02:15:36.280 | - Right, so I think it's just the fact of like,

02:15:39.000 | if you're starting a PhD, for example,

02:15:41.080 | what is one problem that you want to focus on

02:15:43.000 | that you do think is interesting enough,

02:15:45.720 | and you will be able to make a reasonable amount

02:15:47.800 | of headway into it that you think you'll be doing a PhD for.

02:15:50.520 | So in that kind of a timeframe.

02:15:53.080 | So that's one.

02:15:53.920 | Of course, there's the second part,

02:15:54.800 | which is what excites you genuinely.

02:15:56.360 | So you shouldn't just pick problems

02:15:57.600 | that you are not excited about,

02:15:59.040 | because as a grad student or as a researcher,

02:16:01.840 | you really need to be passionate about it

02:16:03.240 | to continue doing that,

02:16:04.600 | because there are so many other things

02:16:05.760 | that you could be doing in life.

02:16:07.120 | So you really need to believe in that

02:16:08.280 | to be able to do that for that long.

02:16:10.760 | In terms of papers,

02:16:11.600 | I think the one thing that I've learned is,

02:16:13.720 | I've like in the past,

02:16:16.440 | whenever I used to write things,

02:16:17.760 | and even now, whenever I do that,

02:16:18.920 | I try to cram in a lot of things into the paper.

02:16:21.400 | Whereas what really matters

02:16:22.800 | is just pushing one simple idea.

02:16:24.760 | That's it.

02:16:25.760 | That's all because that's the paper

02:16:29.120 | is going to be like whatever eight or nine pages.

02:16:32.200 | If you keep cramming in lots of ideas,

02:16:34.240 | it's really hard for the single thing

02:16:36.240 | that you believe in to stand out.

02:16:38.000 | So if you really try to just focus on like,

02:16:40.920 | especially in terms of writing,

02:16:41.920 | really try to focus on one particular idea

02:16:43.800 | and articulate it out in multiple different ways,

02:16:46.240 | it's far more valuable to the reader as well.

02:16:49.040 | And basically, to the reader, of course,

02:16:51.600 | because they get to,

02:16:53.120 | they know that this particular idea

02:16:54.400 | is associated with this paper.

02:16:56.160 | And also for you, because you have,

02:16:59.040 | like when you write about a particular idea

02:17:00.440 | in different ways, you think about it more deeply.

02:17:02.680 | So as a grad student,

02:17:03.600 | I used to always wait to it,

02:17:05.520 | like maybe in the last week or whatever

02:17:07.720 | to write the paper,

02:17:08.720 | because I used to always believe

02:17:10.280 | that doing the experiments

02:17:11.360 | was actually the bigger part of research than writing.

02:17:13.880 | And my advisor always told me

02:17:15.240 | that you should start writing very early on.

02:17:16.680 | And I thought, oh, it doesn't matter.

02:17:17.920 | I don't know what he's talking about.

02:17:19.720 | But I think more and more I realized that's the case.

02:17:21.800 | Like whenever I write something that I'm doing,

02:17:24.040 | I actually think much better about it.

02:17:26.440 | And so if you start writing earlier,

02:17:28.360 | early on you actually, I think get better ideas,

02:17:31.200 | or at least you figure out like holes in your theory

02:17:33.800 | or like particular experiments

02:17:35.480 | that you should run to block those holes and so on.

02:17:38.760 | - Yeah, I'm continually surprised

02:17:40.320 | how many really good papers throughout history

02:17:43.600 | are quite short and quite simple.

02:17:47.400 | And there's a lesson to that.

02:17:49.800 | Like if you want to dream about writing a paper

02:17:52.600 | that changes the world,

02:17:54.400 | and you want to go by example,

02:17:56.800 | they're usually simple.

02:17:58.040 | - Yeah, yeah.

02:17:59.000 | - And that's, it's not cramming or it's,

02:18:03.040 | it's focusing on one idea and thinking deeply.

02:18:07.240 | And you're right that the writing process itself

02:18:10.360 | reveals the idea.

02:18:12.280 | It challenges you to really think about what is the idea

02:18:15.320 | that explains that the thread that ties it all together.

02:18:18.120 | - And so like a lot of famous researchers I know

02:18:21.520 | actually would start off like,

02:18:24.120 | first they would, even before the experiments were in,

02:18:27.000 | a lot of them would actually start

02:18:28.320 | with writing the introduction of the paper

02:18:30.400 | with zero experiments in.

02:18:32.160 | Because that at least helps them figure out

02:18:33.800 | what they're like, what they're trying to solve

02:18:35.800 | and how it fits in like the context of things right now.

02:18:38.680 | And that would really guide their entire research.

02:18:40.680 | So a lot of them would actually first write an intros

02:18:42.360 | with like zero experiments in

02:18:43.560 | and that's how they would start projects.

02:18:46.040 | - Some basic questions about people maybe

02:18:48.240 | that are more like beginners in this field.

02:18:51.960 | What's the best programming language to learn

02:18:54.080 | if you're interested in machine learning?

02:18:56.600 | - I would say Python,

02:18:57.440 | just because it's the easiest one to learn.

02:19:00.320 | And also a lot of like programming

02:19:03.160 | in machine learning happens in Python.

02:19:05.000 | So it'll, if you don't know any other programming language,

02:19:07.600 | Python is actually going to get you a long way.

02:19:09.560 | - Yeah, it seems like sort of a,

02:19:11.680 | it's a toss up question because it seems like Python

02:19:14.000 | is so much dominating the space now.

02:19:16.800 | But I wonder if there's an interesting alternative.

02:19:18.480 | Obviously there's like Swift

02:19:19.960 | and there's a lot of interesting alternatives popping up,

02:19:22.720 | even JavaScript.

02:19:23.960 | So I, or R, more like for the data science applications.

02:19:28.880 | But it seems like Python more and more

02:19:31.240 | is actually being used to teach like introduction

02:19:34.160 | to programming at universities.

02:19:35.920 | So it just combines everything very nicely.

02:19:38.320 | Even harder question.

02:19:41.880 | What are the pros and cons of PyTorch versus TensorFlow?

02:19:46.160 | - I see.

02:19:47.000 | Okay.

02:19:49.400 | - You can go with no comment.

02:19:51.320 | - So a disclaimer to this is that the last time

02:19:53.440 | I used TensorFlow was probably like four years ago.

02:19:56.440 | And so it was right when it had come out

02:19:58.200 | because so I started on like deep learning in 2014 or so

02:20:02.680 | and the dominant sort of framework for us then

02:20:06.480 | for vision was Caffe, which was out of Berkeley.

02:20:09.080 | And we used Caffe a lot, it was really nice.

02:20:11.280 | And then TensorFlow came in,

02:20:13.400 | which was basically like Python first.

02:20:15.120 | So Caffe was mainly C++

02:20:17.080 | and it had like very loose kind of Python binding.

02:20:19.080 | So Python wasn't really the first language you would use.

02:20:21.360 | You would really use either MATLAB or C++

02:20:24.720 | to like get stuff done in like Caffe.

02:20:28.280 | And then Python of course became popular a little bit later.

02:20:30.960 | So TensorFlow was basically around that time.

02:20:32.640 | So 2015, 2016 is when I last used it.

02:20:35.280 | It's been a while.

02:20:37.240 | - And then what, did you use Torch or did you--

02:20:40.600 | - So then I moved to LuaTorch,

02:20:42.560 | which was the Torch in Lua.

02:20:44.040 | And then in 2017, I think basically

02:20:46.400 | pretty much to PyTorch completely.

02:20:48.440 | - Oh, interesting.

02:20:49.280 | So you went to Lua, cool.

02:20:50.560 | - Yeah.

02:20:51.520 | - Huh, so you were there before it was cool.

02:20:54.240 | - Yeah, I mean, so LuaTorch was really good

02:20:56.360 | because it actually allowed you to do a lot

02:20:59.560 | of different kinds of things.

02:21:01.360 | So which Caffe was very rigid in terms of its structure.

02:21:03.880 | Like you would create a neural network once and that's it.

02:21:06.800 | Whereas if you wanted like very dynamic graphs and so on,

02:21:09.320 | it was very hard to do that.

02:21:10.200 | And LuaTorch was much more friendly

02:21:11.600 | for all of these things.

02:21:13.560 | Okay, so in terms of PyTorch and TensorFlow,

02:21:15.600 | my personal bias is PyTorch

02:21:17.280 | just because I've been using it longer

02:21:19.080 | and I'm more familiar with it.

02:21:20.760 | And also that PyTorch is much easier to debug

02:21:23.560 | is what I find because it's imperative in nature

02:21:26.280 | compared to like TensorFlow, which is not imperative.

02:21:28.600 | But that's telling you a lot that basically

02:21:30.480 | the imperative design is sort of a way

02:21:33.320 | in which a lot of people are taught programming

02:21:35.240 | and that's what actually makes debugging easier for them.

02:21:38.160 | So like I learned programming in C, C++.

02:21:40.480 | And so for me, imperative way of programming

02:21:42.200 | is more natural.

02:21:43.200 | - Do you think it's good to have

02:21:45.280 | kind of these two communities, this kind of competition?

02:21:48.480 | I think PyTorch is kind of more and more becoming dominant

02:21:51.480 | in the research community,

02:21:52.520 | but TensorFlow is still very popular

02:21:54.600 | in the more sort of application machine learning community.

02:21:57.920 | So do you think it's good to have that kind of split

02:22:00.480 | in code bases or,

02:22:02.680 | so like the benefit there is the competition challenges

02:22:06.560 | the library developers to step up their game.

02:22:09.120 | - Yeah.

02:22:10.000 | - But the downside is there's these code bases

02:22:12.760 | that are in different libraries.

02:22:15.200 | - Right, so I think the downside is that,

02:22:17.080 | I mean, for a lot of research code

02:22:18.480 | that's released in one framework

02:22:19.640 | and if you're using the other one,

02:22:20.600 | it's really hard to like really build on top of it.

02:22:23.960 | But thankfully the open source community

02:22:25.800 | in machine learning is amazing.

02:22:27.080 | So whenever like something pops up in TensorFlow,

02:22:30.840 | you wait a few days and someone who's like super sharp

02:22:33.200 | will actually come and translate that particular code base

02:22:35.520 | into PyTorch and basically have figured that

02:22:38.400 | all the nooks and crannies out.

02:22:39.720 | So the open source community is amazing

02:22:41.800 | and they really like figure out this gap.

02:22:44.280 | So I think in terms of like having these two frameworks

02:22:47.560 | or multiple, I think of course there are different use cases

02:22:49.720 | so there are going to be benefits

02:22:51.080 | to using one or the other framework.

02:22:52.840 | And like you said, I think competition is just healthy

02:22:54.720 | because both of these frameworks keep,

02:22:57.360 | or like all of these frameworks really sort of

02:22:59.040 | keep learning from each other

02:23:00.120 | and keep incorporating different things

02:23:01.640 | to just make them better and better.

02:23:03.760 | - What advice would you have for someone

02:23:06.320 | new to machine learning, you know,

02:23:09.680 | maybe just started or haven't even started

02:23:11.520 | but are curious about it and who want to get in the field?

02:23:14.880 | - Don't be afraid to get your hands dirty.

02:23:16.640 | I think that's the main thing.

02:23:17.600 | So if something doesn't work,

02:23:19.120 | like really drill into why things are not working.

02:23:22.200 | - Can you elaborate what your hands dirty means?

02:23:24.560 | - Right, so for example, like if an algorithm,

02:23:27.600 | if you try to train a network and it's not converging,

02:23:29.760 | whatever, rather than trying to like Google the answer

02:23:32.280 | or trying to do something,

02:23:33.440 | like really spend those like five, eight, 10, 15, 20,

02:23:36.360 | whatever number of hours really trying

02:23:37.600 | to figure it out yourself.

02:23:39.040 | 'Cause in that process, you'll actually learn a lot more.

02:23:41.360 | - Yeah.

02:23:42.560 | - Googling is of course like a good way to solve it

02:23:44.640 | when you need a quick answer.

02:23:46.040 | But I think initially, especially like

02:23:47.680 | when you're starting out, it's much nicer

02:23:49.400 | to like figure things out by yourself.

02:23:51.880 | And I just say that from experience

02:23:53.000 | because like when I started out,

02:23:54.320 | there were not a lot of resources.

02:23:55.520 | So we would like in the lab, a lot of us,

02:23:57.920 | like we would look up to senior students

02:23:59.720 | and the senior students were of course busy

02:24:01.400 | and they would be like, "Hey, why don't you go figure it out

02:24:03.120 | because I just don't have the time.

02:24:04.360 | I'm working on my dissertation or whatever."

02:24:06.520 | I'll find a PhD students.

02:24:07.680 | And so then we would sit down

02:24:08.800 | and like just try to figure it out.

02:24:10.520 | And that I think really helped me.

02:24:12.480 | That has really helped me figure a lot of things out.

02:24:15.080 | - I think in general, if I were to generalize that,

02:24:18.760 | I feel like persevering through any kind of struggle

02:24:22.800 | on a thing you care about is good.

02:24:25.680 | So you're basically, you try to make it seem like

02:24:28.200 | it's good to spend time debugging,

02:24:30.880 | but really any kind of struggle,

02:24:32.600 | whatever form that takes, it could be just Googling a lot.

02:24:36.120 | Just basically anything, just go sticking with it

02:24:38.760 | and go into the hard thing that could take a form

02:24:41.080 | of implementing stuff from scratch.

02:24:43.280 | It could take the form of re-implementing

02:24:45.680 | with different libraries or different programming languages.

02:24:49.400 | It could take a lot of different forms,

02:24:50.640 | but struggle is good for the soul.

02:24:53.560 | - So like in Pittsburgh, when I did my PhD,

02:24:55.840 | the thing was it used to snow a lot.

02:24:57.560 | - Yeah.

02:24:58.400 | - And so when it was snowed, you really couldn't do much.

02:25:00.840 | So the thing that a lot of people said was,

02:25:03.000 | "Snow builds character."

02:25:05.360 | Because when it's snowing, you can't do anything else.

02:25:07.520 | You focus on work.

02:25:09.080 | - Do you have advice in general for people,

02:25:10.840 | you've already exceptionally successful, you're young,

02:25:13.440 | but do you have advice for young people starting out

02:25:15.800 | in college or maybe in high school?

02:25:17.840 | You know, advice for their career, advice for their life,

02:25:21.080 | how to pave a successful path in career and life?

02:25:25.760 | - I would say just be hungry.

02:25:27.400 | Like always be hungry for what you want.

02:25:29.760 | And I think like, I've been inspired by a lot of people

02:25:33.360 | who are just like driven and who really like go

02:25:35.840 | for what they want, no matter what, like,

02:25:38.440 | you shouldn't want it, you should need it.

02:25:40.560 | So if you need something, you basically go

02:25:42.640 | towards the ends to make it work.

02:25:44.400 | - How do you know when you come across a thing

02:25:47.120 | that's like you need?

02:25:51.160 | - I think there's not going to be any single thing

02:25:53.120 | that you're going to need.

02:25:53.960 | There are going to be different types of things

02:25:55.000 | that you need, but whenever you need something,

02:25:56.640 | you just go push for it.

02:25:57.960 | And of course, once you, you may not get it,

02:26:00.080 | or you may find that this was not even the thing

02:26:02.000 | that you were looking for, it might be a different thing.

02:26:03.720 | But the point is like you're pushing through things

02:26:06.280 | and that actually gives you, brings a lot of skills

02:26:09.000 | and brings a lot of like,

02:26:11.520 | builds a certain kind of attitude,

02:26:12.920 | which will probably help you get the other thing.

02:26:15.680 | Once you figure out what's really the thing that you want.

02:26:18.080 | - Yeah, I think a lot of people are,

02:26:20.560 | I've noticed the kind of afraid of that is because one,

02:26:23.240 | it's a fear of commitment.

02:26:24.880 | And two, there's so many amazing things in this world.

02:26:26.880 | You almost don't want to miss out

02:26:28.120 | on all the other amazing things

02:26:29.400 | by committing to this one thing.

02:26:31.040 | So I think a lot of it has to do with just allowing yourself

02:26:33.840 | to like notice that thing and just go all the way with it.

02:26:41.560 | - I mean, I also like failure, right?

02:26:43.240 | So I know this is like super cheesy that failure

02:26:47.280 | is something that you should be prepared for and so on.

02:26:49.760 | But I do think, I mean, especially in research,

02:26:52.520 | for example, failure is something that happens almost like,

02:26:55.240 | almost every day is like experiments failing and not working.

02:26:59.120 | And so you really need to be so used to it.

02:27:02.240 | You need to have a thick skin.

02:27:03.880 | But, and only basically through,

02:27:06.280 | like when you get through it is when you find

02:27:07.880 | the one thing that's actually working.

02:27:09.560 | So Thomas Edison was like one person like that.

02:27:11.840 | So I really, like when I was a kid,

02:27:13.680 | I used to really read about how he found like the filament,

02:27:17.040 | the light bulb filament.

02:27:18.680 | And then he, I think his thing was like,

02:27:20.600 | he tried 990 things that didn't work

02:27:23.120 | or something of the sort.

02:27:24.320 | And then they asked him like, so what did you learn?

02:27:26.920 | Because all of these were failed experiments.

02:27:28.480 | And then he says, oh, these 990 things don't work.

02:27:31.600 | And I know that.

02:27:32.440 | Did you know that?

02:27:33.280 | (laughing)

02:27:34.120 | I mean, that's pretty inspiring.

02:27:35.960 | - So you spent a few years on this earth

02:27:38.480 | performing a self supervised kind of learning process.

02:27:43.480 | Have you figured out the meaning of life yet?

02:27:46.400 | I told you I'm doing this podcast to try to get the answer.

02:27:49.120 | I'm hoping you could tell me.

02:27:50.720 | What do you think the meaning of it all is?

02:27:52.880 | - I don't think I figured this out.

02:27:55.760 | No, I have no idea.

02:27:57.080 | (laughing)

02:27:58.960 | - Do you think AI will help us figure it out?

02:28:02.520 | Or do you think there's no answer?

02:28:03.880 | The whole point is to keep searching.

02:28:05.440 | - I think, yeah, I think it's an endless

02:28:07.520 | sort of quest for us.

02:28:08.760 | I don't think AI will help us there.

02:28:10.520 | This is like a very hard, hard, hard question,

02:28:13.560 | which so many humans have tried to answer.

02:28:15.400 | - Well, that's the interesting thing.

02:28:16.600 | Difference between AI and humans.

02:28:18.400 | Humans don't seem to know what the hell they're doing.

02:28:21.800 | And AI is almost always operating

02:28:23.680 | under well-defined objective functions.

02:28:27.360 | And I wonder like, whether our lack of ability

02:28:33.320 | to define good long-term objective functions

02:28:37.200 | or introspect what is the objective function

02:28:40.400 | under which we operate, if that's a feature or a bug.

02:28:44.360 | - I would say it's a feature

02:28:45.200 | because then everyone actually has very different kinds

02:28:47.400 | of objective functions that they're optimizing.

02:28:49.320 | And those objective functions evolve

02:28:51.280 | and like change dramatically

02:28:52.800 | through the course of their life.

02:28:53.840 | That's actually what makes us interesting, right?

02:28:55.960 | If otherwise, like if everyone was doing

02:28:58.000 | the exact same thing, that would be pretty boring.

02:29:00.520 | We do want like people with different kinds of perspectives.

02:29:03.800 | Also people evolve continuously.

02:29:06.120 | That's like, I would say the biggest feature of being human.

02:29:09.280 | - And then we get to like the ones that die

02:29:11.120 | because they do something stupid.

02:29:12.520 | We get to watch that, see it and learn from it.

02:29:15.400 | And as a species, we take that lesson

02:29:20.320 | and become better and better

02:29:22.560 | because of all the dumb people in the world

02:29:24.240 | that died doing something wild and beautiful.

02:29:28.120 | Ishan, thank you so much for this incredible conversation.

02:29:31.800 | We did a depth first search

02:29:34.800 | through the space of machine learning

02:29:38.000 | and it was fun and fascinating.

02:29:41.600 | So it's really an honor to meet you

02:29:43.880 | and it was a really awesome conversation.

02:29:45.760 | Thanks for coming down today and talking with me.

02:29:48.160 | - Thanks Lex.

02:29:49.000 | I mean, I've listened to you.

02:29:50.200 | I told you it was unreal for me

02:29:51.720 | to actually meet you in person

02:29:52.960 | and I'm so happy to be here.

02:29:54.080 | Thank you.

02:29:55.000 | - Thanks man.

02:29:55.840 | Thanks for listening to this conversation

02:29:58.160 | with Ishan Misra.

02:29:59.360 | And thank you to Onnit, The Information,

02:30:02.440 | Grammarly and Athletic Greens.

02:30:05.280 | Check them out in the description to support this podcast.

02:30:08.560 | And now let me leave you with some words

02:30:10.440 | from Arthur C. Clarke.

02:30:12.480 | Any sufficiently advanced technology

02:30:14.920 | is indistinguishable from magic.

02:30:18.120 | Thank you for listening and hope to see you next time.

02:30:21.000 | (upbeat music)

02:30:23.600 | (upbeat music)

02:30:26.200 | [BLANK_AUDIO]

Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206

Chapters