back to indexIshan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206
Chapters
0:0 Introduction
2:27 Self-supervised learning
11:2 Self-supervised learning is the dark matter of intelligence
14:54 Categorization
23:28 Is computer vision still really hard?
27:12 Understanding Language
36:51 Harder to solve: vision or language
43:36 Contrastive learning & energy-based models
47:37 Data augmentation
51:57 Fixed audio spike by lowering sound with pen tool
60:10 Real data vs. augmented data
63:54 Non-contrastive learning energy based self supervised learning methods
67:32 Unsupervised learning (SwAV)
70:14 Self-supervised Pretraining (SEER)
75:21 Self-supervised learning (SSL) architectures
81:21 VISSL pytorch-based SSL library
84:15 Multi-modal
91:43 Active learning
97:22 Autonomous driving
108:49 Limits of deep learning
112:57 Difference between learning and reasoning
118:3 Building super-human AI
125:51 Most beautiful idea in self-supervised learning
129:40 Simulation for training AI
133:4 Video games replacing reality
134:18 How to write a good research paper
138:45 Best programming language for beginners
139:39 PyTorch vs TensorFlow
143:3 Advice for getting into machine learning
145:9 Advice for young people
147:35 Meaning of life
00:00:00.000 |
The following is a conversation with Ishan Misra, 00:00:05.840 |
who works on self-supervised machine learning 00:00:13.480 |
understand the visual world with minimal help 00:00:18.040 |
Transformers and self-attention has been successfully used 00:00:25.620 |
to do self-supervised learning in the domain of language. 00:00:40.320 |
and in the morning, come back to a much smarter robot. 00:00:43.600 |
I read the blog post, "Self-supervised Learning, 00:00:45.960 |
"The Dark Matter of Intelligence" by Ishan and Yan LeCun 00:00:52.940 |
on the excellent "Machine Learning Street Talk" podcast. 00:00:59.160 |
By the way, if you're interested in machine learning and AI, 00:01:02.840 |
I cannot recommend the ML Street Talk podcast highly enough. 00:01:11.280 |
Onnit, The Information, Grammarly and Athletic Greens. 00:01:15.400 |
Check them out in the description to support this podcast. 00:01:18.640 |
As a side note, let me say that for those of you 00:01:21.680 |
who may have been listening for quite a while, 00:01:37.720 |
to have many conversations with world-class researchers 00:01:40.580 |
in AI, math, physics, biology and all the other sciences. 00:01:45.120 |
But I also want to talk to historians, musicians, 00:01:48.880 |
athletes and of course, occasionally comedians. 00:01:53.600 |
three times a week now to give me more freedom 00:02:03.160 |
I challenged the listener to count the number of times 00:02:08.000 |
Ishan and I use the word banana as the canonical example 00:02:12.600 |
at the core of the hard problem of computer vision 00:02:22.620 |
and here is my conversation with Ishan Misra. 00:02:32.740 |
of what is supervised and semi-supervised learning 00:02:43.900 |
the way they're trained is you get a bunch of humans, 00:02:51.980 |
what is present in the image, draw boxes around them, 00:03:00.540 |
For NLP, again, there are lots of these particular tasks, 00:03:03.220 |
say about sentiment analysis, about entailment and so on. 00:03:08.100 |
we get a big corpus of such annotated or labeled data 00:03:18.380 |
So it looks at an image and the human has tagged 00:03:22.420 |
and now the system is basically trying to mimic that. 00:03:41.320 |
So this is a standard sort of supervised setting. 00:03:49.260 |
but you have lots of other data which is unsupervised 00:03:53.100 |
Now the problem basically with supervised learning 00:03:55.260 |
and why you actually have all of these alternate 00:04:04.980 |
one of the most popular data sets is ImageNet. 00:04:09.300 |
has about 22,000 concepts and about 14 million images. 00:04:36.760 |
Like you have about, I think, 400 million images or so, 00:04:40.620 |
to most of the popular sort of social media websites today. 00:04:44.180 |
So now supervised learning just doesn't scale. 00:04:48.660 |
if I want to have various types of fine-grained concepts, 00:04:58.580 |
you have this annotated corpus of supervised data, 00:05:03.700 |
And the idea is that the algorithm should basically 00:05:19.660 |
the idea is that the algorithm actually learns 00:05:23.500 |
and actually gets better at predicting these concepts. 00:05:32.260 |
or the algorithm should really discover concepts, 00:05:36.380 |
or learn representations about the world which are useful, 00:05:39.180 |
without access to explicit human supervision. 00:05:48.520 |
And maybe that perhaps is when Yanma Kun and you argue 00:05:51.960 |
that unsupervised is the incorrect terminology here. 00:06:04.480 |
the reason it has the term supervised in itself 00:06:06.720 |
is because you're using the data itself as supervision. 00:06:10.320 |
So because the data serves as its own source of supervision, 00:06:16.380 |
I mean, we did it in that blog post with Yan, 00:06:22.100 |
So starting from like '94 from Virginia Desas' group, 00:06:28.780 |
Jeetendra Malik has said this a bunch of times as well. 00:06:33.060 |
And then unsupervised basically means everything 00:06:36.420 |
But that includes stuff like semi-supervised, 00:06:38.620 |
that includes other like transductive learning, 00:06:43.020 |
So that's the reason like now people are preferring 00:06:53.100 |
which tries to extract just sort of data supervision signals 00:06:56.900 |
from the data itself is a self-supervised learning algorithm. 00:07:02.140 |
a set of tricks which unlock the supervision. 00:07:12.840 |
The data doesn't just speak to you some ground truth. 00:07:17.760 |
So I don't know what your favorite domain is. 00:07:19.580 |
So you specifically specialize in visual learning, 00:07:31.060 |
So the idea basically being that you can train models 00:07:37.380 |
And now these models learn to predict the masked out words. 00:07:40.500 |
So if you have like the cat jumped over the dog, 00:07:45.940 |
And now you are essentially asking the model to predict 00:07:52.460 |
a distribution over all the possible words that it knows. 00:07:55.320 |
And probably it has like, if it's a well-trained model, 00:08:09.400 |
is basically say, for example, video prediction. 00:08:17.400 |
you can feed in the first nine seconds to a model 00:08:26.720 |
is predicting something about the data itself. 00:08:32.280 |
because the 10 second video was naturally captured. 00:08:34.580 |
Because the model is predicting what's happening there, 00:08:43.980 |
So like if I have something at the edge of the table, 00:08:48.340 |
which you really don't have to sit and annotate. 00:08:56.600 |
And then it falls down and this is a fallen down cup. 00:08:58.800 |
So I won't have to annotate all of these things 00:09:02.000 |
- Isn't that kind of a brilliant little trick 00:09:05.240 |
of taking a series of data that is consistent 00:09:11.860 |
and then teaching the algorithm to predict that element. 00:09:16.860 |
Isn't that, first of all, that's quite brilliant. 00:09:27.880 |
that is consistent with the physical reality. 00:09:30.220 |
The question is, are there other tricks like this 00:09:34.400 |
that can generate the self-supervision signal? 00:09:37.840 |
- So sequence is possibly the most widely used one in NLP. 00:09:41.200 |
For vision, the one that is actually used for images, 00:09:47.600 |
and now taking different crops of that image. 00:09:55.280 |
and asking the network to basically present it 00:09:58.080 |
with a choice saying that, okay, now you have this image, 00:10:01.360 |
you have this image, are these the same or not? 00:10:04.480 |
And so the idea basically is that because different, 00:10:06.680 |
like in an image, different parts of the image 00:10:09.800 |
So for example, if you have a chair and a table, 00:10:12.400 |
basically these things are going to be close by, 00:10:16.880 |
if you have like a zoomed in picture of a chair, 00:10:20.520 |
it's going to be different parts of the chair. 00:10:22.360 |
So the idea basically is that different crops 00:10:30.320 |
So this is possibly the most widely used trick these days 00:10:33.120 |
for self-supervised learning in computer vision. 00:10:50.280 |
like language or something that's like a time series, 00:11:02.320 |
- Yuan Yan Lacoon wrote the blog post in March, 2021, 00:11:12.640 |
and maybe explain the main idea or set of ideas? 00:11:15.640 |
- The blog post was mainly about sort of just telling, 00:11:27.200 |
for machine learning algorithms that come in the future, 00:11:33.840 |
have a good understanding of what dark matter is. 00:11:39.840 |
- Maybe the metaphor doesn't exactly transfer, 00:11:51.240 |
- Right, so I think self-supervised learning, 00:11:56.240 |
towards what it probably should end up learning, 00:12:03.720 |
self-supervised learning is going to be a very powerful way 00:12:18.720 |
So supervised learning is clearly not going to scale. 00:12:21.520 |
So what is the thing that's actually going to scale? 00:12:32.560 |
hey, this is taking him more time to lift up, 00:12:41.960 |
you should be able to infer a lot of things about the world 00:12:57.440 |
There's so many questions that are yet to be, 00:13:04.440 |
over which the self-supervised learning process works? 00:13:08.600 |
How much interactivity, like in the active learning, 00:13:38.480 |
versus something that's more akin to learning, 00:13:49.160 |
But we are, I mean, a lot of us are actually convinced 00:13:56.600 |
that human supervision cannot be at large scale, 00:14:04.120 |
So the machines have to discover the supervision 00:14:10.240 |
- I mean, the other thing is also that humans 00:14:23.080 |
what makes us say one is dining table and the other is not? 00:14:28.160 |
They're not like very good sources of supervision 00:14:46.080 |
because we're not maybe going to confuse it a lot actually. 00:14:49.320 |
- Well, humans can't even answer the meaning of life. 00:14:56.960 |
Humans are not very good at telling the difference 00:14:59.040 |
between what is and isn't a table, like you mentioned. 00:15:08.140 |
Is it possible to create a pretty good taxonomy 00:15:16.400 |
It seems like a lot of approaches in machine learning 00:15:19.000 |
kind of assume a hopeful vision that it's possible 00:15:26.520 |
but we can always get closer and closer to it. 00:15:33.040 |
So the thing is for any particular categorization 00:15:36.920 |
if you have a discrete sort of categorization, 00:15:40.520 |
or I can take a third concept and I can blend it in 00:15:46.560 |
I will always find an N plus one category for you. 00:15:50.720 |
And I can actually create not just N plus one, 00:15:52.440 |
I can very easily create far more than N categories. 00:15:59.000 |
So it's really hard for us to come and sit in 00:16:03.240 |
And they compose in various weird ways, right? 00:16:05.880 |
Like you have like a croissant and a donut come together 00:16:09.720 |
So if you were to like enumerate all the foods up until, 00:16:12.440 |
I don't know, whenever the cronut was about 10 years ago 00:16:16.460 |
then this entire thing called cronut would not exist. 00:16:19.000 |
- Yeah, I remember there was the most awesome video 00:16:31.000 |
So it's a very difficult philosophical question. 00:16:33.840 |
So there is a concept of similarity between objects. 00:16:43.200 |
a good way to tell which parts of things are similar 00:16:47.920 |
and which parts of things are very different? 00:16:51.780 |
So you don't necessarily need to name everything 00:16:54.320 |
or assign a name to everything to be able to use it, right? 00:17:03.540 |
- I mean, lots of like, for example, animals, right? 00:17:09.540 |
but they're able to go about their day perfectly. 00:17:12.900 |
So, I mean, we probably look at things and we figure out, 00:17:22.020 |
So I haven't seen all the possible doorknobs in the world. 00:17:32.140 |
So I, of course, related to all the doorknobs that I've seen 00:17:36.540 |
I have a pretty good idea of how it's going to open. 00:17:39.420 |
And I think this kind of translation between experiences 00:17:57.680 |
Can having a good function that compares objects 00:18:11.560 |
- Well, let me tell you what that's similar to. 00:18:19.740 |
I think understanding is the process of placing that thing 00:18:24.740 |
in some kind of network of knowledge that you have. 00:18:28.420 |
That it perhaps is fundamentally related to other concepts. 00:18:33.180 |
So it's not like understanding is fundamentally related 00:18:41.480 |
And maybe like deeper and deeper understanding 00:18:45.800 |
is maybe just adding more edges to that graph somehow. 00:18:50.800 |
So maybe it is a composition of similarities. 00:18:55.080 |
I mean, ultimately, I suppose it is a kind of embedding 00:19:12.360 |
I mean, I don't even know what everything is, 00:19:18.920 |
things are similar in very different contexts, right? 00:19:34.000 |
So elephants are like herbivores, lions are not. 00:19:40.680 |
also actually helps us understand a lot about things. 00:19:47.640 |
Just like forming this particular category of elephant 00:19:57.240 |
which are not as maybe, for example, like grilled cheese. 00:20:06.760 |
- Right, so categorization is still very useful 00:20:11.280 |
But is your intuition then sort of the self-supervised 00:20:15.960 |
should be the, to borrow Jan LeCun's terminology, 00:20:23.680 |
the classification, maybe the supervised like layer 00:20:36.400 |
then you won't be able to sit and annotate everything. 00:20:44.960 |
I sat down and annotated like a bunch of cards 00:20:49.920 |
it was in a video and I was basically drawing boxes 00:20:53.560 |
And I think I spent about a week doing all of that 00:20:57.680 |
And basically, this was, I think my first year of my PhD 00:21:06.000 |
And when I had done that, someone came up to me 00:21:09.600 |
oh, this is a pickup truck, this is not a car. 00:21:12.800 |
And that's like, aha, this actually makes sense 00:21:23.640 |
- By the way, the annotation was bounding boxes? 00:21:27.000 |
- There's so many deep, profound questions here 00:21:32.200 |
by doing self-supervised learning, by the way, 00:21:39.040 |
maybe you don't ever need to answer that question. 00:21:49.960 |
drawing very careful line around this object? 00:21:57.520 |
I remember when I first saw semantic segmentation 00:22:03.640 |
where you have a very exact line around the object 00:22:27.880 |
Like I had like an existential crisis every time. 00:22:35.320 |
I'm not sure I have a good answer to what's better. 00:22:38.240 |
And I'm not sure I share the confidence that you have 00:22:41.520 |
that self-supervised learning can take us far. 00:22:50.840 |
but I still feel like we need to understand what makes, 00:22:54.160 |
like this dream of maybe what it's called symbolic AI, 00:23:01.400 |
of arriving, like once you have this common sense base, 00:23:09.000 |
and build graphs or hierarchies of concepts on top 00:23:18.440 |
of this three-dimensional world or four-dimensional world 00:23:22.040 |
and be able to reason and then project that onto 2D plane 00:23:30.960 |
I remember, I think Andrej Karpathy had a blog post 00:23:35.000 |
about computer vision, like being really hard. 00:23:39.000 |
I forgot what the title was, but it was many, many years ago. 00:23:42.080 |
And he had, I think President Obama stepping on a scale 00:23:45.600 |
and there was a bunch of people laughing and whatever. 00:23:48.440 |
And there's a lot of interesting things about that image. 00:23:52.000 |
And I think Andrej highlighted a bunch of things 00:24:04.040 |
You immediately project because of our knowledge of pose 00:24:10.360 |
you understand how the forces are being applied 00:24:17.440 |
is multiple people looking at each other in the image. 00:24:27.520 |
like is laughing at how humorous the situation is. 00:24:31.240 |
And this person is confused about what the situation is 00:24:45.040 |
Like in order to achieve that level of understanding 00:24:51.440 |
does self-supervised learning play in that, do you think? 00:24:58.440 |
I think Andre and I think a lot of people agreed 00:25:03.320 |
Do you still think computer vision is really hard? 00:25:12.500 |
So if you ask me to solve just that particular problem, 00:25:17.560 |
I can always construct a data set and basically predict, 00:25:30.880 |
I mean, it won't be as bad as like randomly guessing. 00:25:37.840 |
- Yeah, maybe like Reddit upvotes is the signal. 00:25:41.240 |
- I mean, it won't do a great job, but it'll do something. 00:25:43.800 |
It may actually be like, it may find certain things 00:25:57.520 |
But the general problem you're saying is hard. 00:26:06.760 |
that are going to communicate with humans at the end of it, 00:26:08.760 |
you want to understand what the algorithm is doing, right? 00:26:10.880 |
You want it to be able to like produce an output 00:26:13.720 |
that you can decipher, that you can understand, 00:26:19.360 |
So at some point in this sort of entire loop, 00:26:23.720 |
And now this human needs to understand what's going on. 00:26:27.640 |
this entire notion of language or semantics really comes in. 00:26:36.280 |
So self-supervised learning is probably going to be useful 00:26:40.800 |
before the machine really needs to communicate 00:26:53.280 |
to build a big base of understanding or whatever, 00:27:02.300 |
Supervised learning in the context of computer vision 00:27:10.480 |
of what we're as a community working on today. 00:27:19.000 |
of self-supervised learning in natural language processing, 00:27:40.120 |
I kind of follow it a little bit from the sides. 00:27:47.880 |
I think it's called the distributional hypothesis in NLP. 00:27:52.640 |
that occur in the same context should have similar meaning. 00:27:55.960 |
So if you have the blank jumped over the blank, 00:27:59.040 |
it basically, whatever is like in the first blank 00:28:01.960 |
is basically an object that can actually jump, 00:28:05.820 |
So a cat or a dog, or I don't know, sheep, something, 00:28:13.420 |
if you have words that are in the same context 00:28:21.480 |
Because you're predicting by looking at their context 00:28:24.900 |
So in this particular case, the blank jumped over the fence. 00:28:28.240 |
So now if it's a sheep, the sheep jumped over the fence, 00:28:32.400 |
So essentially the algorithm or the representation 00:28:35.560 |
basically puts together these two concepts together. 00:28:37.600 |
So it says, okay, dogs are going to be kind of related 00:28:39.840 |
to sheep because both of them occur in the same context. 00:28:46.760 |
You can say that dogs are absolutely not related to sheep 00:28:53.000 |
I'm a dog food person and I really want to give 00:28:57.280 |
So depending on what your downstream application is, 00:29:00.080 |
of course, this notion of similarity or this notion 00:29:14.000 |
that the number of words in a particular language 00:29:24.120 |
I still got to, because we take it for granted, 00:29:28.380 |
you're talking about this very process of the blank, 00:29:33.440 |
and then having the knowledge of what word went there 00:29:38.540 |
That's the ground truth that you're training on, 00:29:53.320 |
and the other question is, is there other tricks? 00:30:01.940 |
In autonomous driving, there's a bunch of tricks 00:30:06.960 |
that give you the self-supervised signal back. 00:30:10.360 |
For example, very similar to sentences, but not really, 00:30:15.360 |
which is you have signals from humans driving the car, 00:30:23.680 |
and so you can ask the neural network to predict 00:30:27.820 |
what's going to happen in the next two seconds 00:30:30.260 |
for a safe navigation through the environment, 00:30:36.200 |
that you also have knowledge of what happened 00:30:38.640 |
in the next two seconds, because you have video of the data. 00:30:42.120 |
The question in autonomous driving, as it is in language, 00:30:58.880 |
How good can we get, and are there other tricks? 00:31:09.120 |
I wonder how many signals there are in the data 00:31:20.840 |
that maybe this masking process is self-supervised learning. 00:31:33.840 |
to leverage human computation in very interesting ways 00:31:36.880 |
that might actually border on semi-supervised learning, 00:31:40.840 |
Obviously, the internet is generated by humans 00:31:56.200 |
I mean, so Word2Vec, the initial sort of NLP technique 00:32:02.120 |
like all the BERT and all these big models that we get, 00:32:17.560 |
follows this other sentence in terms of logic, 00:32:19.600 |
so entailment, you can do a lot of these things 00:32:23.680 |
So I'm not sure if I can predict how far it can take us, 00:32:28.400 |
because when it first came out, when Word2Vec was out, 00:32:31.560 |
I don't think a lot of us would have imagined 00:32:33.520 |
that this would actually help us do some kind 00:32:44.640 |
neural network architectures has taken us from that to this, 00:32:47.600 |
is just showing you how maybe poor predictors we are, 00:32:52.280 |
like as humans, how poor we are at predicting 00:32:54.880 |
how successful a particular technique is going to be. 00:33:00.080 |
I look completely stupid basically predicting this. 00:33:02.800 |
- In the language domain, is there something in your work 00:33:12.560 |
but also just, I don't know, beautiful and profound 00:33:15.720 |
that I think carries through to the vision domain? 00:33:18.160 |
- I mean, the idea of masking has been very powerful. 00:33:21.040 |
It has been used in vision as well for predicting, 00:33:23.680 |
like you say, the next, if you have in sort of frames 00:33:27.200 |
and you predict what's going to happen in the next frame. 00:33:33.800 |
I think you would have asked about transformers while back. 00:33:38.360 |
like it has become super exciting for computer vision now. 00:33:40.840 |
Like in the past, I would say year and a half, 00:33:47.440 |
is something called the self-attention model. 00:33:50.440 |
And the idea basically is that if you have N elements, 00:33:53.780 |
what you're creating is a way for all of these N elements 00:33:57.880 |
So the idea basically is that you are paying attention. 00:34:08.980 |
you're basically getting a much better view of the data. 00:34:11.460 |
So for example, if you have a sentence of like four words, 00:34:18.320 |
it's constructed in a way such that each word 00:34:23.840 |
Now, the reason it's like different from say, 00:34:29.560 |
you would only pay attention to a local window. 00:34:31.400 |
So each word would only pay attention to its next neighbor 00:34:37.860 |
In images, you would basically pay attention to pixels 00:34:40.120 |
in a three cross three or a seven cross seven neighborhood. 00:34:43.680 |
Whereas with the transformer, the self-attention mainly, 00:34:48.760 |
needs to pay attention to each other element. 00:34:57.680 |
a wide context in terms of the wide context of the sentence 00:35:01.560 |
in understanding the meaning of a particular word 00:35:18.600 |
So whether it's like, you're looking at all the pixels 00:35:20.960 |
that are of a kitchen, of a dining table and so on. 00:35:23.760 |
And then you're basically looking at the banana also. 00:35:29.220 |
there's something funny about the word banana. 00:35:37.500 |
Okay, so masking has worked for the vision context as well. 00:35:42.400 |
- And so this transformer idea has worked as well. 00:36:01.000 |
where he would basically have a blob of pixels 00:36:12.320 |
But I'm not sure, it was one of these three things, 00:36:19.040 |
at that particular local window, you couldn't figure it out. 00:36:21.840 |
Because of resolution, because of other things, 00:36:23.840 |
it's just not easy always to just figure it out 00:36:26.040 |
by looking at just the neighborhood of pixels, 00:36:29.640 |
And the same thing happens for language as well. 00:36:31.960 |
- For the parameters that have to learn something 00:36:34.240 |
about the data, you need to give it the capacity 00:36:39.080 |
Like if it's not actually able to receive the signal at all, 00:36:42.600 |
then it's not gonna be able to learn that signal. 00:36:44.240 |
And to understand images, to understand language, 00:36:47.240 |
you have to be able to see words in their full context. 00:36:50.640 |
Okay, what is harder to solve, vision or language? 00:36:54.880 |
Visual intelligence or linguistic intelligence? 00:36:57.800 |
- So I'm going to say computer vision is harder. 00:36:59.760 |
My reason for this is basically that language, 00:37:03.240 |
of course, has a big structure to it because we developed it. 00:37:08.600 |
in a lot of animals, everyone is able to get by, 00:37:11.400 |
a lot of these animals on earth are actually able 00:37:24.240 |
it of course also has a linguistic component. 00:37:26.400 |
But it means that there is something far more fundamental 00:37:36.920 |
in the challenges that have to do with the progress 00:37:47.400 |
that we focused on, or we discovered self-attention 00:37:50.240 |
and transformers in the context of language first? 00:37:53.640 |
- So like the self-supervised learning success 00:38:02.480 |
I think it's just that the signal was a little bit different 00:38:11.240 |
So for vision, the main success has basically 00:38:26.920 |
For this particular question, let's go for it. 00:38:28.560 |
Okay, so the first thing is language is very structured. 00:38:47.280 |
Now for vision, let's imagine doing the same thing. 00:38:52.600 |
and we ask the network or this neural network 00:38:54.680 |
to predict what is present in this missing patch. 00:39:02.560 |
If you're even producing basically a seven cross seven 00:39:08.000 |
at each of these 169 or each of these 49 locations, 00:39:15.280 |
And very quickly, the kind of like prediction problems 00:39:27.560 |
like doing this like distribution over a finite set. 00:39:30.880 |
And the problem is when this set becomes really large, 00:39:37.000 |
and at solving basically this particular set of problems. 00:39:41.040 |
So if you were to do it exactly in the same way 00:39:44.200 |
as NLP for vision, there is very limited success. 00:39:51.680 |
It's basically by saying that you take these two 00:40:00.440 |
just saying that the distance between these vectors 00:40:06.640 |
from the visual signal than there is from NLP. 00:40:09.120 |
- Okay, the other reason is the distributional hypothesis 00:40:18.440 |
Now, because there are just finite number of words 00:40:22.280 |
and there is a finite way in like which we compose them, 00:40:27.440 |
but in language, there's a lot of structure, right? 00:40:33.760 |
There are lots of these sentences that you'll get. 00:40:41.480 |
This exact same sentence might occur in a different context. 00:40:50.460 |
which are because this particular token itself 00:40:53.560 |
you get a lot of these tokens or these words, 00:40:57.720 |
this related meaning across given this context. 00:41:07.440 |
There might be like different noise in the sensor. 00:41:09.800 |
So the thing is you're capturing a physical phenomenon, 00:41:13.840 |
a very complicated pipeline of like image processing, 00:41:27.440 |
And each of these tokens are very, very well defined. 00:41:30.160 |
- There could be a little bit of an argument there, 00:41:45.480 |
are you getting close to being able to solve, 00:41:49.360 |
easily with flying colors past the Turing test kind of thing. 00:41:56.560 |
And the computer vision problem is in the 2D plane 00:41:59.760 |
is a projection of a three dimensional world. 00:42:06.480 |
- I mean, I think what I'm saying is NLP is not easy. 00:42:09.500 |
Like abstract thought expressed in knowledge, 00:42:16.720 |
I mean, we've been communicating with language for so long, 00:42:19.140 |
and it is of course a very complicated concept. 00:42:21.980 |
The thing is, at least getting like somewhat reasonable, 00:42:26.980 |
like being able to solve some kind of reasonable tasks 00:42:43.380 |
I feel like for both language and computer vision, 00:42:56.580 |
And I feel like for language, that wall is farther away. 00:43:06.540 |
You can even fool people that you're tweeting 00:43:11.460 |
or your question answering has intelligence behind it. 00:43:16.460 |
But to truly demonstrate understanding of dialogue, 00:43:25.020 |
that would require perhaps big breakthroughs. 00:43:30.420 |
I think the big breakthroughs need to happen earlier 00:43:36.620 |
This might be a good place to, you already mentioned it, 00:43:43.860 |
- Contrastive learning is sort of a paradigm of learning 00:43:46.860 |
where the idea is that you are learning this embedding space 00:43:50.700 |
or so you're learning this sort of vector space 00:43:54.500 |
And the way you learn that is basically by contrasting. 00:43:59.100 |
you have another sample that's related to it. 00:44:02.860 |
And you have another sample that's not related to it. 00:44:10.980 |
So you have an image of a cat, you have an image of a dog. 00:44:14.500 |
And for whatever application that you're doing, 00:44:16.540 |
say you're trying to figure out what pets are, 00:44:18.860 |
you're saying that these two images are related. 00:44:22.300 |
but now you have another third image of a banana 00:44:36.780 |
And now what you're training the network to do 00:44:38.180 |
is basically pull both of these features together 00:44:42.100 |
while pushing them away from the feature of a banana. 00:44:47.860 |
So there's always this notion of a negative and a positive. 00:44:54.140 |
that Jan sort of explains a lot of these methods. 00:45:00.700 |
or more than that, like when I joined Facebook, 00:45:02.860 |
Jan used to keep mentioning this word energy-based models. 00:45:05.140 |
And of course I had no idea what he was talking about. 00:45:18.340 |
rather than talking about probability distributions, 00:45:21.980 |
So models are trying to minimize certain energies 00:45:25.020 |
or they're trying to maximize a certain kind of energy. 00:45:29.780 |
you can explain a lot of the contrastive models, 00:45:33.300 |
which are like generative adversarial networks. 00:45:45.340 |
And so by putting this common sort of language 00:45:49.740 |
what looks very different in machine learning 00:45:51.820 |
that OVAEs are very different from what GANs are, 00:45:54.180 |
are very, very different from what contrastive models are. 00:46:04.220 |
or minimizing this energy function is slightly different. 00:46:10.380 |
and putting a sexy word on top of it like energy. 00:46:21.180 |
So basically the idea is that if you were to imagine 00:46:23.540 |
like the embedding as a manifold, a 2D manifold, 00:46:26.460 |
you would get a hill or like a high sort of peak 00:46:44.100 |
- Right, so this is where all the sort of ingenuity 00:46:47.860 |
So for example, like you can take the fill in the blank 00:46:51.660 |
problem or you can take in the context problem. 00:46:57.740 |
Two words that are in different contexts are not related. 00:47:00.500 |
For images, basically two crops from the same image 00:47:02.980 |
are related and whereas a third image is not related at all. 00:47:12.700 |
Whereas a third frame from a different video is not related. 00:47:15.580 |
So it basically is, it's a very general term. 00:47:34.500 |
it can basically also be used for self-supervised learning. 00:47:37.580 |
- So you mentioned one of the ideas in the vision context 00:47:53.260 |
Obviously, there's a bunch of other techniques. 00:47:58.580 |
in images, lighting is something that varies a lot 00:48:01.620 |
and you can artificially change those kinds of things. 00:48:04.500 |
There's the whole broad field of data augmentation, 00:48:07.700 |
which manipulates images in order to increase arbitrarily 00:48:15.860 |
And second of all, what's the role of data augmentation 00:48:18.140 |
in self-supervised learning and contrastive learning? 00:48:22.020 |
- So data augmentation is just a way, like you said, 00:48:34.900 |
where you can just increase, say, the colors, 00:48:37.340 |
like the colors or the brightness of the image, 00:48:39.140 |
or increase or decrease the contrast of the image, 00:48:46.220 |
to basically perturb the data or augment the data. 00:48:51.060 |
And so it has played a fundamental role for computer vision, 00:48:59.180 |
contrastive or otherwise, is by taking an image, 00:49:05.340 |
and then computing basically two perturbations of it. 00:49:08.580 |
So these can be two different crops of the image 00:49:14.980 |
So you jitter the colors a little bit and so on. 00:49:26.300 |
from both of these perturbations to be similar. 00:49:28.940 |
So now you can use a variety of different ways 00:49:36.020 |
So basically, both of these things are positives, 00:49:40.420 |
You can do this basically by like clustering. 00:49:43.460 |
For example, you can say that both of these images should, 00:49:48.140 |
should belong in the same cluster because they're related. 00:49:55.140 |
to basically enforce this particular constraint. 00:50:09.660 |
- So the neural network basically takes in the image 00:50:20.020 |
like different crops that you computed to be similar. 00:50:28.140 |
in this multidimensional space to each other. 00:50:50.140 |
you kind of have to see it from two, three, multiple angles. 00:51:03.200 |
like in order for us to place a concept in its proper place, 00:51:08.580 |
we have to basically crop it in all kinds of ways, 00:51:14.420 |
in whatever very clever ways that the brain likes to do. 00:51:25.060 |
So like babies, for example, pick up objects, 00:51:26.980 |
like move them and put them close to their eye and whatnot. 00:51:31.180 |
actually we are good at imagining it as well, right? 00:51:36.940 |
I've never basically looked at it from like top down. 00:51:40.700 |
I could very well tell you that that's an elephant. 00:51:45.340 |
we naturally build it or transfer it from other objects 00:51:47.820 |
that we've seen to imagine what it's going to look like. 00:51:59.860 |
but not just like normal things, like wild things, 00:52:03.340 |
but they're nevertheless physically consistent. 00:52:06.880 |
- So, I mean, people do kind of like occlusion 00:52:14.080 |
gray box to sort of mask out a certain part of the image. 00:52:17.360 |
And the thing is basically you're kind of occluding it. 00:52:19.880 |
For example, you place it say on half of a person's face. 00:52:28.160 |
- So, no, I meant like you have like, what is it? 00:52:45.280 |
Well, maybe not elves, but like puppies and kittens 00:52:51.240 |
and like constantly be generating that wild imagination. 00:52:57.520 |
that's currently applied, it's super ultra, very boring. 00:53:02.880 |
I wonder if there's a benefit to being wildly imaginable 00:53:07.000 |
while trying to be consistent with physical reality. 00:53:11.840 |
- I think it's a kind of a chicken and egg problem, right? 00:53:14.160 |
Because to have like amazing data augmentation, 00:53:18.480 |
And what we're trying to do data augmentation 00:53:23.720 |
- Before you understand it, just put elves with bananas 00:53:33.920 |
Okay, so what are the different kinds of data augmentation 00:53:36.920 |
that you've seen to be effective in visual intelligence? 00:53:42.000 |
it's a lot of these image filtering operations. 00:53:45.760 |
all the kinds of Instagram filters that you can think of. 00:53:49.400 |
So like arbitrarily like make the red super red, 00:53:52.480 |
make the green super greens, like saturate the image. 00:53:59.560 |
Like I said, lighting is a really interesting one to me. 00:54:04.760 |
- So I mean, the augmentations that we work on 00:54:08.920 |
They're not going to be like physically realistic versions 00:54:11.720 |
It's not that you're assuming that there's a light source up 00:54:17.040 |
It's really more about like brightness of the image, 00:54:22.560 |
- But this is a really important point to me. 00:54:39.080 |
So I wonder if there's big improvements to be achieved 00:54:42.560 |
on much more intelligent kinds of data augmentation. 00:54:55.260 |
To me, it seems like data augmentation potentially 00:55:05.320 |
- You're almost like thinking of like generative kind of, 00:55:11.040 |
it's like very active imagination of messing with the world 00:55:14.840 |
and teaching that mechanism for messing with the world 00:55:40.520 |
probably the possibilities were wilder, more numerous. 00:55:53.120 |
So I wonder if you think there's a lot of breakthroughs 00:55:57.160 |
And maybe also, can you just comment on the stuff we have? 00:55:59.780 |
Is that a big part of self-supervised learning? 00:56:02.960 |
So data augmentation is like key to self-supervised learning. 00:56:05.520 |
That has like the kind of augmentation that we're using. 00:56:08.320 |
And basically the fact that we're trying to learn 00:56:11.040 |
these neural networks that are predicting these features 00:56:13.920 |
from images that are robust under data augmentation 00:56:17.080 |
has been the key for visual self-supervised learning. 00:56:19.560 |
And they play a fairly fundamental role to it. 00:56:28.400 |
is that you feed in the pixels to the neural network 00:56:31.160 |
and it should figure out the patterns on its own. 00:56:35.640 |
You shouldn't really go and handcraft these features. 00:56:48.200 |
what kinds of data augmentation that we're looking for? 00:56:50.840 |
We are encoding a very sort of human specific bias there 00:57:15.760 |
is actually going into the data augmentation. 00:57:17.600 |
So although we are calling it self-supervised learning, 00:57:19.680 |
a lot of the human knowledge is actually being encoded 00:57:23.520 |
So it's really like we've kind of sneaked away 00:57:27.120 |
and we're like really designing these nice list 00:57:29.440 |
of data augmentations that are working very well. 00:57:31.640 |
- Of course, the idea is that it's much easier 00:57:33.720 |
to design a list of data augmentation than it is to do. 00:57:36.600 |
So humans are doing nevertheless doing less and less work 00:57:39.640 |
and maybe leveraging their creativity more and more. 00:57:42.600 |
And when we say data augmentation is not parameterized, 00:57:45.080 |
it means it's not part of the learning process. 00:57:50.560 |
some of the data augmentation into the learning process? 00:57:54.960 |
And in fact, it will be really beneficial for us 00:58:01.840 |
For example, like when you have certain concepts, 00:58:08.160 |
and then basically you change the color of the banana. 00:58:15.920 |
like it has no notion of what is present in the image. 00:58:22.600 |
And now what we're doing is we're telling the neural network 00:58:24.760 |
that this red banana and so a crop of this image 00:58:28.240 |
which has the red banana and a crop of this image 00:58:38.560 |
should take into account what is present in the image 00:58:43.920 |
It shouldn't be completely independent of the image. 00:58:48.840 |
instead of being drastic, do subtle augmentation 00:58:54.120 |
I'm not sure if it's subtle, but like realistic for sure. 00:58:56.280 |
- If it's realistic, then even subtle augmentation 00:59:06.440 |
if for example, now we're doing medical imaging, 00:59:15.080 |
So if you were to like actually loop in data augmentation 00:59:34.960 |
and the purists and all of us basically say that, 00:59:37.560 |
okay, this should learn useful representations 00:59:39.440 |
and they should be useful for any kind of end task, 00:59:47.760 |
Maybe the first baby step for us should be that, 00:59:50.480 |
okay, if you're trying to loop in this data augmentation 00:59:59.560 |
Or are we trying to distinguish between banana and apple? 01:00:02.040 |
Or are we trying to do all of these things at once? 01:00:04.400 |
And so some notion of like what happens at the end 01:00:07.920 |
might actually help us do much better at this side. 01:00:16.240 |
like a choice to have an arbitrary large data set 01:00:21.280 |
versus really good data augmentation algorithms, 01:00:26.560 |
which would you like to train in a self-supervised way on? 01:00:31.240 |
So natural data from the internet are arbitrary large, 01:00:40.200 |
good data augmentation on the finite data set. 01:00:44.440 |
because our learning algorithms for vision right now 01:00:49.360 |
even if you were to give me like an infinite source 01:00:52.880 |
I still need a good data augmentation algorithm. 01:00:59.000 |
because you've given me an arbitrarily large data set, 01:01:04.880 |
construct like these two perturbations of it, 01:01:18.040 |
So you can like reduce down the amount of data 01:01:26.440 |
than giving me like 10 times the size of that data, 01:01:30.960 |
a very primitive data augmentation algorithm. 01:01:32.640 |
- Like through tagging and all those kinds of things, 01:01:37.280 |
that are semantically similar on the internet? 01:01:44.960 |
farther away than you would be comfortable with. 01:01:47.880 |
- So I mean, yes, tagging will help you a lot. 01:01:51.520 |
in figuring out what images are related or not. 01:02:12.400 |
which are going to be applicable pretty much to anything. 01:02:23.840 |
These tags are like very indicative of what's going on. 01:02:26.480 |
And they are, I mean, they are human supervision. 01:02:29.480 |
- Yeah, this is one of the tasks of discovering 01:02:50.280 |
It'd be exciting to discover ways to leverage 01:03:00.200 |
humans driving and machines can learn from the driving. 01:03:03.040 |
I always hoped that there could be some supervision signal 01:03:08.200 |
because there's so many people that play video games 01:03:13.920 |
is put into video games, into playing video games. 01:03:17.720 |
And you can design video games somewhat cheaply 01:03:24.640 |
It feels like that could be leveraged somehow. 01:03:28.720 |
Like there are actually folks right here in UT Austin, 01:03:30.880 |
like Philip Grenville is a professor at UT Austin. 01:03:38.000 |
I mean, it's really fun, like as a PhD student 01:03:40.080 |
getting to basically play video games all day. 01:03:42.200 |
- Yeah, but so I do hope that kind of thing scales 01:03:44.960 |
and like ultimately boils down to discovering 01:03:54.080 |
But that said, there's non-contrastive methods. 01:04:07.840 |
you have this notion of a positive and a negative. 01:04:10.760 |
Now, the thing is this entire learning paradigm 01:04:25.720 |
The thing is this is a fairly simple analogy, right? 01:04:32.480 |
So very quickly, if this is the only source of supervision 01:04:36.680 |
your learning is not going to be like after a point 01:04:38.720 |
then neural network is really not going to learn a lot 01:04:43.960 |
So it can be, oh, a cat and a dog are very similar, 01:04:46.720 |
but they're very different from a Volkswagen Beetle. 01:04:54.960 |
the quality of the negative sample really matters a lot. 01:05:00.360 |
that typically these methods that are contrastive 01:05:04.960 |
which becomes harder and harder to sort of scale 01:05:10.960 |
why non-contrastive methods have become popular 01:05:13.760 |
and why people think that they're going to be more useful. 01:05:18.480 |
like clustering is one non-contrastive method. 01:05:20.960 |
The idea basically being that you have two of these samples. 01:05:24.720 |
So the cat and dog are two crops of this image. 01:05:29.320 |
And so essentially you're basically doing clustering online 01:05:35.120 |
and which is very different from having access 01:05:38.960 |
The other way which has become really popular 01:05:43.180 |
So the idea basically is that you have a teacher network 01:05:51.120 |
and basically the neural network figures out the patterns, 01:06:01.680 |
that the features produced by the teacher network 01:06:04.000 |
and the student network should be very similar. 01:06:16.360 |
how to have these two sorts of parallel networks, 01:06:30.160 |
but they're different enough that you can actually 01:06:34.000 |
- So you can ensure that they always remain different enough 01:06:38.200 |
so that the thing doesn't collapse into something boring. 01:06:41.880 |
So the main sort of enemy of self-supervised learning, 01:06:44.360 |
any kind of similarity maximization technique is collapse. 01:06:59.200 |
And so all we need to do is basically come up 01:07:05.360 |
And then for example, like clustering or self-distillation 01:07:09.240 |
We also had a recent paper where we used like decorrelation 01:07:13.120 |
between like two sets of features to prevent collapse. 01:07:16.780 |
So that's inspired a little bit by like Horace Barlow's 01:07:20.720 |
- By the way, I should comment that whoever counts 01:07:23.560 |
the number of times the word banana, apple, cat and dog 01:07:27.800 |
we're using this conversation wins the internet. 01:07:31.140 |
What is Suave and the main improvement proposed 01:07:36.800 |
in the paper on supervised learning of visual features 01:07:43.000 |
- Suave basically is a clustering based technique, 01:07:52.440 |
And the idea basically is that you want the features 01:07:58.920 |
And basically crops that are coming from different images 01:08:22.000 |
So this is offline basically because I need to do one pass 01:08:27.240 |
Suave is basically just a simple way of doing this online. 01:08:31.820 |
you're actually computing these clusters online. 01:08:34.800 |
And so of course there is like a lot of tricks involved 01:08:37.480 |
in how to do this in a robust manner without collapsing, 01:08:42.440 |
- Is there a nice way to say what is the key methodology 01:08:54.920 |
like there are always K clusters in a data set. 01:09:02.200 |
when you look at any sort of small number of examples, 01:09:04.840 |
all of them must belong to one of these K clusters. 01:09:16.880 |
should be equally partitioned into K clusters. 01:09:21.800 |
they have equal contribution to these N samples. 01:09:30.680 |
So all this, if all features become the same, 01:09:33.180 |
then you have basically just one mega cluster. 01:09:35.160 |
You don't even have like 10 clusters or 3000 clusters. 01:09:38.160 |
So Swarv basically ensures that at each point, 01:09:46.280 |
Basically just figure out how to do this online. 01:09:55.760 |
- And the fact they have a fixed K makes things simpler. 01:10:00.400 |
Our clustering is not like really hard clustering, 01:10:03.760 |
So basically you can be 0.2 to cluster number one 01:10:09.920 |
So essentially, even though we have like 3000 clusters, 01:10:19.240 |
And what are the key results and insights in the paper, 01:10:23.120 |
self-supervised pre-training of visual features in the wild? 01:10:42.920 |
the way we sort of operate is like in the research community, 01:10:48.560 |
which of course I talked about as having lots of labels. 01:10:54.240 |
that went behind basically the labeling process. 01:11:06.720 |
has a particular distribution of concepts, right? 01:11:13.680 |
of course, belong to a certain set of noun concepts. 01:11:33.440 |
actually really exploit this bias of ImageNet. 01:11:41.000 |
always uses ImageNet sort of as the benchmark 01:11:43.160 |
to show the success of self-supervised learning. 01:11:46.640 |
particular limitations to this kind of dataset? 01:12:08.800 |
- Yeah, but you would, for a more in the wild dataset, 01:12:12.000 |
you would need to be cleverer and more careful 01:12:21.400 |
One, basically to move away from ImageNet for training. 01:12:24.680 |
So the images that we used were like uncurated images. 01:12:40.080 |
So we did not say that, oh, images that belong to dogs 01:12:47.000 |
And basically other images should be thrown out. 01:12:53.560 |
And of course, it also goes back to like the problem 01:12:57.320 |
So these were basically about a billion or so images. 01:13:08.600 |
if we can train a very large convolutional model 01:13:18.280 |
So is self-supervised learning really over fit to ImageNet? 01:13:27.520 |
Will it actually be able to still figure out, 01:13:29.680 |
you know, different types of objects and so on? 01:13:33.720 |
it would actually do better than an ImageNet trained model? 01:13:38.160 |
And so for CIR, one of our main findings was that 01:13:46.360 |
without really necessarily filtering them out. 01:13:57.720 |
You don't really need to sit and filter them out. 01:13:59.720 |
These images can be cartoons, these can be memes, 01:14:02.040 |
these can be actual pictures uploaded by people. 01:14:04.440 |
And you don't really care about what these images are. 01:14:06.160 |
You don't even care about what concepts they contain. 01:14:10.280 |
- What image selection mechanism would you say is there, 01:14:18.840 |
So you're kind of implying that there's almost none, 01:14:24.960 |
- Right, so it's not like, uncurated can basically, 01:14:32.400 |
like cameras that can take pictures at random viewpoints. 01:14:37.400 |
they are typically going to care about the framing of it. 01:14:41.840 |
the picture of a zoomed in wall, for example. 01:14:43.800 |
- Well, when we say internet, do we mean social networks? 01:14:48.680 |
of like a zoomed in table or a zoomed in wall. 01:14:53.160 |
because people do have the like photographer's bias, 01:15:00.280 |
nice looking things and so on in the picture. 01:15:02.640 |
So that's the kind of bias that typically exists 01:15:05.640 |
in this dataset and also the user base, right? 01:15:15.440 |
or may not even have access to a lot of internet. 01:15:17.360 |
- So this is a giant dataset and a giant neural network. 01:15:34.160 |
we've basically started using transformers for vision. 01:15:41.120 |
you might choose to use a particular formulation. 01:15:51.200 |
when it comes to compute versus like accuracy. 01:15:59.680 |
and basically it worked really well in terms of scaling. 01:16:05.480 |
- Can you maybe quickly comment on what regNets are? 01:16:16.960 |
efficient neural networks, large neural networks. 01:16:19.480 |
- So one of the sort of key takeaways from this paper, 01:16:21.760 |
which the author, like whenever you hear them 01:16:29.000 |
Flops basically being the floating point operations. 01:16:33.280 |
this model is like really computationally heavy, 01:16:36.160 |
or like our model is computationally cheap and so on. 01:16:38.960 |
Now it turns out that flops are really not a good indicator 01:16:52.080 |
And so designing like one of the key findings 01:17:04.760 |
So RegNet is basically a network architecture family 01:17:13.520 |
And of course, it builds upon like earlier work, 01:17:20.360 |
But one of the things in this work is basically, 01:17:22.360 |
they also use like squeeze excitation blocks. 01:17:25.040 |
So it's a lot of nice sort of technical innovation 01:17:28.680 |
and a lot of the ingenuity of these particular authors 01:17:31.360 |
in how to combine these multiple building blocks. 01:17:35.960 |
for both flops and memory when you're basically doing this, 01:17:57.320 |
neural network architectures with lots of parameters, 01:17:59.520 |
lots of flops, but also because they're like efficient 01:18:01.960 |
in terms of the amount of memory that they're using, 01:18:06.560 |
you can fit a very large model on a single GPU, for example. 01:18:09.600 |
- Would you say that the choice of architecture 01:18:18.520 |
Is there a possibility to say what matters more? 01:18:21.680 |
You kind of implied that you can probably go really far 01:18:27.560 |
- All right, I think data like data and data augmentation, 01:18:30.520 |
the algorithm being used for the self supervised training 01:18:33.240 |
matters a lot more than the particular kind of architecture. 01:18:51.640 |
depending on like the particular task that you care about, 01:18:53.760 |
they have certain advantages and disadvantages. 01:19:07.720 |
of how to effectively train something like that fast? 01:19:15.440 |
- I mean, so the model was like a billion parameters. 01:19:21.440 |
- So if like, basically the same number of parameters 01:19:23.320 |
as the number of images, and it took a while. 01:19:26.160 |
I don't remember the exact number, it's in the paper, 01:19:33.720 |
when you're thinking of scaling this kind of thing, 01:19:41.880 |
of self-supervised learning is the several orders 01:19:47.320 |
both in your own network and the size of the data. 01:19:57.840 |
or is that really outside of even deep learning? 01:20:15.440 |
there is a lot of intercommunication between nodes. 01:20:17.720 |
So like gradients or the model parameters are being passed. 01:20:20.560 |
So you really want to minimize communication costs 01:20:22.720 |
when you really want to scale these models up. 01:20:29.120 |
like as limited amount of communication as possible. 01:20:35.040 |
So essentially after every sort of gradient step, 01:20:38.440 |
all you basically have like a synchronization step 01:20:55.240 |
But the main thing is like minimize the amount 01:21:01.840 |
That has been the key takeaway, at least in my experience. 01:21:14.120 |
like that fast communication that you're talking to, 01:21:17.600 |
well, when they're training machine learning. 01:21:30.080 |
- VISL basically was born out of a lot of us at Facebook 01:21:33.000 |
are doing the self-supervised learning research. 01:21:35.120 |
So it's a common framework in which we have like a lot 01:21:38.680 |
of self-supervised learning methods implemented for vision. 01:21:41.680 |
It's also, it has in itself like a benchmark of tasks 01:21:45.920 |
that you can evaluate the self-supervised representations on. 01:21:48.760 |
So the use case for it is basically for anyone 01:21:51.200 |
who's either trying to evaluate their self-supervised model 01:21:59.240 |
So it's basically supposed to be all of these things. 01:22:01.480 |
So as a researcher, before VISL, for example, 01:22:04.440 |
or like when we started doing this work fairly seriously 01:22:09.240 |
and implement every self-supervised learning model, 01:22:11.840 |
test it out in a like sort of consistent manner. 01:22:18.120 |
Even when someone said that they were reporting 01:22:20.360 |
image net accuracy, it could mean lots of different things. 01:22:23.160 |
So with VISL, we tried to really sort of standardize that 01:22:29.720 |
And so VISL basically builds upon a lot of this kind of work 01:22:37.160 |
we come up with a self-supervised learning method, 01:22:38.960 |
a lot of us try to push that into VISL as well, 01:22:41.160 |
just so that it basically is like the central piece 01:22:49.200 |
so certainly outside of Facebook, but just researchers, 01:22:52.000 |
or just even people that know how to program in Python 01:22:54.920 |
and know how to use PyTorch, what would be the use case? 01:22:58.640 |
What would be a fun thing to play around with VISL on? 01:23:04.280 |
with self-supervised learning on, would you say? 01:23:09.760 |
Like is it always about big size that's important to have, 01:23:14.600 |
or is there fun little smaller case playgrounds 01:23:19.720 |
- So we're trying to like push something towards that. 01:23:24.320 |
but nothing like super standard on the smaller scale. 01:23:26.800 |
I mean, ImageNet in itself is actually pretty big also. 01:23:29.280 |
So that is not something which is like feasible 01:23:32.240 |
for a lot of people, but we are trying to like push up 01:23:38.960 |
a lot of the observations or a lot of the algorithms 01:23:53.240 |
I've been trying to do that for a little bit as well, 01:23:54.880 |
because it does take time to train stuff on ImageNet. 01:23:56.800 |
It does take time to train on like more images, 01:23:59.840 |
but pretty much every time I've tried to do that, 01:24:02.200 |
it's been unsuccessful because all the observations 01:24:04.080 |
I draw from my set of experiments on a smaller data set 01:24:09.120 |
or like don't translate into another sort of data set. 01:24:11.720 |
So it's been hard for us to figure this one out, 01:24:31.400 |
What are the key results, insights in this paper, 01:24:33.840 |
and what can you say in general about the promise 01:24:37.600 |
- For this paper, it actually came as a little bit 01:24:41.920 |
So I can describe what the problem setup was. 01:24:44.120 |
So it's been used in the past by lots of folks, 01:24:55.440 |
but I wasn't really sure how well it would work in practice 01:25:02.400 |
and I wasn't sure if like a lot of our insights 01:25:04.160 |
from self-supervised learning would translate 01:25:08.280 |
So multimodal learning is when you have like, 01:25:17.520 |
- Okay, so the particular modalities that we worked on 01:25:22.000 |
So the idea was basically if you have a video, 01:25:35.440 |
So what we did in this work was basically trained 01:25:39.400 |
one on the video signal, one on the audio signal. 01:25:43.800 |
that we get from both of these neural networks 01:25:51.120 |
and the same kinds of features from the audio. 01:25:54.280 |
Well, for a lot of these objects that we have, 01:26:07.280 |
So where you can't learn anything about bananas there. 01:26:11.640 |
- Well, yes, when they say the word banana, then- 01:26:16.360 |
as a source, that source of audio is useless. 01:26:20.640 |
for example, someone playing a musical instrument. 01:26:22.440 |
So guitars have a particular kind of sound and so on. 01:26:24.680 |
So because a lot of these things are correlated, 01:26:30.120 |
video and audio, and learn a common embedding space, 01:26:35.200 |
related modalities can basically be close together. 01:26:38.560 |
And again, you use contrastive learning for this. 01:26:40.600 |
So in contrastive learning, basically the video 01:26:45.520 |
and you can take any other video or any other audio, 01:26:51.000 |
It's just a simple application of contrastive learning. 01:26:53.720 |
The main sort of finding from this work for us 01:27:05.400 |
that we ended up learning can actually be used 01:27:07.480 |
for downstream, for example, recognizing human actions 01:27:11.000 |
or recognizing different types of sounds, for example. 01:27:17.160 |
- Can you give kind of an example of a human action 01:27:24.360 |
- Right, so there is this dataset called kinetics, 01:27:26.880 |
for example, which has like 400 different types 01:27:29.480 |
So people jumping, people doing different kinds 01:27:34.280 |
So like different strokes in swimming, golf, and so on. 01:27:57.880 |
So basically give it a video and basically play 01:28:08.880 |
where the network is, where the network thinks 01:28:22.920 |
it can actually figure out where their mouth is. 01:28:26.080 |
So it can actually distinguish different people's voices, 01:28:30.480 |
Without that ever being annotated in any way. 01:28:33.640 |
- Right, so this is all what it had discovered. 01:28:43.480 |
of this sound coming with this kind of like an object 01:28:46.680 |
that it basically learns to associate this sound 01:28:55.200 |
is then you then fine tune it for a particular task. 01:28:57.920 |
So this is forming like a really good knowledge base 01:29:01.880 |
within a neural network based on which you could then 01:29:04.520 |
train a little bit more to accomplish a specific task. 01:29:12.800 |
You can just use a few of them to basically get your-- 01:29:18.520 |
that it can figure out where the sound is coming from? 01:29:33.000 |
Does it speak to the idea that multiple modalities 01:29:39.240 |
are somehow much bigger than the sum of their parts 01:29:44.120 |
or is it really, really useful to have multiple modalities 01:29:47.960 |
or is this just a cool thing that there's parts 01:30:03.920 |
- I would say a little tending more towards the second part. 01:30:07.840 |
So most of it can be sort of figured out with one modality 01:30:10.760 |
but having an extra modality always helps you. 01:30:13.240 |
So in this case, for example, like one thing is when you're, 01:30:22.040 |
whether it's an apple or whether it's an onion, 01:30:29.840 |
because apples and onions make a very different kind 01:30:34.920 |
So you really figure this out based on audio. 01:30:40.120 |
when you have access to different kinds of modalities. 01:30:42.360 |
And the other thing is, so I like to relate it in this way. 01:30:46.440 |
but the distributional hypothesis in NLP, right? 01:30:49.400 |
Where context basically gives kind of meaning to that word. 01:30:57.200 |
so that's the same context across different videos, 01:31:03.080 |
So that's the kind of reason why it figures out 01:31:06.520 |
It observed the same sound across multiple different videos 01:31:09.840 |
and it figures out maybe this is the common factor 01:31:13.320 |
- I wonder, I used to have this argument with my dad a bunch 01:31:22.920 |
like if that's important sensory information. 01:31:25.560 |
Mostly we're talking about like falling in love 01:31:35.360 |
but like you can fall in love with just language really, 01:31:38.440 |
but voice is very powerful and vision is next 01:31:43.920 |
Can I ask you about this process of active learning? 01:31:50.080 |
- Is there some value within the self-supervised 01:32:00.160 |
in intelligent ways such that they would most benefit 01:32:14.000 |
and now you're talking about active learning, I love it. 01:32:16.720 |
I think Yann LeCun told me that active learning 01:32:20.480 |
I think back then I didn't want to argue with him too much, 01:32:24.400 |
but when we talk again, we're gonna spend three hours 01:32:28.440 |
My sense was you can go extremely far with active learning, 01:32:32.480 |
you know, perhaps farther than anything else. 01:32:45.320 |
from intelligent optimized usage of the data. 01:32:53.240 |
that includes data augmentation and active learning, 01:32:57.080 |
that there's something about maybe interactive exploration 01:32:59.920 |
of the data that at least this part of the solution 01:33:10.880 |
So back in the day we did this largely ignored CVPR paper 01:33:16.560 |
So the idea was basically you would train an agent 01:33:24.400 |
it would decide what's the next hardest question 01:33:28.800 |
And the idea was basically because it was being smart 01:33:37.920 |
And we did find to some extent that it was actually better 01:33:48.680 |
you need to understand something about the image. 01:33:50.920 |
You can't ask a completely arbitrarily random question, 01:33:53.480 |
it may not even apply to that particular image. 01:33:55.520 |
So there is some amount of understanding or knowledge 01:34:01.320 |
So I think active learning by itself is really good. 01:34:06.400 |
is basically how do we come up with a technique 01:34:16.040 |
I think that's the sort of beauty of it, right? 01:34:18.360 |
Because when you know that there are certain things 01:34:23.640 |
is actually going to bring you the most value. 01:34:26.520 |
And I think that's the sort of key challenge. 01:34:36.360 |
it is basically about if the model has a knowledge 01:34:41.400 |
and it is weak basically about certain things, 01:34:50.400 |
So at that level, it's a very powerful technique. 01:34:53.220 |
I actually do think it's going to be really useful. 01:35:04.300 |
For example, you have your self-supervised model, 01:35:06.900 |
which is very good at predicting similarities 01:35:10.780 |
And so if you label a picture as basically say, a banana, 01:35:46.860 |
to discover the most likely, the most beneficial image. 01:36:00.000 |
but have some kind of more complicated learning system, 01:36:10.940 |
- Yeah, like actually in a self-supervised way, 01:36:24.020 |
It's just, I think, yeah, it's going to be explored. 01:36:29.420 |
I kind of think of what Tesla Autopilot is doing 01:36:36.900 |
There's something that Andrei Kapathia and their team 01:36:42.140 |
- So you're basically deploying a bunch of instantiations 01:36:50.700 |
that are then sent back for annotation for particular, 01:37:04.020 |
but almost a banana cases sent back for annotation. 01:37:14.820 |
is the cars themselves that are sending you back the data, 01:37:22.840 |
What are your thoughts about that sort of deployment 01:37:33.860 |
is there applications for autonomous driving? 01:37:36.960 |
Like computer vision based autonomous driving, 01:37:42.060 |
in the context of computer vision based autonomous driving? 01:37:48.380 |
I think for self-supervised learning to be used 01:37:50.060 |
in autonomous driving, there are lots of opportunities. 01:37:52.580 |
And just like pure consistency in predictions is one way. 01:37:55.860 |
So because you have this nice sequence of data 01:38:10.660 |
like one way possibly in which how they're figuring out 01:38:17.500 |
So you predict that the car was going to turn right. 01:38:20.420 |
So this was the action that was going to happen 01:38:21.900 |
say in the shadow mode, and now the driver turned left. 01:38:27.220 |
So basically by forming these good predictive models, 01:38:30.180 |
you are, I mean, these are kind of self-supervised models. 01:38:32.900 |
Prediction models are basically being trained 01:38:34.660 |
just by looking at what's going to happen next 01:38:36.820 |
and asking them to predict what's going to happen next. 01:38:44.700 |
basically just by looking at what data you have. 01:38:46.900 |
- Is there something about that active learning context 01:38:54.780 |
seeing cases where it doesn't perform as you expected 01:38:59.140 |
and then retraining the system based on that? 01:39:01.020 |
- I think that, I mean, that really resonates with me. 01:39:08.540 |
of like practical system like autonomous driving, 01:39:14.540 |
I mean, highway driving or like freeway driving 01:39:17.420 |
has basically been, like there has been a lot of success 01:39:20.140 |
in that particular part of autonomous driving 01:39:28.020 |
are the sort of reason why autonomous driving hasn't come, 01:39:31.420 |
it hasn't become like super, super mainstream 01:39:33.180 |
and available like in every possible car right now. 01:39:35.620 |
And so basically by really scaling this problem out 01:39:38.180 |
by really trying to get all of these edge cases out 01:39:41.820 |
and then just like using those to improve your model, 01:39:55.220 |
He thinks that the Tesla computer vision approach 01:39:58.180 |
or really any approach for autonomous driving 01:40:13.580 |
- Okay, so what does solving autonomous driving mean? 01:40:19.260 |
that very different types of driving have been. 01:40:45.140 |
like say five times less accidents than humans. 01:40:50.140 |
Sufficiently safer such that the public feels 01:41:20.140 |
on the screen it basically shows all the detections 01:41:22.300 |
and everything the car is doing as you're driving by. 01:41:24.740 |
And that's super distracting for me as a person 01:41:26.940 |
because all I keep looking at is like the bounding boxes 01:41:31.820 |
like especially when it's raining and it's able to do that. 01:41:36.060 |
It's actually able to get through rain and do that. 01:41:38.580 |
And one of the reasons why like a lot of us believed 01:41:46.900 |
for autonomous driving was the key driver, right? 01:41:51.060 |
And Tesla then decided to go this completely other route 01:41:55.820 |
So their initial system I think was camera and radar based. 01:42:02.060 |
And so that was just like, it sounded completely crazy. 01:42:09.300 |
Of course it comes with its own set of complications. 01:42:11.780 |
But now to see that happen in like on a live Tesla, 01:42:20.620 |
I think there were also like a lot of advancements 01:42:23.980 |
Now there were like, I know at CMU when I was there, 01:42:34.500 |
it could actually still have a very reasonable visibility. 01:42:37.700 |
And I think there are lots of these kinds of innovations 01:42:41.020 |
which is actually going to make this very easy 01:42:43.900 |
And so maybe that's actually why I'm more optimistic 01:42:53.580 |
that's the reason I'm quite optimistic about it. 01:43:00.740 |
we're actually going to get much better about it. 01:43:02.660 |
And then of course, once we're able to scale out 01:43:08.780 |
I think that's going to make us go very far away. 01:43:13.620 |
I'm very much with you on the five to 10 years, 01:43:21.820 |
but for some people that might seem like really far away, 01:43:32.300 |
about how much game theory is in this whole thing. 01:43:36.900 |
So like how much is this simply collision avoidance problem? 01:43:44.340 |
you're still interacting with other humans in the scene, 01:43:46.980 |
and you're trying to create an experience that's compelling. 01:43:49.460 |
So you want to get from point A to point B quickly, 01:43:53.060 |
you want to navigate the scene in a safe way, 01:43:55.260 |
but you also want to show some level of aggression, 01:43:58.500 |
because, well, certainly this is why you're screwed in India, 01:44:07.020 |
- So like, or New York, or basically any major city. 01:44:20.100 |
as a huge problem in this, as a source of problem. 01:44:22.980 |
Like the driving is fundamentally a robot on robot 01:44:35.180 |
I used to think humans are, almost certainly, 01:44:41.220 |
Pedestrians and cyclists and humans inside of the cars, 01:44:44.380 |
you have to have like mental models for them. 01:44:51.420 |
it's the same kind of intuition breaking thing 01:44:58.820 |
you'll get all the human, like human information you need. 01:45:18.620 |
I was skeptical that they would be able at scale 01:45:22.540 |
to convert the driving scene across the world 01:45:29.060 |
such that you can create this data engine at scale. 01:45:33.140 |
And the fact that Tesla is at least getting there 01:45:39.460 |
makes me think that it's now starting to be coupled 01:45:49.860 |
if through purely this process, you can get really far, 01:45:55.740 |
I tend to believe we don't give enough credit 01:46:13.260 |
I wish there was much more driver sensing inside Teslas 01:46:17.180 |
and much deeper consideration of human factors, 01:46:28.740 |
how to keep utilizing the little human supervision 01:46:32.980 |
that are needed to keep this whole thing safe. 01:46:45.020 |
It is not a robotics problem or computer vision problem. 01:46:49.980 |
but so, which is why I think it's 10 years plus, 01:46:53.340 |
but I do think there'll be a bunch of cities and contexts 01:46:56.300 |
where geo-restricted, it will work really, really damn well. 01:47:02.620 |
So I think for me, like it's five, if I'm being optimistic 01:47:09.220 |
10 plus basically, if we want to like cover most of say, 01:47:16.060 |
So my optimistic is five and pessimistic is 30. 01:47:24.420 |
I've watched enough pedestrians to think like, 01:47:27.940 |
we may be like, there's a small part of me still, 01:47:33.860 |
that thinks we will have to build AGI to solve driving. 01:47:44.020 |
and also human society is part of the picture 01:47:50.860 |
it's not clear to me that that's not a problem 01:47:54.300 |
that machine learning will also have to solve. 01:48:04.120 |
how to make a really good recommender system. 01:48:24.200 |
It's kind of fascinating that the more successful 01:48:31.580 |
and the more precious politicians and the public 01:48:35.980 |
and all the different fascinating forces of our society 01:48:43.980 |
It's also how good you are at navigating human nature, 01:48:49.940 |
What do you think are the limits of deep learning? 01:48:54.820 |
into the big question of artificial intelligence. 01:49:04.340 |
What do you think the limits of self-supervised learning 01:49:07.780 |
and just learning in general, deep learning are? 01:49:10.740 |
- I think like for deep learning in particular, 01:49:14.180 |
I would say a little bit more vague right now. 01:49:16.820 |
So I wouldn't, like for something that's so vague, 01:49:18.700 |
it's hard to predict what its limits are going to be. 01:49:21.980 |
But like I said, I think anywhere you want to interact 01:49:31.620 |
So really like if you have just like vacuous concepts 01:49:38.580 |
it's very hard to communicate those with a human 01:49:40.380 |
without like inserting some kind of human knowledge 01:49:47.020 |
the biggest challenge is just like data efficiency. 01:49:59.860 |
whatever you want to call it, like any concept, 01:50:02.500 |
it's really hard for these methods to generalize 01:50:04.820 |
by looking at just one or two samples of things. 01:50:09.740 |
And I think that's actually why like these edge cases, 01:50:11.660 |
for example, for Tesla are actually that important. 01:50:14.500 |
Because if you see just one instance of the car failing, 01:50:25.140 |
And you're actually going to be able to recognize 01:50:26.740 |
this kind of instance in a very different scenario. 01:50:30.300 |
so you got that thing labeled when it was snowing, 01:50:43.540 |
How do we solve handwritten digit recognition problem 01:50:47.540 |
when we only have one example for each number? 01:50:51.220 |
It feels like humans are using something like learning. 01:50:56.020 |
we are good at transferring knowledge a little bit. 01:50:59.260 |
We are just better at, like, for a lot of these problems 01:51:02.620 |
where we are generalizing from a single sample 01:51:06.940 |
we are using a lot of our own domain knowledge 01:51:12.300 |
So I've never seen you write the number nine, for example. 01:51:15.300 |
And if you were to write it, I would still get it. 01:51:17.460 |
And if you were to write a different kind of alphabet 01:51:29.020 |
The other sort of problem with any deep learning system 01:51:35.820 |
Now you can argue that humans also don't have any guarantees 01:51:38.100 |
like there is no guarantee that I can recognize a cat 01:51:44.980 |
lots of scenarios in which I don't recognize cats in general. 01:51:56.900 |
Now algorithms, like traditional CS algorithms 01:52:05.540 |
you are guaranteed that it's going to be sorted. 01:52:12.380 |
We know for a fact that like a cat recognition model 01:52:16.980 |
every cat in the world in every circumstance. 01:52:19.660 |
I think most people would agree with that statement. 01:52:27.820 |
like if you have this kind of failure case existing, 01:52:29.900 |
then you think of it as like something is wrong. 01:52:34.460 |
of nebulous correctness for machine learning. 01:52:40.500 |
or like for a lot of these machine learning algorithms, 01:52:48.100 |
or at least a limitation in our phrasing of this. 01:53:03.340 |
Do you think it's possible for neural networks to reason? 01:53:35.500 |
neural networks are really good at recognition, 01:53:45.820 |
they're very good at making those sort of snap judgments. 01:53:48.260 |
But if you were to give them a very complicated thing 01:54:05.220 |
- Well, there's a certain aspect to reasoning 01:54:08.820 |
that you can maybe convert into the process of programming. 01:54:11.860 |
And so there's the whole field of program synthesis 01:54:14.300 |
and people have been applying machine learning 01:54:25.260 |
You know, the step of like building things on top of, 01:54:29.380 |
like little intuitions, concepts on top of each other, 01:54:45.100 |
we are prime examples of machines that have like, 01:54:47.620 |
or individuals that have learned this, right? 01:54:51.940 |
it is a technique that is very easy to learn. 01:55:07.500 |
how well it's going to generalize to an unseen thing. 01:55:13.900 |
And I think that's basically telling us a lot about, 01:55:18.460 |
like a lot about the fact that we really don't know 01:55:20.700 |
what this model has learned and how well it's basically, 01:55:22.740 |
because we don't know how well it's going to transfer. 01:55:25.140 |
- There's also a sense in which it feels like 01:55:28.100 |
we humans may not be aware of how much like background, 01:55:45.500 |
where you're learning a particular task in computer vision, 01:55:49.180 |
you celebrate your state of the art successes 01:56:00.220 |
And humans are obviously doing that exceptionally well, 01:56:12.060 |
And I don't think it's a very well explored paradigm. 01:56:15.300 |
We have like things in deep learning, for example, 01:56:20.300 |
The thing basically being that if you teach a network 01:56:24.900 |
and now you teach that same network to recognize cats, 01:56:32.660 |
if you were to teach someone to recognize dogs 01:56:36.060 |
they don't forget immediately how to recognize these dogs. 01:56:38.580 |
I think that's basically sort of what you're trying to get. 01:56:44.860 |
or the mechanisms that store not just memories, 01:57:00.060 |
or if you can do that within a single neural network 01:57:02.460 |
with some particular sort of architecture quirks, 01:57:14.980 |
or to the ideas of logic-based sort of expert systems. 01:57:24.220 |
It's really annoying, like with self-supervised learning, 01:57:42.540 |
But I think whenever we try to like understand it, 01:57:45.380 |
we're putting our own subjective human bias into it. 01:57:51.140 |
the goal is that it should learn naturally from the data. 01:58:13.500 |
We've already kind of started talking about this, 01:58:20.740 |
Does it have to interact richly with the world? 01:58:23.900 |
Does it have to have some more human elements 01:58:44.620 |
And that just seems a little bit weird to me. 01:58:57.100 |
There is like a mismatch between these things. 01:59:01.060 |
I can either be surprised or I can be saddened 01:59:07.940 |
that I already have a predictive model in my head 01:59:11.420 |
or something that I thought was likely to happen. 01:59:13.700 |
And then there was something that I observed that happened 01:59:15.540 |
that there was a disconnect between these two things. 01:59:18.260 |
And that basically is like maybe one of the reasons 01:59:24.260 |
- Yeah, I think, so I talked to people a lot about 01:59:29.100 |
I think that's an interesting concept of emotion 01:59:43.780 |
So it's a part of basically human to human interaction, 01:59:50.180 |
So it's like, I would throw it into the full mix 01:59:58.020 |
And to me, communication can be done with objects 02:00:07.540 |
our ability to connect with things that look like a Roomba, 02:00:11.980 |
First of all, let's talk about other biological systems 02:00:33.940 |
- So then we have to be, I guess, specific, but yeah. 02:01:00.260 |
Do you think like to interact with the physical world, 02:01:02.820 |
do you think you can understand the physical world 02:01:04.620 |
without being able to directly interact with it? 02:01:08.420 |
I think at some point we will need to bite the bullet 02:01:10.660 |
and actually interact with the physical world 02:01:12.660 |
as much as I like working on like passive computer vision 02:01:21.220 |
some kind of embodiment or some kind of interaction 02:01:28.580 |
Do you think, how often do you think about consciousness 02:01:34.300 |
You could think of it as the more simple thing 02:01:36.500 |
of self-awareness, of being aware that you are a perceiving, 02:01:46.800 |
or you can think about the bigger version of that 02:01:50.300 |
which is consciousness, which is having it feel 02:01:57.180 |
the subjective experience of being in this world. 02:01:59.540 |
- So I think of self-awareness a little bit more 02:02:03.380 |
because I think self-awareness is pretty critical 02:02:06.100 |
for any kind of AGI or whatever you want to call it 02:02:10.140 |
that we build because it needs to contextualize 02:02:15.540 |
with respect to all the other things that exist around it. 02:02:19.620 |
It needs to understand that it's an autonomous car, right? 02:02:26.180 |
What are the things that it is supposed to do and so on? 02:02:39.300 |
that's, I would say, basically required at least, 02:02:46.380 |
believe that it has to be able to display consciousness. 02:02:51.380 |
- Display consciousness, what do you mean by that? 02:02:54.300 |
- Meaning like for us humans to connect with each other 02:03:01.660 |
I think we need to feel, like in order for us 02:03:05.100 |
to truly feel like that there's another being there, 02:03:17.300 |
Now, I tend to think that that's easier to achieve 02:03:21.540 |
than it may sound 'cause we anthropomorphize stuff so hard. 02:03:25.700 |
Like you have a mug that just like has wheels 02:03:28.740 |
and like rotates every once in a while and makes a sound. 02:03:31.900 |
I think a couple of days in, especially if you're, 02:03:39.500 |
you might start to believe that mug on wheels is conscious. 02:03:42.220 |
So I think we anthropomorphize pretty effectively 02:03:54.740 |
I think of consciousness as the capacity to suffer. 02:03:57.440 |
And if you're an entity that's able to feel things 02:04:02.420 |
in the world and to communicate that to others, 02:04:05.580 |
I think that's a really powerful way to interact with humans. 02:04:13.220 |
I believe you should be able to richly interact with humans. 02:04:17.980 |
Like humans would need to want to interact with you. 02:04:21.120 |
Like it can't be like, it's the self-supervised learning 02:04:25.600 |
versus like the robot shouldn't have to pay you 02:04:33.600 |
And then you're going to scale up significantly 02:05:00.240 |
'cause it just gets so boring when you're like, 02:05:20.920 |
Like it is an entity much like a human being. 02:05:27.320 |
I don't know if that's fundamentally a robotics problem 02:05:30.520 |
or some kind of problem that we're not yet even aware. 02:05:33.040 |
Like if it is truly a hard problem of consciousness, 02:05:38.600 |
we can pretty effectively fake it till we make it. 02:05:42.640 |
So we can display a lot of human-like elements for a while 02:05:59.040 |
with a glass of wine and armchair and just at a fireplace, 02:06:10.080 |
what do you think is the especially beautiful idea? 02:06:16.520 |
what objects are in some notion of objectness emerges 02:06:20.000 |
from these models by just like self-supervised learning. 02:06:23.720 |
So for example, like one of the things like the dino paper 02:06:35.640 |
So if you have like a dog running in the field, 02:06:42.320 |
what the boundaries of this dog are automatically. 02:06:52.720 |
It's able to group these things together automatically. 02:06:58.080 |
that this dumb idea that you take like these two crops 02:07:01.400 |
of an image and then you say that the features 02:07:03.160 |
should be similar, that has resulted in something like this, 02:07:12.080 |
And I mean, I don't think a lot of us even understand 02:07:20.800 |
maybe like a lot in terms of how we're setting up 02:07:26.800 |
So it's really fundamentally telling us something about 02:07:34.160 |
about how we're setting up the self-supervised learning 02:07:36.040 |
problem and despite being like super dumb about it, 02:07:41.600 |
like we'll actually get something that is able to do 02:08:02.360 |
you have to have a few concepts that you wanna apply. 02:08:11.520 |
through self-supervised learning on billions of images. 02:08:17.440 |
So that's like a fundamental concept which we have, 02:08:21.480 |
but that's another concept that should be emergent from it, 02:08:26.760 |
like if you don't teach humans that this is like 02:08:32.480 |
And the same thing for like animals, like dogs, 02:08:47.920 |
- So I think rotation probably, yes, yeah, rotation, yes. 02:08:55.200 |
- But it's interesting if all of them could be, 02:09:04.880 |
that there's multiple objects of the same kind in the image 02:09:21.520 |
Counting I do believe, I mean, should be possible. 02:09:25.920 |
but I do think it's not that far in the realm of possibility. 02:09:33.240 |
can then be applied to then solving those kinds of IQ tests, 02:09:36.520 |
which seem currently to be kind of impossible. 02:09:50.040 |
- So this is going to be a little controversial, 02:09:55.340 |
like actually using simulation to do things very much. 02:10:01.040 |
where you talk about, are we living in a simulation often, 02:10:03.600 |
you're referring to using simulation to construct worlds 02:10:17.400 |
which builds like the environment of the world. 02:10:22.680 |
you train your machine learning system in that. 02:10:27.520 |
but I think it's a really expensive way of doing things. 02:10:30.920 |
And at the end of it, you do need the real world. 02:10:39.960 |
the payout is so large that you can actually invest 02:10:47.040 |
You can't really build simulations of everything. 02:10:51.560 |
because second, it's also not possible for a lot of things. 02:10:59.400 |
on like using synthetic data and like synthetic simulators. 02:11:02.080 |
I generally am not very, like I don't believe in that. 02:11:05.800 |
- So you're saying it's very challenging visually, 02:11:11.920 |
like the lighting, all those kinds of things. 02:11:13.560 |
- I mean, all these companies that you have, right? 02:11:22.880 |
a lot of them is about like accurately trying to figure out 02:11:26.080 |
how the lighting is and like how things reflect 02:11:40.560 |
- And for me, in the context of autonomous driving, 02:11:44.720 |
it's very tempting to be able to use simulation, right? 02:11:53.280 |
that perhaps is a bigger one than the visual limitation 02:12:00.720 |
'Cause the, so you're ultimately interested in edge cases. 02:12:03.800 |
And the question is how well can you generate edge cases 02:12:07.240 |
in simulation, especially with human behavior? 02:12:10.960 |
- I think another problem is like for autonomous driving, 02:12:15.160 |
So say autonomous driving, like in 10 years from now, 02:12:22.400 |
So now there are 50% of the agents say, which are humans, 02:12:30.040 |
So now the kinds of behaviors that you actually expect 02:12:32.280 |
from the other agents or other cars on the road 02:12:36.680 |
And as the proportion of the number of autonomous cars 02:12:44.040 |
based on just like right now to build them today, 02:12:46.400 |
you don't have that many autonomous cars on the road. 02:12:48.360 |
So you will try to like make all of the other agents 02:13:02.760 |
This is why I think it's an interesting question. 02:13:07.720 |
like virtual reality game, where it is so real, 02:13:17.320 |
but like, it's so nice that you just wanna stay there. 02:13:20.800 |
You just wanna stay there and you don't wanna come back. 02:13:24.920 |
Do you think that's doable within our lifetime? 02:13:36.040 |
- Does that make you sad that there'll be like, 02:13:38.440 |
like population of kids that basically spend 95%, 02:13:55.720 |
that they really derive a lot of value out of, 02:13:58.120 |
derive a lot of enjoyment and like happiness out of, 02:14:00.720 |
and maybe the real world wasn't giving them that, 02:14:14.400 |
Again, I think it's, this is a very hard question. 02:14:18.280 |
- So like you've been a part of a lot of amazing papers, 02:14:31.000 |
is there common things that you've learned along the way 02:14:39.000 |
- Right, so I think both of these I've picked up from, 02:14:44.640 |
like lots of people I've worked with in the past. 02:14:53.680 |
So, I mean, there are multiple reasons for this. 02:14:58.960 |
that can actually be solved in a particular timeframe. 02:15:02.320 |
So now say you want to work on finding the meaning of life. 02:15:15.520 |
like make some kind of meaningful progress in your lifetime? 02:15:18.800 |
If you are optimistic about it, then like go ahead. 02:15:22.120 |
I keep asking people about the meaning of life. 02:15:24.040 |
I'm hoping by episode like 220, I'll figure it out. 02:15:36.280 |
- Right, so I think it's just the fact of like, 02:15:41.080 |
what is one problem that you want to focus on 02:15:45.720 |
and you will be able to make a reasonable amount 02:15:47.800 |
of headway into it that you think you'll be doing a PhD for. 02:15:59.040 |
because as a grad student or as a researcher, 02:16:18.920 |
I try to cram in a lot of things into the paper. 02:16:29.120 |
is going to be like whatever eight or nine pages. 02:16:43.800 |
and articulate it out in multiple different ways, 02:16:46.240 |
it's far more valuable to the reader as well. 02:17:00.440 |
in different ways, you think about it more deeply. 02:17:11.360 |
was actually the bigger part of research than writing. 02:17:19.720 |
But I think more and more I realized that's the case. 02:17:21.800 |
Like whenever I write something that I'm doing, 02:17:28.360 |
early on you actually, I think get better ideas, 02:17:31.200 |
or at least you figure out like holes in your theory 02:17:35.480 |
that you should run to block those holes and so on. 02:17:40.320 |
how many really good papers throughout history 02:17:49.800 |
Like if you want to dream about writing a paper 02:18:03.040 |
it's focusing on one idea and thinking deeply. 02:18:07.240 |
And you're right that the writing process itself 02:18:12.280 |
It challenges you to really think about what is the idea 02:18:15.320 |
that explains that the thread that ties it all together. 02:18:18.120 |
- And so like a lot of famous researchers I know 02:18:24.120 |
first they would, even before the experiments were in, 02:18:33.800 |
what they're like, what they're trying to solve 02:18:35.800 |
and how it fits in like the context of things right now. 02:18:38.680 |
And that would really guide their entire research. 02:18:40.680 |
So a lot of them would actually first write an intros 02:18:51.960 |
What's the best programming language to learn 02:19:05.000 |
So it'll, if you don't know any other programming language, 02:19:07.600 |
Python is actually going to get you a long way. 02:19:11.680 |
it's a toss up question because it seems like Python 02:19:16.800 |
But I wonder if there's an interesting alternative. 02:19:19.960 |
and there's a lot of interesting alternatives popping up, 02:19:23.960 |
So I, or R, more like for the data science applications. 02:19:31.240 |
is actually being used to teach like introduction 02:19:41.880 |
What are the pros and cons of PyTorch versus TensorFlow? 02:19:51.320 |
- So a disclaimer to this is that the last time 02:19:53.440 |
I used TensorFlow was probably like four years ago. 02:19:58.200 |
because so I started on like deep learning in 2014 or so 02:20:02.680 |
and the dominant sort of framework for us then 02:20:06.480 |
for vision was Caffe, which was out of Berkeley. 02:20:17.080 |
and it had like very loose kind of Python binding. 02:20:19.080 |
So Python wasn't really the first language you would use. 02:20:28.280 |
And then Python of course became popular a little bit later. 02:20:30.960 |
So TensorFlow was basically around that time. 02:20:37.240 |
- And then what, did you use Torch or did you-- 02:21:01.360 |
So which Caffe was very rigid in terms of its structure. 02:21:03.880 |
Like you would create a neural network once and that's it. 02:21:06.800 |
Whereas if you wanted like very dynamic graphs and so on, 02:21:20.760 |
And also that PyTorch is much easier to debug 02:21:23.560 |
is what I find because it's imperative in nature 02:21:26.280 |
compared to like TensorFlow, which is not imperative. 02:21:33.320 |
in which a lot of people are taught programming 02:21:35.240 |
and that's what actually makes debugging easier for them. 02:21:45.280 |
kind of these two communities, this kind of competition? 02:21:48.480 |
I think PyTorch is kind of more and more becoming dominant 02:21:54.600 |
in the more sort of application machine learning community. 02:21:57.920 |
So do you think it's good to have that kind of split 02:22:02.680 |
so like the benefit there is the competition challenges 02:22:06.560 |
the library developers to step up their game. 02:22:10.000 |
- But the downside is there's these code bases 02:22:20.600 |
it's really hard to like really build on top of it. 02:22:27.080 |
So whenever like something pops up in TensorFlow, 02:22:30.840 |
you wait a few days and someone who's like super sharp 02:22:33.200 |
will actually come and translate that particular code base 02:22:44.280 |
So I think in terms of like having these two frameworks 02:22:47.560 |
or multiple, I think of course there are different use cases 02:22:52.840 |
And like you said, I think competition is just healthy 02:22:57.360 |
or like all of these frameworks really sort of 02:23:11.520 |
but are curious about it and who want to get in the field? 02:23:19.120 |
like really drill into why things are not working. 02:23:22.200 |
- Can you elaborate what your hands dirty means? 02:23:24.560 |
- Right, so for example, like if an algorithm, 02:23:27.600 |
if you try to train a network and it's not converging, 02:23:29.760 |
whatever, rather than trying to like Google the answer 02:23:33.440 |
like really spend those like five, eight, 10, 15, 20, 02:23:39.040 |
'Cause in that process, you'll actually learn a lot more. 02:23:42.560 |
- Googling is of course like a good way to solve it 02:24:01.400 |
and they would be like, "Hey, why don't you go figure it out 02:24:12.480 |
That has really helped me figure a lot of things out. 02:24:15.080 |
- I think in general, if I were to generalize that, 02:24:18.760 |
I feel like persevering through any kind of struggle 02:24:25.680 |
So you're basically, you try to make it seem like 02:24:32.600 |
whatever form that takes, it could be just Googling a lot. 02:24:36.120 |
Just basically anything, just go sticking with it 02:24:38.760 |
and go into the hard thing that could take a form 02:24:45.680 |
with different libraries or different programming languages. 02:24:58.400 |
- And so when it was snowed, you really couldn't do much. 02:25:05.360 |
Because when it's snowing, you can't do anything else. 02:25:10.840 |
you've already exceptionally successful, you're young, 02:25:13.440 |
but do you have advice for young people starting out 02:25:17.840 |
You know, advice for their career, advice for their life, 02:25:21.080 |
how to pave a successful path in career and life? 02:25:29.760 |
And I think like, I've been inspired by a lot of people 02:25:33.360 |
who are just like driven and who really like go 02:25:44.400 |
- How do you know when you come across a thing 02:25:51.160 |
- I think there's not going to be any single thing 02:25:53.960 |
There are going to be different types of things 02:25:55.000 |
that you need, but whenever you need something, 02:26:00.080 |
or you may find that this was not even the thing 02:26:02.000 |
that you were looking for, it might be a different thing. 02:26:03.720 |
But the point is like you're pushing through things 02:26:06.280 |
and that actually gives you, brings a lot of skills 02:26:12.920 |
which will probably help you get the other thing. 02:26:15.680 |
Once you figure out what's really the thing that you want. 02:26:20.560 |
I've noticed the kind of afraid of that is because one, 02:26:24.880 |
And two, there's so many amazing things in this world. 02:26:31.040 |
So I think a lot of it has to do with just allowing yourself 02:26:33.840 |
to like notice that thing and just go all the way with it. 02:26:43.240 |
So I know this is like super cheesy that failure 02:26:47.280 |
is something that you should be prepared for and so on. 02:26:49.760 |
But I do think, I mean, especially in research, 02:26:52.520 |
for example, failure is something that happens almost like, 02:26:55.240 |
almost every day is like experiments failing and not working. 02:27:06.280 |
like when you get through it is when you find 02:27:09.560 |
So Thomas Edison was like one person like that. 02:27:13.680 |
I used to really read about how he found like the filament, 02:27:24.320 |
And then they asked him like, so what did you learn? 02:27:26.920 |
Because all of these were failed experiments. 02:27:28.480 |
And then he says, oh, these 990 things don't work. 02:27:38.480 |
performing a self supervised kind of learning process. 02:27:43.480 |
Have you figured out the meaning of life yet? 02:27:46.400 |
I told you I'm doing this podcast to try to get the answer. 02:27:58.960 |
- Do you think AI will help us figure it out? 02:28:10.520 |
This is like a very hard, hard, hard question, 02:28:18.400 |
Humans don't seem to know what the hell they're doing. 02:28:27.360 |
And I wonder like, whether our lack of ability 02:28:40.400 |
under which we operate, if that's a feature or a bug. 02:28:45.200 |
because then everyone actually has very different kinds 02:28:47.400 |
of objective functions that they're optimizing. 02:28:53.840 |
That's actually what makes us interesting, right? 02:28:58.000 |
the exact same thing, that would be pretty boring. 02:29:00.520 |
We do want like people with different kinds of perspectives. 02:29:06.120 |
That's like, I would say the biggest feature of being human. 02:29:12.520 |
We get to watch that, see it and learn from it. 02:29:24.240 |
that died doing something wild and beautiful. 02:29:28.120 |
Ishan, thank you so much for this incredible conversation. 02:29:45.760 |
Thanks for coming down today and talking with me. 02:30:05.280 |
Check them out in the description to support this podcast. 02:30:18.120 |
Thank you for listening and hope to see you next time.