back to index

Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206


Chapters

0:0 Introduction
2:27 Self-supervised learning
11:2 Self-supervised learning is the dark matter of intelligence
14:54 Categorization
23:28 Is computer vision still really hard?
27:12 Understanding Language
36:51 Harder to solve: vision or language
43:36 Contrastive learning & energy-based models
47:37 Data augmentation
51:57 Fixed audio spike by lowering sound with pen tool
60:10 Real data vs. augmented data
63:54 Non-contrastive learning energy based self supervised learning methods
67:32 Unsupervised learning (SwAV)
70:14 Self-supervised Pretraining (SEER)
75:21 Self-supervised learning (SSL) architectures
81:21 VISSL pytorch-based SSL library
84:15 Multi-modal
91:43 Active learning
97:22 Autonomous driving
108:49 Limits of deep learning
112:57 Difference between learning and reasoning
118:3 Building super-human AI
125:51 Most beautiful idea in self-supervised learning
129:40 Simulation for training AI
133:4 Video games replacing reality
134:18 How to write a good research paper
138:45 Best programming language for beginners
139:39 PyTorch vs TensorFlow
143:3 Advice for getting into machine learning
145:9 Advice for young people
147:35 Meaning of life

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Ishan Misra,
00:00:03.260 | research scientist at Facebook AI Research
00:00:05.840 | who works on self-supervised machine learning
00:00:08.600 | in the domain of computer vision.
00:00:10.520 | Or in other words, making AI systems
00:00:13.480 | understand the visual world with minimal help
00:00:16.580 | from us humans.
00:00:18.040 | Transformers and self-attention has been successfully used
00:00:21.740 | by OpenAI's GPT-3 and other language models
00:00:25.620 | to do self-supervised learning in the domain of language.
00:00:28.600 | Ishan, together with Yan LeCun and others,
00:00:31.800 | is trying to achieve the same success
00:00:33.960 | in the domain of images and video.
00:00:36.400 | The goal is to leave a robot
00:00:38.320 | watching YouTube videos all night
00:00:40.320 | and in the morning, come back to a much smarter robot.
00:00:43.600 | I read the blog post, "Self-supervised Learning,
00:00:45.960 | "The Dark Matter of Intelligence" by Ishan and Yan LeCun
00:00:50.360 | and then listened to Ishan's appearance
00:00:52.940 | on the excellent "Machine Learning Street Talk" podcast.
00:00:57.200 | And I knew I had to talk to him.
00:00:59.160 | By the way, if you're interested in machine learning and AI,
00:01:02.840 | I cannot recommend the ML Street Talk podcast highly enough.
00:01:07.840 | Those guys are great.
00:01:09.640 | Quick mention of our sponsors,
00:01:11.280 | Onnit, The Information, Grammarly and Athletic Greens.
00:01:15.400 | Check them out in the description to support this podcast.
00:01:18.640 | As a side note, let me say that for those of you
00:01:21.680 | who may have been listening for quite a while,
00:01:23.240 | this podcast used to be called
00:01:24.960 | Artificial Intelligence Podcast
00:01:27.120 | because my life passion has always been,
00:01:29.720 | will always be artificial intelligence,
00:01:32.680 | both narrowly and broadly defined.
00:01:35.460 | My goal with this podcast is still
00:01:37.720 | to have many conversations with world-class researchers
00:01:40.580 | in AI, math, physics, biology and all the other sciences.
00:01:45.120 | But I also want to talk to historians, musicians,
00:01:48.880 | athletes and of course, occasionally comedians.
00:01:51.520 | In fact, I'm trying out doing this podcast
00:01:53.600 | three times a week now to give me more freedom
00:01:56.200 | with guest selection and maybe get a chance
00:01:59.380 | to have a bit more fun.
00:02:00.880 | Speaking of fun, in this conversation,
00:02:03.160 | I challenged the listener to count the number of times
00:02:05.440 | the word banana is mentioned.
00:02:08.000 | Ishan and I use the word banana as the canonical example
00:02:12.600 | at the core of the hard problem of computer vision
00:02:15.200 | and maybe the hard problem of consciousness.
00:02:19.020 | This is the Lex Friedman Podcast
00:02:22.620 | and here is my conversation with Ishan Misra.
00:02:26.280 | What is self-supervised learning?
00:02:29.860 | And maybe even give the bigger basics
00:02:32.740 | of what is supervised and semi-supervised learning
00:02:35.340 | and maybe why is self-supervised learning
00:02:37.620 | a better term than unsupervised learning?
00:02:40.100 | - Let's start with supervised learning.
00:02:41.580 | So typically for machine learning systems,
00:02:43.900 | the way they're trained is you get a bunch of humans,
00:02:46.900 | the humans point out particular concepts.
00:02:48.580 | So if it's in the case of images,
00:02:50.180 | you want the humans to come and tell you
00:02:51.980 | what is present in the image, draw boxes around them,
00:02:55.820 | draw masks of things, pixels,
00:02:57.700 | which are of particular categories or not.
00:03:00.540 | For NLP, again, there are lots of these particular tasks,
00:03:03.220 | say about sentiment analysis, about entailment and so on.
00:03:06.620 | So typically for supervised learning,
00:03:08.100 | we get a big corpus of such annotated or labeled data
00:03:11.280 | and then we feed that to a system
00:03:12.780 | and the system is really trying to mimic.
00:03:14.820 | So it's taking this input of the data
00:03:16.580 | and then trying to mimic the output.
00:03:18.380 | So it looks at an image and the human has tagged
00:03:20.660 | that this image contains a banana
00:03:22.420 | and now the system is basically trying to mimic that.
00:03:24.660 | So that's its learning signal.
00:03:26.700 | And so for supervised learning,
00:03:28.020 | we try to gather lots of such data
00:03:30.060 | and we train these machine learning models
00:03:31.820 | to imitate the input output.
00:03:33.460 | And the hope is basically by doing so,
00:03:35.620 | now on unseen or like new kinds of data,
00:03:38.060 | this model can automatically learn
00:03:40.020 | to predict these concepts.
00:03:41.320 | So this is a standard sort of supervised setting.
00:03:43.380 | For semi-supervised setting,
00:03:45.780 | the idea typically is that you have,
00:03:47.580 | of course, all of the supervised data,
00:03:49.260 | but you have lots of other data which is unsupervised
00:03:51.780 | or which is like not labeled.
00:03:53.100 | Now the problem basically with supervised learning
00:03:55.260 | and why you actually have all of these alternate
00:03:57.460 | sort of learning paradigms is,
00:03:59.400 | supervised learning just does not scale.
00:04:01.800 | So if you look at for computer vision,
00:04:03.900 | the sort of largest,
00:04:04.980 | one of the most popular data sets is ImageNet.
00:04:07.500 | So the entire ImageNet data set
00:04:09.300 | has about 22,000 concepts and about 14 million images.
00:04:13.820 | So these concepts are basically just nouns
00:04:16.140 | and they're annotated on images.
00:04:18.340 | And this entire data set was a mammoth
00:04:20.020 | data collection effort.
00:04:20.860 | It actually gave rise to a lot of powerful
00:04:23.020 | learning algorithms.
00:04:23.860 | It's credited with like sort of the rise
00:04:25.620 | of deep learning as well.
00:04:27.220 | But this data set took about 22 human years
00:04:30.100 | to collect, to annotate.
00:04:31.920 | And it's not even that many concepts, right?
00:04:33.460 | It's not even that many images.
00:04:34.540 | 14 million is nothing really.
00:04:36.760 | Like you have about, I think, 400 million images or so,
00:04:39.340 | or even more than that uploaded
00:04:40.620 | to most of the popular sort of social media websites today.
00:04:44.180 | So now supervised learning just doesn't scale.
00:04:46.420 | If I want to now annotate more concepts,
00:04:48.660 | if I want to have various types of fine-grained concepts,
00:04:51.300 | then it won't really scale.
00:04:53.220 | So now you come up to these sort of
00:04:54.500 | different learning paradigms,
00:04:55.700 | for example, semi-supervised learning,
00:04:57.540 | where the idea is, of course,
00:04:58.580 | you have this annotated corpus of supervised data,
00:05:01.360 | and you have lots of these unlabeled images.
00:05:03.700 | And the idea is that the algorithm should basically
00:05:05.580 | try to measure some kind of consistency,
00:05:07.960 | or really try to measure some kind of signal
00:05:10.260 | on this sort of unlabeled data
00:05:12.140 | to make itself more confident
00:05:14.180 | about what it's really trying to predict.
00:05:16.160 | So by access to this lots of unlabeled data,
00:05:19.660 | the idea is that the algorithm actually learns
00:05:22.220 | to be more confident,
00:05:23.500 | and actually gets better at predicting these concepts.
00:05:26.880 | And now we come to the other extreme,
00:05:28.460 | which is like self-supervised learning.
00:05:30.500 | The idea basically is that the machine,
00:05:32.260 | or the algorithm should really discover concepts,
00:05:34.700 | or discover things about the world,
00:05:36.380 | or learn representations about the world which are useful,
00:05:39.180 | without access to explicit human supervision.
00:05:41.740 | - So the word supervision is still
00:05:44.300 | in the term self-supervised.
00:05:46.240 | So what is the supervision signal?
00:05:48.520 | And maybe that perhaps is when Yanma Kun and you argue
00:05:51.960 | that unsupervised is the incorrect terminology here.
00:05:55.000 | So what is the supervision signal
00:05:57.400 | when the humans aren't part of the picture?
00:05:59.680 | Or not a big part of the picture?
00:06:02.360 | - Right.
00:06:03.200 | So self-supervised,
00:06:04.480 | the reason it has the term supervised in itself
00:06:06.720 | is because you're using the data itself as supervision.
00:06:10.320 | So because the data serves as its own source of supervision,
00:06:13.200 | it's self-supervised in that way.
00:06:15.140 | Now the reason a lot of people,
00:06:16.380 | I mean, we did it in that blog post with Yan,
00:06:18.380 | but a lot of other people have also argued
00:06:20.100 | for using this term self-supervised.
00:06:22.100 | So starting from like '94 from Virginia Desas' group,
00:06:25.620 | I think UCSD, and now she's at UCSD.
00:06:28.780 | Jeetendra Malik has said this a bunch of times as well.
00:06:31.640 | So you have supervised.
00:06:33.060 | And then unsupervised basically means everything
00:06:35.180 | which is not supervised.
00:06:36.420 | But that includes stuff like semi-supervised,
00:06:38.620 | that includes other like transductive learning,
00:06:41.300 | lots of other sort of settings.
00:06:43.020 | So that's the reason like now people are preferring
00:06:46.040 | this term self-supervised
00:06:47.100 | because it explicitly says what's happening.
00:06:49.260 | The data itself is the source of supervision
00:06:51.620 | and any sort of learning algorithm
00:06:53.100 | which tries to extract just sort of data supervision signals
00:06:56.900 | from the data itself is a self-supervised learning algorithm.
00:06:59.460 | - But there is within the data,
00:07:02.140 | a set of tricks which unlock the supervision.
00:07:04.940 | - Right.
00:07:05.780 | - So can you give maybe some examples?
00:07:07.180 | And there's innovation ingenuity required
00:07:11.360 | to unlock that supervision.
00:07:12.840 | The data doesn't just speak to you some ground truth.
00:07:15.580 | You have to do some kind of trick.
00:07:17.760 | So I don't know what your favorite domain is.
00:07:19.580 | So you specifically specialize in visual learning,
00:07:23.020 | but is there favorite examples
00:07:24.500 | maybe in language or other domains?
00:07:26.500 | - Perhaps the most successful applications
00:07:28.300 | have been in NLP, language processing.
00:07:31.060 | So the idea basically being that you can train models
00:07:34.000 | that can, you have a sentence
00:07:35.780 | and you mask out certain words.
00:07:37.380 | And now these models learn to predict the masked out words.
00:07:40.500 | So if you have like the cat jumped over the dog,
00:07:44.000 | so you can basically mask out cat.
00:07:45.940 | And now you are essentially asking the model to predict
00:07:47.880 | what was missing, what did I mask out?
00:07:50.280 | So the model is going to predict basically
00:07:52.460 | a distribution over all the possible words that it knows.
00:07:55.320 | And probably it has like, if it's a well-trained model,
00:07:58.360 | it has a sort of higher probability density
00:08:00.580 | for this word cat.
00:08:02.560 | For vision, I would say the sort of more,
00:08:05.500 | I mean, the easier example,
00:08:07.440 | which is not as widely used these days
00:08:09.400 | is basically say, for example, video prediction.
00:08:12.000 | So video is again, a sequence of things.
00:08:14.060 | So you can ask the model.
00:08:15.000 | So if you have a video of say 10 seconds,
00:08:17.400 | you can feed in the first nine seconds to a model
00:08:19.800 | and then ask it, hey, what happens basically
00:08:21.920 | in the 10th second?
00:08:22.760 | Can you predict what's going to happen?
00:08:24.480 | And the idea basically is because the model
00:08:26.720 | is predicting something about the data itself.
00:08:29.400 | Of course, you didn't need any human
00:08:31.360 | to tell you what was happening
00:08:32.280 | because the 10 second video was naturally captured.
00:08:34.580 | Because the model is predicting what's happening there,
00:08:36.640 | it's going to automatically learn something
00:08:39.000 | about the structure of the world,
00:08:40.000 | how objects move, object permanence,
00:08:42.560 | and these kinds of things.
00:08:43.980 | So like if I have something at the edge of the table,
00:08:45.920 | it'll fall down.
00:08:47.500 | Things like these,
00:08:48.340 | which you really don't have to sit and annotate.
00:08:50.240 | In a supervised learning setting,
00:08:51.280 | I would have to sit and annotate.
00:08:52.260 | This is a cup.
00:08:53.100 | Now I move this cup.
00:08:54.100 | This is still a cup.
00:08:55.160 | And now I move this cup, it's still a cup.
00:08:56.600 | And then it falls down and this is a fallen down cup.
00:08:58.800 | So I won't have to annotate all of these things
00:09:00.400 | in a self-supervised setting.
00:09:02.000 | - Isn't that kind of a brilliant little trick
00:09:05.240 | of taking a series of data that is consistent
00:09:08.280 | and removing one element in that series,
00:09:11.860 | and then teaching the algorithm to predict that element.
00:09:16.860 | Isn't that, first of all, that's quite brilliant.
00:09:19.620 | It seems to be applicable in anything
00:09:23.040 | that has the constraint of being a sequence
00:09:27.880 | that is consistent with the physical reality.
00:09:30.220 | The question is, are there other tricks like this
00:09:34.400 | that can generate the self-supervision signal?
00:09:37.840 | - So sequence is possibly the most widely used one in NLP.
00:09:41.200 | For vision, the one that is actually used for images,
00:09:44.080 | which is very popular these days,
00:09:45.840 | is basically taking an image
00:09:47.600 | and now taking different crops of that image.
00:09:50.080 | So you can basically decide to crop,
00:09:51.400 | say the top left corner,
00:09:53.100 | and you crop, say the bottom right corner,
00:09:55.280 | and asking the network to basically present it
00:09:58.080 | with a choice saying that, okay, now you have this image,
00:10:01.360 | you have this image, are these the same or not?
00:10:04.480 | And so the idea basically is that because different,
00:10:06.680 | like in an image, different parts of the image
00:10:08.480 | are going to be related.
00:10:09.800 | So for example, if you have a chair and a table,
00:10:12.400 | basically these things are going to be close by,
00:10:15.080 | versus if you take, again,
00:10:16.880 | if you have like a zoomed in picture of a chair,
00:10:19.520 | if you're taking different crops,
00:10:20.520 | it's going to be different parts of the chair.
00:10:22.360 | So the idea basically is that different crops
00:10:25.040 | of the image are related,
00:10:26.160 | and so the features or the representations
00:10:27.920 | that you get from these different crops
00:10:29.040 | should also be related.
00:10:30.320 | So this is possibly the most widely used trick these days
00:10:33.120 | for self-supervised learning in computer vision.
00:10:35.760 | - So again, using the consistency
00:10:38.400 | that's inherent to physical reality,
00:10:40.200 | in visual domain, that's, you know,
00:10:43.800 | parts of an image are consistent,
00:10:45.640 | and then in the language domain,
00:10:48.400 | or anything that has sequences,
00:10:50.280 | like language or something that's like a time series,
00:10:53.000 | then you can chop off parts in time.
00:10:55.440 | It's similar to the story of RNNs and CNNs,
00:10:59.380 | of RNNs and Covenants.
00:11:02.320 | - Yuan Yan Lacoon wrote the blog post in March, 2021,
00:11:06.640 | titled "Self-supervised learning,
00:11:08.840 | the dark matter of intelligence."
00:11:11.080 | Can you summarize this blog post
00:11:12.640 | and maybe explain the main idea or set of ideas?
00:11:15.640 | - The blog post was mainly about sort of just telling,
00:11:18.680 | I mean, this is really an accepted fact,
00:11:21.680 | I would say for a lot of people now,
00:11:22.960 | that self-supervised learning is something
00:11:24.320 | that is going to play an important role
00:11:27.200 | for machine learning algorithms that come in the future,
00:11:29.240 | and even now.
00:11:30.400 | - Let me just comment that we don't yet
00:11:33.840 | have a good understanding of what dark matter is.
00:11:36.480 | - That's true.
00:11:37.320 | (laughing)
00:11:39.000 | The idea basically being--
00:11:39.840 | - Maybe the metaphor doesn't exactly transfer,
00:11:41.840 | but maybe it's actually perfectly transfers,
00:11:44.840 | that we don't know, we have an inkling
00:11:47.880 | that it'll be a big part
00:11:49.280 | of whatever solving intelligence looks like.
00:11:51.240 | - Right, so I think self-supervised learning,
00:11:52.960 | the way it's done right now,
00:11:54.080 | is I would say like the first step
00:11:56.240 | towards what it probably should end up learning,
00:11:58.560 | or what it should enable us to do.
00:12:00.200 | - Yeah.
00:12:01.040 | - So the idea for that particular piece was,
00:12:03.720 | self-supervised learning is going to be a very powerful way
00:12:06.160 | to learn common sense about the world,
00:12:08.400 | or like stuff that is really hard to label.
00:12:10.800 | For example, like is this piece
00:12:13.720 | over here heavier than the cup?
00:12:15.640 | Now, for all these kinds of things,
00:12:17.480 | you'll have to sit and label these things.
00:12:18.720 | So supervised learning is clearly not going to scale.
00:12:21.520 | So what is the thing that's actually going to scale?
00:12:23.480 | It's probably going to be an agent
00:12:25.040 | that can either actually interact with it,
00:12:26.800 | to lift it up, or observe me doing it.
00:12:29.960 | So if I'm basically lifting these things up,
00:12:31.520 | it can probably reason about,
00:12:32.560 | hey, this is taking him more time to lift up,
00:12:34.720 | or the velocity is different.
00:12:36.160 | Whereas the velocity for this is different,
00:12:37.800 | probably this one is heavier.
00:12:39.560 | So essentially by observations of the data,
00:12:41.960 | you should be able to infer a lot of things about the world
00:12:44.760 | without someone explicitly telling you,
00:12:46.800 | this is heavy, this is not,
00:12:48.680 | this is something that can pour,
00:12:49.960 | this is something that cannot pour,
00:12:51.160 | this is somewhere that you can sit,
00:12:52.440 | this is not somewhere that you can sit.
00:12:53.880 | - But you've just mentioned ability
00:12:55.480 | to interact with the world.
00:12:57.440 | There's so many questions that are yet to be,
00:13:00.960 | that are still open, which is,
00:13:02.800 | how do you select a set of data
00:13:04.440 | over which the self-supervised learning process works?
00:13:08.600 | How much interactivity, like in the active learning,
00:13:11.480 | or the machine teaching context is there?
00:13:14.360 | What are the reward signals?
00:13:16.440 | Like how much actual interaction there is
00:13:18.520 | with the physical world?
00:13:20.040 | That kind of thing.
00:13:21.400 | So that could be a huge question.
00:13:24.760 | And then on top of that,
00:13:26.680 | which I have a million questions about,
00:13:28.920 | which we don't know the answers to,
00:13:30.360 | but it's worth talking about,
00:13:32.000 | is how much reasoning is involved?
00:13:35.080 | How much accumulation of knowledge
00:13:38.480 | versus something that's more akin to learning,
00:13:40.760 | or whether that's the same thing.
00:13:43.200 | But so we're like, it is truly dark matter.
00:13:46.520 | - We don't know how exactly to do it.
00:13:49.160 | But we are, I mean, a lot of us are actually convinced
00:13:52.040 | that it's going to be a sort of major thing
00:13:54.200 | in machine learning.
00:13:55.040 | Let me reframe it then,
00:13:56.600 | that human supervision cannot be at large scale,
00:14:01.160 | the source of the solution to intelligence.
00:14:04.120 | So the machines have to discover the supervision
00:14:08.000 | in the natural signal of the world.
00:14:10.240 | - I mean, the other thing is also that humans
00:14:12.440 | are not particularly good labelers.
00:14:14.200 | They're not very consistent.
00:14:16.000 | For example, like what's the difference
00:14:17.880 | between a dining table and a table?
00:14:19.840 | Is it just the fact that one,
00:14:21.560 | like if you just look at a particular table,
00:14:23.080 | what makes us say one is dining table and the other is not?
00:14:26.480 | Humans are not particularly consistent.
00:14:28.160 | They're not like very good sources of supervision
00:14:30.080 | for a lot of these kind of edge cases.
00:14:32.320 | So it may be also the fact that if we want,
00:14:35.880 | like want an algorithm or want a machine
00:14:37.960 | to solve a particular task for us,
00:14:39.640 | we can maybe just specify the end goal.
00:14:42.120 | And like the stuff in between,
00:14:44.240 | we really probably should not be specifying
00:14:46.080 | because we're not maybe going to confuse it a lot actually.
00:14:49.320 | - Well, humans can't even answer the meaning of life.
00:14:51.440 | So I'm not sure if we're good supervisors
00:14:53.920 | of the end goal either.
00:14:55.220 | So let me ask you about categories.
00:14:56.960 | Humans are not very good at telling the difference
00:14:59.040 | between what is and isn't a table, like you mentioned.
00:15:01.940 | Do you think it's possible,
00:15:04.560 | let me ask you like a, pretend you're Plato.
00:15:08.140 | Is it possible to create a pretty good taxonomy
00:15:14.800 | of objects in the world?
00:15:16.400 | It seems like a lot of approaches in machine learning
00:15:19.000 | kind of assume a hopeful vision that it's possible
00:15:22.160 | to construct a perfect taxonomy,
00:15:24.080 | or it exists perhaps out of our reach,
00:15:26.520 | but we can always get closer and closer to it.
00:15:28.820 | Or is that a hopeless pursuit?
00:15:30.400 | - I think it's hopeless in some way.
00:15:33.040 | So the thing is for any particular categorization
00:15:36.080 | that you create,
00:15:36.920 | if you have a discrete sort of categorization,
00:15:38.760 | I can always take the nearest two concepts
00:15:40.520 | or I can take a third concept and I can blend it in
00:15:42.600 | and I can create a new category.
00:15:44.500 | So if you were to enumerate N categories,
00:15:46.560 | I will always find an N plus one category for you.
00:15:48.900 | That's not going to be in the N categories.
00:15:50.720 | And I can actually create not just N plus one,
00:15:52.440 | I can very easily create far more than N categories.
00:15:55.160 | The thing is a lot of things we talk about
00:15:57.320 | are actually compositional.
00:15:59.000 | So it's really hard for us to come and sit in
00:16:01.800 | and enumerate all of these out.
00:16:03.240 | And they compose in various weird ways, right?
00:16:05.880 | Like you have like a croissant and a donut come together
00:16:08.360 | to form a cronut.
00:16:09.720 | So if you were to like enumerate all the foods up until,
00:16:12.440 | I don't know, whenever the cronut was about 10 years ago
00:16:15.200 | or 15 years ago,
00:16:16.460 | then this entire thing called cronut would not exist.
00:16:19.000 | - Yeah, I remember there was the most awesome video
00:16:21.760 | of a cat wearing a monkey costume.
00:16:23.520 | People should look it up, it's great.
00:16:28.240 | So is that a monkey or is that a cat?
00:16:31.000 | So it's a very difficult philosophical question.
00:16:33.840 | So there is a concept of similarity between objects.
00:16:37.280 | So you think that can take us very far?
00:16:39.860 | Just kind of getting a good function,
00:16:43.200 | a good way to tell which parts of things are similar
00:16:47.920 | and which parts of things are very different?
00:16:50.700 | - I think so, yeah.
00:16:51.780 | So you don't necessarily need to name everything
00:16:54.320 | or assign a name to everything to be able to use it, right?
00:16:57.820 | So there are like lots of--
00:16:59.540 | - Shakespeare said that, what's in a name?
00:17:01.740 | - What's in a name, yeah.
00:17:02.580 | - Yeah, okay.
00:17:03.540 | - I mean, lots of like, for example, animals, right?
00:17:05.820 | They don't have necessarily a well-formed
00:17:08.140 | like syntactic language,
00:17:09.540 | but they're able to go about their day perfectly.
00:17:11.820 | The same thing happens for us.
00:17:12.900 | So, I mean, we probably look at things and we figure out,
00:17:17.100 | oh, this is similar to something else
00:17:18.500 | that I've seen before.
00:17:19.340 | And then I can probably learn how to use it.
00:17:22.020 | So I haven't seen all the possible doorknobs in the world.
00:17:26.300 | But if you show me, like I was able to get
00:17:28.700 | into this particular place fairly easily.
00:17:30.380 | I've never seen that particular doorknob.
00:17:32.140 | So I, of course, related to all the doorknobs that I've seen
00:17:34.340 | and I know exactly how it's going to open.
00:17:36.540 | I have a pretty good idea of how it's going to open.
00:17:39.420 | And I think this kind of translation between experiences
00:17:41.820 | only happens because of similarity.
00:17:43.700 | Because I'm able to relate it to a doorknob.
00:17:45.380 | If I related it to a hairdryer,
00:17:46.580 | I would probably be stuck still outside,
00:17:48.380 | not able to get in.
00:17:49.380 | - Again, a bit of a philosophical question,
00:17:52.220 | but can similarity take us all the way
00:17:55.620 | to understanding a thing?
00:17:57.680 | Can having a good function that compares objects
00:18:01.940 | get us to understand something profound
00:18:04.900 | about singular objects?
00:18:07.220 | - I think I'll ask you a question back.
00:18:08.620 | What does it mean to understand objects?
00:18:11.560 | - Well, let me tell you what that's similar to.
00:18:14.340 | (laughing)
00:18:15.660 | So there's an idea of sort of reasoning
00:18:17.700 | by analogy kind of thing.
00:18:19.740 | I think understanding is the process of placing that thing
00:18:24.740 | in some kind of network of knowledge that you have.
00:18:28.420 | That it perhaps is fundamentally related to other concepts.
00:18:33.180 | So it's not like understanding is fundamentally related
00:18:36.500 | by like composition of other concepts
00:18:39.260 | and maybe in relation to other concepts.
00:18:41.480 | And maybe like deeper and deeper understanding
00:18:45.800 | is maybe just adding more edges to that graph somehow.
00:18:50.800 | So maybe it is a composition of similarities.
00:18:55.080 | I mean, ultimately, I suppose it is a kind of embedding
00:18:59.560 | in that wisdom space.
00:19:02.480 | - Yeah.
00:19:03.320 | (laughing)
00:19:04.140 | Okay, wisdom space is good.
00:19:05.240 | I think, I do think, right?
00:19:08.040 | So similarity does get you very, very far.
00:19:10.760 | Is it the answer to everything?
00:19:12.360 | I mean, I don't even know what everything is,
00:19:14.140 | but it's going to take us really far.
00:19:16.720 | And I think the thing is,
00:19:18.920 | things are similar in very different contexts, right?
00:19:21.680 | So an elephant is similar to, I don't know,
00:19:24.360 | another sort of wild animal.
00:19:25.640 | Let's just pick, I don't know, lion.
00:19:27.440 | And in a different way,
00:19:28.520 | because they're both four-legged creatures.
00:19:30.560 | They're also land animals.
00:19:32.080 | But of course, they're very different
00:19:33.160 | in a lot of different ways.
00:19:34.000 | So elephants are like herbivores, lions are not.
00:19:37.280 | So similarity does,
00:19:38.560 | similarity and particularly dissimilarity
00:19:40.680 | also actually helps us understand a lot about things.
00:19:43.760 | And so that's actually why I think
00:19:45.240 | discrete categorization is very hard.
00:19:47.640 | Just like forming this particular category of elephant
00:19:50.080 | and a particular category of lion,
00:19:51.880 | maybe it's good for just like taxonomy,
00:19:54.400 | biological taxonomies.
00:19:55.800 | But when it comes to other things,
00:19:57.240 | which are not as maybe, for example, like grilled cheese.
00:20:01.760 | I have a grilled cheese, I dip it in tomato
00:20:03.120 | and I keep it outside.
00:20:04.000 | Now, is that still a grilled cheese
00:20:05.080 | or is that something else?
00:20:06.760 | - Right, so categorization is still very useful
00:20:09.800 | for solving problems.
00:20:11.280 | But is your intuition then sort of the self-supervised
00:20:15.960 | should be the, to borrow Jan LeCun's terminology,
00:20:20.960 | should be the cake and then categorization,
00:20:23.680 | the classification, maybe the supervised like layer
00:20:27.400 | should be just like the thing on top,
00:20:29.120 | the cherry or the icing or whatever.
00:20:31.040 | So if you make it the cake,
00:20:32.960 | it gets in the way of learning.
00:20:35.560 | - If you make it the cake,
00:20:36.400 | then you won't be able to sit and annotate everything.
00:20:39.400 | That's as simple as it is.
00:20:40.680 | Like that's my very practical view on it.
00:20:43.120 | It's just, I mean, in my PhD,
00:20:44.960 | I sat down and annotated like a bunch of cards
00:20:47.040 | for one of my projects.
00:20:48.480 | And very quickly, I was just like,
00:20:49.920 | it was in a video and I was basically drawing boxes
00:20:52.200 | around all these cards.
00:20:53.560 | And I think I spent about a week doing all of that
00:20:55.640 | and I barely got anything done.
00:20:57.680 | And basically, this was, I think my first year of my PhD
00:21:00.320 | or like second year of my master's.
00:21:02.760 | And then by the end of it, I'm like, okay,
00:21:04.040 | this is just hopeless.
00:21:05.040 | I can keep doing it.
00:21:06.000 | And when I had done that, someone came up to me
00:21:08.480 | and they basically told me,
00:21:09.600 | oh, this is a pickup truck, this is not a car.
00:21:12.800 | And that's like, aha, this actually makes sense
00:21:14.840 | because a pickup truck is not really,
00:21:16.080 | like what was I annotating?
00:21:17.040 | Was I annotating anything that is mobile?
00:21:19.600 | Or was I annotating particular sedans?
00:21:21.440 | Or was I annotating SUVs?
00:21:22.720 | What was I doing?
00:21:23.640 | - By the way, the annotation was bounding boxes?
00:21:25.760 | - Bounding boxes, yeah.
00:21:27.000 | - There's so many deep, profound questions here
00:21:30.080 | that you're almost cheating your way out of
00:21:32.200 | by doing self-supervised learning, by the way,
00:21:34.400 | which is like what makes for an object.
00:21:37.480 | As opposed to solve intelligence,
00:21:39.040 | maybe you don't ever need to answer that question.
00:21:41.560 | I mean, this is the question
00:21:43.680 | that anyone that's ever done annotation
00:21:45.280 | because it's so painful gets to ask,
00:21:48.000 | like, why am I doing,
00:21:49.960 | drawing very careful line around this object?
00:21:55.440 | Like what is the value?
00:21:57.520 | I remember when I first saw semantic segmentation
00:22:00.200 | where you have like instant segmentation
00:22:03.640 | where you have a very exact line around the object
00:22:08.280 | in the 2D plane of a fundamentally 3D object
00:22:11.720 | projected on a 2D plane.
00:22:13.400 | So you're drawing a line around a car
00:22:15.800 | that might be occluded,
00:22:16.920 | there might be another thing in front of it,
00:22:18.840 | but you're still drawing the line
00:22:20.360 | of the part of the car that you see.
00:22:22.680 | How is that the car?
00:22:25.880 | Why is that the car?
00:22:27.880 | Like I had like an existential crisis every time.
00:22:31.000 | Like how is that going to help us understand
00:22:33.520 | or solve computer vision?
00:22:35.320 | I'm not sure I have a good answer to what's better.
00:22:38.240 | And I'm not sure I share the confidence that you have
00:22:41.520 | that self-supervised learning can take us far.
00:22:46.520 | I think I'm more and more convinced
00:22:48.560 | that it's a very important component,
00:22:50.840 | but I still feel like we need to understand what makes,
00:22:54.160 | like this dream of maybe what it's called symbolic AI,
00:23:01.400 | of arriving, like once you have this common sense base,
00:23:05.560 | be able to play with these concepts
00:23:09.000 | and build graphs or hierarchies of concepts on top
00:23:13.440 | in order to then like form a deep sense
00:23:18.440 | of this three-dimensional world or four-dimensional world
00:23:22.040 | and be able to reason and then project that onto 2D plane
00:23:25.480 | in order to interpret a 2D image.
00:23:28.520 | Can I ask you just an out there question?
00:23:30.960 | I remember, I think Andrej Karpathy had a blog post
00:23:35.000 | about computer vision, like being really hard.
00:23:39.000 | I forgot what the title was, but it was many, many years ago.
00:23:42.080 | And he had, I think President Obama stepping on a scale
00:23:44.760 | and there was humor
00:23:45.600 | and there was a bunch of people laughing and whatever.
00:23:48.440 | And there's a lot of interesting things about that image.
00:23:52.000 | And I think Andrej highlighted a bunch of things
00:23:55.140 | about the image that us humans
00:23:56.600 | are able to immediately understand.
00:23:59.000 | Like the idea, I think of gravity
00:24:00.960 | and that you have the concept of a weight.
00:24:04.040 | You immediately project because of our knowledge of pose
00:24:08.160 | and how human bodies are constructed,
00:24:10.360 | you understand how the forces are being applied
00:24:13.040 | with the human body.
00:24:14.580 | The really interesting other thing
00:24:16.080 | that you're able to understand
00:24:17.440 | is multiple people looking at each other in the image.
00:24:20.520 | You're able to have a mental model
00:24:22.360 | of what the people are thinking about.
00:24:23.760 | You're able to infer like,
00:24:25.320 | oh, this person is probably thinks,
00:24:27.520 | like is laughing at how humorous the situation is.
00:24:31.240 | And this person is confused about what the situation is
00:24:34.160 | because they're looking this way.
00:24:35.600 | We're able to infer all of that.
00:24:37.540 | So that's human vision.
00:24:40.460 | How difficult is computer vision?
00:24:45.040 | Like in order to achieve that level of understanding
00:24:48.440 | and maybe how big of a part
00:24:51.440 | does self-supervised learning play in that, do you think?
00:24:54.380 | And do you still, you know,
00:24:56.080 | back that was like over a decade ago,
00:24:58.440 | I think Andre and I think a lot of people agreed
00:25:00.920 | is computer vision is really hard.
00:25:03.320 | Do you still think computer vision is really hard?
00:25:06.000 | - I think it is, yes.
00:25:07.520 | And getting to that kind of understanding,
00:25:10.600 | I mean, it's really out there.
00:25:12.500 | So if you ask me to solve just that particular problem,
00:25:15.380 | I can do it the supervised learning route.
00:25:17.560 | I can always construct a data set and basically predict,
00:25:19.720 | oh, is there humor in this or not?
00:25:21.640 | And of course I can do it.
00:25:22.480 | - Actually, that's a good question.
00:25:23.600 | Do you think you can, okay, okay.
00:25:25.200 | Do you think you can do
00:25:26.480 | human supervised annotation of humor?
00:25:29.040 | - To some extent, yes.
00:25:29.960 | I'm sure it'll work.
00:25:30.880 | I mean, it won't be as bad as like randomly guessing.
00:25:34.380 | I'm sure it can still predict
00:25:35.560 | whether it's humorous or not in some way.
00:25:37.840 | - Yeah, maybe like Reddit upvotes is the signal.
00:25:40.400 | I don't know.
00:25:41.240 | - I mean, it won't do a great job, but it'll do something.
00:25:43.800 | It may actually be like, it may find certain things
00:25:46.080 | which are not humorous, humorous as well,
00:25:47.560 | which is going to be bad for us.
00:25:49.180 | But I mean, it won't be random.
00:25:52.120 | - Yeah, kind of like my sense of humor.
00:25:54.520 | - Okay, so fine.
00:25:55.920 | So you can, that particular problem, yes.
00:25:57.520 | But the general problem you're saying is hard.
00:25:59.600 | - The general problem is hard.
00:26:00.440 | And I mean, self-supervised learning
00:26:02.320 | is not the answer to everything.
00:26:03.920 | Of course it's not.
00:26:04.760 | I think if you have machines
00:26:06.760 | that are going to communicate with humans at the end of it,
00:26:08.760 | you want to understand what the algorithm is doing, right?
00:26:10.880 | You want it to be able to like produce an output
00:26:13.720 | that you can decipher, that you can understand,
00:26:15.560 | or it's actually useful for something else,
00:26:17.440 | which again is a human.
00:26:19.360 | So at some point in this sort of entire loop,
00:26:22.280 | a human steps in.
00:26:23.720 | And now this human needs to understand what's going on.
00:26:26.360 | So, and at that point,
00:26:27.640 | this entire notion of language or semantics really comes in.
00:26:30.440 | If the machine just spits out something,
00:26:32.600 | and if we can't understand it,
00:26:34.000 | then it's not really that useful for us.
00:26:36.280 | So self-supervised learning is probably going to be useful
00:26:38.440 | for a lot of the things before that part,
00:26:40.800 | before the machine really needs to communicate
00:26:42.880 | a particular kind of output with a human.
00:26:46.060 | Because I mean, otherwise,
00:26:47.800 | how is it going to do that without language?
00:26:49.920 | - Or some kind of communication.
00:26:51.880 | But you're saying that it's possible
00:26:53.280 | to build a big base of understanding or whatever,
00:26:55.880 | of what's a better- - Concepts.
00:26:58.440 | - Concepts. - Concepts, yeah.
00:26:59.780 | - Like common sense concepts.
00:27:02.300 | Supervised learning in the context of computer vision
00:27:06.120 | is something you've focused on,
00:27:07.520 | but that's a really hard domain.
00:27:08.960 | And it's kind of the cutting edge
00:27:10.480 | of what we're as a community working on today.
00:27:13.040 | Can we take a little bit of a step back
00:27:14.760 | and look at language?
00:27:16.320 | Can you summarize the history of success
00:27:19.000 | of self-supervised learning in natural language processing,
00:27:22.480 | language modeling?
00:27:23.880 | What are transformers?
00:27:25.600 | What is the masking, the sentence completion
00:27:28.760 | that you mentioned before?
00:27:30.080 | How does it lead us to understand anything?
00:27:33.560 | Semantic meaning of words,
00:27:34.800 | syntactic role of words and sentences.
00:27:37.660 | - So I'm of course not the expert in NLP.
00:27:40.120 | I kind of follow it a little bit from the sides.
00:27:43.480 | So the main sort of reason
00:27:45.800 | why all of this masking stuff works is,
00:27:47.880 | I think it's called the distributional hypothesis in NLP.
00:27:50.880 | The idea basically being that words
00:27:52.640 | that occur in the same context should have similar meaning.
00:27:55.960 | So if you have the blank jumped over the blank,
00:27:59.040 | it basically, whatever is like in the first blank
00:28:01.960 | is basically an object that can actually jump,
00:28:04.120 | is going to be something that can jump.
00:28:05.820 | So a cat or a dog, or I don't know, sheep, something,
00:28:08.360 | all of these things can basically be
00:28:09.720 | in that particular context.
00:28:11.640 | And now, so essentially the idea is that
00:28:13.420 | if you have words that are in the same context
00:28:16.040 | and you predict them,
00:28:17.320 | you're going to learn lots of useful things
00:28:20.000 | about how words are related.
00:28:21.480 | Because you're predicting by looking at their context
00:28:23.560 | what the word is going to be.
00:28:24.900 | So in this particular case, the blank jumped over the fence.
00:28:28.240 | So now if it's a sheep, the sheep jumped over the fence,
00:28:30.940 | the dog jumped over the fence.
00:28:32.400 | So essentially the algorithm or the representation
00:28:35.560 | basically puts together these two concepts together.
00:28:37.600 | So it says, okay, dogs are going to be kind of related
00:28:39.840 | to sheep because both of them occur in the same context.
00:28:42.720 | Of course, now you can decide depending
00:28:44.760 | on your particular application downstream.
00:28:46.760 | You can say that dogs are absolutely not related to sheep
00:28:49.140 | because, well, I don't, I really care about,
00:28:51.280 | you know, dog food, for example.
00:28:53.000 | I'm a dog food person and I really want to give
00:28:55.120 | this dog food to this particular animal.
00:28:57.280 | So depending on what your downstream application is,
00:29:00.080 | of course, this notion of similarity or this notion
00:29:03.000 | or this common sense that you've learned
00:29:04.280 | may not be applicable.
00:29:05.800 | But the point is basically that this,
00:29:08.060 | just predicting what the blanks are
00:29:09.920 | is going to take you really, really far.
00:29:11.720 | - So there's a nice feature of language
00:29:14.000 | that the number of words in a particular language
00:29:18.680 | is very large, but it's finite,
00:29:20.760 | and it's actually not that large
00:29:22.040 | in the grand scheme of things.
00:29:24.120 | I still got to, because we take it for granted,
00:29:26.540 | so first of all, when you say masking,
00:29:28.380 | you're talking about this very process of the blank,
00:29:31.520 | of removing words from a sentence,
00:29:33.440 | and then having the knowledge of what word went there
00:29:36.760 | in the initial data set.
00:29:38.540 | That's the ground truth that you're training on,
00:29:41.080 | and then you're asking the neural network
00:29:43.480 | to predict what goes there.
00:29:45.120 | That's like a little trick.
00:29:49.240 | It's a really powerful trick.
00:29:50.880 | The question is how far that takes us,
00:29:53.320 | and the other question is, is there other tricks?
00:29:56.280 | 'Cause to me, it's very possible
00:29:58.680 | there's other very fascinating tricks.
00:30:00.740 | I'll give you an example.
00:30:01.940 | In autonomous driving, there's a bunch of tricks
00:30:06.960 | that give you the self-supervised signal back.
00:30:10.360 | For example, very similar to sentences, but not really,
00:30:15.360 | which is you have signals from humans driving the car,
00:30:20.260 | because a lot of us drive cars to places,
00:30:23.680 | and so you can ask the neural network to predict
00:30:27.820 | what's going to happen in the next two seconds
00:30:30.260 | for a safe navigation through the environment,
00:30:33.400 | and the signal comes from the fact
00:30:36.200 | that you also have knowledge of what happened
00:30:38.640 | in the next two seconds, because you have video of the data.
00:30:42.120 | The question in autonomous driving, as it is in language,
00:30:46.760 | can we learn how to drive autonomously
00:30:50.160 | based on that kind of self-supervision?
00:30:53.440 | Probably the answer is no.
00:30:55.320 | The question is how good can we get?
00:30:57.720 | And the same with language.
00:30:58.880 | How good can we get, and are there other tricks?
00:31:02.320 | We get sometimes super excited by this trick
00:31:04.640 | that works really well, but I wonder,
00:31:07.320 | it's almost like mining for gold.
00:31:09.120 | I wonder how many signals there are in the data
00:31:12.760 | that could be leveraged that are like there.
00:31:15.300 | I just wanted to kind of linger on that,
00:31:18.600 | because sometimes it's easy to think
00:31:20.840 | that maybe this masking process is self-supervised learning.
00:31:24.840 | No, it's only one method.
00:31:27.200 | So there could be many, many other methods,
00:31:29.280 | many tricky methods, maybe interesting ways
00:31:33.840 | to leverage human computation in very interesting ways
00:31:36.880 | that might actually border on semi-supervised learning,
00:31:39.920 | something like that.
00:31:40.840 | Obviously, the internet is generated by humans
00:31:43.520 | at the end of the day.
00:31:44.720 | So all that to say is, what's your sense,
00:31:48.760 | in this particular context of language,
00:31:50.680 | how far can that masking process take us?
00:31:54.680 | - So it has stood the test of time, right?
00:31:56.200 | I mean, so Word2Vec, the initial sort of NLP technique
00:31:59.800 | that was using this, to now, for example,
00:32:02.120 | like all the BERT and all these big models that we get,
00:32:05.880 | BERT and Roberta, for example,
00:32:07.600 | all of them are still sort of based
00:32:08.800 | on the same principle of masking.
00:32:10.640 | It's taken us really far.
00:32:12.160 | I mean, you can actually do things like,
00:32:14.440 | oh, these two sentences are similar or not,
00:32:16.240 | whether this particular sentence
00:32:17.560 | follows this other sentence in terms of logic,
00:32:19.600 | so entailment, you can do a lot of these things
00:32:21.800 | with just this masking trick.
00:32:23.680 | So I'm not sure if I can predict how far it can take us,
00:32:28.400 | because when it first came out, when Word2Vec was out,
00:32:31.560 | I don't think a lot of us would have imagined
00:32:33.520 | that this would actually help us do some kind
00:32:36.000 | of entailment problems and really that well.
00:32:38.560 | And so just the fact that by just scaling up
00:32:40.960 | the amount of data that we're training on
00:32:42.360 | and using better and more powerful
00:32:44.640 | neural network architectures has taken us from that to this,
00:32:47.600 | is just showing you how maybe poor predictors we are,
00:32:52.280 | like as humans, how poor we are at predicting
00:32:54.880 | how successful a particular technique is going to be.
00:32:57.360 | So I think I can say something now,
00:32:58.680 | but like 10 years from now,
00:33:00.080 | I look completely stupid basically predicting this.
00:33:02.800 | - In the language domain, is there something in your work
00:33:07.160 | that you find useful and insightful
00:33:09.560 | and transferable to computer vision,
00:33:12.560 | but also just, I don't know, beautiful and profound
00:33:15.720 | that I think carries through to the vision domain?
00:33:18.160 | - I mean, the idea of masking has been very powerful.
00:33:21.040 | It has been used in vision as well for predicting,
00:33:23.680 | like you say, the next, if you have in sort of frames
00:33:27.200 | and you predict what's going to happen in the next frame.
00:33:29.400 | So that's been very powerful.
00:33:30.960 | In terms of modeling, like in just terms,
00:33:32.880 | in terms of architecture,
00:33:33.800 | I think you would have asked about transformers while back.
00:33:36.880 | That has really become,
00:33:38.360 | like it has become super exciting for computer vision now.
00:33:40.840 | Like in the past, I would say year and a half,
00:33:42.760 | it's become really powerful.
00:33:44.160 | - What's a transformer?
00:33:45.240 | - Right.
00:33:46.080 | I mean, the core part of a transformer
00:33:47.440 | is something called the self-attention model.
00:33:49.040 | So it came out of Google.
00:33:50.440 | And the idea basically is that if you have N elements,
00:33:53.780 | what you're creating is a way for all of these N elements
00:33:56.520 | to talk to each other.
00:33:57.880 | So the idea basically is that you are paying attention.
00:34:01.820 | Each element is paying attention
00:34:03.200 | to each of the other element.
00:34:04.980 | And basically by doing this,
00:34:06.800 | it's really trying to figure out,
00:34:08.980 | you're basically getting a much better view of the data.
00:34:11.460 | So for example, if you have a sentence of like four words,
00:34:14.500 | the point is if you get a representation
00:34:16.340 | or a feature for this entire sentence,
00:34:18.320 | it's constructed in a way such that each word
00:34:21.280 | has paid attention to everything else.
00:34:23.840 | Now, the reason it's like different from say,
00:34:26.100 | what you would do in a conf net
00:34:28.440 | is basically that in a conf net,
00:34:29.560 | you would only pay attention to a local window.
00:34:31.400 | So each word would only pay attention to its next neighbor
00:34:34.520 | or like one neighbor after that.
00:34:36.160 | And the same thing goes for images.
00:34:37.860 | In images, you would basically pay attention to pixels
00:34:40.120 | in a three cross three or a seven cross seven neighborhood.
00:34:42.800 | And that's it.
00:34:43.680 | Whereas with the transformer, the self-attention mainly,
00:34:46.020 | the sort of idea is that each element
00:34:48.760 | needs to pay attention to each other element.
00:34:50.440 | - And when you say attention,
00:34:51.980 | maybe another way to phrase that
00:34:53.400 | is you're considering a context,
00:34:57.680 | a wide context in terms of the wide context of the sentence
00:35:01.560 | in understanding the meaning of a particular word
00:35:05.120 | and in computer vision,
00:35:06.160 | that's understanding a larger context
00:35:07.840 | to understand the local pattern
00:35:10.080 | of a particular local part of an image.
00:35:13.080 | - Right, so basically if you have say,
00:35:14.940 | again, a banana in the image,
00:35:16.520 | you're looking at the full image first.
00:35:18.600 | So whether it's like, you're looking at all the pixels
00:35:20.960 | that are of a kitchen, of a dining table and so on.
00:35:23.760 | And then you're basically looking at the banana also.
00:35:25.920 | - Yeah, by the way, in terms of,
00:35:27.200 | if we were to train the funny classifier,
00:35:29.220 | there's something funny about the word banana.
00:35:32.000 | Just wanted to anticipate that.
00:35:33.840 | - I am wearing a banana shirt, so yeah.
00:35:36.200 | - Is there bananas on it?
00:35:37.500 | Okay, so masking has worked for the vision context as well.
00:35:42.400 | - And so this transformer idea has worked as well.
00:35:44.280 | So basically looking at all the elements
00:35:46.240 | to understand a particular element
00:35:48.120 | has been really powerful in vision.
00:35:49.880 | The reason is like a lot of things
00:35:52.040 | when you're looking at them in isolation.
00:35:53.440 | So if you look at just a blob of pixels,
00:35:55.560 | so Antonio Toralba at MIT used to have this
00:35:57.600 | like really famous image,
00:35:58.920 | which I looked at when I was a PhD student,
00:36:01.000 | where he would basically have a blob of pixels
00:36:02.800 | and he would ask you, "Hey, what is this?"
00:36:04.920 | And it looked basically like a shoe
00:36:06.800 | or like it could look like a TV remote,
00:36:08.840 | it could look like anything.
00:36:10.040 | And it turns out it was a beer bottle.
00:36:12.320 | But I'm not sure, it was one of these three things,
00:36:14.080 | but basically he showed you the full picture
00:36:15.400 | and then it was very obvious what it was.
00:36:17.520 | But the point is, just by looking
00:36:19.040 | at that particular local window, you couldn't figure it out.
00:36:21.840 | Because of resolution, because of other things,
00:36:23.840 | it's just not easy always to just figure it out
00:36:26.040 | by looking at just the neighborhood of pixels,
00:36:27.920 | what these pixels are.
00:36:29.640 | And the same thing happens for language as well.
00:36:31.960 | - For the parameters that have to learn something
00:36:34.240 | about the data, you need to give it the capacity
00:36:37.120 | to learn the essential things.
00:36:39.080 | Like if it's not actually able to receive the signal at all,
00:36:42.600 | then it's not gonna be able to learn that signal.
00:36:44.240 | And to understand images, to understand language,
00:36:47.240 | you have to be able to see words in their full context.
00:36:50.640 | Okay, what is harder to solve, vision or language?
00:36:54.880 | Visual intelligence or linguistic intelligence?
00:36:57.800 | - So I'm going to say computer vision is harder.
00:36:59.760 | My reason for this is basically that language,
00:37:03.240 | of course, has a big structure to it because we developed it.
00:37:06.800 | But as vision is something that is common
00:37:08.600 | in a lot of animals, everyone is able to get by,
00:37:11.400 | a lot of these animals on earth are actually able
00:37:13.320 | to get by without language.
00:37:15.080 | And a lot of these animals we also deem
00:37:17.080 | to be intelligent.
00:37:18.280 | So clearly intelligence does have
00:37:20.920 | like a visual component to it.
00:37:22.520 | And yes, of course, in the case of humans,
00:37:24.240 | it of course also has a linguistic component.
00:37:26.400 | But it means that there is something far more fundamental
00:37:28.720 | about vision than there is about language.
00:37:30.840 | And I'm sorry to anyone who disagrees,
00:37:32.960 | but yes, this is what I feel.
00:37:34.360 | - So that's being a little bit reflected
00:37:36.920 | in the challenges that have to do with the progress
00:37:40.800 | of self-supervised learning, would you say?
00:37:42.520 | Or is that just the peculiar accidents
00:37:45.560 | of the progress of the AI community
00:37:47.400 | that we focused on, or we discovered self-attention
00:37:50.240 | and transformers in the context of language first?
00:37:53.640 | - So like the self-supervised learning success
00:37:55.520 | was actually for vision has not much to do
00:37:58.840 | with the transformers part, I would say.
00:38:00.320 | It's actually been independent a little bit.
00:38:02.480 | I think it's just that the signal was a little bit different
00:38:05.320 | for vision than there was for like NLP
00:38:08.120 | and probably NLP folks discovered it before.
00:38:11.240 | So for vision, the main success has basically
00:38:13.360 | been this like crops so far,
00:38:14.840 | like taking different crops of images.
00:38:16.800 | Whereas for NLP, it was this masking thing.
00:38:18.920 | - But also the level of success
00:38:20.480 | is still much higher for language.
00:38:22.080 | - It has.
00:38:22.920 | So that has a lot to do with, I mean,
00:38:25.660 | I can get into a lot of details.
00:38:26.920 | For this particular question, let's go for it.
00:38:28.560 | Okay, so the first thing is language is very structured.
00:38:32.280 | So you are going to produce a distribution
00:38:34.040 | over a finite vocabulary.
00:38:35.920 | English has a finite number of words.
00:38:37.680 | It's actually not that large.
00:38:39.280 | And you need to produce basically,
00:38:41.440 | for when you're doing this masking thing,
00:38:42.760 | all you need to do is basically tell me
00:38:44.160 | which one of these like 50,000 words it is.
00:38:46.440 | That's it.
00:38:47.280 | Now for vision, let's imagine doing the same thing.
00:38:49.560 | Okay, we're basically going to blank out
00:38:51.480 | a particular part of the image
00:38:52.600 | and we ask the network or this neural network
00:38:54.680 | to predict what is present in this missing patch.
00:38:58.080 | It's combinatorially large, right?
00:38:59.960 | You have 256 pixel values.
00:39:02.560 | If you're even producing basically a seven cross seven
00:39:04.840 | or a 14 cross 14, like window of pixels
00:39:08.000 | at each of these 169 or each of these 49 locations,
00:39:11.340 | you have 256 values to predict.
00:39:13.760 | And so it's really, really large.
00:39:15.280 | And very quickly, the kind of like prediction problems
00:39:19.000 | that we're setting up are going to be
00:39:20.280 | extremely like intractable for us.
00:39:22.800 | And so the thing is for NLP,
00:39:23.960 | it has been really successful
00:39:25.000 | because we are very good at predicting,
00:39:27.560 | like doing this like distribution over a finite set.
00:39:30.880 | And the problem is when this set becomes really large,
00:39:33.520 | we're going to become really, really bad
00:39:35.560 | at making these predictions
00:39:37.000 | and at solving basically this particular set of problems.
00:39:41.040 | So if you were to do it exactly in the same way
00:39:44.200 | as NLP for vision, there is very limited success.
00:39:47.040 | The way stuff is working right now
00:39:48.980 | is actually not by predicting these masks.
00:39:51.680 | It's basically by saying that you take these two
00:39:53.660 | like crops from the image,
00:39:55.160 | you get a feature representation from it
00:39:57.060 | and just saying that these two features,
00:39:58.680 | so they're like vectors,
00:40:00.440 | just saying that the distance between these vectors
00:40:02.040 | should be small.
00:40:03.220 | And so it's a very different way of learning
00:40:06.640 | from the visual signal than there is from NLP.
00:40:09.120 | - Okay, the other reason is the distributional hypothesis
00:40:11.360 | that we talked about for NLP, right?
00:40:12.920 | So a word given its context,
00:40:15.160 | basically the context actually supplies
00:40:16.560 | a lot of meaning to the word.
00:40:18.440 | Now, because there are just finite number of words
00:40:22.280 | and there is a finite way in like which we compose them,
00:40:25.760 | of course, the same thing holds for pixels,
00:40:27.440 | but in language, there's a lot of structure, right?
00:40:29.760 | So I always say whatever,
00:40:31.000 | the dash jumped over the fence, for example.
00:40:33.760 | There are lots of these sentences that you'll get.
00:40:36.720 | And from this, you can actually look at
00:40:38.680 | this particular sentence might occur
00:40:40.160 | in a lot of different contexts as well.
00:40:41.480 | This exact same sentence might occur in a different context.
00:40:44.080 | So the sheep jumped over the fence,
00:40:45.560 | the cat jumped over the fence,
00:40:46.840 | the dog jumped over the fence.
00:40:48.160 | So you immediately get a lot of these words,
00:40:50.460 | which are because this particular token itself
00:40:52.720 | has so much meaning,
00:40:53.560 | you get a lot of these tokens or these words,
00:40:55.480 | which are actually going to have sort of
00:40:57.720 | this related meaning across given this context.
00:41:00.560 | Whereas for vision, it's much harder.
00:41:02.640 | Because just by like pure,
00:41:04.160 | like the way we capture images,
00:41:05.600 | lighting can be different.
00:41:07.440 | There might be like different noise in the sensor.
00:41:09.800 | So the thing is you're capturing a physical phenomenon,
00:41:12.220 | and then you're basically going through
00:41:13.840 | a very complicated pipeline of like image processing,
00:41:16.360 | and then you're translating that into
00:41:18.020 | some kind of like digital signal.
00:41:20.400 | Whereas with language, you write it down,
00:41:23.520 | and you transfer it to a digital signal,
00:41:25.040 | almost like it's a lossless like transfer.
00:41:27.440 | And each of these tokens are very, very well defined.
00:41:30.160 | - There could be a little bit of an argument there,
00:41:32.860 | because language as written down
00:41:36.140 | is a projection of thought.
00:41:39.420 | This is one of the open questions is,
00:41:42.560 | if you perfectly can solve language,
00:41:45.480 | are you getting close to being able to solve,
00:41:49.360 | easily with flying colors past the Turing test kind of thing.
00:41:52.800 | So that's, it's similar, but different.
00:41:56.560 | And the computer vision problem is in the 2D plane
00:41:59.760 | is a projection of a three dimensional world.
00:42:02.640 | So perhaps there are similar problems there.
00:42:05.640 | Maybe this is-
00:42:06.480 | - I mean, I think what I'm saying is NLP is not easy.
00:42:08.560 | Of course, don't get me wrong.
00:42:09.500 | Like abstract thought expressed in knowledge,
00:42:12.900 | or knowledge basically expressed in language
00:42:14.580 | is really hard to understand, right?
00:42:16.720 | I mean, we've been communicating with language for so long,
00:42:19.140 | and it is of course a very complicated concept.
00:42:21.980 | The thing is, at least getting like somewhat reasonable,
00:42:26.980 | like being able to solve some kind of reasonable tasks
00:42:29.820 | with language, I would say slightly easier
00:42:32.060 | than it is with computer vision.
00:42:33.660 | - Yeah, I would say, yeah.
00:42:35.360 | So that's well put.
00:42:36.600 | I would say getting impressive performance
00:42:39.340 | on language is easier.
00:42:43.380 | I feel like for both language and computer vision,
00:42:45.340 | there's going to be this wall of like,
00:42:49.460 | like this hump you have to overcome
00:42:52.240 | to achieve superhuman level performance
00:42:54.780 | or human level performance.
00:42:56.580 | And I feel like for language, that wall is farther away.
00:43:00.220 | So you can get pretty nice.
00:43:01.900 | You can do a lot of tricks.
00:43:04.100 | You can show really impressive performance.
00:43:06.540 | You can even fool people that you're tweeting
00:43:09.660 | or you're blog post writing,
00:43:11.460 | or your question answering has intelligence behind it.
00:43:16.460 | But to truly demonstrate understanding of dialogue,
00:43:21.900 | of continuous long form dialogue,
00:43:25.020 | that would require perhaps big breakthroughs.
00:43:28.580 | In the same way in computer vision,
00:43:30.420 | I think the big breakthroughs need to happen earlier
00:43:33.380 | to achieve impressive performance.
00:43:36.620 | This might be a good place to, you already mentioned it,
00:43:38.740 | but what is contrastive learning
00:43:41.100 | and what are energy based models?
00:43:43.860 | - Contrastive learning is sort of a paradigm of learning
00:43:46.860 | where the idea is that you are learning this embedding space
00:43:50.700 | or so you're learning this sort of vector space
00:43:52.700 | of all your concepts.
00:43:54.500 | And the way you learn that is basically by contrasting.
00:43:56.780 | So the idea is that you have a sample,
00:43:59.100 | you have another sample that's related to it.
00:44:00.980 | So that's called the positive.
00:44:02.860 | And you have another sample that's not related to it.
00:44:05.100 | So that's negative.
00:44:06.100 | So for example, let's just take an NLP
00:44:08.340 | or a simple example in computer vision.
00:44:10.980 | So you have an image of a cat, you have an image of a dog.
00:44:14.500 | And for whatever application that you're doing,
00:44:16.540 | say you're trying to figure out what pets are,
00:44:18.860 | you're saying that these two images are related.
00:44:20.300 | So image of a cat and dog are related,
00:44:22.300 | but now you have another third image of a banana
00:44:25.380 | because you don't like that word.
00:44:26.980 | So now you basically have this banana.
00:44:28.940 | - Thank you for speaking to the crowd.
00:44:30.660 | - And so you take both of these images
00:44:32.580 | and you take the image from the cat,
00:44:34.460 | the image from the dog,
00:44:35.300 | you get a feature from both of them.
00:44:36.780 | And now what you're training the network to do
00:44:38.180 | is basically pull both of these features together
00:44:42.100 | while pushing them away from the feature of a banana.
00:44:44.740 | So this is the contrastive part.
00:44:45.860 | So you're contrasting against the banana.
00:44:47.860 | So there's always this notion of a negative and a positive.
00:44:51.500 | Now, energy-based models are like one way
00:44:54.140 | that Jan sort of explains a lot of these methods.
00:44:57.460 | So Jan basically, I think a couple of years
00:45:00.700 | or more than that, like when I joined Facebook,
00:45:02.860 | Jan used to keep mentioning this word energy-based models.
00:45:05.140 | And of course I had no idea what he was talking about.
00:45:07.260 | So then one day I caught him
00:45:08.460 | in one of the conference rooms and I'm like,
00:45:10.020 | can you please tell me what this is?
00:45:11.260 | So then like very patiently he sat down
00:45:14.180 | with like a marker and a whiteboard.
00:45:15.980 | And his idea basically is that
00:45:18.340 | rather than talking about probability distributions,
00:45:20.300 | you can talk about energies of models.
00:45:21.980 | So models are trying to minimize certain energies
00:45:24.020 | in certain space,
00:45:25.020 | or they're trying to maximize a certain kind of energy.
00:45:28.220 | And the idea basically is that
00:45:29.780 | you can explain a lot of the contrastive models,
00:45:32.220 | GANs for example,
00:45:33.300 | which are like generative adversarial networks.
00:45:36.020 | A lot of these modern learning methods
00:45:37.940 | or VAEs which are variational autoencoders,
00:45:39.940 | you can really explain them very nicely
00:45:41.860 | in terms of an energy function
00:45:43.180 | that they're trying to minimize or maximize.
00:45:45.340 | And so by putting this common sort of language
00:45:48.380 | for all of these models,
00:45:49.740 | what looks very different in machine learning
00:45:51.820 | that OVAEs are very different from what GANs are,
00:45:54.180 | are very, very different from what contrastive models are.
00:45:56.420 | You actually get a sense of like,
00:45:57.580 | oh, these are actually very, very related.
00:46:00.140 | It's just that the way or the mechanism
00:46:02.500 | in which they're sort of maximizing
00:46:04.220 | or minimizing this energy function is slightly different.
00:46:06.980 | - It's revealing the commonalities
00:46:08.900 | between all these approaches
00:46:10.380 | and putting a sexy word on top of it like energy.
00:46:12.860 | And so similarity,
00:46:14.340 | two things that are similar have low energy.
00:46:16.740 | Like the low energy signifying similarity.
00:46:20.340 | - Right, exactly.
00:46:21.180 | So basically the idea is that if you were to imagine
00:46:23.540 | like the embedding as a manifold, a 2D manifold,
00:46:26.460 | you would get a hill or like a high sort of peak
00:46:28.900 | in the energy manifold
00:46:30.580 | wherever two things are not related.
00:46:32.420 | And basically you would have like a dip
00:46:34.100 | where two things are related.
00:46:35.500 | So you'd get a dip in the manifold.
00:46:37.060 | - And in the self-supervised context,
00:46:40.180 | how do you know two things are related
00:46:42.260 | and two things are not related?
00:46:44.100 | - Right, so this is where all the sort of ingenuity
00:46:46.420 | or tricks comes in, right?
00:46:47.860 | So for example, like you can take the fill in the blank
00:46:51.660 | problem or you can take in the context problem.
00:46:54.340 | And what you can say is two words
00:46:55.900 | that are in the same context are related.
00:46:57.740 | Two words that are in different contexts are not related.
00:47:00.500 | For images, basically two crops from the same image
00:47:02.980 | are related and whereas a third image is not related at all.
00:47:06.420 | Or for a video, it can be two frames
00:47:08.140 | from that video are related
00:47:09.140 | because they're likely to contain
00:47:10.780 | the same sort of concepts in them.
00:47:12.700 | Whereas a third frame from a different video is not related.
00:47:15.580 | So it basically is, it's a very general term.
00:47:18.300 | Contrastive learning is nothing really to do
00:47:19.820 | with self-supervised learning.
00:47:20.820 | It actually is very popular in, for example,
00:47:23.220 | like any kind of metric learning
00:47:25.180 | or any kind of embedding learning.
00:47:26.900 | So it's also used in supervised learning.
00:47:28.860 | It's all, and the thing is,
00:47:30.220 | because we are not really using labels
00:47:32.060 | to get these positive or negative pairs,
00:47:34.500 | it can basically also be used for self-supervised learning.
00:47:37.580 | - So you mentioned one of the ideas in the vision context
00:47:40.420 | that works is to have different crops.
00:47:45.220 | So you could think of that as a way
00:47:47.020 | to sort of manipulating the data
00:47:49.460 | to generate examples that are similar.
00:47:53.260 | Obviously, there's a bunch of other techniques.
00:47:55.780 | You mentioned lighting as a very, you know,
00:47:58.580 | in images, lighting is something that varies a lot
00:48:01.620 | and you can artificially change those kinds of things.
00:48:04.500 | There's the whole broad field of data augmentation,
00:48:07.700 | which manipulates images in order to increase arbitrarily
00:48:11.820 | the size of the dataset.
00:48:13.420 | First of all, what is data augmentation?
00:48:15.860 | And second of all, what's the role of data augmentation
00:48:18.140 | in self-supervised learning and contrastive learning?
00:48:22.020 | - So data augmentation is just a way, like you said,
00:48:24.800 | it's basically a way to augment the data.
00:48:26.700 | So you have, say, N samples,
00:48:28.700 | and what you do is you basically define
00:48:30.180 | some kind of transforms for the sample.
00:48:32.300 | So you take your, say, image,
00:48:33.700 | and then you define a transform
00:48:34.900 | where you can just increase, say, the colors,
00:48:37.340 | like the colors or the brightness of the image,
00:48:39.140 | or increase or decrease the contrast of the image,
00:48:41.340 | for example, or take different crops of it.
00:48:43.740 | So data augmentation is just a process
00:48:46.220 | to basically perturb the data or augment the data.
00:48:51.060 | And so it has played a fundamental role for computer vision,
00:48:54.300 | for self-supervised learning especially.
00:48:56.620 | The way most of the current methods work,
00:48:59.180 | contrastive or otherwise, is by taking an image,
00:49:02.740 | in the case of images, is by taking an image
00:49:05.340 | and then computing basically two perturbations of it.
00:49:08.580 | So these can be two different crops of the image
00:49:11.500 | with different types of lighting
00:49:12.940 | or different contrast or different colors.
00:49:14.980 | So you jitter the colors a little bit and so on.
00:49:17.860 | And now the idea is basically,
00:49:19.900 | because it's the same object
00:49:21.740 | or because it's like related concepts
00:49:23.460 | in both of these perturbations,
00:49:25.220 | you want the features
00:49:26.300 | from both of these perturbations to be similar.
00:49:28.940 | So now you can use a variety of different ways
00:49:31.300 | to enforce this constraint,
00:49:32.620 | like these features being similar.
00:49:34.220 | You can do this by contrastive learning.
00:49:36.020 | So basically, both of these things are positives,
00:49:38.460 | a third sort of image is negative.
00:49:40.420 | You can do this basically by like clustering.
00:49:43.460 | For example, you can say that both of these images should,
00:49:46.980 | the features from both of these images
00:49:48.140 | should belong in the same cluster because they're related.
00:49:50.580 | Whereas image, like another image
00:49:52.300 | should belong to a different cluster.
00:49:53.860 | So there's a variety of different ways
00:49:55.140 | to basically enforce this particular constraint.
00:49:57.580 | - By the way, when you say features,
00:49:59.100 | it means there's a very large neural network
00:50:01.660 | that extracting patterns from the image
00:50:03.620 | and the kind of patterns it extracts
00:50:05.140 | should be either identical or very similar.
00:50:07.980 | - Right.
00:50:08.820 | - That's what that means.
00:50:09.660 | - So the neural network basically takes in the image
00:50:11.860 | and then outputs a set of like,
00:50:14.140 | basically a vector of like numbers,
00:50:16.580 | and that's the feature.
00:50:17.700 | And you want this feature for both of these
00:50:20.020 | like different crops that you computed to be similar.
00:50:22.100 | So you want this vector to be identical
00:50:24.500 | in its like entries, for example.
00:50:26.100 | - Be like literally close
00:50:28.140 | in this multidimensional space to each other.
00:50:30.740 | - Right.
00:50:31.620 | - And like you said,
00:50:32.580 | close can mean part of the same cluster
00:50:34.740 | or something like that in this large space.
00:50:37.420 | First of all, that,
00:50:38.940 | I wonder if there is connection
00:50:40.660 | to the way humans learn to this,
00:50:43.780 | almost like maybe subconsciously
00:50:48.060 | in order to understand a thing,
00:50:50.140 | you kind of have to see it from two, three, multiple angles.
00:50:54.660 | I wonder, I have a lot of friends
00:50:57.340 | who are neuroscientists maybe,
00:50:58.460 | or getting incognito scientists.
00:51:00.220 | I wonder if that's in there somewhere,
00:51:03.200 | like in order for us to place a concept in its proper place,
00:51:08.580 | we have to basically crop it in all kinds of ways,
00:51:12.460 | do basic data augmentation on it
00:51:14.420 | in whatever very clever ways that the brain likes to do.
00:51:17.660 | - Right.
00:51:19.020 | - Like spinning around in our mind somehow
00:51:21.140 | that is very effective.
00:51:23.100 | - So I think for some of them,
00:51:24.140 | we like need to do it.
00:51:25.060 | So like babies, for example, pick up objects,
00:51:26.980 | like move them and put them close to their eye and whatnot.
00:51:30.100 | But for certain other things,
00:51:31.180 | actually we are good at imagining it as well, right?
00:51:33.780 | So if you, I have never seen, for example,
00:51:35.940 | an elephant from the top.
00:51:36.940 | I've never basically looked at it from like top down.
00:51:39.540 | But if you showed me a picture of it,
00:51:40.700 | I could very well tell you that that's an elephant.
00:51:43.780 | So I think some of it, we're just like,
00:51:45.340 | we naturally build it or transfer it from other objects
00:51:47.820 | that we've seen to imagine what it's going to look like.
00:51:50.940 | - Has anyone done that with augmentation?
00:51:53.260 | Like imagine all the possible things
00:51:56.900 | that are occluded or not there,
00:51:59.860 | but not just like normal things, like wild things,
00:52:03.340 | but they're nevertheless physically consistent.
00:52:06.880 | - So, I mean, people do kind of like occlusion
00:52:10.440 | based augmentation as well.
00:52:11.720 | So you place in like a random like box,
00:52:14.080 | gray box to sort of mask out a certain part of the image.
00:52:17.360 | And the thing is basically you're kind of occluding it.
00:52:19.880 | For example, you place it say on half of a person's face.
00:52:23.480 | So basically saying that, you know,
00:52:24.840 | something below their nose is occluded,
00:52:26.600 | because it's grayed out.
00:52:28.160 | - So, no, I meant like you have like, what is it?
00:52:31.560 | A table and you can't see behind the table.
00:52:33.760 | And you imagine there's a bunch of elves
00:52:37.000 | with bananas behind the table.
00:52:38.800 | Like, I wonder if there's useful to have
00:52:40.880 | a wild imagination for the network,
00:52:44.160 | because that's possible.
00:52:45.280 | Well, maybe not elves, but like puppies and kittens
00:52:47.880 | or something like that.
00:52:48.960 | Just have a wild imagination
00:52:51.240 | and like constantly be generating that wild imagination.
00:52:55.040 | 'Cause in terms of data augmentation
00:52:57.520 | that's currently applied, it's super ultra, very boring.
00:53:01.160 | It's very basic data augmentation.
00:53:02.880 | I wonder if there's a benefit to being wildly imaginable
00:53:07.000 | while trying to be consistent with physical reality.
00:53:11.840 | - I think it's a kind of a chicken and egg problem, right?
00:53:14.160 | Because to have like amazing data augmentation,
00:53:16.360 | you need to understand what the scene is.
00:53:18.480 | And what we're trying to do data augmentation
00:53:20.600 | to learn what a scene is anyway.
00:53:22.040 | So it's basically just keeps going on.
00:53:23.720 | - Before you understand it, just put elves with bananas
00:53:26.000 | until you know it's not to be true.
00:53:28.080 | Just like children have a wild imagination
00:53:31.640 | until the adults ruin it all.
00:53:33.920 | Okay, so what are the different kinds of data augmentation
00:53:36.920 | that you've seen to be effective in visual intelligence?
00:53:40.760 | - For like vision,
00:53:42.000 | it's a lot of these image filtering operations.
00:53:44.120 | So like blurring the image,
00:53:45.760 | all the kinds of Instagram filters that you can think of.
00:53:49.400 | So like arbitrarily like make the red super red,
00:53:52.480 | make the green super greens, like saturate the image.
00:53:55.800 | - Rotation cropping.
00:53:56.960 | - Rotation cropping, exactly.
00:53:58.440 | All of these kinds of things.
00:53:59.560 | Like I said, lighting is a really interesting one to me.
00:54:02.640 | That feels like really complicated to do.
00:54:04.760 | - So I mean, the augmentations that we work on
00:54:07.320 | aren't like that involved.
00:54:08.920 | They're not going to be like physically realistic versions
00:54:10.880 | of lighting.
00:54:11.720 | It's not that you're assuming that there's a light source up
00:54:13.720 | and then you're moving it to the right
00:54:15.120 | and then what does the thing look like?
00:54:17.040 | It's really more about like brightness of the image,
00:54:19.200 | overall brightness of the image
00:54:20.440 | or overall contrast of the image and so on.
00:54:22.560 | - But this is a really important point to me.
00:54:25.120 | I always thought that data augmentation
00:54:28.720 | holds an important key to big improvements
00:54:32.720 | in machine learning.
00:54:33.840 | And it seems that it is an important aspect
00:54:36.640 | of self-supervised learning.
00:54:39.080 | So I wonder if there's big improvements to be achieved
00:54:42.560 | on much more intelligent kinds of data augmentation.
00:54:46.680 | For example, currently,
00:54:48.320 | maybe you can correct me if I'm wrong,
00:54:50.160 | data augmentation is not parametrized.
00:54:52.760 | - Yeah.
00:54:53.600 | - You're not learning.
00:54:54.440 | - You're not learning.
00:54:55.260 | To me, it seems like data augmentation potentially
00:54:59.720 | should involve more learning
00:55:01.960 | than the learning process itself.
00:55:04.120 | - Right.
00:55:05.320 | - You're almost like thinking of like generative kind of,
00:55:08.760 | it's the elves with bananas.
00:55:10.200 | You're trying to,
00:55:11.040 | it's like very active imagination of messing with the world
00:55:14.840 | and teaching that mechanism for messing with the world
00:55:17.600 | to be realistic.
00:55:19.080 | - Right.
00:55:20.460 | - Because that feels like,
00:55:22.600 | I mean, it's imagination.
00:55:24.680 | Just as you said,
00:55:25.640 | it feels like us humans are able to,
00:55:28.160 | maybe sometimes subconsciously,
00:55:30.720 | imagine before we see the thing,
00:55:33.000 | imagine what we're expecting to see.
00:55:35.500 | Like maybe several options.
00:55:37.240 | And especially, we probably forgot,
00:55:38.800 | but when we were younger,
00:55:40.520 | probably the possibilities were wilder, more numerous.
00:55:44.200 | And then as we get older,
00:55:45.160 | we become to understand the world
00:55:47.400 | and the possibilities of what we might see
00:55:51.040 | becomes less and less and less.
00:55:53.120 | So I wonder if you think there's a lot of breakthroughs
00:55:55.600 | yet to be had in data augmentation.
00:55:57.160 | And maybe also, can you just comment on the stuff we have?
00:55:59.780 | Is that a big part of self-supervised learning?
00:56:02.120 | - Yes.
00:56:02.960 | So data augmentation is like key to self-supervised learning.
00:56:05.520 | That has like the kind of augmentation that we're using.
00:56:08.320 | And basically the fact that we're trying to learn
00:56:11.040 | these neural networks that are predicting these features
00:56:13.920 | from images that are robust under data augmentation
00:56:17.080 | has been the key for visual self-supervised learning.
00:56:19.560 | And they play a fairly fundamental role to it.
00:56:22.400 | Now, the irony of all of this is that
00:56:24.600 | for like deep learning purists will say
00:56:26.720 | the entire point of deep learning
00:56:28.400 | is that you feed in the pixels to the neural network
00:56:31.160 | and it should figure out the patterns on its own.
00:56:33.120 | So if it really wants to look at edges,
00:56:34.480 | it should look at edges.
00:56:35.640 | You shouldn't really go and handcraft these features.
00:56:38.600 | You shouldn't go tell it that look at edges.
00:56:41.160 | So data augmentation should basically
00:56:43.120 | be in the same category.
00:56:44.400 | Why should we tell the network
00:56:46.040 | or tell this entire learning paradigm
00:56:48.200 | what kinds of data augmentation that we're looking for?
00:56:50.840 | We are encoding a very sort of human specific bias there
00:56:55.200 | that we know things are like,
00:56:57.560 | if you change the contrast of the image,
00:56:59.200 | it should still be an apple
00:57:00.240 | or it should still be apple, not banana.
00:57:02.240 | - Thank you.
00:57:03.520 | - Basically if we change like colors,
00:57:05.880 | it should still be the same kind of concept.
00:57:08.040 | Of course, this is not one,
00:57:09.880 | this is doesn't feel like super satisfactory
00:57:12.480 | because a lot of our human knowledge
00:57:14.560 | or our human supervision
00:57:15.760 | is actually going into the data augmentation.
00:57:17.600 | So although we are calling it self-supervised learning,
00:57:19.680 | a lot of the human knowledge is actually being encoded
00:57:21.880 | in the data augmentation process.
00:57:23.520 | So it's really like we've kind of sneaked away
00:57:25.480 | the supervision at the input
00:57:27.120 | and we're like really designing these nice list
00:57:29.440 | of data augmentations that are working very well.
00:57:31.640 | - Of course, the idea is that it's much easier
00:57:33.720 | to design a list of data augmentation than it is to do.
00:57:36.600 | So humans are doing nevertheless doing less and less work
00:57:39.640 | and maybe leveraging their creativity more and more.
00:57:42.600 | And when we say data augmentation is not parameterized,
00:57:45.080 | it means it's not part of the learning process.
00:57:48.200 | Do you think it's possible to integrate
00:57:50.560 | some of the data augmentation into the learning process?
00:57:53.280 | - I think so.
00:57:54.120 | I think so.
00:57:54.960 | And in fact, it will be really beneficial for us
00:57:57.440 | because a lot of these data augmentations
00:57:59.720 | that we use in vision are very extreme.
00:58:01.840 | For example, like when you have certain concepts,
00:58:05.400 | again, a banana, you take the banana
00:58:08.160 | and then basically you change the color of the banana.
00:58:10.600 | So you make it a purple banana.
00:58:12.440 | Now this data augmentation process
00:58:14.160 | is actually independent of the,
00:58:15.920 | like it has no notion of what is present in the image.
00:58:18.920 | So it can change this color arbitrarily.
00:58:20.560 | It can make it a red banana as well.
00:58:22.600 | And now what we're doing is we're telling the neural network
00:58:24.760 | that this red banana and so a crop of this image
00:58:28.240 | which has the red banana and a crop of this image
00:58:30.240 | where I change the color to a purple banana
00:58:31.840 | should be, the features should be the same.
00:58:34.080 | Now bananas aren't red or purple mostly.
00:58:36.680 | So really the data augmentation process
00:58:38.560 | should take into account what is present in the image
00:58:41.120 | and what are the kinds of physical realities
00:58:43.080 | that are possible.
00:58:43.920 | It shouldn't be completely independent of the image.
00:58:45.840 | - So you might get big gains if you,
00:58:48.840 | instead of being drastic, do subtle augmentation
00:58:51.560 | but realistic augmentation.
00:58:53.280 | - Right, realistic.
00:58:54.120 | I'm not sure if it's subtle, but like realistic for sure.
00:58:56.280 | - If it's realistic, then even subtle augmentation
00:58:59.600 | will give you big benefits.
00:59:00.680 | - Exactly, yeah.
00:59:01.840 | And it'll be like for particular domains,
00:59:05.040 | you might actually see like,
00:59:06.440 | if for example, now we're doing medical imaging,
00:59:08.960 | there are going to be certain kinds
00:59:10.120 | of like geometric augmentations
00:59:11.440 | which are not really going to be very valid
00:59:13.480 | for the human body.
00:59:15.080 | So if you were to like actually loop in data augmentation
00:59:18.280 | into the learning process,
00:59:19.480 | it will actually be much more useful.
00:59:21.320 | Now, this actually does take us to maybe
00:59:23.680 | a semi-supervised kind of a setting
00:59:25.120 | because you do want to understand
00:59:27.480 | what is it that you're trying to solve.
00:59:29.080 | So currently self-supervised learning
00:59:30.880 | kind of operates in the wild, right?
00:59:32.720 | So you do the self-supervised learning,
00:59:34.960 | and the purists and all of us basically say that,
00:59:37.560 | okay, this should learn useful representations
00:59:39.440 | and they should be useful for any kind of end task,
00:59:42.320 | no matter it's like banana recognition
00:59:44.280 | or like autonomous driving.
00:59:46.240 | Now, it's a tall order.
00:59:47.760 | Maybe the first baby step for us should be that,
00:59:50.480 | okay, if you're trying to loop in this data augmentation
00:59:52.640 | into the learning process,
00:59:53.920 | then we at least need to have some sense
00:59:56.000 | of what we're trying to do.
00:59:56.840 | Are we trying to distinguish
00:59:57.760 | between different types of bananas?
00:59:59.560 | Or are we trying to distinguish between banana and apple?
01:00:02.040 | Or are we trying to do all of these things at once?
01:00:04.400 | And so some notion of like what happens at the end
01:00:07.920 | might actually help us do much better at this side.
01:00:10.840 | - Let me ask you a ridiculous question.
01:00:13.760 | If I were to give you like a black box,
01:00:16.240 | like a choice to have an arbitrary large data set
01:00:19.480 | of real natural data
01:00:21.280 | versus really good data augmentation algorithms,
01:00:26.560 | which would you like to train in a self-supervised way on?
01:00:31.240 | So natural data from the internet are arbitrary large,
01:00:34.960 | so unlimited data,
01:00:37.280 | or it's like more controlled,
01:00:40.200 | good data augmentation on the finite data set.
01:00:43.600 | - The thing is like,
01:00:44.440 | because our learning algorithms for vision right now
01:00:47.240 | really rely on data augmentation,
01:00:49.360 | even if you were to give me like an infinite source
01:00:51.440 | of like image data,
01:00:52.880 | I still need a good data augmentation algorithm.
01:00:54.560 | - You need something that tells you
01:00:56.040 | that two things are similar.
01:00:57.400 | - Right, and so something,
01:00:59.000 | because you've given me an arbitrarily large data set,
01:01:01.600 | I still need to use data augmentation
01:01:03.760 | to take that image,
01:01:04.880 | construct like these two perturbations of it,
01:01:06.880 | and then learn from it.
01:01:08.240 | So the thing is our learning paradigm
01:01:09.960 | is very primitive right now.
01:01:11.880 | Even if you were to give me lots of images,
01:01:13.800 | it's still not really useful.
01:01:15.200 | A good data augmentation algorithm
01:01:16.520 | is actually going to be more useful.
01:01:18.040 | So you can like reduce down the amount of data
01:01:21.160 | that you'll give me by like 10 times.
01:01:22.920 | But if you were to give me
01:01:23.760 | a good data augmentation algorithm,
01:01:25.040 | that will probably do better
01:01:26.440 | than giving me like 10 times the size of that data,
01:01:29.040 | but me having to rely on like
01:01:30.960 | a very primitive data augmentation algorithm.
01:01:32.640 | - Like through tagging and all those kinds of things,
01:01:35.040 | is there a way to discover things
01:01:37.280 | that are semantically similar on the internet?
01:01:39.640 | Obviously there is,
01:01:40.480 | but there might be extremely noisy.
01:01:42.560 | And the difference might be
01:01:44.960 | farther away than you would be comfortable with.
01:01:47.880 | - So I mean, yes, tagging will help you a lot.
01:01:49.760 | It'll actually go a very long way
01:01:51.520 | in figuring out what images are related or not.
01:01:54.000 | And then so,
01:01:55.720 | but then the purists would argue
01:01:57.520 | that when you're using human tags,
01:01:58.960 | because these tags are like supervision,
01:02:01.240 | is it really self supervised learning now?
01:02:04.000 | Because you're using human tags
01:02:05.400 | to figure out which images are like similar.
01:02:08.000 | - Hashtag no filter means a lot of things.
01:02:10.440 | - Yes.
01:02:11.280 | I mean, there are certain tags
01:02:12.400 | which are going to be applicable pretty much to anything.
01:02:15.320 | So they're pretty useless for learning.
01:02:18.280 | But I mean, certain tags are actually like
01:02:20.840 | the Eiffel tower, for example,
01:02:22.280 | or the Taj Mahal, for example.
01:02:23.840 | These tags are like very indicative of what's going on.
01:02:26.480 | And they are, I mean, they are human supervision.
01:02:29.480 | - Yeah, this is one of the tasks of discovering
01:02:31.920 | from human generated data,
01:02:33.680 | strong signals that could be leveraged
01:02:37.320 | for self supervision.
01:02:39.560 | Like humans are doing so much work already.
01:02:42.280 | Like many years ago,
01:02:43.520 | there was something that was called,
01:02:45.160 | I guess, human computation back in the day.
01:02:48.040 | Humans are doing so much work.
01:02:50.280 | It'd be exciting to discover ways to leverage
01:02:53.520 | the work they're doing to teach machines
01:02:55.900 | without any extra effort from them.
01:02:58.000 | An example could be, like we said, driving,
01:03:00.200 | humans driving and machines can learn from the driving.
01:03:03.040 | I always hoped that there could be some supervision signal
01:03:06.800 | discovered in video games,
01:03:08.200 | because there's so many people that play video games
01:03:10.760 | that it feels like so much effort
01:03:13.920 | is put into video games, into playing video games.
01:03:17.720 | And you can design video games somewhat cheaply
01:03:21.840 | to include whatever signals you want.
01:03:24.640 | It feels like that could be leveraged somehow.
01:03:27.560 | - So people are using that.
01:03:28.720 | Like there are actually folks right here in UT Austin,
01:03:30.880 | like Philip Grenville is a professor at UT Austin.
01:03:33.680 | He's been like working on video games
01:03:36.160 | as a source of supervision.
01:03:38.000 | I mean, it's really fun, like as a PhD student
01:03:40.080 | getting to basically play video games all day.
01:03:42.200 | - Yeah, but so I do hope that kind of thing scales
01:03:44.960 | and like ultimately boils down to discovering
01:03:48.120 | some undeniably very good signal.
01:03:51.620 | It's like masking in NLP.
01:03:54.080 | But that said, there's non-contrastive methods.
01:03:57.640 | What do non-contrastive energy-based
01:04:00.880 | self-supervised learning methods look like
01:04:03.560 | and why are they promising?
01:04:05.680 | - So like I said about contrastive learning,
01:04:07.840 | you have this notion of a positive and a negative.
01:04:10.760 | Now, the thing is this entire learning paradigm
01:04:13.680 | really requires access to a lot of negatives
01:04:17.200 | to learn a good sort of feature space.
01:04:19.080 | The idea is if I tell you, okay,
01:04:21.700 | so a cat and a dog are similar
01:04:23.720 | and they're very different from a banana.
01:04:25.720 | The thing is this is a fairly simple analogy, right?
01:04:28.040 | Because bananas look visually very different
01:04:30.880 | from what cats and dogs do.
01:04:32.480 | So very quickly, if this is the only source of supervision
01:04:35.080 | that I'm giving you,
01:04:36.680 | your learning is not going to be like after a point
01:04:38.720 | then neural network is really not going to learn a lot
01:04:41.720 | because the negative that you're getting
01:04:43.040 | is going to be so random.
01:04:43.960 | So it can be, oh, a cat and a dog are very similar,
01:04:46.720 | but they're very different from a Volkswagen Beetle.
01:04:49.960 | Now, like this car looks very different
01:04:52.000 | from these animals again.
01:04:53.020 | So the thing is in contrastive learning,
01:04:54.960 | the quality of the negative sample really matters a lot.
01:04:58.200 | And so what has happened is basically
01:05:00.360 | that typically these methods that are contrastive
01:05:02.920 | really require access to lots of negatives,
01:05:04.960 | which becomes harder and harder to sort of scale
01:05:06.960 | when designing a learning algorithm.
01:05:09.080 | So that's been one of the reasons
01:05:10.960 | why non-contrastive methods have become popular
01:05:13.760 | and why people think that they're going to be more useful.
01:05:16.440 | So a non-contrastive method, for example,
01:05:18.480 | like clustering is one non-contrastive method.
01:05:20.960 | The idea basically being that you have two of these samples.
01:05:24.720 | So the cat and dog are two crops of this image.
01:05:27.720 | They belong to the same cluster.
01:05:29.320 | And so essentially you're basically doing clustering online
01:05:33.360 | when you're learning this network,
01:05:35.120 | and which is very different from having access
01:05:36.760 | to a lot of negatives explicitly.
01:05:38.960 | The other way which has become really popular
01:05:40.880 | is something called self-distillation.
01:05:43.180 | So the idea basically is that you have a teacher network
01:05:45.720 | and a student network,
01:05:47.560 | and the teacher network produces a feature.
01:05:49.560 | So it takes in the image,
01:05:51.120 | and basically the neural network figures out the patterns,
01:05:53.720 | gets the feature out.
01:05:55.280 | And there's another neural network,
01:05:56.840 | which is the student neural network,
01:05:58.000 | and that also produces a feature.
01:05:59.960 | And now all you're doing is basically saying
01:06:01.680 | that the features produced by the teacher network
01:06:04.000 | and the student network should be very similar.
01:06:06.140 | That's it.
01:06:06.980 | There is no notion of a negative anymore.
01:06:09.240 | And that's it.
01:06:10.080 | So it's all about similarity maximization
01:06:11.840 | between these two features.
01:06:13.720 | And so all I need to now do is figure out
01:06:16.360 | how to have these two sorts of parallel networks,
01:06:18.720 | a student network and a teacher network.
01:06:20.640 | And basically researchers have figured out
01:06:23.020 | very cheap methods to do this.
01:06:24.280 | So you can actually have for free, really,
01:06:26.800 | two types of neural networks.
01:06:29.040 | They're kind of related,
01:06:30.160 | but they're different enough that you can actually
01:06:32.040 | basically have a learning problem set up.
01:06:34.000 | - So you can ensure that they always remain different enough
01:06:38.200 | so that the thing doesn't collapse into something boring.
01:06:41.040 | - Exactly.
01:06:41.880 | So the main sort of enemy of self-supervised learning,
01:06:44.360 | any kind of similarity maximization technique is collapse.
01:06:47.560 | It's a collapse means that you learn
01:06:49.820 | the same feature representation
01:06:51.560 | for all the images in the world,
01:06:53.160 | which is completely useless.
01:06:54.640 | - Everything is a banana.
01:06:55.640 | - Everything is a banana.
01:06:56.560 | Everything is a cat.
01:06:57.400 | Everything is a car.
01:06:58.240 | - Yeah.
01:06:59.200 | And so all we need to do is basically come up
01:07:01.680 | with ways to prevent collapse.
01:07:03.320 | Contrastive learning is one way of doing it.
01:07:05.360 | And then for example, like clustering or self-distillation
01:07:07.840 | or other ways of doing it.
01:07:09.240 | We also had a recent paper where we used like decorrelation
01:07:13.120 | between like two sets of features to prevent collapse.
01:07:16.780 | So that's inspired a little bit by like Horace Barlow's
01:07:18.920 | neuroscience principles.
01:07:20.720 | - By the way, I should comment that whoever counts
01:07:23.560 | the number of times the word banana, apple, cat and dog
01:07:27.800 | we're using this conversation wins the internet.
01:07:30.160 | I wish you luck.
01:07:31.140 | What is Suave and the main improvement proposed
01:07:36.800 | in the paper on supervised learning of visual features
01:07:40.360 | by contrasting cluster assignments?
01:07:43.000 | - Suave basically is a clustering based technique,
01:07:46.400 | which is for again, the same thing
01:07:48.400 | for self-supervised learning in vision,
01:07:50.760 | where we have two crops.
01:07:52.440 | And the idea basically is that you want the features
01:07:55.280 | from these two crops of an image
01:07:57.000 | to lie in the same cluster.
01:07:58.920 | And basically crops that are coming from different images
01:08:02.540 | to be in different clusters.
01:08:03.960 | Now, typically in a sort of,
01:08:05.900 | if you were to do this clustering,
01:08:07.140 | you would perform clustering offline.
01:08:09.520 | What that means is you would,
01:08:11.040 | if you have a data set of N examples,
01:08:13.160 | you would run over all of these N examples,
01:08:15.360 | get features for them, perform clustering.
01:08:17.520 | So basically get some clusters
01:08:19.480 | and then repeat the process again.
01:08:22.000 | So this is offline basically because I need to do one pass
01:08:24.640 | through the data to compute its clusters.
01:08:27.240 | Suave is basically just a simple way of doing this online.
01:08:30.200 | So as you're going through the data,
01:08:31.820 | you're actually computing these clusters online.
01:08:34.800 | And so of course there is like a lot of tricks involved
01:08:37.480 | in how to do this in a robust manner without collapsing,
01:08:40.140 | but this is the sort of key idea to it.
01:08:42.440 | - Is there a nice way to say what is the key methodology
01:08:45.480 | of the clustering that enables that?
01:08:47.680 | - Right, so the idea basically is that
01:08:51.040 | when you have N samples,
01:08:52.720 | we assume that we have access to,
01:08:54.920 | like there are always K clusters in a data set.
01:08:57.040 | K is a fixed number.
01:08:57.920 | So for example, K is 3000.
01:09:00.160 | And so if you have any,
01:09:02.200 | when you look at any sort of small number of examples,
01:09:04.840 | all of them must belong to one of these K clusters.
01:09:08.000 | And we impose this equipartition constraint.
01:09:10.360 | What this means is that basically
01:09:13.640 | your entire set of N samples
01:09:16.880 | should be equally partitioned into K clusters.
01:09:19.480 | So all your K clusters are basically equal,
01:09:21.800 | they have equal contribution to these N samples.
01:09:24.400 | And this ensures that we never collapse.
01:09:26.520 | So collapse can be viewed as a way
01:09:28.320 | in which all samples belong to one cluster.
01:09:30.680 | So all this, if all features become the same,
01:09:33.180 | then you have basically just one mega cluster.
01:09:35.160 | You don't even have like 10 clusters or 3000 clusters.
01:09:38.160 | So Swarv basically ensures that at each point,
01:09:41.000 | all these 3000 clusters are being used
01:09:43.000 | in the clustering process.
01:09:45.080 | And that's it.
01:09:46.280 | Basically just figure out how to do this online.
01:09:48.520 | And again, basically just make sure
01:09:51.000 | that two crops from the same image
01:09:52.600 | belong to the same cluster and others don't.
01:09:55.760 | - And the fact they have a fixed K makes things simpler.
01:09:58.880 | - Fixed K makes things simpler.
01:10:00.400 | Our clustering is not like really hard clustering,
01:10:02.600 | it's soft clustering.
01:10:03.760 | So basically you can be 0.2 to cluster number one
01:10:06.920 | and 0.8 to cluster number two.
01:10:08.480 | So it's not really hard.
01:10:09.920 | So essentially, even though we have like 3000 clusters,
01:10:12.760 | we can actually represent a lot of clusters.
01:10:15.160 | - What is SEER?
01:10:16.960 | S-E-E-R.
01:10:19.240 | And what are the key results and insights in the paper,
01:10:23.120 | self-supervised pre-training of visual features in the wild?
01:10:27.400 | What is this big, beautiful SEER system?
01:10:30.720 | - SEER, so I'll first go to Swarv
01:10:32.920 | because Swarv is actually like
01:10:34.240 | one of the key components for SEER.
01:10:35.800 | So Swarv was, when we use Swarv,
01:10:37.800 | it was demonstrated on ImageNet.
01:10:39.800 | So typically like self-supervised methods,
01:10:42.920 | the way we sort of operate is like in the research community,
01:10:46.200 | we kind of cheat.
01:10:47.200 | So we take ImageNet,
01:10:48.560 | which of course I talked about as having lots of labels.
01:10:51.280 | And then we throw away the labels,
01:10:52.800 | like throw away all the hard work
01:10:54.240 | that went behind basically the labeling process.
01:10:56.800 | And we pretend that it is like unsupervised.
01:11:00.240 | But the problem here is that we have,
01:11:02.440 | when we collected these images,
01:11:05.160 | the ImageNet dataset
01:11:06.720 | has a particular distribution of concepts, right?
01:11:09.960 | So these images are very curated.
01:11:11.760 | And what that means is these images,
01:11:13.680 | of course, belong to a certain set of noun concepts.
01:11:17.680 | And also ImageNet has this bias
01:11:19.320 | that all images contain an object,
01:11:21.240 | which is like very big,
01:11:22.480 | and it's typically in the center.
01:11:24.080 | So when you're talking about a dog,
01:11:25.120 | it's a well-framed dog,
01:11:26.160 | it's towards the center of the image.
01:11:28.360 | So a lot of the data augmentation,
01:11:29.760 | a lot of the sort of hidden assumptions
01:11:31.520 | in self-supervised learning
01:11:33.440 | actually really exploit this bias of ImageNet.
01:11:37.400 | And so, I mean, a lot of my work,
01:11:39.720 | a lot of work from other people
01:11:41.000 | always uses ImageNet sort of as the benchmark
01:11:43.160 | to show the success of self-supervised learning.
01:11:45.480 | - So you're implying that there's
01:11:46.640 | particular limitations to this kind of dataset?
01:11:49.240 | - Yes, I mean, it's basically because
01:11:51.040 | our data augmentations that we designed,
01:11:53.200 | like all data augmentations that we designed
01:11:56.000 | for self-supervised learning and vision
01:11:57.520 | are kind of overfed to ImageNet.
01:11:59.400 | - But you're saying a little bit hard-coded,
01:12:02.440 | like the cropping.
01:12:03.840 | - Exactly, the cropping parameters,
01:12:05.480 | the kind of lighting that we're using,
01:12:07.320 | the kind of blurring that we're using.
01:12:08.800 | - Yeah, but you would, for a more in the wild dataset,
01:12:12.000 | you would need to be cleverer and more careful
01:12:16.240 | in setting the range of parameters
01:12:17.520 | and those kinds of things.
01:12:18.960 | - So for Sear, our main goal was twofold.
01:12:21.400 | One, basically to move away from ImageNet for training.
01:12:24.680 | So the images that we used were like uncurated images.
01:12:27.720 | Now there's a lot of debate
01:12:28.640 | whether they're actually curated or not,
01:12:30.080 | but I'll talk about that later.
01:12:32.360 | But the idea was basically
01:12:33.880 | these are going to be random internet images
01:12:36.400 | that we are not going to filter out
01:12:37.920 | based on like particular categories.
01:12:40.080 | So we did not say that, oh, images that belong to dogs
01:12:42.880 | and cats should be the only images
01:12:44.280 | that come in this dataset, banana.
01:12:47.000 | And basically other images should be thrown out.
01:12:50.040 | So we didn't do any of that.
01:12:51.800 | So these are random internet images.
01:12:53.560 | And of course, it also goes back to like the problem
01:12:56.040 | of scale that you talked about.
01:12:57.320 | So these were basically about a billion or so images.
01:13:00.120 | And for context ImageNet,
01:13:01.560 | the ImageNet version that we use
01:13:02.800 | was 1 million images earlier.
01:13:04.280 | So this is basically going like
01:13:05.400 | three orders of magnitude more.
01:13:07.600 | The idea was basically to see
01:13:08.600 | if we can train a very large convolutional model
01:13:11.800 | in a self-supervised way on this uncurated,
01:13:14.440 | but really large set of images.
01:13:16.400 | And how well would this model do?
01:13:18.280 | So is self-supervised learning really over fit to ImageNet?
01:13:21.440 | Or can it actually work in the wild?
01:13:23.840 | And it was also out of curiosity,
01:13:25.720 | what kind of things will this model learn?
01:13:27.520 | Will it actually be able to still figure out,
01:13:29.680 | you know, different types of objects and so on?
01:13:32.000 | Would there be particular kinds of tasks
01:13:33.720 | it would actually do better than an ImageNet trained model?
01:13:38.160 | And so for CIR, one of our main findings was that
01:13:40.960 | we can actually train very large models
01:13:43.120 | in a completely self-supervised way
01:13:44.800 | on lots of internet images
01:13:46.360 | without really necessarily filtering them out.
01:13:48.640 | Which was in itself a good thing
01:13:49.760 | because it's a fairly simple process, right?
01:13:51.960 | So you get images which are uploaded
01:13:54.080 | and you basically can immediately use them
01:13:55.800 | to train a model in an unsupervised way.
01:13:57.720 | You don't really need to sit and filter them out.
01:13:59.720 | These images can be cartoons, these can be memes,
01:14:02.040 | these can be actual pictures uploaded by people.
01:14:04.440 | And you don't really care about what these images are.
01:14:06.160 | You don't even care about what concepts they contain.
01:14:08.520 | So this was a very sort of simple setup.
01:14:10.280 | - What image selection mechanism would you say is there,
01:14:14.760 | like inherent in some aspect of the process?
01:14:18.840 | So you're kind of implying that there's almost none,
01:14:21.280 | but what is there, would you say,
01:14:23.600 | if you were to introspect?
01:14:24.960 | - Right, so it's not like, uncurated can basically,
01:14:28.480 | like one way of imagining uncurated
01:14:30.400 | is basically you have like cameras,
01:14:32.400 | like cameras that can take pictures at random viewpoints.
01:14:35.200 | When people upload pictures to the internet,
01:14:37.400 | they are typically going to care about the framing of it.
01:14:40.320 | They're not going to upload, say,
01:14:41.840 | the picture of a zoomed in wall, for example.
01:14:43.800 | - Well, when we say internet, do we mean social networks?
01:14:46.080 | - Yes. - Okay.
01:14:47.160 | - So these are not going to be like pictures
01:14:48.680 | of like a zoomed in table or a zoomed in wall.
01:14:51.400 | So it's not really completely uncurated
01:14:53.160 | because people do have the like photographer's bias,
01:14:55.800 | where they do want to keep things
01:14:57.040 | towards the center a little bit,
01:14:58.640 | or like really have like, you know,
01:15:00.280 | nice looking things and so on in the picture.
01:15:02.640 | So that's the kind of bias that typically exists
01:15:05.640 | in this dataset and also the user base, right?
01:15:07.720 | You're not going to get lots of pictures
01:15:09.320 | from different parts of the world
01:15:10.520 | because there are certain parts of the world
01:15:12.120 | where people may not actually be uploading
01:15:14.320 | a lot of pictures to the internet
01:15:15.440 | or may not even have access to a lot of internet.
01:15:17.360 | - So this is a giant dataset and a giant neural network.
01:15:21.720 | I don't think we've talked about
01:15:23.000 | what architectures work well for SSL,
01:15:27.960 | for self-supervised learning.
01:15:29.320 | - For SEER and for SWAB,
01:15:30.680 | we were using convolutional networks,
01:15:32.480 | but recently in a work called Dyno,
01:15:34.160 | we've basically started using transformers for vision.
01:15:36.880 | Both seem to work really well,
01:15:38.640 | convNets and transformers,
01:15:39.880 | and depending on what you want to do,
01:15:41.120 | you might choose to use a particular formulation.
01:15:43.560 | So for SEER, it was a convNet,
01:15:45.400 | it was particularly a regNet model,
01:15:47.480 | which was also work from Facebook.
01:15:49.720 | RegNets are like really good
01:15:51.200 | when it comes to compute versus like accuracy.
01:15:54.760 | So because it was a very efficient model,
01:15:56.920 | compute and memory wise efficient,
01:15:59.680 | and basically it worked really well in terms of scaling.
01:16:02.480 | So we used a very large regNet model
01:16:04.200 | and trained it on a billion images.
01:16:05.480 | - Can you maybe quickly comment on what regNets are?
01:16:08.640 | It comes from this paper,
01:16:10.680 | Designing Network Design Spaces.
01:16:13.520 | It's just super interesting concept
01:16:15.520 | that emphasizes on how to create
01:16:16.960 | efficient neural networks, large neural networks.
01:16:19.480 | - So one of the sort of key takeaways from this paper,
01:16:21.760 | which the author, like whenever you hear them
01:16:23.360 | present this work, they keep saying is,
01:16:26.040 | a lot of neural networks are characterized
01:16:27.920 | in terms of flops, right?
01:16:29.000 | Flops basically being the floating point operations.
01:16:31.440 | And people really love to use flops to say,
01:16:33.280 | this model is like really computationally heavy,
01:16:36.160 | or like our model is computationally cheap and so on.
01:16:38.960 | Now it turns out that flops are really not a good indicator
01:16:41.800 | of how well a particular network is,
01:16:43.800 | like how efficient it is really.
01:16:45.920 | And what a better indicator is,
01:16:47.880 | is the activation or the memory
01:16:49.640 | that is being used by this particular model.
01:16:52.080 | And so designing like one of the key findings
01:16:54.920 | from this paper was basically that
01:16:56.520 | you need to design network families
01:16:58.640 | or neural network architectures
01:17:00.080 | that are actually very efficient
01:17:01.360 | in the memory space as well,
01:17:02.720 | not just in terms of pure flops.
01:17:04.760 | So RegNet is basically a network architecture family
01:17:07.520 | that came out of this paper,
01:17:08.880 | that is particularly good at both flops
01:17:11.120 | and the sort of memory required for it.
01:17:13.520 | And of course, it builds upon like earlier work,
01:17:15.720 | like ResNet being like the sort of
01:17:17.440 | more popular inspiration for it,
01:17:18.960 | where you have residual connections.
01:17:20.360 | But one of the things in this work is basically,
01:17:22.360 | they also use like squeeze excitation blocks.
01:17:25.040 | So it's a lot of nice sort of technical innovation
01:17:27.040 | in all of this, from prior work
01:17:28.680 | and a lot of the ingenuity of these particular authors
01:17:31.360 | in how to combine these multiple building blocks.
01:17:34.080 | But the key constraint was optimized
01:17:35.960 | for both flops and memory when you're basically doing this,
01:17:38.280 | don't just look at flops.
01:17:39.520 | - And that allows you to what,
01:17:41.360 | have a sort of have very large networks
01:17:45.160 | through this process can optimize for low,
01:17:49.040 | like for efficiency, for low memory.
01:17:51.240 | - Also in just in terms of pure hardware,
01:17:53.560 | they fit very well on GPU memory.
01:17:55.800 | So they can be like really powerful
01:17:57.320 | neural network architectures with lots of parameters,
01:17:59.520 | lots of flops, but also because they're like efficient
01:18:01.960 | in terms of the amount of memory that they're using,
01:18:04.000 | you can actually fit a lot of these on like,
01:18:06.560 | you can fit a very large model on a single GPU, for example.
01:18:09.600 | - Would you say that the choice of architecture
01:18:14.240 | matters more than the choice
01:18:15.880 | of maybe data augmentation techniques?
01:18:18.520 | Is there a possibility to say what matters more?
01:18:21.680 | You kind of implied that you can probably go really far
01:18:24.360 | with just using basic conv nuts.
01:18:27.560 | - All right, I think data like data and data augmentation,
01:18:30.520 | the algorithm being used for the self supervised training
01:18:33.240 | matters a lot more than the particular kind of architecture.
01:18:36.360 | With different types of architecture,
01:18:37.600 | you will get different like properties
01:18:39.520 | in the resulting sort of representation.
01:18:41.640 | But really, I mean, the secret sauce
01:18:43.560 | is in the data augmentation
01:18:44.600 | and the algorithm being used to train them.
01:18:47.000 | The architectures, I mean, at this point,
01:18:49.160 | a lot of them perform very similarly,
01:18:51.640 | depending on like the particular task that you care about,
01:18:53.760 | they have certain advantages and disadvantages.
01:18:56.360 | - Is there something interesting to be said
01:18:58.080 | about what it takes with Sears
01:19:00.120 | to train a giant neural network?
01:19:01.880 | You're talking about a huge amount of data,
01:19:04.120 | a huge neural network.
01:19:05.760 | Is there something interesting to be said
01:19:07.720 | of how to effectively train something like that fast?
01:19:11.240 | - Lots of GPUs.
01:19:12.960 | - Okay, so.
01:19:13.800 | (both laughing)
01:19:15.440 | - I mean, so the model was like a billion parameters.
01:19:17.960 | - Yeah.
01:19:18.800 | - And it was trained on a billion images.
01:19:20.600 | - Yeah.
01:19:21.440 | - So if like, basically the same number of parameters
01:19:23.320 | as the number of images, and it took a while.
01:19:26.160 | I don't remember the exact number, it's in the paper,
01:19:28.600 | but it took a while.
01:19:29.440 | (both laughing)
01:19:31.840 | - I guess I'm trying to get at is
01:19:33.720 | when you're thinking of scaling this kind of thing,
01:19:38.640 | I mean, one of the exciting possibilities
01:19:41.880 | of self-supervised learning is the several orders
01:19:45.280 | of magnitude scaling of everything,
01:19:47.320 | both in your own network and the size of the data.
01:19:50.880 | And so the question is,
01:19:52.560 | do you think there's some interesting tricks
01:19:55.120 | to do large scale distributed compute,
01:19:57.840 | or is that really outside of even deep learning?
01:20:00.880 | That's more about like hardware engineering.
01:20:04.320 | - I think more and more there is like this,
01:20:06.760 | a lot of like systems are designed,
01:20:10.080 | basically taking into account
01:20:11.320 | the machine learning needs, right?
01:20:12.400 | So because whenever you're doing
01:20:14.320 | this kind of distributed training,
01:20:15.440 | there is a lot of intercommunication between nodes.
01:20:17.720 | So like gradients or the model parameters are being passed.
01:20:20.560 | So you really want to minimize communication costs
01:20:22.720 | when you really want to scale these models up.
01:20:25.200 | You want basically to be able to do as much,
01:20:29.120 | like as limited amount of communication as possible.
01:20:31.400 | So currently like a dominant paradigm
01:20:33.240 | is synchronized sort of training.
01:20:35.040 | So essentially after every sort of gradient step,
01:20:38.440 | all you basically have like a synchronization step
01:20:41.160 | between all the sort of compute chips
01:20:43.360 | that you're going on with.
01:20:44.680 | I think asynchronous training was popular,
01:20:47.800 | but it doesn't seem to perform as well.
01:20:50.360 | But in general, I think that sort of the,
01:20:53.280 | I guess it's outside my scope as well, yeah.
01:20:55.240 | But the main thing is like minimize the amount
01:20:58.960 | of synchronization steps that you have.
01:21:01.840 | That has been the key takeaway, at least in my experience.
01:21:04.600 | The others I have no idea about.
01:21:05.920 | How to design the chip.
01:21:06.760 | - Yeah, there's very few things that I see,
01:21:09.320 | Jim Keller's eyes light up as much
01:21:11.880 | as talking about giant computers doing
01:21:14.120 | like that fast communication that you're talking to,
01:21:17.600 | well, when they're training machine learning.
01:21:20.160 | Systems.
01:21:21.200 | What is VISL?
01:21:22.320 | V-I-S-S-L.
01:21:24.160 | The PyTorch based SSL library.
01:21:27.880 | What are the use cases that it might have?
01:21:30.080 | - VISL basically was born out of a lot of us at Facebook
01:21:33.000 | are doing the self-supervised learning research.
01:21:35.120 | So it's a common framework in which we have like a lot
01:21:38.680 | of self-supervised learning methods implemented for vision.
01:21:41.680 | It's also, it has in itself like a benchmark of tasks
01:21:45.920 | that you can evaluate the self-supervised representations on.
01:21:48.760 | So the use case for it is basically for anyone
01:21:51.200 | who's either trying to evaluate their self-supervised model
01:21:53.720 | or train their self-supervised model
01:21:55.960 | or a researcher who's trying to build
01:21:57.760 | a new self-supervised technique.
01:21:59.240 | So it's basically supposed to be all of these things.
01:22:01.480 | So as a researcher, before VISL, for example,
01:22:04.440 | or like when we started doing this work fairly seriously
01:22:06.920 | at Facebook, it was very hard for us to go
01:22:09.240 | and implement every self-supervised learning model,
01:22:11.840 | test it out in a like sort of consistent manner.
01:22:14.560 | The experimental setup was very different
01:22:16.400 | across different groups.
01:22:18.120 | Even when someone said that they were reporting
01:22:20.360 | image net accuracy, it could mean lots of different things.
01:22:23.160 | So with VISL, we tried to really sort of standardize that
01:22:25.320 | as much as possible.
01:22:26.360 | And there was a paper like we did in 2019
01:22:28.240 | just about benchmarking.
01:22:29.720 | And so VISL basically builds upon a lot of this kind of work
01:22:32.840 | that we did about like benchmarking.
01:22:35.120 | And then every time we try to like,
01:22:37.160 | we come up with a self-supervised learning method,
01:22:38.960 | a lot of us try to push that into VISL as well,
01:22:41.160 | just so that it basically is like the central piece
01:22:43.440 | where a lot of these methods can reside.
01:22:46.360 | - Just out of curiosity, like people may be,
01:22:49.200 | so certainly outside of Facebook, but just researchers,
01:22:52.000 | or just even people that know how to program in Python
01:22:54.920 | and know how to use PyTorch, what would be the use case?
01:22:58.640 | What would be a fun thing to play around with VISL on?
01:23:01.320 | Like what's a fun thing to play around
01:23:04.280 | with self-supervised learning on, would you say?
01:23:07.920 | Is there a good hello world program?
01:23:09.760 | Like is it always about big size that's important to have,
01:23:14.600 | or is there fun little smaller case playgrounds
01:23:18.840 | to play around with?
01:23:19.720 | - So we're trying to like push something towards that.
01:23:22.400 | I think there are a few setups out there,
01:23:24.320 | but nothing like super standard on the smaller scale.
01:23:26.800 | I mean, ImageNet in itself is actually pretty big also.
01:23:29.280 | So that is not something which is like feasible
01:23:32.240 | for a lot of people, but we are trying to like push up
01:23:34.840 | with like smaller sort of use cases.
01:23:36.360 | The thing is, at a smaller scale,
01:23:38.960 | a lot of the observations or a lot of the algorithms
01:23:41.200 | that work don't necessarily translate
01:23:42.720 | into the medium or the larger scale.
01:23:44.960 | So it's really tricky to come up
01:23:46.120 | with a good small scale setup
01:23:47.440 | where a lot of your empirical observations
01:23:49.120 | will really translate to the other setup.
01:23:51.520 | So it's been really challenging.
01:23:53.240 | I've been trying to do that for a little bit as well,
01:23:54.880 | because it does take time to train stuff on ImageNet.
01:23:56.800 | It does take time to train on like more images,
01:23:59.840 | but pretty much every time I've tried to do that,
01:24:02.200 | it's been unsuccessful because all the observations
01:24:04.080 | I draw from my set of experiments on a smaller data set
01:24:06.840 | don't translate into ImageNet,
01:24:09.120 | or like don't translate into another sort of data set.
01:24:11.720 | So it's been hard for us to figure this one out,
01:24:14.200 | but it's an important problem.
01:24:15.720 | - So there's this really interesting idea
01:24:17.920 | of learning across multiple modalities.
01:24:20.800 | You have a CVPR 2021 best paper candidate
01:24:25.800 | titled "Audio-Visual Instance Discrimination
01:24:29.200 | with Cross-Modal Agreement."
01:24:31.400 | What are the key results, insights in this paper,
01:24:33.840 | and what can you say in general about the promise
01:24:35.880 | and power of multimodal learning?
01:24:37.600 | - For this paper, it actually came as a little bit
01:24:39.920 | of a shock to me at how well it worked.
01:24:41.920 | So I can describe what the problem setup was.
01:24:44.120 | So it's been used in the past by lots of folks,
01:24:46.520 | like for example, Andrew Owens from MIT,
01:24:48.360 | Aljosha Efros from Berkeley,
01:24:49.920 | Andrew Zisserman from Oxford.
01:24:51.120 | So a lot of these people have been sort of
01:24:52.360 | showing results in this.
01:24:54.040 | Of course, I was aware of this result,
01:24:55.440 | but I wasn't really sure how well it would work in practice
01:24:58.560 | for like other sort of downstream tasks.
01:25:00.560 | So the results kept getting better,
01:25:02.400 | and I wasn't sure if like a lot of our insights
01:25:04.160 | from self-supervised learning would translate
01:25:05.880 | into this multimodal learning problem.
01:25:08.280 | So multimodal learning is when you have like,
01:25:11.840 | when you have multiple modalities.
01:25:14.160 | (laughs)
01:25:15.000 | And that's not even a quote.
01:25:15.840 | - Excellent.
01:25:16.680 | (laughs)
01:25:17.520 | - Okay, so the particular modalities that we worked on
01:25:20.000 | in this work were audio and video.
01:25:22.000 | So the idea was basically if you have a video,
01:25:23.880 | you have its corresponding audio track,
01:25:25.840 | and you want to use both of these signals,
01:25:27.520 | the audio signal and the video signal
01:25:29.240 | to learn a good representation for video
01:25:31.240 | and good representation for audio.
01:25:32.080 | - Like this podcast.
01:25:33.640 | - Like this podcast, exactly.
01:25:35.440 | So what we did in this work was basically trained
01:25:38.120 | two different neural networks,
01:25:39.400 | one on the video signal, one on the audio signal.
01:25:41.920 | And what we wanted is basically the features
01:25:43.800 | that we get from both of these neural networks
01:25:45.400 | should be similar.
01:25:46.800 | So it should basically be able to produce
01:25:48.720 | the same kinds of features from the video
01:25:51.120 | and the same kinds of features from the audio.
01:25:53.240 | Now, why is this useful?
01:25:54.280 | Well, for a lot of these objects that we have,
01:25:56.680 | there is a characteristic sound, right?
01:25:58.280 | So trains, when they go by,
01:25:59.520 | they make a particular kind of sound.
01:26:00.800 | Boats make a particular kind of sound.
01:26:02.520 | People, when they're jumping around,
01:26:03.840 | will like shout, "Whoo-ah," whatever.
01:26:06.240 | Bananas don't make a sound.
01:26:07.280 | So where you can't learn anything about bananas there.
01:26:09.360 | - Or when humans mention bananas.
01:26:11.640 | - Well, yes, when they say the word banana, then-
01:26:13.480 | - So you can't trust basically anything
01:26:15.080 | that comes out of a human's mouth
01:26:16.360 | as a source, that source of audio is useless.
01:26:18.720 | - So the typical use case is basically like,
01:26:20.640 | for example, someone playing a musical instrument.
01:26:22.440 | So guitars have a particular kind of sound and so on.
01:26:24.680 | So because a lot of these things are correlated,
01:26:27.080 | the idea in multimodal learning
01:26:28.440 | is to take these two kinds of modalities,
01:26:30.120 | video and audio, and learn a common embedding space,
01:26:33.120 | a common feature space where both of these
01:26:35.200 | related modalities can basically be close together.
01:26:38.560 | And again, you use contrastive learning for this.
01:26:40.600 | So in contrastive learning, basically the video
01:26:43.320 | and the corresponding audio are positives,
01:26:45.520 | and you can take any other video or any other audio,
01:26:48.200 | and that becomes a negative.
01:26:49.840 | And so basically that's it.
01:26:51.000 | It's just a simple application of contrastive learning.
01:26:53.720 | The main sort of finding from this work for us
01:26:55.960 | was basically that you can actually learn
01:26:58.720 | very, very powerful feature representations,
01:27:00.760 | very, very powerful video representations.
01:27:02.840 | So you can learn the sort of video network
01:27:05.400 | that we ended up learning can actually be used
01:27:07.480 | for downstream, for example, recognizing human actions
01:27:11.000 | or recognizing different types of sounds, for example.
01:27:14.440 | So this was sort of the key finding.
01:27:17.160 | - Can you give kind of an example of a human action
01:27:20.200 | or like just so we can build up intuition
01:27:23.360 | of what kind of thing?
01:27:24.360 | - Right, so there is this dataset called kinetics,
01:27:26.880 | for example, which has like 400 different types
01:27:28.640 | of human actions.
01:27:29.480 | So people jumping, people doing different kinds
01:27:32.360 | of sports or different types of swimming.
01:27:34.280 | So like different strokes in swimming, golf, and so on.
01:27:37.600 | So there are like just different types
01:27:39.200 | of actions right there.
01:27:40.560 | And the point is this kind of video network
01:27:42.600 | that you learn in a self-supervised way
01:27:44.400 | can be used very easily to kind of recognize
01:27:46.920 | these different types of actions.
01:27:48.920 | It can also be used for recognizing
01:27:50.440 | different types of objects.
01:27:51.760 | And what we did is we tried to visualize
01:27:54.800 | whether the network can figure out
01:27:56.080 | where the sound is coming from.
01:27:57.880 | So basically give it a video and basically play
01:28:01.000 | of say of a person just strumming a guitar,
01:28:03.000 | but of course there is no audio in this.
01:28:04.720 | And now you give it this sound of a guitar.
01:28:07.160 | And you ask, like basically try to visualize
01:28:08.880 | where the network is, where the network thinks
01:28:11.000 | the sound is coming from.
01:28:12.520 | And it can kind of basically draw like,
01:28:14.560 | when you visualize it, you can see
01:28:15.720 | that it's basically focusing on the guitar.
01:28:17.480 | - Yeah, that's surreal.
01:28:18.320 | - And the same thing, for example,
01:28:20.160 | for certain people's voices,
01:28:21.480 | like famous celebrities' voices,
01:28:22.920 | it can actually figure out where their mouth is.
01:28:26.080 | So it can actually distinguish different people's voices,
01:28:28.600 | for example, a little bit as well.
01:28:30.480 | Without that ever being annotated in any way.
01:28:33.640 | - Right, so this is all what it had discovered.
01:28:35.520 | We never pointed out that this is a guitar
01:28:38.200 | and this is the kind of sound it produces.
01:28:40.080 | It can actually naturally figure that out
01:28:41.520 | because it's seen so many correlations
01:28:43.480 | of this sound coming with this kind of like an object
01:28:46.680 | that it basically learns to associate this sound
01:28:49.040 | with this kind of an object.
01:28:50.000 | - Yeah, that's really fascinating, right?
01:28:52.800 | That's really interesting.
01:28:53.640 | So the idea with this kind of network
01:28:55.200 | is then you then fine tune it for a particular task.
01:28:57.920 | So this is forming like a really good knowledge base
01:29:01.880 | within a neural network based on which you could then
01:29:04.520 | train a little bit more to accomplish a specific task.
01:29:07.720 | - Right, exactly.
01:29:09.200 | So you don't need a lot of videos
01:29:11.120 | of humans doing actions annotated.
01:29:12.800 | You can just use a few of them to basically get your--
01:29:16.040 | - How much insight do you draw from the fact
01:29:18.520 | that it can figure out where the sound is coming from?
01:29:22.560 | I'm trying to see, so that's kind of very,
01:29:26.160 | it's very CVPR beautiful, right?
01:29:28.120 | It's a cool insight.
01:29:30.000 | I wonder how profound that is.
01:29:33.000 | Does it speak to the idea that multiple modalities
01:29:39.240 | are somehow much bigger than the sum of their parts
01:29:44.120 | or is it really, really useful to have multiple modalities
01:29:47.960 | or is this just a cool thing that there's parts
01:29:50.640 | of our world that can be revealed
01:29:55.440 | like effectively through multiple modalities
01:29:58.400 | but most of it is really all about vision
01:30:01.240 | or about one of the modalities?
01:30:03.920 | - I would say a little tending more towards the second part.
01:30:07.840 | So most of it can be sort of figured out with one modality
01:30:10.760 | but having an extra modality always helps you.
01:30:13.240 | So in this case, for example, like one thing is when you're,
01:30:17.800 | if you observe someone cutting something
01:30:19.440 | and you don't have any sort of sound there,
01:30:22.040 | whether it's an apple or whether it's an onion,
01:30:25.160 | it's very hard to figure that out.
01:30:26.760 | But if you hear someone cutting it,
01:30:28.280 | it's very easy to figure it out
01:30:29.840 | because apples and onions make a very different kind
01:30:32.920 | of characteristic sound when they're cut.
01:30:34.920 | So you really figure this out based on audio.
01:30:36.960 | It's much easier.
01:30:38.320 | So your life will become much easier
01:30:40.120 | when you have access to different kinds of modalities.
01:30:42.360 | And the other thing is, so I like to relate it in this way.
01:30:45.120 | It may be like completely wrong
01:30:46.440 | but the distributional hypothesis in NLP, right?
01:30:49.400 | Where context basically gives kind of meaning to that word.
01:30:53.120 | Sound kind of does that too, right?
01:30:55.120 | So if you have the same sound,
01:30:57.200 | so that's the same context across different videos,
01:30:59.920 | you're very likely to be observing
01:31:01.360 | the same kind of concept.
01:31:03.080 | So that's the kind of reason why it figures out
01:31:05.080 | the guitar thing, right?
01:31:06.520 | It observed the same sound across multiple different videos
01:31:09.840 | and it figures out maybe this is the common factor
01:31:11.960 | that's actually doing it.
01:31:13.320 | - I wonder, I used to have this argument with my dad a bunch
01:31:17.520 | for creating general intelligence,
01:31:19.840 | whether a smell is an important,
01:31:22.920 | like if that's important sensory information.
01:31:25.560 | Mostly we're talking about like falling in love
01:31:27.600 | with an AI system.
01:31:29.000 | And for him, smell and touch are important.
01:31:31.480 | And I was arguing that it's not at all.
01:31:33.920 | It's important, it's nice and everything,
01:31:35.360 | but like you can fall in love with just language really,
01:31:38.440 | but voice is very powerful and vision is next
01:31:41.440 | and smell is not that important.
01:31:43.920 | Can I ask you about this process of active learning?
01:31:46.920 | You mentioned interactivity.
01:31:49.240 | - Right.
01:31:50.080 | - Is there some value within the self-supervised
01:31:55.080 | learning context to select parts of the data
01:32:00.160 | in intelligent ways such that they would most benefit
01:32:05.080 | the learning process?
01:32:06.400 | - Right, so I think so.
01:32:07.520 | I think, I mean, I know I'm talking
01:32:09.200 | to an active learning fan here,
01:32:10.320 | so of course I know the answer.
01:32:12.600 | - First you were talking bananas
01:32:14.000 | and now you're talking about active learning, I love it.
01:32:16.720 | I think Yann LeCun told me that active learning
01:32:18.800 | is not that interesting.
01:32:20.480 | I think back then I didn't want to argue with him too much,
01:32:24.400 | but when we talk again, we're gonna spend three hours
01:32:27.200 | arguing about active learning.
01:32:28.440 | My sense was you can go extremely far with active learning,
01:32:32.480 | you know, perhaps farther than anything else.
01:32:34.960 | Like to me, there's this kind of intuition
01:32:38.000 | that similar to data augmentation,
01:32:40.880 | you can get a lot from the data,
01:32:45.320 | from intelligent optimized usage of the data.
01:32:50.320 | I'm trying to speak generally in such a way
01:32:53.240 | that includes data augmentation and active learning,
01:32:57.080 | that there's something about maybe interactive exploration
01:32:59.920 | of the data that at least this part of the solution
01:33:04.360 | to intelligence, like an important part.
01:33:07.160 | I don't know what your thoughts are
01:33:08.240 | on active learning in general.
01:33:09.360 | - I actually really like active learning.
01:33:10.880 | So back in the day we did this largely ignored CVPR paper
01:33:14.240 | called learning by asking questions.
01:33:16.560 | So the idea was basically you would train an agent
01:33:18.280 | that would ask a question about the image,
01:33:20.140 | it would get an answer,
01:33:21.560 | and basically then it would update itself,
01:33:23.400 | it would see the next image,
01:33:24.400 | it would decide what's the next hardest question
01:33:26.840 | that I can ask to learn the most.
01:33:28.800 | And the idea was basically because it was being smart
01:33:31.340 | about the kinds of questions it was asking,
01:33:33.520 | it would learn in fewer samples,
01:33:35.120 | it would be more efficient at using data.
01:33:37.920 | And we did find to some extent that it was actually better
01:33:40.360 | than randomly asking questions.
01:33:42.060 | Kind of weird thing about active learning
01:33:43.520 | is it's also a chicken and egg problem
01:33:45.200 | because when you look at an image,
01:33:47.160 | to ask a good question about the image,
01:33:48.680 | you need to understand something about the image.
01:33:50.920 | You can't ask a completely arbitrarily random question,
01:33:53.480 | it may not even apply to that particular image.
01:33:55.520 | So there is some amount of understanding or knowledge
01:33:57.640 | that basically keeps getting built
01:33:59.200 | when you're doing active learning.
01:34:01.320 | So I think active learning by itself is really good.
01:34:04.600 | And the main thing we need to figure out
01:34:06.400 | is basically how do we come up with a technique
01:34:09.640 | to first model what the model knows,
01:34:13.340 | and also model what the model does not know.
01:34:16.040 | I think that's the sort of beauty of it, right?
01:34:18.360 | Because when you know that there are certain things
01:34:20.480 | that you don't know anything about,
01:34:22.160 | asking a question about those concepts
01:34:23.640 | is actually going to bring you the most value.
01:34:26.520 | And I think that's the sort of key challenge.
01:34:28.360 | Now self-supervised learning by itself,
01:34:29.960 | like selecting data for it and so on,
01:34:31.480 | that's actually really useful.
01:34:32.680 | But I think that's a very narrow view
01:34:34.000 | of looking at active learning, right?
01:34:35.120 | If you look at it more broadly,
01:34:36.360 | it is basically about if the model has a knowledge
01:34:40.040 | about n concepts,
01:34:41.400 | and it is weak basically about certain things,
01:34:43.880 | so it needs to ask questions
01:34:45.300 | either to discover new concepts
01:34:46.900 | or to basically increase its knowledge
01:34:49.200 | about these n concepts.
01:34:50.400 | So at that level, it's a very powerful technique.
01:34:53.220 | I actually do think it's going to be really useful.
01:34:56.540 | Even in simple things such as data labeling,
01:34:59.060 | it's super useful.
01:35:00.260 | So here is one simple way
01:35:02.940 | that you can use active learning.
01:35:04.300 | For example, you have your self-supervised model,
01:35:06.900 | which is very good at predicting similarities
01:35:08.780 | and dissimilarities between things.
01:35:10.780 | And so if you label a picture as basically say, a banana,
01:35:14.640 | now you know that all the images
01:35:17.740 | that are very similar to this image
01:35:19.220 | are also likely to contain bananas.
01:35:21.500 | So probably when you want to understand
01:35:24.260 | what else is a banana,
01:35:25.180 | you're not going to use these other images.
01:35:26.920 | You're actually going to use an image
01:35:28.200 | that is not completely dissimilar,
01:35:31.160 | but somewhere in between,
01:35:32.340 | which is not super similar to this image,
01:35:33.860 | but not super dissimilar either.
01:35:35.660 | And that's going to tell you a lot more
01:35:37.140 | about what this concept of a banana is.
01:35:39.540 | - So that's kind of a heuristic.
01:35:41.860 | I wonder if it's possible to also learn ways
01:35:46.860 | to discover the most likely, the most beneficial image.
01:35:51.940 | So not just looking at a thing
01:35:55.020 | that's somewhat similar to a banana,
01:35:58.460 | but not exactly similar,
01:36:00.000 | but have some kind of more complicated learning system,
01:36:03.560 | like learned discovery mechanism
01:36:07.100 | that tells you what image to look for.
01:36:09.440 | - Yeah, exactly.
01:36:10.940 | - Yeah, like actually in a self-supervised way,
01:36:14.340 | learning strictly a function that says,
01:36:17.300 | is this image going to be very useful to me,
01:36:20.600 | given what I currently know?
01:36:22.140 | - I think there's a lot of synergy there.
01:36:24.020 | It's just, I think, yeah, it's going to be explored.
01:36:26.760 | - I think very much related to that,
01:36:29.420 | I kind of think of what Tesla Autopilot is doing
01:36:32.420 | currently as kind of active learning.
01:36:36.900 | There's something that Andrei Kapathia and their team
01:36:39.280 | are calling a data engine.
01:36:41.300 | - Yes.
01:36:42.140 | - So you're basically deploying a bunch of instantiations
01:36:45.700 | of a neural network into the wild,
01:36:47.820 | and they're collecting a bunch of edge cases
01:36:50.700 | that are then sent back for annotation for particular,
01:36:53.980 | and edge cases as defined as near failure
01:36:56.700 | or some weirdness on a particular task
01:36:59.980 | that's then sent back.
01:37:01.420 | It's that not exactly a banana,
01:37:04.020 | but almost a banana cases sent back for annotation.
01:37:07.220 | And then there's this loop that keeps going,
01:37:09.260 | and you keep retraining and retraining.
01:37:11.620 | And the active learning step there,
01:37:13.300 | or whatever you want to call it,
01:37:14.820 | is the cars themselves that are sending you back the data,
01:37:19.100 | like what the hell happened here?
01:37:20.780 | This was weird.
01:37:22.840 | What are your thoughts about that sort of deployment
01:37:26.460 | of neural networks in the wild?
01:37:28.260 | Another way to ask a question,
01:37:30.100 | but first is your thoughts,
01:37:31.360 | and maybe if you want to comment,
01:37:33.860 | is there applications for autonomous driving?
01:37:36.960 | Like computer vision based autonomous driving,
01:37:40.180 | applications of self-supervised learning
01:37:42.060 | in the context of computer vision based autonomous driving?
01:37:46.120 | - So I think so.
01:37:48.380 | I think for self-supervised learning to be used
01:37:50.060 | in autonomous driving, there are lots of opportunities.
01:37:52.580 | And just like pure consistency in predictions is one way.
01:37:55.860 | So because you have this nice sequence of data
01:38:00.300 | that is coming in, a video stream of it,
01:38:02.340 | associated of course with the actions,
01:38:04.100 | let's say the car took,
01:38:05.260 | you can form a very nice predictive model
01:38:07.680 | of what's happening.
01:38:08.520 | So for example, like all the way,
01:38:10.660 | like one way possibly in which how they're figuring out
01:38:14.500 | what data to get labeled is basically
01:38:15.940 | through prediction uncertainty, right?
01:38:17.500 | So you predict that the car was going to turn right.
01:38:20.420 | So this was the action that was going to happen
01:38:21.900 | say in the shadow mode, and now the driver turned left.
01:38:24.700 | And this is a really big surprise.
01:38:27.220 | So basically by forming these good predictive models,
01:38:30.180 | you are, I mean, these are kind of self-supervised models.
01:38:32.900 | Prediction models are basically being trained
01:38:34.660 | just by looking at what's going to happen next
01:38:36.820 | and asking them to predict what's going to happen next.
01:38:38.980 | So I would say this is really like one use
01:38:40.780 | of self-supervised learning.
01:38:42.340 | It's a predictive model,
01:38:43.460 | and you're learning a predictive model
01:38:44.700 | basically just by looking at what data you have.
01:38:46.900 | - Is there something about that active learning context
01:38:49.620 | that you find insights from?
01:38:53.020 | Like that kind of deployment of the system,
01:38:54.780 | seeing cases where it doesn't perform as you expected
01:38:59.140 | and then retraining the system based on that?
01:39:01.020 | - I think that, I mean, that really resonates with me.
01:39:03.620 | It's super smart to do it that way.
01:39:05.580 | Because I mean, the thing is with any kind
01:39:08.540 | of like practical system like autonomous driving,
01:39:11.180 | there are those edge cases are the things
01:39:13.060 | that are actually the problem, right?
01:39:14.540 | I mean, highway driving or like freeway driving
01:39:17.420 | has basically been, like there has been a lot of success
01:39:20.140 | in that particular part of autonomous driving
01:39:21.860 | for a long time.
01:39:23.100 | I would say like since the 80s or something.
01:39:25.580 | Now, the point is all these failure cases
01:39:28.020 | are the sort of reason why autonomous driving hasn't come,
01:39:31.420 | it hasn't become like super, super mainstream
01:39:33.180 | and available like in every possible car right now.
01:39:35.620 | And so basically by really scaling this problem out
01:39:38.180 | by really trying to get all of these edge cases out
01:39:40.420 | as quickly as possible,
01:39:41.820 | and then just like using those to improve your model,
01:39:43.860 | that's super smart.
01:39:45.580 | And prediction uncertainty to do that
01:39:47.060 | is like one really nice way of doing it.
01:39:49.740 | - Let me put you on the spot.
01:39:52.020 | So we mentioned offline Jitendra.
01:39:55.220 | He thinks that the Tesla computer vision approach
01:39:58.180 | or really any approach for autonomous driving
01:40:00.780 | is very far away.
01:40:02.660 | How many years away,
01:40:05.420 | if you had to bet all your money on it,
01:40:06.940 | are we to solving autonomous driving
01:40:09.580 | with this kind of computer vision only
01:40:11.980 | machine learning based approach?
01:40:13.580 | - Okay, so what does solving autonomous driving mean?
01:40:15.380 | Does it mean solving it in the US?
01:40:17.180 | Does it mean solving it in India?
01:40:18.420 | Because I can tell you
01:40:19.260 | that very different types of driving have been.
01:40:21.140 | - Not India, not Russia.
01:40:23.740 | In the United States, autonomous,
01:40:26.220 | so what solving means is when the car says
01:40:30.420 | it has control, it is fully liable.
01:40:34.140 | You can go to sleep, it's driving by itself.
01:40:37.860 | So this is highway and city driving,
01:40:39.820 | but not everywhere, but mostly everywhere.
01:40:42.380 | And it's, let's say significantly better,
01:40:45.140 | like say five times less accidents than humans.
01:40:50.140 | Sufficiently safer such that the public feels
01:40:54.060 | like that transition is enticing beneficial,
01:40:58.020 | both for our safety and financially
01:40:59.580 | and all those kinds of things.
01:41:01.140 | - Okay, so first disclaimer,
01:41:02.380 | I'm not an expert in autonomous driving.
01:41:04.300 | So let me put it out there.
01:41:06.020 | I would say like at least five to 10 years.
01:41:08.420 | This would be my guess from now.
01:41:11.860 | Yeah, I'm actually very impressed.
01:41:14.740 | Like when I sat in a friend's Tesla recently
01:41:16.900 | and of course like looking, so it can,
01:41:20.140 | on the screen it basically shows all the detections
01:41:22.300 | and everything the car is doing as you're driving by.
01:41:24.740 | And that's super distracting for me as a person
01:41:26.940 | because all I keep looking at is like the bounding boxes
01:41:29.500 | in the car that's tracking.
01:41:30.860 | And it's really impressive,
01:41:31.820 | like especially when it's raining and it's able to do that.
01:41:34.340 | That was the most impressive part for me.
01:41:36.060 | It's actually able to get through rain and do that.
01:41:38.580 | And one of the reasons why like a lot of us believed
01:41:41.780 | and I would put myself in that category
01:41:44.100 | is LIDAR based sort of technology
01:41:46.900 | for autonomous driving was the key driver, right?
01:41:48.780 | So Waymo was using it for the longest time.
01:41:51.060 | And Tesla then decided to go this completely other route
01:41:53.340 | that we're not going to even use LIDAR.
01:41:55.820 | So their initial system I think was camera and radar based.
01:41:58.780 | And now they're actually moving
01:41:59.700 | to a completely like vision based system.
01:42:02.060 | And so that was just like, it sounded completely crazy.
01:42:04.700 | Like LIDAR is very useful in cases
01:42:07.100 | where you have low visibility.
01:42:09.300 | Of course it comes with its own set of complications.
01:42:11.780 | But now to see that happen in like on a live Tesla,
01:42:15.220 | that basically just proves everyone wrong,
01:42:17.060 | I would say in a way.
01:42:18.180 | And that's just working really well.
01:42:20.620 | I think there were also like a lot of advancements
01:42:22.780 | in camera technology.
01:42:23.980 | Now there were like, I know at CMU when I was there,
01:42:26.340 | there was a particular kind of camera
01:42:28.020 | that had been developed that was really good
01:42:30.140 | at basically low visibility settings.
01:42:32.860 | So like lots of snow and lots of rain,
01:42:34.500 | it could actually still have a very reasonable visibility.
01:42:37.700 | And I think there are lots of these kinds of innovations
01:42:39.460 | that will happen on the sensor side itself,
01:42:41.020 | which is actually going to make this very easy
01:42:42.900 | in the future.
01:42:43.900 | And so maybe that's actually why I'm more optimistic
01:42:46.140 | about vision based autonomous driving.
01:42:49.060 | It's gonna call it self supervised driving.
01:42:50.500 | But vision based autonomous driving,
01:42:53.580 | that's the reason I'm quite optimistic about it.
01:42:55.500 | Because I think there are going to be lots
01:42:56.700 | of these advances on the sensor side itself.
01:42:58.980 | So acquiring this data,
01:43:00.740 | we're actually going to get much better about it.
01:43:02.660 | And then of course, once we're able to scale out
01:43:05.100 | and get all of these edge cases in,
01:43:06.820 | as like Andre described,
01:43:08.780 | I think that's going to make us go very far away.
01:43:11.740 | - Yeah, so it's funny,
01:43:13.620 | I'm very much with you on the five to 10 years,
01:43:16.300 | maybe 10 years, but you made it,
01:43:20.100 | I'm not sure how you made it sound,
01:43:21.820 | but for some people that might seem like really far away,
01:43:25.380 | and then for other people,
01:43:26.940 | it might seem like very close.
01:43:30.460 | There's a lot of fundamental questions
01:43:32.300 | about how much game theory is in this whole thing.
01:43:36.900 | So like how much is this simply collision avoidance problem?
01:43:41.900 | And how much of it is,
01:43:44.340 | you're still interacting with other humans in the scene,
01:43:46.980 | and you're trying to create an experience that's compelling.
01:43:49.460 | So you want to get from point A to point B quickly,
01:43:53.060 | you want to navigate the scene in a safe way,
01:43:55.260 | but you also want to show some level of aggression,
01:43:58.500 | because, well, certainly this is why you're screwed in India,
01:44:01.980 | because you have to show aggression.
01:44:03.380 | - Or Jersey, or New Jersey.
01:44:04.900 | (both laughing)
01:44:07.020 | - So like, or New York, or basically any major city.
01:44:11.180 | But I think it's probably Elon
01:44:13.220 | that I talked the most about this,
01:44:14.780 | which is surprised to the level of which
01:44:17.700 | they're not considering human beings
01:44:20.100 | as a huge problem in this, as a source of problem.
01:44:22.980 | Like the driving is fundamentally a robot on robot
01:44:27.980 | versus the environment problem,
01:44:31.180 | versus like, you can just consider humans
01:44:34.020 | not part of the problem.
01:44:35.180 | I used to think humans are, almost certainly,
01:44:38.860 | have to be modeled really well.
01:44:41.220 | Pedestrians and cyclists and humans inside of the cars,
01:44:44.380 | you have to have like mental models for them.
01:44:46.380 | You cannot just see it as objects.
01:44:48.340 | But more and more, it's like the,
01:44:51.420 | it's the same kind of intuition breaking thing
01:44:53.700 | that self-supervised learning does,
01:44:56.180 | which is, well, maybe through the learning,
01:44:58.820 | you'll get all the human, like human information you need.
01:45:03.820 | Right? - Right.
01:45:05.060 | - Maybe you'll get it just with enough data.
01:45:07.740 | You don't need to have explicit good models
01:45:09.660 | of human behavior.
01:45:10.820 | Maybe you get it through the data.
01:45:12.140 | So, I mean, my skepticism also just knowing
01:45:14.660 | a lot of automotive companies
01:45:16.340 | and how difficult it is to be innovative,
01:45:18.620 | I was skeptical that they would be able at scale
01:45:22.540 | to convert the driving scene across the world
01:45:27.380 | into digital form,
01:45:29.060 | such that you can create this data engine at scale.
01:45:33.140 | And the fact that Tesla is at least getting there
01:45:36.660 | or are already there,
01:45:39.460 | makes me think that it's now starting to be coupled
01:45:43.660 | to the self-supervised learning vision,
01:45:47.620 | which is like, if that's gonna work,
01:45:49.860 | if through purely this process, you can get really far,
01:45:52.980 | then maybe you can solve driving that way.
01:45:54.900 | I don't know.
01:45:55.740 | I tend to believe we don't give enough credit
01:46:00.060 | to how amazing humans are both at driving
01:46:05.060 | and at supervising autonomous systems.
01:46:09.420 | And also we don't, this is, I wish we were,
01:46:13.260 | I wish there was much more driver sensing inside Teslas
01:46:17.180 | and much deeper consideration of human factors,
01:46:21.220 | like understanding psychology and drowsiness
01:46:24.700 | and all those kinds of things.
01:46:26.220 | When the car does more and more of the work,
01:46:28.740 | how to keep utilizing the little human supervision
01:46:32.980 | that are needed to keep this whole thing safe.
01:46:35.100 | I mean, it's a fascinating dance
01:46:36.580 | of human robot interaction.
01:46:38.460 | To me, autonomous driving for a long time
01:46:42.140 | is a human robot interaction problem.
01:46:45.020 | It is not a robotics problem or computer vision problem.
01:46:48.020 | Like you have to have a human in the loop,
01:46:49.980 | but so, which is why I think it's 10 years plus,
01:46:53.340 | but I do think there'll be a bunch of cities and contexts
01:46:56.300 | where geo-restricted, it will work really, really damn well.
01:47:01.300 | - Yeah.
01:47:02.620 | So I think for me, like it's five, if I'm being optimistic
01:47:04.980 | and it's going to be five for a lot of cases
01:47:07.340 | and 10 plus, yeah, I agree with you.
01:47:09.220 | 10 plus basically, if we want to like cover most of say,
01:47:13.420 | contiguous United States or something.
01:47:15.220 | - Oh, interesting.
01:47:16.060 | So my optimistic is five and pessimistic is 30.
01:47:20.300 | - 30.
01:47:21.140 | - I have a long tail on this one.
01:47:22.460 | I've watched enough driving videos.
01:47:24.420 | I've watched enough pedestrians to think like,
01:47:27.940 | we may be like, there's a small part of me still,
01:47:31.140 | not a small, like a pretty big part of me
01:47:33.860 | that thinks we will have to build AGI to solve driving.
01:47:37.540 | - Oh wow.
01:47:38.460 | - Like there's something to me like,
01:47:40.020 | because humans are part of the picture,
01:47:41.780 | deeply part of the picture
01:47:44.020 | and also human society is part of the picture
01:47:46.060 | in that human life is at stake.
01:47:47.940 | Anytime a robot kills a human,
01:47:50.860 | it's not clear to me that that's not a problem
01:47:54.300 | that machine learning will also have to solve.
01:47:56.340 | Like it has to, you have to integrate that
01:47:59.400 | into the whole thing,
01:48:00.240 | just like Facebook or social networks.
01:48:02.940 | You know, one thing is to say
01:48:04.120 | how to make a really good recommender system.
01:48:06.700 | And then the other thing is to integrate
01:48:08.660 | into that recommender system,
01:48:10.260 | all the journalists that will write articles
01:48:12.060 | about that recommender system.
01:48:13.860 | Like you have to consider the society
01:48:15.880 | within which the AI system operates.
01:48:18.380 | And in order to, and like politicians too,
01:48:20.700 | you know, this is the regulatory stuff
01:48:22.660 | for autonomous driving.
01:48:24.200 | It's kind of fascinating that the more successful
01:48:26.700 | your AI system becomes,
01:48:28.700 | the more it gets integrated in society
01:48:31.580 | and the more precious politicians and the public
01:48:34.500 | and the clickbait journalists
01:48:35.980 | and all the different fascinating forces of our society
01:48:39.060 | start acting on it.
01:48:40.380 | And then it's no longer how good you are
01:48:42.220 | at doing the initial task.
01:48:43.980 | It's also how good you are at navigating human nature,
01:48:47.020 | which is a fascinating space.
01:48:49.940 | What do you think are the limits of deep learning?
01:48:52.620 | If you allow me, we'll zoom out a little bit
01:48:54.820 | into the big question of artificial intelligence.
01:48:58.140 | You said dark matter of intelligence
01:49:01.260 | is self-supervised learning,
01:49:02.820 | but there could be more.
01:49:04.340 | What do you think the limits of self-supervised learning
01:49:07.780 | and just learning in general, deep learning are?
01:49:10.740 | - I think like for deep learning in particular,
01:49:12.700 | because self-supervised learning is,
01:49:14.180 | I would say a little bit more vague right now.
01:49:16.820 | So I wouldn't, like for something that's so vague,
01:49:18.700 | it's hard to predict what its limits are going to be.
01:49:21.980 | But like I said, I think anywhere you want to interact
01:49:25.260 | with human self-supervised learning
01:49:26.540 | kind of hits a boundary very quickly
01:49:28.460 | because you need to have an interface
01:49:29.940 | to be able to communicate with a human.
01:49:31.620 | So really like if you have just like vacuous concepts
01:49:35.060 | or like just like nebulous concepts
01:49:36.900 | discovered by a network,
01:49:38.580 | it's very hard to communicate those with a human
01:49:40.380 | without like inserting some kind of human knowledge
01:49:42.420 | or some kind of like human bias there.
01:49:44.380 | In general, I think for deep learning,
01:49:47.020 | the biggest challenge is just like data efficiency.
01:49:50.660 | Even with self-supervised learning,
01:49:52.620 | even with anything else,
01:49:53.540 | if you just see a single concept once,
01:49:57.460 | like one image of a, like, I don't know,
01:49:59.860 | whatever you want to call it, like any concept,
01:50:02.500 | it's really hard for these methods to generalize
01:50:04.820 | by looking at just one or two samples of things.
01:50:07.700 | And that has been a real, real challenge.
01:50:09.740 | And I think that's actually why like these edge cases,
01:50:11.660 | for example, for Tesla are actually that important.
01:50:14.500 | Because if you see just one instance of the car failing,
01:50:18.020 | and if you just annotate that
01:50:19.300 | and you get that into your dataset,
01:50:21.900 | you have like very limited guarantee
01:50:23.540 | that it's not going to happen again.
01:50:25.140 | And you're actually going to be able to recognize
01:50:26.740 | this kind of instance in a very different scenario.
01:50:28.620 | So like when it was snowing,
01:50:30.300 | so you got that thing labeled when it was snowing,
01:50:32.020 | but now when it's raining,
01:50:33.220 | you're actually not able to get it.
01:50:34.620 | Or you basically have the same scenario
01:50:36.580 | in a different part of the world.
01:50:37.420 | So the lighting was different or so on.
01:50:39.100 | So it's just really hard for these models,
01:50:41.020 | like deep learning, especially to do that.
01:50:42.700 | - What's your intuition?
01:50:43.540 | How do we solve handwritten digit recognition problem
01:50:47.540 | when we only have one example for each number?
01:50:51.220 | It feels like humans are using something like learning.
01:50:54.700 | - Right, I think it's,
01:50:56.020 | we are good at transferring knowledge a little bit.
01:50:59.260 | We are just better at, like, for a lot of these problems
01:51:02.620 | where we are generalizing from a single sample
01:51:04.820 | or recognizing from a single sample,
01:51:06.940 | we are using a lot of our own domain knowledge
01:51:08.740 | and a lot of our like inductive bias
01:51:10.340 | into that one sample to generalize it.
01:51:12.300 | So I've never seen you write the number nine, for example.
01:51:15.300 | And if you were to write it, I would still get it.
01:51:17.460 | And if you were to write a different kind of alphabet
01:51:19.260 | and like write it in two different ways,
01:51:20.820 | I would still probably be able to figure out
01:51:22.340 | that these are the same two characters.
01:51:24.700 | It's just that I have been very used
01:51:26.300 | to seeing handwritten digits in my life.
01:51:29.020 | The other sort of problem with any deep learning system
01:51:31.340 | or any kind of machine learning system
01:51:32.660 | is like it's guarantees, right?
01:51:34.140 | There are no guarantees for it.
01:51:35.820 | Now you can argue that humans also don't have any guarantees
01:51:38.100 | like there is no guarantee that I can recognize a cat
01:51:41.100 | in every scenario.
01:51:42.180 | I'm sure there are going to be lots of cats
01:51:43.860 | that I don't recognize,
01:51:44.980 | lots of scenarios in which I don't recognize cats in general.
01:51:48.060 | But I think from like,
01:51:50.180 | from just a sort of application perspective,
01:51:52.780 | you do need guarantees, right?
01:51:54.700 | We call these things algorithms.
01:51:56.900 | Now algorithms, like traditional CS algorithms
01:51:59.020 | have guarantees, sorting is a guarantee.
01:52:01.420 | If you were to like call sort
01:52:03.380 | on a particular array of numbers,
01:52:05.540 | you are guaranteed that it's going to be sorted.
01:52:07.580 | Otherwise it's a bug.
01:52:09.260 | Now for machine learning,
01:52:10.100 | it's very hard to characterize this.
01:52:12.380 | We know for a fact that like a cat recognition model
01:52:15.380 | is not going to recognize cats,
01:52:16.980 | every cat in the world in every circumstance.
01:52:19.660 | I think most people would agree with that statement.
01:52:21.980 | But we are still okay with it.
01:52:23.540 | We still don't call this as a bug.
01:52:25.340 | Whereas in traditional computer science
01:52:26.660 | or traditional science,
01:52:27.820 | like if you have this kind of failure case existing,
01:52:29.900 | then you think of it as like something is wrong.
01:52:33.140 | I think there is this sort of notion
01:52:34.460 | of nebulous correctness for machine learning.
01:52:36.980 | And that's something we just need
01:52:37.820 | to be very comfortable with.
01:52:39.420 | And for deep learning,
01:52:40.500 | or like for a lot of these machine learning algorithms,
01:52:42.660 | it's not clear how do we characterize
01:52:44.660 | this notion of correctness.
01:52:46.260 | I think limitation in our understanding,
01:52:48.100 | or at least a limitation in our phrasing of this.
01:52:51.100 | And if we were to come up with better ways
01:52:53.020 | to understand this limitation,
01:52:55.020 | then it would actually help us a lot.
01:52:57.140 | - Do you think there's a distinction
01:52:58.820 | between the concept of learning
01:53:01.780 | and the concept of reasoning?
01:53:03.340 | Do you think it's possible for neural networks to reason?
01:53:09.220 | - So I think of it slightly differently.
01:53:11.620 | So for me, learning is whenever
01:53:14.460 | I can like make a snap judgment.
01:53:16.020 | So if you show me a picture of a dog,
01:53:17.140 | I can immediately say it's a dog.
01:53:18.820 | But if you give me like a puzzle,
01:53:20.380 | you know, like whatever,
01:53:22.060 | Goldsberg machine of like,
01:53:24.180 | things going to happen,
01:53:25.020 | then I have to reason.
01:53:25.860 | I've never, it's a very complicated setup.
01:53:27.620 | I've never seen that particular setup.
01:53:29.340 | And I really need to, you know,
01:53:30.660 | draw and like imagine in my head
01:53:32.220 | what's going to happen to figure it out.
01:53:34.660 | So I think yes,
01:53:35.500 | neural networks are really good at recognition,
01:53:38.940 | but they're not very good at reasoning.
01:53:41.180 | Because they're like,
01:53:42.020 | if they have seen something before,
01:53:44.140 | or seen something similar before,
01:53:45.820 | they're very good at making those sort of snap judgments.
01:53:48.260 | But if you were to give them a very complicated thing
01:53:50.740 | that they've not seen before,
01:53:52.500 | they have very limited ability right now
01:53:55.340 | to compose different things like,
01:53:56.740 | oh, I've seen this particular part before,
01:53:58.260 | I've seen this particular part before.
01:54:00.060 | And now probably like,
01:54:00.980 | this is how they're going to work in tandem.
01:54:02.940 | It's very hard for them to come up
01:54:04.140 | with these kinds of things.
01:54:05.220 | - Well, there's a certain aspect to reasoning
01:54:08.820 | that you can maybe convert into the process of programming.
01:54:11.860 | And so there's the whole field of program synthesis
01:54:14.300 | and people have been applying machine learning
01:54:17.220 | to the problem of program synthesis.
01:54:18.900 | And the question is, you know,
01:54:20.140 | can they, the step of composition,
01:54:22.700 | why can't that be learned?
01:54:25.260 | You know, the step of like building things on top of,
01:54:29.380 | like little intuitions, concepts on top of each other,
01:54:33.220 | can that be learnable?
01:54:35.260 | What's your intuition there?
01:54:36.780 | Or like, I guess, similar set of techniques,
01:54:39.460 | do you think that would be applicable?
01:54:42.060 | - So I think it is, of course, learnable.
01:54:43.980 | It is learnable because like,
01:54:45.100 | we are prime examples of machines that have like,
01:54:47.620 | or individuals that have learned this, right?
01:54:49.500 | Like humans have learned this.
01:54:51.100 | So it is, of course,
01:54:51.940 | it is a technique that is very easy to learn.
01:54:54.820 | I think where we are kind of hitting a wall,
01:54:58.420 | basically with like current machine learning
01:55:00.500 | is the fact that when the network learns
01:55:03.420 | all of this information,
01:55:04.660 | we basically are not able to figure out
01:55:07.500 | how well it's going to generalize to an unseen thing.
01:55:10.620 | And we have no, like a priori,
01:55:12.420 | no way of characterizing that.
01:55:13.900 | And I think that's basically telling us a lot about,
01:55:18.460 | like a lot about the fact that we really don't know
01:55:20.700 | what this model has learned and how well it's basically,
01:55:22.740 | because we don't know how well it's going to transfer.
01:55:25.140 | - There's also a sense in which it feels like
01:55:28.100 | we humans may not be aware of how much like background,
01:55:33.100 | how good our background model is,
01:55:36.860 | how much knowledge we just have
01:55:38.780 | slowly building on top of each other.
01:55:41.540 | It feels like neural networks
01:55:42.620 | are constantly throwing stuff out.
01:55:43.980 | Like you'll do some incredible thing
01:55:45.500 | where you're learning a particular task in computer vision,
01:55:49.180 | you celebrate your state of the art successes
01:55:51.380 | and you throw that out.
01:55:52.860 | Like it feels like it's,
01:55:54.420 | you're never using stuff you've learned
01:55:56.860 | for your future successes in other domains.
01:56:00.220 | And humans are obviously doing that exceptionally well,
01:56:03.380 | still throwing stuff away in their mind,
01:56:05.980 | but keeping certain kernels of truth.
01:56:07.980 | - Right, so I think we're like,
01:56:09.340 | continual learning is sort of the paradigm
01:56:11.220 | for this in machine learning.
01:56:12.060 | And I don't think it's a very well explored paradigm.
01:56:15.300 | We have like things in deep learning, for example,
01:56:17.380 | right, catastrophic forgetting
01:56:18.540 | is like one of the standard things.
01:56:20.300 | The thing basically being that if you teach a network
01:56:23.260 | like to recognize dogs,
01:56:24.900 | and now you teach that same network to recognize cats,
01:56:27.540 | it basically forgets how to recognize dogs.
01:56:29.180 | So it forgets very quickly.
01:56:30.940 | I mean, and whereas a human,
01:56:32.660 | if you were to teach someone to recognize dogs
01:56:34.700 | and then to recognize cats,
01:56:36.060 | they don't forget immediately how to recognize these dogs.
01:56:38.580 | I think that's basically sort of what you're trying to get.
01:56:40.780 | - Yeah, I just, I wonder if like
01:56:42.540 | the long-term memory mechanisms
01:56:44.860 | or the mechanisms that store not just memories,
01:56:47.260 | but concepts that allow you to reason
01:56:52.260 | of like, and compose concepts,
01:56:57.340 | if those things will look very different
01:56:59.180 | than neural networks,
01:57:00.060 | or if you can do that within a single neural network
01:57:02.460 | with some particular sort of architecture quirks,
01:57:06.180 | that seems to be a really open problem.
01:57:07.860 | And of course I go up and down on that
01:57:09.580 | because there's something so compelling
01:57:12.740 | to the symbolic AI
01:57:14.980 | or to the ideas of logic-based sort of expert systems.
01:57:19.980 | You have like human interpretable facts
01:57:22.580 | that built on top of each other.
01:57:24.220 | It's really annoying, like with self-supervised learning,
01:57:27.900 | that the AI is not very explainable.
01:57:31.260 | Like you can't like understand
01:57:33.500 | all the beautiful thing it has learned.
01:57:35.660 | You can't ask it like questions.
01:57:38.540 | But then again, maybe that's a stupid thing
01:57:41.060 | for us humans to want.
01:57:42.540 | But I think whenever we try to like understand it,
01:57:45.380 | we're putting our own subjective human bias into it.
01:57:48.500 | And I think that's the sort of problem.
01:57:50.140 | With self-supervised learning,
01:57:51.140 | the goal is that it should learn naturally from the data.
01:57:54.420 | So now if you try to understand it,
01:57:55.660 | you are using your own preconceived notions
01:57:58.780 | of what this model has learned.
01:58:00.740 | And that's the problem.
01:58:02.460 | - High level question.
01:58:04.660 | What do you think it takes to build a system
01:58:07.900 | with superhuman, maybe let's say human level
01:58:10.500 | or superhuman level, general intelligence?
01:58:13.500 | We've already kind of started talking about this,
01:58:15.580 | but what's your intuition?
01:58:17.740 | Like, does this thing have to have a body?
01:58:20.740 | Does it have to interact richly with the world?
01:58:23.900 | Does it have to have some more human elements
01:58:27.900 | like self-awareness?
01:58:30.460 | - I think emotion.
01:58:32.220 | I think emotion is something which is,
01:58:34.340 | like it's not really attributed
01:58:37.100 | typically in standard machine learning.
01:58:38.420 | It's not something we think about.
01:58:39.540 | Like there is NLP, there is vision,
01:58:41.020 | there is no like emotion.
01:58:42.580 | Emotion is never a part of all of this.
01:58:44.620 | And that just seems a little bit weird to me.
01:58:47.060 | I think the reason basically being that
01:58:48.860 | there is surprise and like,
01:58:51.540 | basically emotion is like one of the reasons
01:58:53.780 | emotions arises like what happens
01:58:55.780 | and what you expect to happen, right?
01:58:57.100 | There is like a mismatch between these things.
01:58:59.420 | And so that gives rise like,
01:59:01.060 | I can either be surprised or I can be saddened
01:59:03.500 | or I can be happy and all of this.
01:59:05.300 | And so this basically indicates
01:59:07.940 | that I already have a predictive model in my head
01:59:10.140 | and something that I predicted
01:59:11.420 | or something that I thought was likely to happen.
01:59:13.700 | And then there was something that I observed that happened
01:59:15.540 | that there was a disconnect between these two things.
01:59:18.260 | And that basically is like maybe one of the reasons
01:59:21.820 | I like you have a lot of emotions.
01:59:24.260 | - Yeah, I think, so I talked to people a lot about
01:59:26.740 | like Lisa Feldman Barrett.
01:59:29.100 | I think that's an interesting concept of emotion
01:59:31.660 | but I have a sense that emotion primarily
01:59:36.820 | in the way we think about it,
01:59:38.060 | which is the display of emotion
01:59:40.300 | is a communication mechanism between humans.
01:59:43.780 | So it's a part of basically human to human interaction,
01:59:47.380 | an important part, but just the part.
01:59:50.180 | So it's like, I would throw it into the full mix
01:59:55.060 | of communication.
01:59:58.020 | And to me, communication can be done with objects
02:00:01.260 | that don't look at all like humans.
02:00:04.340 | - Okay.
02:00:05.420 | I've seen our ability to anthropomorphize,
02:00:07.540 | our ability to connect with things that look like a Roomba,
02:00:10.660 | our ability to connect.
02:00:11.980 | First of all, let's talk about other biological systems
02:00:14.700 | like dogs, our ability to love things
02:00:17.420 | that are very different than humans.
02:00:19.380 | - But they do display emotion, right?
02:00:20.940 | I mean, dogs do display emotion.
02:00:23.180 | So they don't have to be anthropomorphic
02:00:25.300 | for them to like display the kind of emotion
02:00:27.540 | that we don't.
02:00:28.380 | - Exactly.
02:00:29.860 | But then the word emotion starts to lose.
02:00:33.940 | - So then we have to be, I guess, specific, but yeah.
02:00:36.260 | So have rich, flavorful communication.
02:00:39.540 | - Communication, yeah.
02:00:40.380 | - Yeah, so like, yes, it's full of emotion.
02:00:42.980 | It's full of wit and humor and moods
02:00:47.980 | and all those kinds of things, yeah.
02:00:50.260 | So you're talking about like flavor.
02:00:53.740 | - Flavor, yeah.
02:00:54.580 | Okay, let's call it that.
02:00:55.420 | So there's content and then there is flavor
02:00:57.220 | and I'm talking about the flavor.
02:00:58.420 | - Do you think it needs to have a body?
02:01:00.260 | Do you think like to interact with the physical world,
02:01:02.820 | do you think you can understand the physical world
02:01:04.620 | without being able to directly interact with it?
02:01:07.020 | - I don't think so, yeah.
02:01:08.420 | I think at some point we will need to bite the bullet
02:01:10.660 | and actually interact with the physical world
02:01:12.660 | as much as I like working on like passive computer vision
02:01:15.820 | where I just like sit in my armchair
02:01:17.260 | and look at videos and learn.
02:01:19.020 | I do think that we will need to have
02:01:21.220 | some kind of embodiment or some kind of interaction
02:01:24.580 | to figure out things about the world.
02:01:26.900 | - What about consciousness?
02:01:28.580 | Do you think, how often do you think about consciousness
02:01:32.260 | when you think about your work?
02:01:34.300 | You could think of it as the more simple thing
02:01:36.500 | of self-awareness, of being aware that you are a perceiving,
02:01:41.500 | sensing, acting thing in this world,
02:01:46.800 | or you can think about the bigger version of that
02:01:50.300 | which is consciousness, which is having it feel
02:01:54.460 | like something to be that entity,
02:01:57.180 | the subjective experience of being in this world.
02:01:59.540 | - So I think of self-awareness a little bit more
02:02:01.380 | than the broader goal of it
02:02:03.380 | because I think self-awareness is pretty critical
02:02:06.100 | for any kind of AGI or whatever you want to call it
02:02:10.140 | that we build because it needs to contextualize
02:02:13.020 | what it is and what role it's playing
02:02:15.540 | with respect to all the other things that exist around it.
02:02:17.900 | I think that requires self-awareness.
02:02:19.620 | It needs to understand that it's an autonomous car, right?
02:02:23.180 | And what does that mean?
02:02:24.860 | What are its limitations?
02:02:26.180 | What are the things that it is supposed to do and so on?
02:02:29.020 | What is its role in some way?
02:02:30.700 | Or, I mean, these are the kind of things
02:02:34.180 | that we kind of expect from it, I would say.
02:02:36.820 | And so that's the level of self-awareness
02:02:39.300 | that's, I would say, basically required at least,
02:02:42.160 | if not more than that.
02:02:44.220 | - Yeah, I tend to, on the emotion side,
02:02:46.380 | believe that it has to be able to display consciousness.
02:02:51.380 | - Display consciousness, what do you mean by that?
02:02:54.300 | - Meaning like for us humans to connect with each other
02:02:57.580 | or to connect with other living entities,
02:03:01.660 | I think we need to feel, like in order for us
02:03:05.100 | to truly feel like that there's another being there,
02:03:09.380 | we have to believe that they're conscious.
02:03:11.420 | And so we won't ever connect with something
02:03:14.980 | that doesn't have elements of consciousness.
02:03:17.300 | Now, I tend to think that that's easier to achieve
02:03:21.540 | than it may sound 'cause we anthropomorphize stuff so hard.
02:03:25.700 | Like you have a mug that just like has wheels
02:03:28.740 | and like rotates every once in a while and makes a sound.
02:03:31.900 | I think a couple of days in, especially if you're,
02:03:35.360 | if you don't hang out with humans,
02:03:39.500 | you might start to believe that mug on wheels is conscious.
02:03:42.220 | So I think we anthropomorphize pretty effectively
02:03:44.860 | as human beings.
02:03:46.020 | But I do think that it's in the same bucket
02:03:49.260 | that we'll call emotion that show that,
02:03:54.740 | I think of consciousness as the capacity to suffer.
02:03:57.440 | And if you're an entity that's able to feel things
02:04:02.420 | in the world and to communicate that to others,
02:04:05.580 | I think that's a really powerful way to interact with humans.
02:04:10.940 | And in order to create an AGI system,
02:04:13.220 | I believe you should be able to richly interact with humans.
02:04:17.980 | Like humans would need to want to interact with you.
02:04:21.120 | Like it can't be like, it's the self-supervised learning
02:04:25.600 | versus like the robot shouldn't have to pay you
02:04:29.280 | to interact with me.
02:04:30.280 | So like it should be a natural fun thing.
02:04:33.600 | And then you're going to scale up significantly
02:04:36.080 | how much interaction it gets.
02:04:39.080 | It's the Alexa prize,
02:04:40.860 | which they're trying to get me to be a judge
02:04:42.880 | on their contest.
02:04:44.680 | Let's see if I wanna do that.
02:04:46.020 | But their challenge is to talk to you,
02:04:50.560 | make the human sufficiently interested
02:04:53.960 | that the human keeps talking for 20 minutes.
02:04:56.160 | - To Alexa?
02:04:57.120 | - To Alexa, yeah.
02:04:58.600 | And right now they're not even close to that
02:05:00.240 | 'cause it just gets so boring when you're like,
02:05:02.560 | when the intelligence is not there,
02:05:04.280 | it gets very not interesting to talk to it.
02:05:06.920 | And so the robot needs to be interesting.
02:05:08.960 | And one of the ways it can be interesting
02:05:10.400 | is display the capacity to love, to suffer.
02:05:14.680 | And I would say that essentially means
02:05:17.480 | the capacity to display consciousness.
02:05:20.920 | Like it is an entity much like a human being.
02:05:25.160 | Of course, what that really means,
02:05:27.320 | I don't know if that's fundamentally a robotics problem
02:05:30.520 | or some kind of problem that we're not yet even aware.
02:05:33.040 | Like if it is truly a hard problem of consciousness,
02:05:36.040 | I tend to maybe optimistically think it's a,
02:05:38.600 | we can pretty effectively fake it till we make it.
02:05:42.640 | So we can display a lot of human-like elements for a while
02:05:46.400 | and that will be sufficient to form
02:05:49.080 | really close connections with humans.
02:05:50.920 | What to use the most beautiful idea
02:05:53.720 | in self-supervised learning?
02:05:55.840 | Like when you sit back with, I don't know,
02:05:59.040 | with a glass of wine and armchair and just at a fireplace,
02:06:04.040 | just thinking how beautiful this world
02:06:08.280 | that you get to explore is,
02:06:10.080 | what do you think is the especially beautiful idea?
02:06:13.800 | - The fact that like object level,
02:06:16.520 | what objects are in some notion of objectness emerges
02:06:20.000 | from these models by just like self-supervised learning.
02:06:23.720 | So for example, like one of the things like the dino paper
02:06:28.720 | that I was a part of at Facebook is,
02:06:31.040 | the object sort of boundaries emerge
02:06:34.280 | from these representations.
02:06:35.640 | So if you have like a dog running in the field,
02:06:38.080 | the boundaries around the dog,
02:06:39.480 | the network is basically able to figure out
02:06:42.320 | what the boundaries of this dog are automatically.
02:06:45.520 | And it was never trained to do that.
02:06:47.040 | It was never trained to,
02:06:49.120 | no one taught it that this is a dog
02:06:51.000 | and these pixels belong to a dog.
02:06:52.720 | It's able to group these things together automatically.
02:06:55.000 | So that's one.
02:06:56.120 | And I think in general, that entire notion
02:06:58.080 | that this dumb idea that you take like these two crops
02:07:01.400 | of an image and then you say that the features
02:07:03.160 | should be similar, that has resulted in something like this,
02:07:06.040 | like the model is able to figure out
02:07:07.920 | what the dog pixels are and so on.
02:07:10.320 | That just seems like so surprising.
02:07:12.080 | And I mean, I don't think a lot of us even understand
02:07:16.200 | how that is happening really.
02:07:18.120 | And it's something we're taking for granted,
02:07:20.800 | maybe like a lot in terms of how we're setting up
02:07:23.120 | these algorithms, but it's just,
02:07:24.920 | it's a very beautiful and powerful idea.
02:07:26.800 | So it's really fundamentally telling us something about
02:07:30.240 | that there is so much signal in the pixels
02:07:32.400 | that we can be super dumb about it,
02:07:34.160 | about how we're setting up the self-supervised learning
02:07:36.040 | problem and despite being like super dumb about it,
02:07:39.560 | we'll actually get very good,
02:07:41.600 | like we'll actually get something that is able to do
02:07:43.960 | very like surprising things.
02:07:45.680 | - I wonder if there's other like objectness
02:07:48.240 | of other concepts that can emerge.
02:07:50.340 | I don't know if you follow Francois Chollet,
02:07:53.560 | he had the competition for intelligence
02:07:56.600 | that basically it's kind of like an IQ test,
02:07:59.520 | but for machines, but for an IQ test,
02:08:02.360 | you have to have a few concepts that you wanna apply.
02:08:05.360 | One of them is objectness.
02:08:07.800 | I wonder if those concepts can emerge
02:08:11.520 | through self-supervised learning on billions of images.
02:08:14.760 | - I think something like object permanence
02:08:16.320 | can definitely emerge, right?
02:08:17.440 | So that's like a fundamental concept which we have,
02:08:20.240 | maybe not through images, through video,
02:08:21.480 | but that's another concept that should be emergent from it,
02:08:25.140 | because it's not something that,
02:08:26.760 | like if you don't teach humans that this is like
02:08:29.840 | about this concept of object permanence,
02:08:31.480 | it actually emerges.
02:08:32.480 | And the same thing for like animals, like dogs,
02:08:34.100 | I think actually, permanence automatically
02:08:36.360 | is something that they are born with.
02:08:38.080 | So I think it should emerge from the data.
02:08:40.320 | It should emerge basically very quickly.
02:08:42.440 | - I wonder if ideas like symmetry, rotation,
02:08:45.880 | these kinds of things might emerge.
02:08:47.920 | - So I think rotation probably, yes, yeah, rotation, yes.
02:08:51.600 | - I mean, there's some constraints
02:08:52.680 | in the architecture itself.
02:08:54.040 | - Right.
02:08:55.200 | - But it's interesting if all of them could be,
02:08:59.240 | like counting was another one,
02:09:01.060 | being able to kind of understand
02:09:04.880 | that there's multiple objects of the same kind in the image
02:09:07.720 | and be able to count them.
02:09:09.020 | I wonder if all of that could be,
02:09:11.560 | if constructed correctly, they can emerge,
02:09:14.360 | 'cause then you can transfer those concepts
02:09:16.480 | to then interpret images at a deeper level.
02:09:20.680 | - Right.
02:09:21.520 | Counting I do believe, I mean, should be possible.
02:09:24.680 | I don't know like yet,
02:09:25.920 | but I do think it's not that far in the realm of possibility.
02:09:29.720 | - Yeah, that'd be interesting
02:09:30.560 | if using self-supervised learning on images
02:09:33.240 | can then be applied to then solving those kinds of IQ tests,
02:09:36.520 | which seem currently to be kind of impossible.
02:09:38.840 | What idea do you believe might be true
02:09:43.320 | that most people think is not true
02:09:46.600 | or don't agree with you on?
02:09:48.560 | Is there something like that?
02:09:50.040 | - So this is going to be a little controversial,
02:09:52.400 | but okay, sure.
02:09:53.500 | I don't believe in simulation,
02:09:55.340 | like actually using simulation to do things very much.
02:09:58.840 | - Just to clarify, because this is a podcast
02:10:01.040 | where you talk about, are we living in a simulation often,
02:10:03.600 | you're referring to using simulation to construct worlds
02:10:08.000 | that you then leverage for machine learning.
02:10:10.320 | - Right, yeah.
02:10:11.160 | For example, like one example would be like
02:10:13.080 | to train an autonomous car driving system,
02:10:15.520 | you basically first build a simulator,
02:10:17.400 | which builds like the environment of the world.
02:10:19.840 | And then you basically have a lot of like,
02:10:22.680 | you train your machine learning system in that.
02:10:25.320 | So I believe it is possible,
02:10:27.520 | but I think it's a really expensive way of doing things.
02:10:30.920 | And at the end of it, you do need the real world.
02:10:33.800 | So I'm not sure.
02:10:35.560 | So maybe for certain settings,
02:10:36.960 | like maybe the payout is so large,
02:10:38.920 | like for autonomous driving,
02:10:39.960 | the payout is so large that you can actually invest
02:10:42.000 | that much money to build it.
02:10:43.400 | But I think as a general sort of principle,
02:10:45.520 | it does not apply to a lot of concepts.
02:10:47.040 | You can't really build simulations of everything.
02:10:49.760 | Not only because like one, it's expensive,
02:10:51.560 | because second, it's also not possible for a lot of things.
02:10:54.840 | So in general, like there is a lot of work
02:10:59.400 | on like using synthetic data and like synthetic simulators.
02:11:02.080 | I generally am not very, like I don't believe in that.
02:11:05.800 | - So you're saying it's very challenging visually,
02:11:09.000 | like to correctly like simulate the visual,
02:11:11.920 | like the lighting, all those kinds of things.
02:11:13.560 | - I mean, all these companies that you have, right?
02:11:15.640 | So like Pixar and like whatever,
02:11:17.840 | all these companies are,
02:11:19.800 | all this like computer graphics stuff
02:11:21.520 | is really about accurately,
02:11:22.880 | a lot of them is about like accurately trying to figure out
02:11:26.080 | how the lighting is and like how things reflect
02:11:28.560 | off of one another and so on.
02:11:30.400 | And like how sparkly things look and so on.
02:11:32.200 | So it's a very hard problem.
02:11:33.960 | So do we really need to solve that first
02:11:37.120 | to be able to like do computer vision?
02:11:39.360 | Probably not.
02:11:40.560 | - And for me, in the context of autonomous driving,
02:11:44.720 | it's very tempting to be able to use simulation, right?
02:11:47.960 | Because it's a safety critical application,
02:11:50.480 | but the other limitation or simulation
02:11:53.280 | that perhaps is a bigger one than the visual limitation
02:11:58.360 | is the behavior of objects.
02:12:00.720 | 'Cause the, so you're ultimately interested in edge cases.
02:12:03.800 | And the question is how well can you generate edge cases
02:12:07.240 | in simulation, especially with human behavior?
02:12:10.960 | - I think another problem is like for autonomous driving,
02:12:13.240 | right, it's a constantly changing world.
02:12:15.160 | So say autonomous driving, like in 10 years from now,
02:12:18.480 | like there are lots of autonomous cars,
02:12:20.720 | but they're still going to be humans.
02:12:22.400 | So now there are 50% of the agents say, which are humans,
02:12:25.160 | 50% of the agents that are autonomous,
02:12:26.800 | like car driving agents.
02:12:28.520 | So now the mixture has changed.
02:12:30.040 | So now the kinds of behaviors that you actually expect
02:12:32.280 | from the other agents or other cars on the road
02:12:35.120 | are actually going to be very different.
02:12:36.680 | And as the proportion of the number of autonomous cars
02:12:39.040 | to humans keeps changing,
02:12:40.400 | this behavior will actually change a lot.
02:12:42.560 | So now if you were to build a simulator
02:12:44.040 | based on just like right now to build them today,
02:12:46.400 | you don't have that many autonomous cars on the road.
02:12:48.360 | So you will try to like make all of the other agents
02:12:50.440 | in that simulator behave as humans,
02:12:52.960 | but that's not really going to hold true
02:12:54.600 | 10, 15, 20, 30 years from now.
02:12:57.360 | - Do you think we're living in a simulation?
02:12:59.240 | - No.
02:13:00.080 | - How hard is it?
02:13:02.760 | This is why I think it's an interesting question.
02:13:04.800 | How hard is it to build a video game,
02:13:07.720 | like virtual reality game, where it is so real,
02:13:11.800 | forget like ultra realistic
02:13:15.160 | to where you can't tell the difference,
02:13:17.320 | but like, it's so nice that you just wanna stay there.
02:13:20.800 | You just wanna stay there and you don't wanna come back.
02:13:24.920 | Do you think that's doable within our lifetime?
02:13:29.360 | - Within our lifetime?
02:13:30.480 | Probably, yeah.
02:13:32.160 | I eat healthy, I live long.
02:13:33.760 | (both laughing)
02:13:36.040 | - Does that make you sad that there'll be like,
02:13:38.440 | like population of kids that basically spend 95%,
02:13:44.400 | 99% of their time in a virtual world?
02:13:50.080 | - Very, very hard question to answer.
02:13:51.920 | For certain people, it might be something
02:13:55.720 | that they really derive a lot of value out of,
02:13:58.120 | derive a lot of enjoyment and like happiness out of,
02:14:00.720 | and maybe the real world wasn't giving them that,
02:14:03.120 | that's why they did that.
02:14:03.960 | So maybe it is good for certain people.
02:14:05.920 | - So ultimately, if it maximizes happiness,
02:14:09.360 | - Right, I think if- - Or we could judge.
02:14:10.720 | - Yeah, I think if it's making people happy,
02:14:12.720 | maybe it's okay.
02:14:14.400 | Again, I think it's, this is a very hard question.
02:14:18.280 | - So like you've been a part of a lot of amazing papers,
02:14:22.600 | what advice would you give to somebody
02:14:25.640 | on what it takes to write a good paper?
02:14:28.040 | Grad students writing papers now,
02:14:31.000 | is there common things that you've learned along the way
02:14:34.520 | that you think it takes,
02:14:35.760 | both for a good idea and a good paper?
02:14:39.000 | - Right, so I think both of these I've picked up from,
02:14:44.640 | like lots of people I've worked with in the past.
02:14:46.520 | So one of them is picking the right problem
02:14:48.680 | to work on in research is as important
02:14:51.040 | as like finding the solution to it.
02:14:53.680 | So, I mean, there are multiple reasons for this.
02:14:56.200 | So one is that there are certain problems
02:14:58.960 | that can actually be solved in a particular timeframe.
02:15:02.320 | So now say you want to work on finding the meaning of life.
02:15:06.400 | This is a great problem.
02:15:07.400 | I think most people will agree with that.
02:15:09.440 | But do you believe that your talents
02:15:12.240 | and like the energy that you'll spend on it
02:15:13.800 | will make a meaning,
02:15:15.520 | like make some kind of meaningful progress in your lifetime?
02:15:18.800 | If you are optimistic about it, then like go ahead.
02:15:20.960 | - That's why I started this podcast.
02:15:22.120 | I keep asking people about the meaning of life.
02:15:24.040 | I'm hoping by episode like 220, I'll figure it out.
02:15:28.000 | - Oh, not too many episodes to go.
02:15:30.280 | - All right.
02:15:31.720 | Maybe today, I don't know.
02:15:33.080 | But you're right.
02:15:33.920 | So that seems intractable at the moment.
02:15:36.280 | - Right, so I think it's just the fact of like,
02:15:39.000 | if you're starting a PhD, for example,
02:15:41.080 | what is one problem that you want to focus on
02:15:43.000 | that you do think is interesting enough,
02:15:45.720 | and you will be able to make a reasonable amount
02:15:47.800 | of headway into it that you think you'll be doing a PhD for.
02:15:50.520 | So in that kind of a timeframe.
02:15:53.080 | So that's one.
02:15:53.920 | Of course, there's the second part,
02:15:54.800 | which is what excites you genuinely.
02:15:56.360 | So you shouldn't just pick problems
02:15:57.600 | that you are not excited about,
02:15:59.040 | because as a grad student or as a researcher,
02:16:01.840 | you really need to be passionate about it
02:16:03.240 | to continue doing that,
02:16:04.600 | because there are so many other things
02:16:05.760 | that you could be doing in life.
02:16:07.120 | So you really need to believe in that
02:16:08.280 | to be able to do that for that long.
02:16:10.760 | In terms of papers,
02:16:11.600 | I think the one thing that I've learned is,
02:16:13.720 | I've like in the past,
02:16:16.440 | whenever I used to write things,
02:16:17.760 | and even now, whenever I do that,
02:16:18.920 | I try to cram in a lot of things into the paper.
02:16:21.400 | Whereas what really matters
02:16:22.800 | is just pushing one simple idea.
02:16:24.760 | That's it.
02:16:25.760 | That's all because that's the paper
02:16:29.120 | is going to be like whatever eight or nine pages.
02:16:32.200 | If you keep cramming in lots of ideas,
02:16:34.240 | it's really hard for the single thing
02:16:36.240 | that you believe in to stand out.
02:16:38.000 | So if you really try to just focus on like,
02:16:40.920 | especially in terms of writing,
02:16:41.920 | really try to focus on one particular idea
02:16:43.800 | and articulate it out in multiple different ways,
02:16:46.240 | it's far more valuable to the reader as well.
02:16:49.040 | And basically, to the reader, of course,
02:16:51.600 | because they get to,
02:16:53.120 | they know that this particular idea
02:16:54.400 | is associated with this paper.
02:16:56.160 | And also for you, because you have,
02:16:59.040 | like when you write about a particular idea
02:17:00.440 | in different ways, you think about it more deeply.
02:17:02.680 | So as a grad student,
02:17:03.600 | I used to always wait to it,
02:17:05.520 | like maybe in the last week or whatever
02:17:07.720 | to write the paper,
02:17:08.720 | because I used to always believe
02:17:10.280 | that doing the experiments
02:17:11.360 | was actually the bigger part of research than writing.
02:17:13.880 | And my advisor always told me
02:17:15.240 | that you should start writing very early on.
02:17:16.680 | And I thought, oh, it doesn't matter.
02:17:17.920 | I don't know what he's talking about.
02:17:19.720 | But I think more and more I realized that's the case.
02:17:21.800 | Like whenever I write something that I'm doing,
02:17:24.040 | I actually think much better about it.
02:17:26.440 | And so if you start writing earlier,
02:17:28.360 | early on you actually, I think get better ideas,
02:17:31.200 | or at least you figure out like holes in your theory
02:17:33.800 | or like particular experiments
02:17:35.480 | that you should run to block those holes and so on.
02:17:38.760 | - Yeah, I'm continually surprised
02:17:40.320 | how many really good papers throughout history
02:17:43.600 | are quite short and quite simple.
02:17:47.400 | And there's a lesson to that.
02:17:49.800 | Like if you want to dream about writing a paper
02:17:52.600 | that changes the world,
02:17:54.400 | and you want to go by example,
02:17:56.800 | they're usually simple.
02:17:58.040 | - Yeah, yeah.
02:17:59.000 | - And that's, it's not cramming or it's,
02:18:03.040 | it's focusing on one idea and thinking deeply.
02:18:07.240 | And you're right that the writing process itself
02:18:10.360 | reveals the idea.
02:18:12.280 | It challenges you to really think about what is the idea
02:18:15.320 | that explains that the thread that ties it all together.
02:18:18.120 | - And so like a lot of famous researchers I know
02:18:21.520 | actually would start off like,
02:18:24.120 | first they would, even before the experiments were in,
02:18:27.000 | a lot of them would actually start
02:18:28.320 | with writing the introduction of the paper
02:18:30.400 | with zero experiments in.
02:18:32.160 | Because that at least helps them figure out
02:18:33.800 | what they're like, what they're trying to solve
02:18:35.800 | and how it fits in like the context of things right now.
02:18:38.680 | And that would really guide their entire research.
02:18:40.680 | So a lot of them would actually first write an intros
02:18:42.360 | with like zero experiments in
02:18:43.560 | and that's how they would start projects.
02:18:46.040 | - Some basic questions about people maybe
02:18:48.240 | that are more like beginners in this field.
02:18:51.960 | What's the best programming language to learn
02:18:54.080 | if you're interested in machine learning?
02:18:56.600 | - I would say Python,
02:18:57.440 | just because it's the easiest one to learn.
02:19:00.320 | And also a lot of like programming
02:19:03.160 | in machine learning happens in Python.
02:19:05.000 | So it'll, if you don't know any other programming language,
02:19:07.600 | Python is actually going to get you a long way.
02:19:09.560 | - Yeah, it seems like sort of a,
02:19:11.680 | it's a toss up question because it seems like Python
02:19:14.000 | is so much dominating the space now.
02:19:16.800 | But I wonder if there's an interesting alternative.
02:19:18.480 | Obviously there's like Swift
02:19:19.960 | and there's a lot of interesting alternatives popping up,
02:19:22.720 | even JavaScript.
02:19:23.960 | So I, or R, more like for the data science applications.
02:19:28.880 | But it seems like Python more and more
02:19:31.240 | is actually being used to teach like introduction
02:19:34.160 | to programming at universities.
02:19:35.920 | So it just combines everything very nicely.
02:19:38.320 | Even harder question.
02:19:41.880 | What are the pros and cons of PyTorch versus TensorFlow?
02:19:46.160 | - I see.
02:19:47.000 | Okay.
02:19:49.400 | - You can go with no comment.
02:19:51.320 | - So a disclaimer to this is that the last time
02:19:53.440 | I used TensorFlow was probably like four years ago.
02:19:56.440 | And so it was right when it had come out
02:19:58.200 | because so I started on like deep learning in 2014 or so
02:20:02.680 | and the dominant sort of framework for us then
02:20:06.480 | for vision was Caffe, which was out of Berkeley.
02:20:09.080 | And we used Caffe a lot, it was really nice.
02:20:11.280 | And then TensorFlow came in,
02:20:13.400 | which was basically like Python first.
02:20:15.120 | So Caffe was mainly C++
02:20:17.080 | and it had like very loose kind of Python binding.
02:20:19.080 | So Python wasn't really the first language you would use.
02:20:21.360 | You would really use either MATLAB or C++
02:20:24.720 | to like get stuff done in like Caffe.
02:20:28.280 | And then Python of course became popular a little bit later.
02:20:30.960 | So TensorFlow was basically around that time.
02:20:32.640 | So 2015, 2016 is when I last used it.
02:20:35.280 | It's been a while.
02:20:37.240 | - And then what, did you use Torch or did you--
02:20:40.600 | - So then I moved to LuaTorch,
02:20:42.560 | which was the Torch in Lua.
02:20:44.040 | And then in 2017, I think basically
02:20:46.400 | pretty much to PyTorch completely.
02:20:48.440 | - Oh, interesting.
02:20:49.280 | So you went to Lua, cool.
02:20:50.560 | - Yeah.
02:20:51.520 | - Huh, so you were there before it was cool.
02:20:54.240 | - Yeah, I mean, so LuaTorch was really good
02:20:56.360 | because it actually allowed you to do a lot
02:20:59.560 | of different kinds of things.
02:21:01.360 | So which Caffe was very rigid in terms of its structure.
02:21:03.880 | Like you would create a neural network once and that's it.
02:21:06.800 | Whereas if you wanted like very dynamic graphs and so on,
02:21:09.320 | it was very hard to do that.
02:21:10.200 | And LuaTorch was much more friendly
02:21:11.600 | for all of these things.
02:21:13.560 | Okay, so in terms of PyTorch and TensorFlow,
02:21:15.600 | my personal bias is PyTorch
02:21:17.280 | just because I've been using it longer
02:21:19.080 | and I'm more familiar with it.
02:21:20.760 | And also that PyTorch is much easier to debug
02:21:23.560 | is what I find because it's imperative in nature
02:21:26.280 | compared to like TensorFlow, which is not imperative.
02:21:28.600 | But that's telling you a lot that basically
02:21:30.480 | the imperative design is sort of a way
02:21:33.320 | in which a lot of people are taught programming
02:21:35.240 | and that's what actually makes debugging easier for them.
02:21:38.160 | So like I learned programming in C, C++.
02:21:40.480 | And so for me, imperative way of programming
02:21:42.200 | is more natural.
02:21:43.200 | - Do you think it's good to have
02:21:45.280 | kind of these two communities, this kind of competition?
02:21:48.480 | I think PyTorch is kind of more and more becoming dominant
02:21:51.480 | in the research community,
02:21:52.520 | but TensorFlow is still very popular
02:21:54.600 | in the more sort of application machine learning community.
02:21:57.920 | So do you think it's good to have that kind of split
02:22:00.480 | in code bases or,
02:22:02.680 | so like the benefit there is the competition challenges
02:22:06.560 | the library developers to step up their game.
02:22:09.120 | - Yeah.
02:22:10.000 | - But the downside is there's these code bases
02:22:12.760 | that are in different libraries.
02:22:15.200 | - Right, so I think the downside is that,
02:22:17.080 | I mean, for a lot of research code
02:22:18.480 | that's released in one framework
02:22:19.640 | and if you're using the other one,
02:22:20.600 | it's really hard to like really build on top of it.
02:22:23.960 | But thankfully the open source community
02:22:25.800 | in machine learning is amazing.
02:22:27.080 | So whenever like something pops up in TensorFlow,
02:22:30.840 | you wait a few days and someone who's like super sharp
02:22:33.200 | will actually come and translate that particular code base
02:22:35.520 | into PyTorch and basically have figured that
02:22:38.400 | all the nooks and crannies out.
02:22:39.720 | So the open source community is amazing
02:22:41.800 | and they really like figure out this gap.
02:22:44.280 | So I think in terms of like having these two frameworks
02:22:47.560 | or multiple, I think of course there are different use cases
02:22:49.720 | so there are going to be benefits
02:22:51.080 | to using one or the other framework.
02:22:52.840 | And like you said, I think competition is just healthy
02:22:54.720 | because both of these frameworks keep,
02:22:57.360 | or like all of these frameworks really sort of
02:22:59.040 | keep learning from each other
02:23:00.120 | and keep incorporating different things
02:23:01.640 | to just make them better and better.
02:23:03.760 | - What advice would you have for someone
02:23:06.320 | new to machine learning, you know,
02:23:09.680 | maybe just started or haven't even started
02:23:11.520 | but are curious about it and who want to get in the field?
02:23:14.880 | - Don't be afraid to get your hands dirty.
02:23:16.640 | I think that's the main thing.
02:23:17.600 | So if something doesn't work,
02:23:19.120 | like really drill into why things are not working.
02:23:22.200 | - Can you elaborate what your hands dirty means?
02:23:24.560 | - Right, so for example, like if an algorithm,
02:23:27.600 | if you try to train a network and it's not converging,
02:23:29.760 | whatever, rather than trying to like Google the answer
02:23:32.280 | or trying to do something,
02:23:33.440 | like really spend those like five, eight, 10, 15, 20,
02:23:36.360 | whatever number of hours really trying
02:23:37.600 | to figure it out yourself.
02:23:39.040 | 'Cause in that process, you'll actually learn a lot more.
02:23:41.360 | - Yeah.
02:23:42.560 | - Googling is of course like a good way to solve it
02:23:44.640 | when you need a quick answer.
02:23:46.040 | But I think initially, especially like
02:23:47.680 | when you're starting out, it's much nicer
02:23:49.400 | to like figure things out by yourself.
02:23:51.880 | And I just say that from experience
02:23:53.000 | because like when I started out,
02:23:54.320 | there were not a lot of resources.
02:23:55.520 | So we would like in the lab, a lot of us,
02:23:57.920 | like we would look up to senior students
02:23:59.720 | and the senior students were of course busy
02:24:01.400 | and they would be like, "Hey, why don't you go figure it out
02:24:03.120 | because I just don't have the time.
02:24:04.360 | I'm working on my dissertation or whatever."
02:24:06.520 | I'll find a PhD students.
02:24:07.680 | And so then we would sit down
02:24:08.800 | and like just try to figure it out.
02:24:10.520 | And that I think really helped me.
02:24:12.480 | That has really helped me figure a lot of things out.
02:24:15.080 | - I think in general, if I were to generalize that,
02:24:18.760 | I feel like persevering through any kind of struggle
02:24:22.800 | on a thing you care about is good.
02:24:25.680 | So you're basically, you try to make it seem like
02:24:28.200 | it's good to spend time debugging,
02:24:30.880 | but really any kind of struggle,
02:24:32.600 | whatever form that takes, it could be just Googling a lot.
02:24:36.120 | Just basically anything, just go sticking with it
02:24:38.760 | and go into the hard thing that could take a form
02:24:41.080 | of implementing stuff from scratch.
02:24:43.280 | It could take the form of re-implementing
02:24:45.680 | with different libraries or different programming languages.
02:24:49.400 | It could take a lot of different forms,
02:24:50.640 | but struggle is good for the soul.
02:24:53.560 | - So like in Pittsburgh, when I did my PhD,
02:24:55.840 | the thing was it used to snow a lot.
02:24:57.560 | - Yeah.
02:24:58.400 | - And so when it was snowed, you really couldn't do much.
02:25:00.840 | So the thing that a lot of people said was,
02:25:03.000 | "Snow builds character."
02:25:05.360 | Because when it's snowing, you can't do anything else.
02:25:07.520 | You focus on work.
02:25:09.080 | - Do you have advice in general for people,
02:25:10.840 | you've already exceptionally successful, you're young,
02:25:13.440 | but do you have advice for young people starting out
02:25:15.800 | in college or maybe in high school?
02:25:17.840 | You know, advice for their career, advice for their life,
02:25:21.080 | how to pave a successful path in career and life?
02:25:25.760 | - I would say just be hungry.
02:25:27.400 | Like always be hungry for what you want.
02:25:29.760 | And I think like, I've been inspired by a lot of people
02:25:33.360 | who are just like driven and who really like go
02:25:35.840 | for what they want, no matter what, like,
02:25:38.440 | you shouldn't want it, you should need it.
02:25:40.560 | So if you need something, you basically go
02:25:42.640 | towards the ends to make it work.
02:25:44.400 | - How do you know when you come across a thing
02:25:47.120 | that's like you need?
02:25:51.160 | - I think there's not going to be any single thing
02:25:53.120 | that you're going to need.
02:25:53.960 | There are going to be different types of things
02:25:55.000 | that you need, but whenever you need something,
02:25:56.640 | you just go push for it.
02:25:57.960 | And of course, once you, you may not get it,
02:26:00.080 | or you may find that this was not even the thing
02:26:02.000 | that you were looking for, it might be a different thing.
02:26:03.720 | But the point is like you're pushing through things
02:26:06.280 | and that actually gives you, brings a lot of skills
02:26:09.000 | and brings a lot of like,
02:26:11.520 | builds a certain kind of attitude,
02:26:12.920 | which will probably help you get the other thing.
02:26:15.680 | Once you figure out what's really the thing that you want.
02:26:18.080 | - Yeah, I think a lot of people are,
02:26:20.560 | I've noticed the kind of afraid of that is because one,
02:26:23.240 | it's a fear of commitment.
02:26:24.880 | And two, there's so many amazing things in this world.
02:26:26.880 | You almost don't want to miss out
02:26:28.120 | on all the other amazing things
02:26:29.400 | by committing to this one thing.
02:26:31.040 | So I think a lot of it has to do with just allowing yourself
02:26:33.840 | to like notice that thing and just go all the way with it.
02:26:41.560 | - I mean, I also like failure, right?
02:26:43.240 | So I know this is like super cheesy that failure
02:26:47.280 | is something that you should be prepared for and so on.
02:26:49.760 | But I do think, I mean, especially in research,
02:26:52.520 | for example, failure is something that happens almost like,
02:26:55.240 | almost every day is like experiments failing and not working.
02:26:59.120 | And so you really need to be so used to it.
02:27:02.240 | You need to have a thick skin.
02:27:03.880 | But, and only basically through,
02:27:06.280 | like when you get through it is when you find
02:27:07.880 | the one thing that's actually working.
02:27:09.560 | So Thomas Edison was like one person like that.
02:27:11.840 | So I really, like when I was a kid,
02:27:13.680 | I used to really read about how he found like the filament,
02:27:17.040 | the light bulb filament.
02:27:18.680 | And then he, I think his thing was like,
02:27:20.600 | he tried 990 things that didn't work
02:27:23.120 | or something of the sort.
02:27:24.320 | And then they asked him like, so what did you learn?
02:27:26.920 | Because all of these were failed experiments.
02:27:28.480 | And then he says, oh, these 990 things don't work.
02:27:31.600 | And I know that.
02:27:32.440 | Did you know that?
02:27:33.280 | (laughing)
02:27:34.120 | I mean, that's pretty inspiring.
02:27:35.960 | - So you spent a few years on this earth
02:27:38.480 | performing a self supervised kind of learning process.
02:27:43.480 | Have you figured out the meaning of life yet?
02:27:46.400 | I told you I'm doing this podcast to try to get the answer.
02:27:49.120 | I'm hoping you could tell me.
02:27:50.720 | What do you think the meaning of it all is?
02:27:52.880 | - I don't think I figured this out.
02:27:55.760 | No, I have no idea.
02:27:57.080 | (laughing)
02:27:58.960 | - Do you think AI will help us figure it out?
02:28:02.520 | Or do you think there's no answer?
02:28:03.880 | The whole point is to keep searching.
02:28:05.440 | - I think, yeah, I think it's an endless
02:28:07.520 | sort of quest for us.
02:28:08.760 | I don't think AI will help us there.
02:28:10.520 | This is like a very hard, hard, hard question,
02:28:13.560 | which so many humans have tried to answer.
02:28:15.400 | - Well, that's the interesting thing.
02:28:16.600 | Difference between AI and humans.
02:28:18.400 | Humans don't seem to know what the hell they're doing.
02:28:21.800 | And AI is almost always operating
02:28:23.680 | under well-defined objective functions.
02:28:27.360 | And I wonder like, whether our lack of ability
02:28:33.320 | to define good long-term objective functions
02:28:37.200 | or introspect what is the objective function
02:28:40.400 | under which we operate, if that's a feature or a bug.
02:28:44.360 | - I would say it's a feature
02:28:45.200 | because then everyone actually has very different kinds
02:28:47.400 | of objective functions that they're optimizing.
02:28:49.320 | And those objective functions evolve
02:28:51.280 | and like change dramatically
02:28:52.800 | through the course of their life.
02:28:53.840 | That's actually what makes us interesting, right?
02:28:55.960 | If otherwise, like if everyone was doing
02:28:58.000 | the exact same thing, that would be pretty boring.
02:29:00.520 | We do want like people with different kinds of perspectives.
02:29:03.800 | Also people evolve continuously.
02:29:06.120 | That's like, I would say the biggest feature of being human.
02:29:09.280 | - And then we get to like the ones that die
02:29:11.120 | because they do something stupid.
02:29:12.520 | We get to watch that, see it and learn from it.
02:29:15.400 | And as a species, we take that lesson
02:29:20.320 | and become better and better
02:29:22.560 | because of all the dumb people in the world
02:29:24.240 | that died doing something wild and beautiful.
02:29:28.120 | Ishan, thank you so much for this incredible conversation.
02:29:31.800 | We did a depth first search
02:29:34.800 | through the space of machine learning
02:29:38.000 | and it was fun and fascinating.
02:29:41.600 | So it's really an honor to meet you
02:29:43.880 | and it was a really awesome conversation.
02:29:45.760 | Thanks for coming down today and talking with me.
02:29:48.160 | - Thanks Lex.
02:29:49.000 | I mean, I've listened to you.
02:29:50.200 | I told you it was unreal for me
02:29:51.720 | to actually meet you in person
02:29:52.960 | and I'm so happy to be here.
02:29:54.080 | Thank you.
02:29:55.000 | - Thanks man.
02:29:55.840 | Thanks for listening to this conversation
02:29:58.160 | with Ishan Misra.
02:29:59.360 | And thank you to Onnit, The Information,
02:30:02.440 | Grammarly and Athletic Greens.
02:30:05.280 | Check them out in the description to support this podcast.
02:30:08.560 | And now let me leave you with some words
02:30:10.440 | from Arthur C. Clarke.
02:30:12.480 | Any sufficiently advanced technology
02:30:14.920 | is indistinguishable from magic.
02:30:18.120 | Thank you for listening and hope to see you next time.
02:30:21.000 | (upbeat music)
02:30:23.600 | (upbeat music)
02:30:26.200 | [BLANK_AUDIO]