back to index

Ian Goodfellow: Generative Adversarial Networks (GANs) | Lex Fridman Podcast #19


Chapters

0:0 Introduction
1:8 Deep learning limitations
2:42 Function estimators
6:53 Selfawareness
8:58 Difficult cases
12:50 Hidden voice commands
14:0 Writing a deep learning chapter
16:44 What is deep learning
18:28 What is an example of deep learning
20:36 What could an alternative direction of training neural networks look like
21:43 Are you optimistic about us discovering something better
24:17 How do we build knowledge representation
25:17 Differentiable knowledge bases
26:40 GANs at a bar
27:54 Deep Boltzmann machines
30:20 What are GANs
33:26 How do GANs work
36:44 Types of GANs
39:51 History of GANs
43:31 Semisupervised GANs
44:26 Class labels
46:22 Zebra cycle
49:31 Data augmentation
52:10 Fairness

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Ian Goodfellow.
00:00:03.720 | He's the author of the popular textbook on deep learning,
00:00:06.360 | simply titled "Deep Learning."
00:00:08.920 | He coined the term of generative adversarial networks,
00:00:12.320 | otherwise known as GANs,
00:00:14.560 | and with his 2014 paper is responsible
00:00:18.160 | for launching the incredible growth
00:00:20.440 | of research and innovation
00:00:22.120 | in this subfield of deep learning.
00:00:24.720 | He got his BS and MS at Stanford,
00:00:27.520 | his PhD at University of Montreal
00:00:30.120 | with Yoshua Bengio and Aaron Kervil.
00:00:33.320 | He held several research positions,
00:00:35.240 | including an open AI, Google Brain,
00:00:37.600 | and now at Apple as the director of machine learning.
00:00:41.560 | This recording happened while Ian was still at Google Brain,
00:00:45.400 | but we don't talk about anything specific to Google
00:00:48.520 | or any other organization.
00:00:50.760 | This conversation is part
00:00:52.480 | of the Artificial Intelligence podcast.
00:00:54.520 | If you enjoy it, subscribe on YouTube, iTunes,
00:00:57.560 | or simply connect with me on Twitter @LexFriedman,
00:01:00.880 | spelled F-R-I-D.
00:01:03.000 | And now, here's my conversation with Ian Goodfellow.
00:01:07.080 | You open your popular deep learning book
00:01:10.960 | with a Russian doll type diagram
00:01:13.600 | that shows deep learning as a subset
00:01:15.880 | of representation learning,
00:01:17.120 | which in turn is a subset of machine learning,
00:01:19.960 | and finally a subset of AI.
00:01:22.520 | So this kind of implies that there may be limits
00:01:25.280 | to deep learning in the context of AI.
00:01:27.720 | So what do you think is the current limits of deep learning,
00:01:31.560 | and are those limits something
00:01:33.120 | that we can overcome with time?
00:01:35.760 | - Yeah, I think one of the biggest limitations
00:01:37.720 | of deep learning is that right now
00:01:39.320 | it requires really a lot of data, especially labeled data.
00:01:42.920 | There are some unsupervised
00:01:45.480 | and semi-supervised learning algorithms
00:01:47.140 | that can reduce the amount of labeled data you need,
00:01:49.480 | but they still require a lot of unlabeled data.
00:01:52.200 | Reinforcement learning algorithms, they don't need labels,
00:01:54.240 | but they need really a lot of experiences.
00:01:56.320 | As human beings, we don't learn to play Pong
00:01:58.960 | by failing at Pong 2 million times.
00:02:01.600 | So just getting the generalization ability better
00:02:05.920 | is one of the most important bottlenecks
00:02:08.080 | in the capability of the technology today.
00:02:10.600 | And then I guess I'd also say deep learning
00:02:12.400 | is like a component of a bigger system.
00:02:15.660 | So far, nobody is really proposing to have
00:02:20.640 | only what you'd call deep learning
00:02:22.040 | as the entire ingredient of intelligence.
00:02:25.560 | You use deep learning as sub-modules of other systems,
00:02:29.880 | like AlphaGo has a deep learning model
00:02:32.360 | that estimates the value function.
00:02:34.160 | Most reinforcement learning algorithms
00:02:36.620 | have a deep learning module
00:02:37.920 | that estimates which action to take next,
00:02:40.360 | but you might have other components.
00:02:42.520 | - So you're basically building a function estimator.
00:02:46.120 | Do you think it's possible,
00:02:48.640 | you said nobody's kind of been thinking
00:02:50.180 | about this so far, but do you think neural networks
00:02:52.280 | could be made to reason in the way symbolic systems did
00:02:56.080 | in the '80s and '90s to do more,
00:02:58.800 | create more like programs as opposed to functions?
00:03:01.480 | - Yeah, I think we already see that a little bit.
00:03:03.980 | I already kind of think of neural nets as a kind of program.
00:03:08.880 | I think of deep learning as basically learning programs
00:03:12.960 | that have more than one step.
00:03:15.320 | So if you draw a flow chart,
00:03:17.760 | or if you draw a TensorFlow graph
00:03:19.580 | describing your machine learning model,
00:03:21.900 | I think of the depth of that graph
00:03:23.540 | as describing the number of steps that run in sequence,
00:03:25.900 | and then the width of that graph
00:03:27.660 | is the number of steps that run in parallel.
00:03:30.180 | Now it's been long enough
00:03:31.720 | that we've had deep learning working
00:03:32.940 | that it's a little bit silly
00:03:33.920 | to even discuss shallow learning anymore.
00:03:35.780 | But back when I first got involved in AI,
00:03:38.940 | when we used machine learning,
00:03:40.140 | we were usually learning things
00:03:41.320 | like support vector machines.
00:03:43.740 | You could have a lot of input features to the model,
00:03:45.660 | and you could multiply each feature by a different weight.
00:03:48.140 | All those multiplications were done in parallel
00:03:50.120 | to each other.
00:03:51.260 | There wasn't a lot done in series.
00:03:52.740 | I think what we got with deep learning
00:03:54.360 | was really the ability to have steps of a program
00:03:58.420 | that run in sequence.
00:04:00.340 | And I think that we've actually started to see
00:04:03.180 | that what's important with deep learning
00:04:05.020 | is more the fact that we have a multi-step program
00:04:07.980 | rather than the fact that we've learned a representation.
00:04:10.780 | If you look at things like ResNets, for example,
00:04:15.140 | they take one particular kind of representation
00:04:18.660 | and they update it several times.
00:04:21.060 | Back when deep learning first really took off
00:04:23.560 | in the academic world in 2006,
00:04:25.740 | when Jeff Hinton showed
00:04:27.660 | that you could train deep belief networks,
00:04:30.180 | everybody who was interested in the idea
00:04:31.980 | thought of it as each layer
00:04:33.560 | learns a different level of abstraction.
00:04:35.940 | That the first layer trained on images
00:04:37.820 | learns something like edges,
00:04:38.940 | and the second layer learns corners,
00:04:40.420 | and eventually you get these kind of grandmother cell units
00:04:43.320 | that recognize specific objects.
00:04:45.920 | Today, I think most people think of it
00:04:47.980 | more as a computer program,
00:04:50.660 | where as you add more layers,
00:04:51.980 | you can do more updates before you output your final number.
00:04:55.140 | But I don't think anybody believes
00:04:56.420 | that layer 150 of the ResNet is a grandmother cell,
00:05:01.420 | and layer 100 is contours or something like that.
00:05:05.080 | - Okay, so you're not thinking of it
00:05:08.180 | as a singular representation that keeps building.
00:05:11.520 | You think of it as a program,
00:05:14.060 | sort of almost like a state.
00:05:15.940 | Representation is a state of understanding.
00:05:18.580 | - Yeah, I think of it as a program
00:05:20.260 | that makes several updates
00:05:21.500 | and arrives at better and better understandings,
00:05:23.820 | but it's not replacing the representation at each step.
00:05:27.500 | It's refining it.
00:05:29.160 | And in some sense, that's a little bit like reasoning.
00:05:31.660 | It's not reasoning in the form of deduction,
00:05:33.560 | but it's reasoning in the form of taking a thought
00:05:36.940 | and refining it and refining it carefully
00:05:39.420 | until it's good enough to use.
00:05:41.260 | - So do you think, and I hope you don't mind,
00:05:43.580 | we'll jump philosophical every once in a while.
00:05:46.020 | Do you think of cognition, human cognition,
00:05:50.460 | or even consciousness as simply a result
00:05:53.500 | of this kind of sequential representation learning,
00:05:58.100 | do you think that can emerge?
00:06:00.420 | - Cognition, yes, I think so.
00:06:02.460 | Consciousness, it's really hard
00:06:03.700 | to even define what we mean by that.
00:06:06.440 | I guess there's, consciousness is often defined
00:06:09.820 | as things like having self-awareness,
00:06:12.060 | and that's relatively easy to turn into something actionable
00:06:16.060 | for a computer scientist to reason about.
00:06:18.380 | People also define consciousness
00:06:19.700 | in terms of having qualitative states of experience,
00:06:22.420 | like qualia, and there's all these philosophical problems,
00:06:25.260 | like could you imagine a zombie
00:06:27.820 | who does all the same information processing as a human,
00:06:30.700 | but doesn't really have the qualitative experiences
00:06:33.460 | that we have?
00:06:34.660 | That sort of thing, I have no idea how to formalize
00:06:37.540 | or turn it into a scientific question.
00:06:39.940 | I don't know how you could run an experiment
00:06:41.580 | to tell whether a person is a zombie or not.
00:06:44.860 | And similarly, I don't know how you could run an experiment
00:06:47.180 | to tell whether an advanced AI system
00:06:49.640 | had become conscious in the sense of qualia or not.
00:06:53.020 | - But in the more practical sense,
00:06:54.540 | like almost like self-attention,
00:06:56.260 | you think consciousness and cognition can,
00:06:58.900 | in an impressive way, emerge from current types
00:07:03.220 | of architectures that we think of as deep learning.
00:07:05.540 | - Or if you think of consciousness
00:07:07.940 | in terms of self-awareness and just making plans
00:07:12.180 | based on the fact that the agent itself exists in the world,
00:07:16.580 | reinforcement learning algorithms
00:07:18.000 | are already more or less forced
00:07:20.140 | to model the agent's effect on the environment.
00:07:23.060 | So that more limited version of consciousness
00:07:26.340 | is already something that we get limited versions of
00:07:31.340 | with reinforcement learning algorithms
00:07:32.980 | if they're trained well.
00:07:34.660 | - But you say limited.
00:07:37.420 | So the big question really is how you jump
00:07:39.900 | from limited to human level, right?
00:07:42.100 | And whether it's possible.
00:07:44.620 | Even just building common sense reasoning
00:07:49.020 | seems to be exceptionally difficult.
00:07:50.540 | So if we scale things up,
00:07:52.500 | if we get much better on supervised learning,
00:07:55.020 | if we get better at labeling,
00:07:56.620 | if we get bigger data sets, more compute,
00:08:00.620 | do you think we'll start to see really impressive things
00:08:03.860 | that go from limited to something,
00:08:08.340 | echoes of human level cognition?
00:08:10.340 | - I think so, yeah.
00:08:11.180 | I'm optimistic about what can happen
00:08:13.340 | just with more computation and more data.
00:08:16.420 | I do think it'll be important to get the right kind of data.
00:08:20.100 | Today, most of the machine learning systems we train
00:08:23.140 | are mostly trained on one type of data for each model.
00:08:27.540 | But the human brain, we get all of our different senses
00:08:31.380 | and we have many different experiences
00:08:33.860 | like riding a bike, driving a car,
00:08:36.300 | talking to people, reading.
00:08:37.940 | I think when you get that kind of integrated data set
00:08:42.420 | working with a machine learning model
00:08:44.420 | that can actually close the loop and interact,
00:08:47.660 | we may find that algorithms not so different
00:08:50.460 | from what we have today learn really interesting things
00:08:53.260 | when you scale them up a lot
00:08:54.380 | and train them on a large amount of multimodal data.
00:08:58.220 | - So multimodal is really interesting,
00:08:59.620 | but within, like you're working adversarial examples,
00:09:04.020 | so selecting within modal, within one mode of data,
00:09:09.020 | selecting better what are the difficult cases
00:09:13.780 | from which you're most useful to learn from?
00:09:16.140 | - Oh yeah, like could we get a whole lot of mileage
00:09:18.860 | out of designing a model that's resistant
00:09:22.260 | to adversarial examples or something like that?
00:09:24.100 | - Right, that's a question.
00:09:26.260 | - My thinking on that has evolved a lot
00:09:27.740 | over the last few years.
00:09:28.900 | - Oh, interesting.
00:09:29.940 | - When I first started to really invest
00:09:31.260 | in studying adversarial examples,
00:09:32.740 | I was thinking of it mostly as adversarial examples
00:09:36.340 | reveal a big problem with machine learning
00:09:38.980 | and we would like to close the gap
00:09:41.180 | between how machine learning models
00:09:43.700 | respond to adversarial examples and how humans respond.
00:09:46.540 | After studying the problem more,
00:09:49.180 | I still think that adversarial examples are important.
00:09:51.940 | I think of them now more of as a security liability
00:09:55.420 | than as an issue that necessarily shows
00:09:57.780 | there's something uniquely wrong with machine learning
00:10:01.260 | as opposed to humans.
00:10:02.820 | - Also, do you see them as a tool
00:10:04.620 | to improve the performance of the system?
00:10:06.460 | Not on the security side, but literally just accuracy.
00:10:10.780 | - I do see them as a kind of tool on that side,
00:10:13.460 | but maybe not quite as much as I used to think.
00:10:16.660 | We've started to find that there's a trade-off
00:10:18.500 | between accuracy on adversarial examples
00:10:21.660 | and accuracy on clean examples.
00:10:24.380 | Back in 2014, when I did the first
00:10:27.140 | adversarially trained classifier
00:10:29.060 | that showed resistance to some kinds of adversarial examples,
00:10:33.020 | it also got better at the clean data on MNIST.
00:10:36.020 | And that's something we've replicated
00:10:37.100 | several times on MNIST,
00:10:39.020 | that when we train against weak adversarial examples,
00:10:41.500 | MNIST classifiers get more accurate.
00:10:43.900 | So far, that hasn't really held up on other data sets
00:10:47.100 | and hasn't held up when we train
00:10:48.860 | against stronger adversaries.
00:10:50.740 | It seems like when you confront
00:10:53.180 | a really strong adversary,
00:10:55.740 | you tend to have to give something up.
00:10:58.100 | - Interesting.
00:10:59.060 | But it's such a compelling idea,
00:11:00.540 | 'cause it feels like that's how us humans learn
00:11:04.740 | as to the difficult cases.
00:11:06.340 | - We try to think of what would we screw up,
00:11:08.820 | and then we make sure we fix that.
00:11:11.020 | It's also, in a lot of branches of engineering,
00:11:13.700 | you do a worst-case analysis
00:11:15.820 | and make sure that your system will work in the worst case.
00:11:18.740 | And then that guarantees that it'll work
00:11:20.420 | in all of the messy average cases
00:11:23.580 | that happen when you go out into a really randomized world.
00:11:27.420 | - Yeah, with driving with autonomous vehicles,
00:11:29.540 | there seems to be a desire to just look for,
00:11:33.060 | think adversarially,
00:11:34.860 | try to figure out how to mess up the system.
00:11:36.900 | And if you can be robust to all those difficult cases,
00:11:40.620 | then you can, it's a hand-wavy,
00:11:42.900 | empirical way to show your system is safe.
00:11:45.820 | - Yeah, yeah.
00:11:47.020 | Today, most adversarial example research
00:11:49.100 | isn't really focused on a particular use case,
00:11:51.620 | but there are a lot of different use cases
00:11:54.020 | where you'd like to make sure that the adversary
00:11:56.940 | can't interfere with the operation of your system.
00:12:00.220 | Like in finance,
00:12:01.060 | if you have an algorithm making trades for you,
00:12:03.300 | people go to a lot of an effort
00:12:04.660 | to obfuscate their algorithm.
00:12:06.660 | That's both to protect their IP,
00:12:08.060 | because you don't wanna research
00:12:10.860 | and develop a profitable trading algorithm
00:12:13.580 | then have somebody else capture the gains.
00:12:16.100 | But it's at least partly
00:12:17.140 | because you don't want people to make adversarial examples
00:12:19.500 | that fool your algorithm into making bad trades.
00:12:22.580 | Or I guess one area that's been popular
00:12:26.580 | in the academic literature is speech recognition.
00:12:30.180 | If you use speech recognition to hear an audio waveform
00:12:34.420 | and then turn that into a command
00:12:37.700 | that a phone executes for you,
00:12:39.660 | you don't want a malicious adversary
00:12:41.900 | to be able to produce audio
00:12:43.620 | that gets interpreted as malicious commands,
00:12:46.300 | especially if a human in the room
00:12:47.820 | doesn't realize that something like that is happening.
00:12:50.300 | - In speech recognition,
00:12:52.020 | has there been much success
00:12:53.900 | in being able to create adversarial examples
00:12:58.460 | that fool the system?
00:12:59.780 | - Yeah, actually.
00:13:00.860 | I guess the first work that I'm aware of
00:13:02.420 | is a paper called "Hidden Voice Commands"
00:13:05.140 | that came out in 2016, I believe.
00:13:08.460 | And they were able to show that
00:13:10.780 | they could make sounds that are not understandable
00:13:13.780 | by a human, but are recognized as the target phrase
00:13:18.420 | that the attacker wants the phone to recognize it as.
00:13:21.340 | Since then, things have gotten a little bit better
00:13:24.020 | on the attacker side and worse on the defender side.
00:13:27.580 | It's become possible to make sounds
00:13:33.380 | that sound like normal speech,
00:13:35.580 | but are actually interpreted as a different sentence
00:13:38.980 | than the human hears.
00:13:40.700 | The level of perceptibility
00:13:42.740 | of the adversarial perturbation is still kind of high.
00:13:45.420 | When you listen to the recording,
00:13:48.180 | it sounds like there's some noise in the background,
00:13:51.020 | just like rustling sounds.
00:13:52.940 | But those rustling sounds
00:13:53.940 | are actually the adversarial perturbation
00:13:55.540 | that makes the phone hear a completely different sentence.
00:13:58.020 | - Yeah, that's so fascinating.
00:14:00.100 | Peter Norvig mentioned that you're writing
00:14:01.620 | the deep learning chapter for the fourth edition
00:14:04.260 | of the "Artificial Intelligence, a Modern Approach" book.
00:14:07.340 | So how do you even begin summarizing
00:14:10.700 | the field of deep learning in a chapter?
00:14:12.700 | (Peter laughs)
00:14:13.940 | - Well, in my case, I waited like a year
00:14:16.900 | before I actually wrote anything.
00:14:19.180 | Is it, even having written a full length textbook before,
00:14:22.660 | it's still pretty intimidating to try to start writing
00:14:26.820 | just one chapter that covers everything.
00:14:29.100 | One thing that helped me make that plan
00:14:33.220 | was actually the experience
00:14:34.340 | of having written the full book before
00:14:36.740 | and then watching how the field changed
00:14:39.140 | after the book came out.
00:14:40.940 | I've realized there's a lot of topics
00:14:42.300 | that were maybe extraneous in the first book.
00:14:45.020 | And just seeing what stood the test
00:14:47.580 | of a few years of being published
00:14:49.420 | and what seems a little bit less important to have included
00:14:52.740 | now helped me pare down the topics
00:14:54.260 | I wanted to cover for the book.
00:14:55.820 | It's also really nice now that
00:14:59.260 | the field has kind of stabilized
00:15:00.580 | to the point where some core ideas from the 1980s
00:15:02.820 | are still used today.
00:15:04.780 | When I first started studying machine learning,
00:15:06.660 | almost everything from the 1980s had been rejected
00:15:09.580 | and now some of it has come back.
00:15:11.340 | So that stuff that's really stood the test of time
00:15:13.460 | is what I focused on putting into the book.
00:15:15.940 | There's also, I guess, two different philosophies
00:15:21.300 | about how you might write a book.
00:15:23.140 | One philosophy is you try to write a reference
00:15:24.820 | that covers everything.
00:15:26.220 | The other philosophy is you try to provide
00:15:28.020 | a high level summary that gives people
00:15:30.380 | the language to understand a field
00:15:32.420 | and tells them what the most important concepts are.
00:15:34.980 | The first deep learning book that I wrote
00:15:37.060 | with Joshua and Aaron was somewhere between
00:15:39.620 | the two philosophies, that it's trying to be
00:15:42.380 | both a reference and an introductory guide.
00:15:45.780 | Writing this chapter for Russell and Norvig's book,
00:15:48.940 | I was able to focus more on just a concise introduction
00:15:52.780 | of the key concepts and the language
00:15:54.260 | you need to read about them more.
00:15:55.980 | In a lot of cases, I actually just wrote paragraphs
00:15:57.540 | that said, "Here's a rapidly evolving area
00:16:00.020 | "that you should pay attention to.
00:16:01.900 | "It's pointless to try to tell you what the latest
00:16:04.660 | "and best version of a learn-to-learn model is."
00:16:09.660 | I can point you to a paper that's recent right now,
00:16:13.660 | but there isn't a whole lot of a reason
00:16:16.300 | to delve into exactly what's going on
00:16:18.620 | with the latest learning-to-learn approach
00:16:21.620 | or the latest module produced
00:16:23.420 | by a learning-to-learn algorithm.
00:16:24.980 | You should know that learning-to-learn is a thing
00:16:26.780 | and that it may very well be the source
00:16:29.500 | of the latest and greatest convolutional net
00:16:32.220 | or recurrent net module that you would want to use
00:16:34.540 | in your latest project.
00:16:36.060 | But there isn't a lot of point in trying to summarize
00:16:38.180 | exactly which architecture and which learning approach
00:16:42.300 | got to which level of performance.
00:16:44.060 | - So you maybe focused more on the basics
00:16:48.020 | of the methodology, so from backpropagation
00:16:51.300 | to feedforward to recurrent neural networks,
00:16:53.740 | convolutional, that kind of thing?
00:16:55.180 | - Yeah, yeah.
00:16:56.500 | - So if I were to ask you, I remember I took
00:16:58.700 | algorithms and data structures algorithms course.
00:17:03.740 | I remember the professor asked, "What is an algorithm?"
00:17:08.220 | And yelled at everybody in a good way
00:17:12.240 | that nobody was answering it correctly.
00:17:14.100 | Everybody knew what the algorithm, it was a graduate course.
00:17:16.420 | Everybody knew what an algorithm was,
00:17:18.180 | but they weren't able to answer it well.
00:17:19.820 | So let me ask you in that same spirit,
00:17:22.380 | what is deep learning?
00:17:23.620 | - I would say deep learning is any kind of machine learning
00:17:29.740 | that involves learning parameters
00:17:32.500 | of more than one consecutive step.
00:17:36.020 | So that would mean shallow learning is things
00:17:39.620 | where you learn a lot of operations that happen in parallel.
00:17:43.780 | You might have a system that makes multiple steps,
00:17:46.740 | like you might have hand-designed feature extractors,
00:17:51.020 | but really only one step is learned.
00:17:52.660 | Deep learning is anything where you have multiple operations
00:17:56.060 | in sequence, and that includes the things
00:17:58.580 | that are really popular today,
00:17:59.820 | like convolutional networks and recurrent networks,
00:18:03.620 | but it also includes some of the things
00:18:05.060 | that have died out, like Bolton machines,
00:18:08.300 | where we weren't using back propagation.
00:18:10.900 | Today I hear a lot of people define deep learning
00:18:14.260 | as gradient descent applied to
00:18:19.060 | these differentiable functions.
00:18:21.500 | And I think that's a legitimate usage of the term.
00:18:24.820 | It's just different from the way
00:18:25.940 | that I use the term myself.
00:18:27.860 | - So what's an example of deep learning
00:18:31.780 | that is not gradient descent and differentiable functions?
00:18:34.780 | In your, I mean, not specifically perhaps,
00:18:37.460 | but more even looking into the future,
00:18:39.820 | what's your thought about that space of approaches?
00:18:44.340 | - Yeah, so I tend to think of machine learning algorithms
00:18:46.380 | as decomposed into really three different pieces.
00:18:50.220 | There's the model, which can be something like a neural net
00:18:53.020 | or a Bolton machine or a recurrent model.
00:18:56.620 | And that basically just describes how do you take data
00:18:59.500 | and how do you take parameters?
00:19:01.140 | And what function do you use to make a prediction
00:19:04.300 | given the data and the parameters?
00:19:07.340 | Another piece of the learning algorithm
00:19:09.260 | is the optimization algorithm,
00:19:12.380 | or not every algorithm can be really described
00:19:14.900 | in terms of optimization,
00:19:15.900 | but what's the algorithm for updating the parameters
00:19:18.860 | or updating whatever the state of the network is?
00:19:21.660 | And then the last part is the data set,
00:19:26.260 | like how do you actually represent the world
00:19:29.180 | as it comes into your machine learning system?
00:19:32.100 | So I think of deep learning as telling us something
00:19:35.780 | about what does the model look like?
00:19:39.060 | And basically to qualify as deep,
00:19:41.260 | I say that it just has to have multiple layers.
00:19:44.540 | That can be multiple steps
00:19:46.340 | in a feed-forward differentiable computation.
00:19:49.220 | That can be multiple layers in a graphical model.
00:19:52.020 | There's a lot of ways that you could satisfy me
00:19:53.540 | that something has multiple steps
00:19:56.140 | that are each parameterized separately.
00:19:58.100 | I think of gradient descent
00:19:59.940 | as being all about that other piece,
00:20:01.900 | how do you actually update the parameters piece?
00:20:04.260 | So you could imagine having a deep model
00:20:05.980 | like a convolutional net
00:20:07.540 | and training it with something like evolution
00:20:09.660 | or a genetic algorithm.
00:20:11.300 | And I would say that still qualifies as deep learning.
00:20:14.780 | And then in terms of models
00:20:16.060 | that aren't necessarily differentiable,
00:20:18.740 | I guess Boltzmann machines are probably
00:20:21.260 | the main example of something
00:20:23.580 | where you can't really take a derivative
00:20:25.540 | and use that for the learning process.
00:20:28.020 | But you can still argue that the model
00:20:30.820 | has many steps of processing that it applies
00:20:33.780 | when you run inference in the model.
00:20:35.820 | - So it's the steps of processing that's key.
00:20:38.980 | So Jeff Hinton suggests that we need to throw away
00:20:41.380 | back propagation and start all over.
00:20:44.940 | What do you think about that?
00:20:46.540 | What could an alternative direction
00:20:48.620 | of training neural networks look like?
00:20:50.980 | - I don't know that back propagation
00:20:52.900 | is gonna go away entirely.
00:20:54.700 | Most of the time when we decide
00:20:57.140 | that a machine learning algorithm
00:20:59.220 | isn't on the critical path to research for improving AI,
00:21:03.460 | the algorithm doesn't die.
00:21:04.660 | It just becomes used for some specialized set of things.
00:21:07.740 | A lot of algorithms like logistic regression
00:21:11.220 | don't seem that exciting to AI researchers
00:21:14.020 | who are working on things like speech recognition
00:21:16.780 | or autonomous cars today.
00:21:18.460 | But there's still a lot of use for logistic regression
00:21:21.140 | and things like analyzing really noisy data
00:21:24.060 | in medicine and finance,
00:21:25.740 | or making really rapid predictions
00:21:28.820 | in really time-limited contexts.
00:21:30.740 | So I think back propagation and gradient descent
00:21:33.500 | are around to stay, but they may not end up being
00:21:37.500 | everything that we need to get to real human level
00:21:40.900 | or superhuman AI.
00:21:42.420 | - Are you optimistic about us discovering,
00:21:44.780 | back propagation has been around for a few decades.
00:21:50.260 | So are you optimistic about us as a community
00:21:54.100 | being able to discover something better?
00:21:56.820 | - Yeah, I am.
00:21:57.660 | I think we likely will find something that works better.
00:22:01.820 | You could imagine things like having stacks of models
00:22:05.500 | where some of the lower level models
00:22:07.580 | predict parameters of the higher level models.
00:22:10.220 | And so at the top level,
00:22:12.180 | you're not learning in terms of literally
00:22:13.500 | calculating gradients,
00:22:14.460 | but just predicting how different values will perform.
00:22:17.700 | You can kind of see that already in some areas
00:22:19.580 | like Bayesian optimization,
00:22:21.380 | where you have a Gaussian process
00:22:22.940 | that predicts how well different
00:22:24.180 | parameter values will perform.
00:22:25.900 | We already use those kinds of algorithms
00:22:27.700 | for things like hyper parameter optimization.
00:22:30.260 | And in general, we know a lot of things
00:22:31.660 | other than back prop that work really well
00:22:33.260 | for specific problems.
00:22:34.980 | The main thing we haven't found is
00:22:37.460 | a way of taking one of these other
00:22:38.900 | non-back prop based algorithms
00:22:41.180 | and having it really advance the state of the art
00:22:43.500 | on an AI level problem.
00:22:46.140 | - Right.
00:22:47.100 | - But I wouldn't be surprised if eventually
00:22:49.180 | we find that some of these algorithms that,
00:22:51.580 | even the ones that already exist,
00:22:52.820 | not even necessarily a new one,
00:22:54.260 | we might find some way of customizing
00:22:58.220 | one of these algorithms to do something
00:22:59.820 | really interesting at the level of cognition
00:23:02.540 | or the level of,
00:23:05.300 | I think one system that we really don't have
00:23:08.460 | working quite right yet is like short-term memory.
00:23:12.140 | We have things like LSTMs,
00:23:14.540 | they're called long short-term memory.
00:23:17.060 | They still don't do quite what a human does
00:23:20.060 | with short-term memory.
00:23:21.820 | Like gradient descent to learn a specific fact
00:23:26.980 | has to do multiple steps on that fact.
00:23:29.420 | Like if I tell you the meeting today is at 3 p.m.,
00:23:34.180 | I don't need to say over and over again,
00:23:35.500 | it's at 3 p.m., it's at 3 p.m., it's at 3 p.m.,
00:23:37.820 | it's at 3 p.m. for you to do a gradient step on each one.
00:23:40.420 | You just hear it once and you remember it.
00:23:43.220 | There's been some work on things like
00:23:46.060 | self-attention and attention-like mechanisms
00:23:48.340 | like the neural Turing machine
00:23:50.420 | that can write to memory cells
00:23:52.220 | and update themselves with facts like that right away.
00:23:54.900 | But I don't think we've really nailed it yet.
00:23:56.900 | And that's one area where I'd imagine
00:23:59.580 | that new optimization algorithms
00:24:02.660 | or different ways of applying
00:24:03.820 | existing optimization algorithms
00:24:06.020 | could give us a way of just lightning fast
00:24:08.820 | updating the state of a machine learning system
00:24:11.180 | to contain a specific fact like that
00:24:14.100 | without needing to have it presented
00:24:15.340 | over and over and over again.
00:24:16.980 | - So some of the success of symbolic systems in the '80s
00:24:21.420 | is they were able to assemble these kinds of facts better.
00:24:26.220 | But there's a lot of expert input required
00:24:29.100 | and it's very limited in that sense.
00:24:31.140 | Do you ever look back to that
00:24:33.700 | as something that we'll have to return to eventually,
00:24:36.580 | sort of dust off the book from the shelf
00:24:38.420 | and think about how we build knowledge,
00:24:41.340 | representation, knowledge--
00:24:42.940 | - Like will we have to use graph searches?
00:24:44.820 | - Graph searches, right.
00:24:45.780 | - And like first order logic and entailment
00:24:47.700 | and things like that.
00:24:48.540 | - That kind of thing, yeah, exactly.
00:24:49.580 | - In my particular line of work,
00:24:51.180 | which has mostly been machine learning security
00:24:54.540 | and also generative modeling,
00:24:56.740 | I haven't usually found myself moving in that direction.
00:25:00.540 | For generative models, I could see a little bit of,
00:25:03.500 | it could be useful if you had something like
00:25:05.180 | a differentiable knowledge base
00:25:09.660 | or some other kind of knowledge base
00:25:10.980 | where it's possible for some of our
00:25:13.140 | fuzzier machine learning algorithms
00:25:14.820 | to interact with the knowledge base.
00:25:16.860 | - I mean, neural network is kind of like that.
00:25:19.020 | It's a differentiable knowledge base of sorts.
00:25:21.420 | - Yeah.
00:25:22.260 | - But--
00:25:23.620 | - If we had a really easy way of giving feedback
00:25:27.620 | to machine learning models,
00:25:29.260 | that would clearly help a lot with generative models.
00:25:32.380 | And so you could imagine one way of getting there
00:25:33.900 | would be get a lot better at natural language processing.
00:25:36.700 | But another way of getting there would be
00:25:38.900 | take some kind of knowledge base
00:25:40.260 | and figure out a way for it to actually
00:25:42.300 | interact with a neural network.
00:25:44.060 | - Being able to have a chat with a neural network.
00:25:46.060 | - Yeah.
00:25:46.900 | (laughing)
00:25:47.860 | So like one thing in generative models we see a lot today
00:25:49.980 | is you'll get things like faces that are not symmetrical,
00:25:53.540 | like people that have two eyes that are different colors.
00:25:58.180 | And I mean, there are people with eyes
00:25:59.540 | that are different colors in real life,
00:26:00.820 | but not nearly as many of them as you tend to see
00:26:03.420 | in the machine learning generated data.
00:26:06.060 | So if you had either a knowledge base
00:26:08.060 | that could contain the fact,
00:26:10.180 | people's faces are generally approximately symmetric
00:26:13.340 | and eye color is especially likely
00:26:15.900 | to be the same on both sides.
00:26:17.940 | Being able to just inject that hint
00:26:20.180 | into the machine learning model
00:26:22.020 | without it having to discover that itself
00:26:23.820 | after studying a lot of data
00:26:25.780 | would be a really useful feature.
00:26:28.340 | I could see a lot of ways of getting there
00:26:30.140 | without bringing back some of the 1980s technology,
00:26:32.180 | but I also see some ways that you could imagine
00:26:35.140 | extending the 1980s technology to play nice
00:26:37.460 | with neural nets and have it help get there.
00:26:40.020 | - Awesome, so you talked about the story
00:26:42.580 | of you coming up with the idea of GANs
00:26:45.180 | at a bar with some friends.
00:26:47.020 | You were arguing that this, you know,
00:26:49.580 | GANs would work, generative adversarial networks,
00:26:53.060 | and the others didn't think so.
00:26:54.660 | Then you went home at midnight,
00:26:57.100 | coded it up, and it worked.
00:26:58.420 | So if I was a friend of yours at the bar,
00:27:01.340 | I would also have doubts.
00:27:02.700 | It's a really nice idea,
00:27:03.860 | but I'm very skeptical that it would work.
00:27:06.820 | What was the basis of their skepticism?
00:27:09.300 | What was the basis of your intuition why it should work?
00:27:13.180 | - I don't wanna be someone who goes around
00:27:15.980 | promoting alcohol for the purposes of science,
00:27:18.300 | but in this case, I do actually think
00:27:21.020 | that drinking helped a little bit.
00:27:23.060 | When your inhibitions are lowered,
00:27:25.360 | you're more willing to try out things
00:27:27.380 | that you wouldn't try out otherwise.
00:27:29.620 | So I have noticed in general
00:27:32.460 | that I'm less prone to shooting down some of my own ideas
00:27:34.540 | when I have had a little bit to drink.
00:27:37.980 | I think if I had had that idea at lunchtime,
00:27:40.820 | I probably would have thought,
00:27:42.260 | it's hard enough to train one neural net.
00:27:43.740 | You can't train a second neural net
00:27:44.900 | in the inner loop of the outer neural net.
00:27:48.080 | That was basically my friend's objection,
00:27:49.820 | was that trying to train two neural nets at the same time
00:27:52.740 | would be too hard.
00:27:54.260 | - So it was more about the training process,
00:27:56.140 | unless, so my skepticism would be,
00:27:58.300 | you know, I'm sure you could train it,
00:28:01.140 | but the thing it would converge to
00:28:03.180 | would not be able to generate anything reasonable,
00:28:05.820 | any kind of reasonable realism.
00:28:08.260 | - Yeah, so part of what all of us were thinking about
00:28:11.360 | when we had this conversation was deep Bolton machines,
00:28:15.280 | which a lot of us in the lab, including me,
00:28:16.980 | were a big fan of deep Bolton machines at the time.
00:28:19.580 | They involved two separate processes
00:28:22.900 | running at the same time.
00:28:24.180 | One of them is called the positive phase,
00:28:28.140 | where you load data into the model
00:28:31.180 | and tell the model to make the data more likely.
00:28:33.540 | The other one is called the negative phase,
00:28:35.140 | where you draw samples from the model
00:28:37.020 | and tell the model to make those samples less likely.
00:28:39.660 | In a deep Bolton machine,
00:28:42.220 | it's not trivial to generate a sample.
00:28:43.960 | You have to actually run an iterative process
00:28:46.980 | that gets better and better samples
00:28:49.140 | coming closer and closer to the distribution
00:28:51.400 | the model represents.
00:28:52.860 | So during the training process,
00:28:53.900 | you're always running these two systems at the same time.
00:28:57.180 | One that's updating the parameters of the model
00:28:58.940 | and another one that's trying to generate samples
00:29:00.500 | from the model.
00:29:01.680 | And they worked really well on things like MNIST,
00:29:04.340 | but a lot of us in the lab, including me,
00:29:05.820 | had tried to get deep Bolton machines
00:29:07.500 | to scale past MNIST to things like generating color photos.
00:29:11.900 | And we just couldn't get the two processes
00:29:14.120 | to stay synchronized.
00:29:15.940 | So when I had the idea for GANs,
00:29:18.740 | a lot of people thought that the discriminator
00:29:20.320 | would have more or less the same problem
00:29:22.580 | as the negative phase in the Bolton machine.
00:29:25.340 | That trying to train the discriminator in the inner loop,
00:29:27.780 | you just couldn't get it to keep up
00:29:29.920 | with the generator in the outer loop.
00:29:31.540 | And that would prevent it from converging
00:29:33.820 | to anything useful.
00:29:35.220 | - Yeah, I share that intuition.
00:29:36.860 | - Yeah.
00:29:37.700 | - But turns out to not be the case.
00:29:41.940 | - A lot of the time with machine learning algorithms,
00:29:43.760 | it's really hard to predict ahead of time
00:29:45.160 | how well they'll actually perform.
00:29:46.900 | You have to just run the experiment and see what happens.
00:29:49.140 | And I would say I still today don't have one factor
00:29:53.460 | I can put my finger on and say,
00:29:54.740 | "This is why GANs worked for photo generation
00:29:58.300 | "and deep Bolton machines don't."
00:30:00.300 | There are a lot of theory papers
00:30:03.300 | showing that under some theoretical settings,
00:30:06.340 | the GAN algorithm does actually converge.
00:30:09.620 | But those settings are restricted enough
00:30:14.140 | that they don't necessarily explain the whole picture
00:30:17.540 | in terms of all the results that we see in practice.
00:30:20.740 | - So taking a step back,
00:30:22.300 | can you, in the same way as we talked about deep learning,
00:30:24.860 | can you tell me what generative adversarial networks are?
00:30:28.400 | - Yeah, so generative adversarial networks
00:30:31.380 | are a particular kind of generative model.
00:30:33.980 | A generative model is a machine learning model
00:30:36.260 | that can train on some set of data.
00:30:38.860 | Like say you have a collection of photos of cats
00:30:41.220 | and you want to generate more photos of cats,
00:30:43.980 | or you want to estimate a probability distribution over cats
00:30:47.700 | so you can ask how likely it is
00:30:49.780 | that some new image is a photo of a cat.
00:30:51.820 | GANs are one way of doing this.
00:30:55.800 | Some generative models are good at creating new data.
00:30:59.180 | Other generative models are good at estimating
00:31:01.620 | that density function and telling you
00:31:03.000 | how likely particular pieces of data are
00:31:06.580 | to come from the same distribution as the training data.
00:31:09.700 | GANs are more focused on generating samples
00:31:12.420 | rather than estimating the density function.
00:31:15.620 | There are some kinds of GANs like FlowGAN that can do both,
00:31:18.500 | but mostly GANs are about generating samples,
00:31:21.620 | generating new photos of cats that look realistic.
00:31:25.220 | And they do that completely from scratch.
00:31:29.300 | It's analogous to human imagination.
00:31:32.220 | When a GAN creates a new image of a cat,
00:31:34.740 | it's using a neural network to produce a cat
00:31:39.300 | that has not existed before.
00:31:41.020 | It isn't doing something like compositing photos together.
00:31:44.540 | You're not literally taking the eye off of one cat
00:31:47.060 | and the ear off of another cat.
00:31:48.260 | It's more of this digestive process
00:31:51.340 | where the neural net trains on a lot of data
00:31:53.940 | and comes up with some representation
00:31:55.580 | of the probability distribution
00:31:57.420 | and generates entirely new cats.
00:31:59.820 | There are a lot of different ways
00:32:00.900 | of building a generative model.
00:32:01.980 | What's specific to GANs is that we have a two-player game
00:32:05.660 | in the game theoretic sense.
00:32:08.100 | And as the players in this game compete,
00:32:10.340 | one of them becomes able to generate realistic data.
00:32:13.940 | The first player is called the generator.
00:32:16.140 | It produces output data, such as just images, for example.
00:32:20.660 | And at the start of the learning process,
00:32:22.460 | it'll just produce completely random images.
00:32:25.140 | The other player is called the discriminator.
00:32:27.420 | The discriminator takes images as input
00:32:29.700 | and guesses whether they're real or fake.
00:32:32.540 | You train it both on real data,
00:32:34.260 | so photos that come from your training set,
00:32:36.140 | actual photos of cats,
00:32:37.860 | and you train it to say that those are real.
00:32:39.900 | You also train it on images
00:32:41.980 | that come from the generator network
00:32:43.860 | and you train it to say that those are fake.
00:32:46.060 | As the two players compete in this game,
00:32:49.220 | the discriminator tries to become better
00:32:50.980 | at recognizing whether images are real or fake.
00:32:53.340 | And the generator becomes better
00:32:54.820 | at fooling the discriminator into thinking
00:32:57.020 | that its outputs are real.
00:32:59.620 | And you can analyze this through the language of game theory
00:33:03.580 | and find that there's a Nash equilibrium
00:33:06.980 | where the generator has captured
00:33:08.660 | the correct probability distribution.
00:33:10.820 | So in the cat example,
00:33:12.180 | it makes perfectly realistic cat photos.
00:33:14.580 | And the discriminator is unable to do better
00:33:17.180 | than random guessing
00:33:18.740 | because all the samples coming from both the data
00:33:21.860 | and the generator look equally likely
00:33:24.060 | to have come from either source.
00:33:25.860 | - So do you ever sit back
00:33:28.380 | and does it just blow your mind that this thing works?
00:33:31.300 | So from very,
00:33:33.380 | so it's able to estimate the identity function
00:33:35.860 | enough to generate realistic images.
00:33:38.700 | I mean, yeah, do you ever sit back?
00:33:42.180 | - Yeah. - And think,
00:33:44.220 | how does this even, why, this is quite incredible,
00:33:46.780 | especially where GANs have gone
00:33:48.340 | in terms of realism.
00:33:49.300 | - Yeah, and not just to flatter my own work,
00:33:51.660 | but generative models,
00:33:53.900 | all of them have this property
00:33:55.460 | that if they really did what we asked them to do,
00:33:58.860 | they would do nothing but memorize the training data.
00:34:01.100 | - Right, exactly.
00:34:01.940 | - Models that are based on maximizing the likelihood,
00:34:05.780 | the way that you obtain the maximum likelihood
00:34:08.180 | for a specific training set
00:34:09.740 | is you assign all of your probability mass
00:34:12.420 | to the training examples and nowhere else.
00:34:15.140 | For GANs, the game is played using a training set.
00:34:18.420 | So the way that you become unbeatable in the game
00:34:21.180 | is you literally memorize training examples.
00:34:23.420 | One of my former interns wrote a paper,
00:34:28.900 | his name is Vaishnav Nagarajan,
00:34:31.060 | and he showed that it's actually hard
00:34:33.060 | for the generator to memorize the training data,
00:34:36.100 | hard in a statistical learning theory sense
00:34:39.140 | that you can actually create reasons
00:34:42.180 | for why it would require quite a lot of learning steps
00:34:47.180 | and a lot of observations of different latent variables
00:34:52.180 | before you could memorize the training data.
00:34:54.340 | That still doesn't really explain
00:34:55.660 | why when you produce samples that are new,
00:34:58.220 | why do you get compelling images
00:34:59.860 | rather than just garbage
00:35:01.860 | that's different from the training set.
00:35:03.740 | And I don't think we really have a good answer for that,
00:35:06.940 | especially if you think about
00:35:07.900 | how many possible images are out there
00:35:10.260 | and how few images the generative model sees during training.
00:35:15.260 | It seems just unreasonable
00:35:16.940 | that generative models create new images
00:35:19.220 | as well as they do,
00:35:20.780 | especially considering that we're basically
00:35:22.740 | training them to memorize rather than generalize.
00:35:25.180 | I think part of the answer is
00:35:28.220 | there's a paper called Deep Image Prior
00:35:30.860 | where they show that you can take a convolutional net
00:35:33.100 | and you don't even need to learn the parameters of it at all,
00:35:35.020 | you just use the model architecture.
00:35:37.700 | And it's already useful for things like in-painting images.
00:35:41.100 | I think that shows us
00:35:42.300 | that the convolutional network architecture
00:35:44.380 | captures something really important
00:35:45.940 | about the structure of images.
00:35:47.980 | And we don't need to actually use learning
00:35:50.980 | to capture all the information
00:35:52.260 | coming out of the convolutional net.
00:35:54.060 | That would imply that it would be much harder
00:35:58.460 | to make generative models in other domains.
00:36:01.300 | So far, we're able to make reasonable speech models
00:36:03.660 | and things like that.
00:36:04.900 | But to be honest,
00:36:06.420 | we haven't actually explored a whole lot
00:36:07.860 | of different data sets all that much.
00:36:09.820 | We don't, for example,
00:36:11.500 | see a lot of deep learning models of biology data sets,
00:36:16.500 | where you have lots of microarrays
00:36:19.900 | measuring the amount of different enzymes
00:36:22.300 | and things like that.
00:36:23.140 | So we may find that some of the progress
00:36:25.300 | that we've seen for images and speech
00:36:26.900 | turns out to really rely heavily on the model architecture.
00:36:30.140 | And we were able to do what we did for vision
00:36:33.020 | by trying to reverse engineer the human visual system.
00:36:37.020 | And maybe it'll turn out that we can't just
00:36:39.820 | use that same trick for arbitrary kinds of data.
00:36:42.580 | - Right, so there's aspects of the human vision system,
00:36:45.940 | the hardware of it, that makes it,
00:36:48.420 | without learning, without cognition,
00:36:51.140 | just makes it really effective at detecting the patterns
00:36:53.660 | we see in the visual world.
00:36:54.940 | - Yeah.
00:36:55.940 | - Yeah, that's really interesting.
00:36:57.700 | What, in a big, quick overview,
00:37:02.300 | in your view, what types of GANs are there,
00:37:06.300 | and what other generative models besides GANs are there?
00:37:10.100 | - Yeah, so it's maybe a little bit easier to start with
00:37:13.540 | what kinds of generative models are there other than GANs.
00:37:16.900 | So most generative models are likelihood-based,
00:37:20.900 | where to train them, you have a model that tells you
00:37:24.900 | how much probability it assigns to a particular example,
00:37:29.100 | and you just maximize the probability
00:37:31.140 | assigned to all the training examples.
00:37:33.700 | It turns out that it's hard to design a model
00:37:36.180 | that can create really complicated images
00:37:39.180 | or really complicated audio waveforms,
00:37:42.260 | and still have it be possible to estimate
00:37:46.220 | the likelihood function from a computational point of view.
00:37:51.220 | Most interesting models that you would just
00:37:53.820 | write down intuitively, it turns out that it's almost
00:37:56.420 | impossible to calculate the amount of probability
00:37:59.020 | they assign to a particular point.
00:38:00.780 | So there's a few different schools of generative models
00:38:04.740 | in the likelihood family.
00:38:06.260 | One approach is to very carefully design the model
00:38:10.180 | so that it is computationally tractable
00:38:12.780 | to measure the density it assigns to a particular point.
00:38:15.540 | So there are things like autoregressive models,
00:38:19.140 | like PixelCNN, those basically break down
00:38:23.940 | the probability distribution into a product
00:38:26.860 | over every single feature.
00:38:28.660 | So for an image, you estimate the probability
00:38:31.540 | of each pixel, given all of the pixels that came before it.
00:38:35.780 | There's tricks where if you want to measure
00:38:37.660 | the density function, you can actually calculate
00:38:40.620 | the density for all these pixels more or less in parallel.
00:38:43.500 | Generating the image still tends to require you
00:38:46.860 | to go one pixel at a time, and that can be very slow.
00:38:50.820 | But there are, again, tricks for doing this
00:38:52.980 | in a hierarchical pattern where you can keep
00:38:54.540 | the runtime under control.
00:38:56.140 | - Are the quality of the images it generates
00:38:58.540 | putting runtime aside pretty good?
00:39:01.660 | - They're reasonable, yeah.
00:39:04.420 | I would say a lot of the best results
00:39:07.460 | are from GANs these days, but it can be hard to tell
00:39:11.060 | how much of that is based on who's studying
00:39:14.700 | which type of algorithm, if that makes sense.
00:39:17.300 | - The amount of effort invested in a particular--
00:39:18.900 | - Yeah, or like the kind of expertise.
00:39:21.420 | So a lot of people who've traditionally been excited
00:39:23.140 | about graphics or art and things like that
00:39:25.060 | have gotten interested in GANs.
00:39:27.020 | And to some extent, it's hard to tell,
00:39:28.740 | are GANs doing better because they have a lot of
00:39:32.340 | graphics and art experts behind them,
00:39:34.700 | or are GANs doing better because
00:39:36.700 | they're more computationally efficient,
00:39:38.900 | or are GANs doing better because they prioritize
00:39:41.660 | the realism of samples over the accuracy
00:39:44.620 | of the density function?
00:39:45.500 | I think all of those are potentially valid explanations,
00:39:48.660 | and it's hard to tell.
00:39:51.300 | - So can you give a brief history of GANs
00:39:53.740 | from 2014, were you paper 13?
00:39:58.740 | - Yeah, so a few highlights.
00:40:00.980 | In the first paper, we just showed that GANs basically work.
00:40:04.740 | If you look back at the samples we had now,
00:40:06.620 | they look terrible.
00:40:08.820 | On the CIFAR-10 dataset, you can't even
00:40:10.460 | recognize objects in them.
00:40:12.220 | - Your paper, sorry, you used CIFAR-10?
00:40:15.020 | - We used MNIST, which is little handwritten digits.
00:40:18.060 | We used the Toronto Face Database,
00:40:19.860 | which is small, grayscale photos of faces.
00:40:22.700 | We did have recognizable faces.
00:40:24.220 | My colleague Bing Xu put together
00:40:25.700 | the first GAN face model for that paper.
00:40:28.540 | We also had the CIFAR-10 dataset,
00:40:32.980 | which is things like very small 32 by 32 pixels
00:40:36.100 | of cars and cats and dogs.
00:40:40.660 | For that, we didn't get recognizable objects,
00:40:43.020 | but all the deep learning people back then
00:40:46.180 | were really used to looking at these failed samples
00:40:48.420 | and kind of reading them like tea leaves.
00:40:50.420 | And people who are used to reading the tea leaves
00:40:53.020 | recognize that our tea leaves at least look different.
00:40:56.500 | Maybe not necessarily better,
00:40:57.820 | but there was something unusual about them.
00:40:59.980 | And that got a lot of us excited.
00:41:03.620 | One of the next really big steps was LAPGAN
00:41:06.180 | by Emily Denton and Sumit Chintala at Facebook AI Research,
00:41:10.900 | where they actually got really good high-resolution photos
00:41:14.420 | working with GANs for the first time.
00:41:16.580 | They had a complicated system where they generated
00:41:18.860 | the image starting at low-res and then scaling up to high-res,
00:41:22.780 | but they were able to get it to work.
00:41:24.900 | And then in 2015, I believe, later that same year,
00:41:29.900 | Alec Radford and Sumit Chintala and Luke Metz
00:41:34.940 | published the DCGAN paper,
00:41:38.420 | which it stands for Deep Convolutional GAN.
00:41:40.980 | It's kind of a non-unique name
00:41:43.740 | because these days basically all GANs
00:41:46.420 | and even some before that were deep and convolutional,
00:41:48.380 | but they just kind of picked a name
00:41:50.220 | for a really great recipe where they were able to actually,
00:41:54.020 | using only one model instead of a multi-step process,
00:41:57.300 | actually generate realistic images of faces
00:41:59.700 | and things like that.
00:42:00.740 | That was sort of like the beginning
00:42:05.220 | of the Cambrian explosion of GANs.
00:42:07.380 | Like, once you had animals that had a backbone,
00:42:09.740 | you suddenly got lots of different versions of fish
00:42:12.900 | and four-legged animals and things like that.
00:42:15.340 | So DCGAN became kind of the backbone
00:42:17.940 | for many different models that came out.
00:42:19.420 | - Used as a baseline even still.
00:42:21.620 | - Yeah, yeah.
00:42:23.140 | And so from there, I would say some interesting things
00:42:25.940 | we've seen are, there's a lot you can say
00:42:29.420 | about how just the quality
00:42:30.940 | of standard image generation GANs has increased,
00:42:33.540 | but what's also maybe more interesting
00:42:35.100 | on an intellectual level is how the things
00:42:37.380 | you can use GANs for has also changed.
00:42:40.060 | One thing is that you can use them to learn classifiers
00:42:44.580 | without having to have class labels for every example
00:42:47.380 | in your training set.
00:42:48.940 | So that's called semi-supervised learning.
00:42:51.780 | My colleague at OpenAI, Tim Solomons, who's at Brain now,
00:42:55.820 | wrote a paper called "Improved Techniques for Training GANs."
00:42:59.780 | I'm a co-author on this paper,
00:43:00.900 | but I can't claim any credit for this particular part.
00:43:03.700 | One thing he showed in the paper is that
00:43:05.860 | you can take the GAN discriminator
00:43:07.820 | and use it as a classifier that actually tells you,
00:43:11.340 | you know, this image is a cat, this image is a dog,
00:43:13.620 | this image is a car, this image is a truck, and so on.
00:43:16.420 | Not just to say whether the image is real or fake,
00:43:18.820 | but if it is real, to say specifically
00:43:20.700 | what kind of object it is.
00:43:22.620 | And he found that you can train these classifiers
00:43:25.340 | with far fewer labeled examples than traditional classifiers.
00:43:30.340 | - So if you supervise based on also
00:43:33.660 | not just your discrimination ability,
00:43:35.300 | but your ability to classify, you're going to do much,
00:43:38.660 | you're going to converge much faster
00:43:40.100 | to being effective at being a discriminator.
00:43:43.300 | - Yeah.
00:43:44.260 | So for example, for the MNIST dataset,
00:43:46.340 | you want to look at an image of a handwritten digit
00:43:48.860 | and say whether it's a zero, a one, or a two, and so on.
00:43:52.700 | To get down to less than 1% accuracy
00:43:56.980 | required around 60,000 examples
00:44:00.260 | until maybe about 2014 or so.
00:44:02.780 | In 2016, with this semi-supervised GAN project,
00:44:07.460 | Tim was able to get below 1% error
00:44:11.060 | using only a hundred labeled examples.
00:44:13.660 | So that was about a 600X decrease
00:44:16.020 | in the amount of labels that he needed.
00:44:18.020 | He's still using more images than that,
00:44:21.100 | but he doesn't need to have each of them labeled as,
00:44:23.460 | you know, this one's a one, this one's a two,
00:44:25.100 | this one's a zero, and so on.
00:44:27.020 | - Then to be able to, for GANs to be able to generate
00:44:30.020 | recognizable objects, so objects from a particular class,
00:44:33.460 | you still need labeled data
00:44:37.020 | because you need to know what it means
00:44:38.900 | to be a particular class cat, dog.
00:44:40.900 | How do you think we can move away from that?
00:44:44.620 | - Yeah, some researchers at Brain Zurich
00:44:46.660 | actually just released a really great paper
00:44:49.060 | on semi-supervised GANs where their goal isn't to classify,
00:44:53.980 | it's to make recognizable objects
00:44:56.260 | despite not having a lot of labeled data.
00:44:58.700 | They were working off of DeepMind's BigGAN project,
00:45:02.420 | and they showed that they can match the performance
00:45:05.220 | of BigGAN using only 10%, I believe,
00:45:08.700 | of the labels.
00:45:10.580 | BigGAN was trained on the ImageNet dataset,
00:45:12.340 | which is about 1.2 million images,
00:45:14.460 | and had all of them labeled.
00:45:15.900 | This latest project from Brain Zurich
00:45:19.100 | shows that they're able to get away
00:45:20.260 | with only having about 10% of the images labeled.
00:45:24.620 | And they do that essentially using a clustering algorithm
00:45:29.900 | where the discriminator learns to assign the objects
00:45:33.380 | to groups, and then this understanding
00:45:36.340 | that objects can be grouped into similar types
00:45:40.380 | helps it to form more realistic ideas
00:45:43.460 | of what should be appearing in the image,
00:45:45.420 | because it knows that every image it creates
00:45:47.980 | has to come from one of these archetypal groups
00:45:50.180 | rather than just being some arbitrary image.
00:45:53.220 | If you train a GAN with no class labels,
00:45:55.140 | you tend to get things that look sort of like
00:45:57.220 | grass or water or brick or dirt,
00:46:00.500 | but without necessarily a lot going on in them.
00:46:04.460 | And I think that's partly because
00:46:05.820 | if you look at a large ImageNet image,
00:46:07.900 | the object doesn't necessarily occupy the whole image.
00:46:11.260 | And so you learn to create realistic sets of pixels,
00:46:15.660 | but you don't necessarily learn
00:46:17.540 | that the object is the star of the show
00:46:20.140 | and you want it to be in every image you make.
00:46:22.220 | - Yeah, I've heard you talk about the horse,
00:46:25.460 | the zebra cycle GAN mapping,
00:46:27.060 | and how it turns out, again, thought-provoking,
00:46:31.980 | that horses are usually on grass
00:46:33.660 | and zebras are usually on drier terrain.
00:46:35.740 | So when you're doing that kind of generation,
00:46:38.220 | you're going to end up generating greener horses or whatever.
00:46:42.720 | So those are connected together.
00:46:45.420 | It's not just-- - Yeah, yeah.
00:46:47.420 | - You're not able to segment,
00:46:49.060 | you're able to generate in a segmented way.
00:46:52.360 | So are there other types of games you come across
00:46:55.060 | in your mind that neural networks can play with each other
00:47:00.060 | to be able to solve problems?
00:47:05.220 | - Yeah, the one that I spend most of my time on
00:47:07.700 | is in security, you can model most interactions as a game
00:47:12.700 | where there's attackers trying to break your system
00:47:15.820 | and you're the defender trying to build a resilient system.
00:47:19.160 | There's also domain adversarial learning,
00:47:23.100 | which is an approach to domain adaptation
00:47:25.540 | that looks really a lot like GANs.
00:47:27.260 | The authors had the idea before the GAN paper came out,
00:47:31.820 | their paper came out a little bit later,
00:47:33.780 | and they were very nice and cited the GAN paper,
00:47:38.260 | but I know that they actually had the idea
00:47:40.220 | before it came out.
00:47:41.180 | Domain adaptation is when you want to train
00:47:44.340 | a machine learning model in one setting called a domain
00:47:47.620 | and then deploy it in another domain later.
00:47:50.300 | And you would like it to perform well in the new domain,
00:47:52.700 | even though the new domain is different
00:47:54.020 | from how it was trained.
00:47:55.940 | So for example, you might want to train
00:47:58.500 | on a really clean image data set like ImageNet,
00:48:01.380 | but then deploy on users' phones
00:48:03.380 | where the user is taking pictures in the dark
00:48:06.020 | or pictures while moving quickly
00:48:07.820 | and just pictures that aren't really centered
00:48:10.020 | or composed all that well.
00:48:11.340 | When you take a normal machine learning model,
00:48:15.860 | it often degrades really badly
00:48:17.860 | when you move to the new domain
00:48:19.020 | because it looks so different
00:48:20.060 | from what the model was trained on.
00:48:22.140 | Domain adaptation algorithms try to smooth out that gap.
00:48:25.460 | And the domain adversarial approach
00:48:27.340 | is based on training a feature extractor
00:48:29.820 | where the features have the same statistics
00:48:32.180 | regardless of which domain you extracted them on.
00:48:35.180 | So in the domain adversarial game,
00:48:36.900 | you have one player that's a feature extractor
00:48:39.180 | and another player that's a domain recognizer.
00:48:42.100 | The domain recognizer wants to look at the output
00:48:44.300 | of the feature extractor
00:48:45.740 | and guess which of the two domains the features came from.
00:48:49.340 | So it's a lot like the real versus fake
00:48:50.900 | discriminator in GANs.
00:48:52.500 | And then the feature extractor,
00:48:54.940 | you can think of as loosely analogous
00:48:56.860 | to the generator in GANs,
00:48:57.980 | except what it's trying to do here
00:48:59.140 | is both fool the domain recognizer
00:49:02.500 | into not knowing which domain the data came from
00:49:05.380 | and also extract features that are good for classification.
00:49:09.060 | So at the end of the day,
00:49:10.500 | in the cases where it works out,
00:49:13.780 | you can actually get features
00:49:16.900 | that work about the same in both domains.
00:49:20.660 | Sometimes this has a drawback where
00:49:22.900 | in order to make things work the same in both domains,
00:49:24.860 | it just gets worse at the first one.
00:49:26.780 | But there are a lot of cases
00:49:27.860 | where it actually works out well on both.
00:49:30.820 | - So do you think of GANs being useful
00:49:33.020 | in the context of data augmentation?
00:49:35.460 | - Yeah, one thing you could hope for with GANs
00:49:38.100 | is you could imagine I've got a limited training set
00:49:41.380 | and I'd like to make more training data
00:49:43.900 | to train something else like a classifier.
00:49:46.060 | You could train the GAN on the training set
00:49:50.540 | and then create more data.
00:49:52.380 | And then maybe the classifier would perform better
00:49:55.220 | on the test set after training
00:49:56.500 | on this bigger GAN generated data set.
00:49:58.900 | So that's the simplest version
00:50:00.420 | of something you might hope would work.
00:50:03.060 | I've never heard of that particular approach working,
00:50:05.460 | but I think there's some closely related things
00:50:08.940 | that I think could work in the future
00:50:11.540 | and some that actually already have worked.
00:50:14.100 | So if we think a little bit about what we'd be hoping for
00:50:15.820 | if we use the GAN to make more training data,
00:50:18.220 | we're hoping that the GAN will generalize to new examples
00:50:22.060 | better than the classifier would have generalized
00:50:24.140 | if it was trained on the same data.
00:50:25.980 | And I don't know of any reason to believe
00:50:27.700 | that the GAN would generalize better
00:50:28.900 | than the classifier would.
00:50:30.260 | But what we might hope for
00:50:33.060 | is that the GAN could generalize differently
00:50:35.540 | from a specific classifier.
00:50:37.460 | So one thing I think is worth trying
00:50:39.140 | that I haven't personally tried, but someone could try
00:50:41.020 | is what if you trained a whole lot
00:50:43.380 | of different generative models on the same training set,
00:50:46.460 | create samples from all of them,
00:50:48.340 | and then train a classifier on that?
00:50:50.540 | Because each of the generative models might generalize
00:50:52.980 | in a slightly different way,
00:50:54.420 | they might capture many different axes of variation
00:50:56.940 | that one individual model wouldn't.
00:50:58.820 | And then the classifier can capture all of those ideas
00:51:01.860 | by training on all of their data.
00:51:03.540 | So it'd be a little bit
00:51:04.380 | like making an ensemble of classifiers.
00:51:06.260 | - Ensemble of GANs.
00:51:07.820 | - Yeah. - In a way.
00:51:08.820 | - I think that could generalize better.
00:51:10.060 | The other thing that GANs are really good for
00:51:12.620 | is not necessarily generating new data
00:51:16.980 | that's exactly like what you already have,
00:51:19.340 | but by generating new data that has different properties
00:51:23.540 | from the data you already had.
00:51:25.300 | One thing that you can do
00:51:26.220 | is you can create differentially private data.
00:51:29.100 | So suppose that you have something like medical records,
00:51:31.860 | and you don't want to train a classifier
00:51:33.820 | on the medical records and then publish the classifier,
00:51:36.460 | because someone might be able to reverse engineer
00:51:38.140 | some of the medical records you trained on.
00:51:40.540 | There's a paper from Casey Green's lab
00:51:42.780 | that shows how you can train a GAN
00:51:45.020 | using differential privacy.
00:51:46.980 | And then the samples from the GAN
00:51:48.980 | still have the same differential privacy guarantees
00:51:51.180 | as the parameters of the GAN.
00:51:52.700 | So you can make fake patient data
00:51:55.660 | for other researchers to use,
00:51:57.220 | and they can do almost anything they want with that data
00:51:59.180 | because it doesn't come from real people.
00:52:02.020 | And the differential privacy mechanism
00:52:04.260 | gives you clear guarantees
00:52:06.460 | on how much the original people's data has been protected.
00:52:09.900 | - That's really interesting, actually.
00:52:11.340 | I haven't heard you talk about that before.
00:52:13.740 | In terms of fairness,
00:52:15.220 | I've seen from AAAI, your talk,
00:52:18.620 | how can adversarial machine learning
00:52:21.220 | help models be more fair
00:52:23.300 | with respect to sensitive variables?
00:52:25.700 | - Yeah, so there's a paper from Amos Storky's lab
00:52:28.460 | about how to learn machine learning models
00:52:31.380 | that are incapable of using specific variables.
00:52:34.780 | So say, for example, you wanted to make predictions
00:52:36.660 | that are not affected by gender.
00:52:39.540 | It isn't enough to just leave gender
00:52:41.220 | out of the input to the model.
00:52:42.780 | You can often infer gender
00:52:43.980 | from a lot of other characteristics.
00:52:45.420 | Like, say that you have the person's name,
00:52:47.460 | but you're not told their gender.
00:52:48.580 | Well, if their name is Ian,
00:52:50.500 | they're kind of obviously a man.
00:52:52.100 | So what you'd like to do
00:52:54.540 | is make a machine learning model
00:52:55.620 | that can still take in a lot of different attributes
00:52:58.980 | and make a really accurate, informed prediction,
00:53:02.540 | but be confident that it isn't reverse engineering gender
00:53:05.740 | or another sensitive variable internally.
00:53:08.380 | You can do that using something very similar
00:53:10.260 | to the domain adversarial approach,
00:53:12.820 | where you have one player that's a feature extractor
00:53:16.100 | and another player that's a feature analyzer.
00:53:19.060 | And you want to make sure that the feature analyzer
00:53:21.420 | is not able to guess the value of the sensitive variable
00:53:24.700 | that you're trying to keep private.
00:53:26.620 | - Right, that's, yeah, I love this approach.
00:53:29.060 | So, yeah, with the feature,
00:53:31.620 | you're not able to infer the sensitive variables.
00:53:36.020 | - Yeah. - It's brilliant.
00:53:36.860 | It's quite brilliant and simple, actually.
00:53:39.460 | - Another way I think that GANs in particular
00:53:42.740 | could be used for fairness
00:53:44.220 | would be to make something like a cycle GAN,
00:53:46.740 | where you can take data from one domain
00:53:49.700 | and convert it into another.
00:53:51.140 | We've seen cycle GAN turning horses into zebras.
00:53:53.860 | We've seen other unsupervised GANs made by Mingyu Liu
00:53:58.860 | doing things like turning day photos into night photos.
00:54:01.980 | I think for fairness,
00:54:04.780 | you could imagine taking records for people in one group
00:54:08.420 | and transforming them into analogous people in another group
00:54:11.500 | and testing to see if they're treated equitably
00:54:14.940 | across those two groups.
00:54:16.420 | There's a lot of things that'd be hard to get right
00:54:18.060 | to make sure that the conversion process itself is fair.
00:54:21.100 | And I don't think it's anywhere near
00:54:23.860 | something that we could actually use yet.
00:54:25.380 | But if you could design that conversion process
00:54:27.100 | very carefully, it might give you a way of doing audits
00:54:30.500 | where you say, what if we took people from this group,
00:54:33.100 | converted them into equivalent people in another group?
00:54:35.420 | Does the system actually treat them how it ought to?
00:54:38.660 | - That's also really interesting.
00:54:41.740 | In popular press and in general, in our imagination,
00:54:46.740 | you think, well, GANs are able to generate data
00:54:51.700 | and you start to think about deep fakes
00:54:54.500 | or being able to sort of maliciously generate data
00:54:57.900 | that fakes the identity of other people.
00:55:01.180 | Is this something of a concern to you?
00:55:03.140 | Is this something, if you look 10, 20 years into the future,
00:55:06.900 | is that something that pops up in your work,
00:55:10.340 | in the work of the community that's working
00:55:11.860 | on generating models?
00:55:13.540 | - I'm a lot less concerned about 20 years from now
00:55:15.860 | than the next few years.
00:55:17.380 | I think there will be a kind of bumpy cultural transition
00:55:20.820 | as people encounter this idea
00:55:23.140 | that there can be very realistic videos
00:55:24.660 | and audio that aren't real.
00:55:26.260 | I think 20 years from now, people will mostly understand
00:55:30.100 | that you shouldn't believe something is real
00:55:31.900 | just because you saw a video of it.
00:55:34.060 | People will expect to see that it's been cryptographically
00:55:36.700 | signed or have some other mechanism to make them believe
00:55:41.700 | that the content is real.
00:55:44.300 | There's already people working on this.
00:55:45.660 | Like there's a startup called TruePic
00:55:47.620 | that provides a lot of mechanisms for authenticating
00:55:50.860 | that an image is real.
00:55:51.980 | They're maybe not quite up to having a state actor
00:55:56.100 | try to evade their verification techniques,
00:55:59.820 | but it's something that people are already working on
00:56:02.380 | and I think will get right eventually.
00:56:04.140 | - So you think authentication will eventually win out?
00:56:08.300 | So being able to authenticate that this is real
00:56:10.740 | and this is not.
00:56:11.900 | - Yeah.
00:56:13.300 | - As opposed to GANs just getting better and better
00:56:15.780 | or generative models being able to get better and better
00:56:18.220 | to where the nature of what is real is normal.
00:56:21.500 | - I don't think we'll ever be able to look at the pixels
00:56:24.460 | of a photo and tell you for sure that it's real or not real.
00:56:28.580 | And I think it would actually be somewhat dangerous
00:56:32.780 | to rely on that approach too much.
00:56:35.140 | If you make a really good fake detector
00:56:36.820 | and then someone's able to fool your fake detector
00:56:38.900 | and your fake detector says this image is not fake,
00:56:42.140 | then it's even more credible
00:56:43.500 | than if you've never made a fake detector
00:56:45.100 | in the first place.
00:56:46.260 | What I do think we'll get to is systems
00:56:50.380 | that we can kind of use behind the scenes
00:56:53.300 | to make estimates of what's going on
00:56:55.580 | and maybe not like use them in court
00:56:57.820 | for a definitive analysis.
00:56:59.580 | I also think we will likely get better authentication systems
00:57:04.180 | where, you know, imagine that every phone
00:57:07.380 | cryptographically signs everything that comes out of it.
00:57:10.540 | You wouldn't be able to conclusively tell
00:57:12.820 | that an image was real,
00:57:14.540 | but you would be able to tell somebody
00:57:17.700 | who knew the appropriate private key for this phone
00:57:21.300 | was actually able to sign this image
00:57:24.340 | and upload it to this server at this timestamp.
00:57:27.460 | - Right.
00:57:28.940 | So you could imagine maybe you make phones
00:57:31.380 | that have the private keys hardware embedded in them.
00:57:34.300 | If like a state security agency
00:57:37.500 | really wants to infiltrate the company,
00:57:39.260 | they could probably, you know,
00:57:40.860 | plant a private key of their choice
00:57:42.540 | or break open the chip and learn the private key
00:57:45.100 | or something like that.
00:57:46.220 | But it would make it a lot harder
00:57:47.460 | for an adversary with fewer resources to fake things.
00:57:51.500 | - For most of us it would be okay.
00:57:52.860 | Okay.
00:57:53.700 | So you mentioned the beer and the bar and the new ideas.
00:57:58.340 | You were able to implement this
00:57:59.780 | or come up with this new idea pretty quickly
00:58:02.900 | and implement it pretty quickly.
00:58:04.420 | Do you think there's still many such groundbreaking ideas
00:58:07.740 | in deep learning that could be developed so quickly?
00:58:11.020 | - Yeah, I do think that there are a lot of ideas
00:58:13.020 | that can be developed really quickly.
00:58:14.860 | GANs were probably a little bit of an outlier
00:58:17.860 | on the whole like one hour time scale.
00:58:20.220 | But just in terms of like low resource ideas
00:58:24.260 | where you do something really different
00:58:25.580 | on the algorithm scale and get a big payback.
00:58:28.820 | I think it's not as likely that you'll see that
00:58:31.900 | in terms of things like core machine learning technologies
00:58:34.940 | like a better classifier
00:58:36.580 | or a better reinforcement learning algorithm
00:58:38.180 | or a better generative model.
00:58:39.580 | If I had the GAN idea today,
00:58:42.420 | it would be a lot harder to prove that it was useful
00:58:45.260 | than it was back in 2014
00:58:46.940 | because I would need to get it running on something
00:58:50.100 | like ImageNet or CelebA at high resolution.
00:58:54.060 | You know, those take a while to train.
00:58:55.540 | You couldn't train it in an hour
00:58:57.580 | and know that it was something really new and exciting.
00:59:01.020 | Back in 2014, training on MNIST was enough.
00:59:03.260 | But there are other areas of machine learning
00:59:06.780 | where I think a new idea could actually be developed
00:59:11.260 | really quickly with low resources.
00:59:13.260 | - What's your intuition about what areas
00:59:15.420 | of machine learning are ripe for this?
00:59:17.740 | - Yeah, so I think fairness and interpretability
00:59:23.140 | are areas where we just really don't have any idea
00:59:27.060 | how anything should be done yet.
00:59:29.060 | Like for interpretability,
00:59:30.380 | I don't think we even have the right definitions.
00:59:32.740 | And even just defining a really useful concept,
00:59:36.100 | you don't even need to run any experiments,
00:59:38.140 | could have a huge impact on the field.
00:59:40.100 | We've seen that, for example, in differential privacy
00:59:42.580 | that Cynthia Dwork and her collaborators
00:59:45.340 | made this technical definition of privacy
00:59:48.060 | where before a lot of things are really mushy
00:59:50.060 | and then with that definition,
00:59:51.620 | you could actually design randomized algorithms
00:59:54.260 | for accessing databases and guarantee
00:59:56.220 | that they preserved individual people's privacy
00:59:58.860 | in like a mathematical quantitative sense.
01:00:01.820 | Right now, we all talk a lot about how interpretable
01:00:05.860 | different machine learning algorithms are,
01:00:07.580 | but it's really just people's opinion.
01:00:09.860 | And everybody probably has a different idea
01:00:11.300 | of what interpretability means in their head.
01:00:13.860 | If we could define some concept related to interpretability
01:00:17.020 | that's actually measurable,
01:00:18.780 | that would be a huge leap forward,
01:00:20.620 | even without a new algorithm that increases that quantity.
01:00:24.180 | And also once we had the definition of differential privacy,
01:00:28.780 | it was fast to get the algorithms that guaranteed it.
01:00:31.380 | So you could imagine once we have definitions
01:00:33.540 | of good concepts and interpretability,
01:00:35.740 | we might be able to provide the algorithms
01:00:37.580 | that have the interpretability guarantees quickly too.
01:00:40.540 | - What do you think it takes to build a system
01:00:46.900 | with human level intelligence
01:00:48.660 | as we quickly venture into the philosophical?
01:00:51.980 | So artificial general intelligence,
01:00:54.420 | what do you think it takes?
01:00:55.620 | - I think that it definitely takes better environments
01:01:00.620 | than we currently have for training agents,
01:01:03.780 | that we want them to have
01:01:05.300 | a really wide diversity of experiences.
01:01:08.740 | I also think it's gonna take really a lot of computation.
01:01:11.780 | It's hard to imagine exactly how much.
01:01:13.780 | - So you're optimistic about simulation,
01:01:16.300 | simulating a variety of environments
01:01:18.180 | as the path forward?
01:01:19.580 | - I think it's a necessary ingredient.
01:01:22.020 | Yeah, I don't think that we're going to get
01:01:24.740 | to artificial general intelligence
01:01:27.380 | by training on fixed data sets
01:01:29.740 | or by thinking really hard about the problem.
01:01:32.140 | I think that the agent really needs to interact
01:01:35.900 | and have a variety of experiences within the same lifespan.
01:01:40.900 | And today we have many different models
01:01:44.140 | that can each do one thing,
01:01:45.740 | and we tend to train them on one data set
01:01:47.580 | or one RL environment.
01:01:49.020 | Sometimes there are actually papers
01:01:51.420 | about getting one set of parameters
01:01:53.500 | to perform well in many different RL environments,
01:01:57.020 | but we don't really have anything like an agent
01:01:59.540 | that goes seamlessly from one type of experience to another
01:02:02.940 | and really integrates all the different things
01:02:05.300 | that it does over the course of its life.
01:02:08.060 | When we do see multi-agent environments,
01:02:10.580 | they tend to be,
01:02:12.420 | or so many multi-environment agents,
01:02:14.700 | they tend to be similar environments.
01:02:16.780 | Like all of them are playing like an action-based video game.
01:02:20.420 | We don't really have an agent that goes from
01:02:23.220 | playing a video game to like reading the Wall Street Journal
01:02:27.500 | to predicting how effective a molecule will be as a drug
01:02:31.260 | or something like that.
01:02:33.220 | - What do you think is a good test
01:02:35.140 | for intelligence in your view?
01:02:36.980 | There's been a lot of benchmarks,
01:02:38.660 | started with Alan Turing,
01:02:41.700 | natural conversation being a good benchmark for intelligence.
01:02:46.260 | What would Ian Goodfellow sit back
01:02:51.260 | and be really damn impressed
01:02:53.340 | if a system was able to accomplish?
01:02:56.020 | - Something that doesn't take a lot of glue
01:02:58.460 | from human engineers.
01:02:59.780 | So imagine that instead of having to go to the CIFAR website
01:03:04.780 | and download CIFAR 10
01:03:07.940 | and then write a Python script to parse it and all that,
01:03:11.340 | you could just point an agent at the CIFAR 10 problem
01:03:16.340 | and it downloads and extracts the data
01:03:19.180 | and trains a model and starts giving you predictions.
01:03:22.420 | I feel like something that doesn't need to have
01:03:25.980 | every step of the pipeline assembled for it
01:03:28.700 | definitely understands what it's doing.
01:03:30.460 | - Is AutoML moving into that direction
01:03:32.380 | or are you thinking way even bigger?
01:03:34.420 | - AutoML has mostly been moving toward,
01:03:37.260 | once we've built all the glue,
01:03:39.940 | can the machine learning system
01:03:42.180 | design the architecture really well?
01:03:44.340 | And so I'm more of saying,
01:03:45.740 | if something knows how to pre-process the data
01:03:49.580 | so that it successfully accomplishes the task,
01:03:52.340 | then it would be very hard to argue
01:03:53.500 | that it doesn't truly understand the task
01:03:56.220 | in some fundamental sense.
01:03:58.500 | And I don't necessarily know
01:03:59.540 | that that's the philosophical definition of intelligence,
01:04:02.260 | but that's something that would be really cool to build,
01:04:03.780 | that would be really useful and would impress me
01:04:05.580 | and would convince me that we've made a step forward
01:04:08.180 | in real AI.
01:04:09.420 | - So you give it the URL for Wikipedia
01:04:13.380 | and then next day expect it to be able to solve CIFAR-10.
01:04:18.380 | - Or you type in a paragraph
01:04:20.820 | explaining what you want it to do
01:04:22.180 | and it figures out what web searches it should run
01:04:24.780 | and downloads all the necessary ingredients.
01:04:28.300 | - So you have a very clear, calm way of speaking,
01:04:33.300 | no ums, easy to edit.
01:04:37.580 | I've seen comments for both you and I
01:04:40.220 | have been identified as both potentially being robots.
01:04:44.180 | If you have to prove to the world that you are indeed human,
01:04:47.180 | how would you do it?
01:04:48.180 | - I can understand thinking that I'm a robot.
01:04:53.180 | - It's the flip side of the Turing test, I think.
01:04:57.780 | - Yeah, yeah, the prove your human test.
01:05:00.420 | - Intellectually, so you have to,
01:05:03.580 | is there something that's truly unique
01:05:07.380 | in your mind as it doesn't go back
01:05:09.900 | to just natural language again,
01:05:11.620 | just being able to talk your way out of it?
01:05:13.860 | - Proving that I'm not a robot with today's technology,
01:05:17.060 | yeah, that's pretty straightforward.
01:05:18.740 | My conversation today hasn't veered off
01:05:20.780 | into talking about the stock market or something
01:05:24.380 | because it's my training data.
01:05:25.940 | But I guess more generally,
01:05:27.500 | trying to prove that something is real
01:05:28.860 | from the content alone is incredibly hard.
01:05:31.420 | That's one of the main things I've gotten
01:05:32.460 | out of my GAN research,
01:05:33.500 | that you can simulate almost anything.
01:05:37.700 | And so you have to really step back
01:05:39.100 | to a separate channel to prove that something is real.
01:05:42.260 | So I guess I should have had myself stamped
01:05:45.540 | on a blockchain when I was born or something,
01:05:47.700 | but I didn't do that.
01:05:48.620 | So according to my own research methodology,
01:05:50.820 | there's just no way to know at this point.
01:05:52.980 | - So what, last question, problem stands out for you
01:05:56.340 | that you're really excited about challenging
01:05:58.380 | in the near future?
01:05:59.940 | - So I think resistance to adversarial examples,
01:06:02.940 | figuring out how to make machine learning secure
01:06:05.540 | against an adversary who wants to interfere it
01:06:07.500 | and control it, that is one of the most important things
01:06:10.700 | researchers today could solve.
01:06:12.180 | - In all domains, image, language, driving, and everything.
01:06:17.180 | - I guess I'm most concerned about domains
01:06:19.820 | we haven't really encountered yet.
01:06:22.020 | Like imagine 20 years from now
01:06:24.060 | when we're using advanced AIs
01:06:26.340 | to do things we haven't even thought of yet.
01:06:28.980 | Like if you ask people,
01:06:30.660 | what are the important problems in security of phones
01:06:35.140 | in like 2002, I don't think we would have anticipated
01:06:38.940 | that we're using them for nearly as many things
01:06:42.180 | as we're using them for today.
01:06:43.660 | I think it's gonna be like that with AI
01:06:44.900 | that you can kind of try to speculate about where it's going
01:06:47.940 | but really the business opportunities that end up taking off
01:06:51.060 | would be hard to predict ahead of time.
01:06:54.220 | What you can predict ahead of time is that
01:06:56.460 | almost anything you can do with machine learning,
01:06:58.380 | you would like to make sure that people can't get it
01:07:02.140 | to do what they want rather than what you want
01:07:04.660 | just by showing it a funny QR code or a funny input pattern.
01:07:08.540 | - And you think that the set of methodology to do that
01:07:11.060 | can be bigger than any one domain?
01:07:12.900 | And that's the-- - I think so, yeah.
01:07:14.180 | Yeah, like one methodology that I think is,
01:07:19.180 | not a specific methodology,
01:07:20.700 | but like a category of solutions
01:07:22.820 | that I'm excited about today is making dynamic models
01:07:25.740 | that change every time they make a prediction.
01:07:28.260 | So right now we tend to train models
01:07:31.180 | and then after they're trained, we freeze them
01:07:33.180 | and we just use the same rule to classify everything
01:07:36.300 | that comes in from then on.
01:07:38.260 | That's really a sitting duck from a security point of view.
01:07:41.580 | If you always output the same answer for the same input,
01:07:45.540 | then people can just run inputs through
01:07:48.340 | until they find a mistake that benefits them.
01:07:50.220 | And then they use the same mistake
01:07:51.820 | over and over and over again.
01:07:53.260 | I think having a model that updates its predictions
01:07:56.580 | so that it's harder to predict what you're gonna get
01:08:00.420 | will make it harder for an adversary
01:08:02.820 | to really take control of the system
01:08:04.900 | and make it do what they want it to do.
01:08:06.180 | - Yeah, models that maintain a bit of a sense of mystery
01:08:09.820 | about them 'cause they always keep changing.
01:08:12.180 | - Yeah.
01:08:13.020 | - Ian, thanks so much for talking today.
01:08:14.340 | It was awesome.
01:08:15.180 | - Thank you for coming in.
01:08:16.020 | It's great to see you.
01:08:17.020 | (upbeat music)
01:08:19.620 | (upbeat music)
01:08:22.220 | (upbeat music)
01:08:24.820 | (upbeat music)
01:08:27.420 | (upbeat music)
01:08:30.020 | (upbeat music)
01:08:32.620 | [BLANK_AUDIO]