Ian Goodfellow: Generative Adversarial Networks (GANs)

00:00:00.000 | The following is a conversation with Ian Goodfellow.

00:00:03.720 | He's the author of the popular textbook on deep learning,

00:00:06.360 | simply titled "Deep Learning."

00:00:08.920 | He coined the term of generative adversarial networks,

00:00:12.320 | otherwise known as GANs,

00:00:14.560 | and with his 2014 paper is responsible

00:00:18.160 | for launching the incredible growth

00:00:20.440 | of research and innovation

00:00:22.120 | in this subfield of deep learning.

00:00:24.720 | He got his BS and MS at Stanford,

00:00:27.520 | his PhD at University of Montreal

00:00:30.120 | with Yoshua Bengio and Aaron Kervil.

00:00:33.320 | He held several research positions,

00:00:35.240 | including an open AI, Google Brain,

00:00:37.600 | and now at Apple as the director of machine learning.

00:00:41.560 | This recording happened while Ian was still at Google Brain,

00:00:45.400 | but we don't talk about anything specific to Google

00:00:48.520 | or any other organization.

00:00:50.760 | This conversation is part

00:00:52.480 | of the Artificial Intelligence podcast.

00:00:54.520 | If you enjoy it, subscribe on YouTube, iTunes,

00:00:57.560 | or simply connect with me on Twitter @LexFriedman,

00:01:00.880 | spelled F-R-I-D.

00:01:03.000 | And now, here's my conversation with Ian Goodfellow.

00:01:07.080 | You open your popular deep learning book

00:01:10.960 | with a Russian doll type diagram

00:01:13.600 | that shows deep learning as a subset

00:01:15.880 | of representation learning,

00:01:17.120 | which in turn is a subset of machine learning,

00:01:19.960 | and finally a subset of AI.

00:01:22.520 | So this kind of implies that there may be limits

00:01:25.280 | to deep learning in the context of AI.

00:01:27.720 | So what do you think is the current limits of deep learning,

00:01:31.560 | and are those limits something

00:01:33.120 | that we can overcome with time?

00:01:35.760 | - Yeah, I think one of the biggest limitations

00:01:37.720 | of deep learning is that right now

00:01:39.320 | it requires really a lot of data, especially labeled data.

00:01:42.920 | There are some unsupervised

00:01:45.480 | and semi-supervised learning algorithms

00:01:47.140 | that can reduce the amount of labeled data you need,

00:01:49.480 | but they still require a lot of unlabeled data.

00:01:52.200 | Reinforcement learning algorithms, they don't need labels,

00:01:54.240 | but they need really a lot of experiences.

00:01:56.320 | As human beings, we don't learn to play Pong

00:01:58.960 | by failing at Pong 2 million times.

00:02:01.600 | So just getting the generalization ability better

00:02:05.920 | is one of the most important bottlenecks

00:02:08.080 | in the capability of the technology today.

00:02:10.600 | And then I guess I'd also say deep learning

00:02:12.400 | is like a component of a bigger system.

00:02:15.660 | So far, nobody is really proposing to have

00:02:20.640 | only what you'd call deep learning

00:02:22.040 | as the entire ingredient of intelligence.

00:02:25.560 | You use deep learning as sub-modules of other systems,

00:02:29.880 | like AlphaGo has a deep learning model

00:02:32.360 | that estimates the value function.

00:02:34.160 | Most reinforcement learning algorithms

00:02:36.620 | have a deep learning module

00:02:37.920 | that estimates which action to take next,

00:02:40.360 | but you might have other components.

00:02:42.520 | - So you're basically building a function estimator.

00:02:46.120 | Do you think it's possible,

00:02:48.640 | you said nobody's kind of been thinking

00:02:50.180 | about this so far, but do you think neural networks

00:02:52.280 | could be made to reason in the way symbolic systems did

00:02:56.080 | in the '80s and '90s to do more,

00:02:58.800 | create more like programs as opposed to functions?

00:03:01.480 | - Yeah, I think we already see that a little bit.

00:03:03.980 | I already kind of think of neural nets as a kind of program.

00:03:08.880 | I think of deep learning as basically learning programs

00:03:12.960 | that have more than one step.

00:03:15.320 | So if you draw a flow chart,

00:03:17.760 | or if you draw a TensorFlow graph

00:03:19.580 | describing your machine learning model,

00:03:21.900 | I think of the depth of that graph

00:03:23.540 | as describing the number of steps that run in sequence,

00:03:25.900 | and then the width of that graph

00:03:27.660 | is the number of steps that run in parallel.

00:03:30.180 | Now it's been long enough

00:03:31.720 | that we've had deep learning working

00:03:32.940 | that it's a little bit silly

00:03:33.920 | to even discuss shallow learning anymore.

00:03:35.780 | But back when I first got involved in AI,

00:03:38.940 | when we used machine learning,

00:03:40.140 | we were usually learning things

00:03:41.320 | like support vector machines.

00:03:43.740 | You could have a lot of input features to the model,

00:03:45.660 | and you could multiply each feature by a different weight.

00:03:48.140 | All those multiplications were done in parallel

00:03:50.120 | to each other.

00:03:51.260 | There wasn't a lot done in series.

00:03:52.740 | I think what we got with deep learning

00:03:54.360 | was really the ability to have steps of a program

00:03:58.420 | that run in sequence.

00:04:00.340 | And I think that we've actually started to see

00:04:03.180 | that what's important with deep learning

00:04:05.020 | is more the fact that we have a multi-step program

00:04:07.980 | rather than the fact that we've learned a representation.

00:04:10.780 | If you look at things like ResNets, for example,

00:04:15.140 | they take one particular kind of representation

00:04:18.660 | and they update it several times.

00:04:21.060 | Back when deep learning first really took off

00:04:23.560 | in the academic world in 2006,

00:04:25.740 | when Jeff Hinton showed

00:04:27.660 | that you could train deep belief networks,

00:04:30.180 | everybody who was interested in the idea

00:04:31.980 | thought of it as each layer

00:04:33.560 | learns a different level of abstraction.

00:04:35.940 | That the first layer trained on images

00:04:37.820 | learns something like edges,

00:04:38.940 | and the second layer learns corners,

00:04:40.420 | and eventually you get these kind of grandmother cell units

00:04:43.320 | that recognize specific objects.

00:04:45.920 | Today, I think most people think of it

00:04:47.980 | more as a computer program,

00:04:50.660 | where as you add more layers,

00:04:51.980 | you can do more updates before you output your final number.

00:04:55.140 | But I don't think anybody believes

00:04:56.420 | that layer 150 of the ResNet is a grandmother cell,

00:05:01.420 | and layer 100 is contours or something like that.

00:05:05.080 | - Okay, so you're not thinking of it

00:05:08.180 | as a singular representation that keeps building.

00:05:11.520 | You think of it as a program,

00:05:14.060 | sort of almost like a state.

00:05:15.940 | Representation is a state of understanding.

00:05:18.580 | - Yeah, I think of it as a program

00:05:20.260 | that makes several updates

00:05:21.500 | and arrives at better and better understandings,

00:05:23.820 | but it's not replacing the representation at each step.

00:05:27.500 | It's refining it.

00:05:29.160 | And in some sense, that's a little bit like reasoning.

00:05:31.660 | It's not reasoning in the form of deduction,

00:05:33.560 | but it's reasoning in the form of taking a thought

00:05:36.940 | and refining it and refining it carefully

00:05:39.420 | until it's good enough to use.

00:05:41.260 | - So do you think, and I hope you don't mind,

00:05:43.580 | we'll jump philosophical every once in a while.

00:05:46.020 | Do you think of cognition, human cognition,

00:05:50.460 | or even consciousness as simply a result

00:05:53.500 | of this kind of sequential representation learning,

00:05:58.100 | do you think that can emerge?

00:06:00.420 | - Cognition, yes, I think so.

00:06:02.460 | Consciousness, it's really hard

00:06:03.700 | to even define what we mean by that.

00:06:06.440 | I guess there's, consciousness is often defined

00:06:09.820 | as things like having self-awareness,

00:06:12.060 | and that's relatively easy to turn into something actionable

00:06:16.060 | for a computer scientist to reason about.

00:06:18.380 | People also define consciousness

00:06:19.700 | in terms of having qualitative states of experience,

00:06:22.420 | like qualia, and there's all these philosophical problems,

00:06:25.260 | like could you imagine a zombie

00:06:27.820 | who does all the same information processing as a human,

00:06:30.700 | but doesn't really have the qualitative experiences

00:06:33.460 | that we have?

00:06:34.660 | That sort of thing, I have no idea how to formalize

00:06:37.540 | or turn it into a scientific question.

00:06:39.940 | I don't know how you could run an experiment

00:06:41.580 | to tell whether a person is a zombie or not.

00:06:44.860 | And similarly, I don't know how you could run an experiment

00:06:47.180 | to tell whether an advanced AI system

00:06:49.640 | had become conscious in the sense of qualia or not.

00:06:53.020 | - But in the more practical sense,

00:06:54.540 | like almost like self-attention,

00:06:56.260 | you think consciousness and cognition can,

00:06:58.900 | in an impressive way, emerge from current types

00:07:03.220 | of architectures that we think of as deep learning.

00:07:05.540 | - Or if you think of consciousness

00:07:07.940 | in terms of self-awareness and just making plans

00:07:12.180 | based on the fact that the agent itself exists in the world,

00:07:16.580 | reinforcement learning algorithms

00:07:18.000 | are already more or less forced

00:07:20.140 | to model the agent's effect on the environment.

00:07:23.060 | So that more limited version of consciousness

00:07:26.340 | is already something that we get limited versions of

00:07:31.340 | with reinforcement learning algorithms

00:07:32.980 | if they're trained well.

00:07:34.660 | - But you say limited.

00:07:37.420 | So the big question really is how you jump

00:07:39.900 | from limited to human level, right?

00:07:42.100 | And whether it's possible.

00:07:44.620 | Even just building common sense reasoning

00:07:49.020 | seems to be exceptionally difficult.

00:07:50.540 | So if we scale things up,

00:07:52.500 | if we get much better on supervised learning,

00:07:55.020 | if we get better at labeling,

00:07:56.620 | if we get bigger data sets, more compute,

00:08:00.620 | do you think we'll start to see really impressive things

00:08:03.860 | that go from limited to something,

00:08:08.340 | echoes of human level cognition?

00:08:10.340 | - I think so, yeah.

00:08:11.180 | I'm optimistic about what can happen

00:08:13.340 | just with more computation and more data.

00:08:16.420 | I do think it'll be important to get the right kind of data.

00:08:20.100 | Today, most of the machine learning systems we train

00:08:23.140 | are mostly trained on one type of data for each model.

00:08:27.540 | But the human brain, we get all of our different senses

00:08:31.380 | and we have many different experiences

00:08:33.860 | like riding a bike, driving a car,

00:08:36.300 | talking to people, reading.

00:08:37.940 | I think when you get that kind of integrated data set

00:08:42.420 | working with a machine learning model

00:08:44.420 | that can actually close the loop and interact,

00:08:47.660 | we may find that algorithms not so different

00:08:50.460 | from what we have today learn really interesting things

00:08:53.260 | when you scale them up a lot

00:08:54.380 | and train them on a large amount of multimodal data.

00:08:58.220 | - So multimodal is really interesting,

00:08:59.620 | but within, like you're working adversarial examples,

00:09:04.020 | so selecting within modal, within one mode of data,

00:09:09.020 | selecting better what are the difficult cases

00:09:13.780 | from which you're most useful to learn from?

00:09:16.140 | - Oh yeah, like could we get a whole lot of mileage

00:09:18.860 | out of designing a model that's resistant

00:09:22.260 | to adversarial examples or something like that?

00:09:24.100 | - Right, that's a question.

00:09:26.260 | - My thinking on that has evolved a lot

00:09:27.740 | over the last few years.

00:09:28.900 | - Oh, interesting.

00:09:29.940 | - When I first started to really invest

00:09:31.260 | in studying adversarial examples,

00:09:32.740 | I was thinking of it mostly as adversarial examples

00:09:36.340 | reveal a big problem with machine learning

00:09:38.980 | and we would like to close the gap

00:09:41.180 | between how machine learning models

00:09:43.700 | respond to adversarial examples and how humans respond.

00:09:46.540 | After studying the problem more,

00:09:49.180 | I still think that adversarial examples are important.

00:09:51.940 | I think of them now more of as a security liability

00:09:55.420 | than as an issue that necessarily shows

00:09:57.780 | there's something uniquely wrong with machine learning

00:10:01.260 | as opposed to humans.

00:10:02.820 | - Also, do you see them as a tool

00:10:04.620 | to improve the performance of the system?

00:10:06.460 | Not on the security side, but literally just accuracy.

00:10:10.780 | - I do see them as a kind of tool on that side,

00:10:13.460 | but maybe not quite as much as I used to think.

00:10:16.660 | We've started to find that there's a trade-off

00:10:18.500 | between accuracy on adversarial examples

00:10:21.660 | and accuracy on clean examples.

00:10:24.380 | Back in 2014, when I did the first

00:10:27.140 | adversarially trained classifier

00:10:29.060 | that showed resistance to some kinds of adversarial examples,

00:10:33.020 | it also got better at the clean data on MNIST.

00:10:36.020 | And that's something we've replicated

00:10:37.100 | several times on MNIST,

00:10:39.020 | that when we train against weak adversarial examples,

00:10:41.500 | MNIST classifiers get more accurate.

00:10:43.900 | So far, that hasn't really held up on other data sets

00:10:47.100 | and hasn't held up when we train

00:10:48.860 | against stronger adversaries.

00:10:50.740 | It seems like when you confront

00:10:53.180 | a really strong adversary,

00:10:55.740 | you tend to have to give something up.

00:10:58.100 | - Interesting.

00:10:59.060 | But it's such a compelling idea,

00:11:00.540 | 'cause it feels like that's how us humans learn

00:11:04.740 | as to the difficult cases.

00:11:06.340 | - We try to think of what would we screw up,

00:11:08.820 | and then we make sure we fix that.

00:11:11.020 | It's also, in a lot of branches of engineering,

00:11:13.700 | you do a worst-case analysis

00:11:15.820 | and make sure that your system will work in the worst case.

00:11:18.740 | And then that guarantees that it'll work

00:11:20.420 | in all of the messy average cases

00:11:23.580 | that happen when you go out into a really randomized world.

00:11:27.420 | - Yeah, with driving with autonomous vehicles,

00:11:29.540 | there seems to be a desire to just look for,

00:11:33.060 | think adversarially,

00:11:34.860 | try to figure out how to mess up the system.

00:11:36.900 | And if you can be robust to all those difficult cases,

00:11:40.620 | then you can, it's a hand-wavy,

00:11:42.900 | empirical way to show your system is safe.

00:11:45.820 | - Yeah, yeah.

00:11:47.020 | Today, most adversarial example research

00:11:49.100 | isn't really focused on a particular use case,

00:11:51.620 | but there are a lot of different use cases

00:11:54.020 | where you'd like to make sure that the adversary

00:11:56.940 | can't interfere with the operation of your system.

00:12:00.220 | Like in finance,

00:12:01.060 | if you have an algorithm making trades for you,

00:12:03.300 | people go to a lot of an effort

00:12:04.660 | to obfuscate their algorithm.

00:12:06.660 | That's both to protect their IP,

00:12:08.060 | because you don't wanna research

00:12:10.860 | and develop a profitable trading algorithm

00:12:13.580 | then have somebody else capture the gains.

00:12:16.100 | But it's at least partly

00:12:17.140 | because you don't want people to make adversarial examples

00:12:19.500 | that fool your algorithm into making bad trades.

00:12:22.580 | Or I guess one area that's been popular

00:12:26.580 | in the academic literature is speech recognition.

00:12:30.180 | If you use speech recognition to hear an audio waveform

00:12:34.420 | and then turn that into a command

00:12:37.700 | that a phone executes for you,

00:12:39.660 | you don't want a malicious adversary

00:12:41.900 | to be able to produce audio

00:12:43.620 | that gets interpreted as malicious commands,

00:12:46.300 | especially if a human in the room

00:12:47.820 | doesn't realize that something like that is happening.

00:12:50.300 | - In speech recognition,

00:12:52.020 | has there been much success

00:12:53.900 | in being able to create adversarial examples

00:12:58.460 | that fool the system?

00:12:59.780 | - Yeah, actually.

00:13:00.860 | I guess the first work that I'm aware of

00:13:02.420 | is a paper called "Hidden Voice Commands"

00:13:05.140 | that came out in 2016, I believe.

00:13:08.460 | And they were able to show that

00:13:10.780 | they could make sounds that are not understandable

00:13:13.780 | by a human, but are recognized as the target phrase

00:13:18.420 | that the attacker wants the phone to recognize it as.

00:13:21.340 | Since then, things have gotten a little bit better

00:13:24.020 | on the attacker side and worse on the defender side.

00:13:27.580 | It's become possible to make sounds

00:13:33.380 | that sound like normal speech,

00:13:35.580 | but are actually interpreted as a different sentence

00:13:38.980 | than the human hears.

00:13:40.700 | The level of perceptibility

00:13:42.740 | of the adversarial perturbation is still kind of high.

00:13:45.420 | When you listen to the recording,

00:13:48.180 | it sounds like there's some noise in the background,

00:13:51.020 | just like rustling sounds.

00:13:52.940 | But those rustling sounds

00:13:53.940 | are actually the adversarial perturbation

00:13:55.540 | that makes the phone hear a completely different sentence.

00:13:58.020 | - Yeah, that's so fascinating.

00:14:00.100 | Peter Norvig mentioned that you're writing

00:14:01.620 | the deep learning chapter for the fourth edition

00:14:04.260 | of the "Artificial Intelligence, a Modern Approach" book.

00:14:07.340 | So how do you even begin summarizing

00:14:10.700 | the field of deep learning in a chapter?

00:14:12.700 | (Peter laughs)

00:14:13.940 | - Well, in my case, I waited like a year

00:14:16.900 | before I actually wrote anything.

00:14:19.180 | Is it, even having written a full length textbook before,

00:14:22.660 | it's still pretty intimidating to try to start writing

00:14:26.820 | just one chapter that covers everything.

00:14:29.100 | One thing that helped me make that plan

00:14:33.220 | was actually the experience

00:14:34.340 | of having written the full book before

00:14:36.740 | and then watching how the field changed

00:14:39.140 | after the book came out.

00:14:40.940 | I've realized there's a lot of topics

00:14:42.300 | that were maybe extraneous in the first book.

00:14:45.020 | And just seeing what stood the test

00:14:47.580 | of a few years of being published

00:14:49.420 | and what seems a little bit less important to have included

00:14:52.740 | now helped me pare down the topics

00:14:54.260 | I wanted to cover for the book.

00:14:55.820 | It's also really nice now that

00:14:59.260 | the field has kind of stabilized

00:15:00.580 | to the point where some core ideas from the 1980s

00:15:02.820 | are still used today.

00:15:04.780 | When I first started studying machine learning,

00:15:06.660 | almost everything from the 1980s had been rejected

00:15:09.580 | and now some of it has come back.

00:15:11.340 | So that stuff that's really stood the test of time

00:15:13.460 | is what I focused on putting into the book.

00:15:15.940 | There's also, I guess, two different philosophies

00:15:21.300 | about how you might write a book.

00:15:23.140 | One philosophy is you try to write a reference

00:15:24.820 | that covers everything.

00:15:26.220 | The other philosophy is you try to provide

00:15:28.020 | a high level summary that gives people

00:15:30.380 | the language to understand a field

00:15:32.420 | and tells them what the most important concepts are.

00:15:34.980 | The first deep learning book that I wrote

00:15:37.060 | with Joshua and Aaron was somewhere between

00:15:39.620 | the two philosophies, that it's trying to be

00:15:42.380 | both a reference and an introductory guide.

00:15:45.780 | Writing this chapter for Russell and Norvig's book,

00:15:48.940 | I was able to focus more on just a concise introduction

00:15:52.780 | of the key concepts and the language

00:15:54.260 | you need to read about them more.

00:15:55.980 | In a lot of cases, I actually just wrote paragraphs

00:15:57.540 | that said, "Here's a rapidly evolving area

00:16:00.020 | "that you should pay attention to.

00:16:01.900 | "It's pointless to try to tell you what the latest

00:16:04.660 | "and best version of a learn-to-learn model is."

00:16:09.660 | I can point you to a paper that's recent right now,

00:16:13.660 | but there isn't a whole lot of a reason

00:16:16.300 | to delve into exactly what's going on

00:16:18.620 | with the latest learning-to-learn approach

00:16:21.620 | or the latest module produced

00:16:23.420 | by a learning-to-learn algorithm.

00:16:24.980 | You should know that learning-to-learn is a thing

00:16:26.780 | and that it may very well be the source

00:16:29.500 | of the latest and greatest convolutional net

00:16:32.220 | or recurrent net module that you would want to use

00:16:34.540 | in your latest project.

00:16:36.060 | But there isn't a lot of point in trying to summarize

00:16:38.180 | exactly which architecture and which learning approach

00:16:42.300 | got to which level of performance.

00:16:44.060 | - So you maybe focused more on the basics

00:16:48.020 | of the methodology, so from backpropagation

00:16:51.300 | to feedforward to recurrent neural networks,

00:16:53.740 | convolutional, that kind of thing?

00:16:55.180 | - Yeah, yeah.

00:16:56.500 | - So if I were to ask you, I remember I took

00:16:58.700 | algorithms and data structures algorithms course.

00:17:03.740 | I remember the professor asked, "What is an algorithm?"

00:17:08.220 | And yelled at everybody in a good way

00:17:12.240 | that nobody was answering it correctly.

00:17:14.100 | Everybody knew what the algorithm, it was a graduate course.

00:17:16.420 | Everybody knew what an algorithm was,

00:17:18.180 | but they weren't able to answer it well.

00:17:19.820 | So let me ask you in that same spirit,

00:17:22.380 | what is deep learning?

00:17:23.620 | - I would say deep learning is any kind of machine learning

00:17:29.740 | that involves learning parameters

00:17:32.500 | of more than one consecutive step.

00:17:36.020 | So that would mean shallow learning is things

00:17:39.620 | where you learn a lot of operations that happen in parallel.

00:17:43.780 | You might have a system that makes multiple steps,

00:17:46.740 | like you might have hand-designed feature extractors,

00:17:51.020 | but really only one step is learned.

00:17:52.660 | Deep learning is anything where you have multiple operations

00:17:56.060 | in sequence, and that includes the things

00:17:58.580 | that are really popular today,

00:17:59.820 | like convolutional networks and recurrent networks,

00:18:03.620 | but it also includes some of the things

00:18:05.060 | that have died out, like Bolton machines,

00:18:08.300 | where we weren't using back propagation.

00:18:10.900 | Today I hear a lot of people define deep learning

00:18:14.260 | as gradient descent applied to

00:18:19.060 | these differentiable functions.

00:18:21.500 | And I think that's a legitimate usage of the term.

00:18:24.820 | It's just different from the way

00:18:25.940 | that I use the term myself.

00:18:27.860 | - So what's an example of deep learning

00:18:31.780 | that is not gradient descent and differentiable functions?

00:18:34.780 | In your, I mean, not specifically perhaps,

00:18:37.460 | but more even looking into the future,

00:18:39.820 | what's your thought about that space of approaches?

00:18:44.340 | - Yeah, so I tend to think of machine learning algorithms

00:18:46.380 | as decomposed into really three different pieces.

00:18:50.220 | There's the model, which can be something like a neural net

00:18:53.020 | or a Bolton machine or a recurrent model.

00:18:56.620 | And that basically just describes how do you take data

00:18:59.500 | and how do you take parameters?

00:19:01.140 | And what function do you use to make a prediction

00:19:04.300 | given the data and the parameters?

00:19:07.340 | Another piece of the learning algorithm

00:19:09.260 | is the optimization algorithm,

00:19:12.380 | or not every algorithm can be really described

00:19:14.900 | in terms of optimization,

00:19:15.900 | but what's the algorithm for updating the parameters

00:19:18.860 | or updating whatever the state of the network is?

00:19:21.660 | And then the last part is the data set,

00:19:26.260 | like how do you actually represent the world

00:19:29.180 | as it comes into your machine learning system?

00:19:32.100 | So I think of deep learning as telling us something

00:19:35.780 | about what does the model look like?

00:19:39.060 | And basically to qualify as deep,

00:19:41.260 | I say that it just has to have multiple layers.

00:19:44.540 | That can be multiple steps

00:19:46.340 | in a feed-forward differentiable computation.

00:19:49.220 | That can be multiple layers in a graphical model.

00:19:52.020 | There's a lot of ways that you could satisfy me

00:19:53.540 | that something has multiple steps

00:19:56.140 | that are each parameterized separately.

00:19:58.100 | I think of gradient descent

00:19:59.940 | as being all about that other piece,

00:20:01.900 | how do you actually update the parameters piece?

00:20:04.260 | So you could imagine having a deep model

00:20:05.980 | like a convolutional net

00:20:07.540 | and training it with something like evolution

00:20:09.660 | or a genetic algorithm.

00:20:11.300 | And I would say that still qualifies as deep learning.

00:20:14.780 | And then in terms of models

00:20:16.060 | that aren't necessarily differentiable,

00:20:18.740 | I guess Boltzmann machines are probably

00:20:21.260 | the main example of something

00:20:23.580 | where you can't really take a derivative

00:20:25.540 | and use that for the learning process.

00:20:28.020 | But you can still argue that the model

00:20:30.820 | has many steps of processing that it applies

00:20:33.780 | when you run inference in the model.

00:20:35.820 | - So it's the steps of processing that's key.

00:20:38.980 | So Jeff Hinton suggests that we need to throw away

00:20:41.380 | back propagation and start all over.

00:20:44.940 | What do you think about that?

00:20:46.540 | What could an alternative direction

00:20:48.620 | of training neural networks look like?

00:20:50.980 | - I don't know that back propagation

00:20:52.900 | is gonna go away entirely.

00:20:54.700 | Most of the time when we decide

00:20:57.140 | that a machine learning algorithm

00:20:59.220 | isn't on the critical path to research for improving AI,

00:21:03.460 | the algorithm doesn't die.

00:21:04.660 | It just becomes used for some specialized set of things.

00:21:07.740 | A lot of algorithms like logistic regression

00:21:11.220 | don't seem that exciting to AI researchers

00:21:14.020 | who are working on things like speech recognition

00:21:16.780 | or autonomous cars today.

00:21:18.460 | But there's still a lot of use for logistic regression

00:21:21.140 | and things like analyzing really noisy data

00:21:24.060 | in medicine and finance,

00:21:25.740 | or making really rapid predictions

00:21:28.820 | in really time-limited contexts.

00:21:30.740 | So I think back propagation and gradient descent

00:21:33.500 | are around to stay, but they may not end up being

00:21:37.500 | everything that we need to get to real human level

00:21:40.900 | or superhuman AI.

00:21:42.420 | - Are you optimistic about us discovering,

00:21:44.780 | back propagation has been around for a few decades.

00:21:50.260 | So are you optimistic about us as a community

00:21:54.100 | being able to discover something better?

00:21:56.820 | - Yeah, I am.

00:21:57.660 | I think we likely will find something that works better.

00:22:01.820 | You could imagine things like having stacks of models

00:22:05.500 | where some of the lower level models

00:22:07.580 | predict parameters of the higher level models.

00:22:10.220 | And so at the top level,

00:22:12.180 | you're not learning in terms of literally

00:22:13.500 | calculating gradients,

00:22:14.460 | but just predicting how different values will perform.

00:22:17.700 | You can kind of see that already in some areas

00:22:19.580 | like Bayesian optimization,

00:22:21.380 | where you have a Gaussian process

00:22:22.940 | that predicts how well different

00:22:24.180 | parameter values will perform.

00:22:25.900 | We already use those kinds of algorithms

00:22:27.700 | for things like hyper parameter optimization.

00:22:30.260 | And in general, we know a lot of things

00:22:31.660 | other than back prop that work really well

00:22:33.260 | for specific problems.

00:22:34.980 | The main thing we haven't found is

00:22:37.460 | a way of taking one of these other

00:22:38.900 | non-back prop based algorithms

00:22:41.180 | and having it really advance the state of the art

00:22:43.500 | on an AI level problem.

00:22:46.140 | - Right.

00:22:47.100 | - But I wouldn't be surprised if eventually

00:22:49.180 | we find that some of these algorithms that,

00:22:51.580 | even the ones that already exist,

00:22:52.820 | not even necessarily a new one,

00:22:54.260 | we might find some way of customizing

00:22:58.220 | one of these algorithms to do something

00:22:59.820 | really interesting at the level of cognition

00:23:02.540 | or the level of,

00:23:05.300 | I think one system that we really don't have

00:23:08.460 | working quite right yet is like short-term memory.

00:23:12.140 | We have things like LSTMs,

00:23:14.540 | they're called long short-term memory.

00:23:17.060 | They still don't do quite what a human does

00:23:20.060 | with short-term memory.

00:23:21.820 | Like gradient descent to learn a specific fact

00:23:26.980 | has to do multiple steps on that fact.

00:23:29.420 | Like if I tell you the meeting today is at 3 p.m.,

00:23:34.180 | I don't need to say over and over again,

00:23:35.500 | it's at 3 p.m., it's at 3 p.m., it's at 3 p.m.,

00:23:37.820 | it's at 3 p.m. for you to do a gradient step on each one.

00:23:40.420 | You just hear it once and you remember it.

00:23:43.220 | There's been some work on things like

00:23:46.060 | self-attention and attention-like mechanisms

00:23:48.340 | like the neural Turing machine

00:23:50.420 | that can write to memory cells

00:23:52.220 | and update themselves with facts like that right away.

00:23:54.900 | But I don't think we've really nailed it yet.

00:23:56.900 | And that's one area where I'd imagine

00:23:59.580 | that new optimization algorithms

00:24:02.660 | or different ways of applying

00:24:03.820 | existing optimization algorithms

00:24:06.020 | could give us a way of just lightning fast

00:24:08.820 | updating the state of a machine learning system

00:24:11.180 | to contain a specific fact like that

00:24:14.100 | without needing to have it presented

00:24:15.340 | over and over and over again.

00:24:16.980 | - So some of the success of symbolic systems in the '80s

00:24:21.420 | is they were able to assemble these kinds of facts better.

00:24:26.220 | But there's a lot of expert input required

00:24:29.100 | and it's very limited in that sense.

00:24:31.140 | Do you ever look back to that

00:24:33.700 | as something that we'll have to return to eventually,

00:24:36.580 | sort of dust off the book from the shelf

00:24:38.420 | and think about how we build knowledge,

00:24:41.340 | representation, knowledge--

00:24:42.940 | - Like will we have to use graph searches?

00:24:44.820 | - Graph searches, right.

00:24:45.780 | - And like first order logic and entailment

00:24:47.700 | and things like that.

00:24:48.540 | - That kind of thing, yeah, exactly.

00:24:49.580 | - In my particular line of work,

00:24:51.180 | which has mostly been machine learning security

00:24:54.540 | and also generative modeling,

00:24:56.740 | I haven't usually found myself moving in that direction.

00:25:00.540 | For generative models, I could see a little bit of,

00:25:03.500 | it could be useful if you had something like

00:25:05.180 | a differentiable knowledge base

00:25:09.660 | or some other kind of knowledge base

00:25:10.980 | where it's possible for some of our

00:25:13.140 | fuzzier machine learning algorithms

00:25:14.820 | to interact with the knowledge base.

00:25:16.860 | - I mean, neural network is kind of like that.

00:25:19.020 | It's a differentiable knowledge base of sorts.

00:25:21.420 | - Yeah.

00:25:22.260 | - But--

00:25:23.620 | - If we had a really easy way of giving feedback

00:25:27.620 | to machine learning models,

00:25:29.260 | that would clearly help a lot with generative models.

00:25:32.380 | And so you could imagine one way of getting there

00:25:33.900 | would be get a lot better at natural language processing.

00:25:36.700 | But another way of getting there would be

00:25:38.900 | take some kind of knowledge base

00:25:40.260 | and figure out a way for it to actually

00:25:42.300 | interact with a neural network.

00:25:44.060 | - Being able to have a chat with a neural network.

00:25:46.060 | - Yeah.

00:25:46.900 | (laughing)

00:25:47.860 | So like one thing in generative models we see a lot today

00:25:49.980 | is you'll get things like faces that are not symmetrical,

00:25:53.540 | like people that have two eyes that are different colors.

00:25:58.180 | And I mean, there are people with eyes

00:25:59.540 | that are different colors in real life,

00:26:00.820 | but not nearly as many of them as you tend to see

00:26:03.420 | in the machine learning generated data.

00:26:06.060 | So if you had either a knowledge base

00:26:08.060 | that could contain the fact,

00:26:10.180 | people's faces are generally approximately symmetric

00:26:13.340 | and eye color is especially likely

00:26:15.900 | to be the same on both sides.

00:26:17.940 | Being able to just inject that hint

00:26:20.180 | into the machine learning model

00:26:22.020 | without it having to discover that itself

00:26:23.820 | after studying a lot of data

00:26:25.780 | would be a really useful feature.

00:26:28.340 | I could see a lot of ways of getting there

00:26:30.140 | without bringing back some of the 1980s technology,

00:26:32.180 | but I also see some ways that you could imagine

00:26:35.140 | extending the 1980s technology to play nice

00:26:37.460 | with neural nets and have it help get there.

00:26:40.020 | - Awesome, so you talked about the story

00:26:42.580 | of you coming up with the idea of GANs

00:26:45.180 | at a bar with some friends.

00:26:47.020 | You were arguing that this, you know,

00:26:49.580 | GANs would work, generative adversarial networks,

00:26:53.060 | and the others didn't think so.

00:26:54.660 | Then you went home at midnight,

00:26:57.100 | coded it up, and it worked.

00:26:58.420 | So if I was a friend of yours at the bar,

00:27:01.340 | I would also have doubts.

00:27:02.700 | It's a really nice idea,

00:27:03.860 | but I'm very skeptical that it would work.

00:27:06.820 | What was the basis of their skepticism?

00:27:09.300 | What was the basis of your intuition why it should work?

00:27:13.180 | - I don't wanna be someone who goes around

00:27:15.980 | promoting alcohol for the purposes of science,

00:27:18.300 | but in this case, I do actually think

00:27:21.020 | that drinking helped a little bit.

00:27:23.060 | When your inhibitions are lowered,

00:27:25.360 | you're more willing to try out things

00:27:27.380 | that you wouldn't try out otherwise.

00:27:29.620 | So I have noticed in general

00:27:32.460 | that I'm less prone to shooting down some of my own ideas

00:27:34.540 | when I have had a little bit to drink.

00:27:37.980 | I think if I had had that idea at lunchtime,

00:27:40.820 | I probably would have thought,

00:27:42.260 | it's hard enough to train one neural net.

00:27:43.740 | You can't train a second neural net

00:27:44.900 | in the inner loop of the outer neural net.

00:27:48.080 | That was basically my friend's objection,

00:27:49.820 | was that trying to train two neural nets at the same time

00:27:52.740 | would be too hard.

00:27:54.260 | - So it was more about the training process,

00:27:56.140 | unless, so my skepticism would be,

00:27:58.300 | you know, I'm sure you could train it,

00:28:01.140 | but the thing it would converge to

00:28:03.180 | would not be able to generate anything reasonable,

00:28:05.820 | any kind of reasonable realism.

00:28:08.260 | - Yeah, so part of what all of us were thinking about

00:28:11.360 | when we had this conversation was deep Bolton machines,

00:28:15.280 | which a lot of us in the lab, including me,

00:28:16.980 | were a big fan of deep Bolton machines at the time.

00:28:19.580 | They involved two separate processes

00:28:22.900 | running at the same time.

00:28:24.180 | One of them is called the positive phase,

00:28:28.140 | where you load data into the model

00:28:31.180 | and tell the model to make the data more likely.

00:28:33.540 | The other one is called the negative phase,

00:28:35.140 | where you draw samples from the model

00:28:37.020 | and tell the model to make those samples less likely.

00:28:39.660 | In a deep Bolton machine,

00:28:42.220 | it's not trivial to generate a sample.

00:28:43.960 | You have to actually run an iterative process

00:28:46.980 | that gets better and better samples

00:28:49.140 | coming closer and closer to the distribution

00:28:51.400 | the model represents.

00:28:52.860 | So during the training process,

00:28:53.900 | you're always running these two systems at the same time.

00:28:57.180 | One that's updating the parameters of the model

00:28:58.940 | and another one that's trying to generate samples

00:29:00.500 | from the model.

00:29:01.680 | And they worked really well on things like MNIST,

00:29:04.340 | but a lot of us in the lab, including me,

00:29:05.820 | had tried to get deep Bolton machines

00:29:07.500 | to scale past MNIST to things like generating color photos.

00:29:11.900 | And we just couldn't get the two processes

00:29:14.120 | to stay synchronized.

00:29:15.940 | So when I had the idea for GANs,

00:29:18.740 | a lot of people thought that the discriminator

00:29:20.320 | would have more or less the same problem

00:29:22.580 | as the negative phase in the Bolton machine.

00:29:25.340 | That trying to train the discriminator in the inner loop,

00:29:27.780 | you just couldn't get it to keep up

00:29:29.920 | with the generator in the outer loop.

00:29:31.540 | And that would prevent it from converging

00:29:33.820 | to anything useful.

00:29:35.220 | - Yeah, I share that intuition.

00:29:36.860 | - Yeah.

00:29:37.700 | - But turns out to not be the case.

00:29:41.940 | - A lot of the time with machine learning algorithms,

00:29:43.760 | it's really hard to predict ahead of time

00:29:45.160 | how well they'll actually perform.

00:29:46.900 | You have to just run the experiment and see what happens.

00:29:49.140 | And I would say I still today don't have one factor

00:29:53.460 | I can put my finger on and say,

00:29:54.740 | "This is why GANs worked for photo generation

00:29:58.300 | "and deep Bolton machines don't."

00:30:00.300 | There are a lot of theory papers

00:30:03.300 | showing that under some theoretical settings,

00:30:06.340 | the GAN algorithm does actually converge.

00:30:09.620 | But those settings are restricted enough

00:30:14.140 | that they don't necessarily explain the whole picture

00:30:17.540 | in terms of all the results that we see in practice.

00:30:20.740 | - So taking a step back,

00:30:22.300 | can you, in the same way as we talked about deep learning,

00:30:24.860 | can you tell me what generative adversarial networks are?

00:30:28.400 | - Yeah, so generative adversarial networks

00:30:31.380 | are a particular kind of generative model.

00:30:33.980 | A generative model is a machine learning model

00:30:36.260 | that can train on some set of data.

00:30:38.860 | Like say you have a collection of photos of cats

00:30:41.220 | and you want to generate more photos of cats,

00:30:43.980 | or you want to estimate a probability distribution over cats

00:30:47.700 | so you can ask how likely it is

00:30:49.780 | that some new image is a photo of a cat.

00:30:51.820 | GANs are one way of doing this.

00:30:55.800 | Some generative models are good at creating new data.

00:30:59.180 | Other generative models are good at estimating

00:31:01.620 | that density function and telling you

00:31:03.000 | how likely particular pieces of data are

00:31:06.580 | to come from the same distribution as the training data.

00:31:09.700 | GANs are more focused on generating samples

00:31:12.420 | rather than estimating the density function.

00:31:15.620 | There are some kinds of GANs like FlowGAN that can do both,

00:31:18.500 | but mostly GANs are about generating samples,

00:31:21.620 | generating new photos of cats that look realistic.

00:31:25.220 | And they do that completely from scratch.

00:31:29.300 | It's analogous to human imagination.

00:31:32.220 | When a GAN creates a new image of a cat,

00:31:34.740 | it's using a neural network to produce a cat

00:31:39.300 | that has not existed before.

00:31:41.020 | It isn't doing something like compositing photos together.

00:31:44.540 | You're not literally taking the eye off of one cat

00:31:47.060 | and the ear off of another cat.

00:31:48.260 | It's more of this digestive process

00:31:51.340 | where the neural net trains on a lot of data

00:31:53.940 | and comes up with some representation

00:31:55.580 | of the probability distribution

00:31:57.420 | and generates entirely new cats.

00:31:59.820 | There are a lot of different ways

00:32:00.900 | of building a generative model.

00:32:01.980 | What's specific to GANs is that we have a two-player game

00:32:05.660 | in the game theoretic sense.

00:32:08.100 | And as the players in this game compete,

00:32:10.340 | one of them becomes able to generate realistic data.

00:32:13.940 | The first player is called the generator.

00:32:16.140 | It produces output data, such as just images, for example.

00:32:20.660 | And at the start of the learning process,

00:32:22.460 | it'll just produce completely random images.

00:32:25.140 | The other player is called the discriminator.

00:32:27.420 | The discriminator takes images as input

00:32:29.700 | and guesses whether they're real or fake.

00:32:32.540 | You train it both on real data,

00:32:34.260 | so photos that come from your training set,

00:32:36.140 | actual photos of cats,

00:32:37.860 | and you train it to say that those are real.

00:32:39.900 | You also train it on images

00:32:41.980 | that come from the generator network

00:32:43.860 | and you train it to say that those are fake.

00:32:46.060 | As the two players compete in this game,

00:32:49.220 | the discriminator tries to become better

00:32:50.980 | at recognizing whether images are real or fake.

00:32:53.340 | And the generator becomes better

00:32:54.820 | at fooling the discriminator into thinking

00:32:57.020 | that its outputs are real.

00:32:59.620 | And you can analyze this through the language of game theory

00:33:03.580 | and find that there's a Nash equilibrium

00:33:06.980 | where the generator has captured

00:33:08.660 | the correct probability distribution.

00:33:10.820 | So in the cat example,

00:33:12.180 | it makes perfectly realistic cat photos.

00:33:14.580 | And the discriminator is unable to do better

00:33:17.180 | than random guessing

00:33:18.740 | because all the samples coming from both the data

00:33:21.860 | and the generator look equally likely

00:33:24.060 | to have come from either source.

00:33:25.860 | - So do you ever sit back

00:33:28.380 | and does it just blow your mind that this thing works?

00:33:31.300 | So from very,

00:33:33.380 | so it's able to estimate the identity function

00:33:35.860 | enough to generate realistic images.

00:33:38.700 | I mean, yeah, do you ever sit back?

00:33:42.180 | - Yeah. - And think,

00:33:44.220 | how does this even, why, this is quite incredible,

00:33:46.780 | especially where GANs have gone

00:33:48.340 | in terms of realism.

00:33:49.300 | - Yeah, and not just to flatter my own work,

00:33:51.660 | but generative models,

00:33:53.900 | all of them have this property

00:33:55.460 | that if they really did what we asked them to do,

00:33:58.860 | they would do nothing but memorize the training data.

00:34:01.100 | - Right, exactly.

00:34:01.940 | - Models that are based on maximizing the likelihood,

00:34:05.780 | the way that you obtain the maximum likelihood

00:34:08.180 | for a specific training set

00:34:09.740 | is you assign all of your probability mass

00:34:12.420 | to the training examples and nowhere else.

00:34:15.140 | For GANs, the game is played using a training set.

00:34:18.420 | So the way that you become unbeatable in the game

00:34:21.180 | is you literally memorize training examples.

00:34:23.420 | One of my former interns wrote a paper,

00:34:28.900 | his name is Vaishnav Nagarajan,

00:34:31.060 | and he showed that it's actually hard

00:34:33.060 | for the generator to memorize the training data,

00:34:36.100 | hard in a statistical learning theory sense

00:34:39.140 | that you can actually create reasons

00:34:42.180 | for why it would require quite a lot of learning steps

00:34:47.180 | and a lot of observations of different latent variables

00:34:52.180 | before you could memorize the training data.

00:34:54.340 | That still doesn't really explain

00:34:55.660 | why when you produce samples that are new,

00:34:58.220 | why do you get compelling images

00:34:59.860 | rather than just garbage

00:35:01.860 | that's different from the training set.

00:35:03.740 | And I don't think we really have a good answer for that,

00:35:06.940 | especially if you think about

00:35:07.900 | how many possible images are out there

00:35:10.260 | and how few images the generative model sees during training.

00:35:15.260 | It seems just unreasonable

00:35:16.940 | that generative models create new images

00:35:19.220 | as well as they do,

00:35:20.780 | especially considering that we're basically

00:35:22.740 | training them to memorize rather than generalize.

00:35:25.180 | I think part of the answer is

00:35:28.220 | there's a paper called Deep Image Prior

00:35:30.860 | where they show that you can take a convolutional net

00:35:33.100 | and you don't even need to learn the parameters of it at all,

00:35:35.020 | you just use the model architecture.

00:35:37.700 | And it's already useful for things like in-painting images.

00:35:41.100 | I think that shows us

00:35:42.300 | that the convolutional network architecture

00:35:44.380 | captures something really important

00:35:45.940 | about the structure of images.

00:35:47.980 | And we don't need to actually use learning

00:35:50.980 | to capture all the information

00:35:52.260 | coming out of the convolutional net.

00:35:54.060 | That would imply that it would be much harder

00:35:58.460 | to make generative models in other domains.

00:36:01.300 | So far, we're able to make reasonable speech models

00:36:03.660 | and things like that.

00:36:04.900 | But to be honest,

00:36:06.420 | we haven't actually explored a whole lot

00:36:07.860 | of different data sets all that much.

00:36:09.820 | We don't, for example,

00:36:11.500 | see a lot of deep learning models of biology data sets,

00:36:16.500 | where you have lots of microarrays

00:36:19.900 | measuring the amount of different enzymes

00:36:22.300 | and things like that.

00:36:23.140 | So we may find that some of the progress

00:36:25.300 | that we've seen for images and speech

00:36:26.900 | turns out to really rely heavily on the model architecture.

00:36:30.140 | And we were able to do what we did for vision

00:36:33.020 | by trying to reverse engineer the human visual system.

00:36:37.020 | And maybe it'll turn out that we can't just

00:36:39.820 | use that same trick for arbitrary kinds of data.

00:36:42.580 | - Right, so there's aspects of the human vision system,

00:36:45.940 | the hardware of it, that makes it,

00:36:48.420 | without learning, without cognition,

00:36:51.140 | just makes it really effective at detecting the patterns

00:36:53.660 | we see in the visual world.

00:36:54.940 | - Yeah.

00:36:55.940 | - Yeah, that's really interesting.

00:36:57.700 | What, in a big, quick overview,

00:37:02.300 | in your view, what types of GANs are there,

00:37:06.300 | and what other generative models besides GANs are there?

00:37:10.100 | - Yeah, so it's maybe a little bit easier to start with

00:37:13.540 | what kinds of generative models are there other than GANs.

00:37:16.900 | So most generative models are likelihood-based,

00:37:20.900 | where to train them, you have a model that tells you

00:37:24.900 | how much probability it assigns to a particular example,

00:37:29.100 | and you just maximize the probability

00:37:31.140 | assigned to all the training examples.

00:37:33.700 | It turns out that it's hard to design a model

00:37:36.180 | that can create really complicated images

00:37:39.180 | or really complicated audio waveforms,

00:37:42.260 | and still have it be possible to estimate

00:37:46.220 | the likelihood function from a computational point of view.

00:37:51.220 | Most interesting models that you would just

00:37:53.820 | write down intuitively, it turns out that it's almost

00:37:56.420 | impossible to calculate the amount of probability

00:37:59.020 | they assign to a particular point.

00:38:00.780 | So there's a few different schools of generative models

00:38:04.740 | in the likelihood family.

00:38:06.260 | One approach is to very carefully design the model

00:38:10.180 | so that it is computationally tractable

00:38:12.780 | to measure the density it assigns to a particular point.

00:38:15.540 | So there are things like autoregressive models,

00:38:19.140 | like PixelCNN, those basically break down

00:38:23.940 | the probability distribution into a product

00:38:26.860 | over every single feature.

00:38:28.660 | So for an image, you estimate the probability

00:38:31.540 | of each pixel, given all of the pixels that came before it.

00:38:35.780 | There's tricks where if you want to measure

00:38:37.660 | the density function, you can actually calculate

00:38:40.620 | the density for all these pixels more or less in parallel.

00:38:43.500 | Generating the image still tends to require you

00:38:46.860 | to go one pixel at a time, and that can be very slow.

00:38:50.820 | But there are, again, tricks for doing this

00:38:52.980 | in a hierarchical pattern where you can keep

00:38:54.540 | the runtime under control.

00:38:56.140 | - Are the quality of the images it generates

00:38:58.540 | putting runtime aside pretty good?

00:39:01.660 | - They're reasonable, yeah.

00:39:04.420 | I would say a lot of the best results

00:39:07.460 | are from GANs these days, but it can be hard to tell

00:39:11.060 | how much of that is based on who's studying

00:39:14.700 | which type of algorithm, if that makes sense.

00:39:17.300 | - The amount of effort invested in a particular--

00:39:18.900 | - Yeah, or like the kind of expertise.

00:39:21.420 | So a lot of people who've traditionally been excited

00:39:23.140 | about graphics or art and things like that

00:39:25.060 | have gotten interested in GANs.

00:39:27.020 | And to some extent, it's hard to tell,

00:39:28.740 | are GANs doing better because they have a lot of

00:39:32.340 | graphics and art experts behind them,

00:39:34.700 | or are GANs doing better because

00:39:36.700 | they're more computationally efficient,

00:39:38.900 | or are GANs doing better because they prioritize

00:39:41.660 | the realism of samples over the accuracy

00:39:44.620 | of the density function?

00:39:45.500 | I think all of those are potentially valid explanations,

00:39:48.660 | and it's hard to tell.

00:39:51.300 | - So can you give a brief history of GANs

00:39:53.740 | from 2014, were you paper 13?

00:39:58.740 | - Yeah, so a few highlights.

00:40:00.980 | In the first paper, we just showed that GANs basically work.

00:40:04.740 | If you look back at the samples we had now,

00:40:06.620 | they look terrible.

00:40:08.820 | On the CIFAR-10 dataset, you can't even

00:40:10.460 | recognize objects in them.

00:40:12.220 | - Your paper, sorry, you used CIFAR-10?

00:40:15.020 | - We used MNIST, which is little handwritten digits.

00:40:18.060 | We used the Toronto Face Database,

00:40:19.860 | which is small, grayscale photos of faces.

00:40:22.700 | We did have recognizable faces.

00:40:24.220 | My colleague Bing Xu put together

00:40:25.700 | the first GAN face model for that paper.

00:40:28.540 | We also had the CIFAR-10 dataset,

00:40:32.980 | which is things like very small 32 by 32 pixels

00:40:36.100 | of cars and cats and dogs.

00:40:40.660 | For that, we didn't get recognizable objects,

00:40:43.020 | but all the deep learning people back then

00:40:46.180 | were really used to looking at these failed samples

00:40:48.420 | and kind of reading them like tea leaves.

00:40:50.420 | And people who are used to reading the tea leaves

00:40:53.020 | recognize that our tea leaves at least look different.

00:40:56.500 | Maybe not necessarily better,

00:40:57.820 | but there was something unusual about them.

00:40:59.980 | And that got a lot of us excited.

00:41:03.620 | One of the next really big steps was LAPGAN

00:41:06.180 | by Emily Denton and Sumit Chintala at Facebook AI Research,

00:41:10.900 | where they actually got really good high-resolution photos

00:41:14.420 | working with GANs for the first time.

00:41:16.580 | They had a complicated system where they generated

00:41:18.860 | the image starting at low-res and then scaling up to high-res,

00:41:22.780 | but they were able to get it to work.

00:41:24.900 | And then in 2015, I believe, later that same year,

00:41:29.900 | Alec Radford and Sumit Chintala and Luke Metz

00:41:34.940 | published the DCGAN paper,

00:41:38.420 | which it stands for Deep Convolutional GAN.

00:41:40.980 | It's kind of a non-unique name

00:41:43.740 | because these days basically all GANs

00:41:46.420 | and even some before that were deep and convolutional,

00:41:48.380 | but they just kind of picked a name

00:41:50.220 | for a really great recipe where they were able to actually,

00:41:54.020 | using only one model instead of a multi-step process,

00:41:57.300 | actually generate realistic images of faces

00:41:59.700 | and things like that.

00:42:00.740 | That was sort of like the beginning

00:42:05.220 | of the Cambrian explosion of GANs.

00:42:07.380 | Like, once you had animals that had a backbone,

00:42:09.740 | you suddenly got lots of different versions of fish

00:42:12.900 | and four-legged animals and things like that.

00:42:15.340 | So DCGAN became kind of the backbone

00:42:17.940 | for many different models that came out.

00:42:19.420 | - Used as a baseline even still.

00:42:21.620 | - Yeah, yeah.

00:42:23.140 | And so from there, I would say some interesting things

00:42:25.940 | we've seen are, there's a lot you can say

00:42:29.420 | about how just the quality

00:42:30.940 | of standard image generation GANs has increased,

00:42:33.540 | but what's also maybe more interesting

00:42:35.100 | on an intellectual level is how the things

00:42:37.380 | you can use GANs for has also changed.

00:42:40.060 | One thing is that you can use them to learn classifiers

00:42:44.580 | without having to have class labels for every example

00:42:47.380 | in your training set.

00:42:48.940 | So that's called semi-supervised learning.

00:42:51.780 | My colleague at OpenAI, Tim Solomons, who's at Brain now,

00:42:55.820 | wrote a paper called "Improved Techniques for Training GANs."

00:42:59.780 | I'm a co-author on this paper,

00:43:00.900 | but I can't claim any credit for this particular part.

00:43:03.700 | One thing he showed in the paper is that

00:43:05.860 | you can take the GAN discriminator

00:43:07.820 | and use it as a classifier that actually tells you,

00:43:11.340 | you know, this image is a cat, this image is a dog,

00:43:13.620 | this image is a car, this image is a truck, and so on.

00:43:16.420 | Not just to say whether the image is real or fake,

00:43:18.820 | but if it is real, to say specifically

00:43:20.700 | what kind of object it is.

00:43:22.620 | And he found that you can train these classifiers

00:43:25.340 | with far fewer labeled examples than traditional classifiers.

00:43:30.340 | - So if you supervise based on also

00:43:33.660 | not just your discrimination ability,

00:43:35.300 | but your ability to classify, you're going to do much,

00:43:38.660 | you're going to converge much faster

00:43:40.100 | to being effective at being a discriminator.

00:43:43.300 | - Yeah.

00:43:44.260 | So for example, for the MNIST dataset,

00:43:46.340 | you want to look at an image of a handwritten digit

00:43:48.860 | and say whether it's a zero, a one, or a two, and so on.

00:43:52.700 | To get down to less than 1% accuracy

00:43:56.980 | required around 60,000 examples

00:44:00.260 | until maybe about 2014 or so.

00:44:02.780 | In 2016, with this semi-supervised GAN project,

00:44:07.460 | Tim was able to get below 1% error

00:44:11.060 | using only a hundred labeled examples.

00:44:13.660 | So that was about a 600X decrease

00:44:16.020 | in the amount of labels that he needed.

00:44:18.020 | He's still using more images than that,

00:44:21.100 | but he doesn't need to have each of them labeled as,

00:44:23.460 | you know, this one's a one, this one's a two,

00:44:25.100 | this one's a zero, and so on.

00:44:27.020 | - Then to be able to, for GANs to be able to generate

00:44:30.020 | recognizable objects, so objects from a particular class,

00:44:33.460 | you still need labeled data

00:44:37.020 | because you need to know what it means

00:44:38.900 | to be a particular class cat, dog.

00:44:40.900 | How do you think we can move away from that?

00:44:44.620 | - Yeah, some researchers at Brain Zurich

00:44:46.660 | actually just released a really great paper

00:44:49.060 | on semi-supervised GANs where their goal isn't to classify,

00:44:53.980 | it's to make recognizable objects

00:44:56.260 | despite not having a lot of labeled data.

00:44:58.700 | They were working off of DeepMind's BigGAN project,

00:45:02.420 | and they showed that they can match the performance

00:45:05.220 | of BigGAN using only 10%, I believe,

00:45:08.700 | of the labels.

00:45:10.580 | BigGAN was trained on the ImageNet dataset,

00:45:12.340 | which is about 1.2 million images,

00:45:14.460 | and had all of them labeled.

00:45:15.900 | This latest project from Brain Zurich

00:45:19.100 | shows that they're able to get away

00:45:20.260 | with only having about 10% of the images labeled.

00:45:24.620 | And they do that essentially using a clustering algorithm

00:45:29.900 | where the discriminator learns to assign the objects

00:45:33.380 | to groups, and then this understanding

00:45:36.340 | that objects can be grouped into similar types

00:45:40.380 | helps it to form more realistic ideas

00:45:43.460 | of what should be appearing in the image,

00:45:45.420 | because it knows that every image it creates

00:45:47.980 | has to come from one of these archetypal groups

00:45:50.180 | rather than just being some arbitrary image.

00:45:53.220 | If you train a GAN with no class labels,

00:45:55.140 | you tend to get things that look sort of like

00:45:57.220 | grass or water or brick or dirt,

00:46:00.500 | but without necessarily a lot going on in them.

00:46:04.460 | And I think that's partly because

00:46:05.820 | if you look at a large ImageNet image,

00:46:07.900 | the object doesn't necessarily occupy the whole image.

00:46:11.260 | And so you learn to create realistic sets of pixels,

00:46:15.660 | but you don't necessarily learn

00:46:17.540 | that the object is the star of the show

00:46:20.140 | and you want it to be in every image you make.

00:46:22.220 | - Yeah, I've heard you talk about the horse,

00:46:25.460 | the zebra cycle GAN mapping,

00:46:27.060 | and how it turns out, again, thought-provoking,

00:46:31.980 | that horses are usually on grass

00:46:33.660 | and zebras are usually on drier terrain.

00:46:35.740 | So when you're doing that kind of generation,

00:46:38.220 | you're going to end up generating greener horses or whatever.

00:46:42.720 | So those are connected together.

00:46:45.420 | It's not just-- - Yeah, yeah.

00:46:47.420 | - You're not able to segment,

00:46:49.060 | you're able to generate in a segmented way.

00:46:52.360 | So are there other types of games you come across

00:46:55.060 | in your mind that neural networks can play with each other

00:47:00.060 | to be able to solve problems?

00:47:05.220 | - Yeah, the one that I spend most of my time on

00:47:07.700 | is in security, you can model most interactions as a game

00:47:12.700 | where there's attackers trying to break your system

00:47:15.820 | and you're the defender trying to build a resilient system.

00:47:19.160 | There's also domain adversarial learning,

00:47:23.100 | which is an approach to domain adaptation

00:47:25.540 | that looks really a lot like GANs.

00:47:27.260 | The authors had the idea before the GAN paper came out,

00:47:31.820 | their paper came out a little bit later,

00:47:33.780 | and they were very nice and cited the GAN paper,

00:47:38.260 | but I know that they actually had the idea

00:47:40.220 | before it came out.

00:47:41.180 | Domain adaptation is when you want to train

00:47:44.340 | a machine learning model in one setting called a domain

00:47:47.620 | and then deploy it in another domain later.

00:47:50.300 | And you would like it to perform well in the new domain,

00:47:52.700 | even though the new domain is different

00:47:54.020 | from how it was trained.

00:47:55.940 | So for example, you might want to train

00:47:58.500 | on a really clean image data set like ImageNet,

00:48:01.380 | but then deploy on users' phones

00:48:03.380 | where the user is taking pictures in the dark

00:48:06.020 | or pictures while moving quickly

00:48:07.820 | and just pictures that aren't really centered

00:48:10.020 | or composed all that well.

00:48:11.340 | When you take a normal machine learning model,

00:48:15.860 | it often degrades really badly

00:48:17.860 | when you move to the new domain

00:48:19.020 | because it looks so different

00:48:20.060 | from what the model was trained on.

00:48:22.140 | Domain adaptation algorithms try to smooth out that gap.

00:48:25.460 | And the domain adversarial approach

00:48:27.340 | is based on training a feature extractor

00:48:29.820 | where the features have the same statistics

00:48:32.180 | regardless of which domain you extracted them on.

00:48:35.180 | So in the domain adversarial game,

00:48:36.900 | you have one player that's a feature extractor

00:48:39.180 | and another player that's a domain recognizer.

00:48:42.100 | The domain recognizer wants to look at the output

00:48:44.300 | of the feature extractor

00:48:45.740 | and guess which of the two domains the features came from.

00:48:49.340 | So it's a lot like the real versus fake

00:48:50.900 | discriminator in GANs.

00:48:52.500 | And then the feature extractor,

00:48:54.940 | you can think of as loosely analogous

00:48:56.860 | to the generator in GANs,

00:48:57.980 | except what it's trying to do here

00:48:59.140 | is both fool the domain recognizer

00:49:02.500 | into not knowing which domain the data came from

00:49:05.380 | and also extract features that are good for classification.

00:49:09.060 | So at the end of the day,

00:49:10.500 | in the cases where it works out,

00:49:13.780 | you can actually get features

00:49:16.900 | that work about the same in both domains.

00:49:20.660 | Sometimes this has a drawback where

00:49:22.900 | in order to make things work the same in both domains,

00:49:24.860 | it just gets worse at the first one.

00:49:26.780 | But there are a lot of cases

00:49:27.860 | where it actually works out well on both.

00:49:30.820 | - So do you think of GANs being useful

00:49:33.020 | in the context of data augmentation?

00:49:35.460 | - Yeah, one thing you could hope for with GANs

00:49:38.100 | is you could imagine I've got a limited training set

00:49:41.380 | and I'd like to make more training data

00:49:43.900 | to train something else like a classifier.

00:49:46.060 | You could train the GAN on the training set

00:49:50.540 | and then create more data.

00:49:52.380 | And then maybe the classifier would perform better

00:49:55.220 | on the test set after training

00:49:56.500 | on this bigger GAN generated data set.

00:49:58.900 | So that's the simplest version

00:50:00.420 | of something you might hope would work.

00:50:03.060 | I've never heard of that particular approach working,

00:50:05.460 | but I think there's some closely related things

00:50:08.940 | that I think could work in the future

00:50:11.540 | and some that actually already have worked.

00:50:14.100 | So if we think a little bit about what we'd be hoping for

00:50:15.820 | if we use the GAN to make more training data,

00:50:18.220 | we're hoping that the GAN will generalize to new examples

00:50:22.060 | better than the classifier would have generalized

00:50:24.140 | if it was trained on the same data.

00:50:25.980 | And I don't know of any reason to believe

00:50:27.700 | that the GAN would generalize better

00:50:28.900 | than the classifier would.

00:50:30.260 | But what we might hope for

00:50:33.060 | is that the GAN could generalize differently

00:50:35.540 | from a specific classifier.

00:50:37.460 | So one thing I think is worth trying

00:50:39.140 | that I haven't personally tried, but someone could try

00:50:41.020 | is what if you trained a whole lot

00:50:43.380 | of different generative models on the same training set,

00:50:46.460 | create samples from all of them,

00:50:48.340 | and then train a classifier on that?

00:50:50.540 | Because each of the generative models might generalize

00:50:52.980 | in a slightly different way,

00:50:54.420 | they might capture many different axes of variation

00:50:56.940 | that one individual model wouldn't.

00:50:58.820 | And then the classifier can capture all of those ideas

00:51:01.860 | by training on all of their data.

00:51:03.540 | So it'd be a little bit

00:51:04.380 | like making an ensemble of classifiers.

00:51:06.260 | - Ensemble of GANs.

00:51:07.820 | - Yeah. - In a way.

00:51:08.820 | - I think that could generalize better.

00:51:10.060 | The other thing that GANs are really good for

00:51:12.620 | is not necessarily generating new data

00:51:16.980 | that's exactly like what you already have,

00:51:19.340 | but by generating new data that has different properties

00:51:23.540 | from the data you already had.

00:51:25.300 | One thing that you can do

00:51:26.220 | is you can create differentially private data.

00:51:29.100 | So suppose that you have something like medical records,

00:51:31.860 | and you don't want to train a classifier

00:51:33.820 | on the medical records and then publish the classifier,

00:51:36.460 | because someone might be able to reverse engineer

00:51:38.140 | some of the medical records you trained on.

00:51:40.540 | There's a paper from Casey Green's lab

00:51:42.780 | that shows how you can train a GAN

00:51:45.020 | using differential privacy.

00:51:46.980 | And then the samples from the GAN

00:51:48.980 | still have the same differential privacy guarantees

00:51:51.180 | as the parameters of the GAN.

00:51:52.700 | So you can make fake patient data

00:51:55.660 | for other researchers to use,

00:51:57.220 | and they can do almost anything they want with that data

00:51:59.180 | because it doesn't come from real people.

00:52:02.020 | And the differential privacy mechanism

00:52:04.260 | gives you clear guarantees

00:52:06.460 | on how much the original people's data has been protected.

00:52:09.900 | - That's really interesting, actually.

00:52:11.340 | I haven't heard you talk about that before.

00:52:13.740 | In terms of fairness,

00:52:15.220 | I've seen from AAAI, your talk,

00:52:18.620 | how can adversarial machine learning

00:52:21.220 | help models be more fair

00:52:23.300 | with respect to sensitive variables?

00:52:25.700 | - Yeah, so there's a paper from Amos Storky's lab

00:52:28.460 | about how to learn machine learning models

00:52:31.380 | that are incapable of using specific variables.

00:52:34.780 | So say, for example, you wanted to make predictions

00:52:36.660 | that are not affected by gender.

00:52:39.540 | It isn't enough to just leave gender

00:52:41.220 | out of the input to the model.

00:52:42.780 | You can often infer gender

00:52:43.980 | from a lot of other characteristics.

00:52:45.420 | Like, say that you have the person's name,

00:52:47.460 | but you're not told their gender.

00:52:48.580 | Well, if their name is Ian,

00:52:50.500 | they're kind of obviously a man.

00:52:52.100 | So what you'd like to do

00:52:54.540 | is make a machine learning model

00:52:55.620 | that can still take in a lot of different attributes

00:52:58.980 | and make a really accurate, informed prediction,

00:53:02.540 | but be confident that it isn't reverse engineering gender

00:53:05.740 | or another sensitive variable internally.

00:53:08.380 | You can do that using something very similar

00:53:10.260 | to the domain adversarial approach,

00:53:12.820 | where you have one player that's a feature extractor

00:53:16.100 | and another player that's a feature analyzer.

00:53:19.060 | And you want to make sure that the feature analyzer

00:53:21.420 | is not able to guess the value of the sensitive variable

00:53:24.700 | that you're trying to keep private.

00:53:26.620 | - Right, that's, yeah, I love this approach.

00:53:29.060 | So, yeah, with the feature,

00:53:31.620 | you're not able to infer the sensitive variables.

00:53:36.020 | - Yeah. - It's brilliant.

00:53:36.860 | It's quite brilliant and simple, actually.

00:53:39.460 | - Another way I think that GANs in particular

00:53:42.740 | could be used for fairness

00:53:44.220 | would be to make something like a cycle GAN,

00:53:46.740 | where you can take data from one domain

00:53:49.700 | and convert it into another.

00:53:51.140 | We've seen cycle GAN turning horses into zebras.

00:53:53.860 | We've seen other unsupervised GANs made by Mingyu Liu

00:53:58.860 | doing things like turning day photos into night photos.

00:54:01.980 | I think for fairness,

00:54:04.780 | you could imagine taking records for people in one group

00:54:08.420 | and transforming them into analogous people in another group

00:54:11.500 | and testing to see if they're treated equitably

00:54:14.940 | across those two groups.

00:54:16.420 | There's a lot of things that'd be hard to get right

00:54:18.060 | to make sure that the conversion process itself is fair.

00:54:21.100 | And I don't think it's anywhere near

00:54:23.860 | something that we could actually use yet.

00:54:25.380 | But if you could design that conversion process

00:54:27.100 | very carefully, it might give you a way of doing audits

00:54:30.500 | where you say, what if we took people from this group,

00:54:33.100 | converted them into equivalent people in another group?

00:54:35.420 | Does the system actually treat them how it ought to?

00:54:38.660 | - That's also really interesting.

00:54:41.740 | In popular press and in general, in our imagination,

00:54:46.740 | you think, well, GANs are able to generate data

00:54:51.700 | and you start to think about deep fakes

00:54:54.500 | or being able to sort of maliciously generate data

00:54:57.900 | that fakes the identity of other people.

00:55:01.180 | Is this something of a concern to you?

00:55:03.140 | Is this something, if you look 10, 20 years into the future,

00:55:06.900 | is that something that pops up in your work,

00:55:10.340 | in the work of the community that's working

00:55:11.860 | on generating models?

00:55:13.540 | - I'm a lot less concerned about 20 years from now

00:55:15.860 | than the next few years.

00:55:17.380 | I think there will be a kind of bumpy cultural transition

00:55:20.820 | as people encounter this idea

00:55:23.140 | that there can be very realistic videos

00:55:24.660 | and audio that aren't real.

00:55:26.260 | I think 20 years from now, people will mostly understand

00:55:30.100 | that you shouldn't believe something is real

00:55:31.900 | just because you saw a video of it.

00:55:34.060 | People will expect to see that it's been cryptographically

00:55:36.700 | signed or have some other mechanism to make them believe

00:55:41.700 | that the content is real.

00:55:44.300 | There's already people working on this.

00:55:45.660 | Like there's a startup called TruePic

00:55:47.620 | that provides a lot of mechanisms for authenticating

00:55:50.860 | that an image is real.

00:55:51.980 | They're maybe not quite up to having a state actor

00:55:56.100 | try to evade their verification techniques,

00:55:59.820 | but it's something that people are already working on

00:56:02.380 | and I think will get right eventually.

00:56:04.140 | - So you think authentication will eventually win out?

00:56:08.300 | So being able to authenticate that this is real

00:56:10.740 | and this is not.

00:56:11.900 | - Yeah.

00:56:13.300 | - As opposed to GANs just getting better and better

00:56:15.780 | or generative models being able to get better and better

00:56:18.220 | to where the nature of what is real is normal.

00:56:21.500 | - I don't think we'll ever be able to look at the pixels

00:56:24.460 | of a photo and tell you for sure that it's real or not real.

00:56:28.580 | And I think it would actually be somewhat dangerous

00:56:32.780 | to rely on that approach too much.

00:56:35.140 | If you make a really good fake detector

00:56:36.820 | and then someone's able to fool your fake detector

00:56:38.900 | and your fake detector says this image is not fake,

00:56:42.140 | then it's even more credible

00:56:43.500 | than if you've never made a fake detector

00:56:45.100 | in the first place.

00:56:46.260 | What I do think we'll get to is systems

00:56:50.380 | that we can kind of use behind the scenes

00:56:53.300 | to make estimates of what's going on

00:56:55.580 | and maybe not like use them in court

00:56:57.820 | for a definitive analysis.

00:56:59.580 | I also think we will likely get better authentication systems

00:57:04.180 | where, you know, imagine that every phone

00:57:07.380 | cryptographically signs everything that comes out of it.

00:57:10.540 | You wouldn't be able to conclusively tell

00:57:12.820 | that an image was real,

00:57:14.540 | but you would be able to tell somebody

00:57:17.700 | who knew the appropriate private key for this phone

00:57:21.300 | was actually able to sign this image

00:57:24.340 | and upload it to this server at this timestamp.

00:57:27.460 | - Right.

00:57:28.940 | So you could imagine maybe you make phones

00:57:31.380 | that have the private keys hardware embedded in them.

00:57:34.300 | If like a state security agency

00:57:37.500 | really wants to infiltrate the company,

00:57:39.260 | they could probably, you know,

00:57:40.860 | plant a private key of their choice

00:57:42.540 | or break open the chip and learn the private key

00:57:45.100 | or something like that.

00:57:46.220 | But it would make it a lot harder

00:57:47.460 | for an adversary with fewer resources to fake things.

00:57:51.500 | - For most of us it would be okay.

00:57:52.860 | Okay.

00:57:53.700 | So you mentioned the beer and the bar and the new ideas.

00:57:58.340 | You were able to implement this

00:57:59.780 | or come up with this new idea pretty quickly

00:58:02.900 | and implement it pretty quickly.

00:58:04.420 | Do you think there's still many such groundbreaking ideas

00:58:07.740 | in deep learning that could be developed so quickly?

00:58:11.020 | - Yeah, I do think that there are a lot of ideas

00:58:13.020 | that can be developed really quickly.

00:58:14.860 | GANs were probably a little bit of an outlier

00:58:17.860 | on the whole like one hour time scale.

00:58:20.220 | But just in terms of like low resource ideas

00:58:24.260 | where you do something really different

00:58:25.580 | on the algorithm scale and get a big payback.

00:58:28.820 | I think it's not as likely that you'll see that

00:58:31.900 | in terms of things like core machine learning technologies

00:58:34.940 | like a better classifier

00:58:36.580 | or a better reinforcement learning algorithm

00:58:38.180 | or a better generative model.

00:58:39.580 | If I had the GAN idea today,

00:58:42.420 | it would be a lot harder to prove that it was useful

00:58:45.260 | than it was back in 2014

00:58:46.940 | because I would need to get it running on something

00:58:50.100 | like ImageNet or CelebA at high resolution.

00:58:54.060 | You know, those take a while to train.

00:58:55.540 | You couldn't train it in an hour

00:58:57.580 | and know that it was something really new and exciting.

00:59:01.020 | Back in 2014, training on MNIST was enough.

00:59:03.260 | But there are other areas of machine learning

00:59:06.780 | where I think a new idea could actually be developed

00:59:11.260 | really quickly with low resources.

00:59:13.260 | - What's your intuition about what areas

00:59:15.420 | of machine learning are ripe for this?

00:59:17.740 | - Yeah, so I think fairness and interpretability

00:59:23.140 | are areas where we just really don't have any idea

00:59:27.060 | how anything should be done yet.

00:59:29.060 | Like for interpretability,

00:59:30.380 | I don't think we even have the right definitions.

00:59:32.740 | And even just defining a really useful concept,

00:59:36.100 | you don't even need to run any experiments,

00:59:38.140 | could have a huge impact on the field.

00:59:40.100 | We've seen that, for example, in differential privacy

00:59:42.580 | that Cynthia Dwork and her collaborators

00:59:45.340 | made this technical definition of privacy

00:59:48.060 | where before a lot of things are really mushy

00:59:50.060 | and then with that definition,

00:59:51.620 | you could actually design randomized algorithms

00:59:54.260 | for accessing databases and guarantee

00:59:56.220 | that they preserved individual people's privacy

00:59:58.860 | in like a mathematical quantitative sense.

01:00:01.820 | Right now, we all talk a lot about how interpretable

01:00:05.860 | different machine learning algorithms are,

01:00:07.580 | but it's really just people's opinion.

01:00:09.860 | And everybody probably has a different idea

01:00:11.300 | of what interpretability means in their head.

01:00:13.860 | If we could define some concept related to interpretability

01:00:17.020 | that's actually measurable,

01:00:18.780 | that would be a huge leap forward,

01:00:20.620 | even without a new algorithm that increases that quantity.

01:00:24.180 | And also once we had the definition of differential privacy,

01:00:28.780 | it was fast to get the algorithms that guaranteed it.

01:00:31.380 | So you could imagine once we have definitions

01:00:33.540 | of good concepts and interpretability,

01:00:35.740 | we might be able to provide the algorithms

01:00:37.580 | that have the interpretability guarantees quickly too.

01:00:40.540 | - What do you think it takes to build a system

01:00:46.900 | with human level intelligence

01:00:48.660 | as we quickly venture into the philosophical?

01:00:51.980 | So artificial general intelligence,

01:00:54.420 | what do you think it takes?

01:00:55.620 | - I think that it definitely takes better environments

01:01:00.620 | than we currently have for training agents,

01:01:03.780 | that we want them to have

01:01:05.300 | a really wide diversity of experiences.

01:01:08.740 | I also think it's gonna take really a lot of computation.

01:01:11.780 | It's hard to imagine exactly how much.

01:01:13.780 | - So you're optimistic about simulation,

01:01:16.300 | simulating a variety of environments

01:01:18.180 | as the path forward?

01:01:19.580 | - I think it's a necessary ingredient.

01:01:22.020 | Yeah, I don't think that we're going to get

01:01:24.740 | to artificial general intelligence

01:01:27.380 | by training on fixed data sets

01:01:29.740 | or by thinking really hard about the problem.

01:01:32.140 | I think that the agent really needs to interact

01:01:35.900 | and have a variety of experiences within the same lifespan.

01:01:40.900 | And today we have many different models

01:01:44.140 | that can each do one thing,

01:01:45.740 | and we tend to train them on one data set

01:01:47.580 | or one RL environment.

01:01:49.020 | Sometimes there are actually papers

01:01:51.420 | about getting one set of parameters

01:01:53.500 | to perform well in many different RL environments,

01:01:57.020 | but we don't really have anything like an agent

01:01:59.540 | that goes seamlessly from one type of experience to another

01:02:02.940 | and really integrates all the different things

01:02:05.300 | that it does over the course of its life.

01:02:08.060 | When we do see multi-agent environments,

01:02:10.580 | they tend to be,

01:02:12.420 | or so many multi-environment agents,

01:02:14.700 | they tend to be similar environments.

01:02:16.780 | Like all of them are playing like an action-based video game.

01:02:20.420 | We don't really have an agent that goes from

01:02:23.220 | playing a video game to like reading the Wall Street Journal

01:02:27.500 | to predicting how effective a molecule will be as a drug

01:02:31.260 | or something like that.

01:02:33.220 | - What do you think is a good test

01:02:35.140 | for intelligence in your view?

01:02:36.980 | There's been a lot of benchmarks,

01:02:38.660 | started with Alan Turing,

01:02:41.700 | natural conversation being a good benchmark for intelligence.

01:02:46.260 | What would Ian Goodfellow sit back

01:02:51.260 | and be really damn impressed

01:02:53.340 | if a system was able to accomplish?

01:02:56.020 | - Something that doesn't take a lot of glue

01:02:58.460 | from human engineers.

01:02:59.780 | So imagine that instead of having to go to the CIFAR website

01:03:04.780 | and download CIFAR 10

01:03:07.940 | and then write a Python script to parse it and all that,

01:03:11.340 | you could just point an agent at the CIFAR 10 problem

01:03:16.340 | and it downloads and extracts the data

01:03:19.180 | and trains a model and starts giving you predictions.

01:03:22.420 | I feel like something that doesn't need to have

01:03:25.980 | every step of the pipeline assembled for it

01:03:28.700 | definitely understands what it's doing.

01:03:30.460 | - Is AutoML moving into that direction

01:03:32.380 | or are you thinking way even bigger?

01:03:34.420 | - AutoML has mostly been moving toward,

01:03:37.260 | once we've built all the glue,

01:03:39.940 | can the machine learning system

01:03:42.180 | design the architecture really well?

01:03:44.340 | And so I'm more of saying,

01:03:45.740 | if something knows how to pre-process the data

01:03:49.580 | so that it successfully accomplishes the task,

01:03:52.340 | then it would be very hard to argue

01:03:53.500 | that it doesn't truly understand the task

01:03:56.220 | in some fundamental sense.

01:03:58.500 | And I don't necessarily know

01:03:59.540 | that that's the philosophical definition of intelligence,

01:04:02.260 | but that's something that would be really cool to build,

01:04:03.780 | that would be really useful and would impress me

01:04:05.580 | and would convince me that we've made a step forward

01:04:08.180 | in real AI.

01:04:09.420 | - So you give it the URL for Wikipedia

01:04:13.380 | and then next day expect it to be able to solve CIFAR-10.

01:04:18.380 | - Or you type in a paragraph

01:04:20.820 | explaining what you want it to do

01:04:22.180 | and it figures out what web searches it should run

01:04:24.780 | and downloads all the necessary ingredients.

01:04:28.300 | - So you have a very clear, calm way of speaking,

01:04:33.300 | no ums, easy to edit.

01:04:37.580 | I've seen comments for both you and I

01:04:40.220 | have been identified as both potentially being robots.

01:04:44.180 | If you have to prove to the world that you are indeed human,

01:04:47.180 | how would you do it?

01:04:48.180 | - I can understand thinking that I'm a robot.

01:04:53.180 | - It's the flip side of the Turing test, I think.

01:04:57.780 | - Yeah, yeah, the prove your human test.

01:05:00.420 | - Intellectually, so you have to,

01:05:03.580 | is there something that's truly unique

01:05:07.380 | in your mind as it doesn't go back

01:05:09.900 | to just natural language again,

01:05:11.620 | just being able to talk your way out of it?

01:05:13.860 | - Proving that I'm not a robot with today's technology,

01:05:17.060 | yeah, that's pretty straightforward.

01:05:18.740 | My conversation today hasn't veered off

01:05:20.780 | into talking about the stock market or something

01:05:24.380 | because it's my training data.

01:05:25.940 | But I guess more generally,

01:05:27.500 | trying to prove that something is real

01:05:28.860 | from the content alone is incredibly hard.

01:05:31.420 | That's one of the main things I've gotten

01:05:32.460 | out of my GAN research,

01:05:33.500 | that you can simulate almost anything.

01:05:37.700 | And so you have to really step back

01:05:39.100 | to a separate channel to prove that something is real.

01:05:42.260 | So I guess I should have had myself stamped

01:05:45.540 | on a blockchain when I was born or something,

01:05:47.700 | but I didn't do that.

01:05:48.620 | So according to my own research methodology,

01:05:50.820 | there's just no way to know at this point.

01:05:52.980 | - So what, last question, problem stands out for you

01:05:56.340 | that you're really excited about challenging

01:05:58.380 | in the near future?

01:05:59.940 | - So I think resistance to adversarial examples,

01:06:02.940 | figuring out how to make machine learning secure

01:06:05.540 | against an adversary who wants to interfere it

01:06:07.500 | and control it, that is one of the most important things

01:06:10.700 | researchers today could solve.

01:06:12.180 | - In all domains, image, language, driving, and everything.

01:06:17.180 | - I guess I'm most concerned about domains

01:06:19.820 | we haven't really encountered yet.

01:06:22.020 | Like imagine 20 years from now

01:06:24.060 | when we're using advanced AIs

01:06:26.340 | to do things we haven't even thought of yet.

01:06:28.980 | Like if you ask people,

01:06:30.660 | what are the important problems in security of phones

01:06:35.140 | in like 2002, I don't think we would have anticipated

01:06:38.940 | that we're using them for nearly as many things

01:06:42.180 | as we're using them for today.

01:06:43.660 | I think it's gonna be like that with AI

01:06:44.900 | that you can kind of try to speculate about where it's going

01:06:47.940 | but really the business opportunities that end up taking off

01:06:51.060 | would be hard to predict ahead of time.

01:06:54.220 | What you can predict ahead of time is that

01:06:56.460 | almost anything you can do with machine learning,

01:06:58.380 | you would like to make sure that people can't get it

01:07:02.140 | to do what they want rather than what you want

01:07:04.660 | just by showing it a funny QR code or a funny input pattern.

01:07:08.540 | - And you think that the set of methodology to do that

01:07:11.060 | can be bigger than any one domain?

01:07:12.900 | And that's the-- - I think so, yeah.

01:07:14.180 | Yeah, like one methodology that I think is,

01:07:19.180 | not a specific methodology,

01:07:20.700 | but like a category of solutions

01:07:22.820 | that I'm excited about today is making dynamic models

01:07:25.740 | that change every time they make a prediction.

01:07:28.260 | So right now we tend to train models

01:07:31.180 | and then after they're trained, we freeze them

01:07:33.180 | and we just use the same rule to classify everything

01:07:36.300 | that comes in from then on.

01:07:38.260 | That's really a sitting duck from a security point of view.

01:07:41.580 | If you always output the same answer for the same input,

01:07:45.540 | then people can just run inputs through

01:07:48.340 | until they find a mistake that benefits them.

01:07:50.220 | And then they use the same mistake

01:07:51.820 | over and over and over again.

01:07:53.260 | I think having a model that updates its predictions

01:07:56.580 | so that it's harder to predict what you're gonna get

01:08:00.420 | will make it harder for an adversary

01:08:02.820 | to really take control of the system

01:08:04.900 | and make it do what they want it to do.

01:08:06.180 | - Yeah, models that maintain a bit of a sense of mystery

01:08:09.820 | about them 'cause they always keep changing.

01:08:12.180 | - Yeah.

01:08:13.020 | - Ian, thanks so much for talking today.

01:08:14.340 | It was awesome.

01:08:15.180 | - Thank you for coming in.

01:08:16.020 | It's great to see you.

01:08:17.020 | (upbeat music)

01:08:19.620 | (upbeat music)

01:08:22.220 | (upbeat music)

01:08:24.820 | (upbeat music)

01:08:27.420 | (upbeat music)

01:08:30.020 | (upbeat music)

01:08:32.620 | [BLANK_AUDIO]

Ian Goodfellow: Generative Adversarial Networks (GANs) | Lex Fridman Podcast #19

Chapters