back to index

Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)


Chapters

0:0 Introduction
0:55 Talk
43:4 Q&A

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back to CIGSAS 099, Artificial General Intelligence.
00:00:04.000 | Today we have Ilya Tsitskever,
00:00:07.640 | co-founder and research director of OpenAI.
00:00:13.400 | He started in the ML group in Toronto with Jeffrey Hinton,
00:00:17.280 | then at Stanford with Andrew Ng,
00:00:19.040 | co-founded DNN Research for three years
00:00:21.440 | as a research scientist at Google Brain,
00:00:23.520 | and finally co-founded OpenAI.
00:00:26.280 | Citations aren't everything,
00:00:28.560 | but they do indicate impact.
00:00:31.120 | And his work, recent work, in the past five years
00:00:35.000 | has been cited over 46,000 times.
00:00:39.320 | He has been the key creative intellect
00:00:42.600 | and driver behind some of the biggest breakthrough ideas
00:00:45.400 | in deep learning and artificial intelligence ever.
00:00:49.440 | So please welcome Ilya.
00:00:52.520 | (audience applauding)
00:00:56.680 | - All right, thanks for the introduction, Lex.
00:00:58.880 | All right, thanks for coming to my talk.
00:01:01.280 | I will tell you about some work we've done
00:01:03.640 | over the past year on meta-learning
00:01:06.440 | and self-play at OpenAI.
00:01:08.800 | And before I dive into some of the
00:01:11.600 | more technical details of the work,
00:01:14.200 | I want to spend a little bit of time
00:01:16.400 | talking about deep learning
00:01:19.800 | and why it works at all in the first place.
00:01:22.480 | Which I think it's actually not a self-evident thing
00:01:25.560 | that it should work.
00:01:27.120 | One fact, it's actually a fact,
00:01:30.520 | it's a mathematical theorem that you can prove,
00:01:35.520 | is that if you could find the shortest program
00:01:40.600 | that does very well on your data,
00:01:43.760 | then you will achieve the best generalization possible.
00:01:46.760 | With a little bit of modification,
00:01:48.080 | you can turn it into a very, very simple algorithm.
00:01:50.880 | With a little bit of modification,
00:01:52.800 | you can turn it into a precise theorem.
00:01:54.440 | And on a very intuitive level,
00:01:58.040 | it's easy to see why it should be the case.
00:02:00.920 | If you have some data,
00:02:03.240 | and you're able to find the shorter program
00:02:05.880 | which generates this data,
00:02:08.000 | then you've essentially extracted all conceivable regularity
00:02:11.560 | from this data into your program.
00:02:13.800 | And then you can use this object
00:02:14.920 | to make the best predictions possible.
00:02:18.000 | If you have data which is so complex,
00:02:21.840 | but there is no way to express it as a shorter program,
00:02:26.120 | then it means that your data is totally random.
00:02:28.160 | There is no way to extract any regularity
00:02:30.160 | from it whatsoever.
00:02:31.200 | Now there is little known mathematical theory behind this,
00:02:35.600 | and the proofs of these statements
00:02:37.160 | are actually not even that hard.
00:02:39.200 | But the one minor slight disappointment
00:02:42.600 | is that it's actually not possible,
00:02:44.200 | at least given today's tools and understanding,
00:02:46.960 | to find the best short program
00:02:48.960 | that explains or generates or solves your problem
00:02:53.960 | given your data.
00:02:55.760 | This problem is computationally intractable.
00:02:57.960 | The space of all programs is a very nasty space.
00:03:04.000 | Small changes to your program
00:03:05.600 | result in massive changes to the behavior of the program,
00:03:07.880 | as it should be.
00:03:08.800 | It makes sense.
00:03:10.080 | You have a loop.
00:03:11.280 | You change the inside of the loop.
00:03:13.400 | Of course, you get something totally different.
00:03:15.800 | So the space of programs is so hard,
00:03:17.920 | at least given what we know today,
00:03:19.320 | search there seems to be completely off the table.
00:03:22.560 | Well, if we give up on short programs,
00:03:28.880 | what about small circuits?
00:03:30.400 | Well, it turns out that we are lucky.
00:03:34.800 | It turns out that when it comes to small circuits,
00:03:37.480 | you can just find the best small circuit
00:03:40.200 | that solves your problem using backpropagation.
00:03:42.920 | And this is the miraculous fact
00:03:46.360 | on which the rest of AI stands.
00:03:49.520 | It is the fact that when you have a circuit
00:03:52.320 | and you impose constraints on your circuit using data,
00:03:56.600 | you can find a way to satisfy these constraints
00:04:01.040 | using backprop by iteratively making small changes
00:04:05.960 | to the weights of your neural network
00:04:08.160 | until its predictions satisfy the data.
00:04:13.080 | What this means is that the computational problem
00:04:16.680 | that's solved by backpropagation is extremely profound.
00:04:19.640 | It is circuit search.
00:04:21.360 | Now, we know that you can't solve it always,
00:04:23.840 | but you can solve it sometimes.
00:04:26.640 | And you can solve it at those times
00:04:29.600 | where we have a practical data set.
00:04:32.200 | It is easy to design artificial data sets
00:04:34.040 | for which you cannot find the best neural network.
00:04:36.680 | But in practice, that seems to be not a problem.
00:04:39.960 | You can think of training a neural network
00:04:42.200 | as solving a neural equation in many cases
00:04:45.840 | where you have a large number of equation terms like this,
00:04:50.480 | f of xi theta equals yi.
00:04:53.080 | So you've got your parameters
00:04:54.200 | and they represent all your degrees of freedom.
00:04:57.240 | And you use gradient descent to push the information
00:05:01.160 | from these equations into the parameters
00:05:03.400 | to satisfy them all.
00:05:04.400 | And you can see that the neural network,
00:05:08.160 | let's say one with 50 layers,
00:05:10.160 | is basically a parallel computer
00:05:12.760 | that is given 50 time steps to run.
00:05:16.400 | And you can do quite a lot with 50 time steps
00:05:19.840 | of a very, very powerful, massively parallel computer.
00:05:23.960 | So for example, I think it is not widely known
00:05:28.960 | that you can learn to sort,
00:05:33.240 | sort n n-bit numbers
00:05:36.200 | using a modestly sized neural network
00:05:38.280 | with just two hidden layers,
00:05:40.560 | which is not bad.
00:05:41.520 | It's not self-evident,
00:05:44.680 | especially since we've been taught
00:05:46.200 | that sorting requires log n parallel steps.
00:05:49.640 | With a neural network,
00:05:50.760 | you can sort successfully using only two parallel steps.
00:05:54.800 | So there's something slightly unobvious going on.
00:05:57.160 | Now, these are parallel steps of threshold neurons,
00:06:01.200 | so they're doing a little bit more work.
00:06:03.120 | That's the answer to the mystery.
00:06:04.440 | But if you've got 50 such layers,
00:06:05.920 | you can do quite a bit of logic,
00:06:07.320 | quite a bit of reasoning,
00:06:08.440 | all inside the neural network.
00:06:10.240 | And that's why it works.
00:06:11.480 | Given the data,
00:06:13.520 | we are able to find the best neural network.
00:06:16.920 | And because the neural network is deep,
00:06:18.640 | because it can run computation inside of its layers,
00:06:22.880 | the best neural network is worth finding.
00:06:25.800 | 'Cause that's really what you need.
00:06:27.160 | You need something, you need a model class,
00:06:30.920 | which is worth optimizing.
00:06:32.240 | But it also needs to be optimizable.
00:06:35.640 | And deep neural networks satisfy both of these constraints.
00:06:39.280 | And this is why everything works.
00:06:40.720 | This is the basis on which everything else resides.
00:06:43.760 | Now I want to talk a little bit
00:06:46.200 | about reinforcement learning.
00:06:48.040 | So reinforcement learning is a framework.
00:06:50.720 | It's a framework of evaluating agents
00:06:54.760 | in their ability to achieve goals
00:06:56.240 | in complicated stochastic environments.
00:06:58.280 | You've got an agent,
00:07:00.120 | which is plugged into an environment,
00:07:01.440 | as shown in the figure right here.
00:07:04.000 | And for any given agent,
00:07:07.160 | you can simply run it many times
00:07:09.640 | and compute its average reward.
00:07:11.400 | Now, the thing that's interesting
00:07:14.360 | about the reinforcement learning framework
00:07:16.400 | is that there exist interesting,
00:07:19.840 | useful reinforcement learning algorithms.
00:07:22.840 | The framework existed for a long time.
00:07:25.280 | It became interesting once we realized
00:07:27.400 | that good algorithms exist.
00:07:28.960 | Now, these are not perfect algorithms,
00:07:30.920 | but they are good enough to do interesting things.
00:07:34.400 | And all you want, the mathematical problem,
00:07:38.200 | is one where you need to maximize the expected reward.
00:07:41.000 | Now, one important way
00:07:45.800 | in which the reinforcement learning framework
00:07:47.240 | is not quite complete
00:07:48.960 | is that it assumes that the reward
00:07:50.480 | is given by the environment.
00:07:52.400 | You see this picture.
00:07:54.200 | The agent sends an action,
00:07:56.080 | while the reward sends both the observation
00:07:59.960 | and the reward backwards.
00:08:01.480 | That's what the environment communicates back.
00:08:04.400 | The way in which this is not the case in the real world
00:08:07.440 | is that we figure out
00:08:12.000 | what the reward is from the observation.
00:08:14.160 | We reward ourselves.
00:08:16.160 | We are not told, the environment doesn't say,
00:08:17.600 | "Hey, here's some negative reward."
00:08:20.200 | It's our interpretation of our senses
00:08:22.720 | that lets us determine what the reward is.
00:08:24.800 | And there is only one real true reward in life,
00:08:28.280 | and this is existence or nonexistence.
00:08:31.080 | And everything else is a corollary of that.
00:08:34.000 | So, well, what should our agent be?
00:08:37.040 | You already know the answer.
00:08:37.880 | It should be a neural network.
00:08:40.200 | Because whenever you want to do something,
00:08:42.200 | the answer is going to be a neural network,
00:08:43.520 | and you want the agent to map observations to actions.
00:08:47.320 | So you let it be parameterized with a neural net,
00:08:49.800 | and you apply a learning algorithm.
00:08:51.600 | So I want to explain to you
00:08:52.560 | how reinforcement learning works.
00:08:54.640 | This is model-free reinforcement learning,
00:08:56.400 | the reinforcement learning
00:08:57.240 | that's actually being used in practice everywhere.
00:09:00.640 | But it's also deeply, it's very robust, it's very simple.
00:09:05.640 | It's also not very efficient.
00:09:07.560 | So the way it works is the following.
00:09:08.800 | This is literally the one-sentence description
00:09:11.440 | of what happens.
00:09:12.320 | In short, try something new.
00:09:16.720 | Add randomness to your actions.
00:09:20.080 | And compare the result to your expectation.
00:09:23.600 | If the result surprises you,
00:09:27.800 | if you find that the result exceeded your expectation,
00:09:31.440 | then change your parameters
00:09:32.880 | to take those actions in the future.
00:09:34.680 | That's it.
00:09:36.520 | This is the full idea of reinforcement learning.
00:09:38.840 | Try it out, see if you like it,
00:09:41.160 | and if you do, do more of that in the future.
00:09:43.400 | And that's it.
00:09:45.840 | That's literally it.
00:09:47.120 | This is the core idea.
00:09:48.760 | Now, it turns out it's not difficult
00:09:50.120 | to formalize mathematically.
00:09:51.640 | But this is really what's going on.
00:09:53.080 | If in a neural network, in a regular neural network,
00:09:56.920 | like this, you might say, okay, what's the goal?
00:09:59.440 | You run the neural network, you get an answer.
00:10:02.080 | You compare it to the desired answer.
00:10:04.960 | And whatever difference you have between those two,
00:10:06.560 | you send it back to change the neural network.
00:10:09.240 | That's supervised learning.
00:10:11.720 | In reinforcement learning, you run a neural network,
00:10:14.840 | you add a bit of randomness to your action,
00:10:17.720 | and then if you like the result,
00:10:19.400 | your randomness turns into the desired target, in effect.
00:10:23.280 | So that's it.
00:10:24.120 | Trivial.
00:10:27.120 | Now, math exists.
00:10:30.680 | Without explaining what these equations mean,
00:10:35.960 | the point is not really to derive them,
00:10:37.320 | but just to show that they exist.
00:10:39.760 | There are two classes of reinforcement learning algorithms.
00:10:42.600 | One of them is the policy gradient,
00:10:44.320 | where basically what you do is that you take
00:10:47.560 | this expression right there, the sum of rewards,
00:10:51.960 | and you just crunch through the derivatives.
00:10:53.840 | You expand the terms.
00:10:55.480 | You run, you do some algebra, and you get a derivative.
00:10:59.040 | And miraculously, the derivative has exactly the form
00:11:04.520 | that I told you, which is try some actions,
00:11:09.320 | and if you like them, increase the log probability
00:11:11.680 | of the actions.
00:11:12.560 | That literally follows from the math.
00:11:14.000 | It's very nice when the intuitive explanation
00:11:17.400 | has a one-to-one correspondence
00:11:18.680 | to what you get in the equation,
00:11:20.160 | even though you'll have to take my word for it
00:11:22.180 | if you're not familiar with it.
00:11:24.040 | That's the equation at the top.
00:11:26.240 | Then there is a different class
00:11:27.160 | of reinforcement learning algorithms,
00:11:28.400 | which is a little bit more difficult to explain.
00:11:30.440 | It's called the Q-learning-based algorithms.
00:11:32.720 | They are a bit less stable, a bit more sample efficient,
00:11:36.760 | and it has the property that it can learn
00:11:42.320 | not only from the data generated by the actor,
00:11:46.280 | but from any other data as well.
00:11:47.880 | So it has a different robustness profile,
00:11:51.760 | which would be a little bit important,
00:11:53.400 | but it's only gonna be a technicality.
00:11:55.280 | So yeah, this is the on-policy, off-policy distinction,
00:11:59.320 | but it's a little bit technical,
00:12:00.400 | so if you find this hard to understand,
00:12:03.200 | don't worry about it.
00:12:04.240 | If you already know it, then you already know it.
00:12:07.160 | So now what's the potential of reinforcement learning?
00:12:10.080 | What's the promise?
00:12:11.840 | What is it actually, why should we be excited about it?
00:12:16.360 | Now, there are two reasons.
00:12:17.840 | The reinforcement learning algorithms of today
00:12:19.400 | are already useful and interesting,
00:12:22.800 | and especially if you have a really good simulation
00:12:25.280 | of your world, you could train agents
00:12:27.120 | to do lots of interesting things.
00:12:28.760 | But what's really exciting is if you can build
00:12:33.640 | a super amazing sample efficient
00:12:36.080 | reinforcement learning algorithm.
00:12:37.720 | We just give it a tiny amount of data,
00:12:39.840 | and the algorithm just crunches through it
00:12:41.440 | and extracts every bit of entropy out of it
00:12:43.720 | in order to learn in the fastest way possible.
00:12:46.120 | Now, today our algorithms are not particularly
00:12:49.720 | data efficient, they are data inefficient.
00:12:52.840 | But as our field keeps making progress, this will change.
00:12:56.000 | Next, I want to dive into the topic of meta-learning.
00:13:00.480 | The goal of meta-learning, so meta-learning
00:13:05.480 | is a beautiful idea that doesn't really work,
00:13:08.720 | but it kind of works, and it's really promising too.
00:13:11.760 | It's another promising idea.
00:13:13.160 | So what's the dream?
00:13:15.600 | We have some learning algorithms.
00:13:19.160 | Perhaps we could use those learning algorithms
00:13:21.200 | in order to learn to learn.
00:13:22.560 | That'd be nice if we could learn to learn.
00:13:25.720 | So how would you do that?
00:13:28.800 | You would take a system which,
00:13:30.360 | you train it not on one task, but on many tasks,
00:13:36.400 | and you ask it if it learns to solve these tasks quickly.
00:13:39.200 | And that may actually be enough.
00:13:42.440 | So here's how it looks like.
00:13:43.280 | Here's how most traditional meta-learning looks like.
00:13:47.920 | You have a model which is a big neural network.
00:13:50.960 | But what you do is that you treat every,
00:13:53.960 | instead of training cases, you have training tasks.
00:13:58.960 | And instead of test cases, you have test tasks.
00:14:01.240 | So your input may be, instead of just your current test case,
00:14:05.440 | it would be all the information about the test tasks
00:14:09.600 | plus the test case, and you'll try to output
00:14:12.040 | the prediction or action for that test case.
00:14:14.840 | So basically you say, yeah, I'm gonna give you
00:14:17.560 | your 10 examples as part of your input to your model,
00:14:21.000 | figure out how to make the best use of them.
00:14:23.200 | It's a really straightforward idea.
00:14:27.480 | You turn the neural network into the learning algorithm
00:14:30.680 | by turning a training task into a training case.
00:14:34.760 | So training task equals training case.
00:14:37.480 | This is meta-learning.
00:14:39.080 | This one sentence.
00:14:40.560 | And so there've been several success stories
00:14:45.040 | which I think are very interesting.
00:14:48.400 | One of the success stories of meta-learning
00:14:50.040 | is learning to recognize characters quickly.
00:14:53.400 | So there've been a data set produced by MIT by Lake et al.
00:14:58.000 | And this is a data set.
00:15:02.680 | We have a large number of different handwritten characters.
00:15:06.160 | And people have been able to train
00:15:08.120 | extremely strong meta-learning system for this task.
00:15:10.840 | Another successful, another very successful example
00:15:14.200 | of meta-learning is that of neural architecture search
00:15:17.560 | by Zop and Lee from Google,
00:15:20.000 | where they found a neural architecture
00:15:23.360 | that solved one problem well, a small problem.
00:15:26.120 | And then it would generalize,
00:15:27.000 | and then it would successfully solve large problems as well.
00:15:29.240 | So this is kind of the small number of bits meta-learning.
00:15:34.240 | It's like when you learn the architecture,
00:15:36.080 | or maybe even learn a program,
00:15:37.360 | a small program or learning algorithm,
00:15:39.000 | which you apply to new tasks.
00:15:40.640 | So this is the other way of doing meta-learning.
00:15:43.520 | So anyway, but the point is, what's happening,
00:15:46.000 | what's really happening in meta-learning in most cases
00:15:48.800 | is that you turn a training task into a training case
00:15:53.240 | and pretend that this is totally normal deep learning.
00:15:56.360 | That's it.
00:15:57.440 | This is the entirety of meta-learning.
00:15:59.360 | Everything else is just minor details.
00:16:01.880 | Next, I wanna dive in.
00:16:05.160 | So now that I've finished the introduction section,
00:16:07.520 | I want to start discussing different work
00:16:09.760 | by different people from OpenAI.
00:16:12.800 | And I wanna start by talking
00:16:14.320 | about hindsight experience replay.
00:16:16.600 | There's been a large effort by Andriy Kobycharov
00:16:20.280 | to develop a learning algorithm for reinforcement learning
00:16:23.600 | that doesn't solve just one task,
00:16:27.400 | but it solves many tasks,
00:16:29.920 | and it learns to make use of its experience
00:16:33.200 | in a much more efficient way.
00:16:34.640 | And I wanna discuss one problem in reinforcement learning.
00:16:38.600 | It's actually, I guess, a set of problems
00:16:41.400 | which are related to each other.
00:16:43.040 | One really important thing you need to learn to do
00:16:49.040 | is to explore.
00:16:50.040 | You start out in an environment,
00:16:53.680 | you don't know what to do.
00:16:55.080 | What do you do?
00:16:56.440 | So one very important thing that has to happen
00:16:58.240 | is that you must get rewards from time to time.
00:17:01.120 | If you try something and you don't get rewards,
00:17:04.760 | then how can you learn?
00:17:06.920 | So I'd say that's the kind of the crux of the problem.
00:17:11.320 | How do you learn?
00:17:12.400 | And relatedly, is there any way to meaningfully benefit
00:17:17.400 | from the experience, from your attempts, from your failures?
00:17:23.280 | If you try to achieve a goal and you fail,
00:17:25.280 | can you still learn from it?
00:17:26.560 | You tell you, instead of asking your algorithm
00:17:28.960 | to achieve a single goal,
00:17:31.000 | you want to learn a policy
00:17:32.040 | that can achieve a very large family of goals.
00:17:34.320 | For example, instead of reaching one state,
00:17:36.760 | you want to learn a policy
00:17:37.760 | that reaches every state of your system.
00:17:40.760 | Now what's the implication?
00:17:42.680 | Anytime you do something, you achieve some state.
00:17:46.800 | So let's suppose you say, I want to achieve state A.
00:17:50.000 | I try my best and I end up achieving state B.
00:17:54.280 | I can either conclude, well, that was disappointing,
00:17:58.200 | I haven't learned almost anything.
00:18:00.480 | I still have no idea how to achieve state A.
00:18:04.040 | But alternatively, I can say, well, wait a second,
00:18:06.760 | I've just reached a perfectly good state, which is B.
00:18:10.040 | Can I learn how to achieve state B
00:18:12.280 | from my attempt to achieve state A?
00:18:14.800 | And answer is yes, you can.
00:18:16.440 | And it just works.
00:18:17.840 | And I just want to point out, this is the one case,
00:18:20.440 | there's a small subtlety here,
00:18:21.800 | which may be interesting to those of you
00:18:26.440 | who are very familiar with the distinction
00:18:28.560 | between on policy and off policy.
00:18:30.240 | When you try to achieve A,
00:18:33.160 | you are doing on-policy learning for reaching the state A,
00:18:37.640 | but you're doing off-policy learning
00:18:39.440 | for reaching the state B,
00:18:40.960 | because you would take different actions
00:18:42.440 | if you would actually try to reach state B.
00:18:44.680 | So that's why it's very important
00:18:46.120 | that the algorithm you use here
00:18:47.600 | can support off-policy learning.
00:18:49.840 | But that's a minor technicality.
00:18:52.000 | At the crux of the idea is,
00:18:54.960 | you make the problem easier
00:18:57.280 | by ostensibly making it harder.
00:18:59.240 | By training a system which aspires to reach,
00:19:03.760 | to learn to reach every state,
00:19:05.720 | to learn to achieve every goal,
00:19:07.520 | to learn to master its environment in general,
00:19:10.680 | you build a system which always learns something.
00:19:15.080 | It learns from success as well as from failure.
00:19:17.560 | Because if it tries to do one thing
00:19:19.800 | and it does something else,
00:19:21.560 | it now has training data
00:19:22.440 | for how to achieve that something else.
00:19:24.800 | I want to show you a video
00:19:25.640 | of how this thing works in practice.
00:19:27.880 | So one challenge in reinforcement learning systems
00:19:32.080 | is the need to shape the reward.
00:19:34.280 | So what does it mean?
00:19:36.280 | It means that at the beginning of the system,
00:19:38.440 | at the start of learning,
00:19:39.440 | when the system doesn't know much,
00:19:41.200 | it will probably not achieve your goal.
00:19:43.840 | And so it's important that you design your reward function
00:19:46.200 | to give it gradual increments,
00:19:47.840 | to make it smooth and continuous
00:19:49.040 | so that even when the system is not very good,
00:19:50.720 | it achieves the goal.
00:19:52.240 | Now, if you give your system a very sparse reward
00:19:55.480 | where the reward is achieved
00:19:56.560 | only when you reach a final state,
00:19:58.280 | then it becomes very hard
00:20:01.360 | for normal reinforcement learning algorithms
00:20:03.800 | to solve a problem,
00:20:04.640 | because naturally, you never get the reward,
00:20:06.760 | so you never learn.
00:20:07.960 | No reward means no learning.
00:20:10.120 | But here, because you learn from failure
00:20:13.040 | as well as from success,
00:20:14.760 | this problem simply doesn't occur.
00:20:17.400 | And so this is nice.
00:20:19.120 | I think, you know,
00:20:20.520 | let's look at the videos a little bit more.
00:20:22.520 | Like, it's nice how this,
00:20:23.680 | it confidently and energetically moves
00:20:25.920 | the little green puck to its target.
00:20:29.120 | And here's another one.
00:20:30.800 | (silence)
00:20:32.960 | Okay, so we can skip the,
00:20:51.720 | it works if you do it on a physical robot as well,
00:20:54.320 | but we can skip it.
00:20:55.320 | So, I think the point is
00:20:58.440 | that the hindsight experience replay algorithm
00:21:00.880 | is directionally correct,
00:21:02.880 | because you want to make use of all your data
00:21:07.680 | and not only a small fraction of it.
00:21:10.040 | Now, one huge question is,
00:21:12.160 | where do you get the high-level states?
00:21:15.720 | Where do the high-level states come from?
00:21:17.760 | Because in the work that I've shown you so far,
00:21:21.760 | the system is asked to achieve low-level states.
00:21:25.080 | So I think one thing that will become very important
00:21:27.880 | for these kind of approaches
00:21:29.440 | is representation learning and unsupervised learning.
00:21:32.640 | Figure out what are the right states,
00:21:35.640 | what's the state space of goals that's worth achieving.
00:21:39.000 | Now I want to go through some real meta-learning results,
00:21:46.320 | and I'll show you a very simple way
00:21:50.360 | of doing seem-to-real from simulation
00:21:53.400 | to the physical robot with meta-learning.
00:21:56.840 | And this is work by Peng et al.
00:21:58.360 | It was a really nice intern project in 2017.
00:22:02.000 | So, I think we can agree that in the domain of robotics,
00:22:08.920 | it would be nice if you could train your policy
00:22:12.080 | in simulation, and then somehow this knowledge
00:22:15.040 | would carry over to the physical robot.
00:22:19.680 | Now, we can build simulators that are okay,
00:22:26.200 | but they can never perfectly match the real world
00:22:29.560 | unless you want to have an insanely slow simulator.
00:22:32.560 | And the reason for that is that it turns out
00:22:36.080 | that simulating contacts is super hard,
00:22:40.240 | and I heard somewhere, correct me if I'm wrong,
00:22:43.080 | that simulating friction is NP-complete.
00:22:45.600 | I'm not sure, but it's like stuff like that.
00:22:49.800 | So your simulation is just not going to match reality.
00:22:53.880 | There'll be some resemblance, but that's it.
00:22:56.240 | How can we address this problem?
00:22:59.280 | And I want to show you one simple idea.
00:23:01.280 | So let's say, one thing that would be nice
00:23:08.440 | is that if you could learn a policy
00:23:11.400 | that would quickly adapt itself to the real world.
00:23:16.400 | Well, if you want to learn a policy that can quickly adapt,
00:23:20.300 | we need to make sure that it has opportunities
00:23:22.240 | to adapt during training time.
00:23:23.920 | So what do we do?
00:23:25.520 | Instead of solving our problem in just one simulator,
00:23:30.520 | we add a huge amount of variability to the simulator.
00:23:33.600 | We say, we will randomize the frictions,
00:23:36.640 | we will randomize the masses,
00:23:38.520 | the length of the different objects
00:23:40.640 | and their, I guess, dimensions.
00:23:44.560 | So you try to randomize physics,
00:23:47.500 | the simulator, in lots of different ways.
00:23:49.760 | And then importantly, you don't tell the policy
00:23:52.640 | how you randomized it.
00:23:54.640 | So what is it going to do then?
00:23:56.160 | You take your policy and you put it in an environment
00:23:58.120 | and it says, well, this is really tough.
00:24:00.400 | I don't know what the masses are
00:24:02.080 | and I don't know what the frictions are.
00:24:03.880 | I need to try things out and figure out
00:24:06.160 | what the friction is as I get responses from the environment.
00:24:10.280 | So you build it, you learn a certain degree
00:24:13.100 | of adaptability into the policy.
00:24:15.960 | And it actually works.
00:24:17.960 | I just want to show you.
00:24:19.140 | This is what happens when you just train a policy
00:24:21.720 | in simulation and deploy it on the physical robot.
00:24:24.920 | And here the goal is to bring the hockey puck
00:24:27.680 | towards the red dot.
00:24:29.600 | And you will see that it will struggle.
00:24:32.420 | And the reason it struggles is because of
00:24:39.360 | the systematic differences between the simulator
00:24:42.480 | and the real physical robot.
00:24:47.340 | So even the basic movement is difficult for the policy
00:24:51.080 | because the assumptions are violated so much.
00:24:53.360 | So if you do the training as I discussed,
00:24:55.560 | we train a recurrent neural network policy
00:24:58.160 | which learns to quickly infer properties of the simulator
00:25:02.880 | in order to accomplish the task.
00:25:04.560 | You can then give it the real thing, the real physics,
00:25:07.560 | and it will do much better.
00:25:08.920 | So now this is not a perfect technique,
00:25:11.520 | but it's definitely very promising.
00:25:12.880 | It's promising whenever you are able
00:25:15.040 | to sufficiently randomize the simulator.
00:25:17.300 | So it's definitely very nice to see
00:25:20.340 | the closed loop nature of the policy.
00:25:22.820 | You can see that it would push the hockey puck
00:25:25.200 | and it would correct it very, very gently
00:25:27.800 | to bring it to the goal.
00:25:29.180 | Yeah, you saw that?
00:25:30.500 | That was cool.
00:25:31.340 | So that was a cool application of meta-learning.
00:25:38.120 | I want to discuss one more application of meta-learning
00:25:41.980 | which is learning a hierarchy of actions.
00:25:45.480 | And this was work done by Franz et al.
00:25:49.280 | Actually, Kevin Franz, the engineer who did it,
00:25:52.460 | was in high school when he wrote this paper.
00:25:58.300 | one thing that would be nice
00:26:04.380 | is if reinforcement learning was hierarchical.
00:26:08.380 | If instead of simply taking micro-actions,
00:26:11.900 | you had some kind of little subroutines
00:26:14.540 | that you could deploy.
00:26:16.420 | Maybe the term subroutine is a little bit too crude,
00:26:18.380 | but if you had some idea of which action primitives
00:26:22.340 | are worth starting with.
00:26:24.480 | Now, no one has been able to get actually
00:26:29.700 | like a real value add
00:26:31.860 | from hierarchical reinforcement learning yet.
00:26:33.860 | So far, all the really cool results,
00:26:35.780 | all the really convincing results
00:26:36.860 | of reinforcement learning do not use it.
00:26:39.900 | That's because we haven't quite figured out
00:26:43.100 | what's the right way for reinforcement learning,
00:26:45.220 | for hierarchical reinforcement learning.
00:26:47.180 | And I just want to show you one very simple approach
00:26:50.940 | where you use meta-learning
00:26:52.540 | to learn a hierarchy of actions.
00:26:57.540 | So here's what you do.
00:26:58.660 | You have, in this specific work,
00:27:03.380 | you have a certain,
00:27:05.260 | let's say you have a certain number
00:27:07.660 | of low-level primitives.
00:27:08.780 | Let's say you have 10 of them.
00:27:11.020 | And you have a distribution of tasks.
00:27:13.100 | And your goal is to learn low-level primitives
00:27:19.060 | such that when they're used inside
00:27:23.460 | a very brief run of some reinforcement learning algorithm,
00:27:26.860 | you will make as much progress as possible.
00:27:29.020 | So the idea is you want to get
00:27:32.120 | the greatest amount of progress,
00:27:33.540 | you want to learn policies that result in the great,
00:27:36.940 | sorry, you want to learn primitives
00:27:40.100 | that result in the greatest amount of progress possible
00:27:42.900 | when used inside learning.
00:27:44.500 | So this is a meta-learning setup
00:27:45.900 | because you need distribution of tasks.
00:27:47.740 | And here we've had a little maze.
00:27:50.940 | You have a distribution of mazes,
00:27:53.740 | and in this case the little bug learned three policies
00:27:56.580 | which move it in a fixed direction.
00:28:00.380 | And as a result of having this hierarchy,
00:28:02.060 | you're able to solve problems really fast,
00:28:03.900 | but only when the hierarchy is correct.
00:28:06.440 | So hierarchical reinforcement learning
00:28:07.700 | is still a work in progress.
00:28:09.100 | And this work is an interesting proof point
00:28:13.580 | of how hierarchical reinforcement could be like,
00:28:20.260 | how hierarchical reinforcement learning could be like
00:28:23.740 | if it worked.
00:28:24.560 | Now, I want to just spend one slide
00:28:30.060 | addressing the limitations of high-capacity meta-learning.
00:28:35.200 | The specific limitation is that
00:28:37.660 | the training task distribution has to be equal
00:28:43.620 | to the test task distribution.
00:28:45.120 | And I think this is a real limitation
00:28:47.860 | because in reality, the new task that you want to learn
00:28:52.140 | will in some ways be fundamentally different
00:28:55.660 | from anything you've seen so far.
00:28:58.220 | So for example, if you go to school,
00:28:59.900 | you learn lots of useful things,
00:29:02.700 | but then when you go to work,
00:29:04.200 | only a fraction of the things that you've learned
00:29:07.920 | carries over.
00:29:09.280 | You need to learn quite a few more things from scratch.
00:29:12.880 | So meta-learning would struggle with that
00:29:15.980 | because it really assumes that the distribution
00:29:20.320 | over the training task has to be equal
00:29:21.800 | to the distribution over the test tasks.
00:29:24.080 | That's a limitation.
00:29:24.960 | I think that as we develop better algorithms
00:29:28.880 | for being robust when the test tasks
00:29:33.880 | are outside of the distribution of the training task,
00:29:36.340 | then meta-learning will work much better.
00:29:39.280 | Now, I want to talk about self-play.
00:29:42.960 | I think self-play is a very cool topic
00:29:46.940 | that's starting to get attention only now.
00:29:50.260 | And I want to start by reviewing very old work
00:29:55.460 | called TDGammon.
00:29:57.420 | It's back from all the way from 1992,
00:29:59.980 | so it's 26 years old now.
00:30:02.380 | It was done by Jerry Tesauro.
00:30:04.140 | So this work is really incredible
00:30:07.220 | because it has so much relevance today.
00:30:13.580 | What they did basically, they said,
00:30:17.380 | "Okay, let's take two neural networks
00:30:21.260 | "and let them play against each other,
00:30:25.620 | "let them play backgammon against each other,
00:30:27.740 | "and let them be trained with Q-learning."
00:30:31.240 | So it's a super modern approach.
00:30:33.980 | And you would think this was a paper from 2017,
00:30:38.620 | except that when you look at this plot,
00:30:40.220 | it shows that you only have 10 hidden units,
00:30:42.340 | 20 hidden units, 40 and 80 for the different colors,
00:30:47.000 | where you notice that the largest neural network works best.
00:30:49.620 | So in some ways, not much has changed,
00:30:51.900 | and this is the evidence.
00:30:55.460 | And in fact, they were able to beat
00:30:57.100 | the world champion in backgammon,
00:30:58.500 | and they were able to discover new strategies
00:31:00.320 | that the best human backgammon players have not noticed,
00:31:05.320 | and they've determined that the strategies
00:31:07.540 | covered by TDGammon are actually better.
00:31:09.820 | So that's pure self-play with Q-learning,
00:31:12.220 | which remained dormant until the DQN work
00:31:17.220 | with Atari by DeepMind.
00:31:21.880 | So, now other examples of self-play include AlphaGo Zero,
00:31:26.880 | which was able to learn to beat the world champion in Go
00:31:32.920 | without using any external data whatsoever.
00:31:35.920 | Another result of this vein is by OpenAI,
00:31:39.000 | which is our Dota 2 bot,
00:31:40.960 | which was able to build the world champion
00:31:43.400 | on the 1v1 version of the game.
00:31:45.120 | And so I want to spend a little bit of time
00:31:50.080 | talking about the allure of self-play
00:31:53.720 | and why I think it's exciting.
00:31:55.280 | So, one important problem that we must face
00:32:02.600 | as we try to build truly intelligent systems
00:32:08.760 | is what is the task?
00:32:11.160 | What are we actually teaching the systems to do?
00:32:13.560 | And one very attractive attribute of self-play
00:32:18.280 | is that the agents create the environment.
00:32:23.280 | By virtue of the agent acting in the environment,
00:32:27.600 | the environment becomes difficult for the other agents.
00:32:31.800 | And you can see here an example of an iguana
00:32:34.880 | interacting with snakes that try to eat it
00:32:37.560 | unsuccessfully this time,
00:32:39.400 | so we can see what will happen in a moment.
00:32:41.560 | The iguana is trying its best.
00:32:44.760 | And so the fact that you have this arms race
00:32:48.200 | between the snakes and the iguana
00:32:50.120 | motivates their development,
00:32:53.200 | potentially without bound.
00:32:55.760 | And this is what happens in effect in biological evolution.
00:32:59.080 | Now, interesting work in this direction
00:33:02.920 | was done in 1994 by Carl Sims.
00:33:05.800 | There is a really cool video on YouTube by Carl Sims.
00:33:09.640 | You should check it out,
00:33:11.000 | which really kind of shows all the work that he's done.
00:33:14.160 | And here you have a little competition between agents
00:33:17.000 | where you evolve both the behavior and their morphology
00:33:20.560 | when the agent is trying to gain possession of a green cube.
00:33:25.360 | And so you can see that the agents
00:33:29.240 | create the challenge for each other,
00:33:31.120 | and that's why they need to develop.
00:33:32.920 | So one thing that we did,
00:33:38.120 | and this is work by Van Salendaal from OpenAI,
00:33:41.760 | is we said, okay, well,
00:33:43.880 | can we demonstrate some unusual results in self-play
00:33:48.560 | that would really convince us that there is something there?
00:33:52.400 | So what we did here is that we created a small ring,
00:33:56.800 | and you have these two humanoid figures,
00:33:58.760 | and their goal is just to push each other outside the ring.
00:34:01.680 | And they don't know anything about wrestling.
00:34:04.920 | They don't know anything about standing
00:34:06.720 | or balancing each other.
00:34:07.840 | They don't know anything about centers of gravity.
00:34:10.000 | All they know is that if you don't do a good job,
00:34:13.040 | then your competition is going to do a better job.
00:34:15.520 | Now, one of the really attractive things about self-play
00:34:20.320 | is that you always have an opponent
00:34:25.320 | that's roughly as good as you are.
00:34:27.140 | In order to learn,
00:34:30.120 | you need to sometimes win and sometimes lose.
00:34:32.600 | Like, you can't always win.
00:34:34.640 | Sometimes you must fail.
00:34:36.080 | Sometimes you must succeed.
00:34:37.440 | So let's see what will happen here.
00:34:41.960 | Yeah, so the green humanoid was able to block the ball.
00:34:46.460 | In a well-balanced self-play environment,
00:34:50.720 | the competition is always level.
00:34:55.400 | No matter how good you are or how bad you are,
00:34:58.320 | you have a competition that makes it
00:35:00.440 | exactly the right challenge for you.
00:35:03.080 | Oh, and one thing here.
00:35:04.040 | So this video shows transfer learning.
00:35:06.200 | You take the little wrestling humanoid,
00:35:09.020 | and you take its friend away,
00:35:11.200 | and you start applying big, large, random forces on it,
00:35:14.380 | and you see if it can maintain its balance.
00:35:16.880 | And the answer turns out to be that yes, it can,
00:35:19.800 | because it's been trained against an opponent
00:35:22.960 | that pushes it.
00:35:24.360 | And so that's why, even if it doesn't understand
00:35:27.280 | where the pressure force is being applied on it,
00:35:29.440 | it's still able to balance itself.
00:35:31.400 | So this is one potentially attractive feature
00:35:34.960 | of self-play environments,
00:35:35.920 | that you could learn a certain broad set of skills,
00:35:40.160 | although it's a little hard to control
00:35:41.960 | what the skills will be.
00:35:43.640 | And so the biggest open question with this research is,
00:35:46.640 | how do you learn agents in a self-play environment
00:35:50.040 | such that they do whatever they do,
00:35:54.560 | but then they are able to solve a battery of tasks
00:35:57.000 | that is useful for us,
00:35:58.200 | that is explicitly specified externally?
00:36:00.200 | Yeah.
00:36:02.560 | I also want to highlight one attribute
00:36:08.440 | of self-play environments that we've observed
00:36:10.440 | in our Dota bot,
00:36:12.080 | and that is that we've seen a very rapid increase
00:36:14.400 | in the competence of the bot.
00:36:16.080 | So over the course of maybe five months,
00:36:19.000 | we've seen the bot go from playing totally randomly
00:36:23.840 | all the way to the world champion.
00:36:28.000 | And the reason for that is that once you have
00:36:30.520 | a self-play environment, if you put compute into it,
00:36:34.520 | you turn it into data.
00:36:36.360 | Self-play allows you to turn compute into data.
00:36:40.280 | And I think we will see a lot more of that
00:36:42.280 | as being an extremely important thing
00:36:44.840 | to be able to turn compute into, essentially,
00:36:46.920 | data or generalization,
00:36:48.680 | simply because the speed of neural net processors
00:36:51.720 | will increase very dramatically over the next few years.
00:36:54.920 | So neural net cycles will be cheap,
00:36:56.600 | and it will be important to make use of these
00:36:58.760 | newly found overabundance of cycles.
00:37:01.760 | I also want to talk a little bit about the end game
00:37:05.400 | of the self-play approach.
00:37:07.280 | So one thing that we know about the human brain
00:37:12.480 | is that it has increased in size fairly rapidly
00:37:15.040 | over the past two million years.
00:37:17.240 | My theory, the reason I think it happened,
00:37:21.840 | is because our ancestors got to a point
00:37:26.600 | where the thing that's most important for your survival
00:37:29.680 | is your standing in the tribe,
00:37:32.160 | and less the tiger and the lion.
00:37:34.720 | Once the most important thing is
00:37:37.680 | how you deal with those other things
00:37:39.200 | which have a large brain,
00:37:40.560 | then it really helps to have a slightly larger brain.
00:37:43.120 | And I think that's what happened.
00:37:44.480 | And there exists at least one paper from science
00:37:47.680 | which supports this point of view.
00:37:50.160 | So apparently there has been convergent evolution
00:37:52.360 | between social apes and social birds,
00:37:56.640 | even though, in terms of various behaviors,
00:38:00.800 | even though the divergence in evolutionary time scale
00:38:04.840 | between humans and birds has occurred a very long time ago,
00:38:08.600 | and humans, sorry, humans, apes and humans,
00:38:11.000 | apes and birds have very different brain structure.
00:38:14.480 | So I think what should happen if we succeed,
00:38:19.800 | if we successfully follow the path of this approach,
00:38:23.160 | is that we should create a society of agents
00:38:25.040 | which will have language and theory of mind,
00:38:28.680 | negotiation, social skills, trade, economy,
00:38:32.920 | politics, justice system.
00:38:35.040 | All these things should happen
00:38:37.000 | inside the multi-agent environment.
00:38:38.960 | And there will also be some alignment issue
00:38:40.560 | of how do you make sure that the agents we learn
00:38:43.040 | behave in a way that we want.
00:38:45.200 | Now, I want to make a speculative digression here,
00:38:48.800 | which is, I want to make the following observation.
00:38:57.080 | If you believe that this kind of society of agents
00:39:02.000 | is a plausible place where truly,
00:39:07.000 | where fully general intelligence will emerge,
00:39:12.080 | and if you accept that our experience with the DotaBot,
00:39:16.640 | where we've seen a very rapid increase in competence,
00:39:18.640 | will carry over once all the details are right,
00:39:21.880 | if you assume both of these conditions,
00:39:24.560 | then it should follow that we should see
00:39:26.960 | a very rapid increase in the competence of our agents
00:39:30.640 | as they live in the society of agents.
00:39:33.400 | So now that we've talked about a potentially
00:39:38.120 | interesting way of increasing the competence
00:39:41.880 | and teaching agents social skills and language,
00:39:45.640 | and a lot of things that actually exist in humans as well,
00:39:49.160 | we want to talk a little bit about
00:39:51.560 | how you convey goals to agents.
00:39:55.600 | And the question of conveying goals to agents
00:39:59.480 | is just a technical problem,
00:40:01.320 | but it will be important
00:40:03.120 | because it is more likely than not
00:40:07.960 | that the agents that we will train
00:40:10.440 | will eventually be dramatically smarter than us.
00:40:14.200 | And this is work by the OpenAI Safety Team
00:40:17.480 | by Paul Crusciano et al and others.
00:40:19.440 | So I'm just going to show you this video
00:40:23.400 | which basically explains how the whole thing works.
00:40:25.960 | There is some behavior you're looking for,
00:40:30.520 | and you, the human, gets to see pairs of behaviors.
00:40:34.680 | And you simply click on the one that looks better.
00:40:37.240 | And after a very modest number of clicks,
00:40:44.120 | you can get this little simulated leg to do backflips.
00:40:49.960 | And there you go, you can now do backflips.
00:40:54.960 | And to get this specific behavior,
00:41:02.240 | it took about 500 clicks by human annotators.
00:41:07.240 | The way it works is that you take all the,
00:41:10.440 | so this is a very data-efficient
00:41:12.520 | reinforcement learning algorithm,
00:41:14.480 | but it is efficient in terms of rewards
00:41:17.040 | and not in terms of the environment interactions.
00:41:20.360 | So what you do here is that you take all the clicks,
00:41:22.840 | so you've got your, here is one behavior
00:41:26.000 | which is better than the other.
00:41:27.680 | You fit a reward function,
00:41:29.880 | a numerical reward function to those clicks.
00:41:33.480 | So you want to fit a reward function
00:41:34.840 | which satisfies those clicks,
00:41:36.120 | and then you optimize this reward function
00:41:37.600 | with reinforcement learning.
00:41:39.560 | And it actually works.
00:41:41.160 | So this requires 500 bits of information.
00:41:44.760 | We've also been able to train lots of Atari games
00:41:48.000 | using several thousand bits of information.
00:41:49.560 | So in all these cases, you had human annotators
00:41:53.000 | or human judges, just like in the previous slide,
00:41:56.640 | looking at pairs of trajectories
00:42:00.440 | and clicking on the one that they thought was better.
00:42:02.840 | And here's an example of an unusual goal
00:42:08.000 | where this is a car racing game,
00:42:10.000 | but the goal was to ask the agent
00:42:13.800 | to train the white car to drive right behind the orange car.
00:42:18.720 | So it's a different goal,
00:42:19.760 | and it was very straightforward to communicate this goal
00:42:23.360 | using this approach.
00:42:24.400 | So then, to finish off,
00:42:29.000 | alignment is a technical problem.
00:42:31.320 | It has to be solved.
00:42:32.960 | But of course, the determination of the correct goals
00:42:36.240 | we want our AI systems to have
00:42:38.520 | will be a very challenging political problem.
00:42:42.000 | And on this note, I want to thank you so much
00:42:44.440 | for your attention, and I just want to say
00:42:47.200 | that it will be a happy hour
00:42:48.160 | at Cambridge Brewing Company at 8.45
00:42:50.360 | if you want to chat more about AI and other topics.
00:42:53.360 | Please come by.
00:42:54.720 | - I think that deserves an applause.
00:42:56.440 | Thank you very much.
00:42:57.280 | (audience applauding)
00:43:00.440 | - So back propagation is,
00:43:05.800 | well, neural networks are buyer-inspired,
00:43:09.000 | but back propagation doesn't look as though
00:43:10.680 | it's what's going on in the brain,
00:43:12.440 | because signals in the brain go one direction
00:43:14.960 | down the axons, whereas back propagation
00:43:16.680 | requires the errors to be propagated back up the wires.
00:43:20.560 | So can you just talk a little bit
00:43:24.120 | about that whole situation where it looks
00:43:26.760 | as though the brain is doing something a bit different
00:43:28.560 | than our highly successful algorithms?
00:43:31.400 | Are algorithms gonna be improved
00:43:33.520 | once we figure out what the brain is doing,
00:43:35.320 | or is the brain really sending signals back
00:43:37.360 | even though it's got no obvious way of doing that?
00:43:40.240 | What's happening in that area?
00:43:42.240 | - So that's a great question.
00:43:44.360 | So first of all, I'll say that the true answer
00:43:46.520 | is that, the honest answer is that I don't know,
00:43:48.880 | but I have opinions.
00:43:50.600 | And so, so I'll say two things.
00:43:55.040 | First of all, given that, if we agree,
00:44:00.040 | rather, it is a true fact that back propagation
00:44:03.880 | solves the problem of circuit search.
00:44:07.920 | This problem feels like an extremely fundamental problem.
00:44:11.080 | And for this reason, I think that it's unlikely to go away.
00:44:14.980 | Now, you're also right that the brain
00:44:17.160 | doesn't obviously do back propagation,
00:44:19.520 | although there have been multiple proposals
00:44:20.960 | of how it could be doing them.
00:44:22.880 | For example, there's been a work
00:44:25.440 | by Tim Lillicrap and others, where they've shown
00:44:29.600 | that if you use, that it's possible to learn
00:44:32.280 | a different set of connections that can be used
00:44:35.240 | for the backward pass, and that can result
00:44:37.400 | in successful learning.
00:44:38.640 | Now, the reason this hasn't been really pushed
00:44:41.440 | to the limit by practitioners is because they say,
00:44:44.000 | well, I got TF to the gradients,
00:44:46.480 | I'm just not going to worry about it.
00:44:48.580 | But you are right that this is an important issue,
00:44:50.600 | and one of two things is going to happen.
00:44:53.400 | So my personal opinion is that back propagation
00:44:56.040 | is just going to stay with us till the very end,
00:44:58.000 | and we'll actually build fully human level
00:45:00.760 | and beyond systems before we understand
00:45:02.760 | how the brain does what it does.
00:45:05.360 | So that's what I believe, but of course,
00:45:09.560 | it is a difference that has to be acknowledged.
00:45:12.560 | - Okay, thank you.
00:45:14.240 | Do you think it was a fair matchup for the Dota bot
00:45:18.160 | and that person, given the constraints of the system?
00:45:21.680 | - So I'd say that the biggest advantage computers have
00:45:26.000 | in games like this, like one of the big advantages,
00:45:28.320 | is that they obviously have a better reaction time.
00:45:31.960 | Although in Dota in particular,
00:45:33.840 | the number of clicks per second
00:45:36.160 | over the top players is fairly small,
00:45:39.040 | which is different from StarCraft.
00:45:40.580 | So in StarCraft, StarCraft is a very mechanically heavy game
00:45:44.720 | because of a large number of units.
00:45:46.680 | And so the top players, they just click all the time.
00:45:49.240 | In Dota, every player control is just one hero,
00:45:53.440 | and so that greatly reduces the total number
00:45:55.720 | of actions they need to make.
00:45:56.940 | Now, still, precision matters.
00:45:58.540 | I think that we'll discover that,
00:46:02.220 | but what I think will really happen
00:46:03.500 | is that we'll discover that computers
00:46:04.780 | have the advantage in any domain.
00:46:08.920 | Or rather, every domain.
00:46:12.180 | Not yet.
00:46:15.380 | - So do you think that the emergent behaviors
00:46:17.420 | from the agent were actually kind of directed
00:46:20.460 | because the constraints were already kind of in place?
00:46:22.620 | Like, so it was kind of forced to discover those?
00:46:24.060 | Or do you think that, like,
00:46:26.620 | that was actually something quite novel
00:46:27.940 | that, like, wow, it actually discovered these on its own?
00:46:30.900 | Like, you didn't actually have
00:46:31.980 | to be biased towards constraining it?
00:46:33.540 | - So it's definitely, we discover new strategies,
00:46:35.540 | and I can share an anecdote where our tester,
00:46:39.260 | we have a pro which would test the bot,
00:46:41.980 | and he played against it for a long time,
00:46:44.540 | and the bot would do all kinds of things
00:46:46.500 | against the player, the human player, which were effective.
00:46:49.980 | Then at some point, that pro decided
00:46:52.900 | to play against the better pro,
00:46:54.940 | and he decided to imitate one of the things
00:46:56.580 | that the bot was doing, and this,
00:47:00.140 | by imitating it, he was able to defeat a better pro.
00:47:03.300 | So I think the strategies that he discovers are real,
00:47:06.180 | and so, like, it means that,
00:47:08.100 | like, there's very real transfer, you know.
00:47:10.300 | I would say, I think what that means
00:47:15.260 | is that because the strategies discovered
00:47:17.180 | by the bot help the humans, it means that the,
00:47:19.180 | like, the fundamental gameplay is deeply related.
00:47:21.620 | - For a long time now, I've heard that the objective
00:47:25.820 | of reinforcement learning is to determine a policy
00:47:30.140 | that chooses an action to maximize the expected reward,
00:47:34.500 | which is what you said earlier.
00:47:36.180 | Would you ever wanna look at the standard deviation
00:47:39.540 | of possible rewards?
00:47:41.540 | Does that even make sense?
00:47:42.700 | - Yeah.
00:47:43.940 | I mean, I think for sure.
00:47:44.780 | I think it's really application-dependent.
00:47:47.500 | One of the reasons to maximize the expected reward
00:47:50.900 | is because it's easier to design algorithms for it.
00:47:54.740 | So you write down this equation, the formula,
00:47:59.180 | you do a little bit of derivation,
00:48:00.740 | you get something which amounts
00:48:02.380 | to a nice-looking algorithm.
00:48:03.860 | Now, I think there exists, like, really,
00:48:07.900 | there exists applications
00:48:09.820 | where you never wanna make mistakes,
00:48:11.060 | and you wanna work on the standard deviation as well,
00:48:13.900 | but in practice, it seems that the,
00:48:16.420 | just looking at the expected reward
00:48:17.820 | covers a large fraction of the situations
00:48:23.220 | you'd like to apply this to.
00:48:24.940 | - Okay, thanks.
00:48:25.780 | - We talked last week about motivations,
00:48:32.820 | and that has a lot to do with the reinforcement,
00:48:37.140 | and some of the ideas is that the,
00:48:40.360 | our motivations are actually connection
00:48:43.820 | with others and cooperation,
00:48:46.340 | and I'm wondering if,
00:48:48.460 | and I understand it's very popular
00:48:50.340 | to have the computers play these competitive games,
00:48:53.780 | but is there any use in having an agent
00:48:59.580 | self-play collaboratively, collaborative games?
00:49:03.740 | - Yeah, I think that's an extremely good question.
00:49:06.240 | I think one place from which we can get some inspiration
00:49:10.780 | is from the evolution of cooperation.
00:49:12.620 | Like, I think,
00:49:14.340 | cooperation, we cooperate ultimately
00:49:19.740 | because it's much better for you,
00:49:22.100 | the person, to be cooperative than not,
00:49:24.900 | and so I think what should happen,
00:49:29.020 | if you have a sufficiently open-ended game,
00:49:32.600 | then cooperation will be the winning strategy,
00:49:36.700 | and so I think we will get cooperation
00:49:38.420 | whether we like it or not.
00:49:39.720 | - Hey, you mentioned the complexity
00:49:47.240 | of the simulation of friction.
00:49:50.240 | I was wondering if you feel that there exists
00:49:52.660 | open complexity theoretic problems relevant to AI,
00:49:56.940 | or whether it's just a matter of finding good approximations
00:49:59.780 | that humans, of the types of problems
00:50:02.200 | that humans tend to solve.
00:50:04.260 | - Yeah, so complexity theory,
00:50:06.140 | well, like, at a very basic level,
00:50:10.920 | we know that whatever algorithm we're gonna run
00:50:13.940 | is going to run fairly efficiently on some hardware,
00:50:16.940 | so that puts a pretty strict upper bound
00:50:20.720 | on the true complexity of the problems we're solving.
00:50:23.420 | Like, by definition, we are solving problems
00:50:25.980 | which aren't too hard in a complexly theoretic sense.
00:50:28.780 | Now, it is also the case that many of the problems,
00:50:33.480 | so while the overall thing that we do
00:50:36.620 | is not hard from a complexity theoretic sense,
00:50:38.440 | and indeed, humans cannot solve
00:50:40.060 | NP-complete problems in general,
00:50:41.860 | it is true that many of the optimization problems
00:50:46.500 | that we pose to our algorithms
00:50:48.420 | are intractable in the general case,
00:50:50.500 | starting from neural net optimization itself.
00:50:53.380 | It is easy to create a family of data sets
00:50:56.100 | for a neural network with a very small number of neurons,
00:50:58.460 | such that finding the global optimum is NP-complete.
00:51:01.380 | And so, how do we avoid it?
00:51:04.420 | Well, we just try gradient descent anyway,
00:51:06.500 | and somehow it works.
00:51:07.780 | But without question,
00:51:13.980 | we do not solve problems which are truly intractable.
00:51:17.420 | So, I mean, I hope this answers the question.
00:51:19.920 | - Hello.
00:51:22.060 | It seems like an important sub-problem
00:51:24.700 | on the path towards AGI will be understanding language,
00:51:28.500 | and the state of generative language modeling right now
00:51:31.180 | is pretty abysmal.
00:51:32.860 | What do you think are the most productive
00:51:35.220 | research trajectories towards generative language models?
00:51:38.320 | - So, I'll first say that you are completely correct
00:51:41.900 | that the situation with language is still far from great,
00:51:44.860 | although progress has been made,
00:51:46.580 | even without any particular innovations
00:51:50.860 | beyond models that exist today.
00:51:52.900 | Simply scaling up models that exist today
00:51:54.940 | on larger data sets is going to go surprisingly far.
00:51:59.100 | Not even larger data sets, but larger and deeper models.
00:52:01.580 | For example, if you trained a language model
00:52:04.060 | with a thousand layers, and it's the same layer,
00:52:06.820 | I think it's gonna be a pretty amazing language model.
00:52:10.940 | Like, we don't have the cycles for it yet,
00:52:13.420 | but I think it will change very soon.
00:52:15.660 | Now, I also agree with you that there are
00:52:18.420 | some fundamental things missing
00:52:20.220 | in our current understanding of deep learning,
00:52:24.580 | which prevent us from really solving
00:52:27.100 | the problem that we want.
00:52:28.380 | So I think one of these problems,
00:52:29.660 | one of the things that's missing is that,
00:52:31.100 | or that seems like patently wrong,
00:52:33.500 | is the fact that we train a model,
00:52:39.180 | and then we stop training the model, and we freeze it.
00:52:41.900 | Even though it's the training process
00:52:44.700 | where the magic really happens.
00:52:47.460 | Like, the magic is, like, if you think about it,
00:52:49.900 | like, the training process is the true general part
00:52:53.820 | of the whole story, because your TensorFlow code
00:52:57.340 | doesn't care which data set to optimize.
00:52:59.420 | It just says, "Whatever, just give me the data set.
00:53:00.780 | "I don't care which problem to solve.
00:53:02.020 | "I'll solve them all."
00:53:03.860 | So, like, the ability to do that feels really special,
00:53:07.900 | and I think we are not using it at test time.
00:53:11.580 | Like, it's hard to speculate about, like,
00:53:13.700 | things which we don't know the answer,
00:53:14.980 | but all I'll say is that simply training
00:53:17.980 | bigger, deeper language models
00:53:20.220 | will go surprisingly far, scaling up.
00:53:22.900 | But also doing things like training at test time
00:53:24.620 | and inference at test time, I think,
00:53:25.660 | will be another important boost to performance.
00:53:29.020 | - Hi, thank you for the talk.
00:53:31.820 | So it seems like right now another interesting approach
00:53:35.140 | to solving reinforcement learning problems
00:53:36.860 | would be to go for the evolutionary routes,
00:53:39.620 | using evolutionary strategies.
00:53:41.460 | And although they have their caveats,
00:53:43.620 | I wanted to know if at OpenAI particularly
00:53:45.860 | you're working on something related,
00:53:47.620 | and what is your general opinion on them?
00:53:50.020 | - So, like, at present, I believe that something
00:53:54.300 | like evolutionary strategies is not great
00:53:56.020 | for reinforcement learning.
00:53:58.620 | I think that normal reinforcement learning algorithms,
00:54:00.660 | especially with big policies, are better.
00:54:03.840 | But I think if you want to evolve a small, compact object,
00:54:06.220 | like a piece of code, for example,
00:54:10.300 | I think that would be a place where
00:54:12.500 | this would be seriously worth considering.
00:54:14.780 | But this, you know, evolving a useful piece of code
00:54:19.740 | is a cool idea, it hasn't been done yet,
00:54:21.560 | so still a lot of work to be done before we get there.
00:54:25.100 | - Hi, thank you so much for coming.
00:54:26.740 | My question is, you mentioned what is the right goal
00:54:30.940 | is a political problem, so I'm wondering
00:54:32.700 | if you can elaborate a bit on that,
00:54:34.340 | and then also, what do you think would be the approach
00:54:37.260 | for us to maybe get there?
00:54:39.420 | - Well, I can't really comment too much,
00:54:41.980 | because all the thoughts that, you know,
00:54:45.260 | we now have a few people who are thinking
00:54:47.980 | about this full-time at OpenAI.
00:54:50.540 | I don't have enough of a super strong opinion
00:54:55.500 | to say anything too definitive.
00:54:57.900 | All I can say at the very high level
00:54:59.720 | is given the size, like, if you go into the future,
00:55:02.500 | whenever, soon or, you know, whenever it's gonna happen,
00:55:05.660 | when you build a computer which can do anything better
00:55:09.420 | than a human, it will happen, 'cause the brain is physical.
00:55:12.500 | The impact on society is going to be
00:55:15.500 | completely massive and overwhelming.
00:55:18.100 | It's very difficult to imagine,
00:55:21.780 | even if you try really hard.
00:55:23.620 | And I think what it means is that people will care a lot.
00:55:26.460 | And that's what I was alluding to,
00:55:29.220 | the fact that this will be something
00:55:31.220 | that many people will care about strongly.
00:55:33.340 | And, like, as the impact increases gradually,
00:55:37.820 | with self-driving cars, more automation,
00:55:39.540 | I think we will see a lot more people care.
00:55:41.820 | - Do we need to have a very accurate model
00:55:45.340 | of the physical world and simulate that
00:55:48.620 | in order to have these agents that can eventually
00:55:52.380 | come out into the real world and do something
00:55:55.060 | approaching, you know, human-level intelligence tasks?
00:55:58.700 | - Yeah, that's a very good question.
00:56:00.860 | So I think if that were the case, we'd be in trouble.
00:56:04.180 | And I am very certain that it could be avoided.
00:56:10.500 | So specifically, the real answer has to be that, look,
00:56:14.780 | you learn to problem-solve, you learn to negotiate,
00:56:17.820 | you learn to persist, you learn lots of different
00:56:20.060 | useful life lessons in the simulation.
00:56:21.860 | And yes, you learn some physics, too.
00:56:23.500 | But then you go outside of the real world,
00:56:25.380 | and you have to start over to some extent,
00:56:27.260 | because many of your deeply held assumptions will be false.
00:56:31.060 | And one of the goals, so that's one of the reasons
00:56:33.860 | I care so much about never stopping training.
00:56:37.900 | You've accumulated your knowledge,
00:56:39.460 | now you go into an environment where some
00:56:40.940 | of your assumptions are violated, you continue training.
00:56:42.940 | You try to connect the new data to your old data.
00:56:45.020 | And this is an important requirement from our algorithms,
00:56:47.220 | which is already met to some extent,
00:56:48.660 | but it will have to be met a lot more
00:56:50.620 | so that you can take the partial knowledge
00:56:53.140 | that you've acquired and go in a new situation,
00:56:56.940 | learn some more.
00:56:57.940 | Literally the example of, you go to school,
00:57:00.580 | you learn useful things, then you go to work.
00:57:02.820 | It's not perfect, it's not, you know,
00:57:04.860 | for your four years of CS in undergrad,
00:57:07.020 | it's not gonna fully prepare you for
00:57:08.980 | whatever it is you need to know at work.
00:57:10.540 | It will help somewhat.
00:57:11.820 | You'll be able to get off the ground,
00:57:12.900 | but there will be lots of new things you need to learn.
00:57:14.660 | So that's the spirit of it.
00:57:16.620 | I think of it as of the school.
00:57:18.900 | - One of the things you mentioned pretty early on
00:57:20.900 | in your talk is that one of the limitations
00:57:22.700 | of this sort of style of reinforcement learning
00:57:24.940 | is there's no self-organization.
00:57:27.140 | So you have to tell it when it did a good thing
00:57:28.820 | or it did a bad thing.
00:57:29.860 | And that's actually a problem in neuroscience
00:57:31.500 | as well when you're trying to teach a rat
00:57:32.700 | to navigate a maze.
00:57:33.940 | You have to artificially tell it what to do.
00:57:36.420 | So where do you see moving forward
00:57:37.900 | when we already have this problem with teaching,
00:57:40.020 | not necessarily learning, but also teaching.
00:57:41.940 | So where do you see the research moving forward
00:57:43.660 | in that respect?
00:57:44.820 | How do you sort of introduce
00:57:45.900 | this notion of self-organization?
00:57:48.100 | - So I think without question,
00:57:49.420 | one really important thing you need to do
00:57:51.220 | is to be able to infer the goals and strategies
00:57:55.780 | of other agents by observing them.
00:57:57.540 | That's a fundamental skill you need to be able to learn,
00:58:00.860 | to embed into the agents.
00:58:02.900 | So that for example, you have two agents,
00:58:04.780 | one of them is doing something,
00:58:06.180 | and the other agent says, "Well, that's really cool.
00:58:07.700 | "I wanna be able to do that too."
00:58:09.260 | And then you go on and do that.
00:58:10.700 | And so I'd say that this is a very important component
00:58:12.900 | in terms of setting the reward of,
00:58:15.380 | you see what they do, you infer the reward,
00:58:18.900 | and now we have a knob which says,
00:58:20.980 | "You see what they're doing?
00:58:22.100 | "Now go and try to do the same thing."
00:58:24.140 | So I'd say this is, as far as I know,
00:58:27.020 | this was one of the important ways in which humans
00:58:31.740 | are quite different from other animals
00:58:34.060 | in the way which, in the scale and scope
00:58:39.060 | in which we copy the behavior of other humans.
00:58:42.420 | - Might I ask a quick follow-up?
00:58:45.180 | - Go for it.
00:58:46.020 | - So that's kind of obvious how that works
00:58:47.540 | in the scope of competition,
00:58:48.940 | but what about just sort of arbitrary tasks?
00:58:51.220 | Like I'm in a math class with someone
00:58:52.860 | and I see someone doing a problem a particular way
00:58:54.900 | and I'm like, "Oh, that's a good strategy.
00:58:56.100 | "Maybe I should try that out."
00:58:57.620 | How does that work in a sort of non-competitive environment?
00:59:00.460 | - So I think that this will be,
00:59:02.100 | I think that's going to be a little bit separate
00:59:04.700 | from the competitive environment,
00:59:06.660 | but it will have to be somehow either,
00:59:10.260 | probably baked in, maybe evolved into the system,
00:59:15.620 | where if you have other agents doing things,
00:59:19.900 | they're generating data which you observe,
00:59:22.020 | and the only way to truly make sense
00:59:23.660 | of the data that you see is to infer the goal of the agent,
00:59:27.780 | the strategy, their belief state.
00:59:29.780 | That's important also for communicating with them.
00:59:32.460 | If you want to successfully communicate with someone,
00:59:34.140 | you have to keep track both of their goal
00:59:35.820 | and of their belief state instead of knowledge.
00:59:37.780 | So I think you will find that there are many,
00:59:40.500 | I guess, connections between understanding
00:59:43.900 | what other agents are doing, inferring their goals,
00:59:46.140 | imitating them, and successfully communicating with them.
00:59:49.180 | - All right, let's give Ilya and the happy hour a big hand.
00:59:52.260 | (audience applauding)
00:59:55.420 | (audience cheering)
00:59:58.420 | (upbeat music)
01:00:01.020 | (upbeat music)
01:00:03.620 | (upbeat music)
01:00:06.220 | (upbeat music)
01:00:08.820 | (upbeat music)
01:00:11.420 | [BLANK_AUDIO]