Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)

00:00:00.000 | Welcome back to CIGSAS 099, Artificial General Intelligence.

00:00:04.000 | Today we have Ilya Tsitskever,

00:00:07.640 | co-founder and research director of OpenAI.

00:00:13.400 | He started in the ML group in Toronto with Jeffrey Hinton,

00:00:17.280 | then at Stanford with Andrew Ng,

00:00:19.040 | co-founded DNN Research for three years

00:00:21.440 | as a research scientist at Google Brain,

00:00:23.520 | and finally co-founded OpenAI.

00:00:26.280 | Citations aren't everything,

00:00:28.560 | but they do indicate impact.

00:00:31.120 | And his work, recent work, in the past five years

00:00:35.000 | has been cited over 46,000 times.

00:00:39.320 | He has been the key creative intellect

00:00:42.600 | and driver behind some of the biggest breakthrough ideas

00:00:45.400 | in deep learning and artificial intelligence ever.

00:00:49.440 | So please welcome Ilya.

00:00:52.520 | (audience applauding)

00:00:56.680 | - All right, thanks for the introduction, Lex.

00:00:58.880 | All right, thanks for coming to my talk.

00:01:01.280 | I will tell you about some work we've done

00:01:03.640 | over the past year on meta-learning

00:01:06.440 | and self-play at OpenAI.

00:01:08.800 | And before I dive into some of the

00:01:11.600 | more technical details of the work,

00:01:14.200 | I want to spend a little bit of time

00:01:16.400 | talking about deep learning

00:01:19.800 | and why it works at all in the first place.

00:01:22.480 | Which I think it's actually not a self-evident thing

00:01:25.560 | that it should work.

00:01:27.120 | One fact, it's actually a fact,

00:01:30.520 | it's a mathematical theorem that you can prove,

00:01:35.520 | is that if you could find the shortest program

00:01:40.600 | that does very well on your data,

00:01:43.760 | then you will achieve the best generalization possible.

00:01:46.760 | With a little bit of modification,

00:01:48.080 | you can turn it into a very, very simple algorithm.

00:01:50.880 | With a little bit of modification,

00:01:52.800 | you can turn it into a precise theorem.

00:01:54.440 | And on a very intuitive level,

00:01:58.040 | it's easy to see why it should be the case.

00:02:00.920 | If you have some data,

00:02:03.240 | and you're able to find the shorter program

00:02:05.880 | which generates this data,

00:02:08.000 | then you've essentially extracted all conceivable regularity

00:02:11.560 | from this data into your program.

00:02:13.800 | And then you can use this object

00:02:14.920 | to make the best predictions possible.

00:02:18.000 | If you have data which is so complex,

00:02:21.840 | but there is no way to express it as a shorter program,

00:02:26.120 | then it means that your data is totally random.

00:02:28.160 | There is no way to extract any regularity

00:02:30.160 | from it whatsoever.

00:02:31.200 | Now there is little known mathematical theory behind this,

00:02:35.600 | and the proofs of these statements

00:02:37.160 | are actually not even that hard.

00:02:39.200 | But the one minor slight disappointment

00:02:42.600 | is that it's actually not possible,

00:02:44.200 | at least given today's tools and understanding,

00:02:46.960 | to find the best short program

00:02:48.960 | that explains or generates or solves your problem

00:02:53.960 | given your data.

00:02:55.760 | This problem is computationally intractable.

00:02:57.960 | The space of all programs is a very nasty space.

00:03:04.000 | Small changes to your program

00:03:05.600 | result in massive changes to the behavior of the program,

00:03:07.880 | as it should be.

00:03:08.800 | It makes sense.

00:03:10.080 | You have a loop.

00:03:11.280 | You change the inside of the loop.

00:03:13.400 | Of course, you get something totally different.

00:03:15.800 | So the space of programs is so hard,

00:03:17.920 | at least given what we know today,

00:03:19.320 | search there seems to be completely off the table.

00:03:22.560 | Well, if we give up on short programs,

00:03:28.880 | what about small circuits?

00:03:30.400 | Well, it turns out that we are lucky.

00:03:34.800 | It turns out that when it comes to small circuits,

00:03:37.480 | you can just find the best small circuit

00:03:40.200 | that solves your problem using backpropagation.

00:03:42.920 | And this is the miraculous fact

00:03:46.360 | on which the rest of AI stands.

00:03:49.520 | It is the fact that when you have a circuit

00:03:52.320 | and you impose constraints on your circuit using data,

00:03:56.600 | you can find a way to satisfy these constraints

00:04:01.040 | using backprop by iteratively making small changes

00:04:05.960 | to the weights of your neural network

00:04:08.160 | until its predictions satisfy the data.

00:04:13.080 | What this means is that the computational problem

00:04:16.680 | that's solved by backpropagation is extremely profound.

00:04:19.640 | It is circuit search.

00:04:21.360 | Now, we know that you can't solve it always,

00:04:23.840 | but you can solve it sometimes.

00:04:26.640 | And you can solve it at those times

00:04:29.600 | where we have a practical data set.

00:04:32.200 | It is easy to design artificial data sets

00:04:34.040 | for which you cannot find the best neural network.

00:04:36.680 | But in practice, that seems to be not a problem.

00:04:39.960 | You can think of training a neural network

00:04:42.200 | as solving a neural equation in many cases

00:04:45.840 | where you have a large number of equation terms like this,

00:04:50.480 | f of xi theta equals yi.

00:04:53.080 | So you've got your parameters

00:04:54.200 | and they represent all your degrees of freedom.

00:04:57.240 | And you use gradient descent to push the information

00:05:01.160 | from these equations into the parameters

00:05:03.400 | to satisfy them all.

00:05:04.400 | And you can see that the neural network,

00:05:08.160 | let's say one with 50 layers,

00:05:10.160 | is basically a parallel computer

00:05:12.760 | that is given 50 time steps to run.

00:05:16.400 | And you can do quite a lot with 50 time steps

00:05:19.840 | of a very, very powerful, massively parallel computer.

00:05:23.960 | So for example, I think it is not widely known

00:05:28.960 | that you can learn to sort,

00:05:33.240 | sort n n-bit numbers

00:05:36.200 | using a modestly sized neural network

00:05:38.280 | with just two hidden layers,

00:05:40.560 | which is not bad.

00:05:41.520 | It's not self-evident,

00:05:44.680 | especially since we've been taught

00:05:46.200 | that sorting requires log n parallel steps.

00:05:49.640 | With a neural network,

00:05:50.760 | you can sort successfully using only two parallel steps.

00:05:54.800 | So there's something slightly unobvious going on.

00:05:57.160 | Now, these are parallel steps of threshold neurons,

00:06:01.200 | so they're doing a little bit more work.

00:06:03.120 | That's the answer to the mystery.

00:06:04.440 | But if you've got 50 such layers,

00:06:05.920 | you can do quite a bit of logic,

00:06:07.320 | quite a bit of reasoning,

00:06:08.440 | all inside the neural network.

00:06:10.240 | And that's why it works.

00:06:11.480 | Given the data,

00:06:13.520 | we are able to find the best neural network.

00:06:16.920 | And because the neural network is deep,

00:06:18.640 | because it can run computation inside of its layers,

00:06:22.880 | the best neural network is worth finding.

00:06:25.800 | 'Cause that's really what you need.

00:06:27.160 | You need something, you need a model class,

00:06:30.920 | which is worth optimizing.

00:06:32.240 | But it also needs to be optimizable.

00:06:35.640 | And deep neural networks satisfy both of these constraints.

00:06:39.280 | And this is why everything works.

00:06:40.720 | This is the basis on which everything else resides.

00:06:43.760 | Now I want to talk a little bit

00:06:46.200 | about reinforcement learning.

00:06:48.040 | So reinforcement learning is a framework.

00:06:50.720 | It's a framework of evaluating agents

00:06:54.760 | in their ability to achieve goals

00:06:56.240 | in complicated stochastic environments.

00:06:58.280 | You've got an agent,

00:07:00.120 | which is plugged into an environment,

00:07:01.440 | as shown in the figure right here.

00:07:04.000 | And for any given agent,

00:07:07.160 | you can simply run it many times

00:07:09.640 | and compute its average reward.

00:07:11.400 | Now, the thing that's interesting

00:07:14.360 | about the reinforcement learning framework

00:07:16.400 | is that there exist interesting,

00:07:19.840 | useful reinforcement learning algorithms.

00:07:22.840 | The framework existed for a long time.

00:07:25.280 | It became interesting once we realized

00:07:27.400 | that good algorithms exist.

00:07:28.960 | Now, these are not perfect algorithms,

00:07:30.920 | but they are good enough to do interesting things.

00:07:34.400 | And all you want, the mathematical problem,

00:07:38.200 | is one where you need to maximize the expected reward.

00:07:41.000 | Now, one important way

00:07:45.800 | in which the reinforcement learning framework

00:07:47.240 | is not quite complete

00:07:48.960 | is that it assumes that the reward

00:07:50.480 | is given by the environment.

00:07:52.400 | You see this picture.

00:07:54.200 | The agent sends an action,

00:07:56.080 | while the reward sends both the observation

00:07:59.960 | and the reward backwards.

00:08:01.480 | That's what the environment communicates back.

00:08:04.400 | The way in which this is not the case in the real world

00:08:07.440 | is that we figure out

00:08:12.000 | what the reward is from the observation.

00:08:14.160 | We reward ourselves.

00:08:16.160 | We are not told, the environment doesn't say,

00:08:17.600 | "Hey, here's some negative reward."

00:08:20.200 | It's our interpretation of our senses

00:08:22.720 | that lets us determine what the reward is.

00:08:24.800 | And there is only one real true reward in life,

00:08:28.280 | and this is existence or nonexistence.

00:08:31.080 | And everything else is a corollary of that.

00:08:34.000 | So, well, what should our agent be?

00:08:37.040 | You already know the answer.

00:08:37.880 | It should be a neural network.

00:08:40.200 | Because whenever you want to do something,

00:08:42.200 | the answer is going to be a neural network,

00:08:43.520 | and you want the agent to map observations to actions.

00:08:47.320 | So you let it be parameterized with a neural net,

00:08:49.800 | and you apply a learning algorithm.

00:08:51.600 | So I want to explain to you

00:08:52.560 | how reinforcement learning works.

00:08:54.640 | This is model-free reinforcement learning,

00:08:56.400 | the reinforcement learning

00:08:57.240 | that's actually being used in practice everywhere.

00:09:00.640 | But it's also deeply, it's very robust, it's very simple.

00:09:05.640 | It's also not very efficient.

00:09:07.560 | So the way it works is the following.

00:09:08.800 | This is literally the one-sentence description

00:09:11.440 | of what happens.

00:09:12.320 | In short, try something new.

00:09:16.720 | Add randomness to your actions.

00:09:20.080 | And compare the result to your expectation.

00:09:23.600 | If the result surprises you,

00:09:27.800 | if you find that the result exceeded your expectation,

00:09:31.440 | then change your parameters

00:09:32.880 | to take those actions in the future.

00:09:34.680 | That's it.

00:09:36.520 | This is the full idea of reinforcement learning.

00:09:38.840 | Try it out, see if you like it,

00:09:41.160 | and if you do, do more of that in the future.

00:09:43.400 | And that's it.

00:09:45.840 | That's literally it.

00:09:47.120 | This is the core idea.

00:09:48.760 | Now, it turns out it's not difficult

00:09:50.120 | to formalize mathematically.

00:09:51.640 | But this is really what's going on.

00:09:53.080 | If in a neural network, in a regular neural network,

00:09:56.920 | like this, you might say, okay, what's the goal?

00:09:59.440 | You run the neural network, you get an answer.

00:10:02.080 | You compare it to the desired answer.

00:10:04.960 | And whatever difference you have between those two,

00:10:06.560 | you send it back to change the neural network.

00:10:09.240 | That's supervised learning.

00:10:11.720 | In reinforcement learning, you run a neural network,

00:10:14.840 | you add a bit of randomness to your action,

00:10:17.720 | and then if you like the result,

00:10:19.400 | your randomness turns into the desired target, in effect.

00:10:23.280 | So that's it.

00:10:24.120 | Trivial.

00:10:27.120 | Now, math exists.

00:10:30.680 | Without explaining what these equations mean,

00:10:35.960 | the point is not really to derive them,

00:10:37.320 | but just to show that they exist.

00:10:39.760 | There are two classes of reinforcement learning algorithms.

00:10:42.600 | One of them is the policy gradient,

00:10:44.320 | where basically what you do is that you take

00:10:47.560 | this expression right there, the sum of rewards,

00:10:51.960 | and you just crunch through the derivatives.

00:10:53.840 | You expand the terms.

00:10:55.480 | You run, you do some algebra, and you get a derivative.

00:10:59.040 | And miraculously, the derivative has exactly the form

00:11:04.520 | that I told you, which is try some actions,

00:11:09.320 | and if you like them, increase the log probability

00:11:11.680 | of the actions.

00:11:12.560 | That literally follows from the math.

00:11:14.000 | It's very nice when the intuitive explanation

00:11:17.400 | has a one-to-one correspondence

00:11:18.680 | to what you get in the equation,

00:11:20.160 | even though you'll have to take my word for it

00:11:22.180 | if you're not familiar with it.

00:11:24.040 | That's the equation at the top.

00:11:26.240 | Then there is a different class

00:11:27.160 | of reinforcement learning algorithms,

00:11:28.400 | which is a little bit more difficult to explain.

00:11:30.440 | It's called the Q-learning-based algorithms.

00:11:32.720 | They are a bit less stable, a bit more sample efficient,

00:11:36.760 | and it has the property that it can learn

00:11:42.320 | not only from the data generated by the actor,

00:11:46.280 | but from any other data as well.

00:11:47.880 | So it has a different robustness profile,

00:11:51.760 | which would be a little bit important,

00:11:53.400 | but it's only gonna be a technicality.

00:11:55.280 | So yeah, this is the on-policy, off-policy distinction,

00:11:59.320 | but it's a little bit technical,

00:12:00.400 | so if you find this hard to understand,

00:12:03.200 | don't worry about it.

00:12:04.240 | If you already know it, then you already know it.

00:12:07.160 | So now what's the potential of reinforcement learning?

00:12:10.080 | What's the promise?

00:12:11.840 | What is it actually, why should we be excited about it?

00:12:16.360 | Now, there are two reasons.

00:12:17.840 | The reinforcement learning algorithms of today

00:12:19.400 | are already useful and interesting,

00:12:22.800 | and especially if you have a really good simulation

00:12:25.280 | of your world, you could train agents

00:12:27.120 | to do lots of interesting things.

00:12:28.760 | But what's really exciting is if you can build

00:12:33.640 | a super amazing sample efficient

00:12:36.080 | reinforcement learning algorithm.

00:12:37.720 | We just give it a tiny amount of data,

00:12:39.840 | and the algorithm just crunches through it

00:12:41.440 | and extracts every bit of entropy out of it

00:12:43.720 | in order to learn in the fastest way possible.

00:12:46.120 | Now, today our algorithms are not particularly

00:12:49.720 | data efficient, they are data inefficient.

00:12:52.840 | But as our field keeps making progress, this will change.

00:12:56.000 | Next, I want to dive into the topic of meta-learning.

00:13:00.480 | The goal of meta-learning, so meta-learning

00:13:05.480 | is a beautiful idea that doesn't really work,

00:13:08.720 | but it kind of works, and it's really promising too.

00:13:11.760 | It's another promising idea.

00:13:13.160 | So what's the dream?

00:13:15.600 | We have some learning algorithms.

00:13:19.160 | Perhaps we could use those learning algorithms

00:13:21.200 | in order to learn to learn.

00:13:22.560 | That'd be nice if we could learn to learn.

00:13:25.720 | So how would you do that?

00:13:28.800 | You would take a system which,

00:13:30.360 | you train it not on one task, but on many tasks,

00:13:36.400 | and you ask it if it learns to solve these tasks quickly.

00:13:39.200 | And that may actually be enough.

00:13:42.440 | So here's how it looks like.

00:13:43.280 | Here's how most traditional meta-learning looks like.

00:13:47.920 | You have a model which is a big neural network.

00:13:50.960 | But what you do is that you treat every,

00:13:53.960 | instead of training cases, you have training tasks.

00:13:58.960 | And instead of test cases, you have test tasks.

00:14:01.240 | So your input may be, instead of just your current test case,

00:14:05.440 | it would be all the information about the test tasks

00:14:09.600 | plus the test case, and you'll try to output

00:14:12.040 | the prediction or action for that test case.

00:14:14.840 | So basically you say, yeah, I'm gonna give you

00:14:17.560 | your 10 examples as part of your input to your model,

00:14:21.000 | figure out how to make the best use of them.

00:14:23.200 | It's a really straightforward idea.

00:14:27.480 | You turn the neural network into the learning algorithm

00:14:30.680 | by turning a training task into a training case.

00:14:34.760 | So training task equals training case.

00:14:37.480 | This is meta-learning.

00:14:39.080 | This one sentence.

00:14:40.560 | And so there've been several success stories

00:14:45.040 | which I think are very interesting.

00:14:48.400 | One of the success stories of meta-learning

00:14:50.040 | is learning to recognize characters quickly.

00:14:53.400 | So there've been a data set produced by MIT by Lake et al.

00:14:58.000 | And this is a data set.

00:15:02.680 | We have a large number of different handwritten characters.

00:15:06.160 | And people have been able to train

00:15:08.120 | extremely strong meta-learning system for this task.

00:15:10.840 | Another successful, another very successful example

00:15:14.200 | of meta-learning is that of neural architecture search

00:15:17.560 | by Zop and Lee from Google,

00:15:20.000 | where they found a neural architecture

00:15:23.360 | that solved one problem well, a small problem.

00:15:26.120 | And then it would generalize,

00:15:27.000 | and then it would successfully solve large problems as well.

00:15:29.240 | So this is kind of the small number of bits meta-learning.

00:15:34.240 | It's like when you learn the architecture,

00:15:36.080 | or maybe even learn a program,

00:15:37.360 | a small program or learning algorithm,

00:15:39.000 | which you apply to new tasks.

00:15:40.640 | So this is the other way of doing meta-learning.

00:15:43.520 | So anyway, but the point is, what's happening,

00:15:46.000 | what's really happening in meta-learning in most cases

00:15:48.800 | is that you turn a training task into a training case

00:15:53.240 | and pretend that this is totally normal deep learning.

00:15:56.360 | That's it.

00:15:57.440 | This is the entirety of meta-learning.

00:15:59.360 | Everything else is just minor details.

00:16:01.880 | Next, I wanna dive in.

00:16:05.160 | So now that I've finished the introduction section,

00:16:07.520 | I want to start discussing different work

00:16:09.760 | by different people from OpenAI.

00:16:12.800 | And I wanna start by talking

00:16:14.320 | about hindsight experience replay.

00:16:16.600 | There's been a large effort by Andriy Kobycharov

00:16:20.280 | to develop a learning algorithm for reinforcement learning

00:16:23.600 | that doesn't solve just one task,

00:16:27.400 | but it solves many tasks,

00:16:29.920 | and it learns to make use of its experience

00:16:33.200 | in a much more efficient way.

00:16:34.640 | And I wanna discuss one problem in reinforcement learning.

00:16:38.600 | It's actually, I guess, a set of problems

00:16:41.400 | which are related to each other.

00:16:43.040 | One really important thing you need to learn to do

00:16:49.040 | is to explore.

00:16:50.040 | You start out in an environment,

00:16:53.680 | you don't know what to do.

00:16:55.080 | What do you do?

00:16:56.440 | So one very important thing that has to happen

00:16:58.240 | is that you must get rewards from time to time.

00:17:01.120 | If you try something and you don't get rewards,

00:17:04.760 | then how can you learn?

00:17:06.920 | So I'd say that's the kind of the crux of the problem.

00:17:11.320 | How do you learn?

00:17:12.400 | And relatedly, is there any way to meaningfully benefit

00:17:17.400 | from the experience, from your attempts, from your failures?

00:17:23.280 | If you try to achieve a goal and you fail,

00:17:25.280 | can you still learn from it?

00:17:26.560 | You tell you, instead of asking your algorithm

00:17:28.960 | to achieve a single goal,

00:17:31.000 | you want to learn a policy

00:17:32.040 | that can achieve a very large family of goals.

00:17:34.320 | For example, instead of reaching one state,

00:17:36.760 | you want to learn a policy

00:17:37.760 | that reaches every state of your system.

00:17:40.760 | Now what's the implication?

00:17:42.680 | Anytime you do something, you achieve some state.

00:17:46.800 | So let's suppose you say, I want to achieve state A.

00:17:50.000 | I try my best and I end up achieving state B.

00:17:54.280 | I can either conclude, well, that was disappointing,

00:17:58.200 | I haven't learned almost anything.

00:18:00.480 | I still have no idea how to achieve state A.

00:18:04.040 | But alternatively, I can say, well, wait a second,

00:18:06.760 | I've just reached a perfectly good state, which is B.

00:18:10.040 | Can I learn how to achieve state B

00:18:12.280 | from my attempt to achieve state A?

00:18:14.800 | And answer is yes, you can.

00:18:16.440 | And it just works.

00:18:17.840 | And I just want to point out, this is the one case,

00:18:20.440 | there's a small subtlety here,

00:18:21.800 | which may be interesting to those of you

00:18:26.440 | who are very familiar with the distinction

00:18:28.560 | between on policy and off policy.

00:18:30.240 | When you try to achieve A,

00:18:33.160 | you are doing on-policy learning for reaching the state A,

00:18:37.640 | but you're doing off-policy learning

00:18:39.440 | for reaching the state B,

00:18:40.960 | because you would take different actions

00:18:42.440 | if you would actually try to reach state B.

00:18:44.680 | So that's why it's very important

00:18:46.120 | that the algorithm you use here

00:18:47.600 | can support off-policy learning.

00:18:49.840 | But that's a minor technicality.

00:18:52.000 | At the crux of the idea is,

00:18:54.960 | you make the problem easier

00:18:57.280 | by ostensibly making it harder.

00:18:59.240 | By training a system which aspires to reach,

00:19:03.760 | to learn to reach every state,

00:19:05.720 | to learn to achieve every goal,

00:19:07.520 | to learn to master its environment in general,

00:19:10.680 | you build a system which always learns something.

00:19:15.080 | It learns from success as well as from failure.

00:19:17.560 | Because if it tries to do one thing

00:19:19.800 | and it does something else,

00:19:21.560 | it now has training data

00:19:22.440 | for how to achieve that something else.

00:19:24.800 | I want to show you a video

00:19:25.640 | of how this thing works in practice.

00:19:27.880 | So one challenge in reinforcement learning systems

00:19:32.080 | is the need to shape the reward.

00:19:34.280 | So what does it mean?

00:19:36.280 | It means that at the beginning of the system,

00:19:38.440 | at the start of learning,

00:19:39.440 | when the system doesn't know much,

00:19:41.200 | it will probably not achieve your goal.

00:19:43.840 | And so it's important that you design your reward function

00:19:46.200 | to give it gradual increments,

00:19:47.840 | to make it smooth and continuous

00:19:49.040 | so that even when the system is not very good,

00:19:50.720 | it achieves the goal.

00:19:52.240 | Now, if you give your system a very sparse reward

00:19:55.480 | where the reward is achieved

00:19:56.560 | only when you reach a final state,

00:19:58.280 | then it becomes very hard

00:20:01.360 | for normal reinforcement learning algorithms

00:20:03.800 | to solve a problem,

00:20:04.640 | because naturally, you never get the reward,

00:20:06.760 | so you never learn.

00:20:07.960 | No reward means no learning.

00:20:10.120 | But here, because you learn from failure

00:20:13.040 | as well as from success,

00:20:14.760 | this problem simply doesn't occur.

00:20:17.400 | And so this is nice.

00:20:19.120 | I think, you know,

00:20:20.520 | let's look at the videos a little bit more.

00:20:22.520 | Like, it's nice how this,

00:20:23.680 | it confidently and energetically moves

00:20:25.920 | the little green puck to its target.

00:20:29.120 | And here's another one.

00:20:30.800 | (silence)

00:20:32.960 | Okay, so we can skip the,

00:20:51.720 | it works if you do it on a physical robot as well,

00:20:54.320 | but we can skip it.

00:20:55.320 | So, I think the point is

00:20:58.440 | that the hindsight experience replay algorithm

00:21:00.880 | is directionally correct,

00:21:02.880 | because you want to make use of all your data

00:21:07.680 | and not only a small fraction of it.

00:21:10.040 | Now, one huge question is,

00:21:12.160 | where do you get the high-level states?

00:21:15.720 | Where do the high-level states come from?

00:21:17.760 | Because in the work that I've shown you so far,

00:21:21.760 | the system is asked to achieve low-level states.

00:21:25.080 | So I think one thing that will become very important

00:21:27.880 | for these kind of approaches

00:21:29.440 | is representation learning and unsupervised learning.

00:21:32.640 | Figure out what are the right states,

00:21:35.640 | what's the state space of goals that's worth achieving.

00:21:39.000 | Now I want to go through some real meta-learning results,

00:21:46.320 | and I'll show you a very simple way

00:21:50.360 | of doing seem-to-real from simulation

00:21:53.400 | to the physical robot with meta-learning.

00:21:56.840 | And this is work by Peng et al.

00:21:58.360 | It was a really nice intern project in 2017.

00:22:02.000 | So, I think we can agree that in the domain of robotics,

00:22:08.920 | it would be nice if you could train your policy

00:22:12.080 | in simulation, and then somehow this knowledge

00:22:15.040 | would carry over to the physical robot.

00:22:19.680 | Now, we can build simulators that are okay,

00:22:26.200 | but they can never perfectly match the real world

00:22:29.560 | unless you want to have an insanely slow simulator.

00:22:32.560 | And the reason for that is that it turns out

00:22:36.080 | that simulating contacts is super hard,

00:22:40.240 | and I heard somewhere, correct me if I'm wrong,

00:22:43.080 | that simulating friction is NP-complete.

00:22:45.600 | I'm not sure, but it's like stuff like that.

00:22:49.800 | So your simulation is just not going to match reality.

00:22:53.880 | There'll be some resemblance, but that's it.

00:22:56.240 | How can we address this problem?

00:22:59.280 | And I want to show you one simple idea.

00:23:01.280 | So let's say, one thing that would be nice

00:23:08.440 | is that if you could learn a policy

00:23:11.400 | that would quickly adapt itself to the real world.

00:23:16.400 | Well, if you want to learn a policy that can quickly adapt,

00:23:20.300 | we need to make sure that it has opportunities

00:23:22.240 | to adapt during training time.

00:23:23.920 | So what do we do?

00:23:25.520 | Instead of solving our problem in just one simulator,

00:23:30.520 | we add a huge amount of variability to the simulator.

00:23:33.600 | We say, we will randomize the frictions,

00:23:36.640 | we will randomize the masses,

00:23:38.520 | the length of the different objects

00:23:40.640 | and their, I guess, dimensions.

00:23:44.560 | So you try to randomize physics,

00:23:47.500 | the simulator, in lots of different ways.

00:23:49.760 | And then importantly, you don't tell the policy

00:23:52.640 | how you randomized it.

00:23:54.640 | So what is it going to do then?

00:23:56.160 | You take your policy and you put it in an environment

00:23:58.120 | and it says, well, this is really tough.

00:24:00.400 | I don't know what the masses are

00:24:02.080 | and I don't know what the frictions are.

00:24:03.880 | I need to try things out and figure out

00:24:06.160 | what the friction is as I get responses from the environment.

00:24:10.280 | So you build it, you learn a certain degree

00:24:13.100 | of adaptability into the policy.

00:24:15.960 | And it actually works.

00:24:17.960 | I just want to show you.

00:24:19.140 | This is what happens when you just train a policy

00:24:21.720 | in simulation and deploy it on the physical robot.

00:24:24.920 | And here the goal is to bring the hockey puck

00:24:27.680 | towards the red dot.

00:24:29.600 | And you will see that it will struggle.

00:24:32.420 | And the reason it struggles is because of

00:24:39.360 | the systematic differences between the simulator

00:24:42.480 | and the real physical robot.

00:24:47.340 | So even the basic movement is difficult for the policy

00:24:51.080 | because the assumptions are violated so much.

00:24:53.360 | So if you do the training as I discussed,

00:24:55.560 | we train a recurrent neural network policy

00:24:58.160 | which learns to quickly infer properties of the simulator

00:25:02.880 | in order to accomplish the task.

00:25:04.560 | You can then give it the real thing, the real physics,

00:25:07.560 | and it will do much better.

00:25:08.920 | So now this is not a perfect technique,

00:25:11.520 | but it's definitely very promising.

00:25:12.880 | It's promising whenever you are able

00:25:15.040 | to sufficiently randomize the simulator.

00:25:17.300 | So it's definitely very nice to see

00:25:20.340 | the closed loop nature of the policy.

00:25:22.820 | You can see that it would push the hockey puck

00:25:25.200 | and it would correct it very, very gently

00:25:27.800 | to bring it to the goal.

00:25:29.180 | Yeah, you saw that?

00:25:30.500 | That was cool.

00:25:31.340 | So that was a cool application of meta-learning.

00:25:38.120 | I want to discuss one more application of meta-learning

00:25:41.980 | which is learning a hierarchy of actions.

00:25:45.480 | And this was work done by Franz et al.

00:25:49.280 | Actually, Kevin Franz, the engineer who did it,

00:25:52.460 | was in high school when he wrote this paper.

00:25:55.260 | So,

00:25:58.300 | one thing that would be nice

00:26:04.380 | is if reinforcement learning was hierarchical.

00:26:08.380 | If instead of simply taking micro-actions,

00:26:11.900 | you had some kind of little subroutines

00:26:14.540 | that you could deploy.

00:26:16.420 | Maybe the term subroutine is a little bit too crude,

00:26:18.380 | but if you had some idea of which action primitives

00:26:22.340 | are worth starting with.

00:26:24.480 | Now, no one has been able to get actually

00:26:29.700 | like a real value add

00:26:31.860 | from hierarchical reinforcement learning yet.

00:26:33.860 | So far, all the really cool results,

00:26:35.780 | all the really convincing results

00:26:36.860 | of reinforcement learning do not use it.

00:26:39.900 | That's because we haven't quite figured out

00:26:43.100 | what's the right way for reinforcement learning,

00:26:45.220 | for hierarchical reinforcement learning.

00:26:47.180 | And I just want to show you one very simple approach

00:26:50.940 | where you use meta-learning

00:26:52.540 | to learn a hierarchy of actions.

00:26:57.540 | So here's what you do.

00:26:58.660 | You have, in this specific work,

00:27:03.380 | you have a certain,

00:27:05.260 | let's say you have a certain number

00:27:07.660 | of low-level primitives.

00:27:08.780 | Let's say you have 10 of them.

00:27:11.020 | And you have a distribution of tasks.

00:27:13.100 | And your goal is to learn low-level primitives

00:27:19.060 | such that when they're used inside

00:27:23.460 | a very brief run of some reinforcement learning algorithm,

00:27:26.860 | you will make as much progress as possible.

00:27:29.020 | So the idea is you want to get

00:27:32.120 | the greatest amount of progress,

00:27:33.540 | you want to learn policies that result in the great,

00:27:36.940 | sorry, you want to learn primitives

00:27:40.100 | that result in the greatest amount of progress possible

00:27:42.900 | when used inside learning.

00:27:44.500 | So this is a meta-learning setup

00:27:45.900 | because you need distribution of tasks.

00:27:47.740 | And here we've had a little maze.

00:27:50.940 | You have a distribution of mazes,

00:27:53.740 | and in this case the little bug learned three policies

00:27:56.580 | which move it in a fixed direction.

00:28:00.380 | And as a result of having this hierarchy,

00:28:02.060 | you're able to solve problems really fast,

00:28:03.900 | but only when the hierarchy is correct.

00:28:06.440 | So hierarchical reinforcement learning

00:28:07.700 | is still a work in progress.

00:28:09.100 | And this work is an interesting proof point

00:28:13.580 | of how hierarchical reinforcement could be like,

00:28:20.260 | how hierarchical reinforcement learning could be like

00:28:23.740 | if it worked.

00:28:24.560 | Now, I want to just spend one slide

00:28:30.060 | addressing the limitations of high-capacity meta-learning.

00:28:35.200 | The specific limitation is that

00:28:37.660 | the training task distribution has to be equal

00:28:43.620 | to the test task distribution.

00:28:45.120 | And I think this is a real limitation

00:28:47.860 | because in reality, the new task that you want to learn

00:28:52.140 | will in some ways be fundamentally different

00:28:55.660 | from anything you've seen so far.

00:28:58.220 | So for example, if you go to school,

00:28:59.900 | you learn lots of useful things,

00:29:02.700 | but then when you go to work,

00:29:04.200 | only a fraction of the things that you've learned

00:29:07.920 | carries over.

00:29:09.280 | You need to learn quite a few more things from scratch.

00:29:12.880 | So meta-learning would struggle with that

00:29:15.980 | because it really assumes that the distribution

00:29:20.320 | over the training task has to be equal

00:29:21.800 | to the distribution over the test tasks.

00:29:24.080 | That's a limitation.

00:29:24.960 | I think that as we develop better algorithms

00:29:28.880 | for being robust when the test tasks

00:29:33.880 | are outside of the distribution of the training task,

00:29:36.340 | then meta-learning will work much better.

00:29:39.280 | Now, I want to talk about self-play.

00:29:42.960 | I think self-play is a very cool topic

00:29:46.940 | that's starting to get attention only now.

00:29:50.260 | And I want to start by reviewing very old work

00:29:55.460 | called TDGammon.

00:29:57.420 | It's back from all the way from 1992,

00:29:59.980 | so it's 26 years old now.

00:30:02.380 | It was done by Jerry Tesauro.

00:30:04.140 | So this work is really incredible

00:30:07.220 | because it has so much relevance today.

00:30:13.580 | What they did basically, they said,

00:30:17.380 | "Okay, let's take two neural networks

00:30:21.260 | "and let them play against each other,

00:30:25.620 | "let them play backgammon against each other,

00:30:27.740 | "and let them be trained with Q-learning."

00:30:31.240 | So it's a super modern approach.

00:30:33.980 | And you would think this was a paper from 2017,

00:30:38.620 | except that when you look at this plot,

00:30:40.220 | it shows that you only have 10 hidden units,

00:30:42.340 | 20 hidden units, 40 and 80 for the different colors,

00:30:47.000 | where you notice that the largest neural network works best.

00:30:49.620 | So in some ways, not much has changed,

00:30:51.900 | and this is the evidence.

00:30:55.460 | And in fact, they were able to beat

00:30:57.100 | the world champion in backgammon,

00:30:58.500 | and they were able to discover new strategies

00:31:00.320 | that the best human backgammon players have not noticed,

00:31:05.320 | and they've determined that the strategies

00:31:07.540 | covered by TDGammon are actually better.

00:31:09.820 | So that's pure self-play with Q-learning,

00:31:12.220 | which remained dormant until the DQN work

00:31:17.220 | with Atari by DeepMind.

00:31:21.880 | So, now other examples of self-play include AlphaGo Zero,

00:31:26.880 | which was able to learn to beat the world champion in Go

00:31:32.920 | without using any external data whatsoever.

00:31:35.920 | Another result of this vein is by OpenAI,

00:31:39.000 | which is our Dota 2 bot,

00:31:40.960 | which was able to build the world champion

00:31:43.400 | on the 1v1 version of the game.

00:31:45.120 | And so I want to spend a little bit of time

00:31:50.080 | talking about the allure of self-play

00:31:53.720 | and why I think it's exciting.

00:31:55.280 | So, one important problem that we must face

00:32:02.600 | as we try to build truly intelligent systems

00:32:08.760 | is what is the task?

00:32:11.160 | What are we actually teaching the systems to do?

00:32:13.560 | And one very attractive attribute of self-play

00:32:18.280 | is that the agents create the environment.

00:32:23.280 | By virtue of the agent acting in the environment,

00:32:27.600 | the environment becomes difficult for the other agents.

00:32:31.800 | And you can see here an example of an iguana

00:32:34.880 | interacting with snakes that try to eat it

00:32:37.560 | unsuccessfully this time,

00:32:39.400 | so we can see what will happen in a moment.

00:32:41.560 | The iguana is trying its best.

00:32:44.760 | And so the fact that you have this arms race

00:32:48.200 | between the snakes and the iguana

00:32:50.120 | motivates their development,

00:32:53.200 | potentially without bound.

00:32:55.760 | And this is what happens in effect in biological evolution.

00:32:59.080 | Now, interesting work in this direction

00:33:02.920 | was done in 1994 by Carl Sims.

00:33:05.800 | There is a really cool video on YouTube by Carl Sims.

00:33:09.640 | You should check it out,

00:33:11.000 | which really kind of shows all the work that he's done.

00:33:14.160 | And here you have a little competition between agents

00:33:17.000 | where you evolve both the behavior and their morphology

00:33:20.560 | when the agent is trying to gain possession of a green cube.

00:33:25.360 | And so you can see that the agents

00:33:29.240 | create the challenge for each other,

00:33:31.120 | and that's why they need to develop.

00:33:32.920 | So one thing that we did,

00:33:38.120 | and this is work by Van Salendaal from OpenAI,

00:33:41.760 | is we said, okay, well,

00:33:43.880 | can we demonstrate some unusual results in self-play

00:33:48.560 | that would really convince us that there is something there?

00:33:52.400 | So what we did here is that we created a small ring,

00:33:56.800 | and you have these two humanoid figures,

00:33:58.760 | and their goal is just to push each other outside the ring.

00:34:01.680 | And they don't know anything about wrestling.

00:34:04.920 | They don't know anything about standing

00:34:06.720 | or balancing each other.

00:34:07.840 | They don't know anything about centers of gravity.

00:34:10.000 | All they know is that if you don't do a good job,

00:34:13.040 | then your competition is going to do a better job.

00:34:15.520 | Now, one of the really attractive things about self-play

00:34:20.320 | is that you always have an opponent

00:34:25.320 | that's roughly as good as you are.

00:34:27.140 | In order to learn,

00:34:30.120 | you need to sometimes win and sometimes lose.

00:34:32.600 | Like, you can't always win.

00:34:34.640 | Sometimes you must fail.

00:34:36.080 | Sometimes you must succeed.

00:34:37.440 | So let's see what will happen here.

00:34:41.960 | Yeah, so the green humanoid was able to block the ball.

00:34:46.460 | In a well-balanced self-play environment,

00:34:50.720 | the competition is always level.

00:34:55.400 | No matter how good you are or how bad you are,

00:34:58.320 | you have a competition that makes it

00:35:00.440 | exactly the right challenge for you.

00:35:03.080 | Oh, and one thing here.

00:35:04.040 | So this video shows transfer learning.

00:35:06.200 | You take the little wrestling humanoid,

00:35:09.020 | and you take its friend away,

00:35:11.200 | and you start applying big, large, random forces on it,

00:35:14.380 | and you see if it can maintain its balance.

00:35:16.880 | And the answer turns out to be that yes, it can,

00:35:19.800 | because it's been trained against an opponent

00:35:22.960 | that pushes it.

00:35:24.360 | And so that's why, even if it doesn't understand

00:35:27.280 | where the pressure force is being applied on it,

00:35:29.440 | it's still able to balance itself.

00:35:31.400 | So this is one potentially attractive feature

00:35:34.960 | of self-play environments,

00:35:35.920 | that you could learn a certain broad set of skills,

00:35:40.160 | although it's a little hard to control

00:35:41.960 | what the skills will be.

00:35:43.640 | And so the biggest open question with this research is,

00:35:46.640 | how do you learn agents in a self-play environment

00:35:50.040 | such that they do whatever they do,

00:35:54.560 | but then they are able to solve a battery of tasks

00:35:57.000 | that is useful for us,

00:35:58.200 | that is explicitly specified externally?

00:36:00.200 | Yeah.

00:36:02.560 | I also want to highlight one attribute

00:36:08.440 | of self-play environments that we've observed

00:36:10.440 | in our Dota bot,

00:36:12.080 | and that is that we've seen a very rapid increase

00:36:14.400 | in the competence of the bot.

00:36:16.080 | So over the course of maybe five months,

00:36:19.000 | we've seen the bot go from playing totally randomly

00:36:23.840 | all the way to the world champion.

00:36:28.000 | And the reason for that is that once you have

00:36:30.520 | a self-play environment, if you put compute into it,

00:36:34.520 | you turn it into data.

00:36:36.360 | Self-play allows you to turn compute into data.

00:36:40.280 | And I think we will see a lot more of that

00:36:42.280 | as being an extremely important thing

00:36:44.840 | to be able to turn compute into, essentially,

00:36:46.920 | data or generalization,

00:36:48.680 | simply because the speed of neural net processors

00:36:51.720 | will increase very dramatically over the next few years.

00:36:54.920 | So neural net cycles will be cheap,

00:36:56.600 | and it will be important to make use of these

00:36:58.760 | newly found overabundance of cycles.

00:37:01.760 | I also want to talk a little bit about the end game

00:37:05.400 | of the self-play approach.

00:37:07.280 | So one thing that we know about the human brain

00:37:12.480 | is that it has increased in size fairly rapidly

00:37:15.040 | over the past two million years.

00:37:17.240 | My theory, the reason I think it happened,

00:37:21.840 | is because our ancestors got to a point

00:37:26.600 | where the thing that's most important for your survival

00:37:29.680 | is your standing in the tribe,

00:37:32.160 | and less the tiger and the lion.

00:37:34.720 | Once the most important thing is

00:37:37.680 | how you deal with those other things

00:37:39.200 | which have a large brain,

00:37:40.560 | then it really helps to have a slightly larger brain.

00:37:43.120 | And I think that's what happened.

00:37:44.480 | And there exists at least one paper from science

00:37:47.680 | which supports this point of view.

00:37:50.160 | So apparently there has been convergent evolution

00:37:52.360 | between social apes and social birds,

00:37:56.640 | even though, in terms of various behaviors,

00:38:00.800 | even though the divergence in evolutionary time scale

00:38:04.840 | between humans and birds has occurred a very long time ago,

00:38:08.600 | and humans, sorry, humans, apes and humans,

00:38:11.000 | apes and birds have very different brain structure.

00:38:14.480 | So I think what should happen if we succeed,

00:38:19.800 | if we successfully follow the path of this approach,

00:38:23.160 | is that we should create a society of agents

00:38:25.040 | which will have language and theory of mind,

00:38:28.680 | negotiation, social skills, trade, economy,

00:38:32.920 | politics, justice system.

00:38:35.040 | All these things should happen

00:38:37.000 | inside the multi-agent environment.

00:38:38.960 | And there will also be some alignment issue

00:38:40.560 | of how do you make sure that the agents we learn

00:38:43.040 | behave in a way that we want.

00:38:45.200 | Now, I want to make a speculative digression here,

00:38:48.800 | which is, I want to make the following observation.

00:38:57.080 | If you believe that this kind of society of agents

00:39:02.000 | is a plausible place where truly,

00:39:07.000 | where fully general intelligence will emerge,

00:39:12.080 | and if you accept that our experience with the DotaBot,

00:39:16.640 | where we've seen a very rapid increase in competence,

00:39:18.640 | will carry over once all the details are right,

00:39:21.880 | if you assume both of these conditions,

00:39:24.560 | then it should follow that we should see

00:39:26.960 | a very rapid increase in the competence of our agents

00:39:30.640 | as they live in the society of agents.

00:39:33.400 | So now that we've talked about a potentially

00:39:38.120 | interesting way of increasing the competence

00:39:41.880 | and teaching agents social skills and language,

00:39:45.640 | and a lot of things that actually exist in humans as well,

00:39:49.160 | we want to talk a little bit about

00:39:51.560 | how you convey goals to agents.

00:39:55.600 | And the question of conveying goals to agents

00:39:59.480 | is just a technical problem,

00:40:01.320 | but it will be important

00:40:03.120 | because it is more likely than not

00:40:07.960 | that the agents that we will train

00:40:10.440 | will eventually be dramatically smarter than us.

00:40:14.200 | And this is work by the OpenAI Safety Team

00:40:17.480 | by Paul Crusciano et al and others.

00:40:19.440 | So I'm just going to show you this video

00:40:23.400 | which basically explains how the whole thing works.

00:40:25.960 | There is some behavior you're looking for,

00:40:30.520 | and you, the human, gets to see pairs of behaviors.

00:40:34.680 | And you simply click on the one that looks better.

00:40:37.240 | And after a very modest number of clicks,

00:40:44.120 | you can get this little simulated leg to do backflips.

00:40:49.960 | And there you go, you can now do backflips.

00:40:54.960 | And to get this specific behavior,

00:41:02.240 | it took about 500 clicks by human annotators.

00:41:07.240 | The way it works is that you take all the,

00:41:10.440 | so this is a very data-efficient

00:41:12.520 | reinforcement learning algorithm,

00:41:14.480 | but it is efficient in terms of rewards

00:41:17.040 | and not in terms of the environment interactions.

00:41:20.360 | So what you do here is that you take all the clicks,

00:41:22.840 | so you've got your, here is one behavior

00:41:26.000 | which is better than the other.

00:41:27.680 | You fit a reward function,

00:41:29.880 | a numerical reward function to those clicks.

00:41:33.480 | So you want to fit a reward function

00:41:34.840 | which satisfies those clicks,

00:41:36.120 | and then you optimize this reward function

00:41:37.600 | with reinforcement learning.

00:41:39.560 | And it actually works.

00:41:41.160 | So this requires 500 bits of information.

00:41:44.760 | We've also been able to train lots of Atari games

00:41:48.000 | using several thousand bits of information.

00:41:49.560 | So in all these cases, you had human annotators

00:41:53.000 | or human judges, just like in the previous slide,

00:41:56.640 | looking at pairs of trajectories

00:42:00.440 | and clicking on the one that they thought was better.

00:42:02.840 | And here's an example of an unusual goal

00:42:08.000 | where this is a car racing game,

00:42:10.000 | but the goal was to ask the agent

00:42:13.800 | to train the white car to drive right behind the orange car.

00:42:18.720 | So it's a different goal,

00:42:19.760 | and it was very straightforward to communicate this goal

00:42:23.360 | using this approach.

00:42:24.400 | So then, to finish off,

00:42:29.000 | alignment is a technical problem.

00:42:31.320 | It has to be solved.

00:42:32.960 | But of course, the determination of the correct goals

00:42:36.240 | we want our AI systems to have

00:42:38.520 | will be a very challenging political problem.

00:42:42.000 | And on this note, I want to thank you so much

00:42:44.440 | for your attention, and I just want to say

00:42:47.200 | that it will be a happy hour

00:42:48.160 | at Cambridge Brewing Company at 8.45

00:42:50.360 | if you want to chat more about AI and other topics.

00:42:53.360 | Please come by.

00:42:54.720 | - I think that deserves an applause.

00:42:56.440 | Thank you very much.

00:42:57.280 | (audience applauding)

00:43:00.440 | - So back propagation is,

00:43:05.800 | well, neural networks are buyer-inspired,

00:43:09.000 | but back propagation doesn't look as though

00:43:10.680 | it's what's going on in the brain,

00:43:12.440 | because signals in the brain go one direction

00:43:14.960 | down the axons, whereas back propagation

00:43:16.680 | requires the errors to be propagated back up the wires.

00:43:20.560 | So can you just talk a little bit

00:43:24.120 | about that whole situation where it looks

00:43:26.760 | as though the brain is doing something a bit different

00:43:28.560 | than our highly successful algorithms?

00:43:31.400 | Are algorithms gonna be improved

00:43:33.520 | once we figure out what the brain is doing,

00:43:35.320 | or is the brain really sending signals back

00:43:37.360 | even though it's got no obvious way of doing that?

00:43:40.240 | What's happening in that area?

00:43:42.240 | - So that's a great question.

00:43:44.360 | So first of all, I'll say that the true answer

00:43:46.520 | is that, the honest answer is that I don't know,

00:43:48.880 | but I have opinions.

00:43:50.600 | And so, so I'll say two things.

00:43:55.040 | First of all, given that, if we agree,

00:44:00.040 | rather, it is a true fact that back propagation

00:44:03.880 | solves the problem of circuit search.

00:44:07.920 | This problem feels like an extremely fundamental problem.

00:44:11.080 | And for this reason, I think that it's unlikely to go away.

00:44:14.980 | Now, you're also right that the brain

00:44:17.160 | doesn't obviously do back propagation,

00:44:19.520 | although there have been multiple proposals

00:44:20.960 | of how it could be doing them.

00:44:22.880 | For example, there's been a work

00:44:25.440 | by Tim Lillicrap and others, where they've shown

00:44:29.600 | that if you use, that it's possible to learn

00:44:32.280 | a different set of connections that can be used

00:44:35.240 | for the backward pass, and that can result

00:44:37.400 | in successful learning.

00:44:38.640 | Now, the reason this hasn't been really pushed

00:44:41.440 | to the limit by practitioners is because they say,

00:44:44.000 | well, I got TF to the gradients,

00:44:46.480 | I'm just not going to worry about it.

00:44:48.580 | But you are right that this is an important issue,

00:44:50.600 | and one of two things is going to happen.

00:44:53.400 | So my personal opinion is that back propagation

00:44:56.040 | is just going to stay with us till the very end,

00:44:58.000 | and we'll actually build fully human level

00:45:00.760 | and beyond systems before we understand

00:45:02.760 | how the brain does what it does.

00:45:05.360 | So that's what I believe, but of course,

00:45:09.560 | it is a difference that has to be acknowledged.

00:45:12.560 | - Okay, thank you.

00:45:14.240 | Do you think it was a fair matchup for the Dota bot

00:45:18.160 | and that person, given the constraints of the system?

00:45:21.680 | - So I'd say that the biggest advantage computers have

00:45:26.000 | in games like this, like one of the big advantages,

00:45:28.320 | is that they obviously have a better reaction time.

00:45:31.960 | Although in Dota in particular,

00:45:33.840 | the number of clicks per second

00:45:36.160 | over the top players is fairly small,

00:45:39.040 | which is different from StarCraft.

00:45:40.580 | So in StarCraft, StarCraft is a very mechanically heavy game

00:45:44.720 | because of a large number of units.

00:45:46.680 | And so the top players, they just click all the time.

00:45:49.240 | In Dota, every player control is just one hero,

00:45:53.440 | and so that greatly reduces the total number

00:45:55.720 | of actions they need to make.

00:45:56.940 | Now, still, precision matters.

00:45:58.540 | I think that we'll discover that,

00:46:02.220 | but what I think will really happen

00:46:03.500 | is that we'll discover that computers

00:46:04.780 | have the advantage in any domain.

00:46:08.920 | Or rather, every domain.

00:46:12.180 | Not yet.

00:46:15.380 | - So do you think that the emergent behaviors

00:46:17.420 | from the agent were actually kind of directed

00:46:20.460 | because the constraints were already kind of in place?

00:46:22.620 | Like, so it was kind of forced to discover those?

00:46:24.060 | Or do you think that, like,

00:46:26.620 | that was actually something quite novel

00:46:27.940 | that, like, wow, it actually discovered these on its own?

00:46:30.900 | Like, you didn't actually have

00:46:31.980 | to be biased towards constraining it?

00:46:33.540 | - So it's definitely, we discover new strategies,

00:46:35.540 | and I can share an anecdote where our tester,

00:46:39.260 | we have a pro which would test the bot,

00:46:41.980 | and he played against it for a long time,

00:46:44.540 | and the bot would do all kinds of things

00:46:46.500 | against the player, the human player, which were effective.

00:46:49.980 | Then at some point, that pro decided

00:46:52.900 | to play against the better pro,

00:46:54.940 | and he decided to imitate one of the things

00:46:56.580 | that the bot was doing, and this,

00:47:00.140 | by imitating it, he was able to defeat a better pro.

00:47:03.300 | So I think the strategies that he discovers are real,

00:47:06.180 | and so, like, it means that,

00:47:08.100 | like, there's very real transfer, you know.

00:47:10.300 | I would say, I think what that means

00:47:15.260 | is that because the strategies discovered

00:47:17.180 | by the bot help the humans, it means that the,

00:47:19.180 | like, the fundamental gameplay is deeply related.

00:47:21.620 | - For a long time now, I've heard that the objective

00:47:25.820 | of reinforcement learning is to determine a policy

00:47:30.140 | that chooses an action to maximize the expected reward,

00:47:34.500 | which is what you said earlier.

00:47:36.180 | Would you ever wanna look at the standard deviation

00:47:39.540 | of possible rewards?

00:47:41.540 | Does that even make sense?

00:47:42.700 | - Yeah.

00:47:43.940 | I mean, I think for sure.

00:47:44.780 | I think it's really application-dependent.

00:47:47.500 | One of the reasons to maximize the expected reward

00:47:50.900 | is because it's easier to design algorithms for it.

00:47:54.740 | So you write down this equation, the formula,

00:47:59.180 | you do a little bit of derivation,

00:48:00.740 | you get something which amounts

00:48:02.380 | to a nice-looking algorithm.

00:48:03.860 | Now, I think there exists, like, really,

00:48:07.900 | there exists applications

00:48:09.820 | where you never wanna make mistakes,

00:48:11.060 | and you wanna work on the standard deviation as well,

00:48:13.900 | but in practice, it seems that the,

00:48:16.420 | just looking at the expected reward

00:48:17.820 | covers a large fraction of the situations

00:48:23.220 | you'd like to apply this to.

00:48:24.940 | - Okay, thanks.

00:48:25.780 | - We talked last week about motivations,

00:48:32.820 | and that has a lot to do with the reinforcement,

00:48:37.140 | and some of the ideas is that the,

00:48:40.360 | our motivations are actually connection

00:48:43.820 | with others and cooperation,

00:48:46.340 | and I'm wondering if,

00:48:48.460 | and I understand it's very popular

00:48:50.340 | to have the computers play these competitive games,

00:48:53.780 | but is there any use in having an agent

00:48:59.580 | self-play collaboratively, collaborative games?

00:49:03.740 | - Yeah, I think that's an extremely good question.

00:49:06.240 | I think one place from which we can get some inspiration

00:49:10.780 | is from the evolution of cooperation.

00:49:12.620 | Like, I think,

00:49:14.340 | cooperation, we cooperate ultimately

00:49:19.740 | because it's much better for you,

00:49:22.100 | the person, to be cooperative than not,

00:49:24.900 | and so I think what should happen,

00:49:29.020 | if you have a sufficiently open-ended game,

00:49:32.600 | then cooperation will be the winning strategy,

00:49:36.700 | and so I think we will get cooperation

00:49:38.420 | whether we like it or not.

00:49:39.720 | - Hey, you mentioned the complexity

00:49:47.240 | of the simulation of friction.

00:49:50.240 | I was wondering if you feel that there exists

00:49:52.660 | open complexity theoretic problems relevant to AI,

00:49:56.940 | or whether it's just a matter of finding good approximations

00:49:59.780 | that humans, of the types of problems

00:50:02.200 | that humans tend to solve.

00:50:04.260 | - Yeah, so complexity theory,

00:50:06.140 | well, like, at a very basic level,

00:50:10.920 | we know that whatever algorithm we're gonna run

00:50:13.940 | is going to run fairly efficiently on some hardware,

00:50:16.940 | so that puts a pretty strict upper bound

00:50:20.720 | on the true complexity of the problems we're solving.

00:50:23.420 | Like, by definition, we are solving problems

00:50:25.980 | which aren't too hard in a complexly theoretic sense.

00:50:28.780 | Now, it is also the case that many of the problems,

00:50:33.480 | so while the overall thing that we do

00:50:36.620 | is not hard from a complexity theoretic sense,

00:50:38.440 | and indeed, humans cannot solve

00:50:40.060 | NP-complete problems in general,

00:50:41.860 | it is true that many of the optimization problems

00:50:46.500 | that we pose to our algorithms

00:50:48.420 | are intractable in the general case,

00:50:50.500 | starting from neural net optimization itself.

00:50:53.380 | It is easy to create a family of data sets

00:50:56.100 | for a neural network with a very small number of neurons,

00:50:58.460 | such that finding the global optimum is NP-complete.

00:51:01.380 | And so, how do we avoid it?

00:51:04.420 | Well, we just try gradient descent anyway,

00:51:06.500 | and somehow it works.

00:51:07.780 | But without question,

00:51:13.980 | we do not solve problems which are truly intractable.

00:51:17.420 | So, I mean, I hope this answers the question.

00:51:19.920 | - Hello.

00:51:22.060 | It seems like an important sub-problem

00:51:24.700 | on the path towards AGI will be understanding language,

00:51:28.500 | and the state of generative language modeling right now

00:51:31.180 | is pretty abysmal.

00:51:32.860 | What do you think are the most productive

00:51:35.220 | research trajectories towards generative language models?

00:51:38.320 | - So, I'll first say that you are completely correct

00:51:41.900 | that the situation with language is still far from great,

00:51:44.860 | although progress has been made,

00:51:46.580 | even without any particular innovations

00:51:50.860 | beyond models that exist today.

00:51:52.900 | Simply scaling up models that exist today

00:51:54.940 | on larger data sets is going to go surprisingly far.

00:51:59.100 | Not even larger data sets, but larger and deeper models.

00:52:01.580 | For example, if you trained a language model

00:52:04.060 | with a thousand layers, and it's the same layer,

00:52:06.820 | I think it's gonna be a pretty amazing language model.

00:52:10.940 | Like, we don't have the cycles for it yet,

00:52:13.420 | but I think it will change very soon.

00:52:15.660 | Now, I also agree with you that there are

00:52:18.420 | some fundamental things missing

00:52:20.220 | in our current understanding of deep learning,

00:52:24.580 | which prevent us from really solving

00:52:27.100 | the problem that we want.

00:52:28.380 | So I think one of these problems,

00:52:29.660 | one of the things that's missing is that,

00:52:31.100 | or that seems like patently wrong,

00:52:33.500 | is the fact that we train a model,

00:52:39.180 | and then we stop training the model, and we freeze it.

00:52:41.900 | Even though it's the training process

00:52:44.700 | where the magic really happens.

00:52:47.460 | Like, the magic is, like, if you think about it,

00:52:49.900 | like, the training process is the true general part

00:52:53.820 | of the whole story, because your TensorFlow code

00:52:57.340 | doesn't care which data set to optimize.

00:52:59.420 | It just says, "Whatever, just give me the data set.

00:53:00.780 | "I don't care which problem to solve.

00:53:02.020 | "I'll solve them all."

00:53:03.860 | So, like, the ability to do that feels really special,

00:53:07.900 | and I think we are not using it at test time.

00:53:11.580 | Like, it's hard to speculate about, like,

00:53:13.700 | things which we don't know the answer,

00:53:14.980 | but all I'll say is that simply training

00:53:17.980 | bigger, deeper language models

00:53:20.220 | will go surprisingly far, scaling up.

00:53:22.900 | But also doing things like training at test time

00:53:24.620 | and inference at test time, I think,

00:53:25.660 | will be another important boost to performance.

00:53:29.020 | - Hi, thank you for the talk.

00:53:31.820 | So it seems like right now another interesting approach

00:53:35.140 | to solving reinforcement learning problems

00:53:36.860 | would be to go for the evolutionary routes,

00:53:39.620 | using evolutionary strategies.

00:53:41.460 | And although they have their caveats,

00:53:43.620 | I wanted to know if at OpenAI particularly

00:53:45.860 | you're working on something related,

00:53:47.620 | and what is your general opinion on them?

00:53:50.020 | - So, like, at present, I believe that something

00:53:54.300 | like evolutionary strategies is not great

00:53:56.020 | for reinforcement learning.

00:53:58.620 | I think that normal reinforcement learning algorithms,

00:54:00.660 | especially with big policies, are better.

00:54:03.840 | But I think if you want to evolve a small, compact object,

00:54:06.220 | like a piece of code, for example,

00:54:10.300 | I think that would be a place where

00:54:12.500 | this would be seriously worth considering.

00:54:14.780 | But this, you know, evolving a useful piece of code

00:54:19.740 | is a cool idea, it hasn't been done yet,

00:54:21.560 | so still a lot of work to be done before we get there.

00:54:25.100 | - Hi, thank you so much for coming.

00:54:26.740 | My question is, you mentioned what is the right goal

00:54:30.940 | is a political problem, so I'm wondering

00:54:32.700 | if you can elaborate a bit on that,

00:54:34.340 | and then also, what do you think would be the approach

00:54:37.260 | for us to maybe get there?

00:54:39.420 | - Well, I can't really comment too much,

00:54:41.980 | because all the thoughts that, you know,

00:54:45.260 | we now have a few people who are thinking

00:54:47.980 | about this full-time at OpenAI.

00:54:50.540 | I don't have enough of a super strong opinion

00:54:55.500 | to say anything too definitive.

00:54:57.900 | All I can say at the very high level

00:54:59.720 | is given the size, like, if you go into the future,

00:55:02.500 | whenever, soon or, you know, whenever it's gonna happen,

00:55:05.660 | when you build a computer which can do anything better

00:55:09.420 | than a human, it will happen, 'cause the brain is physical.

00:55:12.500 | The impact on society is going to be

00:55:15.500 | completely massive and overwhelming.

00:55:18.100 | It's very difficult to imagine,

00:55:21.780 | even if you try really hard.

00:55:23.620 | And I think what it means is that people will care a lot.

00:55:26.460 | And that's what I was alluding to,

00:55:29.220 | the fact that this will be something

00:55:31.220 | that many people will care about strongly.

00:55:33.340 | And, like, as the impact increases gradually,

00:55:37.820 | with self-driving cars, more automation,

00:55:39.540 | I think we will see a lot more people care.

00:55:41.820 | - Do we need to have a very accurate model

00:55:45.340 | of the physical world and simulate that

00:55:48.620 | in order to have these agents that can eventually

00:55:52.380 | come out into the real world and do something

00:55:55.060 | approaching, you know, human-level intelligence tasks?

00:55:58.700 | - Yeah, that's a very good question.

00:56:00.860 | So I think if that were the case, we'd be in trouble.

00:56:04.180 | And I am very certain that it could be avoided.

00:56:10.500 | So specifically, the real answer has to be that, look,

00:56:14.780 | you learn to problem-solve, you learn to negotiate,

00:56:17.820 | you learn to persist, you learn lots of different

00:56:20.060 | useful life lessons in the simulation.

00:56:21.860 | And yes, you learn some physics, too.

00:56:23.500 | But then you go outside of the real world,

00:56:25.380 | and you have to start over to some extent,

00:56:27.260 | because many of your deeply held assumptions will be false.

00:56:31.060 | And one of the goals, so that's one of the reasons

00:56:33.860 | I care so much about never stopping training.

00:56:37.900 | You've accumulated your knowledge,

00:56:39.460 | now you go into an environment where some

00:56:40.940 | of your assumptions are violated, you continue training.

00:56:42.940 | You try to connect the new data to your old data.

00:56:45.020 | And this is an important requirement from our algorithms,

00:56:47.220 | which is already met to some extent,

00:56:48.660 | but it will have to be met a lot more

00:56:50.620 | so that you can take the partial knowledge

00:56:53.140 | that you've acquired and go in a new situation,

00:56:56.940 | learn some more.

00:56:57.940 | Literally the example of, you go to school,

00:57:00.580 | you learn useful things, then you go to work.

00:57:02.820 | It's not perfect, it's not, you know,

00:57:04.860 | for your four years of CS in undergrad,

00:57:07.020 | it's not gonna fully prepare you for

00:57:08.980 | whatever it is you need to know at work.

00:57:10.540 | It will help somewhat.

00:57:11.820 | You'll be able to get off the ground,

00:57:12.900 | but there will be lots of new things you need to learn.

00:57:14.660 | So that's the spirit of it.

00:57:16.620 | I think of it as of the school.

00:57:18.900 | - One of the things you mentioned pretty early on

00:57:20.900 | in your talk is that one of the limitations

00:57:22.700 | of this sort of style of reinforcement learning

00:57:24.940 | is there's no self-organization.

00:57:27.140 | So you have to tell it when it did a good thing

00:57:28.820 | or it did a bad thing.

00:57:29.860 | And that's actually a problem in neuroscience

00:57:31.500 | as well when you're trying to teach a rat

00:57:32.700 | to navigate a maze.

00:57:33.940 | You have to artificially tell it what to do.

00:57:36.420 | So where do you see moving forward

00:57:37.900 | when we already have this problem with teaching,

00:57:40.020 | not necessarily learning, but also teaching.

00:57:41.940 | So where do you see the research moving forward

00:57:43.660 | in that respect?

00:57:44.820 | How do you sort of introduce

00:57:45.900 | this notion of self-organization?

00:57:48.100 | - So I think without question,

00:57:49.420 | one really important thing you need to do

00:57:51.220 | is to be able to infer the goals and strategies

00:57:55.780 | of other agents by observing them.

00:57:57.540 | That's a fundamental skill you need to be able to learn,

00:58:00.860 | to embed into the agents.

00:58:02.900 | So that for example, you have two agents,

00:58:04.780 | one of them is doing something,

00:58:06.180 | and the other agent says, "Well, that's really cool.

00:58:07.700 | "I wanna be able to do that too."

00:58:09.260 | And then you go on and do that.

00:58:10.700 | And so I'd say that this is a very important component

00:58:12.900 | in terms of setting the reward of,

00:58:15.380 | you see what they do, you infer the reward,

00:58:18.900 | and now we have a knob which says,

00:58:20.980 | "You see what they're doing?

00:58:22.100 | "Now go and try to do the same thing."

00:58:24.140 | So I'd say this is, as far as I know,

00:58:27.020 | this was one of the important ways in which humans

00:58:31.740 | are quite different from other animals

00:58:34.060 | in the way which, in the scale and scope

00:58:39.060 | in which we copy the behavior of other humans.

00:58:42.420 | - Might I ask a quick follow-up?

00:58:45.180 | - Go for it.

00:58:46.020 | - So that's kind of obvious how that works

00:58:47.540 | in the scope of competition,

00:58:48.940 | but what about just sort of arbitrary tasks?

00:58:51.220 | Like I'm in a math class with someone

00:58:52.860 | and I see someone doing a problem a particular way

00:58:54.900 | and I'm like, "Oh, that's a good strategy.

00:58:56.100 | "Maybe I should try that out."

00:58:57.620 | How does that work in a sort of non-competitive environment?

00:59:00.460 | - So I think that this will be,

00:59:02.100 | I think that's going to be a little bit separate

00:59:04.700 | from the competitive environment,

00:59:06.660 | but it will have to be somehow either,

00:59:10.260 | probably baked in, maybe evolved into the system,

00:59:15.620 | where if you have other agents doing things,

00:59:19.900 | they're generating data which you observe,

00:59:22.020 | and the only way to truly make sense

00:59:23.660 | of the data that you see is to infer the goal of the agent,

00:59:27.780 | the strategy, their belief state.

00:59:29.780 | That's important also for communicating with them.

00:59:32.460 | If you want to successfully communicate with someone,

00:59:34.140 | you have to keep track both of their goal

00:59:35.820 | and of their belief state instead of knowledge.

00:59:37.780 | So I think you will find that there are many,

00:59:40.500 | I guess, connections between understanding

00:59:43.900 | what other agents are doing, inferring their goals,

00:59:46.140 | imitating them, and successfully communicating with them.

00:59:49.180 | - All right, let's give Ilya and the happy hour a big hand.

00:59:52.260 | (audience applauding)

00:59:55.420 | (audience cheering)

00:59:58.420 | (upbeat music)

01:00:01.020 | (upbeat music)

01:00:03.620 | (upbeat music)

01:00:06.220 | (upbeat music)

01:00:08.820 | (upbeat music)

01:00:11.420 | [BLANK_AUDIO]

Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)

Chapters