Back to Index

Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)


Chapters

0:0 Introduction
0:55 Talk
43:4 Q&A

Transcript

Welcome back to CIGSAS 099, Artificial General Intelligence. Today we have Ilya Tsitskever, co-founder and research director of OpenAI. He started in the ML group in Toronto with Jeffrey Hinton, then at Stanford with Andrew Ng, co-founded DNN Research for three years as a research scientist at Google Brain, and finally co-founded OpenAI.

Citations aren't everything, but they do indicate impact. And his work, recent work, in the past five years has been cited over 46,000 times. He has been the key creative intellect and driver behind some of the biggest breakthrough ideas in deep learning and artificial intelligence ever. So please welcome Ilya.

(audience applauding) - All right, thanks for the introduction, Lex. All right, thanks for coming to my talk. I will tell you about some work we've done over the past year on meta-learning and self-play at OpenAI. And before I dive into some of the more technical details of the work, I want to spend a little bit of time talking about deep learning and why it works at all in the first place.

Which I think it's actually not a self-evident thing that it should work. One fact, it's actually a fact, it's a mathematical theorem that you can prove, is that if you could find the shortest program that does very well on your data, then you will achieve the best generalization possible.

With a little bit of modification, you can turn it into a very, very simple algorithm. With a little bit of modification, you can turn it into a precise theorem. And on a very intuitive level, it's easy to see why it should be the case. If you have some data, and you're able to find the shorter program which generates this data, then you've essentially extracted all conceivable regularity from this data into your program.

And then you can use this object to make the best predictions possible. If you have data which is so complex, but there is no way to express it as a shorter program, then it means that your data is totally random. There is no way to extract any regularity from it whatsoever.

Now there is little known mathematical theory behind this, and the proofs of these statements are actually not even that hard. But the one minor slight disappointment is that it's actually not possible, at least given today's tools and understanding, to find the best short program that explains or generates or solves your problem given your data.

This problem is computationally intractable. The space of all programs is a very nasty space. Small changes to your program result in massive changes to the behavior of the program, as it should be. It makes sense. You have a loop. You change the inside of the loop. Of course, you get something totally different.

So the space of programs is so hard, at least given what we know today, search there seems to be completely off the table. Well, if we give up on short programs, what about small circuits? Well, it turns out that we are lucky. It turns out that when it comes to small circuits, you can just find the best small circuit that solves your problem using backpropagation.

And this is the miraculous fact on which the rest of AI stands. It is the fact that when you have a circuit and you impose constraints on your circuit using data, you can find a way to satisfy these constraints using backprop by iteratively making small changes to the weights of your neural network until its predictions satisfy the data.

What this means is that the computational problem that's solved by backpropagation is extremely profound. It is circuit search. Now, we know that you can't solve it always, but you can solve it sometimes. And you can solve it at those times where we have a practical data set. It is easy to design artificial data sets for which you cannot find the best neural network.

But in practice, that seems to be not a problem. You can think of training a neural network as solving a neural equation in many cases where you have a large number of equation terms like this, f of xi theta equals yi. So you've got your parameters and they represent all your degrees of freedom.

And you use gradient descent to push the information from these equations into the parameters to satisfy them all. And you can see that the neural network, let's say one with 50 layers, is basically a parallel computer that is given 50 time steps to run. And you can do quite a lot with 50 time steps of a very, very powerful, massively parallel computer.

So for example, I think it is not widely known that you can learn to sort, sort n n-bit numbers using a modestly sized neural network with just two hidden layers, which is not bad. It's not self-evident, especially since we've been taught that sorting requires log n parallel steps. With a neural network, you can sort successfully using only two parallel steps.

So there's something slightly unobvious going on. Now, these are parallel steps of threshold neurons, so they're doing a little bit more work. That's the answer to the mystery. But if you've got 50 such layers, you can do quite a bit of logic, quite a bit of reasoning, all inside the neural network.

And that's why it works. Given the data, we are able to find the best neural network. And because the neural network is deep, because it can run computation inside of its layers, the best neural network is worth finding. 'Cause that's really what you need. You need something, you need a model class, which is worth optimizing.

But it also needs to be optimizable. And deep neural networks satisfy both of these constraints. And this is why everything works. This is the basis on which everything else resides. Now I want to talk a little bit about reinforcement learning. So reinforcement learning is a framework. It's a framework of evaluating agents in their ability to achieve goals in complicated stochastic environments.

You've got an agent, which is plugged into an environment, as shown in the figure right here. And for any given agent, you can simply run it many times and compute its average reward. Now, the thing that's interesting about the reinforcement learning framework is that there exist interesting, useful reinforcement learning algorithms.

The framework existed for a long time. It became interesting once we realized that good algorithms exist. Now, these are not perfect algorithms, but they are good enough to do interesting things. And all you want, the mathematical problem, is one where you need to maximize the expected reward. Now, one important way in which the reinforcement learning framework is not quite complete is that it assumes that the reward is given by the environment.

You see this picture. The agent sends an action, while the reward sends both the observation and the reward backwards. That's what the environment communicates back. The way in which this is not the case in the real world is that we figure out what the reward is from the observation.

We reward ourselves. We are not told, the environment doesn't say, "Hey, here's some negative reward." It's our interpretation of our senses that lets us determine what the reward is. And there is only one real true reward in life, and this is existence or nonexistence. And everything else is a corollary of that.

So, well, what should our agent be? You already know the answer. It should be a neural network. Because whenever you want to do something, the answer is going to be a neural network, and you want the agent to map observations to actions. So you let it be parameterized with a neural net, and you apply a learning algorithm.

So I want to explain to you how reinforcement learning works. This is model-free reinforcement learning, the reinforcement learning that's actually being used in practice everywhere. But it's also deeply, it's very robust, it's very simple. It's also not very efficient. So the way it works is the following. This is literally the one-sentence description of what happens.

In short, try something new. Add randomness to your actions. And compare the result to your expectation. If the result surprises you, if you find that the result exceeded your expectation, then change your parameters to take those actions in the future. That's it. This is the full idea of reinforcement learning.

Try it out, see if you like it, and if you do, do more of that in the future. And that's it. That's literally it. This is the core idea. Now, it turns out it's not difficult to formalize mathematically. But this is really what's going on. If in a neural network, in a regular neural network, like this, you might say, okay, what's the goal?

You run the neural network, you get an answer. You compare it to the desired answer. And whatever difference you have between those two, you send it back to change the neural network. That's supervised learning. In reinforcement learning, you run a neural network, you add a bit of randomness to your action, and then if you like the result, your randomness turns into the desired target, in effect.

So that's it. Trivial. Now, math exists. Without explaining what these equations mean, the point is not really to derive them, but just to show that they exist. There are two classes of reinforcement learning algorithms. One of them is the policy gradient, where basically what you do is that you take this expression right there, the sum of rewards, and you just crunch through the derivatives.

You expand the terms. You run, you do some algebra, and you get a derivative. And miraculously, the derivative has exactly the form that I told you, which is try some actions, and if you like them, increase the log probability of the actions. That literally follows from the math. It's very nice when the intuitive explanation has a one-to-one correspondence to what you get in the equation, even though you'll have to take my word for it if you're not familiar with it.

That's the equation at the top. Then there is a different class of reinforcement learning algorithms, which is a little bit more difficult to explain. It's called the Q-learning-based algorithms. They are a bit less stable, a bit more sample efficient, and it has the property that it can learn not only from the data generated by the actor, but from any other data as well.

So it has a different robustness profile, which would be a little bit important, but it's only gonna be a technicality. So yeah, this is the on-policy, off-policy distinction, but it's a little bit technical, so if you find this hard to understand, don't worry about it. If you already know it, then you already know it.

So now what's the potential of reinforcement learning? What's the promise? What is it actually, why should we be excited about it? Now, there are two reasons. The reinforcement learning algorithms of today are already useful and interesting, and especially if you have a really good simulation of your world, you could train agents to do lots of interesting things.

But what's really exciting is if you can build a super amazing sample efficient reinforcement learning algorithm. We just give it a tiny amount of data, and the algorithm just crunches through it and extracts every bit of entropy out of it in order to learn in the fastest way possible.

Now, today our algorithms are not particularly data efficient, they are data inefficient. But as our field keeps making progress, this will change. Next, I want to dive into the topic of meta-learning. The goal of meta-learning, so meta-learning is a beautiful idea that doesn't really work, but it kind of works, and it's really promising too.

It's another promising idea. So what's the dream? We have some learning algorithms. Perhaps we could use those learning algorithms in order to learn to learn. That'd be nice if we could learn to learn. So how would you do that? You would take a system which, you train it not on one task, but on many tasks, and you ask it if it learns to solve these tasks quickly.

And that may actually be enough. So here's how it looks like. Here's how most traditional meta-learning looks like. You have a model which is a big neural network. But what you do is that you treat every, instead of training cases, you have training tasks. And instead of test cases, you have test tasks.

So your input may be, instead of just your current test case, it would be all the information about the test tasks plus the test case, and you'll try to output the prediction or action for that test case. So basically you say, yeah, I'm gonna give you your 10 examples as part of your input to your model, figure out how to make the best use of them.

It's a really straightforward idea. You turn the neural network into the learning algorithm by turning a training task into a training case. So training task equals training case. This is meta-learning. This one sentence. And so there've been several success stories which I think are very interesting. One of the success stories of meta-learning is learning to recognize characters quickly.

So there've been a data set produced by MIT by Lake et al. And this is a data set. We have a large number of different handwritten characters. And people have been able to train extremely strong meta-learning system for this task. Another successful, another very successful example of meta-learning is that of neural architecture search by Zop and Lee from Google, where they found a neural architecture that solved one problem well, a small problem.

And then it would generalize, and then it would successfully solve large problems as well. So this is kind of the small number of bits meta-learning. It's like when you learn the architecture, or maybe even learn a program, a small program or learning algorithm, which you apply to new tasks.

So this is the other way of doing meta-learning. So anyway, but the point is, what's happening, what's really happening in meta-learning in most cases is that you turn a training task into a training case and pretend that this is totally normal deep learning. That's it. This is the entirety of meta-learning.

Everything else is just minor details. Next, I wanna dive in. So now that I've finished the introduction section, I want to start discussing different work by different people from OpenAI. And I wanna start by talking about hindsight experience replay. There's been a large effort by Andriy Kobycharov to develop a learning algorithm for reinforcement learning that doesn't solve just one task, but it solves many tasks, and it learns to make use of its experience in a much more efficient way.

And I wanna discuss one problem in reinforcement learning. It's actually, I guess, a set of problems which are related to each other. One really important thing you need to learn to do is to explore. You start out in an environment, you don't know what to do. What do you do?

So one very important thing that has to happen is that you must get rewards from time to time. If you try something and you don't get rewards, then how can you learn? So I'd say that's the kind of the crux of the problem. How do you learn? And relatedly, is there any way to meaningfully benefit from the experience, from your attempts, from your failures?

If you try to achieve a goal and you fail, can you still learn from it? You tell you, instead of asking your algorithm to achieve a single goal, you want to learn a policy that can achieve a very large family of goals. For example, instead of reaching one state, you want to learn a policy that reaches every state of your system.

Now what's the implication? Anytime you do something, you achieve some state. So let's suppose you say, I want to achieve state A. I try my best and I end up achieving state B. I can either conclude, well, that was disappointing, I haven't learned almost anything. I still have no idea how to achieve state A.

But alternatively, I can say, well, wait a second, I've just reached a perfectly good state, which is B. Can I learn how to achieve state B from my attempt to achieve state A? And answer is yes, you can. And it just works. And I just want to point out, this is the one case, there's a small subtlety here, which may be interesting to those of you who are very familiar with the distinction between on policy and off policy.

When you try to achieve A, you are doing on-policy learning for reaching the state A, but you're doing off-policy learning for reaching the state B, because you would take different actions if you would actually try to reach state B. So that's why it's very important that the algorithm you use here can support off-policy learning.

But that's a minor technicality. At the crux of the idea is, you make the problem easier by ostensibly making it harder. By training a system which aspires to reach, to learn to reach every state, to learn to achieve every goal, to learn to master its environment in general, you build a system which always learns something.

It learns from success as well as from failure. Because if it tries to do one thing and it does something else, it now has training data for how to achieve that something else. I want to show you a video of how this thing works in practice. So one challenge in reinforcement learning systems is the need to shape the reward.

So what does it mean? It means that at the beginning of the system, at the start of learning, when the system doesn't know much, it will probably not achieve your goal. And so it's important that you design your reward function to give it gradual increments, to make it smooth and continuous so that even when the system is not very good, it achieves the goal.

Now, if you give your system a very sparse reward where the reward is achieved only when you reach a final state, then it becomes very hard for normal reinforcement learning algorithms to solve a problem, because naturally, you never get the reward, so you never learn. No reward means no learning.

But here, because you learn from failure as well as from success, this problem simply doesn't occur. And so this is nice. I think, you know, let's look at the videos a little bit more. Like, it's nice how this, it confidently and energetically moves the little green puck to its target.

And here's another one. (silence) Okay, so we can skip the, it works if you do it on a physical robot as well, but we can skip it. So, I think the point is that the hindsight experience replay algorithm is directionally correct, because you want to make use of all your data and not only a small fraction of it.

Now, one huge question is, where do you get the high-level states? Where do the high-level states come from? Because in the work that I've shown you so far, the system is asked to achieve low-level states. So I think one thing that will become very important for these kind of approaches is representation learning and unsupervised learning.

Figure out what are the right states, what's the state space of goals that's worth achieving. Now I want to go through some real meta-learning results, and I'll show you a very simple way of doing seem-to-real from simulation to the physical robot with meta-learning. And this is work by Peng et al.

It was a really nice intern project in 2017. So, I think we can agree that in the domain of robotics, it would be nice if you could train your policy in simulation, and then somehow this knowledge would carry over to the physical robot. Now, we can build simulators that are okay, but they can never perfectly match the real world unless you want to have an insanely slow simulator.

And the reason for that is that it turns out that simulating contacts is super hard, and I heard somewhere, correct me if I'm wrong, that simulating friction is NP-complete. I'm not sure, but it's like stuff like that. So your simulation is just not going to match reality. There'll be some resemblance, but that's it.

How can we address this problem? And I want to show you one simple idea. So let's say, one thing that would be nice is that if you could learn a policy that would quickly adapt itself to the real world. Well, if you want to learn a policy that can quickly adapt, we need to make sure that it has opportunities to adapt during training time.

So what do we do? Instead of solving our problem in just one simulator, we add a huge amount of variability to the simulator. We say, we will randomize the frictions, we will randomize the masses, the length of the different objects and their, I guess, dimensions. So you try to randomize physics, the simulator, in lots of different ways.

And then importantly, you don't tell the policy how you randomized it. So what is it going to do then? You take your policy and you put it in an environment and it says, well, this is really tough. I don't know what the masses are and I don't know what the frictions are.

I need to try things out and figure out what the friction is as I get responses from the environment. So you build it, you learn a certain degree of adaptability into the policy. And it actually works. I just want to show you. This is what happens when you just train a policy in simulation and deploy it on the physical robot.

And here the goal is to bring the hockey puck towards the red dot. And you will see that it will struggle. And the reason it struggles is because of the systematic differences between the simulator and the real physical robot. So even the basic movement is difficult for the policy because the assumptions are violated so much.

So if you do the training as I discussed, we train a recurrent neural network policy which learns to quickly infer properties of the simulator in order to accomplish the task. You can then give it the real thing, the real physics, and it will do much better. So now this is not a perfect technique, but it's definitely very promising.

It's promising whenever you are able to sufficiently randomize the simulator. So it's definitely very nice to see the closed loop nature of the policy. You can see that it would push the hockey puck and it would correct it very, very gently to bring it to the goal. Yeah, you saw that?

That was cool. So that was a cool application of meta-learning. I want to discuss one more application of meta-learning which is learning a hierarchy of actions. And this was work done by Franz et al. Actually, Kevin Franz, the engineer who did it, was in high school when he wrote this paper.

So, one thing that would be nice is if reinforcement learning was hierarchical. If instead of simply taking micro-actions, you had some kind of little subroutines that you could deploy. Maybe the term subroutine is a little bit too crude, but if you had some idea of which action primitives are worth starting with.

Now, no one has been able to get actually like a real value add from hierarchical reinforcement learning yet. So far, all the really cool results, all the really convincing results of reinforcement learning do not use it. That's because we haven't quite figured out what's the right way for reinforcement learning, for hierarchical reinforcement learning.

And I just want to show you one very simple approach where you use meta-learning to learn a hierarchy of actions. So here's what you do. You have, in this specific work, you have a certain, let's say you have a certain number of low-level primitives. Let's say you have 10 of them.

And you have a distribution of tasks. And your goal is to learn low-level primitives such that when they're used inside a very brief run of some reinforcement learning algorithm, you will make as much progress as possible. So the idea is you want to get the greatest amount of progress, you want to learn policies that result in the great, sorry, you want to learn primitives that result in the greatest amount of progress possible when used inside learning.

So this is a meta-learning setup because you need distribution of tasks. And here we've had a little maze. You have a distribution of mazes, and in this case the little bug learned three policies which move it in a fixed direction. And as a result of having this hierarchy, you're able to solve problems really fast, but only when the hierarchy is correct.

So hierarchical reinforcement learning is still a work in progress. And this work is an interesting proof point of how hierarchical reinforcement could be like, how hierarchical reinforcement learning could be like if it worked. Now, I want to just spend one slide addressing the limitations of high-capacity meta-learning. The specific limitation is that the training task distribution has to be equal to the test task distribution.

And I think this is a real limitation because in reality, the new task that you want to learn will in some ways be fundamentally different from anything you've seen so far. So for example, if you go to school, you learn lots of useful things, but then when you go to work, only a fraction of the things that you've learned carries over.

You need to learn quite a few more things from scratch. So meta-learning would struggle with that because it really assumes that the distribution over the training task has to be equal to the distribution over the test tasks. That's a limitation. I think that as we develop better algorithms for being robust when the test tasks are outside of the distribution of the training task, then meta-learning will work much better.

Now, I want to talk about self-play. I think self-play is a very cool topic that's starting to get attention only now. And I want to start by reviewing very old work called TDGammon. It's back from all the way from 1992, so it's 26 years old now. It was done by Jerry Tesauro.

So this work is really incredible because it has so much relevance today. What they did basically, they said, "Okay, let's take two neural networks "and let them play against each other, "let them play backgammon against each other, "and let them be trained with Q-learning." So it's a super modern approach.

And you would think this was a paper from 2017, except that when you look at this plot, it shows that you only have 10 hidden units, 20 hidden units, 40 and 80 for the different colors, where you notice that the largest neural network works best. So in some ways, not much has changed, and this is the evidence.

And in fact, they were able to beat the world champion in backgammon, and they were able to discover new strategies that the best human backgammon players have not noticed, and they've determined that the strategies covered by TDGammon are actually better. So that's pure self-play with Q-learning, which remained dormant until the DQN work with Atari by DeepMind.

So, now other examples of self-play include AlphaGo Zero, which was able to learn to beat the world champion in Go without using any external data whatsoever. Another result of this vein is by OpenAI, which is our Dota 2 bot, which was able to build the world champion on the 1v1 version of the game.

And so I want to spend a little bit of time talking about the allure of self-play and why I think it's exciting. So, one important problem that we must face as we try to build truly intelligent systems is what is the task? What are we actually teaching the systems to do?

And one very attractive attribute of self-play is that the agents create the environment. By virtue of the agent acting in the environment, the environment becomes difficult for the other agents. And you can see here an example of an iguana interacting with snakes that try to eat it unsuccessfully this time, so we can see what will happen in a moment.

The iguana is trying its best. And so the fact that you have this arms race between the snakes and the iguana motivates their development, potentially without bound. And this is what happens in effect in biological evolution. Now, interesting work in this direction was done in 1994 by Carl Sims.

There is a really cool video on YouTube by Carl Sims. You should check it out, which really kind of shows all the work that he's done. And here you have a little competition between agents where you evolve both the behavior and their morphology when the agent is trying to gain possession of a green cube.

And so you can see that the agents create the challenge for each other, and that's why they need to develop. So one thing that we did, and this is work by Van Salendaal from OpenAI, is we said, okay, well, can we demonstrate some unusual results in self-play that would really convince us that there is something there?

So what we did here is that we created a small ring, and you have these two humanoid figures, and their goal is just to push each other outside the ring. And they don't know anything about wrestling. They don't know anything about standing or balancing each other. They don't know anything about centers of gravity.

All they know is that if you don't do a good job, then your competition is going to do a better job. Now, one of the really attractive things about self-play is that you always have an opponent that's roughly as good as you are. In order to learn, you need to sometimes win and sometimes lose.

Like, you can't always win. Sometimes you must fail. Sometimes you must succeed. So let's see what will happen here. Yeah, so the green humanoid was able to block the ball. In a well-balanced self-play environment, the competition is always level. No matter how good you are or how bad you are, you have a competition that makes it exactly the right challenge for you.

Oh, and one thing here. So this video shows transfer learning. You take the little wrestling humanoid, and you take its friend away, and you start applying big, large, random forces on it, and you see if it can maintain its balance. And the answer turns out to be that yes, it can, because it's been trained against an opponent that pushes it.

And so that's why, even if it doesn't understand where the pressure force is being applied on it, it's still able to balance itself. So this is one potentially attractive feature of self-play environments, that you could learn a certain broad set of skills, although it's a little hard to control what the skills will be.

And so the biggest open question with this research is, how do you learn agents in a self-play environment such that they do whatever they do, but then they are able to solve a battery of tasks that is useful for us, that is explicitly specified externally? Yeah. I also want to highlight one attribute of self-play environments that we've observed in our Dota bot, and that is that we've seen a very rapid increase in the competence of the bot.

So over the course of maybe five months, we've seen the bot go from playing totally randomly all the way to the world champion. And the reason for that is that once you have a self-play environment, if you put compute into it, you turn it into data. Self-play allows you to turn compute into data.

And I think we will see a lot more of that as being an extremely important thing to be able to turn compute into, essentially, data or generalization, simply because the speed of neural net processors will increase very dramatically over the next few years. So neural net cycles will be cheap, and it will be important to make use of these newly found overabundance of cycles.

I also want to talk a little bit about the end game of the self-play approach. So one thing that we know about the human brain is that it has increased in size fairly rapidly over the past two million years. My theory, the reason I think it happened, is because our ancestors got to a point where the thing that's most important for your survival is your standing in the tribe, and less the tiger and the lion.

Once the most important thing is how you deal with those other things which have a large brain, then it really helps to have a slightly larger brain. And I think that's what happened. And there exists at least one paper from science which supports this point of view. So apparently there has been convergent evolution between social apes and social birds, even though, in terms of various behaviors, even though the divergence in evolutionary time scale between humans and birds has occurred a very long time ago, and humans, sorry, humans, apes and humans, apes and birds have very different brain structure.

So I think what should happen if we succeed, if we successfully follow the path of this approach, is that we should create a society of agents which will have language and theory of mind, negotiation, social skills, trade, economy, politics, justice system. All these things should happen inside the multi-agent environment.

And there will also be some alignment issue of how do you make sure that the agents we learn behave in a way that we want. Now, I want to make a speculative digression here, which is, I want to make the following observation. If you believe that this kind of society of agents is a plausible place where truly, where fully general intelligence will emerge, and if you accept that our experience with the DotaBot, where we've seen a very rapid increase in competence, will carry over once all the details are right, if you assume both of these conditions, then it should follow that we should see a very rapid increase in the competence of our agents as they live in the society of agents.

So now that we've talked about a potentially interesting way of increasing the competence and teaching agents social skills and language, and a lot of things that actually exist in humans as well, we want to talk a little bit about how you convey goals to agents. And the question of conveying goals to agents is just a technical problem, but it will be important because it is more likely than not that the agents that we will train will eventually be dramatically smarter than us.

And this is work by the OpenAI Safety Team by Paul Crusciano et al and others. So I'm just going to show you this video which basically explains how the whole thing works. There is some behavior you're looking for, and you, the human, gets to see pairs of behaviors. And you simply click on the one that looks better.

And after a very modest number of clicks, you can get this little simulated leg to do backflips. And there you go, you can now do backflips. And to get this specific behavior, it took about 500 clicks by human annotators. The way it works is that you take all the, so this is a very data-efficient reinforcement learning algorithm, but it is efficient in terms of rewards and not in terms of the environment interactions.

So what you do here is that you take all the clicks, so you've got your, here is one behavior which is better than the other. You fit a reward function, a numerical reward function to those clicks. So you want to fit a reward function which satisfies those clicks, and then you optimize this reward function with reinforcement learning.

And it actually works. So this requires 500 bits of information. We've also been able to train lots of Atari games using several thousand bits of information. So in all these cases, you had human annotators or human judges, just like in the previous slide, looking at pairs of trajectories and clicking on the one that they thought was better.

And here's an example of an unusual goal where this is a car racing game, but the goal was to ask the agent to train the white car to drive right behind the orange car. So it's a different goal, and it was very straightforward to communicate this goal using this approach.

So then, to finish off, alignment is a technical problem. It has to be solved. But of course, the determination of the correct goals we want our AI systems to have will be a very challenging political problem. And on this note, I want to thank you so much for your attention, and I just want to say that it will be a happy hour at Cambridge Brewing Company at 8.45 if you want to chat more about AI and other topics.

Please come by. - I think that deserves an applause. Thank you very much. (audience applauding) - So back propagation is, well, neural networks are buyer-inspired, but back propagation doesn't look as though it's what's going on in the brain, because signals in the brain go one direction down the axons, whereas back propagation requires the errors to be propagated back up the wires.

So can you just talk a little bit about that whole situation where it looks as though the brain is doing something a bit different than our highly successful algorithms? Are algorithms gonna be improved once we figure out what the brain is doing, or is the brain really sending signals back even though it's got no obvious way of doing that?

What's happening in that area? - So that's a great question. So first of all, I'll say that the true answer is that, the honest answer is that I don't know, but I have opinions. And so, so I'll say two things. First of all, given that, if we agree, rather, it is a true fact that back propagation solves the problem of circuit search.

This problem feels like an extremely fundamental problem. And for this reason, I think that it's unlikely to go away. Now, you're also right that the brain doesn't obviously do back propagation, although there have been multiple proposals of how it could be doing them. For example, there's been a work by Tim Lillicrap and others, where they've shown that if you use, that it's possible to learn a different set of connections that can be used for the backward pass, and that can result in successful learning.

Now, the reason this hasn't been really pushed to the limit by practitioners is because they say, well, I got TF to the gradients, I'm just not going to worry about it. But you are right that this is an important issue, and one of two things is going to happen.

So my personal opinion is that back propagation is just going to stay with us till the very end, and we'll actually build fully human level and beyond systems before we understand how the brain does what it does. So that's what I believe, but of course, it is a difference that has to be acknowledged.

- Okay, thank you. Do you think it was a fair matchup for the Dota bot and that person, given the constraints of the system? - So I'd say that the biggest advantage computers have in games like this, like one of the big advantages, is that they obviously have a better reaction time.

Although in Dota in particular, the number of clicks per second over the top players is fairly small, which is different from StarCraft. So in StarCraft, StarCraft is a very mechanically heavy game because of a large number of units. And so the top players, they just click all the time.

In Dota, every player control is just one hero, and so that greatly reduces the total number of actions they need to make. Now, still, precision matters. I think that we'll discover that, but what I think will really happen is that we'll discover that computers have the advantage in any domain.

Or rather, every domain. Not yet. - So do you think that the emergent behaviors from the agent were actually kind of directed because the constraints were already kind of in place? Like, so it was kind of forced to discover those? Or do you think that, like, that was actually something quite novel that, like, wow, it actually discovered these on its own?

Like, you didn't actually have to be biased towards constraining it? - So it's definitely, we discover new strategies, and I can share an anecdote where our tester, we have a pro which would test the bot, and he played against it for a long time, and the bot would do all kinds of things against the player, the human player, which were effective.

Then at some point, that pro decided to play against the better pro, and he decided to imitate one of the things that the bot was doing, and this, by imitating it, he was able to defeat a better pro. So I think the strategies that he discovers are real, and so, like, it means that, like, there's very real transfer, you know.

I would say, I think what that means is that because the strategies discovered by the bot help the humans, it means that the, like, the fundamental gameplay is deeply related. - For a long time now, I've heard that the objective of reinforcement learning is to determine a policy that chooses an action to maximize the expected reward, which is what you said earlier.

Would you ever wanna look at the standard deviation of possible rewards? Does that even make sense? - Yeah. I mean, I think for sure. I think it's really application-dependent. One of the reasons to maximize the expected reward is because it's easier to design algorithms for it. So you write down this equation, the formula, you do a little bit of derivation, you get something which amounts to a nice-looking algorithm.

Now, I think there exists, like, really, there exists applications where you never wanna make mistakes, and you wanna work on the standard deviation as well, but in practice, it seems that the, just looking at the expected reward covers a large fraction of the situations you'd like to apply this to.

- Okay, thanks. - We talked last week about motivations, and that has a lot to do with the reinforcement, and some of the ideas is that the, our motivations are actually connection with others and cooperation, and I'm wondering if, and I understand it's very popular to have the computers play these competitive games, but is there any use in having an agent self-play collaboratively, collaborative games?

- Yeah, I think that's an extremely good question. I think one place from which we can get some inspiration is from the evolution of cooperation. Like, I think, cooperation, we cooperate ultimately because it's much better for you, the person, to be cooperative than not, and so I think what should happen, if you have a sufficiently open-ended game, then cooperation will be the winning strategy, and so I think we will get cooperation whether we like it or not.

- Hey, you mentioned the complexity of the simulation of friction. I was wondering if you feel that there exists open complexity theoretic problems relevant to AI, or whether it's just a matter of finding good approximations that humans, of the types of problems that humans tend to solve. - Yeah, so complexity theory, well, like, at a very basic level, we know that whatever algorithm we're gonna run is going to run fairly efficiently on some hardware, so that puts a pretty strict upper bound on the true complexity of the problems we're solving.

Like, by definition, we are solving problems which aren't too hard in a complexly theoretic sense. Now, it is also the case that many of the problems, so while the overall thing that we do is not hard from a complexity theoretic sense, and indeed, humans cannot solve NP-complete problems in general, it is true that many of the optimization problems that we pose to our algorithms are intractable in the general case, starting from neural net optimization itself.

It is easy to create a family of data sets for a neural network with a very small number of neurons, such that finding the global optimum is NP-complete. And so, how do we avoid it? Well, we just try gradient descent anyway, and somehow it works. But without question, we do not solve problems which are truly intractable.

So, I mean, I hope this answers the question. - Hello. It seems like an important sub-problem on the path towards AGI will be understanding language, and the state of generative language modeling right now is pretty abysmal. What do you think are the most productive research trajectories towards generative language models?

- So, I'll first say that you are completely correct that the situation with language is still far from great, although progress has been made, even without any particular innovations beyond models that exist today. Simply scaling up models that exist today on larger data sets is going to go surprisingly far.

Not even larger data sets, but larger and deeper models. For example, if you trained a language model with a thousand layers, and it's the same layer, I think it's gonna be a pretty amazing language model. Like, we don't have the cycles for it yet, but I think it will change very soon.

Now, I also agree with you that there are some fundamental things missing in our current understanding of deep learning, which prevent us from really solving the problem that we want. So I think one of these problems, one of the things that's missing is that, or that seems like patently wrong, is the fact that we train a model, and then we stop training the model, and we freeze it.

Even though it's the training process where the magic really happens. Like, the magic is, like, if you think about it, like, the training process is the true general part of the whole story, because your TensorFlow code doesn't care which data set to optimize. It just says, "Whatever, just give me the data set.

"I don't care which problem to solve. "I'll solve them all." So, like, the ability to do that feels really special, and I think we are not using it at test time. Like, it's hard to speculate about, like, things which we don't know the answer, but all I'll say is that simply training bigger, deeper language models will go surprisingly far, scaling up.

But also doing things like training at test time and inference at test time, I think, will be another important boost to performance. - Hi, thank you for the talk. So it seems like right now another interesting approach to solving reinforcement learning problems would be to go for the evolutionary routes, using evolutionary strategies.

And although they have their caveats, I wanted to know if at OpenAI particularly you're working on something related, and what is your general opinion on them? - So, like, at present, I believe that something like evolutionary strategies is not great for reinforcement learning. I think that normal reinforcement learning algorithms, especially with big policies, are better.

But I think if you want to evolve a small, compact object, like a piece of code, for example, I think that would be a place where this would be seriously worth considering. But this, you know, evolving a useful piece of code is a cool idea, it hasn't been done yet, so still a lot of work to be done before we get there.

- Hi, thank you so much for coming. My question is, you mentioned what is the right goal is a political problem, so I'm wondering if you can elaborate a bit on that, and then also, what do you think would be the approach for us to maybe get there? - Well, I can't really comment too much, because all the thoughts that, you know, we now have a few people who are thinking about this full-time at OpenAI.

I don't have enough of a super strong opinion to say anything too definitive. All I can say at the very high level is given the size, like, if you go into the future, whenever, soon or, you know, whenever it's gonna happen, when you build a computer which can do anything better than a human, it will happen, 'cause the brain is physical.

The impact on society is going to be completely massive and overwhelming. It's very difficult to imagine, even if you try really hard. And I think what it means is that people will care a lot. And that's what I was alluding to, the fact that this will be something that many people will care about strongly.

And, like, as the impact increases gradually, with self-driving cars, more automation, I think we will see a lot more people care. - Do we need to have a very accurate model of the physical world and simulate that in order to have these agents that can eventually come out into the real world and do something approaching, you know, human-level intelligence tasks?

- Yeah, that's a very good question. So I think if that were the case, we'd be in trouble. And I am very certain that it could be avoided. So specifically, the real answer has to be that, look, you learn to problem-solve, you learn to negotiate, you learn to persist, you learn lots of different useful life lessons in the simulation.

And yes, you learn some physics, too. But then you go outside of the real world, and you have to start over to some extent, because many of your deeply held assumptions will be false. And one of the goals, so that's one of the reasons I care so much about never stopping training.

You've accumulated your knowledge, now you go into an environment where some of your assumptions are violated, you continue training. You try to connect the new data to your old data. And this is an important requirement from our algorithms, which is already met to some extent, but it will have to be met a lot more so that you can take the partial knowledge that you've acquired and go in a new situation, learn some more.

Literally the example of, you go to school, you learn useful things, then you go to work. It's not perfect, it's not, you know, for your four years of CS in undergrad, it's not gonna fully prepare you for whatever it is you need to know at work. It will help somewhat.

You'll be able to get off the ground, but there will be lots of new things you need to learn. So that's the spirit of it. I think of it as of the school. - One of the things you mentioned pretty early on in your talk is that one of the limitations of this sort of style of reinforcement learning is there's no self-organization.

So you have to tell it when it did a good thing or it did a bad thing. And that's actually a problem in neuroscience as well when you're trying to teach a rat to navigate a maze. You have to artificially tell it what to do. So where do you see moving forward when we already have this problem with teaching, not necessarily learning, but also teaching.

So where do you see the research moving forward in that respect? How do you sort of introduce this notion of self-organization? - So I think without question, one really important thing you need to do is to be able to infer the goals and strategies of other agents by observing them.

That's a fundamental skill you need to be able to learn, to embed into the agents. So that for example, you have two agents, one of them is doing something, and the other agent says, "Well, that's really cool. "I wanna be able to do that too." And then you go on and do that.

And so I'd say that this is a very important component in terms of setting the reward of, you see what they do, you infer the reward, and now we have a knob which says, "You see what they're doing? "Now go and try to do the same thing." So I'd say this is, as far as I know, this was one of the important ways in which humans are quite different from other animals in the way which, in the scale and scope in which we copy the behavior of other humans.

- Might I ask a quick follow-up? - Go for it. - So that's kind of obvious how that works in the scope of competition, but what about just sort of arbitrary tasks? Like I'm in a math class with someone and I see someone doing a problem a particular way and I'm like, "Oh, that's a good strategy.

"Maybe I should try that out." How does that work in a sort of non-competitive environment? - So I think that this will be, I think that's going to be a little bit separate from the competitive environment, but it will have to be somehow either, probably baked in, maybe evolved into the system, where if you have other agents doing things, they're generating data which you observe, and the only way to truly make sense of the data that you see is to infer the goal of the agent, the strategy, their belief state.

That's important also for communicating with them. If you want to successfully communicate with someone, you have to keep track both of their goal and of their belief state instead of knowledge. So I think you will find that there are many, I guess, connections between understanding what other agents are doing, inferring their goals, imitating them, and successfully communicating with them.

- All right, let's give Ilya and the happy hour a big hand. (audience applauding) (audience cheering) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)