back to indexIlya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)
Chapters
0:0 Introduction
0:55 Talk
43:4 Q&A
00:00:00.000 |
Welcome back to CIGSAS 099, Artificial General Intelligence. 00:00:13.400 |
He started in the ML group in Toronto with Jeffrey Hinton, 00:00:31.120 |
And his work, recent work, in the past five years 00:00:42.600 |
and driver behind some of the biggest breakthrough ideas 00:00:45.400 |
in deep learning and artificial intelligence ever. 00:00:56.680 |
- All right, thanks for the introduction, Lex. 00:01:22.480 |
Which I think it's actually not a self-evident thing 00:01:30.520 |
it's a mathematical theorem that you can prove, 00:01:35.520 |
is that if you could find the shortest program 00:01:43.760 |
then you will achieve the best generalization possible. 00:01:48.080 |
you can turn it into a very, very simple algorithm. 00:02:08.000 |
then you've essentially extracted all conceivable regularity 00:02:21.840 |
but there is no way to express it as a shorter program, 00:02:26.120 |
then it means that your data is totally random. 00:02:31.200 |
Now there is little known mathematical theory behind this, 00:02:44.200 |
at least given today's tools and understanding, 00:02:48.960 |
that explains or generates or solves your problem 00:02:57.960 |
The space of all programs is a very nasty space. 00:03:05.600 |
result in massive changes to the behavior of the program, 00:03:13.400 |
Of course, you get something totally different. 00:03:19.320 |
search there seems to be completely off the table. 00:03:34.800 |
It turns out that when it comes to small circuits, 00:03:40.200 |
that solves your problem using backpropagation. 00:03:52.320 |
and you impose constraints on your circuit using data, 00:03:56.600 |
you can find a way to satisfy these constraints 00:04:01.040 |
using backprop by iteratively making small changes 00:04:13.080 |
What this means is that the computational problem 00:04:16.680 |
that's solved by backpropagation is extremely profound. 00:04:34.040 |
for which you cannot find the best neural network. 00:04:36.680 |
But in practice, that seems to be not a problem. 00:04:45.840 |
where you have a large number of equation terms like this, 00:04:54.200 |
and they represent all your degrees of freedom. 00:04:57.240 |
And you use gradient descent to push the information 00:05:16.400 |
And you can do quite a lot with 50 time steps 00:05:19.840 |
of a very, very powerful, massively parallel computer. 00:05:23.960 |
So for example, I think it is not widely known 00:05:50.760 |
you can sort successfully using only two parallel steps. 00:05:54.800 |
So there's something slightly unobvious going on. 00:05:57.160 |
Now, these are parallel steps of threshold neurons, 00:06:18.640 |
because it can run computation inside of its layers, 00:06:35.640 |
And deep neural networks satisfy both of these constraints. 00:06:40.720 |
This is the basis on which everything else resides. 00:07:30.920 |
but they are good enough to do interesting things. 00:07:38.200 |
is one where you need to maximize the expected reward. 00:07:45.800 |
in which the reinforcement learning framework 00:08:01.480 |
That's what the environment communicates back. 00:08:04.400 |
The way in which this is not the case in the real world 00:08:16.160 |
We are not told, the environment doesn't say, 00:08:24.800 |
And there is only one real true reward in life, 00:08:43.520 |
and you want the agent to map observations to actions. 00:08:47.320 |
So you let it be parameterized with a neural net, 00:08:57.240 |
that's actually being used in practice everywhere. 00:09:00.640 |
But it's also deeply, it's very robust, it's very simple. 00:09:08.800 |
This is literally the one-sentence description 00:09:27.800 |
if you find that the result exceeded your expectation, 00:09:36.520 |
This is the full idea of reinforcement learning. 00:09:41.160 |
and if you do, do more of that in the future. 00:09:53.080 |
If in a neural network, in a regular neural network, 00:09:56.920 |
like this, you might say, okay, what's the goal? 00:09:59.440 |
You run the neural network, you get an answer. 00:10:04.960 |
And whatever difference you have between those two, 00:10:06.560 |
you send it back to change the neural network. 00:10:11.720 |
In reinforcement learning, you run a neural network, 00:10:19.400 |
your randomness turns into the desired target, in effect. 00:10:30.680 |
Without explaining what these equations mean, 00:10:39.760 |
There are two classes of reinforcement learning algorithms. 00:10:47.560 |
this expression right there, the sum of rewards, 00:10:55.480 |
You run, you do some algebra, and you get a derivative. 00:10:59.040 |
And miraculously, the derivative has exactly the form 00:11:09.320 |
and if you like them, increase the log probability 00:11:14.000 |
It's very nice when the intuitive explanation 00:11:20.160 |
even though you'll have to take my word for it 00:11:28.400 |
which is a little bit more difficult to explain. 00:11:32.720 |
They are a bit less stable, a bit more sample efficient, 00:11:42.320 |
not only from the data generated by the actor, 00:11:55.280 |
So yeah, this is the on-policy, off-policy distinction, 00:12:04.240 |
If you already know it, then you already know it. 00:12:07.160 |
So now what's the potential of reinforcement learning? 00:12:11.840 |
What is it actually, why should we be excited about it? 00:12:17.840 |
The reinforcement learning algorithms of today 00:12:22.800 |
and especially if you have a really good simulation 00:12:28.760 |
But what's really exciting is if you can build 00:12:43.720 |
in order to learn in the fastest way possible. 00:12:46.120 |
Now, today our algorithms are not particularly 00:12:52.840 |
But as our field keeps making progress, this will change. 00:12:56.000 |
Next, I want to dive into the topic of meta-learning. 00:13:05.480 |
is a beautiful idea that doesn't really work, 00:13:08.720 |
but it kind of works, and it's really promising too. 00:13:19.160 |
Perhaps we could use those learning algorithms 00:13:30.360 |
you train it not on one task, but on many tasks, 00:13:36.400 |
and you ask it if it learns to solve these tasks quickly. 00:13:43.280 |
Here's how most traditional meta-learning looks like. 00:13:47.920 |
You have a model which is a big neural network. 00:13:53.960 |
instead of training cases, you have training tasks. 00:13:58.960 |
And instead of test cases, you have test tasks. 00:14:01.240 |
So your input may be, instead of just your current test case, 00:14:05.440 |
it would be all the information about the test tasks 00:14:14.840 |
So basically you say, yeah, I'm gonna give you 00:14:17.560 |
your 10 examples as part of your input to your model, 00:14:27.480 |
You turn the neural network into the learning algorithm 00:14:30.680 |
by turning a training task into a training case. 00:14:53.400 |
So there've been a data set produced by MIT by Lake et al. 00:15:02.680 |
We have a large number of different handwritten characters. 00:15:08.120 |
extremely strong meta-learning system for this task. 00:15:10.840 |
Another successful, another very successful example 00:15:14.200 |
of meta-learning is that of neural architecture search 00:15:23.360 |
that solved one problem well, a small problem. 00:15:27.000 |
and then it would successfully solve large problems as well. 00:15:29.240 |
So this is kind of the small number of bits meta-learning. 00:15:40.640 |
So this is the other way of doing meta-learning. 00:15:43.520 |
So anyway, but the point is, what's happening, 00:15:46.000 |
what's really happening in meta-learning in most cases 00:15:48.800 |
is that you turn a training task into a training case 00:15:53.240 |
and pretend that this is totally normal deep learning. 00:16:05.160 |
So now that I've finished the introduction section, 00:16:16.600 |
There's been a large effort by Andriy Kobycharov 00:16:20.280 |
to develop a learning algorithm for reinforcement learning 00:16:34.640 |
And I wanna discuss one problem in reinforcement learning. 00:16:43.040 |
One really important thing you need to learn to do 00:16:56.440 |
So one very important thing that has to happen 00:16:58.240 |
is that you must get rewards from time to time. 00:17:01.120 |
If you try something and you don't get rewards, 00:17:06.920 |
So I'd say that's the kind of the crux of the problem. 00:17:12.400 |
And relatedly, is there any way to meaningfully benefit 00:17:17.400 |
from the experience, from your attempts, from your failures? 00:17:26.560 |
You tell you, instead of asking your algorithm 00:17:32.040 |
that can achieve a very large family of goals. 00:17:42.680 |
Anytime you do something, you achieve some state. 00:17:46.800 |
So let's suppose you say, I want to achieve state A. 00:17:50.000 |
I try my best and I end up achieving state B. 00:17:54.280 |
I can either conclude, well, that was disappointing, 00:18:04.040 |
But alternatively, I can say, well, wait a second, 00:18:06.760 |
I've just reached a perfectly good state, which is B. 00:18:17.840 |
And I just want to point out, this is the one case, 00:18:33.160 |
you are doing on-policy learning for reaching the state A, 00:19:07.520 |
to learn to master its environment in general, 00:19:10.680 |
you build a system which always learns something. 00:19:15.080 |
It learns from success as well as from failure. 00:19:27.880 |
So one challenge in reinforcement learning systems 00:19:36.280 |
It means that at the beginning of the system, 00:19:43.840 |
And so it's important that you design your reward function 00:19:49.040 |
so that even when the system is not very good, 00:19:52.240 |
Now, if you give your system a very sparse reward 00:20:51.720 |
it works if you do it on a physical robot as well, 00:20:58.440 |
that the hindsight experience replay algorithm 00:21:02.880 |
because you want to make use of all your data 00:21:17.760 |
Because in the work that I've shown you so far, 00:21:21.760 |
the system is asked to achieve low-level states. 00:21:25.080 |
So I think one thing that will become very important 00:21:29.440 |
is representation learning and unsupervised learning. 00:21:35.640 |
what's the state space of goals that's worth achieving. 00:21:39.000 |
Now I want to go through some real meta-learning results, 00:22:02.000 |
So, I think we can agree that in the domain of robotics, 00:22:08.920 |
it would be nice if you could train your policy 00:22:12.080 |
in simulation, and then somehow this knowledge 00:22:26.200 |
but they can never perfectly match the real world 00:22:29.560 |
unless you want to have an insanely slow simulator. 00:22:40.240 |
and I heard somewhere, correct me if I'm wrong, 00:22:49.800 |
So your simulation is just not going to match reality. 00:23:11.400 |
that would quickly adapt itself to the real world. 00:23:16.400 |
Well, if you want to learn a policy that can quickly adapt, 00:23:20.300 |
we need to make sure that it has opportunities 00:23:25.520 |
Instead of solving our problem in just one simulator, 00:23:30.520 |
we add a huge amount of variability to the simulator. 00:23:49.760 |
And then importantly, you don't tell the policy 00:23:56.160 |
You take your policy and you put it in an environment 00:24:06.160 |
what the friction is as I get responses from the environment. 00:24:19.140 |
This is what happens when you just train a policy 00:24:21.720 |
in simulation and deploy it on the physical robot. 00:24:24.920 |
And here the goal is to bring the hockey puck 00:24:39.360 |
the systematic differences between the simulator 00:24:47.340 |
So even the basic movement is difficult for the policy 00:24:51.080 |
because the assumptions are violated so much. 00:24:58.160 |
which learns to quickly infer properties of the simulator 00:25:04.560 |
You can then give it the real thing, the real physics, 00:25:22.820 |
You can see that it would push the hockey puck 00:25:31.340 |
So that was a cool application of meta-learning. 00:25:38.120 |
I want to discuss one more application of meta-learning 00:25:49.280 |
Actually, Kevin Franz, the engineer who did it, 00:26:04.380 |
is if reinforcement learning was hierarchical. 00:26:16.420 |
Maybe the term subroutine is a little bit too crude, 00:26:18.380 |
but if you had some idea of which action primitives 00:26:31.860 |
from hierarchical reinforcement learning yet. 00:26:43.100 |
what's the right way for reinforcement learning, 00:26:47.180 |
And I just want to show you one very simple approach 00:27:13.100 |
And your goal is to learn low-level primitives 00:27:23.460 |
a very brief run of some reinforcement learning algorithm, 00:27:33.540 |
you want to learn policies that result in the great, 00:27:40.100 |
that result in the greatest amount of progress possible 00:27:53.740 |
and in this case the little bug learned three policies 00:28:13.580 |
of how hierarchical reinforcement could be like, 00:28:20.260 |
how hierarchical reinforcement learning could be like 00:28:30.060 |
addressing the limitations of high-capacity meta-learning. 00:28:37.660 |
the training task distribution has to be equal 00:28:47.860 |
because in reality, the new task that you want to learn 00:29:04.200 |
only a fraction of the things that you've learned 00:29:09.280 |
You need to learn quite a few more things from scratch. 00:29:15.980 |
because it really assumes that the distribution 00:29:33.880 |
are outside of the distribution of the training task, 00:29:50.260 |
And I want to start by reviewing very old work 00:30:25.620 |
"let them play backgammon against each other, 00:30:33.980 |
And you would think this was a paper from 2017, 00:30:42.340 |
20 hidden units, 40 and 80 for the different colors, 00:30:47.000 |
where you notice that the largest neural network works best. 00:30:58.500 |
and they were able to discover new strategies 00:31:00.320 |
that the best human backgammon players have not noticed, 00:31:21.880 |
So, now other examples of self-play include AlphaGo Zero, 00:31:26.880 |
which was able to learn to beat the world champion in Go 00:32:11.160 |
What are we actually teaching the systems to do? 00:32:13.560 |
And one very attractive attribute of self-play 00:32:23.280 |
By virtue of the agent acting in the environment, 00:32:27.600 |
the environment becomes difficult for the other agents. 00:32:55.760 |
And this is what happens in effect in biological evolution. 00:33:05.800 |
There is a really cool video on YouTube by Carl Sims. 00:33:11.000 |
which really kind of shows all the work that he's done. 00:33:14.160 |
And here you have a little competition between agents 00:33:17.000 |
where you evolve both the behavior and their morphology 00:33:20.560 |
when the agent is trying to gain possession of a green cube. 00:33:38.120 |
and this is work by Van Salendaal from OpenAI, 00:33:43.880 |
can we demonstrate some unusual results in self-play 00:33:48.560 |
that would really convince us that there is something there? 00:33:52.400 |
So what we did here is that we created a small ring, 00:33:58.760 |
and their goal is just to push each other outside the ring. 00:34:01.680 |
And they don't know anything about wrestling. 00:34:07.840 |
They don't know anything about centers of gravity. 00:34:10.000 |
All they know is that if you don't do a good job, 00:34:13.040 |
then your competition is going to do a better job. 00:34:15.520 |
Now, one of the really attractive things about self-play 00:34:30.120 |
you need to sometimes win and sometimes lose. 00:34:41.960 |
Yeah, so the green humanoid was able to block the ball. 00:34:55.400 |
No matter how good you are or how bad you are, 00:35:11.200 |
and you start applying big, large, random forces on it, 00:35:16.880 |
And the answer turns out to be that yes, it can, 00:35:19.800 |
because it's been trained against an opponent 00:35:24.360 |
And so that's why, even if it doesn't understand 00:35:27.280 |
where the pressure force is being applied on it, 00:35:31.400 |
So this is one potentially attractive feature 00:35:35.920 |
that you could learn a certain broad set of skills, 00:35:43.640 |
And so the biggest open question with this research is, 00:35:46.640 |
how do you learn agents in a self-play environment 00:35:54.560 |
but then they are able to solve a battery of tasks 00:36:08.440 |
of self-play environments that we've observed 00:36:12.080 |
and that is that we've seen a very rapid increase 00:36:19.000 |
we've seen the bot go from playing totally randomly 00:36:28.000 |
And the reason for that is that once you have 00:36:30.520 |
a self-play environment, if you put compute into it, 00:36:36.360 |
Self-play allows you to turn compute into data. 00:36:44.840 |
to be able to turn compute into, essentially, 00:36:48.680 |
simply because the speed of neural net processors 00:36:51.720 |
will increase very dramatically over the next few years. 00:36:56.600 |
and it will be important to make use of these 00:37:01.760 |
I also want to talk a little bit about the end game 00:37:07.280 |
So one thing that we know about the human brain 00:37:12.480 |
is that it has increased in size fairly rapidly 00:37:26.600 |
where the thing that's most important for your survival 00:37:40.560 |
then it really helps to have a slightly larger brain. 00:37:44.480 |
And there exists at least one paper from science 00:37:50.160 |
So apparently there has been convergent evolution 00:38:00.800 |
even though the divergence in evolutionary time scale 00:38:04.840 |
between humans and birds has occurred a very long time ago, 00:38:11.000 |
apes and birds have very different brain structure. 00:38:19.800 |
if we successfully follow the path of this approach, 00:38:40.560 |
of how do you make sure that the agents we learn 00:38:45.200 |
Now, I want to make a speculative digression here, 00:38:48.800 |
which is, I want to make the following observation. 00:38:57.080 |
If you believe that this kind of society of agents 00:39:07.000 |
where fully general intelligence will emerge, 00:39:12.080 |
and if you accept that our experience with the DotaBot, 00:39:16.640 |
where we've seen a very rapid increase in competence, 00:39:18.640 |
will carry over once all the details are right, 00:39:26.960 |
a very rapid increase in the competence of our agents 00:39:41.880 |
and teaching agents social skills and language, 00:39:45.640 |
and a lot of things that actually exist in humans as well, 00:39:55.600 |
And the question of conveying goals to agents 00:40:10.440 |
will eventually be dramatically smarter than us. 00:40:23.400 |
which basically explains how the whole thing works. 00:40:30.520 |
and you, the human, gets to see pairs of behaviors. 00:40:34.680 |
And you simply click on the one that looks better. 00:40:44.120 |
you can get this little simulated leg to do backflips. 00:41:02.240 |
it took about 500 clicks by human annotators. 00:41:17.040 |
and not in terms of the environment interactions. 00:41:20.360 |
So what you do here is that you take all the clicks, 00:41:44.760 |
We've also been able to train lots of Atari games 00:41:49.560 |
So in all these cases, you had human annotators 00:41:53.000 |
or human judges, just like in the previous slide, 00:42:00.440 |
and clicking on the one that they thought was better. 00:42:13.800 |
to train the white car to drive right behind the orange car. 00:42:19.760 |
and it was very straightforward to communicate this goal 00:42:32.960 |
But of course, the determination of the correct goals 00:42:38.520 |
will be a very challenging political problem. 00:42:42.000 |
And on this note, I want to thank you so much 00:42:50.360 |
if you want to chat more about AI and other topics. 00:43:12.440 |
because signals in the brain go one direction 00:43:16.680 |
requires the errors to be propagated back up the wires. 00:43:26.760 |
as though the brain is doing something a bit different 00:43:37.360 |
even though it's got no obvious way of doing that? 00:43:44.360 |
So first of all, I'll say that the true answer 00:43:46.520 |
is that, the honest answer is that I don't know, 00:44:00.040 |
rather, it is a true fact that back propagation 00:44:07.920 |
This problem feels like an extremely fundamental problem. 00:44:11.080 |
And for this reason, I think that it's unlikely to go away. 00:44:25.440 |
by Tim Lillicrap and others, where they've shown 00:44:32.280 |
a different set of connections that can be used 00:44:38.640 |
Now, the reason this hasn't been really pushed 00:44:41.440 |
to the limit by practitioners is because they say, 00:44:48.580 |
But you are right that this is an important issue, 00:44:53.400 |
So my personal opinion is that back propagation 00:44:56.040 |
is just going to stay with us till the very end, 00:45:09.560 |
it is a difference that has to be acknowledged. 00:45:14.240 |
Do you think it was a fair matchup for the Dota bot 00:45:18.160 |
and that person, given the constraints of the system? 00:45:21.680 |
- So I'd say that the biggest advantage computers have 00:45:26.000 |
in games like this, like one of the big advantages, 00:45:28.320 |
is that they obviously have a better reaction time. 00:45:40.580 |
So in StarCraft, StarCraft is a very mechanically heavy game 00:45:46.680 |
And so the top players, they just click all the time. 00:45:49.240 |
In Dota, every player control is just one hero, 00:46:15.380 |
- So do you think that the emergent behaviors 00:46:17.420 |
from the agent were actually kind of directed 00:46:20.460 |
because the constraints were already kind of in place? 00:46:22.620 |
Like, so it was kind of forced to discover those? 00:46:27.940 |
that, like, wow, it actually discovered these on its own? 00:46:33.540 |
- So it's definitely, we discover new strategies, 00:46:35.540 |
and I can share an anecdote where our tester, 00:46:46.500 |
against the player, the human player, which were effective. 00:47:00.140 |
by imitating it, he was able to defeat a better pro. 00:47:03.300 |
So I think the strategies that he discovers are real, 00:47:17.180 |
by the bot help the humans, it means that the, 00:47:19.180 |
like, the fundamental gameplay is deeply related. 00:47:21.620 |
- For a long time now, I've heard that the objective 00:47:25.820 |
of reinforcement learning is to determine a policy 00:47:30.140 |
that chooses an action to maximize the expected reward, 00:47:36.180 |
Would you ever wanna look at the standard deviation 00:47:47.500 |
One of the reasons to maximize the expected reward 00:47:50.900 |
is because it's easier to design algorithms for it. 00:47:54.740 |
So you write down this equation, the formula, 00:48:11.060 |
and you wanna work on the standard deviation as well, 00:48:32.820 |
and that has a lot to do with the reinforcement, 00:48:50.340 |
to have the computers play these competitive games, 00:48:59.580 |
self-play collaboratively, collaborative games? 00:49:03.740 |
- Yeah, I think that's an extremely good question. 00:49:06.240 |
I think one place from which we can get some inspiration 00:49:32.600 |
then cooperation will be the winning strategy, 00:49:50.240 |
I was wondering if you feel that there exists 00:49:52.660 |
open complexity theoretic problems relevant to AI, 00:49:56.940 |
or whether it's just a matter of finding good approximations 00:50:10.920 |
we know that whatever algorithm we're gonna run 00:50:13.940 |
is going to run fairly efficiently on some hardware, 00:50:20.720 |
on the true complexity of the problems we're solving. 00:50:25.980 |
which aren't too hard in a complexly theoretic sense. 00:50:28.780 |
Now, it is also the case that many of the problems, 00:50:36.620 |
is not hard from a complexity theoretic sense, 00:50:41.860 |
it is true that many of the optimization problems 00:50:50.500 |
starting from neural net optimization itself. 00:50:56.100 |
for a neural network with a very small number of neurons, 00:50:58.460 |
such that finding the global optimum is NP-complete. 00:51:13.980 |
we do not solve problems which are truly intractable. 00:51:17.420 |
So, I mean, I hope this answers the question. 00:51:24.700 |
on the path towards AGI will be understanding language, 00:51:28.500 |
and the state of generative language modeling right now 00:51:35.220 |
research trajectories towards generative language models? 00:51:38.320 |
- So, I'll first say that you are completely correct 00:51:41.900 |
that the situation with language is still far from great, 00:51:54.940 |
on larger data sets is going to go surprisingly far. 00:51:59.100 |
Not even larger data sets, but larger and deeper models. 00:52:04.060 |
with a thousand layers, and it's the same layer, 00:52:06.820 |
I think it's gonna be a pretty amazing language model. 00:52:20.220 |
in our current understanding of deep learning, 00:52:39.180 |
and then we stop training the model, and we freeze it. 00:52:47.460 |
Like, the magic is, like, if you think about it, 00:52:49.900 |
like, the training process is the true general part 00:52:53.820 |
of the whole story, because your TensorFlow code 00:52:59.420 |
It just says, "Whatever, just give me the data set. 00:53:03.860 |
So, like, the ability to do that feels really special, 00:53:07.900 |
and I think we are not using it at test time. 00:53:22.900 |
But also doing things like training at test time 00:53:25.660 |
will be another important boost to performance. 00:53:31.820 |
So it seems like right now another interesting approach 00:53:50.020 |
- So, like, at present, I believe that something 00:53:58.620 |
I think that normal reinforcement learning algorithms, 00:54:03.840 |
But I think if you want to evolve a small, compact object, 00:54:14.780 |
But this, you know, evolving a useful piece of code 00:54:21.560 |
so still a lot of work to be done before we get there. 00:54:26.740 |
My question is, you mentioned what is the right goal 00:54:34.340 |
and then also, what do you think would be the approach 00:54:50.540 |
I don't have enough of a super strong opinion 00:54:59.720 |
is given the size, like, if you go into the future, 00:55:02.500 |
whenever, soon or, you know, whenever it's gonna happen, 00:55:05.660 |
when you build a computer which can do anything better 00:55:09.420 |
than a human, it will happen, 'cause the brain is physical. 00:55:23.620 |
And I think what it means is that people will care a lot. 00:55:33.340 |
And, like, as the impact increases gradually, 00:55:48.620 |
in order to have these agents that can eventually 00:55:52.380 |
come out into the real world and do something 00:55:55.060 |
approaching, you know, human-level intelligence tasks? 00:56:00.860 |
So I think if that were the case, we'd be in trouble. 00:56:04.180 |
And I am very certain that it could be avoided. 00:56:10.500 |
So specifically, the real answer has to be that, look, 00:56:14.780 |
you learn to problem-solve, you learn to negotiate, 00:56:17.820 |
you learn to persist, you learn lots of different 00:56:27.260 |
because many of your deeply held assumptions will be false. 00:56:31.060 |
And one of the goals, so that's one of the reasons 00:56:33.860 |
I care so much about never stopping training. 00:56:40.940 |
of your assumptions are violated, you continue training. 00:56:42.940 |
You try to connect the new data to your old data. 00:56:45.020 |
And this is an important requirement from our algorithms, 00:56:53.140 |
that you've acquired and go in a new situation, 00:57:00.580 |
you learn useful things, then you go to work. 00:57:12.900 |
but there will be lots of new things you need to learn. 00:57:18.900 |
- One of the things you mentioned pretty early on 00:57:22.700 |
of this sort of style of reinforcement learning 00:57:27.140 |
So you have to tell it when it did a good thing 00:57:29.860 |
And that's actually a problem in neuroscience 00:57:37.900 |
when we already have this problem with teaching, 00:57:41.940 |
So where do you see the research moving forward 00:57:51.220 |
is to be able to infer the goals and strategies 00:57:57.540 |
That's a fundamental skill you need to be able to learn, 00:58:06.180 |
and the other agent says, "Well, that's really cool. 00:58:10.700 |
And so I'd say that this is a very important component 00:58:27.020 |
this was one of the important ways in which humans 00:58:39.060 |
in which we copy the behavior of other humans. 00:58:52.860 |
and I see someone doing a problem a particular way 00:58:57.620 |
How does that work in a sort of non-competitive environment? 00:59:02.100 |
I think that's going to be a little bit separate 00:59:10.260 |
probably baked in, maybe evolved into the system, 00:59:23.660 |
of the data that you see is to infer the goal of the agent, 00:59:29.780 |
That's important also for communicating with them. 00:59:32.460 |
If you want to successfully communicate with someone, 00:59:35.820 |
and of their belief state instead of knowledge. 00:59:37.780 |
So I think you will find that there are many, 00:59:43.900 |
what other agents are doing, inferring their goals, 00:59:46.140 |
imitating them, and successfully communicating with them. 00:59:49.180 |
- All right, let's give Ilya and the happy hour a big hand.