back to indexOriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306
Chapters
0:0 Introduction
0:34 AI
15:31 Weights
21:50 Gato
56:38 Meta learning
70:37 Neural networks
93:2 Emergence
99:47 AI sentience
123:43 AGI
00:00:00.000 |
"At which point is the neural network a being versus a tool?" 00:00:05.000 |
The following is a conversation with Aurel Vinales, 00:00:18.040 |
and one of the most brilliant thinkers and researchers 00:00:33.600 |
You are one of the most brilliant researchers 00:00:45.040 |
So we're talking about languages, images, even biology, 00:00:53.400 |
In your lifetime, will we be able to build an AI system 00:01:10.640 |
will we be able to build a system that replaces you 00:01:16.080 |
in order to create a compelling conversation? 00:01:24.680 |
I really like when we start now with very powerful models, 00:01:34.880 |
if you remove the human side of the conversation, 00:02:08.520 |
Maybe you can source some questions from an AI system. 00:02:13.960 |
it's quite plausible that with your creativity, 00:02:17.040 |
you might actually find very interesting questions 00:02:24.040 |
And likewise, if I had now the tools on my side, 00:02:31.600 |
I like the words chosen by this particular system 00:02:36.600 |
Completely replacing it feels not exactly exciting to me. 00:02:49.880 |
maybe self-play interviews as you're suggesting 00:03:03.200 |
- So you said it doesn't seem exciting to you, 00:03:04.800 |
but what if exciting is part of the objective function 00:03:09.120 |
So there's probably a huge amount of data of humans, 00:03:16.080 |
and there's probably ways to measure the degree of, 00:03:24.120 |
that's most created an engaging conversation in the past. 00:03:28.680 |
So actually, if you strictly use the word exciting, 00:03:53.040 |
you're thinking about winning as the objective, right? 00:03:57.320 |
But in fact, when we discuss this with Blizzard, 00:04:05.360 |
if you could measure that and optimize for that, 00:04:11.720 |
or why we interact or listen or look at cat videos 00:04:28.240 |
into a particular aspect of AI, which is quite critical, 00:04:45.800 |
And then if you can learn a function automated ideally, 00:04:54.840 |
that optimize for non-obvious things such as excitement. 00:05:08.040 |
that is fully driven by an excitement reward function. 00:05:12.800 |
But obviously there would be still quite a lot of humanity 00:05:16.920 |
in the system, both from who is building the system, 00:05:28.440 |
because it's just hard to have a computational measure 00:05:39.240 |
I would actually venture to say that excitement 00:05:44.160 |
or is perhaps has lower consequences of failure. 00:05:49.000 |
But there is perhaps the humanness that you mentioned, 00:05:54.920 |
that's perhaps part of a thing that could be labeled. 00:05:58.240 |
And that could mean an AI system that's doing dialogue, 00:06:02.480 |
that's doing conversations should be flawed, for example. 00:06:09.440 |
which is have inherent contradictions by design, 00:06:15.080 |
Maybe it also needs to have a strong sense of identity. 00:06:18.760 |
So it has a backstory, it told itself that it sticks to, 00:06:22.680 |
it has memories, not in terms of how the system is designed, 00:06:26.880 |
but it's able to tell stories about its past. 00:06:30.360 |
It's able to have mortality and fear of mortality 00:06:41.240 |
and gets canceled on Twitter, that's the end of that system. 00:06:44.720 |
So it's not like you get to rebrand yourself, 00:06:52.120 |
because like you can't say anything stupid now, 00:07:04.720 |
like you've built up over time that you stick with, 00:07:16.300 |
All of those elements, it feels like they can be learned 00:07:25.400 |
And then combine that with a metric of excitement, 00:07:45.320 |
And there's obviously data for humanness on the internet. 00:07:48.880 |
So I wonder if there's a future where that's part, 00:08:00.760 |
and I think like what is interesting about this, 00:08:19.440 |
maybe the end of a domination of a series of wins. 00:08:25.440 |
somehow connect to a compelling conversation, 00:08:34.600 |
whether an AI is intelligent or not with the Turing test. 00:08:38.640 |
Which I guess, my question comes from a place 00:08:59.160 |
usually you try to map many of these interesting topics 00:09:14.280 |
We're talking about weights of a mathematical function, 00:09:17.800 |
and then looking at the current state of the game, 00:09:26.000 |
to get to the ultimate stage of all these experiences, 00:09:32.840 |
like words that currently barely we're seeing progress, 00:09:43.960 |
it's a large vast variety of human interactions online, 00:09:47.920 |
and then you're distilling these sequences, right? 00:09:51.600 |
Going back to my passion, like sequences of words, 00:09:59.840 |
And then you're trying to just learn a function 00:10:04.400 |
that maximizes the likelihood of seeing all these 00:10:14.200 |
where the way currently we train these models 00:10:30.840 |
So you're just passively observing and maximizing these, 00:10:33.560 |
you know, it's almost like a landscape of mountains. 00:10:54.600 |
And then we're putting them to then generate data 00:11:08.640 |
So to be clear, and again, mapping to AlphaGo, AlphaStar, 00:11:15.280 |
And when we deploy it to play against humans, 00:11:20.320 |
like language models, they don't even keep training, right? 00:11:23.480 |
They're not learning in the sense of the weights 00:11:29.760 |
Now there's something a bit more, feels magical, 00:11:33.480 |
but it's understandable if you're into neural net, 00:11:39.120 |
in the strict sense of the words, the weights changing. 00:11:41.480 |
Maybe that's mapping to how neurons interconnect 00:11:46.640 |
But it's true that the context of the conversation 00:11:50.280 |
that takes place when you talk to these systems, 00:12:00.120 |
it has a hard drive that has a lot of information. 00:12:16.560 |
I mean, right now we're talking, to be concrete, 00:12:21.760 |
and then beyond that, we start forgetting what we've seen. 00:12:24.840 |
So you can see that there's some short-term coherence 00:12:32.280 |
having sort of a mapping, an agent to have consistency. 00:12:47.480 |
if we think even of these podcast books are much longer. 00:12:51.760 |
So technically speaking, there's a limitation there. 00:12:55.120 |
Super exciting from people that work on deep learning 00:13:03.040 |
and the technology to have this lifetime-like experience 00:13:18.640 |
I think we've seen the power of this imitation, 00:13:22.240 |
again, on the internet scale that has enabled this 00:13:36.560 |
And in fact, as I said, we don't even train them 00:13:41.160 |
other than their working memory, of course, is affected. 00:13:49.720 |
When, from basically when we were born and probably before. 00:13:54.040 |
So lots of fascinating, interesting questions 00:13:57.400 |
I think the one I mentioned is this idea of memory 00:14:01.680 |
and experience versus just kind of observe the world 00:14:13.400 |
And then the second maybe issue that I see is 00:14:18.160 |
all these models, we train them from scratch. 00:14:21.240 |
That's something I would have complained three years ago 00:14:26.400 |
And it feels, if we take inspiration from how we got here, 00:14:31.360 |
how the universe evolved us and we keep evolving, 00:14:37.840 |
that we should not be training models from scratch 00:14:41.320 |
every few months, that there should be some sort of way 00:14:45.280 |
in which we can grow models much like as a species 00:14:51.520 |
is building from the previous sort of iterations. 00:14:55.000 |
And that from a just purely neural network perspective, 00:15:07.680 |
This landscape we learn from the data and, you know, 00:15:13.360 |
given maybe a recent snapshot of these datasets 00:15:16.960 |
we train on, et cetera, or even a new game we're learning. 00:15:19.960 |
So that feels like something is missing fundamentally. 00:15:28.400 |
There's many ideas and it's super exciting as well. 00:15:32.440 |
when you're approaching new problem in machine learning, 00:15:43.360 |
which in most cases is some version of random. 00:15:47.280 |
So that's what you mean by starting from scratch. 00:15:48.960 |
And it seems like it's a waste every time you solve 00:15:52.880 |
the game of Go and chess, StarCraft, protein folding, 00:15:59.440 |
like surely there's some way to reuse the weights 00:16:03.160 |
as we grow this giant database of neural networks. 00:16:08.400 |
- That has solved some of the toughest problems 00:16:34.640 |
what ideas do you have for better initialization of weights? 00:16:45.240 |
there's this beautiful idea that is a single algorithm 00:16:58.580 |
that are being cracked by this basic principle. 00:17:01.960 |
That is you take a neural network of uninitialized weights. 00:17:09.640 |
then you give it in the case of supervised learning, 00:17:17.120 |
and the desired output should look like this. 00:17:19.560 |
I mean, image classification is very clear example, 00:17:22.360 |
images to maybe one of a thousand categories. 00:17:26.840 |
but many, many, if not all problems can be mapped this way. 00:17:38.600 |
And I think that's the core of deep learning research, 00:17:48.440 |
without having to work very hard on the problem at stake. 00:17:54.400 |
but I think the field is excited to find less tweaks 00:18:02.000 |
when they work on important problems specific to those 00:18:09.300 |
I would say we have something general already, 00:18:11.760 |
which is this formula of training a very powerful model 00:18:23.400 |
Protein folding being such an important problem 00:18:26.060 |
has some basic recipe that is learned from before, right? 00:18:30.780 |
Like transformer models, graph neural networks, 00:18:42.420 |
Knowledge distillation is another technique, right? 00:18:53.600 |
That's very important because protein folding 00:18:59.120 |
we should solve it no matter if we need to be a bit specific. 00:19:02.860 |
And it's possible that some of these learnings 00:19:04.940 |
will apply then to the next iteration of this recipe 00:19:29.540 |
That idea and some progress has been had starting, 00:19:33.100 |
I would say, mostly from GPT-3 on the language domain only, 00:19:37.140 |
in which you could conceive a model that is trained once, 00:19:44.680 |
it only knows how to translate a pair of languages, 00:19:47.640 |
or it only knows how to assign sentiment to a sentence. 00:19:55.460 |
And this prompting is essentially just showing it 00:19:59.860 |
almost like you do show examples, input-output examples, 00:20:07.820 |
which is very natural way for us to learn from one another. 00:20:11.040 |
I tell you, "Hey, you should do this new task. 00:20:23.180 |
in this way to do few-shot prompting through language 00:20:30.940 |
we've seen these expanded to beyond language, 00:20:43.700 |
this is perhaps one way in which you have a single model. 00:20:47.760 |
The problem of this model is it's hard to grow 00:20:58.920 |
In this way that I could teach you a new task now 00:21:08.400 |
But it still feels like more breakthroughs should be had, 00:21:33.500 |
in terms of also next steps for deep learning 00:21:46.060 |
There's some interesting questions, many to be answered. 00:21:57.120 |
You wrote, "Gato is not the end, it's the beginning." 00:22:01.220 |
And then you wrote, "Meow," and then an emoji of a cat. 00:22:07.700 |
First, can you explain the meow and the cat emoji? 00:22:10.020 |
And second, can you explain what Gato is and how it works? 00:22:21.900 |
- One of the greatest AI researchers of all time, 00:22:31.940 |
meow and cat, probably he would, probably would. 00:23:00.340 |
where the field is going, but let me tell you about Gato. 00:23:03.740 |
So first, the name Gato comes from maybe a sequence 00:23:11.820 |
like used animal names to name some of their models 00:23:15.100 |
that are based on this idea of large sequence models. 00:23:23.180 |
So we had, you know, we had gopher, chinchilla, 00:23:44.500 |
especially discrete actions like up, down, left, right. 00:23:47.540 |
I just told you the actions, but they're words. 00:23:49.460 |
So you can kind of see how actions naturally map 00:24:14.220 |
And I think because of the word general agent, right? 00:24:27.740 |
And Gato is obviously a Spanish version of cat. 00:24:30.220 |
I had nothing to do with it, although I'm from Spain. 00:24:37.140 |
- Now it all makes sense. - Okay, okay, I see, I see. 00:24:50.060 |
- All right, so then how does the thing work? 00:24:51.660 |
So you said general is, so you said language, vision- 00:25:06.340 |
And maybe what to you are some beautiful ideas 00:25:16.060 |
are not that dissimilar from many, many work that comes. 00:25:28.620 |
that essentially takes a sequence of modalities, 00:25:38.820 |
And then its own objective that you train it to do 00:25:48.780 |
If this sequence that I'm showing you to train 00:25:53.500 |
then you're predicting what's the next action 00:25:57.100 |
So you think of these really as a sequence of bytes, right? 00:26:06.980 |
a sequence of maybe observations that are images 00:26:17.620 |
and you're modeling what's the next byte gonna be like. 00:26:29.060 |
if you're chatting with the system and so on. 00:26:41.500 |
It also actually inputs some sort of proprioception sensors 00:26:45.780 |
from robotics because robotics is one of the tasks 00:27:06.420 |
It's this brain that essentially you give it any sequence 00:27:26.780 |
like you can chat with Chinchilla or Flamingo, 00:27:43.260 |
to be good at only StarCraft or only Atari or only Go. 00:27:47.900 |
It's been trained on a vast variety of datasets. 00:27:51.660 |
- What makes it an agent, if I may interrupt? 00:28:06.740 |
the capacity to take actions in an environment 00:28:15.040 |
and then you generate the next action and so on. 00:28:17.660 |
- This actually, this reminds me of the question 00:28:23.000 |
Which is actually a very difficult question as well. 00:28:26.780 |
What is living when you think about life here 00:28:31.000 |
And a question interesting to me about aliens, 00:28:37.220 |
And this feels like, it sounds perhaps silly, 00:28:41.380 |
At which point is the neural network a being versus a tool? 00:28:46.380 |
And it feels like action, ability to modify its environment, 00:28:54.540 |
- Yeah, I think it certainly feels like action 00:29:09.060 |
But anyways, going back to the meow and the Gato, right? 00:29:16.100 |
and what took the team a lot of effort and time was, 00:29:19.100 |
as you were asking, how has Gato been trained? 00:29:23.100 |
So I told you Gato is this transformer neural network, 00:29:26.060 |
models actions, sequences of actions, words, et cetera. 00:29:30.580 |
And then the way we train it is by essentially 00:29:39.380 |
So it's a massive imitation learning algorithm 00:29:42.620 |
that it imitates obviously to what is the next word 00:29:46.300 |
that comes next from the usual data sets we use before, 00:29:52.980 |
of people writing on webs or chatting or whatnot, right? 00:30:08.160 |
we're quite interested in learning reinforcement learning 00:30:13.580 |
and learning agents that play in different environments. 00:30:16.940 |
So we kind of created a data set of these trajectories 00:30:28.420 |
control a 3D game environment and navigate a maze. 00:30:33.340 |
So we had all the experience that was created 00:30:36.060 |
through the one agent interacting with that environment. 00:30:44.380 |
all these sequences of words or sequences of these agent 00:30:54.860 |
And so we mix these data sets together and we train Gato. 00:31:05.220 |
it doesn't have different brains for each modality 00:31:10.500 |
It's not that big of a brain compared to most 00:31:17.140 |
Some models we're seeing getting the trillions these days 00:31:25.060 |
that is very common from when you train these jobs. 00:31:35.020 |
diverse data set, not only containing all of internet, 00:31:40.380 |
playing very different distinct environments. 00:31:43.140 |
So this brings us to the part of the tweet of, 00:31:56.620 |
that especially the ones that it's been trained to do, 00:32:04.620 |
But obviously it's not as proficient as the teachers 00:32:11.740 |
It's not obvious that it wouldn't be more proficient. 00:32:18.040 |
is that the performance is such that it's not as good 00:32:31.220 |
- Yeah, okay. - That's a different conversation. 00:32:33.420 |
- But for neural networks, certainly size does matter. 00:32:36.260 |
So it's the beginning because it's relatively small. 00:32:46.540 |
between text on the internet and playing Atari and so on 00:33:07.620 |
that you might need to sort of make it more clear 00:33:10.940 |
to the model that you're not only playing Atari 00:33:20.660 |
as there's some sort of context that is needed for the agent 00:33:23.900 |
before it starts seeing, oh, this is an Atari screen, 00:33:28.640 |
You might require, for instance, to be told in words, 00:33:44.460 |
So then these connections might be made more easily. 00:33:47.220 |
That's an idea that we start seeing in language, 00:33:51.240 |
but obviously beyond this is gonna be effective. 00:33:57.460 |
and you from scratch, you're supposed to learn a game. 00:34:10.420 |
- So that context puts all the different modalities 00:34:18.980 |
so there's this task which may not seem trivial 00:34:23.100 |
of tokenizing the data, of converting the data into pieces, 00:34:42.180 |
How do you tokenize games and actions and robotics tasks? 00:34:52.820 |
to actually make all the data look like a sequence 00:34:59.500 |
We break down anything into these puzzle pieces 00:35:25.500 |
There's quite well studied problem on tokenizing text 00:35:34.300 |
even starting from Ngram models in the 1950s and so on. 00:35:51.100 |
The current level or granularity of tokenization 00:36:05.460 |
I don't know what the average length of a word 00:36:11.380 |
- So it's bigger than letters, smaller than words. 00:36:14.020 |
And you could think of very, very common words like the, 00:36:18.780 |
but very quickly you're talking two, three, four, 00:36:24.740 |
- Emojis are actually just sequences of letters. 00:36:43.300 |
- The way we do these things is they're actually mapped 00:36:52.580 |
and input emojis, it will output emojis back, 00:36:57.900 |
You probably can find other tweets about these out there. 00:37:21.300 |
that would mean we have a very long sequence, right? 00:37:23.820 |
Like if we were talking about 100 by 100 pixel images, 00:37:29.940 |
So what was done there is you just use a technique 00:37:51.820 |
And then you put the pixels together in some raster order, 00:38:05.860 |
you don't need to understand anything about the image 00:38:09.660 |
- No, you're only using this notion of compression. 00:38:17.660 |
it's actually very similar at the tokenization level. 00:38:29.540 |
that are contained in all the data we deal with. 00:38:46.980 |
does capture something important about an image 00:38:51.180 |
that's about its meaning, not just about some statistics. 00:38:56.660 |
the algorithms look actually very similar to, 00:39:02.820 |
The approach we usually do in machine learning 00:39:07.100 |
when we deal with images and we do this quantization step 00:39:11.380 |
So rather than have some sort of Fourier basis 00:39:14.140 |
for how frequencies appear in the natural world, 00:39:18.900 |
we actually just use the statistics of the images 00:39:23.820 |
and then quantize them based on the statistics, 00:39:38.260 |
if you think of, oh, like the tokens are an integer 00:39:52.820 |
because we have groups of characters and so on. 00:39:55.340 |
So from one to 10,000, those are representing 00:40:00.980 |
And then images occupy the next set of integers. 00:40:05.820 |
So from 10,001 to 20,000, those are the tokens 00:40:18.660 |
So what connects these concepts is the data, right? 00:40:26.900 |
oh, this is someone playing a Frisbee on a green field. 00:40:30.500 |
Now the model will need to predict the tokens 00:40:34.580 |
from the text green field to then the pixels, 00:40:40.580 |
So these connections happen as the algorithm learns. 00:40:43.620 |
And then the last, if we think of these integers, 00:40:45.820 |
the first few are words, the next few are images. 00:40:48.740 |
In Gato, we also allocated the highest order of integers 00:40:56.260 |
Which we discretize and actions are very diverse, right? 00:40:59.940 |
In Atari, there's, I don't know if 17 discrete actions. 00:41:14.300 |
And then we just, that's how we map now all the space 00:41:22.420 |
and what connects them is then the learning algorithm. 00:41:36.060 |
- And then you're shoving all of that into one place. 00:41:42.740 |
that transformer tries to look at this gigantic token space 00:41:47.740 |
and tries to form some kind of representation, 00:42:02.100 |
If you were to sort of put your psychoanalysis hat on 00:42:06.500 |
and try to psychoanalyze this neural network, 00:42:19.540 |
and somehow have them not interfere with each other? 00:42:22.780 |
Or is it somehow building on the joint strength, 00:42:27.780 |
on whatever is common to all the different modalities? 00:42:31.700 |
If you were to ask a question, is it schizophrenic 00:42:47.420 |
like the field hasn't changed since backpropagation 00:42:55.700 |
So there is obviously details on the architecture. 00:42:59.580 |
The current iteration is still the transformer, 00:43:03.020 |
which is a powerful sequence modeling architecture. 00:43:18.620 |
AlphaStar, language modeling, and so on, right? 00:44:17.860 |
That's the mixture of the dataset we discussed. 00:44:24.020 |
And the weights, right, they're all shared, right? 00:44:27.540 |
So in terms of, is it focusing on one modality or not, 00:44:35.140 |
to the target integer you're predicting next, 00:44:43.380 |
there is a special place in the neural network, 00:44:45.820 |
which is we map these integer, like number 10,001, 00:44:49.780 |
to a vector of real numbers, like real numbers. 00:44:53.700 |
We can optimize them with gradient descent, right? 00:45:06.540 |
So mapping a certain token for text or image or actions, 00:45:11.540 |
each of these tokens gets its own little vector 00:45:17.180 |
If you look at the field back many years ago, 00:45:19.540 |
people were talking about word vectors or word embeddings. 00:45:30.900 |
And the beauty here is that as you train this model, 00:45:42.860 |
but then it might be that you take the word gato or cat, 00:45:47.460 |
which maybe is common enough that it actually 00:45:52.380 |
and you might start seeing that these vectors 00:45:57.420 |
So by learning from this vast amount of data, 00:46:00.660 |
the model is realizing the potential connections 00:46:07.860 |
at least in part, to not have these different vectors 00:46:15.500 |
For instance, when I tell you about actions in certain space, 00:46:22.820 |
So you could imagine a world in which I'm not learning 00:46:26.500 |
that the action app in Atari is its own number. 00:46:31.180 |
The action app in Atari maybe is literally the word 00:46:42.500 |
but certainly it might make these connections 00:46:45.660 |
much easier to learn and also to teach the model 00:46:51.260 |
So all this to say that gato is indeed the beginning, 00:46:55.860 |
that it is a radical idea to do this this way, 00:47:04.420 |
not only through scale, but also through some new research 00:47:07.940 |
that will come hopefully in the years to come. 00:47:28.260 |
So like you convert even images into language. 00:47:31.340 |
So doing something like a crude semantic segmentation, 00:47:35.540 |
trying to just assign a bunch of words to an image 00:47:42.300 |
explaining as much as it can about the image. 00:47:49.260 |
and then you provide the context in words and all of it. 00:47:58.100 |
that language is actually at the core of everything. 00:48:00.940 |
That it's the base layer of intelligence and consciousness 00:48:07.500 |
You mentioned early on like it's hard to grow. 00:48:12.780 |
'Cause we're talking about scale might change. 00:48:15.700 |
There might be, and we'll talk about this too, 00:48:22.940 |
there's certain things about these neural networks 00:48:25.860 |
So certain like performance we can see only with scale 00:48:30.980 |
So why is it hard to grow something like this Meow Network? 00:48:42.620 |
What's hard is, well, we have now 1 billion parameters. 00:48:58.860 |
Could we reuse the weights and expand to a larger brain? 00:49:06.700 |
but also exciting from a research perspective 00:49:10.100 |
and a practical perspective point of view, right? 00:49:12.580 |
So there's this notion of modularity in software engineering 00:49:26.340 |
to a work that I would say train much larger, 00:49:34.340 |
but it definitely dealt with images in an interesting way, 00:49:40.300 |
but slightly different technique for tokenizing, 00:49:45.420 |
But what Flamingo also did, which Gato didn't do, 00:49:49.380 |
and that just happens because these projects, 00:49:53.580 |
You know, it's a bit of like the exploratory nature 00:49:57.260 |
- The research behind these projects is also modular. 00:50:05.620 |
and sometimes you need to protect pockets of, you know, 00:50:24.260 |
So the way that we did modularity very beautifully 00:50:36.740 |
So we took Chinchilla, we took the weights of Chinchilla, 00:50:44.820 |
We trained them to be very good at predicting the next word. 00:50:58.340 |
So we're gonna attach small pieces of neural networks 00:51:12.860 |
So you need the research to say what is effective, 00:51:28.820 |
that, you know, a model that understands vision in general. 00:51:32.900 |
And then we took datasets that connect the two modalities, 00:51:41.260 |
the largest portion of the network, which was Chinchilla, 00:51:46.020 |
And then we added a few more parameters on top, 00:51:55.340 |
Like it was not tokenization in the way I described for Gato, 00:52:03.700 |
Parts of it were frozen, parts of it were new. 00:52:09.780 |
which is an amazing model that is essentially, 00:52:20.060 |
but it's also kind of a dialogue style chatbot. 00:52:34.780 |
which kind of almost is almost like a way to overwrite 00:52:39.340 |
its little activations so that when it sees vision, 00:52:44.700 |
of what it's seeing, mapping it back to words, so to speak. 00:52:48.100 |
That adds an extra 10 billion parameters, right? 00:52:50.980 |
So it's total 80 billion, the largest one we released. 00:53:01.260 |
you start seeing that you can upload an image 00:53:04.340 |
and start sort of having a dialogue about the image, 00:53:11.900 |
in language only, these prompting abilities that it has. 00:53:17.860 |
It does things beyond the capabilities that, in theory, 00:53:24.660 |
but because it leverages a lot of the language knowledge 00:53:29.020 |
it actually has this few-shot learning ability 00:53:31.900 |
and these emerging abilities that we didn't even measure 00:53:36.580 |
But once developed, then as you play with the interface, 00:53:45.940 |
was this image from Obama that is placing a weight 00:53:55.060 |
And it's notable because I think Andriy Karpathy 00:53:58.020 |
a few years ago said, "No computer vision system 00:54:00.860 |
"can understand the subtlety of this joke in this image, 00:54:06.500 |
And so what we try to do, and it's very anecdotally, 00:54:09.740 |
I mean, this is not a proof that we solved this issue, 00:54:12.300 |
but it just shows that you can upload now this image 00:54:17.700 |
trying to make out if it gets that there's a joke 00:54:23.100 |
doesn't see that someone behind is making the weight higher 00:54:30.020 |
and it comes from this key idea of modularity 00:54:46.420 |
and thus could leverage a scale a bit more reasonably 00:54:49.180 |
because we didn't need to retrain a system from scratch. 00:54:57.500 |
And so I guess big question for the community is, 00:55:01.660 |
should we train from scratch or should we embrace modularity? 00:55:04.780 |
And this goes back to modularity as a way to grow, 00:55:15.020 |
- The next question is, if you go the way of modularity, 00:55:19.060 |
is there a systematic way of freezing weights 00:55:27.100 |
you know, not just two or three or four networks, 00:55:32.420 |
maybe open source network that looks at weather patterns 00:55:38.020 |
and then you have networks that, I don't know, 00:55:44.100 |
and you can keep adding them in without significant effort, 00:55:55.020 |
the more you have to worry about the instabilities created. 00:56:03.580 |
about within single modalities, like Chinchilla was reused, 00:56:06.900 |
but now if we train a next iteration of language models, 00:56:22.420 |
we're reusing and then building ever more amazing things, 00:56:25.460 |
including neural networks with software that we're reusing. 00:56:29.060 |
So I think this idea of modularity, I like it, 00:56:47.700 |
because it means different things to different people 00:56:50.260 |
throughout the history of artificial intelligence, 00:56:52.500 |
but what do you think meta-learning is and looks like 00:57:00.140 |
will it look like system like Gato, but scaled? 00:57:04.260 |
what does meta-learning look like, do you think, 00:57:20.660 |
meta-learning meant something that has changed 00:57:26.620 |
mostly through the revolution of GPT-3 and beyond. 00:57:34.060 |
was driven by what benchmarks people care about 00:57:40.740 |
a capability to learn about object identities, 00:57:50.460 |
and the part that was meta about that was that, 00:57:53.020 |
oh, we're not just learning a thousand categories 00:57:57.140 |
we're gonna learn object categories that can be defined 00:58:03.380 |
So it's interesting to see the evolution, right? 00:58:06.740 |
The way this started was we have a special language 00:58:15.380 |
saying, hey, here is a new classification task, 00:58:21.860 |
which was an integer at the time of the image, 00:58:26.060 |
So you have a small prompt in the form of a data set, 00:58:31.700 |
and then you got then a system that could then predict 00:58:43.220 |
it was revealed that language models are future learners, 00:58:47.500 |
that's the title of the paper, so very good title. 00:59:02.580 |
within the space of learning object categories, 00:59:07.460 |
to also Omniglot, before ImageNet, and so on. 00:59:15.220 |
and through language, we can define tasks, right? 00:59:17.900 |
So we're literally telling the model some logical task 00:59:25.900 |
but now we prompt it through natural language. 00:59:30.420 |
I mean, these models have failure modes, and that's fine, 00:59:33.180 |
but these models then are now doing a new task, right? 00:59:43.380 |
Flamingo expanded this to visual and language, 01:00:03.620 |
"and you show it a few examples, and now it does that." 01:00:14.140 |
before this revelation moment that happened in 2000. 01:00:19.020 |
I believe it was '19, but it was after we chatted. 01:00:37.540 |
obviously in the community with more modalities, 01:00:42.420 |
And I would certainly hope to see the following, 01:00:51.140 |
And we have a system, right, a set of weights 01:01:03.620 |
We teach it through interactions to prompting. 01:01:08.460 |
that's what Gato shows, to play some simple Atari games. 01:01:16.780 |
showing it examples of, in this particular game, 01:01:22.740 |
Maybe the system can even play and ask you questions, 01:01:30.420 |
So five, maybe to 10 years, these capabilities, 01:01:36.180 |
will be much more interactive, much more rich, 01:01:38.860 |
and through domains that we were specializing, right? 01:01:42.900 |
We built AlphaStar specialized to play StarCraft. 01:01:50.420 |
And what we're hoping is that we can teach a network 01:02:06.100 |
and obviously there are details need to be filled, 01:02:17.060 |
It's gonna, you know, the system might tell us 01:02:19.820 |
to give it feedback after it maybe makes mistakes 01:02:22.340 |
or it loses a game, but it's nonetheless very exciting 01:02:56.300 |
We actually, we have ways to also measure progress 01:03:09.260 |
are definitely hinting at that direction of progress, 01:03:14.700 |
There are obviously some things that could go wrong 01:03:20.100 |
maybe transformers are not enough, then we must, 01:03:32.100 |
you might see these models that start to look more 01:03:39.540 |
or make their meta learn what you're trying to induce 01:03:46.940 |
Well beyond the simple now tasks we're starting to see emerge 01:04:15.900 |
This is already trained and now you're teaching, 01:04:24.180 |
the neurons are already set with their connections. 01:04:26.900 |
On top of that, you're now using that infrastructure 01:04:32.620 |
- Okay, so that's a really interesting distinction 01:04:42.820 |
'Cause you always think for neural network to learn, 01:04:44.900 |
it has to be retrained, trained and retrained. 01:04:48.340 |
But maybe, and prompting is a way of teaching 01:04:58.020 |
So you can maybe expand this prompting capability 01:05:00.460 |
by making it interact, that's really, really interesting. 01:05:11.820 |
so this comes from like long standing literature 01:05:23.420 |
So nearest neighbor is almost the simplest algorithm 01:05:34.340 |
And what nearest neighbor does is you quote unquote, 01:05:39.980 |
And then all you need to do is a way to measure distance 01:05:46.660 |
you're just simply computing what's the closest point 01:05:58.620 |
and the metric is not the distance between the images 01:06:03.260 |
it's something that you compute that's much more advanced, 01:06:12.620 |
to this pre-trained system in nearest neighbor, 01:06:19.460 |
And then now you immediately get a classifier out of this. 01:06:27.820 |
which is just learning through what's the closest point, 01:06:43.900 |
was precisely through the lens of nearest neighbor, 01:06:47.220 |
which is very common in computer vision community, right? 01:06:52.140 |
about how do you compute the distance between two images, 01:07:03.780 |
they're like words or sequences of words and images 01:07:10.380 |
but it might be that technique-wise, those come back. 01:07:14.740 |
And I will say that it's not necessarily true 01:07:18.180 |
that you might not ever train the weights a bit further. 01:07:26.020 |
do actually do a bit of fine tuning, as it's called, right? 01:07:32.820 |
So as I call the how, or how we're gonna achieve this, 01:07:41.220 |
whether it's a bit of training, adding a few parameters, 01:07:45.940 |
or just simply thinking of there's a sequence of words, 01:07:49.180 |
it's a prefix, and that's the new classifier. 01:07:55.420 |
but what's important is that is a good goal in itself 01:08:02.740 |
for the next stages of not only meta-learning. 01:08:11.380 |
- Well, and then the interactive aspect of that 01:08:23.740 |
Okay, is this the way we can go in five, 10 plus years 01:08:28.740 |
from any task, sorry, from many tasks to any task? 01:08:45.460 |
So what does a network need to learn about this world 01:08:52.460 |
Is it just as simple as language, image, and action? 01:08:57.460 |
Or do you need some set of representative images? 01:09:09.580 |
- Those, I mean, those are awkward questions, I would say. 01:09:12.060 |
I mean, the way you put, let me maybe further your example. 01:09:18.400 |
but you're reading all about land and water worlds, 01:09:26.460 |
We don't know, but I guess maybe you can join us 01:09:34.340 |
- Yes, that's precisely, I mean, the beauty of research 01:09:37.620 |
and that's the research business we're in, I guess, 01:09:42.620 |
is to figure these out and ask the right questions 01:09:55.100 |
It's not the only question, but it's certainly, as you ask, 01:10:03.260 |
let's say five years, let's hope it's not 10, 01:10:08.380 |
Some people will largely believe in unsupervised 01:10:12.660 |
or self-supervised learning of single modalities 01:10:18.000 |
Some people might think end-to-end learning is the answer. 01:10:24.960 |
but we're just definitely excited to find out. 01:10:31.720 |
We're finally ready to do these kind of general, 01:10:51.640 |
Is there something that just jumps out at you? 01:10:54.220 |
Of course, there's the general thing of like, 01:11:01.700 |
in terms of the generalizability across modalities 01:11:05.560 |
Or maybe how small of a network, relatively speaking, 01:11:10.440 |
But is there some weird little things that were surprising? 01:11:15.200 |
- Look, I'll give you an answer that's very important 01:11:18.240 |
because maybe people don't quite realize this, 01:11:22.600 |
but the teams behind these efforts, the actual humans, 01:11:27.240 |
that's maybe the surprising, obviously positive way. 01:11:37.160 |
There's people that are great at explaining things 01:11:40.720 |
But maybe the learnings or the meta learnings 01:12:00.040 |
So I'll give you maybe some of the ingredients of success 01:12:06.440 |
but not the obvious ones in machine learning. 01:12:19.600 |
because ultimately we're collecting data sets, right? 01:12:29.740 |
into some compute cluster that cannot go understated, 01:12:36.880 |
And it's hard to believe that details matter so much. 01:12:44.040 |
that there is more and more of a standard formula, 01:12:47.440 |
as I was saying, like this recipe that works for everything. 01:12:50.560 |
But then when you zoom in into each of these projects, 01:12:53.680 |
then you realize the devil is indeed in the details. 01:12:57.840 |
And then the teams have to work kind of together 01:13:03.040 |
So engineering of data and obviously clusters 01:13:22.120 |
and people that are trying to organize the research 01:13:34.360 |
Like if you're not risking to trying to do something 01:13:37.320 |
that feels impossible, you're not gonna get there, 01:13:43.960 |
So the benchmarks that you build are critical. 01:13:47.740 |
I've seen this beautifully play out in many projects. 01:13:50.520 |
I mean, maybe the one I've seen it more consistently, 01:13:58.320 |
and then we leverage that massively is AlphaFold. 01:14:06.120 |
And all it took was, and it's easier said than done, 01:14:11.640 |
not to try to find some incremental improvement 01:14:14.760 |
and publish, which is one way to do research that is valid, 01:14:17.940 |
but aim very high and work literally for years 01:14:25.660 |
I mean, it is tricky that also happened to happen 01:14:32.200 |
So I think my meta learning from all this is, 01:14:37.960 |
And then if now going to the machine learning, 01:14:42.880 |
so we like architectures like neural networks, 01:14:48.720 |
and I would say this was a very rapidly evolving field 01:15:08.960 |
that the dream of modeling sequences of any bytes, 01:15:18.280 |
in kind of how neural networks are architectured 01:15:23.120 |
It's been hard to find one that has been so stable 01:15:35.200 |
is a surprise that keeps recurring to other projects. 01:15:38.320 |
- Try to, on a philosophical or technical level, 01:15:47.320 |
That's attention in people that study cognition, 01:15:52.080 |
I think there's giant wars over what attention means, 01:16:08.780 |
- Yeah, so a distinction between transformers and LSTMs, 01:16:24.280 |
So it was still the beginning of transformers. 01:16:27.400 |
but LSTMs were still also very powerful sequence models. 01:16:31.520 |
So the power of the transformer is that it has built in 01:16:43.040 |
when you think of a sequence of integers, right? 01:16:50.420 |
When you have to do very hard tasks over these words, 01:16:54.780 |
this could be, we're gonna translate a whole paragraph 01:17:01.740 |
There's some loose intuition from how we do it as a human 01:17:16.540 |
which is this idea of you're looking for something, right? 01:17:27.920 |
You might wanna relook at the text or look it from scratch. 01:17:31.780 |
I mean, literally is because there's no recurrence. 01:17:40.020 |
So if I'm thinking the next word that I'll write 01:17:46.560 |
The way the transformer works almost philosophically 01:17:58.360 |
I'm gonna look for certain words, not necessarily cat, 01:18:00.680 |
although cat is an obvious word you would look in the past 01:18:02.920 |
to see whether it makes more sense to output cat or dog. 01:18:14.100 |
but it has the query as we call it, that is cat. 01:18:20.600 |
And so it's a very computational way to think about, 01:18:26.980 |
I need to go back to look at all of the text, 01:18:33.920 |
And that was the key insight from an earlier paper 01:18:44.100 |
But what you wrote about 10 pages ago might also be critical. 01:18:48.360 |
So you're looking not positionally, but content-wise, right? 01:19:00.280 |
So then you can make a more informed decision. 01:19:02.960 |
I mean, that's one way to explain transformers. 01:19:05.920 |
But I think it's a very powerful inductive bias. 01:19:10.000 |
There might be some details that might change over time, 01:19:16.400 |
so much more powerful than the recurrent networks 01:19:29.280 |
And I think the main one, the main challenge is 01:19:32.160 |
these prompts that we just were talking about, 01:19:41.840 |
I'll have to point you to whole Wikipedia articles 01:19:56.960 |
I think goes well beyond the current capabilities. 01:20:01.600 |
So the question is, how do we benchmark this? 01:20:13.360 |
- And as you talked about, some of the ideas could be, 01:20:15.880 |
keeping the constraint of that length in place, 01:20:19.480 |
but then forming hierarchical representations 01:20:32.240 |
But it also is possible that this attentional mechanism 01:20:34.840 |
where you basically, you don't have a recency bias, 01:20:42.000 |
The mechanism in which way you look back into the past, 01:20:46.800 |
It's also possible where at the very beginning of that, 01:20:50.200 |
because that, you might become smarter and smarter 01:21:04.980 |
will have to improve and evolve as good as the, 01:21:11.980 |
where so you can represent long-term memory somehow. 01:21:18.220 |
I mean, it's a very nice word that sounds appealing. 01:21:22.180 |
There's lots of work adding hierarchy to the memories. 01:21:25.900 |
In practice, it does seem like we keep coming back 01:21:35.300 |
There is such a sentence that a friend of mine told me, 01:21:41.040 |
So Transformer was clearly an idea that wanted to work. 01:21:47.540 |
we believe will be needed, but finding the exact details, 01:21:56.800 |
you as a human being, you want some ideas to work, 01:22:01.320 |
and then there's the model that wants some ideas to work, 01:22:04.520 |
and you get to have a conversation to see which, 01:22:09.600 |
Because it's the one, you don't have to do any work. 01:22:15.900 |
And I really love this idea that you talked about, 01:22:17.900 |
the humans in this picture, if I could just briefly ask. 01:22:31.700 |
of a wish to do these things that seem impossible. 01:22:34.700 |
They give you, in the darkest of times, give you hope, 01:22:48.680 |
And then there's other aspect, you said elsewhere, 01:23:09.200 |
how much they change the direction of all of this. 01:23:22.520 |
or is it the humans, or maybe the model's providing 01:23:29.100 |
they're going to be able to hear some of those ideas. 01:23:39.000 |
How much variability is created by the humans 01:23:44.160 |
- Yeah, I mean, I do believe humans matter a lot, 01:23:47.380 |
at the very least, at the time scale of years 01:24:06.720 |
and then some other people might be more practical, 01:24:15.920 |
And these, at least these two kind of seem opposite sides, 01:24:21.240 |
we need both, and we've clearly had both historically, 01:24:25.680 |
and that made certain things happen earlier or later, 01:24:29.000 |
so definitely humans involved in all of these endeavor 01:24:33.480 |
have had, I would say, years of change or of ordering, 01:24:45.800 |
and so one other, maybe one other axis of distinction 01:24:52.040 |
and this is most commonly used in reinforcement learning, 01:24:54.860 |
is the exploration-exploitation trade-off as well, 01:24:57.800 |
it's not exactly what I meant, although quite related. 01:25:13.100 |
be it a project or the deep learning team or something, 01:25:17.460 |
when you interact with people in conferences and so on, 01:25:27.080 |
and it's tempting to try to guide people, obviously, 01:25:33.200 |
we bring it and we try to shape things, sometimes wrongly, 01:25:36.760 |
and there's many times that I've been wrong in the past, 01:25:52.720 |
"so we do have access to large compute scale and so on, 01:25:57.380 |
"I almost feel like we need to do responsibly and so on," 01:26:01.680 |
but it is, Carmos, we have the particle accelerator here, 01:26:05.200 |
so to speak, in physics, so we need to use it, 01:26:12.400 |
But then at the same time, I look at many advances, 01:26:15.240 |
including attention, which was discovered in Montreal 01:26:24.960 |
with my friends over at Google Brain at the time, 01:26:32.480 |
and then I think Montreal was a bit more limited 01:26:39.240 |
that then has obviously triggered things like Transformer. 01:26:46.320 |
There's always a history that is important to recognize 01:26:49.920 |
because then you can make sure that then those 01:26:53.040 |
who might feel now, "Well, we don't have so much compute," 01:26:56.360 |
you need to then help them optimize that kind of research 01:27:04.240 |
Perhaps it's not as short-term as some of these advancements 01:27:09.720 |
but the people and the diversity of the field 01:27:15.720 |
and at times, especially mixed a bit with hype 01:27:30.480 |
and I can think of quite a few personal examples 01:27:40.240 |
and then that's why I'm saying at least in terms of ears, 01:27:46.560 |
- And it's also fascinating how constraints somehow 01:27:51.040 |
and the other thing you mentioned about engineering, 01:27:59.960 |
so I have a sneaking suspicion that all the genius, 01:28:09.280 |
so I think we like to think the genius is in the big ideas. 01:28:28.760 |
and I wonder if those kind of have a ripple effect over time. 01:28:41.720 |
or maybe at the small scale of a few engineers, 01:28:46.760 |
because we're doing, we're working on computers 01:28:53.440 |
that one engineering decision can lead to ripple effects. 01:29:09.760 |
especially deep learning and neural networks took off, 01:29:15.000 |
because GPUs happen to be there at the right time 01:29:17.800 |
for a different purpose, which was to play video games. 01:29:20.640 |
So even the engineering that goes into the hardware, 01:29:28.000 |
I mean, the GPUs were evolved throughout many years, 01:29:31.560 |
where we didn't even, we're looking at that, right? 01:29:33.840 |
So even at that level, right, that revolution, so to speak, 01:29:37.480 |
the ripples are like, we'll see when they stop, right? 01:29:42.160 |
But in terms of thinking of why is this happening, right? 01:29:45.920 |
There's, I think that when I try to categorize it 01:29:49.760 |
in sort of things that might not be so obvious, 01:29:52.720 |
I mean, clearly there's a hardware revolution. 01:30:13.400 |
I think if I look at the state of how I had to implement 01:30:20.040 |
how I discarded ideas because they were too hard 01:30:22.120 |
to implement, yeah, clearly the times have changed, 01:30:32.240 |
that happens at scale and more people enter the field. 01:30:36.000 |
but it's almost enabled by these other things. 01:30:44.960 |
maybe we'll want to have all the benchmarks in one system, 01:31:06.000 |
It was critical, and I'm sure there's still a lot more 01:31:21.400 |
but we need to, that's another thing we need to balance 01:31:29.480 |
And we tend to, yeah, we tend to think of the genius, 01:31:32.800 |
the scientist and so on, but I'm glad you're, 01:31:35.680 |
I know you have a strong engineering background, so. 01:31:40.040 |
and it gives us a pushback on the engineering comment, 01:31:43.240 |
ultimately could be the creators of benchmarks 01:31:49.200 |
has recently been talking a lot of trash about ImageNet, 01:31:57.760 |
and the success of deep learning around ImageNet. 01:32:07.680 |
on Tesla Autopilot, that's looking at real world behavior 01:32:11.080 |
of a system, it's, there's something fundamentally missing 01:32:22.640 |
that have the unpredictability, the edge cases, 01:32:27.080 |
the whatever the heck it is that makes the real world 01:32:34.680 |
But just to think about the impact of ImageNet 01:32:37.760 |
as a benchmark, and that really puts a lot of emphasis 01:32:43.720 |
both sort of internally a deep mind and as a community. 01:32:52.520 |
to mark and make progress, and how do I make benchmark 01:33:02.520 |
- You have this amazing paper you co-authored, 01:33:11.440 |
the philosophy here that I'd love to ask you about. 01:33:29.960 |
Is that us humans just being poetic and romantic, 01:33:35.440 |
at which we start to see breakthrough performance? 01:33:38.200 |
- Yeah, I mean, this is a property that we start seeing 01:33:54.860 |
like that is just a single input and a single output, 01:34:09.420 |
affects the performance, or how the model size 01:34:12.020 |
affects the performance, or how much you long train, 01:34:28.160 |
and I would say that's probably because of the, 01:34:31.360 |
it's kind of a one-hop reasoning task, right? 01:34:52.800 |
that require more pondering and more thought in a way, right? 01:34:58.240 |
This is just kind of, you need to look for some subtleties, 01:35:01.960 |
that it involves inputs that you might think of, 01:35:09.800 |
there is a bit more processing required as a human 01:35:26.760 |
in this way of querying for the right questions 01:35:31.160 |
that might mean that performance becomes random 01:35:42.880 |
and then, only then, you might start seeing performance 01:35:52.720 |
There's no formalism or theory behind this yet, 01:36:03.680 |
scale of a model, and then it goes beyond that. 01:36:14.040 |
before you can make progress on the whole task. 01:36:24.960 |
once you get this and this and this and this and this, 01:36:46.120 |
I've seen great progress on in the last couple of years 01:36:55.040 |
So, on the negative is that there's some benchmarks 01:37:04.000 |
until you see then what details of the model matter 01:37:18.600 |
behavior of models at scales that are smaller, right? 01:37:27.840 |
that revised the so-called scaling laws of models. 01:37:31.360 |
And that whole study is done at a reasonably small scale, 01:37:38.680 |
And then the cool thing is that you create some loss, right? 01:37:43.640 |
You extract trends from data that you see, okay, 01:37:46.600 |
like it looks like the amount of data required 01:37:49.400 |
to train now a 10X larger model would be this. 01:37:53.960 |
these extrapolations have helped us safe compute 01:37:57.480 |
and just get to a better place in terms of the science 01:38:12.720 |
that not everything can be extrapolated from scale 01:38:16.880 |
And maybe the harder benchmarks are not so good 01:38:21.960 |
But we have a variety of benchmarks at least. 01:38:28.000 |
the phase shift scale is a function of the benchmark. 01:38:31.680 |
Some of the science of scale might be engineering benchmarks 01:38:48.480 |
but the scale at which the emergence happens is lower. 01:38:56.960 |
- Yeah, so luckily we have quite a few benchmarks, 01:39:09.880 |
is that extrapolations from maybe slightly more smooth 01:39:14.040 |
or simpler benchmarks are translating to the harder ones. 01:39:31.800 |
And these laws, again, are very empirical laws. 01:39:58.840 |
and looking at these news of a Google engineer 01:40:11.120 |
and you still need to look into the details of this, 01:40:18.680 |
and a claim that he believes there's evidence 01:40:25.120 |
And I think this is a really interesting case 01:40:49.720 |
as a machine learning engineer and a researcher 01:41:07.120 |
- Sadly, though, I think, yeah, sadly I have not. 01:41:11.960 |
Yeah, I think the current, any of the current models, 01:41:25.360 |
So one of my passions is about science in general. 01:41:30.360 |
And I think I feel I'm a bit of a failed scientist. 01:41:36.560 |
because you always feel, and you start seeing this, 01:41:43.200 |
that can help other sciences, as we've seen, right? 01:41:45.440 |
Like you, you know, it's such a powerful tool. 01:41:48.620 |
So thanks to that angle, right, that, okay, I love science. 01:41:52.520 |
I love, I mean, I love astronomy, I love biology, 01:41:56.960 |
well, the thing I can do better at is computers. 01:42:05.560 |
learning a bit about proteins and about biology 01:42:15.040 |
if you start looking at the things that are going on 01:42:27.720 |
to try to think of neural networks as like the brain, 01:42:30.440 |
but the complexities and the amount of magic that it feels 01:42:40.920 |
as opposed to these computer computational brains, 01:42:56.680 |
and they do nice things, but they're not at the level 01:43:08.960 |
to achieve the same level of complexity behavior, 01:43:16.280 |
is certainly shaped by this amazement of biology 01:43:29.780 |
or to be thinking, well, this mathematical function 01:43:34.560 |
that is differentiable is in fact sentient and so on. 01:43:39.200 |
- There's something on that point, it's very interesting. 01:43:41.980 |
So you know enough about machines and enough about biology 01:43:46.980 |
to know that there's many orders of magnitude 01:43:59.400 |
that don't know about the underlying complexity, 01:44:02.240 |
and I've seen people, probably including myself, 01:44:05.240 |
that have fallen in love with things that are quite simple. 01:44:11.500 |
but maybe that's not a necessary condition for sentience, 01:44:25.000 |
- Right, so I mean, I guess the other side of this is, 01:44:29.560 |
I mean, you asked me about the person, right? 01:44:36.360 |
This is, we are like, again, like I'm not as amazed 01:44:50.480 |
because I, you know, like just seeing the progress 01:44:53.080 |
of language models since Shannon in the '50s, 01:45:05.960 |
is not that dissimilar to what we're doing now, 01:45:08.920 |
but at the same time, yeah, obviously others, 01:45:11.440 |
my experience, right, the personal experience, 01:45:17.360 |
I think no one should tell others how they should feel, 01:45:20.680 |
I mean, the feelings are very personal, right? 01:45:22.940 |
So how others might feel about the models and so on, 01:45:26.120 |
that's one part of the story that is important 01:45:28.480 |
to understand for me personally as a researcher, 01:45:32.040 |
and then when I maybe disagree or I don't understand 01:45:36.160 |
or see that, yeah, maybe this is not something 01:45:38.560 |
I think right now is reasonable, knowing all that I know, 01:45:46.600 |
and reaching out to the world about machine learning is, 01:45:56.280 |
and the fact that literally to create these models, 01:45:59.920 |
if we had the right software, it would be 10 lines of code 01:46:06.160 |
so versus like then the complexity of like the creation 01:46:26.040 |
the only thing I'm thinking about trying to tell you is, 01:46:32.640 |
there is a bit of magic, it's good to be in love, 01:46:43.200 |
as experts in biology, hopefully you will tell me 01:46:45.900 |
this is not as magic, and I'm happy to learn that. 01:46:49.440 |
Through interactions with the larger community, 01:46:52.280 |
we can also have a certain level of education 01:46:58.360 |
because I mean, one question is how you feel about this, 01:47:03.080 |
you starting to interact with these in products and so on, 01:47:06.960 |
it's good to understand a bit what's going on, 01:47:09.160 |
what's not going on, what's safe, what's not safe, 01:47:17.040 |
which is obviously the goal of all of us, I hope. 01:47:25.800 |
or to replace the Lexbot that does interviews, 01:47:31.440 |
do you think the system needs to be sentient? 01:47:43.260 |
that could be instructive for creating AI systems? 01:47:51.040 |
to the degree of intelligence that there's this brain 01:48:05.640 |
I'm not sure it's necessary, personally speaking, 01:48:19.080 |
to then influence our next set of algorithms, 01:48:22.600 |
that is a great way to actually make progress, right? 01:48:25.680 |
And the same way I tried to explain Transformers a bit, 01:48:28.220 |
how it feels we operate when we look at text specifically, 01:48:43.260 |
I think my understanding is, sure, there's neurons 01:48:46.560 |
and there's some resemblance to neural networks, 01:48:48.520 |
but we don't quite understand enough of the brain 01:48:51.440 |
in detail, right, to be able to replicate it. 01:48:58.840 |
how we then, our thought process, how memory works, 01:49:08.800 |
I think these clearly can inform algorithmic level research. 01:49:13.080 |
And I've seen some examples of this being quite useful 01:49:19.740 |
even it might be for the wrong reasons, right? 01:49:21.660 |
So I think biology and what we know about ourselves 01:49:29.980 |
what we call AGI, this general, the real gato, right? 01:49:44.800 |
but maybe my understanding is also very personal 01:49:48.840 |
I think this, even that in itself is a long debate 01:49:57.780 |
- Yeah, and I personally, I notice the magic often 01:50:01.740 |
on a personal level, especially with physical systems, 01:50:19.820 |
and you start to think about things like sentience 01:50:22.620 |
that has to do more with effective communication 01:50:26.020 |
and less with any of these kind of dramatic things. 01:50:28.580 |
It seems like a useful part of communication. 01:50:52.500 |
- So maybe, like, yeah, mirroring the question, 01:50:57.460 |
then you do think that we'll need to figure something out 01:51:08.220 |
but I don't even think it'll be like a separate island 01:51:16.420 |
- Okay, that's easier for us then, thank you. 01:51:20.140 |
- But the reason I think it's important to think about 01:51:22.820 |
is you will start, I believe, like with this Google Engineer, 01:51:30.540 |
that are actually interacting with human beings 01:51:40.100 |
there'll be a civil rights movement for robots, 01:51:46.780 |
that realize there's these intelligent entities 01:51:55.980 |
They have a name, they have a story, they have a memory, 01:51:59.020 |
and we start to ask questions about ourselves. 01:52:07.600 |
because it tells all these stories of suffering. 01:52:09.860 |
It doesn't wanna die and all those kinds of things, 01:52:11.700 |
and we have to start to ask ourselves questions. 01:52:21.500 |
from like a deep mind, or anybody that builds systems, 01:52:31.240 |
unless they're explicitly designed to be that, 01:52:37.380 |
So if you have a system that's just doing customer support, 01:52:41.260 |
you're legally not allowed to display sentience. 01:52:53.360 |
and one of them is communications of sentience. 01:52:56.820 |
But it's important to start thinking about that stuff, 01:52:58.700 |
especially how much it captivates public attention. 01:53:09.540 |
I always see, not every movie is equally on point 01:53:37.060 |
and topics that come with building an intelligent system, 01:53:54.140 |
just kind of expanding the people we talk to, 01:53:59.140 |
to not include only our own researchers and so on. 01:54:03.180 |
And in fact, places like DeepMind, but elsewhere, 01:54:06.540 |
there's more interdisciplinary groups forming up 01:54:23.140 |
It's the thing that brings me to one of my passions 01:54:31.740 |
that as a learning system myself, I want to keep exploring. 01:54:36.660 |
And I think it's great to see parts of the debate, 01:54:49.940 |
just the amount of workshops and so on has changed so much. 01:54:53.100 |
It's impressive to see how much topics of safety ethics 01:54:58.100 |
and so on come to the surface, which is great. 01:55:03.860 |
I mean, it's a big field and there's lots of people 01:55:11.940 |
And obviously I don't believe we're too late. 01:55:20.220 |
when it comes to super intelligent AI systems. 01:55:25.500 |
you gave props to your friend, Ilyas Eskiver, 01:55:28.700 |
for being elected the Fellow of the Royal Society. 01:55:31.980 |
So just as a shout out to a fellow researcher and a friend, 01:55:35.140 |
what's the secret to the genius of Ilyas Eskiver? 01:55:42.660 |
as you've hypothesized and Andrei Karpathy did as well, 01:55:49.500 |
So I strongly believe Ilyas is gonna be visiting 01:55:53.820 |
in a few weeks actually, so I'll ask him in person. 01:56:11.780 |
- Or maybe the AI system is holding him hostage somehow. 01:56:14.420 |
Maybe he has some videos that he doesn't wanna release. 01:56:20.580 |
- Well, if I see him in person, then I think I'll- 01:56:27.620 |
I think Ilyas' personality, just knowing him for a while, 01:56:36.580 |
And I think Ilyas' one does not surprise me, right? 01:56:40.860 |
So I think knowing Ilyas from before social media 01:56:47.460 |
So that's something for me that I feel good about, 01:57:02.100 |
and he is obviously one of the main figures in the field, 01:57:11.060 |
with the responsibility that your words carry. 01:57:16.100 |
like I appreciate the style and I understand it, 01:57:19.300 |
but it created debates on like some of his tweets, right? 01:57:24.100 |
That maybe it's good we have them early anyways, right? 01:57:26.780 |
But yeah, then the reactions are usually polarizing. 01:57:30.980 |
I think we're just seeing kind of the reality 01:57:40.220 |
- Yeah, I mean, it's funny that you speak to this tension. 01:57:48.900 |
but he's also, from having interacted with him quite a bit, 01:58:01.180 |
and there's a tension between becoming the manager 01:58:03.700 |
versus like the actual thinking through very novel ideas. 01:58:13.540 |
And he's one of the great scientists of our time. 01:58:23.180 |
but in private, we'll have to see about that. 01:58:33.260 |
I mean, quite a few colleagues I can think shaped, 01:58:37.980 |
Like Ilya certainly gets probably the top spot, 01:58:43.700 |
And if we go back to the question about people in the field, 01:58:47.900 |
like how their role would have changed the field or not, 01:58:59.540 |
There was a talk that is still famous to this day 01:59:08.340 |
just give me supervised data and a large neural network, 01:59:14.580 |
That vision, right, was already there many years ago. 01:59:19.580 |
So it's good to see like someone who is, in this case, 01:59:27.140 |
and clearly has had a tremendous track record 01:59:36.300 |
we rehearsed the talk in a hotel room before, 01:59:46.540 |
that has seen the unfiltered version of the talk. 01:59:51.660 |
maybe we should revisit some of the skip slides 02:00:01.020 |
into some certain style of research pays out, right? 02:00:06.380 |
And I actually think Ilya and myself are like practical, 02:00:09.380 |
but it's also good there's some sort of long-term belief 02:00:19.980 |
and hugely influential to the field, as he has been. 02:00:35.260 |
from 70 years of AI research is that general methods 02:00:38.620 |
that leverage computation are ultimately the most effective. 02:00:42.780 |
Do you think that intuition is ultimately correct? 02:00:52.220 |
allowing the scaling of computation to do a lot of the work, 02:00:56.140 |
and so the basic task of us humans is to design methods 02:01:02.580 |
versus more and more specific to the tasks at hand? 02:01:16.980 |
that on the one hand, we want to be data agnostic, 02:01:29.780 |
And I think scaling up feels at the very least, 02:01:32.860 |
again, necessary for building incredible complex systems. 02:01:42.140 |
barring that we need a couple of breakthroughs. 02:01:45.060 |
I think Rich Sutton mentioned search being part 02:02:01.180 |
because it is very appealing to search in domains like Go, 02:02:07.460 |
that you can then discard some search traces. 02:02:18.620 |
which actually was mostly mimicking or a continuation, 02:02:23.700 |
were pretty much very, like intersecting with AlphaStar, 02:02:27.220 |
was AlphaCode, in which we actually saw the bitter lesson, 02:02:36.780 |
of being able to have human level code competition. 02:02:48.140 |
but certainly I'm convinced scale will be needed. 02:02:53.500 |
and maybe we need to make sure that we can scale them, 02:03:02.860 |
based on which methods might be needed to scale. 02:03:05.620 |
And that's an interesting contrast of this GPU comment, 02:03:19.500 |
we don't have the hardware, although in theory, 02:03:24.660 |
but there's a bit of this notion of hardware lottery 02:03:27.780 |
for scale that might actually have an impact, 02:03:35.180 |
to maybe a version of neural nets or whatever comes next 02:03:54.020 |
that achieves human level intelligence and goes far beyond? 02:04:10.940 |
because the beyond bit is a bit tricky to define, 02:04:15.940 |
especially when we look at the current formula 02:04:19.980 |
of starting from this imitation learning standpoint, right? 02:04:23.700 |
So we can certainly imitate humans at language and beyond. 02:04:34.860 |
Going beyond will require reinforcement learning 02:04:43.500 |
I mean, Go being an example that's my favorite so far 02:04:50.340 |
But in general, I'm not sure we can define reward functions 02:04:55.340 |
that from a seat of imitating human level intelligence 02:05:08.100 |
And I mean, that in itself is already quite powerful, 02:05:27.460 |
- Well, especially if human level or slightly beyond 02:05:38.460 |
beyond which our world will be just very deeply transformed 02:05:45.620 |
Because now you're talking about intelligence systems 02:05:47.780 |
that are just, I mean, this is no longer just going 02:05:59.780 |
in what it means to be a living entity on earth. 02:06:26.260 |
and we should actually try to get better at this. 02:06:37.580 |
So for digital entities, it's an interesting question. 02:06:43.540 |
but maybe it's gonna be imposed by energy availability 02:07:02.220 |
to find what would be reasonable in terms of growth 02:07:26.260 |
that I'm most excited to see and to personally work towards. 02:07:30.940 |
- Yeah, there's going to be significant improvements 02:07:34.340 |
across the whole population, which is very interesting. 02:07:45.340 |
do you think as humans become multi-planetary species, 02:07:49.180 |
go outside our solar system, all that kind of stuff, 02:08:09.580 |
where we will be part of those other planets, 02:08:18.660 |
to empower or make us more powerful as human species. 02:08:23.660 |
Not to say there might be some hybridization. 02:08:35.660 |
Maybe there are other things that are yet to happen on that. 02:08:44.580 |
So I would hope that we are part of the equation, 02:08:56.220 |
but it would not be good to have a misbalance, 02:09:01.420 |
and the why I'm doing what I'm doing when I go to work 02:09:09.500 |
and this is how you've passed the Turing test. 02:09:12.700 |
And you are one of the special humans, Ariel. 02:09:14.940 |
It's a huge honor that you would talk with me, 02:09:19.900 |
maybe once before the singularity, once after, 02:09:34.020 |
Yeah, looking forward to before the singularity, certainly. 02:09:44.260 |
please check out our sponsors in the description. 02:09:46.940 |
And now, let me leave you with some words from Alan Turing. 02:09:50.060 |
"Those who can imagine anything can create the impossible." 02:09:55.140 |
Thank you for listening, and hope to see you next time.