back to index

Oriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306


Chapters

0:0 Introduction
0:34 AI
15:31 Weights
21:50 Gato
56:38 Meta learning
70:37 Neural networks
93:2 Emergence
99:47 AI sentience
123:43 AGI

Whisper Transcript | Transcript Only Page

00:00:00.000 | "At which point is the neural network a being versus a tool?"
00:00:05.000 | The following is a conversation with Aurel Vinales,
00:00:11.400 | his second time in the podcast.
00:00:13.480 | Aurel is the research director
00:00:15.960 | and deep learning lead at DeepMind,
00:00:18.040 | and one of the most brilliant thinkers and researchers
00:00:20.980 | in the history of artificial intelligence.
00:00:24.360 | This is the Lex Friedman Podcast.
00:00:26.680 | To support it, please check out our sponsors
00:00:28.880 | in the description.
00:00:30.200 | And now, dear friends, here's Aurel Vinales.
00:00:33.600 | You are one of the most brilliant researchers
00:00:37.060 | in the history of AI,
00:00:38.480 | working across all kinds of modalities.
00:00:40.600 | Probably the one common theme is
00:00:42.720 | it's always sequences of data.
00:00:45.040 | So we're talking about languages, images, even biology,
00:00:48.020 | and games, as we talked about last time.
00:00:50.280 | So you're a good person to ask this.
00:00:53.400 | In your lifetime, will we be able to build an AI system
00:00:57.360 | that's able to replace me as the interviewer
00:01:00.760 | in this conversation,
00:01:02.600 | in terms of ability to ask questions
00:01:04.480 | that are compelling to somebody listening?
00:01:06.600 | And then further question is,
00:01:09.400 | are we close,
00:01:10.640 | will we be able to build a system that replaces you
00:01:13.880 | as the interviewee
00:01:16.080 | in order to create a compelling conversation?
00:01:18.120 | How far away are we, do you think?
00:01:20.040 | - It's a good question.
00:01:21.800 | I think partly I would say, do we want that?
00:01:24.680 | I really like when we start now with very powerful models,
00:01:29.360 | interacting with them
00:01:30.960 | and thinking of them more closer to us.
00:01:34.040 | The question is,
00:01:34.880 | if you remove the human side of the conversation,
00:01:38.320 | is that an interesting,
00:01:40.200 | is that an interesting artifact?
00:01:42.320 | And I would say probably not.
00:01:44.440 | I've seen, for instance, last time we spoke,
00:01:47.400 | like we were talking about StarCraft
00:01:50.280 | and creating agents that play games
00:01:53.480 | involves self-play,
00:01:54.880 | but ultimately what people care about was,
00:01:57.600 | how does this agent behave
00:01:59.080 | when the opposite side is a human?
00:02:02.680 | So without a doubt,
00:02:04.720 | we will probably be more empowered by AI.
00:02:08.520 | Maybe you can source some questions from an AI system.
00:02:12.480 | I mean, that even today, I would say,
00:02:13.960 | it's quite plausible that with your creativity,
00:02:17.040 | you might actually find very interesting questions
00:02:19.400 | that you can filter.
00:02:20.720 | We call this cherry picking sometimes
00:02:22.400 | in the field of language.
00:02:24.040 | And likewise, if I had now the tools on my side,
00:02:27.520 | I could say, look,
00:02:28.520 | you're asking this interesting question.
00:02:30.640 | From this answer,
00:02:31.600 | I like the words chosen by this particular system
00:02:34.760 | that created a few words.
00:02:36.600 | Completely replacing it feels not exactly exciting to me.
00:02:41.280 | Although in my lifetime, I think way,
00:02:43.760 | I mean, given the trajectory,
00:02:45.520 | I think it's possible that perhaps
00:02:48.000 | there could be interesting,
00:02:49.880 | maybe self-play interviews as you're suggesting
00:02:53.040 | that would look or sound quite interesting
00:02:56.160 | and probably would educate,
00:02:57.720 | or you could learn a topic
00:02:59.160 | through listening to one of these interviews
00:03:01.600 | at a basic level, at least.
00:03:03.200 | - So you said it doesn't seem exciting to you,
00:03:04.800 | but what if exciting is part of the objective function
00:03:07.520 | the thing is optimized over?
00:03:09.120 | So there's probably a huge amount of data of humans,
00:03:12.840 | if you look correctly,
00:03:14.120 | of humans communicating online,
00:03:16.080 | and there's probably ways to measure the degree of,
00:03:19.560 | as they talk about engagement.
00:03:21.920 | So you can probably optimize the question
00:03:24.120 | that's most created an engaging conversation in the past.
00:03:28.680 | So actually, if you strictly use the word exciting,
00:03:31.560 | there is probably a way
00:03:36.520 | to create a optimally exciting conversations
00:03:40.320 | that involve AI systems.
00:03:42.160 | At least one side is AI.
00:03:44.600 | - Yeah, that makes sense.
00:03:45.640 | I think maybe looping back a bit to games
00:03:48.880 | and the game industry,
00:03:50.240 | when you design algorithms,
00:03:53.040 | you're thinking about winning as the objective, right?
00:03:55.800 | Or the reward function.
00:03:57.320 | But in fact, when we discuss this with Blizzard,
00:04:00.080 | the creators of StarCraft in this case,
00:04:02.320 | I think what's exciting, fun,
00:04:05.360 | if you could measure that and optimize for that,
00:04:09.160 | that's probably why we play video games
00:04:11.720 | or why we interact or listen or look at cat videos
00:04:14.600 | or whatever on the internet.
00:04:16.440 | So it's true that modeling reward
00:04:19.480 | beyond the obvious reward functions
00:04:21.320 | we've used to in reinforcement learning
00:04:23.720 | is definitely very exciting.
00:04:25.560 | And again, there is some progress actually
00:04:28.240 | into a particular aspect of AI, which is quite critical,
00:04:32.160 | which is, for instance, is a conversation
00:04:36.120 | or is the information truthful, right?
00:04:38.200 | So you could start trying to evaluate these
00:04:41.640 | from except from the internet, right?
00:04:44.400 | That has lots of information.
00:04:45.800 | And then if you can learn a function automated ideally,
00:04:50.160 | so you can also optimize it more easily,
00:04:52.880 | then you could actually have conversations
00:04:54.840 | that optimize for non-obvious things such as excitement.
00:04:59.360 | So yeah, that's quite possible.
00:05:01.040 | And then I would say in that case,
00:05:03.560 | it would definitely be fun exercise
00:05:05.880 | and quite unique to have at least one side
00:05:08.040 | that is fully driven by an excitement reward function.
00:05:12.800 | But obviously there would be still quite a lot of humanity
00:05:16.920 | in the system, both from who is building the system,
00:05:20.760 | of course, and also ultimately,
00:05:23.560 | if we think of labeling for excitement,
00:05:26.000 | that those labels must come from us
00:05:28.440 | because it's just hard to have a computational measure
00:05:32.480 | of excitement as far as I understand,
00:05:34.560 | there's no such thing.
00:05:36.120 | - Wow, as you mentioned truth also,
00:05:39.240 | I would actually venture to say that excitement
00:05:41.800 | is easier to label than truth,
00:05:44.160 | or is perhaps has lower consequences of failure.
00:05:49.000 | But there is perhaps the humanness that you mentioned,
00:05:54.920 | that's perhaps part of a thing that could be labeled.
00:05:58.240 | And that could mean an AI system that's doing dialogue,
00:06:02.480 | that's doing conversations should be flawed, for example.
00:06:07.480 | Like that's the thing you optimize for,
00:06:09.440 | which is have inherent contradictions by design,
00:06:13.280 | have flaws by design.
00:06:15.080 | Maybe it also needs to have a strong sense of identity.
00:06:18.760 | So it has a backstory, it told itself that it sticks to,
00:06:22.680 | it has memories, not in terms of how the system is designed,
00:06:26.880 | but it's able to tell stories about its past.
00:06:30.360 | It's able to have mortality and fear of mortality
00:06:35.360 | in the following way that it has an identity
00:06:39.120 | and like if it says something stupid
00:06:41.240 | and gets canceled on Twitter, that's the end of that system.
00:06:44.720 | So it's not like you get to rebrand yourself,
00:06:47.360 | that system is, that's it.
00:06:49.360 | So maybe that the high stakes nature of it,
00:06:52.120 | because like you can't say anything stupid now,
00:06:54.560 | or because you'd be canceled on Twitter.
00:06:57.720 | And that there's stakes to that.
00:06:59.760 | And that I think part of the reason
00:07:01.160 | that makes it interesting.
00:07:03.520 | And then you have a perspective
00:07:04.720 | like you've built up over time that you stick with,
00:07:07.720 | and then people can disagree with you.
00:07:09.120 | So holding that perspective strongly,
00:07:11.800 | holding sort of maybe a controversial,
00:07:14.040 | at least a strong opinion.
00:07:16.300 | All of those elements, it feels like they can be learned
00:07:18.840 | because it feels like there's a lot of data
00:07:21.760 | on the internet of people having an opinion.
00:07:24.520 | (laughs)
00:07:25.400 | And then combine that with a metric of excitement,
00:07:27.840 | you can start to create something that,
00:07:30.000 | as opposed to trying to optimize for sort of
00:07:34.480 | grammatical clarity and truthfulness,
00:07:38.120 | the factual consistency over many sentences,
00:07:42.000 | you optimize for the humanness.
00:07:45.320 | And there's obviously data for humanness on the internet.
00:07:48.880 | So I wonder if there's a future where that's part,
00:07:53.040 | I mean I sometimes wonder that about myself,
00:07:56.400 | I'm a huge fan of podcasts,
00:07:58.120 | and I listen to some podcasts,
00:08:00.760 | and I think like what is interesting about this,
00:08:03.240 | what is compelling?
00:08:04.280 | The same way you watch other games,
00:08:07.440 | like you said, watch, play StarCraft,
00:08:09.160 | or have Magnus Carlsen play chess.
00:08:13.040 | So I'm not a chess player,
00:08:14.920 | but it's still interesting to me,
00:08:16.120 | and what is that?
00:08:16.960 | That's the stakes of it,
00:08:19.440 | maybe the end of a domination of a series of wins.
00:08:23.400 | I don't know, there's all those elements
00:08:25.440 | somehow connect to a compelling conversation,
00:08:28.000 | and I wonder how hard is that to replace?
00:08:30.200 | 'Cause ultimately all of that connects
00:08:31.840 | to the initial proposition of how to test
00:08:34.600 | whether an AI is intelligent or not with the Turing test.
00:08:38.640 | Which I guess, my question comes from a place
00:08:41.760 | of the spirit of that test.
00:08:43.680 | - Yes, I actually recall,
00:08:45.440 | I was just listening to our first podcast
00:08:47.920 | where we discussed Turing test.
00:08:50.360 | So I would say from a neural network,
00:08:54.760 | AI builder perspective,
00:08:59.160 | usually you try to map many of these interesting topics
00:09:03.160 | you discuss to benchmarks,
00:09:05.200 | and then also to actual architectures
00:09:08.120 | on how these systems are currently built,
00:09:10.640 | how they learn, what data they learn from,
00:09:13.080 | what are they learning, right?
00:09:14.280 | We're talking about weights of a mathematical function,
00:09:17.800 | and then looking at the current state of the game,
00:09:21.560 | maybe what do we need leaps forward
00:09:26.000 | to get to the ultimate stage of all these experiences,
00:09:30.640 | lifetime experience, fears,
00:09:32.840 | like words that currently barely we're seeing progress,
00:09:37.840 | just because what's happening today
00:09:40.040 | is you take all these human interactions,
00:09:43.960 | it's a large vast variety of human interactions online,
00:09:47.920 | and then you're distilling these sequences, right?
00:09:51.600 | Going back to my passion, like sequences of words,
00:09:54.680 | letters, images, sound,
00:09:56.920 | there's more modalities here to be at play.
00:09:59.840 | And then you're trying to just learn a function
00:10:03.360 | that will be happy,
00:10:04.400 | that maximizes the likelihood of seeing all these
00:10:08.840 | through a neural network.
00:10:10.880 | Now, I think there's a few places
00:10:14.200 | where the way currently we train these models
00:10:17.240 | would clearly like to be able to develop
00:10:20.000 | the kinds of capabilities you say.
00:10:22.120 | I'll tell you maybe a couple.
00:10:23.520 | One is the lifetime of an agent or a model.
00:10:27.640 | So you learn from these data offline, right?
00:10:30.840 | So you're just passively observing and maximizing these,
00:10:33.560 | you know, it's almost like a landscape of mountains.
00:10:37.760 | And then everywhere there's data
00:10:39.120 | that humans interacted in this way,
00:10:41.040 | you're trying to make that higher
00:10:43.000 | and then lower where there's no data.
00:10:45.720 | And then these models generally
00:10:48.480 | don't then experience themselves.
00:10:51.160 | They just are observers, right?
00:10:52.520 | They're passive observers of the data.
00:10:54.600 | And then we're putting them to then generate data
00:10:57.440 | when we interact with them.
00:10:59.200 | But that's very limiting.
00:11:00.920 | The experience they actually experience
00:11:03.480 | when they could maybe be optimizing
00:11:05.680 | or further optimizing the weights,
00:11:07.440 | we're not even doing that.
00:11:08.640 | So to be clear, and again, mapping to AlphaGo, AlphaStar,
00:11:13.640 | we train the model.
00:11:15.280 | And when we deploy it to play against humans,
00:11:18.280 | or in this case, interact with humans,
00:11:20.320 | like language models, they don't even keep training, right?
00:11:23.480 | They're not learning in the sense of the weights
00:11:26.160 | that you've learned from the data.
00:11:28.200 | They don't keep changing.
00:11:29.760 | Now there's something a bit more, feels magical,
00:11:33.480 | but it's understandable if you're into neural net,
00:11:36.200 | which is, well, they might not learn
00:11:39.120 | in the strict sense of the words, the weights changing.
00:11:41.480 | Maybe that's mapping to how neurons interconnect
00:11:44.360 | and how we learn over our lifetime.
00:11:46.640 | But it's true that the context of the conversation
00:11:50.280 | that takes place when you talk to these systems,
00:11:54.960 | it's held in their working memory, right?
00:11:57.200 | It's almost like you start a computer,
00:12:00.120 | it has a hard drive that has a lot of information.
00:12:02.840 | You have access to the internet,
00:12:04.000 | which has probably all the information,
00:12:06.320 | but there's also a working memory
00:12:08.440 | where these agents, as we call them,
00:12:11.080 | or start calling them, build upon.
00:12:13.840 | Now, this memory is very limited.
00:12:16.560 | I mean, right now we're talking, to be concrete,
00:12:19.200 | about 2000 words that we hold,
00:12:21.760 | and then beyond that, we start forgetting what we've seen.
00:12:24.840 | So you can see that there's some short-term coherence
00:12:28.040 | already, right, with when you said,
00:12:29.880 | I mean, it's a very interesting topic,
00:12:32.280 | having sort of a mapping, an agent to have consistency.
00:12:37.280 | Then if you say, "Oh, what's your name?"
00:12:40.760 | It could remember that,
00:12:42.240 | but then it might forget beyond 2000 words,
00:12:44.960 | which is not that long of context,
00:12:47.480 | if we think even of these podcast books are much longer.
00:12:51.760 | So technically speaking, there's a limitation there.
00:12:55.120 | Super exciting from people that work on deep learning
00:12:58.160 | to be working on,
00:12:59.960 | but I would say we lack maybe benchmarks
00:13:03.040 | and the technology to have this lifetime-like experience
00:13:07.840 | of memory that keeps building up.
00:13:10.840 | However, the way it learns offline
00:13:13.160 | is clearly very powerful, right?
00:13:14.880 | So you asked me three years ago,
00:13:17.400 | I would say, "Oh, we're very far."
00:13:18.640 | I think we've seen the power of this imitation,
00:13:22.240 | again, on the internet scale that has enabled this
00:13:26.240 | to feel like at least the knowledge,
00:13:28.760 | the basic knowledge about the world
00:13:30.160 | now is incorporated into the weights,
00:13:33.120 | but then this experience is lacking.
00:13:36.560 | And in fact, as I said, we don't even train them
00:13:39.320 | when we're talking to them,
00:13:41.160 | other than their working memory, of course, is affected.
00:13:44.760 | So that's the dynamic part,
00:13:46.560 | but they don't learn in the same way
00:13:48.240 | that you and I have learned, right?
00:13:49.720 | When, from basically when we were born and probably before.
00:13:54.040 | So lots of fascinating, interesting questions
00:13:56.480 | you asked there.
00:13:57.400 | I think the one I mentioned is this idea of memory
00:14:01.680 | and experience versus just kind of observe the world
00:14:05.480 | and learn its knowledge,
00:14:06.720 | which I think for that, I would argue,
00:14:08.880 | lots of recent advancements
00:14:10.320 | that make me very excited about the field.
00:14:13.400 | And then the second maybe issue that I see is
00:14:18.160 | all these models, we train them from scratch.
00:14:21.240 | That's something I would have complained three years ago
00:14:24.000 | or six years ago or 10 years ago.
00:14:26.400 | And it feels, if we take inspiration from how we got here,
00:14:31.360 | how the universe evolved us and we keep evolving,
00:14:35.240 | it feels that is a missing piece,
00:14:37.840 | that we should not be training models from scratch
00:14:41.320 | every few months, that there should be some sort of way
00:14:45.280 | in which we can grow models much like as a species
00:14:49.000 | and many other elements in the universe
00:14:51.520 | is building from the previous sort of iterations.
00:14:55.000 | And that from a just purely neural network perspective,
00:14:59.520 | even though we would like to make it work,
00:15:02.280 | it's proven very hard to not, you know,
00:15:05.600 | throw away the previous weights, right?
00:15:07.680 | This landscape we learn from the data and, you know,
00:15:10.280 | refresh it with a brand new set of weights,
00:15:13.360 | given maybe a recent snapshot of these datasets
00:15:16.960 | we train on, et cetera, or even a new game we're learning.
00:15:19.960 | So that feels like something is missing fundamentally.
00:15:24.160 | We might find it, but it's not very clear
00:15:27.440 | how it will look like.
00:15:28.400 | There's many ideas and it's super exciting as well.
00:15:30.800 | - Yes, just for people who don't know,
00:15:32.440 | when you're approaching new problem in machine learning,
00:15:35.720 | you're going to come up with an architecture
00:15:38.200 | that has a bunch of weights
00:15:40.960 | and then you initialize them somehow,
00:15:43.360 | which in most cases is some version of random.
00:15:47.280 | So that's what you mean by starting from scratch.
00:15:48.960 | And it seems like it's a waste every time you solve
00:15:52.880 | the game of Go and chess, StarCraft, protein folding,
00:15:59.440 | like surely there's some way to reuse the weights
00:16:03.160 | as we grow this giant database of neural networks.
00:16:08.400 | - That has solved some of the toughest problems
00:16:10.000 | in the world.
00:16:10.840 | And so some of that is, what is that?
00:16:15.240 | Methods, how to reuse weights,
00:16:19.080 | how to learn extract was generalizable,
00:16:22.480 | or at least has a chance to be
00:16:25.160 | and throw away the other stuff.
00:16:26.900 | And maybe the neural network itself
00:16:29.560 | should be able to tell you that.
00:16:31.640 | Like what, yeah, how do you,
00:16:34.640 | what ideas do you have for better initialization of weights?
00:16:37.520 | Maybe stepping back,
00:16:38.720 | if we look at the field of machine learning,
00:16:41.720 | but especially deep learning, right?
00:16:44.040 | At the core of deep learning,
00:16:45.240 | there's this beautiful idea that is a single algorithm
00:16:49.240 | can solve any task, right?
00:16:50.920 | So it's been proven over and over
00:16:54.400 | with more increasing set of benchmarks
00:16:56.440 | and things that were thought impossible
00:16:58.580 | that are being cracked by this basic principle.
00:17:01.960 | That is you take a neural network of uninitialized weights.
00:17:05.800 | So like a blank computational brain,
00:17:09.640 | then you give it in the case of supervised learning,
00:17:12.600 | a lot ideally of examples of,
00:17:14.960 | hey, here is what the input looks like
00:17:17.120 | and the desired output should look like this.
00:17:19.560 | I mean, image classification is very clear example,
00:17:22.360 | images to maybe one of a thousand categories.
00:17:25.560 | That's what ImageNet is like,
00:17:26.840 | but many, many, if not all problems can be mapped this way.
00:17:30.720 | And then there's a generic recipe, right?
00:17:33.840 | That you can use.
00:17:35.240 | And this recipe with very little change.
00:17:38.600 | And I think that's the core of deep learning research,
00:17:40.920 | right?
00:17:41.760 | That what is the recipe that is universal
00:17:44.400 | that for any new given task,
00:17:46.400 | I'll be able to use without thinking,
00:17:48.440 | without having to work very hard on the problem at stake.
00:17:51.740 | We have not found this recipe,
00:17:54.400 | but I think the field is excited to find less tweaks
00:17:59.400 | or tricks that people find
00:18:02.000 | when they work on important problems specific to those
00:18:05.280 | and more of a general algorithm, right?
00:18:07.540 | So at an algorithmic level,
00:18:09.300 | I would say we have something general already,
00:18:11.760 | which is this formula of training a very powerful model
00:18:14.520 | and neural network on a lot of data.
00:18:17.000 | And in many cases,
00:18:19.400 | you need some specificity
00:18:21.200 | to the actual problem you're solving.
00:18:23.400 | Protein folding being such an important problem
00:18:26.060 | has some basic recipe that is learned from before, right?
00:18:30.780 | Like transformer models, graph neural networks,
00:18:34.120 | ideas coming from NLP,
00:18:35.720 | like something called BERT,
00:18:38.580 | that is a kind of loss that you can emplace
00:18:41.280 | to help the model.
00:18:42.420 | Knowledge distillation is another technique, right?
00:18:45.680 | So this is the formula.
00:18:47.080 | We still had to find some particular things
00:18:50.560 | that were specific to alpha fold, right?
00:18:53.600 | That's very important because protein folding
00:18:55.880 | is such a high value problem that as humans,
00:18:59.120 | we should solve it no matter if we need to be a bit specific.
00:19:02.860 | And it's possible that some of these learnings
00:19:04.940 | will apply then to the next iteration of this recipe
00:19:07.380 | that deep learners are about.
00:19:09.340 | But it is true that so far,
00:19:11.820 | the recipe is what's common,
00:19:13.180 | but the weights you generally throw away,
00:19:15.860 | which feels very sad.
00:19:17.780 | Although maybe in the last,
00:19:21.380 | especially in the last two, three years,
00:19:23.360 | and when we last spoke,
00:19:24.620 | I mentioned these area of meta-learning,
00:19:26.600 | which is the idea of learning to learn.
00:19:29.540 | That idea and some progress has been had starting,
00:19:33.100 | I would say, mostly from GPT-3 on the language domain only,
00:19:37.140 | in which you could conceive a model that is trained once,
00:19:42.060 | and then this model is not narrow in that
00:19:44.680 | it only knows how to translate a pair of languages,
00:19:47.640 | or it only knows how to assign sentiment to a sentence.
00:19:51.480 | These actually, you could teach it
00:19:54.100 | by a prompting, it's called.
00:19:55.460 | And this prompting is essentially just showing it
00:19:58.060 | a few more examples,
00:19:59.860 | almost like you do show examples, input-output examples,
00:20:02.980 | algorithmically speaking,
00:20:04.080 | to the process of creating this model.
00:20:06.280 | But now you're doing it through language,
00:20:07.820 | which is very natural way for us to learn from one another.
00:20:11.040 | I tell you, "Hey, you should do this new task.
00:20:13.080 | "I'll tell you a bit more.
00:20:14.500 | "Maybe you ask me some questions."
00:20:16.040 | And now you know the task, right?
00:20:17.800 | You didn't need to retrain it from scratch.
00:20:20.300 | And we've seen these magical moments almost
00:20:23.180 | in this way to do few-shot prompting through language
00:20:26.940 | on language-only domain.
00:20:28.520 | And then in the last two years,
00:20:30.940 | we've seen these expanded to beyond language,
00:20:34.620 | adding vision, adding actions and games,
00:20:38.040 | lots of progress to be had.
00:20:39.460 | But this is maybe, if you ask me about
00:20:42.120 | how are we gonna crack this problem,
00:20:43.700 | this is perhaps one way in which you have a single model.
00:20:47.760 | The problem of this model is it's hard to grow
00:20:52.140 | in weights or capacity,
00:20:54.260 | but the model is certainly so powerful
00:20:56.380 | that you can teach it some tasks, right?
00:20:58.920 | In this way that I could teach you a new task now
00:21:01.940 | if we were all at a text-based task
00:21:05.060 | or a classification, a vision-style task.
00:21:08.400 | But it still feels like more breakthroughs should be had,
00:21:12.820 | but it's a great beginning, right?
00:21:13.980 | We have a good baseline.
00:21:15.400 | We have an idea that this maybe is the way
00:21:17.740 | we want to benchmark progress towards AGI.
00:21:20.740 | And I think in my view, that's critical
00:21:22.820 | to always have a way to benchmark
00:21:25.000 | the community converging to this overall,
00:21:27.780 | which is good to see.
00:21:29.200 | And then this is actually what excites me
00:21:33.500 | in terms of also next steps for deep learning
00:21:36.580 | is how to make these models more powerful.
00:21:39.040 | How do you train them?
00:21:40.460 | How to grow them if they must grow?
00:21:43.080 | Should they change their weights
00:21:44.500 | as you teach it the task or not?
00:21:46.060 | There's some interesting questions, many to be answered.
00:21:48.520 | - Yeah, you've opened the door
00:21:49.760 | to a bunch of questions I wanna ask,
00:21:52.260 | but let's first return to your tweet
00:21:55.660 | and read it like a Shakespeare.
00:21:57.120 | You wrote, "Gato is not the end, it's the beginning."
00:22:01.220 | And then you wrote, "Meow," and then an emoji of a cat.
00:22:04.960 | So first, two questions.
00:22:07.700 | First, can you explain the meow and the cat emoji?
00:22:10.020 | And second, can you explain what Gato is and how it works?
00:22:13.620 | - Right, indeed.
00:22:14.580 | I mean, thanks for reminding me
00:22:16.500 | that we're all exposing on Twitter and-
00:22:19.900 | - Permanently there.
00:22:20.900 | - Yes, permanently there.
00:22:21.900 | - One of the greatest AI researchers of all time,
00:22:25.100 | meow and cat emoji.
00:22:27.220 | - Yes. - There you go.
00:22:28.260 | - Right, so-
00:22:29.100 | - Can you imagine like touring, tweeting,
00:22:31.940 | meow and cat, probably he would, probably would.
00:22:34.340 | - Probably.
00:22:35.180 | So yeah, the tweet is important, actually.
00:22:38.020 | You know, I put thought on the tweets.
00:22:39.800 | I hope people-
00:22:40.780 | - Which part did you think, okay.
00:22:43.060 | So there's three sentences.
00:22:44.900 | Gato's not the end, Gato's the beginning.
00:22:48.700 | Meow, cat emoji.
00:22:50.140 | Okay, which is the important part?
00:22:51.740 | - It's the meow, no, no.
00:22:53.140 | Definitely that it is the beginning.
00:22:56.060 | I mean, I probably was just explaining a bit
00:23:00.340 | where the field is going, but let me tell you about Gato.
00:23:03.740 | So first, the name Gato comes from maybe a sequence
00:23:08.100 | of releases that DeepMind had that named,
00:23:11.820 | like used animal names to name some of their models
00:23:15.100 | that are based on this idea of large sequence models.
00:23:19.100 | Initially, they're only language,
00:23:20.620 | but we are expanding to other modalities.
00:23:23.180 | So we had, you know, we had gopher, chinchilla,
00:23:28.180 | these were language only.
00:23:29.940 | And then more recently we released flamingo,
00:23:32.700 | which adds vision to the equation.
00:23:35.420 | And then Gato, which adds vision
00:23:38.140 | and then also actions in the mix, right?
00:23:41.620 | As we discuss actually actions,
00:23:44.500 | especially discrete actions like up, down, left, right.
00:23:47.540 | I just told you the actions, but they're words.
00:23:49.460 | So you can kind of see how actions naturally map
00:23:52.740 | to sequence modeling of words,
00:23:54.500 | which these models are very powerful.
00:23:57.020 | So Gato was named after, I believe,
00:24:01.660 | I can only from memory, right?
00:24:03.580 | These, you know, these things always happen
00:24:06.020 | with an amazing team of researchers behind.
00:24:08.500 | So before the release, we had the discussion
00:24:12.180 | about which animal would we pick, right?
00:24:14.220 | And I think because of the word general agent, right?
00:24:18.340 | And this is a property quite unique to Gato.
00:24:21.860 | We kind of were playing with the GA words
00:24:24.700 | and then, you know, Gato is-
00:24:25.980 | - Rhymes with cat.
00:24:26.900 | - Yes.
00:24:27.740 | And Gato is obviously a Spanish version of cat.
00:24:30.220 | I had nothing to do with it, although I'm from Spain.
00:24:32.220 | - Oh, how do you, wait, sorry.
00:24:33.260 | How do you say cat in Spanish?
00:24:34.620 | - Gato.
00:24:35.460 | - Oh, Gato.
00:24:36.300 | - Yeah.
00:24:37.140 | - Now it all makes sense. - Okay, okay, I see, I see.
00:24:37.980 | - Now it all makes sense.
00:24:39.060 | - Okay, so-
00:24:39.900 | - How do you say meow in Spanish?
00:24:40.780 | No, that's probably the same.
00:24:41.900 | - I think you say it the same way,
00:24:44.380 | but you write it as M-I-A-U.
00:24:48.060 | - Okay, it's universal.
00:24:49.220 | - Yeah.
00:24:50.060 | - All right, so then how does the thing work?
00:24:51.660 | So you said general is, so you said language, vision-
00:24:56.660 | - And action.
00:24:58.380 | - Action.
00:24:59.220 | How does this, can you explain
00:25:01.820 | what kind of neural networks are involved?
00:25:04.220 | What does the training look like?
00:25:06.340 | And maybe what to you are some beautiful ideas
00:25:10.900 | within this system?
00:25:11.860 | - Yeah, so maybe the basics of Gato
00:25:16.060 | are not that dissimilar from many, many work that comes.
00:25:19.940 | So here is where the sort of the recipe,
00:25:22.900 | I mean, hasn't changed too much.
00:25:24.220 | There is a transformer model
00:25:25.580 | that's the kind of recurrent neural network
00:25:28.620 | that essentially takes a sequence of modalities,
00:25:33.300 | observations that could be words,
00:25:36.380 | could be vision, or could be actions.
00:25:38.820 | And then its own objective that you train it to do
00:25:42.140 | when you train it is to predict
00:25:44.060 | what the next anything is.
00:25:46.380 | And anything means what's the next action.
00:25:48.780 | If this sequence that I'm showing you to train
00:25:51.220 | is a sequence of actions and observations,
00:25:53.500 | then you're predicting what's the next action
00:25:55.620 | and the next observation, right?
00:25:57.100 | So you think of these really as a sequence of bytes, right?
00:26:00.900 | So take any sequence of words,
00:26:04.220 | a sequence of interleaved words and images,
00:26:06.980 | a sequence of maybe observations that are images
00:26:11.260 | and moves in a tarry up, down, left, right.
00:26:14.260 | And these, you just think of them as bytes
00:26:17.620 | and you're modeling what's the next byte gonna be like.
00:26:20.540 | And you might interpret that as an action
00:26:23.380 | and then play it in a game,
00:26:25.820 | or you could interpret it as a word
00:26:27.660 | and then write it down
00:26:29.060 | if you're chatting with the system and so on.
00:26:31.340 | So Gato basically can be thought as inputs,
00:26:36.620 | images, text, video, actions.
00:26:41.500 | It also actually inputs some sort of proprioception sensors
00:26:45.780 | from robotics because robotics is one of the tasks
00:26:48.260 | that it's been trained to do.
00:26:49.860 | And then at the output, similarly,
00:26:51.900 | it outputs words, actions.
00:26:53.700 | It does not output images.
00:26:55.700 | That's just by design,
00:26:57.420 | we decided not to go that way for now.
00:26:59.900 | That's also in part why it's the beginning
00:27:02.740 | because there's more to do clearly.
00:27:04.900 | But that's kind of what Gato is.
00:27:06.420 | It's this brain that essentially you give it any sequence
00:27:09.220 | of these observations and modalities
00:27:11.940 | and it outputs the next step.
00:27:13.780 | And then off you go,
00:27:15.340 | you feed the next step into
00:27:17.380 | and predict the next one and so on.
00:27:20.060 | Now, it is more than a language model
00:27:24.140 | because even though you can chat with Gato,
00:27:26.780 | like you can chat with Chinchilla or Flamingo,
00:27:29.540 | it also is an agent, right?
00:27:33.220 | So that's why we call it A of Gato,
00:27:37.220 | like the letter A and also it's general.
00:27:41.380 | It's not an agent that's been trained
00:27:43.260 | to be good at only StarCraft or only Atari or only Go.
00:27:47.900 | It's been trained on a vast variety of datasets.
00:27:51.660 | - What makes it an agent, if I may interrupt?
00:27:53.860 | The fact that it can generate actions?
00:27:56.020 | - Yes, so when we call it,
00:27:58.180 | I mean, it's a good question, right?
00:28:00.100 | When do we call a model?
00:28:02.780 | I mean, everything is a model,
00:28:03.860 | but what is an agent in my view is indeed
00:28:06.740 | the capacity to take actions in an environment
00:28:09.700 | that you then send to it
00:28:11.660 | and then the environment might return
00:28:13.500 | with a new observation
00:28:15.040 | and then you generate the next action and so on.
00:28:17.660 | - This actually, this reminds me of the question
00:28:20.420 | from the side of biology, what is life?
00:28:23.000 | Which is actually a very difficult question as well.
00:28:25.380 | What is living?
00:28:26.780 | What is living when you think about life here
00:28:29.460 | on this planet Earth?
00:28:31.000 | And a question interesting to me about aliens,
00:28:33.420 | what is life when we visit another planet?
00:28:35.720 | Would we be able to recognize it?
00:28:37.220 | And this feels like, it sounds perhaps silly,
00:28:40.220 | but I don't think it is.
00:28:41.380 | At which point is the neural network a being versus a tool?
00:28:46.380 | And it feels like action, ability to modify its environment,
00:28:52.400 | is that fundamental leap.
00:28:54.540 | - Yeah, I think it certainly feels like action
00:28:57.420 | is a necessary condition to be more alive,
00:29:01.920 | but probably not sufficient either.
00:29:04.380 | So sadly--
00:29:05.220 | - It's a soul consciousness thing, whatever.
00:29:06.880 | - Yeah, yeah, we can get back to that later.
00:29:09.060 | But anyways, going back to the meow and the Gato, right?
00:29:12.300 | So one of the leaps forward
00:29:16.100 | and what took the team a lot of effort and time was,
00:29:19.100 | as you were asking, how has Gato been trained?
00:29:23.100 | So I told you Gato is this transformer neural network,
00:29:26.060 | models actions, sequences of actions, words, et cetera.
00:29:30.580 | And then the way we train it is by essentially
00:29:34.820 | pulling data sets of observations, right?
00:29:39.380 | So it's a massive imitation learning algorithm
00:29:42.620 | that it imitates obviously to what is the next word
00:29:46.300 | that comes next from the usual data sets we use before,
00:29:49.860 | right?
00:29:50.700 | So these are these web scale style data sets
00:29:52.980 | of people writing on webs or chatting or whatnot, right?
00:29:57.980 | So that's an obvious source that we use
00:30:00.480 | on all language work.
00:30:02.020 | But then we also took a lot of agents
00:30:05.620 | that we have at DeepMind.
00:30:06.700 | I mean, as you know, DeepMind,
00:30:08.160 | we're quite interested in learning reinforcement learning
00:30:13.580 | and learning agents that play in different environments.
00:30:16.940 | So we kind of created a data set of these trajectories
00:30:20.740 | as we call them or agent experiences.
00:30:23.020 | So in a way, there are other agents we train
00:30:25.660 | for a single mind purpose to, let's say,
00:30:28.420 | control a 3D game environment and navigate a maze.
00:30:33.340 | So we had all the experience that was created
00:30:36.060 | through the one agent interacting with that environment.
00:30:39.560 | And we added these to the data sets, right?
00:30:41.860 | And as I said, we just see all the data,
00:30:44.380 | all these sequences of words or sequences of these agent
00:30:47.500 | interacting with that environment
00:30:49.700 | or agents playing Atari and so on.
00:30:52.180 | We see this as the same kind of data.
00:30:54.860 | And so we mix these data sets together and we train Gato.
00:30:59.220 | That's the G part, right?
00:31:01.580 | It's general because it really has mixed,
00:31:05.220 | it doesn't have different brains for each modality
00:31:07.520 | or each narrow task.
00:31:09.060 | It has a single brain.
00:31:10.500 | It's not that big of a brain compared to most
00:31:12.700 | of the neural networks we see these days.
00:31:14.780 | It has 1 billion parameters.
00:31:17.140 | Some models we're seeing getting the trillions these days
00:31:21.100 | and certainly 100 billion feels like a size
00:31:25.060 | that is very common from when you train these jobs.
00:31:28.980 | So the actual agent is relatively small,
00:31:32.660 | but it's been trained on a very challenging,
00:31:35.020 | diverse data set, not only containing all of internet,
00:31:37.980 | but containing all these agent experience
00:31:40.380 | playing very different distinct environments.
00:31:43.140 | So this brings us to the part of the tweet of,
00:31:46.420 | this is not the end, it's the beginning.
00:31:48.900 | It feels very cool to see Gato in principle
00:31:53.100 | is able to control any sort of environments
00:31:56.620 | that especially the ones that it's been trained to do,
00:31:59.140 | these 3D games, Atari games,
00:32:01.100 | all sorts of robotics tasks and so on.
00:32:04.620 | But obviously it's not as proficient as the teachers
00:32:08.960 | it learned from on these environments.
00:32:09.800 | - Is that why it's not obvious?
00:32:11.740 | It's not obvious that it wouldn't be more proficient.
00:32:15.100 | It's just the current beginning part
00:32:18.040 | is that the performance is such that it's not as good
00:32:21.780 | as if it's specialized to that task.
00:32:23.460 | - Right, so it's not as good,
00:32:25.820 | although I would argue size matters here.
00:32:28.060 | So the fact that--
00:32:29.180 | - I would argue size always matters.
00:32:31.220 | - Yeah, okay. - That's a different conversation.
00:32:33.420 | - But for neural networks, certainly size does matter.
00:32:36.260 | So it's the beginning because it's relatively small.
00:32:39.660 | So obviously scaling this idea up
00:32:42.620 | might make the connections that exist
00:32:46.540 | between text on the internet and playing Atari and so on
00:32:50.740 | more synergistic with one another.
00:32:53.340 | And you might gain.
00:32:54.260 | And that moment we didn't quite see,
00:32:56.360 | but obviously that's why it's the beginning.
00:32:58.660 | - That synergy might emerge with scale.
00:33:00.980 | - Right, might emerge with scale.
00:33:02.140 | And also I believe there's some new research
00:33:04.420 | or ways in which you prepare the data
00:33:07.620 | that you might need to sort of make it more clear
00:33:10.940 | to the model that you're not only playing Atari
00:33:14.180 | and it's just, you start from a screen
00:33:16.360 | and here is up and a screen and down.
00:33:18.400 | Maybe you can think of playing Atari
00:33:20.660 | as there's some sort of context that is needed for the agent
00:33:23.900 | before it starts seeing, oh, this is an Atari screen,
00:33:26.900 | I'm gonna start playing.
00:33:28.640 | You might require, for instance, to be told in words,
00:33:33.420 | hey, in this sequence that I'm showing,
00:33:36.860 | you're gonna be playing an Atari game.
00:33:39.100 | So text might actually be a good driver
00:33:41.980 | to enhance the data.
00:33:44.460 | So then these connections might be made more easily.
00:33:47.220 | That's an idea that we start seeing in language,
00:33:51.240 | but obviously beyond this is gonna be effective.
00:33:55.180 | It's not like I don't show you a screen
00:33:57.460 | and you from scratch, you're supposed to learn a game.
00:34:01.000 | There is a lot of context we might set.
00:34:03.380 | So there might be some work needed as well
00:34:05.860 | to set that context.
00:34:07.780 | But anyways, there's a lot of work.
00:34:10.420 | - So that context puts all the different modalities
00:34:13.540 | on the same level ground.
00:34:14.980 | - Exactly. - If you provide
00:34:15.820 | the context best.
00:34:16.660 | So maybe on that point,
00:34:18.980 | so there's this task which may not seem trivial
00:34:23.100 | of tokenizing the data, of converting the data into pieces,
00:34:28.100 | into basic atomic elements
00:34:31.300 | that then could cross modalities somehow.
00:34:35.300 | So what's tokenization?
00:34:37.900 | How do you tokenize text?
00:34:39.680 | How do you tokenize images?
00:34:42.180 | How do you tokenize games and actions and robotics tasks?
00:34:47.060 | - Yeah, that's a great question.
00:34:48.220 | So tokenization is the entry point
00:34:52.820 | to actually make all the data look like a sequence
00:34:55.580 | because tokens then are just kind of
00:34:57.660 | these little puzzle pieces.
00:34:59.500 | We break down anything into these puzzle pieces
00:35:01.740 | and then we just model,
00:35:03.460 | what's this puzzle look like, right?
00:35:05.340 | When you make it lay down in a line,
00:35:07.700 | so to speak, in a sequence.
00:35:09.500 | So in Gato, the text, there's a lot of work.
00:35:14.500 | You tokenize text usually by looking
00:35:17.340 | at commonly used substrings, right?
00:35:20.020 | So there's, you know, ing in English
00:35:22.500 | is a very common substring.
00:35:23.660 | So that becomes a token.
00:35:25.500 | There's quite well studied problem on tokenizing text
00:35:29.060 | and Gato just use the standard techniques
00:35:31.580 | that have been developed from many years,
00:35:34.300 | even starting from Ngram models in the 1950s and so on.
00:35:37.940 | - Just for context, how many tokens,
00:35:40.180 | like what order of magnitude,
00:35:41.780 | number of tokens is required for a word?
00:35:44.460 | - Yeah. - Usually.
00:35:45.300 | What are we talking about?
00:35:46.180 | - Yeah, for a word in English, right?
00:35:48.620 | I mean, every language is very different.
00:35:51.100 | The current level or granularity of tokenization
00:35:53.900 | generally means it's maybe two to five.
00:35:57.780 | I mean, I don't know the statistics exactly,
00:36:00.140 | but to give you an idea,
00:36:02.100 | we don't tokenize at the level of letters,
00:36:04.100 | then it would probably be like,
00:36:05.460 | I don't know what the average length of a word
00:36:07.500 | is in English, but that would be, you know,
00:36:09.220 | the minimum set of tokens you could use.
00:36:11.380 | - So it's bigger than letters, smaller than words.
00:36:13.180 | - Yes, yes.
00:36:14.020 | And you could think of very, very common words like the,
00:36:16.860 | I mean, that would be a single token,
00:36:18.780 | but very quickly you're talking two, three, four,
00:36:21.500 | four tokens or so.
00:36:22.340 | - Have you ever tried to tokenize emojis?
00:36:24.740 | - Emojis are actually just sequences of letters.
00:36:29.420 | So- - Maybe to you,
00:36:30.940 | but to me, they mean so much more.
00:36:32.980 | - Yeah, you can render the emoji,
00:36:34.380 | but you might, if you actually just-
00:36:36.780 | - Yeah, this is a philosophical question.
00:36:38.940 | Is emojis an image or a text?
00:36:43.300 | - The way we do these things is they're actually mapped
00:36:46.900 | to small sequences of characters.
00:36:49.540 | So you can actually play with these models
00:36:52.580 | and input emojis, it will output emojis back,
00:36:55.780 | which is actually quite a fun exercise.
00:36:57.900 | You probably can find other tweets about these out there.
00:37:02.300 | But yeah, so anyways, text, there's like,
00:37:04.460 | it's very clear how this is done.
00:37:06.780 | And then in Gato, what we did for images
00:37:10.620 | is we map images to essentially,
00:37:13.780 | we compressed images, so to speak,
00:37:15.460 | into something that looks more like,
00:37:17.460 | less like every pixel with every intensity
00:37:21.300 | that would mean we have a very long sequence, right?
00:37:23.820 | Like if we were talking about 100 by 100 pixel images,
00:37:27.300 | that would make the sequences far too long.
00:37:29.940 | So what was done there is you just use a technique
00:37:33.340 | that essentially compresses an image
00:37:35.860 | into maybe 16 by 16 patches of pixels.
00:37:40.180 | And then that is mapped, again, tokenized.
00:37:42.740 | You just essentially quantize this space
00:37:45.380 | into a special word that actually maps
00:37:49.020 | to these little sequence of pixels.
00:37:51.820 | And then you put the pixels together in some raster order,
00:37:55.140 | and then that's how you get out
00:37:57.820 | or in the image that you're processing.
00:38:00.820 | - But there's no semantic aspect to that.
00:38:04.060 | So you're doing some kind of,
00:38:05.860 | you don't need to understand anything about the image
00:38:07.780 | in order to tokenize it currently.
00:38:09.660 | - No, you're only using this notion of compression.
00:38:12.620 | So you're trying to find common,
00:38:15.100 | it's like JPG or all these algorithms,
00:38:17.660 | it's actually very similar at the tokenization level.
00:38:20.540 | All we're doing is finding common patterns
00:38:23.340 | and then making sure in a lossy way,
00:38:25.860 | we compress these images,
00:38:27.260 | given the statistics of the images
00:38:29.540 | that are contained in all the data we deal with.
00:38:31.860 | - Although you could probably argue that JPG
00:38:34.220 | does have some understanding of images.
00:38:36.660 | Because visual information, maybe color,
00:38:42.940 | compressing crudely based on color
00:38:46.980 | does capture something important about an image
00:38:51.180 | that's about its meaning, not just about some statistics.
00:38:54.620 | - Yeah, I mean, JP, as I said,
00:38:56.660 | the algorithms look actually very similar to,
00:38:59.420 | they use the cosine transform in JPG.
00:39:02.820 | The approach we usually do in machine learning
00:39:07.100 | when we deal with images and we do this quantization step
00:39:10.140 | is a bit more data-driven.
00:39:11.380 | So rather than have some sort of Fourier basis
00:39:14.140 | for how frequencies appear in the natural world,
00:39:18.900 | we actually just use the statistics of the images
00:39:23.820 | and then quantize them based on the statistics,
00:39:26.980 | much like you do in words, right?
00:39:28.300 | So common substrings are allocated a token,
00:39:32.420 | and images is very similar.
00:39:34.420 | But there's no connection, the token space,
00:39:38.260 | if you think of, oh, like the tokens are an integer
00:39:41.060 | and in the end of the day.
00:39:42.420 | So now like we work on, maybe we have about,
00:39:46.180 | let's say, I don't know the exact numbers,
00:39:47.980 | but let's say 10,000 tokens for text, right?
00:39:51.180 | Certainly more than characters
00:39:52.820 | because we have groups of characters and so on.
00:39:55.340 | So from one to 10,000, those are representing
00:39:58.300 | all the language and the words we'll see.
00:40:00.980 | And then images occupy the next set of integers.
00:40:04.180 | So they're completely independent, right?
00:40:05.820 | So from 10,001 to 20,000, those are the tokens
00:40:09.860 | that represent these other modality images.
00:40:12.780 | And that is an interesting aspect
00:40:16.940 | that makes it orthogonal.
00:40:18.660 | So what connects these concepts is the data, right?
00:40:21.620 | Once you have a data set, for instance,
00:40:24.460 | that captions images, that tells you,
00:40:26.900 | oh, this is someone playing a Frisbee on a green field.
00:40:30.500 | Now the model will need to predict the tokens
00:40:34.580 | from the text green field to then the pixels,
00:40:37.780 | and that will start making the connections
00:40:39.740 | between the tokens.
00:40:40.580 | So these connections happen as the algorithm learns.
00:40:43.620 | And then the last, if we think of these integers,
00:40:45.820 | the first few are words, the next few are images.
00:40:48.740 | In Gato, we also allocated the highest order of integers
00:40:53.740 | to actions, right?
00:40:56.260 | Which we discretize and actions are very diverse, right?
00:40:59.940 | In Atari, there's, I don't know if 17 discrete actions.
00:41:04.100 | In robotics, actions might be torques
00:41:06.940 | and forces that we apply.
00:41:08.220 | So we just use kind of similar ideas
00:41:11.180 | to compress these actions into tokens.
00:41:14.300 | And then we just, that's how we map now all the space
00:41:18.660 | to these sequence of integers.
00:41:20.780 | But they occupy different space,
00:41:22.420 | and what connects them is then the learning algorithm.
00:41:24.820 | That's where the magic happens.
00:41:26.260 | - So the modalities are orthogonal
00:41:28.780 | to each other in token space.
00:41:30.300 | - Right, right.
00:41:31.140 | - So in the input, everything you add,
00:41:33.620 | you add extra tokens.
00:41:35.220 | - Right.
00:41:36.060 | - And then you're shoving all of that into one place.
00:41:40.420 | - Yes, the transformer.
00:41:41.620 | - And that transformer,
00:41:42.740 | that transformer tries to look at this gigantic token space
00:41:47.740 | and tries to form some kind of representation,
00:41:52.220 | some kind of unique wisdom
00:41:56.740 | about all of these different modalities.
00:41:59.220 | How's that possible?
00:42:02.100 | If you were to sort of put your psychoanalysis hat on
00:42:06.500 | and try to psychoanalyze this neural network,
00:42:09.380 | is it schizophrenic?
00:42:11.740 | Does it try to, given this very few weights,
00:42:16.740 | represent multiple disjoint things
00:42:19.540 | and somehow have them not interfere with each other?
00:42:22.780 | Or is it somehow building on the joint strength,
00:42:27.780 | on whatever is common to all the different modalities?
00:42:31.700 | If you were to ask a question, is it schizophrenic
00:42:35.580 | or is it of one mind?
00:42:38.660 | - I mean, it is one mind,
00:42:41.020 | and it's actually the simplest algorithm,
00:42:44.340 | which that's kind of in a way how it feels
00:42:47.420 | like the field hasn't changed since backpropagation
00:42:51.660 | and gradient descent was purpose
00:42:53.620 | for learning neural networks.
00:42:55.700 | So there is obviously details on the architecture.
00:42:58.660 | This has evolved.
00:42:59.580 | The current iteration is still the transformer,
00:43:03.020 | which is a powerful sequence modeling architecture.
00:43:07.380 | But then the goal of setting these weights
00:43:12.220 | to predict the data is essentially the same
00:43:15.460 | as basically I could describe,
00:43:17.180 | I mean, we described a few years ago,
00:43:18.620 | AlphaStar, language modeling, and so on, right?
00:43:21.540 | We take, let's say, an Atari game.
00:43:24.540 | We map it to a string of numbers
00:43:27.580 | that will all be probably image space
00:43:30.300 | and action space interleaved.
00:43:32.380 | And all we're gonna do is say,
00:43:34.060 | okay, given the numbers,
00:43:37.260 | you know, 10,001, 10,004, 10,005,
00:43:40.380 | the next number that comes is 20,006,
00:43:43.220 | which is in the action space.
00:43:45.380 | And you're just optimizing these weights
00:43:48.820 | via very simple gradients,
00:43:51.660 | like, you know, mathematical is almost
00:43:53.500 | the most boring algorithm you could imagine.
00:43:55.860 | We set all the weights so that
00:43:57.780 | given this particular instance,
00:44:00.180 | these weights are set to maximize
00:44:03.180 | the probability of having seen
00:44:05.020 | this particular sequence of integers
00:44:07.260 | for this particular game.
00:44:09.100 | And then the algorithm does this
00:44:11.620 | for many, many, many iterations,
00:44:14.740 | looking at different modalities,
00:44:16.860 | different games, right?
00:44:17.860 | That's the mixture of the dataset we discussed.
00:44:20.460 | So in a way, it's a very simple algorithm.
00:44:24.020 | And the weights, right, they're all shared, right?
00:44:27.540 | So in terms of, is it focusing on one modality or not,
00:44:30.900 | the intermediate weights that are converting
00:44:33.180 | from these input of integers
00:44:35.140 | to the target integer you're predicting next,
00:44:37.660 | those weights certainly are common.
00:44:40.300 | And then the way the tokenization happens,
00:44:43.380 | there is a special place in the neural network,
00:44:45.820 | which is we map these integer, like number 10,001,
00:44:49.780 | to a vector of real numbers, like real numbers.
00:44:53.700 | We can optimize them with gradient descent, right?
00:44:56.100 | The functions we learn are actually
00:44:58.260 | surprisingly differentiable.
00:44:59.700 | That's why we compute gradients.
00:45:01.700 | So this step is the only one
00:45:03.900 | that this orthogonality dimension applies.
00:45:06.540 | So mapping a certain token for text or image or actions,
00:45:11.540 | each of these tokens gets its own little vector
00:45:15.020 | of real numbers that represents this.
00:45:17.180 | If you look at the field back many years ago,
00:45:19.540 | people were talking about word vectors or word embeddings.
00:45:23.460 | These are the same.
00:45:24.300 | We have word vectors or embeddings.
00:45:25.980 | We have image vector or embeddings
00:45:28.860 | and action vector of embeddings.
00:45:30.900 | And the beauty here is that as you train this model,
00:45:33.900 | if you visualize these little vectors,
00:45:36.660 | it might be that they start aligning
00:45:38.460 | even though they're independent parameters.
00:45:41.100 | There could be anything,
00:45:42.860 | but then it might be that you take the word gato or cat,
00:45:47.460 | which maybe is common enough that it actually
00:45:49.020 | has its own token.
00:45:50.220 | And then you take pixels that have a cat,
00:45:52.380 | and you might start seeing that these vectors
00:45:55.300 | look like they align, right?
00:45:57.420 | So by learning from this vast amount of data,
00:46:00.660 | the model is realizing the potential connections
00:46:03.940 | between these modalities.
00:46:05.660 | Now I will say there will be another way,
00:46:07.860 | at least in part, to not have these different vectors
00:46:12.860 | for each different modality.
00:46:15.500 | For instance, when I tell you about actions in certain space,
00:46:20.220 | I'm defining actions by words, right?
00:46:22.820 | So you could imagine a world in which I'm not learning
00:46:26.500 | that the action app in Atari is its own number.
00:46:31.180 | The action app in Atari maybe is literally the word
00:46:34.380 | or the sentence app in Atari, right?
00:46:37.300 | And that would mean we now leverage
00:46:39.380 | much more from the language.
00:46:41.020 | This is not what we did here,
00:46:42.500 | but certainly it might make these connections
00:46:45.660 | much easier to learn and also to teach the model
00:46:49.060 | to correct its own actions and so on, right?
00:46:51.260 | So all this to say that gato is indeed the beginning,
00:46:55.860 | that it is a radical idea to do this this way,
00:46:59.420 | but there's probably a lot more to be done
00:47:02.340 | and the results to be more impressive,
00:47:04.420 | not only through scale, but also through some new research
00:47:07.940 | that will come hopefully in the years to come.
00:47:10.460 | - So just to elaborate quickly,
00:47:12.260 | you mean one possible next step
00:47:16.660 | or one of the paths that you might take next
00:47:20.180 | is doing the tokenization fundamentally
00:47:25.180 | as a kind of linguistic communication.
00:47:28.260 | So like you convert even images into language.
00:47:31.340 | So doing something like a crude semantic segmentation,
00:47:35.540 | trying to just assign a bunch of words to an image
00:47:38.340 | that like have almost like a dumb entity
00:47:42.300 | explaining as much as it can about the image.
00:47:45.300 | And so you convert that into words
00:47:46.900 | and then you convert games into words
00:47:49.260 | and then you provide the context in words and all of it.
00:47:53.500 | And eventually getting to a point
00:47:56.300 | where everybody agrees with Noam Chomsky
00:47:58.100 | that language is actually at the core of everything.
00:48:00.940 | That it's the base layer of intelligence and consciousness
00:48:04.980 | and all that kind of stuff, okay.
00:48:07.500 | You mentioned early on like it's hard to grow.
00:48:11.260 | What did you mean by that?
00:48:12.780 | 'Cause we're talking about scale might change.
00:48:15.700 | There might be, and we'll talk about this too,
00:48:18.980 | like there's a emergent,
00:48:22.940 | there's certain things about these neural networks
00:48:25.020 | that are emergent.
00:48:25.860 | So certain like performance we can see only with scale
00:48:28.980 | and there's some kind of threshold of scale.
00:48:30.980 | So why is it hard to grow something like this Meow Network?
00:48:35.980 | - So the Meow Network, it's not hard to grow
00:48:41.140 | if you retrain it.
00:48:42.620 | What's hard is, well, we have now 1 billion parameters.
00:48:46.860 | We train them for a while.
00:48:48.140 | We spend some amount of work
00:48:50.740 | towards building these weights
00:48:53.140 | that are an amazing initial brain
00:48:55.900 | for doing these kind of tasks we care about.
00:48:58.860 | Could we reuse the weights and expand to a larger brain?
00:49:03.860 | And that is extraordinarily hard,
00:49:06.700 | but also exciting from a research perspective
00:49:10.100 | and a practical perspective point of view, right?
00:49:12.580 | So there's this notion of modularity in software engineering
00:49:17.580 | and we starting to see some examples
00:49:20.500 | and work that leverages modularity.
00:49:23.340 | In fact, if we go back one step from Gato
00:49:26.340 | to a work that I would say train much larger,
00:49:29.700 | much more capable network called Flamingo.
00:49:32.580 | Flamingo did not deal with actions,
00:49:34.340 | but it definitely dealt with images in an interesting way,
00:49:38.460 | kind of akin to what Gato did,
00:49:40.300 | but slightly different technique for tokenizing,
00:49:43.020 | but we don't need to go into that detail.
00:49:45.420 | But what Flamingo also did, which Gato didn't do,
00:49:49.380 | and that just happens because these projects,
00:49:51.620 | you know, they're different.
00:49:53.580 | You know, it's a bit of like the exploratory nature
00:49:55.900 | of research, which is great.
00:49:57.260 | - The research behind these projects is also modular.
00:50:00.620 | - Yes, exactly.
00:50:01.860 | And it has to be, right?
00:50:02.780 | We need to have creativity
00:50:05.620 | and sometimes you need to protect pockets of, you know,
00:50:08.860 | people, researchers, and so on.
00:50:10.340 | - By we, you mean humans.
00:50:11.860 | - Yes. - Okay.
00:50:12.860 | - And also in particular researchers
00:50:14.620 | and maybe even further, you know,
00:50:16.780 | DeepMind or other such labs.
00:50:18.860 | - And then the neural networks themselves.
00:50:21.020 | So it's modularity all the way down.
00:50:23.380 | - All the way down.
00:50:24.260 | So the way that we did modularity very beautifully
00:50:27.540 | in Flamingo is we took Chinchilla,
00:50:30.140 | which is a language only model,
00:50:32.860 | not an agent if we think of actions
00:50:34.700 | being necessary for agency.
00:50:36.740 | So we took Chinchilla, we took the weights of Chinchilla,
00:50:40.980 | and then we froze them.
00:50:42.820 | We said, "These don't change."
00:50:44.820 | We trained them to be very good at predicting the next word.
00:50:47.580 | It's a very good language model,
00:50:49.460 | state of the art at the time you release it,
00:50:51.260 | et cetera, et cetera.
00:50:52.980 | We're gonna add a capability to see, right?
00:50:55.540 | We are gonna add the ability to see
00:50:56.980 | to this language model.
00:50:58.340 | So we're gonna attach small pieces of neural networks
00:51:01.980 | at the right places in the model.
00:51:03.900 | It's almost like injecting the network
00:51:07.940 | with some weights and some substructures
00:51:10.780 | in the ways, in a good way, right?
00:51:12.860 | So you need the research to say what is effective,
00:51:15.300 | how do you add this capability
00:51:16.740 | without destroying others, et cetera.
00:51:18.860 | So we created a small sub-network,
00:51:23.500 | initialized not from random,
00:51:25.420 | but actually from self-supervised learning,
00:51:28.820 | that, you know, a model that understands vision in general.
00:51:32.900 | And then we took datasets that connect the two modalities,
00:51:37.340 | vision and language.
00:51:38.820 | And then we froze the main part,
00:51:41.260 | the largest portion of the network, which was Chinchilla,
00:51:43.780 | that is 70 billion parameters.
00:51:46.020 | And then we added a few more parameters on top,
00:51:49.300 | trained from scratch,
00:51:50.580 | and then some others that were pre-trained
00:51:52.700 | from like, with the capacity to see.
00:51:55.340 | Like it was not tokenization in the way I described for Gato,
00:51:58.900 | but it's a similar idea.
00:52:01.500 | And then we trained the whole system.
00:52:03.700 | Parts of it were frozen, parts of it were new.
00:52:06.700 | And all of a sudden we developed Flamingo,
00:52:09.780 | which is an amazing model that is essentially,
00:52:12.700 | I mean, describing it is a chatbot
00:52:15.140 | where you can also upload images
00:52:17.100 | and start conversing about images,
00:52:20.060 | but it's also kind of a dialogue style chatbot.
00:52:23.860 | - So the input is images and text,
00:52:25.900 | and the output is text. - Yes, exactly.
00:52:28.060 | And- - How many parameters?
00:52:29.500 | You said 70 billion for Chinchilla.
00:52:31.940 | - Yeah, Chinchilla is 70 billion.
00:52:33.380 | And then the ones we add on top,
00:52:34.780 | which kind of almost is almost like a way to overwrite
00:52:39.340 | its little activations so that when it sees vision,
00:52:42.540 | it does kind of a correct computation
00:52:44.700 | of what it's seeing, mapping it back to words, so to speak.
00:52:48.100 | That adds an extra 10 billion parameters, right?
00:52:50.980 | So it's total 80 billion, the largest one we released.
00:52:54.100 | And then you train it on a few data sets
00:52:57.460 | that contain vision and language.
00:52:59.460 | And once you interact with the model,
00:53:01.260 | you start seeing that you can upload an image
00:53:04.340 | and start sort of having a dialogue about the image,
00:53:08.100 | which is actually not something,
00:53:09.580 | it's very similar and akin to what we saw
00:53:11.900 | in language only, these prompting abilities that it has.
00:53:15.380 | You can teach it a new vision task, right?
00:53:17.860 | It does things beyond the capabilities that, in theory,
00:53:21.620 | the data sets provided in themselves,
00:53:24.660 | but because it leverages a lot of the language knowledge
00:53:27.260 | acquired from Chinchilla,
00:53:29.020 | it actually has this few-shot learning ability
00:53:31.900 | and these emerging abilities that we didn't even measure
00:53:34.780 | once we were developing the model.
00:53:36.580 | But once developed, then as you play with the interface,
00:53:40.220 | you can start seeing, wow, okay, yeah,
00:53:41.820 | it's cool, we can upload, I think,
00:53:44.300 | one of the tweets talking about Twitter
00:53:45.940 | was this image from Obama that is placing a weight
00:53:49.980 | and someone is kind of weighting themselves
00:53:52.540 | and it's kind of a joke-style image.
00:53:55.060 | And it's notable because I think Andriy Karpathy
00:53:58.020 | a few years ago said, "No computer vision system
00:54:00.860 | "can understand the subtlety of this joke in this image,
00:54:04.780 | "all the things that go on."
00:54:06.500 | And so what we try to do, and it's very anecdotally,
00:54:09.740 | I mean, this is not a proof that we solved this issue,
00:54:12.300 | but it just shows that you can upload now this image
00:54:15.860 | and start conversing with the model,
00:54:17.700 | trying to make out if it gets that there's a joke
00:54:21.500 | because the person weighting themselves
00:54:23.100 | doesn't see that someone behind is making the weight higher
00:54:26.820 | and so on and so forth.
00:54:27.980 | So it's a fascinating capability
00:54:30.020 | and it comes from this key idea of modularity
00:54:33.380 | where we took a frozen brain
00:54:34.940 | and we just added a new capability.
00:54:37.900 | So the question is, should we,
00:54:40.740 | so in a way you can see even from DeepMind,
00:54:42.860 | we have Flamingo that this moderate approach
00:54:46.420 | and thus could leverage a scale a bit more reasonably
00:54:49.180 | because we didn't need to retrain a system from scratch.
00:54:52.340 | And on the other hand, we had Gato,
00:54:54.180 | which used the same datasets,
00:54:55.940 | but then it trained it from scratch, right?
00:54:57.500 | And so I guess big question for the community is,
00:55:01.660 | should we train from scratch or should we embrace modularity?
00:55:04.780 | And this goes back to modularity as a way to grow,
00:55:09.780 | but reuse seems like natural
00:55:12.140 | and it was very effective, certainly.
00:55:15.020 | - The next question is, if you go the way of modularity,
00:55:19.060 | is there a systematic way of freezing weights
00:55:22.780 | and joining different modalities across,
00:55:27.100 | you know, not just two or three or four networks,
00:55:29.300 | but hundreds of networks
00:55:30.620 | from all different kinds of places,
00:55:32.420 | maybe open source network that looks at weather patterns
00:55:36.420 | and you shove that in somehow,
00:55:38.020 | and then you have networks that, I don't know,
00:55:40.500 | do all kinds of, play StarCraft
00:55:42.140 | and play all the other video games,
00:55:44.100 | and you can keep adding them in without significant effort,
00:55:49.100 | like maybe the effort scales linearly
00:55:52.540 | or something like that,
00:55:53.380 | as opposed to like the more network you add,
00:55:55.020 | the more you have to worry about the instabilities created.
00:55:57.980 | - Yeah, so that vision is beautiful.
00:56:00.020 | I think there's still the question
00:56:03.580 | about within single modalities, like Chinchilla was reused,
00:56:06.900 | but now if we train a next iteration of language models,
00:56:10.260 | are we gonna use Chinchilla or not?
00:56:11.900 | - Yeah, how do you swap out Chinchilla?
00:56:13.220 | - Right, so there's still big questions,
00:56:15.980 | but that idea is actually really akin
00:56:18.420 | to software engineering,
00:56:19.420 | which we're not re-implementing,
00:56:21.140 | you know, libraries from scratch,
00:56:22.420 | we're reusing and then building ever more amazing things,
00:56:25.460 | including neural networks with software that we're reusing.
00:56:29.060 | So I think this idea of modularity, I like it,
00:56:32.260 | I think it's here to stay,
00:56:33.980 | and that's also why I mentioned
00:56:35.980 | it's just the beginning, not the end.
00:56:38.300 | - You've mentioned meta-learning,
00:56:39.500 | so given this promise of Gato,
00:56:42.900 | can we try to redefine this term
00:56:46.100 | that's almost akin to consciousness,
00:56:47.700 | because it means different things to different people
00:56:50.260 | throughout the history of artificial intelligence,
00:56:52.500 | but what do you think meta-learning is and looks like
00:56:58.220 | now in the five years, 10 years,
00:57:00.140 | will it look like system like Gato, but scaled?
00:57:03.300 | What's your sense of,
00:57:04.260 | what does meta-learning look like, do you think,
00:57:08.380 | with all the wisdom we've learned so far?
00:57:10.580 | - Yeah, great question,
00:57:11.660 | maybe it's good to give another data point
00:57:14.620 | looking backwards rather than forward.
00:57:16.300 | So when we talk in 2019,
00:57:20.660 | meta-learning meant something that has changed
00:57:26.620 | mostly through the revolution of GPT-3 and beyond.
00:57:31.260 | So what meta-learning meant at the time
00:57:34.060 | was driven by what benchmarks people care about
00:57:37.780 | in meta-learning,
00:57:38.940 | and the benchmarks were about
00:57:40.740 | a capability to learn about object identities,
00:57:45.100 | so it was very much over-fitted
00:57:47.500 | to vision and object classification,
00:57:50.460 | and the part that was meta about that was that,
00:57:53.020 | oh, we're not just learning a thousand categories
00:57:55.420 | that ImageNet tells us to learn,
00:57:57.140 | we're gonna learn object categories that can be defined
00:58:00.580 | when we interact with the model.
00:58:03.380 | So it's interesting to see the evolution, right?
00:58:06.740 | The way this started was we have a special language
00:58:10.860 | that was a data set, a small data set
00:58:13.340 | that we prompted the model with,
00:58:15.380 | saying, hey, here is a new classification task,
00:58:18.900 | I'll give you one image and the name,
00:58:21.860 | which was an integer at the time of the image,
00:58:24.460 | and a different image, and so on.
00:58:26.060 | So you have a small prompt in the form of a data set,
00:58:30.100 | a machine learning data set,
00:58:31.700 | and then you got then a system that could then predict
00:58:35.580 | or classify these objects
00:58:37.020 | that you just defined kind of on the fly.
00:58:39.420 | So fast forward,
00:58:43.220 | it was revealed that language models are future learners,
00:58:47.500 | that's the title of the paper, so very good title.
00:58:50.140 | Sometimes titles are really good,
00:58:51.580 | so this one is really, really good,
00:58:53.580 | because that's the point of GPT-3,
00:58:56.220 | that showed that, look, sure,
00:58:58.820 | we can focus on object classification
00:59:00.980 | and what meta-learning means
00:59:02.580 | within the space of learning object categories,
00:59:05.460 | this goes beyond, or before, rather,
00:59:07.460 | to also Omniglot, before ImageNet, and so on.
00:59:10.060 | So there's a few benchmarks.
00:59:11.500 | To now, all of a sudden,
00:59:13.020 | we're a bit unlocked from benchmarks,
00:59:15.220 | and through language, we can define tasks, right?
00:59:17.900 | So we're literally telling the model some logical task
00:59:21.580 | or little thing that we wanted to do.
00:59:23.860 | We prompt it much like we did before,
00:59:25.900 | but now we prompt it through natural language.
00:59:28.460 | And then, not perfectly,
00:59:30.420 | I mean, these models have failure modes, and that's fine,
00:59:33.180 | but these models then are now doing a new task, right?
00:59:37.140 | So they meta-learn this new capability.
00:59:40.460 | Now, that's where we are now.
00:59:43.380 | Flamingo expanded this to visual and language,
00:59:47.220 | but it basically has the same abilities.
00:59:49.300 | You can teach it, for instance,
00:59:51.540 | an emergent property was that
00:59:53.260 | you can take pictures of numbers
00:59:55.260 | and then do arithmetic with the numbers
00:59:57.780 | just by teaching it, "Oh, that's,
00:59:59.900 | "when I show you three plus six,
01:00:01.980 | "I want you to output nine,
01:00:03.620 | "and you show it a few examples, and now it does that."
01:00:06.660 | So it went way beyond this ImageNet
01:00:10.180 | sort of categorization of images
01:00:12.620 | that we were a bit stuck, maybe,
01:00:14.140 | before this revelation moment that happened in 2000.
01:00:19.020 | I believe it was '19, but it was after we chatted.
01:00:21.860 | - And that way it has solved meta-learning
01:00:24.260 | as was previously defined.
01:00:26.020 | - Yes, it expanded what it meant.
01:00:27.700 | So that's what you say, what does it mean?
01:00:29.460 | So it's an evolving term.
01:00:31.300 | But here is maybe now looking forward,
01:00:35.140 | looking at what's happening,
01:00:37.540 | obviously in the community with more modalities,
01:00:41.340 | what we can expect.
01:00:42.420 | And I would certainly hope to see the following,
01:00:44.900 | and this is a pretty drastic hope,
01:00:48.340 | but in five years, maybe we chat again.
01:00:51.140 | And we have a system, right, a set of weights
01:00:55.860 | that we can teach it to play StarCraft.
01:00:59.780 | Maybe not at the level of AlphaStar,
01:01:01.420 | but play StarCraft, a complex game.
01:01:03.620 | We teach it through interactions to prompting.
01:01:06.860 | You can certainly prompt a system,
01:01:08.460 | that's what Gato shows, to play some simple Atari games.
01:01:11.700 | So imagine if you start talking to a system,
01:01:15.300 | teaching it a new game,
01:01:16.780 | showing it examples of, in this particular game,
01:01:20.940 | this user did something good.
01:01:22.740 | Maybe the system can even play and ask you questions,
01:01:25.420 | say, "Hey, I played this game.
01:01:26.940 | I just played this game.
01:01:27.860 | Did I do well?
01:01:29.060 | Can you teach me more?"
01:01:30.420 | So five, maybe to 10 years, these capabilities,
01:01:34.780 | or what meta-learning means,
01:01:36.180 | will be much more interactive, much more rich,
01:01:38.860 | and through domains that we were specializing, right?
01:01:41.620 | So you see the difference, right?
01:01:42.900 | We built AlphaStar specialized to play StarCraft.
01:01:46.980 | The algorithms were general,
01:01:48.220 | but the weights were specialized.
01:01:50.420 | And what we're hoping is that we can teach a network
01:01:54.180 | to play games, to play any game,
01:01:56.580 | just using games as an example,
01:01:58.580 | through interacting with it, teaching it,
01:02:01.500 | uploading the Wikipedia page of StarCraft.
01:02:03.740 | Like this is in the horizon,
01:02:06.100 | and obviously there are details need to be filled,
01:02:09.340 | and research need to be done.
01:02:10.940 | But that's how I see meta-learning above,
01:02:13.220 | which is gonna be beyond prompting.
01:02:15.380 | It's gonna be a bit more interactive.
01:02:17.060 | It's gonna, you know, the system might tell us
01:02:19.820 | to give it feedback after it maybe makes mistakes
01:02:22.340 | or it loses a game, but it's nonetheless very exciting
01:02:26.260 | because if you think about this this way,
01:02:29.020 | the benchmarks are already there.
01:02:30.620 | We just repurpose the benchmarks, right?
01:02:33.180 | So in a way, I like to map the space
01:02:36.980 | of what maybe AGI means to say,
01:02:40.340 | okay, like we went 101% performance in Go,
01:02:45.340 | in Chess, in StarCraft.
01:02:47.860 | The next iteration might be 20% performance
01:02:51.900 | across quote unquote all tasks, right?
01:02:54.700 | And even if it's not as good, it's fine.
01:02:56.300 | We actually, we have ways to also measure progress
01:02:59.940 | because we have those special agents,
01:03:01.620 | specialized agents, and so on.
01:03:04.180 | So this is to me very exciting.
01:03:06.220 | And these next iteration models
01:03:09.260 | are definitely hinting at that direction of progress,
01:03:13.380 | which hopefully we can have.
01:03:14.700 | There are obviously some things that could go wrong
01:03:17.580 | in terms of we might not have the tools,
01:03:20.100 | maybe transformers are not enough, then we must,
01:03:22.540 | there's some breakthroughs to come,
01:03:24.300 | which makes the field more exciting
01:03:26.300 | to people like me as well, of course.
01:03:28.620 | But that's, if you ask me five to 10 years,
01:03:32.100 | you might see these models that start to look more
01:03:34.300 | like weights that are already trained.
01:03:36.860 | And then it's more about teaching
01:03:39.540 | or make their meta learn what you're trying to induce
01:03:42.540 | in terms of tasks and so on.
01:03:46.940 | Well beyond the simple now tasks we're starting to see emerge
01:03:50.980 | like small arithmetic tasks and so on.
01:03:54.140 | - So a few questions around that.
01:03:55.700 | This is fascinating.
01:03:57.180 | So that kind of teaching, interactive,
01:04:01.420 | so it's beyond prompting,
01:04:02.740 | so it's interacting with the neural network,
01:04:05.180 | that's different than the training process.
01:04:08.380 | So it's different than the optimization
01:04:12.420 | over differentiable functions.
01:04:15.900 | This is already trained and now you're teaching,
01:04:18.620 | I mean, it's almost like akin to the brain,
01:04:24.180 | the neurons are already set with their connections.
01:04:26.900 | On top of that, you're now using that infrastructure
01:04:29.980 | to build up further knowledge.
01:04:32.620 | - Okay, so that's a really interesting distinction
01:04:36.700 | that's actually not obvious
01:04:38.060 | from a software engineering perspective,
01:04:40.340 | that there's a line to be drawn.
01:04:42.820 | 'Cause you always think for neural network to learn,
01:04:44.900 | it has to be retrained, trained and retrained.
01:04:48.340 | But maybe, and prompting is a way of teaching
01:04:53.220 | a neural network a little bit of context
01:04:55.980 | about whatever the heck you're trying to do.
01:04:58.020 | So you can maybe expand this prompting capability
01:05:00.460 | by making it interact, that's really, really interesting.
01:05:04.220 | - Yeah, by the way, this is not,
01:05:06.380 | if you look at way back at different ways
01:05:09.220 | to tackle even classification tasks,
01:05:11.820 | so this comes from like long standing literature
01:05:16.460 | in machine learning.
01:05:18.260 | What I'm suggesting could sound to some
01:05:20.780 | like a bit like nearest neighbor.
01:05:23.420 | So nearest neighbor is almost the simplest algorithm
01:05:26.100 | that does not require learning.
01:05:30.060 | So it has this interesting,
01:05:31.740 | like you don't need to compute gradients.
01:05:34.340 | And what nearest neighbor does is you quote unquote,
01:05:37.500 | have a data set or upload a data set.
01:05:39.980 | And then all you need to do is a way to measure distance
01:05:43.060 | between points.
01:05:44.780 | And then to classify a new point,
01:05:46.660 | you're just simply computing what's the closest point
01:05:49.220 | in this massive amount of data.
01:05:51.260 | And that's my answer.
01:05:52.700 | So you can think of prompting in a way
01:05:55.500 | as you're uploading, not just simple points
01:05:58.620 | and the metric is not the distance between the images
01:06:02.420 | or something simple,
01:06:03.260 | it's something that you compute that's much more advanced,
01:06:06.020 | but in a way, it's very similar, right?
01:06:08.380 | You simply are uploading some knowledge
01:06:12.620 | to this pre-trained system in nearest neighbor,
01:06:15.060 | maybe the metric is learned or not,
01:06:17.260 | but you don't need to further train it.
01:06:19.460 | And then now you immediately get a classifier out of this.
01:06:23.700 | Now it's just an evolution of that concept,
01:06:25.820 | very classical concept in machine learning,
01:06:27.820 | which is just learning through what's the closest point,
01:06:32.180 | closest by some distance and that's it.
01:06:34.540 | It's an evolution of that.
01:06:36.100 | And I will say how I saw meta-learning
01:06:39.020 | when we worked on a few ideas in 2016,
01:06:43.900 | was precisely through the lens of nearest neighbor,
01:06:47.220 | which is very common in computer vision community, right?
01:06:49.940 | There's a very active area of research
01:06:52.140 | about how do you compute the distance between two images,
01:06:55.460 | but if you have a good distance metric,
01:06:57.580 | you also have a good classifier, right?
01:06:59.940 | All I'm saying is now these distances
01:07:01.740 | and the points are not just images,
01:07:03.780 | they're like words or sequences of words and images
01:07:08.540 | and actions that teach you something new,
01:07:10.380 | but it might be that technique-wise, those come back.
01:07:14.740 | And I will say that it's not necessarily true
01:07:18.180 | that you might not ever train the weights a bit further.
01:07:21.780 | Some aspect of meta-learning,
01:07:23.900 | some techniques in meta-learning
01:07:26.020 | do actually do a bit of fine tuning, as it's called, right?
01:07:28.900 | They train the weights a little bit
01:07:31.100 | when they get a new task.
01:07:32.820 | So as I call the how, or how we're gonna achieve this,
01:07:36.940 | as a deep learner, I'm very skeptic.
01:07:39.820 | We're gonna try a few things,
01:07:41.220 | whether it's a bit of training, adding a few parameters,
01:07:44.180 | thinking of these as nearest neighbor,
01:07:45.940 | or just simply thinking of there's a sequence of words,
01:07:49.180 | it's a prefix, and that's the new classifier.
01:07:52.980 | We'll see, right?
01:07:53.820 | There's the beauty of research,
01:07:55.420 | but what's important is that is a good goal in itself
01:08:00.140 | that I see as very worthwhile pursuing
01:08:02.740 | for the next stages of not only meta-learning.
01:08:05.700 | I think this is basically what's exciting
01:08:08.460 | about machine learning, period, to me.
01:08:11.380 | - Well, and then the interactive aspect of that
01:08:13.740 | is also very interesting.
01:08:15.140 | - Yes. - The interactive version
01:08:16.380 | of nearest neighbor. (laughs)
01:08:18.420 | - Yeah. - To help you pull out
01:08:20.620 | the classifier from this giant thing.
01:08:23.740 | Okay, is this the way we can go in five, 10 plus years
01:08:28.740 | from any task, sorry, from many tasks to any task?
01:08:36.100 | So, and what does that mean?
01:08:39.420 | What does it need to be actually trained on?
01:08:41.620 | Which point is the network had enough?
01:08:45.460 | So what does a network need to learn about this world
01:08:50.460 | in order to be able to perform any task?
01:08:52.460 | Is it just as simple as language, image, and action?
01:08:57.460 | Or do you need some set of representative images?
01:09:01.820 | Like if you only see land images,
01:09:05.180 | will you know anything about underwater?
01:09:06.700 | Is that somehow fundamentally different?
01:09:08.740 | I don't know.
01:09:09.580 | - Those, I mean, those are awkward questions, I would say.
01:09:12.060 | I mean, the way you put, let me maybe further your example.
01:09:15.020 | Right, if all you see is land images,
01:09:18.400 | but you're reading all about land and water worlds,
01:09:21.540 | but in books, right, imagine.
01:09:23.900 | Would that be enough?
01:09:25.380 | Good question.
01:09:26.460 | We don't know, but I guess maybe you can join us
01:09:30.380 | if you want in our quest to find this.
01:09:32.100 | That's precisely--
01:09:33.420 | - Water world, yeah.
01:09:34.340 | - Yes, that's precisely, I mean, the beauty of research
01:09:37.620 | and that's the research business we're in, I guess,
01:09:42.620 | is to figure these out and ask the right questions
01:09:46.220 | and then iterate with the whole community,
01:09:49.540 | publishing findings and so on.
01:09:52.420 | But yeah, this is a question.
01:09:55.100 | It's not the only question, but it's certainly, as you ask,
01:09:57.540 | is on my mind constantly, right?
01:10:00.020 | And so we'll need to wait for maybe the,
01:10:03.260 | let's say five years, let's hope it's not 10,
01:10:05.940 | to see what are the answers.
01:10:08.380 | Some people will largely believe in unsupervised
01:10:12.660 | or self-supervised learning of single modalities
01:10:15.460 | and then crossing them.
01:10:18.000 | Some people might think end-to-end learning is the answer.
01:10:21.680 | Modularity is maybe the answer.
01:10:23.780 | So we don't know,
01:10:24.960 | but we're just definitely excited to find out.
01:10:27.520 | - But it feels like this is the right time
01:10:29.280 | and we're at the beginning of this journey.
01:10:31.720 | We're finally ready to do these kind of general,
01:10:34.640 | big models and agents.
01:10:37.600 | What do you sort of specific technical thing
01:10:42.480 | about Gato, Flamingo, Chinchilla, Gopher,
01:10:47.360 | any of these that is especially beautiful,
01:10:49.520 | that was surprising maybe?
01:10:51.640 | Is there something that just jumps out at you?
01:10:54.220 | Of course, there's the general thing of like,
01:10:57.560 | you didn't think it was possible
01:10:58.900 | and then you realize it's possible
01:11:01.700 | in terms of the generalizability across modalities
01:11:04.480 | and all that kind of stuff.
01:11:05.560 | Or maybe how small of a network, relatively speaking,
01:11:08.920 | Gato is, all that kind of stuff.
01:11:10.440 | But is there some weird little things that were surprising?
01:11:15.200 | - Look, I'll give you an answer that's very important
01:11:18.240 | because maybe people don't quite realize this,
01:11:22.600 | but the teams behind these efforts, the actual humans,
01:11:27.240 | that's maybe the surprising, obviously positive way.
01:11:31.720 | So anytime you see these breakthroughs,
01:11:34.580 | I mean, it's easy to map it to a few people.
01:11:37.160 | There's people that are great at explaining things
01:11:39.220 | and so on, that's very nice.
01:11:40.720 | But maybe the learnings or the meta learnings
01:11:44.680 | that I get as a human about this is,
01:11:47.400 | sure, we can move forward,
01:11:49.060 | but the surprising bit is how important
01:11:55.480 | are all the pieces of these projects,
01:11:58.720 | how do they come together?
01:12:00.040 | So I'll give you maybe some of the ingredients of success
01:12:04.440 | that are common across these,
01:12:06.440 | but not the obvious ones in machine learning.
01:12:08.480 | I can always also give you those.
01:12:11.320 | But basically, engineering is critical.
01:12:16.320 | So very good engineering,
01:12:19.600 | because ultimately we're collecting data sets, right?
01:12:23.760 | So the engineering of data
01:12:26.160 | and then of deploying the models at scale
01:12:29.740 | into some compute cluster that cannot go understated,
01:12:32.840 | that is a huge factor of success.
01:12:36.880 | And it's hard to believe that details matter so much.
01:12:41.560 | We would like to believe that it's true
01:12:44.040 | that there is more and more of a standard formula,
01:12:47.440 | as I was saying, like this recipe that works for everything.
01:12:50.560 | But then when you zoom in into each of these projects,
01:12:53.680 | then you realize the devil is indeed in the details.
01:12:57.840 | And then the teams have to work kind of together
01:13:01.520 | towards these goals.
01:13:03.040 | So engineering of data and obviously clusters
01:13:07.520 | and large scale is very important.
01:13:09.280 | And then one that is often not,
01:13:13.120 | maybe nowadays it is more clear,
01:13:15.080 | is benchmark progress, right?
01:13:17.160 | So we're talking here about multiple months
01:13:19.860 | of tens of researchers
01:13:22.120 | and people that are trying to organize the research
01:13:26.160 | and so on, working together.
01:13:28.080 | And you don't know that you can get there.
01:13:32.120 | I mean, this is the beauty.
01:13:34.360 | Like if you're not risking to trying to do something
01:13:37.320 | that feels impossible, you're not gonna get there,
01:13:40.540 | but you need a way to measure progress.
01:13:43.960 | So the benchmarks that you build are critical.
01:13:47.740 | I've seen this beautifully play out in many projects.
01:13:50.520 | I mean, maybe the one I've seen it more consistently,
01:13:53.880 | which means we establish the metric,
01:13:56.840 | actually the community did,
01:13:58.320 | and then we leverage that massively is AlphaFold.
01:14:01.560 | This is a project where the data,
01:14:04.520 | the metrics were all there.
01:14:06.120 | And all it took was, and it's easier said than done,
01:14:09.120 | an amazing team working,
01:14:11.640 | not to try to find some incremental improvement
01:14:14.760 | and publish, which is one way to do research that is valid,
01:14:17.940 | but aim very high and work literally for years
01:14:22.520 | to iterate over that process.
01:14:24.120 | And working for years with the team,
01:14:25.660 | I mean, it is tricky that also happened to happen
01:14:29.800 | partly during a pandemic and so on.
01:14:32.200 | So I think my meta learning from all this is,
01:14:35.280 | the teams are critical to the success.
01:14:37.960 | And then if now going to the machine learning,
01:14:40.200 | the part that's surprising is,
01:14:42.880 | so we like architectures like neural networks,
01:14:48.720 | and I would say this was a very rapidly evolving field
01:14:53.120 | until the transformer came.
01:14:54.960 | So attention might indeed be all you need,
01:14:58.160 | which is the title, also a good title,
01:15:00.280 | although in hindsight is good.
01:15:02.280 | I don't think at the time I thought
01:15:03.440 | this is a great title for a paper,
01:15:05.040 | but that architecture is proving
01:15:08.960 | that the dream of modeling sequences of any bytes,
01:15:12.540 | there is something there that will stick.
01:15:15.360 | And I think these advance in architectures,
01:15:18.280 | in kind of how neural networks are architectured
01:15:21.040 | to do what they do.
01:15:23.120 | It's been hard to find one that has been so stable
01:15:26.080 | and relatively has changed very little
01:15:28.920 | since it was invented five or so years ago.
01:15:33.040 | So that is a surprising,
01:15:35.200 | is a surprise that keeps recurring to other projects.
01:15:38.320 | - Try to, on a philosophical or technical level,
01:15:42.440 | introspect what is the magic of attention?
01:15:45.480 | What is attention?
01:15:47.320 | That's attention in people that study cognition,
01:15:50.120 | so human attention.
01:15:52.080 | I think there's giant wars over what attention means,
01:15:55.780 | how it works in the human mind.
01:15:57.440 | So what, there's very simple looks
01:16:00.200 | at what attention is in neural network
01:16:02.600 | from the days of attention is all you need,
01:16:04.440 | but do you think there's a general principle
01:16:06.840 | that's really powerful here?
01:16:08.780 | - Yeah, so a distinction between transformers and LSTMs,
01:16:13.360 | which were what came before,
01:16:15.360 | and there was a transitional period
01:16:17.840 | where you could use both.
01:16:19.680 | In fact, when we talked about AlphaStar,
01:16:22.000 | we used transformers and LSTMs.
01:16:24.280 | So it was still the beginning of transformers.
01:16:26.380 | They were very powerful,
01:16:27.400 | but LSTMs were still also very powerful sequence models.
01:16:31.520 | So the power of the transformer is that it has built in
01:16:36.520 | what we call an inductive bias of attention
01:16:41.140 | that makes the model,
01:16:43.040 | when you think of a sequence of integers, right?
01:16:45.700 | Like we discussed this before, right?
01:16:47.440 | This is a sequence of words.
01:16:50.420 | When you have to do very hard tasks over these words,
01:16:54.780 | this could be, we're gonna translate a whole paragraph
01:16:57.900 | or we're gonna predict the next paragraph
01:16:59.780 | given 10 paragraphs before.
01:17:01.740 | There's some loose intuition from how we do it as a human
01:17:09.260 | that is very nicely mimicked and replicated,
01:17:14.780 | structurally speaking in the transformer,
01:17:16.540 | which is this idea of you're looking for something, right?
01:17:21.160 | So you're sort of, when you're,
01:17:23.900 | you just read a piece of text,
01:17:25.740 | now you're thinking what comes next.
01:17:27.920 | You might wanna relook at the text or look it from scratch.
01:17:31.780 | I mean, literally is because there's no recurrence.
01:17:35.080 | You're just thinking what comes next.
01:17:37.300 | And it's almost hypothesis driven, right?
01:17:40.020 | So if I'm thinking the next word that I'll write
01:17:43.380 | is cat or dog, okay?
01:17:46.560 | The way the transformer works almost philosophically
01:17:49.840 | is it has these two hypotheses.
01:17:52.840 | Is it gonna be cat or is it gonna be dog?
01:17:55.640 | And then it thinks, okay, if it's cat,
01:17:58.360 | I'm gonna look for certain words, not necessarily cat,
01:18:00.680 | although cat is an obvious word you would look in the past
01:18:02.920 | to see whether it makes more sense to output cat or dog.
01:18:05.960 | And then it does some very deep computation
01:18:09.400 | over the words and beyond, right?
01:18:11.400 | So it combines the words and,
01:18:14.100 | but it has the query as we call it, that is cat.
01:18:18.440 | And then similarly for dog, right?
01:18:20.600 | And so it's a very computational way to think about,
01:18:24.360 | look, if I'm thinking deeply about text,
01:18:26.980 | I need to go back to look at all of the text,
01:18:29.560 | attend over it, but it's not just attention.
01:18:31.860 | Like what is guiding the attention?
01:18:33.920 | And that was the key insight from an earlier paper
01:18:36.660 | is not how far away is it?
01:18:39.100 | I mean, how far away is it is important?
01:18:40.760 | What did I just write about?
01:18:42.680 | That's critical.
01:18:44.100 | But what you wrote about 10 pages ago might also be critical.
01:18:48.360 | So you're looking not positionally, but content-wise, right?
01:18:53.160 | And transformers have this beautiful way
01:18:56.040 | to query for certain content and pull it out
01:18:59.420 | in a compressed way.
01:19:00.280 | So then you can make a more informed decision.
01:19:02.960 | I mean, that's one way to explain transformers.
01:19:05.920 | But I think it's a very powerful inductive bias.
01:19:10.000 | There might be some details that might change over time,
01:19:12.480 | but I think that is what makes transformers
01:19:16.400 | so much more powerful than the recurrent networks
01:19:19.880 | that were more recency bias-based,
01:19:22.420 | which obviously works in some tasks,
01:19:24.300 | but it has major flaws.
01:19:26.680 | Transformer itself has flaws.
01:19:29.280 | And I think the main one, the main challenge is
01:19:32.160 | these prompts that we just were talking about,
01:19:35.720 | they can be a thousand words long.
01:19:38.040 | But if I'm teaching you StarCraft,
01:19:39.880 | I mean, I'll have to show you videos.
01:19:41.840 | I'll have to point you to whole Wikipedia articles
01:19:44.600 | about the game.
01:19:46.120 | We'll have to interact probably as you play,
01:19:48.000 | you'll ask me questions.
01:19:49.480 | The context required for us to achieve
01:19:52.340 | me being a good teacher to you on the game,
01:19:54.760 | as you would want to do it with a model,
01:19:56.960 | I think goes well beyond the current capabilities.
01:20:01.600 | So the question is, how do we benchmark this?
01:20:03.900 | And then how do we change the structure
01:20:06.400 | of the architectures?
01:20:07.280 | I think there's ideas on both sides,
01:20:08.820 | but we'll have to see empirically, right?
01:20:11.280 | Obviously what ends up working in the--
01:20:13.360 | - And as you talked about, some of the ideas could be,
01:20:15.880 | keeping the constraint of that length in place,
01:20:19.480 | but then forming hierarchical representations
01:20:23.060 | to where you can start being much clever
01:20:26.240 | in how you use those thousand tokens.
01:20:28.840 | - Indeed.
01:20:29.680 | - Yeah, that's really interesting.
01:20:32.240 | But it also is possible that this attentional mechanism
01:20:34.840 | where you basically, you don't have a recency bias,
01:20:37.560 | but you look more generally,
01:20:40.300 | you make it learnable.
01:20:42.000 | The mechanism in which way you look back into the past,
01:20:45.280 | you make that learnable.
01:20:46.800 | It's also possible where at the very beginning of that,
01:20:50.200 | because that, you might become smarter and smarter
01:20:54.380 | in the way you query the past.
01:20:56.920 | So recent past and distant past,
01:21:00.600 | and maybe very, very distant past.
01:21:02.360 | So almost like the attention mechanism
01:21:04.980 | will have to improve and evolve as good as the,
01:21:09.620 | the tokenization mechanism,
01:21:11.980 | where so you can represent long-term memory somehow.
01:21:14.980 | - Yes.
01:21:16.140 | And I mean, hierarchies are very,
01:21:18.220 | I mean, it's a very nice word that sounds appealing.
01:21:22.180 | There's lots of work adding hierarchy to the memories.
01:21:25.900 | In practice, it does seem like we keep coming back
01:21:29.460 | to the main formula or main architecture.
01:21:33.000 | That sometimes tells us something.
01:21:35.300 | There is such a sentence that a friend of mine told me,
01:21:38.540 | like whether it wants to work or not.
01:21:41.040 | So Transformer was clearly an idea that wanted to work.
01:21:45.000 | And then I think there's some principles
01:21:47.540 | we believe will be needed, but finding the exact details,
01:21:51.040 | details matter so much, right?
01:21:52.920 | That's gonna be tricky.
01:21:54.280 | - I love the idea that there's like,
01:21:56.800 | you as a human being, you want some ideas to work,
01:22:01.320 | and then there's the model that wants some ideas to work,
01:22:04.520 | and you get to have a conversation to see which,
01:22:07.400 | - More likely the model will win in the end.
01:22:09.600 | Because it's the one, you don't have to do any work.
01:22:12.860 | The model's the one that has to do the work,
01:22:14.380 | so you should listen to the model.
01:22:15.900 | And I really love this idea that you talked about,
01:22:17.900 | the humans in this picture, if I could just briefly ask.
01:22:21.200 | One is you're saying the benchmarks about,
01:22:25.700 | so the modular humans working on this,
01:22:27.980 | the benchmarks providing a sturdy ground
01:22:31.700 | of a wish to do these things that seem impossible.
01:22:34.700 | They give you, in the darkest of times, give you hope,
01:22:39.140 | because little signs of improvement.
01:22:40.940 | Somehow you're not lost if you have metrics
01:22:46.560 | to measure your improvement.
01:22:48.680 | And then there's other aspect, you said elsewhere,
01:22:52.260 | and here today, titles matter.
01:22:56.600 | I wonder how much humans matter
01:23:00.520 | in the evolution of all of this,
01:23:02.360 | meaning individual humans.
01:23:04.300 | Something about their interaction,
01:23:08.140 | something about their ideas,
01:23:09.200 | how much they change the direction of all of this.
01:23:13.180 | If you change the humans in this picture,
01:23:15.680 | is it that the model is sitting there,
01:23:18.240 | and it wants some idea to work,
01:23:22.520 | or is it the humans, or maybe the model's providing
01:23:25.600 | 20 ideas that could work,
01:23:27.020 | and depending on the humans you pick,
01:23:29.100 | they're going to be able to hear some of those ideas.
01:23:31.800 | - In all the, because you're now directing
01:23:34.600 | all of deep learning at DeepMind,
01:23:35.920 | you get to interact with a lot of projects,
01:23:37.440 | a lot of brilliant researchers.
01:23:39.000 | How much variability is created by the humans
01:23:43.100 | in all of this?
01:23:44.160 | - Yeah, I mean, I do believe humans matter a lot,
01:23:47.380 | at the very least, at the time scale of years
01:23:52.380 | on when things are happening,
01:23:54.880 | and what's the sequencing of it, right?
01:23:56.940 | So you get to interact with people that,
01:24:00.560 | I mean, you mentioned this,
01:24:02.240 | some people really want some idea to work,
01:24:05.160 | and they'll persist,
01:24:06.720 | and then some other people might be more practical,
01:24:09.400 | like, I don't care what idea works,
01:24:12.880 | I care about cracking protein folding.
01:24:15.920 | And these, at least these two kind of seem opposite sides,
01:24:21.240 | we need both, and we've clearly had both historically,
01:24:25.680 | and that made certain things happen earlier or later,
01:24:29.000 | so definitely humans involved in all of these endeavor
01:24:33.480 | have had, I would say, years of change or of ordering,
01:24:38.480 | how things have happened,
01:24:40.480 | which breakthroughs came before
01:24:41.840 | which other breakthroughs, and so on,
01:24:43.300 | so certainly that does happen,
01:24:45.800 | and so one other, maybe one other axis of distinction
01:24:50.600 | is what I called,
01:24:52.040 | and this is most commonly used in reinforcement learning,
01:24:54.860 | is the exploration-exploitation trade-off as well,
01:24:57.800 | it's not exactly what I meant, although quite related.
01:25:00.920 | So when you start trying to help others,
01:25:05.920 | like you become a bit more of a mentor
01:25:11.480 | to a large group of people,
01:25:13.100 | be it a project or the deep learning team or something,
01:25:16.380 | or even in the community
01:25:17.460 | when you interact with people in conferences and so on,
01:25:20.800 | you're identifying quickly some things
01:25:24.920 | that are explorative or exploitative,
01:25:27.080 | and it's tempting to try to guide people, obviously,
01:25:30.720 | I mean, that's what makes our experience,
01:25:33.200 | we bring it and we try to shape things, sometimes wrongly,
01:25:36.760 | and there's many times that I've been wrong in the past,
01:25:39.600 | that's great, but it would be wrong
01:25:43.720 | to dismiss any sort of the research styles
01:25:48.160 | that I'm observing, and I often get asked,
01:25:51.280 | "Well, you're in industry, right,
01:25:52.720 | "so we do have access to large compute scale and so on,
01:25:55.580 | "so there's certain kinds of research
01:25:57.380 | "I almost feel like we need to do responsibly and so on,"
01:26:01.680 | but it is, Carmos, we have the particle accelerator here,
01:26:05.200 | so to speak, in physics, so we need to use it,
01:26:07.520 | we need to answer the questions
01:26:08.840 | that we should be answering right now
01:26:10.440 | for the scientific progress.
01:26:12.400 | But then at the same time, I look at many advances,
01:26:15.240 | including attention, which was discovered in Montreal
01:26:19.360 | initially because of lack of compute, right?
01:26:22.440 | So we were working on sequence to sequence
01:26:24.960 | with my friends over at Google Brain at the time,
01:26:27.920 | and we were using, I think, eight GPUs,
01:26:30.400 | which was somehow a lot at the time,
01:26:32.480 | and then I think Montreal was a bit more limited
01:26:35.240 | in the scale, but then they discovered
01:26:37.320 | this content-based attention concept
01:26:39.240 | that then has obviously triggered things like Transformer.
01:26:43.400 | Not everything obviously starts Transformer.
01:26:46.320 | There's always a history that is important to recognize
01:26:49.920 | because then you can make sure that then those
01:26:53.040 | who might feel now, "Well, we don't have so much compute,"
01:26:56.360 | you need to then help them optimize that kind of research
01:27:01.360 | that might actually produce amazing change.
01:27:04.240 | Perhaps it's not as short-term as some of these advancements
01:27:07.920 | or perhaps it's a different timescale,
01:27:09.720 | but the people and the diversity of the field
01:27:13.040 | is quite critical that we maintain it,
01:27:15.720 | and at times, especially mixed a bit with hype
01:27:19.040 | or other things, it's a bit tricky
01:27:21.520 | to be observing maybe too much
01:27:24.160 | of the same thinking across the board,
01:27:27.760 | but the humans definitely are critical,
01:27:30.480 | and I can think of quite a few personal examples
01:27:33.880 | where also someone told me something
01:27:36.560 | that had a huge effect onto some idea,
01:27:40.240 | and then that's why I'm saying at least in terms of ears,
01:27:43.280 | probably some things do happen.
01:27:44.880 | - Yeah, it's fascinating.
01:27:45.720 | - Yeah.
01:27:46.560 | - And it's also fascinating how constraints somehow
01:27:48.200 | are essential for innovation,
01:27:51.040 | and the other thing you mentioned about engineering,
01:27:53.400 | I have a sneaking suspicion, maybe I over,
01:27:56.640 | my love is with engineering,
01:27:59.960 | so I have a sneaking suspicion that all the genius,
01:28:04.480 | a large percentage of the genius
01:28:06.280 | is in the tiny details of engineering,
01:28:09.280 | so I think we like to think the genius is in the big ideas.
01:28:14.280 | I have a sneaking suspicion that,
01:28:20.160 | because I've seen the genius of details,
01:28:22.600 | of engineering details,
01:28:24.120 | make the night and day difference,
01:28:28.760 | and I wonder if those kind of have a ripple effect over time.
01:28:32.120 | So that too, so that's sort of,
01:28:35.520 | taking the engineering perspective,
01:28:36.840 | that sometimes that quiet innovation
01:28:39.360 | at the level of an individual engineer,
01:28:41.720 | or maybe at the small scale of a few engineers,
01:28:44.600 | can make all the difference, that scales,
01:28:46.760 | because we're doing, we're working on computers
01:28:49.680 | that are scaled across large groups,
01:28:53.440 | that one engineering decision can lead to ripple effects.
01:28:56.960 | - Yes. - Which is interesting
01:28:57.800 | to think about.
01:28:58.920 | - Yeah, I mean, engineering,
01:29:00.760 | there's also kind of a historical,
01:29:04.160 | it might be a bit random,
01:29:06.280 | because if you think of the history of how,
01:29:09.760 | especially deep learning and neural networks took off,
01:29:12.320 | feels like a bit random,
01:29:15.000 | because GPUs happen to be there at the right time
01:29:17.800 | for a different purpose, which was to play video games.
01:29:20.640 | So even the engineering that goes into the hardware,
01:29:24.600 | and it might have a time,
01:29:26.320 | like the timeframe might be very different.
01:29:28.000 | I mean, the GPUs were evolved throughout many years,
01:29:31.560 | where we didn't even, we're looking at that, right?
01:29:33.840 | So even at that level, right, that revolution, so to speak,
01:29:37.480 | the ripples are like, we'll see when they stop, right?
01:29:42.160 | But in terms of thinking of why is this happening, right?
01:29:45.920 | There's, I think that when I try to categorize it
01:29:49.760 | in sort of things that might not be so obvious,
01:29:52.720 | I mean, clearly there's a hardware revolution.
01:29:54.960 | We are surfing thanks to that.
01:29:58.360 | Data centers as well.
01:29:59.760 | I mean, data centers are where,
01:30:01.840 | like, I mean, at Google, for instance,
01:30:03.200 | obviously they're serving Google,
01:30:04.800 | but there's also now, thanks to that,
01:30:06.920 | and to have built such amazing data centers,
01:30:09.640 | we can train these models.
01:30:11.720 | Software is an important one.
01:30:13.400 | I think if I look at the state of how I had to implement
01:30:18.280 | things to implement my ideas,
01:30:20.040 | how I discarded ideas because they were too hard
01:30:22.120 | to implement, yeah, clearly the times have changed,
01:30:25.280 | and thankfully we are in a much better
01:30:27.600 | software position as well.
01:30:29.400 | And then, I mean, obviously there's research
01:30:32.240 | that happens at scale and more people enter the field.
01:30:35.160 | That's great to see,
01:30:36.000 | but it's almost enabled by these other things.
01:30:38.280 | And last but not least is also data, right?
01:30:40.600 | Curating data sets, labeling data sets,
01:30:43.120 | these benchmarks we think about,
01:30:44.960 | maybe we'll want to have all the benchmarks in one system,
01:30:48.920 | but it's still very valuable that someone
01:30:51.320 | put the thought and the time and the vision
01:30:53.600 | to build certain benchmarks.
01:30:54.880 | We've seen progress thanks to,
01:30:56.640 | but we're gonna repurpose the benchmarks.
01:30:59.280 | That's the beauty of Atari,
01:31:01.640 | is like we solved it in a way,
01:31:04.240 | but we use it in Gato.
01:31:06.000 | It was critical, and I'm sure there's still a lot more
01:31:09.120 | to do thanks to that amazing benchmark
01:31:10.960 | that someone took the time to put,
01:31:13.120 | even though at the time maybe,
01:31:15.160 | oh, you have to think what's the next,
01:31:17.360 | you know, iteration of architectures.
01:31:19.440 | That's what maybe the field recognizes,
01:31:21.400 | but we need to, that's another thing we need to balance
01:31:24.000 | in terms of a human's behind.
01:31:25.800 | We need to recognize all these aspects
01:31:27.960 | because they're all critical.
01:31:29.480 | And we tend to, yeah, we tend to think of the genius,
01:31:32.800 | the scientist and so on, but I'm glad you're,
01:31:35.680 | I know you have a strong engineering background, so.
01:31:38.000 | - But also I'm a lover of data,
01:31:40.040 | and it gives us a pushback on the engineering comment,
01:31:43.240 | ultimately could be the creators of benchmarks
01:31:46.080 | who have the most impact.
01:31:47.440 | Andrej Karpathy, who you mentioned,
01:31:49.200 | has recently been talking a lot of trash about ImageNet,
01:31:52.000 | which he has the right to do
01:31:53.200 | because of how critical he is about,
01:31:55.480 | how essential he is to the development
01:31:57.760 | and the success of deep learning around ImageNet.
01:32:01.520 | And you're saying that that's actually,
01:32:02.920 | that benchmark is holding back the field
01:32:05.480 | because, I mean, especially in his context
01:32:07.680 | on Tesla Autopilot, that's looking at real world behavior
01:32:11.080 | of a system, it's, there's something fundamentally missing
01:32:16.080 | about ImageNet that doesn't capture
01:32:17.960 | the real worldness of things,
01:32:20.440 | that we need to have data sets, benchmarks
01:32:22.640 | that have the unpredictability, the edge cases,
01:32:27.080 | the whatever the heck it is that makes the real world
01:32:29.680 | so difficult to operate in,
01:32:32.280 | we need to have benchmarks with that, so.
01:32:34.680 | But just to think about the impact of ImageNet
01:32:37.760 | as a benchmark, and that really puts a lot of emphasis
01:32:42.120 | on the importance of a benchmark,
01:32:43.720 | both sort of internally a deep mind and as a community.
01:32:46.680 | So one is coming in from within,
01:32:48.960 | like how do I create a benchmark for me
01:32:52.520 | to mark and make progress, and how do I make benchmark
01:32:57.280 | for the community to mark and push progress?
01:33:02.520 | - You have this amazing paper you co-authored,
01:33:05.880 | a survey paper called Emergent Abilities
01:33:08.600 | of Large Language Models, it has, again,
01:33:11.440 | the philosophy here that I'd love to ask you about.
01:33:14.480 | What's the intuition about the phenomena
01:33:16.680 | of emergence in neural networks,
01:33:18.480 | transform as language models?
01:33:20.660 | Is there a magic threshold beyond which
01:33:24.160 | we start to see certain performance?
01:33:27.160 | And is that different from task to task?
01:33:29.960 | Is that us humans just being poetic and romantic,
01:33:32.640 | or is there literally some level
01:33:35.440 | at which we start to see breakthrough performance?
01:33:38.200 | - Yeah, I mean, this is a property that we start seeing
01:33:41.520 | in systems that actually tend to be,
01:33:46.880 | so in machine learning, traditionally,
01:33:49.280 | again, going to benchmarks, I mean,
01:33:51.960 | if you have some input-output, right,
01:33:54.860 | like that is just a single input and a single output,
01:33:58.280 | you generally, when you train these systems,
01:34:01.200 | you see reasonably smooth curves
01:34:04.420 | when you analyze how much the data set size
01:34:09.420 | affects the performance, or how the model size
01:34:12.020 | affects the performance, or how much you long train,
01:34:15.080 | how long you train the system for
01:34:17.920 | affects the performance, right?
01:34:19.360 | So, you know, if we think of ImageNet,
01:34:22.080 | like the training curves look fairly smooth
01:34:25.080 | and predictable in a way,
01:34:28.160 | and I would say that's probably because of the,
01:34:31.360 | it's kind of a one-hop reasoning task, right?
01:34:36.360 | It's like, here is an input,
01:34:38.240 | and you think for a few milliseconds,
01:34:40.800 | or 100 milliseconds, 300, as a human,
01:34:43.760 | and then you tell me, yeah,
01:34:44.840 | there's an alpaca in this image.
01:34:47.880 | So, in language, we are seeing benchmarks
01:34:52.800 | that require more pondering and more thought in a way, right?
01:34:58.240 | This is just kind of, you need to look for some subtleties,
01:35:01.960 | that it involves inputs that you might think of,
01:35:05.400 | or even if the input is a sentence
01:35:07.860 | describing a mathematical problem,
01:35:09.800 | there is a bit more processing required as a human
01:35:14.180 | and more introspection.
01:35:15.700 | So, I think that how these benchmarks work
01:35:20.520 | means that there is actually a threshold,
01:35:23.520 | just going back to how transformers work
01:35:26.760 | in this way of querying for the right questions
01:35:29.560 | to get the right answers,
01:35:31.160 | that might mean that performance becomes random
01:35:35.520 | until the right question is asked
01:35:37.800 | by the querying system of a transformer
01:35:40.080 | or of a language model like a transformer,
01:35:42.880 | and then, only then, you might start seeing performance
01:35:47.720 | going from random to non-random,
01:35:50.120 | and this is more empirical.
01:35:52.720 | There's no formalism or theory behind this yet,
01:35:56.320 | although it might be quite important,
01:35:57.800 | but we're seeing these phase transitions
01:36:00.360 | of random performance until some, let's say,
01:36:03.680 | scale of a model, and then it goes beyond that.
01:36:06.800 | And it might be that you need to fit
01:36:10.560 | a few low-order bits of thought
01:36:14.040 | before you can make progress on the whole task.
01:36:17.200 | And if you could measure, actually,
01:36:19.720 | those breakdown of the task,
01:36:21.880 | maybe you would see more smooth,
01:36:23.480 | oh, like, yeah, this, you know,
01:36:24.960 | once you get this and this and this and this and this,
01:36:27.760 | then you start making progress in the task.
01:36:30.320 | But it's somehow a bit annoying
01:36:33.520 | because then it means that certain questions
01:36:37.480 | we might ask about architectures
01:36:40.320 | possibly can only be done at certain scale.
01:36:43.040 | And one thing that, conversely,
01:36:46.120 | I've seen great progress on in the last couple of years
01:36:49.200 | is this notion of science of deep learning
01:36:52.480 | and science of scale in particular, right?
01:36:55.040 | So, on the negative is that there's some benchmarks
01:36:58.680 | for which progress might need to be measured
01:37:01.800 | at minimum and at certain scale
01:37:04.000 | until you see then what details of the model matter
01:37:07.560 | to make that performance better, right?
01:37:10.000 | So that's a bit of a con.
01:37:11.920 | But what we've also seen is that
01:37:14.720 | you can sort of empirically analyze
01:37:18.600 | behavior of models at scales that are smaller, right?
01:37:22.880 | So let's say, to put an example,
01:37:25.680 | we had this chinchilla paper
01:37:27.840 | that revised the so-called scaling laws of models.
01:37:31.360 | And that whole study is done at a reasonably small scale,
01:37:34.720 | right, that may be hundreds of millions
01:37:36.520 | up to 1 billion parameters.
01:37:38.680 | And then the cool thing is that you create some loss, right?
01:37:41.880 | Some loss that, some trends, right?
01:37:43.640 | You extract trends from data that you see, okay,
01:37:46.600 | like it looks like the amount of data required
01:37:49.400 | to train now a 10X larger model would be this.
01:37:52.120 | And these loss so far,
01:37:53.960 | these extrapolations have helped us safe compute
01:37:57.480 | and just get to a better place in terms of the science
01:38:00.920 | of how should we run these models at scale,
01:38:03.800 | how much data, how much depth,
01:38:05.600 | and all sorts of questions we start asking,
01:38:08.480 | extrapolating from a small scale.
01:38:10.600 | But then this emergence is sadly
01:38:12.720 | that not everything can be extrapolated from scale
01:38:15.680 | depending on the benchmark.
01:38:16.880 | And maybe the harder benchmarks are not so good
01:38:20.240 | for extracting these loss.
01:38:21.960 | But we have a variety of benchmarks at least.
01:38:24.160 | - So I wonder to which degree the threshold,
01:38:28.000 | the phase shift scale is a function of the benchmark.
01:38:31.680 | Some of the science of scale might be engineering benchmarks
01:38:37.840 | where that threshold is low.
01:38:40.400 | Sort of taking a main benchmark
01:38:43.840 | and reducing it somehow
01:38:46.120 | where the essential difficulty is left,
01:38:48.480 | but the scale at which the emergence happens is lower.
01:38:52.600 | Just for the science aspect of it
01:38:54.280 | versus the actual real world aspect.
01:38:56.960 | - Yeah, so luckily we have quite a few benchmarks,
01:38:59.280 | some of which are simpler,
01:39:00.560 | or maybe they're more like,
01:39:01.880 | I think people might call these systems one
01:39:03.840 | versus systems two style.
01:39:05.920 | So I think what we're not seeing luckily
01:39:09.880 | is that extrapolations from maybe slightly more smooth
01:39:14.040 | or simpler benchmarks are translating to the harder ones.
01:39:18.560 | But that is not to say
01:39:19.640 | that this extrapolation will hit its limits.
01:39:22.600 | And when it does,
01:39:24.200 | then how much we scale or how we scale
01:39:27.560 | will sadly be a bit suboptimal
01:39:29.440 | until we find better loss, right?
01:39:31.800 | And these laws, again, are very empirical laws.
01:39:33.800 | They're not like physical laws of models.
01:39:35.920 | Although I wish there would be better theory
01:39:38.680 | about these things as well,
01:39:40.120 | but so far I would say empirical theory,
01:39:43.000 | as I call it, is way ahead
01:39:44.520 | than actual theory of machine learning.
01:39:47.000 | - Let me ask you almost for fun.
01:39:50.480 | So this is not, Oriol,
01:39:52.080 | as a deep mind person
01:39:54.640 | or anything to do with deep mind or Google,
01:39:57.280 | just as a human being,
01:39:58.840 | and looking at these news of a Google engineer
01:40:01.760 | who claimed that,
01:40:05.800 | I guess the Lambda language model
01:40:08.360 | was sentient or had the,
01:40:11.120 | and you still need to look into the details of this,
01:40:14.080 | but sort of making an official report
01:40:18.680 | and a claim that he believes there's evidence
01:40:21.740 | that this system has achieved sentience.
01:40:25.120 | And I think this is a really interesting case
01:40:29.560 | on a human level, on a psychological level,
01:40:31.760 | on a technical machine learning level
01:40:35.920 | of how language models transform our world,
01:40:38.360 | and also just philosophical level
01:40:39.880 | of the role of AI systems in a human world.
01:40:44.120 | So what do you find interesting?
01:40:48.120 | What's your take on all of this
01:40:49.720 | as a machine learning engineer and a researcher
01:40:52.440 | and also as a human being?
01:40:54.320 | - Yeah, I mean, a few reactions.
01:40:56.400 | Quite a few, actually.
01:40:58.760 | - Have you ever briefly thought,
01:41:01.640 | is this thing sentient?
01:41:02.560 | - Right, so never.
01:41:04.320 | Absolutely never.
01:41:05.160 | - You mean with like AlphaStar?
01:41:06.280 | Wait a minute.
01:41:07.120 | - Sadly, though, I think, yeah, sadly I have not.
01:41:11.960 | Yeah, I think the current, any of the current models,
01:41:15.320 | although very useful and very good,
01:41:17.560 | yeah, I think we're quite far from that.
01:41:21.200 | And there's kind of a converse side story.
01:41:25.360 | So one of my passions is about science in general.
01:41:30.360 | And I think I feel I'm a bit of a failed scientist.
01:41:34.540 | That's why I came to machine learning,
01:41:36.560 | because you always feel, and you start seeing this,
01:41:40.160 | that machine learning is maybe the science
01:41:43.200 | that can help other sciences, as we've seen, right?
01:41:45.440 | Like you, you know, it's such a powerful tool.
01:41:48.620 | So thanks to that angle, right, that, okay, I love science.
01:41:52.520 | I love, I mean, I love astronomy, I love biology,
01:41:54.960 | but I'm not an expert and I decided,
01:41:56.960 | well, the thing I can do better at is computers.
01:42:00.040 | But having, especially with,
01:42:02.960 | when I was a bit more involved in AlphaFold,
01:42:05.560 | learning a bit about proteins and about biology
01:42:08.800 | and about life, the complexity,
01:42:13.120 | it feels like it really is, like, I mean,
01:42:15.040 | if you start looking at the things that are going on
01:42:18.160 | at the atomic level, and also, I mean,
01:42:23.880 | there's obviously the, we are maybe inclined
01:42:27.720 | to try to think of neural networks as like the brain,
01:42:30.440 | but the complexities and the amount of magic that it feels
01:42:35.080 | when, I mean, I don't, I'm not an expert,
01:42:37.120 | so it naturally feels more magic,
01:42:38.600 | but looking at biological systems,
01:42:40.920 | as opposed to these computer computational brains,
01:42:45.540 | just makes me like, wow, there's such level
01:42:49.600 | of complexity difference still, right?
01:42:51.480 | Like orders of magnitude complexity that,
01:42:53.820 | sure, these weights, I mean, we train them
01:42:56.680 | and they do nice things, but they're not at the level
01:43:00.160 | of biological entities, brains, cells.
01:43:05.160 | It just feels like it's just not possible
01:43:08.960 | to achieve the same level of complexity behavior,
01:43:12.360 | and my belief, when I talk to other beings,
01:43:16.280 | is certainly shaped by this amazement of biology
01:43:20.340 | that maybe because I know too much,
01:43:22.340 | I don't have about machine learning,
01:43:23.760 | but I certainly feel it's very far-fetched
01:43:27.600 | and far in the future to be calling,
01:43:29.780 | or to be thinking, well, this mathematical function
01:43:34.560 | that is differentiable is in fact sentient and so on.
01:43:39.200 | - There's something on that point, it's very interesting.
01:43:41.980 | So you know enough about machines and enough about biology
01:43:46.980 | to know that there's many orders of magnitude
01:43:49.040 | of difference and complexity,
01:43:50.620 | but you know how machine learning works.
01:43:56.060 | So the interesting question for human beings
01:43:58.140 | that are interacting with a system
01:43:59.400 | that don't know about the underlying complexity,
01:44:02.240 | and I've seen people, probably including myself,
01:44:05.240 | that have fallen in love with things that are quite simple.
01:44:07.920 | - Yeah, so-- - And so maybe
01:44:09.440 | the complexity is one part of the picture,
01:44:11.500 | but maybe that's not a necessary condition for sentience,
01:44:16.500 | for perception or emulation of sentience.
01:44:25.000 | - Right, so I mean, I guess the other side of this is,
01:44:28.180 | that's how I feel personally,
01:44:29.560 | I mean, you asked me about the person, right?
01:44:32.360 | Now it's very interesting to see
01:44:33.980 | how other humans feel about things, right?
01:44:36.360 | This is, we are like, again, like I'm not as amazed
01:44:40.800 | about things that I feel,
01:44:42.320 | this is not as magical as this other thing,
01:44:44.560 | because of maybe how I got to learn about it
01:44:48.000 | and how I see the curve a bit more smooth,
01:44:50.480 | because I, you know, like just seeing the progress
01:44:53.080 | of language models since Shannon in the '50s,
01:44:56.000 | and actually looking at that timescale,
01:44:58.900 | we're not that fast progress, right?
01:45:00.840 | I mean, what we were thinking at the time,
01:45:03.460 | like almost 100 years ago,
01:45:05.960 | is not that dissimilar to what we're doing now,
01:45:08.920 | but at the same time, yeah, obviously others,
01:45:11.440 | my experience, right, the personal experience,
01:45:14.500 | I think no one should, you know,
01:45:17.360 | I think no one should tell others how they should feel,
01:45:20.680 | I mean, the feelings are very personal, right?
01:45:22.940 | So how others might feel about the models and so on,
01:45:26.120 | that's one part of the story that is important
01:45:28.480 | to understand for me personally as a researcher,
01:45:32.040 | and then when I maybe disagree or I don't understand
01:45:36.160 | or see that, yeah, maybe this is not something
01:45:38.560 | I think right now is reasonable, knowing all that I know,
01:45:41.580 | one of the other things, and perhaps partly
01:45:44.320 | why it's great to be talking to you
01:45:46.600 | and reaching out to the world about machine learning is,
01:45:49.860 | hey, let's demystify a bit the magic
01:45:53.480 | and try to see a bit more of the math
01:45:56.280 | and the fact that literally to create these models,
01:45:59.920 | if we had the right software, it would be 10 lines of code
01:46:03.160 | and then just a dump of the internet,
01:46:06.160 | so versus like then the complexity of like the creation
01:46:10.320 | of humans from their inception, right,
01:46:13.640 | and also the complexity of evolution
01:46:15.820 | of the whole universe to where we are
01:46:19.240 | that feels orders of magnitude more complex
01:46:21.960 | and fascinating to me.
01:46:23.500 | So I think, yeah, maybe part of,
01:46:26.040 | the only thing I'm thinking about trying to tell you is,
01:46:29.300 | yeah, I think explaining a bit of the magic,
01:46:32.640 | there is a bit of magic, it's good to be in love,
01:46:34.840 | obviously, with what you do at work,
01:46:37.040 | and I'm certainly fascinated and surprised
01:46:39.440 | quite often as well, but I think hopefully,
01:46:43.200 | as experts in biology, hopefully you will tell me
01:46:45.900 | this is not as magic, and I'm happy to learn that.
01:46:49.440 | Through interactions with the larger community,
01:46:52.280 | we can also have a certain level of education
01:46:56.020 | that in practice also will matter,
01:46:58.360 | because I mean, one question is how you feel about this,
01:47:00.800 | but then the other very important is,
01:47:03.080 | you starting to interact with these in products and so on,
01:47:06.960 | it's good to understand a bit what's going on,
01:47:09.160 | what's not going on, what's safe, what's not safe,
01:47:12.280 | and so on, right, otherwise, the technology
01:47:14.840 | will not be used properly for good,
01:47:17.040 | which is obviously the goal of all of us, I hope.
01:47:20.540 | - So let me then ask the next question,
01:47:22.940 | do you think in order to solve intelligence,
01:47:25.800 | or to replace the Lexbot that does interviews,
01:47:29.560 | as we started this conversation with,
01:47:31.440 | do you think the system needs to be sentient?
01:47:34.840 | Do you think it needs to achieve something
01:47:37.260 | like consciousness, and do you think about
01:47:39.780 | what consciousness is in the human mind
01:47:43.260 | that could be instructive for creating AI systems?
01:47:46.760 | - Yeah, honestly, I think probably not
01:47:51.040 | to the degree of intelligence that there's this brain
01:47:56.040 | that can learn, can be extremely useful,
01:48:00.320 | can challenge you, can teach you,
01:48:02.960 | conversely, you can teach it to do things.
01:48:05.640 | I'm not sure it's necessary, personally speaking,
01:48:09.120 | but if consciousness or any other biological
01:48:14.080 | or evolutionary lesson can be repurposed
01:48:19.080 | to then influence our next set of algorithms,
01:48:22.600 | that is a great way to actually make progress, right?
01:48:25.680 | And the same way I tried to explain Transformers a bit,
01:48:28.220 | how it feels we operate when we look at text specifically,
01:48:33.220 | these insights are very important, right?
01:48:36.000 | So there's a distinction between details
01:48:40.320 | of how the brain might be doing computation.
01:48:43.260 | I think my understanding is, sure, there's neurons
01:48:46.560 | and there's some resemblance to neural networks,
01:48:48.520 | but we don't quite understand enough of the brain
01:48:51.440 | in detail, right, to be able to replicate it.
01:48:55.320 | But then more, if you zoom out a bit,
01:48:58.840 | how we then, our thought process, how memory works,
01:49:03.400 | maybe even how evolution got us here,
01:49:05.640 | what's exploration, exploitation,
01:49:07.320 | like how these things happen,
01:49:08.800 | I think these clearly can inform algorithmic level research.
01:49:13.080 | And I've seen some examples of this being quite useful
01:49:18.080 | to then guide the research,
01:49:19.740 | even it might be for the wrong reasons, right?
01:49:21.660 | So I think biology and what we know about ourselves
01:49:26.100 | can help a whole lot to build essentially
01:49:29.980 | what we call AGI, this general, the real gato, right?
01:49:34.140 | The last step of the chain, hopefully.
01:49:36.540 | But consciousness in particular,
01:49:39.180 | I don't myself at least think too hard
01:49:42.060 | about how to add that to the system,
01:49:44.800 | but maybe my understanding is also very personal
01:49:47.840 | about what it means, right?
01:49:48.840 | I think this, even that in itself is a long debate
01:49:51.760 | that I know people have often,
01:49:55.300 | and maybe I should learn more about this.
01:49:57.780 | - Yeah, and I personally, I notice the magic often
01:50:01.740 | on a personal level, especially with physical systems,
01:50:04.940 | like robots.
01:50:06.160 | I have a lot of legged robots now in Austin
01:50:10.460 | that I play with.
01:50:11.700 | And even when you program them,
01:50:13.260 | when they do things you didn't expect,
01:50:15.580 | there's an immediate anthropomorphization,
01:50:18.620 | and you notice the magic,
01:50:19.820 | and you start to think about things like sentience
01:50:22.620 | that has to do more with effective communication
01:50:26.020 | and less with any of these kind of dramatic things.
01:50:28.580 | It seems like a useful part of communication.
01:50:32.600 | Having the perception of consciousness
01:50:36.580 | seems like useful for us humans.
01:50:38.860 | We treat each other more seriously.
01:50:40.860 | We are able to do a nearest neighbor shoving
01:50:45.060 | of that entity into your memory correctly,
01:50:47.700 | all that kind of stuff.
01:50:48.700 | Seems useful, at least to fake it,
01:50:50.860 | even if you never make it.
01:50:52.500 | - So maybe, like, yeah, mirroring the question,
01:50:55.660 | and since you talked to a few people,
01:50:57.460 | then you do think that we'll need to figure something out
01:51:01.780 | in order to achieve intelligence
01:51:04.580 | in a grander sense of the word.
01:51:06.540 | - Yeah, I personally believe yes,
01:51:08.220 | but I don't even think it'll be like a separate island
01:51:12.620 | we'll have to travel to.
01:51:14.140 | I think it'll emerge quite naturally.
01:51:16.420 | - Okay, that's easier for us then, thank you.
01:51:20.140 | - But the reason I think it's important to think about
01:51:22.820 | is you will start, I believe, like with this Google Engineer,
01:51:26.340 | you will start seeing this a lot more,
01:51:28.780 | especially when you have AI systems
01:51:30.540 | that are actually interacting with human beings
01:51:32.980 | that don't have an engineering background,
01:51:35.180 | and we have to prepare for that.
01:51:38.580 | Because there'll be, I do believe
01:51:40.100 | there'll be a civil rights movement for robots,
01:51:42.300 | as silly as it is to say.
01:51:44.580 | There's going to be a large number of people
01:51:46.780 | that realize there's these intelligent entities
01:51:48.980 | with whom I have a deep relationship,
01:51:51.620 | and I don't wanna lose them.
01:51:53.220 | They've come to be a part of my life,
01:51:54.780 | and they mean a lot.
01:51:55.980 | They have a name, they have a story, they have a memory,
01:51:59.020 | and we start to ask questions about ourselves.
01:52:01.340 | Well, what, this thing sure seems like
01:52:04.940 | it's capable of suffering,
01:52:07.600 | because it tells all these stories of suffering.
01:52:09.860 | It doesn't wanna die and all those kinds of things,
01:52:11.700 | and we have to start to ask ourselves questions.
01:52:14.460 | Well, what is the difference
01:52:15.460 | between a human being and this thing?
01:52:16.980 | And so when you engineer,
01:52:18.580 | I believe from an engineering perspective,
01:52:21.500 | from like a deep mind, or anybody that builds systems,
01:52:24.980 | there might be laws in the future
01:52:26.500 | where you're not allowed to engineer systems
01:52:29.140 | with displays of sentience,
01:52:31.240 | unless they're explicitly designed to be that,
01:52:36.020 | unless it's a pet.
01:52:37.380 | So if you have a system that's just doing customer support,
01:52:41.260 | you're legally not allowed to display sentience.
01:52:44.180 | We'll start to ask ourselves that question,
01:52:47.300 | and then so that's going to be part
01:52:49.500 | of the software engineering process.
01:52:51.260 | Which features do we have,
01:52:53.360 | and one of them is communications of sentience.
01:52:56.820 | But it's important to start thinking about that stuff,
01:52:58.700 | especially how much it captivates public attention.
01:53:01.740 | - Yeah, absolutely.
01:53:03.180 | It's definitely a topic that is important.
01:53:06.420 | We think about, and I think in a way,
01:53:09.540 | I always see, not every movie is equally on point
01:53:14.540 | with certain things,
01:53:16.100 | but certainly science fiction in this sense,
01:53:19.100 | at least has prepared society
01:53:20.740 | to start thinking about certain topics
01:53:24.060 | that even if it's too early to talk about,
01:53:26.460 | as long as we are reasonable,
01:53:29.480 | it's certainly gonna prepare us
01:53:31.300 | for both the research to come and how to,
01:53:34.980 | I mean, there's many important challenges
01:53:37.060 | and topics that come with building an intelligent system,
01:53:42.060 | many of which you just mentioned, right?
01:53:44.660 | So I think we're never gonna be fully ready
01:53:49.660 | unless we talk about this,
01:53:51.420 | and we start also, as I said,
01:53:54.140 | just kind of expanding the people we talk to,
01:53:59.140 | to not include only our own researchers and so on.
01:54:03.180 | And in fact, places like DeepMind, but elsewhere,
01:54:06.540 | there's more interdisciplinary groups forming up
01:54:10.380 | to start asking and really working with us
01:54:13.260 | on these questions,
01:54:14.980 | because obviously this is not initially
01:54:17.420 | what your passion is when you do your PhD,
01:54:19.380 | but certainly it is coming, right?
01:54:21.460 | So it's fascinating, kind of.
01:54:23.140 | It's the thing that brings me to one of my passions
01:54:27.180 | that is learning.
01:54:28.020 | So in this sense, this is kind of a new area
01:54:31.740 | that as a learning system myself, I want to keep exploring.
01:54:36.660 | And I think it's great to see parts of the debate,
01:54:41.060 | and even I've seen a level of maturity
01:54:43.780 | in the conferences that deal with AI.
01:54:46.500 | If you look five years ago to now,
01:54:49.940 | just the amount of workshops and so on has changed so much.
01:54:53.100 | It's impressive to see how much topics of safety ethics
01:54:58.100 | and so on come to the surface, which is great.
01:55:01.700 | And if we were too early, clearly it's fine.
01:55:03.860 | I mean, it's a big field and there's lots of people
01:55:07.300 | with lots of interests that will do progress
01:55:10.300 | or make progress.
01:55:11.940 | And obviously I don't believe we're too late.
01:55:14.100 | So in that sense, I think it's great
01:55:16.460 | that we're doing this already.
01:55:18.180 | - It's better to be too early than too late
01:55:20.220 | when it comes to super intelligent AI systems.
01:55:22.780 | Let me ask, speaking of sentient AIs,
01:55:25.500 | you gave props to your friend, Ilyas Eskiver,
01:55:28.700 | for being elected the Fellow of the Royal Society.
01:55:31.980 | So just as a shout out to a fellow researcher and a friend,
01:55:35.140 | what's the secret to the genius of Ilyas Eskiver?
01:55:39.420 | And also, do you believe that his tweets,
01:55:42.660 | as you've hypothesized and Andrei Karpathy did as well,
01:55:46.020 | are generated by a language model?
01:55:48.660 | - Yeah.
01:55:49.500 | So I strongly believe Ilyas is gonna be visiting
01:55:53.820 | in a few weeks actually, so I'll ask him in person.
01:55:58.420 | - Will he tell you the truth?
01:55:59.260 | - Yes, of course.
01:56:00.100 | - Okay, sure. - Hopefully.
01:56:00.940 | I mean, ultimately we all have shared paths
01:56:04.060 | and there's friendships that go beyond,
01:56:06.940 | obviously, institutions and so on.
01:56:09.860 | So I hope he tells me the truth.
01:56:11.780 | - Or maybe the AI system is holding him hostage somehow.
01:56:14.420 | Maybe he has some videos that he doesn't wanna release.
01:56:16.980 | So maybe it has taken control over him,
01:56:19.740 | so he can't tell the truth.
01:56:20.580 | - Well, if I see him in person, then I think I'll-
01:56:22.300 | - He will know.
01:56:23.220 | - Yeah, but I think it's a good,
01:56:27.620 | I think Ilyas' personality, just knowing him for a while,
01:56:30.940 | yeah, he's, everyone in Twitter, I guess,
01:56:35.260 | gets a different persona.
01:56:36.580 | And I think Ilyas' one does not surprise me, right?
01:56:40.860 | So I think knowing Ilyas from before social media
01:56:43.540 | and before AI was so prevalent,
01:56:45.740 | I recognize a lot of his character.
01:56:47.460 | So that's something for me that I feel good about,
01:56:50.460 | a friend that hasn't changed
01:56:52.420 | or like is still true to himself, right?
01:56:55.940 | Obviously, there is though a fact
01:56:58.900 | that your field becomes more popular
01:57:02.100 | and he is obviously one of the main figures in the field,
01:57:05.420 | having done a lot of advancement.
01:57:06.860 | So I think that the tricky bit here
01:57:08.980 | is how to balance your true self
01:57:11.060 | with the responsibility that your words carry.
01:57:13.540 | So in this sense, I think, yeah,
01:57:16.100 | like I appreciate the style and I understand it,
01:57:19.300 | but it created debates on like some of his tweets, right?
01:57:24.100 | That maybe it's good we have them early anyways, right?
01:57:26.780 | But yeah, then the reactions are usually polarizing.
01:57:30.980 | I think we're just seeing kind of the reality
01:57:32.980 | of social media a bit there as well,
01:57:34.900 | reflected on that particular topic
01:57:38.060 | or set of topics he's tweeting about.
01:57:40.220 | - Yeah, I mean, it's funny that you speak to this tension.
01:57:42.860 | He was one of the early seminal figures
01:57:46.100 | in the field of deep learning.
01:57:47.260 | And so there's a responsibility with that,
01:57:48.900 | but he's also, from having interacted with him quite a bit,
01:57:53.100 | he's just a brilliant thinker about ideas.
01:57:57.380 | And which, as are you,
01:58:01.180 | and there's a tension between becoming the manager
01:58:03.700 | versus like the actual thinking through very novel ideas.
01:58:08.700 | The, yeah, the scientist versus the manager.
01:58:13.540 | And he's one of the great scientists of our time.
01:58:17.620 | This was quite interesting.
01:58:18.740 | And also people tell me quite silly,
01:58:20.740 | which I haven't quite detected yet,
01:58:23.180 | but in private, we'll have to see about that.
01:58:25.940 | - Yeah, yeah.
01:58:27.380 | I mean, just on the point of,
01:58:29.580 | I mean, Ilya has been an inspiration.
01:58:33.260 | I mean, quite a few colleagues I can think shaped,
01:58:36.300 | you know, the person you are.
01:58:37.980 | Like Ilya certainly gets probably the top spot,
01:58:42.220 | if not close to the top.
01:58:43.700 | And if we go back to the question about people in the field,
01:58:47.900 | like how their role would have changed the field or not,
01:58:51.660 | I think Ilya's case is interesting
01:58:53.900 | because he really has a deep belief
01:58:56.740 | in the scaling up of neural networks.
01:58:59.540 | There was a talk that is still famous to this day
01:59:03.620 | from the "Sequence to Sequence" paper,
01:59:06.100 | where he was just claiming,
01:59:08.340 | just give me supervised data and a large neural network,
01:59:11.700 | and then, you know, you'll solve
01:59:13.140 | basically all the problems, right?
01:59:14.580 | That vision, right, was already there many years ago.
01:59:19.580 | So it's good to see like someone who is, in this case,
01:59:22.820 | very deeply into this style of research
01:59:27.140 | and clearly has had a tremendous track record
01:59:31.980 | of successes and so on.
01:59:34.100 | The funny bit about that talk is that
01:59:36.300 | we rehearsed the talk in a hotel room before,
01:59:39.020 | and the original version of that talk
01:59:41.980 | would have been even more controversial.
01:59:43.980 | So maybe I'm the only person
01:59:46.540 | that has seen the unfiltered version of the talk.
01:59:49.180 | And, you know, maybe when the time comes,
01:59:51.660 | maybe we should revisit some of the skip slides
01:59:55.100 | from the talk from Ilya.
01:59:57.580 | But I really think the deep belief
02:00:01.020 | into some certain style of research pays out, right?
02:00:03.900 | It's good to be practical sometimes.
02:00:06.380 | And I actually think Ilya and myself are like practical,
02:00:09.380 | but it's also good there's some sort of long-term belief
02:00:13.260 | and trajectory.
02:00:14.820 | Obviously, there's a bit of luck involved,
02:00:16.700 | but it might be that that's the right path,
02:00:18.820 | then you clearly are ahead
02:00:19.980 | and hugely influential to the field, as he has been.
02:00:23.540 | - Do you agree with that intuition
02:00:25.100 | that maybe was written about by Rich Sutton
02:00:29.660 | in "The Bitter Lesson,"
02:00:33.580 | that the biggest lesson that can be read
02:00:35.260 | from 70 years of AI research is that general methods
02:00:38.620 | that leverage computation are ultimately the most effective.
02:00:42.780 | Do you think that intuition is ultimately correct?
02:00:47.780 | General methods that leverage computation,
02:00:52.220 | allowing the scaling of computation to do a lot of the work,
02:00:56.140 | and so the basic task of us humans is to design methods
02:01:00.900 | that are more and more general
02:01:02.580 | versus more and more specific to the tasks at hand?
02:01:07.060 | - I certainly think this essentially mimics
02:01:10.380 | a bit of the deep learning research,
02:01:13.540 | almost like philosophy,
02:01:16.980 | that on the one hand, we want to be data agnostic,
02:01:20.460 | we don't wanna pre-process datasets,
02:01:22.100 | we wanna see the bytes, right?
02:01:23.420 | Like the true data as it is,
02:01:25.540 | and then learn everything on top.
02:01:27.340 | So very much agree with that.
02:01:29.780 | And I think scaling up feels at the very least,
02:01:32.860 | again, necessary for building incredible complex systems.
02:01:39.020 | It's possibly not sufficient,
02:01:42.140 | barring that we need a couple of breakthroughs.
02:01:45.060 | I think Rich Sutton mentioned search being part
02:01:48.580 | of the equation of scale and search.
02:01:52.260 | I think search, I've seen it,
02:01:55.420 | that's been more mixed in my experience,
02:01:57.300 | or from that lesson in particular,
02:01:59.340 | search is a bit more tricky
02:02:01.180 | because it is very appealing to search in domains like Go,
02:02:05.340 | where you have a clear reward function
02:02:07.460 | that you can then discard some search traces.
02:02:10.620 | But then in some other tasks,
02:02:12.940 | it's not very clear how you would do that.
02:02:15.260 | Although recently, one of our recent works,
02:02:18.620 | which actually was mostly mimicking or a continuation,
02:02:22.140 | and even the team and the people involved
02:02:23.700 | were pretty much very, like intersecting with AlphaStar,
02:02:27.220 | was AlphaCode, in which we actually saw the bitter lesson,
02:02:30.980 | how scale of the models,
02:02:32.620 | and then a massive amount of search,
02:02:34.260 | yielded this kind of very interesting result
02:02:36.780 | of being able to have human level code competition.
02:02:41.340 | So I've seen examples of it being
02:02:43.660 | literally mapped to search and scale.
02:02:46.380 | I'm not so convinced about the search bit,
02:02:48.140 | but certainly I'm convinced scale will be needed.
02:02:50.900 | So we need general methods.
02:02:52.660 | We need to test them,
02:02:53.500 | and maybe we need to make sure that we can scale them,
02:02:56.140 | given the hardware that we have in practice,
02:02:59.100 | but then maybe we should also shape
02:03:00.940 | how the hardware looks like,
02:03:02.860 | based on which methods might be needed to scale.
02:03:05.620 | And that's an interesting contrast of this GPU comment,
02:03:10.620 | that is, we got it for free almost,
02:03:13.380 | because games were using this,
02:03:15.060 | but maybe now if sparsity is required,
02:03:19.500 | we don't have the hardware, although in theory,
02:03:21.860 | I mean, many people are building
02:03:23.180 | different kinds of hardware these days,
02:03:24.660 | but there's a bit of this notion of hardware lottery
02:03:27.780 | for scale that might actually have an impact,
02:03:31.260 | at least on the year, again, scale of years,
02:03:33.420 | on how fast we'll make progress
02:03:35.180 | to maybe a version of neural nets or whatever comes next
02:03:39.420 | that might enable truly intelligent agents.
02:03:44.420 | - Do you think in your lifetime,
02:03:46.100 | we will build an AGI system
02:03:49.500 | that would undeniably be a thing
02:03:54.020 | that achieves human level intelligence and goes far beyond?
02:03:57.460 | - I definitely think it's possible
02:04:02.340 | that it will go far beyond,
02:04:03.700 | but I'm definitely convinced
02:04:04.860 | that it will be human level intelligence.
02:04:08.060 | And I'm hypothesizing about the beyond
02:04:10.940 | because the beyond bit is a bit tricky to define,
02:04:15.940 | especially when we look at the current formula
02:04:19.980 | of starting from this imitation learning standpoint, right?
02:04:23.700 | So we can certainly imitate humans at language and beyond.
02:04:30.660 | So getting at human level through imitation
02:04:33.340 | feels very possible.
02:04:34.860 | Going beyond will require reinforcement learning
02:04:38.980 | and other things.
02:04:39.820 | And I think in some areas
02:04:41.620 | that certainly already has paid out.
02:04:43.500 | I mean, Go being an example that's my favorite so far
02:04:47.220 | in terms of going beyond human capabilities.
02:04:50.340 | But in general, I'm not sure we can define reward functions
02:04:55.340 | that from a seat of imitating human level intelligence
02:04:59.940 | that is general and then going beyond.
02:05:02.820 | That bit is not so clear in my lifetime,
02:05:05.140 | but certainly human level, yes.
02:05:08.100 | And I mean, that in itself is already quite powerful,
02:05:10.860 | I think.
02:05:11.700 | So going beyond, I think it's obviously not,
02:05:14.420 | we're not gonna not try that
02:05:16.060 | if then we get to superhuman scientist
02:05:19.860 | and discovery and advancing the world.
02:05:22.060 | But at least human level is also,
02:05:24.660 | in general, is also very, very powerful.
02:05:27.460 | - Well, especially if human level or slightly beyond
02:05:31.500 | is integrated deeply with human society
02:05:33.740 | and there's billions of agents like that,
02:05:36.460 | do you think there's a singularity moment
02:05:38.460 | beyond which our world will be just very deeply transformed
02:05:43.460 | by these kinds of systems?
02:05:45.620 | Because now you're talking about intelligence systems
02:05:47.780 | that are just, I mean, this is no longer just going
02:05:52.780 | from horse and buggy to the car.
02:05:56.420 | It feels like a very different kind of shift
02:05:59.780 | in what it means to be a living entity on earth.
02:06:03.300 | Are you afraid?
02:06:04.180 | Are you excited of this world?
02:06:06.300 | - I'm afraid if there's a lot more.
02:06:09.340 | So I think maybe we'll need to think about
02:06:13.020 | if we truly get there,
02:06:14.940 | just thinking of limited resources,
02:06:18.340 | like humanity clearly hits some limits
02:06:21.420 | and then there's some balance, hopefully,
02:06:23.420 | that biologically the planet is imposing
02:06:26.260 | and we should actually try to get better at this.
02:06:28.500 | As we know, there's quite a few issues
02:06:31.500 | with having too many people coexisting
02:06:35.740 | in a resource-limited way.
02:06:37.580 | So for digital entities, it's an interesting question.
02:06:40.300 | I think such a limit maybe should exist,
02:06:43.540 | but maybe it's gonna be imposed by energy availability
02:06:47.620 | because this also consumes energy.
02:06:49.700 | In fact, most systems are more inefficient
02:06:53.500 | than we are in terms of energy required.
02:06:55.980 | - Correct, yeah.
02:06:56.820 | - But definitely, I think as a society,
02:06:59.500 | we'll need to just work together
02:07:02.220 | to find what would be reasonable in terms of growth
02:07:06.380 | or how we coexist if that is to happen.
02:07:11.380 | I am very excited about, obviously,
02:07:14.660 | the aspects of automation that make people
02:07:17.700 | that obviously don't have access
02:07:19.020 | to certain resources or knowledge,
02:07:20.980 | for them to have that access.
02:07:23.900 | I think those are the applications in a way
02:07:26.260 | that I'm most excited to see and to personally work towards.
02:07:30.940 | - Yeah, there's going to be significant improvements
02:07:32.660 | in productivity and the quality of life
02:07:34.340 | across the whole population, which is very interesting.
02:07:36.980 | But I'm looking even far beyond
02:07:39.180 | us becoming a multi-planetary species.
02:07:42.660 | And just as a quick bet, last question,
02:07:45.340 | do you think as humans become multi-planetary species,
02:07:49.180 | go outside our solar system, all that kind of stuff,
02:07:52.460 | do you think there'll be more humans
02:07:54.420 | or more robots in that future world?
02:07:57.180 | So will humans be the quirky,
02:08:02.180 | intelligent being of the past,
02:08:04.460 | or is there something deeply fundamental
02:08:06.980 | to human intelligence that's truly special,
02:08:09.580 | where we will be part of those other planets,
02:08:12.100 | not just AI systems?
02:08:13.900 | - I think we're all excited to build AGI
02:08:18.660 | to empower or make us more powerful as human species.
02:08:23.660 | Not to say there might be some hybridization.
02:08:27.580 | I mean, this is obviously speculation,
02:08:29.700 | but there are companies also trying to,
02:08:32.500 | the same way medicine is making us better.
02:08:35.660 | Maybe there are other things that are yet to happen on that.
02:08:39.100 | But if the ratio is not at most one-to-one,
02:08:43.340 | I would not be happy.
02:08:44.580 | So I would hope that we are part of the equation,
02:08:49.220 | but maybe there's, maybe a one-to-one ratio
02:08:52.780 | feels like possible, constructive and so on,
02:08:56.220 | but it would not be good to have a misbalance,
02:08:59.620 | at least from my core beliefs
02:09:01.420 | and the why I'm doing what I'm doing when I go to work
02:09:05.180 | and I research what I research.
02:09:07.100 | - Well, this is how I know you're human,
02:09:09.500 | and this is how you've passed the Turing test.
02:09:12.700 | And you are one of the special humans, Ariel.
02:09:14.940 | It's a huge honor that you would talk with me,
02:09:17.060 | and I hope we get the chance to speak again,
02:09:19.900 | maybe once before the singularity, once after,
02:09:23.020 | and see how our view of the world changes.
02:09:25.420 | Thank you again for talking today.
02:09:26.540 | Thank you for the amazing work you do.
02:09:28.140 | You're a shining example of a researcher
02:09:31.300 | and a human being in this community.
02:09:32.900 | - Thanks a lot, Lex.
02:09:34.020 | Yeah, looking forward to before the singularity, certainly.
02:09:36.780 | (Lex laughs)
02:09:37.820 | - And maybe after.
02:09:38.980 | Thanks for listening to this conversation
02:09:41.460 | with Ariel Vinales.
02:09:43.100 | To support this podcast,
02:09:44.260 | please check out our sponsors in the description.
02:09:46.940 | And now, let me leave you with some words from Alan Turing.
02:09:50.060 | "Those who can imagine anything can create the impossible."
02:09:55.140 | Thank you for listening, and hope to see you next time.
02:09:59.180 | (upbeat music)
02:10:01.780 | (upbeat music)
02:10:04.380 | [BLANK_AUDIO]