Oriol Vinyals: Deep Learning and Artificial General Intelligence

00:00:00.000 | "At which point is the neural network a being versus a tool?"

00:00:05.000 | The following is a conversation with Aurel Vinales,

00:00:11.400 | his second time in the podcast.

00:00:13.480 | Aurel is the research director

00:00:15.960 | and deep learning lead at DeepMind,

00:00:18.040 | and one of the most brilliant thinkers and researchers

00:00:20.980 | in the history of artificial intelligence.

00:00:24.360 | This is the Lex Friedman Podcast.

00:00:26.680 | To support it, please check out our sponsors

00:00:28.880 | in the description.

00:00:30.200 | And now, dear friends, here's Aurel Vinales.

00:00:33.600 | You are one of the most brilliant researchers

00:00:37.060 | in the history of AI,

00:00:38.480 | working across all kinds of modalities.

00:00:40.600 | Probably the one common theme is

00:00:42.720 | it's always sequences of data.

00:00:45.040 | So we're talking about languages, images, even biology,

00:00:48.020 | and games, as we talked about last time.

00:00:50.280 | So you're a good person to ask this.

00:00:53.400 | In your lifetime, will we be able to build an AI system

00:00:57.360 | that's able to replace me as the interviewer

00:01:00.760 | in this conversation,

00:01:02.600 | in terms of ability to ask questions

00:01:04.480 | that are compelling to somebody listening?

00:01:06.600 | And then further question is,

00:01:09.400 | are we close,

00:01:10.640 | will we be able to build a system that replaces you

00:01:13.880 | as the interviewee

00:01:16.080 | in order to create a compelling conversation?

00:01:18.120 | How far away are we, do you think?

00:01:20.040 | - It's a good question.

00:01:21.800 | I think partly I would say, do we want that?

00:01:24.680 | I really like when we start now with very powerful models,

00:01:29.360 | interacting with them

00:01:30.960 | and thinking of them more closer to us.

00:01:34.040 | The question is,

00:01:34.880 | if you remove the human side of the conversation,

00:01:38.320 | is that an interesting,

00:01:40.200 | is that an interesting artifact?

00:01:42.320 | And I would say probably not.

00:01:44.440 | I've seen, for instance, last time we spoke,

00:01:47.400 | like we were talking about StarCraft

00:01:50.280 | and creating agents that play games

00:01:53.480 | involves self-play,

00:01:54.880 | but ultimately what people care about was,

00:01:57.600 | how does this agent behave

00:01:59.080 | when the opposite side is a human?

00:02:02.680 | So without a doubt,

00:02:04.720 | we will probably be more empowered by AI.

00:02:08.520 | Maybe you can source some questions from an AI system.

00:02:12.480 | I mean, that even today, I would say,

00:02:13.960 | it's quite plausible that with your creativity,

00:02:17.040 | you might actually find very interesting questions

00:02:19.400 | that you can filter.

00:02:20.720 | We call this cherry picking sometimes

00:02:22.400 | in the field of language.

00:02:24.040 | And likewise, if I had now the tools on my side,

00:02:27.520 | I could say, look,

00:02:28.520 | you're asking this interesting question.

00:02:30.640 | From this answer,

00:02:31.600 | I like the words chosen by this particular system

00:02:34.760 | that created a few words.

00:02:36.600 | Completely replacing it feels not exactly exciting to me.

00:02:41.280 | Although in my lifetime, I think way,

00:02:43.760 | I mean, given the trajectory,

00:02:45.520 | I think it's possible that perhaps

00:02:48.000 | there could be interesting,

00:02:49.880 | maybe self-play interviews as you're suggesting

00:02:53.040 | that would look or sound quite interesting

00:02:56.160 | and probably would educate,

00:02:57.720 | or you could learn a topic

00:02:59.160 | through listening to one of these interviews

00:03:01.600 | at a basic level, at least.

00:03:03.200 | - So you said it doesn't seem exciting to you,

00:03:04.800 | but what if exciting is part of the objective function

00:03:07.520 | the thing is optimized over?

00:03:09.120 | So there's probably a huge amount of data of humans,

00:03:12.840 | if you look correctly,

00:03:14.120 | of humans communicating online,

00:03:16.080 | and there's probably ways to measure the degree of,

00:03:19.560 | as they talk about engagement.

00:03:21.920 | So you can probably optimize the question

00:03:24.120 | that's most created an engaging conversation in the past.

00:03:28.680 | So actually, if you strictly use the word exciting,

00:03:31.560 | there is probably a way

00:03:36.520 | to create a optimally exciting conversations

00:03:40.320 | that involve AI systems.

00:03:42.160 | At least one side is AI.

00:03:44.600 | - Yeah, that makes sense.

00:03:45.640 | I think maybe looping back a bit to games

00:03:48.880 | and the game industry,

00:03:50.240 | when you design algorithms,

00:03:53.040 | you're thinking about winning as the objective, right?

00:03:55.800 | Or the reward function.

00:03:57.320 | But in fact, when we discuss this with Blizzard,

00:04:00.080 | the creators of StarCraft in this case,

00:04:02.320 | I think what's exciting, fun,

00:04:05.360 | if you could measure that and optimize for that,

00:04:09.160 | that's probably why we play video games

00:04:11.720 | or why we interact or listen or look at cat videos

00:04:14.600 | or whatever on the internet.

00:04:16.440 | So it's true that modeling reward

00:04:19.480 | beyond the obvious reward functions

00:04:21.320 | we've used to in reinforcement learning

00:04:23.720 | is definitely very exciting.

00:04:25.560 | And again, there is some progress actually

00:04:28.240 | into a particular aspect of AI, which is quite critical,

00:04:32.160 | which is, for instance, is a conversation

00:04:36.120 | or is the information truthful, right?

00:04:38.200 | So you could start trying to evaluate these

00:04:41.640 | from except from the internet, right?

00:04:44.400 | That has lots of information.

00:04:45.800 | And then if you can learn a function automated ideally,

00:04:50.160 | so you can also optimize it more easily,

00:04:52.880 | then you could actually have conversations

00:04:54.840 | that optimize for non-obvious things such as excitement.

00:04:59.360 | So yeah, that's quite possible.

00:05:01.040 | And then I would say in that case,

00:05:03.560 | it would definitely be fun exercise

00:05:05.880 | and quite unique to have at least one side

00:05:08.040 | that is fully driven by an excitement reward function.

00:05:12.800 | But obviously there would be still quite a lot of humanity

00:05:16.920 | in the system, both from who is building the system,

00:05:20.760 | of course, and also ultimately,

00:05:23.560 | if we think of labeling for excitement,

00:05:26.000 | that those labels must come from us

00:05:28.440 | because it's just hard to have a computational measure

00:05:32.480 | of excitement as far as I understand,

00:05:34.560 | there's no such thing.

00:05:36.120 | - Wow, as you mentioned truth also,

00:05:39.240 | I would actually venture to say that excitement

00:05:41.800 | is easier to label than truth,

00:05:44.160 | or is perhaps has lower consequences of failure.

00:05:49.000 | But there is perhaps the humanness that you mentioned,

00:05:54.920 | that's perhaps part of a thing that could be labeled.

00:05:58.240 | And that could mean an AI system that's doing dialogue,

00:06:02.480 | that's doing conversations should be flawed, for example.

00:06:07.480 | Like that's the thing you optimize for,

00:06:09.440 | which is have inherent contradictions by design,

00:06:13.280 | have flaws by design.

00:06:15.080 | Maybe it also needs to have a strong sense of identity.

00:06:18.760 | So it has a backstory, it told itself that it sticks to,

00:06:22.680 | it has memories, not in terms of how the system is designed,

00:06:26.880 | but it's able to tell stories about its past.

00:06:30.360 | It's able to have mortality and fear of mortality

00:06:35.360 | in the following way that it has an identity

00:06:39.120 | and like if it says something stupid

00:06:41.240 | and gets canceled on Twitter, that's the end of that system.

00:06:44.720 | So it's not like you get to rebrand yourself,

00:06:47.360 | that system is, that's it.

00:06:49.360 | So maybe that the high stakes nature of it,

00:06:52.120 | because like you can't say anything stupid now,

00:06:54.560 | or because you'd be canceled on Twitter.

00:06:57.720 | And that there's stakes to that.

00:06:59.760 | And that I think part of the reason

00:07:01.160 | that makes it interesting.

00:07:03.520 | And then you have a perspective

00:07:04.720 | like you've built up over time that you stick with,

00:07:07.720 | and then people can disagree with you.

00:07:09.120 | So holding that perspective strongly,

00:07:11.800 | holding sort of maybe a controversial,

00:07:14.040 | at least a strong opinion.

00:07:16.300 | All of those elements, it feels like they can be learned

00:07:18.840 | because it feels like there's a lot of data

00:07:21.760 | on the internet of people having an opinion.

00:07:24.520 | (laughs)

00:07:25.400 | And then combine that with a metric of excitement,

00:07:27.840 | you can start to create something that,

00:07:30.000 | as opposed to trying to optimize for sort of

00:07:34.480 | grammatical clarity and truthfulness,

00:07:38.120 | the factual consistency over many sentences,

00:07:42.000 | you optimize for the humanness.

00:07:45.320 | And there's obviously data for humanness on the internet.

00:07:48.880 | So I wonder if there's a future where that's part,

00:07:53.040 | I mean I sometimes wonder that about myself,

00:07:56.400 | I'm a huge fan of podcasts,

00:07:58.120 | and I listen to some podcasts,

00:08:00.760 | and I think like what is interesting about this,

00:08:03.240 | what is compelling?

00:08:04.280 | The same way you watch other games,

00:08:07.440 | like you said, watch, play StarCraft,

00:08:09.160 | or have Magnus Carlsen play chess.

00:08:13.040 | So I'm not a chess player,

00:08:14.920 | but it's still interesting to me,

00:08:16.120 | and what is that?

00:08:16.960 | That's the stakes of it,

00:08:19.440 | maybe the end of a domination of a series of wins.

00:08:23.400 | I don't know, there's all those elements

00:08:25.440 | somehow connect to a compelling conversation,

00:08:28.000 | and I wonder how hard is that to replace?

00:08:30.200 | 'Cause ultimately all of that connects

00:08:31.840 | to the initial proposition of how to test

00:08:34.600 | whether an AI is intelligent or not with the Turing test.

00:08:38.640 | Which I guess, my question comes from a place

00:08:41.760 | of the spirit of that test.

00:08:43.680 | - Yes, I actually recall,

00:08:45.440 | I was just listening to our first podcast

00:08:47.920 | where we discussed Turing test.

00:08:50.360 | So I would say from a neural network,

00:08:54.760 | AI builder perspective,

00:08:59.160 | usually you try to map many of these interesting topics

00:09:03.160 | you discuss to benchmarks,

00:09:05.200 | and then also to actual architectures

00:09:08.120 | on how these systems are currently built,

00:09:10.640 | how they learn, what data they learn from,

00:09:13.080 | what are they learning, right?

00:09:14.280 | We're talking about weights of a mathematical function,

00:09:17.800 | and then looking at the current state of the game,

00:09:21.560 | maybe what do we need leaps forward

00:09:26.000 | to get to the ultimate stage of all these experiences,

00:09:30.640 | lifetime experience, fears,

00:09:32.840 | like words that currently barely we're seeing progress,

00:09:37.840 | just because what's happening today

00:09:40.040 | is you take all these human interactions,

00:09:43.960 | it's a large vast variety of human interactions online,

00:09:47.920 | and then you're distilling these sequences, right?

00:09:51.600 | Going back to my passion, like sequences of words,

00:09:54.680 | letters, images, sound,

00:09:56.920 | there's more modalities here to be at play.

00:09:59.840 | And then you're trying to just learn a function

00:10:03.360 | that will be happy,

00:10:04.400 | that maximizes the likelihood of seeing all these

00:10:08.840 | through a neural network.

00:10:10.880 | Now, I think there's a few places

00:10:14.200 | where the way currently we train these models

00:10:17.240 | would clearly like to be able to develop

00:10:20.000 | the kinds of capabilities you say.

00:10:22.120 | I'll tell you maybe a couple.

00:10:23.520 | One is the lifetime of an agent or a model.

00:10:27.640 | So you learn from these data offline, right?

00:10:30.840 | So you're just passively observing and maximizing these,

00:10:33.560 | you know, it's almost like a landscape of mountains.

00:10:37.760 | And then everywhere there's data

00:10:39.120 | that humans interacted in this way,

00:10:41.040 | you're trying to make that higher

00:10:43.000 | and then lower where there's no data.

00:10:45.720 | And then these models generally

00:10:48.480 | don't then experience themselves.

00:10:51.160 | They just are observers, right?

00:10:52.520 | They're passive observers of the data.

00:10:54.600 | And then we're putting them to then generate data

00:10:57.440 | when we interact with them.

00:10:59.200 | But that's very limiting.

00:11:00.920 | The experience they actually experience

00:11:03.480 | when they could maybe be optimizing

00:11:05.680 | or further optimizing the weights,

00:11:07.440 | we're not even doing that.

00:11:08.640 | So to be clear, and again, mapping to AlphaGo, AlphaStar,

00:11:13.640 | we train the model.

00:11:15.280 | And when we deploy it to play against humans,

00:11:18.280 | or in this case, interact with humans,

00:11:20.320 | like language models, they don't even keep training, right?

00:11:23.480 | They're not learning in the sense of the weights

00:11:26.160 | that you've learned from the data.

00:11:28.200 | They don't keep changing.

00:11:29.760 | Now there's something a bit more, feels magical,

00:11:33.480 | but it's understandable if you're into neural net,

00:11:36.200 | which is, well, they might not learn

00:11:39.120 | in the strict sense of the words, the weights changing.

00:11:41.480 | Maybe that's mapping to how neurons interconnect

00:11:44.360 | and how we learn over our lifetime.

00:11:46.640 | But it's true that the context of the conversation

00:11:50.280 | that takes place when you talk to these systems,

00:11:54.960 | it's held in their working memory, right?

00:11:57.200 | It's almost like you start a computer,

00:12:00.120 | it has a hard drive that has a lot of information.

00:12:02.840 | You have access to the internet,

00:12:04.000 | which has probably all the information,

00:12:06.320 | but there's also a working memory

00:12:08.440 | where these agents, as we call them,

00:12:11.080 | or start calling them, build upon.

00:12:13.840 | Now, this memory is very limited.

00:12:16.560 | I mean, right now we're talking, to be concrete,

00:12:19.200 | about 2000 words that we hold,

00:12:21.760 | and then beyond that, we start forgetting what we've seen.

00:12:24.840 | So you can see that there's some short-term coherence

00:12:28.040 | already, right, with when you said,

00:12:29.880 | I mean, it's a very interesting topic,

00:12:32.280 | having sort of a mapping, an agent to have consistency.

00:12:37.280 | Then if you say, "Oh, what's your name?"

00:12:40.760 | It could remember that,

00:12:42.240 | but then it might forget beyond 2000 words,

00:12:44.960 | which is not that long of context,

00:12:47.480 | if we think even of these podcast books are much longer.

00:12:51.760 | So technically speaking, there's a limitation there.

00:12:55.120 | Super exciting from people that work on deep learning

00:12:58.160 | to be working on,

00:12:59.960 | but I would say we lack maybe benchmarks

00:13:03.040 | and the technology to have this lifetime-like experience

00:13:07.840 | of memory that keeps building up.

00:13:10.840 | However, the way it learns offline

00:13:13.160 | is clearly very powerful, right?

00:13:14.880 | So you asked me three years ago,

00:13:17.400 | I would say, "Oh, we're very far."

00:13:18.640 | I think we've seen the power of this imitation,

00:13:22.240 | again, on the internet scale that has enabled this

00:13:26.240 | to feel like at least the knowledge,

00:13:28.760 | the basic knowledge about the world

00:13:30.160 | now is incorporated into the weights,

00:13:33.120 | but then this experience is lacking.

00:13:36.560 | And in fact, as I said, we don't even train them

00:13:39.320 | when we're talking to them,

00:13:41.160 | other than their working memory, of course, is affected.

00:13:44.760 | So that's the dynamic part,

00:13:46.560 | but they don't learn in the same way

00:13:48.240 | that you and I have learned, right?

00:13:49.720 | When, from basically when we were born and probably before.

00:13:54.040 | So lots of fascinating, interesting questions

00:13:56.480 | you asked there.

00:13:57.400 | I think the one I mentioned is this idea of memory

00:14:01.680 | and experience versus just kind of observe the world

00:14:05.480 | and learn its knowledge,

00:14:06.720 | which I think for that, I would argue,

00:14:08.880 | lots of recent advancements

00:14:10.320 | that make me very excited about the field.

00:14:13.400 | And then the second maybe issue that I see is

00:14:18.160 | all these models, we train them from scratch.

00:14:21.240 | That's something I would have complained three years ago

00:14:24.000 | or six years ago or 10 years ago.

00:14:26.400 | And it feels, if we take inspiration from how we got here,

00:14:31.360 | how the universe evolved us and we keep evolving,

00:14:35.240 | it feels that is a missing piece,

00:14:37.840 | that we should not be training models from scratch

00:14:41.320 | every few months, that there should be some sort of way

00:14:45.280 | in which we can grow models much like as a species

00:14:49.000 | and many other elements in the universe

00:14:51.520 | is building from the previous sort of iterations.

00:14:55.000 | And that from a just purely neural network perspective,

00:14:59.520 | even though we would like to make it work,

00:15:02.280 | it's proven very hard to not, you know,

00:15:05.600 | throw away the previous weights, right?

00:15:07.680 | This landscape we learn from the data and, you know,

00:15:10.280 | refresh it with a brand new set of weights,

00:15:13.360 | given maybe a recent snapshot of these datasets

00:15:16.960 | we train on, et cetera, or even a new game we're learning.

00:15:19.960 | So that feels like something is missing fundamentally.

00:15:24.160 | We might find it, but it's not very clear

00:15:27.440 | how it will look like.

00:15:28.400 | There's many ideas and it's super exciting as well.

00:15:30.800 | - Yes, just for people who don't know,

00:15:32.440 | when you're approaching new problem in machine learning,

00:15:35.720 | you're going to come up with an architecture

00:15:38.200 | that has a bunch of weights

00:15:40.960 | and then you initialize them somehow,

00:15:43.360 | which in most cases is some version of random.

00:15:47.280 | So that's what you mean by starting from scratch.

00:15:48.960 | And it seems like it's a waste every time you solve

00:15:52.880 | the game of Go and chess, StarCraft, protein folding,

00:15:59.440 | like surely there's some way to reuse the weights

00:16:03.160 | as we grow this giant database of neural networks.

00:16:08.400 | - That has solved some of the toughest problems

00:16:10.000 | in the world.

00:16:10.840 | And so some of that is, what is that?

00:16:15.240 | Methods, how to reuse weights,

00:16:19.080 | how to learn extract was generalizable,

00:16:22.480 | or at least has a chance to be

00:16:25.160 | and throw away the other stuff.

00:16:26.900 | And maybe the neural network itself

00:16:29.560 | should be able to tell you that.

00:16:31.640 | Like what, yeah, how do you,

00:16:34.640 | what ideas do you have for better initialization of weights?

00:16:37.520 | Maybe stepping back,

00:16:38.720 | if we look at the field of machine learning,

00:16:41.720 | but especially deep learning, right?

00:16:44.040 | At the core of deep learning,

00:16:45.240 | there's this beautiful idea that is a single algorithm

00:16:49.240 | can solve any task, right?

00:16:50.920 | So it's been proven over and over

00:16:54.400 | with more increasing set of benchmarks

00:16:56.440 | and things that were thought impossible

00:16:58.580 | that are being cracked by this basic principle.

00:17:01.960 | That is you take a neural network of uninitialized weights.

00:17:05.800 | So like a blank computational brain,

00:17:09.640 | then you give it in the case of supervised learning,

00:17:12.600 | a lot ideally of examples of,

00:17:14.960 | hey, here is what the input looks like

00:17:17.120 | and the desired output should look like this.

00:17:19.560 | I mean, image classification is very clear example,

00:17:22.360 | images to maybe one of a thousand categories.

00:17:25.560 | That's what ImageNet is like,

00:17:26.840 | but many, many, if not all problems can be mapped this way.

00:17:30.720 | And then there's a generic recipe, right?

00:17:33.840 | That you can use.

00:17:35.240 | And this recipe with very little change.

00:17:38.600 | And I think that's the core of deep learning research,

00:17:40.920 | right?

00:17:41.760 | That what is the recipe that is universal

00:17:44.400 | that for any new given task,

00:17:46.400 | I'll be able to use without thinking,

00:17:48.440 | without having to work very hard on the problem at stake.

00:17:51.740 | We have not found this recipe,

00:17:54.400 | but I think the field is excited to find less tweaks

00:17:59.400 | or tricks that people find

00:18:02.000 | when they work on important problems specific to those

00:18:05.280 | and more of a general algorithm, right?

00:18:07.540 | So at an algorithmic level,

00:18:09.300 | I would say we have something general already,

00:18:11.760 | which is this formula of training a very powerful model

00:18:14.520 | and neural network on a lot of data.

00:18:17.000 | And in many cases,

00:18:19.400 | you need some specificity

00:18:21.200 | to the actual problem you're solving.

00:18:23.400 | Protein folding being such an important problem

00:18:26.060 | has some basic recipe that is learned from before, right?

00:18:30.780 | Like transformer models, graph neural networks,

00:18:34.120 | ideas coming from NLP,

00:18:35.720 | like something called BERT,

00:18:38.580 | that is a kind of loss that you can emplace

00:18:41.280 | to help the model.

00:18:42.420 | Knowledge distillation is another technique, right?

00:18:45.680 | So this is the formula.

00:18:47.080 | We still had to find some particular things

00:18:50.560 | that were specific to alpha fold, right?

00:18:53.600 | That's very important because protein folding

00:18:55.880 | is such a high value problem that as humans,

00:18:59.120 | we should solve it no matter if we need to be a bit specific.

00:19:02.860 | And it's possible that some of these learnings

00:19:04.940 | will apply then to the next iteration of this recipe

00:19:07.380 | that deep learners are about.

00:19:09.340 | But it is true that so far,

00:19:11.820 | the recipe is what's common,

00:19:13.180 | but the weights you generally throw away,

00:19:15.860 | which feels very sad.

00:19:17.780 | Although maybe in the last,

00:19:21.380 | especially in the last two, three years,

00:19:23.360 | and when we last spoke,

00:19:24.620 | I mentioned these area of meta-learning,

00:19:26.600 | which is the idea of learning to learn.

00:19:29.540 | That idea and some progress has been had starting,

00:19:33.100 | I would say, mostly from GPT-3 on the language domain only,

00:19:37.140 | in which you could conceive a model that is trained once,

00:19:42.060 | and then this model is not narrow in that

00:19:44.680 | it only knows how to translate a pair of languages,

00:19:47.640 | or it only knows how to assign sentiment to a sentence.

00:19:51.480 | These actually, you could teach it

00:19:54.100 | by a prompting, it's called.

00:19:55.460 | And this prompting is essentially just showing it

00:19:58.060 | a few more examples,

00:19:59.860 | almost like you do show examples, input-output examples,

00:20:02.980 | algorithmically speaking,

00:20:04.080 | to the process of creating this model.

00:20:06.280 | But now you're doing it through language,

00:20:07.820 | which is very natural way for us to learn from one another.

00:20:11.040 | I tell you, "Hey, you should do this new task.

00:20:13.080 | "I'll tell you a bit more.

00:20:14.500 | "Maybe you ask me some questions."

00:20:16.040 | And now you know the task, right?

00:20:17.800 | You didn't need to retrain it from scratch.

00:20:20.300 | And we've seen these magical moments almost

00:20:23.180 | in this way to do few-shot prompting through language

00:20:26.940 | on language-only domain.

00:20:28.520 | And then in the last two years,

00:20:30.940 | we've seen these expanded to beyond language,

00:20:34.620 | adding vision, adding actions and games,

00:20:38.040 | lots of progress to be had.

00:20:39.460 | But this is maybe, if you ask me about

00:20:42.120 | how are we gonna crack this problem,

00:20:43.700 | this is perhaps one way in which you have a single model.

00:20:47.760 | The problem of this model is it's hard to grow

00:20:52.140 | in weights or capacity,

00:20:54.260 | but the model is certainly so powerful

00:20:56.380 | that you can teach it some tasks, right?

00:20:58.920 | In this way that I could teach you a new task now

00:21:01.940 | if we were all at a text-based task

00:21:05.060 | or a classification, a vision-style task.

00:21:08.400 | But it still feels like more breakthroughs should be had,

00:21:12.820 | but it's a great beginning, right?

00:21:13.980 | We have a good baseline.

00:21:15.400 | We have an idea that this maybe is the way

00:21:17.740 | we want to benchmark progress towards AGI.

00:21:20.740 | And I think in my view, that's critical

00:21:22.820 | to always have a way to benchmark

00:21:25.000 | the community converging to this overall,

00:21:27.780 | which is good to see.

00:21:29.200 | And then this is actually what excites me

00:21:33.500 | in terms of also next steps for deep learning

00:21:36.580 | is how to make these models more powerful.

00:21:39.040 | How do you train them?

00:21:40.460 | How to grow them if they must grow?

00:21:43.080 | Should they change their weights

00:21:44.500 | as you teach it the task or not?

00:21:46.060 | There's some interesting questions, many to be answered.

00:21:48.520 | - Yeah, you've opened the door

00:21:49.760 | to a bunch of questions I wanna ask,

00:21:52.260 | but let's first return to your tweet

00:21:55.660 | and read it like a Shakespeare.

00:21:57.120 | You wrote, "Gato is not the end, it's the beginning."

00:22:01.220 | And then you wrote, "Meow," and then an emoji of a cat.

00:22:04.960 | So first, two questions.

00:22:07.700 | First, can you explain the meow and the cat emoji?

00:22:10.020 | And second, can you explain what Gato is and how it works?

00:22:13.620 | - Right, indeed.

00:22:14.580 | I mean, thanks for reminding me

00:22:16.500 | that we're all exposing on Twitter and-

00:22:19.900 | - Permanently there.

00:22:20.900 | - Yes, permanently there.

00:22:21.900 | - One of the greatest AI researchers of all time,

00:22:25.100 | meow and cat emoji.

00:22:27.220 | - Yes. - There you go.

00:22:28.260 | - Right, so-

00:22:29.100 | - Can you imagine like touring, tweeting,

00:22:31.940 | meow and cat, probably he would, probably would.

00:22:34.340 | - Probably.

00:22:35.180 | So yeah, the tweet is important, actually.

00:22:38.020 | You know, I put thought on the tweets.

00:22:39.800 | I hope people-

00:22:40.780 | - Which part did you think, okay.

00:22:43.060 | So there's three sentences.

00:22:44.900 | Gato's not the end, Gato's the beginning.

00:22:48.700 | Meow, cat emoji.

00:22:50.140 | Okay, which is the important part?

00:22:51.740 | - It's the meow, no, no.

00:22:53.140 | Definitely that it is the beginning.

00:22:56.060 | I mean, I probably was just explaining a bit

00:23:00.340 | where the field is going, but let me tell you about Gato.

00:23:03.740 | So first, the name Gato comes from maybe a sequence

00:23:08.100 | of releases that DeepMind had that named,

00:23:11.820 | like used animal names to name some of their models

00:23:15.100 | that are based on this idea of large sequence models.

00:23:19.100 | Initially, they're only language,

00:23:20.620 | but we are expanding to other modalities.

00:23:23.180 | So we had, you know, we had gopher, chinchilla,

00:23:28.180 | these were language only.

00:23:29.940 | And then more recently we released flamingo,

00:23:32.700 | which adds vision to the equation.

00:23:35.420 | And then Gato, which adds vision

00:23:38.140 | and then also actions in the mix, right?

00:23:41.620 | As we discuss actually actions,

00:23:44.500 | especially discrete actions like up, down, left, right.

00:23:47.540 | I just told you the actions, but they're words.

00:23:49.460 | So you can kind of see how actions naturally map

00:23:52.740 | to sequence modeling of words,

00:23:54.500 | which these models are very powerful.

00:23:57.020 | So Gato was named after, I believe,

00:24:01.660 | I can only from memory, right?

00:24:03.580 | These, you know, these things always happen

00:24:06.020 | with an amazing team of researchers behind.

00:24:08.500 | So before the release, we had the discussion

00:24:12.180 | about which animal would we pick, right?

00:24:14.220 | And I think because of the word general agent, right?

00:24:18.340 | And this is a property quite unique to Gato.

00:24:21.860 | We kind of were playing with the GA words

00:24:24.700 | and then, you know, Gato is-

00:24:25.980 | - Rhymes with cat.

00:24:26.900 | - Yes.

00:24:27.740 | And Gato is obviously a Spanish version of cat.

00:24:30.220 | I had nothing to do with it, although I'm from Spain.

00:24:32.220 | - Oh, how do you, wait, sorry.

00:24:33.260 | How do you say cat in Spanish?

00:24:34.620 | - Gato.

00:24:35.460 | - Oh, Gato.

00:24:36.300 | - Yeah.

00:24:37.140 | - Now it all makes sense. - Okay, okay, I see, I see.

00:24:37.980 | - Now it all makes sense.

00:24:39.060 | - Okay, so-

00:24:39.900 | - How do you say meow in Spanish?

00:24:40.780 | No, that's probably the same.

00:24:41.900 | - I think you say it the same way,

00:24:44.380 | but you write it as M-I-A-U.

00:24:48.060 | - Okay, it's universal.

00:24:49.220 | - Yeah.

00:24:50.060 | - All right, so then how does the thing work?

00:24:51.660 | So you said general is, so you said language, vision-

00:24:56.660 | - And action.

00:24:58.380 | - Action.

00:24:59.220 | How does this, can you explain

00:25:01.820 | what kind of neural networks are involved?

00:25:04.220 | What does the training look like?

00:25:06.340 | And maybe what to you are some beautiful ideas

00:25:10.900 | within this system?

00:25:11.860 | - Yeah, so maybe the basics of Gato

00:25:16.060 | are not that dissimilar from many, many work that comes.

00:25:19.940 | So here is where the sort of the recipe,

00:25:22.900 | I mean, hasn't changed too much.

00:25:24.220 | There is a transformer model

00:25:25.580 | that's the kind of recurrent neural network

00:25:28.620 | that essentially takes a sequence of modalities,

00:25:33.300 | observations that could be words,

00:25:36.380 | could be vision, or could be actions.

00:25:38.820 | And then its own objective that you train it to do

00:25:42.140 | when you train it is to predict

00:25:44.060 | what the next anything is.

00:25:46.380 | And anything means what's the next action.

00:25:48.780 | If this sequence that I'm showing you to train

00:25:51.220 | is a sequence of actions and observations,

00:25:53.500 | then you're predicting what's the next action

00:25:55.620 | and the next observation, right?

00:25:57.100 | So you think of these really as a sequence of bytes, right?

00:26:00.900 | So take any sequence of words,

00:26:04.220 | a sequence of interleaved words and images,

00:26:06.980 | a sequence of maybe observations that are images

00:26:11.260 | and moves in a tarry up, down, left, right.

00:26:14.260 | And these, you just think of them as bytes

00:26:17.620 | and you're modeling what's the next byte gonna be like.

00:26:20.540 | And you might interpret that as an action

00:26:23.380 | and then play it in a game,

00:26:25.820 | or you could interpret it as a word

00:26:27.660 | and then write it down

00:26:29.060 | if you're chatting with the system and so on.

00:26:31.340 | So Gato basically can be thought as inputs,

00:26:36.620 | images, text, video, actions.

00:26:41.500 | It also actually inputs some sort of proprioception sensors

00:26:45.780 | from robotics because robotics is one of the tasks

00:26:48.260 | that it's been trained to do.

00:26:49.860 | And then at the output, similarly,

00:26:51.900 | it outputs words, actions.

00:26:53.700 | It does not output images.

00:26:55.700 | That's just by design,

00:26:57.420 | we decided not to go that way for now.

00:26:59.900 | That's also in part why it's the beginning

00:27:02.740 | because there's more to do clearly.

00:27:04.900 | But that's kind of what Gato is.

00:27:06.420 | It's this brain that essentially you give it any sequence

00:27:09.220 | of these observations and modalities

00:27:11.940 | and it outputs the next step.

00:27:13.780 | And then off you go,

00:27:15.340 | you feed the next step into

00:27:17.380 | and predict the next one and so on.

00:27:20.060 | Now, it is more than a language model

00:27:24.140 | because even though you can chat with Gato,

00:27:26.780 | like you can chat with Chinchilla or Flamingo,

00:27:29.540 | it also is an agent, right?

00:27:33.220 | So that's why we call it A of Gato,

00:27:37.220 | like the letter A and also it's general.

00:27:41.380 | It's not an agent that's been trained

00:27:43.260 | to be good at only StarCraft or only Atari or only Go.

00:27:47.900 | It's been trained on a vast variety of datasets.

00:27:51.660 | - What makes it an agent, if I may interrupt?

00:27:53.860 | The fact that it can generate actions?

00:27:56.020 | - Yes, so when we call it,

00:27:58.180 | I mean, it's a good question, right?

00:28:00.100 | When do we call a model?

00:28:02.780 | I mean, everything is a model,

00:28:03.860 | but what is an agent in my view is indeed

00:28:06.740 | the capacity to take actions in an environment

00:28:09.700 | that you then send to it

00:28:11.660 | and then the environment might return

00:28:13.500 | with a new observation

00:28:15.040 | and then you generate the next action and so on.

00:28:17.660 | - This actually, this reminds me of the question

00:28:20.420 | from the side of biology, what is life?

00:28:23.000 | Which is actually a very difficult question as well.

00:28:25.380 | What is living?

00:28:26.780 | What is living when you think about life here

00:28:29.460 | on this planet Earth?

00:28:31.000 | And a question interesting to me about aliens,

00:28:33.420 | what is life when we visit another planet?

00:28:35.720 | Would we be able to recognize it?

00:28:37.220 | And this feels like, it sounds perhaps silly,

00:28:40.220 | but I don't think it is.

00:28:41.380 | At which point is the neural network a being versus a tool?

00:28:46.380 | And it feels like action, ability to modify its environment,

00:28:52.400 | is that fundamental leap.

00:28:54.540 | - Yeah, I think it certainly feels like action

00:28:57.420 | is a necessary condition to be more alive,

00:29:01.920 | but probably not sufficient either.

00:29:04.380 | So sadly--

00:29:05.220 | - It's a soul consciousness thing, whatever.

00:29:06.880 | - Yeah, yeah, we can get back to that later.

00:29:09.060 | But anyways, going back to the meow and the Gato, right?

00:29:12.300 | So one of the leaps forward

00:29:16.100 | and what took the team a lot of effort and time was,

00:29:19.100 | as you were asking, how has Gato been trained?

00:29:23.100 | So I told you Gato is this transformer neural network,

00:29:26.060 | models actions, sequences of actions, words, et cetera.

00:29:30.580 | And then the way we train it is by essentially

00:29:34.820 | pulling data sets of observations, right?

00:29:39.380 | So it's a massive imitation learning algorithm

00:29:42.620 | that it imitates obviously to what is the next word

00:29:46.300 | that comes next from the usual data sets we use before,

00:29:49.860 | right?

00:29:50.700 | So these are these web scale style data sets

00:29:52.980 | of people writing on webs or chatting or whatnot, right?

00:29:57.980 | So that's an obvious source that we use

00:30:00.480 | on all language work.

00:30:02.020 | But then we also took a lot of agents

00:30:05.620 | that we have at DeepMind.

00:30:06.700 | I mean, as you know, DeepMind,

00:30:08.160 | we're quite interested in learning reinforcement learning

00:30:13.580 | and learning agents that play in different environments.

00:30:16.940 | So we kind of created a data set of these trajectories

00:30:20.740 | as we call them or agent experiences.

00:30:23.020 | So in a way, there are other agents we train

00:30:25.660 | for a single mind purpose to, let's say,

00:30:28.420 | control a 3D game environment and navigate a maze.

00:30:33.340 | So we had all the experience that was created

00:30:36.060 | through the one agent interacting with that environment.

00:30:39.560 | And we added these to the data sets, right?

00:30:41.860 | And as I said, we just see all the data,

00:30:44.380 | all these sequences of words or sequences of these agent

00:30:47.500 | interacting with that environment

00:30:49.700 | or agents playing Atari and so on.

00:30:52.180 | We see this as the same kind of data.

00:30:54.860 | And so we mix these data sets together and we train Gato.

00:30:59.220 | That's the G part, right?

00:31:01.580 | It's general because it really has mixed,

00:31:05.220 | it doesn't have different brains for each modality

00:31:07.520 | or each narrow task.

00:31:09.060 | It has a single brain.

00:31:10.500 | It's not that big of a brain compared to most

00:31:12.700 | of the neural networks we see these days.

00:31:14.780 | It has 1 billion parameters.

00:31:17.140 | Some models we're seeing getting the trillions these days

00:31:21.100 | and certainly 100 billion feels like a size

00:31:25.060 | that is very common from when you train these jobs.

00:31:28.980 | So the actual agent is relatively small,

00:31:32.660 | but it's been trained on a very challenging,

00:31:35.020 | diverse data set, not only containing all of internet,

00:31:37.980 | but containing all these agent experience

00:31:40.380 | playing very different distinct environments.

00:31:43.140 | So this brings us to the part of the tweet of,

00:31:46.420 | this is not the end, it's the beginning.

00:31:48.900 | It feels very cool to see Gato in principle

00:31:53.100 | is able to control any sort of environments

00:31:56.620 | that especially the ones that it's been trained to do,

00:31:59.140 | these 3D games, Atari games,

00:32:01.100 | all sorts of robotics tasks and so on.

00:32:04.620 | But obviously it's not as proficient as the teachers

00:32:08.960 | it learned from on these environments.

00:32:09.800 | - Is that why it's not obvious?

00:32:11.740 | It's not obvious that it wouldn't be more proficient.

00:32:15.100 | It's just the current beginning part

00:32:18.040 | is that the performance is such that it's not as good

00:32:21.780 | as if it's specialized to that task.

00:32:23.460 | - Right, so it's not as good,

00:32:25.820 | although I would argue size matters here.

00:32:28.060 | So the fact that--

00:32:29.180 | - I would argue size always matters.

00:32:31.220 | - Yeah, okay. - That's a different conversation.

00:32:33.420 | - But for neural networks, certainly size does matter.

00:32:36.260 | So it's the beginning because it's relatively small.

00:32:39.660 | So obviously scaling this idea up

00:32:42.620 | might make the connections that exist

00:32:46.540 | between text on the internet and playing Atari and so on

00:32:50.740 | more synergistic with one another.

00:32:53.340 | And you might gain.

00:32:54.260 | And that moment we didn't quite see,

00:32:56.360 | but obviously that's why it's the beginning.

00:32:58.660 | - That synergy might emerge with scale.

00:33:00.980 | - Right, might emerge with scale.

00:33:02.140 | And also I believe there's some new research

00:33:04.420 | or ways in which you prepare the data

00:33:07.620 | that you might need to sort of make it more clear

00:33:10.940 | to the model that you're not only playing Atari

00:33:14.180 | and it's just, you start from a screen

00:33:16.360 | and here is up and a screen and down.

00:33:18.400 | Maybe you can think of playing Atari

00:33:20.660 | as there's some sort of context that is needed for the agent

00:33:23.900 | before it starts seeing, oh, this is an Atari screen,

00:33:26.900 | I'm gonna start playing.

00:33:28.640 | You might require, for instance, to be told in words,

00:33:33.420 | hey, in this sequence that I'm showing,

00:33:36.860 | you're gonna be playing an Atari game.

00:33:39.100 | So text might actually be a good driver

00:33:41.980 | to enhance the data.

00:33:44.460 | So then these connections might be made more easily.

00:33:47.220 | That's an idea that we start seeing in language,

00:33:51.240 | but obviously beyond this is gonna be effective.

00:33:55.180 | It's not like I don't show you a screen

00:33:57.460 | and you from scratch, you're supposed to learn a game.

00:34:01.000 | There is a lot of context we might set.

00:34:03.380 | So there might be some work needed as well

00:34:05.860 | to set that context.

00:34:07.780 | But anyways, there's a lot of work.

00:34:10.420 | - So that context puts all the different modalities

00:34:13.540 | on the same level ground.

00:34:14.980 | - Exactly. - If you provide

00:34:15.820 | the context best.

00:34:16.660 | So maybe on that point,

00:34:18.980 | so there's this task which may not seem trivial

00:34:23.100 | of tokenizing the data, of converting the data into pieces,

00:34:28.100 | into basic atomic elements

00:34:31.300 | that then could cross modalities somehow.

00:34:35.300 | So what's tokenization?

00:34:37.900 | How do you tokenize text?

00:34:39.680 | How do you tokenize images?

00:34:42.180 | How do you tokenize games and actions and robotics tasks?

00:34:47.060 | - Yeah, that's a great question.

00:34:48.220 | So tokenization is the entry point

00:34:52.820 | to actually make all the data look like a sequence

00:34:55.580 | because tokens then are just kind of

00:34:57.660 | these little puzzle pieces.

00:34:59.500 | We break down anything into these puzzle pieces

00:35:01.740 | and then we just model,

00:35:03.460 | what's this puzzle look like, right?

00:35:05.340 | When you make it lay down in a line,

00:35:07.700 | so to speak, in a sequence.

00:35:09.500 | So in Gato, the text, there's a lot of work.

00:35:14.500 | You tokenize text usually by looking

00:35:17.340 | at commonly used substrings, right?

00:35:20.020 | So there's, you know, ing in English

00:35:22.500 | is a very common substring.

00:35:23.660 | So that becomes a token.

00:35:25.500 | There's quite well studied problem on tokenizing text

00:35:29.060 | and Gato just use the standard techniques

00:35:31.580 | that have been developed from many years,

00:35:34.300 | even starting from Ngram models in the 1950s and so on.

00:35:37.940 | - Just for context, how many tokens,

00:35:40.180 | like what order of magnitude,

00:35:41.780 | number of tokens is required for a word?

00:35:44.460 | - Yeah. - Usually.

00:35:45.300 | What are we talking about?

00:35:46.180 | - Yeah, for a word in English, right?

00:35:48.620 | I mean, every language is very different.

00:35:51.100 | The current level or granularity of tokenization

00:35:53.900 | generally means it's maybe two to five.

00:35:57.780 | I mean, I don't know the statistics exactly,

00:36:00.140 | but to give you an idea,

00:36:02.100 | we don't tokenize at the level of letters,

00:36:04.100 | then it would probably be like,

00:36:05.460 | I don't know what the average length of a word

00:36:07.500 | is in English, but that would be, you know,

00:36:09.220 | the minimum set of tokens you could use.

00:36:11.380 | - So it's bigger than letters, smaller than words.

00:36:13.180 | - Yes, yes.

00:36:14.020 | And you could think of very, very common words like the,

00:36:16.860 | I mean, that would be a single token,

00:36:18.780 | but very quickly you're talking two, three, four,

00:36:21.500 | four tokens or so.

00:36:22.340 | - Have you ever tried to tokenize emojis?

00:36:24.740 | - Emojis are actually just sequences of letters.

00:36:29.420 | So- - Maybe to you,

00:36:30.940 | but to me, they mean so much more.

00:36:32.980 | - Yeah, you can render the emoji,

00:36:34.380 | but you might, if you actually just-

00:36:36.780 | - Yeah, this is a philosophical question.

00:36:38.940 | Is emojis an image or a text?

00:36:43.300 | - The way we do these things is they're actually mapped

00:36:46.900 | to small sequences of characters.

00:36:49.540 | So you can actually play with these models

00:36:52.580 | and input emojis, it will output emojis back,

00:36:55.780 | which is actually quite a fun exercise.

00:36:57.900 | You probably can find other tweets about these out there.

00:37:02.300 | But yeah, so anyways, text, there's like,

00:37:04.460 | it's very clear how this is done.

00:37:06.780 | And then in Gato, what we did for images

00:37:10.620 | is we map images to essentially,

00:37:13.780 | we compressed images, so to speak,

00:37:15.460 | into something that looks more like,

00:37:17.460 | less like every pixel with every intensity

00:37:21.300 | that would mean we have a very long sequence, right?

00:37:23.820 | Like if we were talking about 100 by 100 pixel images,

00:37:27.300 | that would make the sequences far too long.

00:37:29.940 | So what was done there is you just use a technique

00:37:33.340 | that essentially compresses an image

00:37:35.860 | into maybe 16 by 16 patches of pixels.

00:37:40.180 | And then that is mapped, again, tokenized.

00:37:42.740 | You just essentially quantize this space

00:37:45.380 | into a special word that actually maps

00:37:49.020 | to these little sequence of pixels.

00:37:51.820 | And then you put the pixels together in some raster order,

00:37:55.140 | and then that's how you get out

00:37:57.820 | or in the image that you're processing.

00:38:00.820 | - But there's no semantic aspect to that.

00:38:04.060 | So you're doing some kind of,

00:38:05.860 | you don't need to understand anything about the image

00:38:07.780 | in order to tokenize it currently.

00:38:09.660 | - No, you're only using this notion of compression.

00:38:12.620 | So you're trying to find common,

00:38:15.100 | it's like JPG or all these algorithms,

00:38:17.660 | it's actually very similar at the tokenization level.

00:38:20.540 | All we're doing is finding common patterns

00:38:23.340 | and then making sure in a lossy way,

00:38:25.860 | we compress these images,

00:38:27.260 | given the statistics of the images

00:38:29.540 | that are contained in all the data we deal with.

00:38:31.860 | - Although you could probably argue that JPG

00:38:34.220 | does have some understanding of images.

00:38:36.660 | Because visual information, maybe color,

00:38:42.940 | compressing crudely based on color

00:38:46.980 | does capture something important about an image

00:38:51.180 | that's about its meaning, not just about some statistics.

00:38:54.620 | - Yeah, I mean, JP, as I said,

00:38:56.660 | the algorithms look actually very similar to,

00:38:59.420 | they use the cosine transform in JPG.

00:39:02.820 | The approach we usually do in machine learning

00:39:07.100 | when we deal with images and we do this quantization step

00:39:10.140 | is a bit more data-driven.

00:39:11.380 | So rather than have some sort of Fourier basis

00:39:14.140 | for how frequencies appear in the natural world,

00:39:18.900 | we actually just use the statistics of the images

00:39:23.820 | and then quantize them based on the statistics,

00:39:26.980 | much like you do in words, right?

00:39:28.300 | So common substrings are allocated a token,

00:39:32.420 | and images is very similar.

00:39:34.420 | But there's no connection, the token space,

00:39:38.260 | if you think of, oh, like the tokens are an integer

00:39:41.060 | and in the end of the day.

00:39:42.420 | So now like we work on, maybe we have about,

00:39:46.180 | let's say, I don't know the exact numbers,

00:39:47.980 | but let's say 10,000 tokens for text, right?

00:39:51.180 | Certainly more than characters

00:39:52.820 | because we have groups of characters and so on.

00:39:55.340 | So from one to 10,000, those are representing

00:39:58.300 | all the language and the words we'll see.

00:40:00.980 | And then images occupy the next set of integers.

00:40:04.180 | So they're completely independent, right?

00:40:05.820 | So from 10,001 to 20,000, those are the tokens

00:40:09.860 | that represent these other modality images.

00:40:12.780 | And that is an interesting aspect

00:40:16.940 | that makes it orthogonal.

00:40:18.660 | So what connects these concepts is the data, right?

00:40:21.620 | Once you have a data set, for instance,

00:40:24.460 | that captions images, that tells you,

00:40:26.900 | oh, this is someone playing a Frisbee on a green field.

00:40:30.500 | Now the model will need to predict the tokens

00:40:34.580 | from the text green field to then the pixels,

00:40:37.780 | and that will start making the connections

00:40:39.740 | between the tokens.

00:40:40.580 | So these connections happen as the algorithm learns.

00:40:43.620 | And then the last, if we think of these integers,

00:40:45.820 | the first few are words, the next few are images.

00:40:48.740 | In Gato, we also allocated the highest order of integers

00:40:53.740 | to actions, right?

00:40:56.260 | Which we discretize and actions are very diverse, right?

00:40:59.940 | In Atari, there's, I don't know if 17 discrete actions.

00:41:04.100 | In robotics, actions might be torques

00:41:06.940 | and forces that we apply.

00:41:08.220 | So we just use kind of similar ideas

00:41:11.180 | to compress these actions into tokens.

00:41:14.300 | And then we just, that's how we map now all the space

00:41:18.660 | to these sequence of integers.

00:41:20.780 | But they occupy different space,

00:41:22.420 | and what connects them is then the learning algorithm.

00:41:24.820 | That's where the magic happens.

00:41:26.260 | - So the modalities are orthogonal

00:41:28.780 | to each other in token space.

00:41:30.300 | - Right, right.

00:41:31.140 | - So in the input, everything you add,

00:41:33.620 | you add extra tokens.

00:41:35.220 | - Right.

00:41:36.060 | - And then you're shoving all of that into one place.

00:41:40.420 | - Yes, the transformer.

00:41:41.620 | - And that transformer,

00:41:42.740 | that transformer tries to look at this gigantic token space

00:41:47.740 | and tries to form some kind of representation,

00:41:52.220 | some kind of unique wisdom

00:41:56.740 | about all of these different modalities.

00:41:59.220 | How's that possible?

00:42:02.100 | If you were to sort of put your psychoanalysis hat on

00:42:06.500 | and try to psychoanalyze this neural network,

00:42:09.380 | is it schizophrenic?

00:42:11.740 | Does it try to, given this very few weights,

00:42:16.740 | represent multiple disjoint things

00:42:19.540 | and somehow have them not interfere with each other?

00:42:22.780 | Or is it somehow building on the joint strength,

00:42:27.780 | on whatever is common to all the different modalities?

00:42:31.700 | If you were to ask a question, is it schizophrenic

00:42:35.580 | or is it of one mind?

00:42:38.660 | - I mean, it is one mind,

00:42:41.020 | and it's actually the simplest algorithm,

00:42:44.340 | which that's kind of in a way how it feels

00:42:47.420 | like the field hasn't changed since backpropagation

00:42:51.660 | and gradient descent was purpose

00:42:53.620 | for learning neural networks.

00:42:55.700 | So there is obviously details on the architecture.

00:42:58.660 | This has evolved.

00:42:59.580 | The current iteration is still the transformer,

00:43:03.020 | which is a powerful sequence modeling architecture.

00:43:07.380 | But then the goal of setting these weights

00:43:12.220 | to predict the data is essentially the same

00:43:15.460 | as basically I could describe,

00:43:17.180 | I mean, we described a few years ago,

00:43:18.620 | AlphaStar, language modeling, and so on, right?

00:43:21.540 | We take, let's say, an Atari game.

00:43:24.540 | We map it to a string of numbers

00:43:27.580 | that will all be probably image space

00:43:30.300 | and action space interleaved.

00:43:32.380 | And all we're gonna do is say,

00:43:34.060 | okay, given the numbers,

00:43:37.260 | you know, 10,001, 10,004, 10,005,

00:43:40.380 | the next number that comes is 20,006,

00:43:43.220 | which is in the action space.

00:43:45.380 | And you're just optimizing these weights

00:43:48.820 | via very simple gradients,

00:43:51.660 | like, you know, mathematical is almost

00:43:53.500 | the most boring algorithm you could imagine.

00:43:55.860 | We set all the weights so that

00:43:57.780 | given this particular instance,

00:44:00.180 | these weights are set to maximize

00:44:03.180 | the probability of having seen

00:44:05.020 | this particular sequence of integers

00:44:07.260 | for this particular game.

00:44:09.100 | And then the algorithm does this

00:44:11.620 | for many, many, many iterations,

00:44:14.740 | looking at different modalities,

00:44:16.860 | different games, right?

00:44:17.860 | That's the mixture of the dataset we discussed.

00:44:20.460 | So in a way, it's a very simple algorithm.

00:44:24.020 | And the weights, right, they're all shared, right?

00:44:27.540 | So in terms of, is it focusing on one modality or not,

00:44:30.900 | the intermediate weights that are converting

00:44:33.180 | from these input of integers

00:44:35.140 | to the target integer you're predicting next,

00:44:37.660 | those weights certainly are common.

00:44:40.300 | And then the way the tokenization happens,

00:44:43.380 | there is a special place in the neural network,

00:44:45.820 | which is we map these integer, like number 10,001,

00:44:49.780 | to a vector of real numbers, like real numbers.

00:44:53.700 | We can optimize them with gradient descent, right?

00:44:56.100 | The functions we learn are actually

00:44:58.260 | surprisingly differentiable.

00:44:59.700 | That's why we compute gradients.

00:45:01.700 | So this step is the only one

00:45:03.900 | that this orthogonality dimension applies.

00:45:06.540 | So mapping a certain token for text or image or actions,

00:45:11.540 | each of these tokens gets its own little vector

00:45:15.020 | of real numbers that represents this.

00:45:17.180 | If you look at the field back many years ago,

00:45:19.540 | people were talking about word vectors or word embeddings.

00:45:23.460 | These are the same.

00:45:24.300 | We have word vectors or embeddings.

00:45:25.980 | We have image vector or embeddings

00:45:28.860 | and action vector of embeddings.

00:45:30.900 | And the beauty here is that as you train this model,

00:45:33.900 | if you visualize these little vectors,

00:45:36.660 | it might be that they start aligning

00:45:38.460 | even though they're independent parameters.

00:45:41.100 | There could be anything,

00:45:42.860 | but then it might be that you take the word gato or cat,

00:45:47.460 | which maybe is common enough that it actually

00:45:49.020 | has its own token.

00:45:50.220 | And then you take pixels that have a cat,

00:45:52.380 | and you might start seeing that these vectors

00:45:55.300 | look like they align, right?

00:45:57.420 | So by learning from this vast amount of data,

00:46:00.660 | the model is realizing the potential connections

00:46:03.940 | between these modalities.

00:46:05.660 | Now I will say there will be another way,

00:46:07.860 | at least in part, to not have these different vectors

00:46:12.860 | for each different modality.

00:46:15.500 | For instance, when I tell you about actions in certain space,

00:46:20.220 | I'm defining actions by words, right?

00:46:22.820 | So you could imagine a world in which I'm not learning

00:46:26.500 | that the action app in Atari is its own number.

00:46:31.180 | The action app in Atari maybe is literally the word

00:46:34.380 | or the sentence app in Atari, right?

00:46:37.300 | And that would mean we now leverage

00:46:39.380 | much more from the language.

00:46:41.020 | This is not what we did here,

00:46:42.500 | but certainly it might make these connections

00:46:45.660 | much easier to learn and also to teach the model

00:46:49.060 | to correct its own actions and so on, right?

00:46:51.260 | So all this to say that gato is indeed the beginning,

00:46:55.860 | that it is a radical idea to do this this way,

00:46:59.420 | but there's probably a lot more to be done

00:47:02.340 | and the results to be more impressive,

00:47:04.420 | not only through scale, but also through some new research

00:47:07.940 | that will come hopefully in the years to come.

00:47:10.460 | - So just to elaborate quickly,

00:47:12.260 | you mean one possible next step

00:47:16.660 | or one of the paths that you might take next

00:47:20.180 | is doing the tokenization fundamentally

00:47:25.180 | as a kind of linguistic communication.

00:47:28.260 | So like you convert even images into language.

00:47:31.340 | So doing something like a crude semantic segmentation,

00:47:35.540 | trying to just assign a bunch of words to an image

00:47:38.340 | that like have almost like a dumb entity

00:47:42.300 | explaining as much as it can about the image.

00:47:45.300 | And so you convert that into words

00:47:46.900 | and then you convert games into words

00:47:49.260 | and then you provide the context in words and all of it.

00:47:53.500 | And eventually getting to a point

00:47:56.300 | where everybody agrees with Noam Chomsky

00:47:58.100 | that language is actually at the core of everything.

00:48:00.940 | That it's the base layer of intelligence and consciousness

00:48:04.980 | and all that kind of stuff, okay.

00:48:07.500 | You mentioned early on like it's hard to grow.

00:48:11.260 | What did you mean by that?

00:48:12.780 | 'Cause we're talking about scale might change.

00:48:15.700 | There might be, and we'll talk about this too,

00:48:18.980 | like there's a emergent,

00:48:22.940 | there's certain things about these neural networks

00:48:25.020 | that are emergent.

00:48:25.860 | So certain like performance we can see only with scale

00:48:28.980 | and there's some kind of threshold of scale.

00:48:30.980 | So why is it hard to grow something like this Meow Network?

00:48:35.980 | - So the Meow Network, it's not hard to grow

00:48:41.140 | if you retrain it.

00:48:42.620 | What's hard is, well, we have now 1 billion parameters.

00:48:46.860 | We train them for a while.

00:48:48.140 | We spend some amount of work

00:48:50.740 | towards building these weights

00:48:53.140 | that are an amazing initial brain

00:48:55.900 | for doing these kind of tasks we care about.

00:48:58.860 | Could we reuse the weights and expand to a larger brain?

00:49:03.860 | And that is extraordinarily hard,

00:49:06.700 | but also exciting from a research perspective

00:49:10.100 | and a practical perspective point of view, right?

00:49:12.580 | So there's this notion of modularity in software engineering

00:49:17.580 | and we starting to see some examples

00:49:20.500 | and work that leverages modularity.

00:49:23.340 | In fact, if we go back one step from Gato

00:49:26.340 | to a work that I would say train much larger,

00:49:29.700 | much more capable network called Flamingo.

00:49:32.580 | Flamingo did not deal with actions,

00:49:34.340 | but it definitely dealt with images in an interesting way,

00:49:38.460 | kind of akin to what Gato did,

00:49:40.300 | but slightly different technique for tokenizing,

00:49:43.020 | but we don't need to go into that detail.

00:49:45.420 | But what Flamingo also did, which Gato didn't do,

00:49:49.380 | and that just happens because these projects,

00:49:51.620 | you know, they're different.

00:49:53.580 | You know, it's a bit of like the exploratory nature

00:49:55.900 | of research, which is great.

00:49:57.260 | - The research behind these projects is also modular.

00:50:00.620 | - Yes, exactly.

00:50:01.860 | And it has to be, right?

00:50:02.780 | We need to have creativity

00:50:05.620 | and sometimes you need to protect pockets of, you know,

00:50:08.860 | people, researchers, and so on.

00:50:10.340 | - By we, you mean humans.

00:50:11.860 | - Yes. - Okay.

00:50:12.860 | - And also in particular researchers

00:50:14.620 | and maybe even further, you know,

00:50:16.780 | DeepMind or other such labs.

00:50:18.860 | - And then the neural networks themselves.

00:50:21.020 | So it's modularity all the way down.

00:50:23.380 | - All the way down.

00:50:24.260 | So the way that we did modularity very beautifully

00:50:27.540 | in Flamingo is we took Chinchilla,

00:50:30.140 | which is a language only model,

00:50:32.860 | not an agent if we think of actions

00:50:34.700 | being necessary for agency.

00:50:36.740 | So we took Chinchilla, we took the weights of Chinchilla,

00:50:40.980 | and then we froze them.

00:50:42.820 | We said, "These don't change."

00:50:44.820 | We trained them to be very good at predicting the next word.

00:50:47.580 | It's a very good language model,

00:50:49.460 | state of the art at the time you release it,

00:50:51.260 | et cetera, et cetera.

00:50:52.980 | We're gonna add a capability to see, right?

00:50:55.540 | We are gonna add the ability to see

00:50:56.980 | to this language model.

00:50:58.340 | So we're gonna attach small pieces of neural networks

00:51:01.980 | at the right places in the model.

00:51:03.900 | It's almost like injecting the network

00:51:07.940 | with some weights and some substructures

00:51:10.780 | in the ways, in a good way, right?

00:51:12.860 | So you need the research to say what is effective,

00:51:15.300 | how do you add this capability

00:51:16.740 | without destroying others, et cetera.

00:51:18.860 | So we created a small sub-network,

00:51:23.500 | initialized not from random,

00:51:25.420 | but actually from self-supervised learning,

00:51:28.820 | that, you know, a model that understands vision in general.

00:51:32.900 | And then we took datasets that connect the two modalities,

00:51:37.340 | vision and language.

00:51:38.820 | And then we froze the main part,

00:51:41.260 | the largest portion of the network, which was Chinchilla,

00:51:43.780 | that is 70 billion parameters.

00:51:46.020 | And then we added a few more parameters on top,

00:51:49.300 | trained from scratch,

00:51:50.580 | and then some others that were pre-trained

00:51:52.700 | from like, with the capacity to see.

00:51:55.340 | Like it was not tokenization in the way I described for Gato,

00:51:58.900 | but it's a similar idea.

00:52:01.500 | And then we trained the whole system.

00:52:03.700 | Parts of it were frozen, parts of it were new.

00:52:06.700 | And all of a sudden we developed Flamingo,

00:52:09.780 | which is an amazing model that is essentially,

00:52:12.700 | I mean, describing it is a chatbot

00:52:15.140 | where you can also upload images

00:52:17.100 | and start conversing about images,

00:52:20.060 | but it's also kind of a dialogue style chatbot.

00:52:23.860 | - So the input is images and text,

00:52:25.900 | and the output is text. - Yes, exactly.

00:52:28.060 | And- - How many parameters?

00:52:29.500 | You said 70 billion for Chinchilla.

00:52:31.940 | - Yeah, Chinchilla is 70 billion.

00:52:33.380 | And then the ones we add on top,

00:52:34.780 | which kind of almost is almost like a way to overwrite

00:52:39.340 | its little activations so that when it sees vision,

00:52:42.540 | it does kind of a correct computation

00:52:44.700 | of what it's seeing, mapping it back to words, so to speak.

00:52:48.100 | That adds an extra 10 billion parameters, right?

00:52:50.980 | So it's total 80 billion, the largest one we released.

00:52:54.100 | And then you train it on a few data sets

00:52:57.460 | that contain vision and language.

00:52:59.460 | And once you interact with the model,

00:53:01.260 | you start seeing that you can upload an image

00:53:04.340 | and start sort of having a dialogue about the image,

00:53:08.100 | which is actually not something,

00:53:09.580 | it's very similar and akin to what we saw

00:53:11.900 | in language only, these prompting abilities that it has.

00:53:15.380 | You can teach it a new vision task, right?

00:53:17.860 | It does things beyond the capabilities that, in theory,

00:53:21.620 | the data sets provided in themselves,

00:53:24.660 | but because it leverages a lot of the language knowledge

00:53:27.260 | acquired from Chinchilla,

00:53:29.020 | it actually has this few-shot learning ability

00:53:31.900 | and these emerging abilities that we didn't even measure

00:53:34.780 | once we were developing the model.

00:53:36.580 | But once developed, then as you play with the interface,

00:53:40.220 | you can start seeing, wow, okay, yeah,

00:53:41.820 | it's cool, we can upload, I think,

00:53:44.300 | one of the tweets talking about Twitter

00:53:45.940 | was this image from Obama that is placing a weight

00:53:49.980 | and someone is kind of weighting themselves

00:53:52.540 | and it's kind of a joke-style image.

00:53:55.060 | And it's notable because I think Andriy Karpathy

00:53:58.020 | a few years ago said, "No computer vision system

00:54:00.860 | "can understand the subtlety of this joke in this image,

00:54:04.780 | "all the things that go on."

00:54:06.500 | And so what we try to do, and it's very anecdotally,

00:54:09.740 | I mean, this is not a proof that we solved this issue,

00:54:12.300 | but it just shows that you can upload now this image

00:54:15.860 | and start conversing with the model,

00:54:17.700 | trying to make out if it gets that there's a joke

00:54:21.500 | because the person weighting themselves

00:54:23.100 | doesn't see that someone behind is making the weight higher

00:54:26.820 | and so on and so forth.

00:54:27.980 | So it's a fascinating capability

00:54:30.020 | and it comes from this key idea of modularity

00:54:33.380 | where we took a frozen brain

00:54:34.940 | and we just added a new capability.

00:54:37.900 | So the question is, should we,

00:54:40.740 | so in a way you can see even from DeepMind,

00:54:42.860 | we have Flamingo that this moderate approach

00:54:46.420 | and thus could leverage a scale a bit more reasonably

00:54:49.180 | because we didn't need to retrain a system from scratch.

00:54:52.340 | And on the other hand, we had Gato,

00:54:54.180 | which used the same datasets,

00:54:55.940 | but then it trained it from scratch, right?

00:54:57.500 | And so I guess big question for the community is,

00:55:01.660 | should we train from scratch or should we embrace modularity?

00:55:04.780 | And this goes back to modularity as a way to grow,

00:55:09.780 | but reuse seems like natural

00:55:12.140 | and it was very effective, certainly.

00:55:15.020 | - The next question is, if you go the way of modularity,

00:55:19.060 | is there a systematic way of freezing weights

00:55:22.780 | and joining different modalities across,

00:55:27.100 | you know, not just two or three or four networks,

00:55:29.300 | but hundreds of networks

00:55:30.620 | from all different kinds of places,

00:55:32.420 | maybe open source network that looks at weather patterns

00:55:36.420 | and you shove that in somehow,

00:55:38.020 | and then you have networks that, I don't know,

00:55:40.500 | do all kinds of, play StarCraft

00:55:42.140 | and play all the other video games,

00:55:44.100 | and you can keep adding them in without significant effort,

00:55:49.100 | like maybe the effort scales linearly

00:55:52.540 | or something like that,

00:55:53.380 | as opposed to like the more network you add,

00:55:55.020 | the more you have to worry about the instabilities created.

00:55:57.980 | - Yeah, so that vision is beautiful.

00:56:00.020 | I think there's still the question

00:56:03.580 | about within single modalities, like Chinchilla was reused,

00:56:06.900 | but now if we train a next iteration of language models,

00:56:10.260 | are we gonna use Chinchilla or not?

00:56:11.900 | - Yeah, how do you swap out Chinchilla?

00:56:13.220 | - Right, so there's still big questions,

00:56:15.980 | but that idea is actually really akin

00:56:18.420 | to software engineering,

00:56:19.420 | which we're not re-implementing,

00:56:21.140 | you know, libraries from scratch,

00:56:22.420 | we're reusing and then building ever more amazing things,

00:56:25.460 | including neural networks with software that we're reusing.

00:56:29.060 | So I think this idea of modularity, I like it,

00:56:32.260 | I think it's here to stay,

00:56:33.980 | and that's also why I mentioned

00:56:35.980 | it's just the beginning, not the end.

00:56:38.300 | - You've mentioned meta-learning,

00:56:39.500 | so given this promise of Gato,

00:56:42.900 | can we try to redefine this term

00:56:46.100 | that's almost akin to consciousness,

00:56:47.700 | because it means different things to different people

00:56:50.260 | throughout the history of artificial intelligence,

00:56:52.500 | but what do you think meta-learning is and looks like

00:56:58.220 | now in the five years, 10 years,

00:57:00.140 | will it look like system like Gato, but scaled?

00:57:03.300 | What's your sense of,

00:57:04.260 | what does meta-learning look like, do you think,

00:57:08.380 | with all the wisdom we've learned so far?

00:57:10.580 | - Yeah, great question,

00:57:11.660 | maybe it's good to give another data point

00:57:14.620 | looking backwards rather than forward.

00:57:16.300 | So when we talk in 2019,

00:57:20.660 | meta-learning meant something that has changed

00:57:26.620 | mostly through the revolution of GPT-3 and beyond.

00:57:31.260 | So what meta-learning meant at the time

00:57:34.060 | was driven by what benchmarks people care about

00:57:37.780 | in meta-learning,

00:57:38.940 | and the benchmarks were about

00:57:40.740 | a capability to learn about object identities,

00:57:45.100 | so it was very much over-fitted

00:57:47.500 | to vision and object classification,

00:57:50.460 | and the part that was meta about that was that,

00:57:53.020 | oh, we're not just learning a thousand categories

00:57:55.420 | that ImageNet tells us to learn,

00:57:57.140 | we're gonna learn object categories that can be defined

00:58:00.580 | when we interact with the model.

00:58:03.380 | So it's interesting to see the evolution, right?

00:58:06.740 | The way this started was we have a special language

00:58:10.860 | that was a data set, a small data set

00:58:13.340 | that we prompted the model with,

00:58:15.380 | saying, hey, here is a new classification task,

00:58:18.900 | I'll give you one image and the name,

00:58:21.860 | which was an integer at the time of the image,

00:58:24.460 | and a different image, and so on.

00:58:26.060 | So you have a small prompt in the form of a data set,

00:58:30.100 | a machine learning data set,

00:58:31.700 | and then you got then a system that could then predict

00:58:35.580 | or classify these objects

00:58:37.020 | that you just defined kind of on the fly.

00:58:39.420 | So fast forward,

00:58:43.220 | it was revealed that language models are future learners,

00:58:47.500 | that's the title of the paper, so very good title.

00:58:50.140 | Sometimes titles are really good,

00:58:51.580 | so this one is really, really good,

00:58:53.580 | because that's the point of GPT-3,

00:58:56.220 | that showed that, look, sure,

00:58:58.820 | we can focus on object classification

00:59:00.980 | and what meta-learning means

00:59:02.580 | within the space of learning object categories,

00:59:05.460 | this goes beyond, or before, rather,

00:59:07.460 | to also Omniglot, before ImageNet, and so on.

00:59:10.060 | So there's a few benchmarks.

00:59:11.500 | To now, all of a sudden,

00:59:13.020 | we're a bit unlocked from benchmarks,

00:59:15.220 | and through language, we can define tasks, right?

00:59:17.900 | So we're literally telling the model some logical task

00:59:21.580 | or little thing that we wanted to do.

00:59:23.860 | We prompt it much like we did before,

00:59:25.900 | but now we prompt it through natural language.

00:59:28.460 | And then, not perfectly,

00:59:30.420 | I mean, these models have failure modes, and that's fine,

00:59:33.180 | but these models then are now doing a new task, right?

00:59:37.140 | So they meta-learn this new capability.

00:59:40.460 | Now, that's where we are now.

00:59:43.380 | Flamingo expanded this to visual and language,

00:59:47.220 | but it basically has the same abilities.

00:59:49.300 | You can teach it, for instance,

00:59:51.540 | an emergent property was that

00:59:53.260 | you can take pictures of numbers

00:59:55.260 | and then do arithmetic with the numbers

00:59:57.780 | just by teaching it, "Oh, that's,

00:59:59.900 | "when I show you three plus six,

01:00:01.980 | "I want you to output nine,

01:00:03.620 | "and you show it a few examples, and now it does that."

01:00:06.660 | So it went way beyond this ImageNet

01:00:10.180 | sort of categorization of images

01:00:12.620 | that we were a bit stuck, maybe,

01:00:14.140 | before this revelation moment that happened in 2000.

01:00:19.020 | I believe it was '19, but it was after we chatted.

01:00:21.860 | - And that way it has solved meta-learning

01:00:24.260 | as was previously defined.

01:00:26.020 | - Yes, it expanded what it meant.

01:00:27.700 | So that's what you say, what does it mean?

01:00:29.460 | So it's an evolving term.

01:00:31.300 | But here is maybe now looking forward,

01:00:35.140 | looking at what's happening,

01:00:37.540 | obviously in the community with more modalities,

01:00:41.340 | what we can expect.

01:00:42.420 | And I would certainly hope to see the following,

01:00:44.900 | and this is a pretty drastic hope,

01:00:48.340 | but in five years, maybe we chat again.

01:00:51.140 | And we have a system, right, a set of weights

01:00:55.860 | that we can teach it to play StarCraft.

01:00:59.780 | Maybe not at the level of AlphaStar,

01:01:01.420 | but play StarCraft, a complex game.

01:01:03.620 | We teach it through interactions to prompting.

01:01:06.860 | You can certainly prompt a system,

01:01:08.460 | that's what Gato shows, to play some simple Atari games.

01:01:11.700 | So imagine if you start talking to a system,

01:01:15.300 | teaching it a new game,

01:01:16.780 | showing it examples of, in this particular game,

01:01:20.940 | this user did something good.

01:01:22.740 | Maybe the system can even play and ask you questions,

01:01:25.420 | say, "Hey, I played this game.

01:01:26.940 | I just played this game.

01:01:27.860 | Did I do well?

01:01:29.060 | Can you teach me more?"

01:01:30.420 | So five, maybe to 10 years, these capabilities,

01:01:34.780 | or what meta-learning means,

01:01:36.180 | will be much more interactive, much more rich,

01:01:38.860 | and through domains that we were specializing, right?

01:01:41.620 | So you see the difference, right?

01:01:42.900 | We built AlphaStar specialized to play StarCraft.

01:01:46.980 | The algorithms were general,

01:01:48.220 | but the weights were specialized.

01:01:50.420 | And what we're hoping is that we can teach a network

01:01:54.180 | to play games, to play any game,

01:01:56.580 | just using games as an example,

01:01:58.580 | through interacting with it, teaching it,

01:02:01.500 | uploading the Wikipedia page of StarCraft.

01:02:03.740 | Like this is in the horizon,

01:02:06.100 | and obviously there are details need to be filled,

01:02:09.340 | and research need to be done.

01:02:10.940 | But that's how I see meta-learning above,

01:02:13.220 | which is gonna be beyond prompting.

01:02:15.380 | It's gonna be a bit more interactive.

01:02:17.060 | It's gonna, you know, the system might tell us

01:02:19.820 | to give it feedback after it maybe makes mistakes

01:02:22.340 | or it loses a game, but it's nonetheless very exciting

01:02:26.260 | because if you think about this this way,

01:02:29.020 | the benchmarks are already there.

01:02:30.620 | We just repurpose the benchmarks, right?

01:02:33.180 | So in a way, I like to map the space

01:02:36.980 | of what maybe AGI means to say,

01:02:40.340 | okay, like we went 101% performance in Go,

01:02:45.340 | in Chess, in StarCraft.

01:02:47.860 | The next iteration might be 20% performance

01:02:51.900 | across quote unquote all tasks, right?

01:02:54.700 | And even if it's not as good, it's fine.

01:02:56.300 | We actually, we have ways to also measure progress

01:02:59.940 | because we have those special agents,

01:03:01.620 | specialized agents, and so on.

01:03:04.180 | So this is to me very exciting.

01:03:06.220 | And these next iteration models

01:03:09.260 | are definitely hinting at that direction of progress,

01:03:13.380 | which hopefully we can have.

01:03:14.700 | There are obviously some things that could go wrong

01:03:17.580 | in terms of we might not have the tools,

01:03:20.100 | maybe transformers are not enough, then we must,

01:03:22.540 | there's some breakthroughs to come,

01:03:24.300 | which makes the field more exciting

01:03:26.300 | to people like me as well, of course.

01:03:28.620 | But that's, if you ask me five to 10 years,

01:03:32.100 | you might see these models that start to look more

01:03:34.300 | like weights that are already trained.

01:03:36.860 | And then it's more about teaching

01:03:39.540 | or make their meta learn what you're trying to induce

01:03:42.540 | in terms of tasks and so on.

01:03:46.940 | Well beyond the simple now tasks we're starting to see emerge

01:03:50.980 | like small arithmetic tasks and so on.

01:03:54.140 | - So a few questions around that.

01:03:55.700 | This is fascinating.

01:03:57.180 | So that kind of teaching, interactive,

01:04:01.420 | so it's beyond prompting,

01:04:02.740 | so it's interacting with the neural network,

01:04:05.180 | that's different than the training process.

01:04:08.380 | So it's different than the optimization

01:04:12.420 | over differentiable functions.

01:04:15.900 | This is already trained and now you're teaching,

01:04:18.620 | I mean, it's almost like akin to the brain,

01:04:24.180 | the neurons are already set with their connections.

01:04:26.900 | On top of that, you're now using that infrastructure

01:04:29.980 | to build up further knowledge.

01:04:32.620 | - Okay, so that's a really interesting distinction

01:04:36.700 | that's actually not obvious

01:04:38.060 | from a software engineering perspective,

01:04:40.340 | that there's a line to be drawn.

01:04:42.820 | 'Cause you always think for neural network to learn,

01:04:44.900 | it has to be retrained, trained and retrained.

01:04:48.340 | But maybe, and prompting is a way of teaching

01:04:53.220 | a neural network a little bit of context

01:04:55.980 | about whatever the heck you're trying to do.

01:04:58.020 | So you can maybe expand this prompting capability

01:05:00.460 | by making it interact, that's really, really interesting.

01:05:04.220 | - Yeah, by the way, this is not,

01:05:06.380 | if you look at way back at different ways

01:05:09.220 | to tackle even classification tasks,

01:05:11.820 | so this comes from like long standing literature

01:05:16.460 | in machine learning.

01:05:18.260 | What I'm suggesting could sound to some

01:05:20.780 | like a bit like nearest neighbor.

01:05:23.420 | So nearest neighbor is almost the simplest algorithm

01:05:26.100 | that does not require learning.

01:05:30.060 | So it has this interesting,

01:05:31.740 | like you don't need to compute gradients.

01:05:34.340 | And what nearest neighbor does is you quote unquote,

01:05:37.500 | have a data set or upload a data set.

01:05:39.980 | And then all you need to do is a way to measure distance

01:05:43.060 | between points.

01:05:44.780 | And then to classify a new point,

01:05:46.660 | you're just simply computing what's the closest point

01:05:49.220 | in this massive amount of data.

01:05:51.260 | And that's my answer.

01:05:52.700 | So you can think of prompting in a way

01:05:55.500 | as you're uploading, not just simple points

01:05:58.620 | and the metric is not the distance between the images

01:06:02.420 | or something simple,

01:06:03.260 | it's something that you compute that's much more advanced,

01:06:06.020 | but in a way, it's very similar, right?

01:06:08.380 | You simply are uploading some knowledge

01:06:12.620 | to this pre-trained system in nearest neighbor,

01:06:15.060 | maybe the metric is learned or not,

01:06:17.260 | but you don't need to further train it.

01:06:19.460 | And then now you immediately get a classifier out of this.

01:06:23.700 | Now it's just an evolution of that concept,

01:06:25.820 | very classical concept in machine learning,

01:06:27.820 | which is just learning through what's the closest point,

01:06:32.180 | closest by some distance and that's it.

01:06:34.540 | It's an evolution of that.

01:06:36.100 | And I will say how I saw meta-learning

01:06:39.020 | when we worked on a few ideas in 2016,

01:06:43.900 | was precisely through the lens of nearest neighbor,

01:06:47.220 | which is very common in computer vision community, right?

01:06:49.940 | There's a very active area of research

01:06:52.140 | about how do you compute the distance between two images,

01:06:55.460 | but if you have a good distance metric,

01:06:57.580 | you also have a good classifier, right?

01:06:59.940 | All I'm saying is now these distances

01:07:01.740 | and the points are not just images,

01:07:03.780 | they're like words or sequences of words and images

01:07:08.540 | and actions that teach you something new,

01:07:10.380 | but it might be that technique-wise, those come back.

01:07:14.740 | And I will say that it's not necessarily true

01:07:18.180 | that you might not ever train the weights a bit further.

01:07:21.780 | Some aspect of meta-learning,

01:07:23.900 | some techniques in meta-learning

01:07:26.020 | do actually do a bit of fine tuning, as it's called, right?

01:07:28.900 | They train the weights a little bit

01:07:31.100 | when they get a new task.

01:07:32.820 | So as I call the how, or how we're gonna achieve this,

01:07:36.940 | as a deep learner, I'm very skeptic.

01:07:39.820 | We're gonna try a few things,

01:07:41.220 | whether it's a bit of training, adding a few parameters,

01:07:44.180 | thinking of these as nearest neighbor,

01:07:45.940 | or just simply thinking of there's a sequence of words,

01:07:49.180 | it's a prefix, and that's the new classifier.

01:07:52.980 | We'll see, right?

01:07:53.820 | There's the beauty of research,

01:07:55.420 | but what's important is that is a good goal in itself

01:08:00.140 | that I see as very worthwhile pursuing

01:08:02.740 | for the next stages of not only meta-learning.

01:08:05.700 | I think this is basically what's exciting

01:08:08.460 | about machine learning, period, to me.

01:08:11.380 | - Well, and then the interactive aspect of that

01:08:13.740 | is also very interesting.

01:08:15.140 | - Yes. - The interactive version

01:08:16.380 | of nearest neighbor. (laughs)

01:08:18.420 | - Yeah. - To help you pull out

01:08:20.620 | the classifier from this giant thing.

01:08:23.740 | Okay, is this the way we can go in five, 10 plus years

01:08:28.740 | from any task, sorry, from many tasks to any task?

01:08:36.100 | So, and what does that mean?

01:08:39.420 | What does it need to be actually trained on?

01:08:41.620 | Which point is the network had enough?

01:08:45.460 | So what does a network need to learn about this world

01:08:50.460 | in order to be able to perform any task?

01:08:52.460 | Is it just as simple as language, image, and action?

01:08:57.460 | Or do you need some set of representative images?

01:09:01.820 | Like if you only see land images,

01:09:05.180 | will you know anything about underwater?

01:09:06.700 | Is that somehow fundamentally different?

01:09:08.740 | I don't know.

01:09:09.580 | - Those, I mean, those are awkward questions, I would say.

01:09:12.060 | I mean, the way you put, let me maybe further your example.

01:09:15.020 | Right, if all you see is land images,

01:09:18.400 | but you're reading all about land and water worlds,

01:09:21.540 | but in books, right, imagine.

01:09:23.900 | Would that be enough?

01:09:25.380 | Good question.

01:09:26.460 | We don't know, but I guess maybe you can join us

01:09:30.380 | if you want in our quest to find this.

01:09:32.100 | That's precisely--

01:09:33.420 | - Water world, yeah.

01:09:34.340 | - Yes, that's precisely, I mean, the beauty of research

01:09:37.620 | and that's the research business we're in, I guess,

01:09:42.620 | is to figure these out and ask the right questions

01:09:46.220 | and then iterate with the whole community,

01:09:49.540 | publishing findings and so on.

01:09:52.420 | But yeah, this is a question.

01:09:55.100 | It's not the only question, but it's certainly, as you ask,

01:09:57.540 | is on my mind constantly, right?

01:10:00.020 | And so we'll need to wait for maybe the,

01:10:03.260 | let's say five years, let's hope it's not 10,

01:10:05.940 | to see what are the answers.

01:10:08.380 | Some people will largely believe in unsupervised

01:10:12.660 | or self-supervised learning of single modalities

01:10:15.460 | and then crossing them.

01:10:18.000 | Some people might think end-to-end learning is the answer.

01:10:21.680 | Modularity is maybe the answer.

01:10:23.780 | So we don't know,

01:10:24.960 | but we're just definitely excited to find out.

01:10:27.520 | - But it feels like this is the right time

01:10:29.280 | and we're at the beginning of this journey.

01:10:31.720 | We're finally ready to do these kind of general,

01:10:34.640 | big models and agents.

01:10:37.600 | What do you sort of specific technical thing

01:10:42.480 | about Gato, Flamingo, Chinchilla, Gopher,

01:10:47.360 | any of these that is especially beautiful,

01:10:49.520 | that was surprising maybe?

01:10:51.640 | Is there something that just jumps out at you?

01:10:54.220 | Of course, there's the general thing of like,

01:10:57.560 | you didn't think it was possible

01:10:58.900 | and then you realize it's possible

01:11:01.700 | in terms of the generalizability across modalities

01:11:04.480 | and all that kind of stuff.

01:11:05.560 | Or maybe how small of a network, relatively speaking,

01:11:08.920 | Gato is, all that kind of stuff.

01:11:10.440 | But is there some weird little things that were surprising?

01:11:15.200 | - Look, I'll give you an answer that's very important

01:11:18.240 | because maybe people don't quite realize this,

01:11:22.600 | but the teams behind these efforts, the actual humans,

01:11:27.240 | that's maybe the surprising, obviously positive way.

01:11:31.720 | So anytime you see these breakthroughs,

01:11:34.580 | I mean, it's easy to map it to a few people.

01:11:37.160 | There's people that are great at explaining things

01:11:39.220 | and so on, that's very nice.

01:11:40.720 | But maybe the learnings or the meta learnings

01:11:44.680 | that I get as a human about this is,

01:11:47.400 | sure, we can move forward,

01:11:49.060 | but the surprising bit is how important

01:11:55.480 | are all the pieces of these projects,

01:11:58.720 | how do they come together?

01:12:00.040 | So I'll give you maybe some of the ingredients of success

01:12:04.440 | that are common across these,

01:12:06.440 | but not the obvious ones in machine learning.

01:12:08.480 | I can always also give you those.

01:12:11.320 | But basically, engineering is critical.

01:12:16.320 | So very good engineering,

01:12:19.600 | because ultimately we're collecting data sets, right?

01:12:23.760 | So the engineering of data

01:12:26.160 | and then of deploying the models at scale

01:12:29.740 | into some compute cluster that cannot go understated,

01:12:32.840 | that is a huge factor of success.

01:12:36.880 | And it's hard to believe that details matter so much.

01:12:41.560 | We would like to believe that it's true

01:12:44.040 | that there is more and more of a standard formula,

01:12:47.440 | as I was saying, like this recipe that works for everything.

01:12:50.560 | But then when you zoom in into each of these projects,

01:12:53.680 | then you realize the devil is indeed in the details.

01:12:57.840 | And then the teams have to work kind of together

01:13:01.520 | towards these goals.

01:13:03.040 | So engineering of data and obviously clusters

01:13:07.520 | and large scale is very important.

01:13:09.280 | And then one that is often not,

01:13:13.120 | maybe nowadays it is more clear,

01:13:15.080 | is benchmark progress, right?

01:13:17.160 | So we're talking here about multiple months

01:13:19.860 | of tens of researchers

01:13:22.120 | and people that are trying to organize the research

01:13:26.160 | and so on, working together.

01:13:28.080 | And you don't know that you can get there.

01:13:32.120 | I mean, this is the beauty.

01:13:34.360 | Like if you're not risking to trying to do something

01:13:37.320 | that feels impossible, you're not gonna get there,

01:13:40.540 | but you need a way to measure progress.

01:13:43.960 | So the benchmarks that you build are critical.

01:13:47.740 | I've seen this beautifully play out in many projects.

01:13:50.520 | I mean, maybe the one I've seen it more consistently,

01:13:53.880 | which means we establish the metric,

01:13:56.840 | actually the community did,

01:13:58.320 | and then we leverage that massively is AlphaFold.

01:14:01.560 | This is a project where the data,

01:14:04.520 | the metrics were all there.

01:14:06.120 | And all it took was, and it's easier said than done,

01:14:09.120 | an amazing team working,

01:14:11.640 | not to try to find some incremental improvement

01:14:14.760 | and publish, which is one way to do research that is valid,

01:14:17.940 | but aim very high and work literally for years

01:14:22.520 | to iterate over that process.

01:14:24.120 | And working for years with the team,

01:14:25.660 | I mean, it is tricky that also happened to happen

01:14:29.800 | partly during a pandemic and so on.

01:14:32.200 | So I think my meta learning from all this is,

01:14:35.280 | the teams are critical to the success.

01:14:37.960 | And then if now going to the machine learning,

01:14:40.200 | the part that's surprising is,

01:14:42.880 | so we like architectures like neural networks,

01:14:48.720 | and I would say this was a very rapidly evolving field

01:14:53.120 | until the transformer came.

01:14:54.960 | So attention might indeed be all you need,

01:14:58.160 | which is the title, also a good title,

01:15:00.280 | although in hindsight is good.

01:15:02.280 | I don't think at the time I thought

01:15:03.440 | this is a great title for a paper,

01:15:05.040 | but that architecture is proving

01:15:08.960 | that the dream of modeling sequences of any bytes,

01:15:12.540 | there is something there that will stick.

01:15:15.360 | And I think these advance in architectures,

01:15:18.280 | in kind of how neural networks are architectured

01:15:21.040 | to do what they do.

01:15:23.120 | It's been hard to find one that has been so stable

01:15:26.080 | and relatively has changed very little

01:15:28.920 | since it was invented five or so years ago.

01:15:33.040 | So that is a surprising,

01:15:35.200 | is a surprise that keeps recurring to other projects.

01:15:38.320 | - Try to, on a philosophical or technical level,

01:15:42.440 | introspect what is the magic of attention?

01:15:45.480 | What is attention?

01:15:47.320 | That's attention in people that study cognition,

01:15:50.120 | so human attention.

01:15:52.080 | I think there's giant wars over what attention means,

01:15:55.780 | how it works in the human mind.

01:15:57.440 | So what, there's very simple looks

01:16:00.200 | at what attention is in neural network

01:16:02.600 | from the days of attention is all you need,

01:16:04.440 | but do you think there's a general principle

01:16:06.840 | that's really powerful here?

01:16:08.780 | - Yeah, so a distinction between transformers and LSTMs,

01:16:13.360 | which were what came before,

01:16:15.360 | and there was a transitional period

01:16:17.840 | where you could use both.

01:16:19.680 | In fact, when we talked about AlphaStar,

01:16:22.000 | we used transformers and LSTMs.

01:16:24.280 | So it was still the beginning of transformers.

01:16:26.380 | They were very powerful,

01:16:27.400 | but LSTMs were still also very powerful sequence models.

01:16:31.520 | So the power of the transformer is that it has built in

01:16:36.520 | what we call an inductive bias of attention

01:16:41.140 | that makes the model,

01:16:43.040 | when you think of a sequence of integers, right?

01:16:45.700 | Like we discussed this before, right?

01:16:47.440 | This is a sequence of words.

01:16:50.420 | When you have to do very hard tasks over these words,

01:16:54.780 | this could be, we're gonna translate a whole paragraph

01:16:57.900 | or we're gonna predict the next paragraph

01:16:59.780 | given 10 paragraphs before.

01:17:01.740 | There's some loose intuition from how we do it as a human

01:17:09.260 | that is very nicely mimicked and replicated,

01:17:14.780 | structurally speaking in the transformer,

01:17:16.540 | which is this idea of you're looking for something, right?

01:17:21.160 | So you're sort of, when you're,

01:17:23.900 | you just read a piece of text,

01:17:25.740 | now you're thinking what comes next.

01:17:27.920 | You might wanna relook at the text or look it from scratch.

01:17:31.780 | I mean, literally is because there's no recurrence.

01:17:35.080 | You're just thinking what comes next.

01:17:37.300 | And it's almost hypothesis driven, right?

01:17:40.020 | So if I'm thinking the next word that I'll write

01:17:43.380 | is cat or dog, okay?

01:17:46.560 | The way the transformer works almost philosophically

01:17:49.840 | is it has these two hypotheses.

01:17:52.840 | Is it gonna be cat or is it gonna be dog?

01:17:55.640 | And then it thinks, okay, if it's cat,

01:17:58.360 | I'm gonna look for certain words, not necessarily cat,

01:18:00.680 | although cat is an obvious word you would look in the past

01:18:02.920 | to see whether it makes more sense to output cat or dog.

01:18:05.960 | And then it does some very deep computation

01:18:09.400 | over the words and beyond, right?

01:18:11.400 | So it combines the words and,

01:18:14.100 | but it has the query as we call it, that is cat.

01:18:18.440 | And then similarly for dog, right?

01:18:20.600 | And so it's a very computational way to think about,

01:18:24.360 | look, if I'm thinking deeply about text,

01:18:26.980 | I need to go back to look at all of the text,

01:18:29.560 | attend over it, but it's not just attention.

01:18:31.860 | Like what is guiding the attention?

01:18:33.920 | And that was the key insight from an earlier paper

01:18:36.660 | is not how far away is it?

01:18:39.100 | I mean, how far away is it is important?

01:18:40.760 | What did I just write about?

01:18:42.680 | That's critical.

01:18:44.100 | But what you wrote about 10 pages ago might also be critical.

01:18:48.360 | So you're looking not positionally, but content-wise, right?

01:18:53.160 | And transformers have this beautiful way

01:18:56.040 | to query for certain content and pull it out

01:18:59.420 | in a compressed way.

01:19:00.280 | So then you can make a more informed decision.

01:19:02.960 | I mean, that's one way to explain transformers.

01:19:05.920 | But I think it's a very powerful inductive bias.

01:19:10.000 | There might be some details that might change over time,

01:19:12.480 | but I think that is what makes transformers

01:19:16.400 | so much more powerful than the recurrent networks

01:19:19.880 | that were more recency bias-based,

01:19:22.420 | which obviously works in some tasks,

01:19:24.300 | but it has major flaws.

01:19:26.680 | Transformer itself has flaws.

01:19:29.280 | And I think the main one, the main challenge is

01:19:32.160 | these prompts that we just were talking about,

01:19:35.720 | they can be a thousand words long.

01:19:38.040 | But if I'm teaching you StarCraft,

01:19:39.880 | I mean, I'll have to show you videos.

01:19:41.840 | I'll have to point you to whole Wikipedia articles

01:19:44.600 | about the game.

01:19:46.120 | We'll have to interact probably as you play,

01:19:48.000 | you'll ask me questions.

01:19:49.480 | The context required for us to achieve

01:19:52.340 | me being a good teacher to you on the game,

01:19:54.760 | as you would want to do it with a model,

01:19:56.960 | I think goes well beyond the current capabilities.

01:20:01.600 | So the question is, how do we benchmark this?

01:20:03.900 | And then how do we change the structure

01:20:06.400 | of the architectures?

01:20:07.280 | I think there's ideas on both sides,

01:20:08.820 | but we'll have to see empirically, right?

01:20:11.280 | Obviously what ends up working in the--

01:20:13.360 | - And as you talked about, some of the ideas could be,

01:20:15.880 | keeping the constraint of that length in place,

01:20:19.480 | but then forming hierarchical representations

01:20:23.060 | to where you can start being much clever

01:20:26.240 | in how you use those thousand tokens.

01:20:28.840 | - Indeed.

01:20:29.680 | - Yeah, that's really interesting.

01:20:32.240 | But it also is possible that this attentional mechanism

01:20:34.840 | where you basically, you don't have a recency bias,

01:20:37.560 | but you look more generally,

01:20:40.300 | you make it learnable.

01:20:42.000 | The mechanism in which way you look back into the past,

01:20:45.280 | you make that learnable.

01:20:46.800 | It's also possible where at the very beginning of that,

01:20:50.200 | because that, you might become smarter and smarter

01:20:54.380 | in the way you query the past.

01:20:56.920 | So recent past and distant past,

01:21:00.600 | and maybe very, very distant past.

01:21:02.360 | So almost like the attention mechanism

01:21:04.980 | will have to improve and evolve as good as the,

01:21:09.620 | the tokenization mechanism,

01:21:11.980 | where so you can represent long-term memory somehow.

01:21:14.980 | - Yes.

01:21:16.140 | And I mean, hierarchies are very,

01:21:18.220 | I mean, it's a very nice word that sounds appealing.

01:21:22.180 | There's lots of work adding hierarchy to the memories.

01:21:25.900 | In practice, it does seem like we keep coming back

01:21:29.460 | to the main formula or main architecture.

01:21:33.000 | That sometimes tells us something.

01:21:35.300 | There is such a sentence that a friend of mine told me,

01:21:38.540 | like whether it wants to work or not.

01:21:41.040 | So Transformer was clearly an idea that wanted to work.

01:21:45.000 | And then I think there's some principles

01:21:47.540 | we believe will be needed, but finding the exact details,

01:21:51.040 | details matter so much, right?

01:21:52.920 | That's gonna be tricky.

01:21:54.280 | - I love the idea that there's like,

01:21:56.800 | you as a human being, you want some ideas to work,

01:22:01.320 | and then there's the model that wants some ideas to work,

01:22:04.520 | and you get to have a conversation to see which,

01:22:07.400 | - More likely the model will win in the end.

01:22:09.600 | Because it's the one, you don't have to do any work.

01:22:12.860 | The model's the one that has to do the work,

01:22:14.380 | so you should listen to the model.

01:22:15.900 | And I really love this idea that you talked about,

01:22:17.900 | the humans in this picture, if I could just briefly ask.

01:22:21.200 | One is you're saying the benchmarks about,

01:22:25.700 | so the modular humans working on this,

01:22:27.980 | the benchmarks providing a sturdy ground

01:22:31.700 | of a wish to do these things that seem impossible.

01:22:34.700 | They give you, in the darkest of times, give you hope,

01:22:39.140 | because little signs of improvement.

01:22:40.940 | Somehow you're not lost if you have metrics

01:22:46.560 | to measure your improvement.

01:22:48.680 | And then there's other aspect, you said elsewhere,

01:22:52.260 | and here today, titles matter.

01:22:56.600 | I wonder how much humans matter

01:23:00.520 | in the evolution of all of this,

01:23:02.360 | meaning individual humans.

01:23:04.300 | Something about their interaction,

01:23:08.140 | something about their ideas,

01:23:09.200 | how much they change the direction of all of this.

01:23:13.180 | If you change the humans in this picture,

01:23:15.680 | is it that the model is sitting there,

01:23:18.240 | and it wants some idea to work,

01:23:22.520 | or is it the humans, or maybe the model's providing

01:23:25.600 | 20 ideas that could work,

01:23:27.020 | and depending on the humans you pick,

01:23:29.100 | they're going to be able to hear some of those ideas.

01:23:31.800 | - In all the, because you're now directing

01:23:34.600 | all of deep learning at DeepMind,

01:23:35.920 | you get to interact with a lot of projects,

01:23:37.440 | a lot of brilliant researchers.

01:23:39.000 | How much variability is created by the humans

01:23:43.100 | in all of this?

01:23:44.160 | - Yeah, I mean, I do believe humans matter a lot,

01:23:47.380 | at the very least, at the time scale of years

01:23:52.380 | on when things are happening,

01:23:54.880 | and what's the sequencing of it, right?

01:23:56.940 | So you get to interact with people that,

01:24:00.560 | I mean, you mentioned this,

01:24:02.240 | some people really want some idea to work,

01:24:05.160 | and they'll persist,

01:24:06.720 | and then some other people might be more practical,

01:24:09.400 | like, I don't care what idea works,

01:24:12.880 | I care about cracking protein folding.

01:24:15.920 | And these, at least these two kind of seem opposite sides,

01:24:21.240 | we need both, and we've clearly had both historically,

01:24:25.680 | and that made certain things happen earlier or later,

01:24:29.000 | so definitely humans involved in all of these endeavor

01:24:33.480 | have had, I would say, years of change or of ordering,

01:24:38.480 | how things have happened,

01:24:40.480 | which breakthroughs came before

01:24:41.840 | which other breakthroughs, and so on,

01:24:43.300 | so certainly that does happen,

01:24:45.800 | and so one other, maybe one other axis of distinction

01:24:50.600 | is what I called,

01:24:52.040 | and this is most commonly used in reinforcement learning,

01:24:54.860 | is the exploration-exploitation trade-off as well,

01:24:57.800 | it's not exactly what I meant, although quite related.

01:25:00.920 | So when you start trying to help others,

01:25:05.920 | like you become a bit more of a mentor

01:25:11.480 | to a large group of people,

01:25:13.100 | be it a project or the deep learning team or something,

01:25:16.380 | or even in the community

01:25:17.460 | when you interact with people in conferences and so on,

01:25:20.800 | you're identifying quickly some things

01:25:24.920 | that are explorative or exploitative,

01:25:27.080 | and it's tempting to try to guide people, obviously,

01:25:30.720 | I mean, that's what makes our experience,

01:25:33.200 | we bring it and we try to shape things, sometimes wrongly,

01:25:36.760 | and there's many times that I've been wrong in the past,

01:25:39.600 | that's great, but it would be wrong

01:25:43.720 | to dismiss any sort of the research styles

01:25:48.160 | that I'm observing, and I often get asked,

01:25:51.280 | "Well, you're in industry, right,

01:25:52.720 | "so we do have access to large compute scale and so on,

01:25:55.580 | "so there's certain kinds of research

01:25:57.380 | "I almost feel like we need to do responsibly and so on,"

01:26:01.680 | but it is, Carmos, we have the particle accelerator here,

01:26:05.200 | so to speak, in physics, so we need to use it,

01:26:07.520 | we need to answer the questions

01:26:08.840 | that we should be answering right now

01:26:10.440 | for the scientific progress.

01:26:12.400 | But then at the same time, I look at many advances,

01:26:15.240 | including attention, which was discovered in Montreal

01:26:19.360 | initially because of lack of compute, right?

01:26:22.440 | So we were working on sequence to sequence

01:26:24.960 | with my friends over at Google Brain at the time,

01:26:27.920 | and we were using, I think, eight GPUs,

01:26:30.400 | which was somehow a lot at the time,

01:26:32.480 | and then I think Montreal was a bit more limited

01:26:35.240 | in the scale, but then they discovered

01:26:37.320 | this content-based attention concept

01:26:39.240 | that then has obviously triggered things like Transformer.

01:26:43.400 | Not everything obviously starts Transformer.

01:26:46.320 | There's always a history that is important to recognize

01:26:49.920 | because then you can make sure that then those

01:26:53.040 | who might feel now, "Well, we don't have so much compute,"

01:26:56.360 | you need to then help them optimize that kind of research

01:27:01.360 | that might actually produce amazing change.

01:27:04.240 | Perhaps it's not as short-term as some of these advancements

01:27:07.920 | or perhaps it's a different timescale,

01:27:09.720 | but the people and the diversity of the field

01:27:13.040 | is quite critical that we maintain it,

01:27:15.720 | and at times, especially mixed a bit with hype

01:27:19.040 | or other things, it's a bit tricky

01:27:21.520 | to be observing maybe too much

01:27:24.160 | of the same thinking across the board,

01:27:27.760 | but the humans definitely are critical,

01:27:30.480 | and I can think of quite a few personal examples

01:27:33.880 | where also someone told me something

01:27:36.560 | that had a huge effect onto some idea,

01:27:40.240 | and then that's why I'm saying at least in terms of ears,

01:27:43.280 | probably some things do happen.

01:27:44.880 | - Yeah, it's fascinating.

01:27:45.720 | - Yeah.

01:27:46.560 | - And it's also fascinating how constraints somehow

01:27:48.200 | are essential for innovation,

01:27:51.040 | and the other thing you mentioned about engineering,

01:27:53.400 | I have a sneaking suspicion, maybe I over,

01:27:56.640 | my love is with engineering,

01:27:59.960 | so I have a sneaking suspicion that all the genius,

01:28:04.480 | a large percentage of the genius

01:28:06.280 | is in the tiny details of engineering,

01:28:09.280 | so I think we like to think the genius is in the big ideas.

01:28:14.280 | I have a sneaking suspicion that,

01:28:20.160 | because I've seen the genius of details,

01:28:22.600 | of engineering details,

01:28:24.120 | make the night and day difference,

01:28:28.760 | and I wonder if those kind of have a ripple effect over time.

01:28:32.120 | So that too, so that's sort of,

01:28:35.520 | taking the engineering perspective,

01:28:36.840 | that sometimes that quiet innovation

01:28:39.360 | at the level of an individual engineer,

01:28:41.720 | or maybe at the small scale of a few engineers,

01:28:44.600 | can make all the difference, that scales,

01:28:46.760 | because we're doing, we're working on computers

01:28:49.680 | that are scaled across large groups,

01:28:53.440 | that one engineering decision can lead to ripple effects.

01:28:56.960 | - Yes. - Which is interesting

01:28:57.800 | to think about.

01:28:58.920 | - Yeah, I mean, engineering,

01:29:00.760 | there's also kind of a historical,

01:29:04.160 | it might be a bit random,

01:29:06.280 | because if you think of the history of how,

01:29:09.760 | especially deep learning and neural networks took off,

01:29:12.320 | feels like a bit random,

01:29:15.000 | because GPUs happen to be there at the right time

01:29:17.800 | for a different purpose, which was to play video games.

01:29:20.640 | So even the engineering that goes into the hardware,

01:29:24.600 | and it might have a time,

01:29:26.320 | like the timeframe might be very different.

01:29:28.000 | I mean, the GPUs were evolved throughout many years,

01:29:31.560 | where we didn't even, we're looking at that, right?

01:29:33.840 | So even at that level, right, that revolution, so to speak,

01:29:37.480 | the ripples are like, we'll see when they stop, right?

01:29:42.160 | But in terms of thinking of why is this happening, right?

01:29:45.920 | There's, I think that when I try to categorize it

01:29:49.760 | in sort of things that might not be so obvious,

01:29:52.720 | I mean, clearly there's a hardware revolution.

01:29:54.960 | We are surfing thanks to that.

01:29:58.360 | Data centers as well.

01:29:59.760 | I mean, data centers are where,

01:30:01.840 | like, I mean, at Google, for instance,

01:30:03.200 | obviously they're serving Google,

01:30:04.800 | but there's also now, thanks to that,

01:30:06.920 | and to have built such amazing data centers,

01:30:09.640 | we can train these models.

01:30:11.720 | Software is an important one.

01:30:13.400 | I think if I look at the state of how I had to implement

01:30:18.280 | things to implement my ideas,

01:30:20.040 | how I discarded ideas because they were too hard

01:30:22.120 | to implement, yeah, clearly the times have changed,

01:30:25.280 | and thankfully we are in a much better

01:30:27.600 | software position as well.

01:30:29.400 | And then, I mean, obviously there's research

01:30:32.240 | that happens at scale and more people enter the field.

01:30:35.160 | That's great to see,

01:30:36.000 | but it's almost enabled by these other things.

01:30:38.280 | And last but not least is also data, right?

01:30:40.600 | Curating data sets, labeling data sets,

01:30:43.120 | these benchmarks we think about,

01:30:44.960 | maybe we'll want to have all the benchmarks in one system,

01:30:48.920 | but it's still very valuable that someone

01:30:51.320 | put the thought and the time and the vision

01:30:53.600 | to build certain benchmarks.

01:30:54.880 | We've seen progress thanks to,

01:30:56.640 | but we're gonna repurpose the benchmarks.

01:30:59.280 | That's the beauty of Atari,

01:31:01.640 | is like we solved it in a way,

01:31:04.240 | but we use it in Gato.

01:31:06.000 | It was critical, and I'm sure there's still a lot more

01:31:09.120 | to do thanks to that amazing benchmark

01:31:10.960 | that someone took the time to put,

01:31:13.120 | even though at the time maybe,

01:31:15.160 | oh, you have to think what's the next,

01:31:17.360 | you know, iteration of architectures.

01:31:19.440 | That's what maybe the field recognizes,

01:31:21.400 | but we need to, that's another thing we need to balance

01:31:24.000 | in terms of a human's behind.

01:31:25.800 | We need to recognize all these aspects

01:31:27.960 | because they're all critical.

01:31:29.480 | And we tend to, yeah, we tend to think of the genius,

01:31:32.800 | the scientist and so on, but I'm glad you're,

01:31:35.680 | I know you have a strong engineering background, so.

01:31:38.000 | - But also I'm a lover of data,

01:31:40.040 | and it gives us a pushback on the engineering comment,

01:31:43.240 | ultimately could be the creators of benchmarks

01:31:46.080 | who have the most impact.

01:31:47.440 | Andrej Karpathy, who you mentioned,

01:31:49.200 | has recently been talking a lot of trash about ImageNet,

01:31:52.000 | which he has the right to do

01:31:53.200 | because of how critical he is about,

01:31:55.480 | how essential he is to the development

01:31:57.760 | and the success of deep learning around ImageNet.

01:32:01.520 | And you're saying that that's actually,

01:32:02.920 | that benchmark is holding back the field

01:32:05.480 | because, I mean, especially in his context

01:32:07.680 | on Tesla Autopilot, that's looking at real world behavior

01:32:11.080 | of a system, it's, there's something fundamentally missing

01:32:16.080 | about ImageNet that doesn't capture

01:32:17.960 | the real worldness of things,

01:32:20.440 | that we need to have data sets, benchmarks

01:32:22.640 | that have the unpredictability, the edge cases,

01:32:27.080 | the whatever the heck it is that makes the real world

01:32:29.680 | so difficult to operate in,

01:32:32.280 | we need to have benchmarks with that, so.

01:32:34.680 | But just to think about the impact of ImageNet

01:32:37.760 | as a benchmark, and that really puts a lot of emphasis

01:32:42.120 | on the importance of a benchmark,

01:32:43.720 | both sort of internally a deep mind and as a community.

01:32:46.680 | So one is coming in from within,

01:32:48.960 | like how do I create a benchmark for me

01:32:52.520 | to mark and make progress, and how do I make benchmark

01:32:57.280 | for the community to mark and push progress?

01:33:02.520 | - You have this amazing paper you co-authored,

01:33:05.880 | a survey paper called Emergent Abilities

01:33:08.600 | of Large Language Models, it has, again,

01:33:11.440 | the philosophy here that I'd love to ask you about.

01:33:14.480 | What's the intuition about the phenomena

01:33:16.680 | of emergence in neural networks,

01:33:18.480 | transform as language models?

01:33:20.660 | Is there a magic threshold beyond which

01:33:24.160 | we start to see certain performance?

01:33:27.160 | And is that different from task to task?

01:33:29.960 | Is that us humans just being poetic and romantic,

01:33:32.640 | or is there literally some level

01:33:35.440 | at which we start to see breakthrough performance?

01:33:38.200 | - Yeah, I mean, this is a property that we start seeing

01:33:41.520 | in systems that actually tend to be,

01:33:46.880 | so in machine learning, traditionally,

01:33:49.280 | again, going to benchmarks, I mean,

01:33:51.960 | if you have some input-output, right,

01:33:54.860 | like that is just a single input and a single output,

01:33:58.280 | you generally, when you train these systems,

01:34:01.200 | you see reasonably smooth curves

01:34:04.420 | when you analyze how much the data set size

01:34:09.420 | affects the performance, or how the model size

01:34:12.020 | affects the performance, or how much you long train,

01:34:15.080 | how long you train the system for

01:34:17.920 | affects the performance, right?

01:34:19.360 | So, you know, if we think of ImageNet,

01:34:22.080 | like the training curves look fairly smooth

01:34:25.080 | and predictable in a way,

01:34:28.160 | and I would say that's probably because of the,

01:34:31.360 | it's kind of a one-hop reasoning task, right?

01:34:36.360 | It's like, here is an input,

01:34:38.240 | and you think for a few milliseconds,

01:34:40.800 | or 100 milliseconds, 300, as a human,

01:34:43.760 | and then you tell me, yeah,

01:34:44.840 | there's an alpaca in this image.

01:34:47.880 | So, in language, we are seeing benchmarks

01:34:52.800 | that require more pondering and more thought in a way, right?

01:34:58.240 | This is just kind of, you need to look for some subtleties,

01:35:01.960 | that it involves inputs that you might think of,

01:35:05.400 | or even if the input is a sentence

01:35:07.860 | describing a mathematical problem,

01:35:09.800 | there is a bit more processing required as a human

01:35:14.180 | and more introspection.

01:35:15.700 | So, I think that how these benchmarks work

01:35:20.520 | means that there is actually a threshold,

01:35:23.520 | just going back to how transformers work

01:35:26.760 | in this way of querying for the right questions

01:35:29.560 | to get the right answers,

01:35:31.160 | that might mean that performance becomes random

01:35:35.520 | until the right question is asked

01:35:37.800 | by the querying system of a transformer

01:35:40.080 | or of a language model like a transformer,

01:35:42.880 | and then, only then, you might start seeing performance

01:35:47.720 | going from random to non-random,

01:35:50.120 | and this is more empirical.

01:35:52.720 | There's no formalism or theory behind this yet,

01:35:56.320 | although it might be quite important,

01:35:57.800 | but we're seeing these phase transitions

01:36:00.360 | of random performance until some, let's say,

01:36:03.680 | scale of a model, and then it goes beyond that.

01:36:06.800 | And it might be that you need to fit

01:36:10.560 | a few low-order bits of thought

01:36:14.040 | before you can make progress on the whole task.

01:36:17.200 | And if you could measure, actually,

01:36:19.720 | those breakdown of the task,

01:36:21.880 | maybe you would see more smooth,

01:36:23.480 | oh, like, yeah, this, you know,

01:36:24.960 | once you get this and this and this and this and this,

01:36:27.760 | then you start making progress in the task.

01:36:30.320 | But it's somehow a bit annoying

01:36:33.520 | because then it means that certain questions

01:36:37.480 | we might ask about architectures

01:36:40.320 | possibly can only be done at certain scale.

01:36:43.040 | And one thing that, conversely,

01:36:46.120 | I've seen great progress on in the last couple of years

01:36:49.200 | is this notion of science of deep learning

01:36:52.480 | and science of scale in particular, right?

01:36:55.040 | So, on the negative is that there's some benchmarks

01:36:58.680 | for which progress might need to be measured

01:37:01.800 | at minimum and at certain scale

01:37:04.000 | until you see then what details of the model matter

01:37:07.560 | to make that performance better, right?

01:37:10.000 | So that's a bit of a con.

01:37:11.920 | But what we've also seen is that

01:37:14.720 | you can sort of empirically analyze

01:37:18.600 | behavior of models at scales that are smaller, right?

01:37:22.880 | So let's say, to put an example,

01:37:25.680 | we had this chinchilla paper

01:37:27.840 | that revised the so-called scaling laws of models.

01:37:31.360 | And that whole study is done at a reasonably small scale,

01:37:34.720 | right, that may be hundreds of millions

01:37:36.520 | up to 1 billion parameters.

01:37:38.680 | And then the cool thing is that you create some loss, right?

01:37:41.880 | Some loss that, some trends, right?

01:37:43.640 | You extract trends from data that you see, okay,

01:37:46.600 | like it looks like the amount of data required

01:37:49.400 | to train now a 10X larger model would be this.

01:37:52.120 | And these loss so far,

01:37:53.960 | these extrapolations have helped us safe compute

01:37:57.480 | and just get to a better place in terms of the science

01:38:00.920 | of how should we run these models at scale,

01:38:03.800 | how much data, how much depth,

01:38:05.600 | and all sorts of questions we start asking,

01:38:08.480 | extrapolating from a small scale.

01:38:10.600 | But then this emergence is sadly

01:38:12.720 | that not everything can be extrapolated from scale

01:38:15.680 | depending on the benchmark.

01:38:16.880 | And maybe the harder benchmarks are not so good

01:38:20.240 | for extracting these loss.

01:38:21.960 | But we have a variety of benchmarks at least.

01:38:24.160 | - So I wonder to which degree the threshold,

01:38:28.000 | the phase shift scale is a function of the benchmark.

01:38:31.680 | Some of the science of scale might be engineering benchmarks

01:38:37.840 | where that threshold is low.

01:38:40.400 | Sort of taking a main benchmark

01:38:43.840 | and reducing it somehow

01:38:46.120 | where the essential difficulty is left,

01:38:48.480 | but the scale at which the emergence happens is lower.

01:38:52.600 | Just for the science aspect of it

01:38:54.280 | versus the actual real world aspect.

01:38:56.960 | - Yeah, so luckily we have quite a few benchmarks,

01:38:59.280 | some of which are simpler,

01:39:00.560 | or maybe they're more like,

01:39:01.880 | I think people might call these systems one

01:39:03.840 | versus systems two style.

01:39:05.920 | So I think what we're not seeing luckily

01:39:09.880 | is that extrapolations from maybe slightly more smooth

01:39:14.040 | or simpler benchmarks are translating to the harder ones.

01:39:18.560 | But that is not to say

01:39:19.640 | that this extrapolation will hit its limits.

01:39:22.600 | And when it does,

01:39:24.200 | then how much we scale or how we scale

01:39:27.560 | will sadly be a bit suboptimal

01:39:29.440 | until we find better loss, right?

01:39:31.800 | And these laws, again, are very empirical laws.

01:39:33.800 | They're not like physical laws of models.

01:39:35.920 | Although I wish there would be better theory

01:39:38.680 | about these things as well,

01:39:40.120 | but so far I would say empirical theory,

01:39:43.000 | as I call it, is way ahead

01:39:44.520 | than actual theory of machine learning.

01:39:47.000 | - Let me ask you almost for fun.

01:39:50.480 | So this is not, Oriol,

01:39:52.080 | as a deep mind person

01:39:54.640 | or anything to do with deep mind or Google,

01:39:57.280 | just as a human being,

01:39:58.840 | and looking at these news of a Google engineer

01:40:01.760 | who claimed that,

01:40:05.800 | I guess the Lambda language model

01:40:08.360 | was sentient or had the,

01:40:11.120 | and you still need to look into the details of this,

01:40:14.080 | but sort of making an official report

01:40:18.680 | and a claim that he believes there's evidence

01:40:21.740 | that this system has achieved sentience.

01:40:25.120 | And I think this is a really interesting case

01:40:29.560 | on a human level, on a psychological level,

01:40:31.760 | on a technical machine learning level

01:40:35.920 | of how language models transform our world,

01:40:38.360 | and also just philosophical level

01:40:39.880 | of the role of AI systems in a human world.

01:40:44.120 | So what do you find interesting?

01:40:48.120 | What's your take on all of this

01:40:49.720 | as a machine learning engineer and a researcher

01:40:52.440 | and also as a human being?

01:40:54.320 | - Yeah, I mean, a few reactions.

01:40:56.400 | Quite a few, actually.

01:40:58.760 | - Have you ever briefly thought,

01:41:01.640 | is this thing sentient?

01:41:02.560 | - Right, so never.

01:41:04.320 | Absolutely never.

01:41:05.160 | - You mean with like AlphaStar?

01:41:06.280 | Wait a minute.

01:41:07.120 | - Sadly, though, I think, yeah, sadly I have not.

01:41:11.960 | Yeah, I think the current, any of the current models,

01:41:15.320 | although very useful and very good,

01:41:17.560 | yeah, I think we're quite far from that.

01:41:21.200 | And there's kind of a converse side story.

01:41:25.360 | So one of my passions is about science in general.

01:41:30.360 | And I think I feel I'm a bit of a failed scientist.

01:41:34.540 | That's why I came to machine learning,

01:41:36.560 | because you always feel, and you start seeing this,

01:41:40.160 | that machine learning is maybe the science

01:41:43.200 | that can help other sciences, as we've seen, right?

01:41:45.440 | Like you, you know, it's such a powerful tool.

01:41:48.620 | So thanks to that angle, right, that, okay, I love science.

01:41:52.520 | I love, I mean, I love astronomy, I love biology,

01:41:54.960 | but I'm not an expert and I decided,

01:41:56.960 | well, the thing I can do better at is computers.

01:42:00.040 | But having, especially with,

01:42:02.960 | when I was a bit more involved in AlphaFold,

01:42:05.560 | learning a bit about proteins and about biology

01:42:08.800 | and about life, the complexity,

01:42:13.120 | it feels like it really is, like, I mean,

01:42:15.040 | if you start looking at the things that are going on

01:42:18.160 | at the atomic level, and also, I mean,

01:42:23.880 | there's obviously the, we are maybe inclined

01:42:27.720 | to try to think of neural networks as like the brain,

01:42:30.440 | but the complexities and the amount of magic that it feels

01:42:35.080 | when, I mean, I don't, I'm not an expert,

01:42:37.120 | so it naturally feels more magic,

01:42:38.600 | but looking at biological systems,

01:42:40.920 | as opposed to these computer computational brains,

01:42:45.540 | just makes me like, wow, there's such level

01:42:49.600 | of complexity difference still, right?

01:42:51.480 | Like orders of magnitude complexity that,

01:42:53.820 | sure, these weights, I mean, we train them

01:42:56.680 | and they do nice things, but they're not at the level

01:43:00.160 | of biological entities, brains, cells.

01:43:05.160 | It just feels like it's just not possible

01:43:08.960 | to achieve the same level of complexity behavior,

01:43:12.360 | and my belief, when I talk to other beings,

01:43:16.280 | is certainly shaped by this amazement of biology

01:43:20.340 | that maybe because I know too much,

01:43:22.340 | I don't have about machine learning,

01:43:23.760 | but I certainly feel it's very far-fetched

01:43:27.600 | and far in the future to be calling,

01:43:29.780 | or to be thinking, well, this mathematical function

01:43:34.560 | that is differentiable is in fact sentient and so on.

01:43:39.200 | - There's something on that point, it's very interesting.

01:43:41.980 | So you know enough about machines and enough about biology

01:43:46.980 | to know that there's many orders of magnitude

01:43:49.040 | of difference and complexity,

01:43:50.620 | but you know how machine learning works.

01:43:56.060 | So the interesting question for human beings

01:43:58.140 | that are interacting with a system

01:43:59.400 | that don't know about the underlying complexity,

01:44:02.240 | and I've seen people, probably including myself,

01:44:05.240 | that have fallen in love with things that are quite simple.

01:44:07.920 | - Yeah, so-- - And so maybe

01:44:09.440 | the complexity is one part of the picture,

01:44:11.500 | but maybe that's not a necessary condition for sentience,

01:44:16.500 | for perception or emulation of sentience.

01:44:25.000 | - Right, so I mean, I guess the other side of this is,

01:44:28.180 | that's how I feel personally,

01:44:29.560 | I mean, you asked me about the person, right?

01:44:32.360 | Now it's very interesting to see

01:44:33.980 | how other humans feel about things, right?

01:44:36.360 | This is, we are like, again, like I'm not as amazed

01:44:40.800 | about things that I feel,

01:44:42.320 | this is not as magical as this other thing,

01:44:44.560 | because of maybe how I got to learn about it

01:44:48.000 | and how I see the curve a bit more smooth,

01:44:50.480 | because I, you know, like just seeing the progress

01:44:53.080 | of language models since Shannon in the '50s,

01:44:56.000 | and actually looking at that timescale,

01:44:58.900 | we're not that fast progress, right?

01:45:00.840 | I mean, what we were thinking at the time,

01:45:03.460 | like almost 100 years ago,

01:45:05.960 | is not that dissimilar to what we're doing now,

01:45:08.920 | but at the same time, yeah, obviously others,

01:45:11.440 | my experience, right, the personal experience,

01:45:14.500 | I think no one should, you know,

01:45:17.360 | I think no one should tell others how they should feel,

01:45:20.680 | I mean, the feelings are very personal, right?

01:45:22.940 | So how others might feel about the models and so on,

01:45:26.120 | that's one part of the story that is important

01:45:28.480 | to understand for me personally as a researcher,

01:45:32.040 | and then when I maybe disagree or I don't understand

01:45:36.160 | or see that, yeah, maybe this is not something

01:45:38.560 | I think right now is reasonable, knowing all that I know,

01:45:41.580 | one of the other things, and perhaps partly

01:45:44.320 | why it's great to be talking to you

01:45:46.600 | and reaching out to the world about machine learning is,

01:45:49.860 | hey, let's demystify a bit the magic

01:45:53.480 | and try to see a bit more of the math

01:45:56.280 | and the fact that literally to create these models,

01:45:59.920 | if we had the right software, it would be 10 lines of code

01:46:03.160 | and then just a dump of the internet,

01:46:06.160 | so versus like then the complexity of like the creation

01:46:10.320 | of humans from their inception, right,

01:46:13.640 | and also the complexity of evolution

01:46:15.820 | of the whole universe to where we are

01:46:19.240 | that feels orders of magnitude more complex

01:46:21.960 | and fascinating to me.

01:46:23.500 | So I think, yeah, maybe part of,

01:46:26.040 | the only thing I'm thinking about trying to tell you is,

01:46:29.300 | yeah, I think explaining a bit of the magic,

01:46:32.640 | there is a bit of magic, it's good to be in love,

01:46:34.840 | obviously, with what you do at work,

01:46:37.040 | and I'm certainly fascinated and surprised

01:46:39.440 | quite often as well, but I think hopefully,

01:46:43.200 | as experts in biology, hopefully you will tell me

01:46:45.900 | this is not as magic, and I'm happy to learn that.

01:46:49.440 | Through interactions with the larger community,

01:46:52.280 | we can also have a certain level of education

01:46:56.020 | that in practice also will matter,

01:46:58.360 | because I mean, one question is how you feel about this,

01:47:00.800 | but then the other very important is,

01:47:03.080 | you starting to interact with these in products and so on,

01:47:06.960 | it's good to understand a bit what's going on,

01:47:09.160 | what's not going on, what's safe, what's not safe,

01:47:12.280 | and so on, right, otherwise, the technology

01:47:14.840 | will not be used properly for good,

01:47:17.040 | which is obviously the goal of all of us, I hope.

01:47:20.540 | - So let me then ask the next question,

01:47:22.940 | do you think in order to solve intelligence,

01:47:25.800 | or to replace the Lexbot that does interviews,

01:47:29.560 | as we started this conversation with,

01:47:31.440 | do you think the system needs to be sentient?

01:47:34.840 | Do you think it needs to achieve something

01:47:37.260 | like consciousness, and do you think about

01:47:39.780 | what consciousness is in the human mind

01:47:43.260 | that could be instructive for creating AI systems?

01:47:46.760 | - Yeah, honestly, I think probably not

01:47:51.040 | to the degree of intelligence that there's this brain

01:47:56.040 | that can learn, can be extremely useful,

01:48:00.320 | can challenge you, can teach you,

01:48:02.960 | conversely, you can teach it to do things.

01:48:05.640 | I'm not sure it's necessary, personally speaking,

01:48:09.120 | but if consciousness or any other biological

01:48:14.080 | or evolutionary lesson can be repurposed

01:48:19.080 | to then influence our next set of algorithms,

01:48:22.600 | that is a great way to actually make progress, right?

01:48:25.680 | And the same way I tried to explain Transformers a bit,

01:48:28.220 | how it feels we operate when we look at text specifically,

01:48:33.220 | these insights are very important, right?

01:48:36.000 | So there's a distinction between details

01:48:40.320 | of how the brain might be doing computation.

01:48:43.260 | I think my understanding is, sure, there's neurons

01:48:46.560 | and there's some resemblance to neural networks,

01:48:48.520 | but we don't quite understand enough of the brain

01:48:51.440 | in detail, right, to be able to replicate it.

01:48:55.320 | But then more, if you zoom out a bit,

01:48:58.840 | how we then, our thought process, how memory works,

01:49:03.400 | maybe even how evolution got us here,

01:49:05.640 | what's exploration, exploitation,

01:49:07.320 | like how these things happen,

01:49:08.800 | I think these clearly can inform algorithmic level research.

01:49:13.080 | And I've seen some examples of this being quite useful

01:49:18.080 | to then guide the research,

01:49:19.740 | even it might be for the wrong reasons, right?

01:49:21.660 | So I think biology and what we know about ourselves

01:49:26.100 | can help a whole lot to build essentially

01:49:29.980 | what we call AGI, this general, the real gato, right?

01:49:34.140 | The last step of the chain, hopefully.

01:49:36.540 | But consciousness in particular,

01:49:39.180 | I don't myself at least think too hard

01:49:42.060 | about how to add that to the system,

01:49:44.800 | but maybe my understanding is also very personal

01:49:47.840 | about what it means, right?

01:49:48.840 | I think this, even that in itself is a long debate

01:49:51.760 | that I know people have often,

01:49:55.300 | and maybe I should learn more about this.

01:49:57.780 | - Yeah, and I personally, I notice the magic often

01:50:01.740 | on a personal level, especially with physical systems,

01:50:04.940 | like robots.

01:50:06.160 | I have a lot of legged robots now in Austin

01:50:10.460 | that I play with.

01:50:11.700 | And even when you program them,

01:50:13.260 | when they do things you didn't expect,

01:50:15.580 | there's an immediate anthropomorphization,

01:50:18.620 | and you notice the magic,

01:50:19.820 | and you start to think about things like sentience

01:50:22.620 | that has to do more with effective communication

01:50:26.020 | and less with any of these kind of dramatic things.

01:50:28.580 | It seems like a useful part of communication.

01:50:32.600 | Having the perception of consciousness

01:50:36.580 | seems like useful for us humans.

01:50:38.860 | We treat each other more seriously.

01:50:40.860 | We are able to do a nearest neighbor shoving

01:50:45.060 | of that entity into your memory correctly,

01:50:47.700 | all that kind of stuff.

01:50:48.700 | Seems useful, at least to fake it,

01:50:50.860 | even if you never make it.

01:50:52.500 | - So maybe, like, yeah, mirroring the question,

01:50:55.660 | and since you talked to a few people,

01:50:57.460 | then you do think that we'll need to figure something out

01:51:01.780 | in order to achieve intelligence

01:51:04.580 | in a grander sense of the word.

01:51:06.540 | - Yeah, I personally believe yes,

01:51:08.220 | but I don't even think it'll be like a separate island

01:51:12.620 | we'll have to travel to.

01:51:14.140 | I think it'll emerge quite naturally.

01:51:16.420 | - Okay, that's easier for us then, thank you.

01:51:20.140 | - But the reason I think it's important to think about

01:51:22.820 | is you will start, I believe, like with this Google Engineer,

01:51:26.340 | you will start seeing this a lot more,

01:51:28.780 | especially when you have AI systems

01:51:30.540 | that are actually interacting with human beings

01:51:32.980 | that don't have an engineering background,

01:51:35.180 | and we have to prepare for that.

01:51:38.580 | Because there'll be, I do believe

01:51:40.100 | there'll be a civil rights movement for robots,

01:51:42.300 | as silly as it is to say.

01:51:44.580 | There's going to be a large number of people

01:51:46.780 | that realize there's these intelligent entities

01:51:48.980 | with whom I have a deep relationship,

01:51:51.620 | and I don't wanna lose them.

01:51:53.220 | They've come to be a part of my life,

01:51:54.780 | and they mean a lot.

01:51:55.980 | They have a name, they have a story, they have a memory,

01:51:59.020 | and we start to ask questions about ourselves.

01:52:01.340 | Well, what, this thing sure seems like

01:52:04.940 | it's capable of suffering,

01:52:07.600 | because it tells all these stories of suffering.

01:52:09.860 | It doesn't wanna die and all those kinds of things,

01:52:11.700 | and we have to start to ask ourselves questions.

01:52:14.460 | Well, what is the difference

01:52:15.460 | between a human being and this thing?

01:52:16.980 | And so when you engineer,

01:52:18.580 | I believe from an engineering perspective,

01:52:21.500 | from like a deep mind, or anybody that builds systems,

01:52:24.980 | there might be laws in the future

01:52:26.500 | where you're not allowed to engineer systems

01:52:29.140 | with displays of sentience,

01:52:31.240 | unless they're explicitly designed to be that,

01:52:36.020 | unless it's a pet.

01:52:37.380 | So if you have a system that's just doing customer support,

01:52:41.260 | you're legally not allowed to display sentience.

01:52:44.180 | We'll start to ask ourselves that question,

01:52:47.300 | and then so that's going to be part

01:52:49.500 | of the software engineering process.

01:52:51.260 | Which features do we have,

01:52:53.360 | and one of them is communications of sentience.

01:52:56.820 | But it's important to start thinking about that stuff,

01:52:58.700 | especially how much it captivates public attention.

01:53:01.740 | - Yeah, absolutely.

01:53:03.180 | It's definitely a topic that is important.

01:53:06.420 | We think about, and I think in a way,

01:53:09.540 | I always see, not every movie is equally on point

01:53:14.540 | with certain things,

01:53:16.100 | but certainly science fiction in this sense,

01:53:19.100 | at least has prepared society

01:53:20.740 | to start thinking about certain topics

01:53:24.060 | that even if it's too early to talk about,

01:53:26.460 | as long as we are reasonable,

01:53:29.480 | it's certainly gonna prepare us

01:53:31.300 | for both the research to come and how to,

01:53:34.980 | I mean, there's many important challenges

01:53:37.060 | and topics that come with building an intelligent system,

01:53:42.060 | many of which you just mentioned, right?

01:53:44.660 | So I think we're never gonna be fully ready

01:53:49.660 | unless we talk about this,

01:53:51.420 | and we start also, as I said,

01:53:54.140 | just kind of expanding the people we talk to,

01:53:59.140 | to not include only our own researchers and so on.

01:54:03.180 | And in fact, places like DeepMind, but elsewhere,

01:54:06.540 | there's more interdisciplinary groups forming up

01:54:10.380 | to start asking and really working with us

01:54:13.260 | on these questions,

01:54:14.980 | because obviously this is not initially

01:54:17.420 | what your passion is when you do your PhD,

01:54:19.380 | but certainly it is coming, right?

01:54:21.460 | So it's fascinating, kind of.

01:54:23.140 | It's the thing that brings me to one of my passions

01:54:27.180 | that is learning.

01:54:28.020 | So in this sense, this is kind of a new area

01:54:31.740 | that as a learning system myself, I want to keep exploring.

01:54:36.660 | And I think it's great to see parts of the debate,

01:54:41.060 | and even I've seen a level of maturity

01:54:43.780 | in the conferences that deal with AI.

01:54:46.500 | If you look five years ago to now,

01:54:49.940 | just the amount of workshops and so on has changed so much.

01:54:53.100 | It's impressive to see how much topics of safety ethics

01:54:58.100 | and so on come to the surface, which is great.

01:55:01.700 | And if we were too early, clearly it's fine.

01:55:03.860 | I mean, it's a big field and there's lots of people

01:55:07.300 | with lots of interests that will do progress

01:55:10.300 | or make progress.

01:55:11.940 | And obviously I don't believe we're too late.

01:55:14.100 | So in that sense, I think it's great

01:55:16.460 | that we're doing this already.

01:55:18.180 | - It's better to be too early than too late

01:55:20.220 | when it comes to super intelligent AI systems.

01:55:22.780 | Let me ask, speaking of sentient AIs,

01:55:25.500 | you gave props to your friend, Ilyas Eskiver,

01:55:28.700 | for being elected the Fellow of the Royal Society.

01:55:31.980 | So just as a shout out to a fellow researcher and a friend,

01:55:35.140 | what's the secret to the genius of Ilyas Eskiver?

01:55:39.420 | And also, do you believe that his tweets,

01:55:42.660 | as you've hypothesized and Andrei Karpathy did as well,

01:55:46.020 | are generated by a language model?

01:55:48.660 | - Yeah.

01:55:49.500 | So I strongly believe Ilyas is gonna be visiting

01:55:53.820 | in a few weeks actually, so I'll ask him in person.

01:55:57.580 | But-

01:55:58.420 | - Will he tell you the truth?

01:55:59.260 | - Yes, of course.

01:56:00.100 | - Okay, sure. - Hopefully.

01:56:00.940 | I mean, ultimately we all have shared paths

01:56:04.060 | and there's friendships that go beyond,

01:56:06.940 | obviously, institutions and so on.

01:56:09.860 | So I hope he tells me the truth.

01:56:11.780 | - Or maybe the AI system is holding him hostage somehow.

01:56:14.420 | Maybe he has some videos that he doesn't wanna release.

01:56:16.980 | So maybe it has taken control over him,

01:56:19.740 | so he can't tell the truth.

01:56:20.580 | - Well, if I see him in person, then I think I'll-

01:56:22.300 | - He will know.

01:56:23.220 | - Yeah, but I think it's a good,

01:56:27.620 | I think Ilyas' personality, just knowing him for a while,

01:56:30.940 | yeah, he's, everyone in Twitter, I guess,

01:56:35.260 | gets a different persona.

01:56:36.580 | And I think Ilyas' one does not surprise me, right?

01:56:40.860 | So I think knowing Ilyas from before social media

01:56:43.540 | and before AI was so prevalent,

01:56:45.740 | I recognize a lot of his character.

01:56:47.460 | So that's something for me that I feel good about,

01:56:50.460 | a friend that hasn't changed

01:56:52.420 | or like is still true to himself, right?

01:56:55.940 | Obviously, there is though a fact

01:56:58.900 | that your field becomes more popular

01:57:02.100 | and he is obviously one of the main figures in the field,

01:57:05.420 | having done a lot of advancement.

01:57:06.860 | So I think that the tricky bit here

01:57:08.980 | is how to balance your true self

01:57:11.060 | with the responsibility that your words carry.

01:57:13.540 | So in this sense, I think, yeah,

01:57:16.100 | like I appreciate the style and I understand it,

01:57:19.300 | but it created debates on like some of his tweets, right?

01:57:24.100 | That maybe it's good we have them early anyways, right?

01:57:26.780 | But yeah, then the reactions are usually polarizing.

01:57:30.980 | I think we're just seeing kind of the reality

01:57:32.980 | of social media a bit there as well,

01:57:34.900 | reflected on that particular topic

01:57:38.060 | or set of topics he's tweeting about.

01:57:40.220 | - Yeah, I mean, it's funny that you speak to this tension.

01:57:42.860 | He was one of the early seminal figures

01:57:46.100 | in the field of deep learning.

01:57:47.260 | And so there's a responsibility with that,

01:57:48.900 | but he's also, from having interacted with him quite a bit,

01:57:53.100 | he's just a brilliant thinker about ideas.

01:57:57.380 | And which, as are you,

01:58:01.180 | and there's a tension between becoming the manager

01:58:03.700 | versus like the actual thinking through very novel ideas.

01:58:08.700 | The, yeah, the scientist versus the manager.

01:58:13.540 | And he's one of the great scientists of our time.

01:58:17.620 | This was quite interesting.

01:58:18.740 | And also people tell me quite silly,

01:58:20.740 | which I haven't quite detected yet,

01:58:23.180 | but in private, we'll have to see about that.

01:58:25.940 | - Yeah, yeah.

01:58:27.380 | I mean, just on the point of,

01:58:29.580 | I mean, Ilya has been an inspiration.

01:58:33.260 | I mean, quite a few colleagues I can think shaped,

01:58:36.300 | you know, the person you are.

01:58:37.980 | Like Ilya certainly gets probably the top spot,

01:58:42.220 | if not close to the top.

01:58:43.700 | And if we go back to the question about people in the field,

01:58:47.900 | like how their role would have changed the field or not,

01:58:51.660 | I think Ilya's case is interesting

01:58:53.900 | because he really has a deep belief

01:58:56.740 | in the scaling up of neural networks.

01:58:59.540 | There was a talk that is still famous to this day

01:59:03.620 | from the "Sequence to Sequence" paper,

01:59:06.100 | where he was just claiming,

01:59:08.340 | just give me supervised data and a large neural network,

01:59:11.700 | and then, you know, you'll solve

01:59:13.140 | basically all the problems, right?

01:59:14.580 | That vision, right, was already there many years ago.

01:59:19.580 | So it's good to see like someone who is, in this case,

01:59:22.820 | very deeply into this style of research

01:59:27.140 | and clearly has had a tremendous track record

01:59:31.980 | of successes and so on.

01:59:34.100 | The funny bit about that talk is that

01:59:36.300 | we rehearsed the talk in a hotel room before,

01:59:39.020 | and the original version of that talk

01:59:41.980 | would have been even more controversial.

01:59:43.980 | So maybe I'm the only person

01:59:46.540 | that has seen the unfiltered version of the talk.

01:59:49.180 | And, you know, maybe when the time comes,

01:59:51.660 | maybe we should revisit some of the skip slides

01:59:55.100 | from the talk from Ilya.

01:59:57.580 | But I really think the deep belief

02:00:01.020 | into some certain style of research pays out, right?

02:00:03.900 | It's good to be practical sometimes.

02:00:06.380 | And I actually think Ilya and myself are like practical,

02:00:09.380 | but it's also good there's some sort of long-term belief

02:00:13.260 | and trajectory.

02:00:14.820 | Obviously, there's a bit of luck involved,

02:00:16.700 | but it might be that that's the right path,

02:00:18.820 | then you clearly are ahead

02:00:19.980 | and hugely influential to the field, as he has been.

02:00:23.540 | - Do you agree with that intuition

02:00:25.100 | that maybe was written about by Rich Sutton

02:00:29.660 | in "The Bitter Lesson,"

02:00:33.580 | that the biggest lesson that can be read

02:00:35.260 | from 70 years of AI research is that general methods

02:00:38.620 | that leverage computation are ultimately the most effective.

02:00:42.780 | Do you think that intuition is ultimately correct?

02:00:47.780 | General methods that leverage computation,

02:00:52.220 | allowing the scaling of computation to do a lot of the work,

02:00:56.140 | and so the basic task of us humans is to design methods

02:01:00.900 | that are more and more general

02:01:02.580 | versus more and more specific to the tasks at hand?

02:01:07.060 | - I certainly think this essentially mimics

02:01:10.380 | a bit of the deep learning research,

02:01:13.540 | almost like philosophy,

02:01:16.980 | that on the one hand, we want to be data agnostic,

02:01:20.460 | we don't wanna pre-process datasets,

02:01:22.100 | we wanna see the bytes, right?

02:01:23.420 | Like the true data as it is,

02:01:25.540 | and then learn everything on top.

02:01:27.340 | So very much agree with that.

02:01:29.780 | And I think scaling up feels at the very least,

02:01:32.860 | again, necessary for building incredible complex systems.

02:01:39.020 | It's possibly not sufficient,

02:01:42.140 | barring that we need a couple of breakthroughs.

02:01:45.060 | I think Rich Sutton mentioned search being part

02:01:48.580 | of the equation of scale and search.

02:01:52.260 | I think search, I've seen it,

02:01:55.420 | that's been more mixed in my experience,

02:01:57.300 | or from that lesson in particular,

02:01:59.340 | search is a bit more tricky

02:02:01.180 | because it is very appealing to search in domains like Go,

02:02:05.340 | where you have a clear reward function

02:02:07.460 | that you can then discard some search traces.

02:02:10.620 | But then in some other tasks,

02:02:12.940 | it's not very clear how you would do that.

02:02:15.260 | Although recently, one of our recent works,

02:02:18.620 | which actually was mostly mimicking or a continuation,

02:02:22.140 | and even the team and the people involved

02:02:23.700 | were pretty much very, like intersecting with AlphaStar,

02:02:27.220 | was AlphaCode, in which we actually saw the bitter lesson,

02:02:30.980 | how scale of the models,

02:02:32.620 | and then a massive amount of search,

02:02:34.260 | yielded this kind of very interesting result

02:02:36.780 | of being able to have human level code competition.

02:02:41.340 | So I've seen examples of it being

02:02:43.660 | literally mapped to search and scale.

02:02:46.380 | I'm not so convinced about the search bit,

02:02:48.140 | but certainly I'm convinced scale will be needed.

02:02:50.900 | So we need general methods.

02:02:52.660 | We need to test them,

02:02:53.500 | and maybe we need to make sure that we can scale them,

02:02:56.140 | given the hardware that we have in practice,

02:02:59.100 | but then maybe we should also shape

02:03:00.940 | how the hardware looks like,

02:03:02.860 | based on which methods might be needed to scale.

02:03:05.620 | And that's an interesting contrast of this GPU comment,

02:03:10.620 | that is, we got it for free almost,

02:03:13.380 | because games were using this,

02:03:15.060 | but maybe now if sparsity is required,

02:03:19.500 | we don't have the hardware, although in theory,

02:03:21.860 | I mean, many people are building

02:03:23.180 | different kinds of hardware these days,

02:03:24.660 | but there's a bit of this notion of hardware lottery

02:03:27.780 | for scale that might actually have an impact,

02:03:31.260 | at least on the year, again, scale of years,

02:03:33.420 | on how fast we'll make progress

02:03:35.180 | to maybe a version of neural nets or whatever comes next

02:03:39.420 | that might enable truly intelligent agents.

02:03:44.420 | - Do you think in your lifetime,

02:03:46.100 | we will build an AGI system

02:03:49.500 | that would undeniably be a thing

02:03:54.020 | that achieves human level intelligence and goes far beyond?

02:03:57.460 | - I definitely think it's possible

02:04:02.340 | that it will go far beyond,

02:04:03.700 | but I'm definitely convinced

02:04:04.860 | that it will be human level intelligence.

02:04:08.060 | And I'm hypothesizing about the beyond

02:04:10.940 | because the beyond bit is a bit tricky to define,

02:04:15.940 | especially when we look at the current formula

02:04:19.980 | of starting from this imitation learning standpoint, right?

02:04:23.700 | So we can certainly imitate humans at language and beyond.

02:04:30.660 | So getting at human level through imitation

02:04:33.340 | feels very possible.

02:04:34.860 | Going beyond will require reinforcement learning

02:04:38.980 | and other things.

02:04:39.820 | And I think in some areas

02:04:41.620 | that certainly already has paid out.

02:04:43.500 | I mean, Go being an example that's my favorite so far

02:04:47.220 | in terms of going beyond human capabilities.

02:04:50.340 | But in general, I'm not sure we can define reward functions

02:04:55.340 | that from a seat of imitating human level intelligence

02:04:59.940 | that is general and then going beyond.

02:05:02.820 | That bit is not so clear in my lifetime,

02:05:05.140 | but certainly human level, yes.

02:05:08.100 | And I mean, that in itself is already quite powerful,

02:05:10.860 | I think.

02:05:11.700 | So going beyond, I think it's obviously not,

02:05:14.420 | we're not gonna not try that

02:05:16.060 | if then we get to superhuman scientist

02:05:19.860 | and discovery and advancing the world.

02:05:22.060 | But at least human level is also,

02:05:24.660 | in general, is also very, very powerful.

02:05:27.460 | - Well, especially if human level or slightly beyond

02:05:31.500 | is integrated deeply with human society

02:05:33.740 | and there's billions of agents like that,

02:05:36.460 | do you think there's a singularity moment

02:05:38.460 | beyond which our world will be just very deeply transformed

02:05:43.460 | by these kinds of systems?

02:05:45.620 | Because now you're talking about intelligence systems

02:05:47.780 | that are just, I mean, this is no longer just going

02:05:52.780 | from horse and buggy to the car.

02:05:56.420 | It feels like a very different kind of shift

02:05:59.780 | in what it means to be a living entity on earth.

02:06:03.300 | Are you afraid?

02:06:04.180 | Are you excited of this world?

02:06:06.300 | - I'm afraid if there's a lot more.

02:06:09.340 | So I think maybe we'll need to think about

02:06:13.020 | if we truly get there,

02:06:14.940 | just thinking of limited resources,

02:06:18.340 | like humanity clearly hits some limits

02:06:21.420 | and then there's some balance, hopefully,

02:06:23.420 | that biologically the planet is imposing

02:06:26.260 | and we should actually try to get better at this.

02:06:28.500 | As we know, there's quite a few issues

02:06:31.500 | with having too many people coexisting

02:06:35.740 | in a resource-limited way.

02:06:37.580 | So for digital entities, it's an interesting question.

02:06:40.300 | I think such a limit maybe should exist,

02:06:43.540 | but maybe it's gonna be imposed by energy availability

02:06:47.620 | because this also consumes energy.

02:06:49.700 | In fact, most systems are more inefficient

02:06:53.500 | than we are in terms of energy required.

02:06:55.980 | - Correct, yeah.

02:06:56.820 | - But definitely, I think as a society,

02:06:59.500 | we'll need to just work together

02:07:02.220 | to find what would be reasonable in terms of growth

02:07:06.380 | or how we coexist if that is to happen.

02:07:11.380 | I am very excited about, obviously,

02:07:14.660 | the aspects of automation that make people

02:07:17.700 | that obviously don't have access

02:07:19.020 | to certain resources or knowledge,

02:07:20.980 | for them to have that access.

02:07:23.900 | I think those are the applications in a way

02:07:26.260 | that I'm most excited to see and to personally work towards.

02:07:30.940 | - Yeah, there's going to be significant improvements

02:07:32.660 | in productivity and the quality of life

02:07:34.340 | across the whole population, which is very interesting.

02:07:36.980 | But I'm looking even far beyond

02:07:39.180 | us becoming a multi-planetary species.

02:07:42.660 | And just as a quick bet, last question,

02:07:45.340 | do you think as humans become multi-planetary species,

02:07:49.180 | go outside our solar system, all that kind of stuff,

02:07:52.460 | do you think there'll be more humans

02:07:54.420 | or more robots in that future world?

02:07:57.180 | So will humans be the quirky,

02:08:02.180 | intelligent being of the past,

02:08:04.460 | or is there something deeply fundamental

02:08:06.980 | to human intelligence that's truly special,

02:08:09.580 | where we will be part of those other planets,

02:08:12.100 | not just AI systems?

02:08:13.900 | - I think we're all excited to build AGI

02:08:18.660 | to empower or make us more powerful as human species.

02:08:23.660 | Not to say there might be some hybridization.

02:08:27.580 | I mean, this is obviously speculation,

02:08:29.700 | but there are companies also trying to,

02:08:32.500 | the same way medicine is making us better.

02:08:35.660 | Maybe there are other things that are yet to happen on that.

02:08:39.100 | But if the ratio is not at most one-to-one,

02:08:43.340 | I would not be happy.

02:08:44.580 | So I would hope that we are part of the equation,

02:08:49.220 | but maybe there's, maybe a one-to-one ratio

02:08:52.780 | feels like possible, constructive and so on,

02:08:56.220 | but it would not be good to have a misbalance,

02:08:59.620 | at least from my core beliefs

02:09:01.420 | and the why I'm doing what I'm doing when I go to work

02:09:05.180 | and I research what I research.

02:09:07.100 | - Well, this is how I know you're human,

02:09:09.500 | and this is how you've passed the Turing test.

02:09:12.700 | And you are one of the special humans, Ariel.

02:09:14.940 | It's a huge honor that you would talk with me,

02:09:17.060 | and I hope we get the chance to speak again,

02:09:19.900 | maybe once before the singularity, once after,

02:09:23.020 | and see how our view of the world changes.

02:09:25.420 | Thank you again for talking today.

02:09:26.540 | Thank you for the amazing work you do.

02:09:28.140 | You're a shining example of a researcher

02:09:31.300 | and a human being in this community.

02:09:32.900 | - Thanks a lot, Lex.

02:09:34.020 | Yeah, looking forward to before the singularity, certainly.

02:09:36.780 | (Lex laughs)

02:09:37.820 | - And maybe after.

02:09:38.980 | Thanks for listening to this conversation

02:09:41.460 | with Ariel Vinales.

02:09:43.100 | To support this podcast,

02:09:44.260 | please check out our sponsors in the description.

02:09:46.940 | And now, let me leave you with some words from Alan Turing.

02:09:50.060 | "Those who can imagine anything can create the impossible."

02:09:55.140 | Thank you for listening, and hope to see you next time.

02:09:59.180 | (upbeat music)

02:10:01.780 | (upbeat music)

02:10:04.380 | [BLANK_AUDIO]

Oriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306

Chapters