Ilya Sutskever (OpenAI) and Jensen Huang (NVIDIA CEO) : AI Today and Vision of the Future (3/2023)

00:00:00.000 | Ilya, unbelievable.

00:00:03.000 | Today is the day after GPT-4.

00:00:06.000 | (laughs)

00:00:07.300 | It's great to have you here.

00:00:09.300 | I'm delighted to have you.

00:00:10.680 | I've known you a long time.

00:00:12.180 | The journey and just my mental hit,

00:00:14.840 | my mental memory of the time that I've known you

00:00:19.520 | and the seminal work that you have done

00:00:22.680 | starting in University of Toronto,

00:00:26.060 | the co-invention of AlexNet with Alex and Jeb Hinton

00:00:32.760 | that led to the big bang of modern artificial intelligence,

00:00:38.360 | your career that took you out here to the Bay Area,

00:00:41.600 | the founding of OpenAI, GPT-123,

00:00:46.400 | and then, of course, ChatGPT,

00:00:49.280 | the AI heard around the world.

00:00:52.320 | This is the incredible resume of a young computer scientist,

00:00:57.580 | you know, an entire community and industry

00:01:00.160 | at all with your achievements.

00:01:02.460 | I guess I just want to go back to the beginning

00:01:05.580 | and ask you, deep learning,

00:01:08.920 | what was your intuition around deep learning?

00:01:11.420 | Why did you know that it was going to work?

00:01:13.920 | Did you have any intuition

00:01:14.920 | that it was going to lead to this kind of success?

00:01:17.420 | Okay, well, first of all,

00:01:19.840 | thank you so much for the quote, for all the kind words.

00:01:24.520 | A lot has changed

00:01:26.180 | thanks to the incredible power of deep learning.

00:01:31.220 | Like, I think my personal starting point,

00:01:34.840 | I was interested in artificial intelligence

00:01:36.600 | for a whole variety of reasons,

00:01:39.760 | starting from an intuitive understanding

00:01:43.060 | of appreciation of its impact,

00:01:46.060 | and also I had a lot of curiosity

00:01:47.720 | about what is consciousness,

00:01:49.720 | what is the human experience,

00:01:51.940 | and it felt like progress in artificial intelligence

00:01:54.820 | will help with that.

00:01:57.240 | The next step was,

00:01:58.740 | well, back then, I was starting in 2002, 2003,

00:02:03.000 | and it seemed like learning is the thing

00:02:05.040 | that humans can do, that people can do,

00:02:08.340 | that computers can't do at all.

00:02:11.460 | In 2003, 2002,

00:02:14.460 | computers could not learn anything,

00:02:17.520 | and it wasn't even clear that it was possible in theory.

00:02:21.520 | And so I thought that making progress in learning,

00:02:27.360 | in artificial learning, in machine learning,

00:02:30.060 | that would lead to the greatest progress in AI.

00:02:33.400 | And then I started to look around for what was out there,

00:02:37.200 | and nothing seemed too promising.

00:02:39.800 | But to my great luck,

00:02:42.040 | Jeff Hinton was a professor at my university,

00:02:45.280 | and I was able to find him,

00:02:47.240 | and he was working in neural networks,

00:02:48.780 | and it immediately made sense,

00:02:50.780 | because neural networks had the property

00:02:54.160 | that we are learning,

00:02:55.900 | we are automatically programming parallel computers.

00:03:00.040 | Back then, the parallel computers were small,

00:03:02.700 | but the promise was, if you could somehow figure out

00:03:05.360 | how learning in neural networks work,

00:03:07.900 | then you can program small parallel computers from data.

00:03:11.280 | And it was also similar enough to the brain,

00:03:13.360 | and the brain works,

00:03:14.520 | because it's like you had these several factors going for it.

00:03:17.640 | Now, it wasn't clear how to get it to work,

00:03:21.840 | but of all the things that existed,

00:03:24.800 | that seemed like it had by far the greatest long-term promise.

00:03:28.760 | Even though, you know—

00:03:29.760 | At the time that you first started,

00:03:31.260 | at the time that you first started working

00:03:32.520 | with deep learning and neural networks,

00:03:35.300 | what was the scale of the network?

00:03:37.300 | What was the scale of computing at that moment in time?

00:03:39.600 | What was it like?

00:03:40.760 | An interesting thing to note

00:03:42.400 | was that the importance of scale wasn't realized back then.

00:03:45.980 | So people would just train, you know,

00:03:47.440 | neural networks with like 50 neurons, 100 neurons,

00:03:50.640 | several hundred neurons that would be like

00:03:53.020 | a big neural network.

00:03:55.140 | A million parameters would be considered very large.

00:03:58.360 | We would run our models on unoptimized CPU code,

00:04:02.360 | because we were a bunch of researchers.

00:04:04.360 | We didn't know about BLAS.

00:04:06.140 | We used MATLAB.

00:04:07.060 | The MATLAB was optimized.

00:04:10.180 | And we'd just experiment,

00:04:11.980 | you know, what is even the right question to ask, you know?

00:04:14.480 | So you try to gather,

00:04:17.400 | to just find interesting phenomena,

00:04:19.360 | interesting observation.

00:04:21.600 | You can do this small thing,

00:04:22.900 | and you can do that small thing.

00:04:24.480 | You know, Geoff Hinton was really excited

00:04:26.640 | about training neural nets on small little digits,

00:04:32.440 | both for classification,

00:04:33.640 | and also he was very interested in generating them.

00:04:36.440 | So the beginnings of generative models were right there.

00:04:39.820 | But the question is like, okay,

00:04:40.980 | there's all this cool stuff floating around.

00:04:43.400 | What really gets traction?

00:04:45.680 | And so that, it wasn't,

00:04:47.640 | so it wasn't obvious that this was the right question

00:04:51.240 | back then,

00:04:52.400 | but in hindsight, that turned out to be the right question.

00:04:55.440 | - Now, the year AlexNet was 2012.

00:04:59.780 | - Yes. - 2012.

00:05:01.400 | Now, you and Alex were working on AlexNet

00:05:04.740 | for some time before then.

00:05:07.100 | And at what point was it clear to you

00:05:12.100 | that you wanted to build

00:05:14.600 | a computer vision-oriented neural network

00:05:17.800 | that ImageNet was the right set of data to go for,

00:05:20.860 | and to somehow go for the computer vision contest?

00:05:25.860 | - Yeah.

00:05:28.220 | So I can talk about the context there.

00:05:30.100 | It's, I think, probably two years before that,

00:05:36.440 | it became clear to me that supervised learning

00:05:41.220 | is what's going to get us the traction.

00:05:43.360 | And I can explain precisely why.

00:05:45.820 | It wasn't just an intuition.

00:05:47.680 | It was, I would argue, an irrefutable argument,

00:05:51.680 | which went like this.

00:05:53.360 | If your neural network is deep and large,

00:05:57.300 | then it could be configured to solve a hard task.

00:06:03.360 | So that's the key word, deep and large.

00:06:06.860 | People weren't looking at large neural networks.

00:06:08.860 | People were maybe studying a little bit of depth

00:06:11.560 | in neural networks,

00:06:12.740 | but most of the machine learning field

00:06:14.180 | wasn't even looking at neural networks at all.

00:06:16.060 | They were looking at all kinds of Bayesian models

00:06:18.100 | and kernel methods,

00:06:19.940 | which are theoretically elegant methods,

00:06:22.700 | which have the property that

00:06:24.640 | they actually can't represent a good solution

00:06:27.320 | no matter how you configure them.

00:06:29.320 | Whereas the large and deep neural network

00:06:31.440 | can represent a good solution to the problem.

00:06:34.140 | To find the good solution,

00:06:36.900 | you need a big data set, which requires it,

00:06:40.400 | and a lot of compute to actually do the work.

00:06:44.360 | We've also made advanced work,

00:06:45.740 | so we've worked on optimization for a little bit.

00:06:49.320 | It was clear that optimization is a bottleneck,

00:06:53.140 | and there was a breakthrough by another grad student

00:06:55.560 | in Geoff Hinton's lab called James Martens,

00:06:58.860 | and he came up with an optimization method

00:07:00.440 | which is different from the one we're using now.

00:07:03.100 | Some second-order method.

00:07:05.780 | But the point about it is that it's proved

00:07:08.140 | that we can train those neural networks,

00:07:09.600 | because before we didn't even know we could train them.

00:07:12.140 | So if you can train them, you make it big,

00:07:14.900 | you find the data, and you will succeed.

00:07:17.360 | So then the next question is,

00:07:18.820 | well, what data?

00:07:20.600 | And an ImageNet data set,

00:07:21.900 | back then it seemed like this unbelievably difficult data set.

00:07:25.240 | But it was clear that if you were to train

00:07:28.740 | a large convolutional neural network on this data set,

00:07:30.980 | it must succeed if you just can have the compute.

00:07:34.480 | - And right at that time,

00:07:35.980 | - GPUs came out. - you and I,

00:07:38.440 | our history and our paths intersected,

00:07:42.740 | and somehow you had the observations at a GPU,

00:07:47.740 | and at that time we had,

00:07:49.440 | this is our couple of generations into a CUDA GPU,

00:07:52.900 | and I think it was GTX 580 generation.

00:07:56.560 | You had the insight that the GPU could actually be useful

00:08:00.800 | for training your neural network models.

00:08:02.760 | What was that, how did that day start?

00:08:05.140 | Tell me, you and I, you never told me that moment.

00:08:08.140 | How did that day start?

00:08:09.560 | - Yeah, so, you know, the GPUs

00:08:14.180 | appeared in our lab, in our Toronto lab,

00:08:18.440 | thanks to Jeff, and he said,

00:08:19.720 | "We should try these GPUs."

00:08:21.300 | And we started trying and experimenting with them.

00:08:24.300 | And it was a lot of fun,

00:08:27.300 | but it was unclear what to use them for exactly.

00:08:31.640 | Where are you going to get the real traction?

00:08:33.880 | But then, with the existence of the ImageNet data set,

00:08:39.140 | and then it was also very clear

00:08:42.800 | that the convolutional neural network

00:08:44.300 | is such a great fit for the GPU,

00:08:46.640 | so it should be possible to make it go unbelievably fast,

00:08:50.380 | and therefore train something

00:08:52.380 | which would be completely unprecedented

00:08:54.140 | in terms of its size.

00:08:55.300 | And that's how it happened,

00:08:58.840 | and, you know, very fortunately, Alex Krzyzewski,

00:09:01.960 | he really loved programming the GPU.

00:09:04.800 | (laughing)

00:09:06.340 | And he was able to do it, he was able to code,

00:09:09.600 | to program really fast convolutional kernels.

00:09:13.880 | And then, and train the neural net

00:09:20.220 | on the ImageNet data set, and that led to the result.

00:09:22.840 | But it was like--

00:09:23.680 | - It shocked the world.

00:09:24.980 | - It shocked the world.

00:09:26.640 | It broke the record of computer vision

00:09:29.300 | by such a wide margin that it was a clear discontinuity.

00:09:33.840 | - Yeah. - Yeah.

00:09:34.680 | - And I would say it's not just like,

00:09:36.180 | there is another bit of context there.

00:09:38.100 | It's not so much, like, when you say break the record,

00:09:41.600 | there is an important, it's like,

00:09:43.880 | I think there's a different way to phrase it.

00:09:46.060 | It's that that data set was so obviously hard,

00:09:50.980 | and so obviously outside of reach of anything.

00:09:54.400 | People are making progress with some classical techniques,

00:09:56.720 | and they were actually doing something.

00:09:58.980 | But this thing was so much better on the data set,

00:10:01.600 | which was so obviously hard.

00:10:03.900 | It's not just that it's just some competition.

00:10:06.600 | It was a competition which, back in the day--

00:10:09.060 | - There wasn't an average benchmark.

00:10:10.600 | - It was so obviously difficult,

00:10:13.720 | and so obviously out of reach,

00:10:15.760 | and so obviously with the property

00:10:18.720 | that if you did a good job, that would be amazing.

00:10:21.880 | - Big bang of AI.

00:10:23.300 | Fast forward to now.

00:10:25.260 | You came out to the Valley.

00:10:26.880 | You started OpenAI with some friends.

00:10:29.720 | You were the chief scientist.

00:10:31.920 | Now, what was the first initial idea

00:10:35.020 | about what to work on at OpenAI?

00:10:36.880 | Because you guys worked on several things.

00:10:38.840 | Some of the trails of inventions and work,

00:10:43.420 | you could see, led up to the chat GPT moment.

00:10:48.420 | But what were the initial inspiration?

00:10:52.340 | What were you going--

00:10:53.180 | How would you approach intelligence from that moment

00:10:56.760 | and led to this?

00:10:58.460 | - Yeah.

00:10:59.640 | So, obviously, when we started,

00:11:03.000 | it wasn't 100% clear how to proceed.

00:11:07.800 | And the field was also very different

00:11:11.460 | compared to the way it is right now.

00:11:13.640 | So right now, we already used to,

00:11:16.840 | you have these amazing artifacts,

00:11:20.220 | these amazing neural nets who are doing incredible things,

00:11:23.260 | and everyone is so excited.

00:11:25.840 | But back in 2015, 2016, early 2016,

00:11:29.960 | when we were starting out,

00:11:31.260 | the whole thing seemed pretty crazy.

00:11:35.380 | There were so many fewer researchers,

00:11:39.180 | maybe there were between 100 and 1,000 times

00:11:42.260 | fewer people in the field compared to now.

00:11:44.760 | Like back then, you had like 100 people,

00:11:49.140 | most of them were working in Google/DeepMind,

00:11:52.560 | and that was that.

00:11:53.980 | And then there were people picking up the skills,

00:11:55.860 | but it was very, very scarce, very rare still.

00:11:58.220 | And we had two big initial ideas

00:12:04.680 | at the start of OpenAI

00:12:07.820 | that had a lot of staying power,

00:12:10.160 | and they stayed with us to this day.

00:12:12.280 | And I'll describe them right now.

00:12:13.940 | The first big idea that we had,

00:12:17.600 | one which I was especially excited about very early on,

00:12:22.600 | is the idea of unsupervised learning through compression.

00:12:28.520 | Some context.

00:12:33.780 | Today, we take it for granted

00:12:36.400 | that unsupervised learning is this easy thing

00:12:38.140 | and you just pre-train on everything

00:12:39.760 | and it all does exactly as you'd expect.

00:12:41.760 | In 2016, unsupervised learning

00:12:46.840 | was an unsolved problem in machine learning

00:12:50.600 | that no one had any insight,

00:12:54.060 | any clue as to what to do.

00:12:56.400 | Jan LeCun would go around and give talks

00:12:59.600 | saying that you have this grand challenge

00:13:01.900 | in supervised learning.

00:13:04.340 | And I really believed that really good compression

00:13:08.720 | of the data will lead to unsupervised learning.

00:13:11.020 | Now, compression is not language that's commonly used

00:13:16.580 | to describe what is really being done until recently,

00:13:20.800 | when suddenly it became apparent to many people

00:13:23.300 | that those GPTs actually compress the training data.

00:13:26.460 | You may recall the Ted Chiang New York Times article

00:13:30.100 | which also alluded to this.

00:13:31.920 | But there is a real mathematical sense

00:13:34.220 | in which training these autoregressive generative models

00:13:38.540 | compress the data.

00:13:40.260 | And intuitively, you can see why that should work.

00:13:43.300 | If you compress the data really well,

00:13:45.120 | you must extract all the hidden secrets which exist in it.

00:13:48.160 | Therefore, that is the key.

00:13:50.720 | So that was the first idea that we were really excited about.

00:13:54.420 | And that led to quite a few works in OpenAI,

00:13:59.920 | to the sentiment neuron, which I'll mention very briefly.

00:14:04.920 | This work might not be well known

00:14:09.580 | outside of the machine learning field,

00:14:12.300 | but it was very influential, especially in our thinking.

00:14:15.100 | This work, like the result there

00:14:21.080 | was that when you train a neural network,

00:14:25.020 | back then it was not a transformer,

00:14:26.340 | it was before the transformer.

00:14:27.980 | Small recurrent neural network, LSTM,

00:14:30.560 | to those who remember.

00:14:31.400 | - Sequence work, you've done,

00:14:32.440 | I mean, this is some of the work

00:14:34.320 | that you've done yourself, yeah.

00:14:36.600 | - So the same LSTM with a few twists,

00:14:39.680 | trained to predict the next token in Amazon reviews,

00:14:42.760 | next character.

00:14:44.140 | And we discovered that if you predict

00:14:46.800 | the next character well enough,

00:14:48.960 | there will be a neuron inside that LSTM

00:14:52.360 | that corresponds to its sentiment.

00:14:54.060 | So that was really cool,

00:14:57.040 | because it showed some traction

00:15:00.060 | for unsupervised learning,

00:15:01.640 | and it validated the idea that really good

00:15:06.160 | next character prediction,

00:15:09.220 | next something prediction, compression,

00:15:11.340 | has the property that it discovers

00:15:14.480 | the secrets in the data.

00:15:16.120 | That's what we see with these GPT models, right?

00:15:17.880 | You train, and people say,

00:15:19.620 | "It's just statistical correlation."

00:15:20.940 | I mean, at this point, it should be so clear to anyone.

00:15:22.980 | - That observation also,

00:15:25.340 | for me, intuitively,

00:15:27.080 | opened up the whole world of,

00:15:28.780 | where do I get the data for unsupervised learning?

00:15:33.060 | Because I do have a whole lot of data.

00:15:35.100 | If I could just make you predict the next character,

00:15:37.660 | and I know what the ground truth is,

00:15:39.420 | I know what the answer is,

00:15:40.740 | I could train a neural network model with that.

00:15:43.280 | So that observation,

00:15:45.140 | and masking, and other technology, other approaches,

00:15:48.860 | opened my mind about,

00:15:50.460 | where would the world get all the data

00:15:52.500 | that's unsupervised for unsupervised learning?

00:15:54.700 | - Well, I think,

00:15:56.540 | so I would phrase it a little differently.

00:15:58.880 | I would say that with unsupervised learning,

00:16:02.060 | the hard part has been less around

00:16:05.660 | where you get the data from,

00:16:08.300 | though that part is there as well, especially now.

00:16:11.860 | But it was more about,

00:16:13.680 | why should you do it in the first place?

00:16:16.820 | Why should you bother?

00:16:17.900 | The hard part was to realize

00:16:21.900 | that training these neural nets to predict the next token

00:16:26.740 | is a worthwhile goal at all.

00:16:29.100 | That was the goal. - That it would learn

00:16:29.940 | a representation,

00:16:31.460 | that it would be able to understand.

00:16:33.980 | - That's right, that it will be useful.

00:16:35.500 | - Grammar and, yeah.

00:16:37.480 | - But to actually,

00:16:38.320 | but it just wasn't obvious.

00:16:41.500 | So people weren't doing it.

00:16:42.860 | But the sentiment neuron work,

00:16:44.980 | and I want to call out Alec Radford

00:16:47.140 | as a person who really was responsible

00:16:50.260 | for many of the advances there,

00:16:51.860 | the sentiment, this was before GPT-1,

00:16:56.820 | it was the precursor to GPT-1,

00:16:58.500 | and it influenced our thinking a lot.

00:17:00.900 | Then the transformer came out,

00:17:03.020 | and we immediately went,

00:17:03.940 | oh my God, this is the thing.

00:17:05.380 | And we trained GPT-1.

00:17:09.320 | - Now, along the way,

00:17:11.460 | you've always believed that scaling

00:17:16.460 | will improve the performance of these models.

00:17:18.660 | - Yes.

00:17:19.500 | Larger networks, deeper networks,

00:17:22.780 | more training data would scale that.

00:17:24.660 | There was a very important paper

00:17:27.460 | that OpenAI wrote about the scaling laws

00:17:30.380 | and the relationship between loss

00:17:33.100 | and the size of the model

00:17:35.260 | and the amount of data set,

00:17:36.540 | the size of the data set.

00:17:38.340 | When transformers came out,

00:17:39.740 | it gave us the opportunity

00:17:41.060 | to train very, very large models

00:17:43.220 | in a very reasonable amount of time.

00:17:45.900 | - But did the intuition

00:17:49.300 | about the scaling laws

00:17:51.340 | and the size of models and data

00:17:53.780 | and your journey of GPT-1, 2, 3,

00:17:58.580 | which came first?

00:17:59.700 | Did you see the evidence of GPT-1 through 3 first,

00:18:02.220 | or was it intuition about the scaling law first?

00:18:06.220 | - The intuition, so I would say

00:18:08.340 | that the way I'd phrase it

00:18:10.940 | is that I had a very strong belief

00:18:13.580 | that bigger is better.

00:18:15.860 | And that one of the goals

00:18:19.580 | that we had at OpenAI

00:18:21.180 | is to figure out how to use the scale correctly.

00:18:25.220 | There was a lot of belief in OpenAI

00:18:27.460 | about scale from the very beginning.

00:18:29.260 | The question is what to use it for precisely.

00:18:33.740 | 'Cause I'll mention,

00:18:35.540 | right now we're talking about the GPTs,

00:18:36.820 | but there's another very important line of work

00:18:38.660 | which I haven't mentioned,

00:18:39.620 | the second big idea,

00:18:41.140 | but I think now is a good time to make a detour,

00:18:43.300 | and that's reinforcement learning.

00:18:45.100 | That clearly seems important as well.

00:18:48.780 | What do you do with it?

00:18:50.540 | So the first really big project

00:18:54.380 | that was done inside OpenAI

00:18:55.780 | was our effort

00:19:00.460 | at solving a real-time strategy game.

00:19:03.220 | And for context,

00:19:05.100 | a real-time strategy game is like,

00:19:07.220 | it's a competitive sport.

00:19:08.380 | - Yeah, right.

00:19:09.220 | - We need to be smart,

00:19:10.820 | you need to have fast,

00:19:11.820 | you need to have a quick reaction time,

00:19:13.300 | you, there's teamwork,

00:19:14.740 | and you're competing against another team.

00:19:17.180 | And it's pretty,

00:19:18.540 | it's pretty involved.

00:19:20.260 | And there is a whole competitive league for that game.

00:19:24.380 | The game is called Dota 2.

00:19:26.540 | And so we train a reinforcement learning agent

00:19:28.940 | to play against itself,

00:19:30.460 | to produce

00:19:32.460 | with the goal of reaching a level

00:19:38.020 | so that it could compete against the best

00:19:40.540 | players in the world.

00:19:42.180 | And that was a major undertaking as well.

00:19:44.300 | It was a very different line.

00:19:45.540 | It was reinforcement learning.

00:19:46.980 | - Yeah, I remember the day

00:19:47.820 | that you guys announced that work.

00:19:50.380 | And this is, by the way,

00:19:51.740 | when I was asking earlier about,

00:19:53.820 | there's a large body of work

00:19:55.660 | that have come out of OpenAI.

00:19:56.860 | Some of it seem like detours,

00:19:59.620 | but in fact, as you're explaining now,

00:20:02.100 | they might have been detours,

00:20:03.700 | it's seemingly detours,

00:20:04.980 | but they really led up to some of the important work

00:20:07.100 | that we're now talking about, Chad GPT.

00:20:09.260 | - Yeah.

00:20:10.100 | I mean, there has been real convergence

00:20:12.660 | where the GPTs

00:20:15.220 | produce the foundation.

00:20:16.940 | And in the reinforcement learning of Dota,

00:20:20.020 | morphed into reinforcement learning from human feedback.

00:20:22.980 | - That's right.

00:20:23.820 | - And that combination gave us Chad GPT.

00:20:26.260 | - You know, there's a misunderstanding

00:20:28.980 | that Chad GPT is in itself

00:20:33.340 | just one giant large language model.

00:20:36.020 | There's a system around it that's fairly complicated.

00:20:38.660 | Could you explain briefly for the audience

00:20:43.180 | the fine tuning of it,

00:20:45.940 | the reinforcement learning of it,

00:20:47.620 | the various surrounding systems

00:20:52.140 | that allows you to keep it on rails

00:20:54.860 | and give it knowledge and so on and so forth?

00:20:59.860 | - Yeah, I can.

00:21:02.580 | So the way to think about it

00:21:05.340 | is that when we train a large neural network

00:21:09.580 | to accurately predict the next word

00:21:11.700 | in lots of different texts from the internet,

00:21:16.500 | what we are doing is that we are learning a world model.

00:21:20.940 | It looks like we are learning this.

00:21:22.580 | It may look on the surface

00:21:25.500 | that we are just learning statistical correlations in text,

00:21:28.500 | but it turns out that to just learn

00:21:33.460 | the statistical correlations in text,

00:21:35.780 | to compress them really well,

00:21:37.900 | what the neural network learns

00:21:40.100 | is some representation of the process that produced the text.

00:21:44.580 | This text is actually a projection of the world.

00:21:49.460 | There is a world out there

00:21:51.340 | and it has a projection on this text.

00:21:54.100 | And so what the neural network is learning

00:21:56.220 | is more and more aspects of the world,

00:21:59.780 | of people, of the human conditions,

00:22:02.180 | their hopes, dreams, and motivations,

00:22:05.860 | their interactions in the situations that we are in.

00:22:10.540 | And the neural network learns a compressed,

00:22:13.260 | abstract, usable representation of that.

00:22:17.540 | This is what's being learned

00:22:19.500 | from accurately predicting the next word.

00:22:22.300 | And furthermore, the more accurate you are

00:22:24.940 | at predicting the next word,

00:22:27.180 | the higher the fidelity,

00:22:29.500 | the more resolution you get in this process.

00:22:32.380 | So that's what the pre-training stage does.

00:22:34.540 | But what this does not do

00:22:37.540 | is specify the desired behavior

00:22:41.540 | that we wish our neural network to exhibit.

00:22:44.740 | You see, a language model,

00:22:47.700 | what it really tries to do

00:22:49.900 | is to answer the following question.

00:22:51.700 | If I had some random piece of text on the internet,

00:22:56.660 | which starts with some prefix, some prompt,

00:23:00.020 | what will it complete to?

00:23:03.660 | If you just randomly ended up

00:23:05.620 | on some text from the internet?

00:23:07.580 | But this is different from,

00:23:09.420 | well, I want to have an assistant

00:23:10.700 | which will be truthful, that will be helpful,

00:23:14.180 | that will follow certain rules and not violate them.

00:23:17.900 | That requires additional training.

00:23:19.980 | This is where the fine-tuning

00:23:22.420 | and the reinforcement learning from human teachers

00:23:25.380 | and other forms of AI assistance.

00:23:27.380 | It's not just reinforcement learning from human teachers.

00:23:30.140 | It's also reinforcement learning

00:23:31.100 | from human and AI collaboration.

00:23:33.740 | Our teachers are working together with an AI

00:23:35.540 | to teach our AI to behave.

00:23:37.060 | But here we are not teaching it new knowledge.

00:23:40.260 | This is not what's happening.

00:23:41.980 | We are teaching it, we are communicating with it.

00:23:46.540 | We are communicating to it

00:23:48.900 | what it is that we want it to be.

00:23:50.940 | And this process, this second stage,

00:23:54.420 | is also extremely important.

00:23:56.500 | The better we do the second stage,

00:23:58.300 | the more useful, the more reliable

00:24:00.620 | this neural network will be.

00:24:02.300 | So the second stage is extremely important too,

00:24:04.940 | in addition to the first stage of the learn everything,

00:24:08.580 | learn everything, learn as much as you can about the world

00:24:12.860 | from the projection of the world, which is text.

00:24:16.580 | - Now you could tell, you could fine-tune it,

00:24:19.300 | you could instruct it to perform certain things.

00:24:23.380 | Can you instruct it to not perform certain things

00:24:25.700 | so that you could give it guardrails

00:24:27.180 | about avoid these type of behavior,

00:24:29.420 | give it some kind of a bounding box

00:24:31.940 | so that it doesn't wander out of that bounding box

00:24:35.660 | and perform things that are unsafe or otherwise?

00:24:40.660 | - Yeah.

00:24:41.900 | So this second stage of training

00:24:45.060 | is indeed where we communicate to the neural network

00:24:48.940 | anything we want, which includes the bounding box.

00:24:53.420 | And the better we do this training,

00:24:55.620 | the higher the fidelity

00:24:57.500 | with which we communicate this bounding box.

00:24:59.980 | And so with constant research and innovation

00:25:02.460 | on improving this fidelity,

00:25:04.620 | we are able, we improve this fidelity,

00:25:09.620 | and so it becomes more and more reliable

00:25:13.220 | and precise in the way in which it follows

00:25:16.620 | the intended instructions.

00:25:19.660 | - Chad Chibiti came out just a few months ago.

00:25:21.980 | Fastest growing application in the history of humanity.

00:25:27.900 | Lots of interpretations about why,

00:25:34.540 | but some of the things that is clear,

00:25:38.780 | it is the easiest application

00:25:41.580 | that anyone has ever created for anyone to use.

00:25:45.340 | It performs tasks, it performs things,

00:25:49.580 | it does things that are beyond people's expectation.

00:25:53.900 | Anyone can use it.

00:25:55.180 | There are no instruction sets.

00:25:57.060 | There are no wrong ways to use it.

00:25:58.740 | You just use it.

00:26:00.700 | And if your instructions or prompts are ambiguous,

00:26:05.700 | the conversation refines the ambiguity

00:26:08.620 | until your intents are understood by the application,

00:26:13.820 | by the AI.

00:26:14.660 | The impact, of course, clearly remarkable.

00:26:20.980 | Now, yesterday, this is the day after GPT-4,

00:26:25.420 | just a few months later.

00:26:28.020 | The performance of GPT-4 in many areas, astounding.

00:26:33.020 | SAT scores, GRE scores, bar exams,

00:26:38.580 | the number of tests that it's able to perform

00:26:43.660 | at very capable levels, very capable human levels, astounding.

00:26:48.180 | What were the major differences between Chat GPT and GPT-4

00:26:54.180 | that led to its improvements in these areas?

00:26:59.340 | So GPT-4

00:27:01.900 | is a pretty substantial improvement on top of Chat GPT

00:27:09.620 | across very many dimensions.

00:27:12.940 | Between GPT-4, I would say,

00:27:15.340 | between more than six months ago,

00:27:19.780 | maybe eight months ago, I don't remember exactly.

00:27:23.220 | GPT is the first big difference between Chat GPT and GPT-4.

00:27:30.020 | And that perhaps is the most important difference,

00:27:35.900 | is that the base on top of GPT-4 is built,

00:27:39.420 | predicts the next word with greater accuracy.

00:27:43.260 | This is really important,

00:27:45.900 | because the better a neural network

00:27:48.260 | can predict the next word in text,

00:27:50.580 | the more it understands it.

00:27:52.380 | This claim is now perhaps accepted by many at this point,

00:27:57.220 | but it might still not be intuitive

00:27:59.500 | or not completely intuitive as to why that is.

00:28:02.340 | So I'd like to take a small detour

00:28:04.220 | and to give an analogy that will hopefully clarify

00:28:07.500 | why more accurate prediction of the next word

00:28:10.860 | leads to more understanding, real understanding.

00:28:13.500 | Let's consider an example.

00:28:17.220 | Say you read a detective novel.

00:28:19.700 | It's like a complicated plot, a storyline,

00:28:22.180 | different characters, lots of events,

00:28:25.220 | mysteries like clues, it's unclear.

00:28:28.140 | Then let's say that at the last page of the book,

00:28:31.820 | the detective has got all the clues,

00:28:33.780 | gathered all the people and saying,

00:28:34.940 | "Okay, I'm going to reveal the identity

00:28:37.740 | of whoever committed the crime."

00:28:40.340 | And that person's name is?

00:28:42.180 | - Predict that word.

00:28:43.020 | - Predict that word, exactly.

00:28:44.340 | - My goodness.

00:28:45.180 | - Right? - Yeah, right.

00:28:46.140 | - Now, there are many different words,

00:28:48.660 | but by predicting those words better and better and better,

00:28:51.500 | the understanding of the text keeps on increasing.

00:28:55.780 | GPT-4 predicts the next word better.

00:28:58.980 | - Ilya, people say that deep learning won't lead to reasoning.

00:29:04.420 | - That deep learning won't lead to reasoning.

00:29:06.820 | But in order to predict that next word,

00:29:09.380 | figure out from all of the agents that were there

00:29:13.180 | and all of their strengths or weaknesses or their intentions

00:29:18.180 | and the context, and to be able to predict that word,

00:29:23.780 | who was the murderer,

00:29:27.020 | that requires some amount of reasoning,

00:29:28.660 | a fair amount of reasoning.

00:29:30.180 | And so how is it that it's able to learn reasoning?

00:29:35.180 | And if it learned reasoning,

00:29:40.180 | you know, one of the things that I was going to ask you

00:29:43.460 | is of all the tests that were taken

00:29:45.780 | between CHAT-GPT and GPT-4,

00:29:48.860 | there were some tests that GPT-3

00:29:52.220 | or CHAT-GPT was already very good at.

00:29:55.060 | There were some tests that GPT-3 or CHAT-GPT

00:29:57.940 | was not as good at that GPT-4 was much better at.

00:30:02.820 | And there were some tests that neither are good at yet.

00:30:06.060 | I would love for it, you know,

00:30:07.140 | and some of it has to do with reasoning, it seems,

00:30:09.940 | that, you know, maybe in calculus,

00:30:12.100 | that it wasn't able to break maybe the problem down

00:30:15.780 | into its reasonable steps and solve it.

00:30:18.540 | But yet in some areas,

00:30:21.980 | it seems to demonstrate reasoning skills.

00:30:24.980 | And so is that an area that in predicting the next word,

00:30:29.980 | you're learning reasoning?

00:30:32.660 | And what are the limitations now of GPT-4

00:30:37.660 | that would enhance its ability to reason even further?

00:30:41.180 | You know, reasoning isn't this super well-defined concept,

00:30:47.460 | but we can try to define it anyway,

00:30:51.780 | which is when you maybe, maybe when you go further,

00:30:56.580 | where you're able to somehow think about it a little bit

00:31:00.700 | and get a better answer because of your reasoning.

00:31:03.220 | And I'd say that our neural nets,

00:31:08.900 | you know, maybe there is some kind of limitation

00:31:11.300 | which could be addressed by, for example,

00:31:14.620 | asking the neural network to think out loud.

00:31:16.820 | This has proven to be extremely effective for reasoning,

00:31:20.020 | but I think it also remains to be seen

00:31:21.860 | just how far the basic neural network will go.

00:31:24.260 | I think we have yet to tap, fully tap out its potential.

00:31:29.260 | But yeah, I mean, there is definitely some sense

00:31:36.420 | where reasoning is still not quite at that level

00:31:41.420 | as some of the other capabilities of the neural network,

00:31:45.860 | though we would like the reasoning capabilities

00:31:48.380 | of the neural network to be high, higher.

00:31:51.380 | I think that it's fairly likely that business as usual

00:31:55.900 | will keep, will improve the reasoning capabilities

00:31:58.140 | of the neural network.

00:31:59.660 | I wouldn't necessarily confidently rule out this possibility.

00:32:04.660 | Yeah, because one of the things that is really cool

00:32:08.780 | is you ask Chachapiti a question,

00:32:12.700 | but before it answers the question,

00:32:14.020 | tell me first what you know, and then answer the question.

00:32:18.180 | You know, usually when somebody answers a question,

00:32:20.140 | if you give me the foundational knowledge that you have

00:32:23.580 | or the foundational assumptions that you're making

00:32:25.460 | before you answer the question,

00:32:27.060 | that really improves my believability of the answer.

00:32:31.940 | You're also demonstrating some level of reason,

00:32:33.940 | well, you're demonstrating reasoning.

00:32:35.820 | And so it seems to me that Chachapiti

00:32:37.420 | has this inherent capability embedded in it.

00:32:40.140 | Yeah.

00:32:41.900 | To some degree.

00:32:42.740 | Yeah.

00:32:43.580 | The one way to think about what's happening now

00:32:47.860 | is that these neural networks

00:32:50.100 | have a lot of these capabilities,

00:32:52.100 | they're just not quite very reliable.

00:32:54.780 | In fact, you could say that reliability

00:32:57.700 | is currently the single biggest obstacle

00:32:59.980 | for these neural networks being useful, truly useful.

00:33:03.620 | If sometimes it is still the case

00:33:08.060 | that these neural networks hallucinate a little bit,

00:33:12.700 | or maybe make some mistakes which are unexpected,

00:33:14.980 | which you wouldn't expect the person to make,

00:33:17.780 | it is this kind of unreliability

00:33:20.340 | that makes them substantially less useful.

00:33:23.460 | But I think that perhaps with a little bit more research,

00:33:26.500 | with the current ideas that you have

00:33:28.060 | and perhaps a few more of the ambitious research plans,

00:33:33.060 | you'll be able to achieve higher reliability as well.

00:33:36.140 | And that will be truly useful,

00:33:37.780 | that will allow us to have very accurate guardrails,

00:33:42.660 | which are very precise.

00:33:44.220 | That's right.

00:33:45.060 | And it will make it ask for clarification

00:33:47.220 | where it's unsure,

00:33:48.740 | or maybe say that it doesn't know something

00:33:53.140 | when it doesn't know, and do so extremely reliably.

00:33:57.580 | So I'd say that these are some of the bottlenecks, really.

00:34:01.940 | So it's not about whether it exhibits

00:34:04.100 | some particular capability,

00:34:06.140 | but more how reliably, exactly.

00:34:08.780 | Yeah.

00:34:09.820 | You know, speaking of factualness and factfulness,

00:34:14.900 | hallucination, I saw in one of the videos

00:34:19.900 | a demonstration that links to a Wikipedia page.

00:34:25.340 | Does retrieval capability,

00:34:29.060 | has that been included in the GPT-4?

00:34:31.500 | Is it able to retrieve information from a factual place

00:34:35.980 | that could augment its response to you?

00:34:39.300 | So the current GPT-4, as released,

00:34:44.220 | does not have a built-in retrieval capability.

00:34:47.060 | It is just a really, really good next-word predictor,

00:34:52.060 | which can also consume images, by the way.

00:34:55.180 | We haven't spoken about it.

00:34:56.180 | Yeah, I'm about to ask you about my technology.

00:34:57.380 | It is really good at images,

00:34:59.580 | which is also then fine-tuned with data

00:35:02.380 | and various reinforcement learning variants

00:35:06.940 | to behave in a particular way.

00:35:08.540 | It is perhaps, I'm sure someone will,

00:35:14.020 | it wouldn't surprise me if some of the people

00:35:15.780 | who have access could perhaps request GPT-4

00:35:19.740 | to maybe make some queries

00:35:21.740 | and then populate the results inside the context,

00:35:25.460 | because also the context duration of GPT-4

00:35:27.820 | is quite a bit longer now.

00:35:28.780 | Yeah, that's right.

00:35:29.860 | So in short, although GPT-4 does not support

00:35:34.860 | built-in retrieval, it is completely correct

00:35:41.780 | that it will get better with retrieval.

00:35:44.460 | Multi-modality.

00:35:45.740 | GPT-4 has the ability to learn from text and images

00:35:50.740 | and respond to input from text and images.

00:35:56.460 | First of all, the foundation of multi-modality learning,

00:36:01.180 | of course, Transformers has made it possible

00:36:06.500 | for us to learn from multi-modality,

00:36:08.140 | tokenized text and images.

00:36:11.500 | But at the foundational level,

00:36:16.340 | help us understand how multi-modality enhances

00:36:20.540 | the understanding of the world beyond text by itself.

00:36:25.540 | And my understanding is that when you do multi-modality

00:36:35.220 | learning, that even when it is just a text prompt,

00:36:39.420 | the text prompt, the text understanding

00:36:41.780 | could actually be enhanced.

00:36:43.700 | Tell us about multi-modality at the foundation,

00:36:46.780 | why it's so important and what's the major breakthrough

00:36:50.420 | and the characteristic differences as a result.

00:36:54.020 | So there are two dimensions to multi-modality,

00:36:57.380 | two reasons why it is interesting.

00:37:01.260 | The first reason is a little bit humble.

00:37:06.220 | The first reason is that multi-modality is useful.

00:37:09.580 | It is useful for a neural network to see,

00:37:13.180 | vision in particular, because the world is very visual.

00:37:18.740 | Human beings are very visual animals.

00:37:21.300 | I believe that a third of the human cortex

00:37:26.700 | is dedicated to vision.

00:37:29.500 | And so by not having vision,

00:37:34.500 | the usefulness of our neural networks,

00:37:37.100 | though still considerable, is not as big as it could be.

00:37:41.100 | So it is a very simple usefulness argument.

00:37:45.220 | It is simply useful to see.

00:37:48.260 | And GPT-4 can see quite well.

00:37:52.060 | There is a second reason to the vision,

00:37:57.300 | which is that we learn more about the world

00:38:00.020 | by learning from images in addition to learning from text.

00:38:05.100 | That is also a powerful argument,

00:38:09.700 | though it is not as clear cut as it may seem.

00:38:12.700 | I'll give you an example.

00:38:13.940 | Or rather before giving an example,

00:38:17.140 | I'll make the general comment.

00:38:18.660 | For a human being, us human beings,

00:38:22.260 | we get to hear about one billion words

00:38:26.420 | in our entire life.

00:38:28.260 | - Only?

00:38:29.100 | - Only one billion words.

00:38:30.100 | - That's amazing.

00:38:31.020 | - That's not a lot.

00:38:31.860 | - Yeah, that's not a lot.

00:38:33.860 | - So we need to--

00:38:36.020 | - Does that include my own words in my own head?

00:38:38.460 | (laughing)

00:38:39.860 | - Make it two billion, if you want.

00:38:41.660 | But you see what I mean?

00:38:42.740 | - Yeah.

00:38:43.580 | - You know, we can see that because

00:38:45.340 | a billion seconds is 30 years.

00:38:49.220 | So you can kind of see,

00:38:50.060 | like we don't get to see more than a few words a second,

00:38:52.140 | and then we are asleep half the time.

00:38:53.980 | So like a couple billion words

00:38:55.860 | is the total we get in our entire life.

00:38:58.180 | So it becomes really important for us

00:38:59.860 | to get as many sources of information as we can.

00:39:02.700 | And we absolutely learn a lot more from vision.

00:39:05.140 | The same argument holds true

00:39:08.380 | for our neural networks as well,

00:39:10.380 | except for the fact that the neural network

00:39:13.780 | can learn from so many words.

00:39:16.140 | So things which are hard to learn

00:39:20.140 | about the world from text in a few billion words

00:39:24.700 | may become easier from trillions of words.

00:39:28.420 | And I'll give you an example.

00:39:29.860 | Consider colors.

00:39:33.700 | Surely, one needs to see to understand colors.

00:39:38.780 | And yet, the text-only neural networks

00:39:43.180 | who've never seen a single photon in their entire life,

00:39:47.660 | if you ask them which colors are more similar to each other,

00:39:50.940 | it will know that red is more similar to orange than to blue.

00:39:55.060 | It will know that blue is more similar to purple

00:39:57.740 | than to yellow.

00:39:58.660 | How does that happen?

00:40:01.740 | And one answer is that information about the world,

00:40:05.580 | even the visual information,

00:40:07.380 | slowly leaks in through text,

00:40:10.380 | but slowly, not as quickly.

00:40:12.580 | But then you have a lot of text,

00:40:13.660 | you can still learn a lot.

00:40:15.260 | Of course, once you also add vision

00:40:19.500 | and learning about the world from vision,

00:40:20.940 | you will learn additional things

00:40:22.380 | which are not captured in text.

00:40:24.580 | But I would not say that it is a binary,

00:40:28.980 | there are things which are impossible to learn

00:40:31.340 | from text only.

00:40:32.820 | I think there's more of an exchange rate.

00:40:34.940 | And in particular, as you want to learn,

00:40:37.220 | if you are like a human being

00:40:40.940 | and you want to learn from a billion words

00:40:43.700 | or a hundred million words,

00:40:45.420 | then of course the other sources of information

00:40:47.260 | become far more important.

00:40:49.300 | - Yeah.

00:40:51.700 | You learn from images.

00:40:56.100 | Is there a sensibility that would suggest

00:40:59.100 | that if we wanted to understand

00:41:01.540 | also the construction of the world,

00:41:03.900 | as in the arm is connected to my shoulder,

00:41:06.500 | that my elbow is connected,

00:41:08.300 | that somehow these things move,

00:41:10.460 | the animation of the world,

00:41:13.780 | the physics of the world.

00:41:14.980 | If I wanted to learn that as well,

00:41:16.820 | can I just watch videos and learn that?

00:41:18.780 | - Yes.

00:41:19.620 | - And if I wanted to augment all of that with sound,

00:41:23.140 | like for example, if somebody said,

00:41:25.700 | the meaning of great,

00:41:27.500 | great could be great,

00:41:31.220 | or great could be great.

00:41:32.940 | One is sarcastic, one is enthusiastic.

00:41:38.100 | There are many, many words like that.

00:41:40.020 | That's sick, or I'm sick, or I'm sick.

00:41:45.340 | Depending on how people say it,

00:41:46.900 | would audio also make a contribution

00:41:50.420 | to the learning of the model?

00:41:52.660 | And could we put that to good use soon?

00:41:55.420 | - Yes.

00:41:56.740 | I think it's definitely the case that,

00:42:00.100 | well, what can we say about audio?

00:42:02.580 | It's useful, it's an additional source of information.

00:42:05.420 | Probably not as much as images or video,

00:42:09.060 | but there is a case to be made

00:42:11.980 | for the usefulness of audio as well,

00:42:13.780 | both on the recognition side and on the production side.

00:42:18.020 | - When you,

00:42:18.860 | on the context of the scores that I saw,

00:42:23.700 | the thing that was really interesting

00:42:26.020 | was the data that you guys published.

00:42:29.420 | Which one of the tests were performed well by GPT-3,

00:42:34.700 | and which one of the tests

00:42:35.940 | performed substantially better with GPT-4?

00:42:38.580 | How did multimodality contribute to those tests,

00:42:43.260 | do you think?

00:42:44.260 | - Oh, I mean, in a pretty straightforward way,

00:42:48.900 | anytime there was a test where a problem would,

00:42:52.380 | where to understand the problem,

00:42:53.780 | you need to look at a diagram.

00:42:55.780 | Like, for example, in some math competitions.

00:42:58.220 | Like, there is a math competition

00:43:01.420 | for high school students called AMC-12.

00:43:03.620 | - AMC-10, yeah. - 12, right?

00:43:05.700 | And there, presumably, many of the problems have a diagram.

00:43:10.700 | So, GPT-3.5 does quite badly on that test.

00:43:16.180 | GPT-4, with text only, does,

00:43:21.580 | I think, I don't remember,

00:43:22.460 | but it's like maybe from 2% to 20% accuracy of success rate.

00:43:27.460 | But then when you add vision, it jumps to 40% success rate.

00:43:31.140 | So the vision is really doing a lot of work.

00:43:33.380 | The vision is extremely good.

00:43:35.100 | And I think being able to reason visually as well

00:43:38.980 | and communicate visually will also be very powerful

00:43:42.900 | and very nice things,

00:43:43.860 | which go beyond just learning about the world.

00:43:46.940 | You have several things.

00:43:47.820 | You can learn about the world.

00:43:50.220 | You can then reason about the world visually.

00:43:52.500 | And you can communicate visually.

00:43:54.820 | Where now, in the future, perhaps, in some future version,

00:43:57.860 | if you ask your neural net, "Hey, explain this to me,"

00:44:00.780 | rather than just producing four paragraphs,

00:44:02.300 | it will produce, "Hey, here's a little diagram

00:44:05.340 | "which clearly conveys to you

00:44:07.140 | "exactly what you need to know."

00:44:08.660 | - Yeah, that's incredible.

00:44:10.260 | You know, one of the things that you said earlier

00:44:11.700 | about an AI generating a test to train another AI,

00:44:16.700 | you know, there was a paper that was written about,

00:44:21.060 | and I don't completely know whether it's factual or not,

00:44:24.980 | but that there's a total amount

00:44:28.020 | of somewhere between four trillion

00:44:29.540 | to something like 20 trillion useful, you know,

00:44:33.660 | tokens, language tokens,

00:44:37.180 | that the world will be able to train on,

00:44:40.260 | you know, over some period of time.

00:44:41.620 | And that would have run out of tokens to train.

00:44:44.260 | And I, well, first of all, I wonder

00:44:47.620 | if you feel the same way.

00:44:49.580 | And then secondarily, whether the AI

00:44:55.260 | generating its own data

00:44:59.980 | could be used to train the AI itself,

00:45:03.300 | which you could argue is a little circular,

00:45:06.140 | but we train our brain with generated data all the time

00:45:11.140 | by self-reflection,

00:45:15.940 | working through a problem in our brain, you know,

00:45:19.660 | and, you know, I guess neuroscientists suggest sleeping.

00:45:25.060 | We do a lot of fair amount of, you know,

00:45:26.900 | developing our neurons.

00:45:28.140 | How do you see this area of synthetic data generation?

00:45:32.900 | Is that going to be an important part of the future

00:45:34.660 | of training AI and the AI teaching itself?

00:45:38.380 | - Well, I think, like I wouldn't underestimate

00:45:42.540 | the data that exists out there.

00:45:45.020 | I think there's probably more data than people realize.

00:45:50.020 | And as to your second question,

00:45:53.620 | certainly a possibility remains to be seen.

00:45:56.860 | - Yeah, yeah, it really does seem that

00:45:59.940 | that one of these days our AIs are, you know,

00:46:04.940 | when we're not using it,

00:46:06.220 | maybe generating either adversarial content for itself

00:46:09.380 | to learn from or imagine solving problems

00:46:12.620 | that it can go off and then improve itself.

00:46:16.940 | Tell us whatever you can about where we are now

00:46:23.020 | and what do you think we'll be in not too distant future,

00:46:26.860 | but, you know, pick your horizon, a year or two.

00:46:31.420 | What do you think this whole language model area would be

00:46:34.060 | in some of the areas that you're most excited about?

00:46:36.460 | - You know, predictions are hard.

00:46:38.660 | And it's a bit, although it's a little difficult

00:46:42.220 | to say things which are too specific,

00:46:46.580 | I think it's safe to assume that progress will continue.

00:46:52.580 | And that we will keep on seeing systems which astound us

00:46:56.820 | in the things that they can do.

00:47:00.820 | And the current frontiers will be centered around reliability

00:47:05.500 | around the system can be trusted,

00:47:09.020 | really get into a point where we can trust what it produces,

00:47:13.140 | really get into a point where

00:47:14.820 | if it doesn't understand something,

00:47:16.460 | it asks for a clarification,

00:47:18.020 | says that it doesn't know something,

00:47:21.300 | says that it needs more information.

00:47:23.060 | I think those are perhaps the biggest,

00:47:26.300 | the areas where improvement will lead to the biggest impact

00:47:30.700 | on the usefulness of those systems.

00:47:32.860 | Because right now that's really what stands in the way.

00:47:35.340 | You have an A, you have asking neural net,

00:47:36.780 | you ask a neural net to maybe summarize some long document

00:47:39.900 | and you get a summary.

00:47:40.980 | Like, are you sure that some important detail

00:47:43.740 | wasn't omitted?

00:47:44.580 | It's still a useful summary,

00:47:46.340 | but it's a different story when you know

00:47:48.940 | that all the important points have been covered.

00:47:52.020 | At some point, and in particular,

00:47:55.180 | it's okay if there is ambiguity, it's fine.

00:47:58.260 | But if a point is clearly important,

00:48:00.860 | such that anyone else who saw that point

00:48:02.700 | would say this is really important,

00:48:04.780 | then the neural network will also recognize that reliably.

00:48:07.420 | That's when you know.

00:48:08.980 | Same for the guardrail,

00:48:10.460 | same for its ability to clearly follow

00:48:13.060 | the intent of the user, of its operator.

00:48:16.860 | So I think we'll see a lot of that in the next two years.

00:48:19.580 | - Yeah, that's terrific,

00:48:20.420 | because the progress in those two areas

00:48:22.300 | will make this technology trusted by people to use

00:48:26.740 | and be able to apply it for so many things.

00:48:28.940 | I was thinking that was gonna be the last question,

00:48:30.660 | but I did have another one, sorry about that.

00:48:32.740 | So, Chad GPT to GPT-4.

00:48:36.620 | GPT-4, when you first started using it,

00:48:40.420 | what are some of the skills that it demonstrated

00:48:44.740 | that surprised even you?

00:48:47.180 | - Well, there were lots of really cool things

00:48:51.420 | that it demonstrated,

00:48:52.580 | which were quite cool and surprising.

00:48:57.580 | It was quite good.

00:49:01.780 | So I'll mention two, so let's see.

00:49:05.220 | I'm just trying to think about the best way to go about it.

00:49:09.620 | The short answer is that the level of its reliability

00:49:13.940 | was surprising.

00:49:15.860 | Where the previous neural networks,

00:49:18.220 | if you asked them a question,

00:49:19.860 | sometimes they might misunderstand something

00:49:22.660 | in a kind of a silly way.

00:49:25.900 | Whereas with GPT-4, that stopped happening.

00:49:28.540 | Its ability to solve math problems became far greater.

00:49:31.940 | It was like you could really do the derivation,

00:49:35.740 | and long, complicated derivation,

00:49:37.780 | and you could convert the units and so on.

00:49:39.660 | And that was really cool.

00:49:41.220 | You know, like many people have--

00:49:42.060 | - It works through a proof.

00:49:43.140 | It works through a proof.

00:49:44.180 | - Yeah. - That's pretty amazing.

00:49:45.100 | - Not all proofs, naturally, but quite a few.

00:49:48.100 | Or another example would be,

00:49:50.060 | like many people noticed that it has the ability

00:49:53.020 | to produce poems with every word

00:49:58.020 | starting with the same letter,

00:49:59.340 | or every word starting with some--

00:50:02.060 | - It follows instructions really, really clearly.

00:50:04.780 | - Not perfectly still, but much better than before.

00:50:07.140 | - Yeah, really good.

00:50:08.300 | - And on the vision side,

00:50:09.980 | I really love how it can explain jokes.

00:50:12.780 | It can explain memes.

00:50:14.380 | You show it a meme and ask it why it's funny,

00:50:16.300 | and it will tell you, and it will be correct.

00:50:18.980 | The vision part, I think, is very,

00:50:21.540 | was also very, it's like really actually seeing it

00:50:25.460 | when you can ask follow-up questions

00:50:27.500 | about some complicated image with a complicated diagram

00:50:31.540 | and get an explanation.

00:50:32.500 | That's really cool.

00:50:33.420 | But yeah, overall, I will say, to take a step back,

00:50:38.220 | you know, I've been in this business for quite some time.

00:50:41.340 | Actually, like almost exactly 20 years.

00:50:44.140 | And the thing which I find most surprising

00:50:50.940 | is that it actually works.

00:50:52.780 | (laughing)

00:50:54.380 | - Yeah.

00:50:55.220 | - Like it turned out to be the same little thing all along,

00:50:58.540 | which is no longer little,

00:51:00.300 | and a lot more serious and much more intense,

00:51:03.500 | but it's the same neural network, just larger,

00:51:06.740 | trained on maybe larger data sets in different ways

00:51:09.700 | with the same fundamental training algorithm.

00:51:11.780 | - Yeah.

00:51:12.940 | - So it's like, wow.

00:51:14.820 | I would say this is what I find the most surprising.

00:51:18.380 | - Yeah.

00:51:19.220 | - Whenever I take a step back, I go, how is it possible

00:51:21.420 | that those ideas, those conceptual ideas about,

00:51:23.700 | well, the brain has neurons,

00:51:26.420 | so maybe artificial neurons are just as good,

00:51:29.180 | and so maybe we just need to train them somehow

00:51:30.780 | with some learning algorithm,

00:51:32.020 | that those arguments turned out to be so incredibly correct.

00:51:36.220 | That would be the biggest surprise, I'd say.

00:51:39.900 | - In the 10 years that we've known each other,

00:51:42.820 | the models that you've trained

00:51:47.820 | and the amount of data that you've trained

00:51:49.540 | from what you did on AlexNet to now

00:51:54.540 | is about a million times.

00:51:56.900 | And no one in the world of computer science

00:52:01.900 | would have believed that the amount of computation

00:52:05.500 | that was done in that 10 years' time

00:52:07.620 | would be a million times larger

00:52:09.980 | and that you dedicated your career to go do that.

00:52:13.980 | You've done many more, your body of work is incredible,

00:52:19.820 | but two seminal works, the invention,

00:52:22.620 | the co-invention with AlexNet and that early work,

00:52:25.620 | and now with GPT at OpenAI,

00:52:28.820 | it is truly remarkable what you've accomplished.

00:52:32.980 | It's great to catch up with you again, Ilya.

00:52:35.420 | I'm a good friend and it is quite an amazing moment.

00:52:40.420 | And today's talk, the way you break down the problem

00:52:45.740 | and describe it, this is one of the best PhD,

00:52:50.740 | beyond PhD descriptions of the state-of-the-art

00:52:54.580 | of large language models.

00:52:55.700 | I really appreciate that.

00:52:56.780 | It's great to see you, congratulations.

00:52:58.220 | - Thank you so much. - Yeah, thank you.