back to index

Ilya Sutskever (OpenAI) and Jensen Huang (NVIDIA CEO) : AI Today and Vision of the Future (3/2023)


Whisper Transcript | Transcript Only Page

00:00:00.000 | Ilya, unbelievable.
00:00:03.000 | Today is the day after GPT-4.
00:00:06.000 | (laughs)
00:00:07.300 | It's great to have you here.
00:00:09.300 | I'm delighted to have you.
00:00:10.680 | I've known you a long time.
00:00:12.180 | The journey and just my mental hit,
00:00:14.840 | my mental memory of the time that I've known you
00:00:19.520 | and the seminal work that you have done
00:00:22.680 | starting in University of Toronto,
00:00:26.060 | the co-invention of AlexNet with Alex and Jeb Hinton
00:00:32.760 | that led to the big bang of modern artificial intelligence,
00:00:38.360 | your career that took you out here to the Bay Area,
00:00:41.600 | the founding of OpenAI, GPT-123,
00:00:46.400 | and then, of course, ChatGPT,
00:00:49.280 | the AI heard around the world.
00:00:52.320 | This is the incredible resume of a young computer scientist,
00:00:57.580 | you know, an entire community and industry
00:01:00.160 | at all with your achievements.
00:01:02.460 | I guess I just want to go back to the beginning
00:01:05.580 | and ask you, deep learning,
00:01:08.920 | what was your intuition around deep learning?
00:01:11.420 | Why did you know that it was going to work?
00:01:13.920 | Did you have any intuition
00:01:14.920 | that it was going to lead to this kind of success?
00:01:17.420 | Okay, well, first of all,
00:01:19.840 | thank you so much for the quote, for all the kind words.
00:01:24.520 | A lot has changed
00:01:26.180 | thanks to the incredible power of deep learning.
00:01:31.220 | Like, I think my personal starting point,
00:01:34.840 | I was interested in artificial intelligence
00:01:36.600 | for a whole variety of reasons,
00:01:39.760 | starting from an intuitive understanding
00:01:43.060 | of appreciation of its impact,
00:01:46.060 | and also I had a lot of curiosity
00:01:47.720 | about what is consciousness,
00:01:49.720 | what is the human experience,
00:01:51.940 | and it felt like progress in artificial intelligence
00:01:54.820 | will help with that.
00:01:57.240 | The next step was,
00:01:58.740 | well, back then, I was starting in 2002, 2003,
00:02:03.000 | and it seemed like learning is the thing
00:02:05.040 | that humans can do, that people can do,
00:02:08.340 | that computers can't do at all.
00:02:11.460 | In 2003, 2002,
00:02:14.460 | computers could not learn anything,
00:02:17.520 | and it wasn't even clear that it was possible in theory.
00:02:21.520 | And so I thought that making progress in learning,
00:02:27.360 | in artificial learning, in machine learning,
00:02:30.060 | that would lead to the greatest progress in AI.
00:02:33.400 | And then I started to look around for what was out there,
00:02:37.200 | and nothing seemed too promising.
00:02:39.800 | But to my great luck,
00:02:42.040 | Jeff Hinton was a professor at my university,
00:02:45.280 | and I was able to find him,
00:02:47.240 | and he was working in neural networks,
00:02:48.780 | and it immediately made sense,
00:02:50.780 | because neural networks had the property
00:02:54.160 | that we are learning,
00:02:55.900 | we are automatically programming parallel computers.
00:03:00.040 | Back then, the parallel computers were small,
00:03:02.700 | but the promise was, if you could somehow figure out
00:03:05.360 | how learning in neural networks work,
00:03:07.900 | then you can program small parallel computers from data.
00:03:11.280 | And it was also similar enough to the brain,
00:03:13.360 | and the brain works,
00:03:14.520 | because it's like you had these several factors going for it.
00:03:17.640 | Now, it wasn't clear how to get it to work,
00:03:21.840 | but of all the things that existed,
00:03:24.800 | that seemed like it had by far the greatest long-term promise.
00:03:28.760 | Even though, you know—
00:03:29.760 | At the time that you first started,
00:03:31.260 | at the time that you first started working
00:03:32.520 | with deep learning and neural networks,
00:03:35.300 | what was the scale of the network?
00:03:37.300 | What was the scale of computing at that moment in time?
00:03:39.600 | What was it like?
00:03:40.760 | An interesting thing to note
00:03:42.400 | was that the importance of scale wasn't realized back then.
00:03:45.980 | So people would just train, you know,
00:03:47.440 | neural networks with like 50 neurons, 100 neurons,
00:03:50.640 | several hundred neurons that would be like
00:03:53.020 | a big neural network.
00:03:55.140 | A million parameters would be considered very large.
00:03:58.360 | We would run our models on unoptimized CPU code,
00:04:02.360 | because we were a bunch of researchers.
00:04:04.360 | We didn't know about BLAS.
00:04:06.140 | We used MATLAB.
00:04:07.060 | The MATLAB was optimized.
00:04:10.180 | And we'd just experiment,
00:04:11.980 | you know, what is even the right question to ask, you know?
00:04:14.480 | So you try to gather,
00:04:17.400 | to just find interesting phenomena,
00:04:19.360 | interesting observation.
00:04:21.600 | You can do this small thing,
00:04:22.900 | and you can do that small thing.
00:04:24.480 | You know, Geoff Hinton was really excited
00:04:26.640 | about training neural nets on small little digits,
00:04:32.440 | both for classification,
00:04:33.640 | and also he was very interested in generating them.
00:04:36.440 | So the beginnings of generative models were right there.
00:04:39.820 | But the question is like, okay,
00:04:40.980 | there's all this cool stuff floating around.
00:04:43.400 | What really gets traction?
00:04:45.680 | And so that, it wasn't,
00:04:47.640 | so it wasn't obvious that this was the right question
00:04:51.240 | back then,
00:04:52.400 | but in hindsight, that turned out to be the right question.
00:04:55.440 | - Now, the year AlexNet was 2012.
00:04:59.780 | - Yes. - 2012.
00:05:01.400 | Now, you and Alex were working on AlexNet
00:05:04.740 | for some time before then.
00:05:07.100 | And at what point was it clear to you
00:05:12.100 | that you wanted to build
00:05:14.600 | a computer vision-oriented neural network
00:05:17.800 | that ImageNet was the right set of data to go for,
00:05:20.860 | and to somehow go for the computer vision contest?
00:05:25.860 | - Yeah.
00:05:28.220 | So I can talk about the context there.
00:05:30.100 | It's, I think, probably two years before that,
00:05:36.440 | it became clear to me that supervised learning
00:05:41.220 | is what's going to get us the traction.
00:05:43.360 | And I can explain precisely why.
00:05:45.820 | It wasn't just an intuition.
00:05:47.680 | It was, I would argue, an irrefutable argument,
00:05:51.680 | which went like this.
00:05:53.360 | If your neural network is deep and large,
00:05:57.300 | then it could be configured to solve a hard task.
00:06:03.360 | So that's the key word, deep and large.
00:06:06.860 | People weren't looking at large neural networks.
00:06:08.860 | People were maybe studying a little bit of depth
00:06:11.560 | in neural networks,
00:06:12.740 | but most of the machine learning field
00:06:14.180 | wasn't even looking at neural networks at all.
00:06:16.060 | They were looking at all kinds of Bayesian models
00:06:18.100 | and kernel methods,
00:06:19.940 | which are theoretically elegant methods,
00:06:22.700 | which have the property that
00:06:24.640 | they actually can't represent a good solution
00:06:27.320 | no matter how you configure them.
00:06:29.320 | Whereas the large and deep neural network
00:06:31.440 | can represent a good solution to the problem.
00:06:34.140 | To find the good solution,
00:06:36.900 | you need a big data set, which requires it,
00:06:40.400 | and a lot of compute to actually do the work.
00:06:44.360 | We've also made advanced work,
00:06:45.740 | so we've worked on optimization for a little bit.
00:06:49.320 | It was clear that optimization is a bottleneck,
00:06:53.140 | and there was a breakthrough by another grad student
00:06:55.560 | in Geoff Hinton's lab called James Martens,
00:06:58.860 | and he came up with an optimization method
00:07:00.440 | which is different from the one we're using now.
00:07:03.100 | Some second-order method.
00:07:05.780 | But the point about it is that it's proved
00:07:08.140 | that we can train those neural networks,
00:07:09.600 | because before we didn't even know we could train them.
00:07:12.140 | So if you can train them, you make it big,
00:07:14.900 | you find the data, and you will succeed.
00:07:17.360 | So then the next question is,
00:07:18.820 | well, what data?
00:07:20.600 | And an ImageNet data set,
00:07:21.900 | back then it seemed like this unbelievably difficult data set.
00:07:25.240 | But it was clear that if you were to train
00:07:28.740 | a large convolutional neural network on this data set,
00:07:30.980 | it must succeed if you just can have the compute.
00:07:34.480 | - And right at that time,
00:07:35.980 | - GPUs came out. - you and I,
00:07:38.440 | our history and our paths intersected,
00:07:42.740 | and somehow you had the observations at a GPU,
00:07:47.740 | and at that time we had,
00:07:49.440 | this is our couple of generations into a CUDA GPU,
00:07:52.900 | and I think it was GTX 580 generation.
00:07:56.560 | You had the insight that the GPU could actually be useful
00:08:00.800 | for training your neural network models.
00:08:02.760 | What was that, how did that day start?
00:08:05.140 | Tell me, you and I, you never told me that moment.
00:08:08.140 | How did that day start?
00:08:09.560 | - Yeah, so, you know, the GPUs
00:08:14.180 | appeared in our lab, in our Toronto lab,
00:08:18.440 | thanks to Jeff, and he said,
00:08:19.720 | "We should try these GPUs."
00:08:21.300 | And we started trying and experimenting with them.
00:08:24.300 | And it was a lot of fun,
00:08:27.300 | but it was unclear what to use them for exactly.
00:08:31.640 | Where are you going to get the real traction?
00:08:33.880 | But then, with the existence of the ImageNet data set,
00:08:39.140 | and then it was also very clear
00:08:42.800 | that the convolutional neural network
00:08:44.300 | is such a great fit for the GPU,
00:08:46.640 | so it should be possible to make it go unbelievably fast,
00:08:50.380 | and therefore train something
00:08:52.380 | which would be completely unprecedented
00:08:54.140 | in terms of its size.
00:08:55.300 | And that's how it happened,
00:08:58.840 | and, you know, very fortunately, Alex Krzyzewski,
00:09:01.960 | he really loved programming the GPU.
00:09:04.800 | (laughing)
00:09:06.340 | And he was able to do it, he was able to code,
00:09:09.600 | to program really fast convolutional kernels.
00:09:13.880 | And then, and train the neural net
00:09:20.220 | on the ImageNet data set, and that led to the result.
00:09:22.840 | But it was like--
00:09:23.680 | - It shocked the world.
00:09:24.980 | - It shocked the world.
00:09:26.640 | It broke the record of computer vision
00:09:29.300 | by such a wide margin that it was a clear discontinuity.
00:09:33.840 | - Yeah. - Yeah.
00:09:34.680 | - And I would say it's not just like,
00:09:36.180 | there is another bit of context there.
00:09:38.100 | It's not so much, like, when you say break the record,
00:09:41.600 | there is an important, it's like,
00:09:43.880 | I think there's a different way to phrase it.
00:09:46.060 | It's that that data set was so obviously hard,
00:09:50.980 | and so obviously outside of reach of anything.
00:09:54.400 | People are making progress with some classical techniques,
00:09:56.720 | and they were actually doing something.
00:09:58.980 | But this thing was so much better on the data set,
00:10:01.600 | which was so obviously hard.
00:10:03.900 | It's not just that it's just some competition.
00:10:06.600 | It was a competition which, back in the day--
00:10:09.060 | - There wasn't an average benchmark.
00:10:10.600 | - It was so obviously difficult,
00:10:13.720 | and so obviously out of reach,
00:10:15.760 | and so obviously with the property
00:10:18.720 | that if you did a good job, that would be amazing.
00:10:21.880 | - Big bang of AI.
00:10:23.300 | Fast forward to now.
00:10:25.260 | You came out to the Valley.
00:10:26.880 | You started OpenAI with some friends.
00:10:29.720 | You were the chief scientist.
00:10:31.920 | Now, what was the first initial idea
00:10:35.020 | about what to work on at OpenAI?
00:10:36.880 | Because you guys worked on several things.
00:10:38.840 | Some of the trails of inventions and work,
00:10:43.420 | you could see, led up to the chat GPT moment.
00:10:48.420 | But what were the initial inspiration?
00:10:52.340 | What were you going--
00:10:53.180 | How would you approach intelligence from that moment
00:10:56.760 | and led to this?
00:10:58.460 | - Yeah.
00:10:59.640 | So, obviously, when we started,
00:11:03.000 | it wasn't 100% clear how to proceed.
00:11:07.800 | And the field was also very different
00:11:11.460 | compared to the way it is right now.
00:11:13.640 | So right now, we already used to,
00:11:16.840 | you have these amazing artifacts,
00:11:20.220 | these amazing neural nets who are doing incredible things,
00:11:23.260 | and everyone is so excited.
00:11:25.840 | But back in 2015, 2016, early 2016,
00:11:29.960 | when we were starting out,
00:11:31.260 | the whole thing seemed pretty crazy.
00:11:35.380 | There were so many fewer researchers,
00:11:39.180 | maybe there were between 100 and 1,000 times
00:11:42.260 | fewer people in the field compared to now.
00:11:44.760 | Like back then, you had like 100 people,
00:11:49.140 | most of them were working in Google/DeepMind,
00:11:52.560 | and that was that.
00:11:53.980 | And then there were people picking up the skills,
00:11:55.860 | but it was very, very scarce, very rare still.
00:11:58.220 | And we had two big initial ideas
00:12:04.680 | at the start of OpenAI
00:12:07.820 | that had a lot of staying power,
00:12:10.160 | and they stayed with us to this day.
00:12:12.280 | And I'll describe them right now.
00:12:13.940 | The first big idea that we had,
00:12:17.600 | one which I was especially excited about very early on,
00:12:22.600 | is the idea of unsupervised learning through compression.
00:12:28.520 | Some context.
00:12:33.780 | Today, we take it for granted
00:12:36.400 | that unsupervised learning is this easy thing
00:12:38.140 | and you just pre-train on everything
00:12:39.760 | and it all does exactly as you'd expect.
00:12:41.760 | In 2016, unsupervised learning
00:12:46.840 | was an unsolved problem in machine learning
00:12:50.600 | that no one had any insight,
00:12:54.060 | any clue as to what to do.
00:12:56.400 | Jan LeCun would go around and give talks
00:12:59.600 | saying that you have this grand challenge
00:13:01.900 | in supervised learning.
00:13:04.340 | And I really believed that really good compression
00:13:08.720 | of the data will lead to unsupervised learning.
00:13:11.020 | Now, compression is not language that's commonly used
00:13:16.580 | to describe what is really being done until recently,
00:13:20.800 | when suddenly it became apparent to many people
00:13:23.300 | that those GPTs actually compress the training data.
00:13:26.460 | You may recall the Ted Chiang New York Times article
00:13:30.100 | which also alluded to this.
00:13:31.920 | But there is a real mathematical sense
00:13:34.220 | in which training these autoregressive generative models
00:13:38.540 | compress the data.
00:13:40.260 | And intuitively, you can see why that should work.
00:13:43.300 | If you compress the data really well,
00:13:45.120 | you must extract all the hidden secrets which exist in it.
00:13:48.160 | Therefore, that is the key.
00:13:50.720 | So that was the first idea that we were really excited about.
00:13:54.420 | And that led to quite a few works in OpenAI,
00:13:59.920 | to the sentiment neuron, which I'll mention very briefly.
00:14:04.920 | This work might not be well known
00:14:09.580 | outside of the machine learning field,
00:14:12.300 | but it was very influential, especially in our thinking.
00:14:15.100 | This work, like the result there
00:14:21.080 | was that when you train a neural network,
00:14:25.020 | back then it was not a transformer,
00:14:26.340 | it was before the transformer.
00:14:27.980 | Small recurrent neural network, LSTM,
00:14:30.560 | to those who remember.
00:14:31.400 | - Sequence work, you've done,
00:14:32.440 | I mean, this is some of the work
00:14:34.320 | that you've done yourself, yeah.
00:14:36.600 | - So the same LSTM with a few twists,
00:14:39.680 | trained to predict the next token in Amazon reviews,
00:14:42.760 | next character.
00:14:44.140 | And we discovered that if you predict
00:14:46.800 | the next character well enough,
00:14:48.960 | there will be a neuron inside that LSTM
00:14:52.360 | that corresponds to its sentiment.
00:14:54.060 | So that was really cool,
00:14:57.040 | because it showed some traction
00:15:00.060 | for unsupervised learning,
00:15:01.640 | and it validated the idea that really good
00:15:06.160 | next character prediction,
00:15:09.220 | next something prediction, compression,
00:15:11.340 | has the property that it discovers
00:15:14.480 | the secrets in the data.
00:15:16.120 | That's what we see with these GPT models, right?
00:15:17.880 | You train, and people say,
00:15:19.620 | "It's just statistical correlation."
00:15:20.940 | I mean, at this point, it should be so clear to anyone.
00:15:22.980 | - That observation also,
00:15:25.340 | for me, intuitively,
00:15:27.080 | opened up the whole world of,
00:15:28.780 | where do I get the data for unsupervised learning?
00:15:33.060 | Because I do have a whole lot of data.
00:15:35.100 | If I could just make you predict the next character,
00:15:37.660 | and I know what the ground truth is,
00:15:39.420 | I know what the answer is,
00:15:40.740 | I could train a neural network model with that.
00:15:43.280 | So that observation,
00:15:45.140 | and masking, and other technology, other approaches,
00:15:48.860 | opened my mind about,
00:15:50.460 | where would the world get all the data
00:15:52.500 | that's unsupervised for unsupervised learning?
00:15:54.700 | - Well, I think,
00:15:56.540 | so I would phrase it a little differently.
00:15:58.880 | I would say that with unsupervised learning,
00:16:02.060 | the hard part has been less around
00:16:05.660 | where you get the data from,
00:16:08.300 | though that part is there as well, especially now.
00:16:11.860 | But it was more about,
00:16:13.680 | why should you do it in the first place?
00:16:16.820 | Why should you bother?
00:16:17.900 | The hard part was to realize
00:16:21.900 | that training these neural nets to predict the next token
00:16:26.740 | is a worthwhile goal at all.
00:16:29.100 | That was the goal. - That it would learn
00:16:29.940 | a representation,
00:16:31.460 | that it would be able to understand.
00:16:33.980 | - That's right, that it will be useful.
00:16:35.500 | - Grammar and, yeah.
00:16:37.480 | - But to actually,
00:16:38.320 | but it just wasn't obvious.
00:16:41.500 | So people weren't doing it.
00:16:42.860 | But the sentiment neuron work,
00:16:44.980 | and I want to call out Alec Radford
00:16:47.140 | as a person who really was responsible
00:16:50.260 | for many of the advances there,
00:16:51.860 | the sentiment, this was before GPT-1,
00:16:56.820 | it was the precursor to GPT-1,
00:16:58.500 | and it influenced our thinking a lot.
00:17:00.900 | Then the transformer came out,
00:17:03.020 | and we immediately went,
00:17:03.940 | oh my God, this is the thing.
00:17:05.380 | And we trained GPT-1.
00:17:09.320 | - Now, along the way,
00:17:11.460 | you've always believed that scaling
00:17:16.460 | will improve the performance of these models.
00:17:18.660 | - Yes.
00:17:19.500 | Larger networks, deeper networks,
00:17:22.780 | more training data would scale that.
00:17:24.660 | There was a very important paper
00:17:27.460 | that OpenAI wrote about the scaling laws
00:17:30.380 | and the relationship between loss
00:17:33.100 | and the size of the model
00:17:35.260 | and the amount of data set,
00:17:36.540 | the size of the data set.
00:17:38.340 | When transformers came out,
00:17:39.740 | it gave us the opportunity
00:17:41.060 | to train very, very large models
00:17:43.220 | in a very reasonable amount of time.
00:17:45.900 | - But did the intuition
00:17:49.300 | about the scaling laws
00:17:51.340 | and the size of models and data
00:17:53.780 | and your journey of GPT-1, 2, 3,
00:17:58.580 | which came first?
00:17:59.700 | Did you see the evidence of GPT-1 through 3 first,
00:18:02.220 | or was it intuition about the scaling law first?
00:18:06.220 | - The intuition, so I would say
00:18:08.340 | that the way I'd phrase it
00:18:10.940 | is that I had a very strong belief
00:18:13.580 | that bigger is better.
00:18:15.860 | And that one of the goals
00:18:19.580 | that we had at OpenAI
00:18:21.180 | is to figure out how to use the scale correctly.
00:18:25.220 | There was a lot of belief in OpenAI
00:18:27.460 | about scale from the very beginning.
00:18:29.260 | The question is what to use it for precisely.
00:18:33.740 | 'Cause I'll mention,
00:18:35.540 | right now we're talking about the GPTs,
00:18:36.820 | but there's another very important line of work
00:18:38.660 | which I haven't mentioned,
00:18:39.620 | the second big idea,
00:18:41.140 | but I think now is a good time to make a detour,
00:18:43.300 | and that's reinforcement learning.
00:18:45.100 | That clearly seems important as well.
00:18:48.780 | What do you do with it?
00:18:50.540 | So the first really big project
00:18:54.380 | that was done inside OpenAI
00:18:55.780 | was our effort
00:19:00.460 | at solving a real-time strategy game.
00:19:03.220 | And for context,
00:19:05.100 | a real-time strategy game is like,
00:19:07.220 | it's a competitive sport.
00:19:08.380 | - Yeah, right.
00:19:09.220 | - We need to be smart,
00:19:10.820 | you need to have fast,
00:19:11.820 | you need to have a quick reaction time,
00:19:13.300 | you, there's teamwork,
00:19:14.740 | and you're competing against another team.
00:19:17.180 | And it's pretty,
00:19:18.540 | it's pretty involved.
00:19:20.260 | And there is a whole competitive league for that game.
00:19:24.380 | The game is called Dota 2.
00:19:26.540 | And so we train a reinforcement learning agent
00:19:28.940 | to play against itself,
00:19:30.460 | to produce
00:19:32.460 | with the goal of reaching a level
00:19:38.020 | so that it could compete against the best
00:19:40.540 | players in the world.
00:19:42.180 | And that was a major undertaking as well.
00:19:44.300 | It was a very different line.
00:19:45.540 | It was reinforcement learning.
00:19:46.980 | - Yeah, I remember the day
00:19:47.820 | that you guys announced that work.
00:19:50.380 | And this is, by the way,
00:19:51.740 | when I was asking earlier about,
00:19:53.820 | there's a large body of work
00:19:55.660 | that have come out of OpenAI.
00:19:56.860 | Some of it seem like detours,
00:19:59.620 | but in fact, as you're explaining now,
00:20:02.100 | they might have been detours,
00:20:03.700 | it's seemingly detours,
00:20:04.980 | but they really led up to some of the important work
00:20:07.100 | that we're now talking about, Chad GPT.
00:20:09.260 | - Yeah.
00:20:10.100 | I mean, there has been real convergence
00:20:12.660 | where the GPTs
00:20:15.220 | produce the foundation.
00:20:16.940 | And in the reinforcement learning of Dota,
00:20:20.020 | morphed into reinforcement learning from human feedback.
00:20:22.980 | - That's right.
00:20:23.820 | - And that combination gave us Chad GPT.
00:20:26.260 | - You know, there's a misunderstanding
00:20:28.980 | that Chad GPT is in itself
00:20:33.340 | just one giant large language model.
00:20:36.020 | There's a system around it that's fairly complicated.
00:20:38.660 | Could you explain briefly for the audience
00:20:43.180 | the fine tuning of it,
00:20:45.940 | the reinforcement learning of it,
00:20:47.620 | the various surrounding systems
00:20:52.140 | that allows you to keep it on rails
00:20:54.860 | and give it knowledge and so on and so forth?
00:20:59.860 | - Yeah, I can.
00:21:02.580 | So the way to think about it
00:21:05.340 | is that when we train a large neural network
00:21:09.580 | to accurately predict the next word
00:21:11.700 | in lots of different texts from the internet,
00:21:16.500 | what we are doing is that we are learning a world model.
00:21:20.940 | It looks like we are learning this.
00:21:22.580 | It may look on the surface
00:21:25.500 | that we are just learning statistical correlations in text,
00:21:28.500 | but it turns out that to just learn
00:21:33.460 | the statistical correlations in text,
00:21:35.780 | to compress them really well,
00:21:37.900 | what the neural network learns
00:21:40.100 | is some representation of the process that produced the text.
00:21:44.580 | This text is actually a projection of the world.
00:21:49.460 | There is a world out there
00:21:51.340 | and it has a projection on this text.
00:21:54.100 | And so what the neural network is learning
00:21:56.220 | is more and more aspects of the world,
00:21:59.780 | of people, of the human conditions,
00:22:02.180 | their hopes, dreams, and motivations,
00:22:05.860 | their interactions in the situations that we are in.
00:22:10.540 | And the neural network learns a compressed,
00:22:13.260 | abstract, usable representation of that.
00:22:17.540 | This is what's being learned
00:22:19.500 | from accurately predicting the next word.
00:22:22.300 | And furthermore, the more accurate you are
00:22:24.940 | at predicting the next word,
00:22:27.180 | the higher the fidelity,
00:22:29.500 | the more resolution you get in this process.
00:22:32.380 | So that's what the pre-training stage does.
00:22:34.540 | But what this does not do
00:22:37.540 | is specify the desired behavior
00:22:41.540 | that we wish our neural network to exhibit.
00:22:44.740 | You see, a language model,
00:22:47.700 | what it really tries to do
00:22:49.900 | is to answer the following question.
00:22:51.700 | If I had some random piece of text on the internet,
00:22:56.660 | which starts with some prefix, some prompt,
00:23:00.020 | what will it complete to?
00:23:03.660 | If you just randomly ended up
00:23:05.620 | on some text from the internet?
00:23:07.580 | But this is different from,
00:23:09.420 | well, I want to have an assistant
00:23:10.700 | which will be truthful, that will be helpful,
00:23:14.180 | that will follow certain rules and not violate them.
00:23:17.900 | That requires additional training.
00:23:19.980 | This is where the fine-tuning
00:23:22.420 | and the reinforcement learning from human teachers
00:23:25.380 | and other forms of AI assistance.
00:23:27.380 | It's not just reinforcement learning from human teachers.
00:23:30.140 | It's also reinforcement learning
00:23:31.100 | from human and AI collaboration.
00:23:33.740 | Our teachers are working together with an AI
00:23:35.540 | to teach our AI to behave.
00:23:37.060 | But here we are not teaching it new knowledge.
00:23:40.260 | This is not what's happening.
00:23:41.980 | We are teaching it, we are communicating with it.
00:23:46.540 | We are communicating to it
00:23:48.900 | what it is that we want it to be.
00:23:50.940 | And this process, this second stage,
00:23:54.420 | is also extremely important.
00:23:56.500 | The better we do the second stage,
00:23:58.300 | the more useful, the more reliable
00:24:00.620 | this neural network will be.
00:24:02.300 | So the second stage is extremely important too,
00:24:04.940 | in addition to the first stage of the learn everything,
00:24:08.580 | learn everything, learn as much as you can about the world
00:24:12.860 | from the projection of the world, which is text.
00:24:16.580 | - Now you could tell, you could fine-tune it,
00:24:19.300 | you could instruct it to perform certain things.
00:24:23.380 | Can you instruct it to not perform certain things
00:24:25.700 | so that you could give it guardrails
00:24:27.180 | about avoid these type of behavior,
00:24:29.420 | give it some kind of a bounding box
00:24:31.940 | so that it doesn't wander out of that bounding box
00:24:35.660 | and perform things that are unsafe or otherwise?
00:24:40.660 | - Yeah.
00:24:41.900 | So this second stage of training
00:24:45.060 | is indeed where we communicate to the neural network
00:24:48.940 | anything we want, which includes the bounding box.
00:24:53.420 | And the better we do this training,
00:24:55.620 | the higher the fidelity
00:24:57.500 | with which we communicate this bounding box.
00:24:59.980 | And so with constant research and innovation
00:25:02.460 | on improving this fidelity,
00:25:04.620 | we are able, we improve this fidelity,
00:25:09.620 | and so it becomes more and more reliable
00:25:13.220 | and precise in the way in which it follows
00:25:16.620 | the intended instructions.
00:25:19.660 | - Chad Chibiti came out just a few months ago.
00:25:21.980 | Fastest growing application in the history of humanity.
00:25:27.900 | Lots of interpretations about why,
00:25:34.540 | but some of the things that is clear,
00:25:38.780 | it is the easiest application
00:25:41.580 | that anyone has ever created for anyone to use.
00:25:45.340 | It performs tasks, it performs things,
00:25:49.580 | it does things that are beyond people's expectation.
00:25:53.900 | Anyone can use it.
00:25:55.180 | There are no instruction sets.
00:25:57.060 | There are no wrong ways to use it.
00:25:58.740 | You just use it.
00:26:00.700 | And if your instructions or prompts are ambiguous,
00:26:05.700 | the conversation refines the ambiguity
00:26:08.620 | until your intents are understood by the application,
00:26:13.820 | by the AI.
00:26:14.660 | The impact, of course, clearly remarkable.
00:26:20.980 | Now, yesterday, this is the day after GPT-4,
00:26:25.420 | just a few months later.
00:26:28.020 | The performance of GPT-4 in many areas, astounding.
00:26:33.020 | SAT scores, GRE scores, bar exams,
00:26:38.580 | the number of tests that it's able to perform
00:26:43.660 | at very capable levels, very capable human levels, astounding.
00:26:48.180 | What were the major differences between Chat GPT and GPT-4
00:26:54.180 | that led to its improvements in these areas?
00:26:59.340 | So GPT-4
00:27:01.900 | is a pretty substantial improvement on top of Chat GPT
00:27:09.620 | across very many dimensions.
00:27:12.940 | Between GPT-4, I would say,
00:27:15.340 | between more than six months ago,
00:27:19.780 | maybe eight months ago, I don't remember exactly.
00:27:23.220 | GPT is the first big difference between Chat GPT and GPT-4.
00:27:30.020 | And that perhaps is the most important difference,
00:27:35.900 | is that the base on top of GPT-4 is built,
00:27:39.420 | predicts the next word with greater accuracy.
00:27:43.260 | This is really important,
00:27:45.900 | because the better a neural network
00:27:48.260 | can predict the next word in text,
00:27:50.580 | the more it understands it.
00:27:52.380 | This claim is now perhaps accepted by many at this point,
00:27:57.220 | but it might still not be intuitive
00:27:59.500 | or not completely intuitive as to why that is.
00:28:02.340 | So I'd like to take a small detour
00:28:04.220 | and to give an analogy that will hopefully clarify
00:28:07.500 | why more accurate prediction of the next word
00:28:10.860 | leads to more understanding, real understanding.
00:28:13.500 | Let's consider an example.
00:28:17.220 | Say you read a detective novel.
00:28:19.700 | It's like a complicated plot, a storyline,
00:28:22.180 | different characters, lots of events,
00:28:25.220 | mysteries like clues, it's unclear.
00:28:28.140 | Then let's say that at the last page of the book,
00:28:31.820 | the detective has got all the clues,
00:28:33.780 | gathered all the people and saying,
00:28:34.940 | "Okay, I'm going to reveal the identity
00:28:37.740 | of whoever committed the crime."
00:28:40.340 | And that person's name is?
00:28:42.180 | - Predict that word.
00:28:43.020 | - Predict that word, exactly.
00:28:44.340 | - My goodness.
00:28:45.180 | - Right? - Yeah, right.
00:28:46.140 | - Now, there are many different words,
00:28:48.660 | but by predicting those words better and better and better,
00:28:51.500 | the understanding of the text keeps on increasing.
00:28:55.780 | GPT-4 predicts the next word better.
00:28:58.980 | - Ilya, people say that deep learning won't lead to reasoning.
00:29:04.420 | - That deep learning won't lead to reasoning.
00:29:06.820 | But in order to predict that next word,
00:29:09.380 | figure out from all of the agents that were there
00:29:13.180 | and all of their strengths or weaknesses or their intentions
00:29:18.180 | and the context, and to be able to predict that word,
00:29:23.780 | who was the murderer,
00:29:27.020 | that requires some amount of reasoning,
00:29:28.660 | a fair amount of reasoning.
00:29:30.180 | And so how is it that it's able to learn reasoning?
00:29:35.180 | And if it learned reasoning,
00:29:40.180 | you know, one of the things that I was going to ask you
00:29:43.460 | is of all the tests that were taken
00:29:45.780 | between CHAT-GPT and GPT-4,
00:29:48.860 | there were some tests that GPT-3
00:29:52.220 | or CHAT-GPT was already very good at.
00:29:55.060 | There were some tests that GPT-3 or CHAT-GPT
00:29:57.940 | was not as good at that GPT-4 was much better at.
00:30:02.820 | And there were some tests that neither are good at yet.
00:30:06.060 | I would love for it, you know,
00:30:07.140 | and some of it has to do with reasoning, it seems,
00:30:09.940 | that, you know, maybe in calculus,
00:30:12.100 | that it wasn't able to break maybe the problem down
00:30:15.780 | into its reasonable steps and solve it.
00:30:18.540 | But yet in some areas,
00:30:21.980 | it seems to demonstrate reasoning skills.
00:30:24.980 | And so is that an area that in predicting the next word,
00:30:29.980 | you're learning reasoning?
00:30:32.660 | And what are the limitations now of GPT-4
00:30:37.660 | that would enhance its ability to reason even further?
00:30:41.180 | You know, reasoning isn't this super well-defined concept,
00:30:47.460 | but we can try to define it anyway,
00:30:51.780 | which is when you maybe, maybe when you go further,
00:30:56.580 | where you're able to somehow think about it a little bit
00:31:00.700 | and get a better answer because of your reasoning.
00:31:03.220 | And I'd say that our neural nets,
00:31:08.900 | you know, maybe there is some kind of limitation
00:31:11.300 | which could be addressed by, for example,
00:31:14.620 | asking the neural network to think out loud.
00:31:16.820 | This has proven to be extremely effective for reasoning,
00:31:20.020 | but I think it also remains to be seen
00:31:21.860 | just how far the basic neural network will go.
00:31:24.260 | I think we have yet to tap, fully tap out its potential.
00:31:29.260 | But yeah, I mean, there is definitely some sense
00:31:36.420 | where reasoning is still not quite at that level
00:31:41.420 | as some of the other capabilities of the neural network,
00:31:45.860 | though we would like the reasoning capabilities
00:31:48.380 | of the neural network to be high, higher.
00:31:51.380 | I think that it's fairly likely that business as usual
00:31:55.900 | will keep, will improve the reasoning capabilities
00:31:58.140 | of the neural network.
00:31:59.660 | I wouldn't necessarily confidently rule out this possibility.
00:32:04.660 | Yeah, because one of the things that is really cool
00:32:08.780 | is you ask Chachapiti a question,
00:32:12.700 | but before it answers the question,
00:32:14.020 | tell me first what you know, and then answer the question.
00:32:18.180 | You know, usually when somebody answers a question,
00:32:20.140 | if you give me the foundational knowledge that you have
00:32:23.580 | or the foundational assumptions that you're making
00:32:25.460 | before you answer the question,
00:32:27.060 | that really improves my believability of the answer.
00:32:31.940 | You're also demonstrating some level of reason,
00:32:33.940 | well, you're demonstrating reasoning.
00:32:35.820 | And so it seems to me that Chachapiti
00:32:37.420 | has this inherent capability embedded in it.
00:32:40.140 | Yeah.
00:32:41.900 | To some degree.
00:32:42.740 | Yeah.
00:32:43.580 | The one way to think about what's happening now
00:32:47.860 | is that these neural networks
00:32:50.100 | have a lot of these capabilities,
00:32:52.100 | they're just not quite very reliable.
00:32:54.780 | In fact, you could say that reliability
00:32:57.700 | is currently the single biggest obstacle
00:32:59.980 | for these neural networks being useful, truly useful.
00:33:03.620 | If sometimes it is still the case
00:33:08.060 | that these neural networks hallucinate a little bit,
00:33:12.700 | or maybe make some mistakes which are unexpected,
00:33:14.980 | which you wouldn't expect the person to make,
00:33:17.780 | it is this kind of unreliability
00:33:20.340 | that makes them substantially less useful.
00:33:23.460 | But I think that perhaps with a little bit more research,
00:33:26.500 | with the current ideas that you have
00:33:28.060 | and perhaps a few more of the ambitious research plans,
00:33:33.060 | you'll be able to achieve higher reliability as well.
00:33:36.140 | And that will be truly useful,
00:33:37.780 | that will allow us to have very accurate guardrails,
00:33:42.660 | which are very precise.
00:33:44.220 | That's right.
00:33:45.060 | And it will make it ask for clarification
00:33:47.220 | where it's unsure,
00:33:48.740 | or maybe say that it doesn't know something
00:33:53.140 | when it doesn't know, and do so extremely reliably.
00:33:57.580 | So I'd say that these are some of the bottlenecks, really.
00:34:01.940 | So it's not about whether it exhibits
00:34:04.100 | some particular capability,
00:34:06.140 | but more how reliably, exactly.
00:34:08.780 | Yeah.
00:34:09.820 | You know, speaking of factualness and factfulness,
00:34:14.900 | hallucination, I saw in one of the videos
00:34:19.900 | a demonstration that links to a Wikipedia page.
00:34:25.340 | Does retrieval capability,
00:34:29.060 | has that been included in the GPT-4?
00:34:31.500 | Is it able to retrieve information from a factual place
00:34:35.980 | that could augment its response to you?
00:34:39.300 | So the current GPT-4, as released,
00:34:44.220 | does not have a built-in retrieval capability.
00:34:47.060 | It is just a really, really good next-word predictor,
00:34:52.060 | which can also consume images, by the way.
00:34:55.180 | We haven't spoken about it.
00:34:56.180 | Yeah, I'm about to ask you about my technology.
00:34:57.380 | It is really good at images,
00:34:59.580 | which is also then fine-tuned with data
00:35:02.380 | and various reinforcement learning variants
00:35:06.940 | to behave in a particular way.
00:35:08.540 | It is perhaps, I'm sure someone will,
00:35:14.020 | it wouldn't surprise me if some of the people
00:35:15.780 | who have access could perhaps request GPT-4
00:35:19.740 | to maybe make some queries
00:35:21.740 | and then populate the results inside the context,
00:35:25.460 | because also the context duration of GPT-4
00:35:27.820 | is quite a bit longer now.
00:35:28.780 | Yeah, that's right.
00:35:29.860 | So in short, although GPT-4 does not support
00:35:34.860 | built-in retrieval, it is completely correct
00:35:41.780 | that it will get better with retrieval.
00:35:44.460 | Multi-modality.
00:35:45.740 | GPT-4 has the ability to learn from text and images
00:35:50.740 | and respond to input from text and images.
00:35:56.460 | First of all, the foundation of multi-modality learning,
00:36:01.180 | of course, Transformers has made it possible
00:36:06.500 | for us to learn from multi-modality,
00:36:08.140 | tokenized text and images.
00:36:11.500 | But at the foundational level,
00:36:16.340 | help us understand how multi-modality enhances
00:36:20.540 | the understanding of the world beyond text by itself.
00:36:25.540 | And my understanding is that when you do multi-modality
00:36:35.220 | learning, that even when it is just a text prompt,
00:36:39.420 | the text prompt, the text understanding
00:36:41.780 | could actually be enhanced.
00:36:43.700 | Tell us about multi-modality at the foundation,
00:36:46.780 | why it's so important and what's the major breakthrough
00:36:50.420 | and the characteristic differences as a result.
00:36:54.020 | So there are two dimensions to multi-modality,
00:36:57.380 | two reasons why it is interesting.
00:37:01.260 | The first reason is a little bit humble.
00:37:06.220 | The first reason is that multi-modality is useful.
00:37:09.580 | It is useful for a neural network to see,
00:37:13.180 | vision in particular, because the world is very visual.
00:37:18.740 | Human beings are very visual animals.
00:37:21.300 | I believe that a third of the human cortex
00:37:26.700 | is dedicated to vision.
00:37:29.500 | And so by not having vision,
00:37:34.500 | the usefulness of our neural networks,
00:37:37.100 | though still considerable, is not as big as it could be.
00:37:41.100 | So it is a very simple usefulness argument.
00:37:45.220 | It is simply useful to see.
00:37:48.260 | And GPT-4 can see quite well.
00:37:52.060 | There is a second reason to the vision,
00:37:57.300 | which is that we learn more about the world
00:38:00.020 | by learning from images in addition to learning from text.
00:38:05.100 | That is also a powerful argument,
00:38:09.700 | though it is not as clear cut as it may seem.
00:38:12.700 | I'll give you an example.
00:38:13.940 | Or rather before giving an example,
00:38:17.140 | I'll make the general comment.
00:38:18.660 | For a human being, us human beings,
00:38:22.260 | we get to hear about one billion words
00:38:26.420 | in our entire life.
00:38:28.260 | - Only?
00:38:29.100 | - Only one billion words.
00:38:30.100 | - That's amazing.
00:38:31.020 | - That's not a lot.
00:38:31.860 | - Yeah, that's not a lot.
00:38:33.860 | - So we need to--
00:38:36.020 | - Does that include my own words in my own head?
00:38:38.460 | (laughing)
00:38:39.860 | - Make it two billion, if you want.
00:38:41.660 | But you see what I mean?
00:38:42.740 | - Yeah.
00:38:43.580 | - You know, we can see that because
00:38:45.340 | a billion seconds is 30 years.
00:38:49.220 | So you can kind of see,
00:38:50.060 | like we don't get to see more than a few words a second,
00:38:52.140 | and then we are asleep half the time.
00:38:53.980 | So like a couple billion words
00:38:55.860 | is the total we get in our entire life.
00:38:58.180 | So it becomes really important for us
00:38:59.860 | to get as many sources of information as we can.
00:39:02.700 | And we absolutely learn a lot more from vision.
00:39:05.140 | The same argument holds true
00:39:08.380 | for our neural networks as well,
00:39:10.380 | except for the fact that the neural network
00:39:13.780 | can learn from so many words.
00:39:16.140 | So things which are hard to learn
00:39:20.140 | about the world from text in a few billion words
00:39:24.700 | may become easier from trillions of words.
00:39:28.420 | And I'll give you an example.
00:39:29.860 | Consider colors.
00:39:33.700 | Surely, one needs to see to understand colors.
00:39:38.780 | And yet, the text-only neural networks
00:39:43.180 | who've never seen a single photon in their entire life,
00:39:47.660 | if you ask them which colors are more similar to each other,
00:39:50.940 | it will know that red is more similar to orange than to blue.
00:39:55.060 | It will know that blue is more similar to purple
00:39:57.740 | than to yellow.
00:39:58.660 | How does that happen?
00:40:01.740 | And one answer is that information about the world,
00:40:05.580 | even the visual information,
00:40:07.380 | slowly leaks in through text,
00:40:10.380 | but slowly, not as quickly.
00:40:12.580 | But then you have a lot of text,
00:40:13.660 | you can still learn a lot.
00:40:15.260 | Of course, once you also add vision
00:40:19.500 | and learning about the world from vision,
00:40:20.940 | you will learn additional things
00:40:22.380 | which are not captured in text.
00:40:24.580 | But I would not say that it is a binary,
00:40:28.980 | there are things which are impossible to learn
00:40:31.340 | from text only.
00:40:32.820 | I think there's more of an exchange rate.
00:40:34.940 | And in particular, as you want to learn,
00:40:37.220 | if you are like a human being
00:40:40.940 | and you want to learn from a billion words
00:40:43.700 | or a hundred million words,
00:40:45.420 | then of course the other sources of information
00:40:47.260 | become far more important.
00:40:49.300 | - Yeah.
00:40:51.700 | You learn from images.
00:40:56.100 | Is there a sensibility that would suggest
00:40:59.100 | that if we wanted to understand
00:41:01.540 | also the construction of the world,
00:41:03.900 | as in the arm is connected to my shoulder,
00:41:06.500 | that my elbow is connected,
00:41:08.300 | that somehow these things move,
00:41:10.460 | the animation of the world,
00:41:13.780 | the physics of the world.
00:41:14.980 | If I wanted to learn that as well,
00:41:16.820 | can I just watch videos and learn that?
00:41:18.780 | - Yes.
00:41:19.620 | - And if I wanted to augment all of that with sound,
00:41:23.140 | like for example, if somebody said,
00:41:25.700 | the meaning of great,
00:41:27.500 | great could be great,
00:41:31.220 | or great could be great.
00:41:32.940 | One is sarcastic, one is enthusiastic.
00:41:38.100 | There are many, many words like that.
00:41:40.020 | That's sick, or I'm sick, or I'm sick.
00:41:45.340 | Depending on how people say it,
00:41:46.900 | would audio also make a contribution
00:41:50.420 | to the learning of the model?
00:41:52.660 | And could we put that to good use soon?
00:41:55.420 | - Yes.
00:41:56.740 | I think it's definitely the case that,
00:42:00.100 | well, what can we say about audio?
00:42:02.580 | It's useful, it's an additional source of information.
00:42:05.420 | Probably not as much as images or video,
00:42:09.060 | but there is a case to be made
00:42:11.980 | for the usefulness of audio as well,
00:42:13.780 | both on the recognition side and on the production side.
00:42:18.020 | - When you,
00:42:18.860 | on the context of the scores that I saw,
00:42:23.700 | the thing that was really interesting
00:42:26.020 | was the data that you guys published.
00:42:29.420 | Which one of the tests were performed well by GPT-3,
00:42:34.700 | and which one of the tests
00:42:35.940 | performed substantially better with GPT-4?
00:42:38.580 | How did multimodality contribute to those tests,
00:42:43.260 | do you think?
00:42:44.260 | - Oh, I mean, in a pretty straightforward way,
00:42:48.900 | anytime there was a test where a problem would,
00:42:52.380 | where to understand the problem,
00:42:53.780 | you need to look at a diagram.
00:42:55.780 | Like, for example, in some math competitions.
00:42:58.220 | Like, there is a math competition
00:43:01.420 | for high school students called AMC-12.
00:43:03.620 | - AMC-10, yeah. - 12, right?
00:43:05.700 | And there, presumably, many of the problems have a diagram.
00:43:10.700 | So, GPT-3.5 does quite badly on that test.
00:43:16.180 | GPT-4, with text only, does,
00:43:21.580 | I think, I don't remember,
00:43:22.460 | but it's like maybe from 2% to 20% accuracy of success rate.
00:43:27.460 | But then when you add vision, it jumps to 40% success rate.
00:43:31.140 | So the vision is really doing a lot of work.
00:43:33.380 | The vision is extremely good.
00:43:35.100 | And I think being able to reason visually as well
00:43:38.980 | and communicate visually will also be very powerful
00:43:42.900 | and very nice things,
00:43:43.860 | which go beyond just learning about the world.
00:43:46.940 | You have several things.
00:43:47.820 | You can learn about the world.
00:43:50.220 | You can then reason about the world visually.
00:43:52.500 | And you can communicate visually.
00:43:54.820 | Where now, in the future, perhaps, in some future version,
00:43:57.860 | if you ask your neural net, "Hey, explain this to me,"
00:44:00.780 | rather than just producing four paragraphs,
00:44:02.300 | it will produce, "Hey, here's a little diagram
00:44:05.340 | "which clearly conveys to you
00:44:07.140 | "exactly what you need to know."
00:44:08.660 | - Yeah, that's incredible.
00:44:10.260 | You know, one of the things that you said earlier
00:44:11.700 | about an AI generating a test to train another AI,
00:44:16.700 | you know, there was a paper that was written about,
00:44:21.060 | and I don't completely know whether it's factual or not,
00:44:24.980 | but that there's a total amount
00:44:28.020 | of somewhere between four trillion
00:44:29.540 | to something like 20 trillion useful, you know,
00:44:33.660 | tokens, language tokens,
00:44:37.180 | that the world will be able to train on,
00:44:40.260 | you know, over some period of time.
00:44:41.620 | And that would have run out of tokens to train.
00:44:44.260 | And I, well, first of all, I wonder
00:44:47.620 | if you feel the same way.
00:44:49.580 | And then secondarily, whether the AI
00:44:55.260 | generating its own data
00:44:59.980 | could be used to train the AI itself,
00:45:03.300 | which you could argue is a little circular,
00:45:06.140 | but we train our brain with generated data all the time
00:45:11.140 | by self-reflection,
00:45:15.940 | working through a problem in our brain, you know,
00:45:19.660 | and, you know, I guess neuroscientists suggest sleeping.
00:45:25.060 | We do a lot of fair amount of, you know,
00:45:26.900 | developing our neurons.
00:45:28.140 | How do you see this area of synthetic data generation?
00:45:32.900 | Is that going to be an important part of the future
00:45:34.660 | of training AI and the AI teaching itself?
00:45:38.380 | - Well, I think, like I wouldn't underestimate
00:45:42.540 | the data that exists out there.
00:45:45.020 | I think there's probably more data than people realize.
00:45:50.020 | And as to your second question,
00:45:53.620 | certainly a possibility remains to be seen.
00:45:56.860 | - Yeah, yeah, it really does seem that
00:45:59.940 | that one of these days our AIs are, you know,
00:46:04.940 | when we're not using it,
00:46:06.220 | maybe generating either adversarial content for itself
00:46:09.380 | to learn from or imagine solving problems
00:46:12.620 | that it can go off and then improve itself.
00:46:16.940 | Tell us whatever you can about where we are now
00:46:23.020 | and what do you think we'll be in not too distant future,
00:46:26.860 | but, you know, pick your horizon, a year or two.
00:46:31.420 | What do you think this whole language model area would be
00:46:34.060 | in some of the areas that you're most excited about?
00:46:36.460 | - You know, predictions are hard.
00:46:38.660 | And it's a bit, although it's a little difficult
00:46:42.220 | to say things which are too specific,
00:46:46.580 | I think it's safe to assume that progress will continue.
00:46:52.580 | And that we will keep on seeing systems which astound us
00:46:56.820 | in the things that they can do.
00:47:00.820 | And the current frontiers will be centered around reliability
00:47:05.500 | around the system can be trusted,
00:47:09.020 | really get into a point where we can trust what it produces,
00:47:13.140 | really get into a point where
00:47:14.820 | if it doesn't understand something,
00:47:16.460 | it asks for a clarification,
00:47:18.020 | says that it doesn't know something,
00:47:21.300 | says that it needs more information.
00:47:23.060 | I think those are perhaps the biggest,
00:47:26.300 | the areas where improvement will lead to the biggest impact
00:47:30.700 | on the usefulness of those systems.
00:47:32.860 | Because right now that's really what stands in the way.
00:47:35.340 | You have an A, you have asking neural net,
00:47:36.780 | you ask a neural net to maybe summarize some long document
00:47:39.900 | and you get a summary.
00:47:40.980 | Like, are you sure that some important detail
00:47:43.740 | wasn't omitted?
00:47:44.580 | It's still a useful summary,
00:47:46.340 | but it's a different story when you know
00:47:48.940 | that all the important points have been covered.
00:47:52.020 | At some point, and in particular,
00:47:55.180 | it's okay if there is ambiguity, it's fine.
00:47:58.260 | But if a point is clearly important,
00:48:00.860 | such that anyone else who saw that point
00:48:02.700 | would say this is really important,
00:48:04.780 | then the neural network will also recognize that reliably.
00:48:07.420 | That's when you know.
00:48:08.980 | Same for the guardrail,
00:48:10.460 | same for its ability to clearly follow
00:48:13.060 | the intent of the user, of its operator.
00:48:16.860 | So I think we'll see a lot of that in the next two years.
00:48:19.580 | - Yeah, that's terrific,
00:48:20.420 | because the progress in those two areas
00:48:22.300 | will make this technology trusted by people to use
00:48:26.740 | and be able to apply it for so many things.
00:48:28.940 | I was thinking that was gonna be the last question,
00:48:30.660 | but I did have another one, sorry about that.
00:48:32.740 | So, Chad GPT to GPT-4.
00:48:36.620 | GPT-4, when you first started using it,
00:48:40.420 | what are some of the skills that it demonstrated
00:48:44.740 | that surprised even you?
00:48:47.180 | - Well, there were lots of really cool things
00:48:51.420 | that it demonstrated,
00:48:52.580 | which were quite cool and surprising.
00:48:57.580 | It was quite good.
00:49:01.780 | So I'll mention two, so let's see.
00:49:05.220 | I'm just trying to think about the best way to go about it.
00:49:09.620 | The short answer is that the level of its reliability
00:49:13.940 | was surprising.
00:49:15.860 | Where the previous neural networks,
00:49:18.220 | if you asked them a question,
00:49:19.860 | sometimes they might misunderstand something
00:49:22.660 | in a kind of a silly way.
00:49:25.900 | Whereas with GPT-4, that stopped happening.
00:49:28.540 | Its ability to solve math problems became far greater.
00:49:31.940 | It was like you could really do the derivation,
00:49:35.740 | and long, complicated derivation,
00:49:37.780 | and you could convert the units and so on.
00:49:39.660 | And that was really cool.
00:49:41.220 | You know, like many people have--
00:49:42.060 | - It works through a proof.
00:49:43.140 | It works through a proof.
00:49:44.180 | - Yeah. - That's pretty amazing.
00:49:45.100 | - Not all proofs, naturally, but quite a few.
00:49:48.100 | Or another example would be,
00:49:50.060 | like many people noticed that it has the ability
00:49:53.020 | to produce poems with every word
00:49:58.020 | starting with the same letter,
00:49:59.340 | or every word starting with some--
00:50:02.060 | - It follows instructions really, really clearly.
00:50:04.780 | - Not perfectly still, but much better than before.
00:50:07.140 | - Yeah, really good.
00:50:08.300 | - And on the vision side,
00:50:09.980 | I really love how it can explain jokes.
00:50:12.780 | It can explain memes.
00:50:14.380 | You show it a meme and ask it why it's funny,
00:50:16.300 | and it will tell you, and it will be correct.
00:50:18.980 | The vision part, I think, is very,
00:50:21.540 | was also very, it's like really actually seeing it
00:50:25.460 | when you can ask follow-up questions
00:50:27.500 | about some complicated image with a complicated diagram
00:50:31.540 | and get an explanation.
00:50:32.500 | That's really cool.
00:50:33.420 | But yeah, overall, I will say, to take a step back,
00:50:38.220 | you know, I've been in this business for quite some time.
00:50:41.340 | Actually, like almost exactly 20 years.
00:50:44.140 | And the thing which I find most surprising
00:50:50.940 | is that it actually works.
00:50:52.780 | (laughing)
00:50:54.380 | - Yeah.
00:50:55.220 | - Like it turned out to be the same little thing all along,
00:50:58.540 | which is no longer little,
00:51:00.300 | and a lot more serious and much more intense,
00:51:03.500 | but it's the same neural network, just larger,
00:51:06.740 | trained on maybe larger data sets in different ways
00:51:09.700 | with the same fundamental training algorithm.
00:51:11.780 | - Yeah.
00:51:12.940 | - So it's like, wow.
00:51:14.820 | I would say this is what I find the most surprising.
00:51:18.380 | - Yeah.
00:51:19.220 | - Whenever I take a step back, I go, how is it possible
00:51:21.420 | that those ideas, those conceptual ideas about,
00:51:23.700 | well, the brain has neurons,
00:51:26.420 | so maybe artificial neurons are just as good,
00:51:29.180 | and so maybe we just need to train them somehow
00:51:30.780 | with some learning algorithm,
00:51:32.020 | that those arguments turned out to be so incredibly correct.
00:51:36.220 | That would be the biggest surprise, I'd say.
00:51:39.900 | - In the 10 years that we've known each other,
00:51:42.820 | the models that you've trained
00:51:47.820 | and the amount of data that you've trained
00:51:49.540 | from what you did on AlexNet to now
00:51:54.540 | is about a million times.
00:51:56.900 | And no one in the world of computer science
00:52:01.900 | would have believed that the amount of computation
00:52:05.500 | that was done in that 10 years' time
00:52:07.620 | would be a million times larger
00:52:09.980 | and that you dedicated your career to go do that.
00:52:13.980 | You've done many more, your body of work is incredible,
00:52:19.820 | but two seminal works, the invention,
00:52:22.620 | the co-invention with AlexNet and that early work,
00:52:25.620 | and now with GPT at OpenAI,
00:52:28.820 | it is truly remarkable what you've accomplished.
00:52:32.980 | It's great to catch up with you again, Ilya.
00:52:35.420 | I'm a good friend and it is quite an amazing moment.
00:52:40.420 | And today's talk, the way you break down the problem
00:52:45.740 | and describe it, this is one of the best PhD,
00:52:50.740 | beyond PhD descriptions of the state-of-the-art
00:52:54.580 | of large language models.
00:52:55.700 | I really appreciate that.
00:52:56.780 | It's great to see you, congratulations.
00:52:58.220 | - Thank you so much. - Yeah, thank you.