Back to Index

Ilya Sutskever (OpenAI) and Jensen Huang (NVIDIA CEO) : AI Today and Vision of the Future (3/2023)


Transcript

Ilya, unbelievable. Today is the day after GPT-4. (laughs) It's great to have you here. I'm delighted to have you. I've known you a long time. The journey and just my mental hit, my mental memory of the time that I've known you and the seminal work that you have done starting in University of Toronto, the co-invention of AlexNet with Alex and Jeb Hinton that led to the big bang of modern artificial intelligence, your career that took you out here to the Bay Area, the founding of OpenAI, GPT-123, and then, of course, ChatGPT, the AI heard around the world.

This is the incredible resume of a young computer scientist, you know, an entire community and industry at all with your achievements. I guess I just want to go back to the beginning and ask you, deep learning, what was your intuition around deep learning? Why did you know that it was going to work?

Did you have any intuition that it was going to lead to this kind of success? Okay, well, first of all, thank you so much for the quote, for all the kind words. A lot has changed thanks to the incredible power of deep learning. Like, I think my personal starting point, I was interested in artificial intelligence for a whole variety of reasons, starting from an intuitive understanding of appreciation of its impact, and also I had a lot of curiosity about what is consciousness, what is the human experience, and it felt like progress in artificial intelligence will help with that.

The next step was, well, back then, I was starting in 2002, 2003, and it seemed like learning is the thing that humans can do, that people can do, that computers can't do at all. In 2003, 2002, computers could not learn anything, and it wasn't even clear that it was possible in theory.

And so I thought that making progress in learning, in artificial learning, in machine learning, that would lead to the greatest progress in AI. And then I started to look around for what was out there, and nothing seemed too promising. But to my great luck, Jeff Hinton was a professor at my university, and I was able to find him, and he was working in neural networks, and it immediately made sense, because neural networks had the property that we are learning, we are automatically programming parallel computers.

Back then, the parallel computers were small, but the promise was, if you could somehow figure out how learning in neural networks work, then you can program small parallel computers from data. And it was also similar enough to the brain, and the brain works, because it's like you had these several factors going for it.

Now, it wasn't clear how to get it to work, but of all the things that existed, that seemed like it had by far the greatest long-term promise. Even though, you know— At the time that you first started, at the time that you first started working with deep learning and neural networks, what was the scale of the network?

What was the scale of computing at that moment in time? What was it like? An interesting thing to note was that the importance of scale wasn't realized back then. So people would just train, you know, neural networks with like 50 neurons, 100 neurons, several hundred neurons that would be like a big neural network.

A million parameters would be considered very large. We would run our models on unoptimized CPU code, because we were a bunch of researchers. We didn't know about BLAS. We used MATLAB. The MATLAB was optimized. And we'd just experiment, you know, what is even the right question to ask, you know?

So you try to gather, to just find interesting phenomena, interesting observation. You can do this small thing, and you can do that small thing. You know, Geoff Hinton was really excited about training neural nets on small little digits, both for classification, and also he was very interested in generating them.

So the beginnings of generative models were right there. But the question is like, okay, there's all this cool stuff floating around. What really gets traction? And so that, it wasn't, so it wasn't obvious that this was the right question back then, but in hindsight, that turned out to be the right question.

- Now, the year AlexNet was 2012. - Yes. - 2012. Now, you and Alex were working on AlexNet for some time before then. And at what point was it clear to you that you wanted to build a computer vision-oriented neural network that ImageNet was the right set of data to go for, and to somehow go for the computer vision contest?

- Yeah. So I can talk about the context there. It's, I think, probably two years before that, it became clear to me that supervised learning is what's going to get us the traction. And I can explain precisely why. It wasn't just an intuition. It was, I would argue, an irrefutable argument, which went like this.

If your neural network is deep and large, then it could be configured to solve a hard task. So that's the key word, deep and large. People weren't looking at large neural networks. People were maybe studying a little bit of depth in neural networks, but most of the machine learning field wasn't even looking at neural networks at all.

They were looking at all kinds of Bayesian models and kernel methods, which are theoretically elegant methods, which have the property that they actually can't represent a good solution no matter how you configure them. Whereas the large and deep neural network can represent a good solution to the problem. To find the good solution, you need a big data set, which requires it, and a lot of compute to actually do the work.

We've also made advanced work, so we've worked on optimization for a little bit. It was clear that optimization is a bottleneck, and there was a breakthrough by another grad student in Geoff Hinton's lab called James Martens, and he came up with an optimization method which is different from the one we're using now.

Some second-order method. But the point about it is that it's proved that we can train those neural networks, because before we didn't even know we could train them. So if you can train them, you make it big, you find the data, and you will succeed. So then the next question is, well, what data?

And an ImageNet data set, back then it seemed like this unbelievably difficult data set. But it was clear that if you were to train a large convolutional neural network on this data set, it must succeed if you just can have the compute. - And right at that time, - GPUs came out.

- you and I, our history and our paths intersected, and somehow you had the observations at a GPU, and at that time we had, this is our couple of generations into a CUDA GPU, and I think it was GTX 580 generation. You had the insight that the GPU could actually be useful for training your neural network models.

What was that, how did that day start? Tell me, you and I, you never told me that moment. How did that day start? - Yeah, so, you know, the GPUs appeared in our lab, in our Toronto lab, thanks to Jeff, and he said, "We should try these GPUs." And we started trying and experimenting with them.

And it was a lot of fun, but it was unclear what to use them for exactly. Where are you going to get the real traction? But then, with the existence of the ImageNet data set, and then it was also very clear that the convolutional neural network is such a great fit for the GPU, so it should be possible to make it go unbelievably fast, and therefore train something which would be completely unprecedented in terms of its size.

And that's how it happened, and, you know, very fortunately, Alex Krzyzewski, he really loved programming the GPU. (laughing) And he was able to do it, he was able to code, to program really fast convolutional kernels. And then, and train the neural net on the ImageNet data set, and that led to the result.

But it was like-- - It shocked the world. - It shocked the world. It broke the record of computer vision by such a wide margin that it was a clear discontinuity. - Yeah. - Yeah. - And I would say it's not just like, there is another bit of context there.

It's not so much, like, when you say break the record, there is an important, it's like, I think there's a different way to phrase it. It's that that data set was so obviously hard, and so obviously outside of reach of anything. People are making progress with some classical techniques, and they were actually doing something.

But this thing was so much better on the data set, which was so obviously hard. It's not just that it's just some competition. It was a competition which, back in the day-- - There wasn't an average benchmark. - It was so obviously difficult, and so obviously out of reach, and so obviously with the property that if you did a good job, that would be amazing.

- Big bang of AI. Fast forward to now. You came out to the Valley. You started OpenAI with some friends. You were the chief scientist. Now, what was the first initial idea about what to work on at OpenAI? Because you guys worked on several things. Some of the trails of inventions and work, you could see, led up to the chat GPT moment.

But what were the initial inspiration? What were you going-- How would you approach intelligence from that moment and led to this? - Yeah. So, obviously, when we started, it wasn't 100% clear how to proceed. And the field was also very different compared to the way it is right now.

So right now, we already used to, you have these amazing artifacts, these amazing neural nets who are doing incredible things, and everyone is so excited. But back in 2015, 2016, early 2016, when we were starting out, the whole thing seemed pretty crazy. There were so many fewer researchers, maybe there were between 100 and 1,000 times fewer people in the field compared to now.

Like back then, you had like 100 people, most of them were working in Google/DeepMind, and that was that. And then there were people picking up the skills, but it was very, very scarce, very rare still. And we had two big initial ideas at the start of OpenAI that had a lot of staying power, and they stayed with us to this day.

And I'll describe them right now. The first big idea that we had, one which I was especially excited about very early on, is the idea of unsupervised learning through compression. Some context. Today, we take it for granted that unsupervised learning is this easy thing and you just pre-train on everything and it all does exactly as you'd expect.

In 2016, unsupervised learning was an unsolved problem in machine learning that no one had any insight, any clue as to what to do. Jan LeCun would go around and give talks saying that you have this grand challenge in supervised learning. And I really believed that really good compression of the data will lead to unsupervised learning.

Now, compression is not language that's commonly used to describe what is really being done until recently, when suddenly it became apparent to many people that those GPTs actually compress the training data. You may recall the Ted Chiang New York Times article which also alluded to this. But there is a real mathematical sense in which training these autoregressive generative models compress the data.

And intuitively, you can see why that should work. If you compress the data really well, you must extract all the hidden secrets which exist in it. Therefore, that is the key. So that was the first idea that we were really excited about. And that led to quite a few works in OpenAI, to the sentiment neuron, which I'll mention very briefly.

This work might not be well known outside of the machine learning field, but it was very influential, especially in our thinking. This work, like the result there was that when you train a neural network, back then it was not a transformer, it was before the transformer. Small recurrent neural network, LSTM, to those who remember.

- Sequence work, you've done, I mean, this is some of the work that you've done yourself, yeah. - So the same LSTM with a few twists, trained to predict the next token in Amazon reviews, next character. And we discovered that if you predict the next character well enough, there will be a neuron inside that LSTM that corresponds to its sentiment.

So that was really cool, because it showed some traction for unsupervised learning, and it validated the idea that really good next character prediction, next something prediction, compression, has the property that it discovers the secrets in the data. That's what we see with these GPT models, right? You train, and people say, "It's just statistical correlation." I mean, at this point, it should be so clear to anyone.

- That observation also, for me, intuitively, opened up the whole world of, where do I get the data for unsupervised learning? Because I do have a whole lot of data. If I could just make you predict the next character, and I know what the ground truth is, I know what the answer is, I could train a neural network model with that.

So that observation, and masking, and other technology, other approaches, opened my mind about, where would the world get all the data that's unsupervised for unsupervised learning? - Well, I think, so I would phrase it a little differently. I would say that with unsupervised learning, the hard part has been less around where you get the data from, though that part is there as well, especially now.

But it was more about, why should you do it in the first place? Why should you bother? The hard part was to realize that training these neural nets to predict the next token is a worthwhile goal at all. That was the goal. - That it would learn a representation, that it would be able to understand.

- That's right, that it will be useful. - Grammar and, yeah. - But to actually, but it just wasn't obvious. So people weren't doing it. But the sentiment neuron work, and I want to call out Alec Radford as a person who really was responsible for many of the advances there, the sentiment, this was before GPT-1, it was the precursor to GPT-1, and it influenced our thinking a lot.

Then the transformer came out, and we immediately went, oh my God, this is the thing. And we trained GPT-1. - Now, along the way, you've always believed that scaling will improve the performance of these models. - Yes. Larger networks, deeper networks, more training data would scale that. There was a very important paper that OpenAI wrote about the scaling laws and the relationship between loss and the size of the model and the amount of data set, the size of the data set.

When transformers came out, it gave us the opportunity to train very, very large models in a very reasonable amount of time. - But did the intuition about the scaling laws and the size of models and data and your journey of GPT-1, 2, 3, which came first? Did you see the evidence of GPT-1 through 3 first, or was it intuition about the scaling law first?

- The intuition, so I would say that the way I'd phrase it is that I had a very strong belief that bigger is better. And that one of the goals that we had at OpenAI is to figure out how to use the scale correctly. There was a lot of belief in OpenAI about scale from the very beginning.

The question is what to use it for precisely. 'Cause I'll mention, right now we're talking about the GPTs, but there's another very important line of work which I haven't mentioned, the second big idea, but I think now is a good time to make a detour, and that's reinforcement learning.

That clearly seems important as well. What do you do with it? So the first really big project that was done inside OpenAI was our effort at solving a real-time strategy game. And for context, a real-time strategy game is like, it's a competitive sport. - Yeah, right. - We need to be smart, you need to have fast, you need to have a quick reaction time, you, there's teamwork, and you're competing against another team.

And it's pretty, it's pretty involved. And there is a whole competitive league for that game. The game is called Dota 2. And so we train a reinforcement learning agent to play against itself, to produce with the goal of reaching a level so that it could compete against the best players in the world.

And that was a major undertaking as well. It was a very different line. It was reinforcement learning. - Yeah, I remember the day that you guys announced that work. And this is, by the way, when I was asking earlier about, there's a large body of work that have come out of OpenAI.

Some of it seem like detours, but in fact, as you're explaining now, they might have been detours, it's seemingly detours, but they really led up to some of the important work that we're now talking about, Chad GPT. - Yeah. I mean, there has been real convergence where the GPTs produce the foundation.

And in the reinforcement learning of Dota, morphed into reinforcement learning from human feedback. - That's right. - And that combination gave us Chad GPT. - You know, there's a misunderstanding that Chad GPT is in itself just one giant large language model. There's a system around it that's fairly complicated.

Could you explain briefly for the audience the fine tuning of it, the reinforcement learning of it, the various surrounding systems that allows you to keep it on rails and give it knowledge and so on and so forth? - Yeah, I can. So the way to think about it is that when we train a large neural network to accurately predict the next word in lots of different texts from the internet, what we are doing is that we are learning a world model.

It looks like we are learning this. It may look on the surface that we are just learning statistical correlations in text, but it turns out that to just learn the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text.

This text is actually a projection of the world. There is a world out there and it has a projection on this text. And so what the neural network is learning is more and more aspects of the world, of people, of the human conditions, their hopes, dreams, and motivations, their interactions in the situations that we are in.

And the neural network learns a compressed, abstract, usable representation of that. This is what's being learned from accurately predicting the next word. And furthermore, the more accurate you are at predicting the next word, the higher the fidelity, the more resolution you get in this process. So that's what the pre-training stage does.

But what this does not do is specify the desired behavior that we wish our neural network to exhibit. You see, a language model, what it really tries to do is to answer the following question. If I had some random piece of text on the internet, which starts with some prefix, some prompt, what will it complete to?

If you just randomly ended up on some text from the internet? But this is different from, well, I want to have an assistant which will be truthful, that will be helpful, that will follow certain rules and not violate them. That requires additional training. This is where the fine-tuning and the reinforcement learning from human teachers and other forms of AI assistance.

It's not just reinforcement learning from human teachers. It's also reinforcement learning from human and AI collaboration. Our teachers are working together with an AI to teach our AI to behave. But here we are not teaching it new knowledge. This is not what's happening. We are teaching it, we are communicating with it.

We are communicating to it what it is that we want it to be. And this process, this second stage, is also extremely important. The better we do the second stage, the more useful, the more reliable this neural network will be. So the second stage is extremely important too, in addition to the first stage of the learn everything, learn everything, learn as much as you can about the world from the projection of the world, which is text.

- Now you could tell, you could fine-tune it, you could instruct it to perform certain things. Can you instruct it to not perform certain things so that you could give it guardrails about avoid these type of behavior, give it some kind of a bounding box so that it doesn't wander out of that bounding box and perform things that are unsafe or otherwise?

- Yeah. So this second stage of training is indeed where we communicate to the neural network anything we want, which includes the bounding box. And the better we do this training, the higher the fidelity with which we communicate this bounding box. And so with constant research and innovation on improving this fidelity, we are able, we improve this fidelity, and so it becomes more and more reliable and precise in the way in which it follows the intended instructions.

- Chad Chibiti came out just a few months ago. Fastest growing application in the history of humanity. Lots of interpretations about why, but some of the things that is clear, it is the easiest application that anyone has ever created for anyone to use. It performs tasks, it performs things, it does things that are beyond people's expectation.

Anyone can use it. There are no instruction sets. There are no wrong ways to use it. You just use it. And if your instructions or prompts are ambiguous, the conversation refines the ambiguity until your intents are understood by the application, by the AI. The impact, of course, clearly remarkable.

Now, yesterday, this is the day after GPT-4, just a few months later. The performance of GPT-4 in many areas, astounding. SAT scores, GRE scores, bar exams, the number of tests that it's able to perform at very capable levels, very capable human levels, astounding. What were the major differences between Chat GPT and GPT-4 that led to its improvements in these areas?

So GPT-4 is a pretty substantial improvement on top of Chat GPT across very many dimensions. Between GPT-4, I would say, between more than six months ago, maybe eight months ago, I don't remember exactly. GPT is the first big difference between Chat GPT and GPT-4. And that perhaps is the most important difference, is that the base on top of GPT-4 is built, predicts the next word with greater accuracy.

This is really important, because the better a neural network can predict the next word in text, the more it understands it. This claim is now perhaps accepted by many at this point, but it might still not be intuitive or not completely intuitive as to why that is. So I'd like to take a small detour and to give an analogy that will hopefully clarify why more accurate prediction of the next word leads to more understanding, real understanding.

Let's consider an example. Say you read a detective novel. It's like a complicated plot, a storyline, different characters, lots of events, mysteries like clues, it's unclear. Then let's say that at the last page of the book, the detective has got all the clues, gathered all the people and saying, "Okay, I'm going to reveal the identity of whoever committed the crime." And that person's name is?

- Predict that word. - Predict that word, exactly. - My goodness. - Right? - Yeah, right. - Now, there are many different words, but by predicting those words better and better and better, the understanding of the text keeps on increasing. GPT-4 predicts the next word better. - Ilya, people say that deep learning won't lead to reasoning.

- That deep learning won't lead to reasoning. But in order to predict that next word, figure out from all of the agents that were there and all of their strengths or weaknesses or their intentions and the context, and to be able to predict that word, who was the murderer, that requires some amount of reasoning, a fair amount of reasoning.

And so how is it that it's able to learn reasoning? And if it learned reasoning, you know, one of the things that I was going to ask you is of all the tests that were taken between CHAT-GPT and GPT-4, there were some tests that GPT-3 or CHAT-GPT was already very good at.

There were some tests that GPT-3 or CHAT-GPT was not as good at that GPT-4 was much better at. And there were some tests that neither are good at yet. I would love for it, you know, and some of it has to do with reasoning, it seems, that, you know, maybe in calculus, that it wasn't able to break maybe the problem down into its reasonable steps and solve it.

But yet in some areas, it seems to demonstrate reasoning skills. And so is that an area that in predicting the next word, you're learning reasoning? And what are the limitations now of GPT-4 that would enhance its ability to reason even further? You know, reasoning isn't this super well-defined concept, but we can try to define it anyway, which is when you maybe, maybe when you go further, where you're able to somehow think about it a little bit and get a better answer because of your reasoning.

And I'd say that our neural nets, you know, maybe there is some kind of limitation which could be addressed by, for example, asking the neural network to think out loud. This has proven to be extremely effective for reasoning, but I think it also remains to be seen just how far the basic neural network will go.

I think we have yet to tap, fully tap out its potential. But yeah, I mean, there is definitely some sense where reasoning is still not quite at that level as some of the other capabilities of the neural network, though we would like the reasoning capabilities of the neural network to be high, higher.

I think that it's fairly likely that business as usual will keep, will improve the reasoning capabilities of the neural network. I wouldn't necessarily confidently rule out this possibility. Yeah, because one of the things that is really cool is you ask Chachapiti a question, but before it answers the question, tell me first what you know, and then answer the question.

You know, usually when somebody answers a question, if you give me the foundational knowledge that you have or the foundational assumptions that you're making before you answer the question, that really improves my believability of the answer. You're also demonstrating some level of reason, well, you're demonstrating reasoning. And so it seems to me that Chachapiti has this inherent capability embedded in it.

Yeah. To some degree. Yeah. The one way to think about what's happening now is that these neural networks have a lot of these capabilities, they're just not quite very reliable. In fact, you could say that reliability is currently the single biggest obstacle for these neural networks being useful, truly useful.

If sometimes it is still the case that these neural networks hallucinate a little bit, or maybe make some mistakes which are unexpected, which you wouldn't expect the person to make, it is this kind of unreliability that makes them substantially less useful. But I think that perhaps with a little bit more research, with the current ideas that you have and perhaps a few more of the ambitious research plans, you'll be able to achieve higher reliability as well.

And that will be truly useful, that will allow us to have very accurate guardrails, which are very precise. That's right. And it will make it ask for clarification where it's unsure, or maybe say that it doesn't know something when it doesn't know, and do so extremely reliably. So I'd say that these are some of the bottlenecks, really.

So it's not about whether it exhibits some particular capability, but more how reliably, exactly. Yeah. You know, speaking of factualness and factfulness, hallucination, I saw in one of the videos a demonstration that links to a Wikipedia page. Does retrieval capability, has that been included in the GPT-4? Is it able to retrieve information from a factual place that could augment its response to you?

So the current GPT-4, as released, does not have a built-in retrieval capability. It is just a really, really good next-word predictor, which can also consume images, by the way. We haven't spoken about it. Yeah, I'm about to ask you about my technology. It is really good at images, which is also then fine-tuned with data and various reinforcement learning variants to behave in a particular way.

It is perhaps, I'm sure someone will, it wouldn't surprise me if some of the people who have access could perhaps request GPT-4 to maybe make some queries and then populate the results inside the context, because also the context duration of GPT-4 is quite a bit longer now. Yeah, that's right.

So in short, although GPT-4 does not support built-in retrieval, it is completely correct that it will get better with retrieval. Multi-modality. GPT-4 has the ability to learn from text and images and respond to input from text and images. First of all, the foundation of multi-modality learning, of course, Transformers has made it possible for us to learn from multi-modality, tokenized text and images.

But at the foundational level, help us understand how multi-modality enhances the understanding of the world beyond text by itself. And my understanding is that when you do multi-modality learning, that even when it is just a text prompt, the text prompt, the text understanding could actually be enhanced. Tell us about multi-modality at the foundation, why it's so important and what's the major breakthrough and the characteristic differences as a result.

So there are two dimensions to multi-modality, two reasons why it is interesting. The first reason is a little bit humble. The first reason is that multi-modality is useful. It is useful for a neural network to see, vision in particular, because the world is very visual. Human beings are very visual animals.

I believe that a third of the human cortex is dedicated to vision. And so by not having vision, the usefulness of our neural networks, though still considerable, is not as big as it could be. So it is a very simple usefulness argument. It is simply useful to see. And GPT-4 can see quite well.

There is a second reason to the vision, which is that we learn more about the world by learning from images in addition to learning from text. That is also a powerful argument, though it is not as clear cut as it may seem. I'll give you an example. Or rather before giving an example, I'll make the general comment.

For a human being, us human beings, we get to hear about one billion words in our entire life. - Only? - Only one billion words. - That's amazing. - That's not a lot. - Yeah, that's not a lot. - So we need to-- - Does that include my own words in my own head?

(laughing) - Make it two billion, if you want. But you see what I mean? - Yeah. - You know, we can see that because a billion seconds is 30 years. So you can kind of see, like we don't get to see more than a few words a second, and then we are asleep half the time.

So like a couple billion words is the total we get in our entire life. So it becomes really important for us to get as many sources of information as we can. And we absolutely learn a lot more from vision. The same argument holds true for our neural networks as well, except for the fact that the neural network can learn from so many words.

So things which are hard to learn about the world from text in a few billion words may become easier from trillions of words. And I'll give you an example. Consider colors. Surely, one needs to see to understand colors. And yet, the text-only neural networks who've never seen a single photon in their entire life, if you ask them which colors are more similar to each other, it will know that red is more similar to orange than to blue.

It will know that blue is more similar to purple than to yellow. How does that happen? And one answer is that information about the world, even the visual information, slowly leaks in through text, but slowly, not as quickly. But then you have a lot of text, you can still learn a lot.

Of course, once you also add vision and learning about the world from vision, you will learn additional things which are not captured in text. But I would not say that it is a binary, there are things which are impossible to learn from text only. I think there's more of an exchange rate.

And in particular, as you want to learn, if you are like a human being and you want to learn from a billion words or a hundred million words, then of course the other sources of information become far more important. - Yeah. You learn from images. Is there a sensibility that would suggest that if we wanted to understand also the construction of the world, as in the arm is connected to my shoulder, that my elbow is connected, that somehow these things move, the animation of the world, the physics of the world.

If I wanted to learn that as well, can I just watch videos and learn that? - Yes. - And if I wanted to augment all of that with sound, like for example, if somebody said, the meaning of great, great could be great, or great could be great. One is sarcastic, one is enthusiastic.

There are many, many words like that. That's sick, or I'm sick, or I'm sick. Depending on how people say it, would audio also make a contribution to the learning of the model? And could we put that to good use soon? - Yes. I think it's definitely the case that, well, what can we say about audio?

It's useful, it's an additional source of information. Probably not as much as images or video, but there is a case to be made for the usefulness of audio as well, both on the recognition side and on the production side. - When you, on the context of the scores that I saw, the thing that was really interesting was the data that you guys published.

Which one of the tests were performed well by GPT-3, and which one of the tests performed substantially better with GPT-4? How did multimodality contribute to those tests, do you think? - Oh, I mean, in a pretty straightforward way, anytime there was a test where a problem would, where to understand the problem, you need to look at a diagram.

Like, for example, in some math competitions. Like, there is a math competition for high school students called AMC-12. - AMC-10, yeah. - 12, right? And there, presumably, many of the problems have a diagram. So, GPT-3.5 does quite badly on that test. GPT-4, with text only, does, I think, I don't remember, but it's like maybe from 2% to 20% accuracy of success rate.

But then when you add vision, it jumps to 40% success rate. So the vision is really doing a lot of work. The vision is extremely good. And I think being able to reason visually as well and communicate visually will also be very powerful and very nice things, which go beyond just learning about the world.

You have several things. You can learn about the world. You can then reason about the world visually. And you can communicate visually. Where now, in the future, perhaps, in some future version, if you ask your neural net, "Hey, explain this to me," rather than just producing four paragraphs, it will produce, "Hey, here's a little diagram "which clearly conveys to you "exactly what you need to know." - Yeah, that's incredible.

You know, one of the things that you said earlier about an AI generating a test to train another AI, you know, there was a paper that was written about, and I don't completely know whether it's factual or not, but that there's a total amount of somewhere between four trillion to something like 20 trillion useful, you know, tokens, language tokens, that the world will be able to train on, you know, over some period of time.

And that would have run out of tokens to train. And I, well, first of all, I wonder if you feel the same way. And then secondarily, whether the AI generating its own data could be used to train the AI itself, which you could argue is a little circular, but we train our brain with generated data all the time by self-reflection, working through a problem in our brain, you know, and, you know, I guess neuroscientists suggest sleeping.

We do a lot of fair amount of, you know, developing our neurons. How do you see this area of synthetic data generation? Is that going to be an important part of the future of training AI and the AI teaching itself? - Well, I think, like I wouldn't underestimate the data that exists out there.

I think there's probably more data than people realize. And as to your second question, certainly a possibility remains to be seen. - Yeah, yeah, it really does seem that that one of these days our AIs are, you know, when we're not using it, maybe generating either adversarial content for itself to learn from or imagine solving problems that it can go off and then improve itself.

Tell us whatever you can about where we are now and what do you think we'll be in not too distant future, but, you know, pick your horizon, a year or two. What do you think this whole language model area would be in some of the areas that you're most excited about?

- You know, predictions are hard. And it's a bit, although it's a little difficult to say things which are too specific, I think it's safe to assume that progress will continue. And that we will keep on seeing systems which astound us in the things that they can do. And the current frontiers will be centered around reliability around the system can be trusted, really get into a point where we can trust what it produces, really get into a point where if it doesn't understand something, it asks for a clarification, says that it doesn't know something, says that it needs more information.

I think those are perhaps the biggest, the areas where improvement will lead to the biggest impact on the usefulness of those systems. Because right now that's really what stands in the way. You have an A, you have asking neural net, you ask a neural net to maybe summarize some long document and you get a summary.

Like, are you sure that some important detail wasn't omitted? It's still a useful summary, but it's a different story when you know that all the important points have been covered. At some point, and in particular, it's okay if there is ambiguity, it's fine. But if a point is clearly important, such that anyone else who saw that point would say this is really important, then the neural network will also recognize that reliably.

That's when you know. Same for the guardrail, same for its ability to clearly follow the intent of the user, of its operator. So I think we'll see a lot of that in the next two years. - Yeah, that's terrific, because the progress in those two areas will make this technology trusted by people to use and be able to apply it for so many things.

I was thinking that was gonna be the last question, but I did have another one, sorry about that. So, Chad GPT to GPT-4. GPT-4, when you first started using it, what are some of the skills that it demonstrated that surprised even you? - Well, there were lots of really cool things that it demonstrated, which were quite cool and surprising.

It was quite good. So I'll mention two, so let's see. I'm just trying to think about the best way to go about it. The short answer is that the level of its reliability was surprising. Where the previous neural networks, if you asked them a question, sometimes they might misunderstand something in a kind of a silly way.

Whereas with GPT-4, that stopped happening. Its ability to solve math problems became far greater. It was like you could really do the derivation, and long, complicated derivation, and you could convert the units and so on. And that was really cool. You know, like many people have-- - It works through a proof.

It works through a proof. - Yeah. - That's pretty amazing. - Not all proofs, naturally, but quite a few. Or another example would be, like many people noticed that it has the ability to produce poems with every word starting with the same letter, or every word starting with some-- - It follows instructions really, really clearly.

- Not perfectly still, but much better than before. - Yeah, really good. - And on the vision side, I really love how it can explain jokes. It can explain memes. You show it a meme and ask it why it's funny, and it will tell you, and it will be correct.

The vision part, I think, is very, was also very, it's like really actually seeing it when you can ask follow-up questions about some complicated image with a complicated diagram and get an explanation. That's really cool. But yeah, overall, I will say, to take a step back, you know, I've been in this business for quite some time.

Actually, like almost exactly 20 years. And the thing which I find most surprising is that it actually works. (laughing) - Yeah. - Like it turned out to be the same little thing all along, which is no longer little, and a lot more serious and much more intense, but it's the same neural network, just larger, trained on maybe larger data sets in different ways with the same fundamental training algorithm.

- Yeah. - So it's like, wow. I would say this is what I find the most surprising. - Yeah. - Whenever I take a step back, I go, how is it possible that those ideas, those conceptual ideas about, well, the brain has neurons, so maybe artificial neurons are just as good, and so maybe we just need to train them somehow with some learning algorithm, that those arguments turned out to be so incredibly correct.

That would be the biggest surprise, I'd say. - In the 10 years that we've known each other, the models that you've trained and the amount of data that you've trained from what you did on AlexNet to now is about a million times. And no one in the world of computer science would have believed that the amount of computation that was done in that 10 years' time would be a million times larger and that you dedicated your career to go do that.

You've done many more, your body of work is incredible, but two seminal works, the invention, the co-invention with AlexNet and that early work, and now with GPT at OpenAI, it is truly remarkable what you've accomplished. It's great to catch up with you again, Ilya. I'm a good friend and it is quite an amazing moment.

And today's talk, the way you break down the problem and describe it, this is one of the best PhD, beyond PhD descriptions of the state-of-the-art of large language models. I really appreciate that. It's great to see you, congratulations. - Thank you so much. - Yeah, thank you.