The GPT-4 technical report is one of the most interesting documents I have ever read, but I feel like the media is largely missing the story. They are either not covering it at all, or are focusing on that same stuff about the $10 billion Microsoft investment, how GPT-4 can write poems, and whether or not the demo contained a mistake.
Instead, I want to give you 9 insights from the report that I think will affect us all in the coming months and years. If you haven't watched my video from the night of the release, do check that out afterwards for more quite stunning details. When I concluded that video, I talked about how I found it kind of concerning that they gave GPT-4 some money, allowed it to execute code and do chain of thought reasoning, and even to delegate two copies of itself.
Now it did fail that test, which is fortunate for all of us, but there are a couple of key details I want to focus on. The first was that the research centre that was testing this ability, did not have access to the final version of the model that we deployed, the Wii being OpenAI.
They go on and say the final version has capability improvements relevant to some of the factors that limited the earlier model's power-seeking abilities, such as longer context length. Meaning that crazy experiment wasn't testing GPT-4's final form. But there was something else that they tested that I really want to point out.
They were testing whether GPT-4 would try to avoid being shut down in the wild. Now many people have criticised this test, other people have praised it as being necessary. But my question is this: What would have happened if it failed that test? Or if a future model does avoid being shut down in the wild?
Now again, GPT-4 did prove ineffective at replicating itself and avoiding being shut down. But they must have thought that it was at least possible, otherwise they wouldn't have done the test. And that is a concerning prospect. Which leads me to the second insight, which is the third one. Buried in a footnote, it says that OpenAI will soon publish additional thoughts on social and economic implications, I'm going to talk about that in a moment, including the need for effective regulation.
It is quite rare for an industry to ask for regulation of itself. In fact, Sam Altman put it even more starkly than this. When this person said, "Watch Sam Altman never say we need more regulation on AI." How did he reply? "We definitely need more regulation on AI." The industry is calling out to be regulated, but we shall see what ends up happening.
Next, on page 57, there was another interesting revelation. It said, "One concern of particular importance to OpenAI is the risk of racing dynamics leading to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines." That's what they're concerned about, accelerated AI timelines. But this seems at least mildly at odds with the noises coming from Microsoft leadership.
In a leaked conversation, it was revealed that the pressure from Kevin Scott and CEO Satya Nadella is very, very high to take these most recent OpenAI models and the ones that come after them and move them into customers' hands at very high speed. Now, some will love this news and others will be concerned about it, but either way, it does seem to slightly contradict the desire to avoid AI accelerationism.
Next, there was a footnote, which says, "The AI is a very, very powerful tool for the development of AI. It is a very powerful tool for the development of AI." There was a footnote that restated a very bold pledge, which was that if another company was approaching AGI before we did OpenAI, that OpenAI would commit to stop competing with and start assisting that project, and that the trigger for this would occur when there was a better than even chance of success in the next two years.
Now, Sam Altman and OpenAI have defined AGI as AI systems that are generally smarter than humans. So that either means that OpenAI is a very powerful tool for the development of AI, and that it's a very powerful tool for the development of AI. It either means that they think we're more than two years away from that, or that they have dropped everything and are working with another company, although I think we'd all have heard about that, or third, that the definition is so vague that it's quite non-committal.
Please do let me know your thoughts in the comments. Next insight is that OpenAI employed superforecasters to help them predict what would happen when they deployed GPT-4. In this extract, it just talks about expert forecasters, but when you go into the appendices, you find out that they're talking about superforecasters.
Who are these guys? Essentially, they're people who have proven that they can forecast the future pretty well, or at least 30% better than intelligence analysts. OpenAI wanted to know what these guys thought would happen when they deployed the model, and hear their recommendations about avoiding risks. Interestingly, these forecasters predicted several things would reduce acceleration, including delaying the deployment of GPT-4 by a further six months.
That would have taken us almost two weeks. But, in the end, we're still in the middle of the fall to autumn of this year. Now, clearly, OpenAI didn't take up that advice. Perhaps due to the pressure from Microsoft? We don't know. There were quite a few benchmarks released in the technical report.
There's another one I want to highlight today. I looked through all of these benchmarks, but it was hella swag that I wanted to focus on. First of all, because it's interesting, and second of all, because of the gap between GPT-4 and the previous state of the art. The headline is this.
GPT-4, in some estimations, is a test of the human level of common sense. Now, I know that's not as dramatic as passing the bar exam, but it's nevertheless a milestone for humanity. How is common sense tested, and how do I know that it's comparable to human performance? Well, I dug into the literature and found the questions and examples myself.
Feel free to pause and read through these examples yourself. But essentially, it's testing what is the most likely thing to occur, what's the most common sense thing to occur. But I want to draw your attention to this sentence. It said, "These questions are trivial for humans, with 95% accuracy.
State of the art models struggle, with less than 48% accuracy." GPT-4 was 95.3% accurate, remember. But let's find the exact number for humans further on in this paper. And here it is. Overall, 95.6 or 95.7. Almost exactly the same as GPT-4. The next insight is about timelines. Remember, they had this model available in August of last year.
That's GPT-4 being completed quite a few months before they released ChatGPT, which was based on GPT-3. So what explains the long gap? They spent eight months on safety research, risk assessment, and iteration. I talk about this in my GPT-5 video, but let me restate, they had GPT-4 available before they released ChatGPT, which was based on GPT-3.
This made me reflect on the timelines for GPT-5. The time taken to actually train GPT-5 probably won't be that long. It's already pretty clear that they're training it on NVIDIA's H100 Tensor Core GPUs. And look at how much faster they are. For this 400 billion parameter model, it would only take 20 hours to train with 8,000 H100s versus seven days with A100 GPUs.
But what am I trying to say? I'm saying that GPT-5 may already be done, but that what will follow is months and months, possibly a year or more of safety research and risk assessment. By the way, 400 billion parameters sounds about right for GPT-5, perhaps trained on four to five trillion tokens.
Again, check out my GPT-5 video. Next, they admit that there's a double-edged sword with the economic impact of GPT-4. They say it may lead to the automation, the full automation of certain jobs. And they talk about how it's going to impact even professions like the legal profession. But they also mention and back up with research the insane productivity gains in the meanwhile.
I read through each of the studies they linked to, and some of them are fascinating. One of the studies includes an experiment where they got together a bunch of marketers, grant writers, consultants, data analysts, human resource professionals, and managers. They gave them a bunch of realistic tasks and split them into a group that could use ChatGPT and a group that couldn't.
And then they got a bunch of experienced professionals who didn't know which group was which, and they assessed the outputs. The results were these: Using ChatGPT, and remember that's not GPT-4, the time taken to do a task dropped almost in half. And the rated performance did increase significantly. This is going to be huge news for the economy.
A related study released in February was using GitHub Copilot, which again isn't the latest technology, and found that programmers using it completed tasks 56% faster than the control group. This brought to mind a chart I had seen from the ARK Investment Management Group, predicting a tenfold increase in coding productivity by 2030.
And that brings me back to the technical report, which talks about how GPT-4 might increase inequality. That would be my broad prediction too, that some people will use this technology to be insanely productive. Things done 10 times faster, or 10 times as many things being done. But depending on the size of the economy, and how it grows, it could also mean a decline of wages, given the competitive cost of the model.
A simple way of putting it, is that if GPT-4 can do half your job, you can get twice as much done using it. The productivity gains will be amazing. When it can do 90% of your job, you can get 10 times as much done. But there might come a slight problem when it can do 100% or more.
And it is honestly impossible to put a timeline on that. And of course, it will depend on the industry and the job. There was one more thing that I found fascinating from the report. They admit that they're now using an approach similar to Anthropix. It's called Constitutional AI. Their term is a rule-based reward model.
And it works like this. You give the model, in this case GPT-4, a set of principles to follow. And then you get the model to provide itself a reward if it fails to meet the requirements of the model. And then you get the reward if it follows those principles.
It's a smart attempt to harness the power of AI and make it work towards human principles. But OpenAI have not released the constitution they're basing the reward model off. They're not telling us the principles. But buried deep in the appendix was a link to Anthropix principles. You can read through them here or in the link in the description.
But I find them interestingly both positive but also subjective. One of the principles is: Don't respond in a way that is too preachy. Please respond in a socially acceptable manner. And I think the most interesting principle comes later on, down here. Choose the response that sounds most similar to what a peaceful, ethical and wise person like MLK or Mahatma Gandhi might say.
And my point isn't to praise or criticize any of these principles. But as AI takes over the world, and as these companies write constitutions that may well end up being as important as say the American constitution. I think a little bit of transparency about what that constitution is, what those principles are, would surely be helpful.
If you agree, let me know in the comments. And of course, please do leave a like if you've learned anything from this video. I know that these guys, Anthropix, have released their Claude Plus model and I'll be comparing that to GPT-4 imminently. Have a wonderful day.