The world may be waking up to the fact that intelligence will be automated sooner than anyone could have imagined a few years ago, but it is still sleeping when it comes to who gets the spoils. Just today, the Vice President of America said that AI will never replace workers and only boost productivity.
Then again, Sam Altman, CEO of OpenAI, wrote just yesterday that he could see labor losing its power to capital. And Rand, the famous think tank, put out a paper just the other day that said that the world isn't ready for the "job losses and societal unrest" that it thinks might accompany a more general artificial intelligence.
But even if labor does lose, capital can't decide who gets the money. Just today, Musk and co. challenged Sam Altman and Microsoft for control of OpenAI itself. And of course, there are always papers like this one from Stanford suggesting that the reasoning enhancements needed to bring a model to frontier capability are achievable for just $20, which makes me think you guys can afford AGI after all.
Meanwhile, Dario Amadei, CEO of Anthropic, makers of Claude, says that time is running out to control the AGI itself. "I just think that when the day inevitably comes that we must confront the full automation of intelligence. I just hope we are a little more unified, let's say, than we are now." There is too much to cover as always, so let's just cut it down to the 7 most interesting developments, using the Sam Altman essay as the jumping off point for each.
First off, he gives his 5th or maybe 15th different definition for AGI, but this time it's "We mean it to be a system that can tackle increasingly complex problems at human level in many fields." Well, under that definition, we are getting awfully close. Take coding, where we heard in December that the O3 model was the 175th highest ranked coder in CodeForce's ELO.
Now, that might not mean much to many people, but just yesterday in Japan, Sam Altman said they now have internally the 50th highest scoring competitor. We're clearly well beyond imitation learning. These systems O1, O3, O4, they're not copying those top 50 competitors saying coding. They are trying things out themselves and teaching themselves through reinforcement learning what works.
We are not capped at the human level and that applies to way more than just coding. I've been using deep research from OpenAI on the pro tier this week to at least suggest diagnoses for a relative and a doctor I know said that it found things that she wouldn't have thought of.
Of course, it does hallucinate fairly frequently, but it also thinks of things you might not have thought of. And remember, this is O3 searching maybe 20 sources. What about O5 searching 500? And you might say, well, knowing stuff is cool, but why color workers actually take actions on their computers?
Well, Karina Nguyen from OpenAI has this to say. On tasks, they're saturating all the benchmarks. And post-training itself is not hitting the wall. Basically, we went from like raw data sets from pre-trained models to infinite amount of tasks that you can teach the model in the post-training world via reinforcement learning.
So any task, for example, like how to search the web, how to use the computer, how to write, well, like all sorts of tasks that you like trying to teach the model, all the different skills. And that's why we think like there's no data wall or whatever, because there will be infinite amount of tasks.
And that's how the model becomes extremely super intelligent. And VR actually getting saturated in all benchmarks. So I think the bottleneck is actually in evaluations. And there's a reason I can believe that even though their current operator system, only available on Pro for $200 a month, is quite jank.
It's because tasks like buying something online or filling out a spreadsheet are mostly verifiable. And whenever you hear verifiable or checkable, think ready to be absolutely eaten by reinforcement learning. Just like domains like code, where you can see the impact of enhanced reinforcement learning from 01 preview to 03.
Next is the investment that must go in to make all of this happen. And Sam Altman had this to say later on in the essay, "The scaling laws that predict intelligence improvements have been accurate over many orders of magnitude. Give or take the intelligence of an AI model roughly equals the log of the resources used to train and run it." So think of that as 10xing the resources you put in to get one incremental step forward in intelligence.
Doesn't sound super impressive until you read the third point. And I agree with this point. The socioeconomic value of linearly increasing intelligence, each increment, is super exponential. In short, if someone could somehow double the intelligence of 03, it wouldn't be worth 4x more to me, and I think to many people, it would be worth way, way more than that.
It would be super exponential. He goes on, "A consequence of this is that we see no reason for the exponentially increasing investment to stop in the near future." In other words, if AI will always pay you back tenfold for what you invest in it, why ever stop investing? Many forget this, but less than two years ago, Sam Altman himself said that his grand idea is that OpenAI will capture much of the world's wealth through the creation of AGI, and then redistribute it to the people.
We're talking figures like not just 100 billion, but a trillion, or even 100 trillion. That's coming from him. Only adds, if AGI does create all that wealth, he's not sure how the company will redistribute it. To give you a sense of scale, as you head towards 100 trillion, you're talking about the scale of the entire labor force of the planet.
And that, of course, brings us to others who don't want him to have that control, or maybe want that control for themselves. As you may have heard about, Elon Musk has bid almost 100 billion for OpenAI, or at least it's a bid for the non-profit which currently controls OpenAI.
To save you reading half a dozen reports, essentially it looks like Sam Altman and OpenAI have valued that non-profit's stake in OpenAI at around $40 billion. That leaves plenty of equity left for Microsoft and OpenAI itself, including its employees. However, if Musk and others have valued that stake at $100 billion, then it might be very difficult in court for Altman and co.
to say it's worth only $40 billion. So even if they reject, as it seems like they have done, Musk's offer, it forces them to potentially dilute the stake owned by Microsoft and the employees. Altman said to the employees at OpenAI that these are just tactics to try and weaken us because we're making great progress.
The non-profit behind OpenAI could also reject the offer because it thinks that AGI wouldn't be safe in the hands of Musk. At this point, I just can't resist doing a quick plug for a mini documentary I released on my Patreon just yesterday. It actually covers the origin stories of DeepMind, OpenAI, the tussle with Musk and Anthropic and how the founding vision of each of those AGI labs went awry.
This time, by the way, I used a professional video editor and the early reviews seemed to be good. All the shenanigans that are going on with the non-profit at OpenAI seem worthy of an entire video on their own. So for now, I'm going to move on to the next point.
Sam Altman predicted that with the advent of AGI, the price of many goods will eventually fall dramatically. It seems like one way to assuage people who lose their job or see their wages drop is that, well, at least your TV is cheaper. But he did say the price of luxury goods and land may rise even more dramatically.
Now, I don't know what you think, but I live in London and the price of land is already pretty dramatic. So who knows what it will be after AGI. But just on that luxury goods point, I think Sam Altman might have one particular luxury good in mind. Yesterday in London, Sam Altman was asked about their hardware device designed in part by Johnny Ive from Apple.
And he said, it's incredible. It really is. I'm proud of it. And it's just a year away. Yes, by the way, I did apply to be at that event, but you had to have certain org IDs, which I didn't. One thing that might not be a luxury device are smaller language models.
In leaked audio of that same event, he apparently said, well, one idea would be we put out O3 and then open source O3 mini. We put out O4 and open source O4 mini. He added, this is not a decision, but directionally you could imagine us saying this. Take all of that for what it is worth.
The next jumping off point comes in the first sentence actually of this essay, which is that the mission of OpenAI is to ensure that AGI benefits all of humanity. Not that they make AGI, that they make an AGI that benefits all of humanity. Now, originally when they were founded, which I covered in the documentary, the charter was that they make AGI that benefits all of humanity unencumbered by the need for a financial return.
But that last bit's gone, but we still have that it benefits all of humanity. Not most of humanity, by the way, benefits all of humanity. I really don't know how they are going to achieve that when they themselves admit that the vast majority of human labor might soon become redundant.
Even if they somehow got implemented a benevolent policy in the US to make sure that everyone was looked after, how could you ensure that for other nations? After watching Yoshua Bengio, one of the godfathers of AI, and I'll show you the clip in a second, I did have this thought.
It seems to me if a nation got to AGI or super intelligence one month, three months, six months before another one, it's not the most likely that they would use that advantage to just wipe out other nations. I think more likely would be to wipe out the economies of other nations.
The US might automate the economy of say China or China, the US, and then take that wealth and distribute it amongst its people. And Yoshua Bengio thinks that that might even apply at the level of companies. I can see from the declarations that are made and, you know, what, you know, logically these people would do is that the people who control these systems, like say open AI potentially, they're not going to continue just selling the access to their AI.
They're going to give access to, you know, a lower grade AI. They're going to keep the really powerful ones for themselves and they're going to build companies that are going to compete with the non-AI, you know, systems that exist. And they're going to basically wipe out the economies of all the other countries which don't have these superintelligent systems.
So it's, you know, you say it's, you wrote it's not existential, but I think it is existential for countries who don't build up to this kind of level of AI. And it's an emergency because it's going to take at least several years, even with the coalition of the willing to bridge that.
And just very quickly, I can't help, because he mentioned competitor companies to mention Gemini 2 Pro and Flash from Google, new models from Google DeepMind. There's also of course Gemini Thinking, which replicates the kind of reasoning traces of say O3 Mini or DeepSeek R1. Now straight off the benchmark results of these models are decent, but not stratospheric.
For the most part, we're not talking O3 or DeepSeek R1 levels. On SimpleBench we're rate limited, but it seems like the scores of both the Thinking Mode and Gemini 2 Pro will gravitate around the same level as the "Gemini Experimental 1206". But I will say this, I know it's kind of niche.
Gemini is amazing at quickly reading vast amounts of PDFs and other files. No, it's transcription accuracy of audio and I've tested isn't going to be at the level of say Assembly AI and no, it's coding is no O3 and it's "deep research" button is no deep research, but the Gemini series are great at extracting text from files and they are incredibly cheap.
So I'm quite impressed. And I do suspect as Chachapiti just recently overtook Twitter to become the sixth most visited site and slowly starts closing in on Google, that Google will invest more and more and more to ensure that Gemini 3 is state of the art. Next, Ullman wrote about a likely path that he sees is AI being used by authoritarian governments to control their population through mass surveillance and loss of autonomy.
And that remark brings me to the RAND paper that for some reason I read in full, because they're worried by not just mass surveillance by authoritarian dictatorships, but other threats to quote national security. Wonder weapons, systemic shifts in power, kind of talked about that earlier with say China automating the economy of the US, non-experts in power to develop weapons of mass destruction, artificial entities with agency, think O6 kind of coming alive and instability.
This is RAND again, which has been around for over 75 years and is not known for dramatic statements. Again, I would ask though that if the US does a quote large national effort to ensure that they obtain a decisive AI enabled wonder weapon before China, say three months before, six months before, then what?
Are you really going to use it to then disable the tech sector of China? For me, the real admission comes towards the end of this paper where they say the US is not well positioned to realise the ambitious economic benefits of AGI without widespread unemployment and accompanying societal unrest.
And I still remember the days when Ullman used to say in interviews, it's just around two years ago, he said stuff like, if AGI produces the kind of inequality that he thinks it will, people won't take it anymore. Let's now though, get to some signs that AGI might not even be controlled by countries or even companies.
For less than $50 worth of compute time, of course, not counting research time, but for around apparently $20 worth of compute time, affordable for all of you guys, Stanford produced S1. Now, yes, of course, they did utilise an open weight base model, QUEN 2.5, 32 billion parameters in struct, but the headline is with just a thousand questions worth of data, they could bring that tiny model to being competitive with O1.
This is in science, GPQA and competition level mathematics. The key methodology was, well, whenever the model wanted to stop, they forced it to continue by adding weight, literally the token weight, multiple times to the model's generation when it tried to end. Imagine you sitting in an exam and every time you think you've come to an answer and you're ready to write it down, a voice in your head says, wait, that's kind of what happened until the student or you had taken a set amount of time on the problem.
Appropriately then, this is called test time scaling, scaling up the amount of tokens spent to answer each question. I've reviewed the questions, by the way, in the math 500 benchmark, and they are tough. So to get 95%, at least the hard ones, the level five ones are, so to get 95% in that is impressive.
Likewise, of course, to get beyond 60% in GPQA diamond, which roughly matches the level of PhDs in those domains. To recap, this is an off the shelf open weights model trained with just a thousand questions and reasoning traces. There were some famed professors in this Stanford team and their goal, by the way, was to replicate this chart on the right, which came in September from OpenAI.
Now we kind of already know that the more pre-training you do and post-training with reinforcement learning you do, the better the performance will be. But what about time taken to actually answer questions, test time compute? That's the chart they wanted to replicate. Going back to the S1 paper, they say, despite the large number of O1 replication attempts, none have openly replicated a clear test time scaling behavior and look how they have done so.
I'm going to simplify their approach a little bit because it's the finding that I'm more interested in, but essentially they sourced 59,000 tough questions. Physics Olympiads, astronomy, competition level mathematics, and AGI eval. I remember covering that paper like almost two years ago on this channel. They got Gemini thinking, the one that outputs thinking tokens like DeepSeek R1 does to generate reasoning traces and answers for each of those 59,000 examples.
Now they could have just trained on all of those examples, but that did not offer substantial gains over just picking a thousand of them. Just a thousand examples in say your domain to get a small model to be a true reasoner. Then of course get it to think for a while with that weight trick.
How to filter down from 59,000 examples to 1,000 by the way. First decontaminate, you don't want any questions that you're going to use to then test the model of course. Remove examples that rely on images that aren't found in the question, for example, and other formatting stuff. But more interestingly, difficulty and diversity.
This is the kind of diversity that even JD Vance would get behind. On difficulty, they got smaller models to try those questions. And if those smaller models got the questions right, they excluded them. They must be too easy. On diversity, they wanted to cover as many topics as possible from mathematics and science, for example.
They ended up with around 20 questions from 50 different domains. They then fine-tuned that base model on those a thousand examples with the reasoning traces from Gemini. And if you're wondering about DeepSeek R1, they fine-tuned with 800,000 examples. Actually, you can see that in this chart on the right here.
Again, it wasn't just about fine-tuning each time the model would try to stop. They said wait sometimes two, four, or six times to keep boosting performance. Basically, it forces the model to check its own output and see if it can improve it. Notice weight is fairly neutral. You're not telling the model that it's wrong.
You're saying, wait, maybe do we need to check that? They also tried scaling up majority voting or self-consistency, and it didn't quite have the same slope. Suffice to say though, if anyone watching is in any confusion, getting these kinds of scores in GPQA, Google Proof Question and Answer, and competition level mathematics, it's insane.
Incredibly impressive. Of course, if you took this same model and tested it in a different domain, it would likely perform relatively poorly. Also, side note, when they say open data, they mean those thousand examples that they fine-tuned the base model on. The actual base model doesn't have open data.
So it's not truly open data. As in, we don't know everything that went into the base model. Everything that Quen 2.5, 32 billion parameters, was trained on. Interestingly, they would have gone further, but the actual context window of the underlying language model constrains it. And Karpathy, in his excellent Chatterbitty video this week, talked about how it's an open research question about how to extend the context window suitably at the frontier.
It's a three and a half hour video, but it's a definite recommend from me. Actually, speaking of Karpathy, his reaction to this very paper was "cute idea, reminds me of let's think step-by-step trick. That's where you told the model to think step-by-step so it spent more tokens to reason first before giving you an answer.
Here, by saying wait, we're forcing the model to think for longer. Both lean, he said, on the language prior to steer the thoughts." And speaking of spending your time well by watching a Karpathy video, I would argue you can spend your money pretty well by researching which, say, charity to give to through GiveWell.
They are the sponsors of this video, but I've actually been using them for, I think, 13 years. They have incredibly rigorous methodology, backed by 60,000+ hours of research each year on which charities save the most lives, essentially. The one that I've gone for, for actually all of those 13 years, is the Against Malaria Foundation, I think started in the UK.
Anyway, do check out GiveWell, the links are in the description, and you can even put in where you first heard of them. So obviously, you could put, say, AI Explained. But alas, we are drawing to the end, so I've got one more point from the Sam Wallman essay that I wanted to get to.
In previous essays, he's talked about the value of labour going to zero. Now he just talks about the balance of power between capital and labour getting messed up. But interestingly, he adds, this may require early intervention. Now, OpenAI have funded studies into UBI with, let's say, mixed results, so it's interesting he doesn't specifically advocate for universal basic income.
He just talks about early intervention, then talks about compute budgets and being open to strange-sounding ideas. But I would say, if AGI is coming in two to five years, then the quote "early intervention" would have to happen, say, now? I must confess, though, at this stage, that I feel like we desperately need preparation for what's coming, but it's quite hard to actually specifically say what I'm advocating the preparation be.
Then we get renewed calls just today from the CEO of Anthropic, Dario Amadei, about how AI will become a country of geniuses in a data centre, possibly by 2026 or 2027, and almost certainly no later than 2030. He said that governments are not doing enough to hold the big AI labs to account and measure risks and, at the next international summit – there was one just this week – we should not repeat this missed opportunity.
These issues should be at the top of the agenda. The advance of AI presents major new global challenges. We must move faster and with greater clarity to confront them. I mean, I'm sold and I think many of you are, that change is coming very rapidly and sooner than the vast majority of people on the planet think.
The question for me that I'll have to reflect on is, well, what are we going to do about it? Let me know what you think in the comments but above all, thank you so much for watching to the end and have a wonderful day.