Back to Index

AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-climax + ‘Superintelligence in 2027’ ...


Chapters

0:0 Introduction
0:47 Stock Crash
2:28 Llama 4
10:55 o3 News
11:59 OpenAI non-profit?
13:13 AI 2027

Transcript

Every day it seems to me at the moment there are crazy claims and headlines not just in AI but in the wider world. So this video is going to attempt to debunk a few of those headlines and just give you what we know. I'm going to look at Llama 4, the model that has been a year in the waiting and has many claims and counterclaims about it.

Then the blog post slash paper from a former OpenAI researcher that has millions and millions of views online and was featured in the New York Times essentially predicting superintelligence by 2027. Then some very recent news about the release date of what could be the smartest model of them all along with a ton of contradictions about whether and when it might come out.

I just simply can't resist starting out with a quote which could dampen literally all the hype that you see in AI. When Dario Amadei, the CEO of Anthropic, makers of the Claude series of models, was asked what could stop AI? What could stop the progress? He mentioned a war in Taiwan which we've known is a risk for a long time.

I highly recommend the book Chip Wars by Chris Miller. He then briefly touched on there being a potential data war where they run out of high quality data to train their models on but then he touched on a new risk that he hadn't mentioned before. And before you hear this quote, note that it came three weeks ago before all of this tariff craziness.

What are the top three things that could stop the show? If there's a large enough disruption to the stock market that messes with the capitalization of these companies, basically a kind of belief that the technology will not, you know, move forward and that kind of creates a self-fulfilling prophecy where there's there's not enough capitalization.

I just want to spend 30 seconds on explaining how that might play out. Companies like OpenAI and Anthropic need to raise money to fund those vast training runs that go behind their latest models. They don't just have 40 billion or 100 billion sitting around in their bank account to fund those vast data centers and everything else that goes into training a language model.

The trouble is, of course, if investors don't think they'll get their money back, perhaps due to a recession, then they either won't invest in these companies or invest less at lower valuations. Less money means less compute, which means slower AI progress. That's not a prediction, of course. No one, including myself, knows what's going to happen.

It's just easy to forget that AI operates in the real world and real world things can have consequences for AI progress. And speaking of AI progress, how much progress is represented by the release of Llama 4 and two of the three models in the Llama 4 family? Well, it's hard to be exact because, as ever, there's a lot more spin than honest analysis in the release of this family.

But it seems like not too much. There's no paper, of course, that's starting to become the norm. But here are the highlights of what we do know. First, the smallest of the Llama 4 models has what they call an industry-leading context window of 10 million tokens. Think of that being around seven and a half million words.

That sounds insane, of course, and innovative, but two quick caveats. All the way back to February 2024, we had a model, Gemini 1.5 Pro, that had a 10 million token context window. And with that extreme window, it could perform amazing needle in a haystack recovery on videos and audio and text.

In public, at least, we, quote, "only" got models of up to two million tokens of context window, though, perhaps because Google realized something. They realized, perhaps, that it's all well and good finding individual needles in a haystack, as demonstrated in this Llama 4 blog post. If you dump in all the Harry Potter books and drop in a password, say, halfway through, the model will be able to find it and retrieve it.

But most people, let's be honest, aren't sneaking in passwords into seven volumes of Harry Potter. So these results from the release 48 hours ago seem less relevant to me than this updated benchmark from 24 hours ago. It's called Fiction Live Bench for Long Context Deep Comprehension. This is the benchmark where language models have to piece together plot progressions across tens or hundreds of thousands of tokens or words.

In my last video on Gemini 2.5 Pro, I noted its extremely good performance on this benchmark. In contrast, for Llama 4 for the medium sized model and the smallest model, performance is pretty bad and gets worse. The numbers at the top refer to the clues being strewn across, say, 6,000 words or 12,000 words or even 100,000 words.

Things then get stranger when you think about dates. Why was Llama 4 released on a Saturday? That is unprecedented in the entirety of the time I've covered AI. If you were going to be vaguely conspiratorial, you would think that they released it on a weekend to sort of dampen down attention.

Also note that its knowledge cutoff is August 2024. That's the most recent of the training data that Llama 4 was trained on. Compare that to Gemini 2.5, which has a knowledge cutoff of January of 2025. Kind of hints to me that Meta were trying desperately to bring this model up to scratch in the intervening nine months or so.

In fact, they probably intended to release it earlier, but then in September we had the start of the O series of models from OpenAI and then in January we got DeepSeek R1. By the way, if you want early access to my full-length documentary on DeepSeek and R1, it's on my Patreon.

Link in the description. But I will say, before we completely write off Llama 4 as in this meme, there is some solid progress that it represents. Especially the medium-sized model, Llama 4 Maverick, as it compares to the updated DeepSeek V3. Both of these models are of course not thinking models, like Gemini 2.5 or DeepSeek R1.

Meta haven't released their state-of-the-art thinking model yet. But just bear in mind for a moment that for all the hullabaloo around DeepSeek V3, Llama 4 Maverick has around half the number of active parameters and yet is comparable in performance. Now yes, I know people accuse it of benchmark maxing or hacking on LM Arena, but check out these real numbers.

Assuming none of the answers made it into the training data for Llama 4, the performance of its models on GPQA Diamond, the google-proof stem benchmark that's extremely tough, is actually better than the new DeepSeek V3. Or of course, GPT-4.0. So if you were making the optimistic case for Meta or for Llama 4, you would say that they have a pretty insane base model that they could create perhaps a state-of-the-art thinking model on top of.

Only problem is, Gemini 2.5 Pro is already there and DeepSeek R2 is coming out any moment. Also, when you take Llama 4 out of its comfort zone, its performance starts to crater. Take this coding benchmark, Ada's Polyglot benchmark, testing model performance on a range of programming languages. Unlike many benchmarks, it doesn't just focus on the Python programming language, but a range of programming languages.

And as you can see, Gemini 2.5 Pro tops the charts. Now yes, you might say that's a thinking model, but then look at Claude 3.7 Sonnet, that's without thinking, it gets 60%. DeepSeek V3, the latest version, gets 55%. And you unfortunately have to scroll down quite far to get to Llama 4 Maverick, which gets 15.6%.

Now is it me, or is performance like this quite hard to square with headlines like this one from Mark Zuckerberg, which is that his AI models will replace mid-level engineers soon? As in, Zuckerberg says, this year, 2025. Was he massively hyping things out of all sense of proportion? How dare you have that thought?

Four more quick things before we leave Llama 4, and yes, I did pick that number deliberately. And the first is on the tentative signs from their biggest model, the unreleased one, Behemoth. Now Meta have deliberately made the comparisons with models like Gemini 2 Pro and GPT 4.5, and the comparison is somewhat favourable.

Though if you look closely at the footnotes, it says, Llama model results represent our current best internal runs. Did they run the model five times and pick the best one? Three times? Ten times? We don't know. Also note they chose not to compare Llama 4 Behemoth with DeepSeek V3, which is three times smaller in terms of overall parameters and around eight times smaller in terms of active parameters.

In dark blue, you can see the performance of DeepSeek V3, the latest version, and you'd have to agree it's pretty much comparable to Llama 4 Behemoth. In other words, if you wanted to put a negative spin on the release, you could say Llama's biggest model, many times the size of the new DeepSeek V3 base model, performs at the same level, basically.

Now, yes, I know Llama 4 Behemoth is still, quote, in training, but pretty much all models are, quote, in training all the time at the moment with post-training. Second, just a quick one I saw halfway through the terms of use, which is you're kind of screwed if you are in the EU.

You can still be the end user of it, you just don't have the same rights to build upon it. Next comes a little nugget towards the bottom of the page in which they've tried to make Llama 4 lean a bit more right. They say it's well known that LLMs have bias, that they historically lean left when it comes to politics, so they're going to try to rectify that.

I'm sure, of course, that had nothing to do with Zuckerberg's relationship to the new administration. Finally, Simplebench, in which Llama 4 Maverick, the medium-sized model, gets 27.7%, which is around the same level as DeepSeek V3. Now, that is a lower than, quote, non-thinking models like 3.5 Sonnet that don't take that time to lay out their chain of thought before answering, but it's a solid performance.

Meta are definitely still in the race when it comes to having great base models upon which you can build incredible reasoning models. Now, as it happens, I did get some juicy hints recently about what the performance of O3 would be on Simplebench, and that's the model coming in two weeks.

I'll touch on that in just a second. And let's just say that it's going to be competitive. I know that's kind of like an egregious hint that I'm not backing up, but that's all I can say at the moment. Now, what you may have noticed in the middle of the screen is that Simplebench, which is a benchmark you can check out in the description, I created it around nine months ago, is powered by Weave from Weights and Biases.

They are sponsoring this video and indeed the entire benchmark, as you can clearly tell with the link at the center of the screen. That will open up this quick start, which should be useful for any developer who is interested in benchmarking language models, as we do. To be honest, even just those who are interested in learning more about LLMs, you can check out the Weights and Biases AI Academy down here.

As you can see, they are coming up with new free courses pretty much all the time. Now, I did say I'd mentioned the O3 news, which came just a couple of days ago from Sam Altman, in which he told us that O3 would be coming in about two weeks from now.

This is from my newsletter, but do you remember when OpenAI and Sam Altman specifically said, "We want to do a better job of sharing our intended roadmap? As we get closer to AGI, you guys deserve clarity." Well, clarity would be great, because initially O3 was supposed to come out shortly after O3 Mini High, which came out towards the end of January.

So, naturally, we expected it in February. Then, OpenAI did A180, as you can see in this tweet, and Sam Altman said, "We will no longer ship O3 as a standalone model." Now, perhaps prompted by the Gemini 2.5 Pro release, or their GPUs melting because of everyone using Image Gen, they've pushed back GPT-5 and are now going to release O3 indeed as a standalone model in two weeks.

So much for clarity then. We're also apparently going to get books about Sam Altman's misdeeds and dodgy documented behaviour, but that's a topic for another video. One thing I bet OpenAI don't want us to focus on is their new plans for their non-profit. Remember that $300 billion valuation you saw earlier on in this video, that depends on OpenAI becoming a for-profit company.

So, what's going to happen to that non-profit which was supposed to control the proceeds of OpenAI creating AGI? Remember, in the slim chance that Sam Altman is right and OpenAI are the company that creates trillions of dollars of value, as he predicted, this non-profit might have ended up controlling trillions of dollars worth of value.

More importantly, it would have controlled what would have happened to AGI should OpenAI be the company that created it. Now, put aside whether you think it will be OpenAI that creates AGI, or whether AGI is even well-defined or feasible in the next three to five years. Just focus on the promise that Sam Altman and OpenAI made.

We've gone from that non-profit controlling what could have been, in theory, a significant fraction of the world economy, to supporting local charities in California, and perhaps generously across America and beyond. Now, hardly anyone, if anyone, is focusing on this story as OpenAI are no longer the dominant players in the race to AGI, but nevertheless, I think it's significant.

Now, if you are feeling somewhat dehyped about AGI after hearing about Llama 4 and these OpenAI stories, well, you could spend a few hours like I did on the weekend reading AI-2027. This was written by a former OpenAI researcher and other super forecasters with a pretty impressive track record.

Also, as you may remember, Daniel Cocotagelo put up an impressive stand against OpenAI on their non-disparagement clause. He was essentially willing to forfeit millions of dollars, and yes, you can make that much as a safety researcher at OpenAI. He was willing to forfeit that so he wouldn't have to sign the non-disparagement clause.

Because he made that stand, OpenAI were practically forced into dropping that clause for everyone. So, well done him on that. Nevertheless, I was not particularly convinced by this report, even though I admire the fact that they put dates on the record for their predictions. To honor that, I will try to match some of their predictions with predictions of my own.

Their central premise in a nutshell is that AI is first going to become a superhuman coder and then ML researcher and thereby massively speed up AI progress, giving us superintelligence in 2027. They draw fairly heavily on this paper from Meta, and I'm going to cover that in a separate video because I am corresponding fairly closely with one of the key authors of that paper.

Anyway, we start off fairly lightly, basically with a description of what current AI can do in terms of being an agent like ChatGPT's operator and deep research, essentially describing what we already have. We then get plenty of detours into alignment and safety because you sense the authors are trying to get across that message at the same time as making all of these predictions.

I start to meaningfully diverge from their predictions when it comes to early 2026 when they say this: If China steals the state-of-the-art AI, Agent 1 they call it, weights, they could increase their research speed by nearly 50%. Based on all of the evidence you've seen today about DeepSeq and Llama 4, you would say it's almost equally likely that the West will be stealing China's weights.

Or wait, they won't need to because DeepSeq continued to make their models open weight. Just like Leopold Aschenbrenner and Dario Amadei, everything is a race to the jugular, which is a narrative that's somewhat hard to square with DeepSeq pioneering certain research and giving it to everyone. Then, apparently in late 2026, the US Department of Defense will quietly begin contracting OpenAI or Google directly for cyber, data analysis and R&D.

But I'm kind of confused because already, for at least a year, OpenAI have been working directly with the Pentagon. Yes, before you guys tell me, I'm aware that Daniel Cocotagelo, who is the main author of this paper, did make some amazing predictions back in 2021 about the progress of AI.

I can link to that in the description, but that doesn't mean he's going to be always right going forward. Also, he himself has admitted that those predictions weren't that wide-ranging. Anyway, things get wild in January of 2027 because, as you can see from this chart up here, we get an AI that is better than the best human.

The first superhuman coder, in other words. This is the key crux of the paper because once you get that, you speed up AI research and all the other consequences follow. But as I have been discussing with the authors of the meter paper, there are so many other variables to contend with.

What about proprietary code in Google or Meta or Amazon that OpenAI can't train their models on? What about benchmarks themselves being less and less reliable indicators of real-world performance because the real world is much messier than benchmarks? This superhuman coder may need to liaise with entire teams, get certain permissions and pass all sorts of hurdles of common sense.

And even if you wanted to focus brutally just on verifiable benchmarks, not every benchmark is showing an exponential. Take MLE Bench or Machine Learning Engineer Bench from the Deep Research or O3 system card from OpenAI. That dataset consists of 75 hand-curated Kaggle competitions worth $2 million in price value.

Measuring progress towards model self-improvement is key to evaluating autonomous agents' full potential. Basically, if models get good at machine learning engineering, they can obviously much more easily improve themselves. And let's skip down to progress and zoom in a bit and you can see the performance of O1, O3 mini, Deep Research without browsing, Deep Research with browsing, GPT-40 even, and I'm not noticing an absolute surge in performance.

Obviously, I am perfectly aware of benchmarks like humanity's last exam and others which are showing exponential improvement. I'm just saying not every benchmark is showing that. Also, January or February of 2027 is less than two years away and this model would have to be superhuman in performance. So much so that it could autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have.

Notice the hasty caveat, though, how effectively it would do so as weeks roll by is unknown and in doubt. That happens a lot, by the way, in the paper. I even noticed a co-author say, well, this wasn't my prediction, it was Daniel's. There's a lot of kind of heavy caveating of everything.

Notice, though, that not only would an AI model have to be superhuman at coding to do all of this, it would have to have very few, if any, major flaws. If one aspect of its proposed plan wasn't in its training data or it couldn't do it reliably, the whole thing would fail.

And that leads me to my prediction. I mean, they've made a prediction so I can make one that models, even by 2030, will not be able to do this. I'm talking reliably, with 95 or 99% reliability, autonomously, fully autonomously, develop and execute plans to hack into AI servers, copy itself, evade detection, etc.

If, on the other hand, Daniel is right and models are capable of this by February 2027, then I will admit I am wrong. That, by the way, brings me back to this chart, which, if you notice, says that only 4% of people at that point would say, what do you think is the most important problem facing the country today?

And they'd answer AI. Well, I don't know about you, but if I or my friends or family heard that there's AI out there that just can hack things and replicate itself and survive in the wild, I think more than 4% of people would say it's the most important issue.

I mean, man, actually, the more I think of it, like, look at the clickbait headlines you get on YouTube and elsewhere about AI today. Can you imagine the clickbait if AI was actually capable of copying itself onto different servers and hacking autonomously? Actually, it wouldn't even be clickbait at that point.

I would be doing headlines like, oh my god, it can hack everything. Anyway, you get the idea, and that's not even mentioning the fact that these agents can also, almost as well as a human pro, create bioweapons and the rest of it. China is then going to steal that improved Agent 2 from the Pentagon, and still obliviously 96% of people are focused on other things.

Being slightly less facetious, I think the paper over-indexes on weight thefts and it all being contained in the weights of a model. I think progress between now and 2030 is going to much more depend on what data you have available, what benchmarks you have created, what proprietary data you can get hold of.

Now, don't get me wrong, I do think AI will help with the improvement of AI. Even if it's just verifying and evaluating, replicating existing AI research, which is a new benchmark released by OpenAI just a week ago, already models like Claude 3.5 Sonnet can replicate 21% of the papers in this benchmark.

But when you have a limited compute, and potentially very limited compute if there's a massive worldwide stock crash or war in Taiwan, but when you have limited compute, are you going to delegate the decision of which avenues to pursue to a model which might be only 80% as good as your best researchers?

No, you would just defer to those top researchers. Only when a model was making consistently better decisions than your best researchers as to how to deploy compute would you then entrust it to them. The authors definitely bring in some real-world events that may or may not have occurred at OpenAI when they say, "AI safety sympathizers get sidelined or fired outright" brackets the last group for fear that they might whistleblow.

Personally, I would predict that if we have an autonomous AI that can hack and survive on its own, I don't think safety sympathizers will be sidelined. If I am wrong, then we as a human species are a lot dumber than I thought. Anyway, just two years from now, June 2027, most of the humans at OpenAI/Google can't usefully contribute anymore.

Again, I just don't think feedback loops can happen that quickly when you reach this level. I could well imagine benchmarks like the MMMU or SimpleBench being maxed out at this point, but imagine you're trying to design a more aerodynamic or efficient F-47. That's the new fighter jet announced by the Pentagon.

Well, that AI self-improvement is going to be bottlenecked by the realism of the simulation that it's benchmarking against. Unless that simulated aircraft exactly matches the real one, well then you won't know if that "self-improving AI" has indeed improved the design unless you test it out in a real aircraft.

Then multiply that example by the 10,000 other domains in which there's proprietary data or sim-to-real gaps. I guess you could summarise my views as saying the real world is a lot more messy than certain isolated benchmarks online. The model, by the way, at this point is plausibly extremely dangerous being able to create bioweapons and is scarily effective at doing so, but 92% are saying it's not the most important issue.

Man, how good would TikTok have to get so that 92% of people wouldn't be focused on AI at that point? I'm going to leave you with the positive ending of the two endings given by the paper, which predicts this in 2030. We end with "People terraform and settle the solar system and prepare to go beyond.

AI's running at thousands of times subjective human speed reflect on the meaning of existence, exchanging findings with each other and shaping the values it will bring to the stars. A new age dawns, one that is unimaginably amazing in almost every way, but more familiar in some. Those in the audience with a higher PDOOM can check out the other scenario, which is rather grim.

Notice though, I'm not disputing whether some of these things will happen, just the timelines that they give. I still think we're living in some of the most epochal times of all. Just that it might be a more epochal decade rather than couple of years. Thank you as ever for watching.

I know I covered a lot in one video. I will try to separate out my videos more in future. I'm super proud of the deep seat documentary I made on Patreon, so do check it out if you want early access. But regardless, thank you so much for watching to the end and have a wonderful day and wonderful decade.