Back to Index

$125B for Superintelligence? 3 Models Coming, Sutskever's Secret SSI, & Data Centers (in space)...


Chapters

0:0 Intro
1:6 SSI, Safe Superintelligence (Sutskever)
3:45 Grok-3 (Colossus) + Altman Concerned
5:36 CharacterAI + Foundation Models
6:26 125B Supercomputers + 5-10GW
8:28 ‘GPT-6’ Scale
9:7 Zuckerberg on Exponentials and Doubt
9:42 Strawberry/Orion + Connections + Weights
11:39 Data Centers in Space (and the sea)
12:45 Distributed Training + SemiAnalysis Report w/ Gemini 2
17:34 Climate Change Pledges?

Transcript

"Superintelligence just got valued at $5 billion." Or should I say, "Safe superintelligence "led by none other than the reclusive Ilya Sutskova "just got valued at $5 billion," with this detail-free tweet in just the last few hours. And today, we also got more news of Gemini 2, Grok 3, and not just one, but two new $125 billion data centers.

All this news seems so disparate, right? But there is one theme through it all, which is computing power. Even to the point of making data centers in space, we as a species are making a giant bet that scaling up language models will unlock true artificial intelligence. If the scaling hypothesis believers are right, it's coming soon.

If they're wrong, this could all be viewed as the biggest waste of resources in human history. But let's start with the news from just a couple of hours ago that Ilya Sutskova has raised $1 billion at a $5 billion valuation for his startup, Safe Superintelligence. There he is, by the way, in the middle.

He's definitely still alive and working on AI. If you haven't heard of Safe Superintelligence, don't worry, it's actually only three months old, but as mentioned earlier, valued at $5 billion. If only my three-month-old side hustles got valued at $5 billion, the world would be a happy place. But no, more seriously, what will those $1 billion in funds be used for?

Well, it's that key theme you'll see throughout this video and probably throughout the next five years. The funds will be used to acquire computing power. Ilya Sutskova, by the way, sent a link to this Reuters article, so we can pretty much trust it's spot on with its details. The salient thing, though, about the company, Safe Superintelligence, is just how little detail they're giving out about what they're working on.

Essentially, though, the key pitch is this. Give us a couple of years, and then we're going to attempt in one shot to bring you superintelligence. By the way, they're gonna do that with a team that's currently just 10 employees strong. But before you dismiss them immediately, they are backed by some heavy hitters like Sequoia Capital and Daniel Gross, who is a co-founder.

Up to this point, though, Ilya Sutskova, who is clearly the key person in this venture, hasn't given any real detail about his approach, but he did sprinkle some hints into this article. Sutskova said that he will approach scaling in a different way to his former employer, which would be OpenAI.

And he said, "Everyone just says scaling hypothesis. Everyone neglects to ask, 'What are we scaling?'" But he went on, "Some people can work really long hours, and they'll just go down the same path faster. It's not so much our style, but if you do something different, then it becomes possible for you to do something special." Before people get completely carried away, though, I do want to add a little bit of context to some of the claims that Sutskova has made before.

He co-led the superalignment team at OpenAI, which just over a year ago set themselves the deadline of aligning or making safe superintelligence within four years. Now, I'm not against crazy ambition, and alignment is important, but what actual progress has been made in that year and a bit? Yes, I have read those blog posts put out by the former members of the team, but it doesn't strike me as being a quarter of the way to aligning a superintelligence.

On the grounded to fanciful scale, it is definitely leaning toward the latter. Now, naturally, those weren't the only grandiose visions announced in the last 48 hours. Here is Musk two days ago, claiming to have the most powerful AI training system in the world. He mentions it soon having around 200,000 H100 equivalents, which are the GPUs that go into training larger language models.

Now, your first thoughts might be that that's either an idle boast or that it's not really the computing power that matters, it's how you use it. But I do give comments like that, and this one from July, more credence because of the capabilities of Grok2. Grok2, the frontier model produced by Musk's ex-AI team, is genuinely a GPT-4 level competitor.

So it's worth paying attention at least to when he says that they are gonna train the most powerful AI by every metric by December of this year. And I will give one further hint that that claim shouldn't be immediately dismissed. And that's from this report yesterday in the information.

Now, first, it did caveat that that 100,000 chip cluster, known as Colossus, isn't fully operational. Apparently, fewer than half of those chips are currently in operation, largely because of constraints involving power or networking gear, and more about power constraints in a moment. But according to the information, OpenAI CEO Sam Altman has told some Microsoft executives that he is concerned that Musk's ex-AI could soon have more access to computing power than OpenAI does.

And remember, OpenAI has access to the behemoth, Microsoft's compute power. It's at this point, though, that you might be starting to wonder something. Is it all just about computing power? Isn't there supposed to be some secret source at OpenAI or Google? Is it really just about raw computing power?

Can we buy our way to super intelligence? Well, it's not for want of trying. There have been plenty of teams that have tried to build their own foundation models, only to realize that the key ingredient is scale, computing power. Character AI even built up a loyal fan base and had some stars of the industry, but couldn't make their own foundation models work.

You may also recall efforts by AdeptAI and Inflection, which produced the Pi chatbot. The key personnel from those teams were snapped up by the likes of Google and Microsoft. In short, people are trying things as alternatives to brute force scaling, but not that much is working currently. Sure, you can eke out compute efficiencies and optimizations like GPT-4.0, the Orca series of models and the Phi family of models, but nothing beats scaling.

And that might be why companies are betting everything on colossal new data centers. We're talking levels of investment at the scale that could fund the research to cure entire diseases, or perhaps fund the national budgets of medium-sized countries. And you might've thought from the title of this video that there's a singular 125 billion supercomputer, but there's actually two being planned.

I should, of course, add the caveat that it's according to the information via officials that would know about such investments. Namely, the source is the Commissioner of Commerce, Josh Teigen, who said that two separate companies approached him and the governor of North Dakota about building mega AI data centers.

These would initially consume around 500 megawatts to one gigawatt of power, with plans to scale up to five or 10 gigawatts of power over several years. Those numbers, of course, to most of you, will be complete gobbledygook. So for context, here is an excellent diagram from Epoch AI. Five gigawatts of power allocated to a single training run would put the power constraint just above this line here.

Now, given that it's expected that these other constraints would kick in at higher levels, that would give us just over 10,000 times more compute available compared to that which was used for training GPT-4. Now, given the broad approximate deltas between generations of GPTs, that will be the equivalent of a GPT-6 training run.

Now, yes, I know that there are quote leaks like this one showing that GPT-5 might have between three and five trillion parameters, but my adage for such leaks is don't trust and verify. And also the number of parameters that goes into a model or the number of tweakable knobs, if you like, doesn't tell you automatically how much compute is used to train the model.

Data is also a massive factor there. Chinchilla scaling laws have long since been left behind and we are massively ramping up the amount of data for a given number of parameters. But before we all get too lost in the numbers, what am I actually saying here? I'm saying that with the amount of money that's being spent and the amount of power that's being provisioned, people are factoring in models up to the scale of something like GPT-6.

By the way, if you're skeptical that any progress has been made, compare the performance of the original chat GPT in November of 2022 with Claude 3.5 Sonnet. It's pretty night and day. Claude 5.5 Sonnet would be quite interesting to behold. Now, just to emphasize how much this scaling is a bet rather than a certainty in terms of the outcome it will produce, here again is Mark Zuckerberg.

- It's one of the trickiest things in the world to plan around is when you have an exponential curve, how long does it keep going for? And I think it's likely enough that it will keep going, that it is worth investing the tens or 100 billion plus in building the infrastructure to assume that if that kind of keeps going, you're going to get some really amazing things that are just going to make amazing products.

But I don't think anyone in the industry can really tell you that it will continue scaling at that rate for sure. - And you may have noticed that I've barely mentioned OpenAI successor language models and new verifier approaches. Those approaches previously labeled Q* or Strawberry throw in a bit of an X factor over the coming months.

According to this article from again, the information, OpenAI want to launch Strawberry, which was previously called Q* and check out my video on that for what I think that might be. They want to launch that within ChatGPT as soon as this fall. Interestingly, the only hint they gave of its capabilities was that it could solve a New York Times connections word puzzle.

Now, since I read this article, I have been trying plenty of those New York Times connections puzzles. You've got to create four groups of four words that form a kind of logical set. Here though is the interesting part. If you feed in these puzzles to GPC 4.0 as text or as an image, it usually can get one or two sets of four words.

But then what will happen is it will get stuck. And even if you prompt it to try different arrangements of the remaining words, it'll still predict the same things again and again. So at the very least, OpenAI must have pioneered a method to get language models out of their local minima to get them to try different things instead of getting stuck in a rut.

How that plays out though, in terms of true reasoning capability, I'm gonna wait to test it on SimpleBench to find out. Speaking of experiments though, let me quickly introduce you to Weave from the legendary Weights & Biases, the sponsors of this video. Proper evaluations of language models are absolutely crucial as is clearly visualizing the differences between them.

You'd also ideally want your toolkit to be lightweight so you could confidently and quickly iterate on your LLM applications. So in addition to their free courses and guides, do check out Weave using the link you can see on screen, which will also of course be in the description. Now though, for what some of you have been waiting for, the news we got yesterday that one startup is attempting to build data centers in space.

Such is the need for reliable energy to power these data centers. We are resorting to putting the data centers into space. This company, Lumen Orbit, is a Y Combinator startup and they are aiming for gigawatt scale. Their promo video even mentions a possible five gigawatt data center, which again, if dedicated to pre-training, would allow a GPT-6 style model.

But I must add in a quick caveat before we all go wild about data centers in space. Things like this have been tried before. Microsoft tried to build data centers underwater. The idea was that the sea could help cool the data center and save on costs. And even though it was described as largely a success, apparently it didn't make sense from an operational or practical perspective.

The cost of maintenance, among other things, was simply prohibitive. Now, I don't know about you, but it strikes me that the cost of maintaining things in space might be even more. But as you may have already deduced, that's not exactly gonna stop us reaching GPT-6 scale models. Why not?

Well, we do have the option of geographically distributing the computers used to train the models. In fact, according to some sources, Microsoft found they more or less had to do that. Apparently one Microsoft engineer on a GPT-6 training cluster project was asked, "Why not just co-locate the cluster in one region?" Well, they tried that, he said, "But we can't put more than 100,000 H100s," that's roughly the size of that Colossus project that we heard earlier from Elon Musk, "in a single state without bringing down the power grid." As we saw from that Epoch analysis, it's the power that's the constraining factor.

And also possibly water, but more on that in a future video. But if we distribute the training, then the clusters don't all have to be in the same place, so it reduces that local power drain. And that approach of distributed training to cut a long story short is where we seem to be heading.

According to a report out just today from Semianalysis, Google, OpenAI, and Anthropic are already executing plans to expand their large model training from one site to multiple data center campuses. And we already know, by the way, that Gemini Ultra 1.0 was trained across multiple data centers, so it can be done.

And before we get to more details on that, there was this hidden gem in the third paragraph. Again, this article was from today, and it said, "Google's existing models lag behind OpenAI and Anthropic "because they are still catching up "in terms of synthetic data, reinforcement learning, "and model architecture.

"But the impending release of Gemini 2 will change this." Semianalysis is a pretty reliable source, so that's an interesting comment. It seems like we will get Gemini 2 and Grok 3 within a few months. And as we heard earlier, the Strawberry system from OpenAI in roughly the same timeframe.

We will probably have to wait till next year for OpenAI's next flagship, though, which is codenamed Orion. But just for a moment, I want you to forget all the names and the fruit and focus on one key detail. If the key ingredient to the performance of language models is their scale, we should find out that fact by the end of this year.

Or to put it another way, if scaling doesn't work up to the levels of Grok 3 and Gemini 2, then what else will? If the data centers are getting to the kind of scale where we need satellite pictures to assess how big they are and that doesn't produce true artificial intelligence, then, well, do we have to rely on Ilya Satskova?

Obviously, I'm being cheeky, but it's priced into the value, I think, of many of these companies. Just the possibility, at least, that scaling will yield super intelligence. So if it doesn't, you could expect a reflection of that in the form of a bubble bursting. Just quickly, I can't resist pointing out that if you're in America, then the fact that models will be increasingly interconnected across that continent will lead to a kind of interesting philosophical moment.

Microsoft will quite literally be, in the famous words of their CEO, above you, around you, beneath you. Now, of course, it almost goes without saying that there will be immense hardware issues in getting this all set up and running smoothly. Billions of man hours worth of problems to be solved, for sure.

And that's why companies, it seems, are clamping up about how they're solving these hardware issues. The publishing of methods has effectively stopped. When OpenAI and others tell the hardware industry about these issues, they are very vague and high level so as not to reveal any of their distributed systems tricks.

To be clear, Semi-Analysis says these techniques are more important than model architecture, as both can be thought of as compute efficiency. Here, then, is the central claim from Semi-Analysis. There is a camp that feels AI capabilities have stagnated ever since GPT-4's release. I know many of you watching will feel that.

This is generally true, but only because no one has been able to massively increase the amount of compute dedicated to a single model. The word only there is, of course, an opinion rather than an established fact. Some, of course, believe that no amount of scaling will yield true reasoning or intelligence.

I have my thoughts, but honestly, I'm somewhat agnostic. I genuinely want to know how these future models perform on my simple bench. I go into a ton of detail about what I'm creating on my Patreon, which is called AI Insiders. Oh, and also just a couple of days ago, I released this video on that Epoch AI research.

That reminds me, actually, there was one more thing from that research that I wanted to touch on in this video. It came about halfway through the 20,000 word report, and it's right here. I don't know why I picked it out. I just find it really quite poignant and interesting to see what these behemoth companies will do, because basically what they pledged, this is Google, Microsoft, and Amazon, to become carbon neutral by 2030.

Now, what do you predict will happen if it turns out that the scaling hypothesis is true and that AI is immensely profitable, and yet it requires this immense power and that will break these targets? Will they stick to their honorable pledges? Well, we know what Sam Altman wants to do, which is spend in the order of trillions and finance dozens of new chip factories.

This was from a separate information report, but there was one quote that I found interesting in relation to it. Sam Altman, according to the CEO of TSMC, which makes most of these chips, the power, NVIDIA, and everyone else, he said Sam Altman was, quote, "Too aggressive for me to believe." Remember, by the way, that pretty much all of this comes down to that Taiwanese company, which is why, by the way, the whole tech industry is so nervous about China invading Taiwan.

Anyway, Sam Altman is, according to the TSMC CEO, "Too aggressive for him to believe." And maybe even these 125 billion data centers are also too aggressive. Only time will tell. It's indubitable that a mountain has been identified and that the AI industry is trying to climb it. Whether they will, or indeed, whether they're even heading in the right direction, only time will tell.

Thank you as ever so much for watching all the way to the end. I'm super grateful and have a wonderful day.