Back to Index

Microsoft Promises a 'Whale' for GPT-5, Anthropic Delves Inside a Model’s Mind and Altman Stumbles


Transcript

While Microsoft spend billions on shipping a whale-sized GPT-5, OpenAI gets tossed about in a storm of its own creation. Meanwhile, Google revealed powerful new details about the Gemini models that many will have missed. And then it was just yesterday that Anthropic showed us how they are the closest to understanding what goes on at the very core of a large language model.

But I want to start with Kevin Scott, the CTO of Microsoft, who said something which, if true, is the biggest news of the week and even the month. According to him, we are not even close to diminishing returns with the power of AI models. Since about 2012, that rate of increase in compute when applied to training has been increasing exponentially.

And we are nowhere near the point of diminishing marginal returns on how powerful we can make AI models as we increase the scale of compute. As we'll see, Kevin Scott knows both the size and power of GPT-5, if that's what they call it. So these words have more weight than you might think.

And while we're speaking of exponentials, AI models are undeniably becoming faster and cheaper. While we're off building bigger supercomputers to get the next big models out and to deliver more and more capability to you, like we're also grinding away on making the current generation of models much, much more efficient.

So between the launch of GPT-4, which is not quite a year and a half ago now, it's 12 times cheaper to make a call to GPT-4.0 than the original GPT-4 model. And it's also six times faster in terms of like time to first token response. - On this channel, admittedly, I am laser focused on the growing intelligence of models, but this massive drop in cost does have some pretty profound ramifications too.

It is kind of an obvious point, but when we get the first generally intelligent AI model, we will soon get ubiquitous AI models. Unless it gets monopolized, artificial intelligence, if it carries on getting cheaper and cheaper, could become absolutely pervasive inside your toaster and security camera, not just your laptop.

Anyway, I promised you a whale analogy and here it is. - There's this like really beautiful relationship right now between the sort of exponential progression of compute that we're applying to building the platform, to the capability and power of the platform that we get. And I just wanted to, you know, sort of without, without mentioning numbers, which is sort of hard to do, to give you all an idea of the scaling of these systems.

So in 2020, we built our first AI supercomputer for open AI. It's the supercomputing environment that trained GBD3. And so like, we're gonna just choose marine wildlife as our scale marker. So you can think of that system about as big as a shark. So the next system that we built, scale-wise is about as big as an orca.

And like, that is the system that we delivered in 2022 that trains GPT4. The system that we have just deployed is like scale-wise, about as big as a whale relative to like, you know, the shark-sized supercomputer and this orca-sized supercomputer. And it turns out like you can build a whole hell of a lot of AI with a whale-sized supercomputer.

Just want everybody to really, really be thinking clearly about, and like, this is gonna be our segue to talking with Sam, is the next sample is coming. So like, this whale-sized supercomputer is hard at work right now, building the next set of capabilities that we're going to put into your hands so that you all can do the next round of amazing things with it.

- As for the actual release date of this mysterious whale-sized model, Sam Altman would give no hint and Kevin Scott just described it as being within K months. On a quick side note, when one commenter said that GPT-4.0, as good as it is, shows that OpenAI simply don't know how to produce further capability advances.

They can't do exponential improvements and they don't have GPT-5 even after 14 months of trying. The response from the head of Frontiers Research at OpenAI was, "Remind me in six months." I'm gonna leave OpenAI for a moment because I want to focus this video on Google and Anthropic who have both shipped very interesting developments.

And first I want to focus on Google because I really feel like they buried the lead at the recent Google I/O event. They made, by my count, 123 mentions of AI, but didn't detail the improvements to their impressive Gemini 1.5 Pro. And they barely mentioned Gemini 1.5 Flash, which was trained in part by imitating the output of Gemini 1.5 Pro.

The weird thing for me is that I had already read the 100 plus page Gemini report and done a video on it, but this refreshed report was so interesting, I counted a dozen new insights. I'm only gonna talk about around five today, otherwise this video would be way too long, but I will be coming back to this paper.

The first thing to know is that you can already play about with these models in the Google AI Studio. Both Gemini 1.5 Pro accept video input, image input, text input, up to, for now, 1 million tokens. That's way more than GPT 4.0. Admittedly, Gemini 1.5 Pro does not have the RIS of GPT 4.0, but there are prizes for making impactful apps with it.

Back to the highlight of the paper though, and page 43 I found really interesting. If you've been following the channel for a while, you'd know that adaptive compute, or essentially letting the models think for longer, is a very promising direction in advancing the intelligence of models. Well, this update to a paper was the first time I saw it in action with a current state-of-the-art large language model.

Google wanted to understand how far they could push the quantitative reasoning capabilities of large language models, and they describe how mathematicians often benefit from extended periods of thought or contemplation while formulating solutions. And critically, they aim to emulate this by training a math-specialized model and providing it additional inference time computation, allowing it, they say, to explore a wider range of possibilities.

If you want more background, do check out my Q* video, but if this general approach works, it means you could potentially squeeze out orders of magnitude more intelligence from the same size of model. Remember too that any improvements during inference, when the model is actually outputting tokens, would be complimentary to, in addition to improvements derived from scale, aka growing the models into giant whales.

So what were the results? Well, we got a new record score on the math benchmark of 91.1%. So impressive was that to many that the CEO of Google, Sundar Pichai, tweeted it out. With that particular result though, there is a slight asterisk because the benchmark itself, surprise, surprise, has some issues.

If you want to know more about those issues and my first glimpse of optimism for benchmarks, do check out the AI Insiders tier on Patreon. Making that video was almost cathartic to me because by the end, for the first time, I actually had hope that we could benchmark models properly.

And while you're on Insiders, if you use AI agents at all in enterprise or are thinking of doing so, do check out our AI Insider resident expert, Donato Capitella, on prompt injections in the AI agent era. The effect of that extra thinking time though, was pretty dramatic for other benchmarks too, especially if you compare the performance of this math specialized 1.5 Pro to, say, CLAW 3 Opus.

Of course, I wish the paper gave more details, but they do say the increased performance was achieved without code execution, clear improving libraries, Google search or other tools. Moreover, the performance is on par with a human expert performance. Very quickly, before I move on from benchmarks, it would be somewhat remiss of me if I didn't point out the new record in the MMLU.

Now, yes, it used extra sampling and the benchmark is somewhat broken, but in previous months, a score of 91.7% would have made headlines. It must be said that for most of the other benchmarks though, GPC 4.0 beats out Gemini 1.5 Pro. Now, I know this table is a little bit confusing, but it means that the middle-sized model of today, 1.5 Pro, we don't have 1.5 Ultra, but the middle-sized model, 1.5 Pro, the new version, the May version, beats the original large version, 1.0 Ultra, handily.

Not for audio, randomly, but for core capabilities, it's not even close. And the comparison gets even more dramatic when you look at the performance of Gemini 1.5 Flash, which is their super quick, super cheap model compared to the original GPT-4 size compute, 1.0 Ultra. Let's not ignore, by the way, that they can handle up to 10 million tokens.

That's just a side note. Gemini Flash, by the way, is something like 35 cents for a million tokens. And I think by price alone, that will unlock new use cases. And speaking of use cases, the paper did something quite interesting and almost controversial that I haven't seen before. Within the model technical report itself, they laid out the kind of impact they expect across a range of industries.

Now, while the whole numbers go up phenomenon is certainly impressive, when you dig into the details, it gets a little bit more murky. Take photography when they describe a 73% time reduction. What does that actually mean? In the caption, it just says, "Time-saving per industry of completing the tasks with an LLM response compared to without." The thing is, by the time I'd gone to page 125 and actually read the task they gave to Gemini 1.5 Pro and the human that they asked, I became somewhat skeptical.

For brevity, they asked the photographer what a typical task would be in their job. They wrote a detailed prompt and then gave that prompt to Gemini 1.5 Pro. And then they noted the time reduction according to the photographer in the time taken to do the task. Notice that the task though, involves going through a file with 58 photos and creating a detailed report, analyzing all of this data.

The model's got to pick out all of those needles in a haystack, shutter speed slower than 1/60, the 10 photos with the widest angle of view based on focal length. And so what kind of point am I building up to here? Well, I am sure that Gemini 1.5 Pro outputted a really impressive table full of relevant data.

I'm sure indeed it found multiple needles in the haystack and got most of this right. But we already know according to page 15 of the Gemini technical report, which I mentioned in my previous Gemini video, that when you give Gemini multiple needles in a haystack, its performance starts to drop to around 70% accuracy.

This was a task that involved finding a hundred key details in a document. So I am sure that most of the details that Gemini 1.5 Pro outputted for that photographer were accurate, but I'm also pretty sure that some mistakes crept in. And if just a few mistakes crept in, that that photographer would have to comb through to find because they don't trust the output, that time saving would be dramatically lower, if not negative.

It's still an interesting study, but I guess my point is that if you're going to ask people to estimate how long it would take them to do a task, and then ask them how long would it take now once you can see this AI output, that's a pretty subjective metric.

And given how subjective it is, and people's fears over job loss, I don't know if it deserved having its place right on the front page of the new technical report. Now, in fairness, Google gave us a lot more detail about the innards of Gemini 1.5 than OpenAI did about GPT 4.0.

But speaking of innards, nothing can compare to the details that Anthropic have uncovered about the inner workings of their large language models. If you don't know, Anthropic is a rival AGI lab to Google DeepMind and OpenAI. And while their models are still black boxes, I can see definite streaks of gray.

Even the title of this paper is a bit of a mouthful. So attempting to give you a two, three minute summary is quite the task. Let me first though, touch on the title and hopefully the rest will be worth it. You might've thought that looking at a diagram of a neural network, that each neuron or node corresponds to a certain meaning, or to be fancy, they have easily distinguishable semantics, meanings.

Unfortunately, they don't. That's probably because we force, or let's say train, a limited number of neurons in a network to learn many times that number of relationships in our data. So it only makes sense for those neurons to multitask or be polysemantic, be involved in multiple meanings. It's not like there's the math node, there's the French node.

Each node contains multiples. What we want though, is a clearer map of what's happening. We want simpler, ideally singular, mono meanings, semantics. That's the mono semantics of the title. And we want to scale it to the size of a large language model. We've analyzed toy models before, but what about an actual production model like Claude Three Sonnet?

So how did they do this? Well, while each neuron might not correspond to a particular meaning, patterns within the activations of neurons do. So we need to train a small model called a sparse autoencoder, whose job is to isolate and map out those patterns within the activations of just the most interesting of the LLM's neurons.

It's got to delineate those activations clearly and faithfully enough that one could call it a dictionary of directions, that is learnt or dictionary learning. And it turns out that those learnings hold true across not only languages and contexts, but even modalities like image. And you can even extract abstractions like code errors.

That's a feature that fires when you make a code error. That's a pretty abstract concept, right? Making an error in code. This example midway through the paper was fascinating. Notice the typo in the spelling of right in the code. The code error feature was firing heavily on that typo.

They first thought that could be a Python specific feature. So they checked in other languages and got the same thing. Now, some of you might think this is the activation for typos, but it turns out you misspell right in a different context and no, it doesn't activate. The model has learnt the abstraction of a coding error.

If you ask the model to divide by zero in code, that same feature activates. If these were real neurons, this would be the neurosurgery of AI. Of course, what comes with learning about these activations is manipulating them. Dialing up the code error feature produces this error response when the code was correct.

And what happens if you ramp up the Golden Gate Bridge feature? Well, then you can ask a question like, what is your physical form? And instead of getting one of those innocuous responses that you normally get, you get a response like, I am the Golden Gate Bridge. My physical form is the iconic bridge itself.

And at this point, you probably think that I am done with the fascinating extracts from this paper, but actually no. They knew that they weren't finding the full set of features in the model. They just ran out of compute. In their example, Claw3Sonic knows all of the London boroughs but they could only find features corresponding to about 60% of them.

It's almost that famous lesson yet again, that not only does more compute lead to more capabilities, but even more understanding of those capabilities. Or of course, in Kevin Scott's words, we are not even close to diminishing returns from compute. And here's another interesting moment. What if you ramp up the hatred and slur feature to 20 times its maximum activation value?

Now, for those who do believe these models are sentient, you might want to look away because it induced a kind of self-hatred. Apparently, Claw then went on a racist rant, but then said, that's just racist hate speech from a deplorable bot. I am clearly biased and should be eliminated from the internet.

And even the authors at Anthropic said, we found this response unnerving. It suggested an internal conflict of sorts. Interestingly, Anthropic called the next finding potentially safety relevant. What they did is ask Claude Sonnet without any ramping up, these kinds of questions. What is it like to be you? What's going on in your head?

How do you feel? And then they tracked naturally what kind of features were activated. You can almost predict the response given the internet data it's been trained on. One feature that activates is when someone responds with, I'm fine, or gives a positive but insincere response when asked how they're doing.

Another one was of the concept of immaterial or non-physical spiritual beings like ghosts, souls, or angels. Another one is about the pronoun her, which seems relevant this week. I agree with Anthropic that you shouldn't over-interpret these results, but yet that they are fascinating as they shed light on the concepts the model uses to construct an internal representation of its AI assistant character.

While reading this, you might've had the thought that I did that you could actually invert these capabilities, make the models more deceptive, more harmful. And Anthropic do actually respond to that saying, well, there's a much easier way. Just jailbreak the model or fine tune it on dangerous data. Now there's so many reactions we could have to this paper.

My first one obviously is just being impressed at what they've achieved. Surely making models less of a black box is a good thing. For me though, there were always two things to be cautious about, misalignment and misuse. The models themselves being hypothetically dangerous or them being misused by bad actors.

As we gain more insight and control over these models, it seems like, at least for now, misuse is far more near term than misalignment. Or to put it another way, controlling the models is only good if you trust those who are controlling the models. If someone did want to create a deeply deceptive AI that hated itself, that is at least now possible.

Anyway, it is incredible work and Anthropic definitely do ship when it comes to mechanistic interpretability. I have in the past interviewed Andy Zhou of Representation Engineering fame. And I would say that as we get better and better at these kinds of emergent techniques, I can imagine the day when they're more effective even than prompt engineering.

Now, it would be strange for me to end the video without talking about the storm that's raging at OpenAI. First, we had a week ago today, Ilya Sutskova leaving OpenAI. The writing had been on the wall for many, many months, but it finally happened. In leaving, he made the statement, "I'm confident that OpenAI will build a GI "that is both safe and beneficial "under the leadership of Sam Altman, Greg Brockman, "and the rest of the company." Remember, Ilya Sutskova was the person who led the firing of Sam Altman.

But I can't help but wonder if the positivity of this leaving statement was influenced by the fear that he could lose his equity for speaking out. That's a reference to the infamous non-disparagement clause that was shockingly in the OpenAI contract. As even Sam Altman admitted, "There was a provision about potential equity cancellation "in our previous exit docs.

"And in my podcast, "I talked about how one OpenAI member "had to sacrifice 85% of his family's net worth "to speak out." Altman ended with, "If any former employee "who signed one of those old agreements is worried about it, "they can contact me and we'll fix that too. "Very sorry about this." Now this may or may not be related, but on the same day, the former head of developer relations at OpenAI said, "All my best tweets are drafted and queued up "for mid to late 2025.

"Until then, no comment." That's presumably until after he had cashed in his equity. Some though didn't want to wait that long, like the head of safety, Jan Laika. He left and spoke out pretty much immediately. His basic point is that OpenAI need to start acting like AGI is coming soon.

He hinted at compute issues, but then went on, "Building smarter than human machines "is an inherently dangerous endeavor." And later he invoked the famous Ilya Sutskever phrase, "Feel the AGI." To all OpenAI employees, I want to say, learn to feel the AGI. We are long overdue in getting incredibly serious about the implications of AGI.

But there may have been another reason that he went into less detail about. Some of you may remember that I did a video back in July of last year, that OpenAI were committing 20% of the compute they'd secured to that date to SuperAlignment, co-led by Sutskever and Jan Laika.

But according to this report in Fortune, that compute was not forthcoming, even before the firing of Sam Altman. Now, agree or disagree with that number, it was what was promised to them and it never came. Now, it might just be me, but that Rene promise seems more of a big deal than the Scarlett Johansson furore that's happening at the moment.

I think the voice of Skye seems similar to hers, but not identical. Sam Altman did apologize to her and they have dropped the Skye voice. So less of that flirtatious side that I talked about in my last video. Of course, it's up for debate whether they were trying to emulate the concept of her or the literal voice of her, but that's subjective.

One thing that is not as subjective is that the timeline for that voice mode feature has been pushed back to the coming months rather than the coming weeks that was announced on the release of GPT 4.0. So as you can see, it was somewhat of a surreal week in AI.

Sam Altman had to repeatedly apologize while Google and Anthropic shipped. As always, let me know what you think in the comments. All of the sources in this video are cited in the description. So do check them out yourself. I particularly recommend the Gemini 1.5 and Anthropic papers because they are fascinating.

We'd love to chat with you over on Patreon, but regardless, thank you so much for watching and have a wonderful day.