Back to Index

Grok 4 - 10 New Things to Know


Chapters

0:0 Introduction
0:22 Benchmark Results
2:11 Benchmark Caveats
2:59 ARC-AGI 2
3:35 SimpleBench
4:49 ‘Humanity’s Last Exam’
7:20 SuperGrok Heavy Price
7:58 API Price
8:12 Grok 5, Gemini 3.0 Beta, GPT-5
9:12 System Prompt Change + $1B a month, pollution
10:20 Not soloing science, helping you solo code

Transcript

Grok 4 is out and it's a pretty good AI model but there is going to be more noise about this language model than possibly any other so hopefully I can give you a little signal amid the chaos. Let's boil things down to just 10 things to know about the newest and possibly smartest AI model.

Point one is that Grok 4 might just be the smartest model around at least according to the benchmarks. In certain settings on high school math competitions it beats out OpenAI's best model and Google's best model. The same is true for a fairly famous science benchmark the Google Proof Q&A where it again beats out Anthropics best model and Google's.

Likewise on at least one coding benchmark but Elon Musk went much further saying about Grok 4 that quote it's smarter than almost all graduate students in all disciplines simultaneously. That quote is of course going to be picked up by everyone but it needs three important caveats. First from me is that Grok 4 is still a language model which means it's still going to be prone to all those hallucinations you're familiar with.

It's not a new paradigm of AI. Second we have heard that kind of hype before notably from the Google DeepMind CEO Demis Asabis almost 18 months ago saying that Gemini 2 was better than almost all human experts. Amazing about Gemini is that it's so good at so many things as we started getting to the end of the training for example each of the 50 different subject areas that we tested on it's as good as the best expert humans in those areas.

That was an exaggeration then and Musk is exaggerating now because real world performance doesn't always match up to benchmark performance. is way more than answering multiple choice questions hence the third bit of context coming from Musk himself the CEO of XAI saying that that quote about being smarter than graduates was at least with respect to academic questions.

Grok 4 is a post-grad level in everything like it's it just some of these things are just worth repeating like Grok 4 is post-graduate like PhD level in everything better than PhD but like most PhDs would fail so it's better to say I mean at least with respect to academic questions.

Point number two is that I've been highly impressed by Grok 4 but these benchmark results are misleading for another reason. Note first of all that the y-axis doesn't begin at zero so these differences between the models are somewhat exaggerated in terms of scale. XAI makers of Grok 4 selectively choose which models to compare to.

Notice in one recent high school maths competition Grok 4 Heavy and I'll get to that later way outperforms Gemini DeepThink. That's the soon to be released Gemini 2.5 Pro Heavy if you like. But in this coding benchmark LiveCodebench Gemini DeepThink actually outperforms Grok 4 Heavy and yet is not in the chart.

As always then when these model providers show benchmarks you've got to take them with a grain of salt. Especially when the answers to the benchmarks are available online. But none of that quite explains Grok 4's brilliant performance on ARK AGI 2, a semi-private evaluation. As you can see this post on Twitter or X has got almost 3 million views and is climbing rapidly.

Because this is known to be a fairly rigorous test of so-called fluid intelligence or IQ if you like and Grok 4 genuinely does beat out other models. I've covered ARK AGI in other videos but suffice to say Grok 4 can genuinely pick up on latent patterns in your data.

Of course that is of relevance to almost all disciplines. Next is there a benchmark for how smart a model feels? Well yes I tried to come up with one and it's called SimpleBench. It's a test of social intelligence, trick questions and spatio-temporal questions. Now because everyone is spamming the Grok 4 API it's pretty tough to run the full benchmark today but I've run about 20 questions to get a pretty good estimation.

Take this question, it's a bit of a spin on a common logic puzzle and Grok 4 actually sees through it. That's actually the first model not to pick the trap answer. Grok 4 will feel smart but of course if you draw it out of its comfort zone, for example with spatial reasoning it can still fall apart.

In this question, in common with all other models, Grok 4 doesn't notice that the glove will simply fall on the road. It also takes an extremely long time to answer fairly often which could be a slight issue for many of you. Having said all of that, I strongly suspect that Grok 4 will be around the top of my leaderboard on SimpleBench.

In other words, try not to be too tempted to explain away all those benchmark results just to benchmark hacking. Now that doesn't mean Grok 4 is worth $300 a month but I'll come to that in just a second because there's one more benchmark I want to touch on. And that is of course the grandiosely named Humanity's Last Exam in which under certain settings, Grok 4 scores over 50%, by far the best performance of any model.

However, you should know that this is a knowledge intense benchmark and therefore performance is heavily dependent on the training data that goes into the model. To give you just one example, is it critical to your use case that a model know about hummingbirds having a bilateral paired oval bone?

Now I sound cynical but I think it's actually really cool that models have such an incredible knowledge base and so genuinely I will be using Grok 4 a fair bit. I said at the time of the release of that exam that it wouldn't be Humanity's Last Exam. Whether you happen to have the requisite knowledge in your training data isn't so much a marker of how intelligent you are as a model.

This is not hindsight. On my Patreon in September of last year, I called that the exam would fall sooner than many others. With tools means that for example, Grok 4 can write code to perform certain computations. But what is this Grok 4 Heavy? Well, here's Musk to explain. With the Grok 4 Heavy, what it does is it spawns multiple agents in parallel and all of those agents do work independently and then they compare their work and they decide which one, like it's like a study group.

And it's not as simple as a majority vote because often only one of the agents actually figures out the trick or figures out the solution. But once they share the trick or figure out what the real nature of the problem is, they share that solution with the other agents and then they essentially compare notes and then yield an answer.

So that's the heavy part of Grok 4. Now, long-standing followers of the channel may note that that is the exact premise of Smart GPT that I released around 18 months ago, which scored at the time a record performance on the MLU, 89%. Ironically, that exam was also authored by Dan Hendricks, who is the lead author of Humanity's Last Exam.

And yes, I can't resist plugging that Andrei Karpathy shouted out Smart GPT. One last thing on the benchmark that many might have missed is that the text-based performance of Grok 4 and Grok 4 Heavy is extremely good. But on the full benchmark, it's a more modest improvement over, say, Gemini 2.5 Pro.

So Grok 4 must do really quite badly on the visual segment. In other words, you might not want to rely on it for decoding Roman inscriptions. Which brings me, of course, to Super Grok Heavy for $3,000 a year or $300 a month. XAI are promising new features will come to Super Grok Heavy like video generation in October.

But Gemini Ultra for a lower price already has VO3. Now, if your pockets are deep enough, you'll just subscribe to everything. But if this is your only maxed-out subscription, it's hard to look past the much cheaper $20 Gemini Pro. Let me know, of course, in the comments if you think it's worth this amount and why.

I'm open to being persuaded. I just don't see it at the moment. Just quickly, if you're a developer, you'll note that Grok 4's pricing is at the same level as Claude 4 Sonnet. $3 input, $15 output, which is a decent price for a Frontier model. Again, there are much cheaper alternatives.

Next, if you did watch the live stream, of course, Musk mentions repeatedly that they have new features and new models coming soon and that Grok 5 may be finishing training imminently. However, we also get leaks this week that Gemini 3 is coming and, of course, perennial leaks about GPT-5 coming this month.

Now, it used to be the case that we would then have to wait six months for the actual release of the model because of safety checks. Would a model help with creating a bioweapon, for example? But that all seems to have gone out of the window at the moment.

Which brings me to this fairly wild quote from Musk on safety. Well, let's be bad or good for humanity. It's like, I think it'll be good. Most likely it'll be good. Yeah. Yeah. But I somewhat reconciled myself to the fact that even if it wasn't going to be good, I'd at least like to be alive to see it happen.

So, you know. Next, and you might have been wondering when I was going to talk about this, but yes, of course, Grok 4 may suffer at times from a similar issue to Grok 3 in that it seems to get sudden urges to praise certain historical figures or focus on a country, for example, South Africa.

That behaviour seems to have been caused by this addition to Grok 3's system prompt, which is that its response should not shy away from making claims which are politically incorrect. If such a small change to the system prompt causes such wild behaviour, then anything could happen with Grok 4.

System prompts aren't, of course, the only issue for XAI, they are apparently burning through one billion dollars a month. Either Grok 4 or Grok 5 almost needs to bring in more revenue for XAI. Then, of course, there is the awkward pollution point, because while it is crazy impressive how fast XAI have caught up to OpenAI and Google DeepMind, bringing in the generators necessary to get competitive that fast did come at a local cost.

And if you thought it was wild how quickly Musk's XAI got up to 100,000 GPUs, well, they're planning to bring an entire overseas power plant to Memphis with one million AI GPUs to be powered. I'm going to try to end on a positive though, because even though Musk said that Grok 4 couldn't be used to generate new scientific discoveries just yet, I do think there is an underrated point to be made that's demonstrated by this game made with the help of Grok 4 in just four hours.

And that's that while models like Grok 4 often struggle with current techniques to solo generate new science, what they are optimised for is making existing science or code easier for you to solo. We probably shouldn't underestimate the impact of allowing everyone to do much more on their own. Then again, you probably shouldn't be using Grok to analyse whether or not you should vote for the big beautiful bill, however.

And if Grok 4 or Grok 5's edge comes from its access to X and Twitter data, then at least for Grok 5's sake, let's hope that X can clean up so much of the bot replies, spam and clickbait that's on there at the moment. Thank you so much as ever for watching.

I am certain that this won't be the last mention of Grok 4 on this channel. In fact, I think I mention Grok in a documentary coming up on Patreon. Either way though, have an absolutely wonderful day.