back to indexGrok 4 - 10 New Things to Know

Chapters
0:0 Introduction
0:22 Benchmark Results
2:11 Benchmark Caveats
2:59 ARC-AGI 2
3:35 SimpleBench
4:49 ‘Humanity’s Last Exam’
7:20 SuperGrok Heavy Price
7:58 API Price
8:12 Grok 5, Gemini 3.0 Beta, GPT-5
9:12 System Prompt Change + $1B a month, pollution
10:20 Not soloing science, helping you solo code
00:00:00.000 |
Grok 4 is out and it's a pretty good AI model but there is going to be more noise 00:00:07.020 |
about this language model than possibly any other so hopefully I can give you a little signal amid 00:00:13.760 |
the chaos. Let's boil things down to just 10 things to know about the newest and possibly 00:00:19.660 |
smartest AI model. Point one is that Grok 4 might just be the smartest model around at least 00:00:26.820 |
according to the benchmarks. In certain settings on high school math competitions it beats out 00:00:31.860 |
OpenAI's best model and Google's best model. The same is true for a fairly famous science benchmark 00:00:38.120 |
the Google Proof Q&A where it again beats out Anthropics best model and Google's. Likewise on 00:00:43.920 |
at least one coding benchmark but Elon Musk went much further saying about Grok 4 that quote it's 00:00:49.700 |
smarter than almost all graduate students in all disciplines simultaneously. That quote is of course 00:00:56.080 |
going to be picked up by everyone but it needs three important caveats. First from me is that 00:01:01.760 |
Grok 4 is still a language model which means it's still going to be prone to all those hallucinations 00:01:07.320 |
you're familiar with. It's not a new paradigm of AI. Second we have heard that kind of hype before 00:01:12.740 |
notably from the Google DeepMind CEO Demis Asabis almost 18 months ago saying that Gemini 2 was better 00:01:20.760 |
than almost all human experts. Amazing about Gemini is that it's so good at so many things as we started 00:01:25.260 |
getting to the end of the training for example each of the 50 different subject areas that we tested on 00:01:28.780 |
it's as good as the best expert humans in those areas. That was an exaggeration then and Musk is 00:01:33.680 |
exaggerating now because real world performance doesn't always match up to benchmark performance. 00:01:39.440 |
is way more than answering multiple choice questions hence the third bit of context coming from Musk himself 00:01:45.900 |
the CEO of XAI saying that that quote about being smarter than graduates was at least with respect to 00:01:51.840 |
academic questions. Grok 4 is a post-grad level in everything like it's it just some of these things 00:01:57.600 |
are just worth repeating like Grok 4 is post-graduate like PhD level in everything better than PhD but like 00:02:05.240 |
most PhDs would fail so it's better to say I mean at least with respect to academic questions. Point number 00:02:11.060 |
two is that I've been highly impressed by Grok 4 but these benchmark results are misleading for another 00:02:17.100 |
reason. Note first of all that the y-axis doesn't begin at zero so these differences between the models 00:02:22.720 |
are somewhat exaggerated in terms of scale. XAI makers of Grok 4 selectively choose which models 00:02:29.260 |
to compare to. Notice in one recent high school maths competition Grok 4 Heavy and I'll get to that later 00:02:35.760 |
way outperforms Gemini DeepThink. That's the soon to be released Gemini 2.5 Pro Heavy if you like. But in 00:02:42.520 |
this coding benchmark LiveCodebench Gemini DeepThink actually outperforms Grok 4 Heavy and yet is not in 00:02:49.400 |
the chart. As always then when these model providers show benchmarks you've got to take them with a grain of 00:02:54.740 |
salt. Especially when the answers to the benchmarks are available online. But none of that quite explains 00:03:01.240 |
Grok 4's brilliant performance on ARK AGI 2, a semi-private evaluation. As you can see this post 00:03:08.080 |
on Twitter or X has got almost 3 million views and is climbing rapidly. Because this is known to be 00:03:13.980 |
a fairly rigorous test of so-called fluid intelligence or IQ if you like and Grok 4 genuinely does beat out 00:03:21.420 |
other models. I've covered ARK AGI in other videos but suffice to say Grok 4 can genuinely pick up on 00:03:27.640 |
latent patterns in your data. Of course that is of relevance to almost all disciplines. Next is there 00:03:33.580 |
a benchmark for how smart a model feels? Well yes I tried to come up with one and it's called 00:03:39.940 |
SimpleBench. It's a test of social intelligence, trick questions and spatio-temporal questions. Now because 00:03:46.260 |
everyone is spamming the Grok 4 API it's pretty tough to run the full benchmark today but I've run about 20 00:03:53.720 |
questions to get a pretty good estimation. Take this question, it's a bit of a spin on a common logic 00:03:59.400 |
puzzle and Grok 4 actually sees through it. That's actually the first model not to pick the trap answer. 00:04:06.120 |
Grok 4 will feel smart but of course if you draw it out of its comfort zone, for example with spatial 00:04:11.880 |
reasoning it can still fall apart. In this question, in common with all other models, Grok 4 doesn't 00:04:17.180 |
notice that the glove will simply fall on the road. It also takes an extremely long time to answer fairly 00:04:23.660 |
often which could be a slight issue for many of you. Having said all of that, I strongly suspect that 00:04:29.620 |
Grok 4 will be around the top of my leaderboard on SimpleBench. In other words, try not to be too 00:04:36.340 |
tempted to explain away all those benchmark results just to benchmark hacking. Now that doesn't mean 00:04:41.240 |
Grok 4 is worth $300 a month but I'll come to that in just a second because there's one more benchmark 00:04:47.000 |
I want to touch on. And that is of course the grandiosely named Humanity's Last Exam in which 00:04:53.700 |
under certain settings, Grok 4 scores over 50%, by far the best performance of any model. However, 00:05:00.720 |
you should know that this is a knowledge intense benchmark and therefore performance is heavily 00:05:05.700 |
dependent on the training data that goes into the model. To give you just one example, is it critical 00:05:10.460 |
to your use case that a model know about hummingbirds having a bilateral paired oval bone? Now I sound 00:05:18.160 |
cynical but I think it's actually really cool that models have such an incredible knowledge base and so 00:05:22.700 |
genuinely I will be using Grok 4 a fair bit. I said at the time of the release of that exam that it 00:05:28.380 |
wouldn't be Humanity's Last Exam. Whether you happen to have the requisite knowledge in your training data 00:05:33.400 |
isn't so much a marker of how intelligent you are as a model. This is not hindsight. On my Patreon in 00:05:38.880 |
September of last year, I called that the exam would fall sooner than many others. With tools means that 00:05:45.560 |
for example, Grok 4 can write code to perform certain computations. But what is this Grok 4 Heavy? 00:05:53.060 |
With the Grok 4 Heavy, what it does is it spawns multiple agents in parallel and all of those agents 00:05:59.220 |
do work independently and then they compare their work and they decide which one, like it's like a study 00:06:06.520 |
group. And it's not as simple as a majority vote because often only one of the agents actually figures 00:06:13.220 |
out the trick or figures out the solution. But once they share the trick or figure out what the real 00:06:21.240 |
nature of the problem is, they share that solution with the other agents and then they essentially compare 00:06:26.240 |
notes and then yield an answer. So that's the heavy part of Grok 4. 00:06:32.240 |
Now, long-standing followers of the channel may note that that is the exact premise of Smart GPT that I released 00:06:40.800 |
around 18 months ago, which scored at the time a record performance on the MLU, 89%. Ironically, that exam was 00:06:48.240 |
also authored by Dan Hendricks, who is the lead author of Humanity's Last Exam. And yes, I can't resist plugging that 00:06:55.280 |
Andrei Karpathy shouted out Smart GPT. One last thing on the benchmark that many might have missed 00:07:00.560 |
is that the text-based performance of Grok 4 and Grok 4 Heavy is extremely good. But on the full benchmark, 00:07:06.760 |
it's a more modest improvement over, say, Gemini 2.5 Pro. So Grok 4 must do really quite badly on the 00:07:13.800 |
visual segment. In other words, you might not want to rely on it for decoding Roman inscriptions. Which brings 00:07:19.760 |
me, of course, to Super Grok Heavy for $3,000 a year or $300 a month. XAI are promising new features 00:07:29.760 |
will come to Super Grok Heavy like video generation in October. But Gemini Ultra for a lower price already 00:07:38.160 |
has VO3. Now, if your pockets are deep enough, you'll just subscribe to everything. But if this is your only 00:07:45.200 |
maxed-out subscription, it's hard to look past the much cheaper $20 Gemini Pro. Let me know, 00:07:50.880 |
of course, in the comments if you think it's worth this amount and why. I'm open to being persuaded. 00:07:57.280 |
I just don't see it at the moment. Just quickly, if you're a developer, you'll note that Grok 4's pricing 00:08:01.920 |
is at the same level as Claude 4 Sonnet. $3 input, $15 output, which is a decent price for a Frontier 00:08:08.880 |
model. Again, there are much cheaper alternatives. Next, if you did watch the live stream, of course, 00:08:14.720 |
Musk mentions repeatedly that they have new features and new models coming soon and that Grok 5 may be 00:08:21.840 |
finishing training imminently. However, we also get leaks this week that Gemini 3 is coming and, 00:08:27.360 |
of course, perennial leaks about GPT-5 coming this month. Now, it used to be the case that we would 00:08:33.040 |
then have to wait six months for the actual release of the model because of safety checks. Would a model 00:08:39.040 |
help with creating a bioweapon, for example? But that all seems to have gone out of the window at the 00:08:43.280 |
moment. Which brings me to this fairly wild quote from Musk on safety. 00:08:48.400 |
Well, let's be bad or good for humanity. It's like, I think it'll be good. Most likely it'll be good. 00:08:55.040 |
Yeah. Yeah. But I somewhat reconciled myself to the fact that even if it wasn't going to be good, 00:09:03.200 |
I'd at least like to be alive to see it happen. So, you know. 00:09:08.320 |
Next, and you might have been wondering when I was going to talk about this, but yes, of course, 00:09:16.400 |
Grok 4 may suffer at times from a similar issue to Grok 3 in that it seems to get sudden urges to praise 00:09:24.560 |
certain historical figures or focus on a country, for example, South Africa. That behaviour seems to have 00:09:30.320 |
been caused by this addition to Grok 3's system prompt, which is that its response should not shy away from 00:09:37.440 |
making claims which are politically incorrect. If such a small change to the system prompt causes such 00:09:43.280 |
wild behaviour, then anything could happen with Grok 4. System prompts aren't, of course, the only issue for 00:09:49.200 |
XAI, they are apparently burning through one billion dollars a month. Either Grok 4 or Grok 5 almost 00:09:56.400 |
needs to bring in more revenue for XAI. Then, of course, there is the awkward pollution point, 00:10:01.520 |
because while it is crazy impressive how fast XAI have caught up to OpenAI and Google DeepMind, 00:10:07.280 |
bringing in the generators necessary to get competitive that fast did come at a local cost. And if you thought 00:10:15.200 |
it was wild how quickly Musk's XAI got up to 100,000 GPUs, well, they're planning to bring an entire 00:10:22.720 |
overseas power plant to Memphis with one million AI GPUs to be powered. I'm going to try to end on a 00:10:29.440 |
positive though, because even though Musk said that Grok 4 couldn't be used to generate new scientific 00:10:35.680 |
discoveries just yet, I do think there is an underrated point to be made that's demonstrated by 00:10:41.280 |
this game made with the help of Grok 4 in just four hours. And that's that while models like Grok 4 00:10:47.280 |
often struggle with current techniques to solo generate new science, what they are optimised for 00:10:54.800 |
is making existing science or code easier for you to solo. We probably shouldn't underestimate the impact 00:11:01.600 |
of allowing everyone to do much more on their own. Then again, you probably shouldn't be using Grok to 00:11:07.920 |
analyse whether or not you should vote for the big beautiful bill, however. And if Grok 4 or Grok 5's edge 00:11:15.360 |
comes from its access to X and Twitter data, then at least for Grok 5's sake, let's hope that X can 00:11:23.360 |
clean up so much of the bot replies, spam and clickbait that's on there at the moment. Thank you so much as 00:11:30.800 |
ever for watching. I am certain that this won't be the last mention of Grok 4 on this channel. In fact, 00:11:36.480 |
I think I mention Grok in a documentary coming up on Patreon. Either way though, have an absolutely wonderful day.