back to index

Grok 4 - 10 New Things to Know


Chapters

0:0 Introduction
0:22 Benchmark Results
2:11 Benchmark Caveats
2:59 ARC-AGI 2
3:35 SimpleBench
4:49 ‘Humanity’s Last Exam’
7:20 SuperGrok Heavy Price
7:58 API Price
8:12 Grok 5, Gemini 3.0 Beta, GPT-5
9:12 System Prompt Change + $1B a month, pollution
10:20 Not soloing science, helping you solo code

Whisper Transcript | Transcript Only Page

00:00:00.000 | Grok 4 is out and it's a pretty good AI model but there is going to be more noise
00:00:07.020 | about this language model than possibly any other so hopefully I can give you a little signal amid
00:00:13.760 | the chaos. Let's boil things down to just 10 things to know about the newest and possibly
00:00:19.660 | smartest AI model. Point one is that Grok 4 might just be the smartest model around at least
00:00:26.820 | according to the benchmarks. In certain settings on high school math competitions it beats out
00:00:31.860 | OpenAI's best model and Google's best model. The same is true for a fairly famous science benchmark
00:00:38.120 | the Google Proof Q&A where it again beats out Anthropics best model and Google's. Likewise on
00:00:43.920 | at least one coding benchmark but Elon Musk went much further saying about Grok 4 that quote it's
00:00:49.700 | smarter than almost all graduate students in all disciplines simultaneously. That quote is of course
00:00:56.080 | going to be picked up by everyone but it needs three important caveats. First from me is that
00:01:01.760 | Grok 4 is still a language model which means it's still going to be prone to all those hallucinations
00:01:07.320 | you're familiar with. It's not a new paradigm of AI. Second we have heard that kind of hype before
00:01:12.740 | notably from the Google DeepMind CEO Demis Asabis almost 18 months ago saying that Gemini 2 was better
00:01:20.760 | than almost all human experts. Amazing about Gemini is that it's so good at so many things as we started
00:01:25.260 | getting to the end of the training for example each of the 50 different subject areas that we tested on
00:01:28.780 | it's as good as the best expert humans in those areas. That was an exaggeration then and Musk is
00:01:33.680 | exaggerating now because real world performance doesn't always match up to benchmark performance.
00:01:39.440 | is way more than answering multiple choice questions hence the third bit of context coming from Musk himself
00:01:45.900 | the CEO of XAI saying that that quote about being smarter than graduates was at least with respect to
00:01:51.840 | academic questions. Grok 4 is a post-grad level in everything like it's it just some of these things
00:01:57.600 | are just worth repeating like Grok 4 is post-graduate like PhD level in everything better than PhD but like
00:02:05.240 | most PhDs would fail so it's better to say I mean at least with respect to academic questions. Point number
00:02:11.060 | two is that I've been highly impressed by Grok 4 but these benchmark results are misleading for another
00:02:17.100 | reason. Note first of all that the y-axis doesn't begin at zero so these differences between the models
00:02:22.720 | are somewhat exaggerated in terms of scale. XAI makers of Grok 4 selectively choose which models
00:02:29.260 | to compare to. Notice in one recent high school maths competition Grok 4 Heavy and I'll get to that later
00:02:35.760 | way outperforms Gemini DeepThink. That's the soon to be released Gemini 2.5 Pro Heavy if you like. But in
00:02:42.520 | this coding benchmark LiveCodebench Gemini DeepThink actually outperforms Grok 4 Heavy and yet is not in
00:02:49.400 | the chart. As always then when these model providers show benchmarks you've got to take them with a grain of
00:02:54.740 | salt. Especially when the answers to the benchmarks are available online. But none of that quite explains
00:03:01.240 | Grok 4's brilliant performance on ARK AGI 2, a semi-private evaluation. As you can see this post
00:03:08.080 | on Twitter or X has got almost 3 million views and is climbing rapidly. Because this is known to be
00:03:13.980 | a fairly rigorous test of so-called fluid intelligence or IQ if you like and Grok 4 genuinely does beat out
00:03:21.420 | other models. I've covered ARK AGI in other videos but suffice to say Grok 4 can genuinely pick up on
00:03:27.640 | latent patterns in your data. Of course that is of relevance to almost all disciplines. Next is there
00:03:33.580 | a benchmark for how smart a model feels? Well yes I tried to come up with one and it's called
00:03:39.940 | SimpleBench. It's a test of social intelligence, trick questions and spatio-temporal questions. Now because
00:03:46.260 | everyone is spamming the Grok 4 API it's pretty tough to run the full benchmark today but I've run about 20
00:03:53.720 | questions to get a pretty good estimation. Take this question, it's a bit of a spin on a common logic
00:03:59.400 | puzzle and Grok 4 actually sees through it. That's actually the first model not to pick the trap answer.
00:04:06.120 | Grok 4 will feel smart but of course if you draw it out of its comfort zone, for example with spatial
00:04:11.880 | reasoning it can still fall apart. In this question, in common with all other models, Grok 4 doesn't
00:04:17.180 | notice that the glove will simply fall on the road. It also takes an extremely long time to answer fairly
00:04:23.660 | often which could be a slight issue for many of you. Having said all of that, I strongly suspect that
00:04:29.620 | Grok 4 will be around the top of my leaderboard on SimpleBench. In other words, try not to be too
00:04:36.340 | tempted to explain away all those benchmark results just to benchmark hacking. Now that doesn't mean
00:04:41.240 | Grok 4 is worth $300 a month but I'll come to that in just a second because there's one more benchmark
00:04:47.000 | I want to touch on. And that is of course the grandiosely named Humanity's Last Exam in which
00:04:53.700 | under certain settings, Grok 4 scores over 50%, by far the best performance of any model. However,
00:05:00.720 | you should know that this is a knowledge intense benchmark and therefore performance is heavily
00:05:05.700 | dependent on the training data that goes into the model. To give you just one example, is it critical
00:05:10.460 | to your use case that a model know about hummingbirds having a bilateral paired oval bone? Now I sound
00:05:18.160 | cynical but I think it's actually really cool that models have such an incredible knowledge base and so
00:05:22.700 | genuinely I will be using Grok 4 a fair bit. I said at the time of the release of that exam that it
00:05:28.380 | wouldn't be Humanity's Last Exam. Whether you happen to have the requisite knowledge in your training data
00:05:33.400 | isn't so much a marker of how intelligent you are as a model. This is not hindsight. On my Patreon in
00:05:38.880 | September of last year, I called that the exam would fall sooner than many others. With tools means that
00:05:45.560 | for example, Grok 4 can write code to perform certain computations. But what is this Grok 4 Heavy?
00:05:51.040 | Well, here's Musk to explain.
00:05:53.060 | With the Grok 4 Heavy, what it does is it spawns multiple agents in parallel and all of those agents
00:05:59.220 | do work independently and then they compare their work and they decide which one, like it's like a study
00:06:06.520 | group. And it's not as simple as a majority vote because often only one of the agents actually figures
00:06:13.220 | out the trick or figures out the solution. But once they share the trick or figure out what the real
00:06:21.240 | nature of the problem is, they share that solution with the other agents and then they essentially compare
00:06:26.240 | notes and then yield an answer. So that's the heavy part of Grok 4.
00:06:32.240 | Now, long-standing followers of the channel may note that that is the exact premise of Smart GPT that I released
00:06:40.800 | around 18 months ago, which scored at the time a record performance on the MLU, 89%. Ironically, that exam was
00:06:48.240 | also authored by Dan Hendricks, who is the lead author of Humanity's Last Exam. And yes, I can't resist plugging that
00:06:55.280 | Andrei Karpathy shouted out Smart GPT. One last thing on the benchmark that many might have missed
00:07:00.560 | is that the text-based performance of Grok 4 and Grok 4 Heavy is extremely good. But on the full benchmark,
00:07:06.760 | it's a more modest improvement over, say, Gemini 2.5 Pro. So Grok 4 must do really quite badly on the
00:07:13.800 | visual segment. In other words, you might not want to rely on it for decoding Roman inscriptions. Which brings
00:07:19.760 | me, of course, to Super Grok Heavy for $3,000 a year or $300 a month. XAI are promising new features
00:07:29.760 | will come to Super Grok Heavy like video generation in October. But Gemini Ultra for a lower price already
00:07:38.160 | has VO3. Now, if your pockets are deep enough, you'll just subscribe to everything. But if this is your only
00:07:45.200 | maxed-out subscription, it's hard to look past the much cheaper $20 Gemini Pro. Let me know,
00:07:50.880 | of course, in the comments if you think it's worth this amount and why. I'm open to being persuaded.
00:07:57.280 | I just don't see it at the moment. Just quickly, if you're a developer, you'll note that Grok 4's pricing
00:08:01.920 | is at the same level as Claude 4 Sonnet. $3 input, $15 output, which is a decent price for a Frontier
00:08:08.880 | model. Again, there are much cheaper alternatives. Next, if you did watch the live stream, of course,
00:08:14.720 | Musk mentions repeatedly that they have new features and new models coming soon and that Grok 5 may be
00:08:21.840 | finishing training imminently. However, we also get leaks this week that Gemini 3 is coming and,
00:08:27.360 | of course, perennial leaks about GPT-5 coming this month. Now, it used to be the case that we would
00:08:33.040 | then have to wait six months for the actual release of the model because of safety checks. Would a model
00:08:39.040 | help with creating a bioweapon, for example? But that all seems to have gone out of the window at the
00:08:43.280 | moment. Which brings me to this fairly wild quote from Musk on safety.
00:08:48.400 | Well, let's be bad or good for humanity. It's like, I think it'll be good. Most likely it'll be good.
00:08:55.040 | Yeah. Yeah. But I somewhat reconciled myself to the fact that even if it wasn't going to be good,
00:09:03.200 | I'd at least like to be alive to see it happen. So, you know.
00:09:08.320 | Next, and you might have been wondering when I was going to talk about this, but yes, of course,
00:09:16.400 | Grok 4 may suffer at times from a similar issue to Grok 3 in that it seems to get sudden urges to praise
00:09:24.560 | certain historical figures or focus on a country, for example, South Africa. That behaviour seems to have
00:09:30.320 | been caused by this addition to Grok 3's system prompt, which is that its response should not shy away from
00:09:37.440 | making claims which are politically incorrect. If such a small change to the system prompt causes such
00:09:43.280 | wild behaviour, then anything could happen with Grok 4. System prompts aren't, of course, the only issue for
00:09:49.200 | XAI, they are apparently burning through one billion dollars a month. Either Grok 4 or Grok 5 almost
00:09:56.400 | needs to bring in more revenue for XAI. Then, of course, there is the awkward pollution point,
00:10:01.520 | because while it is crazy impressive how fast XAI have caught up to OpenAI and Google DeepMind,
00:10:07.280 | bringing in the generators necessary to get competitive that fast did come at a local cost. And if you thought
00:10:15.200 | it was wild how quickly Musk's XAI got up to 100,000 GPUs, well, they're planning to bring an entire
00:10:22.720 | overseas power plant to Memphis with one million AI GPUs to be powered. I'm going to try to end on a
00:10:29.440 | positive though, because even though Musk said that Grok 4 couldn't be used to generate new scientific
00:10:35.680 | discoveries just yet, I do think there is an underrated point to be made that's demonstrated by
00:10:41.280 | this game made with the help of Grok 4 in just four hours. And that's that while models like Grok 4
00:10:47.280 | often struggle with current techniques to solo generate new science, what they are optimised for
00:10:54.800 | is making existing science or code easier for you to solo. We probably shouldn't underestimate the impact
00:11:01.600 | of allowing everyone to do much more on their own. Then again, you probably shouldn't be using Grok to
00:11:07.920 | analyse whether or not you should vote for the big beautiful bill, however. And if Grok 4 or Grok 5's edge
00:11:15.360 | comes from its access to X and Twitter data, then at least for Grok 5's sake, let's hope that X can
00:11:23.360 | clean up so much of the bot replies, spam and clickbait that's on there at the moment. Thank you so much as
00:11:30.800 | ever for watching. I am certain that this won't be the last mention of Grok 4 on this channel. In fact,
00:11:36.480 | I think I mention Grok in a documentary coming up on Patreon. Either way though, have an absolutely wonderful day.