Grok 4 - 10 New Things to Know

00:00:00.000 | Grok 4 is out and it's a pretty good AI model but there is going to be more noise

00:00:07.020 | about this language model than possibly any other so hopefully I can give you a little signal amid

00:00:13.760 | the chaos. Let's boil things down to just 10 things to know about the newest and possibly

00:00:19.660 | smartest AI model. Point one is that Grok 4 might just be the smartest model around at least

00:00:26.820 | according to the benchmarks. In certain settings on high school math competitions it beats out

00:00:31.860 | OpenAI's best model and Google's best model. The same is true for a fairly famous science benchmark

00:00:38.120 | the Google Proof Q&A where it again beats out Anthropics best model and Google's. Likewise on

00:00:43.920 | at least one coding benchmark but Elon Musk went much further saying about Grok 4 that quote it's

00:00:49.700 | smarter than almost all graduate students in all disciplines simultaneously. That quote is of course

00:00:56.080 | going to be picked up by everyone but it needs three important caveats. First from me is that

00:01:01.760 | Grok 4 is still a language model which means it's still going to be prone to all those hallucinations

00:01:07.320 | you're familiar with. It's not a new paradigm of AI. Second we have heard that kind of hype before

00:01:12.740 | notably from the Google DeepMind CEO Demis Asabis almost 18 months ago saying that Gemini 2 was better

00:01:20.760 | than almost all human experts. Amazing about Gemini is that it's so good at so many things as we started

00:01:25.260 | getting to the end of the training for example each of the 50 different subject areas that we tested on

00:01:28.780 | it's as good as the best expert humans in those areas. That was an exaggeration then and Musk is

00:01:33.680 | exaggerating now because real world performance doesn't always match up to benchmark performance.

00:01:39.440 | is way more than answering multiple choice questions hence the third bit of context coming from Musk himself

00:01:45.900 | the CEO of XAI saying that that quote about being smarter than graduates was at least with respect to

00:01:51.840 | academic questions. Grok 4 is a post-grad level in everything like it's it just some of these things

00:01:57.600 | are just worth repeating like Grok 4 is post-graduate like PhD level in everything better than PhD but like

00:02:05.240 | most PhDs would fail so it's better to say I mean at least with respect to academic questions. Point number

00:02:11.060 | two is that I've been highly impressed by Grok 4 but these benchmark results are misleading for another

00:02:17.100 | reason. Note first of all that the y-axis doesn't begin at zero so these differences between the models

00:02:22.720 | are somewhat exaggerated in terms of scale. XAI makers of Grok 4 selectively choose which models

00:02:29.260 | to compare to. Notice in one recent high school maths competition Grok 4 Heavy and I'll get to that later

00:02:35.760 | way outperforms Gemini DeepThink. That's the soon to be released Gemini 2.5 Pro Heavy if you like. But in

00:02:42.520 | this coding benchmark LiveCodebench Gemini DeepThink actually outperforms Grok 4 Heavy and yet is not in

00:02:49.400 | the chart. As always then when these model providers show benchmarks you've got to take them with a grain of

00:02:54.740 | salt. Especially when the answers to the benchmarks are available online. But none of that quite explains

00:03:01.240 | Grok 4's brilliant performance on ARK AGI 2, a semi-private evaluation. As you can see this post

00:03:08.080 | on Twitter or X has got almost 3 million views and is climbing rapidly. Because this is known to be

00:03:13.980 | a fairly rigorous test of so-called fluid intelligence or IQ if you like and Grok 4 genuinely does beat out

00:03:21.420 | other models. I've covered ARK AGI in other videos but suffice to say Grok 4 can genuinely pick up on

00:03:27.640 | latent patterns in your data. Of course that is of relevance to almost all disciplines. Next is there

00:03:33.580 | a benchmark for how smart a model feels? Well yes I tried to come up with one and it's called

00:03:39.940 | SimpleBench. It's a test of social intelligence, trick questions and spatio-temporal questions. Now because

00:03:46.260 | everyone is spamming the Grok 4 API it's pretty tough to run the full benchmark today but I've run about 20

00:03:53.720 | questions to get a pretty good estimation. Take this question, it's a bit of a spin on a common logic

00:03:59.400 | puzzle and Grok 4 actually sees through it. That's actually the first model not to pick the trap answer.

00:04:06.120 | Grok 4 will feel smart but of course if you draw it out of its comfort zone, for example with spatial

00:04:11.880 | reasoning it can still fall apart. In this question, in common with all other models, Grok 4 doesn't

00:04:17.180 | notice that the glove will simply fall on the road. It also takes an extremely long time to answer fairly

00:04:23.660 | often which could be a slight issue for many of you. Having said all of that, I strongly suspect that

00:04:29.620 | Grok 4 will be around the top of my leaderboard on SimpleBench. In other words, try not to be too

00:04:36.340 | tempted to explain away all those benchmark results just to benchmark hacking. Now that doesn't mean

00:04:41.240 | Grok 4 is worth $300 a month but I'll come to that in just a second because there's one more benchmark

00:04:47.000 | I want to touch on. And that is of course the grandiosely named Humanity's Last Exam in which

00:04:53.700 | under certain settings, Grok 4 scores over 50%, by far the best performance of any model. However,

00:05:00.720 | you should know that this is a knowledge intense benchmark and therefore performance is heavily

00:05:05.700 | dependent on the training data that goes into the model. To give you just one example, is it critical

00:05:10.460 | to your use case that a model know about hummingbirds having a bilateral paired oval bone? Now I sound

00:05:18.160 | cynical but I think it's actually really cool that models have such an incredible knowledge base and so

00:05:22.700 | genuinely I will be using Grok 4 a fair bit. I said at the time of the release of that exam that it

00:05:28.380 | wouldn't be Humanity's Last Exam. Whether you happen to have the requisite knowledge in your training data

00:05:33.400 | isn't so much a marker of how intelligent you are as a model. This is not hindsight. On my Patreon in

00:05:38.880 | September of last year, I called that the exam would fall sooner than many others. With tools means that

00:05:45.560 | for example, Grok 4 can write code to perform certain computations. But what is this Grok 4 Heavy?

00:05:51.040 | Well, here's Musk to explain.

00:05:53.060 | With the Grok 4 Heavy, what it does is it spawns multiple agents in parallel and all of those agents

00:05:59.220 | do work independently and then they compare their work and they decide which one, like it's like a study

00:06:06.520 | group. And it's not as simple as a majority vote because often only one of the agents actually figures

00:06:13.220 | out the trick or figures out the solution. But once they share the trick or figure out what the real

00:06:21.240 | nature of the problem is, they share that solution with the other agents and then they essentially compare

00:06:26.240 | notes and then yield an answer. So that's the heavy part of Grok 4.

00:06:32.240 | Now, long-standing followers of the channel may note that that is the exact premise of Smart GPT that I released

00:06:40.800 | around 18 months ago, which scored at the time a record performance on the MLU, 89%. Ironically, that exam was

00:06:48.240 | also authored by Dan Hendricks, who is the lead author of Humanity's Last Exam. And yes, I can't resist plugging that

00:06:55.280 | Andrei Karpathy shouted out Smart GPT. One last thing on the benchmark that many might have missed

00:07:00.560 | is that the text-based performance of Grok 4 and Grok 4 Heavy is extremely good. But on the full benchmark,

00:07:06.760 | it's a more modest improvement over, say, Gemini 2.5 Pro. So Grok 4 must do really quite badly on the

00:07:13.800 | visual segment. In other words, you might not want to rely on it for decoding Roman inscriptions. Which brings

00:07:19.760 | me, of course, to Super Grok Heavy for $3,000 a year or $300 a month. XAI are promising new features

00:07:29.760 | will come to Super Grok Heavy like video generation in October. But Gemini Ultra for a lower price already

00:07:38.160 | has VO3. Now, if your pockets are deep enough, you'll just subscribe to everything. But if this is your only

00:07:45.200 | maxed-out subscription, it's hard to look past the much cheaper $20 Gemini Pro. Let me know,

00:07:50.880 | of course, in the comments if you think it's worth this amount and why. I'm open to being persuaded.

00:07:57.280 | I just don't see it at the moment. Just quickly, if you're a developer, you'll note that Grok 4's pricing

00:08:01.920 | is at the same level as Claude 4 Sonnet. $3 input, $15 output, which is a decent price for a Frontier

00:08:08.880 | model. Again, there are much cheaper alternatives. Next, if you did watch the live stream, of course,

00:08:14.720 | Musk mentions repeatedly that they have new features and new models coming soon and that Grok 5 may be

00:08:21.840 | finishing training imminently. However, we also get leaks this week that Gemini 3 is coming and,

00:08:27.360 | of course, perennial leaks about GPT-5 coming this month. Now, it used to be the case that we would

00:08:33.040 | then have to wait six months for the actual release of the model because of safety checks. Would a model

00:08:39.040 | help with creating a bioweapon, for example? But that all seems to have gone out of the window at the

00:08:43.280 | moment. Which brings me to this fairly wild quote from Musk on safety.

00:08:48.400 | Well, let's be bad or good for humanity. It's like, I think it'll be good. Most likely it'll be good.

00:08:55.040 | Yeah. Yeah. But I somewhat reconciled myself to the fact that even if it wasn't going to be good,

00:09:03.200 | I'd at least like to be alive to see it happen. So, you know.

00:09:08.320 | Next, and you might have been wondering when I was going to talk about this, but yes, of course,

00:09:16.400 | Grok 4 may suffer at times from a similar issue to Grok 3 in that it seems to get sudden urges to praise

00:09:24.560 | certain historical figures or focus on a country, for example, South Africa. That behaviour seems to have

00:09:30.320 | been caused by this addition to Grok 3's system prompt, which is that its response should not shy away from

00:09:37.440 | making claims which are politically incorrect. If such a small change to the system prompt causes such

00:09:43.280 | wild behaviour, then anything could happen with Grok 4. System prompts aren't, of course, the only issue for

00:09:49.200 | XAI, they are apparently burning through one billion dollars a month. Either Grok 4 or Grok 5 almost

00:09:56.400 | needs to bring in more revenue for XAI. Then, of course, there is the awkward pollution point,

00:10:01.520 | because while it is crazy impressive how fast XAI have caught up to OpenAI and Google DeepMind,

00:10:07.280 | bringing in the generators necessary to get competitive that fast did come at a local cost. And if you thought

00:10:15.200 | it was wild how quickly Musk's XAI got up to 100,000 GPUs, well, they're planning to bring an entire

00:10:22.720 | overseas power plant to Memphis with one million AI GPUs to be powered. I'm going to try to end on a

00:10:29.440 | positive though, because even though Musk said that Grok 4 couldn't be used to generate new scientific

00:10:35.680 | discoveries just yet, I do think there is an underrated point to be made that's demonstrated by

00:10:41.280 | this game made with the help of Grok 4 in just four hours. And that's that while models like Grok 4

00:10:47.280 | often struggle with current techniques to solo generate new science, what they are optimised for

00:10:54.800 | is making existing science or code easier for you to solo. We probably shouldn't underestimate the impact

00:11:01.600 | of allowing everyone to do much more on their own. Then again, you probably shouldn't be using Grok to

00:11:07.920 | analyse whether or not you should vote for the big beautiful bill, however. And if Grok 4 or Grok 5's edge

00:11:15.360 | comes from its access to X and Twitter data, then at least for Grok 5's sake, let's hope that X can

00:11:23.360 | clean up so much of the bot replies, spam and clickbait that's on there at the moment. Thank you so much as

00:11:30.800 | ever for watching. I am certain that this won't be the last mention of Grok 4 on this channel. In fact,

00:11:36.480 | I think I mention Grok in a documentary coming up on Patreon. Either way though, have an absolutely wonderful day.

Grok 4 - 10 New Things to Know

Chapters