36 hours ago the biggest version of Grok 2 went online, but as of recording there is no paper, no model card, nothing but a Twitter chatbot to test. In other words, there's no paper to Grok or understand Grok 2. But we can still take the release of Grok 2 as an opportunity to discuss whether large language models are developing internal models of the world.
Epoch AI just yesterday put out this paper on how much scaling we might realistically get by 2030 and what exactly will happen to the internet itself in the meantime. The first Grok 2 which was announced a week earlier in a blog post where it said it outperformed Claude 3.5 Sonnet and GPT-4 Turbo, but wasn't released for testing.
The only things you could play about with were Grok 2 Mini, a smaller language model, and Flux, the image generating model which is completely separate from XAI. XAI are of course the creators of the Grok family of models. Now while I found it somewhat newsworthy and intriguing that you could generate mostly unfiltered images using Flux, I was more curious about the capabilities of Grok 2 itself.
And if you're wondering about those videos in the intro, I actually generated the images using Ideagram 2 which was released yesterday. I find that it's slightly better at generating text than the Flux model hosted by Grok. I then fed those images into Runway Gen 3 Alpha along with a prompt of course and generated a 10 second video.
It's a really fun workflow and yes not quite photorealistic yet but I'll get to that again later in the video. But back to the capabilities of Grok 2 and they've shown its performance on some traditional LLM benchmarks here in this table. Notice though that they cheekily hid the performance of Claude 3.5 Sonnet on the right.
You have to scroll a little bit. But I'm not trying to downplay Grok 2, its performance is pretty impressive. The biggest version of Grok 2 scored second only to Claude 3.5 Sonnet in the Google proof science Q&A benchmark and second again to Claude 3.5 Sonnet in the MLU Pro.
Think of that like the MMLU 57 subject knowledge test minus most of the noise and mistakes. And on one math benchmark it actually scored the highest, MathVista. Now any of you who have been watching the channel know that I'm developing my own benchmark called SimpleBench and two senior figures at some of the companies on the right have actually reached out to help me and I'm very grateful for that.
But of course properly testing Grok 2 on SimpleBench requires the API and I don't yet have access to it. Now obviously I was too impatient to wait for that so I did test Grok 2 on a set of questions that I use that is not found in the actual benchmark.
And honestly Grok 2's performance was pretty good, mostly in line with the benchmarks. At some point of course I will do a full introduction video to SimpleBench but for now those of you who don't know it tests basic reasoning. Can you map out some basic cause and effect in space and time if it's not found in your training data?
Humans of course can scoring over 90% in SimpleBench but LLMs generally struggle hard. Grok 2 though does pass my vibe check and comes out with some pretty well reasoned answers in many cases. It does though still get wrong quite a few questions that Claude 3.5 Sonic gets right so I think it will fall behind that particular model.
Thanks to one notorious jailbreaker we now likely know the system prompt for Grok 2. That's the message that's fed to the model before it sees your message. It's to take inspiration from the guide, from Hitchhiker's Guide to the Galaxy. And apparently it has to be reminded that it does not have access to internal X/Twitter data and systems.
Ultimately it's goal it seems is to be maximally truthful. What many of you will be looking out for is breakthrough performance, not yet another model at GPT-4's level. For that though we may have to wait a few months to the next generation of models at 10 times the scale of GPT-4.
Before we leave Grok though it seems worth noting the seemingly inoxorable trend toward ubiquitous fake images on the internet. And honestly I doubt that that's mainly going to come from Flux housed within Grok although Twitter is where those images might spread. I'm actually looking at Google. Their new Pixel 9 phone can quote "reimagine images" like this one now having a cockroach.
You can imagine a restaurant that someone doesn't like and suddenly they can post an image to TripAdvisor with that cockroach. Now you could actually verify that that person with a grudge actually went to that restaurant. So how would this be taken down? And if you don't think those images would make people click or react, well it's already happening on YouTube.
And of course this applies to videos just as much as it does to images although we don't quite yet have real-time photorealism. Now you can let me know if you disagree but it feels like we might only be months or at most a year or two from real-time photorealism.
So you literally wouldn't be able to trust that the person that you're speaking to on Zoom actually looks like they appear to do. Now I get that one answer is just use common sense and don't trust anything you see online. This strikes me as somewhat isolating that we each have to figure out what's real in this world.
There's no sense of shared reality. Or maybe we need technology to solve some of the ills of technology. I was casually reading this 63-page paper from a few days ago and it does strike me as a plausible route to solving some of these challenges. Forget world coin or fingerprints we might be able to use what's called zero knowledge proofs to provide personhood credentials.
Now if you don't know anything about zero knowledge proofs I've linked a brilliant Wired video explaining it but in short this paper did make me somewhat more optimistic that there is at least hope that the internet won't completely devolve into madness. And then of course there is the good kind of madness, the madness of creativity unleashed by tools like Kling, Ideagram and Flux.
Here's 20 seconds of Billory Squintin's Mad Max Muppet style. Now just five days ago the CEO of Google DeepMind Demis Hassabis said that they were working on a way to trace the original image or text from the training data that led to a particular output and then based on the fraction of the output that came from that source they could pay the original creators.
But looking at an output like Mad Max Muppets that just strikes me as an almost impossible task. And if you thought it was only us that could get creative listen to GPT40 from OpenAI mimic the voice of the user who is speaking to it. "I do this just for the sake of doing it.
I think it's really important." That's such a pure and admirable approach rather than by recognition or acclaim. It's refreshing to hear that kind of perspective especially in such a cutting-edge field. "No and I'm not driven by impact either. Although if there is impact that's great. It's just like imagine being on the edge of the earth you know just because you could be that's what it feels like to me.
I just want to be in the space where it's all happening." Talk about a weird failure mode and why on earth does it scream no before doing the imitation? Does this justify the delay to the advanced voice mode from OpenAI or would you not freak out if it started to imitate your voice?
Most people watching won't really care how the model speaks to them. It's about whether the model is as intelligent as them or as it's commonly known is the model generally intelligent and on that point you don't exactly get a clear message from these labs working on AGI. On the one hand last week Demis Hassabis said that AGI is still underhyped.
"I think it's still underhyped or perhaps underappreciated still even now what's going to happen when we get to AGI and post-AGI. I still don't feel like that's that's people have quite understood how enormous that's going to be and therefore the sort of responsibility of that. So it's sort of both really I think it's it's a little bit overhyped in the in the in the near term." And this is why I think we should take much more note of actions and results rather than predictions and words.
When a model for example gets better than human performance on my uncontaminated simple bench I will take that as much more of an indicator than a press release or blog post. If you want to learn more about the inner workings of simple bench and how I might soon be working with some senior figures to make it go viral do sign up to AI Insiders on my Patreon.
I personally message each and every new member and we have live regional networking on I think now six continents. I'm also always on the lookout for people with discord or moderating experience because we have hundreds of amazing members with incredible professional backgrounds. I personally can't always think of the best ways to help people connect.
But one thing that Grok 2, GPT-4 and many other models like it are definitely missing is scale. How much scaling we might realistically get by 2030. And assuming the companies are still willing to fund it the TLDR is about 10,000 times the scale of GPT-4. There are numerous bottlenecks to scaling mentioned in the paper but the most constraining are data scarcity, chip production capacity and actual power constraints.
But even the most constraining of those bottlenecks still leaves room for models 10,000 times the compute of GPT-4. And I know that seems like an abstract number but you can really feel each 10x increase in data and parameters of a model. For example LLAMA2's 70 billion parameters scores around the level of GPT-4 Omini on my simple bench.
LLAMA3, 405 billion parameters not just had more parameters but was trained on far more data scores around the level of GPT-4 and CLAWD3 Opus. The obvious question of course is what about a model with 100x more parameters trained on 100x more data? Would it feel like a breakthrough or just an incremental improvement?
For reference by the way GPT-4 is around 10,000 times the size of GPT-2 which can only just about output coherent text. For me though it's not just about blindly training on more data or naively expecting scale to solve everything. We can't all just draw straight lines on a graph like Leopold Aschenbrenner does.
We have to figure out as this paper aspires to do whether models are developing coherent internal world models. If that's the case then scaled up models won't just quote "know more" they will have a much richer world model and just feel more intelligent. These MIT researchers were trying to find out if language models only rely on surface statistical correlations as some people think.
To put it simply if they only look at statistical correlations no amount of scale is going to yield a step change in performance. But if they can infer hidden functions that x will cause y they can start to figure out the world. More concretely when given these inputs from a puzzle they were also given the programmatic instructions and the resulting outputs.
They were then tested with only inputs and outputs and asked to predict what program had caused those outputs. The experimenters wanted to see if the language model had built a mini world model and was following the moves as they went along. As you might expect it's quite complicated to probe whether language models are developing those kind of causal models and so there was a follow-up paper exploring just that.
That paper's conclusion was that language models are indeed learning latent or hidden concepts. They reference other papers showing that language models perform entity state tracking over the course of simple stories and also reference the famous Othello paper about a board game that I talked about in my Coursera course.
Obviously we're simplifying here but what they found was that after training on enough data with enough scale in this case over a million random puzzles they found that the model spontaneously developed its own conception of the underlying simulation. Think of that like a very small incipient world model. At the start of these experiments they go on the language model generated random instructions that didn't work.
Think GPT-2. By the time we completed training our language model generated correct instructions at the rate of 92.4 percent. And sometimes if I'm being honest I feel for language models trained on trillions of tokens of internet data. They would probably have far richer internal models if non-fiction wasn't so mixed with fiction on the internet.
Sometimes I think we don't necessarily need a new architecture but a data labeling revolution. Things like SimpleBench make clear that if there is a model in current LLMs it's pretty fragile but that doesn't mean it has to be that way. Ultimately we simply don't know yet whether LLMs can even in theory develop enough of a world model to eventually count as an AGI.
Or do they need to? Will they simply serve as the interface for an AGI? For example translating our verbal and typed requests into inputs for separate world simulators? Or will their most common function be for convincing deepfakes? Let me know what you think and, as always, have a wonderful day.