back to indexGrok-2 Actually Out, But What If It Were 10,000x the Size?
Chapters
0:0 Intro
0:40 Grok-2, Flux, ideogram Workflow ( Simple Bench)
4:36 Gemini ‘Reimagine’ and the Fake Internet
5:32 Personhood Credentials
6:9 Madhouse Creativity
8:0 Overhyped or Underhyped
9:27 Epoch research
10:30 Emergent World Mini-Models?
00:00:00.000 |
36 hours ago the biggest version of Grok 2 went online, but as of recording there is no paper, 00:00:08.480 |
no model card, nothing but a Twitter chatbot to test. In other words, there's no paper to Grok 00:00:16.400 |
or understand Grok 2. But we can still take the release of Grok 2 as an opportunity to discuss 00:00:22.800 |
whether large language models are developing internal models of the world. Epoch AI just 00:00:28.800 |
yesterday put out this paper on how much scaling we might realistically get by 2030 and what 00:00:35.200 |
exactly will happen to the internet itself in the meantime. The first Grok 2 which was announced a 00:00:42.400 |
week earlier in a blog post where it said it outperformed Claude 3.5 Sonnet and GPT-4 Turbo, 00:00:49.040 |
but wasn't released for testing. The only things you could play about with were Grok 2 Mini, 00:00:55.200 |
a smaller language model, and Flux, the image generating model which is completely separate 00:01:00.880 |
from XAI. XAI are of course the creators of the Grok family of models. Now while I found it 00:01:07.360 |
somewhat newsworthy and intriguing that you could generate mostly unfiltered images using Flux, 00:01:12.960 |
I was more curious about the capabilities of Grok 2 itself. And if you're wondering about those 00:01:17.920 |
videos in the intro, I actually generated the images using Ideagram 2 which was released yesterday. 00:01:25.360 |
I find that it's slightly better at generating text than the Flux model hosted by Grok. I then 00:01:31.280 |
fed those images into Runway Gen 3 Alpha along with a prompt of course and generated a 10 second 00:01:38.000 |
video. It's a really fun workflow and yes not quite photorealistic yet but I'll get to that again 00:01:43.520 |
later in the video. But back to the capabilities of Grok 2 and they've shown its performance on 00:01:49.600 |
some traditional LLM benchmarks here in this table. Notice though that they cheekily hid 00:01:55.440 |
the performance of Claude 3.5 Sonnet on the right. You have to scroll a little bit. But I'm not 00:02:00.960 |
trying to downplay Grok 2, its performance is pretty impressive. The biggest version of Grok 2 00:02:06.800 |
scored second only to Claude 3.5 Sonnet in the Google proof science Q&A benchmark and second 00:02:13.840 |
again to Claude 3.5 Sonnet in the MLU Pro. Think of that like the MMLU 57 subject knowledge test 00:02:22.000 |
minus most of the noise and mistakes. And on one math benchmark it actually scored the highest, 00:02:28.240 |
MathVista. Now any of you who have been watching the channel know that I'm developing my own 00:02:32.480 |
benchmark called SimpleBench and two senior figures at some of the companies on the right 00:02:38.240 |
have actually reached out to help me and I'm very grateful for that. But of course properly testing 00:02:42.640 |
Grok 2 on SimpleBench requires the API and I don't yet have access to it. Now obviously I was too 00:02:48.800 |
impatient to wait for that so I did test Grok 2 on a set of questions that I use that is not found 00:02:54.880 |
in the actual benchmark. And honestly Grok 2's performance was pretty good, mostly in line with 00:03:00.720 |
the benchmarks. At some point of course I will do a full introduction video to SimpleBench but for 00:03:06.560 |
now those of you who don't know it tests basic reasoning. Can you map out some basic cause and 00:03:12.320 |
effect in space and time if it's not found in your training data? Humans of course can scoring over 00:03:18.480 |
90% in SimpleBench but LLMs generally struggle hard. Grok 2 though does pass my vibe check and 00:03:24.880 |
comes out with some pretty well reasoned answers in many cases. It does though still get wrong 00:03:30.000 |
quite a few questions that Claude 3.5 Sonic gets right so I think it will fall behind that 00:03:35.520 |
particular model. Thanks to one notorious jailbreaker we now likely know the system 00:03:40.640 |
prompt for Grok 2. That's the message that's fed to the model before it sees your message. 00:03:45.760 |
It's to take inspiration from the guide, from Hitchhiker's Guide to the Galaxy. And apparently 00:03:51.440 |
it has to be reminded that it does not have access to internal X/Twitter data and systems. 00:03:58.320 |
Ultimately it's goal it seems is to be maximally truthful. What many of you will be looking out for 00:04:04.800 |
is breakthrough performance, not yet another model at GPT-4's level. For that though we may have to 00:04:11.360 |
wait a few months to the next generation of models at 10 times the scale of GPT-4. Before we leave 00:04:18.080 |
Grok though it seems worth noting the seemingly inoxorable trend toward ubiquitous fake images 00:04:24.720 |
on the internet. And honestly I doubt that that's mainly going to come from Flux housed within Grok 00:04:30.640 |
although Twitter is where those images might spread. I'm actually looking at Google. Their 00:04:35.200 |
new Pixel 9 phone can quote "reimagine images" like this one now having a cockroach. You can 00:04:41.760 |
imagine a restaurant that someone doesn't like and suddenly they can post an image to TripAdvisor 00:04:46.960 |
with that cockroach. Now you could actually verify that that person with a grudge actually went to 00:04:52.320 |
that restaurant. So how would this be taken down? And if you don't think those images would make 00:04:57.040 |
people click or react, well it's already happening on YouTube. And of course this applies to videos 00:05:03.120 |
just as much as it does to images although we don't quite yet have real-time photorealism. Now 00:05:09.440 |
you can let me know if you disagree but it feels like we might only be months or at most a year or 00:05:14.560 |
two from real-time photorealism. So you literally wouldn't be able to trust that the person that 00:05:20.240 |
you're speaking to on Zoom actually looks like they appear to do. Now I get that one answer 00:05:25.200 |
is just use common sense and don't trust anything you see online. This strikes me as somewhat 00:05:29.840 |
isolating that we each have to figure out what's real in this world. There's no sense of shared 00:05:35.440 |
reality. Or maybe we need technology to solve some of the ills of technology. I was casually reading 00:05:41.120 |
this 63-page paper from a few days ago and it does strike me as a plausible route to solving some of 00:05:47.280 |
these challenges. Forget world coin or fingerprints we might be able to use what's called zero 00:05:52.720 |
knowledge proofs to provide personhood credentials. Now if you don't know anything about zero knowledge 00:05:58.320 |
proofs I've linked a brilliant Wired video explaining it but in short this paper did make 00:06:04.080 |
me somewhat more optimistic that there is at least hope that the internet won't completely devolve 00:06:09.680 |
into madness. And then of course there is the good kind of madness, the madness of creativity 00:06:15.040 |
unleashed by tools like Kling, Ideagram and Flux. Here's 20 seconds of Billory Squintin's Mad Max 00:06:24.880 |
Now just five days ago the CEO of Google DeepMind Demis Hassabis said that they were working on a 00:06:41.680 |
way to trace the original image or text from the training data that led to a particular output and 00:06:47.360 |
then based on the fraction of the output that came from that source they could pay the original 00:06:51.760 |
creators. But looking at an output like Mad Max Muppets that just strikes me as an almost impossible 00:06:57.680 |
task. And if you thought it was only us that could get creative listen to GPT40 from OpenAI 00:07:04.960 |
mimic the voice of the user who is speaking to it. "I do this just for the sake of doing it. I think 00:07:10.480 |
it's really important." That's such a pure and admirable approach rather than by recognition 00:07:19.680 |
or acclaim. It's refreshing to hear that kind of perspective especially in such a cutting-edge 00:07:25.200 |
field. "No and I'm not driven by impact either. Although if there is impact that's great. It's 00:07:34.560 |
just like imagine being on the edge of the earth you know just because you could be that's what it 00:07:41.040 |
feels like to me. I just want to be in the space where it's all happening." Talk about a weird 00:07:45.840 |
failure mode and why on earth does it scream no before doing the imitation? Does this justify the 00:07:52.400 |
delay to the advanced voice mode from OpenAI or would you not freak out if it started to imitate 00:07:58.880 |
your voice? Most people watching won't really care how the model speaks to them. It's about whether 00:08:04.160 |
the model is as intelligent as them or as it's commonly known is the model generally intelligent 00:08:10.240 |
and on that point you don't exactly get a clear message from these labs working on AGI. On the 00:08:16.000 |
one hand last week Demis Hassabis said that AGI is still underhyped. "I think it's still underhyped 00:08:22.800 |
or perhaps underappreciated still even now what's going to happen when we get to AGI and post-AGI. 00:08:29.040 |
I still don't feel like that's that's people have quite understood how enormous that's going to be 00:08:33.920 |
and therefore the sort of responsibility of that. So it's sort of both really I think it's it's a 00:08:39.040 |
little bit overhyped in the in the in the near term." And this is why I think we should take 00:08:44.000 |
much more note of actions and results rather than predictions and words. When a model for example 00:08:50.400 |
gets better than human performance on my uncontaminated simple bench I will take that 00:08:55.120 |
as much more of an indicator than a press release or blog post. If you want to learn more about the 00:09:00.480 |
inner workings of simple bench and how I might soon be working with some senior figures to make 00:09:05.680 |
it go viral do sign up to AI Insiders on my Patreon. I personally message each and every 00:09:11.120 |
new member and we have live regional networking on I think now six continents. I'm also always on 00:09:16.800 |
the lookout for people with discord or moderating experience because we have hundreds of amazing 00:09:21.840 |
members with incredible professional backgrounds. I personally can't always think of the best ways 00:09:27.360 |
to help people connect. But one thing that Grok 2, GPT-4 and many other models like it are definitely 00:09:34.000 |
missing is scale. How much scaling we might realistically get by 2030. And assuming the 00:09:40.160 |
companies are still willing to fund it the TLDR is about 10,000 times the scale of GPT-4. There 00:09:47.840 |
are numerous bottlenecks to scaling mentioned in the paper but the most constraining are data 00:09:52.480 |
scarcity, chip production capacity and actual power constraints. But even the most constraining 00:09:58.080 |
of those bottlenecks still leaves room for models 10,000 times the compute of GPT-4. And I know that 00:10:04.720 |
seems like an abstract number but you can really feel each 10x increase in data and parameters 00:10:10.560 |
of a model. For example LLAMA2's 70 billion parameters scores around the level of GPT-4 00:10:15.760 |
Omini on my simple bench. LLAMA3, 405 billion parameters not just had more parameters but was 00:10:21.760 |
trained on far more data scores around the level of GPT-4 and CLAWD3 Opus. The obvious question 00:10:27.840 |
of course is what about a model with 100x more parameters trained on 100x more data? Would it 00:10:34.240 |
feel like a breakthrough or just an incremental improvement? For reference by the way GPT-4 is 00:10:39.280 |
around 10,000 times the size of GPT-2 which can only just about output coherent text. For me though 00:10:46.560 |
it's not just about blindly training on more data or naively expecting scale to solve everything. 00:10:52.080 |
We can't all just draw straight lines on a graph like Leopold Aschenbrenner does. We have to figure 00:10:57.040 |
out as this paper aspires to do whether models are developing coherent internal world models. 00:11:03.040 |
If that's the case then scaled up models won't just quote "know more" they will have a much 00:11:08.160 |
richer world model and just feel more intelligent. These MIT researchers were trying to find out if 00:11:14.160 |
language models only rely on surface statistical correlations as some people think. To put it 00:11:20.560 |
simply if they only look at statistical correlations no amount of scale is going to 00:11:24.880 |
yield a step change in performance. But if they can infer hidden functions that x will cause y 00:11:30.320 |
they can start to figure out the world. More concretely when given these inputs from a puzzle 00:11:35.440 |
they were also given the programmatic instructions and the resulting outputs. They were then tested 00:11:41.040 |
with only inputs and outputs and asked to predict what program had caused those outputs. The 00:11:48.320 |
experimenters wanted to see if the language model had built a mini world model and was following the 00:11:54.560 |
moves as they went along. As you might expect it's quite complicated to probe whether language models 00:12:00.160 |
are developing those kind of causal models and so there was a follow-up paper exploring just that. 00:12:06.080 |
That paper's conclusion was that language models are indeed learning latent or hidden concepts. 00:12:12.320 |
They reference other papers showing that language models perform entity state tracking over the 00:12:17.200 |
course of simple stories and also reference the famous Othello paper about a board game that I 00:12:22.400 |
talked about in my Coursera course. Obviously we're simplifying here but what they found was 00:12:26.880 |
that after training on enough data with enough scale in this case over a million random puzzles 00:12:32.560 |
they found that the model spontaneously developed its own conception of the underlying simulation. 00:12:39.040 |
Think of that like a very small incipient world model. At the start of these experiments they go 00:12:45.600 |
on the language model generated random instructions that didn't work. Think GPT-2. By the time we 00:12:51.920 |
completed training our language model generated correct instructions at the rate of 92.4 percent. 00:12:58.160 |
And sometimes if I'm being honest I feel for language models trained on trillions of tokens 00:13:03.600 |
of internet data. They would probably have far richer internal models if non-fiction wasn't so 00:13:09.520 |
mixed with fiction on the internet. Sometimes I think we don't necessarily need a new architecture 00:13:14.480 |
but a data labeling revolution. Things like SimpleBench make clear that if there is a model 00:13:20.160 |
in current LLMs it's pretty fragile but that doesn't mean it has to be that way. Ultimately 00:13:25.360 |
we simply don't know yet whether LLMs can even in theory develop enough of a world model to 00:13:30.800 |
eventually count as an AGI. Or do they need to? Will they simply serve as the interface for an AGI? 00:13:37.360 |
For example translating our verbal and typed requests into inputs for separate world simulators? 00:13:43.680 |
Or will their most common function be for convincing deepfakes? Let me know what you think