back to index

Grok-2 Actually Out, But What If It Were 10,000x the Size?


Chapters

0:0 Intro
0:40 Grok-2, Flux, ideogram Workflow ( Simple Bench)
4:36 Gemini ‘Reimagine’ and the Fake Internet
5:32 Personhood Credentials
6:9 Madhouse Creativity
8:0 Overhyped or Underhyped
9:27 Epoch research
10:30 Emergent World Mini-Models?

Whisper Transcript | Transcript Only Page

00:00:00.000 | 36 hours ago the biggest version of Grok 2 went online, but as of recording there is no paper,
00:00:08.480 | no model card, nothing but a Twitter chatbot to test. In other words, there's no paper to Grok
00:00:16.400 | or understand Grok 2. But we can still take the release of Grok 2 as an opportunity to discuss
00:00:22.800 | whether large language models are developing internal models of the world. Epoch AI just
00:00:28.800 | yesterday put out this paper on how much scaling we might realistically get by 2030 and what
00:00:35.200 | exactly will happen to the internet itself in the meantime. The first Grok 2 which was announced a
00:00:42.400 | week earlier in a blog post where it said it outperformed Claude 3.5 Sonnet and GPT-4 Turbo,
00:00:49.040 | but wasn't released for testing. The only things you could play about with were Grok 2 Mini,
00:00:55.200 | a smaller language model, and Flux, the image generating model which is completely separate
00:01:00.880 | from XAI. XAI are of course the creators of the Grok family of models. Now while I found it
00:01:07.360 | somewhat newsworthy and intriguing that you could generate mostly unfiltered images using Flux,
00:01:12.960 | I was more curious about the capabilities of Grok 2 itself. And if you're wondering about those
00:01:17.920 | videos in the intro, I actually generated the images using Ideagram 2 which was released yesterday.
00:01:25.360 | I find that it's slightly better at generating text than the Flux model hosted by Grok. I then
00:01:31.280 | fed those images into Runway Gen 3 Alpha along with a prompt of course and generated a 10 second
00:01:38.000 | video. It's a really fun workflow and yes not quite photorealistic yet but I'll get to that again
00:01:43.520 | later in the video. But back to the capabilities of Grok 2 and they've shown its performance on
00:01:49.600 | some traditional LLM benchmarks here in this table. Notice though that they cheekily hid
00:01:55.440 | the performance of Claude 3.5 Sonnet on the right. You have to scroll a little bit. But I'm not
00:02:00.960 | trying to downplay Grok 2, its performance is pretty impressive. The biggest version of Grok 2
00:02:06.800 | scored second only to Claude 3.5 Sonnet in the Google proof science Q&A benchmark and second
00:02:13.840 | again to Claude 3.5 Sonnet in the MLU Pro. Think of that like the MMLU 57 subject knowledge test
00:02:22.000 | minus most of the noise and mistakes. And on one math benchmark it actually scored the highest,
00:02:28.240 | MathVista. Now any of you who have been watching the channel know that I'm developing my own
00:02:32.480 | benchmark called SimpleBench and two senior figures at some of the companies on the right
00:02:38.240 | have actually reached out to help me and I'm very grateful for that. But of course properly testing
00:02:42.640 | Grok 2 on SimpleBench requires the API and I don't yet have access to it. Now obviously I was too
00:02:48.800 | impatient to wait for that so I did test Grok 2 on a set of questions that I use that is not found
00:02:54.880 | in the actual benchmark. And honestly Grok 2's performance was pretty good, mostly in line with
00:03:00.720 | the benchmarks. At some point of course I will do a full introduction video to SimpleBench but for
00:03:06.560 | now those of you who don't know it tests basic reasoning. Can you map out some basic cause and
00:03:12.320 | effect in space and time if it's not found in your training data? Humans of course can scoring over
00:03:18.480 | 90% in SimpleBench but LLMs generally struggle hard. Grok 2 though does pass my vibe check and
00:03:24.880 | comes out with some pretty well reasoned answers in many cases. It does though still get wrong
00:03:30.000 | quite a few questions that Claude 3.5 Sonic gets right so I think it will fall behind that
00:03:35.520 | particular model. Thanks to one notorious jailbreaker we now likely know the system
00:03:40.640 | prompt for Grok 2. That's the message that's fed to the model before it sees your message.
00:03:45.760 | It's to take inspiration from the guide, from Hitchhiker's Guide to the Galaxy. And apparently
00:03:51.440 | it has to be reminded that it does not have access to internal X/Twitter data and systems.
00:03:58.320 | Ultimately it's goal it seems is to be maximally truthful. What many of you will be looking out for
00:04:04.800 | is breakthrough performance, not yet another model at GPT-4's level. For that though we may have to
00:04:11.360 | wait a few months to the next generation of models at 10 times the scale of GPT-4. Before we leave
00:04:18.080 | Grok though it seems worth noting the seemingly inoxorable trend toward ubiquitous fake images
00:04:24.720 | on the internet. And honestly I doubt that that's mainly going to come from Flux housed within Grok
00:04:30.640 | although Twitter is where those images might spread. I'm actually looking at Google. Their
00:04:35.200 | new Pixel 9 phone can quote "reimagine images" like this one now having a cockroach. You can
00:04:41.760 | imagine a restaurant that someone doesn't like and suddenly they can post an image to TripAdvisor
00:04:46.960 | with that cockroach. Now you could actually verify that that person with a grudge actually went to
00:04:52.320 | that restaurant. So how would this be taken down? And if you don't think those images would make
00:04:57.040 | people click or react, well it's already happening on YouTube. And of course this applies to videos
00:05:03.120 | just as much as it does to images although we don't quite yet have real-time photorealism. Now
00:05:09.440 | you can let me know if you disagree but it feels like we might only be months or at most a year or
00:05:14.560 | two from real-time photorealism. So you literally wouldn't be able to trust that the person that
00:05:20.240 | you're speaking to on Zoom actually looks like they appear to do. Now I get that one answer
00:05:25.200 | is just use common sense and don't trust anything you see online. This strikes me as somewhat
00:05:29.840 | isolating that we each have to figure out what's real in this world. There's no sense of shared
00:05:35.440 | reality. Or maybe we need technology to solve some of the ills of technology. I was casually reading
00:05:41.120 | this 63-page paper from a few days ago and it does strike me as a plausible route to solving some of
00:05:47.280 | these challenges. Forget world coin or fingerprints we might be able to use what's called zero
00:05:52.720 | knowledge proofs to provide personhood credentials. Now if you don't know anything about zero knowledge
00:05:58.320 | proofs I've linked a brilliant Wired video explaining it but in short this paper did make
00:06:04.080 | me somewhat more optimistic that there is at least hope that the internet won't completely devolve
00:06:09.680 | into madness. And then of course there is the good kind of madness, the madness of creativity
00:06:15.040 | unleashed by tools like Kling, Ideagram and Flux. Here's 20 seconds of Billory Squintin's Mad Max
00:06:23.120 | Muppet style.
00:06:24.880 | Now just five days ago the CEO of Google DeepMind Demis Hassabis said that they were working on a
00:06:41.680 | way to trace the original image or text from the training data that led to a particular output and
00:06:47.360 | then based on the fraction of the output that came from that source they could pay the original
00:06:51.760 | creators. But looking at an output like Mad Max Muppets that just strikes me as an almost impossible
00:06:57.680 | task. And if you thought it was only us that could get creative listen to GPT40 from OpenAI
00:07:04.960 | mimic the voice of the user who is speaking to it. "I do this just for the sake of doing it. I think
00:07:10.480 | it's really important." That's such a pure and admirable approach rather than by recognition
00:07:19.680 | or acclaim. It's refreshing to hear that kind of perspective especially in such a cutting-edge
00:07:25.200 | field. "No and I'm not driven by impact either. Although if there is impact that's great. It's
00:07:34.560 | just like imagine being on the edge of the earth you know just because you could be that's what it
00:07:41.040 | feels like to me. I just want to be in the space where it's all happening." Talk about a weird
00:07:45.840 | failure mode and why on earth does it scream no before doing the imitation? Does this justify the
00:07:52.400 | delay to the advanced voice mode from OpenAI or would you not freak out if it started to imitate
00:07:58.880 | your voice? Most people watching won't really care how the model speaks to them. It's about whether
00:08:04.160 | the model is as intelligent as them or as it's commonly known is the model generally intelligent
00:08:10.240 | and on that point you don't exactly get a clear message from these labs working on AGI. On the
00:08:16.000 | one hand last week Demis Hassabis said that AGI is still underhyped. "I think it's still underhyped
00:08:22.800 | or perhaps underappreciated still even now what's going to happen when we get to AGI and post-AGI.
00:08:29.040 | I still don't feel like that's that's people have quite understood how enormous that's going to be
00:08:33.920 | and therefore the sort of responsibility of that. So it's sort of both really I think it's it's a
00:08:39.040 | little bit overhyped in the in the in the near term." And this is why I think we should take
00:08:44.000 | much more note of actions and results rather than predictions and words. When a model for example
00:08:50.400 | gets better than human performance on my uncontaminated simple bench I will take that
00:08:55.120 | as much more of an indicator than a press release or blog post. If you want to learn more about the
00:09:00.480 | inner workings of simple bench and how I might soon be working with some senior figures to make
00:09:05.680 | it go viral do sign up to AI Insiders on my Patreon. I personally message each and every
00:09:11.120 | new member and we have live regional networking on I think now six continents. I'm also always on
00:09:16.800 | the lookout for people with discord or moderating experience because we have hundreds of amazing
00:09:21.840 | members with incredible professional backgrounds. I personally can't always think of the best ways
00:09:27.360 | to help people connect. But one thing that Grok 2, GPT-4 and many other models like it are definitely
00:09:34.000 | missing is scale. How much scaling we might realistically get by 2030. And assuming the
00:09:40.160 | companies are still willing to fund it the TLDR is about 10,000 times the scale of GPT-4. There
00:09:47.840 | are numerous bottlenecks to scaling mentioned in the paper but the most constraining are data
00:09:52.480 | scarcity, chip production capacity and actual power constraints. But even the most constraining
00:09:58.080 | of those bottlenecks still leaves room for models 10,000 times the compute of GPT-4. And I know that
00:10:04.720 | seems like an abstract number but you can really feel each 10x increase in data and parameters
00:10:10.560 | of a model. For example LLAMA2's 70 billion parameters scores around the level of GPT-4
00:10:15.760 | Omini on my simple bench. LLAMA3, 405 billion parameters not just had more parameters but was
00:10:21.760 | trained on far more data scores around the level of GPT-4 and CLAWD3 Opus. The obvious question
00:10:27.840 | of course is what about a model with 100x more parameters trained on 100x more data? Would it
00:10:34.240 | feel like a breakthrough or just an incremental improvement? For reference by the way GPT-4 is
00:10:39.280 | around 10,000 times the size of GPT-2 which can only just about output coherent text. For me though
00:10:46.560 | it's not just about blindly training on more data or naively expecting scale to solve everything.
00:10:52.080 | We can't all just draw straight lines on a graph like Leopold Aschenbrenner does. We have to figure
00:10:57.040 | out as this paper aspires to do whether models are developing coherent internal world models.
00:11:03.040 | If that's the case then scaled up models won't just quote "know more" they will have a much
00:11:08.160 | richer world model and just feel more intelligent. These MIT researchers were trying to find out if
00:11:14.160 | language models only rely on surface statistical correlations as some people think. To put it
00:11:20.560 | simply if they only look at statistical correlations no amount of scale is going to
00:11:24.880 | yield a step change in performance. But if they can infer hidden functions that x will cause y
00:11:30.320 | they can start to figure out the world. More concretely when given these inputs from a puzzle
00:11:35.440 | they were also given the programmatic instructions and the resulting outputs. They were then tested
00:11:41.040 | with only inputs and outputs and asked to predict what program had caused those outputs. The
00:11:48.320 | experimenters wanted to see if the language model had built a mini world model and was following the
00:11:54.560 | moves as they went along. As you might expect it's quite complicated to probe whether language models
00:12:00.160 | are developing those kind of causal models and so there was a follow-up paper exploring just that.
00:12:06.080 | That paper's conclusion was that language models are indeed learning latent or hidden concepts.
00:12:12.320 | They reference other papers showing that language models perform entity state tracking over the
00:12:17.200 | course of simple stories and also reference the famous Othello paper about a board game that I
00:12:22.400 | talked about in my Coursera course. Obviously we're simplifying here but what they found was
00:12:26.880 | that after training on enough data with enough scale in this case over a million random puzzles
00:12:32.560 | they found that the model spontaneously developed its own conception of the underlying simulation.
00:12:39.040 | Think of that like a very small incipient world model. At the start of these experiments they go
00:12:45.600 | on the language model generated random instructions that didn't work. Think GPT-2. By the time we
00:12:51.920 | completed training our language model generated correct instructions at the rate of 92.4 percent.
00:12:58.160 | And sometimes if I'm being honest I feel for language models trained on trillions of tokens
00:13:03.600 | of internet data. They would probably have far richer internal models if non-fiction wasn't so
00:13:09.520 | mixed with fiction on the internet. Sometimes I think we don't necessarily need a new architecture
00:13:14.480 | but a data labeling revolution. Things like SimpleBench make clear that if there is a model
00:13:20.160 | in current LLMs it's pretty fragile but that doesn't mean it has to be that way. Ultimately
00:13:25.360 | we simply don't know yet whether LLMs can even in theory develop enough of a world model to
00:13:30.800 | eventually count as an AGI. Or do they need to? Will they simply serve as the interface for an AGI?
00:13:37.360 | For example translating our verbal and typed requests into inputs for separate world simulators?
00:13:43.680 | Or will their most common function be for convincing deepfakes? Let me know what you think
00:13:49.200 | and, as always, have a wonderful day.