Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results

The llama herd has left the building and is roaming the streets. More specifically, the Llama 3.1 405 billion parameter language model is out, but I thought the former expression would be a bit more dramatic. The 92-page paper that came with the model was released less than 24 hours ago, and yes, I've read it in full and have benchmarked the model, comparing it to four competitors on over 100 private questions with 74 notes to touch on.

The model is impressive and the paper is revealing, so let's get started. There are three sizes of text-only Llama 3 models, but this video will focus almost entirely on the biggest and best, the 405 billion parameter model. And no, Meta weren't exaggerating when they say that it delivers comparable quality to leading language models such as GPT-4.

And in case you're new to the channel, I'm not just relying on those traditional benchmarks to assess that comparison. Meta's innovations in a nutshell were higher quality data that's filtered for quality and simply more compute, bigger scale. Indeed, the sheer scale of compute, way more than 10 to the 25 floating point operations, was so big that at one point the EU classified that as presenting a systemic risk.

So whether that scares you or hypes you, let's see the results of all of those flops. Here is just a quick snapshot of a comparison on traditional benchmarks between Llama 3.1 405B and GPT-4, GPT-4.0, Claude 3.5 Sonnet. As I'll try to persuade you in a moment, I don't think these benchmarks quite capture the nuanced differences between the models, but it certainly shows you that this new "open source" model from Meta is on a par, if not better than GPT-4.

Of course, it doesn't yet have all the fancy speech in and speech out that GPT-4 Omni does, but technically we don't have access to that yet either for that model. I do think it's worth noting though, just for 10 seconds, that we have a downloadable model now that is as good or better than the GPT-4 that caused such waves early last year.

At the time, people thought that day might take 2 years or even 5 years, but no, it's here. And yes, Meta are still arguing that this series of models charts a "responsible path" towards the development of Artificial General Intelligence. I'll at least have a few comments on that when it comes to my private General Intelligence benchmark.

Just quickly, why do I keep saying "open source"? Well, according to the semi-official Open Source Initiative, the definition of Open Source AI includes the training data provenance, where it comes from, how it was obtained. The paper on page 4 simply says "from a variety of data sources". So even if you had the budget, you couldn't recreate LLAMA 3.1 because you simply wouldn't know what data they used.

Indeed, I did a video on this for my new Coursera course, but just remember that anytime you hear that Meta is committed to Open Source AI. I mean, just look how many times in this one paragraph, Mark Zuckerberg used the phrase "open source". So why be shy about the data they're using?

Well, as the New York Times recently reported, the data is getting harder and harder to find. Companies like Reddit and Twitter are charging for their data and Meta may not have had permission for all of that data. And one theme you'll see throughout the paper is using language models to improve the performance of language models.

Using LLAMA 2, for example, to filter the data used to train LLAMA 3. That's just one example, there are literally dozens. So you can bet that LLAMA 3.1 is being used to help train LLAMA 4. Before you predict, though, that this is setting off some form of intelligence explosion, remember that it was just yesterday that Zuckerberg admitted that the LLAMA models are hemorrhaging Meta money.

It's hard to know in advance when something is good enough that you're going to have a product that billions of people use, and then when it's ready to kind of be a large business. And I mean, look, we're all spending, you know, a lot of capital and on basically training these models.

So I think that people are going to be probably losing money for quite a while. But I don't know, maybe that'll all happen quicker. It's hard to know exactly. And even OpenAI might be losing $5 billion this year alone. That's at least according to a report released by The Information while I was filming the video.

But we do know that LLAMA 4 is coming, and probably before the end of the year. How do you define AGI, and do you get there first? Well, it's a good question. We're basically already starting to work on LLAMA 4. And our goal is to completely close the gap with all the others on that.

So I don't know. I mean, do we get to AGI first? I mean, I think that there will probably be some breakthroughs between now and then. It's hard to just predict in a straight line. Then you get to the more complicated question, which is like, what is it? I don't know that there's one specific definition for this.

Throughout the paper, they give away their recipe for doing what they did. Having read both the original LLAMA paper and the LLAMA 2 paper, this is quite different. It almost feels like they're much more confident giving away the secrets of large language models. They almost don't believe that there's much of a secret source, and they're not scared of China.

And Claude 3.5 Sonnet aside, they've almost proven that with this model. I must say that there was one part of the paper that I found especially sensational. They developed scaling laws, not just for next token prediction loss, but for benchmark performance, or to somewhat translate that, how long to run the GPUs to get the benchmark performance that they wanted.

Given their flop budget, they predicted how the model would perform and got it right, only just slightly underestimating final performance. Or in their words, this approach enables us to predict downstream task performance, given a specific number of training flops for compute optimal models. They set themselves a compute budget and got the benchmark performance that they expected.

It's almost a bit like you can imagine a benchmark performance dial in Mark Zuckerberg's office that he can move clockwise at will, while the money lasts, of course. These benchmark scaling laws, by the way, extrapolate across four orders of magnitude, so are pretty reliable. In case you're wondering, that's where they got the quirky 405 billion parameter number from.

They had the compute budget, looked at those benchmark scaling laws, and assigned that number of parameters. On the right is the sigmoidal scaling law curve that they anticipated and that followed on the ARK challenge. That's not, by the way, the ARK AGI challenge that I've talked about on this channel recently, but it is legit questions like this that you can see here.

General knowledge and what they call a reasoning challenge. Now, just how many benchmarks that scaling law holds for is a question that I, at least, am immensely curious about. I'll come back to benchmarks, but the amount of detail they went into, down to the exact hardware issues they had, is quite incredible.

They even note at one point that temperature fluctuations during the day impacted GPU dynamic voltage. And slightly more concerningly, the fluctuations of power consumption across the data center stretched the limits of the power grid. It does make me, at least, wonder what the kind of issues they'll have when they scale up another 50x.

Now, clearly, because it is a 92-page paper, I am skipping over a lot, but I do want to bring you the most interesting highlights. For example, there was this detail about how they obsessively cleaned the data. They found an annoying problem that was too common in their data. Overly apologetic tonal issues.

Phrases such as "I'm sorry" or "I apologize". They didn't want that, nor did they want excessive emojis or exclamation points. Back to that theme, though, of AI improving AI, they trained a code expert model to help them find the highest quality human annotations for code. Five pages on in the paper, they say that they trained a multilingual expert model to collect higher quality annotations in non-English languages.

And it seems appropriate at this point to mention that Meta, for the first time, allow you to use this frontier model to generate synthetic data to improve and train your smaller model. They didn't allow that before, and nor did companies like OpenAI, to the best of my knowledge. So that flywheel of models improving models is now technically open to you.

Now, you do have to be slightly sophisticated about it, though. When they trained LLAMA3-405B on its own generated data in programming, they found it wasn't helpful. Notice that is different from those last two examples. This is the same model training on its own generated data. But when they introduced execution feedback, which I've talked about quite a lot on this channel, it did enable the model to learn from its own mistakes.

And anyone who has been following this channel knows that I talk often about verifier models, and LLAMA3 indeed incorporated that approach during training. In coding, for example, only generations that pass syntax checking and unit tests were used for fine-tuning. But for maths and reasoning, the story is even more interesting.

First, they give a curious definition of reasoning. We define reasoning as the ability to perform multi-step computations and arrive at the correct final answer. I'm definitely going to leave a question mark on that one, because under that definition, wouldn't a calculator be doing reasoning? But the interesting bit is how they say that training data on the web shows a shortage of ground-truth correct chains of thought for reasoning and math.

But those are essential for guiding the model how to break down the problem step-by-step and reach the final answer. In other words, most online text contains results and analysis, not the chains of thought involved in coming up with those results. Then they quote directly from the Let's Verify Step-by-Step paper that I've talked about many times on this channel.

And they go on to say the following. "They identified mathematical skills where the model underperforms and actively source prompts from humans to teach the models such skills." And then they use the model, LLAMA3, to check the reasoning steps behind a step-by-step solution. In other words, training a model to recognize good steps in a reasoning chain.

They could then filter the training data where those intermediate reasoning steps were incorrect. So not just the final results, the reasons used to get those final results. They wanted to eliminate invalid reasoning traces. And for the hardest prompts, they even used Monte Carlo Tree Search, a bit like AlphaGo, with those process-based reward models to generate valid reasoning traces.

Translated, they searched as hard as they could to find the best reasoning steps to teach the model reasoning. And at this point, I can hold off no longer from talking about my own private benchmark, what I call SimpleBench, to test general intelligence reasoning. And there are a few things I love about this benchmark.

Obviously, I am ridiculously biased, so take this with a pinch of salt. But this is actually the benchmark I rely on to test the real reasoning intelligence of models. First, it's fully private, so it hasn't been contaminated at all. Second, it is rigorously vetted, not just by me, but by outside experts with more to come.

If even one mistake makes it into the final 100 or 200 questions, I'll be pretty pissed off. But third, and I think most interestingly, even the best models, as you can see, fall well, well, well behind the performance of humans as I have anecdotally tested them. I'll show you one example in a moment, which of course won't make it into the final benchmark.

But for me, it has been the most reliable vibe test that I've seen so far. Now, I will be testing the models again using self-consistency. But for now, we have Claude 3.5 Sonnet way ahead at 32%. Lama 405B at 18%, well ahead of both versions of GPT-4 and Gemini 1.5.

Smaller models, by the way, in case you're curious, like GPT-40 Mini score 0%. And here is one example that the new Lama model actually usually gets, but GPT-40 basically never gets. It comes from the spatial intelligence section of the benchmark and involves placing four whole ice cubes into a fire.

Then some more ice cubes into the fire. And then the question ends with how many whole ice cubes can be found in the fire at the end of the third minute? I even add in pick the most realistic answer. And no, the model doesn't pick zero, reflecting that none of the ice cubes will be whole or even still there after the third minute.

Most models, of course, go down a rabbit hole of calculations. Now, admittedly, this was one of the easier questions on the benchmark. And if you add things like think about this carefully or this is a trick question, the models can sometimes get it. But I know the models well enough now that I can create genuine spatial, temporal, linguistic or social questions that no amount of warnings allow the models to get right.

And yes, that's still with humans scoring near perfectly. How so? Well, it's because, of course, the models are modelling language. They're language models. They're not reality simulators. They don't actually visualise things in their head or think about problems in the same way that we do. So how would a model like Lama 3 ever get a question like this right?

Well, it's because I can leave, let's say, linguistic clues, crumbs to allow them to infer the answer, even if they can't simulate the situation. Testing, if you will, their ability to pick up faint signal amidst the noise. If I remove all signal, models score zero with humans still scoring almost perfectly.

But with just faint signals, I can separate the smart models from the less smart models. I'll be totally honest. I wish I could go through all the hundred plus questions with you because they're pretty fun. But then, of course, it would leak into the training data inevitably and contaminate the test.

Now, I have made the benchmark functional so I can change the numbers. But still, I want to avoid that if possible. Now, I get it. Many of you are thinking that was a very long way of saying that Lama 405B is good. Not quite as good as Claude 3.5 Sonnet, but better, I think, in text at least than GPT-40.

Now, you could say that part of this benchmark is somewhat adversarial and Meta on page 33 talk about how adversarial tests cause significantly worse performance than non-adversarial ones. What they mean by that is that in some of the benchmarks that they used, even a single distracting sentence at the end of a question caused significantly worse performance than simply asking the question.

If the model was actually thinking about the question, that shouldn't happen. And the paper highlights this without actually suggesting a solution. For mathematical reasoning and question answering, however, the adversarial performances are substantially lower than the non-adversarial performances. This pattern is similar for pre-trained and post-trained models, full stop. So much to cover, so I'm going to move swiftly on to contamination.

Through fascinating word matching or n-gram checks, they found that contamination was rife in traditional benchmarks. And these contamination scores in this column actually underestimate the problem. They excluded benchmarks from this chart when the clean set had too few examples or because the observed performance gain when they cleaned the data set showed extremely erratic behavior.

And they go on to describe the MMLU. Even when they allowed for a higher threshold of 8-word overlap between the training data and the test, it gave such high contamination scores that it was impossible to get a good performance gain estimate. So they couldn't even really estimate how much contamination was affecting the MMLU scores.

It seems like private benchmarks such as those from Scale.ai and indeed mine will be more common in the future. Here was the ranking for example in math by Scale.ai with Claude 3.5 Sonnet in number one position. At a glance, human comparisons leading to leaderboards like those from Elemsis seem to be a bit more problematic.

Even though Sam Altman said that we now have GPT-40 Mini matching GPT-40's performance. In my own experiments, and let me know what you think in the comments, it's not even close. Having Mini beating Claude 3.5 Sonnet just seems shocking to me. Now Elemsis have addressed that and said that they're going to release a random 20% subset of those battles.

So I will look at that with interest. Back to the paper though, and here's another way that Llama405B does seem to be better than its rivals. It has a long context of 128k tokens or around 100,000 words. Now other models of course have more than that, but that's not why it's better.

It's when it's asked questions that rely on scouring through that long context that it performs better. And annoyingly, they didn't compare it to Gemini 1.5 Pro, but here it beats GPT-4, GPT-40 and Claude 3.5 Sonnet significantly. What is this infinite bench in QA? Well, as you'd expect, I tracked down that paper and read it in full.

And a typical question from that infinity bench was this. With details strewn throughout a story the length of a novel, they asked what colour dress did person A wear when A met B for the second time. So the model would obviously have to track when A met B for the first time, then the second time and what colour dress they were wearing.

On that, Llama 3.1 crushes Claude 3.5. Also, when there are multiple needles in a haystack. A bit like if there's four passwords strewn throughout a long document. Can't do this quite as well as GPT-4 apparently. Or even Llama 3 A billion parameters randomly, but does far better than Claude 3.5 Sonnet.

It does seem a bit random to me to not compare it to Gemini 1.5 Pro when that's its specialty, long context, but anyway. Now, I will give some more credit to Meta for this. They gave plenty of win-loss human comparisons with GPT-4, not only in the paper, but also on the website of the Llama 3 release.

And most of those comparisons were actually unfavourable. That's commendable honesty to include charts which make your model seem less good. In the middle, you can see Llama 3 losing out to GPT-4.0 most of the time. No, actually, it's all of these comparisons across English reasoning, coding, etc. Now again, as we've seen, human at a glance evaluation can't always be trusted though.

Now though, for a word on safety, and they claim that the violation rate has dropped significantly for Llama 3 compared to its competitors. Now, normally a lower violation rate for safety would lead to an increased false refusal rate when they refuse to answer simple, innocent questions, basically. But actually, it still has a pretty low false refusal rate.

And they make this point that it is critical to consider false refusal as a countermetric because a model that always refuses is maximally safe, cough, cough, Claude 3.5 sonnet, but not always helpful. The reference I'm making there is that Claude very frequently compared to other models seems to refuse my innocent questions.

Anyway, so false refusals are definitely a thing and I'm glad Meta are aware of it. And again, commendable honesty, they admit that Llama 3 is on average more susceptible to prompt injection compared at least to GPT-4 or Gemini Pro, but it's better apparently than mixed trial. But there's a wider point on safety that I do want to note.

It was only around a year ago that Mark Zuckerberg was receiving a letter from two senators in America concerned about the leak of Llama 1, talking about its potential for spam, fraud, malware, privacy violations, and harassment. Now, clearly that letter went nowhere because they subsequently released not only Llama 2, but Llama 3 open weights and downloadable.

And again, on the safety point, Leopold Aschenbrenner will be having a fit because he says there's no point keeping models closed because adversaries like China will simply steal the models anyway on a thumb drive. So when I see letters like this from a couple of days ago to Sam Altman, signed by around six senators asking him if he has indeed committed 20% of their compute budget to safety, I just have a slight suspicion that OpenAI might completely ignore this and completely get away with it.

I also want to commend Meta for being much more rigorous in how they pre-check models before release. They got a set of volunteers and saw if there was any uplift in their ability to create or at least ideate about chemical and biological weapons. Basically when they had access to Llama 3 versus having no access.

Both groups did have the internet at least. And the analysis of these results showed no significant uplift in performance related to usage of Llama 3. And honestly, that doesn't surprise me too much given how much data filtering went on. Count me at least as being surprised if biological or chemical weapon data still made it into the final model.

I would hope not at least. To their credit, OpenAI did a similar study almost six months ago, which I talked about on my Patreon AI Insiders. Now the vision, speech and video parts of Llama 3.1 aren't yet available. Zuckerberg described some sort of mess up but didn't go into much more detail.

But they did have one interesting conjecture in the paper. You might remember how Gemini 1.5 Pro and GPT 4.0 are trained from the ground up to be multimodal. That has advantages but Meta contends that a compositional approach, as in separate models, is actually in some ways more advantageous. Apparently, it's more efficient during inference.

Obviously, we can all judge this when it comes out. But I do note that Noam Brown said that GPT 4.0 didn't turn out as well as they hoped with multimodal reasoning. Here were the final results though in a benchmark I do pay attention to, the MMMU. You can see that Llama 3 with vision scores 64.5% versus Claw 3.5 68.3%.

GPT 4.0 is better at 69.1% and I can believe that. And very quickly on the video data that Meta used for training Llama 3v. Well, they don't say it but they strongly imply that they're using Instagram Reels. Now anyone who knows can correct me, but the duration and resolution of the videos does seem to hint at that.

If that's true, well then like Google they can of course flex those muscles of the masses of data that they have that people like OpenAI wouldn't necessarily have. Yes, by the way, they are working on speech generation as well as speech understanding, so you should be able to talk eventually to Llama 3.1 just like we were promised with GPT 4.0.

They even claim that their speech recognition is better than Whisper v2 and for multilingual scenarios Whisper v3. Now admittedly, this experiment was using Whisper v3, but just look at the speed, in this case using Grok, that these smaller Llama 3 models can act at. Located, can you tabularize it for me?

Can you add a duration column? Can you remove the end times from the time column? Can you make the duration in minutes? And can you move the duration to between the time and stop column? Can you add lunch and dinner at a nice restaurant? You know what, I changed my mind.

Make it Vancouver. Of course, for time reasons, I've got to draw this video to an end, but there were countless more experiments with training models revealed throughout the paper. And speaking of tracking experiments, you may already know that AI labs, including OpenAI, have used Weights & Biases, this video's sponsor, to track frontier machine learning experiments, as well as visualize, iterate on, optimize, and share them.

But you might not know that Weights & Biases now have Weave, a lightweight toolkit to confidently iterate on LLM applications, and that they produce free prompting and LLM agent courses on their website. And if you didn't know that, do let them know that you came from this video, the link is in the description.

And so let's conclude with Meta's conclusion. They say, and I agree, that in many ways, the development of high-quality foundation models is still in its infancy. Our experience in developing LLAMA 3 suggests that substantial further improvements of these models are on the horizon. They go on to admit that they did explore more complex model architectures and training recipes, but did not find the benefits of such approaches to outweigh the additional complexity that they introduce into model development.

Like you, I can't wait, of course, to compare LLAMA 3.1 with Gemini 2 and GPT-5. And they had the right plan to ensure that LLAMA 3 was not accidentally overfitted on commonly used benchmarks, and that their pre-training data was not only procured, but processed by a separate team. That was, they say, strongly incentivized to prevent contamination of that pre-training data.

The model's performance on my simple bench does imply that their benchmark results aren't fluky. And they end with this, we hope that the release of LLAMA 3 encourages the industry to embrace the open, in quotes, "responsible" development of AGI. Let me know what you think in the comments, and as always, have a wonderful day.

Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results

Transcript