Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results

00:00:00.000 | The llama herd has left the building and is roaming the streets.

00:00:05.200 | More specifically, the Llama 3.1 405 billion parameter language model is out,

00:00:11.280 | but I thought the former expression would be a bit more dramatic.

00:00:15.360 | The 92-page paper that came with the model was released less than 24 hours ago,

00:00:21.680 | and yes, I've read it in full and have benchmarked the model,

00:00:25.360 | comparing it to four competitors on over 100 private questions with 74 notes to touch on.

00:00:33.440 | The model is impressive and the paper is revealing, so let's get started.

00:00:39.920 | There are three sizes of text-only Llama 3 models, but this video will focus almost entirely

00:00:46.640 | on the biggest and best, the 405 billion parameter model.

00:00:50.400 | And no, Meta weren't exaggerating when they say that it delivers comparable quality

00:00:55.840 | to leading language models such as GPT-4.

00:00:58.720 | And in case you're new to the channel,

00:01:00.320 | I'm not just relying on those traditional benchmarks to assess that comparison.

00:01:05.360 | Meta's innovations in a nutshell were higher quality data that's filtered for quality

00:01:11.680 | and simply more compute, bigger scale.

00:01:14.560 | Indeed, the sheer scale of compute, way more than 10 to the 25 floating point operations,

00:01:20.480 | was so big that at one point the EU classified that as presenting a systemic risk.

00:01:26.960 | So whether that scares you or hypes you, let's see the results of all of those flops.

00:01:32.960 | Here is just a quick snapshot of a comparison on traditional benchmarks

00:01:37.440 | between Llama 3.1 405B and GPT-4, GPT-4.0, Claude 3.5 Sonnet.

00:01:43.440 | As I'll try to persuade you in a moment,

00:01:45.200 | I don't think these benchmarks quite capture the nuanced differences between the models,

00:01:49.920 | but it certainly shows you that this new "open source" model from Meta

00:01:54.880 | is on a par, if not better than GPT-4.

00:01:57.840 | Of course, it doesn't yet have all the fancy speech in and speech out that GPT-4 Omni does,

00:02:04.080 | but technically we don't have access to that yet either for that model.

00:02:08.160 | I do think it's worth noting though, just for 10 seconds, that we have a downloadable model now

00:02:14.320 | that is as good or better than the GPT-4 that caused such waves early last year.

00:02:20.560 | At the time, people thought that day might take 2 years or even 5 years, but no, it's here.

00:02:26.320 | And yes, Meta are still arguing that this series of models charts a "responsible path"

00:02:33.680 | towards the development of Artificial General Intelligence.

00:02:37.360 | I'll at least have a few comments on that when it comes to

00:02:40.640 | my private General Intelligence benchmark.

00:02:43.120 | Just quickly, why do I keep saying "open source"?

00:02:45.600 | Well, according to the semi-official Open Source Initiative,

00:02:49.520 | the definition of Open Source AI includes the training data provenance,

00:02:54.480 | where it comes from, how it was obtained.

00:02:56.880 | The paper on page 4 simply says "from a variety of data sources".

00:03:02.560 | So even if you had the budget, you couldn't recreate LLAMA 3.1

00:03:06.400 | because you simply wouldn't know what data they used.

00:03:08.880 | Indeed, I did a video on this for my new Coursera course,

00:03:12.080 | but just remember that anytime you hear that Meta is committed to Open Source AI.

00:03:17.520 | I mean, just look how many times in this one paragraph,

00:03:20.480 | Mark Zuckerberg used the phrase "open source".

00:03:22.960 | So why be shy about the data they're using?

00:03:25.600 | Well, as the New York Times recently reported,

00:03:28.480 | the data is getting harder and harder to find.

00:03:31.200 | Companies like Reddit and Twitter are charging for their data

00:03:34.800 | and Meta may not have had permission for all of that data.

00:03:38.560 | And one theme you'll see throughout the paper

00:03:40.960 | is using language models to improve the performance of language models.

00:03:45.600 | Using LLAMA 2, for example, to filter the data used to train LLAMA 3.

00:03:51.600 | That's just one example, there are literally dozens.

00:03:54.640 | So you can bet that LLAMA 3.1 is being used to help train LLAMA 4.

00:04:00.640 | Before you predict, though, that this is setting off some form of intelligence explosion,

00:04:04.560 | remember that it was just yesterday that Zuckerberg admitted

00:04:07.920 | that the LLAMA models are hemorrhaging Meta money.

00:04:10.880 | It's hard to know in advance when something is good enough

00:04:14.240 | that you're going to have a product that billions of people use,

00:04:16.720 | and then when it's ready to kind of be a large business.

00:04:19.760 | And I mean, look, we're all spending, you know, a lot of capital

00:04:22.560 | and on basically training these models.

00:04:25.200 | So I think that people are going to be probably losing money for quite a while.

00:04:28.800 | But I don't know, maybe that'll all happen quicker.

00:04:31.040 | It's hard to know exactly.

00:04:33.040 | And even OpenAI might be losing $5 billion this year alone.

00:04:39.040 | That's at least according to a report released by The Information

00:04:42.320 | while I was filming the video.

00:04:44.320 | But we do know that LLAMA 4 is coming, and probably before the end of the year.

00:04:49.520 | How do you define AGI, and do you get there first?

00:04:53.280 | Well, it's a good question.

00:04:54.080 | We're basically already starting to work on LLAMA 4.

00:04:58.080 | And our goal is to completely close the gap with all the others on that.

00:05:02.400 | So I don't know.

00:05:04.000 | I mean, do we get to AGI first?

00:05:05.680 | I mean, I think that there will probably be some breakthroughs between now and then.

00:05:08.000 | It's hard to just predict in a straight line.

00:05:10.080 | Then you get to the more complicated question, which is like, what is it?

00:05:13.040 | I don't know that there's one specific definition for this.

00:05:16.720 | Throughout the paper, they give away their recipe for doing what they did.

00:05:20.960 | Having read both the original LLAMA paper and the LLAMA 2 paper,

00:05:24.240 | this is quite different.

00:05:25.520 | It almost feels like they're much more confident

00:05:28.240 | giving away the secrets of large language models.

00:05:31.280 | They almost don't believe that there's much of a secret source,

00:05:35.120 | and they're not scared of China.

00:05:37.120 | And Claude 3.5 Sonnet aside, they've almost proven that with this model.

00:05:42.080 | I must say that there was one part of the paper that I found especially sensational.

00:05:46.800 | They developed scaling laws, not just for next token prediction loss,

00:05:51.040 | but for benchmark performance, or to somewhat translate that,

00:05:54.800 | how long to run the GPUs to get the benchmark performance that they wanted.

00:06:00.240 | Given their flop budget, they predicted how the model would perform and got it right,

00:06:06.000 | only just slightly underestimating final performance.

00:06:09.600 | Or in their words, this approach enables us to predict downstream task performance,

00:06:14.880 | given a specific number of training flops for compute optimal models.

00:06:19.200 | They set themselves a compute budget and got the benchmark performance that they expected.

00:06:25.040 | It's almost a bit like you can imagine a benchmark performance dial

00:06:29.680 | in Mark Zuckerberg's office that he can move clockwise at will,

00:06:34.240 | while the money lasts, of course.

00:06:36.080 | These benchmark scaling laws, by the way,

00:06:38.160 | extrapolate across four orders of magnitude, so are pretty reliable.

00:06:42.960 | In case you're wondering, that's where they got the quirky 405 billion parameter number from.

00:06:48.000 | They had the compute budget, looked at those benchmark scaling laws,

00:06:51.760 | and assigned that number of parameters.

00:06:54.560 | On the right is the sigmoidal scaling law curve that they anticipated

00:06:59.760 | and that followed on the ARK challenge.

00:07:02.640 | That's not, by the way, the ARK AGI challenge that I've talked about on this channel recently,

00:07:07.280 | but it is legit questions like this that you can see here.

00:07:10.400 | General knowledge and what they call a reasoning challenge.

00:07:14.160 | Now, just how many benchmarks that scaling law holds for

00:07:17.920 | is a question that I, at least, am immensely curious about.

00:07:21.760 | I'll come back to benchmarks, but the amount of detail they went into,

00:07:25.760 | down to the exact hardware issues they had, is quite incredible.

00:07:29.920 | They even note at one point that temperature fluctuations during the day

00:07:34.880 | impacted GPU dynamic voltage.

00:07:37.360 | And slightly more concerningly, the fluctuations of power consumption

00:07:41.760 | across the data center stretched the limits of the power grid.

00:07:46.720 | It does make me, at least, wonder what the kind of issues they'll have

00:07:50.480 | when they scale up another 50x.

00:07:52.880 | Now, clearly, because it is a 92-page paper, I am skipping over a lot,

00:07:57.200 | but I do want to bring you the most interesting highlights.

00:08:00.080 | For example, there was this detail about how they obsessively cleaned the data.

00:08:04.720 | They found an annoying problem that was too common in their data.

00:08:08.640 | Overly apologetic tonal issues. Phrases such as "I'm sorry" or "I apologize".

00:08:14.320 | They didn't want that, nor did they want excessive emojis or exclamation points.

00:08:19.360 | Back to that theme, though, of AI improving AI,

00:08:22.480 | they trained a code expert model to help them find

00:08:26.480 | the highest quality human annotations for code.

00:08:29.280 | Five pages on in the paper, they say that they trained a multilingual expert model

00:08:34.640 | to collect higher quality annotations in non-English languages.

00:08:38.480 | And it seems appropriate at this point to mention that Meta, for the first time,

00:08:42.560 | allow you to use this frontier model to generate synthetic data

00:08:46.880 | to improve and train your smaller model.

00:08:49.760 | They didn't allow that before, and nor did companies like OpenAI, to the best of my knowledge.

00:08:54.480 | So that flywheel of models improving models is now technically open to you.

00:08:59.440 | Now, you do have to be slightly sophisticated about it, though.

00:09:03.120 | When they trained LLAMA3-405B on its own generated data in programming,

00:09:08.560 | they found it wasn't helpful.

00:09:10.560 | Notice that is different from those last two examples.

00:09:12.800 | This is the same model training on its own generated data.

00:09:16.560 | But when they introduced execution feedback,

00:09:19.280 | which I've talked about quite a lot on this channel,

00:09:21.840 | it did enable the model to learn from its own mistakes.

00:09:25.600 | And anyone who has been following this channel knows that I talk often about

00:09:29.440 | verifier models, and LLAMA3 indeed incorporated that approach during training.

00:09:35.280 | In coding, for example, only generations that pass syntax checking and unit tests

00:09:40.640 | were used for fine-tuning.

00:09:42.400 | But for maths and reasoning, the story is even more interesting.

00:09:46.560 | First, they give a curious definition of reasoning.

00:09:49.840 | We define reasoning as the ability to perform multi-step computations

00:09:53.760 | and arrive at the correct final answer.

00:09:56.000 | I'm definitely going to leave a question mark on that one,

00:09:58.720 | because under that definition, wouldn't a calculator be doing reasoning?

00:10:02.880 | But the interesting bit is how they say that training data on the web

00:10:07.280 | shows a shortage of ground-truth correct chains of thought for reasoning and math.

00:10:13.280 | But those are essential for guiding the model

00:10:16.080 | how to break down the problem step-by-step and reach the final answer.

00:10:19.680 | In other words, most online text contains results and analysis,

00:10:23.840 | not the chains of thought involved in coming up with those results.

00:10:28.080 | Then they quote directly from the Let's Verify Step-by-Step paper

00:10:32.080 | that I've talked about many times on this channel.

00:10:34.240 | And they go on to say the following.

00:10:36.720 | "They identified mathematical skills where the model underperforms

00:10:41.040 | and actively source prompts from humans to teach the models such skills."

00:10:45.280 | And then they use the model, LLAMA3, to check the reasoning steps

00:10:50.160 | behind a step-by-step solution.

00:10:52.400 | In other words, training a model to recognize good steps in a reasoning chain.

00:10:58.160 | They could then filter the training data

00:11:00.560 | where those intermediate reasoning steps were incorrect.

00:11:03.760 | So not just the final results, the reasons used to get those final results.

00:11:08.480 | They wanted to eliminate invalid reasoning traces.

00:11:12.080 | And for the hardest prompts, they even used Monte Carlo Tree Search,

00:11:15.920 | a bit like AlphaGo, with those process-based reward models

00:11:20.000 | to generate valid reasoning traces.

00:11:22.320 | Translated, they searched as hard as they could

00:11:25.280 | to find the best reasoning steps to teach the model reasoning.

00:11:29.200 | And at this point, I can hold off no longer from talking about

00:11:32.720 | my own private benchmark, what I call SimpleBench,

00:11:36.320 | to test general intelligence reasoning.

00:11:38.800 | And there are a few things I love about this benchmark.

00:11:42.400 | Obviously, I am ridiculously biased, so take this with a pinch of salt.

00:11:46.160 | But this is actually the benchmark I rely on

00:11:48.640 | to test the real reasoning intelligence of models.

00:11:51.680 | First, it's fully private, so it hasn't been contaminated at all.

00:11:55.280 | Second, it is rigorously vetted, not just by me,

00:11:58.560 | but by outside experts with more to come.

00:12:01.280 | If even one mistake makes it into the final 100 or 200 questions,

00:12:05.840 | I'll be pretty pissed off.

00:12:07.280 | But third, and I think most interestingly,

00:12:09.440 | even the best models, as you can see,

00:12:12.000 | fall well, well, well behind the performance of humans

00:12:16.400 | as I have anecdotally tested them.

00:12:18.400 | I'll show you one example in a moment,

00:12:20.080 | which of course won't make it into the final benchmark.

00:12:23.040 | But for me, it has been the most reliable vibe test

00:12:26.720 | that I've seen so far.

00:12:28.080 | Now, I will be testing the models again using self-consistency.

00:12:31.600 | But for now, we have Claude 3.5 Sonnet way ahead at 32%.

00:12:36.960 | Lama 405B at 18%, well ahead of both versions of GPT-4 and Gemini 1.5.

00:12:44.800 | Smaller models, by the way, in case you're curious,

00:12:46.800 | like GPT-40 Mini score 0%.

00:12:49.120 | And here is one example that the new Lama model actually usually gets,

00:12:54.720 | but GPT-40 basically never gets.

00:12:58.320 | It comes from the spatial intelligence section of the benchmark

00:13:01.760 | and involves placing four whole ice cubes into a fire.

00:13:06.160 | Then some more ice cubes into the fire.

00:13:08.320 | And then the question ends with how many whole ice cubes

00:13:12.240 | can be found in the fire at the end of the third minute?

00:13:15.840 | I even add in pick the most realistic answer.

00:13:18.880 | And no, the model doesn't pick zero,

00:13:21.040 | reflecting that none of the ice cubes will be whole

00:13:24.000 | or even still there after the third minute.

00:13:27.040 | Most models, of course, go down a rabbit hole of calculations.

00:13:30.720 | Now, admittedly, this was one of the easier questions on the benchmark.

00:13:34.960 | And if you add things like think about this carefully

00:13:38.080 | or this is a trick question, the models can sometimes get it.

00:13:41.840 | But I know the models well enough now

00:13:44.000 | that I can create genuine spatial, temporal, linguistic or social questions

00:13:49.280 | that no amount of warnings allow the models to get right.

00:13:52.720 | And yes, that's still with humans scoring near perfectly.

00:13:56.080 | How so?

00:13:56.800 | Well, it's because, of course, the models are modelling language.

00:14:00.720 | They're language models.

00:14:02.000 | They're not reality simulators.

00:14:04.160 | They don't actually visualise things in their head

00:14:07.200 | or think about problems in the same way that we do.

00:14:10.400 | So how would a model like Lama 3 ever get a question like this right?

00:14:15.360 | Well, it's because I can leave, let's say, linguistic clues,

00:14:19.040 | crumbs to allow them to infer the answer,

00:14:22.480 | even if they can't simulate the situation.

00:14:25.040 | Testing, if you will, their ability to pick up faint signal amidst the noise.

00:14:30.080 | If I remove all signal, models score zero with humans still scoring almost perfectly.

00:14:36.080 | But with just faint signals,

00:14:38.320 | I can separate the smart models from the less smart models.

00:14:42.400 | I'll be totally honest.

00:14:43.280 | I wish I could go through all the hundred plus questions with you

00:14:46.800 | because they're pretty fun.

00:14:48.240 | But then, of course, it would leak into the training data inevitably

00:14:52.000 | and contaminate the test.

00:14:53.600 | Now, I have made the benchmark functional so I can change the numbers.

00:14:57.600 | But still, I want to avoid that if possible.

00:14:59.920 | Now, I get it.

00:15:00.560 | Many of you are thinking that was a very long way of saying that Lama 405B is good.

00:15:07.360 | Not quite as good as Claude 3.5 Sonnet,

00:15:10.720 | but better, I think, in text at least than GPT-40.

00:15:14.560 | Now, you could say that part of this benchmark is somewhat adversarial

00:15:18.880 | and Meta on page 33 talk about how adversarial tests

00:15:23.920 | cause significantly worse performance than non-adversarial ones.

00:15:27.840 | What they mean by that is that in some of the benchmarks that they used,

00:15:31.760 | even a single distracting sentence at the end of a question

00:15:35.680 | caused significantly worse performance than simply asking the question.

00:15:39.760 | If the model was actually thinking about the question, that shouldn't happen.

00:15:43.760 | And the paper highlights this without actually suggesting a solution.

00:15:47.440 | For mathematical reasoning and question answering, however,

00:15:50.480 | the adversarial performances are substantially lower

00:15:54.240 | than the non-adversarial performances.

00:15:56.640 | This pattern is similar for pre-trained and post-trained models, full stop.

00:16:01.600 | So much to cover, so I'm going to move swiftly on to contamination.

00:16:05.120 | Through fascinating word matching or n-gram checks,

00:16:08.560 | they found that contamination was rife in traditional benchmarks.

00:16:13.040 | And these contamination scores in this column actually underestimate the problem.

00:16:17.920 | They excluded benchmarks from this chart when the clean set had too few examples

00:16:23.840 | or because the observed performance gain when they cleaned the data set

00:16:28.000 | showed extremely erratic behavior.

00:16:30.160 | And they go on to describe the MMLU.

00:16:32.560 | Even when they allowed for a higher threshold of 8-word overlap

00:16:36.640 | between the training data and the test,

00:16:38.960 | it gave such high contamination scores

00:16:41.440 | that it was impossible to get a good performance gain estimate.

00:16:44.720 | So they couldn't even really estimate

00:16:46.800 | how much contamination was affecting the MMLU scores.

00:16:50.240 | It seems like private benchmarks such as those from Scale.ai

00:16:53.840 | and indeed mine will be more common in the future.

00:16:56.800 | Here was the ranking for example in math by Scale.ai

00:17:00.560 | with Claude 3.5 Sonnet in number one position.

00:17:03.520 | At a glance, human comparisons leading to leaderboards like those from Elemsis

00:17:08.160 | seem to be a bit more problematic.

00:17:10.000 | Even though Sam Altman said that we now have GPT-40 Mini

00:17:14.160 | matching GPT-40's performance.

00:17:16.480 | In my own experiments, and let me know what you think in the comments,

00:17:19.760 | it's not even close.

00:17:21.280 | Having Mini beating Claude 3.5 Sonnet just seems shocking to me.

00:17:25.440 | Now Elemsis have addressed that

00:17:27.120 | and said that they're going to release a random 20% subset of those battles.

00:17:31.760 | So I will look at that with interest.

00:17:33.920 | Back to the paper though, and here's another way

00:17:36.080 | that Llama405B does seem to be better than its rivals.

00:17:40.400 | It has a long context of 128k tokens or around 100,000 words.

00:17:45.760 | Now other models of course have more than that,

00:17:47.920 | but that's not why it's better.

00:17:49.520 | It's when it's asked questions that rely on scouring through that long context

00:17:54.160 | that it performs better.

00:17:55.360 | And annoyingly, they didn't compare it to Gemini 1.5 Pro,

00:17:58.960 | but here it beats GPT-4, GPT-40 and Claude 3.5 Sonnet significantly.

00:18:04.640 | What is this infinite bench in QA?

00:18:07.440 | Well, as you'd expect, I tracked down that paper and read it in full.

00:18:11.360 | And a typical question from that infinity bench was this.

00:18:15.040 | With details strewn throughout a story the length of a novel,

00:18:18.400 | they asked what colour dress did person A wear when A met B for the second time.

00:18:24.800 | So the model would obviously have to track when A met B for the first time,

00:18:29.040 | then the second time and what colour dress they were wearing.

00:18:32.000 | On that, Llama 3.1 crushes Claude 3.5.

00:18:35.920 | Also, when there are multiple needles in a haystack.

00:18:39.120 | A bit like if there's four passwords strewn throughout a long document.

00:18:43.360 | Can't do this quite as well as GPT-4 apparently.

00:18:46.640 | Or even Llama 3 A billion parameters randomly,

00:18:50.320 | but does far better than Claude 3.5 Sonnet.

00:18:53.040 | It does seem a bit random to me to not compare it to Gemini 1.5 Pro

00:18:57.360 | when that's its specialty, long context, but anyway.

00:19:01.040 | Now, I will give some more credit to Meta for this.

00:19:04.640 | They gave plenty of win-loss human comparisons with GPT-4,

00:19:08.960 | not only in the paper, but also on the website of the Llama 3 release.

00:19:13.520 | And most of those comparisons were actually unfavourable.

00:19:16.960 | That's commendable honesty to include charts which make your model seem less good.

00:19:22.320 | In the middle, you can see Llama 3 losing out to GPT-4.0 most of the time.

00:19:28.160 | No, actually, it's all of these comparisons across English reasoning, coding, etc.

00:19:32.880 | Now again, as we've seen, human at a glance evaluation can't always be trusted though.

00:19:38.320 | Now though, for a word on safety,

00:19:40.560 | and they claim that the violation rate has dropped significantly for Llama 3

00:19:46.160 | compared to its competitors.

00:19:47.680 | Now, normally a lower violation rate for safety would lead to an increased false refusal rate

00:19:53.680 | when they refuse to answer simple, innocent questions, basically.

00:19:56.800 | But actually, it still has a pretty low false refusal rate.

00:20:00.640 | And they make this point that it is critical to consider false refusal as a countermetric

00:20:06.080 | because a model that always refuses is maximally safe, cough, cough, Claude 3.5 sonnet,

00:20:11.440 | but not always helpful.

00:20:12.800 | The reference I'm making there is that Claude very frequently compared to other models

00:20:17.360 | seems to refuse my innocent questions.

00:20:19.680 | Anyway, so false refusals are definitely a thing and I'm glad Meta are aware of it.

00:20:24.800 | And again, commendable honesty,

00:20:26.320 | they admit that Llama 3 is on average more susceptible to prompt injection

00:20:31.280 | compared at least to GPT-4 or Gemini Pro, but it's better apparently than mixed trial.

00:20:36.160 | But there's a wider point on safety that I do want to note.

00:20:39.440 | It was only around a year ago that Mark Zuckerberg was receiving a letter

00:20:44.000 | from two senators in America concerned about the leak of Llama 1,

00:20:49.280 | talking about its potential for spam, fraud, malware, privacy violations, and harassment.

00:20:54.400 | Now, clearly that letter went nowhere because they subsequently released

00:20:58.560 | not only Llama 2, but Llama 3 open weights and downloadable.

00:21:03.440 | And again, on the safety point, Leopold Aschenbrenner will be having a fit

00:21:08.160 | because he says there's no point keeping models closed

00:21:11.360 | because adversaries like China will simply steal the models anyway on a thumb drive.

00:21:16.000 | So when I see letters like this from a couple of days ago to Sam Altman,

00:21:20.080 | signed by around six senators asking him if he has indeed committed

00:21:24.000 | 20% of their compute budget to safety,

00:21:26.480 | I just have a slight suspicion that OpenAI might completely ignore this

00:21:31.200 | and completely get away with it.

00:21:33.440 | I also want to commend Meta for being much more rigorous

00:21:36.240 | in how they pre-check models before release.

00:21:38.560 | They got a set of volunteers and saw if there was any uplift in their ability to create

00:21:43.360 | or at least ideate about chemical and biological weapons.

00:21:46.960 | Basically when they had access to Llama 3 versus having no access.

00:21:50.880 | Both groups did have the internet at least.

00:21:53.360 | And the analysis of these results showed no significant uplift in performance

00:21:57.520 | related to usage of Llama 3.

00:21:59.520 | And honestly, that doesn't surprise me too much given how much data filtering went on.

00:22:04.480 | Count me at least as being surprised if biological or chemical weapon data

00:22:09.440 | still made it into the final model.

00:22:11.280 | I would hope not at least.

00:22:12.720 | To their credit, OpenAI did a similar study almost six months ago,

00:22:16.880 | which I talked about on my Patreon AI Insiders.

00:22:20.080 | Now the vision, speech and video parts of Llama 3.1 aren't yet available.

00:22:25.360 | Zuckerberg described some sort of mess up but didn't go into much more detail.

00:22:29.520 | But they did have one interesting conjecture in the paper.

00:22:32.800 | You might remember how Gemini 1.5 Pro and GPT 4.0

00:22:36.960 | are trained from the ground up to be multimodal.

00:22:39.760 | That has advantages but Meta contends that a compositional approach,

00:22:44.880 | as in separate models, is actually in some ways more advantageous.

00:22:50.000 | Apparently, it's more efficient during inference.

00:22:52.640 | Obviously, we can all judge this when it comes out.

00:22:54.880 | But I do note that Noam Brown said that GPT 4.0 didn't turn out as well

00:22:59.200 | as they hoped with multimodal reasoning.

00:23:01.840 | Here were the final results though in a benchmark I do pay attention to, the MMMU.

00:23:07.600 | You can see that Llama 3 with vision scores 64.5% versus Claw 3.5 68.3%.

00:23:15.920 | GPT 4.0 is better at 69.1% and I can believe that.

00:23:20.240 | And very quickly on the video data that Meta used for training Llama 3v.

00:23:25.120 | Well, they don't say it but they strongly imply that they're using Instagram Reels.

00:23:30.480 | Now anyone who knows can correct me,

00:23:32.720 | but the duration and resolution of the videos does seem to hint at that.

00:23:37.120 | If that's true, well then like Google they can of course flex those muscles

00:23:41.520 | of the masses of data that they have that people like OpenAI wouldn't necessarily have.

00:23:46.560 | Yes, by the way, they are working on speech generation as well as speech understanding,

00:23:51.520 | so you should be able to talk eventually to Llama 3.1 just like we were promised with GPT 4.0.

00:23:58.320 | They even claim that their speech recognition is better than Whisper v2

00:24:03.120 | and for multilingual scenarios Whisper v3.

00:24:06.320 | Now admittedly, this experiment was using Whisper v3,

00:24:09.360 | but just look at the speed, in this case using Grok, that these smaller Llama 3 models can act at.

00:24:15.280 | Located, can you tabularize it for me?

00:24:17.280 | Can you add a duration column?

00:24:22.320 | Can you remove the end times from the time column?

00:24:27.120 | Can you make the duration in minutes?

00:24:31.200 | And can you move the duration to between the time and stop column?

00:24:37.680 | Can you add lunch and dinner at a nice restaurant?

00:24:45.120 | You know what, I changed my mind. Make it Vancouver.

00:24:50.320 | Of course, for time reasons, I've got to draw this video to an end,

00:24:56.960 | but there were countless more experiments with training models revealed throughout the paper.

00:25:01.920 | And speaking of tracking experiments, you may already know that AI labs, including OpenAI,

00:25:08.240 | have used Weights & Biases, this video's sponsor, to track frontier machine learning experiments,

00:25:14.400 | as well as visualize, iterate on, optimize, and share them.

00:25:18.240 | But you might not know that Weights & Biases now have Weave,

00:25:22.160 | a lightweight toolkit to confidently iterate on LLM applications,

00:25:26.800 | and that they produce free prompting and LLM agent courses on their website.

00:25:31.360 | And if you didn't know that, do let them know that you came from this video,

00:25:35.920 | the link is in the description.

00:25:37.520 | And so let's conclude with Meta's conclusion.

00:25:40.640 | They say, and I agree, that in many ways,

00:25:42.800 | the development of high-quality foundation models is still in its infancy.

00:25:46.720 | Our experience in developing LLAMA 3 suggests that substantial

00:25:50.240 | further improvements of these models are on the horizon.

00:25:53.360 | They go on to admit that they did explore more complex model architectures and training recipes,

00:25:58.000 | but did not find the benefits of such approaches to outweigh the additional complexity

00:26:02.960 | that they introduce into model development.

00:26:05.120 | Like you, I can't wait, of course, to compare LLAMA 3.1 with Gemini 2

00:26:10.080 | and GPT-5.

00:26:11.360 | And they had the right plan to ensure that LLAMA 3 was not accidentally overfitted

00:26:16.080 | on commonly used benchmarks, and that their pre-training data

00:26:19.760 | was not only procured, but processed by a separate team.

00:26:23.600 | That was, they say, strongly incentivized to prevent contamination of that pre-training data.

00:26:29.760 | The model's performance on my simple bench does imply that their benchmark results aren't fluky.

00:26:35.120 | And they end with this, we hope that the release of LLAMA 3

00:26:38.080 | encourages the industry to embrace the open, in quotes, "responsible" development of AGI.

00:26:44.000 | Let me know what you think in the comments, and as always, have a wonderful day.