back to indexLlama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results
00:00:00.000 |
The llama herd has left the building and is roaming the streets. 00:00:05.200 |
More specifically, the Llama 3.1 405 billion parameter language model is out, 00:00:11.280 |
but I thought the former expression would be a bit more dramatic. 00:00:15.360 |
The 92-page paper that came with the model was released less than 24 hours ago, 00:00:21.680 |
and yes, I've read it in full and have benchmarked the model, 00:00:25.360 |
comparing it to four competitors on over 100 private questions with 74 notes to touch on. 00:00:33.440 |
The model is impressive and the paper is revealing, so let's get started. 00:00:39.920 |
There are three sizes of text-only Llama 3 models, but this video will focus almost entirely 00:00:46.640 |
on the biggest and best, the 405 billion parameter model. 00:00:50.400 |
And no, Meta weren't exaggerating when they say that it delivers comparable quality 00:01:00.320 |
I'm not just relying on those traditional benchmarks to assess that comparison. 00:01:05.360 |
Meta's innovations in a nutshell were higher quality data that's filtered for quality 00:01:14.560 |
Indeed, the sheer scale of compute, way more than 10 to the 25 floating point operations, 00:01:20.480 |
was so big that at one point the EU classified that as presenting a systemic risk. 00:01:26.960 |
So whether that scares you or hypes you, let's see the results of all of those flops. 00:01:32.960 |
Here is just a quick snapshot of a comparison on traditional benchmarks 00:01:37.440 |
between Llama 3.1 405B and GPT-4, GPT-4.0, Claude 3.5 Sonnet. 00:01:45.200 |
I don't think these benchmarks quite capture the nuanced differences between the models, 00:01:49.920 |
but it certainly shows you that this new "open source" model from Meta 00:01:57.840 |
Of course, it doesn't yet have all the fancy speech in and speech out that GPT-4 Omni does, 00:02:04.080 |
but technically we don't have access to that yet either for that model. 00:02:08.160 |
I do think it's worth noting though, just for 10 seconds, that we have a downloadable model now 00:02:14.320 |
that is as good or better than the GPT-4 that caused such waves early last year. 00:02:20.560 |
At the time, people thought that day might take 2 years or even 5 years, but no, it's here. 00:02:26.320 |
And yes, Meta are still arguing that this series of models charts a "responsible path" 00:02:33.680 |
towards the development of Artificial General Intelligence. 00:02:37.360 |
I'll at least have a few comments on that when it comes to 00:02:43.120 |
Just quickly, why do I keep saying "open source"? 00:02:45.600 |
Well, according to the semi-official Open Source Initiative, 00:02:49.520 |
the definition of Open Source AI includes the training data provenance, 00:02:56.880 |
The paper on page 4 simply says "from a variety of data sources". 00:03:02.560 |
So even if you had the budget, you couldn't recreate LLAMA 3.1 00:03:06.400 |
because you simply wouldn't know what data they used. 00:03:08.880 |
Indeed, I did a video on this for my new Coursera course, 00:03:12.080 |
but just remember that anytime you hear that Meta is committed to Open Source AI. 00:03:17.520 |
I mean, just look how many times in this one paragraph, 00:03:20.480 |
Mark Zuckerberg used the phrase "open source". 00:03:25.600 |
Well, as the New York Times recently reported, 00:03:28.480 |
the data is getting harder and harder to find. 00:03:31.200 |
Companies like Reddit and Twitter are charging for their data 00:03:34.800 |
and Meta may not have had permission for all of that data. 00:03:38.560 |
And one theme you'll see throughout the paper 00:03:40.960 |
is using language models to improve the performance of language models. 00:03:45.600 |
Using LLAMA 2, for example, to filter the data used to train LLAMA 3. 00:03:51.600 |
That's just one example, there are literally dozens. 00:03:54.640 |
So you can bet that LLAMA 3.1 is being used to help train LLAMA 4. 00:04:00.640 |
Before you predict, though, that this is setting off some form of intelligence explosion, 00:04:04.560 |
remember that it was just yesterday that Zuckerberg admitted 00:04:07.920 |
that the LLAMA models are hemorrhaging Meta money. 00:04:10.880 |
It's hard to know in advance when something is good enough 00:04:14.240 |
that you're going to have a product that billions of people use, 00:04:16.720 |
and then when it's ready to kind of be a large business. 00:04:19.760 |
And I mean, look, we're all spending, you know, a lot of capital 00:04:25.200 |
So I think that people are going to be probably losing money for quite a while. 00:04:28.800 |
But I don't know, maybe that'll all happen quicker. 00:04:33.040 |
And even OpenAI might be losing $5 billion this year alone. 00:04:39.040 |
That's at least according to a report released by The Information 00:04:44.320 |
But we do know that LLAMA 4 is coming, and probably before the end of the year. 00:04:49.520 |
How do you define AGI, and do you get there first? 00:04:54.080 |
We're basically already starting to work on LLAMA 4. 00:04:58.080 |
And our goal is to completely close the gap with all the others on that. 00:05:05.680 |
I mean, I think that there will probably be some breakthroughs between now and then. 00:05:08.000 |
It's hard to just predict in a straight line. 00:05:10.080 |
Then you get to the more complicated question, which is like, what is it? 00:05:13.040 |
I don't know that there's one specific definition for this. 00:05:16.720 |
Throughout the paper, they give away their recipe for doing what they did. 00:05:20.960 |
Having read both the original LLAMA paper and the LLAMA 2 paper, 00:05:25.520 |
It almost feels like they're much more confident 00:05:28.240 |
giving away the secrets of large language models. 00:05:31.280 |
They almost don't believe that there's much of a secret source, 00:05:37.120 |
And Claude 3.5 Sonnet aside, they've almost proven that with this model. 00:05:42.080 |
I must say that there was one part of the paper that I found especially sensational. 00:05:46.800 |
They developed scaling laws, not just for next token prediction loss, 00:05:51.040 |
but for benchmark performance, or to somewhat translate that, 00:05:54.800 |
how long to run the GPUs to get the benchmark performance that they wanted. 00:06:00.240 |
Given their flop budget, they predicted how the model would perform and got it right, 00:06:06.000 |
only just slightly underestimating final performance. 00:06:09.600 |
Or in their words, this approach enables us to predict downstream task performance, 00:06:14.880 |
given a specific number of training flops for compute optimal models. 00:06:19.200 |
They set themselves a compute budget and got the benchmark performance that they expected. 00:06:25.040 |
It's almost a bit like you can imagine a benchmark performance dial 00:06:29.680 |
in Mark Zuckerberg's office that he can move clockwise at will, 00:06:38.160 |
extrapolate across four orders of magnitude, so are pretty reliable. 00:06:42.960 |
In case you're wondering, that's where they got the quirky 405 billion parameter number from. 00:06:48.000 |
They had the compute budget, looked at those benchmark scaling laws, 00:06:54.560 |
On the right is the sigmoidal scaling law curve that they anticipated 00:07:02.640 |
That's not, by the way, the ARK AGI challenge that I've talked about on this channel recently, 00:07:07.280 |
but it is legit questions like this that you can see here. 00:07:10.400 |
General knowledge and what they call a reasoning challenge. 00:07:14.160 |
Now, just how many benchmarks that scaling law holds for 00:07:17.920 |
is a question that I, at least, am immensely curious about. 00:07:21.760 |
I'll come back to benchmarks, but the amount of detail they went into, 00:07:25.760 |
down to the exact hardware issues they had, is quite incredible. 00:07:29.920 |
They even note at one point that temperature fluctuations during the day 00:07:37.360 |
And slightly more concerningly, the fluctuations of power consumption 00:07:41.760 |
across the data center stretched the limits of the power grid. 00:07:46.720 |
It does make me, at least, wonder what the kind of issues they'll have 00:07:52.880 |
Now, clearly, because it is a 92-page paper, I am skipping over a lot, 00:07:57.200 |
but I do want to bring you the most interesting highlights. 00:08:00.080 |
For example, there was this detail about how they obsessively cleaned the data. 00:08:04.720 |
They found an annoying problem that was too common in their data. 00:08:08.640 |
Overly apologetic tonal issues. Phrases such as "I'm sorry" or "I apologize". 00:08:14.320 |
They didn't want that, nor did they want excessive emojis or exclamation points. 00:08:19.360 |
Back to that theme, though, of AI improving AI, 00:08:22.480 |
they trained a code expert model to help them find 00:08:26.480 |
the highest quality human annotations for code. 00:08:29.280 |
Five pages on in the paper, they say that they trained a multilingual expert model 00:08:34.640 |
to collect higher quality annotations in non-English languages. 00:08:38.480 |
And it seems appropriate at this point to mention that Meta, for the first time, 00:08:42.560 |
allow you to use this frontier model to generate synthetic data 00:08:49.760 |
They didn't allow that before, and nor did companies like OpenAI, to the best of my knowledge. 00:08:54.480 |
So that flywheel of models improving models is now technically open to you. 00:08:59.440 |
Now, you do have to be slightly sophisticated about it, though. 00:09:03.120 |
When they trained LLAMA3-405B on its own generated data in programming, 00:09:10.560 |
Notice that is different from those last two examples. 00:09:12.800 |
This is the same model training on its own generated data. 00:09:19.280 |
which I've talked about quite a lot on this channel, 00:09:21.840 |
it did enable the model to learn from its own mistakes. 00:09:25.600 |
And anyone who has been following this channel knows that I talk often about 00:09:29.440 |
verifier models, and LLAMA3 indeed incorporated that approach during training. 00:09:35.280 |
In coding, for example, only generations that pass syntax checking and unit tests 00:09:42.400 |
But for maths and reasoning, the story is even more interesting. 00:09:46.560 |
First, they give a curious definition of reasoning. 00:09:49.840 |
We define reasoning as the ability to perform multi-step computations 00:09:56.000 |
I'm definitely going to leave a question mark on that one, 00:09:58.720 |
because under that definition, wouldn't a calculator be doing reasoning? 00:10:02.880 |
But the interesting bit is how they say that training data on the web 00:10:07.280 |
shows a shortage of ground-truth correct chains of thought for reasoning and math. 00:10:13.280 |
But those are essential for guiding the model 00:10:16.080 |
how to break down the problem step-by-step and reach the final answer. 00:10:19.680 |
In other words, most online text contains results and analysis, 00:10:23.840 |
not the chains of thought involved in coming up with those results. 00:10:28.080 |
Then they quote directly from the Let's Verify Step-by-Step paper 00:10:32.080 |
that I've talked about many times on this channel. 00:10:36.720 |
"They identified mathematical skills where the model underperforms 00:10:41.040 |
and actively source prompts from humans to teach the models such skills." 00:10:45.280 |
And then they use the model, LLAMA3, to check the reasoning steps 00:10:52.400 |
In other words, training a model to recognize good steps in a reasoning chain. 00:11:00.560 |
where those intermediate reasoning steps were incorrect. 00:11:03.760 |
So not just the final results, the reasons used to get those final results. 00:11:08.480 |
They wanted to eliminate invalid reasoning traces. 00:11:12.080 |
And for the hardest prompts, they even used Monte Carlo Tree Search, 00:11:15.920 |
a bit like AlphaGo, with those process-based reward models 00:11:22.320 |
Translated, they searched as hard as they could 00:11:25.280 |
to find the best reasoning steps to teach the model reasoning. 00:11:29.200 |
And at this point, I can hold off no longer from talking about 00:11:32.720 |
my own private benchmark, what I call SimpleBench, 00:11:38.800 |
And there are a few things I love about this benchmark. 00:11:42.400 |
Obviously, I am ridiculously biased, so take this with a pinch of salt. 00:11:48.640 |
to test the real reasoning intelligence of models. 00:11:51.680 |
First, it's fully private, so it hasn't been contaminated at all. 00:11:55.280 |
Second, it is rigorously vetted, not just by me, 00:12:01.280 |
If even one mistake makes it into the final 100 or 200 questions, 00:12:12.000 |
fall well, well, well behind the performance of humans 00:12:20.080 |
which of course won't make it into the final benchmark. 00:12:23.040 |
But for me, it has been the most reliable vibe test 00:12:28.080 |
Now, I will be testing the models again using self-consistency. 00:12:31.600 |
But for now, we have Claude 3.5 Sonnet way ahead at 32%. 00:12:36.960 |
Lama 405B at 18%, well ahead of both versions of GPT-4 and Gemini 1.5. 00:12:44.800 |
Smaller models, by the way, in case you're curious, 00:12:49.120 |
And here is one example that the new Lama model actually usually gets, 00:12:58.320 |
It comes from the spatial intelligence section of the benchmark 00:13:01.760 |
and involves placing four whole ice cubes into a fire. 00:13:08.320 |
And then the question ends with how many whole ice cubes 00:13:12.240 |
can be found in the fire at the end of the third minute? 00:13:15.840 |
I even add in pick the most realistic answer. 00:13:21.040 |
reflecting that none of the ice cubes will be whole 00:13:27.040 |
Most models, of course, go down a rabbit hole of calculations. 00:13:30.720 |
Now, admittedly, this was one of the easier questions on the benchmark. 00:13:34.960 |
And if you add things like think about this carefully 00:13:38.080 |
or this is a trick question, the models can sometimes get it. 00:13:44.000 |
that I can create genuine spatial, temporal, linguistic or social questions 00:13:49.280 |
that no amount of warnings allow the models to get right. 00:13:52.720 |
And yes, that's still with humans scoring near perfectly. 00:13:56.800 |
Well, it's because, of course, the models are modelling language. 00:14:04.160 |
They don't actually visualise things in their head 00:14:07.200 |
or think about problems in the same way that we do. 00:14:10.400 |
So how would a model like Lama 3 ever get a question like this right? 00:14:15.360 |
Well, it's because I can leave, let's say, linguistic clues, 00:14:25.040 |
Testing, if you will, their ability to pick up faint signal amidst the noise. 00:14:30.080 |
If I remove all signal, models score zero with humans still scoring almost perfectly. 00:14:38.320 |
I can separate the smart models from the less smart models. 00:14:43.280 |
I wish I could go through all the hundred plus questions with you 00:14:48.240 |
But then, of course, it would leak into the training data inevitably 00:14:53.600 |
Now, I have made the benchmark functional so I can change the numbers. 00:15:00.560 |
Many of you are thinking that was a very long way of saying that Lama 405B is good. 00:15:10.720 |
but better, I think, in text at least than GPT-40. 00:15:14.560 |
Now, you could say that part of this benchmark is somewhat adversarial 00:15:18.880 |
and Meta on page 33 talk about how adversarial tests 00:15:23.920 |
cause significantly worse performance than non-adversarial ones. 00:15:27.840 |
What they mean by that is that in some of the benchmarks that they used, 00:15:31.760 |
even a single distracting sentence at the end of a question 00:15:35.680 |
caused significantly worse performance than simply asking the question. 00:15:39.760 |
If the model was actually thinking about the question, that shouldn't happen. 00:15:43.760 |
And the paper highlights this without actually suggesting a solution. 00:15:47.440 |
For mathematical reasoning and question answering, however, 00:15:50.480 |
the adversarial performances are substantially lower 00:15:56.640 |
This pattern is similar for pre-trained and post-trained models, full stop. 00:16:01.600 |
So much to cover, so I'm going to move swiftly on to contamination. 00:16:05.120 |
Through fascinating word matching or n-gram checks, 00:16:08.560 |
they found that contamination was rife in traditional benchmarks. 00:16:13.040 |
And these contamination scores in this column actually underestimate the problem. 00:16:17.920 |
They excluded benchmarks from this chart when the clean set had too few examples 00:16:23.840 |
or because the observed performance gain when they cleaned the data set 00:16:32.560 |
Even when they allowed for a higher threshold of 8-word overlap 00:16:41.440 |
that it was impossible to get a good performance gain estimate. 00:16:46.800 |
how much contamination was affecting the MMLU scores. 00:16:50.240 |
It seems like private benchmarks such as those from Scale.ai 00:16:53.840 |
and indeed mine will be more common in the future. 00:16:56.800 |
Here was the ranking for example in math by Scale.ai 00:17:00.560 |
with Claude 3.5 Sonnet in number one position. 00:17:03.520 |
At a glance, human comparisons leading to leaderboards like those from Elemsis 00:17:10.000 |
Even though Sam Altman said that we now have GPT-40 Mini 00:17:16.480 |
In my own experiments, and let me know what you think in the comments, 00:17:21.280 |
Having Mini beating Claude 3.5 Sonnet just seems shocking to me. 00:17:27.120 |
and said that they're going to release a random 20% subset of those battles. 00:17:33.920 |
Back to the paper though, and here's another way 00:17:36.080 |
that Llama405B does seem to be better than its rivals. 00:17:40.400 |
It has a long context of 128k tokens or around 100,000 words. 00:17:45.760 |
Now other models of course have more than that, 00:17:49.520 |
It's when it's asked questions that rely on scouring through that long context 00:17:55.360 |
And annoyingly, they didn't compare it to Gemini 1.5 Pro, 00:17:58.960 |
but here it beats GPT-4, GPT-40 and Claude 3.5 Sonnet significantly. 00:18:07.440 |
Well, as you'd expect, I tracked down that paper and read it in full. 00:18:11.360 |
And a typical question from that infinity bench was this. 00:18:15.040 |
With details strewn throughout a story the length of a novel, 00:18:18.400 |
they asked what colour dress did person A wear when A met B for the second time. 00:18:24.800 |
So the model would obviously have to track when A met B for the first time, 00:18:29.040 |
then the second time and what colour dress they were wearing. 00:18:35.920 |
Also, when there are multiple needles in a haystack. 00:18:39.120 |
A bit like if there's four passwords strewn throughout a long document. 00:18:43.360 |
Can't do this quite as well as GPT-4 apparently. 00:18:46.640 |
Or even Llama 3 A billion parameters randomly, 00:18:53.040 |
It does seem a bit random to me to not compare it to Gemini 1.5 Pro 00:18:57.360 |
when that's its specialty, long context, but anyway. 00:19:01.040 |
Now, I will give some more credit to Meta for this. 00:19:04.640 |
They gave plenty of win-loss human comparisons with GPT-4, 00:19:08.960 |
not only in the paper, but also on the website of the Llama 3 release. 00:19:13.520 |
And most of those comparisons were actually unfavourable. 00:19:16.960 |
That's commendable honesty to include charts which make your model seem less good. 00:19:22.320 |
In the middle, you can see Llama 3 losing out to GPT-4.0 most of the time. 00:19:28.160 |
No, actually, it's all of these comparisons across English reasoning, coding, etc. 00:19:32.880 |
Now again, as we've seen, human at a glance evaluation can't always be trusted though. 00:19:40.560 |
and they claim that the violation rate has dropped significantly for Llama 3 00:19:47.680 |
Now, normally a lower violation rate for safety would lead to an increased false refusal rate 00:19:53.680 |
when they refuse to answer simple, innocent questions, basically. 00:19:56.800 |
But actually, it still has a pretty low false refusal rate. 00:20:00.640 |
And they make this point that it is critical to consider false refusal as a countermetric 00:20:06.080 |
because a model that always refuses is maximally safe, cough, cough, Claude 3.5 sonnet, 00:20:12.800 |
The reference I'm making there is that Claude very frequently compared to other models 00:20:19.680 |
Anyway, so false refusals are definitely a thing and I'm glad Meta are aware of it. 00:20:26.320 |
they admit that Llama 3 is on average more susceptible to prompt injection 00:20:31.280 |
compared at least to GPT-4 or Gemini Pro, but it's better apparently than mixed trial. 00:20:36.160 |
But there's a wider point on safety that I do want to note. 00:20:39.440 |
It was only around a year ago that Mark Zuckerberg was receiving a letter 00:20:44.000 |
from two senators in America concerned about the leak of Llama 1, 00:20:49.280 |
talking about its potential for spam, fraud, malware, privacy violations, and harassment. 00:20:54.400 |
Now, clearly that letter went nowhere because they subsequently released 00:20:58.560 |
not only Llama 2, but Llama 3 open weights and downloadable. 00:21:03.440 |
And again, on the safety point, Leopold Aschenbrenner will be having a fit 00:21:08.160 |
because he says there's no point keeping models closed 00:21:11.360 |
because adversaries like China will simply steal the models anyway on a thumb drive. 00:21:16.000 |
So when I see letters like this from a couple of days ago to Sam Altman, 00:21:20.080 |
signed by around six senators asking him if he has indeed committed 00:21:26.480 |
I just have a slight suspicion that OpenAI might completely ignore this 00:21:33.440 |
I also want to commend Meta for being much more rigorous 00:21:38.560 |
They got a set of volunteers and saw if there was any uplift in their ability to create 00:21:43.360 |
or at least ideate about chemical and biological weapons. 00:21:46.960 |
Basically when they had access to Llama 3 versus having no access. 00:21:53.360 |
And the analysis of these results showed no significant uplift in performance 00:21:59.520 |
And honestly, that doesn't surprise me too much given how much data filtering went on. 00:22:04.480 |
Count me at least as being surprised if biological or chemical weapon data 00:22:12.720 |
To their credit, OpenAI did a similar study almost six months ago, 00:22:16.880 |
which I talked about on my Patreon AI Insiders. 00:22:20.080 |
Now the vision, speech and video parts of Llama 3.1 aren't yet available. 00:22:25.360 |
Zuckerberg described some sort of mess up but didn't go into much more detail. 00:22:29.520 |
But they did have one interesting conjecture in the paper. 00:22:32.800 |
You might remember how Gemini 1.5 Pro and GPT 4.0 00:22:36.960 |
are trained from the ground up to be multimodal. 00:22:39.760 |
That has advantages but Meta contends that a compositional approach, 00:22:44.880 |
as in separate models, is actually in some ways more advantageous. 00:22:50.000 |
Apparently, it's more efficient during inference. 00:22:52.640 |
Obviously, we can all judge this when it comes out. 00:22:54.880 |
But I do note that Noam Brown said that GPT 4.0 didn't turn out as well 00:23:01.840 |
Here were the final results though in a benchmark I do pay attention to, the MMMU. 00:23:07.600 |
You can see that Llama 3 with vision scores 64.5% versus Claw 3.5 68.3%. 00:23:15.920 |
GPT 4.0 is better at 69.1% and I can believe that. 00:23:20.240 |
And very quickly on the video data that Meta used for training Llama 3v. 00:23:25.120 |
Well, they don't say it but they strongly imply that they're using Instagram Reels. 00:23:32.720 |
but the duration and resolution of the videos does seem to hint at that. 00:23:37.120 |
If that's true, well then like Google they can of course flex those muscles 00:23:41.520 |
of the masses of data that they have that people like OpenAI wouldn't necessarily have. 00:23:46.560 |
Yes, by the way, they are working on speech generation as well as speech understanding, 00:23:51.520 |
so you should be able to talk eventually to Llama 3.1 just like we were promised with GPT 4.0. 00:23:58.320 |
They even claim that their speech recognition is better than Whisper v2 00:24:06.320 |
Now admittedly, this experiment was using Whisper v3, 00:24:09.360 |
but just look at the speed, in this case using Grok, that these smaller Llama 3 models can act at. 00:24:22.320 |
Can you remove the end times from the time column? 00:24:31.200 |
And can you move the duration to between the time and stop column? 00:24:37.680 |
Can you add lunch and dinner at a nice restaurant? 00:24:45.120 |
You know what, I changed my mind. Make it Vancouver. 00:24:50.320 |
Of course, for time reasons, I've got to draw this video to an end, 00:24:56.960 |
but there were countless more experiments with training models revealed throughout the paper. 00:25:01.920 |
And speaking of tracking experiments, you may already know that AI labs, including OpenAI, 00:25:08.240 |
have used Weights & Biases, this video's sponsor, to track frontier machine learning experiments, 00:25:14.400 |
as well as visualize, iterate on, optimize, and share them. 00:25:18.240 |
But you might not know that Weights & Biases now have Weave, 00:25:22.160 |
a lightweight toolkit to confidently iterate on LLM applications, 00:25:26.800 |
and that they produce free prompting and LLM agent courses on their website. 00:25:31.360 |
And if you didn't know that, do let them know that you came from this video, 00:25:37.520 |
And so let's conclude with Meta's conclusion. 00:25:42.800 |
the development of high-quality foundation models is still in its infancy. 00:25:46.720 |
Our experience in developing LLAMA 3 suggests that substantial 00:25:50.240 |
further improvements of these models are on the horizon. 00:25:53.360 |
They go on to admit that they did explore more complex model architectures and training recipes, 00:25:58.000 |
but did not find the benefits of such approaches to outweigh the additional complexity 00:26:05.120 |
Like you, I can't wait, of course, to compare LLAMA 3.1 with Gemini 2 00:26:11.360 |
And they had the right plan to ensure that LLAMA 3 was not accidentally overfitted 00:26:16.080 |
on commonly used benchmarks, and that their pre-training data 00:26:19.760 |
was not only procured, but processed by a separate team. 00:26:23.600 |
That was, they say, strongly incentivized to prevent contamination of that pre-training data. 00:26:29.760 |
The model's performance on my simple bench does imply that their benchmark results aren't fluky. 00:26:35.120 |
And they end with this, we hope that the release of LLAMA 3 00:26:38.080 |
encourages the industry to embrace the open, in quotes, "responsible" development of AGI. 00:26:44.000 |
Let me know what you think in the comments, and as always, have a wonderful day.