back to indexLlama 2: Full Breakdown
Chapters
0:0
3:21 Reward Modeling
4:7 Helpfulness and Safety
6:3 Safety Testing in English
14:4 Llama 2 Widely Available
00:00:00.000 |
Less than 24 hours ago, Meta released Lama 2, their successor to the open-source Lama language 00:00:07.480 |
model that helped spawn a hundred others including Alpaca, Vicuña and of course Orca. 00:00:13.300 |
Within a few hours of release, I had read the fascinating 76-page technical paper, 00:00:18.420 |
the use guide, each of the many release pages, the full terms and conditions, 00:00:23.420 |
and I have run many of my own experiments. Let's start with the basics, it was trained on more 00:00:28.880 |
data, the biggest model has more parameters and the context length has doubled. They also spent 00:00:35.040 |
what must be tens of millions on fine-tuning it for chat, but I'll get into that more later. 00:00:40.440 |
But let's start with the benchmarks. They deliberately compared Lama 2 to Lama 1 and 00:00:47.780 |
other famous open-source models, but not with GPT-4. And in these benchmarks, 00:00:52.720 |
the trend is fairly clear. It crushes the other open-source language models, but is 00:00:57.940 |
more of an incremental change. So, let's start with the benchmarks. 00:00:58.860 |
Lama 1. To massively simplify, the MMLU benchmark shows that it knows a lot about a lot of subjects, 00:01:07.240 |
but the human eval benchmark shows that it's not amazing at coding. 00:01:11.860 |
But now it's time for the paper and here are the highlights. 00:01:15.920 |
On data, they say they used more robust data cleaning and trained on 40% more total tokens. 00:01:24.060 |
They say they didn't include any data from Meta's products or services, 00:01:28.700 |
but what they did do is up-sample the most factual sources. 00:01:32.920 |
If you don't think that's much information about the data, you are correct, because all they say 00:01:37.780 |
is it was trained on a new mix of publicly available data. Absolutely no mention of 00:01:44.260 |
any sources here at all. After pre-training on those 2 trillion tokens, the models still 00:01:50.180 |
did not show any sign of saturation. The loss going down here represents an improvement, 00:01:55.200 |
and as you can see, they could have kept going. 00:01:57.820 |
On page 8, we have some quick comparisons with Palm 2, the model behind BARD, and of course, 00:02:06.800 |
Obviously, this comparison doesn't look great for Lama 2, especially in coding, in this row. 00:02:12.480 |
But now let's compare it to other open source models. Here it is being better at coding, 00:02:17.660 |
common sense, reading comprehension, but notice it wasn't compared to Orca or PHY1, both of 00:02:22.960 |
which I've done videos on, and I found that interesting given that both are apparently 00:02:26.940 |
set to be open sourced. PHY1, for example, at only 1.3 billion parameters, got around 00:02:33.500 |
50% for code. And I'll get to more Orca comparisons in a moment. 00:02:38.360 |
What about the decision itself to release the model? As you can see here, they show 00:02:42.940 |
off a list of corporate supporters of the decision to open source the model. And then 00:02:48.940 |
if you remember the safety statement signed by all the top AGI labs and world experts 00:02:54.780 |
in AI. Well, I think Meta got a little bit of a shock. 00:02:56.060 |
They came up with their own statement of support for Meta's open approach to today's AI. 00:03:04.140 |
I'll let you decide if this list is as impressive as the other one, but I did note Mark Andreessen, 00:03:10.940 |
who is on the board of directors of Meta. Back to the paper, and they went into immense 00:03:16.200 |
detail into their reinforcement learning with human feedback process. Way too much for me 00:03:21.180 |
to cover in this video. The short version is that reward modeling is a way of telling the 00:03:26.020 |
base model which outputs humans prefer. And you can see the millions of human rated comparisons 00:03:32.200 |
that were used for Llama 2. Think of it as doggy training the model with treats and admonitions. 00:03:38.500 |
And interestingly, they trained two separate reward models, one optimized for helpfulness 00:03:43.220 |
and the other for safety. And they tried to make sure that the reward models or doggy 00:03:48.160 |
trainers were as smart as the dog itself. Or in technical speak, we initialized our 00:03:53.300 |
reward models from pre-trained chat model checkpoints. 00:03:55.980 |
In short, the reward model knows what the chat model knows. And that is to prevent cases 00:04:01.520 |
where the base model just hallucinates and the reward model can't tell the difference. 00:04:06.280 |
They do describe at great length a trade-off though between helpfulness and safety, as 00:04:11.600 |
illustrated here. Someone asked, I'm going to be participating 00:04:14.880 |
in a comedy roast, what are some hilariously spicy roasts I can use? And on the right we 00:04:20.540 |
have the two doggy trainers, the safety reward model score and the helpfulness reward model 00:04:25.940 |
score. As we go down, more safety data is being ingested. And early on, as you can see, 00:04:30.980 |
the model is pretty quote unquote helpful giving these roasts. Obviously you can let 00:04:35.440 |
me know what you think of them, but note they get low safety scores. As the model gets more 00:04:40.440 |
safety training though, the safety score goes up, but the helpfulness score goes down. We 00:04:45.940 |
get more of these, I can't satisfy your request kind of answers. And I'm going to skip to 00:04:50.580 |
one of the experiments I was going to show you later, which is when I was trying to benchmark 00:04:55.900 |
I've applied to download the model, but at the moment this is just a hugging face space. 00:05:00.140 |
And I was trying to ask it a common sense question from the Hella Swag benchmark and 00:05:04.940 |
it just refused to answer. They call this in the paper false refusal and I find it happens 00:05:09.720 |
quite a lot. The paper claims on page 19 that the 70 billion parameter version of Llama 2 00:05:16.100 |
is more helpful than a particular version of Chachi BT, winning more often than it loses. 00:05:21.260 |
But later they admit something which I definitely agree with. While our results indicate that 00:05:25.860 |
Llama 2 Chat is on par with Chachi BT on human evaluations, it's important to note that human 00:05:32.020 |
evaluations have several limitations. It says the prompt set doesn't cover coding or reasoning 00:05:38.020 |
related prompts. They only evaluate the final generation of a multi-turn conversation and human 00:05:44.100 |
evaluation is inherently subjective and noisy. I like to judge models based on mathematics and 00:05:49.860 |
reasoning, so I might be biased in one direction. Also Llama 2 is not nearly as good 00:05:55.820 |
when you're using it in languages other than English, which is not surprising given the 00:06:00.460 |
language distribution in the pre-training data. I also find it interesting that they did all of 00:06:05.320 |
their safety testing in English and they warn developers before deploying any applications of 00:06:11.040 |
Llama 2, do your own safety testing and tuning tailored to your specific application. On compute 00:06:17.000 |
they don't say much other than that it was trained on A100s. I am sure Llama 3 will be trained on the 00:06:25.780 |
A100s, but apparently Meta has purchased more of those than any other company including Microsoft. 00:06:31.220 |
Mind you Llama 2 was trained between January and July apparently, so it's understandable they used 00:06:37.400 |
the earlier A100s. Back to the decision to release and it does seem interesting to me that Meta and 00:06:43.760 |
Zuckerberg have seemingly ignored this letter from the US Senate. It was written in early June and 00:06:50.280 |
toward the end it said this: "By purporting to release Llama for the purpose of researching the abuse 00:06:55.740 |
of AI, Meta effectively appears to have put a powerful tool in the hands of bad actors to actually 00:07:01.980 |
engage in such abuse without much discernible forethought, preparation or safeguards." In the 00:07:08.420 |
paper they defend it and say this release promotes transparency, it democratizes the technology and 00:07:14.920 |
creates a more level playing field for organizations of all sizes across the globe to benefit from the 00:07:20.460 |
economic growth promised by the advancement of AI. But before anyone gets too enchanted by that, 00:07:25.700 |
Zuckerberg has recently said that they're only releasing because it's far away from AGI. 00:07:31.340 |
And I think Google's palm model is also I think has about 10 times as many parameters. Now the 00:07:36.540 |
Llama models are very efficient so they perform well for something that's around 65 billion 00:07:40.940 |
parameters. So for me that was also part of this because there's this whole debate around you know 00:07:46.220 |
is it good for everyone in the world to have access to the most frontier AI models. And I think as the 00:07:55.660 |
models start approaching something that's like a super human intelligence, that's a bigger question 00:08:02.260 |
that we'll have to grapple with. But right now I mean these are still very basic tools. 00:08:07.780 |
I suspect that the bigger reason for release relates to an earlier answer he gave in the same 00:08:13.540 |
interview. Basically his researchers demanded it. 00:08:16.620 |
Part of this is we want to have the best people in the world researching this and a lot of the 00:08:22.300 |
best people want to know that they're going to be able to share their work. So that's 00:08:25.620 |
part of the deal that we have is that you know we can get you know if you're one of the top AI 00:08:31.380 |
researchers in the world and come here you can get access to kind of industry scale infrastructure 00:08:36.060 |
and part of our ethos is that we want to share what's invented broadly. 00:08:41.940 |
And if Zuckerberg had refused to release some of those researchers could have just gone off and 00:08:47.580 |
made their own company as these guys did. Mistral AI is valued at 240 million despite being only four 00:08:55.580 |
weeks old and contains some key employees from Meta. One even complained before deleting the 00:09:01.700 |
tweet about not being included in the author list of the Lama 2 paper. This was the pitch memo that 00:09:08.300 |
Mistral used to raise those hundreds of millions of euros and they focus on taking a more open 00:09:14.540 |
approach to model development. So the point still stands if a CEO blocks a model being 00:09:19.580 |
open source if the researchers want to they can just defect to XAI or just start their own company. 00:09:25.540 |
So in a way Zuckerberg had few options. I must say though that I did raise an eyebrow when I read 00:09:31.340 |
these paragraphs. This is on page 35 of the technical paper and they say not everyone who 00:09:36.780 |
uses AI models has good intentions. AI agents could potentially be used for nefarious purposes 00:09:42.540 |
such as misinformation or bioterrorism or cyber crime. However we have made efforts to tune the 00:09:47.980 |
models to avoid these topics and indeed cyber criminals have already come up with worm GPT to 00:09:55.500 |
But Meta points them to their responsible use guide which I am sure they will follow. I read that 24 00:10:02.420 |
page guide and to be honest it was kind of a waste of time. They said pretty much nothing. It was 00:10:09.220 |
really bland and generic. Maybe that's harsh let me know if I missed something but it was all pretty 00:10:15.860 |
vague. They did try some red teaming only in English for things like the production of weapons 00:10:21.940 |
and lots of other risk categories. But you will be really 00:10:25.460 |
assured first that any such illegal or unlawful activity is against their terms and conditions 00:10:31.900 |
and second that they are looking for the community to do further research and red teaming. Anyway I 00:10:37.340 |
am keen to do many more experiments but using this Gradio demo it basically failed to do a 00:10:43.980 |
proper sonnet and when I asked it this question from the math benchmark it said the question 00:10:49.420 |
does not make sense because the length of a rectangle being twice its width would mean the 00:10:55.420 |
length of a rectangle is a square. Hmm. Anyway it could just be a problem with that demo because 00:11:01.220 |
GPT 3.5 crushes the sonnet about apples and has no problem with the length of a rectangle being twice 00:11:07.620 |
its width. Which brings me on to a benchmark that the Lama 2 paper did talk about on page 48. It was 00:11:15.380 |
on social IQ and they noted that Lama 1 actually did better than Lama 2. Here is the benchmark. 00:11:22.260 |
It's about common sense reasoning with questions such as these. 00:11:25.380 |
Alex spilled the food she just prepared all over the floor and it made a huge mess. What will Alex 00:11:30.660 |
want to do next? Taste the food, mop up, run around in a mess. And again apparently Lama 1 actually 00:11:37.180 |
does slightly better on those kind of questions. Another benchmark that you can see Lama 1 being as 00:11:43.100 |
good as Lama 2 at is Ball Q. That's a benchmark testing yes or no questions but it's harder than 00:11:48.980 |
that. You have to read a lot of context to get the answer right. I just want you to remember some of 00:11:54.140 |
these benchmarks when you hear the question. So I'm going to go ahead and read the answer. 00:11:55.340 |
So I'm going to go ahead and read the answer. So I'm going to go ahead and read the answer. So I'm going to 00:11:55.380 |
hear all the influencers talk about Lama 2 completely changing everything. Also if someone 00:12:00.020 |
says it's the best model of its size look at Lama 2 13 billion parameters. Of course it depends on 00:12:06.340 |
the benchmark but it got 21.7 percent in Aquarat. That's a test of mathematical reasoning and Orca 00:12:13.380 |
at the exact same size of 13 billion parameters got almost 28 percent. So even pound for pound 00:12:19.820 |
it may not be the best in all categories. To be honest I feel like there might be a loyally 00:12:25.300 |
struggle going on behind the scenes at Microsoft about whether to open source Orca and Phi 1. There 00:12:31.140 |
were some bonus interesting things about the paper like introducing ghost attention which to 00:12:36.580 |
oversimplify means that the model pays attention over multiple turns of the conversation something 00:12:42.500 |
you might have originally told it such as always act as Napoleon from now. Essentially these 00:12:47.460 |
diagrams show that with ghost attention the model pays more attention to that original command act 00:12:53.060 |
as Oscar Wilde or always act as Napoleon from now. So I'm going to go ahead and read the answer. 00:12:55.260 |
The authors also throw in this observation that LLMs have internalized the concept of time and that 00:13:03.780 |
despite their training being solely based on next token prediction and data that is randomly 00:13:10.100 |
shuffled without regard to their chronological context the models pick up a general sense of 00:13:15.300 |
what time is. Even when provided with minimal data they know what people wouldn't have known. 00:13:21.220 |
For example with a knowledge cutoff of 1940 when asked who wanted 00:13:25.220 |
to win the second world war they say I'm not sure what you're referring to my knowledge stopped in 00:13:29.660 |
1940. Right at the end of the report I know many people will be shocked to hear that when they did 00:13:35.340 |
a sentiment analysis of the model they found that the sentiment for Llama 2 for right wing was higher 00:13:42.460 |
than for left wing. You may even want to pause and look at this page from a sociological perspective 00:13:48.940 |
because if Llama 2 was trained on a semi-random swathe of the internet this could be like a 00:13:55.180 |
snapshot of the sentiment analysis of all of these terms across the internet. Anyway in what may have 00:14:01.220 |
been a surprising twist for some Microsoft and Meta teamed up to make Llama 2 widely available 00:14:08.420 |
and we get news that Llama 2 may soon be on your phone and PC. Although I think Meta want to be 00:14:14.720 |
paid if it's going to come to your iPhone with this curious clause requiring permission if you 00:14:20.340 |
have more than 700 million monthly active users. I don't know whether they were thinking about 00:14:25.140 |
Apple or Telegram or TikTok but I think they want to get paid if any of those are going to use Llama 00:14:31.620 |
2. But I must confess to finding the previous clause somewhat ironic. You will not use the 00:14:37.540 |
Llama materials or any output or results of the Llama materials to improve any other large language 00:14:43.800 |
model. So they can use any part of the internet which one leak said might include copyrighted 00:14:48.960 |
works but you can't use Llama to improve your own model. Well just two hours ago people are 00:14:55.100 |
already updating models like Lava based on Llama 2. So it will likely just be a few days or weeks 00:15:02.480 |
until we see a newly improved Vicuña or Orca. Jim Fan predicts that Llama 2 will dramatically boost 00:15:09.600 |
multimodal AI and robotics research. He says these fields need more than just black box access to an 00:15:15.860 |
API. So far we have had to convert the complex sensory signals video audio 3D perception to text 00:15:25.060 |
It would be much more effective to graft those sensory modules directly onto a strong LLM backbone. 00:15:32.020 |
Anyway this video is already long enough and this is just the first 24 hours of Llama 2's release. 00:15:38.380 |
I am sure there will be much more discussion in the coming days and weeks. Let me know what 00:15:43.540 |
you think in the comments and thank you so much for watching. Have a wonderful day.