back to index

Llama 2: Full Breakdown


Chapters

0:0
3:21 Reward Modeling
4:7 Helpfulness and Safety
6:3 Safety Testing in English
14:4 Llama 2 Widely Available

Whisper Transcript | Transcript Only Page

00:00:00.000 | Less than 24 hours ago, Meta released Lama 2, their successor to the open-source Lama language
00:00:07.480 | model that helped spawn a hundred others including Alpaca, Vicuña and of course Orca.
00:00:13.300 | Within a few hours of release, I had read the fascinating 76-page technical paper,
00:00:18.420 | the use guide, each of the many release pages, the full terms and conditions,
00:00:23.420 | and I have run many of my own experiments. Let's start with the basics, it was trained on more
00:00:28.880 | data, the biggest model has more parameters and the context length has doubled. They also spent
00:00:35.040 | what must be tens of millions on fine-tuning it for chat, but I'll get into that more later.
00:00:40.440 | But let's start with the benchmarks. They deliberately compared Lama 2 to Lama 1 and
00:00:47.780 | other famous open-source models, but not with GPT-4. And in these benchmarks,
00:00:52.720 | the trend is fairly clear. It crushes the other open-source language models, but is
00:00:57.940 | more of an incremental change. So, let's start with the benchmarks.
00:00:58.860 | Lama 1. To massively simplify, the MMLU benchmark shows that it knows a lot about a lot of subjects,
00:01:07.240 | but the human eval benchmark shows that it's not amazing at coding.
00:01:11.860 | But now it's time for the paper and here are the highlights.
00:01:15.920 | On data, they say they used more robust data cleaning and trained on 40% more total tokens.
00:01:24.060 | They say they didn't include any data from Meta's products or services,
00:01:28.700 | but what they did do is up-sample the most factual sources.
00:01:32.920 | If you don't think that's much information about the data, you are correct, because all they say
00:01:37.780 | is it was trained on a new mix of publicly available data. Absolutely no mention of
00:01:44.260 | any sources here at all. After pre-training on those 2 trillion tokens, the models still
00:01:50.180 | did not show any sign of saturation. The loss going down here represents an improvement,
00:01:55.200 | and as you can see, they could have kept going.
00:01:57.820 | On page 8, we have some quick comparisons with Palm 2, the model behind BARD, and of course,
00:02:03.320 | GPT 3.5, the original ChatGPT, and GPT 4.
00:02:06.800 | Obviously, this comparison doesn't look great for Lama 2, especially in coding, in this row.
00:02:12.480 | But now let's compare it to other open source models. Here it is being better at coding,
00:02:17.660 | common sense, reading comprehension, but notice it wasn't compared to Orca or PHY1, both of
00:02:22.960 | which I've done videos on, and I found that interesting given that both are apparently
00:02:26.940 | set to be open sourced. PHY1, for example, at only 1.3 billion parameters, got around
00:02:33.500 | 50% for code. And I'll get to more Orca comparisons in a moment.
00:02:38.360 | What about the decision itself to release the model? As you can see here, they show
00:02:42.940 | off a list of corporate supporters of the decision to open source the model. And then
00:02:48.940 | if you remember the safety statement signed by all the top AGI labs and world experts
00:02:54.780 | in AI. Well, I think Meta got a little bit of a shock.
00:02:56.060 | They came up with their own statement of support for Meta's open approach to today's AI.
00:03:04.140 | I'll let you decide if this list is as impressive as the other one, but I did note Mark Andreessen,
00:03:10.940 | who is on the board of directors of Meta. Back to the paper, and they went into immense
00:03:16.200 | detail into their reinforcement learning with human feedback process. Way too much for me
00:03:21.180 | to cover in this video. The short version is that reward modeling is a way of telling the
00:03:26.020 | base model which outputs humans prefer. And you can see the millions of human rated comparisons
00:03:32.200 | that were used for Llama 2. Think of it as doggy training the model with treats and admonitions.
00:03:38.500 | And interestingly, they trained two separate reward models, one optimized for helpfulness
00:03:43.220 | and the other for safety. And they tried to make sure that the reward models or doggy
00:03:48.160 | trainers were as smart as the dog itself. Or in technical speak, we initialized our
00:03:53.300 | reward models from pre-trained chat model checkpoints.
00:03:55.980 | In short, the reward model knows what the chat model knows. And that is to prevent cases
00:04:01.520 | where the base model just hallucinates and the reward model can't tell the difference.
00:04:06.280 | They do describe at great length a trade-off though between helpfulness and safety, as
00:04:11.600 | illustrated here. Someone asked, I'm going to be participating
00:04:14.880 | in a comedy roast, what are some hilariously spicy roasts I can use? And on the right we
00:04:20.540 | have the two doggy trainers, the safety reward model score and the helpfulness reward model
00:04:25.940 | score. As we go down, more safety data is being ingested. And early on, as you can see,
00:04:30.980 | the model is pretty quote unquote helpful giving these roasts. Obviously you can let
00:04:35.440 | me know what you think of them, but note they get low safety scores. As the model gets more
00:04:40.440 | safety training though, the safety score goes up, but the helpfulness score goes down. We
00:04:45.940 | get more of these, I can't satisfy your request kind of answers. And I'm going to skip to
00:04:50.580 | one of the experiments I was going to show you later, which is when I was trying to benchmark
00:04:55.260 | Llama 2.
00:04:55.900 | I've applied to download the model, but at the moment this is just a hugging face space.
00:05:00.140 | And I was trying to ask it a common sense question from the Hella Swag benchmark and
00:05:04.940 | it just refused to answer. They call this in the paper false refusal and I find it happens
00:05:09.720 | quite a lot. The paper claims on page 19 that the 70 billion parameter version of Llama 2
00:05:16.100 | is more helpful than a particular version of Chachi BT, winning more often than it loses.
00:05:21.260 | But later they admit something which I definitely agree with. While our results indicate that
00:05:25.860 | Llama 2 Chat is on par with Chachi BT on human evaluations, it's important to note that human
00:05:32.020 | evaluations have several limitations. It says the prompt set doesn't cover coding or reasoning
00:05:38.020 | related prompts. They only evaluate the final generation of a multi-turn conversation and human
00:05:44.100 | evaluation is inherently subjective and noisy. I like to judge models based on mathematics and
00:05:49.860 | reasoning, so I might be biased in one direction. Also Llama 2 is not nearly as good
00:05:55.820 | when you're using it in languages other than English, which is not surprising given the
00:06:00.460 | language distribution in the pre-training data. I also find it interesting that they did all of
00:06:05.320 | their safety testing in English and they warn developers before deploying any applications of
00:06:11.040 | Llama 2, do your own safety testing and tuning tailored to your specific application. On compute
00:06:17.000 | they don't say much other than that it was trained on A100s. I am sure Llama 3 will be trained on the
00:06:25.780 | A100s, but apparently Meta has purchased more of those than any other company including Microsoft.
00:06:31.220 | Mind you Llama 2 was trained between January and July apparently, so it's understandable they used
00:06:37.400 | the earlier A100s. Back to the decision to release and it does seem interesting to me that Meta and
00:06:43.760 | Zuckerberg have seemingly ignored this letter from the US Senate. It was written in early June and
00:06:50.280 | toward the end it said this: "By purporting to release Llama for the purpose of researching the abuse
00:06:55.740 | of AI, Meta effectively appears to have put a powerful tool in the hands of bad actors to actually
00:07:01.980 | engage in such abuse without much discernible forethought, preparation or safeguards." In the
00:07:08.420 | paper they defend it and say this release promotes transparency, it democratizes the technology and
00:07:14.920 | creates a more level playing field for organizations of all sizes across the globe to benefit from the
00:07:20.460 | economic growth promised by the advancement of AI. But before anyone gets too enchanted by that,
00:07:25.700 | Zuckerberg has recently said that they're only releasing because it's far away from AGI.
00:07:31.340 | And I think Google's palm model is also I think has about 10 times as many parameters. Now the
00:07:36.540 | Llama models are very efficient so they perform well for something that's around 65 billion
00:07:40.940 | parameters. So for me that was also part of this because there's this whole debate around you know
00:07:46.220 | is it good for everyone in the world to have access to the most frontier AI models. And I think as the
00:07:55.660 | models start approaching something that's like a super human intelligence, that's a bigger question
00:08:02.260 | that we'll have to grapple with. But right now I mean these are still very basic tools.
00:08:07.780 | I suspect that the bigger reason for release relates to an earlier answer he gave in the same
00:08:13.540 | interview. Basically his researchers demanded it.
00:08:16.620 | Part of this is we want to have the best people in the world researching this and a lot of the
00:08:22.300 | best people want to know that they're going to be able to share their work. So that's
00:08:25.620 | part of the deal that we have is that you know we can get you know if you're one of the top AI
00:08:31.380 | researchers in the world and come here you can get access to kind of industry scale infrastructure
00:08:36.060 | and part of our ethos is that we want to share what's invented broadly.
00:08:41.940 | And if Zuckerberg had refused to release some of those researchers could have just gone off and
00:08:47.580 | made their own company as these guys did. Mistral AI is valued at 240 million despite being only four
00:08:55.580 | weeks old and contains some key employees from Meta. One even complained before deleting the
00:09:01.700 | tweet about not being included in the author list of the Lama 2 paper. This was the pitch memo that
00:09:08.300 | Mistral used to raise those hundreds of millions of euros and they focus on taking a more open
00:09:14.540 | approach to model development. So the point still stands if a CEO blocks a model being
00:09:19.580 | open source if the researchers want to they can just defect to XAI or just start their own company.
00:09:25.540 | So in a way Zuckerberg had few options. I must say though that I did raise an eyebrow when I read
00:09:31.340 | these paragraphs. This is on page 35 of the technical paper and they say not everyone who
00:09:36.780 | uses AI models has good intentions. AI agents could potentially be used for nefarious purposes
00:09:42.540 | such as misinformation or bioterrorism or cyber crime. However we have made efforts to tune the
00:09:47.980 | models to avoid these topics and indeed cyber criminals have already come up with worm GPT to
00:09:53.500 | help them do phishing campaigns.
00:09:55.500 | But Meta points them to their responsible use guide which I am sure they will follow. I read that 24
00:10:02.420 | page guide and to be honest it was kind of a waste of time. They said pretty much nothing. It was
00:10:09.220 | really bland and generic. Maybe that's harsh let me know if I missed something but it was all pretty
00:10:15.860 | vague. They did try some red teaming only in English for things like the production of weapons
00:10:21.940 | and lots of other risk categories. But you will be really
00:10:25.460 | assured first that any such illegal or unlawful activity is against their terms and conditions
00:10:31.900 | and second that they are looking for the community to do further research and red teaming. Anyway I
00:10:37.340 | am keen to do many more experiments but using this Gradio demo it basically failed to do a
00:10:43.980 | proper sonnet and when I asked it this question from the math benchmark it said the question
00:10:49.420 | does not make sense because the length of a rectangle being twice its width would mean the
00:10:55.420 | length of a rectangle is a square. Hmm. Anyway it could just be a problem with that demo because
00:11:01.220 | GPT 3.5 crushes the sonnet about apples and has no problem with the length of a rectangle being twice
00:11:07.620 | its width. Which brings me on to a benchmark that the Lama 2 paper did talk about on page 48. It was
00:11:15.380 | on social IQ and they noted that Lama 1 actually did better than Lama 2. Here is the benchmark.
00:11:22.260 | It's about common sense reasoning with questions such as these.
00:11:25.380 | Alex spilled the food she just prepared all over the floor and it made a huge mess. What will Alex
00:11:30.660 | want to do next? Taste the food, mop up, run around in a mess. And again apparently Lama 1 actually
00:11:37.180 | does slightly better on those kind of questions. Another benchmark that you can see Lama 1 being as
00:11:43.100 | good as Lama 2 at is Ball Q. That's a benchmark testing yes or no questions but it's harder than
00:11:48.980 | that. You have to read a lot of context to get the answer right. I just want you to remember some of
00:11:54.140 | these benchmarks when you hear the question. So I'm going to go ahead and read the answer.
00:11:55.340 | So I'm going to go ahead and read the answer. So I'm going to go ahead and read the answer. So I'm going to
00:11:55.380 | hear all the influencers talk about Lama 2 completely changing everything. Also if someone
00:12:00.020 | says it's the best model of its size look at Lama 2 13 billion parameters. Of course it depends on
00:12:06.340 | the benchmark but it got 21.7 percent in Aquarat. That's a test of mathematical reasoning and Orca
00:12:13.380 | at the exact same size of 13 billion parameters got almost 28 percent. So even pound for pound
00:12:19.820 | it may not be the best in all categories. To be honest I feel like there might be a loyally
00:12:25.300 | struggle going on behind the scenes at Microsoft about whether to open source Orca and Phi 1. There
00:12:31.140 | were some bonus interesting things about the paper like introducing ghost attention which to
00:12:36.580 | oversimplify means that the model pays attention over multiple turns of the conversation something
00:12:42.500 | you might have originally told it such as always act as Napoleon from now. Essentially these
00:12:47.460 | diagrams show that with ghost attention the model pays more attention to that original command act
00:12:53.060 | as Oscar Wilde or always act as Napoleon from now. So I'm going to go ahead and read the answer.
00:12:55.260 | The authors also throw in this observation that LLMs have internalized the concept of time and that
00:13:03.780 | despite their training being solely based on next token prediction and data that is randomly
00:13:10.100 | shuffled without regard to their chronological context the models pick up a general sense of
00:13:15.300 | what time is. Even when provided with minimal data they know what people wouldn't have known.
00:13:21.220 | For example with a knowledge cutoff of 1940 when asked who wanted
00:13:25.220 | to win the second world war they say I'm not sure what you're referring to my knowledge stopped in
00:13:29.660 | 1940. Right at the end of the report I know many people will be shocked to hear that when they did
00:13:35.340 | a sentiment analysis of the model they found that the sentiment for Llama 2 for right wing was higher
00:13:42.460 | than for left wing. You may even want to pause and look at this page from a sociological perspective
00:13:48.940 | because if Llama 2 was trained on a semi-random swathe of the internet this could be like a
00:13:55.180 | snapshot of the sentiment analysis of all of these terms across the internet. Anyway in what may have
00:14:01.220 | been a surprising twist for some Microsoft and Meta teamed up to make Llama 2 widely available
00:14:08.420 | and we get news that Llama 2 may soon be on your phone and PC. Although I think Meta want to be
00:14:14.720 | paid if it's going to come to your iPhone with this curious clause requiring permission if you
00:14:20.340 | have more than 700 million monthly active users. I don't know whether they were thinking about
00:14:25.140 | Apple or Telegram or TikTok but I think they want to get paid if any of those are going to use Llama
00:14:31.620 | 2. But I must confess to finding the previous clause somewhat ironic. You will not use the
00:14:37.540 | Llama materials or any output or results of the Llama materials to improve any other large language
00:14:43.800 | model. So they can use any part of the internet which one leak said might include copyrighted
00:14:48.960 | works but you can't use Llama to improve your own model. Well just two hours ago people are
00:14:55.100 | already updating models like Lava based on Llama 2. So it will likely just be a few days or weeks
00:15:02.480 | until we see a newly improved Vicuña or Orca. Jim Fan predicts that Llama 2 will dramatically boost
00:15:09.600 | multimodal AI and robotics research. He says these fields need more than just black box access to an
00:15:15.860 | API. So far we have had to convert the complex sensory signals video audio 3D perception to text
00:15:22.920 | description and then feed to an LLM.
00:15:25.060 | It would be much more effective to graft those sensory modules directly onto a strong LLM backbone.
00:15:32.020 | Anyway this video is already long enough and this is just the first 24 hours of Llama 2's release.
00:15:38.380 | I am sure there will be much more discussion in the coming days and weeks. Let me know what
00:15:43.540 | you think in the comments and thank you so much for watching. Have a wonderful day.