Back to Index

Llama 2: Full Breakdown


Chapters

0:0
3:21 Reward Modeling
4:7 Helpfulness and Safety
6:3 Safety Testing in English
14:4 Llama 2 Widely Available

Transcript

Less than 24 hours ago, Meta released Lama 2, their successor to the open-source Lama language model that helped spawn a hundred others including Alpaca, Vicuña and of course Orca. Within a few hours of release, I had read the fascinating 76-page technical paper, the use guide, each of the many release pages, the full terms and conditions, and I have run many of my own experiments.

Let's start with the basics, it was trained on more data, the biggest model has more parameters and the context length has doubled. They also spent what must be tens of millions on fine-tuning it for chat, but I'll get into that more later. But let's start with the benchmarks. They deliberately compared Lama 2 to Lama 1 and other famous open-source models, but not with GPT-4.

And in these benchmarks, the trend is fairly clear. It crushes the other open-source language models, but is more of an incremental change. So, let's start with the benchmarks. Lama 1. To massively simplify, the MMLU benchmark shows that it knows a lot about a lot of subjects, but the human eval benchmark shows that it's not amazing at coding.

But now it's time for the paper and here are the highlights. On data, they say they used more robust data cleaning and trained on 40% more total tokens. They say they didn't include any data from Meta's products or services, but what they did do is up-sample the most factual sources.

If you don't think that's much information about the data, you are correct, because all they say is it was trained on a new mix of publicly available data. Absolutely no mention of any sources here at all. After pre-training on those 2 trillion tokens, the models still did not show any sign of saturation.

The loss going down here represents an improvement, and as you can see, they could have kept going. On page 8, we have some quick comparisons with Palm 2, the model behind BARD, and of course, GPT 3.5, the original ChatGPT, and GPT 4. Obviously, this comparison doesn't look great for Lama 2, especially in coding, in this row.

But now let's compare it to other open source models. Here it is being better at coding, common sense, reading comprehension, but notice it wasn't compared to Orca or PHY1, both of which I've done videos on, and I found that interesting given that both are apparently set to be open sourced.

PHY1, for example, at only 1.3 billion parameters, got around 50% for code. And I'll get to more Orca comparisons in a moment. What about the decision itself to release the model? As you can see here, they show off a list of corporate supporters of the decision to open source the model.

And then if you remember the safety statement signed by all the top AGI labs and world experts in AI. Well, I think Meta got a little bit of a shock. They came up with their own statement of support for Meta's open approach to today's AI. I'll let you decide if this list is as impressive as the other one, but I did note Mark Andreessen, who is on the board of directors of Meta.

Back to the paper, and they went into immense detail into their reinforcement learning with human feedback process. Way too much for me to cover in this video. The short version is that reward modeling is a way of telling the base model which outputs humans prefer. And you can see the millions of human rated comparisons that were used for Llama 2.

Think of it as doggy training the model with treats and admonitions. And interestingly, they trained two separate reward models, one optimized for helpfulness and the other for safety. And they tried to make sure that the reward models or doggy trainers were as smart as the dog itself. Or in technical speak, we initialized our reward models from pre-trained chat model checkpoints.

In short, the reward model knows what the chat model knows. And that is to prevent cases where the base model just hallucinates and the reward model can't tell the difference. They do describe at great length a trade-off though between helpfulness and safety, as illustrated here. Someone asked, I'm going to be participating in a comedy roast, what are some hilariously spicy roasts I can use?

And on the right we have the two doggy trainers, the safety reward model score and the helpfulness reward model score. As we go down, more safety data is being ingested. And early on, as you can see, the model is pretty quote unquote helpful giving these roasts. Obviously you can let me know what you think of them, but note they get low safety scores.

As the model gets more safety training though, the safety score goes up, but the helpfulness score goes down. We get more of these, I can't satisfy your request kind of answers. And I'm going to skip to one of the experiments I was going to show you later, which is when I was trying to benchmark Llama 2.

I've applied to download the model, but at the moment this is just a hugging face space. And I was trying to ask it a common sense question from the Hella Swag benchmark and it just refused to answer. They call this in the paper false refusal and I find it happens quite a lot.

The paper claims on page 19 that the 70 billion parameter version of Llama 2 is more helpful than a particular version of Chachi BT, winning more often than it loses. But later they admit something which I definitely agree with. While our results indicate that Llama 2 Chat is on par with Chachi BT on human evaluations, it's important to note that human evaluations have several limitations.

It says the prompt set doesn't cover coding or reasoning related prompts. They only evaluate the final generation of a multi-turn conversation and human evaluation is inherently subjective and noisy. I like to judge models based on mathematics and reasoning, so I might be biased in one direction. Also Llama 2 is not nearly as good when you're using it in languages other than English, which is not surprising given the language distribution in the pre-training data.

I also find it interesting that they did all of their safety testing in English and they warn developers before deploying any applications of Llama 2, do your own safety testing and tuning tailored to your specific application. On compute they don't say much other than that it was trained on A100s.

I am sure Llama 3 will be trained on the A100s, but apparently Meta has purchased more of those than any other company including Microsoft. Mind you Llama 2 was trained between January and July apparently, so it's understandable they used the earlier A100s. Back to the decision to release and it does seem interesting to me that Meta and Zuckerberg have seemingly ignored this letter from the US Senate.

It was written in early June and toward the end it said this: "By purporting to release Llama for the purpose of researching the abuse of AI, Meta effectively appears to have put a powerful tool in the hands of bad actors to actually engage in such abuse without much discernible forethought, preparation or safeguards." In the paper they defend it and say this release promotes transparency, it democratizes the technology and creates a more level playing field for organizations of all sizes across the globe to benefit from the economic growth promised by the advancement of AI.

But before anyone gets too enchanted by that, Zuckerberg has recently said that they're only releasing because it's far away from AGI. And I think Google's palm model is also I think has about 10 times as many parameters. Now the Llama models are very efficient so they perform well for something that's around 65 billion parameters.

So for me that was also part of this because there's this whole debate around you know is it good for everyone in the world to have access to the most frontier AI models. And I think as the models start approaching something that's like a super human intelligence, that's a bigger question that we'll have to grapple with.

But right now I mean these are still very basic tools. I suspect that the bigger reason for release relates to an earlier answer he gave in the same interview. Basically his researchers demanded it. Part of this is we want to have the best people in the world researching this and a lot of the best people want to know that they're going to be able to share their work.

So that's part of the deal that we have is that you know we can get you know if you're one of the top AI researchers in the world and come here you can get access to kind of industry scale infrastructure and part of our ethos is that we want to share what's invented broadly.

And if Zuckerberg had refused to release some of those researchers could have just gone off and made their own company as these guys did. Mistral AI is valued at 240 million despite being only four weeks old and contains some key employees from Meta. One even complained before deleting the tweet about not being included in the author list of the Lama 2 paper.

This was the pitch memo that Mistral used to raise those hundreds of millions of euros and they focus on taking a more open approach to model development. So the point still stands if a CEO blocks a model being open source if the researchers want to they can just defect to XAI or just start their own company.

So in a way Zuckerberg had few options. I must say though that I did raise an eyebrow when I read these paragraphs. This is on page 35 of the technical paper and they say not everyone who uses AI models has good intentions. AI agents could potentially be used for nefarious purposes such as misinformation or bioterrorism or cyber crime.

However we have made efforts to tune the models to avoid these topics and indeed cyber criminals have already come up with worm GPT to help them do phishing campaigns. But Meta points them to their responsible use guide which I am sure they will follow. I read that 24 page guide and to be honest it was kind of a waste of time.

They said pretty much nothing. It was really bland and generic. Maybe that's harsh let me know if I missed something but it was all pretty vague. They did try some red teaming only in English for things like the production of weapons and lots of other risk categories. But you will be really assured first that any such illegal or unlawful activity is against their terms and conditions and second that they are looking for the community to do further research and red teaming.

Anyway I am keen to do many more experiments but using this Gradio demo it basically failed to do a proper sonnet and when I asked it this question from the math benchmark it said the question does not make sense because the length of a rectangle being twice its width would mean the length of a rectangle is a square.

Hmm. Anyway it could just be a problem with that demo because GPT 3.5 crushes the sonnet about apples and has no problem with the length of a rectangle being twice its width. Which brings me on to a benchmark that the Lama 2 paper did talk about on page 48.

It was on social IQ and they noted that Lama 1 actually did better than Lama 2. Here is the benchmark. It's about common sense reasoning with questions such as these. Alex spilled the food she just prepared all over the floor and it made a huge mess. What will Alex want to do next?

Taste the food, mop up, run around in a mess. And again apparently Lama 1 actually does slightly better on those kind of questions. Another benchmark that you can see Lama 1 being as good as Lama 2 at is Ball Q. That's a benchmark testing yes or no questions but it's harder than that.

You have to read a lot of context to get the answer right. I just want you to remember some of these benchmarks when you hear the question. So I'm going to go ahead and read the answer. So I'm going to go ahead and read the answer. So I'm going to go ahead and read the answer.

So I'm going to hear all the influencers talk about Lama 2 completely changing everything. Also if someone says it's the best model of its size look at Lama 2 13 billion parameters. Of course it depends on the benchmark but it got 21.7 percent in Aquarat. That's a test of mathematical reasoning and Orca at the exact same size of 13 billion parameters got almost 28 percent.

So even pound for pound it may not be the best in all categories. To be honest I feel like there might be a loyally struggle going on behind the scenes at Microsoft about whether to open source Orca and Phi 1. There were some bonus interesting things about the paper like introducing ghost attention which to oversimplify means that the model pays attention over multiple turns of the conversation something you might have originally told it such as always act as Napoleon from now.

Essentially these diagrams show that with ghost attention the model pays more attention to that original command act as Oscar Wilde or always act as Napoleon from now. So I'm going to go ahead and read the answer. The authors also throw in this observation that LLMs have internalized the concept of time and that despite their training being solely based on next token prediction and data that is randomly shuffled without regard to their chronological context the models pick up a general sense of what time is.

Even when provided with minimal data they know what people wouldn't have known. For example with a knowledge cutoff of 1940 when asked who wanted to win the second world war they say I'm not sure what you're referring to my knowledge stopped in 1940. Right at the end of the report I know many people will be shocked to hear that when they did a sentiment analysis of the model they found that the sentiment for Llama 2 for right wing was higher than for left wing.

You may even want to pause and look at this page from a sociological perspective because if Llama 2 was trained on a semi-random swathe of the internet this could be like a snapshot of the sentiment analysis of all of these terms across the internet. Anyway in what may have been a surprising twist for some Microsoft and Meta teamed up to make Llama 2 widely available and we get news that Llama 2 may soon be on your phone and PC.

Although I think Meta want to be paid if it's going to come to your iPhone with this curious clause requiring permission if you have more than 700 million monthly active users. I don't know whether they were thinking about Apple or Telegram or TikTok but I think they want to get paid if any of those are going to use Llama 2.

But I must confess to finding the previous clause somewhat ironic. You will not use the Llama materials or any output or results of the Llama materials to improve any other large language model. So they can use any part of the internet which one leak said might include copyrighted works but you can't use Llama to improve your own model.

Well just two hours ago people are already updating models like Lava based on Llama 2. So it will likely just be a few days or weeks until we see a newly improved Vicuña or Orca. Jim Fan predicts that Llama 2 will dramatically boost multimodal AI and robotics research. He says these fields need more than just black box access to an API.

So far we have had to convert the complex sensory signals video audio 3D perception to text description and then feed to an LLM. It would be much more effective to graft those sensory modules directly onto a strong LLM backbone. Anyway this video is already long enough and this is just the first 24 hours of Llama 2's release.

I am sure there will be much more discussion in the coming days and weeks. Let me know what you think in the comments and thank you so much for watching. Have a wonderful day.