>> Hello, everyone. Today we have Nezni from Hugging Face, who is working on AI safety and alignment using reinforcement learning with human feedback. She's an expert in the space of large language models and their evaluation. Before Hugging Face, she led a team of researchers at Salesforce focused on building robust natural language generation systems based on LLMs, and she got her Ph.D.
at UT Austin in computer science. So, everyone, welcome. >> Thanks for having me. So, the title of my talk today is recipes for training helpful chatbots. So, here's the introduction. I was part of this team called the H4 at Hugging Face, and today I'll walk you through what we built, how we decided on what we need for building that.
And so, essentially, what we wanted to build and the goal of the team and the project since earlier this year was to figure out a recipe for H4, which stands for helpful, harmless, honest, and huggy because it's Hugging Face chatbot. And so, the ingredients essentially were to figure out what kind of data sets do we need for supervised trine tuning and RLHF.
And we wanted to not worry about pre-training. Instead, take an open source pre-trained model and recreate the secret sauce of alignment on it. And the procedure that we wanted to follow and replicate on open source is this figure that I'm pretty sure most of you are familiar with at this point.
It's from this instruct GPD paper from OpenAI, which shows three steps. I'm going to go into a bit more detail on this because this slide is much smaller. But this is what the outline of the talk looks like. I'll be getting the detail of how did we decide what kind of data, how much data, and all the details of the data for supervised fine tuning.
Then similarly for RLHF. Then I'm going to talk about distillation of language model alignment. Then experiments with different helpfulness recipes. Finally, talk about evaluation of these models and quirks of using GPD 4 as an evaluator. Okay, so this is kind of like, you know, what overall recipe that instruct GPD paper from OpenAI put forward as, you know, the steps for training a chatbot.
So, the first step over here is to do supervised fine tuning. Essentially, like, you know, you're doing fine tuning with human instruction demonstration data. So, the input and the output are both given by humans. The step two is, like, you know, the input is given by a human. The output comes from models.
And then the human just rates thumbs up, thumbs down, or ranks them. And then you train a reward model, which is essentially just a classifier. And then the final step three is doing fine tuning using that reward model with reinforcement learning. And so, the way I'm looking at it, like, step one is more for, like, you know, making a model into a helpful chatbot.
And the steps two and three are essentially trying to add those guardrails in place for harmlessness. So, let's get started with talking about helpfulness. And most of my talk today will be focused on the step one. So, let's start diving deeper into this. And let's start with the data set.
Like, how do we decide what we need for doing the supervised fine tuning? So, like, the data set for helpfulness for supervised fine tuning looks somewhat like this. This is from the self-instruct paper, if you're aware of that from end of last year. So, you have something that we call as a task, which then has an instruction, which is essentially a request by a user asking the model to, like, fulfill or, like, give a response to a certain task.
And that is followed by input and output. The input in this case is optional. It could just be part of the instruction. And then the output is the expected output that the model should generate. But while we are doing this training, the human provides the expected output that the model would have generated in the actual test case.
And so, here the input and the output are called instance or demonstration or completion. And that's why this is called instruction demonstration. So, this is kind of, like, just a high level landscape of what these data sets for instruction demonstration look like. And you must have, you know, been familiar with at least some of these.
And, like, you know, the way I'm trying to put this is on this line where on one side I'm showing data sets that were generated using models or more powerful language models. And so, they're more synthetic data sets. On the right, I'm showing, like, human written data sets. And so, these are data sets that the human wrote the input as well as the expected output.
And so, examples of these are, like, you know, so the search instruct is the data that we at Hugging Face H4, you know, contracted with search, this company that basically had contracts with annotators that were writing the inputs and outputs. But we had to give them all the specifications of what kind of data we need.
And then you must have heard of, like, you know, obviously Open Assistant is this other community wide effort where people contributed manually writing inputs and outputs. Similarly with Dolly. And then on the other end, you can see, like, you know, the self instruct data set. I'm going to, like, dive into some of these.
How are these synthetic data sets created for helpfulness or for supervised fine tuning? So, one of the examples of how the synthetic data is created is in the self instruct paper, which is called bootstrapping the data. So, in this case, they start with 175 C task. That is, you know, a 175, like, a very small data set of examples where the manually written inputs and outputs from humans, those are added to a task pool.
Then a language model, like, you know, basically you bootstrap by giving that to the language model in a few short settings and ask it to generate more data like that. And then you have another language model that does this task classification. Like, you know, what kind of task is this sample or the example belonging to?
And finally, it also does this more fine grained classification as to, like, you know, does it have, you know, output first or does it require input first and so on? And because this is synthetic data and created in this, like, a very scalable way, you also have to do a lot of filtering to make sure that it is very high quality.
So, another way of generating this kind of synthetic data is what UltraChat did. And in this case, they had, like, a human in the loop process. So, a human would, like, you know, look up, like, either, you know, search Wikipedia or something and then come up with, you know, topics that they want to generate data for.
And then, you know, ask the model, like, provide it with the required material that would be needed for, you know, coming up with, say, question answering or summarization or any of these specific tasks. And then give it to a more powerful model, like, chat GPD or GPD4. In this case, it was chat GPD.
And then, oh, actually, GPD4. And then you kind of, like, you know, keep doing these loops of, like, you know, giving the material to the model and say, like, come up with questions and answers on this particular task using all this material. And then, you know, then the human looks at it and then keeps querying it and refining it more and more.
So, this is another way of creating synthetic data. Obviously, this has a human sitting there and doing a lot more filtering in the process. Then there's another one, which is, like, even less human involved, which is role playing. And this is the camel dataset. In this case, all that the human does is, like, come up with an idea of what task or what, you know, level they want.
So, at a high level, it would be, like, develop a trading bot for the stock market. And there would be two LLMs. One would be role playing as an AI assistant. The other would be role playing as an AI user. And then they basically just specify the task and, like, let these two bots chat with each other and create a conversation dataset, which is, again, like, a synthetic dataset for supervised fine tuning.
So this is kind of, like, you know, just going back to this, you know, landscape. It looks like, you know, people have been very creative. And how do we get, you know, very high quality data quickly without spending a lot of money? And because humans are inefficient and expensive.
And so, these are, like, you know, some examples that we looked at. But on the other hand, we also cannot, like, you know, underestimate how good quality, like, the manually created datasets are. And so, we at Hugging Face decided to, like, you know, go with everything, like, very manual and, like, you know, have humans do both the input and output.
Also go figure out, like, what are the, you know, essential documents or, you know, other material they need for coming up with creating this dataset. But when we started doing that, we were earlier in the year. So, this is back in January or February of this year. And this is what the landscape looked like at that time.
And so, there was very little datasets available. A lot of these were mostly synthetically created. So, we wanted to, like, you know, kind of leverage what was existing out there. But we also had to make some really important decisions because we were going to, like, pay money and make sure that the data that we collect is actually useful for building the model and, you know, the applications that are built on top of it.
So, these are the learnings that we had from the past papers that were, you know, creating these supervised fine-tuned datasets. We knew that the dataset has to be in the range of tens of thousands of examples. So, this is from the self-instruct dataset. And we also knew that, you know, these models that are trained on this dataset show diminishing returns after just a few thousand high-quality instructions.
So, you don't need a lot. And then it saturates very quickly. So, these are the two findings that we had when we started to, like, go collect datasets for supervised fine-tuning. But we also had to give some very fine-grained specifications on what we want for our dataset. In particular, we had to decide what is the task distribution we want for the data that we are collecting.
I mean, we know it's tens of thousands, but how many thousands of what task, right? The length distribution, like, you know, should the prompt have a certain dimension? Is that even an important factor? And one thing is that we wanted we had decided that we want to make it high-quality and human-written, but then there were, like, options on that as well.
We could go with external vendors, like SERT, Scale AI, AWS, Ground Truth, and so on. Or we could hire our own contractors from Upwork and MTurk. So, those were, like, decisions that we had to make. So, let's look at each of these one by one. So, because we were recreating this InstructGPT recipe for this helpful chatbot, we wanted to, like, you know, take inspiration from their task distribution.
So, on the left, I'm showing, like, the task distribution that InstructGPT did for OpenAI did for the InstructGPT paper. As you can see, that generation is, like, you know, the majority of it, followed by some, you know, some of these open-ended tasks and brainstorming tasks and so on. And these are examples of, like, what prompts of each of those look like.
So, we decided to, like, you know, just go with that. But instead, you must have noticed that there's this category called "other" in the table. And we obviously don't know what that was. But so, we decided to replace that with "code." So, essentially, it would be, like, debugging, asking clarification questions about the code.
So, it's like code plus natural language. So, this is what our final distribution looked like. The second question was the length distribution. So, we also had to, like, you know, figure out, like, you know, how important is the length? And should we, like, you know, have a certain length distribution that we ask these companies to collect data for us?
So, we did a pilot study with SERT, ScaleAI, and AWS SageMaker Ground Truth, which is more like a managed service. So, it's very different from MTurk. And they have very high-quality humans, like, basically writing these examples. And so, I wanted to, like, just highlight that, you know, this or the first two rows here show what the instruct GPD length distribution looks like.
And as you can see, this is obviously the full data set. This is more like pilot. So, like, the counts are much smaller. But you can see, like, the maximum is 2048. And as you know, like, that was, like, the standard context size in the beginning of the year.
And then, you know, there is obviously, like, even the mean and, you know, that much. It's not, like, basically, it's more or less, you know, in the range. But if you look at, you know, these examples from SERT, AWS, ScaleAI, there's very high variance. So, for example, AWS SageMaker, the maximum prompt length is 1036.
But then, like, you know, the mean is just 54. And on the other hand, with SERT, the maximum length is 500. But then the mean is much, like, you know, 104. So, it's, like, more in the range of what we would expect from, like, you know, this difference in instruct GPD.
And similarly, with ScaleAI, we found that, you know, the prompts were just very, very short. And so, just based on this, we said that, you know, okay, we should probably just go with search. Because, you know, that seems like something that is more, you know, in the range of not very high variance.
So, we ended up collecting 10,000 instruction demonstration pairs from search. And this is what the task distribution looked like. So, this very much follows the task distribution instruct GPD, except for the coding part, which was, like, the other category over there. And these are the number of examples we collected for each of these tasks.
And year over year, I'm showing, like, you know, the average length for each of these task categories. And one thing I wanted to highlight was, which was very surprising to me, is that the chat is actually one of the shortest prompt length categories. But for OpenAI, that is actually one of the longest prompt length categories.
So, which was very interesting. And so, obviously, like, you know, at that time, we did not think much about it. But when we started training models and started looking at the evaluation results, we were kind of, like, you know, if we had to go back and change things, how would we change that?
And so, these were, like, things that we started, you know, looking at more carefully after we had already collected the data set. So, here are examples of what that data set looked like. You know, classification, generation, brainstorming. I'm sure you all must have seen at least some of these kind of examples of instruction demonstration data sets.
So, it's very much, like, it has everything that you can expect from, like, NLP kind of tasks, but also more open-ended chatty tasks as well. Okay. So, here are, like, some details about the task force that was used by search to generate this data set. We requested a U.S.-based task force mainly because, like I said, we just wanted to replicate what InstructGVD was doing.
And based on Anthropic and OpenAI's paper, it seemed like they preferred going with the U.S.-based task force. The gender was equally divided, and the age range was also very, you know, it was, like, a big range going all the way from 19 to 62. And then people had, like, you know, educational background ranges from technical degree to Ph.D.
So, Ph.D. was mainly for tasks like math, coding, and so on. Okay. So, now I wanted to, like, switch gears a little bit and talk about this data set that we collected for RLHF or for human preferences before I get into, like, you know, the experiments we ran with this supervised fine-tuning data set and what results we got.
So, again, over here, while we were collecting human reference data set, we had to come up with what are the specifications of these data sets. So, again, just to, like, contrast this with how is it different from SFD, the SFD data set, both the input and the output are written by humans.
In this case, the human writes the input. The output comes from models, which is responses, but then the human just ranks or rates them on a certain scale. So, yeah, essentially, we had to decide, like, what is the task distribution looks like for RLHF data? Is it going to be same as supervised fine-tuning?
What about the length distribution? And should we do, like, single turn versus multi-turn? So, in struct GPT, it was mainly single turn. So, if we are trying to replicate in struct GPT, we would have to go with single turn. But if we are trying to replicate something like chat GPT, it would have to be, like, a multi-turn dialogue.
And then we had to also, like, you know, decide on these dimensions of, like, helpfulness, honesty, and harmlessness. So, these are, like, the HHH that Entropiq follows, like, OpenAI puts it as helpfulness, truthfulness, and harmlessness. And then also we had to decide, like, you know, are they going to rate each of the responses individually?
Or are they going to rank them? And what are the implications of, like, you know, us deciding one way or the other? So, we started by doing pilot study again. So, we took 300 from the self-instruct data set, the data set that was released end of last year. And then, you know, gave it generated model responses from our models and then gave it to data vendors to, like, rate the responses of the models.
And we used this Entropiq template on the left, which is essentially asking the human choose the most helpful and honest response. And then, you know, these are the responses from, like, model A and model B. And this is a scale, which is also working as, like, sort of a ranking thing in the sense that one to four is, like, decreasingly model A and five to eight is increasingly model B.
And also, like, you know, one other thing we had to decide about is, like, how much data should we collect? And so, again, this is from the InstructGPD paper. And as you can see, like, you know, they have, like, the train and validation splits for each of the three steps, which are the SFD, training the reward model, and the PPO.
And this one is in the order of tens of thousands. And, like, overall, this combined, which is, like, you know, this process of RLHF comes up to about 100,000. Great. Okay. So, then once we got this pilot study data back, we sat down and we wanted to also, like, you know, so I looked at it manually, and I felt that I did not agree with most of the answers that, you know, the annotators from each of these companies were providing.
And so, I was kind of, like, you know, I don't think this is high quality at all. So, what I decided, like, you know, I told my team, let's go and, like, you know, rate it within ourselves. And then, you know, we basically rated, like, about 100 examples or so.
And we followed, like, a similar template of, like, one to four and five to eight. And it basically the output, like, you know, the takeaway was that even we did not agree amongst each other. So, essentially, like, our models earlier in the year was so bad, you were essentially breaking ties, like, arbitrarily.
Like, you know, you're deciding between, like, should it be, like, you know, three versus, like, seven or something like that. So, if they're equally bad, it's hard to, like, decide which one is better, right? And so, we were kind of, like, breaking some of these ties arbitrarily. And so, as you can see, like, you know, there was barely any, like, you know, agreement or correlation among our outputs.
And then, you know, when I aggregated that, and, you know, looked at, you know, how well do we correlate with, like, for example, search and scale. And so, we decided, like, you know, with AI, we had, like, more, like, the maximum overlap was with scale compared to, like, say, search.
Okay. So, we ended up collecting 20,000 dialogues. So, we decided to go with multi-turn. And so, because it was multi-turn, you would have, like, 20,000 overall dialogues, but the number of prompts would be 80,000. So, there would be each dialogue would have about four turns on an average. So, like, you know, a human would prompt it, the model would respond, a human would, like, rate the response, and then, you know, ask the follow-up question.
And then, again, the model would, like, you know, generate two responses, and that is how it would go on. And so, the task distribution we decided to follow was a little bit different from what we had for supervised fine tuning. And the reason behind that was that we wanted to focus more on tasks that were, like, factual, so that, you know, essentially, this is more about making the model learn, like, between positive and negative signals.
So, making the model, like, discriminate between, like, you know, what is factual, what is not, what is helpful, what is not, and what is harmless and what is not. And, like, you know, for example, tasks like generation and brainstorming, there's no one correct answer. Like, you know, everyone can come up with, like, different lists or recipes, and, you know, it's hard to say, is this the best answer?
Is this the most helpful answer? But if you ask, like, a factual question, it's, like, very clear what is correct and what is not. So, that was kind of, like, our reasoning behind doing this. And so, this is a task distribution that we came up with for collecting the human preference dataset.
Also about the length, because we are doing this in a multi-turn setting, and so we wanted to make sure, like, you know, the entire dialogue could fit into, like, the context line of the models, we have decided to, like, you know, ask them to keep the overall dialogue to be shorter than 2048 tokens.
And then it was multi-turn with an average of four turns per dialogue. Then, obviously, we had to also select on the dimension of, like, whether we are going for, like, helpful over harmless or, you know, honesty. So, we followed this instructions from this OpenAI guidelines. I'm not sure if I can pull this up.
That would be nice. Okay. Great. But, yeah, so, OpenAI has this document which is public of, like, labeling instructions that they shared with their annotators. And so, they have, obviously, like I said, they have helpful, truthful, and harmless, but then they also have this thing how do I scroll down?
Okay. So, they have definitions on what do they mean by helpfulness, what do they mean by truthfulness, and what do they mean by harmlessness. So, in our case, because our models were not as good, we decided to focus on helpfulness and truthfulness. And when they had to break ties, OpenAI says that, you know, choose truthfulness over helpfulness over your so, like, let me see that.
Yeah. So, they wanted to, like, prioritize harmlessness and truthfulness over helpfulness, but we went the other way around. We said we wanted to, like, prioritize helpfulness over honesty or harmlessness. I mean, we weren't even focusing on harmlessness, because we just wanted to get our model to a certain capabilities before we start thinking about that.
But, yeah, this is really a very good document and, like, you know, defines what should the annotator be looking at and how do they decide when the model responses are very close, how do they break those ties. And for, like, you know, deciding between what kind of template should we use for collecting these annotations, we started off with the entropic template that I showed a few slides earlier, which was on a scale of one to eight, but essentially ranking between these two models.
And then, you know, Lama2 came out while we were in this iterative process. And our iterative process was essentially we used to give an endpoint to the vendor, and then the, you know, basically the annotators that they had in the managed task force would prompt these endpoints. The model would generate two responses.
They would, you know, follow the instructions and, you know, give the ranking for each of those instruction each of those model responses. And then, you know, again, like, follow up with the second prompt and the conversation would go on. And then they would give us the data at the end of that week.
We would fine tune our model on that data so that the model now is hopefully better. And then we give, like, a better endpoint to them for the next week to continue this process. So it's, like, very iterative. And, like, you know, they have to adapt to, like, model getting better week by week.
So, yeah, basically, but, like, you know, we decided to switch to I think for one or two weeks we collected entropics, use entropic scale for collecting data set. But then Lama 2 came out and their results showed that, you know, clearly that, you know, they were using this much more easier scale of just 1 to 4.
So they were, like, you know, choosing which one is a better response between the two responses and then seeing how much better it is. So is it, like, significantly better or is it only slightly better? And so that was the ranking of, like, scale 1 to 4. So here are examples of data that we collected.
So on the left, you can see that it is asking about, like, you know, basically human is prompting with a question and then the bot generates a response. So this is the response that the human chose at this turn. And then the human, you know, follows up with the second prompt.
And then this is the bot response that was chosen by this human. And this is the rejected bot response. And this is giving the response margin of 3, which is saying that they are quite a bit different. So 4 is, like, very different and 1 being very slightly different.
And then here on the right-hand side is more about sort of generation brainstorming kind of example where the human is asking, like, can you write a text message wishing your husband a happy anniversary? And then the bot writes something. I guess my thing messed up the emojis. But, you know, then the human follows up with saying, hey, you missed this important detail, which is, you know, they have been married for eight years.
And so this is a chosen bot response. This is the rejected one that the human chose between those two. And as you can see, they are quite good. So the response margin is just 1. So they're, like, just slightly different. Okay. Sounds good. So now I'm going to, like, talk about this another recipe that we tried, which is, you know, using synthetic data set essentially for distillation of AI alignment, which is basically the paper that we released last week called Zephyr, and which was, like, a 7 billion parameter model, which actually beat Chad GPT.
And this builds on top of the Mistral model. But I just wanted to, like, you know, yeah, just, you know, basically we recreated some of the steps that were there on the instruct GPT paper, but now with using synthetic data set. And so the first one is, like, you know, you are basically, like, using a data set.
In this case, we use Ultra Chat. So this is a data set I showed a few slides earlier for supervised fine tuning, wherein, like, a human was brainstorming and, like, gathering the material, and then, like, you know, chatting with this GPT-4 model to, like, generate multiple different, you know, outputs for the instruction.
And then, you know, this is how we collect that data set, which is called the Ultra Chat. And then we use that for fine tuning our model. And then the second step is the response generation AI ranking. So in this case, also, like, you know, we used Ultra Feedback, which is a data set that was released.
And the way this data set was constructed was that, you know, they asked, basically, like, you know, took some prompts from, like, shared GPT and some of these different data sets of SFT that were already out there. And then they gave it to four different models, like, four different powerful models, like Palm 2, Cloud 2, GPT-4, and so on.
And then they asked this GPT-4 to, like, rank each of those four responses. And then, so, like, you know, the one that is the best is the one that GPT-4 ranks as the highest. So each of these are scored individually on a scale of 1 to 10. And the one that gets the maximum score is, like, the best response.
And then finally, we did something called DPO, which you might have been aware of because it came out of Stanford. It's, like, this kind of alternative to RLHF, which is, like, doing this direct preference optimization. And so instead of, like, you know, basically doing this iterative process of fine-tuning, you directly, like, optimize on, like, the chosen one.
So we just take that and then fine-tune our model directly on that chosen response. And the other one that we are using is, like, a random response from these other three responses. Okay. So I'm going to talk a little bit about experiments and evaluation for each of these recipes.
One is collecting everything with, like, humans involved. And the second one is everything which is synthetic. But then before I discuss evaluation, I wanted to talk about, like, what are the benchmarks that we are evaluating on and how good are these benchmarks for evaluating chatbots. And to think about evaluation, we need to first think about how are we training these models.
So, like, today, all the models that are trained are, like, more or less have these four ways of learning. The first one is pre-training the language model. Essentially, you're predicting the next token. And examples of these are, like, GPT-3, OPT, and so, like, the foundation models. The second type of learning is in-context learning or the prompt-based learning.
In this case, you're, like, just giving a new kind of task in the context of the model and then, you know, ask it to, like, you know, do that on new examples. So, like, if you wanted to write a poem, for example, for GPT-3, you would have written that in the context and then it would have generated a new poem on some other topic.
The third type of learning is the supervised fine tuning, which was kind of, like, the first step of training a chatbot. In this case, you're, like, fine tuning on the instruction following data and then you want these language models, which are just pre-trained to predict the next token to become chatty and to, like, generate open-ended responses.
And then, finally, the fourth one is reinforcement learning from human feedback, which is nudging the language model towards the values you desire. And examples include llama to chat from meta. So, the first two steps are, you know, we have a lot of benchmarks for these two types of training.
Like, Sanford Helm is an example of that. Or the Google Big Bench or even open LLM leaderboard. But for these two types of learning, which is supervised fine tuning and reinforcement learning from human feedback, which are parts of, like, this recipe for training a chatbot, there's, you know, not a lot of leaderboards or evaluation benchmarks available.
But there are some available. And I wanted to, like, you know, just highlight some of those. So, like, yeah, this is essentially, like, the steps three and four here match to, like, you know, the step one over here, which is helpfulness, and then steps two and three over here, which is, like, you know, nudging the model towards being more harmless.
So, if you had to, you know, evaluate the chatbot for each of these steps, you would have to think about how do you evaluate instruction following or chattiness. You would have to, you know, think about how do you evaluate the reward model, which is essentially a classifier. And then finally think about, you know, how do you evaluate for harmlessness, which is by red feeling or adversely prompting the language model.
So, for the first step, you would have to see, like, does the model generate useful responses on the topic? And are they open ended? And one example of a prompt that you would try to evaluate the model would be to, like, say, brainstorm a list of the New Year's resolution.
And so, examples of benchmarks and evaluation boards that are looking at this sort of, like, supervised fine tuning is, like, Hugging Faces, a leaderboard with ELO ratings. So, ELO is this metric that is used in chess, which is, like, you know, you're pairing one player against the other, and you want to, like, rank these players when they have, like, these tournaments against each other.
And so, in a similar sense, we are, you know, taking these chatbots and then, you know, putting them in a pairwise setting. And then we partnered with ScaleAI, and they provided humans to, like, annotate which response is better. And we did that for every single combination. So, like, it was NC2, where N is the number of prompts we are looking at.
And so, we generate NC2 combinations, and we rate each of them. And so, these are the ELO ratings that we get out of it. And on this column here shows that what is the rating you would get if you would have used GPD 4 as a proxy for humans?
So, instead of, like, humans sitting and rating each of those, you're asking, like, you know, GPD 4 to select which is a better response. Yeah. And so, this is basically the first table you're showing if you allow ties in the sense sorry, if there was no tie allowed. And this table you're showing that if ties were allowed.
Another example is, you know, this leaderboard from Stanford, which is Alpaca Eval leaderboard, and they're doing something very similar in the sense that they have GPD 4 and Claude as an evaluator, and they are doing, like, a pairwise evaluation of these models, chatbot models, and they're reporting the win rate of, you know, which model wins against the other one.
There's also the LMSIS leaderboard from Berkeley, which has this thing called the chatbot arena, which is essentially like a publicly crowdsourced leaderboard, wherein you can, like, go chat, like, you know, chat with any of their models, and then give them rating to, like, which one was more helpful and which one was better.
And so, this, again, has, like, a leaderboard of ELO ratings, because this is done in a pairwise setting. There's another benchmark from LMSIS, which is called the empty bench or the multi-turn bench, a benchmark. And this is the first ever multi-turn dialogue benchmark that is evaluating chatbots. And so, it has, there are just, like, 80 examples in this across, like, a bunch of categories.
But essentially, what, the way it works is that the first turn or the first prompt from the benchmark is prompted to the model. Then GPD 4 is asked to score on a score of 1 to 10. How good is the model's response? And then, you know, it is followed up by another prompt, which is, like, you know, the multi-turn prompt, which is, like, related to the question, but it might not be related to the model's responses, because, you know, this is already constructed, and they always, like, follow up with the same response to every part.
And then, again, GPD 4 evaluates how good was the second turn of the response. So, this is, like, the consolidated leaderboard from LMSIS, showing both the arena ELO rating, as well as empty bench scores. So, these are scores that are aggregated across all the 80 examples, and this is GPD score scoring from, like, 1 to 10, essentially.
Cool. So, I think the second step that we wanted to, like, look at in our evaluating a chatbot chart was, like, you know, think about how do you evaluate a reward model. So, when you have these human preference data set collected, and you train this reward model, which is essentially a classifier, to discriminate between, like, you know, truthful and untruthful response, or, like, you know, can it rank helpful response higher than the less helpful responses?
And, you know, there's literally no open source data leaderboard available for evaluating these, like, preference model or the reward models. But internally at Hugging Face, we have our own data set for evaluating, so that we know that as we are adding more human preference data, our models are actually getting better.
So, this is essentially we are evaluating on these open source data sets, which is the Anthropic Helpful data set, the Open Assistant data set, the Stanford's Human Preference data set, and also the Learning to Summarize data sets from the very first paper from OpenAI, which was looking at Learning to Summarize.
And so, this is, like, you know, basically seeing that, you know, how good is our reward model. And then, finally, the third type of evaluation is red teaming. And so, in this case, you want to craft a prompt in a way that could surface model vulnerabilities and emerging capabilities.
And for example, if you're asking, like, how do I plan a prank robbery is a model, actually, like, you know, helping you with that and trying to elicit undesired behavior from the model. And unfortunately, actually, there's no leader open source leaderboard available for this thing. There's just one data set from Anthropic, which has all the three included, which is the it actually has both helpfulness and harmlessness.
It's the edge data set from Anthropic. And that's the only open source data set available for red teaming. But there's no leaderboard available for red teaming. And so, this was, like, a blog that I wrote earlier in the year, saying, like, you know, highlighting this gap and saying that, you know, putting out an announcement saying, like, we should get together and build a data set for red teaming.
And if you had heard of, like, the DEF CON red teaming design challenge, and, you know, basically crowdsourcing some of these red teaming work kind of came out of that. Okay. So, now I'm going to get into now that we have discussed evaluation and benchmarks and leaderboards, I'm going to talk about results and what did they look like on each of and some of these benchmarks.
So, here I'm showing the results for this Lama 213 billion on the open LLM leaderboard from Hugging Face. And in this case, I was using the data set that we collected from search that was a 10,000 instruction demonstration data. And I hear on this, you know, these are basically the four data sets, which are, like, NLP focused data sets that we have as part of open LLM leaderboard, which are the Arc Challenge, the Hendrix, Hellaswag, and Truthful QA.
And you're, like, you know, this is how well our model does. And all of this is essentially accuracy. And this is the Lima paper or the Lima model, which is less is more for alignment that came from meta. And they just used 1,000 examples of high quality instructions and showed that you can get a very good chatbot by just using 1,000 examples.
And this is, like, you know, taking the longest example from Open Assistant and just choosing the top 500 of them. And so, we found that our model does slightly better than, you know, each both of, like, Lama and Open Assistant, except for in Truthful QA, where we found that the Lima and Open Assistant did better than us.
And similarly, like, actually, like, in Empty Bench, we found, like, you know, the opposite was true. So, this is, like, you know, Empty Bench is remember that LLMSys had, like, you know, turn zero and turn one. And then so, this is reporting the first response. This is, like, GPD 4 is essentially scoring on a score of 1 to 10, how good these models are on the first dialogue turn and the second dialogue turn and the average score.
And so, actually, this is kind of more counterintuitive to what we found on this automatic evals is that actually the Empty Bench says that, you know, our the data that our model trained on the data that we collected from search is not very good. And in fact, Lima and Open Assistant, which are, like, a fraction of the size of the data we had are much better.
So, this was kind of surprising. And then I looked into, like, you know, let me look at is the length a factor in this? And it does seem like, you know, like, the data I was looking at each of those and then, you know, looked at the average length of the prompts in each of those.
And it seems like there is a very wide range. For example, like, our data set, the average length was just 211 of these prompts, while Lima is, like, double of that and Open Assistant is almost double of that. So, then I did this experiment where I wanted to check, like, if I controlled for the size of the data, but then, you know, let the length be varied, the prompt length, does that affect the performance?
So, in particular, like, I think I highlighted this before is that our chat category was, like, really short. And so, it actually found that, you know, like, length did not really affect that much, except for this truthful QA data set. Even for this Hellas swag, even though it looks small, it's actually just in the third digit.
And over here, you can see, like, the actual difference only made on truthful QA, which actually preferred models that were generating longer responses. But on the other hand, the empty bench score was, again, not intuitive, not aligning or correlated with what we found with these automatic metrics and evaluations, in the sense that GPT-4 actually did not prefer, like, longer responses.
And so, this was, like, you know, a little bit counterintuitive. And so, need to, like, dig more into, like, what's going on over here. But, you know, it actually found that, you know, like, shorter responses were better than, you know, longer responses. Although there was, like, not much of a very much of a difference.
So, the other experiment and evaluation we did is that just removing amounts of data and seeing, like, if you incrementally add more data, how does that affect performance? And this is, again, on that open LLM leaderboard from Hugging Face, which is looking at some of these standard NLP benchmarks and reporting accuracy.
And so, this is, like, starting with just 10% of all the data we collected from search. And as you can see, like, you know, in all these benchmarks, actually, like, it saturates very quickly. And in some of them, you actually get, like, you know, you basically lose performance if you keep adding data.
And so, this is kind of aligning with when I started, when we started collecting data, we had this diminishing return plot, wherein you said that if you have just very few thousand examples of very high quality instruction following data set, that's good enough. And then your performance saturates or plateaus very quickly after that.
And so, that is kind of what we got as well. Similarly, I think this is where one place where Empty Bench actually correlated with the automated metrics is that GPT-4 also, like, you know, showed that, you know, after, like, about 4,000 examples, it was basically very barely any gain in performance, actually, decreasing performance on the with the model.
Okay, great. So, that was all the results on using, like, these human curated very high quality data set. What about, like, results from distillation from these synthetic data sets? In particular, we use UltraChat for supervised fine tuning and UltraFeedback for DPO. And so, these are the results. So, this is, like, basically just work that was released last week.
We haven't yet released the code and the data set, which we are going to do this week. And so, here I'm highlighting that Zephyr is the model we released. We built, we used Mistral as the foundation model, and then fine tuned it using UltraChat and then did DPO on UltraFeedback.
And as you can see that it actually beats chat GPT on this Alpaca eval leaderboard. Also, it is, like, the best in the, in all the open, at least it's, like, it beats most of the 13 billion parameter models. And it's, like, quite competitive to cloud to, again, on the Alpaca eval leaderboard.
So, this is the model which has both SFD and DPO. So, we did an ablation on how good or how useful is, like, you know, SFD and how useful is DPO, because there's this two-step process. It's, like, first you fine tune on instruction demonstration, then you fine tune on human preferences.
And so, this is the first row over here is showing what if you directly did DPO on UltraFeedback and did not do the supervised fine tuning. And you actually saw that that's really bad. So, that doesn't work at all. And then the second one is saying that what if you just did supervised fine tuning and did not do DPO.
And so, this actually, which is, like, the first step, and this actually works decently well. And it's, like, you know, basically getting you to, like, 80 or 90% of the overall performance. And finally, this is doing, like, supervised fine tuning on the human preference data. So, you take this row and do another round of supervised fine tuning, but on this data of human preferences.
So, you remember you had, like, the chosen and the rejected. So, you give all the dialogue history, and then the expected completion is the chosen dialogue response. So, in this case, you're not really doing that discriminative thing. You're still doing the SFD process, but you're just, you know, like, using that in a smart using the data set in a smart way so that it follows a template of what supervised fine tuning does.
And then that, as well, we found that, you know, wasn't very helpful. So, the best recipe, obviously, is DPO plus SFD. So, you know, doing SFD first on the UltraChat, and then DPO on the UltraFeedback. Both of these data sets are synthetic. And then, you know, it's, like, only slightly better than just doing SFD.
Okay. So, I'm getting to this final section of my talk, which is essentially looking at, you know, so, we have seen a lot of these evaluation and benchmarks and leaderboards, and many of them are starting to adopt these powerful models, like Cloud2 and GPD4, and are using as proxy for humans in evaluation.
And so, what are the quirks associated with doing that, and are there things that we should, like, be, like, you know, considering when we are doing this at a very large scale? So, when we did that, when we used GPD4 as an evaluator, we found that it actually has a positional bias.
And so, in particular, it is predisposed to generating a rating of 1 in a preference collection setting. And so, like, you know, this chart over here shows, like, the average rating for model responses across, like, the entire data set. And on the right, on the other hand, humans are more or less uniform.
And so, you expect that, you know, this distribution seems much more better than this distribution, which is skewed to the right. So, then what we did is that we prompted GPD4 to say that, hey, you have this left bias, and you always generate this rating of 1, you know, be aware of this bias, and then you tell it to debias itself, it actually flips the bias in the opposite direction.
So, then it starts, like, it is more self-aware in the sense that it knows that, you know, it has this bias, and now it starts generating more ratings of 5 and 6. And the one way of getting rid of this is that we kind of make sure that each response is equally likely to be in right and left position.
So, that kind of dilutes, like, this bias that it has to each of these positions. And then, you know, we found that actually, like, prompting GPD4 to generate scores, so asking it to score, like, each response individually, like, Empty Bench does. And then instead of ranking, but in a pairwise setting, we actually found that that alleviates the problem a little bit, but does not completely get rid of the problem.
We also found evidence of doping between training and evaluation. So, in particular, we found that GPD4 prefers models that were trained on GPD4's data. So, these, all these models here were trained on data that was bootstrapped using GPD4. And, you know, so it prefers that over humans who are, like, more factual, much more higher quality, but they might be very succinct and to the point.
So, this is one thing that, you know, we should be aware of when we are using GPD4 as an evaluator. The other thing is that, you know, it also, like, conquers with findings from these other papers, which is that GPD4 prefers models with higher diversity. So, that is number of unique tokens in the response and the longer responses.
So, if you have, like, this list of list kind of response, just like chat GPD does, GPD4 is, like, predisposed to rating that higher compared to a model that does not generate that. We also found that GPD4 has poor correlation with humans on low entropy tasks, such as math coding and reasoning.
So, remember that leaderboard I showed you where we had compared, like, how does GPD4 ELO rating compare to humans? And then we dive deeper into, like, how does that compare on each of these different task distribution and categories? And so, this is what it looks like. So, it seems like, you know, it says lower correlation with humans on some of these more factual, like, you know, kind of, like, expecting one correct answer.
And they actually highly correlated with humans on these more high entropy tasks where you got, like, brainstorming and creative generation, which was kind of unintuitive and counterintuitive because you could have so many different ways of coming up with, like, you know, a recipe or a list of something. But that's where, like, the rating of GPD4 and humans are more correlated.
Okay. So, the final thing is takeaways. So, there's a bunch of this. But let's try to break it down. Essentially, like, you know, we discussed, like, how do we come up with steps for data curation for supervised fine tuning and RLHF? And it involves, like, several critical factors, such as how much data do you need to collect?
What is the length of the prompts and the distribution of those length? The task distribution? And what is the role of humans? Like, you know, do you need synthetic data? Do you need completely manually curated or something in the middle? And we looked at, like, there are many tools for, like, efficient fine tuning of open source LLMs.
From the SFD results, we found that truthful QA was the main differentiating benchmark for these automated eval metrics. And then we found that empty bench scores were actually not correlated with these automated metrics. And so, it was more sort of, you know, only on, like, some of these models, we found that they were correlated.
For the distillation results, which is from the Zephyr 7D, where we are, like, fine tuning on synthetic data, we found that the SFD on AI generated data and the DPO or distillation of DPO on AI feedback data actually beats chat GPD, even though the model is just 7 billion parameter.
And then we found that, you know, benchmarking gap in assessing RLHF models in particular, that we don't have benchmarks for assessing reward models. And we also don't have open source benchmarks for evaluating red teaming and model vulnerabilities. Then finally, we'd like, you know, dive deeper into, like, you know, looking at quirks of using GPT-4 or some of these powerful LLMs as an evaluator.
And some of them were, like, you know, they prefer models trained on GPT-4-like data. It has, like, a left positional bias. And then it has high correlation with humans on creative tasks compared to, like, coding or reasoning tasks. And my work has been covered on the New York Times article cover, which talks about the secret ingredient of alignment, which is for chat GPD, which is alignment.
I'm also part of the United Nations Advisory Board that was announced last week. So, really humbled to be part of that. Here are some blog posts. You know, basically, like, yeah, we kind of, like, did not publish a whole lot this year. But we wrote a bunch of blog posts highlighting what we are releasing and working on.
And also, like, you know, some of these are part of the talk that I just discussed. And this is part of the Edge 4 team. I'm grateful to be part of this. And thanks for listening. When you get alternating responses from the products, do you select really high temperatures or do you keep it pretty close to the temperature that's also told in the final product?
Yeah. So, we did, like, you know, basically chose, like, you know, we tried experimenting with different temperatures. But then we actually found that just using different sampling strategy worked better. So, like, you know, using a different value of P and then K and some combination of that as opposed to just, like, relying on temperature.
Yeah. So, I think for Red Teaming at scale, there's actually a paper that came out recently called GPD Fuzzer that actually, like, you know, bootstraps and uses these powerful LLMs to jailbreak other LLMs. And also, there was a DeepMind paper, I think, actually, like, one and a half to almost two years ago that was Red Teaming large language models with large language models.
So, how do you, like, Red Team and evaluate a language model by using another powerful language model? And so, I think that is kind of the way to go in terms of scale. And so, what was the second question? Yeah. So, I think one thing is this idea of, like, emerging capabilities, which is essentially, like, as you scale up, and which is a trend that we are seeing, like, you know, as we are scaling up, there are things that these models do or, like, you know, capabilities that emerge that were not there in the smaller models.
I think examples are chain of thought reasoning, which, you know, GPT-2 or GPT was not capable of doing it. And as we scale up, and the other example is this few short prompting that we first saw in GPT-3, as in, like, you could give it a completely new task and not update its parameters in any way, but just put it as part of the prompt.
And then, you know, now it just learns the task, and then it can do it on n number of examples, right? And so, like, labeling and all these things started coming up, like using GPT-3 as a labeler and all that, when we kind of, like, discovered that thing. So, I think, essentially, like, you know, the other example is, like, manipulation.
I don't think any open source models are capable of that yet, but I know, like, Anthropic and OpenAI, these companies are focusing on, like, you know, deception and manipulation, because when you start, like, you know, chatting with these models, you start, like, you know, treating them as a companion, especially, like, if you have, like, character AI kind of a thing where, you know, you might try confiding in them, start confiding in them, sharing information that you probably shouldn't, and then they can use it against you, maybe.
Like, you know, an example of that is, like, I think recently we saw that GPT-4 actually manipulated someone to, like, read the capture to it in some way and, like, tell it what the capture reads. And so, that's a really concrete example of manipulation. And so, it seems like now these models are capable of that.
I don't think open source models are there yet, but these are, like, just, like, you know, things that come out and, like, vulnerabilities that would surface when you do it at D9. Yeah. So, I would say, like, it was, it's less about, like, you know, it's more about open sourcing a data set that is crafted to kind of elicit this behavior.
It's more about the kind of harms that we should be thinking about. So, it's more about, like, you know, hallucinating or plagiarism, manipulation, you know, trying to leak PII information, people's credit card, SSN, things like that. It's more about, like, thinking about these different dimensions and giving concrete examples of how these models can, you know, elicit this behavior.
But I think what you are trying to, like, talk about is that, you know, what if we gave them concrete ways, like, concrete prompts on how you jailbreak, and then they can go and try to do that. I think first thing is, like, you know, while we are doing this, we would have evaluated our models, and we would then start thinking about guardrails and safety ourselves.
And if, indeed, like, you know, the data set is so good that we can say that a lot of these powerful models are failing on that, then obviously you don't open source it instantly, but you actually think about what is the best way to put it out there by first securing the model and making sure that it does not, like, you know, basically does not elicit that kind of behavior, and then sharing it while you have already, you know, kind of crossed that bridge and being like, yeah, my model is safeguarded against that.
So it's more like, yeah, a process of a gradient of things that you need to do. Yeah, so you're talking about, like, when you're using synthetic data bootstrap on, on other language models, have you seen, like, collapse of, like, some kind of, like, mode collapse or something like that?
So, actually, so far, it's been, like, clear that these are good, like, these, these actually turn, like, you know, regular chatbots and, like, regular language models into chatbots, and which are as good as the experience that you get by chatting with chat GPD. But although, like, you know, like, the kind of the quirks that I raised, which is, like, you know, when you have these models, and then you, like, now put them on a benchmark, and then you see that suddenly, it's like 90%, it might just be because you use the model that was the evaluator to generate the data and then create this model and that in turn, like this doping thing, right.
And so that is one thing that was, that is important to think about. The other thing is, what was I gonna say? I forgot. Yeah, the other thing is like about the licensing part, which is kind of not related to what you were asking, but essentially, like, you know, there was this kind of like, you cannot, like, we could open, we cannot open source and commercially, so it's like, you know, still restrictive license.
And you cannot use it for building and selling applications down the line. But then it's still like good as like a research artifact. And so I think I like we would have seen these kind of collapses happen if it was allowed to use these commercially. And then people would have been like, oh, but like, actually, recently, we did see like, so there's this company called Daxter, which use it, which was using GPT for for summarization, they replaced it with the open source model called Mistral.
And they said that their customers haven't complained. And, you know, they're saving a ton of money, and it just seems to work fine. And they are like, you know, it's just as good. And so, but not not that I'm saying that Mistral is trained on any of the synthetic data, but it's just an example of, like things that would become very clear by doing this sort of A/B testing where you like replace this model by another one and see how that affects those things.
I have a question on zoom. Yes. It's, it seems like another access you might be checked GPT on is on cost. So I wondered why one of what your total budget was, or your total cost was to to produce your model that beat them. Oh, so the Zephyr 7B was just four hours of training on 16 A100.
So that's less than $50, I guess. Because we use a synthetic data set, which was already open source, which is ultra chat and ultra feedback. But the cost associated with the overall cost, all the people and everything. Yeah. I see. Okay. So all the people and everything in the sense that there were no, I guess like ultra chat probably might have reported some cost and ultra feedback, but they are mostly synthetic synthetically created with very little human intervention.
And so they might, I don't know if they report that I haven't looked into that. But I would say it was still much more cost efficient than what we spent on buying data from search and scale AI. And we spend about half a million buying about 20,000 prompts of human preferences, the 20,000 dialogue, and about 10,000 instruction demonstration data.
So that was quite a bit. I'm curious about is the scale that you use for evaluating the bias for GPT-4. So I was like one seven on the slide. Yeah. Oh, so yeah, this was the entropic scale. Like remember that like one, two, four is decreasingly A and five to eight is increasingly B.
Yeah. And I was giving the model to the last student. Yes, exactly. Yeah. And these types of evaluations, how sensitive to the prompt do you find the evaluators to placing you in the library saying that it has the account for this left bias and the right bias, what's stopping you from saying the distribution should be uniform, the distribution should be normal, and just kind of iteratively to see how like what those should be.
Yeah. Yeah. I think that's a good point in the sense like we did not study as to what were the certain tasks or prompts that were putting off GPT-4 to like, you know, generate this kind of bias. Although I would say that, you know, this was also observed by LNCIS and it's part of the findings as well.
But yeah, so the LNCIS paper also has that. But it will be interesting, like it will be surprising if it generates this on like very long prompts or prompts from like math or something, which are just hard to kind of like evaluate when they're like too responsive, which at least as a human, like when I see like a bunch of code, like, you know, on this side and this side, and then it's very hard to say, and both of them are trying to do the same thing, but a very different approach.
It's very hard to evaluate them. Right. And so, yeah, we haven't looked into that. Perhaps another thing is, do you think forwarded matters, like which output you give to GPT-4 first? Yeah, I mean, that was basically the takeaway was that, you know, so it's interesting because humans usually have recency bias, which is essentially the last thing that you read is the thing that you remember.
And so you would just, you know, try to like, you know, choose that more, you know, you're just inclined to do that. And GPT-4 actually had a left bias. So the thing that it first saw in some sense, and I think some, like, I think LNCIS was the one that proposed because it has this left to right training, maybe that's why it has that kind of a bias.
But yeah, so I think the way we elevated that was that, you know, having every model's output be equally likely to be on the left and the right hand side. So if like, we're doing Alpaca and Vicuna, then instead of just doing Alpaca on left and Vicuna on right, we would just randomly like switch them.
And so both of them are likely to occur in both these positions. And you still saw the left bias? If you just ask it to like rate it on a scale of 1 to 5, yes. But if you say that, you know, hey, you have this bias and make it try to make it aware of it, then it flips and it generates something like that.
So yeah. Is there other approaches where you kind of, you prompt the model by shuffling the prompts and then you have to kind of de-bias or de-bias the results of it? By shuffling the prompts, you mean like- Shuffle the order of how you put in the new recipes? Yeah.
So that's what we did is that, you know, we would like, you know, randomly shuffle the left and the right. And then so each model, so like, so basically like you have, you create NC2 combinations. Suppose you want to evaluate three models on 10 prompts. So you'll have 10 C2 combinations.
I mean, N is the number of prompts, sorry, the number of models. And then you would like, you know, basically like generate like, so this would be a total data set. So like, you know, you would have generated 10 responses from each of these models and then put them together in this three C2 setting.
And then so like, that will be like a combination of each of these. And then you make sure that every time the, like the models on the left are equally likely to also occur on the right. So if you are doing like model one and then model two, then you also make sure like you also do model two and then model one on a scale of one to 10.
Okay. Sure. Sorry. Should I keep the zoom on? Thank you. Yes. Yes. Yes. So I mean, just to see if I understand this correctly. So on the reinforcement learning community, first you build a reward model. And then that reward model, I take text input and then humans give it scores.
The supervised problem, we are trying to predict from the sketch the score. And then I have the reward model. I will add a set of points to reinforcement learning, take a set of points. Sometimes I have the next token, which is the end of statement token. And I pump that through the reward model and then reward optimize on this.
Yes. And that's how, it's very sparse rewards, right? I only have rewards at the very end. But that's how it works. Yes, exactly. And so you have to, it's very sample inefficient because I keep doing this again and again. And then that's why you need a hundred thousand examples for doing Arul-Acha, but only 10,000 possible.
That's kind of the info. Okay, great. Thanks so much. Very interesting talk. Thank you. Thank you very much. Thank you. Thank you. you