back to indexStanford CS25: V3 I Recipe for Training Helpful Chatbots
00:00:00.000 |
>> Hello, everyone. Today we have Nezni from Hugging Face, who is working on AI safety 00:00:13.960 |
and alignment using reinforcement learning with human feedback. She's an expert in the 00:00:20.000 |
space of large language models and their evaluation. Before Hugging Face, she led a team of researchers 00:00:29.480 |
at Salesforce focused on building robust natural language generation systems based on LLMs, 00:00:36.880 |
and she got her Ph.D. at UT Austin in computer science. So, everyone, welcome. 00:00:45.000 |
>> Thanks for having me. So, the title of my talk today is recipes for training helpful 00:00:53.880 |
chatbots. So, here's the introduction. I was part of this team called the H4 at Hugging 00:01:00.800 |
Face, and today I'll walk you through what we built, how we decided on what we need for 00:01:07.360 |
building that. And so, essentially, what we wanted to build and the goal of the team and 00:01:11.200 |
the project since earlier this year was to figure out a recipe for H4, which stands for 00:01:17.400 |
helpful, harmless, honest, and huggy because it's Hugging Face chatbot. And so, the ingredients 00:01:23.920 |
essentially were to figure out what kind of data sets do we need for supervised trine 00:01:28.800 |
tuning and RLHF. And we wanted to not worry about pre-training. Instead, take an open 00:01:36.000 |
source pre-trained model and recreate the secret sauce of alignment on it. And the procedure 00:01:41.440 |
that we wanted to follow and replicate on open source is this figure that I'm pretty 00:01:46.560 |
sure most of you are familiar with at this point. It's from this instruct GPD paper from 00:01:50.960 |
OpenAI, which shows three steps. I'm going to go into a bit more detail on this because 00:01:58.160 |
this slide is much smaller. But this is what the outline of the talk looks like. I'll be 00:02:03.360 |
getting the detail of how did we decide what kind of data, how much data, and all the details 00:02:08.760 |
of the data for supervised fine tuning. Then similarly for RLHF. Then I'm going to talk 00:02:14.640 |
about distillation of language model alignment. Then experiments with different helpfulness 00:02:21.080 |
recipes. Finally, talk about evaluation of these models and quirks of using GPD 4 as 00:02:28.840 |
Okay, so this is kind of like, you know, what overall recipe that instruct GPD paper from 00:02:34.680 |
OpenAI put forward as, you know, the steps for training a chatbot. So, the first step 00:02:40.720 |
over here is to do supervised fine tuning. Essentially, like, you know, you're doing 00:02:44.920 |
fine tuning with human instruction demonstration data. So, the input and the output are both 00:02:50.440 |
given by humans. The step two is, like, you know, the input is given by a human. The output 00:02:55.760 |
comes from models. And then the human just rates thumbs up, thumbs down, or ranks them. 00:03:01.080 |
And then you train a reward model, which is essentially just a classifier. And then the 00:03:04.600 |
final step three is doing fine tuning using that reward model with reinforcement learning. 00:03:10.840 |
And so, the way I'm looking at it, like, step one is more for, like, you know, making a 00:03:15.060 |
model into a helpful chatbot. And the steps two and three are essentially trying to add 00:03:20.080 |
those guardrails in place for harmlessness. So, let's get started with talking about helpfulness. 00:03:26.360 |
And most of my talk today will be focused on the step one. So, let's start diving deeper 00:03:34.240 |
into this. And let's start with the data set. Like, how do we decide what we need for doing 00:03:39.760 |
the supervised fine tuning? So, like, the data set for helpfulness for supervised fine 00:03:45.160 |
tuning looks somewhat like this. This is from the self-instruct paper, if you're aware of 00:03:50.320 |
that from end of last year. So, you have something that we call as a task, which then has an 00:03:55.560 |
instruction, which is essentially a request by a user asking the model to, like, fulfill 00:04:00.840 |
or, like, give a response to a certain task. And that is followed by input and output. 00:04:06.920 |
The input in this case is optional. It could just be part of the instruction. And then 00:04:11.760 |
the output is the expected output that the model should generate. But while we are doing 00:04:16.320 |
this training, the human provides the expected output that the model would have generated 00:04:21.240 |
in the actual test case. And so, here the input and the output are called instance or 00:04:27.000 |
demonstration or completion. And that's why this is called instruction demonstration. 00:04:32.840 |
So, this is kind of, like, just a high level landscape of what these data sets for instruction 00:04:39.720 |
demonstration look like. And you must have, you know, been familiar with at least some 00:04:44.200 |
of these. And, like, you know, the way I'm trying to put this is on this line where on 00:04:50.120 |
one side I'm showing data sets that were generated using models or more powerful language models. 00:04:56.880 |
And so, they're more synthetic data sets. On the right, I'm showing, like, human written 00:05:01.120 |
data sets. And so, these are data sets that the human wrote the input as well as the expected 00:05:09.280 |
And so, examples of these are, like, you know, so the search instruct is the data that we 00:05:13.280 |
at Hugging Face H4, you know, contracted with search, this company that basically had contracts 00:05:20.820 |
with annotators that were writing the inputs and outputs. But we had to give them all the 00:05:25.240 |
specifications of what kind of data we need. And then you must have heard of, like, you 00:05:29.680 |
know, obviously Open Assistant is this other community wide effort where people contributed 00:05:33.960 |
manually writing inputs and outputs. Similarly with Dolly. And then on the other end, you 00:05:39.240 |
can see, like, you know, the self instruct data set. I'm going to, like, dive into some 00:05:42.920 |
of these. How are these synthetic data sets created for helpfulness or for supervised 00:05:49.520 |
So, one of the examples of how the synthetic data is created is in the self instruct paper, 00:05:56.520 |
which is called bootstrapping the data. So, in this case, they start with 175 C task. 00:06:02.680 |
That is, you know, a 175, like, a very small data set of examples where the manually written 00:06:08.400 |
inputs and outputs from humans, those are added to a task pool. Then a language model, 00:06:13.480 |
like, you know, basically you bootstrap by giving that to the language model in a few 00:06:17.960 |
short settings and ask it to generate more data like that. And then you have another 00:06:23.200 |
language model that does this task classification. Like, you know, what kind of task is this 00:06:28.800 |
sample or the example belonging to? And finally, it also does this more fine grained classification 00:06:34.880 |
as to, like, you know, does it have, you know, output first or does it require input first 00:06:39.360 |
and so on? And because this is synthetic data and created in this, like, a very scalable 00:06:43.760 |
way, you also have to do a lot of filtering to make sure that it is very high quality. 00:06:49.560 |
So, another way of generating this kind of synthetic data is what UltraChat did. And 00:06:55.840 |
in this case, they had, like, a human in the loop process. So, a human would, like, you 00:07:00.480 |
know, look up, like, either, you know, search Wikipedia or something and then come up with, 00:07:06.400 |
you know, topics that they want to generate data for. And then, you know, ask the model, 00:07:11.640 |
like, provide it with the required material that would be needed for, you know, coming 00:07:16.240 |
up with, say, question answering or summarization or any of these specific tasks. And then give 00:07:21.320 |
it to a more powerful model, like, chat GPD or GPD4. In this case, it was chat GPD. And 00:07:26.040 |
then, oh, actually, GPD4. And then you kind of, like, you know, keep doing these loops 00:07:30.520 |
of, like, you know, giving the material to the model and say, like, come up with questions 00:07:34.480 |
and answers on this particular task using all this material. And then, you know, then 00:07:39.120 |
the human looks at it and then keeps querying it and refining it more and more. So, this 00:07:43.560 |
is another way of creating synthetic data. Obviously, this has a human sitting there 00:07:47.640 |
and doing a lot more filtering in the process. Then there's another one, which is, like, 00:07:53.120 |
even less human involved, which is role playing. And this is the camel dataset. In this case, 00:07:59.080 |
all that the human does is, like, come up with an idea of what task or what, you know, 00:08:05.280 |
level they want. So, at a high level, it would be, like, develop a trading bot for the stock 00:08:09.400 |
market. And there would be two LLMs. One would be role playing as an AI assistant. The other 00:08:15.680 |
would be role playing as an AI user. And then they basically just specify the task and, 00:08:20.760 |
like, let these two bots chat with each other and create a conversation dataset, which is, 00:08:25.760 |
again, like, a synthetic dataset for supervised fine tuning. 00:08:32.200 |
So this is kind of, like, you know, just going back to this, you know, landscape. It looks 00:08:35.920 |
like, you know, people have been very creative. And how do we get, you know, very high quality 00:08:40.920 |
data quickly without spending a lot of money? And because humans are inefficient and expensive. 00:08:46.920 |
And so, these are, like, you know, some examples that we looked at. But on the other hand, 00:08:50.440 |
we also cannot, like, you know, underestimate how good quality, like, the manually created 00:08:55.640 |
datasets are. And so, we at Hugging Face decided to, like, you know, go with everything, like, 00:09:01.840 |
very manual and, like, you know, have humans do both the input and output. Also go figure 00:09:06.080 |
out, like, what are the, you know, essential documents or, you know, other material they 00:09:10.760 |
need for coming up with creating this dataset. But when we started doing that, we were earlier 00:09:17.000 |
in the year. So, this is back in January or February of this year. And this is what the 00:09:21.000 |
landscape looked like at that time. And so, there was very little datasets available. 00:09:26.080 |
A lot of these were mostly synthetically created. So, we wanted to, like, you know, kind of 00:09:31.640 |
leverage what was existing out there. But we also had to make some really important 00:09:35.440 |
decisions because we were going to, like, pay money and make sure that the data that 00:09:39.160 |
we collect is actually useful for building the model and, you know, the applications 00:09:43.520 |
that are built on top of it. So, these are the learnings that we had from the past papers 00:09:49.200 |
that were, you know, creating these supervised fine-tuned datasets. We knew that the dataset 00:09:53.800 |
has to be in the range of tens of thousands of examples. So, this is from the self-instruct 00:09:58.840 |
dataset. And we also knew that, you know, these models that are trained on this dataset 00:10:04.720 |
show diminishing returns after just a few thousand high-quality instructions. So, you 00:10:10.040 |
don't need a lot. And then it saturates very quickly. So, these are the two findings that 00:10:14.080 |
we had when we started to, like, go collect datasets for supervised fine-tuning. 00:10:20.400 |
But we also had to give some very fine-grained specifications on what we want for our dataset. 00:10:26.000 |
In particular, we had to decide what is the task distribution we want for the data that 00:10:29.880 |
we are collecting. I mean, we know it's tens of thousands, but how many thousands of what 00:10:34.160 |
task, right? The length distribution, like, you know, should the prompt have a certain 00:10:39.040 |
dimension? Is that even an important factor? And one thing is that we wanted we had decided 00:10:44.080 |
that we want to make it high-quality and human-written, but then there were, like, options on that 00:10:48.800 |
as well. We could go with external vendors, like SERT, Scale AI, AWS, Ground Truth, and 00:10:54.400 |
so on. Or we could hire our own contractors from Upwork and MTurk. So, those were, like, 00:10:59.560 |
decisions that we had to make. So, let's look at each of these one by one. 00:11:04.040 |
So, because we were recreating this InstructGPT recipe for this helpful chatbot, we wanted 00:11:10.960 |
to, like, you know, take inspiration from their task distribution. So, on the left, 00:11:14.840 |
I'm showing, like, the task distribution that InstructGPT did for OpenAI did for the InstructGPT 00:11:20.600 |
paper. As you can see, that generation is, like, you know, the majority of it, followed 00:11:24.960 |
by some, you know, some of these open-ended tasks and brainstorming tasks and so on. And 00:11:29.760 |
these are examples of, like, what prompts of each of those look like. So, we decided 00:11:34.640 |
to, like, you know, just go with that. But instead, you must have noticed that there's 00:11:38.120 |
this category called "other" in the table. And we obviously don't know what that was. 00:11:43.400 |
But so, we decided to replace that with "code." So, essentially, it would be, like, debugging, 00:11:48.020 |
asking clarification questions about the code. So, it's like code plus natural language. 00:11:52.480 |
So, this is what our final distribution looked like. 00:11:57.680 |
The second question was the length distribution. So, we also had to, like, you know, figure 00:12:01.840 |
out, like, you know, how important is the length? And should we, like, you know, have 00:12:05.240 |
a certain length distribution that we ask these companies to collect data for us? So, 00:12:10.400 |
we did a pilot study with SERT, ScaleAI, and AWS SageMaker Ground Truth, which is more 00:12:15.400 |
like a managed service. So, it's very different from MTurk. And they have very high-quality 00:12:19.520 |
humans, like, basically writing these examples. And so, I wanted to, like, just highlight 00:12:27.040 |
that, you know, this or the first two rows here show what the instruct GPD length distribution 00:12:32.800 |
looks like. And as you can see, this is obviously the full data set. This is more like pilot. 00:12:36.480 |
So, like, the counts are much smaller. But you can see, like, the maximum is 2048. And 00:12:42.360 |
as you know, like, that was, like, the standard context size in the beginning of the year. 00:12:47.240 |
And then, you know, there is obviously, like, even the mean and, you know, that much. It's 00:12:50.440 |
not, like, basically, it's more or less, you know, in the range. But if you look at, you 00:12:55.080 |
know, these examples from SERT, AWS, ScaleAI, there's very high variance. So, for example, 00:13:00.880 |
AWS SageMaker, the maximum prompt length is 1036. But then, like, you know, the mean is 00:13:06.200 |
just 54. And on the other hand, with SERT, the maximum length is 500. But then the mean 00:13:13.160 |
is much, like, you know, 104. So, it's, like, more in the range of what we would expect 00:13:17.800 |
from, like, you know, this difference in instruct GPD. And similarly, with ScaleAI, we found 00:13:22.760 |
that, you know, the prompts were just very, very short. And so, just based on this, we 00:13:28.920 |
said that, you know, okay, we should probably just go with search. Because, you know, that 00:13:32.560 |
seems like something that is more, you know, in the range of not very high variance. 00:13:38.840 |
So, we ended up collecting 10,000 instruction demonstration pairs from search. And this 00:13:44.840 |
is what the task distribution looked like. So, this very much follows the task distribution 00:13:49.280 |
instruct GPD, except for the coding part, which was, like, the other category over there. 00:13:54.640 |
And these are the number of examples we collected for each of these tasks. And year over year, 00:13:59.720 |
I'm showing, like, you know, the average length for each of these task categories. And one 00:14:04.160 |
thing I wanted to highlight was, which was very surprising to me, is that the chat is 00:14:08.160 |
actually one of the shortest prompt length categories. But for OpenAI, that is actually 00:14:14.520 |
one of the longest prompt length categories. So, which was very interesting. And so, obviously, 00:14:19.560 |
like, you know, at that time, we did not think much about it. But when we started training 00:14:24.360 |
models and started looking at the evaluation results, we were kind of, like, you know, 00:14:28.600 |
if we had to go back and change things, how would we change that? And so, these were, 00:14:33.000 |
like, things that we started, you know, looking at more carefully after we had already collected 00:14:37.720 |
the data set. So, here are examples of what that data set looked like. You know, classification, 00:14:46.520 |
generation, brainstorming. I'm sure you all must have seen at least some of these kind 00:14:51.480 |
of examples of instruction demonstration data sets. So, it's very much, like, it has everything 00:14:56.000 |
that you can expect from, like, NLP kind of tasks, but also more open-ended chatty tasks 00:15:02.800 |
as well. Okay. So, here are, like, some details about the task force that was used by search 00:15:13.600 |
to generate this data set. We requested a U.S.-based task force mainly because, like 00:15:18.920 |
I said, we just wanted to replicate what InstructGVD was doing. And based on Anthropic and OpenAI's 00:15:24.720 |
paper, it seemed like they preferred going with the U.S.-based task force. The gender 00:15:29.880 |
was equally divided, and the age range was also very, you know, it was, like, a big range 00:15:36.520 |
going all the way from 19 to 62. And then people had, like, you know, educational background 00:15:42.480 |
ranges from technical degree to Ph.D. So, Ph.D. was mainly for tasks like math, coding, 00:15:47.120 |
and so on. Okay. So, now I wanted to, like, switch gears a little bit and talk about this 00:15:55.760 |
data set that we collected for RLHF or for human preferences before I get into, like, 00:16:02.000 |
you know, the experiments we ran with this supervised fine-tuning data set and what results 00:16:05.960 |
we got. So, again, over here, while we were collecting human reference data set, we had 00:16:12.320 |
to come up with what are the specifications of these data sets. So, again, just to, like, 00:16:17.360 |
contrast this with how is it different from SFD, the SFD data set, both the input and 00:16:22.200 |
the output are written by humans. In this case, the human writes the input. The output 00:16:26.940 |
comes from models, which is responses, but then the human just ranks or rates them on 00:16:31.960 |
a certain scale. So, yeah, essentially, we had to decide, like, what is the task distribution 00:16:38.240 |
looks like for RLHF data? Is it going to be same as supervised fine-tuning? What about 00:16:44.440 |
the length distribution? And should we do, like, single turn versus multi-turn? So, in 00:16:49.040 |
struct GPT, it was mainly single turn. So, if we are trying to replicate in struct GPT, 00:16:53.720 |
we would have to go with single turn. But if we are trying to replicate something like 00:16:57.120 |
chat GPT, it would have to be, like, a multi-turn dialogue. And then we had to also, like, you 00:17:02.680 |
know, decide on these dimensions of, like, helpfulness, honesty, and harmlessness. So, 00:17:07.760 |
these are, like, the HHH that Entropiq follows, like, OpenAI puts it as helpfulness, truthfulness, 00:17:13.400 |
and harmlessness. And then also we had to decide, like, you know, are they going to 00:17:16.880 |
rate each of the responses individually? Or are they going to rank them? And what are 00:17:21.320 |
the implications of, like, you know, us deciding one way or the other? 00:17:26.320 |
So, we started by doing pilot study again. So, we took 300 from the self-instruct data 00:17:34.160 |
set, the data set that was released end of last year. And then, you know, gave it generated 00:17:39.800 |
model responses from our models and then gave it to data vendors to, like, rate the responses 00:17:45.320 |
of the models. And we used this Entropiq template on the left, which is essentially asking the 00:17:51.480 |
human choose the most helpful and honest response. And then, you know, these are the responses 00:17:55.800 |
from, like, model A and model B. And this is a scale, which is also working as, like, 00:18:01.320 |
sort of a ranking thing in the sense that one to four is, like, decreasingly model A 00:18:12.320 |
And also, like, you know, one other thing we had to decide about is, like, how much 00:18:16.100 |
data should we collect? And so, again, this is from the InstructGPD paper. And as you 00:18:21.000 |
can see, like, you know, they have, like, the train and validation splits for each of 00:18:25.200 |
the three steps, which are the SFD, training the reward model, and the PPO. And this one 00:18:30.240 |
is in the order of tens of thousands. And, like, overall, this combined, which is, like, 00:18:35.000 |
you know, this process of RLHF comes up to about 100,000. 00:18:40.120 |
Great. Okay. So, then once we got this pilot study data back, we sat down and we wanted 00:18:49.040 |
to also, like, you know, so I looked at it manually, and I felt that I did not agree 00:18:53.640 |
with most of the answers that, you know, the annotators from each of these companies were 00:18:58.200 |
providing. And so, I was kind of, like, you know, I don't think this is high quality at 00:19:02.600 |
all. So, what I decided, like, you know, I told my team, let's go and, like, you know, 00:19:06.880 |
rate it within ourselves. And then, you know, we basically rated, like, about 100 examples 00:19:12.240 |
or so. And we followed, like, a similar template of, like, one to four and five to eight. And 00:19:17.560 |
it basically the output, like, you know, the takeaway was that even we did not agree amongst 00:19:22.280 |
each other. So, essentially, like, our models earlier in the year was so bad, you were essentially 00:19:27.440 |
breaking ties, like, arbitrarily. Like, you know, you're deciding between, like, should 00:19:31.640 |
it be, like, you know, three versus, like, seven or something like that. So, if they're 00:19:35.360 |
equally bad, it's hard to, like, decide which one is better, right? And so, we were kind 00:19:40.000 |
of, like, breaking some of these ties arbitrarily. And so, as you can see, like, you know, there 00:19:44.120 |
was barely any, like, you know, agreement or correlation among our outputs. And then, 00:19:48.760 |
you know, when I aggregated that, and, you know, looked at, you know, how well do we 00:19:52.480 |
correlate with, like, for example, search and scale. And so, we decided, like, you know, 00:19:57.000 |
with AI, we had, like, more, like, the maximum overlap was with scale compared to, like, 00:20:03.680 |
say, search. Okay. So, we ended up collecting 20,000 dialogues. So, we decided to go with 00:20:09.880 |
multi-turn. And so, because it was multi-turn, you would have, like, 20,000 overall dialogues, 00:20:16.560 |
but the number of prompts would be 80,000. So, there would be each dialogue would have 00:20:19.840 |
about four turns on an average. So, like, you know, a human would prompt it, the model 00:20:24.280 |
would respond, a human would, like, rate the response, and then, you know, ask the follow-up 00:20:29.320 |
question. And then, again, the model would, like, you know, generate two responses, and 00:20:32.760 |
that is how it would go on. And so, the task distribution we decided to follow was a little 00:20:39.360 |
bit different from what we had for supervised fine tuning. And the reason behind that was 00:20:44.640 |
that we wanted to focus more on tasks that were, like, factual, so that, you know, essentially, 00:20:51.520 |
this is more about making the model learn, like, between positive and negative signals. 00:20:55.800 |
So, making the model, like, discriminate between, like, you know, what is factual, what is not, 00:21:00.200 |
what is helpful, what is not, and what is harmless and what is not. And, like, you know, 00:21:04.920 |
for example, tasks like generation and brainstorming, there's no one correct answer. Like, you know, 00:21:09.480 |
everyone can come up with, like, different lists or recipes, and, you know, it's hard 00:21:13.440 |
to say, is this the best answer? Is this the most helpful answer? But if you ask, like, 00:21:17.200 |
a factual question, it's, like, very clear what is correct and what is not. So, that 00:21:22.080 |
was kind of, like, our reasoning behind doing this. And so, this is a task distribution 00:21:26.120 |
that we came up with for collecting the human preference dataset. 00:21:31.680 |
Also about the length, because we are doing this in a multi-turn setting, and so we wanted 00:21:35.840 |
to make sure, like, you know, the entire dialogue could fit into, like, the context line of 00:21:40.000 |
the models, we have decided to, like, you know, ask them to keep the overall dialogue 00:21:44.080 |
to be shorter than 2048 tokens. And then it was multi-turn with an average of four turns 00:21:51.080 |
per dialogue. Then, obviously, we had to also select on the dimension of, like, whether 00:21:55.880 |
we are going for, like, helpful over harmless or, you know, honesty. So, we followed this 00:22:01.800 |
instructions from this OpenAI guidelines. I'm not sure if I can pull this up. That would 00:22:06.240 |
be nice. Okay. Great. But, yeah, so, OpenAI has this document which is public of, like, 00:22:16.400 |
labeling instructions that they shared with their annotators. And so, they have, obviously, 00:22:21.360 |
like I said, they have helpful, truthful, and harmless, but then they also have this 00:22:25.320 |
thing how do I scroll down? Okay. So, they have definitions on what do they mean by helpfulness, 00:22:32.440 |
what do they mean by truthfulness, and what do they mean by harmlessness. So, in our case, 00:22:37.120 |
because our models were not as good, we decided to focus on helpfulness and truthfulness. 00:22:42.620 |
And when they had to break ties, OpenAI says that, you know, choose truthfulness over helpfulness 00:22:48.840 |
over your so, like, let me see that. Yeah. So, they wanted to, like, prioritize harmlessness 00:23:07.000 |
and truthfulness over helpfulness, but we went the other way around. We said we wanted 00:23:10.920 |
to, like, prioritize helpfulness over honesty or harmlessness. I mean, we weren't even focusing 00:23:16.720 |
on harmlessness, because we just wanted to get our model to a certain capabilities before 00:23:21.120 |
we start thinking about that. But, yeah, this is really a very good document and, like, 00:23:26.840 |
you know, defines what should the annotator be looking at and how do they decide when 00:23:32.040 |
the model responses are very close, how do they break those ties. And for, like, you 00:23:37.680 |
know, deciding between what kind of template should we use for collecting these annotations, 00:23:42.680 |
we started off with the entropic template that I showed a few slides earlier, which 00:23:45.960 |
was on a scale of one to eight, but essentially ranking between these two models. And then, 00:23:50.720 |
you know, Lama2 came out while we were in this iterative process. And our iterative 00:23:54.960 |
process was essentially we used to give an endpoint to the vendor, and then the, you 00:23:59.520 |
know, basically the annotators that they had in the managed task force would prompt these 00:24:03.880 |
endpoints. The model would generate two responses. They would, you know, follow the instructions 00:24:10.680 |
and, you know, give the ranking for each of those instruction each of those model responses. 00:24:15.800 |
And then, you know, again, like, follow up with the second prompt and the conversation 00:24:18.680 |
would go on. And then they would give us the data at the end of that week. We would fine 00:24:22.600 |
tune our model on that data so that the model now is hopefully better. And then we give, 00:24:27.920 |
like, a better endpoint to them for the next week to continue this process. So it's, like, 00:24:32.160 |
very iterative. And, like, you know, they have to adapt to, like, model getting better 00:24:36.320 |
week by week. So, yeah, basically, but, like, you know, we decided to switch to I think 00:24:41.240 |
for one or two weeks we collected entropics, use entropic scale for collecting data set. 00:24:47.840 |
But then Lama 2 came out and their results showed that, you know, clearly that, you know, 00:24:52.080 |
they were using this much more easier scale of just 1 to 4. So they were, like, you know, 00:24:57.440 |
choosing which one is a better response between the two responses and then seeing how much 00:25:02.140 |
better it is. So is it, like, significantly better or is it only slightly better? And 00:25:06.920 |
so that was the ranking of, like, scale 1 to 4. So here are examples of data that we 00:25:13.280 |
collected. So on the left, you can see that it is asking about, like, you know, basically 00:25:19.920 |
human is prompting with a question and then the bot generates a response. So this is the 00:25:24.480 |
response that the human chose at this turn. And then the human, you know, follows up with 00:25:28.680 |
the second prompt. And then this is the bot response that was chosen by this human. And 00:25:33.480 |
this is the rejected bot response. And this is giving the response margin of 3, which 00:25:37.760 |
is saying that they are quite a bit different. So 4 is, like, very different and 1 being 00:25:41.880 |
very slightly different. And then here on the right-hand side is more about sort of 00:25:47.200 |
generation brainstorming kind of example where the human is asking, like, can you write a 00:25:52.580 |
text message wishing your husband a happy anniversary? And then the bot writes something. 00:25:57.760 |
I guess my thing messed up the emojis. But, you know, then the human follows up with saying, 00:26:03.640 |
hey, you missed this important detail, which is, you know, they have been married for eight 00:26:07.320 |
years. And so this is a chosen bot response. This is the rejected one that the human chose 00:26:12.360 |
between those two. And as you can see, they are quite good. So the response margin is 00:26:17.920 |
just 1. So they're, like, just slightly different. Okay. Sounds good. So now I'm going to, like, 00:26:26.680 |
talk about this another recipe that we tried, which is, you know, using synthetic data set 00:26:32.080 |
essentially for distillation of AI alignment, which is basically the paper that we released 00:26:37.680 |
last week called Zephyr, and which was, like, a 7 billion parameter model, which actually 00:26:43.160 |
beat Chad GPT. And this builds on top of the Mistral model. But I just wanted to, like, 00:26:49.280 |
you know, yeah, just, you know, basically we recreated some of the steps that were there 00:26:52.640 |
on the instruct GPT paper, but now with using synthetic data set. And so the first one is, 00:26:58.920 |
like, you know, you are basically, like, using a data set. In this case, we use Ultra Chat. 00:27:03.800 |
So this is a data set I showed a few slides earlier for supervised fine tuning, wherein, 00:27:07.680 |
like, a human was brainstorming and, like, gathering the material, and then, like, you 00:27:12.280 |
know, chatting with this GPT-4 model to, like, generate multiple different, you know, outputs 00:27:18.000 |
for the instruction. And then, you know, this is how we collect that data set, which is 00:27:21.920 |
called the Ultra Chat. And then we use that for fine tuning our model. And then the second 00:27:27.500 |
step is the response generation AI ranking. So in this case, also, like, you know, we 00:27:33.320 |
used Ultra Feedback, which is a data set that was released. And the way this data set was 00:27:39.280 |
constructed was that, you know, they asked, basically, like, you know, took some prompts 00:27:44.440 |
from, like, shared GPT and some of these different data sets of SFT that were already out there. 00:27:49.920 |
And then they gave it to four different models, like, four different powerful models, like 00:27:53.800 |
Palm 2, Cloud 2, GPT-4, and so on. And then they asked this GPT-4 to, like, rank each 00:28:01.440 |
of those four responses. And then, so, like, you know, the one that is the best is the 00:28:05.800 |
one that GPT-4 ranks as the highest. So each of these are scored individually on a scale 00:28:09.840 |
of 1 to 10. And the one that gets the maximum score is, like, the best response. And then 00:28:16.760 |
finally, we did something called DPO, which you might have been aware of because it came 00:28:21.520 |
out of Stanford. It's, like, this kind of alternative to RLHF, which is, like, doing 00:28:26.320 |
this direct preference optimization. And so instead of, like, you know, basically doing 00:28:31.720 |
this iterative process of fine-tuning, you directly, like, optimize on, like, the chosen 00:28:35.960 |
one. So we just take that and then fine-tune our model directly on that chosen response. 00:28:42.600 |
And the other one that we are using is, like, a random response from these other three responses. 00:28:47.920 |
Okay. So I'm going to talk a little bit about experiments and evaluation for each of these 00:28:54.960 |
recipes. One is collecting everything with, like, humans involved. And the second one 00:29:00.000 |
is everything which is synthetic. But then before I discuss evaluation, I wanted to talk 00:29:05.160 |
about, like, what are the benchmarks that we are evaluating on and how good are these 00:29:09.040 |
benchmarks for evaluating chatbots. And to think about evaluation, we need to first think 00:29:14.520 |
about how are we training these models. So, like, today, all the models that are trained 00:29:19.240 |
are, like, more or less have these four ways of learning. The first one is pre-training 00:29:23.640 |
the language model. Essentially, you're predicting the next token. And examples of these are, 00:29:28.080 |
like, GPT-3, OPT, and so, like, the foundation models. The second type of learning is in-context 00:29:34.280 |
learning or the prompt-based learning. In this case, you're, like, just giving a new 00:29:38.800 |
kind of task in the context of the model and then, you know, ask it to, like, you know, 00:29:43.480 |
do that on new examples. So, like, if you wanted to write a poem, for example, for GPT-3, 00:29:48.160 |
you would have written that in the context and then it would have generated a new poem 00:29:51.880 |
on some other topic. The third type of learning is the supervised fine tuning, which was kind 00:29:59.760 |
of, like, the first step of training a chatbot. In this case, you're, like, fine tuning on 00:30:04.760 |
the instruction following data and then you want these language models, which are just 00:30:08.840 |
pre-trained to predict the next token to become chatty and to, like, generate open-ended responses. 00:30:14.760 |
And then, finally, the fourth one is reinforcement learning from human feedback, which is nudging 00:30:19.480 |
the language model towards the values you desire. And examples include llama to chat 00:30:24.040 |
from meta. So, the first two steps are, you know, we have a lot of benchmarks for these 00:30:33.060 |
two types of training. Like, Sanford Helm is an example of that. Or the Google Big Bench 00:30:37.840 |
or even open LLM leaderboard. But for these two types of learning, which is supervised 00:30:45.440 |
fine tuning and reinforcement learning from human feedback, which are parts of, like, 00:30:49.160 |
this recipe for training a chatbot, there's, you know, not a lot of leaderboards or evaluation 00:30:54.380 |
benchmarks available. But there are some available. And I wanted to, like, you know, just highlight 00:30:58.340 |
some of those. So, like, yeah, this is essentially, like, the steps three and four here match 00:31:04.220 |
to, like, you know, the step one over here, which is helpfulness, and then steps two and 00:31:08.440 |
three over here, which is, like, you know, nudging the model towards being more harmless. 00:31:13.560 |
So, if you had to, you know, evaluate the chatbot for each of these steps, you would 00:31:19.840 |
have to think about how do you evaluate instruction following or chattiness. You would have to, 00:31:24.800 |
you know, think about how do you evaluate the reward model, which is essentially a classifier. 00:31:29.420 |
And then finally think about, you know, how do you evaluate for harmlessness, which is 00:31:33.320 |
by red feeling or adversely prompting the language model. So, for the first step, you 00:31:39.480 |
would have to see, like, does the model generate useful responses on the topic? And are they 00:31:43.440 |
open ended? And one example of a prompt that you would try to evaluate the model would 00:31:47.880 |
be to, like, say, brainstorm a list of the New Year's resolution. And so, examples of 00:31:54.080 |
benchmarks and evaluation boards that are looking at this sort of, like, supervised 00:31:59.360 |
fine tuning is, like, Hugging Faces, a leaderboard with ELO ratings. So, ELO is this metric that 00:32:05.300 |
is used in chess, which is, like, you know, you're pairing one player against the other, 00:32:09.480 |
and you want to, like, rank these players when they have, like, these tournaments against 00:32:13.160 |
each other. And so, in a similar sense, we are, you know, taking these chatbots and then, 00:32:20.080 |
you know, putting them in a pairwise setting. And then we partnered with ScaleAI, and they 00:32:25.000 |
provided humans to, like, annotate which response is better. And we did that for every single 00:32:30.360 |
combination. So, like, it was NC2, where N is the number of prompts we are looking at. 00:32:35.480 |
And so, we generate NC2 combinations, and we rate each of them. And so, these are the 00:32:40.640 |
ELO ratings that we get out of it. And on this column here shows that what is the rating 00:32:47.600 |
you would get if you would have used GPD 4 as a proxy for humans? So, instead of, like, 00:32:52.880 |
humans sitting and rating each of those, you're asking, like, you know, GPD 4 to select which 00:32:57.120 |
is a better response. Yeah. And so, this is basically the first table you're showing if 00:33:02.680 |
you allow ties in the sense sorry, if there was no tie allowed. And this table you're 00:33:07.320 |
showing that if ties were allowed. Another example is, you know, this leaderboard from 00:33:15.480 |
Stanford, which is Alpaca Eval leaderboard, and they're doing something very similar in 00:33:20.440 |
the sense that they have GPD 4 and Claude as an evaluator, and they are doing, like, 00:33:25.720 |
a pairwise evaluation of these models, chatbot models, and they're reporting the win rate 00:33:32.040 |
of, you know, which model wins against the other one. There's also the LMSIS leaderboard 00:33:38.840 |
from Berkeley, which has this thing called the chatbot arena, which is essentially like 00:33:43.120 |
a publicly crowdsourced leaderboard, wherein you can, like, go chat, like, you know, chat 00:33:48.040 |
with any of their models, and then give them rating to, like, which one was more helpful 00:33:51.880 |
and which one was better. And so, this, again, has, like, a leaderboard of ELO ratings, because 00:33:56.640 |
this is done in a pairwise setting. There's another benchmark from LMSIS, which is called 00:34:03.960 |
the empty bench or the multi-turn bench, a benchmark. And this is the first ever multi-turn 00:34:09.360 |
dialogue benchmark that is evaluating chatbots. And so, it has, there are just, like, 80 examples 00:34:15.440 |
in this across, like, a bunch of categories. But essentially, what, the way it works is 00:34:20.320 |
that the first turn or the first prompt from the benchmark is prompted to the model. Then 00:34:27.360 |
GPD 4 is asked to score on a score of 1 to 10. How good is the model's response? And 00:34:33.560 |
then, you know, it is followed up by another prompt, which is, like, you know, the multi-turn 00:34:38.000 |
prompt, which is, like, related to the question, but it might not be related to the model's 00:34:42.560 |
responses, because, you know, this is already constructed, and they always, like, follow 00:34:45.960 |
up with the same response to every part. And then, again, GPD 4 evaluates how good was 00:34:51.600 |
the second turn of the response. So, this is, like, the consolidated leaderboard from 00:34:57.600 |
LMSIS, showing both the arena ELO rating, as well as empty bench scores. So, these are 00:35:03.400 |
scores that are aggregated across all the 80 examples, and this is GPD score scoring 00:35:08.880 |
from, like, 1 to 10, essentially. Cool. So, I think the second step that we wanted to, 00:35:16.040 |
like, look at in our evaluating a chatbot chart was, like, you know, think about how 00:35:20.800 |
do you evaluate a reward model. So, when you have these human preference data set collected, 00:35:25.840 |
and you train this reward model, which is essentially a classifier, to discriminate 00:35:29.680 |
between, like, you know, truthful and untruthful response, or, like, you know, can it rank 00:35:33.840 |
helpful response higher than the less helpful responses? And, you know, there's literally 00:35:39.280 |
no open source data leaderboard available for evaluating these, like, preference model 00:35:44.840 |
or the reward models. But internally at Hugging Face, we have our own data set for evaluating, 00:35:50.660 |
so that we know that as we are adding more human preference data, our models are actually 00:35:55.520 |
getting better. So, this is essentially we are evaluating on these open source data sets, 00:36:02.200 |
which is the Anthropic Helpful data set, the Open Assistant data set, the Stanford's Human 00:36:07.520 |
Preference data set, and also the Learning to Summarize data sets from the very first 00:36:11.960 |
paper from OpenAI, which was looking at Learning to Summarize. And so, this is, like, you know, 00:36:18.600 |
basically seeing that, you know, how good is our reward model. And then, finally, the 00:36:23.280 |
third type of evaluation is red teaming. And so, in this case, you want to craft a prompt 00:36:28.720 |
in a way that could surface model vulnerabilities and emerging capabilities. And for example, 00:36:34.720 |
if you're asking, like, how do I plan a prank robbery is a model, actually, like, you know, 00:36:39.120 |
helping you with that and trying to elicit undesired behavior from the model. And unfortunately, 00:36:44.720 |
actually, there's no leader open source leaderboard available for this thing. There's just one 00:36:49.280 |
data set from Anthropic, which has all the three included, which is the it actually has 00:36:53.840 |
both helpfulness and harmlessness. It's the edge data set from Anthropic. And that's the 00:36:58.720 |
only open source data set available for red teaming. But there's no leaderboard available 00:37:04.080 |
for red teaming. And so, this was, like, a blog that I wrote earlier in the year, saying, 00:37:08.080 |
like, you know, highlighting this gap and saying that, you know, putting out an announcement 00:37:12.320 |
saying, like, we should get together and build a data set for red teaming. And if you had 00:37:16.160 |
heard of, like, the DEF CON red teaming design challenge, and, you know, basically crowdsourcing 00:37:20.900 |
some of these red teaming work kind of came out of that. Okay. So, now I'm going to get 00:37:26.600 |
into now that we have discussed evaluation and benchmarks and leaderboards, I'm going 00:37:30.600 |
to talk about results and what did they look like on each of and some of these benchmarks. 00:37:35.160 |
So, here I'm showing the results for this Lama 213 billion on the open LLM leaderboard 00:37:42.840 |
from Hugging Face. And in this case, I was using the data set that we collected from 00:37:48.120 |
search that was a 10,000 instruction demonstration data. And I hear on this, you know, these 00:37:53.400 |
are basically the four data sets, which are, like, NLP focused data sets that we have as 00:37:58.480 |
part of open LLM leaderboard, which are the Arc Challenge, the Hendrix, Hellaswag, and 00:38:03.200 |
Truthful QA. And you're, like, you know, this is how well our model does. And all of this 00:38:09.760 |
is essentially accuracy. And this is the Lima paper or the Lima model, which is less is 00:38:14.300 |
more for alignment that came from meta. And they just used 1,000 examples of high quality 00:38:19.240 |
instructions and showed that you can get a very good chatbot by just using 1,000 examples. 00:38:24.760 |
And this is, like, you know, taking the longest example from Open Assistant and just choosing 00:38:28.480 |
the top 500 of them. And so, we found that our model does slightly better than, you know, 00:38:34.320 |
each both of, like, Lama and Open Assistant, except for in Truthful QA, where we found 00:38:39.760 |
that the Lima and Open Assistant did better than us. And similarly, like, actually, like, 00:38:45.760 |
in Empty Bench, we found, like, you know, the opposite was true. So, this is, like, 00:38:49.680 |
you know, Empty Bench is remember that LLMSys had, like, you know, turn zero and turn one. 00:38:54.360 |
And then so, this is reporting the first response. This is, like, GPD 4 is essentially scoring 00:38:58.720 |
on a score of 1 to 10, how good these models are on the first dialogue turn and the second 00:39:04.880 |
dialogue turn and the average score. And so, actually, this is kind of more counterintuitive 00:39:10.960 |
to what we found on this automatic evals is that actually the Empty Bench says that, you 00:39:15.600 |
know, our the data that our model trained on the data that we collected from search 00:39:20.160 |
is not very good. And in fact, Lima and Open Assistant, which are, like, a fraction of 00:39:24.520 |
the size of the data we had are much better. So, this was kind of surprising. And then 00:39:31.800 |
I looked into, like, you know, let me look at is the length a factor in this? And it 00:39:36.520 |
does seem like, you know, like, the data I was looking at each of those and then, you 00:39:40.240 |
know, looked at the average length of the prompts in each of those. And it seems like 00:39:43.960 |
there is a very wide range. For example, like, our data set, the average length was just 00:39:48.600 |
211 of these prompts, while Lima is, like, double of that and Open Assistant is almost 00:39:53.520 |
double of that. So, then I did this experiment where I wanted to check, like, if I controlled 00:40:02.760 |
for the size of the data, but then, you know, let the length be varied, the prompt length, 00:40:07.960 |
does that affect the performance? So, in particular, like, I think I highlighted this before is 00:40:12.520 |
that our chat category was, like, really short. And so, it actually found that, you know, 00:40:17.600 |
like, length did not really affect that much, except for this truthful QA data set. Even 00:40:23.880 |
for this Hellas swag, even though it looks small, it's actually just in the third digit. 00:40:28.620 |
And over here, you can see, like, the actual difference only made on truthful QA, which 00:40:32.140 |
actually preferred models that were generating longer responses. But on the other hand, the 00:40:39.320 |
empty bench score was, again, not intuitive, not aligning or correlated with what we found 00:40:44.080 |
with these automatic metrics and evaluations, in the sense that GPT-4 actually did not prefer, 00:40:50.600 |
like, longer responses. And so, this was, like, you know, a little bit counterintuitive. 00:40:55.640 |
And so, need to, like, dig more into, like, what's going on over here. But, you know, 00:41:00.440 |
it actually found that, you know, like, shorter responses were better than, you know, longer 00:41:05.220 |
responses. Although there was, like, not much of a very much of a difference. 00:41:10.120 |
So, the other experiment and evaluation we did is that just removing amounts of data 00:41:15.840 |
and seeing, like, if you incrementally add more data, how does that affect performance? 00:41:20.860 |
And this is, again, on that open LLM leaderboard from Hugging Face, which is looking at some 00:41:26.620 |
of these standard NLP benchmarks and reporting accuracy. And so, this is, like, starting 00:41:32.200 |
with just 10% of all the data we collected from search. And as you can see, like, you 00:41:37.160 |
know, in all these benchmarks, actually, like, it saturates very quickly. And in some of 00:41:41.720 |
them, you actually get, like, you know, you basically lose performance if you keep adding 00:41:46.120 |
data. And so, this is kind of aligning with when I started, when we started collecting 00:41:50.520 |
data, we had this diminishing return plot, wherein you said that if you have just very 00:41:54.920 |
few thousand examples of very high quality instruction following data set, that's good 00:42:00.040 |
enough. And then your performance saturates or plateaus very quickly after that. And so, 00:42:09.160 |
Similarly, I think this is where one place where Empty Bench actually correlated with 00:42:16.440 |
the automated metrics is that GPT-4 also, like, you know, showed that, you know, after, 00:42:22.880 |
like, about 4,000 examples, it was basically very barely any gain in performance, actually, 00:42:28.800 |
decreasing performance on the with the model. 00:42:33.040 |
Okay, great. So, that was all the results on using, like, these human curated very high 00:42:40.440 |
quality data set. What about, like, results from distillation from these synthetic data 00:42:46.000 |
sets? In particular, we use UltraChat for supervised fine tuning and UltraFeedback for 00:42:51.560 |
DPO. And so, these are the results. So, this is, like, basically just work that was released 00:42:57.520 |
last week. We haven't yet released the code and the data set, which we are going to do 00:43:01.960 |
this week. And so, here I'm highlighting that Zephyr is the model we released. We built, 00:43:06.560 |
we used Mistral as the foundation model, and then fine tuned it using UltraChat and then 00:43:12.800 |
did DPO on UltraFeedback. And as you can see that it actually beats chat GPT on this Alpaca 00:43:19.400 |
Also, it is, like, the best in the, in all the open, at least it's, like, it beats most 00:43:30.920 |
of the 13 billion parameter models. And it's, like, quite competitive to cloud to, again, 00:43:38.920 |
on the Alpaca eval leaderboard. So, this is the model which has both SFD and DPO. So, 00:43:46.880 |
we did an ablation on how good or how useful is, like, you know, SFD and how useful is 00:43:52.720 |
DPO, because there's this two-step process. It's, like, first you fine tune on instruction 00:43:56.960 |
demonstration, then you fine tune on human preferences. And so, this is the first row 00:44:02.280 |
over here is showing what if you directly did DPO on UltraFeedback and did not do the 00:44:07.640 |
supervised fine tuning. And you actually saw that that's really bad. So, that doesn't work 00:44:11.720 |
at all. And then the second one is saying that what if you just did supervised fine 00:44:16.600 |
tuning and did not do DPO. And so, this actually, which is, like, the first step, and this actually 00:44:21.480 |
works decently well. And it's, like, you know, basically getting you to, like, 80 or 90% 00:44:25.840 |
of the overall performance. And finally, this is doing, like, supervised fine tuning on 00:44:31.080 |
the human preference data. So, you take this row and do another round of supervised fine 00:44:35.440 |
tuning, but on this data of human preferences. So, you remember you had, like, the chosen 00:44:39.800 |
and the rejected. So, you give all the dialogue history, and then the expected completion 00:44:44.800 |
is the chosen dialogue response. So, in this case, you're not really doing that discriminative 00:44:49.000 |
thing. You're still doing the SFD process, but you're just, you know, like, using that 00:44:53.240 |
in a smart using the data set in a smart way so that it follows a template of what supervised 00:44:58.160 |
fine tuning does. And then that, as well, we found that, you know, wasn't very helpful. 00:45:03.000 |
So, the best recipe, obviously, is DPO plus SFD. So, you know, doing SFD first on the 00:45:09.280 |
UltraChat, and then DPO on the UltraFeedback. Both of these data sets are synthetic. And 00:45:14.880 |
then, you know, it's, like, only slightly better than just doing SFD. 00:45:20.040 |
Okay. So, I'm getting to this final section of my talk, which is essentially looking at, 00:45:27.120 |
you know, so, we have seen a lot of these evaluation and benchmarks and leaderboards, 00:45:31.760 |
and many of them are starting to adopt these powerful models, like Cloud2 and GPD4, and 00:45:37.120 |
are using as proxy for humans in evaluation. And so, what are the quirks associated with 00:45:41.880 |
doing that, and are there things that we should, like, be, like, you know, considering when 00:45:45.720 |
we are doing this at a very large scale? So, when we did that, when we used GPD4 as 00:45:51.920 |
an evaluator, we found that it actually has a positional bias. And so, in particular, 00:45:57.440 |
it is predisposed to generating a rating of 1 in a preference collection setting. And 00:46:03.080 |
so, like, you know, this chart over here shows, like, the average rating for model responses 00:46:09.040 |
across, like, the entire data set. And on the right, on the other hand, humans are more 00:46:13.440 |
or less uniform. And so, you expect that, you know, this distribution seems much more 00:46:17.960 |
better than this distribution, which is skewed to the right. 00:46:22.440 |
So, then what we did is that we prompted GPD4 to say that, hey, you have this left bias, 00:46:28.560 |
and you always generate this rating of 1, you know, be aware of this bias, and then 00:46:33.160 |
you tell it to debias itself, it actually flips the bias in the opposite direction. 00:46:38.000 |
So, then it starts, like, it is more self-aware in the sense that it knows that, you know, 00:46:42.400 |
it has this bias, and now it starts generating more ratings of 5 and 6. And the one way of 00:46:47.440 |
getting rid of this is that we kind of make sure that each response is equally likely 00:46:51.720 |
to be in right and left position. So, that kind of dilutes, like, this bias that it has 00:47:00.440 |
And then, you know, we found that actually, like, prompting GPD4 to generate scores, so 00:47:05.080 |
asking it to score, like, each response individually, like, Empty Bench does. And then instead of 00:47:10.040 |
ranking, but in a pairwise setting, we actually found that that alleviates the problem a little 00:47:14.200 |
bit, but does not completely get rid of the problem. 00:47:18.080 |
We also found evidence of doping between training and evaluation. So, in particular, we found 00:47:24.840 |
that GPD4 prefers models that were trained on GPD4's data. So, these, all these models 00:47:31.200 |
here were trained on data that was bootstrapped using GPD4. And, you know, so it prefers that 00:47:37.640 |
over humans who are, like, more factual, much more higher quality, but they might be very 00:47:43.080 |
succinct and to the point. So, this is one thing that, you know, we should be aware of 00:47:51.240 |
The other thing is that, you know, it also, like, conquers with findings from these other 00:47:55.040 |
papers, which is that GPD4 prefers models with higher diversity. So, that is number 00:48:00.580 |
of unique tokens in the response and the longer responses. So, if you have, like, this list 00:48:05.480 |
of list kind of response, just like chat GPD does, GPD4 is, like, predisposed to rating 00:48:11.440 |
that higher compared to a model that does not generate that. 00:48:14.720 |
We also found that GPD4 has poor correlation with humans on low entropy tasks, such as 00:48:23.000 |
math coding and reasoning. So, remember that leaderboard I showed you where we had compared, 00:48:27.240 |
like, how does GPD4 ELO rating compare to humans? And then we dive deeper into, like, 00:48:33.120 |
how does that compare on each of these different task distribution and categories? And so, 00:48:37.400 |
this is what it looks like. So, it seems like, you know, it says lower correlation with humans 00:48:42.640 |
on some of these more factual, like, you know, kind of, like, expecting one correct answer. 00:48:48.160 |
And they actually highly correlated with humans on these more high entropy tasks where you 00:48:53.400 |
got, like, brainstorming and creative generation, which was kind of unintuitive and counterintuitive 00:48:58.200 |
because you could have so many different ways of coming up with, like, you know, a recipe 00:49:03.280 |
or a list of something. But that's where, like, the rating of GPD4 and humans are more 00:49:10.120 |
Okay. So, the final thing is takeaways. So, there's a bunch of this. But let's try to 00:49:17.680 |
break it down. Essentially, like, you know, we discussed, like, how do we come up with 00:49:21.520 |
steps for data curation for supervised fine tuning and RLHF? And it involves, like, several 00:49:27.160 |
critical factors, such as how much data do you need to collect? What is the length of 00:49:31.480 |
the prompts and the distribution of those length? The task distribution? And what is 00:49:36.880 |
the role of humans? Like, you know, do you need synthetic data? Do you need completely 00:49:40.400 |
manually curated or something in the middle? And we looked at, like, there are many tools 00:49:44.880 |
for, like, efficient fine tuning of open source LLMs. From the SFD results, we found that 00:49:51.160 |
truthful QA was the main differentiating benchmark for these automated eval metrics. And then 00:49:57.680 |
we found that empty bench scores were actually not correlated with these automated metrics. 00:50:02.680 |
And so, it was more sort of, you know, only on, like, some of these models, we found that 00:50:08.360 |
they were correlated. For the distillation results, which is from the Zephyr 7D, where 00:50:13.560 |
we are, like, fine tuning on synthetic data, we found that the SFD on AI generated data 00:50:19.520 |
and the DPO or distillation of DPO on AI feedback data actually beats chat GPD, even though 00:50:25.680 |
the model is just 7 billion parameter. And then we found that, you know, benchmarking 00:50:30.600 |
gap in assessing RLHF models in particular, that we don't have benchmarks for assessing 00:50:37.080 |
reward models. And we also don't have open source benchmarks for evaluating red teaming 00:50:42.320 |
and model vulnerabilities. Then finally, we'd like, you know, dive deeper into, like, you 00:50:47.480 |
know, looking at quirks of using GPT-4 or some of these powerful LLMs as an evaluator. 00:50:53.560 |
And some of them were, like, you know, they prefer models trained on GPT-4-like data. 00:50:57.980 |
It has, like, a left positional bias. And then it has high correlation with humans on 00:51:02.960 |
creative tasks compared to, like, coding or reasoning tasks. And my work has been covered 00:51:09.640 |
on the New York Times article cover, which talks about the secret ingredient of alignment, 00:51:15.200 |
which is for chat GPD, which is alignment. I'm also part of the United Nations Advisory 00:51:20.000 |
Board that was announced last week. So, really humbled to be part of that. Here are some 00:51:24.760 |
blog posts. You know, basically, like, yeah, we kind of, like, did not publish a whole 00:51:30.440 |
lot this year. But we wrote a bunch of blog posts highlighting what we are releasing and 00:51:35.080 |
working on. And also, like, you know, some of these are part of the talk that I just 00:51:39.960 |
discussed. And this is part of the Edge 4 team. I'm grateful to be part of this. And 00:51:52.560 |
When you get alternating responses from the products, do you select really high temperatures 00:52:14.440 |
or do you keep it pretty close to the temperature that's also told in the final product? 00:52:19.680 |
Yeah. So, we did, like, you know, basically chose, like, you know, we tried experimenting 00:52:25.400 |
with different temperatures. But then we actually found that just using different sampling strategy 00:52:31.660 |
worked better. So, like, you know, using a different value of P and then K and some combination 00:52:36.600 |
of that as opposed to just, like, relying on temperature. 00:53:04.760 |
Yeah. So, I think for Red Teaming at scale, there's actually a paper that came out recently 00:53:20.760 |
called GPD Fuzzer that actually, like, you know, bootstraps and uses these powerful LLMs 00:53:27.280 |
to jailbreak other LLMs. And also, there was a DeepMind paper, I think, actually, like, 00:53:30.840 |
one and a half to almost two years ago that was Red Teaming large language models with 00:53:35.600 |
large language models. So, how do you, like, Red Team and evaluate a language model by 00:53:39.840 |
using another powerful language model? And so, I think that is kind of the way to go 00:53:44.440 |
in terms of scale. And so, what was the second question? 00:53:51.240 |
Yeah. So, I think one thing is this idea of, like, emerging capabilities, which is essentially, 00:54:09.480 |
like, as you scale up, and which is a trend that we are seeing, like, you know, as we 00:54:13.280 |
are scaling up, there are things that these models do or, like, you know, capabilities 00:54:17.720 |
that emerge that were not there in the smaller models. I think examples are chain of thought 00:54:21.780 |
reasoning, which, you know, GPT-2 or GPT was not capable of doing it. And as we scale up, 00:54:27.880 |
and the other example is this few short prompting that we first saw in GPT-3, as in, like, you 00:54:32.880 |
could give it a completely new task and not update its parameters in any way, but just 00:54:37.760 |
put it as part of the prompt. And then, you know, now it just learns the task, and then 00:54:41.960 |
it can do it on n number of examples, right? And so, like, labeling and all these things 00:54:47.200 |
started coming up, like using GPT-3 as a labeler and all that, when we kind of, like, discovered 00:54:52.480 |
that thing. So, I think, essentially, like, you know, the other example is, like, manipulation. 00:54:56.840 |
I don't think any open source models are capable of that yet, but I know, like, Anthropic and 00:55:02.560 |
OpenAI, these companies are focusing on, like, you know, deception and manipulation, because 00:55:07.320 |
when you start, like, you know, chatting with these models, you start, like, you know, treating 00:55:12.440 |
them as a companion, especially, like, if you have, like, character AI kind of a thing 00:55:16.800 |
where, you know, you might try confiding in them, start confiding in them, sharing information 00:55:21.160 |
that you probably shouldn't, and then they can use it against you, maybe. Like, you know, 00:55:26.000 |
an example of that is, like, I think recently we saw that GPT-4 actually manipulated someone 00:55:31.060 |
to, like, read the capture to it in some way and, like, tell it what the capture reads. 00:55:35.440 |
And so, that's a really concrete example of manipulation. And so, it seems like now these 00:55:41.480 |
models are capable of that. I don't think open source models are there yet, but these 00:55:46.760 |
are, like, just, like, you know, things that come out and, like, vulnerabilities that would 00:55:55.840 |
Yeah. So, I would say, like, it was, it's less about, like, you know, it's more about 00:56:25.720 |
open sourcing a data set that is crafted to kind of elicit this behavior. It's more about 00:56:32.160 |
the kind of harms that we should be thinking about. So, it's more about, like, you know, 00:56:36.920 |
hallucinating or plagiarism, manipulation, you know, trying to leak PII information, 00:56:42.720 |
people's credit card, SSN, things like that. It's more about, like, thinking about these 00:56:46.560 |
different dimensions and giving concrete examples of how these models can, you know, elicit 00:56:52.280 |
this behavior. But I think what you are trying to, like, talk about is that, you know, what 00:56:57.120 |
if we gave them concrete ways, like, concrete prompts on how you jailbreak, and then they 00:57:01.320 |
can go and try to do that. I think first thing is, like, you know, while we are doing this, 00:57:05.600 |
we would have evaluated our models, and we would then start thinking about guardrails 00:57:09.160 |
and safety ourselves. And if, indeed, like, you know, the data set is so good that we 00:57:13.600 |
can say that a lot of these powerful models are failing on that, then obviously you don't 00:57:17.160 |
open source it instantly, but you actually think about what is the best way to put it 00:57:21.240 |
out there by first securing the model and making sure that it does not, like, you know, 00:57:26.040 |
basically does not elicit that kind of behavior, and then sharing it while you have already, 00:57:31.200 |
you know, kind of crossed that bridge and being like, yeah, my model is safeguarded 00:57:34.760 |
against that. So it's more like, yeah, a process of a gradient of things that you need to do. 00:57:41.240 |
Yeah, so you're talking about, like, when you're using synthetic data bootstrap on, 00:58:11.160 |
on other language models, have you seen, like, collapse of, like, some kind of, like, mode 00:58:15.800 |
collapse or something like that? So, actually, so far, it's been, like, clear that these 00:58:22.280 |
are good, like, these, these actually turn, like, you know, regular chatbots and, like, 00:58:26.760 |
regular language models into chatbots, and which are as good as the experience that you 00:58:30.600 |
get by chatting with chat GPD. But although, like, you know, like, the kind of the quirks 00:58:35.000 |
that I raised, which is, like, you know, when you have these models, and then you, like, 00:58:38.320 |
now put them on a benchmark, and then you see that suddenly, it's like 90%, it might 00:58:42.320 |
just be because you use the model that was the evaluator to generate the data and then 00:58:46.640 |
create this model and that in turn, like this doping thing, right. And so that is one thing 00:58:51.160 |
that was, that is important to think about. The other thing is, what was I gonna say? 00:58:58.960 |
I forgot. Yeah, the other thing is like about the licensing part, which is kind of not related 00:59:17.880 |
to what you were asking, but essentially, like, you know, there was this kind of like, 00:59:22.040 |
you cannot, like, we could open, we cannot open source and commercially, so it's like, 00:59:26.440 |
you know, still restrictive license. And you cannot use it for building and selling 00:59:31.640 |
applications down the line. But then it's still like good as like a research artifact. 00:59:37.360 |
And so I think I like we would have seen these kind of collapses happen if it was allowed 00:59:42.840 |
to use these commercially. And then people would have been like, oh, but like, actually, 00:59:46.520 |
recently, we did see like, so there's this company called Daxter, which use it, which 00:59:50.880 |
was using GPT for for summarization, they replaced it with the open source model called 00:59:55.220 |
Mistral. And they said that their customers haven't complained. And, you know, they're 01:00:00.200 |
saving a ton of money, and it just seems to work fine. And they are like, you know, it's 01:00:04.720 |
just as good. And so, but not not that I'm saying that Mistral is trained on any of the 01:00:09.840 |
synthetic data, but it's just an example of, like things that would become very clear by 01:00:14.400 |
doing this sort of A/B testing where you like replace this model by another one and see 01:00:31.440 |
It's, it seems like another access you might be checked GPT on is on cost. So I wondered 01:00:40.740 |
why one of what your total budget was, or your total cost was to to produce your model 01:00:46.360 |
Oh, so the Zephyr 7B was just four hours of training on 16 A100. So that's less than $50, 01:00:55.800 |
I guess. Because we use a synthetic data set, which was already open source, which is ultra 01:01:02.360 |
chat and ultra feedback. But the cost associated 01:01:07.520 |
with the overall cost, all the people and everything. Yeah. 01:01:11.760 |
I see. Okay. So all the people and everything in the sense that there were no, I guess like 01:01:16.720 |
ultra chat probably might have reported some cost and ultra feedback, but they are mostly 01:01:22.000 |
synthetic synthetically created with very little human intervention. And so they might, 01:01:28.960 |
I don't know if they report that I haven't looked into that. But I would say it was still 01:01:33.100 |
much more cost efficient than what we spent on buying data from search and scale AI. And 01:01:38.480 |
we spend about half a million buying about 20,000 prompts of human preferences, the 20,000 01:01:45.160 |
dialogue, and about 10,000 instruction demonstration data. So that was quite a bit. 01:01:55.200 |
I'm curious about is the scale that you use for evaluating the bias for GPT-4. So I was 01:02:10.360 |
like one seven on the slide. Yeah. Oh, so yeah, this was the entropic scale. Like remember 01:02:25.600 |
that like one, two, four is decreasingly A and five to eight is increasingly B. Yeah. 01:02:30.560 |
And I was giving the model to the last student. Yes, exactly. Yeah. And these types of evaluations, 01:02:40.440 |
how sensitive to the prompt do you find the evaluators to placing you in the library saying 01:02:46.600 |
that it has the account for this left bias and the right bias, what's stopping you from 01:02:54.000 |
saying the distribution should be uniform, the distribution should be normal, and just 01:02:59.160 |
kind of iteratively to see how like what those should be. Yeah. Yeah. I think that's a good 01:03:04.840 |
point in the sense like we did not study as to what were the certain tasks or prompts 01:03:09.960 |
that were putting off GPT-4 to like, you know, generate this kind of bias. Although I would 01:03:16.120 |
say that, you know, this was also observed by LNCIS and it's part of the findings as 01:03:22.040 |
well. But yeah, so the LNCIS paper also has that. But it will be interesting, like it 01:03:30.320 |
will be surprising if it generates this on like very long prompts or prompts from like 01:03:35.320 |
math or something, which are just hard to kind of like evaluate when they're like too 01:03:39.640 |
responsive, which at least as a human, like when I see like a bunch of code, like, you 01:03:44.520 |
know, on this side and this side, and then it's very hard to say, and both of them are 01:03:47.480 |
trying to do the same thing, but a very different approach. It's very hard to evaluate them. 01:03:52.520 |
Right. And so, yeah, we haven't looked into that. 01:03:56.360 |
Perhaps another thing is, do you think forwarded matters, like which output you give to GPT-4 01:04:10.440 |
Yeah, I mean, that was basically the takeaway was that, you know, so it's interesting because 01:04:15.760 |
humans usually have recency bias, which is essentially the last thing that you read is 01:04:20.080 |
the thing that you remember. And so you would just, you know, try to like, you know, choose 01:04:23.880 |
that more, you know, you're just inclined to do that. And GPT-4 actually had a left 01:04:27.720 |
bias. So the thing that it first saw in some sense, and I think some, like, I think LNCIS 01:04:32.760 |
was the one that proposed because it has this left to right training, maybe that's why it 01:04:37.080 |
has that kind of a bias. But yeah, so I think the way we elevated that was that, you know, 01:04:44.120 |
having every model's output be equally likely to be on the left and the right hand side. 01:04:48.720 |
So if like, we're doing Alpaca and Vicuna, then instead of just doing Alpaca on left 01:04:53.240 |
and Vicuna on right, we would just randomly like switch them. And so both of them are 01:05:05.280 |
If you just ask it to like rate it on a scale of 1 to 5, yes. But if you say that, you know, 01:05:10.280 |
hey, you have this bias and make it try to make it aware of it, then it flips and it 01:05:40.120 |
Is there other approaches where you kind of, you prompt the model by shuffling the prompts 01:05:46.160 |
and then you have to kind of de-bias or de-bias the results of it? 01:05:57.080 |
Shuffle the order of how you put in the new recipes? 01:06:05.760 |
Yeah. So that's what we did is that, you know, we would like, you know, randomly shuffle 01:06:11.240 |
the left and the right. And then so each model, so like, so basically like you have, you create 01:06:17.080 |
NC2 combinations. Suppose you want to evaluate three models on 10 prompts. So you'll have 01:06:23.200 |
10 C2 combinations. I mean, N is the number of prompts, sorry, the number of models. And 01:06:29.040 |
then you would like, you know, basically like generate like, so this would be a total data 01:06:33.080 |
set. So like, you know, you would have generated 10 responses from each of these models and 01:06:37.080 |
then put them together in this three C2 setting. And then so like, that will be like a combination 01:06:43.360 |
of each of these. And then you make sure that every time the, like the models on the left 01:06:48.040 |
are equally likely to also occur on the right. So if you are doing like model one and then 01:06:52.680 |
model two, then you also make sure like you also do model two and then model one on a 01:07:21.520 |
scale of one to 10. Okay. Sure. Sorry. Should I keep the zoom on? Thank you. Yes. Yes. Yes. 01:07:49.360 |
So I mean, just to see if I understand this correctly. So on the reinforcement learning 01:07:58.120 |
community, first you build a reward model. And then that reward model, I take text input 01:08:02.840 |
and then humans give it scores. The supervised problem, we are trying to predict from the 01:08:06.600 |
sketch the score. And then I have the reward model. I will add a set of points to reinforcement 01:08:13.600 |
learning, take a set of points. Sometimes I have the next token, which is the end of 01:08:18.000 |
statement token. And I pump that through the reward model and then reward optimize on this. 01:08:22.000 |
Yes. And that's how, it's very sparse rewards, right? I only have rewards at the very end. 01:08:26.400 |
But that's how it works. Yes, exactly. And so you have to, it's very sample inefficient 01:08:31.240 |
because I keep doing this again and again. And then that's why you need a hundred thousand 01:08:35.600 |
examples for doing Arul-Acha, but only 10,000 possible. That's kind of the info. Okay, great. 01:08:41.240 |
Thanks so much. Very interesting talk. Thank you.