Stanford CS25: V3 I Recipe for Training Helpful Chatbots

00:00:00.000 | >> Hello, everyone. Today we have Nezni from Hugging Face, who is working on AI safety

00:00:13.960 | and alignment using reinforcement learning with human feedback. She's an expert in the

00:00:20.000 | space of large language models and their evaluation. Before Hugging Face, she led a team of researchers

00:00:29.480 | at Salesforce focused on building robust natural language generation systems based on LLMs,

00:00:36.880 | and she got her Ph.D. at UT Austin in computer science. So, everyone, welcome.

00:00:45.000 | >> Thanks for having me. So, the title of my talk today is recipes for training helpful

00:00:53.880 | chatbots. So, here's the introduction. I was part of this team called the H4 at Hugging

00:01:00.800 | Face, and today I'll walk you through what we built, how we decided on what we need for

00:01:07.360 | building that. And so, essentially, what we wanted to build and the goal of the team and

00:01:11.200 | the project since earlier this year was to figure out a recipe for H4, which stands for

00:01:17.400 | helpful, harmless, honest, and huggy because it's Hugging Face chatbot. And so, the ingredients

00:01:23.920 | essentially were to figure out what kind of data sets do we need for supervised trine

00:01:28.800 | tuning and RLHF. And we wanted to not worry about pre-training. Instead, take an open

00:01:36.000 | source pre-trained model and recreate the secret sauce of alignment on it. And the procedure

00:01:41.440 | that we wanted to follow and replicate on open source is this figure that I'm pretty

00:01:46.560 | sure most of you are familiar with at this point. It's from this instruct GPD paper from

00:01:50.960 | OpenAI, which shows three steps. I'm going to go into a bit more detail on this because

00:01:58.160 | this slide is much smaller. But this is what the outline of the talk looks like. I'll be

00:02:03.360 | getting the detail of how did we decide what kind of data, how much data, and all the details

00:02:08.760 | of the data for supervised fine tuning. Then similarly for RLHF. Then I'm going to talk

00:02:14.640 | about distillation of language model alignment. Then experiments with different helpfulness

00:02:21.080 | recipes. Finally, talk about evaluation of these models and quirks of using GPD 4 as

00:02:27.080 | an evaluator.

00:02:28.840 | Okay, so this is kind of like, you know, what overall recipe that instruct GPD paper from

00:02:34.680 | OpenAI put forward as, you know, the steps for training a chatbot. So, the first step

00:02:40.720 | over here is to do supervised fine tuning. Essentially, like, you know, you're doing

00:02:44.920 | fine tuning with human instruction demonstration data. So, the input and the output are both

00:02:50.440 | given by humans. The step two is, like, you know, the input is given by a human. The output

00:02:55.760 | comes from models. And then the human just rates thumbs up, thumbs down, or ranks them.

00:03:01.080 | And then you train a reward model, which is essentially just a classifier. And then the

00:03:04.600 | final step three is doing fine tuning using that reward model with reinforcement learning.

00:03:10.840 | And so, the way I'm looking at it, like, step one is more for, like, you know, making a

00:03:15.060 | model into a helpful chatbot. And the steps two and three are essentially trying to add

00:03:20.080 | those guardrails in place for harmlessness. So, let's get started with talking about helpfulness.

00:03:26.360 | And most of my talk today will be focused on the step one. So, let's start diving deeper

00:03:34.240 | into this. And let's start with the data set. Like, how do we decide what we need for doing

00:03:39.760 | the supervised fine tuning? So, like, the data set for helpfulness for supervised fine

00:03:45.160 | tuning looks somewhat like this. This is from the self-instruct paper, if you're aware of

00:03:50.320 | that from end of last year. So, you have something that we call as a task, which then has an

00:03:55.560 | instruction, which is essentially a request by a user asking the model to, like, fulfill

00:04:00.840 | or, like, give a response to a certain task. And that is followed by input and output.

00:04:06.920 | The input in this case is optional. It could just be part of the instruction. And then

00:04:11.760 | the output is the expected output that the model should generate. But while we are doing

00:04:16.320 | this training, the human provides the expected output that the model would have generated

00:04:21.240 | in the actual test case. And so, here the input and the output are called instance or

00:04:27.000 | demonstration or completion. And that's why this is called instruction demonstration.

00:04:32.840 | So, this is kind of, like, just a high level landscape of what these data sets for instruction

00:04:39.720 | demonstration look like. And you must have, you know, been familiar with at least some

00:04:44.200 | of these. And, like, you know, the way I'm trying to put this is on this line where on

00:04:50.120 | one side I'm showing data sets that were generated using models or more powerful language models.

00:04:56.880 | And so, they're more synthetic data sets. On the right, I'm showing, like, human written

00:05:01.120 | data sets. And so, these are data sets that the human wrote the input as well as the expected

00:05:08.280 | output.

00:05:09.280 | And so, examples of these are, like, you know, so the search instruct is the data that we

00:05:13.280 | at Hugging Face H4, you know, contracted with search, this company that basically had contracts

00:05:20.820 | with annotators that were writing the inputs and outputs. But we had to give them all the

00:05:25.240 | specifications of what kind of data we need. And then you must have heard of, like, you

00:05:29.680 | know, obviously Open Assistant is this other community wide effort where people contributed

00:05:33.960 | manually writing inputs and outputs. Similarly with Dolly. And then on the other end, you

00:05:39.240 | can see, like, you know, the self instruct data set. I'm going to, like, dive into some

00:05:42.920 | of these. How are these synthetic data sets created for helpfulness or for supervised

00:05:48.520 | fine tuning?

00:05:49.520 | So, one of the examples of how the synthetic data is created is in the self instruct paper,

00:05:56.520 | which is called bootstrapping the data. So, in this case, they start with 175 C task.

00:06:02.680 | That is, you know, a 175, like, a very small data set of examples where the manually written

00:06:08.400 | inputs and outputs from humans, those are added to a task pool. Then a language model,

00:06:13.480 | like, you know, basically you bootstrap by giving that to the language model in a few

00:06:17.960 | short settings and ask it to generate more data like that. And then you have another

00:06:23.200 | language model that does this task classification. Like, you know, what kind of task is this

00:06:28.800 | sample or the example belonging to? And finally, it also does this more fine grained classification

00:06:34.880 | as to, like, you know, does it have, you know, output first or does it require input first

00:06:39.360 | and so on? And because this is synthetic data and created in this, like, a very scalable

00:06:43.760 | way, you also have to do a lot of filtering to make sure that it is very high quality.

00:06:49.560 | So, another way of generating this kind of synthetic data is what UltraChat did. And

00:06:55.840 | in this case, they had, like, a human in the loop process. So, a human would, like, you

00:07:00.480 | know, look up, like, either, you know, search Wikipedia or something and then come up with,

00:07:06.400 | you know, topics that they want to generate data for. And then, you know, ask the model,

00:07:11.640 | like, provide it with the required material that would be needed for, you know, coming

00:07:16.240 | up with, say, question answering or summarization or any of these specific tasks. And then give

00:07:21.320 | it to a more powerful model, like, chat GPD or GPD4. In this case, it was chat GPD. And

00:07:26.040 | then, oh, actually, GPD4. And then you kind of, like, you know, keep doing these loops

00:07:30.520 | of, like, you know, giving the material to the model and say, like, come up with questions

00:07:34.480 | and answers on this particular task using all this material. And then, you know, then

00:07:39.120 | the human looks at it and then keeps querying it and refining it more and more. So, this

00:07:43.560 | is another way of creating synthetic data. Obviously, this has a human sitting there

00:07:47.640 | and doing a lot more filtering in the process. Then there's another one, which is, like,

00:07:53.120 | even less human involved, which is role playing. And this is the camel dataset. In this case,

00:07:59.080 | all that the human does is, like, come up with an idea of what task or what, you know,

00:08:05.280 | level they want. So, at a high level, it would be, like, develop a trading bot for the stock

00:08:09.400 | market. And there would be two LLMs. One would be role playing as an AI assistant. The other

00:08:15.680 | would be role playing as an AI user. And then they basically just specify the task and,

00:08:20.760 | like, let these two bots chat with each other and create a conversation dataset, which is,

00:08:25.760 | again, like, a synthetic dataset for supervised fine tuning.

00:08:32.200 | So this is kind of, like, you know, just going back to this, you know, landscape. It looks

00:08:35.920 | like, you know, people have been very creative. And how do we get, you know, very high quality

00:08:40.920 | data quickly without spending a lot of money? And because humans are inefficient and expensive.

00:08:46.920 | And so, these are, like, you know, some examples that we looked at. But on the other hand,

00:08:50.440 | we also cannot, like, you know, underestimate how good quality, like, the manually created

00:08:55.640 | datasets are. And so, we at Hugging Face decided to, like, you know, go with everything, like,

00:09:01.840 | very manual and, like, you know, have humans do both the input and output. Also go figure

00:09:06.080 | out, like, what are the, you know, essential documents or, you know, other material they

00:09:10.760 | need for coming up with creating this dataset. But when we started doing that, we were earlier

00:09:17.000 | in the year. So, this is back in January or February of this year. And this is what the

00:09:21.000 | landscape looked like at that time. And so, there was very little datasets available.

00:09:26.080 | A lot of these were mostly synthetically created. So, we wanted to, like, you know, kind of

00:09:31.640 | leverage what was existing out there. But we also had to make some really important

00:09:35.440 | decisions because we were going to, like, pay money and make sure that the data that

00:09:39.160 | we collect is actually useful for building the model and, you know, the applications

00:09:43.520 | that are built on top of it. So, these are the learnings that we had from the past papers

00:09:49.200 | that were, you know, creating these supervised fine-tuned datasets. We knew that the dataset

00:09:53.800 | has to be in the range of tens of thousands of examples. So, this is from the self-instruct

00:09:58.840 | dataset. And we also knew that, you know, these models that are trained on this dataset

00:10:04.720 | show diminishing returns after just a few thousand high-quality instructions. So, you

00:10:10.040 | don't need a lot. And then it saturates very quickly. So, these are the two findings that

00:10:14.080 | we had when we started to, like, go collect datasets for supervised fine-tuning.

00:10:20.400 | But we also had to give some very fine-grained specifications on what we want for our dataset.

00:10:26.000 | In particular, we had to decide what is the task distribution we want for the data that

00:10:29.880 | we are collecting. I mean, we know it's tens of thousands, but how many thousands of what

00:10:34.160 | task, right? The length distribution, like, you know, should the prompt have a certain

00:10:39.040 | dimension? Is that even an important factor? And one thing is that we wanted we had decided

00:10:44.080 | that we want to make it high-quality and human-written, but then there were, like, options on that

00:10:48.800 | as well. We could go with external vendors, like SERT, Scale AI, AWS, Ground Truth, and

00:10:54.400 | so on. Or we could hire our own contractors from Upwork and MTurk. So, those were, like,

00:10:59.560 | decisions that we had to make. So, let's look at each of these one by one.

00:11:04.040 | So, because we were recreating this InstructGPT recipe for this helpful chatbot, we wanted

00:11:10.960 | to, like, you know, take inspiration from their task distribution. So, on the left,

00:11:14.840 | I'm showing, like, the task distribution that InstructGPT did for OpenAI did for the InstructGPT

00:11:20.600 | paper. As you can see, that generation is, like, you know, the majority of it, followed

00:11:24.960 | by some, you know, some of these open-ended tasks and brainstorming tasks and so on. And

00:11:29.760 | these are examples of, like, what prompts of each of those look like. So, we decided

00:11:34.640 | to, like, you know, just go with that. But instead, you must have noticed that there's

00:11:38.120 | this category called "other" in the table. And we obviously don't know what that was.

00:11:43.400 | But so, we decided to replace that with "code." So, essentially, it would be, like, debugging,

00:11:48.020 | asking clarification questions about the code. So, it's like code plus natural language.

00:11:52.480 | So, this is what our final distribution looked like.

00:11:57.680 | The second question was the length distribution. So, we also had to, like, you know, figure

00:12:01.840 | out, like, you know, how important is the length? And should we, like, you know, have

00:12:05.240 | a certain length distribution that we ask these companies to collect data for us? So,

00:12:10.400 | we did a pilot study with SERT, ScaleAI, and AWS SageMaker Ground Truth, which is more

00:12:15.400 | like a managed service. So, it's very different from MTurk. And they have very high-quality

00:12:19.520 | humans, like, basically writing these examples. And so, I wanted to, like, just highlight

00:12:27.040 | that, you know, this or the first two rows here show what the instruct GPD length distribution

00:12:32.800 | looks like. And as you can see, this is obviously the full data set. This is more like pilot.

00:12:36.480 | So, like, the counts are much smaller. But you can see, like, the maximum is 2048. And

00:12:42.360 | as you know, like, that was, like, the standard context size in the beginning of the year.

00:12:47.240 | And then, you know, there is obviously, like, even the mean and, you know, that much. It's

00:12:50.440 | not, like, basically, it's more or less, you know, in the range. But if you look at, you

00:12:55.080 | know, these examples from SERT, AWS, ScaleAI, there's very high variance. So, for example,

00:13:00.880 | AWS SageMaker, the maximum prompt length is 1036. But then, like, you know, the mean is

00:13:06.200 | just 54. And on the other hand, with SERT, the maximum length is 500. But then the mean

00:13:13.160 | is much, like, you know, 104. So, it's, like, more in the range of what we would expect

00:13:17.800 | from, like, you know, this difference in instruct GPD. And similarly, with ScaleAI, we found

00:13:22.760 | that, you know, the prompts were just very, very short. And so, just based on this, we

00:13:28.920 | said that, you know, okay, we should probably just go with search. Because, you know, that

00:13:32.560 | seems like something that is more, you know, in the range of not very high variance.

00:13:38.840 | So, we ended up collecting 10,000 instruction demonstration pairs from search. And this

00:13:44.840 | is what the task distribution looked like. So, this very much follows the task distribution

00:13:49.280 | instruct GPD, except for the coding part, which was, like, the other category over there.

00:13:54.640 | And these are the number of examples we collected for each of these tasks. And year over year,

00:13:59.720 | I'm showing, like, you know, the average length for each of these task categories. And one

00:14:04.160 | thing I wanted to highlight was, which was very surprising to me, is that the chat is

00:14:08.160 | actually one of the shortest prompt length categories. But for OpenAI, that is actually

00:14:14.520 | one of the longest prompt length categories. So, which was very interesting. And so, obviously,

00:14:19.560 | like, you know, at that time, we did not think much about it. But when we started training

00:14:24.360 | models and started looking at the evaluation results, we were kind of, like, you know,

00:14:28.600 | if we had to go back and change things, how would we change that? And so, these were,

00:14:33.000 | like, things that we started, you know, looking at more carefully after we had already collected

00:14:37.720 | the data set. So, here are examples of what that data set looked like. You know, classification,

00:14:46.520 | generation, brainstorming. I'm sure you all must have seen at least some of these kind

00:14:51.480 | of examples of instruction demonstration data sets. So, it's very much, like, it has everything

00:14:56.000 | that you can expect from, like, NLP kind of tasks, but also more open-ended chatty tasks

00:15:02.800 | as well. Okay. So, here are, like, some details about the task force that was used by search

00:15:13.600 | to generate this data set. We requested a U.S.-based task force mainly because, like

00:15:18.920 | I said, we just wanted to replicate what InstructGVD was doing. And based on Anthropic and OpenAI's

00:15:24.720 | paper, it seemed like they preferred going with the U.S.-based task force. The gender

00:15:29.880 | was equally divided, and the age range was also very, you know, it was, like, a big range

00:15:36.520 | going all the way from 19 to 62. And then people had, like, you know, educational background

00:15:42.480 | ranges from technical degree to Ph.D. So, Ph.D. was mainly for tasks like math, coding,

00:15:47.120 | and so on. Okay. So, now I wanted to, like, switch gears a little bit and talk about this

00:15:55.760 | data set that we collected for RLHF or for human preferences before I get into, like,

00:16:02.000 | you know, the experiments we ran with this supervised fine-tuning data set and what results

00:16:05.960 | we got. So, again, over here, while we were collecting human reference data set, we had

00:16:12.320 | to come up with what are the specifications of these data sets. So, again, just to, like,

00:16:17.360 | contrast this with how is it different from SFD, the SFD data set, both the input and

00:16:22.200 | the output are written by humans. In this case, the human writes the input. The output

00:16:26.940 | comes from models, which is responses, but then the human just ranks or rates them on

00:16:31.960 | a certain scale. So, yeah, essentially, we had to decide, like, what is the task distribution

00:16:38.240 | looks like for RLHF data? Is it going to be same as supervised fine-tuning? What about

00:16:44.440 | the length distribution? And should we do, like, single turn versus multi-turn? So, in

00:16:49.040 | struct GPT, it was mainly single turn. So, if we are trying to replicate in struct GPT,

00:16:53.720 | we would have to go with single turn. But if we are trying to replicate something like

00:16:57.120 | chat GPT, it would have to be, like, a multi-turn dialogue. And then we had to also, like, you

00:17:02.680 | know, decide on these dimensions of, like, helpfulness, honesty, and harmlessness. So,

00:17:07.760 | these are, like, the HHH that Entropiq follows, like, OpenAI puts it as helpfulness, truthfulness,

00:17:13.400 | and harmlessness. And then also we had to decide, like, you know, are they going to

00:17:16.880 | rate each of the responses individually? Or are they going to rank them? And what are

00:17:21.320 | the implications of, like, you know, us deciding one way or the other?

00:17:26.320 | So, we started by doing pilot study again. So, we took 300 from the self-instruct data

00:17:34.160 | set, the data set that was released end of last year. And then, you know, gave it generated

00:17:39.800 | model responses from our models and then gave it to data vendors to, like, rate the responses

00:17:45.320 | of the models. And we used this Entropiq template on the left, which is essentially asking the

00:17:51.480 | human choose the most helpful and honest response. And then, you know, these are the responses

00:17:55.800 | from, like, model A and model B. And this is a scale, which is also working as, like,

00:18:01.320 | sort of a ranking thing in the sense that one to four is, like, decreasingly model A

00:18:06.760 | and five to eight is increasingly model B.

00:18:12.320 | And also, like, you know, one other thing we had to decide about is, like, how much

00:18:16.100 | data should we collect? And so, again, this is from the InstructGPD paper. And as you

00:18:21.000 | can see, like, you know, they have, like, the train and validation splits for each of

00:18:25.200 | the three steps, which are the SFD, training the reward model, and the PPO. And this one

00:18:30.240 | is in the order of tens of thousands. And, like, overall, this combined, which is, like,

00:18:35.000 | you know, this process of RLHF comes up to about 100,000.

00:18:40.120 | Great. Okay. So, then once we got this pilot study data back, we sat down and we wanted

00:18:49.040 | to also, like, you know, so I looked at it manually, and I felt that I did not agree

00:18:53.640 | with most of the answers that, you know, the annotators from each of these companies were

00:18:58.200 | providing. And so, I was kind of, like, you know, I don't think this is high quality at

00:19:02.600 | all. So, what I decided, like, you know, I told my team, let's go and, like, you know,

00:19:06.880 | rate it within ourselves. And then, you know, we basically rated, like, about 100 examples

00:19:12.240 | or so. And we followed, like, a similar template of, like, one to four and five to eight. And

00:19:17.560 | it basically the output, like, you know, the takeaway was that even we did not agree amongst

00:19:22.280 | each other. So, essentially, like, our models earlier in the year was so bad, you were essentially

00:19:27.440 | breaking ties, like, arbitrarily. Like, you know, you're deciding between, like, should

00:19:31.640 | it be, like, you know, three versus, like, seven or something like that. So, if they're

00:19:35.360 | equally bad, it's hard to, like, decide which one is better, right? And so, we were kind

00:19:40.000 | of, like, breaking some of these ties arbitrarily. And so, as you can see, like, you know, there

00:19:44.120 | was barely any, like, you know, agreement or correlation among our outputs. And then,

00:19:48.760 | you know, when I aggregated that, and, you know, looked at, you know, how well do we

00:19:52.480 | correlate with, like, for example, search and scale. And so, we decided, like, you know,

00:19:57.000 | with AI, we had, like, more, like, the maximum overlap was with scale compared to, like,

00:20:03.680 | say, search. Okay. So, we ended up collecting 20,000 dialogues. So, we decided to go with

00:20:09.880 | multi-turn. And so, because it was multi-turn, you would have, like, 20,000 overall dialogues,

00:20:16.560 | but the number of prompts would be 80,000. So, there would be each dialogue would have

00:20:19.840 | about four turns on an average. So, like, you know, a human would prompt it, the model

00:20:24.280 | would respond, a human would, like, rate the response, and then, you know, ask the follow-up

00:20:29.320 | question. And then, again, the model would, like, you know, generate two responses, and

00:20:32.760 | that is how it would go on. And so, the task distribution we decided to follow was a little

00:20:39.360 | bit different from what we had for supervised fine tuning. And the reason behind that was

00:20:44.640 | that we wanted to focus more on tasks that were, like, factual, so that, you know, essentially,

00:20:51.520 | this is more about making the model learn, like, between positive and negative signals.

00:20:55.800 | So, making the model, like, discriminate between, like, you know, what is factual, what is not,

00:21:00.200 | what is helpful, what is not, and what is harmless and what is not. And, like, you know,

00:21:04.920 | for example, tasks like generation and brainstorming, there's no one correct answer. Like, you know,

00:21:09.480 | everyone can come up with, like, different lists or recipes, and, you know, it's hard

00:21:13.440 | to say, is this the best answer? Is this the most helpful answer? But if you ask, like,

00:21:17.200 | a factual question, it's, like, very clear what is correct and what is not. So, that

00:21:22.080 | was kind of, like, our reasoning behind doing this. And so, this is a task distribution

00:21:26.120 | that we came up with for collecting the human preference dataset.

00:21:31.680 | Also about the length, because we are doing this in a multi-turn setting, and so we wanted

00:21:35.840 | to make sure, like, you know, the entire dialogue could fit into, like, the context line of

00:21:40.000 | the models, we have decided to, like, you know, ask them to keep the overall dialogue

00:21:44.080 | to be shorter than 2048 tokens. And then it was multi-turn with an average of four turns

00:21:51.080 | per dialogue. Then, obviously, we had to also select on the dimension of, like, whether

00:21:55.880 | we are going for, like, helpful over harmless or, you know, honesty. So, we followed this

00:22:01.800 | instructions from this OpenAI guidelines. I'm not sure if I can pull this up. That would

00:22:06.240 | be nice. Okay. Great. But, yeah, so, OpenAI has this document which is public of, like,

00:22:16.400 | labeling instructions that they shared with their annotators. And so, they have, obviously,

00:22:21.360 | like I said, they have helpful, truthful, and harmless, but then they also have this

00:22:25.320 | thing how do I scroll down? Okay. So, they have definitions on what do they mean by helpfulness,

00:22:32.440 | what do they mean by truthfulness, and what do they mean by harmlessness. So, in our case,

00:22:37.120 | because our models were not as good, we decided to focus on helpfulness and truthfulness.

00:22:42.620 | And when they had to break ties, OpenAI says that, you know, choose truthfulness over helpfulness

00:22:48.840 | over your so, like, let me see that. Yeah. So, they wanted to, like, prioritize harmlessness

00:23:07.000 | and truthfulness over helpfulness, but we went the other way around. We said we wanted

00:23:10.920 | to, like, prioritize helpfulness over honesty or harmlessness. I mean, we weren't even focusing

00:23:16.720 | on harmlessness, because we just wanted to get our model to a certain capabilities before

00:23:21.120 | we start thinking about that. But, yeah, this is really a very good document and, like,

00:23:26.840 | you know, defines what should the annotator be looking at and how do they decide when

00:23:32.040 | the model responses are very close, how do they break those ties. And for, like, you

00:23:37.680 | know, deciding between what kind of template should we use for collecting these annotations,

00:23:42.680 | we started off with the entropic template that I showed a few slides earlier, which

00:23:45.960 | was on a scale of one to eight, but essentially ranking between these two models. And then,

00:23:50.720 | you know, Lama2 came out while we were in this iterative process. And our iterative

00:23:54.960 | process was essentially we used to give an endpoint to the vendor, and then the, you

00:23:59.520 | know, basically the annotators that they had in the managed task force would prompt these

00:24:03.880 | endpoints. The model would generate two responses. They would, you know, follow the instructions

00:24:10.680 | and, you know, give the ranking for each of those instruction each of those model responses.

00:24:15.800 | And then, you know, again, like, follow up with the second prompt and the conversation

00:24:18.680 | would go on. And then they would give us the data at the end of that week. We would fine

00:24:22.600 | tune our model on that data so that the model now is hopefully better. And then we give,

00:24:27.920 | like, a better endpoint to them for the next week to continue this process. So it's, like,

00:24:32.160 | very iterative. And, like, you know, they have to adapt to, like, model getting better

00:24:36.320 | week by week. So, yeah, basically, but, like, you know, we decided to switch to I think

00:24:41.240 | for one or two weeks we collected entropics, use entropic scale for collecting data set.

00:24:47.840 | But then Lama 2 came out and their results showed that, you know, clearly that, you know,

00:24:52.080 | they were using this much more easier scale of just 1 to 4. So they were, like, you know,

00:24:57.440 | choosing which one is a better response between the two responses and then seeing how much

00:25:02.140 | better it is. So is it, like, significantly better or is it only slightly better? And

00:25:06.920 | so that was the ranking of, like, scale 1 to 4. So here are examples of data that we

00:25:13.280 | collected. So on the left, you can see that it is asking about, like, you know, basically

00:25:19.920 | human is prompting with a question and then the bot generates a response. So this is the

00:25:24.480 | response that the human chose at this turn. And then the human, you know, follows up with

00:25:28.680 | the second prompt. And then this is the bot response that was chosen by this human. And

00:25:33.480 | this is the rejected bot response. And this is giving the response margin of 3, which

00:25:37.760 | is saying that they are quite a bit different. So 4 is, like, very different and 1 being

00:25:41.880 | very slightly different. And then here on the right-hand side is more about sort of

00:25:47.200 | generation brainstorming kind of example where the human is asking, like, can you write a

00:25:52.580 | text message wishing your husband a happy anniversary? And then the bot writes something.

00:25:57.760 | I guess my thing messed up the emojis. But, you know, then the human follows up with saying,

00:26:03.640 | hey, you missed this important detail, which is, you know, they have been married for eight

00:26:07.320 | years. And so this is a chosen bot response. This is the rejected one that the human chose

00:26:12.360 | between those two. And as you can see, they are quite good. So the response margin is

00:26:17.920 | just 1. So they're, like, just slightly different. Okay. Sounds good. So now I'm going to, like,

00:26:26.680 | talk about this another recipe that we tried, which is, you know, using synthetic data set

00:26:32.080 | essentially for distillation of AI alignment, which is basically the paper that we released

00:26:37.680 | last week called Zephyr, and which was, like, a 7 billion parameter model, which actually

00:26:43.160 | beat Chad GPT. And this builds on top of the Mistral model. But I just wanted to, like,

00:26:49.280 | you know, yeah, just, you know, basically we recreated some of the steps that were there

00:26:52.640 | on the instruct GPT paper, but now with using synthetic data set. And so the first one is,

00:26:58.920 | like, you know, you are basically, like, using a data set. In this case, we use Ultra Chat.

00:27:03.800 | So this is a data set I showed a few slides earlier for supervised fine tuning, wherein,

00:27:07.680 | like, a human was brainstorming and, like, gathering the material, and then, like, you

00:27:12.280 | know, chatting with this GPT-4 model to, like, generate multiple different, you know, outputs

00:27:18.000 | for the instruction. And then, you know, this is how we collect that data set, which is

00:27:21.920 | called the Ultra Chat. And then we use that for fine tuning our model. And then the second

00:27:27.500 | step is the response generation AI ranking. So in this case, also, like, you know, we

00:27:33.320 | used Ultra Feedback, which is a data set that was released. And the way this data set was

00:27:39.280 | constructed was that, you know, they asked, basically, like, you know, took some prompts

00:27:44.440 | from, like, shared GPT and some of these different data sets of SFT that were already out there.

00:27:49.920 | And then they gave it to four different models, like, four different powerful models, like

00:27:53.800 | Palm 2, Cloud 2, GPT-4, and so on. And then they asked this GPT-4 to, like, rank each

00:28:01.440 | of those four responses. And then, so, like, you know, the one that is the best is the

00:28:05.800 | one that GPT-4 ranks as the highest. So each of these are scored individually on a scale

00:28:09.840 | of 1 to 10. And the one that gets the maximum score is, like, the best response. And then

00:28:16.760 | finally, we did something called DPO, which you might have been aware of because it came

00:28:21.520 | out of Stanford. It's, like, this kind of alternative to RLHF, which is, like, doing

00:28:26.320 | this direct preference optimization. And so instead of, like, you know, basically doing

00:28:31.720 | this iterative process of fine-tuning, you directly, like, optimize on, like, the chosen

00:28:35.960 | one. So we just take that and then fine-tune our model directly on that chosen response.

00:28:42.600 | And the other one that we are using is, like, a random response from these other three responses.

00:28:47.920 | Okay. So I'm going to talk a little bit about experiments and evaluation for each of these

00:28:54.960 | recipes. One is collecting everything with, like, humans involved. And the second one

00:29:00.000 | is everything which is synthetic. But then before I discuss evaluation, I wanted to talk

00:29:05.160 | about, like, what are the benchmarks that we are evaluating on and how good are these

00:29:09.040 | benchmarks for evaluating chatbots. And to think about evaluation, we need to first think

00:29:14.520 | about how are we training these models. So, like, today, all the models that are trained

00:29:19.240 | are, like, more or less have these four ways of learning. The first one is pre-training

00:29:23.640 | the language model. Essentially, you're predicting the next token. And examples of these are,

00:29:28.080 | like, GPT-3, OPT, and so, like, the foundation models. The second type of learning is in-context

00:29:34.280 | learning or the prompt-based learning. In this case, you're, like, just giving a new

00:29:38.800 | kind of task in the context of the model and then, you know, ask it to, like, you know,

00:29:43.480 | do that on new examples. So, like, if you wanted to write a poem, for example, for GPT-3,

00:29:48.160 | you would have written that in the context and then it would have generated a new poem

00:29:51.880 | on some other topic. The third type of learning is the supervised fine tuning, which was kind

00:29:59.760 | of, like, the first step of training a chatbot. In this case, you're, like, fine tuning on

00:30:04.760 | the instruction following data and then you want these language models, which are just

00:30:08.840 | pre-trained to predict the next token to become chatty and to, like, generate open-ended responses.

00:30:14.760 | And then, finally, the fourth one is reinforcement learning from human feedback, which is nudging

00:30:19.480 | the language model towards the values you desire. And examples include llama to chat

00:30:24.040 | from meta. So, the first two steps are, you know, we have a lot of benchmarks for these

00:30:33.060 | two types of training. Like, Sanford Helm is an example of that. Or the Google Big Bench

00:30:37.840 | or even open LLM leaderboard. But for these two types of learning, which is supervised

00:30:45.440 | fine tuning and reinforcement learning from human feedback, which are parts of, like,

00:30:49.160 | this recipe for training a chatbot, there's, you know, not a lot of leaderboards or evaluation

00:30:54.380 | benchmarks available. But there are some available. And I wanted to, like, you know, just highlight

00:30:58.340 | some of those. So, like, yeah, this is essentially, like, the steps three and four here match

00:31:04.220 | to, like, you know, the step one over here, which is helpfulness, and then steps two and

00:31:08.440 | three over here, which is, like, you know, nudging the model towards being more harmless.

00:31:13.560 | So, if you had to, you know, evaluate the chatbot for each of these steps, you would

00:31:19.840 | have to think about how do you evaluate instruction following or chattiness. You would have to,

00:31:24.800 | you know, think about how do you evaluate the reward model, which is essentially a classifier.

00:31:29.420 | And then finally think about, you know, how do you evaluate for harmlessness, which is

00:31:33.320 | by red feeling or adversely prompting the language model. So, for the first step, you

00:31:39.480 | would have to see, like, does the model generate useful responses on the topic? And are they

00:31:43.440 | open ended? And one example of a prompt that you would try to evaluate the model would

00:31:47.880 | be to, like, say, brainstorm a list of the New Year's resolution. And so, examples of

00:31:54.080 | benchmarks and evaluation boards that are looking at this sort of, like, supervised

00:31:59.360 | fine tuning is, like, Hugging Faces, a leaderboard with ELO ratings. So, ELO is this metric that

00:32:05.300 | is used in chess, which is, like, you know, you're pairing one player against the other,

00:32:09.480 | and you want to, like, rank these players when they have, like, these tournaments against

00:32:13.160 | each other. And so, in a similar sense, we are, you know, taking these chatbots and then,

00:32:20.080 | you know, putting them in a pairwise setting. And then we partnered with ScaleAI, and they

00:32:25.000 | provided humans to, like, annotate which response is better. And we did that for every single

00:32:30.360 | combination. So, like, it was NC2, where N is the number of prompts we are looking at.

00:32:35.480 | And so, we generate NC2 combinations, and we rate each of them. And so, these are the

00:32:40.640 | ELO ratings that we get out of it. And on this column here shows that what is the rating

00:32:47.600 | you would get if you would have used GPD 4 as a proxy for humans? So, instead of, like,

00:32:52.880 | humans sitting and rating each of those, you're asking, like, you know, GPD 4 to select which

00:32:57.120 | is a better response. Yeah. And so, this is basically the first table you're showing if

00:33:02.680 | you allow ties in the sense sorry, if there was no tie allowed. And this table you're

00:33:07.320 | showing that if ties were allowed. Another example is, you know, this leaderboard from

00:33:15.480 | Stanford, which is Alpaca Eval leaderboard, and they're doing something very similar in

00:33:20.440 | the sense that they have GPD 4 and Claude as an evaluator, and they are doing, like,

00:33:25.720 | a pairwise evaluation of these models, chatbot models, and they're reporting the win rate

00:33:32.040 | of, you know, which model wins against the other one. There's also the LMSIS leaderboard

00:33:38.840 | from Berkeley, which has this thing called the chatbot arena, which is essentially like

00:33:43.120 | a publicly crowdsourced leaderboard, wherein you can, like, go chat, like, you know, chat

00:33:48.040 | with any of their models, and then give them rating to, like, which one was more helpful

00:33:51.880 | and which one was better. And so, this, again, has, like, a leaderboard of ELO ratings, because

00:33:56.640 | this is done in a pairwise setting. There's another benchmark from LMSIS, which is called

00:34:03.960 | the empty bench or the multi-turn bench, a benchmark. And this is the first ever multi-turn

00:34:09.360 | dialogue benchmark that is evaluating chatbots. And so, it has, there are just, like, 80 examples

00:34:15.440 | in this across, like, a bunch of categories. But essentially, what, the way it works is

00:34:20.320 | that the first turn or the first prompt from the benchmark is prompted to the model. Then

00:34:27.360 | GPD 4 is asked to score on a score of 1 to 10. How good is the model's response? And

00:34:33.560 | then, you know, it is followed up by another prompt, which is, like, you know, the multi-turn

00:34:38.000 | prompt, which is, like, related to the question, but it might not be related to the model's

00:34:42.560 | responses, because, you know, this is already constructed, and they always, like, follow

00:34:45.960 | up with the same response to every part. And then, again, GPD 4 evaluates how good was

00:34:51.600 | the second turn of the response. So, this is, like, the consolidated leaderboard from

00:34:57.600 | LMSIS, showing both the arena ELO rating, as well as empty bench scores. So, these are

00:35:03.400 | scores that are aggregated across all the 80 examples, and this is GPD score scoring

00:35:08.880 | from, like, 1 to 10, essentially. Cool. So, I think the second step that we wanted to,

00:35:16.040 | like, look at in our evaluating a chatbot chart was, like, you know, think about how

00:35:20.800 | do you evaluate a reward model. So, when you have these human preference data set collected,

00:35:25.840 | and you train this reward model, which is essentially a classifier, to discriminate

00:35:29.680 | between, like, you know, truthful and untruthful response, or, like, you know, can it rank

00:35:33.840 | helpful response higher than the less helpful responses? And, you know, there's literally

00:35:39.280 | no open source data leaderboard available for evaluating these, like, preference model

00:35:44.840 | or the reward models. But internally at Hugging Face, we have our own data set for evaluating,

00:35:50.660 | so that we know that as we are adding more human preference data, our models are actually

00:35:55.520 | getting better. So, this is essentially we are evaluating on these open source data sets,

00:36:02.200 | which is the Anthropic Helpful data set, the Open Assistant data set, the Stanford's Human

00:36:07.520 | Preference data set, and also the Learning to Summarize data sets from the very first

00:36:11.960 | paper from OpenAI, which was looking at Learning to Summarize. And so, this is, like, you know,

00:36:18.600 | basically seeing that, you know, how good is our reward model. And then, finally, the

00:36:23.280 | third type of evaluation is red teaming. And so, in this case, you want to craft a prompt

00:36:28.720 | in a way that could surface model vulnerabilities and emerging capabilities. And for example,

00:36:34.720 | if you're asking, like, how do I plan a prank robbery is a model, actually, like, you know,

00:36:39.120 | helping you with that and trying to elicit undesired behavior from the model. And unfortunately,

00:36:44.720 | actually, there's no leader open source leaderboard available for this thing. There's just one

00:36:49.280 | data set from Anthropic, which has all the three included, which is the it actually has

00:36:53.840 | both helpfulness and harmlessness. It's the edge data set from Anthropic. And that's the

00:36:58.720 | only open source data set available for red teaming. But there's no leaderboard available

00:37:04.080 | for red teaming. And so, this was, like, a blog that I wrote earlier in the year, saying,

00:37:08.080 | like, you know, highlighting this gap and saying that, you know, putting out an announcement

00:37:12.320 | saying, like, we should get together and build a data set for red teaming. And if you had

00:37:16.160 | heard of, like, the DEF CON red teaming design challenge, and, you know, basically crowdsourcing

00:37:20.900 | some of these red teaming work kind of came out of that. Okay. So, now I'm going to get

00:37:26.600 | into now that we have discussed evaluation and benchmarks and leaderboards, I'm going

00:37:30.600 | to talk about results and what did they look like on each of and some of these benchmarks.

00:37:35.160 | So, here I'm showing the results for this Lama 213 billion on the open LLM leaderboard

00:37:42.840 | from Hugging Face. And in this case, I was using the data set that we collected from

00:37:48.120 | search that was a 10,000 instruction demonstration data. And I hear on this, you know, these

00:37:53.400 | are basically the four data sets, which are, like, NLP focused data sets that we have as

00:37:58.480 | part of open LLM leaderboard, which are the Arc Challenge, the Hendrix, Hellaswag, and

00:38:03.200 | Truthful QA. And you're, like, you know, this is how well our model does. And all of this

00:38:09.760 | is essentially accuracy. And this is the Lima paper or the Lima model, which is less is

00:38:14.300 | more for alignment that came from meta. And they just used 1,000 examples of high quality

00:38:19.240 | instructions and showed that you can get a very good chatbot by just using 1,000 examples.

00:38:24.760 | And this is, like, you know, taking the longest example from Open Assistant and just choosing

00:38:28.480 | the top 500 of them. And so, we found that our model does slightly better than, you know,

00:38:34.320 | each both of, like, Lama and Open Assistant, except for in Truthful QA, where we found

00:38:39.760 | that the Lima and Open Assistant did better than us. And similarly, like, actually, like,

00:38:45.760 | in Empty Bench, we found, like, you know, the opposite was true. So, this is, like,

00:38:49.680 | you know, Empty Bench is remember that LLMSys had, like, you know, turn zero and turn one.

00:38:54.360 | And then so, this is reporting the first response. This is, like, GPD 4 is essentially scoring

00:38:58.720 | on a score of 1 to 10, how good these models are on the first dialogue turn and the second

00:39:04.880 | dialogue turn and the average score. And so, actually, this is kind of more counterintuitive

00:39:10.960 | to what we found on this automatic evals is that actually the Empty Bench says that, you

00:39:15.600 | know, our the data that our model trained on the data that we collected from search

00:39:20.160 | is not very good. And in fact, Lima and Open Assistant, which are, like, a fraction of

00:39:24.520 | the size of the data we had are much better. So, this was kind of surprising. And then

00:39:31.800 | I looked into, like, you know, let me look at is the length a factor in this? And it

00:39:36.520 | does seem like, you know, like, the data I was looking at each of those and then, you

00:39:40.240 | know, looked at the average length of the prompts in each of those. And it seems like

00:39:43.960 | there is a very wide range. For example, like, our data set, the average length was just

00:39:48.600 | 211 of these prompts, while Lima is, like, double of that and Open Assistant is almost

00:39:53.520 | double of that. So, then I did this experiment where I wanted to check, like, if I controlled

00:40:02.760 | for the size of the data, but then, you know, let the length be varied, the prompt length,

00:40:07.960 | does that affect the performance? So, in particular, like, I think I highlighted this before is

00:40:12.520 | that our chat category was, like, really short. And so, it actually found that, you know,

00:40:17.600 | like, length did not really affect that much, except for this truthful QA data set. Even

00:40:23.880 | for this Hellas swag, even though it looks small, it's actually just in the third digit.

00:40:28.620 | And over here, you can see, like, the actual difference only made on truthful QA, which

00:40:32.140 | actually preferred models that were generating longer responses. But on the other hand, the

00:40:39.320 | empty bench score was, again, not intuitive, not aligning or correlated with what we found

00:40:44.080 | with these automatic metrics and evaluations, in the sense that GPT-4 actually did not prefer,

00:40:50.600 | like, longer responses. And so, this was, like, you know, a little bit counterintuitive.

00:40:55.640 | And so, need to, like, dig more into, like, what's going on over here. But, you know,

00:41:00.440 | it actually found that, you know, like, shorter responses were better than, you know, longer

00:41:05.220 | responses. Although there was, like, not much of a very much of a difference.

00:41:10.120 | So, the other experiment and evaluation we did is that just removing amounts of data

00:41:15.840 | and seeing, like, if you incrementally add more data, how does that affect performance?

00:41:20.860 | And this is, again, on that open LLM leaderboard from Hugging Face, which is looking at some

00:41:26.620 | of these standard NLP benchmarks and reporting accuracy. And so, this is, like, starting

00:41:32.200 | with just 10% of all the data we collected from search. And as you can see, like, you

00:41:37.160 | know, in all these benchmarks, actually, like, it saturates very quickly. And in some of

00:41:41.720 | them, you actually get, like, you know, you basically lose performance if you keep adding

00:41:46.120 | data. And so, this is kind of aligning with when I started, when we started collecting

00:41:50.520 | data, we had this diminishing return plot, wherein you said that if you have just very

00:41:54.920 | few thousand examples of very high quality instruction following data set, that's good

00:42:00.040 | enough. And then your performance saturates or plateaus very quickly after that. And so,

00:42:06.040 | that is kind of what we got as well.

00:42:09.160 | Similarly, I think this is where one place where Empty Bench actually correlated with

00:42:16.440 | the automated metrics is that GPT-4 also, like, you know, showed that, you know, after,

00:42:22.880 | like, about 4,000 examples, it was basically very barely any gain in performance, actually,

00:42:28.800 | decreasing performance on the with the model.

00:42:33.040 | Okay, great. So, that was all the results on using, like, these human curated very high

00:42:40.440 | quality data set. What about, like, results from distillation from these synthetic data

00:42:46.000 | sets? In particular, we use UltraChat for supervised fine tuning and UltraFeedback for

00:42:51.560 | DPO. And so, these are the results. So, this is, like, basically just work that was released

00:42:57.520 | last week. We haven't yet released the code and the data set, which we are going to do

00:43:01.960 | this week. And so, here I'm highlighting that Zephyr is the model we released. We built,

00:43:06.560 | we used Mistral as the foundation model, and then fine tuned it using UltraChat and then

00:43:12.800 | did DPO on UltraFeedback. And as you can see that it actually beats chat GPT on this Alpaca

00:43:18.400 | eval leaderboard.

00:43:19.400 | Also, it is, like, the best in the, in all the open, at least it's, like, it beats most

00:43:30.920 | of the 13 billion parameter models. And it's, like, quite competitive to cloud to, again,

00:43:38.920 | on the Alpaca eval leaderboard. So, this is the model which has both SFD and DPO. So,

00:43:46.880 | we did an ablation on how good or how useful is, like, you know, SFD and how useful is

00:43:52.720 | DPO, because there's this two-step process. It's, like, first you fine tune on instruction

00:43:56.960 | demonstration, then you fine tune on human preferences. And so, this is the first row

00:44:02.280 | over here is showing what if you directly did DPO on UltraFeedback and did not do the

00:44:07.640 | supervised fine tuning. And you actually saw that that's really bad. So, that doesn't work

00:44:11.720 | at all. And then the second one is saying that what if you just did supervised fine

00:44:16.600 | tuning and did not do DPO. And so, this actually, which is, like, the first step, and this actually

00:44:21.480 | works decently well. And it's, like, you know, basically getting you to, like, 80 or 90%

00:44:25.840 | of the overall performance. And finally, this is doing, like, supervised fine tuning on

00:44:31.080 | the human preference data. So, you take this row and do another round of supervised fine

00:44:35.440 | tuning, but on this data of human preferences. So, you remember you had, like, the chosen

00:44:39.800 | and the rejected. So, you give all the dialogue history, and then the expected completion

00:44:44.800 | is the chosen dialogue response. So, in this case, you're not really doing that discriminative

00:44:49.000 | thing. You're still doing the SFD process, but you're just, you know, like, using that

00:44:53.240 | in a smart using the data set in a smart way so that it follows a template of what supervised

00:44:58.160 | fine tuning does. And then that, as well, we found that, you know, wasn't very helpful.

00:45:03.000 | So, the best recipe, obviously, is DPO plus SFD. So, you know, doing SFD first on the

00:45:09.280 | UltraChat, and then DPO on the UltraFeedback. Both of these data sets are synthetic. And

00:45:14.880 | then, you know, it's, like, only slightly better than just doing SFD.

00:45:20.040 | Okay. So, I'm getting to this final section of my talk, which is essentially looking at,

00:45:27.120 | you know, so, we have seen a lot of these evaluation and benchmarks and leaderboards,

00:45:31.760 | and many of them are starting to adopt these powerful models, like Cloud2 and GPD4, and

00:45:37.120 | are using as proxy for humans in evaluation. And so, what are the quirks associated with

00:45:41.880 | doing that, and are there things that we should, like, be, like, you know, considering when

00:45:45.720 | we are doing this at a very large scale? So, when we did that, when we used GPD4 as

00:45:51.920 | an evaluator, we found that it actually has a positional bias. And so, in particular,

00:45:57.440 | it is predisposed to generating a rating of 1 in a preference collection setting. And

00:46:03.080 | so, like, you know, this chart over here shows, like, the average rating for model responses

00:46:09.040 | across, like, the entire data set. And on the right, on the other hand, humans are more

00:46:13.440 | or less uniform. And so, you expect that, you know, this distribution seems much more

00:46:17.960 | better than this distribution, which is skewed to the right.

00:46:22.440 | So, then what we did is that we prompted GPD4 to say that, hey, you have this left bias,

00:46:28.560 | and you always generate this rating of 1, you know, be aware of this bias, and then

00:46:33.160 | you tell it to debias itself, it actually flips the bias in the opposite direction.

00:46:38.000 | So, then it starts, like, it is more self-aware in the sense that it knows that, you know,

00:46:42.400 | it has this bias, and now it starts generating more ratings of 5 and 6. And the one way of

00:46:47.440 | getting rid of this is that we kind of make sure that each response is equally likely

00:46:51.720 | to be in right and left position. So, that kind of dilutes, like, this bias that it has

00:46:56.320 | to each of these positions.

00:47:00.440 | And then, you know, we found that actually, like, prompting GPD4 to generate scores, so

00:47:05.080 | asking it to score, like, each response individually, like, Empty Bench does. And then instead of

00:47:10.040 | ranking, but in a pairwise setting, we actually found that that alleviates the problem a little

00:47:14.200 | bit, but does not completely get rid of the problem.

00:47:18.080 | We also found evidence of doping between training and evaluation. So, in particular, we found

00:47:24.840 | that GPD4 prefers models that were trained on GPD4's data. So, these, all these models

00:47:31.200 | here were trained on data that was bootstrapped using GPD4. And, you know, so it prefers that

00:47:37.640 | over humans who are, like, more factual, much more higher quality, but they might be very

00:47:43.080 | succinct and to the point. So, this is one thing that, you know, we should be aware of

00:47:47.840 | when we are using GPD4 as an evaluator.

00:47:51.240 | The other thing is that, you know, it also, like, conquers with findings from these other

00:47:55.040 | papers, which is that GPD4 prefers models with higher diversity. So, that is number

00:48:00.580 | of unique tokens in the response and the longer responses. So, if you have, like, this list

00:48:05.480 | of list kind of response, just like chat GPD does, GPD4 is, like, predisposed to rating

00:48:11.440 | that higher compared to a model that does not generate that.

00:48:14.720 | We also found that GPD4 has poor correlation with humans on low entropy tasks, such as

00:48:23.000 | math coding and reasoning. So, remember that leaderboard I showed you where we had compared,

00:48:27.240 | like, how does GPD4 ELO rating compare to humans? And then we dive deeper into, like,

00:48:33.120 | how does that compare on each of these different task distribution and categories? And so,

00:48:37.400 | this is what it looks like. So, it seems like, you know, it says lower correlation with humans

00:48:42.640 | on some of these more factual, like, you know, kind of, like, expecting one correct answer.

00:48:48.160 | And they actually highly correlated with humans on these more high entropy tasks where you

00:48:53.400 | got, like, brainstorming and creative generation, which was kind of unintuitive and counterintuitive

00:48:58.200 | because you could have so many different ways of coming up with, like, you know, a recipe

00:49:03.280 | or a list of something. But that's where, like, the rating of GPD4 and humans are more

00:49:08.160 | correlated.

00:49:10.120 | Okay. So, the final thing is takeaways. So, there's a bunch of this. But let's try to

00:49:17.680 | break it down. Essentially, like, you know, we discussed, like, how do we come up with

00:49:21.520 | steps for data curation for supervised fine tuning and RLHF? And it involves, like, several

00:49:27.160 | critical factors, such as how much data do you need to collect? What is the length of

00:49:31.480 | the prompts and the distribution of those length? The task distribution? And what is

00:49:36.880 | the role of humans? Like, you know, do you need synthetic data? Do you need completely

00:49:40.400 | manually curated or something in the middle? And we looked at, like, there are many tools

00:49:44.880 | for, like, efficient fine tuning of open source LLMs. From the SFD results, we found that

00:49:51.160 | truthful QA was the main differentiating benchmark for these automated eval metrics. And then

00:49:57.680 | we found that empty bench scores were actually not correlated with these automated metrics.

00:50:02.680 | And so, it was more sort of, you know, only on, like, some of these models, we found that

00:50:08.360 | they were correlated. For the distillation results, which is from the Zephyr 7D, where

00:50:13.560 | we are, like, fine tuning on synthetic data, we found that the SFD on AI generated data

00:50:19.520 | and the DPO or distillation of DPO on AI feedback data actually beats chat GPD, even though

00:50:25.680 | the model is just 7 billion parameter. And then we found that, you know, benchmarking

00:50:30.600 | gap in assessing RLHF models in particular, that we don't have benchmarks for assessing

00:50:37.080 | reward models. And we also don't have open source benchmarks for evaluating red teaming

00:50:42.320 | and model vulnerabilities. Then finally, we'd like, you know, dive deeper into, like, you

00:50:47.480 | know, looking at quirks of using GPT-4 or some of these powerful LLMs as an evaluator.

00:50:53.560 | And some of them were, like, you know, they prefer models trained on GPT-4-like data.

00:50:57.980 | It has, like, a left positional bias. And then it has high correlation with humans on

00:51:02.960 | creative tasks compared to, like, coding or reasoning tasks. And my work has been covered

00:51:09.640 | on the New York Times article cover, which talks about the secret ingredient of alignment,

00:51:15.200 | which is for chat GPD, which is alignment. I'm also part of the United Nations Advisory

00:51:20.000 | Board that was announced last week. So, really humbled to be part of that. Here are some

00:51:24.760 | blog posts. You know, basically, like, yeah, we kind of, like, did not publish a whole

00:51:30.440 | lot this year. But we wrote a bunch of blog posts highlighting what we are releasing and

00:51:35.080 | working on. And also, like, you know, some of these are part of the talk that I just

00:51:39.960 | discussed. And this is part of the Edge 4 team. I'm grateful to be part of this. And

00:51:47.440 | thanks for listening.

00:51:52.560 | When you get alternating responses from the products, do you select really high temperatures

00:52:14.440 | or do you keep it pretty close to the temperature that's also told in the final product?

00:52:19.680 | Yeah. So, we did, like, you know, basically chose, like, you know, we tried experimenting

00:52:25.400 | with different temperatures. But then we actually found that just using different sampling strategy

00:52:31.660 | worked better. So, like, you know, using a different value of P and then K and some combination

00:52:36.600 | of that as opposed to just, like, relying on temperature.

00:53:04.760 | Yeah. So, I think for Red Teaming at scale, there's actually a paper that came out recently

00:53:20.760 | called GPD Fuzzer that actually, like, you know, bootstraps and uses these powerful LLMs

00:53:27.280 | to jailbreak other LLMs. And also, there was a DeepMind paper, I think, actually, like,

00:53:30.840 | one and a half to almost two years ago that was Red Teaming large language models with

00:53:35.600 | large language models. So, how do you, like, Red Team and evaluate a language model by

00:53:39.840 | using another powerful language model? And so, I think that is kind of the way to go

00:53:44.440 | in terms of scale. And so, what was the second question?

00:53:51.240 | Yeah. So, I think one thing is this idea of, like, emerging capabilities, which is essentially,

00:54:09.480 | like, as you scale up, and which is a trend that we are seeing, like, you know, as we

00:54:13.280 | are scaling up, there are things that these models do or, like, you know, capabilities

00:54:17.720 | that emerge that were not there in the smaller models. I think examples are chain of thought

00:54:21.780 | reasoning, which, you know, GPT-2 or GPT was not capable of doing it. And as we scale up,

00:54:27.880 | and the other example is this few short prompting that we first saw in GPT-3, as in, like, you

00:54:32.880 | could give it a completely new task and not update its parameters in any way, but just

00:54:37.760 | put it as part of the prompt. And then, you know, now it just learns the task, and then

00:54:41.960 | it can do it on n number of examples, right? And so, like, labeling and all these things

00:54:47.200 | started coming up, like using GPT-3 as a labeler and all that, when we kind of, like, discovered

00:54:52.480 | that thing. So, I think, essentially, like, you know, the other example is, like, manipulation.

00:54:56.840 | I don't think any open source models are capable of that yet, but I know, like, Anthropic and

00:55:02.560 | OpenAI, these companies are focusing on, like, you know, deception and manipulation, because

00:55:07.320 | when you start, like, you know, chatting with these models, you start, like, you know, treating

00:55:12.440 | them as a companion, especially, like, if you have, like, character AI kind of a thing

00:55:16.800 | where, you know, you might try confiding in them, start confiding in them, sharing information

00:55:21.160 | that you probably shouldn't, and then they can use it against you, maybe. Like, you know,

00:55:26.000 | an example of that is, like, I think recently we saw that GPT-4 actually manipulated someone

00:55:31.060 | to, like, read the capture to it in some way and, like, tell it what the capture reads.

00:55:35.440 | And so, that's a really concrete example of manipulation. And so, it seems like now these

00:55:41.480 | models are capable of that. I don't think open source models are there yet, but these

00:55:46.760 | are, like, just, like, you know, things that come out and, like, vulnerabilities that would

00:55:50.840 | surface when you do it at D9.

00:55:55.840 | Yeah. So, I would say, like, it was, it's less about, like, you know, it's more about

00:56:25.720 | open sourcing a data set that is crafted to kind of elicit this behavior. It's more about

00:56:32.160 | the kind of harms that we should be thinking about. So, it's more about, like, you know,

00:56:36.920 | hallucinating or plagiarism, manipulation, you know, trying to leak PII information,

00:56:42.720 | people's credit card, SSN, things like that. It's more about, like, thinking about these

00:56:46.560 | different dimensions and giving concrete examples of how these models can, you know, elicit

00:56:52.280 | this behavior. But I think what you are trying to, like, talk about is that, you know, what

00:56:57.120 | if we gave them concrete ways, like, concrete prompts on how you jailbreak, and then they

00:57:01.320 | can go and try to do that. I think first thing is, like, you know, while we are doing this,

00:57:05.600 | we would have evaluated our models, and we would then start thinking about guardrails

00:57:09.160 | and safety ourselves. And if, indeed, like, you know, the data set is so good that we

00:57:13.600 | can say that a lot of these powerful models are failing on that, then obviously you don't

00:57:17.160 | open source it instantly, but you actually think about what is the best way to put it

00:57:21.240 | out there by first securing the model and making sure that it does not, like, you know,

00:57:26.040 | basically does not elicit that kind of behavior, and then sharing it while you have already,

00:57:31.200 | you know, kind of crossed that bridge and being like, yeah, my model is safeguarded

00:57:34.760 | against that. So it's more like, yeah, a process of a gradient of things that you need to do.

00:57:41.240 | Yeah, so you're talking about, like, when you're using synthetic data bootstrap on,

00:58:11.160 | on other language models, have you seen, like, collapse of, like, some kind of, like, mode

00:58:15.800 | collapse or something like that? So, actually, so far, it's been, like, clear that these

00:58:22.280 | are good, like, these, these actually turn, like, you know, regular chatbots and, like,

00:58:26.760 | regular language models into chatbots, and which are as good as the experience that you

00:58:30.600 | get by chatting with chat GPD. But although, like, you know, like, the kind of the quirks

00:58:35.000 | that I raised, which is, like, you know, when you have these models, and then you, like,

00:58:38.320 | now put them on a benchmark, and then you see that suddenly, it's like 90%, it might

00:58:42.320 | just be because you use the model that was the evaluator to generate the data and then

00:58:46.640 | create this model and that in turn, like this doping thing, right. And so that is one thing

00:58:51.160 | that was, that is important to think about. The other thing is, what was I gonna say?

00:58:58.960 | I forgot. Yeah, the other thing is like about the licensing part, which is kind of not related

00:59:17.880 | to what you were asking, but essentially, like, you know, there was this kind of like,

00:59:22.040 | you cannot, like, we could open, we cannot open source and commercially, so it's like,

00:59:26.440 | you know, still restrictive license. And you cannot use it for building and selling

00:59:31.640 | applications down the line. But then it's still like good as like a research artifact.

00:59:37.360 | And so I think I like we would have seen these kind of collapses happen if it was allowed

00:59:42.840 | to use these commercially. And then people would have been like, oh, but like, actually,

00:59:46.520 | recently, we did see like, so there's this company called Daxter, which use it, which

00:59:50.880 | was using GPT for for summarization, they replaced it with the open source model called

00:59:55.220 | Mistral. And they said that their customers haven't complained. And, you know, they're

01:00:00.200 | saving a ton of money, and it just seems to work fine. And they are like, you know, it's

01:00:04.720 | just as good. And so, but not not that I'm saying that Mistral is trained on any of the

01:00:09.840 | synthetic data, but it's just an example of, like things that would become very clear by

01:00:14.400 | doing this sort of A/B testing where you like replace this model by another one and see

01:00:18.720 | how that affects those things.

01:00:26.060 | I have a question on zoom.

01:00:29.320 | Yes.

01:00:31.440 | It's, it seems like another access you might be checked GPT on is on cost. So I wondered

01:00:40.740 | why one of what your total budget was, or your total cost was to to produce your model

01:00:45.360 | that beat them.

01:00:46.360 | Oh, so the Zephyr 7B was just four hours of training on 16 A100. So that's less than $50,

01:00:55.800 | I guess. Because we use a synthetic data set, which was already open source, which is ultra

01:01:02.360 | chat and ultra feedback. But the cost associated

01:01:07.520 | with the overall cost, all the people and everything. Yeah.

01:01:11.760 | I see. Okay. So all the people and everything in the sense that there were no, I guess like

01:01:16.720 | ultra chat probably might have reported some cost and ultra feedback, but they are mostly

01:01:22.000 | synthetic synthetically created with very little human intervention. And so they might,

01:01:28.960 | I don't know if they report that I haven't looked into that. But I would say it was still

01:01:33.100 | much more cost efficient than what we spent on buying data from search and scale AI. And

01:01:38.480 | we spend about half a million buying about 20,000 prompts of human preferences, the 20,000

01:01:45.160 | dialogue, and about 10,000 instruction demonstration data. So that was quite a bit.

01:01:55.200 | I'm curious about is the scale that you use for evaluating the bias for GPT-4. So I was

01:02:10.360 | like one seven on the slide. Yeah. Oh, so yeah, this was the entropic scale. Like remember

01:02:25.600 | that like one, two, four is decreasingly A and five to eight is increasingly B. Yeah.

01:02:30.560 | And I was giving the model to the last student. Yes, exactly. Yeah. And these types of evaluations,

01:02:40.440 | how sensitive to the prompt do you find the evaluators to placing you in the library saying

01:02:46.600 | that it has the account for this left bias and the right bias, what's stopping you from

01:02:54.000 | saying the distribution should be uniform, the distribution should be normal, and just

01:02:59.160 | kind of iteratively to see how like what those should be. Yeah. Yeah. I think that's a good

01:03:04.840 | point in the sense like we did not study as to what were the certain tasks or prompts

01:03:09.960 | that were putting off GPT-4 to like, you know, generate this kind of bias. Although I would

01:03:16.120 | say that, you know, this was also observed by LNCIS and it's part of the findings as

01:03:22.040 | well. But yeah, so the LNCIS paper also has that. But it will be interesting, like it

01:03:30.320 | will be surprising if it generates this on like very long prompts or prompts from like

01:03:35.320 | math or something, which are just hard to kind of like evaluate when they're like too

01:03:39.640 | responsive, which at least as a human, like when I see like a bunch of code, like, you

01:03:44.520 | know, on this side and this side, and then it's very hard to say, and both of them are

01:03:47.480 | trying to do the same thing, but a very different approach. It's very hard to evaluate them.

01:03:52.520 | Right. And so, yeah, we haven't looked into that.

01:03:56.360 | Perhaps another thing is, do you think forwarded matters, like which output you give to GPT-4

01:04:09.440 | first?

01:04:10.440 | Yeah, I mean, that was basically the takeaway was that, you know, so it's interesting because

01:04:15.760 | humans usually have recency bias, which is essentially the last thing that you read is

01:04:20.080 | the thing that you remember. And so you would just, you know, try to like, you know, choose

01:04:23.880 | that more, you know, you're just inclined to do that. And GPT-4 actually had a left

01:04:27.720 | bias. So the thing that it first saw in some sense, and I think some, like, I think LNCIS

01:04:32.760 | was the one that proposed because it has this left to right training, maybe that's why it

01:04:37.080 | has that kind of a bias. But yeah, so I think the way we elevated that was that, you know,

01:04:44.120 | having every model's output be equally likely to be on the left and the right hand side.

01:04:48.720 | So if like, we're doing Alpaca and Vicuna, then instead of just doing Alpaca on left

01:04:53.240 | and Vicuna on right, we would just randomly like switch them. And so both of them are

01:04:57.120 | likely to occur in both these positions.

01:05:01.960 | And you still saw the left bias?

01:05:05.280 | If you just ask it to like rate it on a scale of 1 to 5, yes. But if you say that, you know,

01:05:10.280 | hey, you have this bias and make it try to make it aware of it, then it flips and it

01:05:14.520 | generates something like that. So yeah.

01:05:40.120 | Is there other approaches where you kind of, you prompt the model by shuffling the prompts

01:05:46.160 | and then you have to kind of de-bias or de-bias the results of it?

01:05:55.600 | By shuffling the prompts, you mean like-

01:05:57.080 | Shuffle the order of how you put in the new recipes?

01:06:05.760 | Yeah. So that's what we did is that, you know, we would like, you know, randomly shuffle

01:06:11.240 | the left and the right. And then so each model, so like, so basically like you have, you create

01:06:17.080 | NC2 combinations. Suppose you want to evaluate three models on 10 prompts. So you'll have

01:06:23.200 | 10 C2 combinations. I mean, N is the number of prompts, sorry, the number of models. And

01:06:29.040 | then you would like, you know, basically like generate like, so this would be a total data

01:06:33.080 | set. So like, you know, you would have generated 10 responses from each of these models and

01:06:37.080 | then put them together in this three C2 setting. And then so like, that will be like a combination

01:06:43.360 | of each of these. And then you make sure that every time the, like the models on the left

01:06:48.040 | are equally likely to also occur on the right. So if you are doing like model one and then

01:06:52.680 | model two, then you also make sure like you also do model two and then model one on a

01:07:21.520 | scale of one to 10. Okay. Sure. Sorry. Should I keep the zoom on? Thank you. Yes. Yes. Yes.

01:07:49.360 | So I mean, just to see if I understand this correctly. So on the reinforcement learning

01:07:58.120 | community, first you build a reward model. And then that reward model, I take text input

01:08:02.840 | and then humans give it scores. The supervised problem, we are trying to predict from the

01:08:06.600 | sketch the score. And then I have the reward model. I will add a set of points to reinforcement

01:08:13.600 | learning, take a set of points. Sometimes I have the next token, which is the end of

01:08:18.000 | statement token. And I pump that through the reward model and then reward optimize on this.

01:08:22.000 | Yes. And that's how, it's very sparse rewards, right? I only have rewards at the very end.

01:08:26.400 | But that's how it works. Yes, exactly. And so you have to, it's very sample inefficient

01:08:31.240 | because I keep doing this again and again. And then that's why you need a hundred thousand

01:08:35.600 | examples for doing Arul-Acha, but only 10,000 possible. That's kind of the info. Okay, great.

01:08:41.240 | Thanks so much. Very interesting talk. Thank you.

01:08:43.120 | Thank you very much.

01:08:44.120 | Thank you.

01:08:45.120 | Thank you.

01:08:45.120 | you

01:08:47.180 | [BLANK_AUDIO]