back to indexThe Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
Chapters
0:0 Introductions and background on the lecture origins
5:17 History of RL and its applications
10:9 Intellectual history of RLHF
13:47 RLHF for decision-making and pre-deep RL vs deep RL
20:19 Initial papers and intuitions around RLHF
27:57 The three phases of RLHF
31:9 Overfitting issues
34:47 How preferences get defined
40:35 Ballpark on LLaMA2 costs
42:50 Synthetic data for training
47:25 Technical deep dive in the RLHF process
54:34 Projection / best event sampling
57:49 Constitutional AI
64:13 DPO
68:54 What's the Allen Institute for AI?
73:43 Benchmarks and models comparisons
00:00:03.200 |
This is Alessio, partner and CTO of Residence 00:00:06.760 |
And I'm joined by my co-host, Swiggs, founder of Small AI. 00:00:10.040 |
Hey, and today we have Dr. Nathan Lambert in the house. 00:00:18.840 |
seems like you've lived there most of the time 00:00:22.760 |
You worked on robotics and model-based reinforcement 00:00:31.440 |
You bootstrapped the RLHF team at Hugging Face, 00:00:42.100 |
that maybe is not super obvious about you on New LinkedIn? 00:00:46.000 |
I stay sane in various insane sport and ultra-endurance sport 00:00:56.640 |
Long-distance trail running or gravel biking. 00:01:01.240 |
Try to unplug sometimes, although it's harder these days. 00:01:05.680 |
Well, the Bay Area is just really good for that stuff, 00:01:11.200 |
I have a trailhead, like, 1.2 miles from my house, 00:01:14.080 |
which is pretty unmatchable in any other urban area. 00:01:21.920 |
You also have an incredible blog, Interconnects, 00:01:34.500 |
feel like I've finally started to write things that 00:01:42.360 |
made read the earlier blogs, they're like, yikes. 00:01:49.000 |
and we just kind of riff on what's actually happening on AI 00:01:52.920 |
and not really do news recaps, but just what it all means 00:01:57.040 |
and have a more critical perspective on the things that 00:02:00.720 |
really are kind of funny but still very serious happening 00:02:08.120 |
what would you highlight as your greatest hits 00:02:17.840 |
So the first real breakout piece was in April 00:02:23.080 |
is like we're all feeling stressed that we're 00:02:26.120 |
going to get scooped and that we're overworked, which 00:02:28.320 |
is like behind the curtain what it feels to work in AI. 00:02:32.040 |
And then a similar one, which we might touch on later in this, 00:02:36.280 |
wasn't the first time I wrote a job search post. 00:02:44.840 |
that it's very on brand, and it's very helpful. 00:02:47.640 |
Because I understand that until you've done it, 00:02:53.360 |
And then other popular ones are various model training 00:03:00.960 |
is-- this stuff is all just like when I figure it out 00:03:04.200 |
So I wrote an article that's like how RLHF actually works, 00:03:07.520 |
which is just the intuitions I had put together 00:03:15.780 |
which you hate that you have to do it, but it is pretty funny. 00:03:19.760 |
I found that it's like, from a literature perspective, 00:03:25.040 |
that is very related to mathematical reasoning. 00:03:28.040 |
So it's like, oh, you just poke a little around what 00:03:30.200 |
they've already published, and it seems pretty reasonable. 00:03:36.160 |
on one of their benchmarks, and then everyone 00:03:48.660 |
And I think you expressed some desire to re-record it. 00:03:51.340 |
And that's why I reached out on Twitter saying, 00:03:54.760 |
And then we can ask questions and talk about it. 00:03:58.460 |
I think it's-- I try to do it every six or 12 months 00:04:00.980 |
is my estimated cadence, just to refine the ways 00:04:05.860 |
And people will see that we don't know that much more, 00:04:08.860 |
but we have a bit better way of saying what we don't know. 00:04:22.980 |
we're going to have the slides on our show notes, 00:04:25.260 |
and then we're going to have a YouTube version. 00:04:32.940 |
So I think to start skipping a lot of the, like, 00:04:41.860 |
is a great kind of tidbit on RLHF becoming a real deal. 00:04:46.420 |
There was some uncertainty earlier in the year 00:04:48.520 |
about whether or not RLHF was really going to be important. 00:04:51.140 |
I think it was not that surprising that it is. 00:05:07.260 |
reinforcement learning, known for its instability, 00:05:09.340 |
seemed a somewhat shadowy field for those in the NLP research 00:05:13.060 |
However, reinforcement learning proved highly effective, 00:05:15.460 |
particularly given its cost and time effectiveness." 00:05:19.020 |
So you don't really know exactly what the costs and time 00:05:22.560 |
Because they have a huge team and a pretty good amount 00:05:27.100 |
But like, this is just the kind of thing that we're seeing now. 00:05:31.100 |
I think any major company that wasn't doing RLHF 00:05:33.500 |
is now realizing they have to have a team around this. 00:05:39.420 |
of that in the open and research communities at the same scale. 00:05:48.820 |
And the other thing on the slide is some of Anthropic's work. 00:05:51.820 |
But everyone knows Anthropic is kind of the masters of this. 00:06:05.900 |
So you come from a robotics background, which RL used to be, 00:06:11.300 |
And then now you're seeing a lot of LLM plus RL. 00:06:16.940 |
You have MPU, which we had on the podcast when they started 00:06:27.500 |
Like maybe how the pendulum will keep swinging? 00:06:33.700 |
of viewing the world through trial and error learning 00:06:37.480 |
that's focused on thinking about decision making and inputs 00:06:47.220 |
whether it's physics, electrical engineering, 00:06:53.900 |
I do think it's a much more diverse background of people. 00:06:56.900 |
Like my background was in electrical engineering 00:07:04.300 |
I think that reinforcement learning, as it was back then, 00:07:09.620 |
so to say, is really different, because you're 00:07:12.820 |
looking at these toy problems, and the numbers 00:07:16.260 |
And everyone went kind of 0 to 1 at scaling these things up. 00:07:20.380 |
But people like Jim Fan and other people that were-- 00:07:23.300 |
you saw this transition in the decision transformer and papers 00:07:26.700 |
and when people are trying to use transformers to do decision 00:07:31.780 |
and I think that was kind of like the early days. 00:07:34.380 |
But then once language models were so proven, 00:07:37.140 |
it's like everyone is using this tool for their research. 00:07:40.300 |
I think in the long run, it will still settle out, 00:07:44.100 |
or RL will still be a field that people work on, 00:07:46.340 |
just because of these kind of fundamental things 00:08:11.380 |
would think of with RL, so actually running things 00:08:20.900 |
so that's why the name takes up so much space. 00:08:23.900 |
But it could have gone a lot of different ways. 00:08:28.380 |
We made it one slide before going on a tangent. 00:08:41.300 |
started because I've had this more diverse background 00:08:45.620 |
is trying to understand what the difference of a cost 00:08:48.180 |
function, or a reward function, and a preference function 00:08:51.060 |
would be, without going into all of the details. 00:08:54.420 |
Costs are normally things that control theorists 00:08:56.500 |
would work with in these kind of closed domains. 00:09:00.380 |
worked with rewards that's central to the formulation 00:09:03.300 |
And then the idea was like, OK, we now are at preferences. 00:09:07.740 |
kind of different assumptions that you're making. 00:09:10.900 |
And those assumptions are built on other fields of work. 00:09:24.060 |
on theories and philosophies spanning tons of human history. 00:09:29.940 |
I think we cite Aristotle in this paper, which is fun. 00:09:35.500 |
It's like 2,300 years old or something like that. 00:09:39.820 |
I think we kind of list some things in the paper 00:09:42.700 |
about summarizing what different presumptions of RLHF could be. 00:09:46.860 |
I think going through these is actually kind of funny. 00:09:50.740 |
It's fun to talk about these, because they're 00:09:55.180 |
see return throughout this podcast that we're 00:09:58.820 |
The core thing of RLHF, in order to be a believer in this, 00:10:05.820 |
you can optimize it in some way and get a different performance 00:10:10.380 |
And you could do this in really complex environments, which 00:10:13.460 |
I don't know how to do that in all the domains. 00:10:17.980 |
So it's kind of-- we'll overshadow everything. 00:10:19.900 |
And then there's go from something kind of obvious 00:10:22.540 |
And then you read the von Neumann-Morgenstern utility 00:10:27.220 |
theorem, which is essentially an economic theory that 00:10:33.140 |
of different people, which is a theoretical piece of work that 00:10:44.260 |
And if you look into this, all of these things, 00:10:50.940 |
So this is kind of like grabbing a few random things. 00:10:53.100 |
And then kind of similar to that is the Bradley-Terry model, 00:10:55.380 |
which is the fancy name for the pairwise preferences 00:11:01.500 |
Anthropic and OpenAI figured out that you can do, 00:11:05.420 |
from a bunch of different people and different sources. 00:11:11.460 |
And then you train a model that works somehow. 00:11:13.620 |
And we don't know-- there's a lot of complex links there. 00:11:16.940 |
But if you want to be a believer in doing this at scale, 00:11:21.220 |
have to accept as preconditions for doing RLHF. 00:11:26.780 |
You have a nice chart of the sort of intellectual history 00:11:32.260 |
either in your paper or in the YouTube video for this podcast. 00:11:35.740 |
But I like the other slide that you have on the presumptions 00:11:43.180 |
And I don't know, do you think that any one of them 00:11:49.620 |
This is the first time I've come across the V&M utility 00:11:53.300 |
This is what you get from working with people. 00:11:56.980 |
the retort is that he's a sociologist by training. 00:12:00.960 |
the philosophers are that found these different things, 00:12:09.740 |
that-- there's debate whether or not preferences exist at all. 00:12:20.420 |
on the math that thinks that you can actually 00:12:31.100 |
So like Jeremy Bentham, like hedonic calculus, 00:12:36.400 |
people assume that preferences can be measured. 00:12:40.540 |
Like, when you look at-- this is where I kind of go on a rant 00:12:43.420 |
and I say that in RLHF, calling things a preference model 00:12:47.220 |
Because there's no inductive bias of what a preference is. 00:12:50.240 |
It's like if you were to learn a robotic system, 00:12:53.740 |
like, hopefully, that actually mirrors the world 00:13:08.860 |
But even, like, if you look at Claude's constitution, 00:13:11.980 |
like, that doesn't mean the model believes these things. 00:13:14.660 |
It's just trained to prioritize these things. 00:13:20.980 |
and if it's actually, like, a repeatable process in the data 00:13:28.820 |
understand what this is and the link between preference 00:13:32.340 |
data and any notion of, like, writing down a specific value. 00:13:39.700 |
sociology work versus computer work already exists? 00:13:43.780 |
Or is it, like, a recent cross-contamination? 00:13:51.060 |
Because at AZ, they have so much overlap between systems 00:14:05.820 |
I think the reason why it's not really talked about 00:14:08.100 |
is just because the RLHF techniques that people use 00:14:11.420 |
were built in, like, labs like OpenAI and DeepMind, 00:14:20.700 |
when you compare them to, like, startups or normal startups. 00:14:23.200 |
But, like, they're not bringing in, like, academics 00:14:30.820 |
Like, the criticism of this paper that this is based on 00:14:33.380 |
is, like, oh, you're missing these things in RL or this 00:14:42.340 |
So it's really hard to include everyone in a principled manner 00:14:47.220 |
It's just a good way to understand and improve 00:14:53.100 |
and, like, what is a good reward model for society. 00:14:56.340 |
It really probably comes down to what an individual wants. 00:15:03.080 |
be a little bit better about the communication, which 00:15:05.300 |
is a recurring theme in my work, is, like, I just 00:15:07.660 |
get frustrated when people say things that don't really 00:15:10.500 |
make sense, especially when it's going to, like, manipulate 00:15:13.300 |
individuals' values or manipulate the general view of AI 00:15:18.000 |
So that's kind of why RLHF is so interesting. 00:15:25.420 |
in what it's actually doing, while the problem specification 00:15:29.660 |
So reinforcement learning, I kind of mentioned this. 00:15:37.980 |
the classic thing where you have an agent interacting 00:15:43.220 |
to the environment, which is called the action. 00:15:45.540 |
The environment returns a state and a reward. 00:15:50.980 |
And the agent learns based on these states and these rewards 00:15:55.300 |
And it should learn a policy that makes the rewards go up. 00:16:00.740 |
If you try to mentally map what this looks like in language, 00:16:03.380 |
which is slide seven, is that, like, the language models 00:16:12.660 |
So if the language model is the policy and it's generating, 00:16:20.420 |
to take tens of thousands of prompts and generate them 00:16:24.020 |
and then show them to a human and collect the human responses 00:16:26.740 |
and then shove that into your training architecture 00:16:32.680 |
We just have a reward model that returns a reward. 00:16:36.820 |
When you look at it like an RL problem, what happens 00:16:58.380 |
is why you'll hear RLHF referred to as a bandit's problem, which 00:17:04.860 |
There's many more debates that you can have in this. 00:17:12.300 |
then kind of like this is an RL even when you zoom 00:17:18.460 |
Does this change as you think about a chain of thought, 00:17:28.080 |
There's work that I mentioned on one slide called process reward 00:17:30.820 |
models that essentially rewards each step in the chain 00:17:34.220 |
of thought reasoning, which it doesn't really 00:17:39.180 |
But it does make it a little bit more fine-grained, where 00:17:41.520 |
you can think about calling it at least you have many states 00:17:47.020 |
That formulation I don't think people have fully settled on. 00:17:49.860 |
I think there's a bunch of great work out there. 00:17:54.300 |
And Let's Verify Step-by-Step is their pretty great paper. 00:18:01.280 |
that'll probably get made more concrete by the community 00:18:06.240 |
on if you can easily draw out if chain of thought reasoning 00:18:22.960 |
is showing that the work that people are using now 00:18:30.800 |
And the step from this paper, Tamer, which is from 2008, 00:18:36.120 |
some names that are still really relevant and kind 00:18:38.480 |
of human-centric RL, Bradley Knox and Peter Stone, 00:18:46.120 |
you would just have a human give a score from 0 to 1 00:18:48.760 |
as a reward, rather than having a reward function. 00:18:54.560 |
learns to take actions to maximize that reward. 00:19:02.400 |
is you compare it to the paper that everyone knows, 00:19:06.200 |
Deep Reinforced Learning from Human Preferences paper, 00:19:08.440 |
which is where they showed that learning from human preferences 00:19:11.680 |
you can solve the basic RL tasks at the time. 00:19:22.600 |
than if you just threw RL at the environment that 00:19:26.440 |
So the preferences thing was you took two trajectories. 00:19:29.800 |
So in this case, it was complete trajectories of the agent. 00:19:32.640 |
And the human was labeling which one is better. 00:19:36.380 |
to be like the pairwise preferences that are used today 00:19:40.280 |
And there's also a really kind of interesting nugget 00:19:42.800 |
that is the trajectory that the humans were labeling over 00:19:45.920 |
has a lot more information than the RL algorithm 00:19:52.000 |
like why the performance in this paper was so strong. 00:19:56.480 |
that there isn't more RL work of this style happening now. 00:20:01.000 |
This paper is in 2017, so it's like six years later. 00:20:03.600 |
And I haven't seen things that are exactly similar, 00:20:08.600 |
where stuff that's happening now kind of came from. 00:20:11.360 |
And that's what the next few slides kind of go into. 00:20:14.080 |
Just on the Cristiano paper, you mentioned the performance 00:20:17.960 |
I don't remember-- what results should I have in mind 00:20:22.080 |
It's mostly like if you think about an RL learning curve, 00:20:24.440 |
which is like on the x-axis, you have environment interactions. 00:20:29.000 |
You can think about different like ablation studies 00:20:32.040 |
So I think they use like A2C, which I don't even 00:20:34.120 |
remember what that stands for, as their baseline. 00:20:38.640 |
on a bunch of environments like the human preference labels, 00:20:44.760 |
than if it just learned from the signal from the environment, 00:21:06.560 |
to establish as a baseline for our listeners. 00:21:13.280 |
Like this is, in some sense, the next token prediction 00:21:33.080 |
which is that you can actually give negative feedback, 00:21:35.680 |
whereas in a general sort of pre-training situation, 00:21:41.000 |
And maybe the order of magnitude of feedback, 00:21:48.360 |
than a typical training process would do in a language model 00:21:53.400 |
Yeah, I don't think I'm the right person to comment exactly, 00:21:56.360 |
but you can make analogies that reinforcement learning is 00:22:00.960 |
There are a lot of things that will point to that. 00:22:04.400 |
I don't know whether or not it's a richer signal. 00:22:09.160 |
but I think it's a good thing for people to look into more. 00:22:14.240 |
It's like as reinforcement learning is so much less 00:22:18.360 |
compute, it is a richer signal in terms of its impact, 00:22:21.640 |
because if they could do what RLHF is doing at pre-training, 00:22:30.520 |
So on a practical basis, as someone fine-tuning models, 00:22:34.800 |
I have often wished for negative fine-tuning, which pretty much 00:22:43.640 |
How does this work in diffusion models and stuff? 00:22:45.840 |
Because you can give negative prompts to something, 00:22:55.880 |
I'm just wondering if we could do something similar. 00:23:00.200 |
Anyway, so I do want to spell that out for people 00:23:02.920 |
in case they haven't made the connection between RLHF 00:23:11.920 |
dig into this, which is like this 2018 paper that 00:23:14.560 |
was a position paper from a bunch of the same authors 00:23:17.600 |
from the Christiano paper and from the OpenAI work 00:23:20.960 |
that everyone knows, which is like they write a position 00:23:31.680 |
The first assumption is that we can learn user intentions 00:23:41.020 |
in the context of RLHF, which is for many tasks 00:23:44.800 |
is easier than producing the correct behavior. 00:23:47.960 |
It's like we can compare two poems that the model generates, 00:23:51.040 |
and it can be viewed as liking a positive example, 00:23:57.280 |
or it could be viewed as really disliking a negative example. 00:24:02.400 |
are doing in the harm space, is a harmful response 00:24:07.160 |
you agree with the company's definition of harms, 00:24:12.800 |
And they downweight them by preferring something 00:24:15.640 |
more benign in the RLHF process, among other ways 00:24:20.080 |
So this is a good way of saying it's like this is core. 00:24:23.200 |
This kind of comparison and positive or negative example 00:24:26.240 |
is core to all of the RLHF work that has continued. 00:24:30.840 |
Maybe I'll try to put a more colloquial restatement of this. 00:24:41.800 |
That's what everyone's doing in the preference modeling 00:24:49.240 |
This is really just to have all the links for people 00:24:57.360 |
which shows that you can do this RLHF process on language 00:25:01.680 |
This familiar diagram starts to emerge in 2019. 00:25:04.280 |
It's just to show that this goes really far back. 00:25:06.320 |
I think we can kind of breeze through some of these. 00:25:08.520 |
And then 2020 is the first open AI experiment 00:25:17.560 |
we'll go into more when I kind of go into the main concepts. 00:25:20.840 |
But it's like the first time you see this diagram that they 00:25:27.280 |
And the types of examples that they would have-- 00:25:31.360 |
but one that I have read a whole bunch of times 00:25:33.680 |
is they took these prompts from Reddit that was like, 00:25:39.920 |
And people really pour their heart and soul into these. 00:25:42.840 |
So these are like multi-paragraph pieces of writing. 00:26:03.280 |
You do a few shot, which is like it repeats itself. 00:26:07.560 |
And what they did is that this was the first time 00:26:09.920 |
where the language model would generate pretty nice text 00:26:14.440 |
It was restricted to the summarization domain. 00:26:19.000 |
this is where I wish I was paying attention more, 00:26:22.800 |
didn't know to read the language model outputs 00:26:25.320 |
and kind of understand this qualitative sense of the models 00:26:45.320 |
But if you were early to see how different the language that 00:26:51.320 |
think you could have been early to things like chat GPT 00:26:59.240 |
The good people know to chat with language models, 00:27:07.600 |
when they were doing this, how important that could be. 00:27:09.920 |
And then they had years to kind of chisel away at that. 00:27:18.840 |
that they didn't think it would be that big of a deal. 00:27:22.920 |
Maybe they didn't, but they were getting the proxy 00:27:30.000 |
If OpenAI didn't do it, someone else would have done it. 00:27:37.160 |
And I love how you say way back when in referring to 2019. 00:27:45.360 |
the relationship between RLHF, instruction tuning, 00:27:50.840 |
Like, how would you construct the level of knowledge 00:27:56.240 |
Like, what should people know at the high level? 00:27:58.320 |
And then if people want to dive in deeper, where do they go? 00:28:13.160 |
is probably still more important in their day-to-day life. 00:28:19.120 |
You can write samples by hand that make sense. 00:28:26.800 |
It's easy to do almost in no-code solutions at this point. 00:28:31.040 |
And the loss function is really straightforward. 00:28:37.920 |
you can kind of learn from it from a different perspective, 00:28:40.200 |
which is like how the instruction tuning distribution 00:28:42.680 |
makes it easier for your RLHF model to learn. 00:28:47.280 |
on your preference data, if it's close to your instruction 00:28:54.080 |
So I think it's nice to segment and just kind of understand 00:29:05.200 |
at least before DPO really had taken off at all, 00:29:08.720 |
it would be like, do you want to have a team of at least five 00:29:11.640 |
people if you're really thinking about doing RLHF? 00:29:16.440 |
But that's still really limited to kind of one data set 00:29:20.080 |
Everyone's using this ultra-feedback data set. 00:29:21.960 |
And it boosts Alpaca, VAL, MTBench, TruthfulQA, 00:29:29.000 |
And it's like, it might just be that data set combined 00:29:41.600 |
a clear competitive advantage in their kind of niche. 00:29:45.560 |
Because you're not going to make your model chat GPT-like better 00:29:50.360 |
You've got to accept that there's some exploration there. 00:29:55.960 |
like a vein of benefit in your specific domain. 00:29:58.360 |
But I'd still be careful going into the RLHF can of worms. 00:30:05.080 |
OK, so there's a bit of a time skip in what you mentioned. 00:30:16.400 |
we're talking about September 2020 and then going into, 00:30:20.720 |
was Vicuña as one of the more interesting applications 00:30:25.320 |
of instruction tuning that pushed LLAMA 1 from, 00:30:30.200 |
let's say, a GPT-3-ish model to a GPT-3.5 model 00:30:34.040 |
in pure open source with not a lot of resources. 00:30:36.600 |
I think-- I mean, they said something like they 00:30:41.480 |
Yeah, instruction tuning can really go a long way. 00:30:47.360 |
are long overblown in most of the things in open source. 00:30:55.760 |
And it's just kind of showing that instruction 00:31:00.520 |
change what it feels like to talk with your model. 00:31:10.360 |
Just having a little bit of data that's a couple turns 00:31:16.580 |
that was like the story of the whole first part of the year 00:31:18.920 |
is people will be surprised by how far you can take 00:31:26.040 |
is the small models don't really handle nuance as well. 00:31:30.880 |
even if they have really good instruction tuning. 00:31:32.920 |
But if you take that kind of $7 to $70 billion parameter jump, 00:31:36.600 |
the instruction tuning at the bigger model is robustness. 00:31:42.320 |
But that's still just with instruction tuning 00:31:51.560 |
Yeah, this is kind of where we go through my own version 00:31:58.720 |
It's funny because all these things, instruction tuning 00:32:05.120 |
We could save the debate for if the big labs still 00:32:14.320 |
and then what does reinforcement learning optimization actually 00:32:18.680 |
We talk about these sequentially because you really 00:32:31.120 |
And then once you have that, you can collect preference data 00:32:35.160 |
When you say word, you mean like angle bracket, inst? 00:32:42.120 |
But I'm just saying they use their adjective that they like. 00:32:45.960 |
I think entropic, also like steerable, is another one. 00:32:51.840 |
Yeah, so instruction tuning, we've covered most of this. 00:32:57.640 |
It makes models that were only OK extremely comprehensible. 00:33:07.120 |
if you want to ask your model, act like a pirate. 00:33:10.380 |
That's one of the ones I always do, which is always funny. 00:33:12.800 |
But whatever you-- act like a chef, like anything. 00:33:26.360 |
because this chat template is used in RHF and all 00:33:33.600 |
It's like, once you see this with instruction tuning, 00:33:36.420 |
you really know it, which is like you take things like stack 00:33:38.840 |
overflow, where you have a question and an answer. 00:33:46.720 |
When somebody asks a question, there's much more-- 00:33:50.080 |
there's surely kind of more tricky things that people do. 00:34:02.480 |
I think people have just gotten better at kind of scaling up 00:34:08.760 |
Yeah, this is where this talk will kind of take a whole left 00:34:15.400 |
I put a slide with the RHF objective, which I think 00:34:21.720 |
to just kind of understand what is trying to happen here 00:34:32.400 |
But everything kind of comes from an equation 00:34:34.280 |
of trying to learn a policy that maximizes the reward. 00:34:40.680 |
A lot can be said about what the reward should 00:34:42.640 |
be subject to some constraint, which the most popular 00:34:45.960 |
constraint is the KL-distraint, which is just 00:34:51.620 |
means if you have a completion from your instruction or RHF 00:34:55.600 |
model, you can compare that completion to a base model. 00:34:59.400 |
And looking at the log probs from the model, which 00:35:05.000 |
you can see a rough calculation of the distance 00:35:07.480 |
between these two models just as a scalar number. 00:35:10.160 |
I think what that actually looks like in code, 00:35:20.680 |
But it is just to make the optimization kind of stay 00:35:26.120 |
Make sure it doesn't overfit to your RHF data. 00:35:29.520 |
Because we have so little data and our RHF overfitting 00:35:37.080 |
that labelers like to see, that the model likes to generate, 00:35:41.200 |
punctuation, weird tokens, like calculator tokens. 00:35:44.920 |
It could overfit to anything if it's in the data a lot 00:35:52.520 |
There's not that much documented work on that, 00:35:55.040 |
but there's a lot of people that know if you take that away, 00:35:58.920 |
So it is important, but I think it's something that people 00:36:04.240 |
But as an objective, as I said, it's just kind of-- 00:36:08.000 |
The reward is where the human part of this comes in. 00:36:15.660 |
The real questions are, how do you implement the reward? 00:36:27.000 |
the equation that most of the stuff is based on right now 00:36:29.920 |
is something called a Bradley-Terry model, which 00:36:36.560 |
I'll show an interface that Anthropic uses here. 00:36:49.920 |
you're looking at the probability that the chosen 00:37:00.520 |
is they assume this probability is correlated to reward. 00:37:12.120 |
I'm kind of inclined to breeze through the math stuff, 00:37:17.800 |
because otherwise it's going to be not as good to listen to. 00:37:24.560 |
I think there's a lot of higher level explanations out there. 00:37:29.240 |
So the real thing is you need to assign a scalar reward of how 00:37:33.440 |
And that's not necessarily that easy to understand. 00:37:36.880 |
Because if we take back to one of the first works 00:37:39.920 |
I mentioned, this tamer thing for decision making, 00:37:42.600 |
people tried that with language models, which 00:37:46.160 |
and you just have someone rate it from 0 to 10, 00:37:50.400 |
on all of these completions and 0 to 10 ratings 00:38:01.800 |
And then that's why they tried this pairwise preference thing. 00:38:05.520 |
And this Bradley Terry model comes from the '50s. 00:38:09.800 |
It's from these fields that I was mentioning earlier. 00:38:20.060 |
But it's still really around in the literature of what 00:38:26.600 |
I'll point out one presumption that this heavily relies on. 00:38:29.320 |
You mentioned this as part of your six presumptions 00:38:31.440 |
that we covered earlier, which is that you can 00:38:35.520 |
This is not exactly true among all humans, right? 00:38:43.920 |
There's a theorem or a name for this called error 00:38:48.880 |
impossibility, which I'm sure you've come across. 00:39:00.280 |
I think the reason this really is done on a deep level 00:39:15.720 |
is trying to stay around correctness and style 00:39:19.160 |
rather than any meaningful notion of preference. 00:39:21.200 |
Because otherwise, these companies really don't want to-- 00:39:27.780 |
And it's like, if you look at what people actually do-- 00:39:30.140 |
so I have a bunch of slides on the feedback interface. 00:39:34.420 |
It's always at the appendices of every paper. 00:39:37.980 |
Yeah, there's something later on in this talk which is like-- 00:39:40.540 |
but it's good to mention in this is when you're doing this 00:39:43.380 |
preference collection, you write out a very long document 00:39:46.500 |
of instructions to people that are collecting this data. 00:39:52.160 |
Something amounting like factuality, helpfulness, 00:39:55.480 |
honestness, harmlessness-- these are all different things. 00:39:57.920 |
Every company will rank these in different ways, 00:40:03.840 |
you should select this one and why and all of this stuff. 00:40:08.640 |
is like, why don't we check if the models actually 00:40:10.760 |
do these things that we tell the data annotators to collect? 00:40:17.760 |
It'll be really-- it's hard to test if a model is honest 00:40:22.500 |
the kind of causal mechanisms as a researcher 00:40:27.820 |
But at a simple level, what it boils down to-- 00:40:33.140 |
It's like, you're having a conversation with an AI, 00:40:37.620 |
You get shown two responses or more in some papers. 00:40:40.780 |
And then you have to choose which one is better. 00:40:42.780 |
I think something you'll hear a lot in this space 00:40:48.400 |
It's a name for probably some research in economics, 00:40:54.180 |
where if you have integers from one to eight, 00:41:01.140 |
And the smallest numbers will represent one model 00:41:04.940 |
And the biggest numbers will be the other model's better. 00:41:24.140 |
Filling out this preference data is really hard. 00:41:40.680 |
The one I have here is from Anthropic's collection demo. 00:41:45.360 |
It's because it was from slides that I did with Anthropic. 00:41:48.560 |
But you can look up these in the various papers. 00:41:55.080 |
and then you have an option to say which one is better. 00:41:59.240 |
The infrastructure is almost exactly the same. 00:42:02.040 |
But they just log which one you think is better. 00:42:04.920 |
I think places like Scale are also really big in this, 00:42:11.920 |
will help control who's doing how many samples. 00:42:15.520 |
You have multiple people go over the same sample once, 00:42:22.840 |
But it's good to know what the distribution of prompts is, 00:42:31.080 |
A last thing to add is that a lot of these companies 00:42:37.020 |
a rating of how good was the prompt, or the conversation, 00:42:46.620 |
There's a quadrant of preference data in my mind, 00:42:48.740 |
which is you're comparing a good answer to a good answer, which 00:42:54.740 |
comparing a bad answer to a bad answer, which is like, 00:42:57.580 |
you don't want to train your model on two different-- 00:43:04.260 |
we don't know if we can use this, because a lot of it 00:43:07.780 |
because you're rushing to try to do this real contract. 00:43:10.560 |
And then there's also good answer to bad answer, 00:43:12.560 |
which I think is probably pretty reasonable to include. 00:43:14.860 |
You just prefer the good one, and move on with your life. 00:43:19.380 |
I think open AIs of the world are all in good answer, 00:43:22.300 |
good answer, and have learned to eliminate everything else. 00:43:24.760 |
But when people try to do this in open source, 00:43:28.500 |
is there's just a lot of bad answers in your preference 00:43:43.100 |
and everything from the model fails to actually complete 00:43:50.580 |
of offensive or dangerous content, moral judgment, 00:43:55.620 |
I don't know exactly if they're doing this now, 00:43:58.340 |
but you can kind of see why doing RLHF at scale 00:44:01.060 |
and prioritizing a lot of different endpoints 00:44:02.860 |
would be hard, because these are all things that you-- 00:44:05.780 |
I'd be interested if I was scaling up a big team 00:44:08.020 |
to do RLHF, and what is going into the preference data, 00:44:12.660 |
You're like, OK, we're going to remove all the data where 00:44:20.900 |
But some of these other metadata categories-- 00:44:33.260 |
Should people try to adjust for this at the RLHF layer, 00:44:38.740 |
where they have a classifier as a separate model that 00:44:44.780 |
Do you mean for training or, like, a deployment? 00:44:48.340 |
I do think that people are doing it at deployment. 00:44:59.020 |
this, like, helpfulness and safety reward models. 00:45:07.260 |
is, like, helpfulness, factuality, maybe safety, 00:45:11.180 |
But places like Anthropic and Chattopadhyay and Bard 00:45:23.260 |
because you could use, like, a 100 times smaller language 00:45:25.700 |
model and do much better at filtering than RLHF. 00:45:30.020 |
But I do think it's still so deeply intertwined 00:45:38.180 |
I think that's something that'll kind of settle out, I think. 00:45:41.180 |
I'm just wondering if it's worth collecting this data 00:45:43.300 |
for the RLHF purpose if you're not going to use it in any way, 00:45:46.060 |
because you're just going to use a separate model to-- 00:45:48.020 |
Yeah, I don't think OpenAI will collect all of this anymore. 00:45:57.980 |
scales with how many minutes it takes for you to do each task. 00:46:09.780 |
and I think you may have joined one of our spaces 00:46:15.420 |
We had an estimate from you that was something 00:46:18.340 |
on the order of Llamatu costs $3 to $6 million 00:46:23.500 |
And then it was something like $20 to $30 million 00:46:27.860 |
Is that something that's still in the ballpark? 00:46:32.280 |
I know that the $20 million was off by a factor of four, 00:46:35.020 |
because I was converting from a prompt number 00:46:44.820 |
And the Llamatu paper reports 1.5 million data points, 00:46:54.020 |
is safe to say that they're spending, if not more. 00:46:56.460 |
They're probably also buying other types of data 00:46:58.500 |
and/or throwing out data that they don't like. 00:47:05.900 |
always are way lower, because all they have to say 00:47:09.500 |
But they're running tens or hundreds of runs. 00:47:11.420 |
So it's like, OK, this is kind of a meaningless number. 00:47:24.540 |
Some methods, people think that it's more sensitive to-- 00:47:29.700 |
Does the type of instruction tuning you do matter for RLHF? 00:47:36.900 |
are trying to figure out if you need to have what is called-- 00:47:43.980 |
is your RLHF data is from your instruction model. 00:47:47.500 |
I really think people in open source and academics 00:47:50.180 |
are going to figure out how to use any preference 00:47:52.180 |
data on any model, just because they're scrappy. 00:47:54.860 |
But there's been an intuition that to do PPO well and keep 00:47:58.900 |
improving the model over time, and do what Meta did, 00:48:03.740 |
is that you need to collect new preference data to kind of edge 00:48:10.340 |
So there's a depreciation where the first batch of data 00:48:14.180 |
you collect isn't really useful for training the model when 00:48:22.500 |
And I do think that if we had all the LLAMA data, 00:48:28.500 |
Probably 20% to 40% would be pretty useful for people, 00:48:38.180 |
So do you think the open source community should 00:48:40.940 |
spend more time figuring out how to reuse the data that we have, 00:48:49.540 |
into using synthetic data, which I wish I had more slides on it, 00:48:53.600 |
Essentially, people also think that synthetic data, like GPT-4, 00:48:57.420 |
is more accurate than humans at labeling preferences. 00:49:06.180 |
And if humans are about 70% agreement or accuracy, 00:49:11.180 |
So it is a bit better, which is in one way of saying it. 00:49:14.340 |
Humans don't even agree with humans 50% of the time. 00:49:18.740 |
It's like the human disagreement or the lack of accuracy 00:49:31.420 |
It's one of my go-to, like I just say this over and over 00:49:40.020 |
know OpenAI has this stuff, is like very cheap for getting 00:49:43.300 |
pretty good data compared to compute or salary 00:49:47.820 |
So it's like, tell people to go crazy generating GPT-4 data 00:49:51.340 |
if you're willing to take the organizational cloud of, 00:49:56.760 |
that you kind of do this, especially at individuals. 00:49:59.020 |
Yeah, they're not going to come after individuals. 00:50:14.140 |
And we should just mention, at the time of recording, 00:50:16.660 |
we've seen the first example of OpenAI enforcing 00:50:20.780 |
ByteDance was caught, reported to be training on GPT-4 data, 00:50:31.260 |
I don't expect OpenAI to go too crazy on this, 00:50:33.880 |
because there's going to be so much backlash against them. 00:50:41.740 |
is like, OK, this costs $10 to collect one data 00:50:46.460 |
It's going to cost you a tenth of a cent with OpenAI, right? 00:50:52.860 |
and therefore people are just going to do it. 00:50:54.740 |
Yeah, and the signal you get from humans from preferences 00:50:58.860 |
The signal that you get from humans for instructions 00:51:02.500 |
is pretty high, but it is also very expensive. 00:51:04.860 |
So the human instructions are definitely, by far and away, 00:51:12.660 |
are just so much easier to get some sort of signal running 00:51:17.260 |
I think people will start working in other goals 00:51:27.460 |
will start doing things like constitutional AI 00:51:29.460 |
for preferences, which will be pretty interesting. 00:51:33.500 |
We saw how long it took RLHF to get started in open source. 00:51:38.780 |
that was really happening until maybe like August, really. 00:51:49.500 |
knowing that it was something that people are interested in, 00:51:59.180 |
But once people show that you can do it once, 00:52:05.500 |
Just in the domain of human preference data suppliers, 00:52:22.820 |
is perhaps a good store of human preference data? 00:52:28.380 |
They, I think, are generally worried about releasing data 00:52:36.380 |
They're trying to release the preference data. 00:52:41.980 |
is the best limited evaluation that people have to learn 00:52:55.780 |
and you pay for the hosting, you can get the prompts. 00:53:13.340 |
And moving data comes with other legal and liability concerns 00:53:29.260 |
Because it's kind of like this classifier approach that 00:53:50.100 |
Because we talked about scalars don't really work. 00:53:52.220 |
So in order to train it, you use the magical batching 00:53:54.700 |
of all language model, all deep learning architectures. 00:53:57.420 |
And you put in the chosen prompt and the rejected prompt 00:54:04.460 |
And you essentially have to increase the difference 00:54:08.660 |
It's always fun when you think about automatic 00:54:11.700 |
It updates the same parameters to separate these two numbers 00:54:16.260 |
And there's this loss function that you'll see in OpenAI 00:54:25.300 |
That's the difference between these two predicted rewards. 00:54:28.000 |
It's just some fancy math around a difference, a subtraction 00:54:32.000 |
between the reward of the rejected prediction 00:54:37.660 |
and the predicted reward of the chosen completion. 00:54:43.040 |
look different in Anthropic and OpenAI's papers. 00:54:45.840 |
But they're just literally just log transforms. 00:54:49.800 |
and taking a log of both sides, you'll converge on one of the-- 00:54:52.680 |
both the two papers end up being the same thing. 00:54:55.320 |
And people don't know how to train preference models, 00:55:08.640 |
And you can take the reward model you're training, 00:55:13.000 |
see if the chosen predicted reward, so the scalar number, 00:55:16.200 |
is higher than the rejected predicted reward. 00:55:21.400 |
It's like where you see they have the 65% to 75% agreement. 00:55:24.400 |
This just means that these scalar numbers were ordered 00:55:31.880 |
That goes to show the kind of deep questions at play here. 00:55:35.480 |
People are playing with different loss functions, 00:55:37.480 |
ensembles, different models to try to address this. 00:55:41.260 |
It's like-- it goes back to what does it mean to do RLHF? 00:55:47.040 |
But it's good to know that this 65% to 75% agreement, 00:55:51.360 |
It's like we don't have 100% agreement with the reward 00:56:00.080 |
and then we start throwing RL at it, I think. 00:56:15.420 |
and then it uses the value function to update the bottle. 00:56:23.240 |
it's more of like a systems problem than an RL problem. 00:56:30.560 |
This is for the KL constraint that we talked about before. 00:56:34.700 |
is either a separate reward model or value head 00:56:38.820 |
And then you need to have your RL code that actually 00:56:42.000 |
learns a value function and updates all the parameters. 00:56:44.360 |
I think it just is really messy to actually set up. 00:56:50.280 |
could understand what each of the components are. 00:56:53.760 |
how do we actually make a language model that 00:57:05.440 |
The reward model is used in RLHF exactly what you would think. 00:57:13.360 |
That score gets plugged into the whole RL stuff. 00:57:16.240 |
And it learns-- and it updates the parameters. 00:57:22.240 |
zooming in on where exactly you put this distance 00:57:25.400 |
penalty between the base model and the RL model. 00:57:28.440 |
Most people say that you just deduct it from the reward. 00:57:33.600 |
as an agent acting in the world, the reward from that world 00:57:37.600 |
would be a combination of the reward model and any 00:57:45.320 |
actually have a KL constraint built into them. 00:57:47.520 |
So it's confusing, because you hear KL twice. 00:57:50.920 |
One of them is about the text, and one of them 00:57:53.120 |
is about the value function distance, or the policy 00:58:01.620 |
because it's more about data and infrastructure than RL details, 00:58:16.200 |
the instruction tuning model, or the instruction tuning 00:58:18.480 |
data set, because they're really happy with that data set 00:58:24.240 |
But I think these are all small gains over just 00:58:37.600 |
where you don't even really need this to get a good model. 00:58:40.040 |
So that's why it's like, OK, the RL is such a small part 00:58:46.520 |
Like, RLHF is a metaphor for all language model adaptation. 00:58:50.440 |
And RL is one tool used at one point in the time. 00:58:55.640 |
the core overview in my mind, to say RL doesn't really 00:59:14.560 |
So in your mind, is the takeaway for this kind 00:59:18.240 |
of next generation of people working on models, 00:59:21.520 |
maybe the underlying theories is less important than actually 00:59:28.240 |
And we'll see, like, I have this advanced topics 00:59:30.240 |
thing in the slides, which it starts with the vowels. 00:59:32.760 |
And then it talks about a lot of different ways 00:59:44.960 |
and if your language model is generating right, 00:59:49.780 |
and kind of understanding how those things change over time. 00:59:55.560 |
in here, I think this is something we could also 01:00:03.360 |
I think one of the fun ones is from the GPT-4 technical 01:00:06.420 |
They essentially listed their kind of bogus evaluations. 01:00:09.200 |
Because it's a hilarious table, because it's like LSAT, AP 01:00:15.240 |
are kind of reasonable vowels in language model land. 01:00:19.240 |
But they just showed that RLHF doesn't improve 01:00:22.440 |
We don't know if internally they have other ones. 01:00:25.240 |
But from what OpenAI has shown us externally, 01:00:32.520 |
I do think it does things that they care about. 01:00:39.600 |
It's a powerful tool to change your language model. 01:00:42.560 |
But as we've seen with LLAMA and safety RLHF, 01:00:58.240 |
to improve your multiple choice reasoning capabilities. 01:01:04.800 |
don't think a lot of people have connected the dots there. 01:01:22.920 |
It's much better being a sommelier, apparently. 01:01:27.480 |
That was the weirdest one that was included in the GPT-404. 01:01:41.120 |
Yeah, so this is essentially how to use RLHF-like things 01:01:50.540 |
is kind of the ideas of rejection sampling and best 01:01:53.840 |
I think best event sampling is what people often encounter 01:01:56.800 |
first, which is the idea of you take a prompt, 01:02:00.200 |
you generate like 10, 20 responses through it, 01:02:06.400 |
The reward model assigns a scalar for each of them. 01:02:10.680 |
and that's the one you answer the question with. 01:02:14.480 |
because it's just spending more inference time compute 01:02:21.640 |
that I talked about from OpenAI, they use it. 01:02:31.080 |
based on a preference data set to make your answers better. 01:02:34.300 |
The interesting thing that people are confused about more 01:02:36.680 |
is rejection sampling, because Meta talked about it 01:02:40.280 |
Essentially, rejection sampling is putting something 01:02:45.600 |
And instead of just returning the best answer to a user, 01:02:50.720 |
you apply instruction tuning on that data set. 01:02:55.240 |
and then you could collect more preference data, 01:02:57.240 |
do a new reward model, and then you rank some new outputs, 01:03:01.400 |
So essentially, Llama started their RLHF process 01:03:04.440 |
with this to get some signal out of preference data. 01:03:06.880 |
That preference data went into a reward model, 01:03:08.960 |
and then the reward model did a good enough ranking 01:03:11.760 |
that it was essentially super-powered instruction 01:03:16.920 |
Works pretty well, much easier to implement the PPO, 01:03:24.560 |
It's easy to plug into things like transformers 01:03:28.340 |
than whatever freaking mess doing RL at scale is going to be. 01:03:44.920 |
and it back-propagates through your reward model directly. 01:03:56.160 |
that all of this is kind of just done on one big data set. 01:04:00.280 |
I'm not an expert in this, but essentially, you 01:04:02.680 |
do much less inference costs during the RLHF process 01:04:08.440 |
There's a few papers that people have published. 01:04:13.440 |
I think it could take off some people that I know in the RLHF 01:04:17.240 |
are doing this in industry, just because it makes 01:04:22.160 |
and the number of things you have to have running. 01:04:34.400 |
or labeling multiple scores or multiple pairwise preferences 01:04:43.920 |
where you're labeling each step in the chain of thought 01:04:47.360 |
reasoning just to kind of make the problem more specific. 01:04:51.360 |
It seems very likely that different feedback will 01:04:55.240 |
Chain of thought reasoning is great for math, 01:05:03.000 |
but as any tool gets better, it gets more specific. 01:05:07.920 |
Then kind of get into more of a talking point, 01:05:14.000 |
I think this is something that people really don't-- 01:05:21.080 |
I think most people thought that constitutional AI was doing 01:05:23.920 |
something where it's like created the preference 01:05:27.200 |
data based on the specific principles in some way, 01:05:42.040 |
of generating this sort of preference data or alignment 01:05:53.320 |
but it draws from the UN Declaration of Human Rights 01:05:57.200 |
and the Apple Terms of Service, for some reason. 01:06:02.760 |
and how is it evaluating in a way that you can train on? 01:06:13.040 |
have a language model that critiques the instruction based 01:06:15.560 |
on principles, and then your instruction responses 01:06:23.560 |
The diagram in their paper's wild in this one. 01:06:35.480 |
is fine-tuning your instructions based on principles. 01:06:40.320 |
And then the second half is what people really 01:06:53.520 |
which is for the synthetic feedback for generating 01:06:59.000 |
pick between these two answers based on this principle. 01:07:21.160 |
It's just the two completions without the context 01:07:30.520 |
set of two candidate completions across different prompts. 01:07:48.920 |
But it is way less explicit than I thought it was going to be. 01:07:55.080 |
it checked to see if the principles were satisfied 01:07:58.840 |
But it's really just a modification to the RLHF setup 01:08:02.420 |
that we've talked about with instruction tuning 01:08:04.380 |
and preference data collection, where there is an AI 01:08:13.720 |
So it almost sounds more tractable in that way. 01:08:17.160 |
But I would also guess, while I just say, oh, look, 01:08:20.000 |
I figured it out, I'm guessing they do different things 01:08:31.200 |
But it's good to know where they started, at least in this case. 01:08:34.880 |
I thought the communication around the Pareto optimal 01:08:38.560 |
improvement was helpful in understanding that you do 01:08:42.560 |
actually want it to be more helpful and honest 01:08:47.640 |
while maintaining the same level of harmlessness 01:08:50.920 |
Yeah, so that figure right at the top of the constitutional AI 01:08:54.680 |
paper is worth seeing, if you don't have it immediately 01:08:57.080 |
pop into your head, where they essentially compare 01:09:03.880 |
and it's something that most RLHF papers don't do, 01:09:09.720 |
And it'd be really great to see more RLHF papers kind 01:09:12.280 |
of showing how per epoch or per half epoch of training, 01:09:16.680 |
because most RLHF is only a few epochs, at least 01:09:23.400 |
But that's how we should be thinking about it, 01:09:27.520 |
And it's like, we don't know what's happening 01:09:31.400 |
I don't know if this is a relevant comparison for you, 01:09:33.960 |
but OpenAI also recently released a weak-to-strong 01:09:40.200 |
talked about a few intermediate checkpoints for GPT-4. 01:09:44.120 |
Any comments on the comparison between constitutional AI 01:09:58.040 |
which is such a kind of odd model to focus on. 01:10:05.760 |
It's like they're sharing less than they know. 01:10:10.280 |
that are pretty cool that they're doing internally. 01:10:16.200 |
have seen the paper, because it's impossible to keep up 01:10:19.640 |
I do think that what constitutional AI and RLHF 01:10:33.800 |
And the only way to scale this is to trust our AI overlords 01:10:39.520 |
And constitutional AI was the first version of this. 01:10:41.880 |
What the second version, or what weak-to-strong is, 01:10:44.600 |
is that anticipating a future of superintelligence 01:10:50.440 |
the thing that we're trying to control is smarter than us. 01:11:01.880 |
to do in the future as well, when we are not-- 01:11:15.560 |
So they're just basically-- they're prepping. 01:11:33.040 |
And I see a lineage from constitutional AI to this. 01:11:37.000 |
Yeah, the constitutional AI and the superalignment 01:11:47.240 |
coming to the same conclusions in different ways. 01:11:56.840 |
Because I just didn't really put it together in my brain 01:11:59.240 |
quickly, looking at weak-to-strong generalization 01:12:08.640 |
I understand what synthetic data means in all of this. 01:12:11.240 |
It's like, how could they communicate that a little bit 01:12:15.800 |
Because I want to know what they think about this. 01:12:17.840 |
Which is why I like that Pareto optimal thing, 01:12:19.760 |
it links-- it takes stairs debate away from x-risk 01:12:23.960 |
to no, this makes navigation models more useful. 01:12:38.280 |
which is about direct preference optimization. 01:12:46.000 |
But essentially, DPO is a different class of algorithms. 01:12:51.040 |
I still call it RLHF, because RLHF is so vague 01:12:55.640 |
I think DPO is closer to RLHF than RLHF is to RL. 01:13:07.960 |
from the preference data, where the preference data is 01:13:28.120 |
where the classifier is trained to output a scalar value based 01:13:34.720 |
DPO is purely based on the difference between two log 01:13:39.160 |
So the reward there is the ratio between the policy generation 01:13:43.640 |
likelihood and the base model generation likelihood. 01:13:47.080 |
I don't have intuitions for what that means yet, 01:13:49.080 |
but what the reward actually is is very different. 01:13:52.600 |
The data starting point, in principle, could be the same. 01:13:55.440 |
And I think we've seen a lot of successes in open source 01:14:06.200 |
I think we'll keep seeing DPO models for the time being. 01:14:10.080 |
But we won't really answer what the fundamental differences 01:14:19.080 |
think that PPO-like methods or other RL methods 01:14:32.800 |
because they could do something more complicated. 01:14:34.840 |
But that's not what academics and open source people 01:14:38.840 |
They care about being able to improve on their methods 01:14:44.200 |
So in a lot of ways, I think DPO still will be what people see. 01:14:47.840 |
But in some ways, it's probably slightly more constrained. 01:14:52.480 |
There's other ways that you could think of PPO 01:14:56.640 |
if your code runs is the score that you give it. 01:15:01.840 |
you have to do canned things to get DPO to have the same data. 01:15:06.920 |
So there are specific cases where the DPO formulation 01:15:10.640 |
But I expect to see more DPO models than anything else 01:15:15.360 |
That's probably what most people need to know, 01:15:19.720 |
And I would love to learn more about PPO and a lot of authors 01:15:33.440 |
So for people who are listening to this in the future, 01:15:51.140 |
a kind of mathy, but still experimental paper in language 01:15:54.700 |
models is, the DPO paper is a really good one 01:16:21.420 |
I will say, it does remind me of Flesh Attention a little bit, 01:16:24.040 |
in the sense that it's kind of an equivalent thing 01:16:29.420 |
And it's just faster, cheaper, just better in every way. 01:16:37.140 |
between the control you get in training a reward model 01:16:41.380 |
Because essentially, everything you want your reward model 01:16:43.620 |
to do might not be everything that you train the policy 01:16:53.660 |
And we don't know if you have fancy engineering-- 01:16:57.100 |
like, if you have fancy engineering abstractions 01:16:59.140 |
and test your reward model to do different things, 01:17:05.680 |
at the absolute biggest scale and most investment 01:17:20.960 |
And I was asking somebody who was on some of those earlier 01:17:25.720 |
And they were like, I wish we had thought of that. 01:17:32.000 |
That's the type of thing that academia still can do 01:17:35.400 |
and can do really well and hopefully continues to do. 01:17:55.000 |
And you're one of the few people maybe placed to explain-- 01:17:59.240 |
Maybe like, what's Allen Institute doing here? 01:18:21.520 |
known as being a super academic lab where they have 01:18:28.920 |
And they're trying to move more in the direction of really 01:18:33.700 |
It's like talking with the new CEO, Ali Farhadi. 01:18:37.280 |
I don't know if I pronounced the last name right. 01:18:39.320 |
But he's trying to move from an org that does papers only 01:18:43.640 |
to something that does papers, releases models, 01:18:50.640 |
don't have a middle, an established place where they 01:18:55.400 |
So they're really trying to expand the scope. 01:18:59.880 |
And the Tulu2 model is the type of thing I've joined. 01:19:03.560 |
And I was like, OK, we should just train it and release it, 01:19:05.360 |
because no one has done this direct preference 01:19:13.280 |
This is classic of everything kind of works right now in ML. 01:19:16.840 |
I showed up in the grad student Hamish Iveson. 01:19:20.240 |
And I need to learn how to pronounce last names better. 01:19:22.520 |
But he had some Jaxx DPO code built on this EZLM framework. 01:19:37.340 |
It's like we did no ablations, didn't change any parameters. 01:19:41.560 |
And that's the model that people have been working with. 01:19:45.040 |
That goes to show that there's a lot of runway and understanding 01:19:54.760 |
And it still returned a model that was pretty good on benchmarks 01:20:00.240 |
So let's say it's like 2024, we'll be busy in this space. 01:20:04.980 |
We're running data ablations to try to understand what's best. 01:20:08.400 |
Then Allen Institute is pre-training language models 01:20:13.280 |
where we'll be able to share data, code, everything, 01:20:32.560 |
those will probably become more of a priority. 01:20:36.560 |
So it's like you still want to learn from LLAMA2 and LLAMA3. 01:20:43.240 |
I think DPO releases are kind of becoming expected, 01:20:46.880 |
because Mistral released a DPO model as well. 01:20:50.720 |
I think the slide after this is just like, there's a ton. 01:20:58.400 |
At some point, you just have to accept that that's 01:21:04.960 |
because there's really interesting, debatable 01:21:14.200 |
because everything that is published is with DPO. 01:21:21.880 |
Yeah, kind of last of this stuff is evaluation. 01:21:25.640 |
And these slides were prepared kind of last minute. 01:21:29.000 |
But I think the question is, how do you evaluate these models 01:21:33.840 |
I think the PSA is like, don't trust your numbers 01:21:37.960 |
It's very hard to do if you're an engineer or a researcher, 01:21:40.320 |
because you have your specific thing that you're zoomed in on. 01:21:44.420 |
to just go play with chat GPT or go play with chat arena. 01:21:48.100 |
It's something that I-- this is me telling myself 01:21:51.760 |
But there's the question of, is the Hugging Face leaderboard 01:22:01.240 |
The Hugging Face leaderboard came out of the team 01:22:04.080 |
We were trying to build a framework to automatically 01:22:10.960 |
and then have them in a central place where it could be like, 01:22:27.240 |
easy to overfit if you're training and focusing on them 01:22:38.440 |
But it's like, now it has six evaluation tools. 01:22:41.640 |
I can't even name all of them off the top of my head. 01:22:48.920 |
but they dropped a drop, which was pretty funny. 01:22:50.960 |
Drupal, QA, and then I think maybe some other math. 01:23:01.480 |
that everyone's talking about, because there's a lot of gaming 01:23:07.240 |
Is there some discussion about held out benchmarks 01:23:15.760 |
But we're thinking about this at Allen AI, too, 01:23:20.240 |
thinking about improving on Opaqa eval, which is-- 01:23:24.800 |
Right now, Hugging Face is just running every eval every day. 01:23:29.040 |
At one point, they were going to do more training. 01:23:39.760 |
but it is like, you have to have hundreds of GPUs 01:23:47.440 |
Some of these are open source models that they don't change, 01:24:07.040 |
is chat GPT from March better than chat GPT from June? 01:24:15.520 |
it's slide 58 is the chatbot arena leaderboard, 01:24:20.880 |
if you're looking later, which chatbot arena is this thing 01:24:27.520 |
And you can see that GPT-4 from March has a higher score. 01:24:33.360 |
And the same-- it's like, this is not a perfect comparison. 01:24:37.360 |
But there are signs that are pretty funny there, 01:24:42.680 |
But you don't know who's collecting this data, 01:24:55.940 |
Yeah, it's outside of the error bars on the LLAMSYS thing. 01:25:00.680 |
that GPT-4 Turbo is also notably ahead of the other GPT-4s, 01:25:08.040 |
once they added it to the leaderboard or to the arena. 01:25:21.520 |
the leaderboard is very close for many strata of models. 01:25:28.280 |
So there are levels where you can get your model to, 01:25:32.800 |
So in the open source, there's things like Mixtral Instruct 01:25:53.600 |
And then there was a level with the alpacas and the vicunas. 01:26:01.500 |
and then there's another step up to GPT-4 Turbo. 01:26:03.960 |
So it's like the difference from the GPT-4 Turbo 01:26:10.720 |
is bigger than the difference from Tulu2 to GPT-4. 01:26:14.600 |
So that's just like, there's something good going on there. 01:26:17.720 |
And I was like, OK, that's a new model by my standards, 01:26:24.280 |
They said it's our new model, but they weren't like, 01:26:28.840 |
because the benchmark scores are probably the same, 01:26:32.120 |
but they made it so that people like using it more. 01:26:34.560 |
There's some hints that 4.5 might drop at some point. 01:26:37.400 |
We don't actually know how true those things are, 01:26:42.440 |
they're retraining these models, and they could call any of them 01:26:50.120 |
in research domains on RLHF is AlpacaValidMTBench. 01:27:00.400 |
Evaluating chat is really hard, and what they both do 01:27:03.360 |
is they have GPT-4 provide some sort of feedback. 01:27:10.480 |
and they have a prompt and a follow-up question. 01:27:19.520 |
and the second response, and provide the average. 01:27:25.120 |
And then AlpacaVal is a little bit different, 01:27:34.860 |
And what it's doing under the hood is comparing the new model 01:27:37.360 |
to DaVinci 0.0.3, which is one of OpenAI's older instruction 01:27:45.560 |
that GPT-4 sees between the new model and DaVinci. 01:28:05.880 |
Open Assistant, Vicuna, Koala, Anthropix, Helpful, Harmless. 01:28:09.120 |
So AlpacaVal is from sources that people know and love. 01:28:14.360 |
We were more focused on MTBench at Hugging Face. 01:28:17.280 |
At AI2, we're a little bit more focused on AlpacaVal. 01:28:22.080 |
These are kind of like table stakes to saying that you have 01:28:26.400 |
You should be able to have a pretty good score on both 01:28:30.080 |
And then the kind of proof is in people actually talking to it. 01:28:33.280 |
So I think the Zephyr model from Hugging Face 01:28:35.960 |
was a kind of step change in people's perception 01:28:45.960 |
And someone else, like I saw some substacker, 01:28:48.440 |
was using it as a writing feedback bot instead of chat 01:28:52.440 |
But that's what happens when a good open release is there now. 01:28:57.600 |
It's like the evaluations are good and people pick it up. 01:29:08.140 |
the one or one of these big ones without talking to it. 01:29:20.160 |
And until Gemini Ultra comes out, we don't know. 01:29:23.520 |
It's probably a great model, but we don't know what they have. 01:29:27.360 |
Gemini Pro didn't do so great on the other stuff, too. 01:29:34.520 |
or if it was like a major deliverable for them or not. 01:29:39.960 |
it's probably a strategy headache for Google. 01:29:48.960 |
One of our lightning round questions is always-- 01:30:01.040 |
that will be hinted at in the talk to this point, which 01:30:06.280 |
is like I split it up in my work between data training 01:30:11.440 |
do we evaluate what's happening at the model level with RLHF. 01:30:18.280 |
so they don't know what's swapping between Cloud Base 01:30:20.760 |
or GPT-4 Base, how that would change any notion of preference 01:30:29.720 |
and kind of see, does RLHF work the same for both of those? 01:30:34.760 |
when you use the same data set in the same framework 01:30:38.200 |
That'd be good to know how sensitive RLHF is. 01:30:41.720 |
On the data, we talk a lot about aggregation. 01:30:44.220 |
On the research side, there's a lot of interesting things 01:30:51.000 |
the quality of the data based on professional contexts? 01:30:55.200 |
The results of this might really affect scale. 01:31:00.160 |
They should do internal market analysis on that line. 01:31:19.440 |
which is like, what happens at the end of the day? 01:31:22.600 |
I mentioned what I call qualitative alignment 01:31:27.520 |
get better in ways matching the preference data preferences? 01:31:31.040 |
So if you collect two batches of preference data 01:31:33.720 |
with different priorities, what is the downstream model change? 01:31:42.080 |
should it be the same as like, write me a joke? 01:31:47.520 |
Like, deep learning just scales and aggregates. 01:31:53.280 |
but it's not necessarily what some people would 01:31:57.160 |
And then the kind of last slide that I have is fun, 01:31:59.240 |
which is just like, John Schulman talks about this 01:32:06.000 |
They made it public three months after the conference 01:32:09.680 |
But he talks about things like ChatGPT being verbose 01:32:14.880 |
Things that are really in vogue in the conversation right now 01:32:17.840 |
and how those can emerge in the process of continually trying 01:32:22.520 |
to adjust the RLHF process based on what users 01:32:27.440 |
And this is like a sort of outer loop optimization 01:32:29.760 |
that no one in the open is even remotely qualified 01:32:34.160 |
And they'll rerun RLHF and train a new reward model 01:32:36.920 |
with a mixture of their curated data and user prompts 01:32:44.880 |
And while there's a lot of critiques about this, 01:32:47.200 |
they're definitely intentional in trying to fix-- 01:32:50.400 |
I feel like it's probably whack-a-mole, where they're 01:32:54.520 |
And then it pops up some new problem after doing RLHF, 01:33:01.400 |
this is where things start to look more like RL. 01:33:05.600 |
Things are just like longer time frame of optimizing the model. 01:33:12.320 |
years away from ever actually working on this. 01:33:14.680 |
But we can try to get details from people who are. 01:33:27.080 |
I would ask you guys about if you know companies that 01:33:30.680 |
I know some that are in the RLHF as a service space 01:33:40.720 |
It depends if synthetic data is going to win over human data. 01:33:43.600 |
If human data is the real winning feature in the end, 01:33:49.440 |
So it kind of makes sense as a VC model anyways, 01:33:51.960 |
but there's going to be both of them for a while. 01:34:01.000 |
Is there a lot of ambition in this field to start companies, 01:34:04.600 |
or is this more such a research-driven part of the stack 01:34:10.640 |
There definitely is, because I know my former colleague 01:34:13.280 |
Nazneen Rajani from Hugging Face is also starting 01:34:18.400 |
The Falcon team who left Hugging Face, I think, 01:34:39.800 |
started a new research lab, so that should help. 01:34:50.040 |
I think this is the first 201 that we've ever had,