The Origin and Future of RLHF: the secret ingredient for ChatGPT

00:00:00.000 | [MUSIC PLAYING]

00:00:00.620 | Hey, everyone.

00:00:01.280 | Welcome to the Latent Space Podcast.

00:00:03.200 | This is Alessio, partner and CTO of Residence

00:00:05.720 | at Decibel Partners.

00:00:06.760 | And I'm joined by my co-host, Swiggs, founder of Small AI.

00:00:10.040 | Hey, and today we have Dr. Nathan Lambert in the house.

00:00:12.800 | Welcome.

00:00:13.520 | Thanks, guys.

00:00:15.640 | You didn't have to come too far.

00:00:17.040 | You got your PhD in Berkeley, and it

00:00:18.840 | seems like you've lived there most of the time

00:00:21.240 | in recent years.

00:00:22.760 | You worked on robotics and model-based reinforcement

00:00:26.240 | learning on your PhD, and you also

00:00:28.200 | interned at FAIR and DeepMind.

00:00:31.440 | You bootstrapped the RLHF team at Hugging Face,

00:00:34.760 | and you recently joined the Allen Institute

00:00:37.200 | as a research scientist.

00:00:39.440 | So that's your quick bio.

00:00:40.720 | What should people know about you

00:00:42.100 | that maybe is not super obvious about you on New LinkedIn?

00:00:46.000 | I stay sane in various insane sport and ultra-endurance sport

00:00:51.560 | activities that I do.

00:00:53.120 | What's an ultra-endurance sport activity?

00:00:56.640 | Long-distance trail running or gravel biking.

00:00:59.320 | Nice.

00:01:00.600 | Nice.

00:01:01.240 | Try to unplug sometimes, although it's harder these days.

00:01:04.960 | Yeah.

00:01:05.680 | Well, the Bay Area is just really good for that stuff,

00:01:08.160 | right?

00:01:08.660 | Oh, yeah.

00:01:09.160 | You can't beat it.

00:01:11.200 | I have a trailhead, like, 1.2 miles from my house,

00:01:14.080 | which is pretty unmatchable in any other urban area.

00:01:18.800 | Yeah.

00:01:19.440 | Yeah.

00:01:20.520 | Pretty excellent.

00:01:21.920 | You also have an incredible blog, Interconnects,

00:01:26.240 | which I'm a fan of.

00:01:28.400 | And I also just recently discovered

00:01:29.880 | that you have a new podcast, Retort.

00:01:32.200 | Yeah, I do.

00:01:33.000 | I've been writing for a while, and I

00:01:34.500 | feel like I've finally started to write things that

00:01:37.400 | are understandable and fun.

00:01:38.840 | After a few years lost in the wilderness,

00:01:40.860 | if you ask some of my friends that I

00:01:42.360 | made read the earlier blogs, they're like, yikes.

00:01:44.680 | But it's coming along.

00:01:47.080 | And the podcast is with my friend Tom,

00:01:49.000 | and we just kind of riff on what's actually happening on AI

00:01:52.920 | and not really do news recaps, but just what it all means

00:01:57.040 | and have a more critical perspective on the things that

00:02:00.720 | really are kind of funny but still very serious happening

00:02:03.720 | in the world of machine learning.

00:02:05.160 | Yeah.

00:02:05.880 | Awesome.

00:02:06.640 | For people who are new to your work,

00:02:08.120 | what would you highlight as your greatest hits

00:02:10.280 | so far on Interconnects, at least?

00:02:13.840 | So the ones that are most popular

00:02:15.840 | are timely and/or opinion pieces.

00:02:17.840 | So the first real breakout piece was in April

00:02:19.920 | when I also just wrote down the thing

00:02:21.500 | that everyone in AI was feeling, which

00:02:23.080 | is like we're all feeling stressed that we're

00:02:26.120 | going to get scooped and that we're overworked, which

00:02:28.320 | is like behind the curtain what it feels to work in AI.

00:02:32.040 | And then a similar one, which we might touch on later in this,

00:02:34.660 | was about my recent job search, which

00:02:36.280 | wasn't the first time I wrote a job search post.

00:02:38.280 | I love that.

00:02:38.760 | People always love that stuff.

00:02:40.040 | It's so open.

00:02:41.080 | I mean, it's easy for me to do in a way

00:02:44.840 | that it's very on brand, and it's very helpful.

00:02:47.640 | Because I understand that until you've done it,

00:02:50.400 | it's hard to share this information.

00:02:53.360 | And then other popular ones are various model training

00:02:57.360 | techniques or fine tuning.

00:02:58.840 | There's an early one on RLHF, which

00:03:00.960 | is-- this stuff is all just like when I figure it out

00:03:03.600 | in my brain.

00:03:04.200 | So I wrote an article that's like how RLHF actually works,

00:03:07.520 | which is just the intuitions I had put together

00:03:09.680 | in the summer about RLHF.

00:03:11.160 | And that was pretty well.

00:03:13.120 | And then I opportunistically wrote about Q*,

00:03:15.780 | which you hate that you have to do it, but it is pretty funny.

00:03:19.760 | I found that it's like, from a literature perspective,

00:03:23.160 | I'm like, OpenAI publishes on work

00:03:25.040 | that is very related to mathematical reasoning.

00:03:28.040 | So it's like, oh, you just poke a little around what

00:03:30.200 | they've already published, and it seems pretty reasonable.

00:03:33.000 | But we don't know.

00:03:33.760 | They probably just got like a moderate bump

00:03:36.160 | on one of their benchmarks, and then everyone

00:03:38.040 | lost their minds.

00:03:38.920 | It doesn't really matter.

00:03:40.360 | This is why Sam Altman was fired.

00:03:42.680 | I don't know.

00:03:43.320 | Anyway, yeah, we're here to talk about--

00:03:45.560 | RLHF 101, you did a presentation.

00:03:48.660 | And I think you expressed some desire to re-record it.

00:03:51.340 | And that's why I reached out on Twitter saying,

00:03:53.300 | like, why not re-record it with us?

00:03:54.760 | And then we can ask questions and talk about it.

00:03:57.260 | Yeah, sounds good.

00:03:58.460 | I think it's-- I try to do it every six or 12 months

00:04:00.980 | is my estimated cadence, just to refine the ways

00:04:05.020 | that I say things.

00:04:05.860 | And people will see that we don't know that much more,

00:04:08.860 | but we have a bit better way of saying what we don't know.

00:04:12.220 | Yeah, awesome.

00:04:13.660 | We can dive right in.

00:04:14.660 | I don't know if there's any other topics

00:04:16.500 | that we want to lay out as groundwork.

00:04:18.760 | No, you have some awesome slides.

00:04:20.420 | So for people listening on podcast only,

00:04:22.980 | we're going to have the slides on our show notes,

00:04:25.260 | and then we're going to have a YouTube version.

00:04:27.740 | Like and subscribe.

00:04:28.620 | Where we run through everything together.

00:04:30.980 | Sounds good, yeah.

00:04:32.940 | So I think to start skipping a lot of the, like,

00:04:36.120 | what is a language model stuff, everyone

00:04:37.820 | knows that at this point.

00:04:38.860 | I think the quote from the Llama 2 paper

00:04:41.860 | is a great kind of tidbit on RLHF becoming a real deal.

00:04:46.420 | There was some uncertainty earlier in the year

00:04:48.520 | about whether or not RLHF was really going to be important.

00:04:51.140 | I think it was not that surprising that it is.

00:04:55.180 | I mean, with recent models still using it,

00:04:58.500 | the signs were there.

00:04:59.460 | But the Llama 2 paper essentially

00:05:00.900 | reads like a bunch of NLP researchers

00:05:03.260 | that were skeptical and surprised.

00:05:05.460 | So the quote from the paper was, "Meanwhile,

00:05:07.260 | reinforcement learning, known for its instability,

00:05:09.340 | seemed a somewhat shadowy field for those in the NLP research

00:05:12.260 | community.

00:05:13.060 | However, reinforcement learning proved highly effective,

00:05:15.460 | particularly given its cost and time effectiveness."

00:05:19.020 | So you don't really know exactly what the costs and time

00:05:21.560 | that Meta is looking at.

00:05:22.560 | Because they have a huge team and a pretty good amount

00:05:24.820 | of money here to release these Llama models.

00:05:27.100 | But like, this is just the kind of thing that we're seeing now.

00:05:31.100 | I think any major company that wasn't doing RLHF

00:05:33.500 | is now realizing they have to have a team around this.

00:05:37.500 | At the same time, we don't have a lot

00:05:39.420 | of that in the open and research communities at the same scale.

00:05:43.340 | I think seeing that converge would be great.

00:05:45.580 | But it's still very early days.

00:05:48.820 | And the other thing on the slide is some of Anthropic's work.

00:05:51.820 | But everyone knows Anthropic is kind of the masters of this.

00:05:54.400 | And they have some of their own techniques

00:05:56.200 | that we're going to talk about later on.

00:05:57.900 | But that's kind of where we start.

00:06:01.580 | Can we do just a one second RL diversion?

00:06:05.900 | So you come from a robotics background, which RL used to be,

00:06:09.700 | or maybe still is, state of the art.

00:06:11.300 | And then now you're seeing a lot of LLM plus RL.

00:06:14.340 | So you have the gym fans, Eureka.

00:06:16.940 | You have MPU, which we had on the podcast when they started

00:06:20.660 | with RL.

00:06:21.140 | Now they're doing RL plus LLMs.

00:06:24.740 | Yeah, any thoughts there on how we got here?

00:06:27.500 | Like maybe how the pendulum will keep swinging?

00:06:31.700 | I really think RL is about a framing

00:06:33.700 | of viewing the world through trial and error learning

00:06:36.020 | and feedback, and really just one

00:06:37.480 | that's focused on thinking about decision making and inputs

00:06:41.180 | in the world and how inputs have reactions.

00:06:44.100 | And in that, a lot of people come

00:06:45.580 | from a lot of different backgrounds,

00:06:47.220 | whether it's physics, electrical engineering,

00:06:48.620 | mechanical engineering.

00:06:49.980 | There are obviously computer scientists.

00:06:51.660 | But compared to other fields of CS,

00:06:53.900 | I do think it's a much more diverse background of people.

00:06:56.900 | Like my background was in electrical engineering

00:06:58.900 | and doing robotics and things like that.

00:07:01.660 | It really just changes the world view.

00:07:04.300 | I think that reinforcement learning, as it was back then,

00:07:09.620 | so to say, is really different, because you're

00:07:12.820 | looking at these toy problems, and the numbers

00:07:15.300 | are totally different.

00:07:16.260 | And everyone went kind of 0 to 1 at scaling these things up.

00:07:20.380 | But people like Jim Fan and other people that were--

00:07:23.300 | you saw this transition in the decision transformer and papers

00:07:26.700 | and when people are trying to use transformers to do decision

00:07:30.340 | making for things like offline RL,

00:07:31.780 | and I think that was kind of like the early days.

00:07:34.380 | But then once language models were so proven,

00:07:37.140 | it's like everyone is using this tool for their research.

00:07:40.300 | I think in the long run, it will still settle out,

00:07:44.100 | or RL will still be a field that people work on,

00:07:46.340 | just because of these kind of fundamental things

00:07:48.340 | that I talked about, that it's just

00:07:50.740 | viewing the whole problem formulation

00:07:52.860 | different than predicting text.

00:07:55.100 | And so there needs to be that separation.

00:07:58.660 | And the view of RL in language models

00:08:01.220 | is pretty contrived already.

00:08:02.500 | So it's not like we're doing real RL.

00:08:05.220 | I think the last slide that I have here

00:08:07.060 | is a way to make RLHF more like what people

00:08:11.380 | would think of with RL, so actually running things

00:08:14.260 | over time.

00:08:15.620 | But it's a weird lineage of tools

00:08:19.300 | that happen to get us to where we are,

00:08:20.900 | so that's why the name takes up so much space.

00:08:23.900 | But it could have gone a lot of different ways.

00:08:27.900 | Cool.

00:08:28.380 | We made it one slide before going on a tangent.

00:08:32.420 | Yeah, I mean, it's kind of related.

00:08:35.380 | Yeah, so we have a history of RL.

00:08:37.620 | Yeah, so I recently--

00:08:39.660 | to give the context, this paper really

00:08:41.300 | started because I've had this more diverse background

00:08:44.020 | than some computer scientists, which

00:08:45.620 | is trying to understand what the difference of a cost

00:08:48.180 | function, or a reward function, and a preference function

00:08:51.060 | would be, without going into all of the details.

00:08:54.420 | Costs are normally things that control theorists

00:08:56.500 | would work with in these kind of closed domains.

00:08:58.620 | And then reinforcement learning has always

00:09:00.380 | worked with rewards that's central to the formulation

00:09:02.660 | that we'll see.

00:09:03.300 | And then the idea was like, OK, we now are at preferences.

00:09:06.260 | And each step along the way, there's

00:09:07.740 | kind of different assumptions that you're making.

00:09:10.060 | We'll get into these.

00:09:10.900 | And those assumptions are built on other fields of work.

00:09:14.420 | So that's what this slide is getting to say.

00:09:16.260 | It's like RLHF, while directly building

00:09:18.340 | on tools from RL and language models,

00:09:20.340 | is really implicitly impacted and built

00:09:24.060 | on theories and philosophies spanning tons of human history.

00:09:29.940 | I think we cite Aristotle in this paper, which is fun.

00:09:32.820 | It's like going pre-BC.

00:09:35.500 | It's like 2,300 years old or something like that.

00:09:38.220 | So that's the reason to do this.

00:09:39.820 | I think we kind of list some things in the paper

00:09:42.700 | about summarizing what different presumptions of RLHF could be.

00:09:46.860 | I think going through these is actually kind of funny.

00:09:50.740 | It's fun to talk about these, because they're

00:09:52.580 | kind of grab bags of things that you'll

00:09:55.180 | see return throughout this podcast that we're

00:09:57.540 | talking about it.

00:09:58.820 | The core thing of RLHF, in order to be a believer in this,

00:10:02.380 | is that RL actually works.

00:10:04.140 | It's like, if you have a reward function,

00:10:05.820 | you can optimize it in some way and get a different performance

00:10:08.440 | out of it.

00:10:09.260 | And you could do this at scale.

00:10:10.380 | And you could do this in really complex environments, which

00:10:12.900 | is--

00:10:13.460 | I don't know how to do that in all the domains.

00:10:15.380 | I don't know how to exactly make chat GPT.

00:10:17.980 | So it's kind of-- we'll overshadow everything.

00:10:19.900 | And then there's go from something kind of obvious

00:10:22.020 | like that.

00:10:22.540 | And then you read the von Neumann-Morgenstern utility

00:10:27.220 | theorem, which is essentially an economic theory that

00:10:30.500 | says you can weight different probabilities

00:10:33.140 | of different people, which is a theoretical piece of work that

00:10:36.660 | is the foundation of utilitarianism.

00:10:38.380 | And trying to quantify preferences

00:10:41.020 | is crucial to doing any sort of RLHF.

00:10:44.260 | And if you look into this, all of these things,

00:10:47.980 | there's way more you could go into if you're

00:10:49.820 | interested in any of these.

00:10:50.940 | So this is kind of like grabbing a few random things.

00:10:53.100 | And then kind of similar to that is the Bradley-Terry model,

00:10:55.380 | which is the fancy name for the pairwise preferences

00:10:57.700 | that everyone is doing.

00:10:59.500 | And then all the things that are like that

00:11:01.500 | Anthropic and OpenAI figured out that you can do,

00:11:03.660 | which is that you can aggregate preferences

00:11:05.420 | from a bunch of different people and different sources.

00:11:07.780 | And then when you actually do RLHF,

00:11:09.580 | you extract things from that data.

00:11:11.460 | And then you train a model that works somehow.

00:11:13.620 | And we don't know-- there's a lot of complex links there.

00:11:16.940 | But if you want to be a believer in doing this at scale,

00:11:19.580 | these are the sorts of things that you

00:11:21.220 | have to accept as preconditions for doing RLHF.

00:11:26.260 | Yeah.

00:11:26.780 | You have a nice chart of the sort of intellectual history

00:11:29.580 | of RLHF that we'll send people to refer to,

00:11:32.260 | either in your paper or in the YouTube video for this podcast.

00:11:35.740 | But I like the other slide that you have on the presumptions

00:11:38.500 | that you need to have for RLHF to work.

00:11:40.860 | You already mentioned some of those.

00:11:43.180 | And I don't know, do you think that any one of them

00:11:45.900 | are-- which one's underappreciated?

00:11:49.620 | This is the first time I've come across the V&M utility

00:11:51.980 | theorem.

00:11:52.760 | Yeah, I know.

00:11:53.300 | This is what you get from working with people.

00:11:55.260 | Like, to my co-host on the podcast,

00:11:56.980 | the retort is that he's a sociologist by training.

00:11:59.380 | So he knows all these things and who

00:12:00.960 | the philosophers are that found these different things,

00:12:03.940 | like utilitarianism.

00:12:05.060 | But there's a lot that goes into this.

00:12:07.860 | Essentially, there's even economic theories

00:12:09.740 | that-- there's debate whether or not preferences exist at all.

00:12:12.980 | And there's different types of math

00:12:15.140 | you can use with whether or not you actually

00:12:16.980 | can model preferences at all.

00:12:18.460 | So it's pretty obvious that RLHF is built

00:12:20.420 | on the math that thinks that you can actually

00:12:22.660 | model any human preference.

00:12:24.260 | But this is the sort of thing that's

00:12:25.760 | debated-- been debated for a long time.

00:12:27.420 | So all the work that's here is like--

00:12:29.140 | and people hear about in their AI classes.

00:12:31.100 | So like Jeremy Bentham, like hedonic calculus,

00:12:33.140 | hedonic calculus, and all these things.

00:12:34.780 | Like, these are the side of work where

00:12:36.400 | people assume that preferences can be measured.

00:12:38.420 | And this is-- like, I don't really know.

00:12:40.540 | Like, when you look at-- this is where I kind of go on a rant

00:12:43.420 | and I say that in RLHF, calling things a preference model

00:12:46.340 | is a little annoying.

00:12:47.220 | Because there's no inductive bias of what a preference is.

00:12:50.240 | It's like if you were to learn a robotic system,

00:12:52.360 | and you learned a dynamics model,

00:12:53.740 | like, hopefully, that actually mirrors the world

00:12:56.260 | in some way of the dynamics.

00:12:57.820 | But with a preference model, it's

00:12:59.240 | like, oh, I don't know what this model--

00:13:01.100 | like, I don't know what ChagGPT encodes

00:13:02.620 | as any sort of preference or what

00:13:04.040 | I would want it to be in a fair way.

00:13:05.540 | Anthropic has done more work on trying

00:13:07.080 | to write these things down.

00:13:08.860 | But even, like, if you look at Claude's constitution,

00:13:11.980 | like, that doesn't mean the model believes these things.

00:13:14.660 | It's just trained to prioritize these things.

00:13:17.120 | And that's kind of what the later points,

00:13:18.860 | I'm looking at, like, what RLHF is doing

00:13:20.980 | and if it's actually, like, a repeatable process in the data

00:13:24.300 | and in the training.

00:13:25.820 | That's just unknown.

00:13:26.660 | And we have a long way to go before we

00:13:28.820 | understand what this is and the link between preference

00:13:32.340 | data and any notion of, like, writing down a specific value.

00:13:36.380 | The disconnection between more, you know,

00:13:39.700 | sociology work versus computer work already exists?

00:13:43.780 | Or is it, like, a recent cross-contamination?

00:13:46.460 | Because when we had 3DAL on the pockets,

00:13:48.620 | it's a flash of attention came to be.

00:13:51.060 | Because at AZ, they have so much overlap between systems

00:13:53.820 | engineer and, like, deep learning engineers.

00:13:56.220 | Like, is it the same in this field?

00:13:59.140 | There are a lot of people-- so I've

00:14:00.560 | gone to a couple of workshops where

00:14:02.100 | these-- the populations of people

00:14:03.940 | who you'd want to include this, like, are.

00:14:05.820 | I think the reason why it's not really talked about

00:14:08.100 | is just because the RLHF techniques that people use

00:14:11.420 | were built in, like, labs like OpenAI and DeepMind,

00:14:15.180 | where there are some of these people.

00:14:17.740 | These places do a pretty good job

00:14:19.100 | trying to get these people in the door

00:14:20.700 | when you compare them to, like, startups or normal startups.

00:14:23.200 | But, like, they're not bringing in, like, academics

00:14:26.580 | from economics, like, social choice theory.

00:14:29.300 | There's just too much.

00:14:30.820 | Like, the criticism of this paper that this is based on

00:14:33.380 | is, like, oh, you're missing these things in RL or this

00:14:35.980 | decade of RL.

00:14:36.740 | And it's like, well, it would literally

00:14:38.860 | be bigger than the Sutton and Bartow book

00:14:40.820 | if you were to include everyone.

00:14:42.340 | So it's really hard to include everyone in a principled manner

00:14:46.020 | when you're designing this.

00:14:47.220 | It's just a good way to understand and improve

00:14:51.180 | the communication of what RLHF is

00:14:53.100 | and, like, what is a good reward model for society.

00:14:56.340 | It really probably comes down to what an individual wants.

00:14:59.380 | And it'll probably motivate models

00:15:01.420 | to move more in that direction and just

00:15:03.080 | be a little bit better about the communication, which

00:15:05.300 | is a recurring theme in my work, is, like, I just

00:15:07.660 | get frustrated when people say things that don't really

00:15:10.500 | make sense, especially when it's going to, like, manipulate

00:15:13.300 | individuals' values or manipulate the general view of AI

00:15:17.080 | or anything like this.

00:15:18.000 | So that's kind of why RLHF is so interesting.

00:15:21.220 | It's, like, it's very vague in its actual--

00:15:25.420 | in what it's actually doing, while the problem specification

00:15:28.020 | is very general.

00:15:29.660 | So reinforcement learning, I kind of mentioned this.

00:15:31.820 | It's a trial and error type of system.

00:15:35.980 | The diagram in the slides is really

00:15:37.980 | the classic thing where you have an agent interacting

00:15:40.260 | with an environment.

00:15:41.100 | So it's kind of this agent has some input

00:15:43.220 | to the environment, which is called the action.

00:15:45.540 | The environment returns a state and a reward.

00:15:49.140 | And that repeats over time.

00:15:50.980 | And the agent learns based on these states and these rewards

00:15:54.140 | that it's seeing.

00:15:55.300 | And it should learn a policy that makes the rewards go up.

00:15:58.500 | That seems pretty simple.

00:16:00.740 | If you try to mentally map what this looks like in language,

00:16:03.380 | which is slide seven, is that, like, the language models

00:16:07.540 | don't make this easy.

00:16:08.740 | I think with a language model, it's

00:16:10.160 | very hard to define what an environment is.

00:16:12.660 | So if the language model is the policy and it's generating,

00:16:16.260 | it's like the environment should be a human.

00:16:18.580 | But setting up the infrastructure

00:16:20.420 | to take tens of thousands of prompts and generate them

00:16:24.020 | and then show them to a human and collect the human responses

00:16:26.740 | and then shove that into your training architecture

00:16:29.380 | is very far away from working.

00:16:31.100 | So we don't really have an environment.

00:16:32.680 | We just have a reward model that returns a reward.

00:16:35.300 | And the state doesn't really exist.

00:16:36.820 | When you look at it like an RL problem, what happens

00:16:40.420 | is the state is a prompt.

00:16:42.500 | And then you do a completion.

00:16:44.020 | And then you throw it away.

00:16:44.780 | And you grab a new prompt.

00:16:46.020 | We're really in, like, RL.

00:16:48.060 | As an RL researcher, you would think of this

00:16:49.860 | as being like you take a state.

00:16:51.100 | You get some completion from it.

00:16:54.200 | And then you look at what that is.

00:16:55.620 | And you keep kind of iterating on it.

00:16:56.960 | And all of that isn't here, which

00:16:58.380 | is why you'll hear RLHF referred to as a bandit's problem, which

00:17:01.340 | is kind of like you choose one action.

00:17:03.100 | And then you watch the dynamics play out.

00:17:04.860 | There's many more debates that you can have in this.

00:17:10.260 | If you get the right RL people in the room,

00:17:12.300 | then kind of like this is an RL even when you zoom

00:17:16.300 | into what RLHF is doing.

00:17:18.460 | Does this change as you think about a chain of thought,

00:17:21.940 | reasoning, and things like that?

00:17:23.340 | Does the state become part of the chain

00:17:26.140 | that you're going through?

00:17:28.080 | There's work that I mentioned on one slide called process reward

00:17:30.820 | models that essentially rewards each step in the chain

00:17:34.220 | of thought reasoning, which it doesn't really

00:17:37.380 | give the part of interaction.

00:17:39.180 | But it does make it a little bit more fine-grained, where

00:17:41.520 | you can think about calling it at least you have many states

00:17:45.300 | from your initial state.

00:17:47.020 | That formulation I don't think people have fully settled on.

00:17:49.860 | I think there's a bunch of great work out there.

00:17:52.420 | Even OpenAI is releasing a lot of this.

00:17:54.300 | And Let's Verify Step-by-Step is their pretty great paper.

00:17:58.060 | On the matter, I think in the next year,

00:18:01.280 | that'll probably get made more concrete by the community

00:18:06.240 | on if you can easily draw out if chain of thought reasoning

00:18:10.200 | is more like RL.

00:18:13.080 | RLHF for decision making.

00:18:15.000 | You have a slide here that compares

00:18:16.800 | pre-deep RL versus deep RL.

00:18:19.520 | Yeah, this is just to say that this

00:18:21.120 | is getting into the history of things, which

00:18:22.960 | is showing that the work that people are using now

00:18:25.720 | really came from well outside of NLP.

00:18:28.840 | And it came before deep learning was big.

00:18:30.800 | And the step from this paper, Tamer, which is from 2008,

00:18:36.120 | some names that are still really relevant and kind

00:18:38.480 | of human-centric RL, Bradley Knox and Peter Stone,

00:18:43.640 | if you have an agent take an action,

00:18:46.120 | you would just have a human give a score from 0 to 1

00:18:48.760 | as a reward, rather than having a reward function.

00:18:50.920 | And then with that classifier, you

00:18:52.300 | can do something with a policy that

00:18:54.560 | learns to take actions to maximize that reward.

00:18:56.960 | It's a pretty simple setup.

00:18:58.200 | It works in simple domains.

00:19:00.400 | And then the reason why this is interesting

00:19:02.400 | is you compare it to the paper that everyone knows,

00:19:04.520 | which is this Paul Cristiano et al.

00:19:06.200 | Deep Reinforced Learning from Human Preferences paper,

00:19:08.440 | which is where they showed that learning from human preferences

00:19:11.680 | you can solve the basic RL tasks at the time.

00:19:14.400 | So various control problems and simulation

00:19:16.680 | and this kind of human preferences approach

00:19:20.300 | had higher rewards in some environments

00:19:22.600 | than if you just threw RL at the environment that

00:19:25.120 | returned a reward.

00:19:26.440 | So the preferences thing was you took two trajectories.

00:19:29.800 | So in this case, it was complete trajectories of the agent.

00:19:32.640 | And the human was labeling which one is better.

00:19:34.680 | And you can see how this kind of comes

00:19:36.380 | to be like the pairwise preferences that are used today

00:19:38.720 | that we'll talk about.

00:19:40.280 | And there's also a really kind of interesting nugget

00:19:42.800 | that is the trajectory that the humans were labeling over

00:19:45.920 | has a lot more information than the RL algorithm

00:19:48.000 | would see if you just had one state, which

00:19:50.120 | is kind of why people think that it's

00:19:52.000 | like why the performance in this paper was so strong.

00:19:54.880 | But I still think that it's surprising

00:19:56.480 | that there isn't more RL work of this style happening now.

00:20:01.000 | This paper is in 2017, so it's like six years later.

00:20:03.600 | And I haven't seen things that are exactly similar,

00:20:06.440 | but it's a great paper to understand

00:20:08.600 | where stuff that's happening now kind of came from.

00:20:11.360 | And that's what the next few slides kind of go into.

00:20:14.080 | Just on the Cristiano paper, you mentioned the performance

00:20:17.400 | being strong.

00:20:17.960 | I don't remember-- what results should I have in mind

00:20:20.520 | when I think about that paper?

00:20:22.080 | It's mostly like if you think about an RL learning curve,

00:20:24.440 | which is like on the x-axis, you have environment interactions.

00:20:27.280 | On the y-axis, you have performance.

00:20:29.000 | You can think about different like ablation studies

00:20:31.120 | of between algorithms.

00:20:32.040 | So I think they use like A2C, which I don't even

00:20:34.120 | remember what that stands for, as their baseline.

00:20:36.600 | But if you do the human preference version

00:20:38.640 | on a bunch of environments like the human preference labels,

00:20:42.560 | the agent was able to learn faster

00:20:44.760 | than if it just learned from the signal from the environment,

00:20:47.800 | which means like the setup does--

00:20:50.120 | it's happening because the reward model has

00:20:53.240 | more information than the agent would.

00:20:55.760 | But like the fact that it can do better,

00:20:57.680 | I was like, that's pretty surprising to me

00:20:59.560 | because RL algorithms are pretty sensitive.

00:21:02.040 | So I was like, OK, yeah.

00:21:05.040 | Which is just one thing I do want

00:21:06.560 | to establish as a baseline for our listeners.

00:21:10.080 | Like we are updating all the weights, right?

00:21:13.280 | Like this is, in some sense, the next token prediction

00:21:18.320 | task of training a language model

00:21:20.360 | is a form of reinforcement learning,

00:21:22.600 | except that it's not from human feedback.

00:21:24.280 | It's just self-supervised learning

00:21:26.880 | from a general corpus.

00:21:28.440 | Yeah.

00:21:30.320 | There's one distinction which I love,

00:21:33.080 | which is that you can actually give negative feedback,

00:21:35.680 | whereas in a general sort of pre-training situation,

00:21:39.400 | you cannot.

00:21:41.000 | And maybe the order of magnitude of feedback,

00:21:43.360 | like the Likert scale that you're

00:21:44.760 | going to talk about in future slides,

00:21:46.600 | that actually just gives more signal

00:21:48.360 | than a typical training process would do in a language model

00:21:51.400 | setting.

00:21:53.400 | Yeah, I don't think I'm the right person to comment exactly,

00:21:56.360 | but you can make analogies that reinforcement learning is

00:21:59.280 | self-supervised learning as well.

00:22:00.960 | There are a lot of things that will point to that.

00:22:04.400 | I don't know whether or not it's a richer signal.

00:22:06.400 | I think that could be seen in the results,

00:22:09.160 | but I think it's a good thing for people to look into more.

00:22:14.240 | It's like as reinforcement learning is so much less

00:22:18.360 | compute, it is a richer signal in terms of its impact,

00:22:21.640 | because if they could do what RLHF is doing at pre-training,

00:22:24.200 | they would, but they don't know how

00:22:26.080 | to have that effect in a stable manner.

00:22:28.840 | Otherwise, everyone would do it.

00:22:30.520 | So on a practical basis, as someone fine-tuning models,

00:22:34.800 | I have often wished for negative fine-tuning, which pretty much

00:22:38.120 | doesn't exist in OpenAI land, and it's not

00:22:41.800 | the default setup in OpenStreetMap.

00:22:43.640 | How does this work in diffusion models and stuff?

00:22:45.840 | Because you can give negative prompts to something,

00:22:48.760 | to stable diffusion or whatever.

00:22:51.080 | That's for guidance.

00:22:51.960 | That's for clip guidance.

00:22:53.160 | Is that just from how they prompt it then?

00:22:55.280 | I don't know.

00:22:55.880 | I'm just wondering if we could do something similar.

00:22:58.080 | It's another tangent.

00:23:00.200 | Anyway, so I do want to spell that out for people

00:23:02.920 | in case they haven't made the connection between RLHF

00:23:05.600 | and the rest of the training process

00:23:07.080 | that they might have some familiarity with.

00:23:09.120 | Yeah, so these coming slides can really

00:23:11.920 | dig into this, which is like this 2018 paper that

00:23:14.560 | was a position paper from a bunch of the same authors

00:23:17.600 | from the Christiano paper and from the OpenAI work

00:23:20.960 | that everyone knows, which is like they write a position

00:23:25.120 | paper on what a reward model could do

00:23:27.640 | to solve alignment for agents.

00:23:30.160 | It's kind of based on two assumptions.

00:23:31.680 | The first assumption is that we can learn user intentions

00:23:34.040 | to a sufficiently high accuracy.

00:23:36.400 | That doesn't last with me, because I

00:23:38.360 | don't know what that means.

00:23:39.520 | But the second one is pretty telling

00:23:41.020 | in the context of RLHF, which is for many tasks

00:23:43.120 | we want to solve, evaluation of outcomes

00:23:44.800 | is easier than producing the correct behavior.

00:23:46.800 | And this is the whole thing.

00:23:47.960 | It's like we can compare two poems that the model generates,

00:23:51.040 | and it can be viewed as liking a positive example,

00:23:57.280 | or it could be viewed as really disliking a negative example.

00:24:00.200 | And that's what I think a lot of people

00:24:02.400 | are doing in the harm space, is a harmful response

00:24:06.040 | to a language model, whether or not

00:24:07.160 | you agree with the company's definition of harms,

00:24:09.240 | is that it's a really bad negative example.

00:24:12.800 | And they downweight them by preferring something

00:24:15.640 | more benign in the RLHF process, among other ways

00:24:18.760 | of dealing with safety.

00:24:20.080 | So this is a good way of saying it's like this is core.

00:24:23.200 | This kind of comparison and positive or negative example

00:24:26.240 | is core to all of the RLHF work that has continued.

00:24:29.760 | Yeah.

00:24:30.840 | Maybe I'll try to put a more colloquial restatement of this.

00:24:34.600 | People often say, I don't know what I want,

00:24:36.440 | but I'll know when I see it.

00:24:37.880 | This is that expressed in [INAUDIBLE]

00:24:40.400 | Yeah, it is.

00:24:41.240 | Yeah, it is.

00:24:41.800 | That's what everyone's doing in the preference modeling

00:24:44.640 | stage that we'll get to.

00:24:45.880 | Yeah.

00:24:47.320 | Yeah, and you can see there are more papers.

00:24:49.240 | This is really just to have all the links for people

00:24:53.400 | that go deeper.

00:24:54.080 | There's a Ziegler et al paper in 2019,

00:24:57.360 | which shows that you can do this RLHF process on language

00:25:00.240 | models.

00:25:01.680 | This familiar diagram starts to emerge in 2019.

00:25:04.280 | It's just to show that this goes really far back.

00:25:06.320 | I think we can kind of breeze through some of these.

00:25:08.520 | And then 2020 is the first open AI experiment

00:25:10.960 | that I think caught people's eyes, which

00:25:12.640 | is this learning to summarize experiment.

00:25:15.080 | It has this three-step process that

00:25:17.560 | we'll go into more when I kind of go into the main concepts.

00:25:20.840 | But it's like the first time you see this diagram that they

00:25:23.480 | reuse with InstructGPT.

00:25:25.080 | They reuse with ChatGPT.

00:25:27.280 | And the types of examples that they would have--

00:25:29.280 | I don't think I need to read these exactly,

00:25:31.360 | but one that I have read a whole bunch of times

00:25:33.680 | is they took these prompts from Reddit that was like,

00:25:37.320 | explain like I'm five or get career advice.

00:25:39.920 | And people really pour their heart and soul into these.

00:25:42.840 | So these are like multi-paragraph pieces of writing.

00:25:45.480 | And then they essentially do comparisons

00:25:47.600 | between a vanilla language model.

00:25:49.400 | I think it was, what's the timeline?

00:25:51.520 | Either GPT-2 or GPT-3.

00:25:53.320 | I always get the exact--

00:25:54.780 | 3 was early 2020, so that's about right.

00:25:57.400 | Yeah, so this is probably done with GPT-2.

00:26:00.360 | It doesn't really matter.

00:26:01.480 | But the language model does normal things.

00:26:03.280 | You do a few shot, which is like it repeats itself.

00:26:05.720 | It doesn't have nice text.

00:26:07.560 | And what they did is that this was the first time

00:26:09.920 | where the language model would generate pretty nice text

00:26:13.200 | from an output.

00:26:14.440 | It was restricted to the summarization domain.

00:26:17.280 | But I think that--

00:26:19.000 | this is where I wish I was paying attention more,

00:26:21.160 | because I would see the paper, but I

00:26:22.800 | didn't know to read the language model outputs

00:26:25.320 | and kind of understand this qualitative sense of the models

00:26:29.600 | very well then.

00:26:30.360 | Because you look at the plots in the papers.

00:26:33.240 | Learning to summarize and destruct GPT

00:26:35.200 | have incredibly pretty plots, just

00:26:37.720 | with nicely separated lines with error bars.

00:26:41.400 | And they're like super fine-tuning works.

00:26:44.120 | The RL step works.

00:26:45.320 | But if you were early to see how different the language that

00:26:49.520 | was written by these models was, I

00:26:51.320 | think you could have been early to things like chat GPT

00:26:54.520 | and knowing RLHF would matter.

00:26:56.840 | But that's now, I think, obvious.

00:26:59.240 | The good people know to chat with language models,

00:27:01.680 | but not even everyone does this.

00:27:03.200 | Like, people are still looking at numbers.

00:27:04.920 | And I think OpenAI probably figured it out

00:27:07.600 | when they were doing this, how important that could be.

00:27:09.920 | And then they had years to kind of chisel away at that.

00:27:13.160 | And that's why they're doing so well now.

00:27:15.440 | Yeah, I mean, arguably, it's well-known

00:27:17.280 | that chat GPT was kind of an accident,

00:27:18.840 | that they didn't think it would be that big of a deal.

00:27:21.360 | Yeah.

00:27:21.920 | So maybe they didn't.

00:27:22.920 | Maybe they didn't, but they were getting the proxy

00:27:25.040 | that they needed.

00:27:26.520 | I've heard off the record from other labs

00:27:28.880 | that it was in the air.

00:27:30.000 | If OpenAI didn't do it, someone else would have done it.

00:27:32.360 | Yeah.

00:27:33.240 | So you've mentioned a couple of other papers

00:27:35.360 | that are very seminal to this period.

00:27:37.160 | And I love how you say way back when in referring to 2019.

00:27:40.480 | It feels like it in my life.

00:27:42.760 | So how much should people understand

00:27:45.360 | the relationship between RLHF, instruction tuning,

00:27:47.920 | PPO, KL divergence, anything like that?

00:27:50.840 | Like, how would you construct the level of knowledge

00:27:54.960 | that people should dive into?

00:27:56.240 | Like, what should people know at the high level?

00:27:58.320 | And then if people want to dive in deeper, where do they go?

00:28:01.520 | Like, is instruction tuning important here?

00:28:06.400 | Or is that part of the overall process

00:28:08.680 | towards modern RLHF?

00:28:11.200 | I think for most people, instruction tuning

00:28:13.160 | is probably still more important in their day-to-day life.

00:28:15.640 | I think instruction tuning works very well.

00:28:19.120 | You can write samples by hand that make sense.

00:28:22.040 | You can get the model to learn from them.

00:28:24.360 | You could do this with very low compute.

00:28:26.800 | It's easy to do almost in no-code solutions at this point.

00:28:31.040 | And the loss function is really straightforward.

00:28:33.480 | And then if you're interested in RLHF,

00:28:37.920 | you can kind of learn from it from a different perspective,

00:28:40.200 | which is like how the instruction tuning distribution

00:28:42.680 | makes it easier for your RLHF model to learn.

00:28:45.480 | There's a lot of details depending

00:28:47.280 | on your preference data, if it's close to your instruction

00:28:50.440 | model or not, if that matters.

00:28:52.520 | But that's really at the RLHF stage.

00:28:54.080 | So I think it's nice to segment and just kind of understand

00:28:56.440 | what your level of investment and goals are.

00:28:58.840 | I think instruction tuning still can do most

00:29:01.240 | of what you want to do.

00:29:02.720 | And if you want to think about RLHF,

00:29:05.200 | at least before DPO really had taken off at all,

00:29:08.720 | it would be like, do you want to have a team of at least five

00:29:11.640 | people if you're really thinking about doing RLHF?

00:29:14.600 | I think DPO makes it a little bit easier.

00:29:16.440 | But that's still really limited to kind of one data set

00:29:18.640 | that everyone's using at this point.

00:29:20.080 | Everyone's using this ultra-feedback data set.

00:29:21.960 | And it boosts Alpaca, VAL, MTBench, TruthfulQA,

00:29:26.320 | and the qualitative model a bit.

00:29:28.000 | We don't really know why.

00:29:29.000 | And it's like, it might just be that data set combined

00:29:31.280 | with the method.

00:29:32.520 | But you've got to be ready for a bumpy ride

00:29:34.600 | if you're wanting to try to do RLHF.

00:29:36.760 | I don't really recommend most startups

00:29:39.160 | to do it unless it's going to provide them

00:29:41.600 | a clear competitive advantage in their kind of niche.

00:29:45.560 | Because you're not going to make your model chat GPT-like better

00:29:48.640 | than OpenAI or anything like that.

00:29:50.360 | You've got to accept that there's some exploration there.

00:29:53.400 | And you might get a vein in your specific--

00:29:55.960 | like a vein of benefit in your specific domain.

00:29:58.360 | But I'd still be careful going into the RLHF can of worms.

00:30:03.160 | You probably don't need to.

00:30:05.080 | OK, so there's a bit of a time skip in what you mentioned.

00:30:07.720 | DPO is like a couple of months old.

00:30:09.680 | So we'll leave that towards the end.

00:30:12.480 | I think the main result that I think

00:30:14.600 | most people talk about at this stage--

00:30:16.400 | we're talking about September 2020 and then going into,

00:30:18.760 | I guess, maybe last year--

00:30:20.720 | was Vicuña as one of the more interesting applications

00:30:25.320 | of instruction tuning that pushed LLAMA 1 from,

00:30:30.200 | let's say, a GPT-3-ish model to a GPT-3.5 model

00:30:34.040 | in pure open source with not a lot of resources.

00:30:36.600 | I think-- I mean, they said something like they

00:30:39.200 | used under $100 to make this.

00:30:41.480 | Yeah, instruction tuning can really go a long way.

00:30:44.160 | I think the claims of chat GPT level

00:30:47.360 | are long overblown in most of the things in open source.

00:30:51.000 | I think it's not to say--

00:30:53.920 | like, Vicuña was a huge step.

00:30:55.760 | And it's just kind of showing that instruction

00:30:58.280 | tuning with the right data will completely

00:31:00.520 | change what it feels like to talk with your model.

00:31:03.720 | From text completion to actually chatting

00:31:06.160 | back and forth, multi-turn.

00:31:08.400 | Yeah, instruction tuning can be multi-turn.

00:31:10.360 | Just having a little bit of data that's a couple turns

00:31:12.760 | can go a really long way.

00:31:14.560 | And it's, I think, people--

00:31:16.580 | that was like the story of the whole first part of the year

00:31:18.920 | is people will be surprised by how far you can take

00:31:21.480 | instruction tuning on a small model.

00:31:23.680 | I think the things that people see now

00:31:26.040 | is the small models don't really handle nuance as well.

00:31:29.280 | And they could be more repetitive,

00:31:30.880 | even if they have really good instruction tuning.

00:31:32.920 | But if you take that kind of $7 to $70 billion parameter jump,

00:31:36.600 | the instruction tuning at the bigger model is robustness.

00:31:40.440 | Little things make more sense.

00:31:42.320 | But that's still just with instruction tuning

00:31:44.160 | and scale more than anything else.

00:31:46.680 | Yeah, excellent.

00:31:49.040 | Shall we go to technical overview?

00:31:51.560 | Yeah, this is kind of where we go through my own version

00:31:54.040 | of this three-phase process.

00:31:55.600 | You can talk about instruction tuning, which

00:31:57.480 | we've talked about a lot.

00:31:58.720 | It's funny because all these things, instruction tuning

00:32:00.520 | has the fewest slides, even though it's

00:32:02.360 | the most practical thing for most people.

00:32:05.120 | We could save the debate for if the big labs still

00:32:07.680 | do instruction tuning for later.

00:32:09.120 | But that's a coming wave for people.

00:32:12.080 | And then like preference data and training,

00:32:14.320 | and then what does reinforcement learning optimization actually

00:32:17.920 | mean.

00:32:18.680 | We talk about these sequentially because you really

00:32:20.840 | have to be able to do each of them

00:32:22.200 | to be able to do the next one.

00:32:23.400 | You need to be able to have a model that's

00:32:25.160 | chatty or helpful, instruction following.

00:32:27.360 | Every company has their own word that they

00:32:29.120 | like to assign to what instructions mean.

00:32:31.120 | And then once you have that, you can collect preference data

00:32:33.620 | and do some sort of optimization.

00:32:35.160 | When you say word, you mean like angle bracket, inst?

00:32:39.120 | Or do you mean something else?

00:32:40.520 | Oh, I don't even know what inst means.

00:32:42.120 | But I'm just saying they use their adjective that they like.

00:32:45.960 | I think entropic, also like steerable, is another one.

00:32:48.560 | I see, I see, I see.

00:32:49.480 | Just the way they describe it.

00:32:50.760 | Yeah.

00:32:51.840 | Yeah, so instruction tuning, we've covered most of this.

00:32:54.280 | It's really about you should try to adapt

00:32:56.240 | your models to specific needs.

00:32:57.640 | It makes models that were only OK extremely comprehensible.

00:33:02.120 | A lot of the times, it's where you start

00:33:04.200 | to get things like chat templates.

00:33:05.620 | So if you want to do system prompts,

00:33:07.120 | if you want to ask your model, act like a pirate.

00:33:10.380 | That's one of the ones I always do, which is always funny.

00:33:12.800 | But whatever you-- act like a chef, like anything.

00:33:16.920 | This is where those types of things

00:33:19.320 | that people really know in language models

00:33:22.160 | start to get applied.

00:33:23.360 | So it's good as a kind of starting point,

00:33:26.360 | because this chat template is used in RHF and all

00:33:29.320 | of these things down the line.

00:33:31.640 | But there's a basic pointer.

00:33:33.600 | It's like, once you see this with instruction tuning,

00:33:36.420 | you really know it, which is like you take things like stack

00:33:38.840 | overflow, where you have a question and an answer.

00:33:40.880 | You format that data really nicely.

00:33:43.120 | You push it through the model.

00:33:44.360 | The model then kind of knows what to do.

00:33:46.720 | When somebody asks a question, there's much more--

00:33:50.080 | there's surely kind of more tricky things that people do.

00:33:53.480 | But I still think the vast majority of it

00:33:55.240 | is question answer.

00:33:56.200 | It's like, please explain this topic to me.

00:33:59.160 | Generate this thing for me.

00:34:00.560 | That hasn't changed that much this year.

00:34:02.480 | I think people have just gotten better at kind of scaling up

00:34:05.880 | the data that they need.

00:34:08.760 | Yeah, this is where this talk will kind of take a whole left

00:34:11.720 | turn into more technical detail land.

00:34:15.400 | I put a slide with the RHF objective, which I think

00:34:18.320 | is good for people to know.

00:34:19.560 | I've started going back to this more

00:34:21.720 | to just kind of understand what is trying to happen here

00:34:25.480 | and what type of math people could do.

00:34:27.800 | I think because of this algorithm,

00:34:29.480 | we've mentioned this, it's in the air,

00:34:31.060 | direct preference optimization.

00:34:32.400 | But everything kind of comes from an equation

00:34:34.280 | of trying to learn a policy that maximizes the reward.

00:34:38.760 | The reward is some learned metric.

00:34:40.680 | A lot can be said about what the reward should

00:34:42.640 | be subject to some constraint, which the most popular

00:34:45.960 | constraint is the KL-distraint, which is just

00:34:48.000 | a distributional distance.

00:34:50.160 | Essentially, in language models, that

00:34:51.620 | means if you have a completion from your instruction or RHF

00:34:55.600 | model, you can compare that completion to a base model.

00:34:59.400 | And looking at the log probs from the model, which

00:35:02.440 | are essentially how likely each token is,

00:35:05.000 | you can see a rough calculation of the distance

00:35:07.480 | between these two models just as a scalar number.

00:35:10.160 | I think what that actually looks like in code,

00:35:12.520 | you can look at it.

00:35:13.800 | It would be like a sum of log probs

00:35:16.440 | that you get right from the model.

00:35:17.960 | It'll look much more simpler than it sounds.

00:35:20.680 | But it is just to make the optimization kind of stay

00:35:23.560 | on tracks.

00:35:24.160 | It's a guardrail that's--

00:35:26.120 | Make sure it doesn't overfit to your RHF data.

00:35:29.520 | Because we have so little data and our RHF overfitting

00:35:32.140 | is really something that could happen.

00:35:33.720 | I think it'll fit to specific features

00:35:37.080 | that labelers like to see, that the model likes to generate,

00:35:41.200 | punctuation, weird tokens, like calculator tokens.

00:35:44.920 | It could overfit to anything if it's in the data a lot

00:35:47.240 | and it happens to be in a specific format.

00:35:49.920 | And the KL constraint prevents that.

00:35:52.520 | There's not that much documented work on that,

00:35:55.040 | but there's a lot of people that know if you take that away,

00:35:57.640 | it just doesn't work at all.

00:35:58.920 | So it is important, but I think it's something that people

00:36:02.240 | don't focus on too much.

00:36:04.240 | But as an objective, as I said, it's just kind of--

00:36:06.680 | you optimize the reward.

00:36:08.000 | The reward is where the human part of this comes in.

00:36:10.560 | We'll talk about that next.

00:36:11.720 | And then subject to a constraint,

00:36:13.960 | don't change the model too much.

00:36:15.660 | The real questions are, how do you implement the reward?

00:36:18.040 | And then how do you make the reward go up

00:36:19.920 | in a meaningful way?

00:36:21.680 | So like a preference model, the task

00:36:23.360 | is kind of to design a human reward.

00:36:25.200 | I think the key--

00:36:27.000 | the equation that most of the stuff is based on right now

00:36:29.920 | is something called a Bradley-Terry model, which

00:36:31.920 | is like a pairwise preference model where

00:36:33.680 | you compare two completions, and you

00:36:35.280 | say which one you like better.

00:36:36.560 | I'll show an interface that Anthropic uses here.

00:36:40.160 | And the Bradley-Terry model is really

00:36:42.540 | a fancy probability between two selections.

00:36:46.040 | And what's happening in the math is

00:36:47.920 | that if you look at the prob--

00:36:49.920 | you're looking at the probability that the chosen

00:36:52.440 | completion, the one you like better,

00:36:54.080 | is actually the better completion

00:36:56.320 | over the rejected completion.

00:36:58.360 | And what these preference models do

00:37:00.520 | is they assume this probability is correlated to reward.

00:37:04.120 | So if you just sample from this probability,

00:37:06.000 | it'll give you a scalar.

00:37:07.080 | And then you use that reward later on

00:37:08.640 | to signify what piece of text is better.

00:37:12.120 | I'm kind of inclined to breeze through the math stuff,

00:37:17.800 | because otherwise it's going to be not as good to listen to.

00:37:20.400 | Yeah, no, no.

00:37:22.920 | I think people want to hear it.

00:37:24.560 | I think there's a lot of higher level explanations out there.

00:37:28.200 | Yeah, yeah.

00:37:29.240 | So the real thing is you need to assign a scalar reward of how

00:37:32.080 | good a response is.

00:37:33.440 | And that's not necessarily that easy to understand.

00:37:36.880 | Because if we take back to one of the first works

00:37:39.920 | I mentioned, this tamer thing for decision making,

00:37:42.600 | people tried that with language models, which

00:37:44.520 | is if you have a prompt and a completion

00:37:46.160 | and you just have someone rate it from 0 to 10,

00:37:48.840 | could you then train a reward model

00:37:50.400 | on all of these completions and 0 to 10 ratings

00:37:52.960 | and see if you could actually change--

00:37:56.280 | can you get trap 2BT with that?

00:37:57.720 | And the answer is really kind of no.

00:37:59.680 | A lot of people tried that.

00:38:00.800 | It didn't really work.

00:38:01.800 | And then that's why they tried this pairwise preference thing.

00:38:04.400 | And it happened to work.

00:38:05.520 | And this Bradley Terry model comes from the '50s.

00:38:09.800 | It's from these fields that I was mentioning earlier.

00:38:12.480 | And it's wild how much of this happens.

00:38:15.760 | I mean, this screenshot I have on the slides

00:38:17.640 | is from the DPO paper.

00:38:18.720 | I think it might be the appendix.

00:38:20.060 | But it's still really around in the literature of what

00:38:23.240 | people are doing for RLHF.

00:38:25.160 | So it's a fun one to know.

00:38:26.600 | I'll point out one presumption that this heavily relies on.

00:38:29.320 | You mentioned this as part of your six presumptions

00:38:31.440 | that we covered earlier, which is that you can

00:38:33.360 | aggregate these preferences.

00:38:35.520 | This is not exactly true among all humans, right?

00:38:38.020 | I have a preference for one thing.

00:38:39.440 | You have a preference for a different thing.

00:38:40.820 | And actually coming from economics,

00:38:42.480 | you mentioned economics earlier.

00:38:43.920 | There's a theorem or a name for this called error

00:38:48.880 | impossibility, which I'm sure you've come across.

00:38:50.960 | Yeah, it's one of the many kind of things

00:38:53.220 | we throw around in the paper.

00:38:55.560 | Do we just ignore it?

00:38:56.740 | Yeah.

00:38:57.240 | We just-- yeah, just aggregate.

00:38:58.560 | Yeah, OK.

00:38:59.360 | Yeah.

00:39:00.280 | I think the reason this really is done on a deep level

00:39:03.880 | is that you're not actually trying

00:39:05.640 | to model any contestable preference in this.

00:39:09.480 | You're not trying to go into things that

00:39:11.400 | are controversial or anything.

00:39:13.360 | It's really the notion of preference

00:39:15.720 | is trying to stay around correctness and style

00:39:19.160 | rather than any meaningful notion of preference.

00:39:21.200 | Because otherwise, these companies really don't want to--

00:39:23.700 | they don't want to do this at all.

00:39:26.500 | I think that's just how it is.

00:39:27.780 | And it's like, if you look at what people actually do--

00:39:30.140 | so I have a bunch of slides on the feedback interface.

00:39:33.260 | And they all publish this.

00:39:34.420 | It's always at the appendices of every paper.

00:39:36.700 | Yeah, it's pretty interesting.

00:39:37.980 | Yeah, there's something later on in this talk which is like--

00:39:40.540 | but it's good to mention in this is when you're doing this

00:39:43.380 | preference collection, you write out a very long document

00:39:46.500 | of instructions to people that are collecting this data.

00:39:49.180 | And it's like, this is the hierarchy

00:39:50.880 | of what we want to prioritize.

00:39:52.160 | Something amounting like factuality, helpfulness,

00:39:55.480 | honestness, harmlessness-- these are all different things.

00:39:57.920 | Every company will rank these in different ways,

00:40:00.220 | provide extensive examples.

00:40:01.800 | It's like, if you see these two answers,

00:40:03.840 | you should select this one and why and all of this stuff.

00:40:07.120 | And then my kind of head scratching

00:40:08.640 | is like, why don't we check if the models actually

00:40:10.760 | do these things that we tell the data annotators to collect?

00:40:14.600 | But I think it's because the model--

00:40:16.120 | it's hard to make that attribution.

00:40:17.760 | It'll be really-- it's hard to test if a model is honest

00:40:20.260 | and stuff.

00:40:20.920 | It would just be nice to understand

00:40:22.500 | the kind of causal mechanisms as a researcher

00:40:25.700 | or if our goals are met.

00:40:27.820 | But at a simple level, what it boils down to--

00:40:30.620 | I have a lot more images than I need.

00:40:33.140 | It's like, you're having a conversation with an AI,

00:40:36.140 | something like TypeGPT.

00:40:37.620 | You get shown two responses or more in some papers.

00:40:40.780 | And then you have to choose which one is better.

00:40:42.780 | I think something you'll hear a lot in this space

00:40:44.780 | is something called a Likert scale.

00:40:47.100 | Likert is a name.

00:40:48.400 | It's a name for probably some research in economics,

00:40:51.280 | decision theory, or something.

00:40:52.620 | But essentially, it's a type of scale

00:40:54.180 | where if you have integers from one to eight,

00:40:58.120 | the middle numbers will represent something

00:41:00.300 | close to a tie.

00:41:01.140 | And the smallest numbers will represent one model

00:41:03.280 | being way better than the other.

00:41:04.940 | And the biggest numbers will be the other model's better.

00:41:08.840 | So in the case of one to eight, if you're

00:41:10.600 | comparing models A to B, if you return a one

00:41:12.680 | if you really liked option A, you

00:41:14.040 | return eight if you really like B,

00:41:15.520 | and then a four or five if they were close.

00:41:17.940 | There's other ways to collect this data.

00:41:19.600 | This one's become really popular.

00:41:21.280 | We played with it a bit at Hugging Face.

00:41:22.900 | It's hard to use.

00:41:24.140 | Filling out this preference data is really hard.

00:41:26.100 | You have to read multiple paragraphs.

00:41:27.800 | It's not for me.

00:41:28.560 | Some people really like it.

00:41:29.600 | I hear, I'm like, I can't imagine

00:41:31.000 | sitting there and reading AI-generated text

00:41:33.240 | and having to do that for my job.

00:41:35.760 | But a lot of these early papers in RLHF

00:41:38.400 | have good examples of what was done.

00:41:40.680 | The one I have here is from Anthropic's collection demo.

00:41:45.360 | It's because it was from slides that I did with Anthropic.

00:41:48.560 | But you can look up these in the various papers.

00:41:52.800 | It looks like Chat2PT with two responses,

00:41:55.080 | and then you have an option to say which one is better.

00:41:57.360 | It's nothing crazy.

00:41:59.240 | The infrastructure is almost exactly the same.

00:42:02.040 | But they just log which one you think is better.

00:42:04.920 | I think places like Scale are also really big in this,

00:42:09.400 | where a lot of the labeler companies

00:42:11.920 | will help control who's doing how many samples.

00:42:15.520 | You have multiple people go over the same sample once,

00:42:18.000 | and what happens if there's disagreement?

00:42:20.060 | I don't really think this disagreement

00:42:21.600 | data is used for anything.

00:42:22.840 | But it's good to know what the distribution of prompts is,

00:42:26.080 | who's doing it, how many samples you have,

00:42:28.040 | controlling the workforce.

00:42:29.240 | All of this is very hard.

00:42:31.080 | A last thing to add is that a lot of these companies

00:42:33.200 | do collect optional metadata.

00:42:35.200 | I think the Anthropic example shows

00:42:37.020 | a rating of how good was the prompt, or the conversation,

00:42:42.380 | from good to bad.

00:42:43.820 | Because things matter.

00:42:46.620 | There's a quadrant of preference data in my mind,

00:42:48.740 | which is you're comparing a good answer to a good answer, which

00:42:51.660 | is a really interesting signal.

00:42:53.140 | And then there's the option of you're

00:42:54.740 | comparing a bad answer to a bad answer, which is like,

00:42:57.580 | you don't want to train your model on two different--

00:42:59.780 | You're both terrible.

00:43:00.780 | This is why we did this at Hugging Base.

00:43:02.500 | And it was like, our data was like,

00:43:04.260 | we don't know if we can use this, because a lot of it

00:43:06.220 | was just bad answer to bad answer,

00:43:07.780 | because you're rushing to try to do this real contract.

00:43:10.560 | And then there's also good answer to bad answer,

00:43:12.560 | which I think is probably pretty reasonable to include.

00:43:14.860 | You just prefer the good one, and move on with your life.

00:43:17.940 | Those are very different scenarios.

00:43:19.380 | I think open AIs of the world are all in good answer,

00:43:22.300 | good answer, and have learned to eliminate everything else.

00:43:24.760 | But when people try to do this in open source,

00:43:26.660 | it's probably like what Open Assistance saw,

00:43:28.500 | is there's just a lot of bad answers in your preference

00:43:30.820 | data.

00:43:31.300 | And you're like, what do I do with this?

00:43:33.460 | Metadata flags can help.

00:43:34.860 | I threw in the slide 28.

00:43:37.540 | It's like the instruct GPT metadata.

00:43:40.460 | You can see how much they collect here,

00:43:43.100 | and everything from the model fails to actually complete

00:43:47.900 | the task, hallucinations, different types

00:43:50.580 | of offensive or dangerous content, moral judgment,

00:43:53.300 | expresses opinion.

00:43:55.620 | I don't know exactly if they're doing this now,

00:43:58.340 | but you can kind of see why doing RLHF at scale

00:44:01.060 | and prioritizing a lot of different endpoints

00:44:02.860 | would be hard, because these are all things that you--

00:44:05.780 | I'd be interested if I was scaling up a big team

00:44:08.020 | to do RLHF, and what is going into the preference data,

00:44:10.420 | and what happens?

00:44:11.420 | You do an experiment.

00:44:12.660 | You're like, OK, we're going to remove all the data where

00:44:14.620 | they said the model hallucinates.

00:44:16.020 | Like, does that?

00:44:16.660 | And then retrain everything.

00:44:17.820 | Like, what does that do?

00:44:19.320 | Yeah, so hallucination is big.

00:44:20.900 | But some of these other metadata categories--

00:44:23.500 | and I've seen this in a lot of papers--

00:44:25.300 | it's like, does it contain sexual content?

00:44:27.300 | Does it express a moral judgment?

00:44:28.780 | Does it denigrate a protected class?

00:44:30.300 | That kind of stuff, very binary.

00:44:33.260 | Should people try to adjust for this at the RLHF layer,

00:44:36.580 | or should they put it as a pipeline

00:44:38.740 | where they have a classifier as a separate model that

00:44:42.580 | grades the model output?

00:44:44.780 | Do you mean for training or, like, a deployment?

00:44:47.580 | Deployment.

00:44:48.340 | I do think that people are doing it at deployment.

00:44:50.420 | I think we've seen safety and other things

00:44:53.420 | in the RLHF pipeline.

00:44:56.020 | Like, Lamatu is famous for kind of having

00:44:59.020 | this, like, helpfulness and safety reward models.

00:45:02.420 | Deep in the Gemini report is something

00:45:04.260 | that Gemini has, like, four things, which

00:45:07.260 | is, like, helpfulness, factuality, maybe safety,

00:45:09.700 | maybe something else.

00:45:11.180 | But places like Anthropic and Chattopadhyay and Bard

00:45:14.860 | almost surely have a classifier after,

00:45:17.060 | which is like, is this text good?

00:45:18.460 | Is this text bad?

00:45:19.940 | And that's not that surprising, I think,

00:45:23.260 | because you could use, like, a 100 times smaller language

00:45:25.700 | model and do much better at filtering than RLHF.

00:45:30.020 | But I do think it's still so deeply intertwined

00:45:33.540 | with the motivation of RLHF to be for safety

00:45:35.620 | that some of these categories still persist.

00:45:38.180 | I think that's something that'll kind of settle out, I think.

00:45:41.180 | I'm just wondering if it's worth collecting this data

00:45:43.300 | for the RLHF purpose if you're not going to use it in any way,

00:45:46.060 | because you're just going to use a separate model to--

00:45:48.020 | Yeah, I don't think OpenAI will collect all of this anymore.

00:45:51.380 | But I think, from a research perspective,

00:45:53.220 | it's very insightful to know.

00:45:55.340 | But it's also expensive.

00:45:56.340 | So essentially, your preference data

00:45:57.980 | scales with how many minutes it takes for you to do each task.

00:46:00.660 | And every button is--

00:46:02.180 | it scales pretty linearly.

00:46:04.300 | So it's not cheap stuff.

00:46:07.740 | Since you mentioned expensiveness,

00:46:09.780 | and I think you may have joined one of our spaces

00:46:12.540 | back when Llamatu was released.

00:46:15.420 | We had an estimate from you that was something

00:46:18.340 | on the order of Llamatu costs $3 to $6 million

00:46:22.100 | to train GPU-wise.

00:46:23.500 | And then it was something like $20 to $30 million

00:46:25.900 | in preference data.

00:46:27.860 | Is that something that's still in the ballpark?

00:46:30.020 | I don't need precise numbers.

00:46:30.820 | I think it's still in the ballpark.

00:46:32.280 | I know that the $20 million was off by a factor of four,

00:46:35.020 | because I was converting from a prompt number

00:46:37.380 | to a total data point.

00:46:39.140 | So essentially, when you do this,

00:46:41.180 | if you have multi-turn setting, each turn

00:46:43.260 | will be one data point.

00:46:44.820 | And the Llamatu paper reports 1.5 million data points,

00:46:48.180 | which could be 400,000 prompts.

00:46:50.860 | So I would still say like $6 to $8 million

00:46:54.020 | is safe to say that they're spending, if not more.

00:46:56.460 | They're probably also buying other types of data

00:46:58.500 | and/or throwing out data that they don't like.

00:47:00.420 | But it's very comparable to compute costs.

00:47:03.460 | But the compute costs listed in the paper

00:47:05.900 | always are way lower, because all they have to say

00:47:07.940 | is, what does one run cost?

00:47:09.500 | But they're running tens or hundreds of runs.

00:47:11.420 | So it's like, OK, this is kind of a meaningless number.

00:47:15.840 | The data number would be more interesting.

00:47:17.620 | Right, right, right, right.

00:47:18.740 | What's the depreciation of this data?

00:47:21.660 | Ooh.

00:47:23.060 | It depends on the method.

00:47:24.540 | Some methods, people think that it's more sensitive to--

00:47:28.460 | this is what I was saying.

00:47:29.700 | Does the type of instruction tuning you do matter for RLHF?

00:47:34.220 | So depending on the method, some people

00:47:36.900 | are trying to figure out if you need to have what is called--

00:47:41.420 | this is very confusing.

00:47:42.380 | It's called on-policy data, which

00:47:43.980 | is your RLHF data is from your instruction model.

00:47:47.500 | I really think people in open source and academics

00:47:50.180 | are going to figure out how to use any preference

00:47:52.180 | data on any model, just because they're scrappy.

00:47:54.860 | But there's been an intuition that to do PPO well and keep

00:47:58.900 | improving the model over time, and do what Meta did,

00:48:01.700 | and what people think that OpenAI does,

00:48:03.740 | is that you need to collect new preference data to kind of edge

00:48:08.300 | the distribution of capabilities forward.

00:48:10.340 | So there's a depreciation where the first batch of data

00:48:14.180 | you collect isn't really useful for training the model when

00:48:16.700 | you have the fifth batch.

00:48:19.300 | We don't really know, but that's something--

00:48:21.340 | it's a good question.

00:48:22.500 | And I do think that if we had all the LLAMA data,

00:48:26.440 | we wouldn't know what to do with all of it.

00:48:28.500 | Probably 20% to 40% would be pretty useful for people,

00:48:32.140 | but not the whole data set.

00:48:33.380 | A lot of it's probably kind of gibberish,

00:48:35.220 | because they had a lot of data in there.

00:48:37.380 | Yeah.

00:48:38.180 | So do you think the open source community should

00:48:40.940 | spend more time figuring out how to reuse the data that we have,

00:48:44.180 | or generate more data?

00:48:46.060 | I think that's one of the bigger questions.

00:48:47.860 | I think if the people are kind of locked

00:48:49.540 | into using synthetic data, which I wish I had more slides on it,

00:48:52.300 | but we could just talk about it.

00:48:53.600 | Essentially, people also think that synthetic data, like GPT-4,

00:48:57.420 | is more accurate than humans at labeling preferences.

00:48:59.780 | So if you look at these diagrams,

00:49:01.160 | like humans are about 60% to 70% agreement.

00:49:04.460 | That's what the models get to.

00:49:06.180 | And if humans are about 70% agreement or accuracy,

00:49:09.660 | like GPT-4 is like 80%.

00:49:11.180 | So it is a bit better, which is in one way of saying it.

00:49:14.340 | Humans don't even agree with humans 50% of the time.

00:49:17.540 | Yeah, so that's the thing.

00:49:18.740 | It's like the human disagreement or the lack of accuracy

00:49:21.900 | should be like a signal.

00:49:23.620 | But how do you incorporate that?

00:49:25.460 | It's really tricky to actually do that.

00:49:28.300 | I think that people just keep using GPT-4,

00:49:30.340 | because it's really cheap.

00:49:31.420 | It's one of my go-to, like I just say this over and over

00:49:33.940 | again, is like GPT-4 for data generation.

00:49:37.660 | All terms and conditions aside, because we

00:49:40.020 | know OpenAI has this stuff, is like very cheap for getting

00:49:43.300 | pretty good data compared to compute or salary

00:49:46.340 | of any engineer or anything.

00:49:47.820 | So it's like, tell people to go crazy generating GPT-4 data

00:49:51.340 | if you're willing to take the organizational cloud of,

00:49:54.260 | should we be doing this?

00:49:55.260 | But I think most people have accepted

00:49:56.760 | that you kind of do this, especially at individuals.

00:49:59.020 | Yeah, they're not going to come after individuals.

00:50:01.100 | I do think more companies should think twice

00:50:04.300 | before doing tons of OpenAI outputs,

00:50:07.260 | also just because the data contamination

00:50:09.740 | and what it does to your workflow

00:50:11.220 | is probably hard to control at scale.

00:50:14.140 | And we should just mention, at the time of recording,

00:50:16.660 | we've seen the first example of OpenAI enforcing

00:50:19.340 | their terms of service.

00:50:20.780 | ByteDance was caught, reported to be training on GPT-4 data,

00:50:24.820 | and they got their access to OpenAI revoked.

00:50:28.300 | So that was one example.

00:50:29.420 | I don't know if you have a comment on that.

00:50:31.260 | I don't expect OpenAI to go too crazy on this,

00:50:33.880 | because there's going to be so much backlash against them.

00:50:36.340 | Everyone's doing it, yeah.

00:50:37.460 | And everyone's going to do it anyways.

00:50:39.860 | And what's at stake here, to spell it out,

00:50:41.740 | is like, OK, this costs $10 to collect one data

00:50:45.180 | point from a human.

00:50:46.460 | It's going to cost you a tenth of a cent with OpenAI, right?

00:50:51.100 | So it's just orders of magnitude cheaper,

00:50:52.860 | and therefore people are just going to do it.

00:50:54.740 | Yeah, and the signal you get from humans from preferences

00:50:58.140 | is not high.

00:50:58.860 | The signal that you get from humans for instructions

00:51:02.500 | is pretty high, but it is also very expensive.

00:51:04.860 | So the human instructions are definitely, by far and away,

00:51:07.500 | the best ones out there, compared

00:51:08.940 | to the synthetic data.

00:51:10.940 | But I think the synthetic preferences

00:51:12.660 | are just so much easier to get some sort of signal running

00:51:15.780 | with, and you can work in other--

00:51:17.260 | I think people will start working in other goals

00:51:19.300 | there, between safety and whatever.

00:51:21.820 | But that's something that's taking off,

00:51:23.780 | and we'll kind of see that.

00:51:24.980 | I think in 2024, at some point, people

00:51:27.460 | will start doing things like constitutional AI

00:51:29.460 | for preferences, which will be pretty interesting.

00:51:33.500 | We saw how long it took RLHF to get started in open source.

00:51:37.180 | Instruction tuning was the only thing

00:51:38.780 | that was really happening until maybe like August, really.

00:51:41.660 | I think Zephyr was the first model that

00:51:44.220 | showed success with RLHF in the public.

00:51:46.860 | But that's a long time from everyone

00:51:49.500 | knowing that it was something that people are interested in,

00:51:51.960 | to having any check mark.

00:51:54.620 | So I accept that and think the same will

00:51:57.660 | happen with constitutional AI.

00:51:59.180 | But once people show that you can do it once,

00:52:01.980 | they continue to explore.

00:52:04.140 | Yeah, excellent.

00:52:05.500 | Just in the domain of human preference data suppliers,

00:52:09.980 | ScaleAI very happily will tell you

00:52:11.460 | that they supplied all that data for Lama 2.

00:52:17.020 | The other one is probably interesting,

00:52:18.560 | LMSYS from Berkeley.

00:52:21.340 | What they're running with Chaterina

00:52:22.820 | is perhaps a good store of human preference data?

00:52:26.420 | Yeah, they released some toxicity data.

00:52:28.380 | They, I think, are generally worried about releasing data

00:52:31.060 | because they have to process it and make

00:52:32.680 | sure everything is safe.

00:52:33.680 | And they're really lightweight work.

00:52:35.180 | And they're trying to.

00:52:36.380 | They're trying to release the preference data.

00:52:38.260 | I have-- if we make it to evaluation,

00:52:40.260 | I'd pretty much say that Chaterina

00:52:41.980 | is the best limited evaluation that people have to learn

00:52:44.540 | how to use language models.

00:52:45.700 | And it's very valuable data.

00:52:47.980 | And trying to get--

00:52:50.060 | they also may share some data with people

00:52:52.460 | that they host models from.

00:52:54.060 | So if your model is hosted there,

00:52:55.780 | and you pay for the hosting, you can get the prompts.

00:52:57.980 | Because you're pointing the endpoint at it,

00:52:59.740 | and it gets pinged to you.

00:53:00.820 | And your-- any real LLM inference stack

00:53:03.940 | saves the prompts that you get.

00:53:05.660 | So that is some signal.

00:53:07.020 | I don't know if the shared preferences--

00:53:09.220 | I do think they're trying to.

00:53:10.420 | They're trying to do all the right things.

00:53:12.220 | They're just very strapped.

00:53:13.340 | And moving data comes with other legal and liability concerns

00:53:16.780 | in some cases.

00:53:18.980 | Awesome.

00:53:20.140 | So kind of looping back a little bit

00:53:22.420 | from that very valuable digression

00:53:24.900 | on what preference data is, we're

00:53:27.180 | talking about the actual loss function.

00:53:29.260 | Because it's kind of like this classifier approach that

00:53:32.060 | might not make too much sense to people.

00:53:34.140 | You take a language model, and you

00:53:36.380 | chop it into pieces a little bit at the end

00:53:38.300 | so that it outputs one number.

00:53:39.780 | It's like-- in technical level, it's

00:53:41.540 | a logit that corresponds to the probability

00:53:43.500 | that we talked about earlier.

00:53:45.020 | But in order to train this, you can't just

00:53:46.780 | have prompt incompletions.

00:53:48.460 | You need to have these pairs.

00:53:50.100 | Because we talked about scalars don't really work.

00:53:52.220 | So in order to train it, you use the magical batching

00:53:54.700 | of all language model, all deep learning architectures.

00:53:57.420 | And you put in the chosen prompt and the rejected prompt

00:53:59.780 | at the same time.

00:54:00.540 | And then you end up with two numbers.

00:54:02.580 | And then there's this fun loss function.

00:54:04.460 | And you essentially have to increase the difference

00:54:06.580 | between these two predicted numbers.

00:54:08.660 | It's always fun when you think about automatic

00:54:10.940 | differentiation.

00:54:11.700 | It updates the same parameters to separate these two numbers

00:54:15.380 | at once.

00:54:16.260 | And there's this loss function that you'll see in OpenAI

00:54:19.060 | Anthropic and everyone's papers.

00:54:20.900 | What it looks like is it's like some log

00:54:22.900 | of a scalar with an exponential.

00:54:25.300 | That's the difference between these two predicted rewards.

00:54:28.000 | It's just some fancy math around a difference, a subtraction

00:54:32.000 | between the reward of the rejected prediction

00:54:34.080 | and the reward of-- the predicted

00:54:35.760 | reward for the rejected completion

00:54:37.660 | and the predicted reward of the chosen completion.

00:54:40.520 | Fun fact is that these loss functions

00:54:43.040 | look different in Anthropic and OpenAI's papers.

00:54:45.840 | But they're just literally just log transforms.

00:54:47.800 | So if you start like expandiating both sides

00:54:49.800 | and taking a log of both sides, you'll converge on one of the--

00:54:52.680 | both the two papers end up being the same thing.

00:54:55.320 | And people don't know how to train preference models,

00:54:57.800 | particularly well now.

00:54:59.040 | I think if you zoom into any of the details

00:55:01.720 | to look at the agreement number, so how--

00:55:04.880 | if you look at a test set, you'll

00:55:06.720 | have a chosen and rejected.

00:55:08.640 | And you can take the reward model you're training,

00:55:10.720 | pass in those completions, and you

00:55:13.000 | see if the chosen predicted reward, so the scalar number,

00:55:16.200 | is higher than the rejected predicted reward.

00:55:18.880 | And this is the agreement numbers

00:55:20.320 | in all of these data sets.

00:55:21.400 | It's like where you see they have the 65% to 75% agreement.

00:55:24.400 | This just means that these scalar numbers were ordered

00:55:27.880 | correctly.

00:55:28.480 | And that's a pretty low number.

00:55:29.780 | It's not going to get to 100%.

00:55:31.880 | That goes to show the kind of deep questions at play here.

00:55:35.480 | People are playing with different loss functions,

00:55:37.480 | ensembles, different models to try to address this.

00:55:39.760 | But it's really a fundamental issue.

00:55:41.260 | It's like-- it goes back to what does it mean to do RLHF?

00:55:45.460 | And we're not going to answer that now.

00:55:47.040 | But it's good to know that this 65% to 75% agreement,

00:55:49.900 | you'll see these numbers everywhere.

00:55:51.360 | It's like we don't have 100% agreement with the reward

00:55:53.920 | model and the data.

00:55:55.720 | And that's fine.

00:55:56.820 | That's just where we're at.

00:55:58.360 | And we essentially take this model,

00:56:00.080 | and then we start throwing RL at it, I think.

00:56:03.480 | PPO, proximal policy optimization,

00:56:07.360 | it's pretty complicated compared to what

00:56:09.080 | you really need to know.

00:56:10.320 | It really just does RL under the hood.

00:56:13.160 | Things like PPO, it learns a value function,

00:56:15.420 | and then it uses the value function to update the bottle.

00:56:18.680 | You could look at--

00:56:19.720 | if you actually look at a feedback diagram,

00:56:23.240 | it's more of like a systems problem than an RL problem.

00:56:27.280 | So you'll see things like you need

00:56:28.700 | to have two copies of the language model.

00:56:30.560 | This is for the KL constraint that we talked about before.

00:56:33.000 | You need to have the reward model, which

00:56:34.700 | is either a separate reward model or value head

00:56:37.000 | on your base model.

00:56:38.820 | And then you need to have your RL code that actually

00:56:42.000 | learns a value function and updates all the parameters.

00:56:44.360 | I think it just is really messy to actually set up.

00:56:48.360 | But if you dig into it, most people

00:56:50.280 | could understand what each of the components are.

00:56:52.320 | And then the hard parts are like,

00:56:53.760 | how do we actually make a language model that

00:56:55.600 | works out of this, which is not something

00:56:57.640 | that people know that well.

00:56:58.840 | I think things that I talk about a lot

00:57:00.480 | is just like, OK, what is the signal flow?

00:57:03.260 | How do you access the reward model?

00:57:05.440 | The reward model is used in RLHF exactly what you would think.

00:57:08.220 | You have a prompt.

00:57:09.360 | The language model generates a completion.

00:57:11.480 | And then that completion is given a score.

00:57:13.360 | That score gets plugged into the whole RL stuff.

00:57:16.240 | And it learns-- and it updates the parameters.

00:57:18.640 | That's kind of the core of it.

00:57:20.720 | There's a lot of different things,

00:57:22.240 | zooming in on where exactly you put this distance

00:57:25.400 | penalty between the base model and the RL model.

00:57:28.440 | Most people say that you just deduct it from the reward.

00:57:31.040 | So if you go all the way back to RL

00:57:33.600 | as an agent acting in the world, the reward from that world

00:57:37.600 | would be a combination of the reward model and any

00:57:39.760 | constraints, like KL, that you put on it.

00:57:41.560 | There's a lot of different ways to do this,

00:57:43.400 | because a lot of RL algorithms, like PPO,

00:57:45.320 | actually have a KL constraint built into them.

00:57:47.520 | So it's confusing, because you hear KL twice.

00:57:49.520 | But those are different KLs.

00:57:50.920 | One of them is about the text, and one of them

00:57:53.120 | is about the value function distance, or the policy

00:57:55.320 | distance, or something like this.

00:57:56.920 | So those are different.

00:57:57.880 | It really ends up being kind of gibberish

00:58:00.120 | that I think is less important now,

00:58:01.620 | because it's more about data and infrastructure than RL details,

00:58:04.720 | than value functions and everything.

00:58:07.820 | A lot of the papers have different terms

00:58:10.040 | in the equations.

00:58:11.120 | I think InstructGPT does something

00:58:12.840 | where they try to get the RL model to match

00:58:16.200 | the instruction tuning model, or the instruction tuning

00:58:18.480 | data set, because they're really happy with that data set

00:58:20.820 | to constrain the distribution.

00:58:22.600 | LLAMA does some different things.

00:58:24.240 | But I think these are all small gains over just

00:58:27.960 | getting the deep understanding of the data

00:58:31.080 | in the infrastructure setup.

00:58:33.840 | This is why we say it's so little RL.

00:58:35.680 | It's like, now we are getting to the point

00:58:37.600 | where you don't even really need this to get a good model.

00:58:40.040 | So that's why it's like, OK, the RL is such a small part

00:58:43.520 | of the actual doing RLHF.

00:58:46.520 | Like, RLHF is a metaphor for all language model adaptation.

00:58:50.440 | And RL is one tool used at one point in the time.

00:58:54.160 | So that's kind of where I wrap up

00:58:55.640 | the core overview in my mind, to say RL doesn't really

00:58:58.680 | do as much as people think.

00:58:59.840 | But you could put up flashy equations

00:59:01.580 | and do all sorts of stuff if you want to.

00:59:04.240 | I think it's kind of misleading, even,

00:59:05.840 | because I don't think about those equations

00:59:07.580 | on a regular basis.

00:59:08.600 | But what if we called it Q*?

00:59:10.640 | Yeah.

00:59:14.560 | So in your mind, is the takeaway for this kind

00:59:18.240 | of next generation of people working on models,

00:59:21.520 | maybe the underlying theories is less important than actually

00:59:25.200 | getting good data, basically?

00:59:26.680 | Yeah, I think it's getting good data.

00:59:28.240 | And we'll see, like, I have this advanced topics

00:59:30.240 | thing in the slides, which it starts with the vowels.

00:59:32.760 | And then it talks about a lot of different ways

00:59:34.800 | that people are using reward models

00:59:36.320 | or constructing training signals, really.

00:59:38.640 | And I think that it's about understanding

00:59:41.880 | what your information flow is.

00:59:43.200 | And if your reward signal is good,

00:59:44.960 | and if your language model is generating right,

00:59:47.840 | zooming in on the tokens it's generating,

00:59:49.780 | and kind of understanding how those things change over time.

00:59:54.040 | I have a slide that I--

00:59:55.560 | in here, I think this is something we could also

00:59:57.800 | talk about evaluation.

00:59:59.000 | But it's really like, RLHF is not

01:00:01.520 | that shown to improve capabilities yet.

01:00:03.360 | I think one of the fun ones is from the GPT-4 technical

01:00:05.920 | report.

01:00:06.420 | They essentially listed their kind of bogus evaluations.

01:00:09.200 | Because it's a hilarious table, because it's like LSAT, AP

01:00:11.680 | exams.

01:00:12.680 | And then, like, AMC-10 and AMC-12

01:00:15.240 | are kind of reasonable vowels in language model land.

01:00:19.240 | But they just showed that RLHF doesn't improve

01:00:21.360 | their evaluation metrics.

01:00:22.440 | We don't know if internally they have other ones.

01:00:24.480 | They probably do.

01:00:25.240 | But from what OpenAI has shown us externally,

01:00:27.920 | like, RLHF improves some metrics.

01:00:29.800 | It decreases some metrics.

01:00:31.440 | No one could really see.

01:00:32.520 | I do think it does things that they care about.

01:00:35.600 | But it's like, RLHF is not an easy tool

01:00:38.160 | to make numbers go up with.

01:00:39.600 | It's a powerful tool to change your language model.

01:00:42.560 | But as we've seen with LLAMA and safety RLHF,

01:00:45.760 | that doesn't always mean that people

01:00:47.260 | are going to be happy with those changes,

01:00:48.760 | or it's going to do exactly what you want.

01:00:50.520 | It's like--

01:00:51.160 | Well, I think this is intuitive.

01:00:52.520 | A lot of these tests are multiple choice.

01:00:55.360 | And RLHF isn't necessarily intended

01:00:58.240 | to improve your multiple choice reasoning capabilities.

01:01:01.280 | Yeah.

01:01:01.780 | Yeah.

01:01:02.280 | I think that it is reasonable, but I

01:01:04.800 | don't think a lot of people have connected the dots there.

01:01:08.360 | And like, what is it in a preference point?

01:01:12.520 | Like, what if your preference data

01:01:14.080 | was between a correct and a wrong answer?

01:01:16.120 | Like, it could conceivably do it,

01:01:18.280 | but I just don't think that is remotely

01:01:20.760 | what it is actually doing.

01:01:22.160 | Yeah.

01:01:22.920 | It's much better being a sommelier, apparently.

01:01:25.520 | Yeah.

01:01:27.480 | That was the weirdest one that was included in the GPT-404.

01:01:30.320 | Yeah, I did.

01:01:30.800 | I just see that the last three down there.

01:01:32.760 | That's really funny.

01:01:34.320 | I can't even taste it.

01:01:35.320 | You can't even taste it?

01:01:36.320 | It's just like-- anyway.

01:01:39.360 | Cool.

01:01:40.120 | Emerging directions.

01:01:41.120 | Yeah, so this is essentially how to use RLHF-like things

01:01:44.000 | to make the bottle better without using PPO,

01:01:46.880 | because PPO is kind of a nightmare to scale.

01:01:49.080 | The first thing that I started with

01:01:50.540 | is kind of the ideas of rejection sampling and best

01:01:53.000 | event sampling.

01:01:53.840 | I think best event sampling is what people often encounter

01:01:56.800 | first, which is the idea of you take a prompt,

01:02:00.200 | you generate like 10, 20 responses through it,

01:02:04.720 | you pass it through a reward model.

01:02:06.400 | The reward model assigns a scalar for each of them.

01:02:08.960 | You pick the one with the highest number,

01:02:10.680 | and that's the one you answer the question with.

01:02:13.000 | It seems pretty logical to people,

01:02:14.480 | because it's just spending more inference time compute

01:02:16.760 | to make your outputs better.

01:02:18.120 | And it works in a lot of things.

01:02:19.800 | This Let's Verify step-by-step paper

01:02:21.640 | that I talked about from OpenAI, they use it.

01:02:23.600 | Lots of papers use it.

01:02:26.360 | It's just kind of like a good thing

01:02:27.760 | to know that you can do.

01:02:28.720 | You can spend more inference compute

01:02:31.080 | based on a preference data set to make your answers better.

01:02:34.300 | The interesting thing that people are confused about more

01:02:36.680 | is rejection sampling, because Meta talked about it

01:02:39.060 | in Llama 2.

01:02:40.280 | Essentially, rejection sampling is putting something

01:02:43.600 | like best event sampling in a feedback loop.

01:02:45.600 | And instead of just returning the best answer to a user,

01:02:48.920 | you take the best few answers, and then

01:02:50.720 | you apply instruction tuning on that data set.

01:02:53.480 | And then you do the instruction tuning,

01:02:55.240 | and then you could collect more preference data,

01:02:57.240 | do a new reward model, and then you rank some new outputs,

01:02:59.800 | and you do instruction tuning again.

01:03:01.400 | So essentially, Llama started their RLHF process

01:03:04.440 | with this to get some signal out of preference data.

01:03:06.880 | That preference data went into a reward model,

01:03:08.960 | and then the reward model did a good enough ranking

01:03:11.760 | that it was essentially super-powered instruction

01:03:14.060 | tuning based on rewards.

01:03:16.920 | Works pretty well, much easier to implement the PPO,

01:03:19.280 | because you can use it in all of your--

01:03:21.720 | it's still instruction tuning, so it's

01:03:23.280 | the same autoregressive loss.

01:03:24.560 | It's easy to plug into things like transformers

01:03:26.560 | and stuff like that, a lot easier

01:03:28.340 | than whatever freaking mess doing RL at scale is going to be.

01:03:33.280 | So that's one.

01:03:34.800 | A quick nod that offline RL is something

01:03:37.400 | that people talk about for RLHF, essentially

01:03:39.760 | because your model doesn't have to generate.

01:03:42.120 | In that case, you just look at data,

01:03:44.920 | and it back-propagates through your reward model directly.

01:03:47.520 | So in PPO, you have the step of needing

01:03:50.200 | to generate everything and passing it

01:03:51.740 | through the reward model.

01:03:53.420 | How offline RL essentially works is

01:03:56.160 | that all of this is kind of just done on one big data set.

01:04:00.280 | I'm not an expert in this, but essentially, you

01:04:02.680 | do much less inference costs during the RLHF process

01:04:07.000 | if you do offline RL.

01:04:08.440 | There's a few papers that people have published.

01:04:12.360 | Not a lot of traction.

01:04:13.440 | I think it could take off some people that I know in the RLHF

01:04:15.980 | area really think a lot of people

01:04:17.240 | are doing this in industry, just because it makes

01:04:19.280 | the kind of training process simpler

01:04:22.160 | and the number of things you have to have running.

01:04:24.200 | Different feedback types are probably

01:04:29.520 | going to come into play.

01:04:32.040 | There's papers like written feedback

01:04:34.400 | or labeling multiple scores or multiple pairwise preferences

01:04:38.240 | for every completion.

01:04:39.960 | That's coming.

01:04:40.920 | It's also kind of related to what

01:04:42.280 | we mentioned in process reward models,

01:04:43.920 | where you're labeling each step in the chain of thought

01:04:47.360 | reasoning just to kind of make the problem more specific.

01:04:51.360 | It seems very likely that different feedback will

01:04:53.320 | be used for different domains.

01:04:55.240 | Chain of thought reasoning is great for math,

01:04:57.800 | and that's where these process reward models

01:05:00.000 | are being designed.

01:05:01.200 | Probably not great for things like poetry,

01:05:03.000 | but as any tool gets better, it gets more specific.

01:05:07.920 | Then kind of get into more of a talking point,

01:05:11.200 | which I think is fun.

01:05:12.080 | The next one I have is constitutional AI.

01:05:14.000 | I think this is something that people really don't--

01:05:18.440 | like, I think just kind of misunderstood.

01:05:21.080 | I think most people thought that constitutional AI was doing

01:05:23.920 | something where it's like created the preference

01:05:27.200 | data based on the specific principles in some way,

01:05:31.440 | where it's like--

01:05:32.960 | what did you two think of constitutional AI?

01:05:35.080 | I'll be the dumb person, and you correct me.

01:05:37.960 | As far as I understood, Anthropic came out

01:05:39.720 | and said that the best way of doing--

01:05:42.040 | of generating this sort of preference data or alignment

01:05:44.480 | is give a second model a constitution

01:05:46.960 | to evaluate the first model's outputs.

01:05:51.760 | And the constitution is unspecified,

01:05:53.320 | but it draws from the UN Declaration of Human Rights

01:05:57.200 | and the Apple Terms of Service, for some reason.

01:05:59.200 | Yeah, and this leads into the question

01:06:00.780 | of what is the other model evaluating,

01:06:02.760 | and how is it evaluating in a way that you can train on?

01:06:05.360 | And that's what I mean.

01:06:06.320 | People didn't think about this.

01:06:07.700 | A lot of the CAI paper was actually

01:06:09.340 | talking about instruction tuning, which

01:06:11.080 | is if you have an instruction, you then

01:06:13.040 | have a language model that critiques the instruction based

01:06:15.560 | on principles, and then your instruction responses

01:06:18.080 | are closer to the constitutional principles.

01:06:20.000 | This was the first half, which is like they

01:06:22.080 | have some acronym for all of this.

01:06:23.560 | The diagram in their paper's wild in this one.

01:06:26.920 | Their papers are sometimes pretty funny,

01:06:28.600 | because they're not capabilities papers.

01:06:30.360 | They're like alignment papers.

01:06:31.720 | So they don't make everything super clear.

01:06:33.720 | So the first half of constitutional AI

01:06:35.480 | is fine-tuning your instructions based on principles.

01:06:39.600 | That's one half.

01:06:40.320 | And then the second half is what people really

01:06:42.680 | thought that they knew, which is like,

01:06:44.680 | how do you use this other bottle to provide

01:06:47.600 | a critique based on principles?

01:06:49.780 | And in the paper, they list--

01:06:51.440 | essentially, they say what their prompt was,

01:06:53.520 | which is for the synthetic feedback for generating

01:06:57.320 | new preferences, which is essentially

01:06:59.000 | pick between these two answers based on this principle.

01:07:03.040 | So they're sampling from the principles

01:07:05.800 | in their constitution and from A, B,

01:07:09.600 | two options of completions.

01:07:11.280 | And then the AI model is essentially

01:07:14.600 | given the context of a certain principle

01:07:17.840 | to pick the A or B preference.

01:07:19.400 | And then that's a new preference data set.

01:07:21.160 | It's just the two completions without the context

01:07:23.760 | of the principles.

01:07:25.200 | So with this sampling idea, they're

01:07:27.800 | sampling from 30 principles and a wide data

01:07:30.520 | set of two candidate completions across different prompts.

01:07:33.440 | So to me, it's a very loose--

01:07:37.440 | the values are not explicit in this.

01:07:39.600 | It's just kind of how they're guided.

01:07:41.940 | And it's a very machine learning-y approach

01:07:45.140 | because it is relying on averages and scale

01:07:47.200 | to get the principles in there.

01:07:48.920 | But it is way less explicit than I thought it was going to be.

01:07:51.880 | I kind of thought there was this feedback

01:07:53.600 | thing in the preference data, where

01:07:55.080 | it checked to see if the principles were satisfied

01:07:57.920 | or anything like this.

01:07:58.840 | But it's really just a modification to the RLHF setup

01:08:02.420 | that we've talked about with instruction tuning

01:08:04.380 | and preference data collection, where there is an AI

01:08:07.120 | model providing critiques.

01:08:08.520 | And a lot of those critiques are based

01:08:10.600 | on sampling of constitutional values.

01:08:13.720 | So it almost sounds more tractable in that way.

01:08:17.160 | But I would also guess, while I just say, oh, look,

01:08:20.000 | I figured it out, I'm guessing they do different things

01:08:22.880 | than they said in the paper.

01:08:24.000 | Like, this paper is from 2022.

01:08:26.280 | It's a pretty old paper.

01:08:28.280 | And they're surely doing more.

01:08:31.200 | But it's good to know where they started, at least in this case.

01:08:34.880 | I thought the communication around the Pareto optimal

01:08:38.560 | improvement was helpful in understanding that you do

01:08:42.560 | actually want it to be more helpful and honest

01:08:47.640 | while maintaining the same level of harmlessness

01:08:49.640 | or something like that, right?

01:08:50.920 | Yeah, so that figure right at the top of the constitutional AI

01:08:54.680 | paper is worth seeing, if you don't have it immediately

01:08:57.080 | pop into your head, where they essentially compare

01:08:59.520 | constitutional AI to other RLHF that they're

01:09:01.840 | doing internally at different--

01:09:03.880 | and it's something that most RLHF papers don't do,

01:09:06.120 | is they have little dots on the lines

01:09:07.800 | to indicate intermediate checkpoints.

01:09:09.720 | And it'd be really great to see more RLHF papers kind

01:09:12.280 | of showing how per epoch or per half epoch of training,

01:09:16.680 | because most RLHF is only a few epochs, at least

01:09:19.240 | in the open models, what is happening there.

01:09:22.240 | People release checkpoints.

01:09:23.400 | But that's how we should be thinking about it,

01:09:25.320 | because the optimizer is so strong.

01:09:27.520 | And it's like, we don't know what's happening

01:09:29.440 | in this kind of intermediate land.

01:09:31.400 | I don't know if this is a relevant comparison for you,

01:09:33.960 | but OpenAI also recently released a weak-to-strong

01:09:38.000 | generalization paper, where they actually

01:09:40.200 | talked about a few intermediate checkpoints for GPT-4.

01:09:44.120 | Any comments on the comparison between constitutional AI

01:09:46.720 | and weak-to-strong generalization?

01:09:48.960 | I didn't see the paper.

01:09:50.040 | I think I saw people criticizing it

01:09:51.840 | for just being safety-washing from the fact

01:09:55.680 | that they're talking about GPT-2 still,

01:09:58.040 | which is such a kind of odd model to focus on.

01:10:00.800 | I didn't really look at it.

01:10:01.960 | I had it lying around.

01:10:02.960 | So I think that it's a thing with OpenAI.

01:10:05.760 | It's like they're sharing less than they know.

01:10:08.840 | So I think they probably have things

01:10:10.280 | that are pretty cool that they're doing internally.

01:10:14.440 | So I'll summarize for listeners who may not

01:10:16.200 | have seen the paper, because it's impossible to keep up

01:10:18.520 | and everything.

01:10:19.640 | I do think that what constitutional AI and RLHF

01:10:24.480 | represents is that we are starting

01:10:27.360 | to come to a point where it's just

01:10:28.800 | impossible for manual human preference data

01:10:31.840 | collection to scale.

01:10:33.800 | And the only way to scale this is to trust our AI overlords

01:10:36.560 | to model our human preferences.

01:10:39.520 | And constitutional AI was the first version of this.

01:10:41.880 | What the second version, or what weak-to-strong is,

01:10:44.600 | is that anticipating a future of superintelligence

01:10:47.720 | or the need for superalignment, where

01:10:50.440 | the thing that we're trying to control is smarter than us.

01:10:53.400 | So you take GPT-2 and try to use GPT-4

01:10:57.240 | to teach it to be smarter than itself,

01:11:00.080 | because this is what we're going to have

01:11:01.880 | to do in the future as well, when we are not--

01:11:04.560 | we're no longer fully in control.

01:11:07.560 | Are we the metaphorical GPT-2, or is--

01:11:10.680 | No, we're not even in the process

01:11:12.840 | anymore at the point of superintelligence.

01:11:15.560 | So they're just basically-- they're prepping.

01:11:17.680 | They're preppers.

01:11:19.960 | And they're saying, this will happen.

01:11:21.480 | And humans will be so far out in the dust

01:11:24.640 | that we just have no say in this debate.

01:11:27.680 | How do we still control systems then?

01:11:30.240 | And weak-to-strong generalization

01:11:32.040 | seems to be the answer.

01:11:33.040 | And I see a lineage from constitutional AI to this.

01:11:37.000 | Yeah, the constitutional AI and the superalignment

01:11:39.480 | is very conceptually linked.

01:11:41.320 | It's like a group of people that has

01:11:43.200 | a very similar intellectual upbringing,

01:11:45.360 | and they work together for a long time,

01:11:47.240 | coming to the same conclusions in different ways.

01:11:49.600 | And I understand the argument.

01:11:50.960 | And I mostly just don't--

01:11:53.840 | I think they're just waiting to see more

01:11:55.640 | from the superalignment team.

01:11:56.840 | Because I just didn't really put it together in my brain

01:11:59.240 | quickly, looking at weak-to-strong generalization

01:12:01.400 | of exactly how it all fits.

01:12:02.920 | But I'm also not a safety researcher.

01:12:05.560 | But I think that could be feedback for them.

01:12:08.640 | I understand what synthetic data means in all of this.

01:12:11.240 | It's like, how could they communicate that a little bit

01:12:14.000 | more specifically in this context?

01:12:15.800 | Because I want to know what they think about this.

01:12:17.840 | Which is why I like that Pareto optimal thing,

01:12:19.760 | it links-- it takes stairs debate away from x-risk

01:12:23.960 | to no, this makes navigation models more useful.

01:12:27.920 | And we can all get behind that.

01:12:29.640 | Yeah.

01:12:30.640 | Yeah, yeah.

01:12:31.600 | I agree.

01:12:32.760 | I think the last kind of emerging direction

01:12:34.600 | that I have might just be this debate.

01:12:36.320 | You can control how long we talk about this,

01:12:38.280 | which is about direct preference optimization.

01:12:40.280 | DPO.

01:12:42.440 | You could go read my blog post on this.

01:12:44.280 | I had tried to summarize this already.

01:12:46.000 | But essentially, DPO is a different class of algorithms.

01:12:51.040 | I still call it RLHF, because RLHF is so vague

01:12:54.800 | in how it's defined.

01:12:55.640 | I think DPO is closer to RLHF than RLHF is to RL.

01:12:59.580 | You can unpack that if you need to.

01:13:01.880 | But what DPO is doing is essentially

01:13:05.040 | deriving a optimal reward function

01:13:07.960 | from the preference data, where the preference data is

01:13:10.160 | the same thing that we've talked about.

01:13:11.780 | And then the clever math in the paper

01:13:14.880 | emerges the optimal policy to that

01:13:18.440 | based on an implicit reward function.

01:13:20.080 | That's a ratio of log probs.

01:13:22.040 | It's very odd.

01:13:23.160 | The difference between what a DPO reward is

01:13:25.640 | and a classifier reward is very different,

01:13:28.120 | where the classifier is trained to output a scalar value based

01:13:32.360 | on this contrastive-like loss, where

01:13:34.720 | DPO is purely based on the difference between two log

01:13:38.400 | prob ratios.

01:13:39.160 | So the reward there is the ratio between the policy generation

01:13:43.640 | likelihood and the base model generation likelihood.

01:13:47.080 | I don't have intuitions for what that means yet,

01:13:49.080 | but what the reward actually is is very different.

01:13:52.600 | The data starting point, in principle, could be the same.

01:13:55.440 | And I think we've seen a lot of successes in open source

01:13:58.040 | with it.

01:13:58.560 | It's way simpler to implement and to work

01:14:01.000 | with in that regard, which is why

01:14:02.640 | I think we'll keep seeing a lot of success

01:14:04.980 | with it in the short term.

01:14:06.200 | I think we'll keep seeing DPO models for the time being.

01:14:10.080 | But we won't really answer what the fundamental differences

01:14:13.120 | are, because it depends on your data.

01:14:15.160 | It depends on your infrastructure.

01:14:17.400 | Rumors seem to be that people still

01:14:19.080 | think that PPO-like methods or other RL methods

01:14:22.560 | have a higher top end.

01:14:25.000 | But I don't necessarily think--

01:14:26.680 | Sorry, what is top end?

01:14:27.800 | Just the absolute best model you could get.

01:14:29.600 | I see.

01:14:30.120 | So Google and OpenAI aren't using DPO

01:14:32.800 | because they could do something more complicated.

01:14:34.840 | But that's not what academics and open source people

01:14:38.040 | really care about.

01:14:38.840 | They care about being able to improve on their methods

01:14:40.840 | and understand where to iterate the models

01:14:42.720 | and work off of each other.

01:14:44.200 | So in a lot of ways, I think DPO still will be what people see.

01:14:47.840 | But in some ways, it's probably slightly more constrained.

01:14:52.480 | There's other ways that you could think of PPO

01:14:54.880 | working nicely in code, where it's

01:14:56.640 | if your code runs is the score that you give it.

01:14:59.640 | And you have to generate--

01:15:01.840 | you have to do canned things to get DPO to have the same data.

01:15:06.920 | So there are specific cases where the DPO formulation

01:15:09.560 | is a little bit harder.

01:15:10.640 | But I expect to see more DPO models than anything else

01:15:14.320 | in the next six months.

01:15:15.360 | That's probably what most people need to know,

01:15:17.960 | unless they're an RLHF expert.

01:15:19.720 | And I would love to learn more about PPO and a lot of authors

01:15:23.360 | in this space, from the DPO authors,

01:15:25.160 | who are great to talk to.

01:15:26.200 | You could reach out to all three of them.

01:15:28.360 | So as a time of recording, we're actually

01:15:30.080 | about to publish our NeurIPS recap,

01:15:31.720 | where we talk to the authors.

01:15:33.440 | So for people who are listening to this in the future,

01:15:35.360 | you can refer to that episode.

01:15:36.600 | Yeah.

01:15:37.100 | I think Rafael, Eric, and Archit--

01:15:39.220 | I've talked to all of them at a good length.

01:15:41.660 | And they're all fantastic.

01:15:42.900 | And it's like, they'll say similar things.

01:15:44.740 | And they'll also defend their method,

01:15:46.280 | because it's an awesome paper.

01:15:47.620 | If you want to learn how a good math--

01:15:51.140 | a kind of mathy, but still experimental paper in language

01:15:54.700 | models is, the DPO paper is a really good one

01:15:57.580 | to spend more time on.

01:15:59.220 | Yeah.

01:15:59.720 | Well, when I asked them questions about it,

01:16:02.940 | they just kind of gestured at their poster

01:16:04.660 | and said, look at the equation.

01:16:05.700 | Just stare at it.

01:16:06.340 | And you'll see it.

01:16:07.100 | Yeah, that's my criticism for them.

01:16:08.580 | It's like, they--

01:16:09.380 | [LAUGHTER]

01:16:10.860 | Like, what?

01:16:12.020 | Yeah, they're a little--

01:16:13.540 | they're still in the academic world, where

01:16:15.540 | some of their answers reflect that.

01:16:17.040 | But I've done it enough with them

01:16:18.700 | that I understand what they're saying.

01:16:20.620 | Yeah, yeah.

01:16:21.420 | I will say, it does remind me of Flesh Attention a little bit,

01:16:24.040 | in the sense that it's kind of an equivalent thing

01:16:28.260 | to the thing it's replacing.

01:16:29.420 | And it's just faster, cheaper, just better in every way.

01:16:31.860 | It's a very different optimization tool.

01:16:33.900 | It's essentially the thing, in my mind,

01:16:35.600 | that I can't get past is the difference

01:16:37.140 | between the control you get in training a reward model

01:16:40.180 | and then training a policy.

01:16:41.380 | Because essentially, everything you want your reward model

01:16:43.620 | to do might not be everything that you train the policy

01:16:45.920 | to do in the RLHF step, where you

01:16:47.620 | have the two different prompt distributions.

01:16:50.100 | But with DPO, you're doing both at once.

01:16:52.020 | So you don't control that.

01:16:53.660 | And we don't know if you have fancy engineering--

01:16:57.100 | like, if you have fancy engineering abstractions

01:16:59.140 | and test your reward model to do different things,

01:17:01.220 | if that separation is really important.

01:17:03.140 | And I think that's where this benefit

01:17:05.680 | at the absolute biggest scale and most investment

01:17:08.200 | could come from.

01:17:09.240 | But DPO is one update.

01:17:12.000 | It is one model.

01:17:13.040 | You can't separate that.

01:17:14.680 | So that's a thing to know.

01:17:17.760 | It probably doesn't matter for most people.

01:17:19.600 | But it is very different.

01:17:20.960 | And I was asking somebody who was on some of those earlier

01:17:23.660 | OpenAI papers that's not OpenAI anymore.

01:17:25.720 | And they were like, I wish we had thought of that.

01:17:28.080 | So it is a really cool idea.

01:17:32.000 | That's the type of thing that academia still can do

01:17:35.400 | and can do really well and hopefully continues to do.

01:17:39.920 | Yeah.

01:17:40.520 | One thing I wanted to make sure I cover

01:17:42.220 | before we leave this topic--

01:17:45.520 | one of the DPO models that we're trained,

01:17:48.240 | apart from Zephyr and Mixtraw, which

01:17:50.120 | is two of the more high-profile ones,

01:17:52.400 | is Tulu from the Allen Institute.

01:17:55.000 | And you're one of the few people maybe placed to explain--

01:17:57.680 | So funny.

01:17:59.240 | Maybe like, what's Allen Institute doing here?

01:18:01.600 | And what's the backstory?

01:18:03.200 | Yeah, so the Allen Institute for AI is--

01:18:05.040 | I think the 10-year birthday is in January.

01:18:07.200 | It's a special event.

01:18:08.760 | And also, people should know, this

01:18:10.160 | is Paul Allen from Microsoft.

01:18:11.760 | Yeah, Paul Allen owns everything in Seattle.

01:18:14.480 | Not literally.

01:18:15.160 | I mean, his past and his estate is still

01:18:18.180 | operating in a lot of great ways.

01:18:19.640 | But the Allen Institute is mostly

01:18:21.520 | known as being a super academic lab where they have

01:18:23.760 | more resources than academia and publish hit

01:18:26.880 | after hit of research paper.

01:18:28.920 | And they're trying to move more in the direction of really

01:18:31.560 | using models.

01:18:32.280 | And this is part of why I joined.

01:18:33.700 | It's like talking with the new CEO, Ali Farhadi.

01:18:37.280 | I don't know if I pronounced the last name right.

01:18:39.320 | But he's trying to move from an org that does papers only

01:18:43.640 | to something that does papers, releases models,

01:18:45.760 | is active in policy, maybe is helping

01:18:48.640 | work with these for-profit institutions that

01:18:50.640 | don't have a middle, an established place where they

01:18:53.640 | could all go through to new things.

01:18:55.400 | So they're really trying to expand the scope.

01:18:58.160 | It's part of why I joined.

01:18:59.880 | And the Tulu2 model is the type of thing I've joined.

01:19:02.200 | And they were talking about this.

01:19:03.560 | And I was like, OK, we should just train it and release it,

01:19:05.360 | because no one has done this direct preference

01:19:07.280 | optimization at a scale of really

01:19:10.120 | like 70 billion parameter scale.

01:19:11.920 | And this experiment is hilarious.

01:19:13.280 | This is classic of everything kind of works right now in ML.

01:19:16.840 | I showed up in the grad student Hamish Iveson.

01:19:20.240 | And I need to learn how to pronounce last names better.

01:19:22.520 | But he had some Jaxx DPO code built on this EZLM framework.

01:19:26.200 | And we have deep TPUs that we could

01:19:29.240 | access for research purposes.

01:19:30.740 | So it's like, OK, we have a huge TPU.

01:19:32.240 | It's like, let's just try the Zephyr recipe

01:19:34.480 | on 70 billion parameters.

01:19:35.720 | And it's literally like the first run.

01:19:37.340 | It's like we did no ablations, didn't change any parameters.

01:19:39.880 | We just copied them all over.

01:19:41.560 | And that's the model that people have been working with.

01:19:45.040 | That goes to show that there's a lot of runway and understanding

01:19:48.080 | and improving on this.

01:19:49.360 | We took the same data and just took it

01:19:51.720 | to a different Jaxx implementation

01:19:53.480 | and scaled it up 10x.

01:19:54.760 | And it still returned a model that was pretty good on benchmarks

01:19:58.400 | and in people using it.

01:20:00.240 | So let's say it's like 2024, we'll be busy in this space.

01:20:04.980 | We're running data ablations to try to understand what's best.

01:20:08.400 | Then Allen Institute is pre-training language models

01:20:11.280 | or pre-training open language models,

01:20:13.280 | where we'll be able to share data, code, everything,

01:20:16.680 | the kind of horn that everyone likes

01:20:18.560 | to get annoyed about these days.

01:20:19.840 | It's like, well, I'm not releasing data.

01:20:21.720 | So that'll come in the new year.

01:20:23.760 | And then things like Tulu2 are the recipes

01:20:26.520 | that we will apply to that.

01:20:28.320 | And we'll kind of keep doing both.

01:20:30.560 | As the pre-trained models get better,

01:20:32.560 | those will probably become more of a priority.

01:20:34.480 | But starting pre-training is very hard.

01:20:36.560 | So it's like you still want to learn from LLAMA2 and LLAMA3.

01:20:40.560 | So that's fun.

01:20:43.240 | I think DPO releases are kind of becoming expected,

01:20:46.880 | because Mistral released a DPO model as well.

01:20:50.720 | I think the slide after this is just like, there's a ton.

01:20:53.800 | It's like Intel releases DPO models.

01:20:55.800 | Stability releases DPO models.

01:20:58.400 | At some point, you just have to accept that that's

01:21:00.440 | where we're going, whether or not

01:21:01.820 | you care about the whole DPO debate.

01:21:03.500 | And that's why I find it so funny,

01:21:04.960 | because there's really interesting, debatable

01:21:07.640 | questions between DPO and other RL methods.

01:21:10.200 | But we just won't have the answer.

01:21:12.120 | And it'll look like there isn't a debate,

01:21:14.200 | because everything that is published is with DPO.

01:21:16.440 | But that doesn't mean that anything

01:21:17.920 | is answered in the time being.

01:21:21.880 | Yeah, kind of last of this stuff is evaluation.

01:21:25.640 | And these slides were prepared kind of last minute.

01:21:29.000 | But I think the question is, how do you evaluate these models

01:21:32.680 | and what you should be doing?

01:21:33.840 | I think the PSA is like, don't trust your numbers

01:21:36.600 | and actually talk to models.

01:21:37.960 | It's very hard to do if you're an engineer or a researcher,

01:21:40.320 | because you have your specific thing that you're zoomed in on.

01:21:43.040 | And it feels like a waste of time

01:21:44.420 | to just go play with chat GPT or go play with chat arena.

01:21:46.800 | But I really don't think it is.

01:21:48.100 | It's something that I-- this is me telling myself

01:21:50.360 | what I should be doing.

01:21:51.760 | But there's the question of, is the Hugging Face leaderboard

01:21:56.920 | good for open source?

01:21:58.960 | And then what else can people do?

01:22:01.240 | The Hugging Face leaderboard came out of the team

01:22:03.240 | that I was on there.

01:22:04.080 | We were trying to build a framework to automatically

01:22:07.400 | evaluate the models that we were training

01:22:09.240 | and the models that people were releasing,

01:22:10.960 | and then have them in a central place where it could be like,

01:22:12.740 | look, here's the evaluation scores.

01:22:14.240 | This is what we're competing with.

01:22:15.760 | It obviously blew up.

01:22:16.880 | I think it's very good for companies

01:22:18.960 | trying to operate in the open LLM space

01:22:21.760 | to build businesses around it.

01:22:23.080 | I think it's bad for people building LLMs

01:22:25.080 | that they think are the best, because it's

01:22:27.240 | easy to overfit if you're training and focusing on them

01:22:30.960 | as a developer.

01:22:31.660 | But it's good to have distribution of models

01:22:35.280 | when there's so many people training them.

01:22:38.440 | But it's like, now it has six evaluation tools.

01:22:41.640 | I can't even name all of them off the top of my head.

01:22:43.840 | ARC, Hellaswag, MMLU.

01:22:47.200 | There was Drop on it at one point,

01:22:48.920 | but they dropped a drop, which was pretty funny.

01:22:50.960 | Drupal, QA, and then I think maybe some other math.

01:22:55.600 | I don't know.

01:22:58.280 | So this benchmark question is something

01:23:01.480 | that everyone's talking about, because there's a lot of gaming

01:23:05.240 | that it seems to be going on.

01:23:07.240 | Is there some discussion about held out benchmarks

01:23:11.440 | that Hugging Face could hold onto?

01:23:13.640 | Mostly it's who's going to pay for it.

01:23:15.760 | But we're thinking about this at Allen AI, too,

01:23:18.240 | is improving on-- we're specifically

01:23:20.240 | thinking about improving on Opaqa eval, which is--

01:23:22.000 | Who's going to pay for running the evals?

01:23:23.280 | Who's going to pay for running the evals?

01:23:24.800 | Right now, Hugging Face is just running every eval every day.

01:23:27.120 | Yeah.

01:23:27.620 | So they have 1,000 GPUs.

01:23:29.040 | At one point, they were going to do more training.

01:23:31.240 | It was going to be used for that.

01:23:32.580 | But now they have less training, and they've

01:23:34.720 | run a good amount of GPUs.

01:23:35.920 | And one of their blog posts, they

01:23:36.920 | said how much compute it was.

01:23:38.160 | I don't think it's a ton to run these,

01:23:39.760 | but it is like, you have to have hundreds of GPUs

01:23:42.440 | to maintain this leaderboard.

01:23:45.320 | So one technical question.

01:23:47.440 | Some of these are open source models that they don't change,

01:23:50.080 | so you just have to run them once.

01:23:51.400 | Yeah.

01:23:51.900 | OK.

01:23:52.640 | So it's not that crazy, I don't think.

01:23:55.640 | No.

01:23:56.140 | It's tractable for--

01:23:56.920 | It's only the closed source models

01:23:58.440 | that need to be re-evaluated.

01:23:59.640 | Yeah, so if you look at the chat arena,

01:24:01.320 | they take specific dates.

01:24:04.700 | Yeah.

01:24:05.200 | And then there's this whole controversy of,

01:24:07.040 | is chat GPT from March better than chat GPT from June?

01:24:12.280 | So on one of these future slides,

01:24:15.520 | it's slide 58 is the chatbot arena leaderboard,

01:24:20.880 | if you're looking later, which chatbot arena is this thing

01:24:23.760 | from LLAMSYS that we were looking at.

01:24:25.720 | And then on the x-axis is models.

01:24:27.520 | And you can see that GPT-4 from March has a higher score.

01:24:33.360 | And the same-- it's like, this is not a perfect comparison.

01:24:37.360 | But there are signs that are pretty funny there,

01:24:39.480 | that there are things cooking.

01:24:42.680 | But you don't know who's collecting this data,

01:24:44.720 | what prompts they're doing, and what--

01:24:47.440 | but it's such a funny timeline.

01:24:49.680 | So for those listening, GT-4 March 14

01:24:53.000 | is 40 Elo points higher than GT-4 June 13.

01:24:55.940 | Yeah, it's outside of the error bars on the LLAMSYS thing.

01:24:58.480 | That's pretty high.

01:24:59.280 | And the other piece of context is

01:25:00.680 | that GPT-4 Turbo is also notably ahead of the other GPT-4s,

01:25:05.960 | which it kind of showed up immediately

01:25:08.040 | once they added it to the leaderboard or to the arena.

01:25:11.040 | And I was like, all the GPT-4.5 memes aside,

01:25:15.480 | it seems like this is effectively

01:25:17.000 | a bump in the model.

01:25:18.040 | If it's clear-- if you zoom into this,

01:25:21.520 | the leaderboard is very close for many strata of models.

01:25:28.280 | So there are levels where you can get your model to,

01:25:31.080 | and it'll be really close to your peers.

01:25:32.800 | So in the open source, there's things like Mixtral Instruct

01:25:38.080 | 2.2.7db, which is effectively--

01:25:40.600 | it's a way bigger model than Mixtral.

01:25:42.920 | Mixtral's the mixture of expert model.

01:25:45.360 | I'll do credit.

01:25:46.040 | It's a very good model, and that's

01:25:47.460 | going to be the next level once people

01:25:49.280 | get better at fine-tuning it.

01:25:50.560 | Ye34bchat, this is one level.

01:25:53.600 | And then there was a level with the alpacas and the vicunas.

01:25:56.840 | But all of these open source models,

01:25:58.880 | there's then another step up to GPT-4,

01:26:01.500 | and then there's another step up to GPT-4 Turbo.

01:26:03.960 | So it's like the difference from the GPT-4 Turbo

01:26:07.720 | to the GPT-4 that was first released

01:26:10.720 | is bigger than the difference from Tulu2 to GPT-4.

01:26:14.600 | So that's just like, there's something good going on there.

01:26:17.720 | And I was like, OK, that's a new model by my standards,

01:26:20.840 | but they're not going to tell us about it.

01:26:23.400 | They did in DevDay.

01:26:24.280 | They said it's our new model, but they weren't like,

01:26:26.920 | this is our new best-performing model,

01:26:28.840 | because the benchmark scores are probably the same,

01:26:32.120 | but they made it so that people like using it more.

01:26:34.560 | There's some hints that 4.5 might drop at some point.

01:26:37.400 | We don't actually know how true those things are,

01:26:39.400 | but I don't think it really matters.

01:26:40.900 | It's like they could call anything--

01:26:42.440 | they're retraining these models, and they could call any of them

01:26:45.240 | GPT-4.5.

01:26:47.640 | I think the two tools that I talk about most

01:26:50.120 | in research domains on RLHF is AlpacaValidMTBench.

01:26:55.200 | They're two academic-maintained leaderboards

01:26:58.100 | for evaluating chat capabilities.

01:27:00.400 | Evaluating chat is really hard, and what they both do

01:27:03.360 | is they have GPT-4 provide some sort of feedback.

01:27:06.520 | MTBench is called MT for multi-turn,

01:27:10.480 | and they have a prompt and a follow-up question.

01:27:12.560 | So what they do is they ask GPT-4

01:27:14.680 | to score both the initial response

01:27:19.520 | and the second response, and provide the average.

01:27:21.600 | Kind of given up on following the slides.

01:27:23.320 | It's all on the slides if you look for it.

01:27:25.120 | And then AlpacaVal is a little bit different,

01:27:27.840 | where you're comparing a candidate model.

01:27:31.000 | So the model we've trained.

01:27:32.200 | So when we're training Tulu, we submit this.

01:27:34.860 | And what it's doing under the hood is comparing the new model

01:27:37.360 | to DaVinci 0.0.3, which is one of OpenAI's older instruction

01:27:42.680 | models, and calculating the win rate

01:27:45.560 | that GPT-4 sees between the new model and DaVinci.

01:27:49.440 | So it has many more prompts than MTBench.

01:27:53.800 | MTBench is custom prompts that they

01:27:55.480 | made to just kind of take a stance on what

01:27:58.560 | is a good chat model.

01:27:59.840 | AlpacaVal sources theirs from Self-Instruct,

01:28:03.240 | which is a popular paper from AI2.

01:28:05.880 | Open Assistant, Vicuna, Koala, Anthropix, Helpful, Harmless.

01:28:09.120 | So AlpacaVal is from sources that people know and love.

01:28:12.320 | MTBench is its own thing.

01:28:14.360 | We were more focused on MTBench at Hugging Face.

01:28:17.280 | At AI2, we're a little bit more focused on AlpacaVal.

01:28:20.040 | But it really can go either way.

01:28:22.080 | These are kind of like table stakes to saying that you have

01:28:24.160 | a good RHF model.

01:28:26.400 | You should be able to have a pretty good score on both

01:28:29.040 | of these.

01:28:30.080 | And then the kind of proof is in people actually talking to it.

01:28:33.280 | So I think the Zephyr model from Hugging Face

01:28:35.960 | was a kind of step change in people's perception

01:28:39.280 | of open models that got integrated

01:28:41.000 | into a bunch of products within a few weeks.

01:28:43.200 | Like U.com was experimenting with it.

01:28:45.960 | And someone else, like I saw some substacker,

01:28:48.440 | was using it as a writing feedback bot instead of chat

01:28:51.480 | GPT.

01:28:52.440 | But that's what happens when a good open release is there now.

01:28:57.600 | It's like the evaluations are good and people pick it up.

01:29:00.920 | And the evaluations are just enough

01:29:03.040 | to say, OK, we're in the right ballpark.

01:29:05.240 | But you never really know if the model is

01:29:08.140 | the one or one of these big ones without talking to it.

01:29:12.440 | However much you talk about evals,

01:29:13.920 | that's still where we're at.

01:29:15.720 | You can't prove anything definitively.

01:29:18.280 | And Google's seeing that.

01:29:20.160 | And until Gemini Ultra comes out, we don't know.

01:29:23.520 | It's probably a great model, but we don't know what they have.

01:29:27.360 | Gemini Pro didn't do so great on the other stuff, too.

01:29:30.480 | Yeah, I want to know if Gemini Pro is just

01:29:32.520 | like some intermediate checkpoint,

01:29:34.520 | or if it was like a major deliverable for them or not.

01:29:38.240 | Which if it wasn't a major deliverable,

01:29:39.960 | it's probably a strategy headache for Google.

01:29:42.480 | But that's not my problem.

01:29:43.840 | You have a bunch of open questions here.

01:29:48.960 | One of our lightning round questions is always--

01:29:51.160 | Yeah, we just do inverted lightning round?

01:29:52.560 | Yeah, exactly.

01:29:53.640 | You asked people open questions.

01:29:57.280 | Oh, I mean, there's so much to do here.

01:29:59.200 | They're kind of like summarization of things

01:30:01.040 | that will be hinted at in the talk to this point, which

01:30:06.280 | is like I split it up in my work between data training

01:30:09.080 | and model, which is essentially like how

01:30:11.440 | do we evaluate what's happening at the model level with RLHF.

01:30:14.640 | I think big labs are so over-indexed--

01:30:16.680 | are indexed on their own base models,

01:30:18.280 | so they don't know what's swapping between Cloud Base

01:30:20.760 | or GPT-4 Base, how that would change any notion of preference

01:30:24.400 | or what you do with RLHF.

01:30:25.640 | I think in the open, we could do that.

01:30:27.480 | We could swap between Lama 2 and MixedRAW

01:30:29.720 | and kind of see, does RLHF work the same for both of those?

01:30:33.080 | Do they both get alpaca valve bumps

01:30:34.760 | when you use the same data set in the same framework

01:30:37.200 | down the line?

01:30:38.200 | That'd be good to know how sensitive RLHF is.

01:30:41.720 | On the data, we talk a lot about aggregation.

01:30:44.220 | On the research side, there's a lot of interesting things

01:30:46.600 | just like, does getting your data

01:30:48.360 | from scale or a Discord army change

01:30:51.000 | the quality of the data based on professional contexts?

01:30:54.720 | And like--

01:30:55.200 | The results of this might really affect scale.

01:30:57.360 | Yeah.

01:30:58.480 | They probably should do it internally.

01:31:00.160 | They should do internal market analysis on that line.

01:31:05.040 | We should also mention, there has

01:31:06.520 | been a report that a lot of these labelers

01:31:08.760 | use ChatGPT to do their work.

01:31:11.400 | Yeah.

01:31:11.960 | I mean, I'm not surprised.

01:31:13.600 | So it's like-- it's a lot of messy grounds

01:31:16.080 | in RL these days.

01:31:17.720 | And then there's more training questions,

01:31:19.440 | which is like, what happens at the end of the day?

01:31:22.600 | I mentioned what I call qualitative alignment

01:31:25.520 | earlier on, which is like, do the models

01:31:27.520 | get better in ways matching the preference data preferences?

01:31:31.040 | So if you collect two batches of preference data

01:31:33.720 | with different priorities, what is the downstream model change?

01:31:37.280 | I don't know if it does anything.

01:31:38.840 | Should all data be equal?

01:31:40.480 | If you have health care questions,

01:31:42.080 | should it be the same as like, write me a joke?

01:31:45.160 | This is all implicit to deep learning.

01:31:47.520 | Like, deep learning just scales and aggregates.

01:31:50.040 | And I think we are going to be on that ride,

01:31:53.280 | but it's not necessarily what some people would

01:31:55.320 | call fair or good.

01:31:57.160 | And then the kind of last slide that I have is fun,

01:31:59.240 | which is just like, John Schulman talks about this

01:32:01.360 | in his ICML talk.

01:32:02.440 | His ICML talk on proxy objectives for RLHF

01:32:05.280 | is public now.

01:32:06.000 | They made it public three months after the conference

01:32:08.240 | or some weird timeline.

01:32:09.680 | But he talks about things like ChatGPT being verbose

01:32:12.560 | and have self-doubt, refusals.

01:32:14.880 | Things that are really in vogue in the conversation right now

01:32:17.840 | and how those can emerge in the process of continually trying

01:32:22.520 | to adjust the RLHF process based on what users

01:32:25.840 | are seeing in the model.

01:32:27.440 | And this is like a sort of outer loop optimization

01:32:29.760 | that no one in the open is even remotely qualified

01:32:32.200 | to talk about, but OpenAI does monitor.

01:32:34.160 | And they'll rerun RLHF and train a new reward model

01:32:36.920 | with a mixture of their curated data and user prompts

01:32:40.280 | to try to make it work better over time.

01:32:43.040 | And that's the different model versions.

01:32:44.880 | And while there's a lot of critiques about this,

01:32:47.200 | they're definitely intentional in trying to fix--

01:32:50.400 | I feel like it's probably whack-a-mole, where they're

01:32:52.600 | like, oh, there's this problem.

01:32:53.240 | We have the data.

01:32:53.800 | We can fix this.

01:32:54.520 | And then it pops up some new problem after doing RLHF,

01:32:57.680 | and they're studying this.

01:32:59.720 | And if you could really figure it out,

01:33:01.400 | this is where things start to look more like RL.

01:33:03.480 | You could automate it.

01:33:05.600 | Things are just like longer time frame of optimizing the model.

01:33:09.640 | It would be cool, but I feel like I'm

01:33:12.320 | years away from ever actually working on this.

01:33:14.680 | But we can try to get details from people who are.

01:33:18.940 | Yeah, excellent.

01:33:20.320 | Awesome.

01:33:21.320 | Yeah, anything else that we missed?

01:33:23.680 | I think we covered a lot of it.

01:33:26.360 | I mean, I'm good.

01:33:27.080 | I would ask you guys about if you know companies that

01:33:29.320 | are doing this and things.

01:33:30.680 | I know some that are in the RLHF as a service space

01:33:34.240 | will become busy, I think for good reason,

01:33:36.400 | just because--

01:33:37.280 | This company is doing RLEIF as a service.

01:33:39.760 | Yeah, both of them are.

01:33:40.720 | It depends if synthetic data is going to win over human data.

01:33:43.600 | If human data is the real winning feature in the end,

01:33:48.040 | it's a big capital investment.

01:33:49.440 | So it kind of makes sense as a VC model anyways,

01:33:51.960 | but there's going to be both of them for a while.

01:33:54.760 | That'd be cool.

01:33:56.520 | You see a lot of people-- because I know

01:33:58.200 | Luis Castricado is starting a company.

01:34:01.000 | Is there a lot of ambition in this field to start companies,

01:34:04.600 | or is this more such a research-driven part of the stack

01:34:09.280 | that maybe it just stays there?

01:34:10.640 | There definitely is, because I know my former colleague

01:34:13.280 | Nazneen Rajani from Hugging Face is also starting

01:34:16.560 | a company in this space.

01:34:18.400 | The Falcon team who left Hugging Face, I think,

01:34:21.200 | is also working in this space.

01:34:22.880 | I don't really know.

01:34:24.600 | I don't know exactly what--

01:34:25.760 | I haven't talked to them since ICML,

01:34:27.240 | so I don't know what they're doing.

01:34:28.240 | Startups change a lot.

01:34:29.440 | But there are definitely a lot of people

01:34:31.080 | looking at this space.

01:34:32.520 | I mean, Scale's probably trying to do it.

01:34:34.240 | If I was Scale, they would want to do it.

01:34:35.920 | I think they've historically had trouble

01:34:37.640 | keeping technical ML talent, but they've

01:34:39.800 | started a new research lab, so that should help.

01:34:43.200 | It's a busy area.

01:34:45.160 | Cool.

01:34:46.040 | What's going on?

01:34:47.040 | Yeah.

01:34:47.560 | Awesome, Nathan.

01:34:48.400 | Thank you so much.

01:34:48.920 | That was a masterclass.

01:34:50.040 | I think this is the first 201 that we've ever had,

01:34:52.160 | and you set the bar very high.

01:34:54.040 | Thank you.

01:34:56.800 | Bye, everyone.

01:34:57.640 | Bye.

01:34:58.360 | Bye-bye.

01:34:58.960 | [MUSIC PLAYING]

01:35:02.320 | [MUSIC PLAYING]

01:35:06.280 | [MUSIC PLAYING]

01:35:09.640 | [MUSIC PLAYING]

01:35:13.560 | [MUSIC PLAYING]

01:35:16.920 | [MUSIC PLAYING]

01:35:20.440 | [MUSIC PLAYING]

01:35:23.800 | [BLANK_AUDIO]

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

Chapters