back to index

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert


Chapters

0:0 Introductions and background on the lecture origins
5:17 History of RL and its applications
10:9 Intellectual history of RLHF
13:47 RLHF for decision-making and pre-deep RL vs deep RL
20:19 Initial papers and intuitions around RLHF
27:57 The three phases of RLHF
31:9 Overfitting issues
34:47 How preferences get defined
40:35 Ballpark on LLaMA2 costs
42:50 Synthetic data for training
47:25 Technical deep dive in the RLHF process
54:34 Projection / best event sampling
57:49 Constitutional AI
64:13 DPO
68:54 What's the Allen Institute for AI?
73:43 Benchmarks and models comparisons

Whisper Transcript | Transcript Only Page

00:00:00.000 | [MUSIC PLAYING]
00:00:00.620 | Hey, everyone.
00:00:01.280 | Welcome to the Latent Space Podcast.
00:00:03.200 | This is Alessio, partner and CTO of Residence
00:00:05.720 | at Decibel Partners.
00:00:06.760 | And I'm joined by my co-host, Swiggs, founder of Small AI.
00:00:10.040 | Hey, and today we have Dr. Nathan Lambert in the house.
00:00:12.800 | Welcome.
00:00:13.520 | Thanks, guys.
00:00:15.640 | You didn't have to come too far.
00:00:17.040 | You got your PhD in Berkeley, and it
00:00:18.840 | seems like you've lived there most of the time
00:00:21.240 | in recent years.
00:00:22.760 | You worked on robotics and model-based reinforcement
00:00:26.240 | learning on your PhD, and you also
00:00:28.200 | interned at FAIR and DeepMind.
00:00:31.440 | You bootstrapped the RLHF team at Hugging Face,
00:00:34.760 | and you recently joined the Allen Institute
00:00:37.200 | as a research scientist.
00:00:39.440 | So that's your quick bio.
00:00:40.720 | What should people know about you
00:00:42.100 | that maybe is not super obvious about you on New LinkedIn?
00:00:46.000 | I stay sane in various insane sport and ultra-endurance sport
00:00:51.560 | activities that I do.
00:00:53.120 | What's an ultra-endurance sport activity?
00:00:56.640 | Long-distance trail running or gravel biking.
00:00:59.320 | Nice.
00:01:00.600 | Nice.
00:01:01.240 | Try to unplug sometimes, although it's harder these days.
00:01:04.960 | Yeah.
00:01:05.680 | Well, the Bay Area is just really good for that stuff,
00:01:08.160 | right?
00:01:08.660 | Oh, yeah.
00:01:09.160 | You can't beat it.
00:01:11.200 | I have a trailhead, like, 1.2 miles from my house,
00:01:14.080 | which is pretty unmatchable in any other urban area.
00:01:18.800 | Yeah.
00:01:19.440 | Yeah.
00:01:20.520 | Pretty excellent.
00:01:21.920 | You also have an incredible blog, Interconnects,
00:01:26.240 | which I'm a fan of.
00:01:28.400 | And I also just recently discovered
00:01:29.880 | that you have a new podcast, Retort.
00:01:32.200 | Yeah, I do.
00:01:33.000 | I've been writing for a while, and I
00:01:34.500 | feel like I've finally started to write things that
00:01:37.400 | are understandable and fun.
00:01:38.840 | After a few years lost in the wilderness,
00:01:40.860 | if you ask some of my friends that I
00:01:42.360 | made read the earlier blogs, they're like, yikes.
00:01:44.680 | But it's coming along.
00:01:47.080 | And the podcast is with my friend Tom,
00:01:49.000 | and we just kind of riff on what's actually happening on AI
00:01:52.920 | and not really do news recaps, but just what it all means
00:01:57.040 | and have a more critical perspective on the things that
00:02:00.720 | really are kind of funny but still very serious happening
00:02:03.720 | in the world of machine learning.
00:02:05.160 | Yeah.
00:02:05.880 | Awesome.
00:02:06.640 | For people who are new to your work,
00:02:08.120 | what would you highlight as your greatest hits
00:02:10.280 | so far on Interconnects, at least?
00:02:13.840 | So the ones that are most popular
00:02:15.840 | are timely and/or opinion pieces.
00:02:17.840 | So the first real breakout piece was in April
00:02:19.920 | when I also just wrote down the thing
00:02:21.500 | that everyone in AI was feeling, which
00:02:23.080 | is like we're all feeling stressed that we're
00:02:26.120 | going to get scooped and that we're overworked, which
00:02:28.320 | is like behind the curtain what it feels to work in AI.
00:02:32.040 | And then a similar one, which we might touch on later in this,
00:02:34.660 | was about my recent job search, which
00:02:36.280 | wasn't the first time I wrote a job search post.
00:02:38.280 | I love that.
00:02:38.760 | People always love that stuff.
00:02:40.040 | It's so open.
00:02:41.080 | I mean, it's easy for me to do in a way
00:02:44.840 | that it's very on brand, and it's very helpful.
00:02:47.640 | Because I understand that until you've done it,
00:02:50.400 | it's hard to share this information.
00:02:53.360 | And then other popular ones are various model training
00:02:57.360 | techniques or fine tuning.
00:02:58.840 | There's an early one on RLHF, which
00:03:00.960 | is-- this stuff is all just like when I figure it out
00:03:03.600 | in my brain.
00:03:04.200 | So I wrote an article that's like how RLHF actually works,
00:03:07.520 | which is just the intuitions I had put together
00:03:09.680 | in the summer about RLHF.
00:03:11.160 | And that was pretty well.
00:03:13.120 | And then I opportunistically wrote about Q*,
00:03:15.780 | which you hate that you have to do it, but it is pretty funny.
00:03:19.760 | I found that it's like, from a literature perspective,
00:03:23.160 | I'm like, OpenAI publishes on work
00:03:25.040 | that is very related to mathematical reasoning.
00:03:28.040 | So it's like, oh, you just poke a little around what
00:03:30.200 | they've already published, and it seems pretty reasonable.
00:03:33.000 | But we don't know.
00:03:33.760 | They probably just got like a moderate bump
00:03:36.160 | on one of their benchmarks, and then everyone
00:03:38.040 | lost their minds.
00:03:38.920 | It doesn't really matter.
00:03:40.360 | This is why Sam Altman was fired.
00:03:42.680 | I don't know.
00:03:43.320 | Anyway, yeah, we're here to talk about--
00:03:45.560 | RLHF 101, you did a presentation.
00:03:48.660 | And I think you expressed some desire to re-record it.
00:03:51.340 | And that's why I reached out on Twitter saying,
00:03:53.300 | like, why not re-record it with us?
00:03:54.760 | And then we can ask questions and talk about it.
00:03:57.260 | Yeah, sounds good.
00:03:58.460 | I think it's-- I try to do it every six or 12 months
00:04:00.980 | is my estimated cadence, just to refine the ways
00:04:05.020 | that I say things.
00:04:05.860 | And people will see that we don't know that much more,
00:04:08.860 | but we have a bit better way of saying what we don't know.
00:04:12.220 | Yeah, awesome.
00:04:13.660 | We can dive right in.
00:04:14.660 | I don't know if there's any other topics
00:04:16.500 | that we want to lay out as groundwork.
00:04:18.760 | No, you have some awesome slides.
00:04:20.420 | So for people listening on podcast only,
00:04:22.980 | we're going to have the slides on our show notes,
00:04:25.260 | and then we're going to have a YouTube version.
00:04:27.740 | Like and subscribe.
00:04:28.620 | Where we run through everything together.
00:04:30.980 | Sounds good, yeah.
00:04:32.940 | So I think to start skipping a lot of the, like,
00:04:36.120 | what is a language model stuff, everyone
00:04:37.820 | knows that at this point.
00:04:38.860 | I think the quote from the Llama 2 paper
00:04:41.860 | is a great kind of tidbit on RLHF becoming a real deal.
00:04:46.420 | There was some uncertainty earlier in the year
00:04:48.520 | about whether or not RLHF was really going to be important.
00:04:51.140 | I think it was not that surprising that it is.
00:04:55.180 | I mean, with recent models still using it,
00:04:58.500 | the signs were there.
00:04:59.460 | But the Llama 2 paper essentially
00:05:00.900 | reads like a bunch of NLP researchers
00:05:03.260 | that were skeptical and surprised.
00:05:05.460 | So the quote from the paper was, "Meanwhile,
00:05:07.260 | reinforcement learning, known for its instability,
00:05:09.340 | seemed a somewhat shadowy field for those in the NLP research
00:05:12.260 | community.
00:05:13.060 | However, reinforcement learning proved highly effective,
00:05:15.460 | particularly given its cost and time effectiveness."
00:05:19.020 | So you don't really know exactly what the costs and time
00:05:21.560 | that Meta is looking at.
00:05:22.560 | Because they have a huge team and a pretty good amount
00:05:24.820 | of money here to release these Llama models.
00:05:27.100 | But like, this is just the kind of thing that we're seeing now.
00:05:31.100 | I think any major company that wasn't doing RLHF
00:05:33.500 | is now realizing they have to have a team around this.
00:05:37.500 | At the same time, we don't have a lot
00:05:39.420 | of that in the open and research communities at the same scale.
00:05:43.340 | I think seeing that converge would be great.
00:05:45.580 | But it's still very early days.
00:05:48.820 | And the other thing on the slide is some of Anthropic's work.
00:05:51.820 | But everyone knows Anthropic is kind of the masters of this.
00:05:54.400 | And they have some of their own techniques
00:05:56.200 | that we're going to talk about later on.
00:05:57.900 | But that's kind of where we start.
00:06:01.580 | Can we do just a one second RL diversion?
00:06:05.900 | So you come from a robotics background, which RL used to be,
00:06:09.700 | or maybe still is, state of the art.
00:06:11.300 | And then now you're seeing a lot of LLM plus RL.
00:06:14.340 | So you have the gym fans, Eureka.
00:06:16.940 | You have MPU, which we had on the podcast when they started
00:06:20.660 | with RL.
00:06:21.140 | Now they're doing RL plus LLMs.
00:06:24.740 | Yeah, any thoughts there on how we got here?
00:06:27.500 | Like maybe how the pendulum will keep swinging?
00:06:31.700 | I really think RL is about a framing
00:06:33.700 | of viewing the world through trial and error learning
00:06:36.020 | and feedback, and really just one
00:06:37.480 | that's focused on thinking about decision making and inputs
00:06:41.180 | in the world and how inputs have reactions.
00:06:44.100 | And in that, a lot of people come
00:06:45.580 | from a lot of different backgrounds,
00:06:47.220 | whether it's physics, electrical engineering,
00:06:48.620 | mechanical engineering.
00:06:49.980 | There are obviously computer scientists.
00:06:51.660 | But compared to other fields of CS,
00:06:53.900 | I do think it's a much more diverse background of people.
00:06:56.900 | Like my background was in electrical engineering
00:06:58.900 | and doing robotics and things like that.
00:07:01.660 | It really just changes the world view.
00:07:04.300 | I think that reinforcement learning, as it was back then,
00:07:09.620 | so to say, is really different, because you're
00:07:12.820 | looking at these toy problems, and the numbers
00:07:15.300 | are totally different.
00:07:16.260 | And everyone went kind of 0 to 1 at scaling these things up.
00:07:20.380 | But people like Jim Fan and other people that were--
00:07:23.300 | you saw this transition in the decision transformer and papers
00:07:26.700 | and when people are trying to use transformers to do decision
00:07:30.340 | making for things like offline RL,
00:07:31.780 | and I think that was kind of like the early days.
00:07:34.380 | But then once language models were so proven,
00:07:37.140 | it's like everyone is using this tool for their research.
00:07:40.300 | I think in the long run, it will still settle out,
00:07:44.100 | or RL will still be a field that people work on,
00:07:46.340 | just because of these kind of fundamental things
00:07:48.340 | that I talked about, that it's just
00:07:50.740 | viewing the whole problem formulation
00:07:52.860 | different than predicting text.
00:07:55.100 | And so there needs to be that separation.
00:07:58.660 | And the view of RL in language models
00:08:01.220 | is pretty contrived already.
00:08:02.500 | So it's not like we're doing real RL.
00:08:05.220 | I think the last slide that I have here
00:08:07.060 | is a way to make RLHF more like what people
00:08:11.380 | would think of with RL, so actually running things
00:08:14.260 | over time.
00:08:15.620 | But it's a weird lineage of tools
00:08:19.300 | that happen to get us to where we are,
00:08:20.900 | so that's why the name takes up so much space.
00:08:23.900 | But it could have gone a lot of different ways.
00:08:27.900 | Cool.
00:08:28.380 | We made it one slide before going on a tangent.
00:08:32.420 | Yeah, I mean, it's kind of related.
00:08:35.380 | Yeah, so we have a history of RL.
00:08:37.620 | Yeah, so I recently--
00:08:39.660 | to give the context, this paper really
00:08:41.300 | started because I've had this more diverse background
00:08:44.020 | than some computer scientists, which
00:08:45.620 | is trying to understand what the difference of a cost
00:08:48.180 | function, or a reward function, and a preference function
00:08:51.060 | would be, without going into all of the details.
00:08:54.420 | Costs are normally things that control theorists
00:08:56.500 | would work with in these kind of closed domains.
00:08:58.620 | And then reinforcement learning has always
00:09:00.380 | worked with rewards that's central to the formulation
00:09:02.660 | that we'll see.
00:09:03.300 | And then the idea was like, OK, we now are at preferences.
00:09:06.260 | And each step along the way, there's
00:09:07.740 | kind of different assumptions that you're making.
00:09:10.060 | We'll get into these.
00:09:10.900 | And those assumptions are built on other fields of work.
00:09:14.420 | So that's what this slide is getting to say.
00:09:16.260 | It's like RLHF, while directly building
00:09:18.340 | on tools from RL and language models,
00:09:20.340 | is really implicitly impacted and built
00:09:24.060 | on theories and philosophies spanning tons of human history.
00:09:29.940 | I think we cite Aristotle in this paper, which is fun.
00:09:32.820 | It's like going pre-BC.
00:09:35.500 | It's like 2,300 years old or something like that.
00:09:38.220 | So that's the reason to do this.
00:09:39.820 | I think we kind of list some things in the paper
00:09:42.700 | about summarizing what different presumptions of RLHF could be.
00:09:46.860 | I think going through these is actually kind of funny.
00:09:50.740 | It's fun to talk about these, because they're
00:09:52.580 | kind of grab bags of things that you'll
00:09:55.180 | see return throughout this podcast that we're
00:09:57.540 | talking about it.
00:09:58.820 | The core thing of RLHF, in order to be a believer in this,
00:10:02.380 | is that RL actually works.
00:10:04.140 | It's like, if you have a reward function,
00:10:05.820 | you can optimize it in some way and get a different performance
00:10:08.440 | out of it.
00:10:09.260 | And you could do this at scale.
00:10:10.380 | And you could do this in really complex environments, which
00:10:13.460 | I don't know how to do that in all the domains.
00:10:15.380 | I don't know how to exactly make chat GPT.
00:10:17.980 | So it's kind of-- we'll overshadow everything.
00:10:19.900 | And then there's go from something kind of obvious
00:10:22.020 | like that.
00:10:22.540 | And then you read the von Neumann-Morgenstern utility
00:10:27.220 | theorem, which is essentially an economic theory that
00:10:30.500 | says you can weight different probabilities
00:10:33.140 | of different people, which is a theoretical piece of work that
00:10:36.660 | is the foundation of utilitarianism.
00:10:38.380 | And trying to quantify preferences
00:10:41.020 | is crucial to doing any sort of RLHF.
00:10:44.260 | And if you look into this, all of these things,
00:10:47.980 | there's way more you could go into if you're
00:10:49.820 | interested in any of these.
00:10:50.940 | So this is kind of like grabbing a few random things.
00:10:53.100 | And then kind of similar to that is the Bradley-Terry model,
00:10:55.380 | which is the fancy name for the pairwise preferences
00:10:57.700 | that everyone is doing.
00:10:59.500 | And then all the things that are like that
00:11:01.500 | Anthropic and OpenAI figured out that you can do,
00:11:03.660 | which is that you can aggregate preferences
00:11:05.420 | from a bunch of different people and different sources.
00:11:07.780 | And then when you actually do RLHF,
00:11:09.580 | you extract things from that data.
00:11:11.460 | And then you train a model that works somehow.
00:11:13.620 | And we don't know-- there's a lot of complex links there.
00:11:16.940 | But if you want to be a believer in doing this at scale,
00:11:19.580 | these are the sorts of things that you
00:11:21.220 | have to accept as preconditions for doing RLHF.
00:11:26.260 | Yeah.
00:11:26.780 | You have a nice chart of the sort of intellectual history
00:11:29.580 | of RLHF that we'll send people to refer to,
00:11:32.260 | either in your paper or in the YouTube video for this podcast.
00:11:35.740 | But I like the other slide that you have on the presumptions
00:11:38.500 | that you need to have for RLHF to work.
00:11:40.860 | You already mentioned some of those.
00:11:43.180 | And I don't know, do you think that any one of them
00:11:45.900 | are-- which one's underappreciated?
00:11:49.620 | This is the first time I've come across the V&M utility
00:11:51.980 | theorem.
00:11:52.760 | Yeah, I know.
00:11:53.300 | This is what you get from working with people.
00:11:55.260 | Like, to my co-host on the podcast,
00:11:56.980 | the retort is that he's a sociologist by training.
00:11:59.380 | So he knows all these things and who
00:12:00.960 | the philosophers are that found these different things,
00:12:03.940 | like utilitarianism.
00:12:05.060 | But there's a lot that goes into this.
00:12:07.860 | Essentially, there's even economic theories
00:12:09.740 | that-- there's debate whether or not preferences exist at all.
00:12:12.980 | And there's different types of math
00:12:15.140 | you can use with whether or not you actually
00:12:16.980 | can model preferences at all.
00:12:18.460 | So it's pretty obvious that RLHF is built
00:12:20.420 | on the math that thinks that you can actually
00:12:22.660 | model any human preference.
00:12:24.260 | But this is the sort of thing that's
00:12:25.760 | debated-- been debated for a long time.
00:12:27.420 | So all the work that's here is like--
00:12:29.140 | and people hear about in their AI classes.
00:12:31.100 | So like Jeremy Bentham, like hedonic calculus,
00:12:33.140 | hedonic calculus, and all these things.
00:12:34.780 | Like, these are the side of work where
00:12:36.400 | people assume that preferences can be measured.
00:12:38.420 | And this is-- like, I don't really know.
00:12:40.540 | Like, when you look at-- this is where I kind of go on a rant
00:12:43.420 | and I say that in RLHF, calling things a preference model
00:12:46.340 | is a little annoying.
00:12:47.220 | Because there's no inductive bias of what a preference is.
00:12:50.240 | It's like if you were to learn a robotic system,
00:12:52.360 | and you learned a dynamics model,
00:12:53.740 | like, hopefully, that actually mirrors the world
00:12:56.260 | in some way of the dynamics.
00:12:57.820 | But with a preference model, it's
00:12:59.240 | like, oh, I don't know what this model--
00:13:01.100 | like, I don't know what ChagGPT encodes
00:13:02.620 | as any sort of preference or what
00:13:04.040 | I would want it to be in a fair way.
00:13:05.540 | Anthropic has done more work on trying
00:13:07.080 | to write these things down.
00:13:08.860 | But even, like, if you look at Claude's constitution,
00:13:11.980 | like, that doesn't mean the model believes these things.
00:13:14.660 | It's just trained to prioritize these things.
00:13:17.120 | And that's kind of what the later points,
00:13:18.860 | I'm looking at, like, what RLHF is doing
00:13:20.980 | and if it's actually, like, a repeatable process in the data
00:13:24.300 | and in the training.
00:13:25.820 | That's just unknown.
00:13:26.660 | And we have a long way to go before we
00:13:28.820 | understand what this is and the link between preference
00:13:32.340 | data and any notion of, like, writing down a specific value.
00:13:36.380 | The disconnection between more, you know,
00:13:39.700 | sociology work versus computer work already exists?
00:13:43.780 | Or is it, like, a recent cross-contamination?
00:13:46.460 | Because when we had 3DAL on the pockets,
00:13:48.620 | it's a flash of attention came to be.
00:13:51.060 | Because at AZ, they have so much overlap between systems
00:13:53.820 | engineer and, like, deep learning engineers.
00:13:56.220 | Like, is it the same in this field?
00:13:59.140 | There are a lot of people-- so I've
00:14:00.560 | gone to a couple of workshops where
00:14:02.100 | these-- the populations of people
00:14:03.940 | who you'd want to include this, like, are.
00:14:05.820 | I think the reason why it's not really talked about
00:14:08.100 | is just because the RLHF techniques that people use
00:14:11.420 | were built in, like, labs like OpenAI and DeepMind,
00:14:15.180 | where there are some of these people.
00:14:17.740 | These places do a pretty good job
00:14:19.100 | trying to get these people in the door
00:14:20.700 | when you compare them to, like, startups or normal startups.
00:14:23.200 | But, like, they're not bringing in, like, academics
00:14:26.580 | from economics, like, social choice theory.
00:14:29.300 | There's just too much.
00:14:30.820 | Like, the criticism of this paper that this is based on
00:14:33.380 | is, like, oh, you're missing these things in RL or this
00:14:35.980 | decade of RL.
00:14:36.740 | And it's like, well, it would literally
00:14:38.860 | be bigger than the Sutton and Bartow book
00:14:40.820 | if you were to include everyone.
00:14:42.340 | So it's really hard to include everyone in a principled manner
00:14:46.020 | when you're designing this.
00:14:47.220 | It's just a good way to understand and improve
00:14:51.180 | the communication of what RLHF is
00:14:53.100 | and, like, what is a good reward model for society.
00:14:56.340 | It really probably comes down to what an individual wants.
00:14:59.380 | And it'll probably motivate models
00:15:01.420 | to move more in that direction and just
00:15:03.080 | be a little bit better about the communication, which
00:15:05.300 | is a recurring theme in my work, is, like, I just
00:15:07.660 | get frustrated when people say things that don't really
00:15:10.500 | make sense, especially when it's going to, like, manipulate
00:15:13.300 | individuals' values or manipulate the general view of AI
00:15:17.080 | or anything like this.
00:15:18.000 | So that's kind of why RLHF is so interesting.
00:15:21.220 | It's, like, it's very vague in its actual--
00:15:25.420 | in what it's actually doing, while the problem specification
00:15:28.020 | is very general.
00:15:29.660 | So reinforcement learning, I kind of mentioned this.
00:15:31.820 | It's a trial and error type of system.
00:15:35.980 | The diagram in the slides is really
00:15:37.980 | the classic thing where you have an agent interacting
00:15:40.260 | with an environment.
00:15:41.100 | So it's kind of this agent has some input
00:15:43.220 | to the environment, which is called the action.
00:15:45.540 | The environment returns a state and a reward.
00:15:49.140 | And that repeats over time.
00:15:50.980 | And the agent learns based on these states and these rewards
00:15:54.140 | that it's seeing.
00:15:55.300 | And it should learn a policy that makes the rewards go up.
00:15:58.500 | That seems pretty simple.
00:16:00.740 | If you try to mentally map what this looks like in language,
00:16:03.380 | which is slide seven, is that, like, the language models
00:16:07.540 | don't make this easy.
00:16:08.740 | I think with a language model, it's
00:16:10.160 | very hard to define what an environment is.
00:16:12.660 | So if the language model is the policy and it's generating,
00:16:16.260 | it's like the environment should be a human.
00:16:18.580 | But setting up the infrastructure
00:16:20.420 | to take tens of thousands of prompts and generate them
00:16:24.020 | and then show them to a human and collect the human responses
00:16:26.740 | and then shove that into your training architecture
00:16:29.380 | is very far away from working.
00:16:31.100 | So we don't really have an environment.
00:16:32.680 | We just have a reward model that returns a reward.
00:16:35.300 | And the state doesn't really exist.
00:16:36.820 | When you look at it like an RL problem, what happens
00:16:40.420 | is the state is a prompt.
00:16:42.500 | And then you do a completion.
00:16:44.020 | And then you throw it away.
00:16:44.780 | And you grab a new prompt.
00:16:46.020 | We're really in, like, RL.
00:16:48.060 | As an RL researcher, you would think of this
00:16:49.860 | as being like you take a state.
00:16:51.100 | You get some completion from it.
00:16:54.200 | And then you look at what that is.
00:16:55.620 | And you keep kind of iterating on it.
00:16:56.960 | And all of that isn't here, which
00:16:58.380 | is why you'll hear RLHF referred to as a bandit's problem, which
00:17:01.340 | is kind of like you choose one action.
00:17:03.100 | And then you watch the dynamics play out.
00:17:04.860 | There's many more debates that you can have in this.
00:17:10.260 | If you get the right RL people in the room,
00:17:12.300 | then kind of like this is an RL even when you zoom
00:17:16.300 | into what RLHF is doing.
00:17:18.460 | Does this change as you think about a chain of thought,
00:17:21.940 | reasoning, and things like that?
00:17:23.340 | Does the state become part of the chain
00:17:26.140 | that you're going through?
00:17:28.080 | There's work that I mentioned on one slide called process reward
00:17:30.820 | models that essentially rewards each step in the chain
00:17:34.220 | of thought reasoning, which it doesn't really
00:17:37.380 | give the part of interaction.
00:17:39.180 | But it does make it a little bit more fine-grained, where
00:17:41.520 | you can think about calling it at least you have many states
00:17:45.300 | from your initial state.
00:17:47.020 | That formulation I don't think people have fully settled on.
00:17:49.860 | I think there's a bunch of great work out there.
00:17:52.420 | Even OpenAI is releasing a lot of this.
00:17:54.300 | And Let's Verify Step-by-Step is their pretty great paper.
00:17:58.060 | On the matter, I think in the next year,
00:18:01.280 | that'll probably get made more concrete by the community
00:18:06.240 | on if you can easily draw out if chain of thought reasoning
00:18:10.200 | is more like RL.
00:18:13.080 | RLHF for decision making.
00:18:15.000 | You have a slide here that compares
00:18:16.800 | pre-deep RL versus deep RL.
00:18:19.520 | Yeah, this is just to say that this
00:18:21.120 | is getting into the history of things, which
00:18:22.960 | is showing that the work that people are using now
00:18:25.720 | really came from well outside of NLP.
00:18:28.840 | And it came before deep learning was big.
00:18:30.800 | And the step from this paper, Tamer, which is from 2008,
00:18:36.120 | some names that are still really relevant and kind
00:18:38.480 | of human-centric RL, Bradley Knox and Peter Stone,
00:18:43.640 | if you have an agent take an action,
00:18:46.120 | you would just have a human give a score from 0 to 1
00:18:48.760 | as a reward, rather than having a reward function.
00:18:50.920 | And then with that classifier, you
00:18:52.300 | can do something with a policy that
00:18:54.560 | learns to take actions to maximize that reward.
00:18:56.960 | It's a pretty simple setup.
00:18:58.200 | It works in simple domains.
00:19:00.400 | And then the reason why this is interesting
00:19:02.400 | is you compare it to the paper that everyone knows,
00:19:04.520 | which is this Paul Cristiano et al.
00:19:06.200 | Deep Reinforced Learning from Human Preferences paper,
00:19:08.440 | which is where they showed that learning from human preferences
00:19:11.680 | you can solve the basic RL tasks at the time.
00:19:14.400 | So various control problems and simulation
00:19:16.680 | and this kind of human preferences approach
00:19:20.300 | had higher rewards in some environments
00:19:22.600 | than if you just threw RL at the environment that
00:19:25.120 | returned a reward.
00:19:26.440 | So the preferences thing was you took two trajectories.
00:19:29.800 | So in this case, it was complete trajectories of the agent.
00:19:32.640 | And the human was labeling which one is better.
00:19:34.680 | And you can see how this kind of comes
00:19:36.380 | to be like the pairwise preferences that are used today
00:19:38.720 | that we'll talk about.
00:19:40.280 | And there's also a really kind of interesting nugget
00:19:42.800 | that is the trajectory that the humans were labeling over
00:19:45.920 | has a lot more information than the RL algorithm
00:19:48.000 | would see if you just had one state, which
00:19:50.120 | is kind of why people think that it's
00:19:52.000 | like why the performance in this paper was so strong.
00:19:54.880 | But I still think that it's surprising
00:19:56.480 | that there isn't more RL work of this style happening now.
00:20:01.000 | This paper is in 2017, so it's like six years later.
00:20:03.600 | And I haven't seen things that are exactly similar,
00:20:06.440 | but it's a great paper to understand
00:20:08.600 | where stuff that's happening now kind of came from.
00:20:11.360 | And that's what the next few slides kind of go into.
00:20:14.080 | Just on the Cristiano paper, you mentioned the performance
00:20:17.400 | being strong.
00:20:17.960 | I don't remember-- what results should I have in mind
00:20:20.520 | when I think about that paper?
00:20:22.080 | It's mostly like if you think about an RL learning curve,
00:20:24.440 | which is like on the x-axis, you have environment interactions.
00:20:27.280 | On the y-axis, you have performance.
00:20:29.000 | You can think about different like ablation studies
00:20:31.120 | of between algorithms.
00:20:32.040 | So I think they use like A2C, which I don't even
00:20:34.120 | remember what that stands for, as their baseline.
00:20:36.600 | But if you do the human preference version
00:20:38.640 | on a bunch of environments like the human preference labels,
00:20:42.560 | the agent was able to learn faster
00:20:44.760 | than if it just learned from the signal from the environment,
00:20:47.800 | which means like the setup does--
00:20:50.120 | it's happening because the reward model has
00:20:53.240 | more information than the agent would.
00:20:55.760 | But like the fact that it can do better,
00:20:57.680 | I was like, that's pretty surprising to me
00:20:59.560 | because RL algorithms are pretty sensitive.
00:21:02.040 | So I was like, OK, yeah.
00:21:05.040 | Which is just one thing I do want
00:21:06.560 | to establish as a baseline for our listeners.
00:21:10.080 | Like we are updating all the weights, right?
00:21:13.280 | Like this is, in some sense, the next token prediction
00:21:18.320 | task of training a language model
00:21:20.360 | is a form of reinforcement learning,
00:21:22.600 | except that it's not from human feedback.
00:21:24.280 | It's just self-supervised learning
00:21:26.880 | from a general corpus.
00:21:28.440 | Yeah.
00:21:30.320 | There's one distinction which I love,
00:21:33.080 | which is that you can actually give negative feedback,
00:21:35.680 | whereas in a general sort of pre-training situation,
00:21:39.400 | you cannot.
00:21:41.000 | And maybe the order of magnitude of feedback,
00:21:43.360 | like the Likert scale that you're
00:21:44.760 | going to talk about in future slides,
00:21:46.600 | that actually just gives more signal
00:21:48.360 | than a typical training process would do in a language model
00:21:51.400 | setting.
00:21:53.400 | Yeah, I don't think I'm the right person to comment exactly,
00:21:56.360 | but you can make analogies that reinforcement learning is
00:21:59.280 | self-supervised learning as well.
00:22:00.960 | There are a lot of things that will point to that.
00:22:04.400 | I don't know whether or not it's a richer signal.
00:22:06.400 | I think that could be seen in the results,
00:22:09.160 | but I think it's a good thing for people to look into more.
00:22:14.240 | It's like as reinforcement learning is so much less
00:22:18.360 | compute, it is a richer signal in terms of its impact,
00:22:21.640 | because if they could do what RLHF is doing at pre-training,
00:22:24.200 | they would, but they don't know how
00:22:26.080 | to have that effect in a stable manner.
00:22:28.840 | Otherwise, everyone would do it.
00:22:30.520 | So on a practical basis, as someone fine-tuning models,
00:22:34.800 | I have often wished for negative fine-tuning, which pretty much
00:22:38.120 | doesn't exist in OpenAI land, and it's not
00:22:41.800 | the default setup in OpenStreetMap.
00:22:43.640 | How does this work in diffusion models and stuff?
00:22:45.840 | Because you can give negative prompts to something,
00:22:48.760 | to stable diffusion or whatever.
00:22:51.080 | That's for guidance.
00:22:51.960 | That's for clip guidance.
00:22:53.160 | Is that just from how they prompt it then?
00:22:55.280 | I don't know.
00:22:55.880 | I'm just wondering if we could do something similar.
00:22:58.080 | It's another tangent.
00:23:00.200 | Anyway, so I do want to spell that out for people
00:23:02.920 | in case they haven't made the connection between RLHF
00:23:05.600 | and the rest of the training process
00:23:07.080 | that they might have some familiarity with.
00:23:09.120 | Yeah, so these coming slides can really
00:23:11.920 | dig into this, which is like this 2018 paper that
00:23:14.560 | was a position paper from a bunch of the same authors
00:23:17.600 | from the Christiano paper and from the OpenAI work
00:23:20.960 | that everyone knows, which is like they write a position
00:23:25.120 | paper on what a reward model could do
00:23:27.640 | to solve alignment for agents.
00:23:30.160 | It's kind of based on two assumptions.
00:23:31.680 | The first assumption is that we can learn user intentions
00:23:34.040 | to a sufficiently high accuracy.
00:23:36.400 | That doesn't last with me, because I
00:23:38.360 | don't know what that means.
00:23:39.520 | But the second one is pretty telling
00:23:41.020 | in the context of RLHF, which is for many tasks
00:23:43.120 | we want to solve, evaluation of outcomes
00:23:44.800 | is easier than producing the correct behavior.
00:23:46.800 | And this is the whole thing.
00:23:47.960 | It's like we can compare two poems that the model generates,
00:23:51.040 | and it can be viewed as liking a positive example,
00:23:57.280 | or it could be viewed as really disliking a negative example.
00:24:00.200 | And that's what I think a lot of people
00:24:02.400 | are doing in the harm space, is a harmful response
00:24:06.040 | to a language model, whether or not
00:24:07.160 | you agree with the company's definition of harms,
00:24:09.240 | is that it's a really bad negative example.
00:24:12.800 | And they downweight them by preferring something
00:24:15.640 | more benign in the RLHF process, among other ways
00:24:18.760 | of dealing with safety.
00:24:20.080 | So this is a good way of saying it's like this is core.
00:24:23.200 | This kind of comparison and positive or negative example
00:24:26.240 | is core to all of the RLHF work that has continued.
00:24:29.760 | Yeah.
00:24:30.840 | Maybe I'll try to put a more colloquial restatement of this.
00:24:34.600 | People often say, I don't know what I want,
00:24:36.440 | but I'll know when I see it.
00:24:37.880 | This is that expressed in [INAUDIBLE]
00:24:40.400 | Yeah, it is.
00:24:41.240 | Yeah, it is.
00:24:41.800 | That's what everyone's doing in the preference modeling
00:24:44.640 | stage that we'll get to.
00:24:45.880 | Yeah.
00:24:47.320 | Yeah, and you can see there are more papers.
00:24:49.240 | This is really just to have all the links for people
00:24:53.400 | that go deeper.
00:24:54.080 | There's a Ziegler et al paper in 2019,
00:24:57.360 | which shows that you can do this RLHF process on language
00:25:00.240 | models.
00:25:01.680 | This familiar diagram starts to emerge in 2019.
00:25:04.280 | It's just to show that this goes really far back.
00:25:06.320 | I think we can kind of breeze through some of these.
00:25:08.520 | And then 2020 is the first open AI experiment
00:25:10.960 | that I think caught people's eyes, which
00:25:12.640 | is this learning to summarize experiment.
00:25:15.080 | It has this three-step process that
00:25:17.560 | we'll go into more when I kind of go into the main concepts.
00:25:20.840 | But it's like the first time you see this diagram that they
00:25:23.480 | reuse with InstructGPT.
00:25:25.080 | They reuse with ChatGPT.
00:25:27.280 | And the types of examples that they would have--
00:25:29.280 | I don't think I need to read these exactly,
00:25:31.360 | but one that I have read a whole bunch of times
00:25:33.680 | is they took these prompts from Reddit that was like,
00:25:37.320 | explain like I'm five or get career advice.
00:25:39.920 | And people really pour their heart and soul into these.
00:25:42.840 | So these are like multi-paragraph pieces of writing.
00:25:45.480 | And then they essentially do comparisons
00:25:47.600 | between a vanilla language model.
00:25:49.400 | I think it was, what's the timeline?
00:25:51.520 | Either GPT-2 or GPT-3.
00:25:53.320 | I always get the exact--
00:25:54.780 | 3 was early 2020, so that's about right.
00:25:57.400 | Yeah, so this is probably done with GPT-2.
00:26:00.360 | It doesn't really matter.
00:26:01.480 | But the language model does normal things.
00:26:03.280 | You do a few shot, which is like it repeats itself.
00:26:05.720 | It doesn't have nice text.
00:26:07.560 | And what they did is that this was the first time
00:26:09.920 | where the language model would generate pretty nice text
00:26:13.200 | from an output.
00:26:14.440 | It was restricted to the summarization domain.
00:26:17.280 | But I think that--
00:26:19.000 | this is where I wish I was paying attention more,
00:26:21.160 | because I would see the paper, but I
00:26:22.800 | didn't know to read the language model outputs
00:26:25.320 | and kind of understand this qualitative sense of the models
00:26:29.600 | very well then.
00:26:30.360 | Because you look at the plots in the papers.
00:26:33.240 | Learning to summarize and destruct GPT
00:26:35.200 | have incredibly pretty plots, just
00:26:37.720 | with nicely separated lines with error bars.
00:26:41.400 | And they're like super fine-tuning works.
00:26:44.120 | The RL step works.
00:26:45.320 | But if you were early to see how different the language that
00:26:49.520 | was written by these models was, I
00:26:51.320 | think you could have been early to things like chat GPT
00:26:54.520 | and knowing RLHF would matter.
00:26:56.840 | But that's now, I think, obvious.
00:26:59.240 | The good people know to chat with language models,
00:27:01.680 | but not even everyone does this.
00:27:03.200 | Like, people are still looking at numbers.
00:27:04.920 | And I think OpenAI probably figured it out
00:27:07.600 | when they were doing this, how important that could be.
00:27:09.920 | And then they had years to kind of chisel away at that.
00:27:13.160 | And that's why they're doing so well now.
00:27:15.440 | Yeah, I mean, arguably, it's well-known
00:27:17.280 | that chat GPT was kind of an accident,
00:27:18.840 | that they didn't think it would be that big of a deal.
00:27:21.360 | Yeah.
00:27:21.920 | So maybe they didn't.
00:27:22.920 | Maybe they didn't, but they were getting the proxy
00:27:25.040 | that they needed.
00:27:26.520 | I've heard off the record from other labs
00:27:28.880 | that it was in the air.
00:27:30.000 | If OpenAI didn't do it, someone else would have done it.
00:27:32.360 | Yeah.
00:27:33.240 | So you've mentioned a couple of other papers
00:27:35.360 | that are very seminal to this period.
00:27:37.160 | And I love how you say way back when in referring to 2019.
00:27:40.480 | It feels like it in my life.
00:27:42.760 | So how much should people understand
00:27:45.360 | the relationship between RLHF, instruction tuning,
00:27:47.920 | PPO, KL divergence, anything like that?
00:27:50.840 | Like, how would you construct the level of knowledge
00:27:54.960 | that people should dive into?
00:27:56.240 | Like, what should people know at the high level?
00:27:58.320 | And then if people want to dive in deeper, where do they go?
00:28:01.520 | Like, is instruction tuning important here?
00:28:06.400 | Or is that part of the overall process
00:28:08.680 | towards modern RLHF?
00:28:11.200 | I think for most people, instruction tuning
00:28:13.160 | is probably still more important in their day-to-day life.
00:28:15.640 | I think instruction tuning works very well.
00:28:19.120 | You can write samples by hand that make sense.
00:28:22.040 | You can get the model to learn from them.
00:28:24.360 | You could do this with very low compute.
00:28:26.800 | It's easy to do almost in no-code solutions at this point.
00:28:31.040 | And the loss function is really straightforward.
00:28:33.480 | And then if you're interested in RLHF,
00:28:37.920 | you can kind of learn from it from a different perspective,
00:28:40.200 | which is like how the instruction tuning distribution
00:28:42.680 | makes it easier for your RLHF model to learn.
00:28:45.480 | There's a lot of details depending
00:28:47.280 | on your preference data, if it's close to your instruction
00:28:50.440 | model or not, if that matters.
00:28:52.520 | But that's really at the RLHF stage.
00:28:54.080 | So I think it's nice to segment and just kind of understand
00:28:56.440 | what your level of investment and goals are.
00:28:58.840 | I think instruction tuning still can do most
00:29:01.240 | of what you want to do.
00:29:02.720 | And if you want to think about RLHF,
00:29:05.200 | at least before DPO really had taken off at all,
00:29:08.720 | it would be like, do you want to have a team of at least five
00:29:11.640 | people if you're really thinking about doing RLHF?
00:29:14.600 | I think DPO makes it a little bit easier.
00:29:16.440 | But that's still really limited to kind of one data set
00:29:18.640 | that everyone's using at this point.
00:29:20.080 | Everyone's using this ultra-feedback data set.
00:29:21.960 | And it boosts Alpaca, VAL, MTBench, TruthfulQA,
00:29:26.320 | and the qualitative model a bit.
00:29:28.000 | We don't really know why.
00:29:29.000 | And it's like, it might just be that data set combined
00:29:31.280 | with the method.
00:29:32.520 | But you've got to be ready for a bumpy ride
00:29:34.600 | if you're wanting to try to do RLHF.
00:29:36.760 | I don't really recommend most startups
00:29:39.160 | to do it unless it's going to provide them
00:29:41.600 | a clear competitive advantage in their kind of niche.
00:29:45.560 | Because you're not going to make your model chat GPT-like better
00:29:48.640 | than OpenAI or anything like that.
00:29:50.360 | You've got to accept that there's some exploration there.
00:29:53.400 | And you might get a vein in your specific--
00:29:55.960 | like a vein of benefit in your specific domain.
00:29:58.360 | But I'd still be careful going into the RLHF can of worms.
00:30:03.160 | You probably don't need to.
00:30:05.080 | OK, so there's a bit of a time skip in what you mentioned.
00:30:07.720 | DPO is like a couple of months old.
00:30:09.680 | So we'll leave that towards the end.
00:30:12.480 | I think the main result that I think
00:30:14.600 | most people talk about at this stage--
00:30:16.400 | we're talking about September 2020 and then going into,
00:30:18.760 | I guess, maybe last year--
00:30:20.720 | was Vicuña as one of the more interesting applications
00:30:25.320 | of instruction tuning that pushed LLAMA 1 from,
00:30:30.200 | let's say, a GPT-3-ish model to a GPT-3.5 model
00:30:34.040 | in pure open source with not a lot of resources.
00:30:36.600 | I think-- I mean, they said something like they
00:30:39.200 | used under $100 to make this.
00:30:41.480 | Yeah, instruction tuning can really go a long way.
00:30:44.160 | I think the claims of chat GPT level
00:30:47.360 | are long overblown in most of the things in open source.
00:30:51.000 | I think it's not to say--
00:30:53.920 | like, Vicuña was a huge step.
00:30:55.760 | And it's just kind of showing that instruction
00:30:58.280 | tuning with the right data will completely
00:31:00.520 | change what it feels like to talk with your model.
00:31:03.720 | From text completion to actually chatting
00:31:06.160 | back and forth, multi-turn.
00:31:08.400 | Yeah, instruction tuning can be multi-turn.
00:31:10.360 | Just having a little bit of data that's a couple turns
00:31:12.760 | can go a really long way.
00:31:14.560 | And it's, I think, people--
00:31:16.580 | that was like the story of the whole first part of the year
00:31:18.920 | is people will be surprised by how far you can take
00:31:21.480 | instruction tuning on a small model.
00:31:23.680 | I think the things that people see now
00:31:26.040 | is the small models don't really handle nuance as well.
00:31:29.280 | And they could be more repetitive,
00:31:30.880 | even if they have really good instruction tuning.
00:31:32.920 | But if you take that kind of $7 to $70 billion parameter jump,
00:31:36.600 | the instruction tuning at the bigger model is robustness.
00:31:40.440 | Little things make more sense.
00:31:42.320 | But that's still just with instruction tuning
00:31:44.160 | and scale more than anything else.
00:31:46.680 | Yeah, excellent.
00:31:49.040 | Shall we go to technical overview?
00:31:51.560 | Yeah, this is kind of where we go through my own version
00:31:54.040 | of this three-phase process.
00:31:55.600 | You can talk about instruction tuning, which
00:31:57.480 | we've talked about a lot.
00:31:58.720 | It's funny because all these things, instruction tuning
00:32:00.520 | has the fewest slides, even though it's
00:32:02.360 | the most practical thing for most people.
00:32:05.120 | We could save the debate for if the big labs still
00:32:07.680 | do instruction tuning for later.
00:32:09.120 | But that's a coming wave for people.
00:32:12.080 | And then like preference data and training,
00:32:14.320 | and then what does reinforcement learning optimization actually
00:32:17.920 | mean.
00:32:18.680 | We talk about these sequentially because you really
00:32:20.840 | have to be able to do each of them
00:32:22.200 | to be able to do the next one.
00:32:23.400 | You need to be able to have a model that's
00:32:25.160 | chatty or helpful, instruction following.
00:32:27.360 | Every company has their own word that they
00:32:29.120 | like to assign to what instructions mean.
00:32:31.120 | And then once you have that, you can collect preference data
00:32:33.620 | and do some sort of optimization.
00:32:35.160 | When you say word, you mean like angle bracket, inst?
00:32:39.120 | Or do you mean something else?
00:32:40.520 | Oh, I don't even know what inst means.
00:32:42.120 | But I'm just saying they use their adjective that they like.
00:32:45.960 | I think entropic, also like steerable, is another one.
00:32:48.560 | I see, I see, I see.
00:32:49.480 | Just the way they describe it.
00:32:50.760 | Yeah.
00:32:51.840 | Yeah, so instruction tuning, we've covered most of this.
00:32:54.280 | It's really about you should try to adapt
00:32:56.240 | your models to specific needs.
00:32:57.640 | It makes models that were only OK extremely comprehensible.
00:33:02.120 | A lot of the times, it's where you start
00:33:04.200 | to get things like chat templates.
00:33:05.620 | So if you want to do system prompts,
00:33:07.120 | if you want to ask your model, act like a pirate.
00:33:10.380 | That's one of the ones I always do, which is always funny.
00:33:12.800 | But whatever you-- act like a chef, like anything.
00:33:16.920 | This is where those types of things
00:33:19.320 | that people really know in language models
00:33:22.160 | start to get applied.
00:33:23.360 | So it's good as a kind of starting point,
00:33:26.360 | because this chat template is used in RHF and all
00:33:29.320 | of these things down the line.
00:33:31.640 | But there's a basic pointer.
00:33:33.600 | It's like, once you see this with instruction tuning,
00:33:36.420 | you really know it, which is like you take things like stack
00:33:38.840 | overflow, where you have a question and an answer.
00:33:40.880 | You format that data really nicely.
00:33:43.120 | You push it through the model.
00:33:44.360 | The model then kind of knows what to do.
00:33:46.720 | When somebody asks a question, there's much more--
00:33:50.080 | there's surely kind of more tricky things that people do.
00:33:53.480 | But I still think the vast majority of it
00:33:55.240 | is question answer.
00:33:56.200 | It's like, please explain this topic to me.
00:33:59.160 | Generate this thing for me.
00:34:00.560 | That hasn't changed that much this year.
00:34:02.480 | I think people have just gotten better at kind of scaling up
00:34:05.880 | the data that they need.
00:34:08.760 | Yeah, this is where this talk will kind of take a whole left
00:34:11.720 | turn into more technical detail land.
00:34:15.400 | I put a slide with the RHF objective, which I think
00:34:18.320 | is good for people to know.
00:34:19.560 | I've started going back to this more
00:34:21.720 | to just kind of understand what is trying to happen here
00:34:25.480 | and what type of math people could do.
00:34:27.800 | I think because of this algorithm,
00:34:29.480 | we've mentioned this, it's in the air,
00:34:31.060 | direct preference optimization.
00:34:32.400 | But everything kind of comes from an equation
00:34:34.280 | of trying to learn a policy that maximizes the reward.
00:34:38.760 | The reward is some learned metric.
00:34:40.680 | A lot can be said about what the reward should
00:34:42.640 | be subject to some constraint, which the most popular
00:34:45.960 | constraint is the KL-distraint, which is just
00:34:48.000 | a distributional distance.
00:34:50.160 | Essentially, in language models, that
00:34:51.620 | means if you have a completion from your instruction or RHF
00:34:55.600 | model, you can compare that completion to a base model.
00:34:59.400 | And looking at the log probs from the model, which
00:35:02.440 | are essentially how likely each token is,
00:35:05.000 | you can see a rough calculation of the distance
00:35:07.480 | between these two models just as a scalar number.
00:35:10.160 | I think what that actually looks like in code,
00:35:12.520 | you can look at it.
00:35:13.800 | It would be like a sum of log probs
00:35:16.440 | that you get right from the model.
00:35:17.960 | It'll look much more simpler than it sounds.
00:35:20.680 | But it is just to make the optimization kind of stay
00:35:23.560 | on tracks.
00:35:24.160 | It's a guardrail that's--
00:35:26.120 | Make sure it doesn't overfit to your RHF data.
00:35:29.520 | Because we have so little data and our RHF overfitting
00:35:32.140 | is really something that could happen.
00:35:33.720 | I think it'll fit to specific features
00:35:37.080 | that labelers like to see, that the model likes to generate,
00:35:41.200 | punctuation, weird tokens, like calculator tokens.
00:35:44.920 | It could overfit to anything if it's in the data a lot
00:35:47.240 | and it happens to be in a specific format.
00:35:49.920 | And the KL constraint prevents that.
00:35:52.520 | There's not that much documented work on that,
00:35:55.040 | but there's a lot of people that know if you take that away,
00:35:57.640 | it just doesn't work at all.
00:35:58.920 | So it is important, but I think it's something that people
00:36:02.240 | don't focus on too much.
00:36:04.240 | But as an objective, as I said, it's just kind of--
00:36:06.680 | you optimize the reward.
00:36:08.000 | The reward is where the human part of this comes in.
00:36:10.560 | We'll talk about that next.
00:36:11.720 | And then subject to a constraint,
00:36:13.960 | don't change the model too much.
00:36:15.660 | The real questions are, how do you implement the reward?
00:36:18.040 | And then how do you make the reward go up
00:36:19.920 | in a meaningful way?
00:36:21.680 | So like a preference model, the task
00:36:23.360 | is kind of to design a human reward.
00:36:25.200 | I think the key--
00:36:27.000 | the equation that most of the stuff is based on right now
00:36:29.920 | is something called a Bradley-Terry model, which
00:36:31.920 | is like a pairwise preference model where
00:36:33.680 | you compare two completions, and you
00:36:35.280 | say which one you like better.
00:36:36.560 | I'll show an interface that Anthropic uses here.
00:36:40.160 | And the Bradley-Terry model is really
00:36:42.540 | a fancy probability between two selections.
00:36:46.040 | And what's happening in the math is
00:36:47.920 | that if you look at the prob--
00:36:49.920 | you're looking at the probability that the chosen
00:36:52.440 | completion, the one you like better,
00:36:54.080 | is actually the better completion
00:36:56.320 | over the rejected completion.
00:36:58.360 | And what these preference models do
00:37:00.520 | is they assume this probability is correlated to reward.
00:37:04.120 | So if you just sample from this probability,
00:37:06.000 | it'll give you a scalar.
00:37:07.080 | And then you use that reward later on
00:37:08.640 | to signify what piece of text is better.
00:37:12.120 | I'm kind of inclined to breeze through the math stuff,
00:37:17.800 | because otherwise it's going to be not as good to listen to.
00:37:20.400 | Yeah, no, no.
00:37:22.920 | I think people want to hear it.
00:37:24.560 | I think there's a lot of higher level explanations out there.
00:37:28.200 | Yeah, yeah.
00:37:29.240 | So the real thing is you need to assign a scalar reward of how
00:37:32.080 | good a response is.
00:37:33.440 | And that's not necessarily that easy to understand.
00:37:36.880 | Because if we take back to one of the first works
00:37:39.920 | I mentioned, this tamer thing for decision making,
00:37:42.600 | people tried that with language models, which
00:37:44.520 | is if you have a prompt and a completion
00:37:46.160 | and you just have someone rate it from 0 to 10,
00:37:48.840 | could you then train a reward model
00:37:50.400 | on all of these completions and 0 to 10 ratings
00:37:52.960 | and see if you could actually change--
00:37:56.280 | can you get trap 2BT with that?
00:37:57.720 | And the answer is really kind of no.
00:37:59.680 | A lot of people tried that.
00:38:00.800 | It didn't really work.
00:38:01.800 | And then that's why they tried this pairwise preference thing.
00:38:04.400 | And it happened to work.
00:38:05.520 | And this Bradley Terry model comes from the '50s.
00:38:09.800 | It's from these fields that I was mentioning earlier.
00:38:12.480 | And it's wild how much of this happens.
00:38:15.760 | I mean, this screenshot I have on the slides
00:38:17.640 | is from the DPO paper.
00:38:18.720 | I think it might be the appendix.
00:38:20.060 | But it's still really around in the literature of what
00:38:23.240 | people are doing for RLHF.
00:38:25.160 | So it's a fun one to know.
00:38:26.600 | I'll point out one presumption that this heavily relies on.
00:38:29.320 | You mentioned this as part of your six presumptions
00:38:31.440 | that we covered earlier, which is that you can
00:38:33.360 | aggregate these preferences.
00:38:35.520 | This is not exactly true among all humans, right?
00:38:38.020 | I have a preference for one thing.
00:38:39.440 | You have a preference for a different thing.
00:38:40.820 | And actually coming from economics,
00:38:42.480 | you mentioned economics earlier.
00:38:43.920 | There's a theorem or a name for this called error
00:38:48.880 | impossibility, which I'm sure you've come across.
00:38:50.960 | Yeah, it's one of the many kind of things
00:38:53.220 | we throw around in the paper.
00:38:55.560 | Do we just ignore it?
00:38:56.740 | Yeah.
00:38:57.240 | We just-- yeah, just aggregate.
00:38:58.560 | Yeah, OK.
00:38:59.360 | Yeah.
00:39:00.280 | I think the reason this really is done on a deep level
00:39:03.880 | is that you're not actually trying
00:39:05.640 | to model any contestable preference in this.
00:39:09.480 | You're not trying to go into things that
00:39:11.400 | are controversial or anything.
00:39:13.360 | It's really the notion of preference
00:39:15.720 | is trying to stay around correctness and style
00:39:19.160 | rather than any meaningful notion of preference.
00:39:21.200 | Because otherwise, these companies really don't want to--
00:39:23.700 | they don't want to do this at all.
00:39:26.500 | I think that's just how it is.
00:39:27.780 | And it's like, if you look at what people actually do--
00:39:30.140 | so I have a bunch of slides on the feedback interface.
00:39:33.260 | And they all publish this.
00:39:34.420 | It's always at the appendices of every paper.
00:39:36.700 | Yeah, it's pretty interesting.
00:39:37.980 | Yeah, there's something later on in this talk which is like--
00:39:40.540 | but it's good to mention in this is when you're doing this
00:39:43.380 | preference collection, you write out a very long document
00:39:46.500 | of instructions to people that are collecting this data.
00:39:49.180 | And it's like, this is the hierarchy
00:39:50.880 | of what we want to prioritize.
00:39:52.160 | Something amounting like factuality, helpfulness,
00:39:55.480 | honestness, harmlessness-- these are all different things.
00:39:57.920 | Every company will rank these in different ways,
00:40:00.220 | provide extensive examples.
00:40:01.800 | It's like, if you see these two answers,
00:40:03.840 | you should select this one and why and all of this stuff.
00:40:07.120 | And then my kind of head scratching
00:40:08.640 | is like, why don't we check if the models actually
00:40:10.760 | do these things that we tell the data annotators to collect?
00:40:14.600 | But I think it's because the model--
00:40:16.120 | it's hard to make that attribution.
00:40:17.760 | It'll be really-- it's hard to test if a model is honest
00:40:20.260 | and stuff.
00:40:20.920 | It would just be nice to understand
00:40:22.500 | the kind of causal mechanisms as a researcher
00:40:25.700 | or if our goals are met.
00:40:27.820 | But at a simple level, what it boils down to--
00:40:30.620 | I have a lot more images than I need.
00:40:33.140 | It's like, you're having a conversation with an AI,
00:40:36.140 | something like TypeGPT.
00:40:37.620 | You get shown two responses or more in some papers.
00:40:40.780 | And then you have to choose which one is better.
00:40:42.780 | I think something you'll hear a lot in this space
00:40:44.780 | is something called a Likert scale.
00:40:47.100 | Likert is a name.
00:40:48.400 | It's a name for probably some research in economics,
00:40:51.280 | decision theory, or something.
00:40:52.620 | But essentially, it's a type of scale
00:40:54.180 | where if you have integers from one to eight,
00:40:58.120 | the middle numbers will represent something
00:41:00.300 | close to a tie.
00:41:01.140 | And the smallest numbers will represent one model
00:41:03.280 | being way better than the other.
00:41:04.940 | And the biggest numbers will be the other model's better.
00:41:08.840 | So in the case of one to eight, if you're
00:41:10.600 | comparing models A to B, if you return a one
00:41:12.680 | if you really liked option A, you
00:41:14.040 | return eight if you really like B,
00:41:15.520 | and then a four or five if they were close.
00:41:17.940 | There's other ways to collect this data.
00:41:19.600 | This one's become really popular.
00:41:21.280 | We played with it a bit at Hugging Face.
00:41:22.900 | It's hard to use.
00:41:24.140 | Filling out this preference data is really hard.
00:41:26.100 | You have to read multiple paragraphs.
00:41:27.800 | It's not for me.
00:41:28.560 | Some people really like it.
00:41:29.600 | I hear, I'm like, I can't imagine
00:41:31.000 | sitting there and reading AI-generated text
00:41:33.240 | and having to do that for my job.
00:41:35.760 | But a lot of these early papers in RLHF
00:41:38.400 | have good examples of what was done.
00:41:40.680 | The one I have here is from Anthropic's collection demo.
00:41:45.360 | It's because it was from slides that I did with Anthropic.
00:41:48.560 | But you can look up these in the various papers.
00:41:52.800 | It looks like Chat2PT with two responses,
00:41:55.080 | and then you have an option to say which one is better.
00:41:57.360 | It's nothing crazy.
00:41:59.240 | The infrastructure is almost exactly the same.
00:42:02.040 | But they just log which one you think is better.
00:42:04.920 | I think places like Scale are also really big in this,
00:42:09.400 | where a lot of the labeler companies
00:42:11.920 | will help control who's doing how many samples.
00:42:15.520 | You have multiple people go over the same sample once,
00:42:18.000 | and what happens if there's disagreement?
00:42:20.060 | I don't really think this disagreement
00:42:21.600 | data is used for anything.
00:42:22.840 | But it's good to know what the distribution of prompts is,
00:42:26.080 | who's doing it, how many samples you have,
00:42:28.040 | controlling the workforce.
00:42:29.240 | All of this is very hard.
00:42:31.080 | A last thing to add is that a lot of these companies
00:42:33.200 | do collect optional metadata.
00:42:35.200 | I think the Anthropic example shows
00:42:37.020 | a rating of how good was the prompt, or the conversation,
00:42:42.380 | from good to bad.
00:42:43.820 | Because things matter.
00:42:46.620 | There's a quadrant of preference data in my mind,
00:42:48.740 | which is you're comparing a good answer to a good answer, which
00:42:51.660 | is a really interesting signal.
00:42:53.140 | And then there's the option of you're
00:42:54.740 | comparing a bad answer to a bad answer, which is like,
00:42:57.580 | you don't want to train your model on two different--
00:42:59.780 | You're both terrible.
00:43:00.780 | This is why we did this at Hugging Base.
00:43:02.500 | And it was like, our data was like,
00:43:04.260 | we don't know if we can use this, because a lot of it
00:43:06.220 | was just bad answer to bad answer,
00:43:07.780 | because you're rushing to try to do this real contract.
00:43:10.560 | And then there's also good answer to bad answer,
00:43:12.560 | which I think is probably pretty reasonable to include.
00:43:14.860 | You just prefer the good one, and move on with your life.
00:43:17.940 | Those are very different scenarios.
00:43:19.380 | I think open AIs of the world are all in good answer,
00:43:22.300 | good answer, and have learned to eliminate everything else.
00:43:24.760 | But when people try to do this in open source,
00:43:26.660 | it's probably like what Open Assistance saw,
00:43:28.500 | is there's just a lot of bad answers in your preference
00:43:30.820 | data.
00:43:31.300 | And you're like, what do I do with this?
00:43:33.460 | Metadata flags can help.
00:43:34.860 | I threw in the slide 28.
00:43:37.540 | It's like the instruct GPT metadata.
00:43:40.460 | You can see how much they collect here,
00:43:43.100 | and everything from the model fails to actually complete
00:43:47.900 | the task, hallucinations, different types
00:43:50.580 | of offensive or dangerous content, moral judgment,
00:43:53.300 | expresses opinion.
00:43:55.620 | I don't know exactly if they're doing this now,
00:43:58.340 | but you can kind of see why doing RLHF at scale
00:44:01.060 | and prioritizing a lot of different endpoints
00:44:02.860 | would be hard, because these are all things that you--
00:44:05.780 | I'd be interested if I was scaling up a big team
00:44:08.020 | to do RLHF, and what is going into the preference data,
00:44:10.420 | and what happens?
00:44:11.420 | You do an experiment.
00:44:12.660 | You're like, OK, we're going to remove all the data where
00:44:14.620 | they said the model hallucinates.
00:44:16.020 | Like, does that?
00:44:16.660 | And then retrain everything.
00:44:17.820 | Like, what does that do?
00:44:19.320 | Yeah, so hallucination is big.
00:44:20.900 | But some of these other metadata categories--
00:44:23.500 | and I've seen this in a lot of papers--
00:44:25.300 | it's like, does it contain sexual content?
00:44:27.300 | Does it express a moral judgment?
00:44:28.780 | Does it denigrate a protected class?
00:44:30.300 | That kind of stuff, very binary.
00:44:33.260 | Should people try to adjust for this at the RLHF layer,
00:44:36.580 | or should they put it as a pipeline
00:44:38.740 | where they have a classifier as a separate model that
00:44:42.580 | grades the model output?
00:44:44.780 | Do you mean for training or, like, a deployment?
00:44:47.580 | Deployment.
00:44:48.340 | I do think that people are doing it at deployment.
00:44:50.420 | I think we've seen safety and other things
00:44:53.420 | in the RLHF pipeline.
00:44:56.020 | Like, Lamatu is famous for kind of having
00:44:59.020 | this, like, helpfulness and safety reward models.
00:45:02.420 | Deep in the Gemini report is something
00:45:04.260 | that Gemini has, like, four things, which
00:45:07.260 | is, like, helpfulness, factuality, maybe safety,
00:45:09.700 | maybe something else.
00:45:11.180 | But places like Anthropic and Chattopadhyay and Bard
00:45:14.860 | almost surely have a classifier after,
00:45:17.060 | which is like, is this text good?
00:45:18.460 | Is this text bad?
00:45:19.940 | And that's not that surprising, I think,
00:45:23.260 | because you could use, like, a 100 times smaller language
00:45:25.700 | model and do much better at filtering than RLHF.
00:45:30.020 | But I do think it's still so deeply intertwined
00:45:33.540 | with the motivation of RLHF to be for safety
00:45:35.620 | that some of these categories still persist.
00:45:38.180 | I think that's something that'll kind of settle out, I think.
00:45:41.180 | I'm just wondering if it's worth collecting this data
00:45:43.300 | for the RLHF purpose if you're not going to use it in any way,
00:45:46.060 | because you're just going to use a separate model to--
00:45:48.020 | Yeah, I don't think OpenAI will collect all of this anymore.
00:45:51.380 | But I think, from a research perspective,
00:45:53.220 | it's very insightful to know.
00:45:55.340 | But it's also expensive.
00:45:56.340 | So essentially, your preference data
00:45:57.980 | scales with how many minutes it takes for you to do each task.
00:46:00.660 | And every button is--
00:46:02.180 | it scales pretty linearly.
00:46:04.300 | So it's not cheap stuff.
00:46:07.740 | Since you mentioned expensiveness,
00:46:09.780 | and I think you may have joined one of our spaces
00:46:12.540 | back when Llamatu was released.
00:46:15.420 | We had an estimate from you that was something
00:46:18.340 | on the order of Llamatu costs $3 to $6 million
00:46:22.100 | to train GPU-wise.
00:46:23.500 | And then it was something like $20 to $30 million
00:46:25.900 | in preference data.
00:46:27.860 | Is that something that's still in the ballpark?
00:46:30.020 | I don't need precise numbers.
00:46:30.820 | I think it's still in the ballpark.
00:46:32.280 | I know that the $20 million was off by a factor of four,
00:46:35.020 | because I was converting from a prompt number
00:46:37.380 | to a total data point.
00:46:39.140 | So essentially, when you do this,
00:46:41.180 | if you have multi-turn setting, each turn
00:46:43.260 | will be one data point.
00:46:44.820 | And the Llamatu paper reports 1.5 million data points,
00:46:48.180 | which could be 400,000 prompts.
00:46:50.860 | So I would still say like $6 to $8 million
00:46:54.020 | is safe to say that they're spending, if not more.
00:46:56.460 | They're probably also buying other types of data
00:46:58.500 | and/or throwing out data that they don't like.
00:47:00.420 | But it's very comparable to compute costs.
00:47:03.460 | But the compute costs listed in the paper
00:47:05.900 | always are way lower, because all they have to say
00:47:07.940 | is, what does one run cost?
00:47:09.500 | But they're running tens or hundreds of runs.
00:47:11.420 | So it's like, OK, this is kind of a meaningless number.
00:47:15.840 | The data number would be more interesting.
00:47:17.620 | Right, right, right, right.
00:47:18.740 | What's the depreciation of this data?
00:47:23.060 | It depends on the method.
00:47:24.540 | Some methods, people think that it's more sensitive to--
00:47:28.460 | this is what I was saying.
00:47:29.700 | Does the type of instruction tuning you do matter for RLHF?
00:47:34.220 | So depending on the method, some people
00:47:36.900 | are trying to figure out if you need to have what is called--
00:47:41.420 | this is very confusing.
00:47:42.380 | It's called on-policy data, which
00:47:43.980 | is your RLHF data is from your instruction model.
00:47:47.500 | I really think people in open source and academics
00:47:50.180 | are going to figure out how to use any preference
00:47:52.180 | data on any model, just because they're scrappy.
00:47:54.860 | But there's been an intuition that to do PPO well and keep
00:47:58.900 | improving the model over time, and do what Meta did,
00:48:01.700 | and what people think that OpenAI does,
00:48:03.740 | is that you need to collect new preference data to kind of edge
00:48:08.300 | the distribution of capabilities forward.
00:48:10.340 | So there's a depreciation where the first batch of data
00:48:14.180 | you collect isn't really useful for training the model when
00:48:16.700 | you have the fifth batch.
00:48:19.300 | We don't really know, but that's something--
00:48:21.340 | it's a good question.
00:48:22.500 | And I do think that if we had all the LLAMA data,
00:48:26.440 | we wouldn't know what to do with all of it.
00:48:28.500 | Probably 20% to 40% would be pretty useful for people,
00:48:32.140 | but not the whole data set.
00:48:33.380 | A lot of it's probably kind of gibberish,
00:48:35.220 | because they had a lot of data in there.
00:48:37.380 | Yeah.
00:48:38.180 | So do you think the open source community should
00:48:40.940 | spend more time figuring out how to reuse the data that we have,
00:48:44.180 | or generate more data?
00:48:46.060 | I think that's one of the bigger questions.
00:48:47.860 | I think if the people are kind of locked
00:48:49.540 | into using synthetic data, which I wish I had more slides on it,
00:48:52.300 | but we could just talk about it.
00:48:53.600 | Essentially, people also think that synthetic data, like GPT-4,
00:48:57.420 | is more accurate than humans at labeling preferences.
00:48:59.780 | So if you look at these diagrams,
00:49:01.160 | like humans are about 60% to 70% agreement.
00:49:04.460 | That's what the models get to.
00:49:06.180 | And if humans are about 70% agreement or accuracy,
00:49:09.660 | like GPT-4 is like 80%.
00:49:11.180 | So it is a bit better, which is in one way of saying it.
00:49:14.340 | Humans don't even agree with humans 50% of the time.
00:49:17.540 | Yeah, so that's the thing.
00:49:18.740 | It's like the human disagreement or the lack of accuracy
00:49:21.900 | should be like a signal.
00:49:23.620 | But how do you incorporate that?
00:49:25.460 | It's really tricky to actually do that.
00:49:28.300 | I think that people just keep using GPT-4,
00:49:30.340 | because it's really cheap.
00:49:31.420 | It's one of my go-to, like I just say this over and over
00:49:33.940 | again, is like GPT-4 for data generation.
00:49:37.660 | All terms and conditions aside, because we
00:49:40.020 | know OpenAI has this stuff, is like very cheap for getting
00:49:43.300 | pretty good data compared to compute or salary
00:49:46.340 | of any engineer or anything.
00:49:47.820 | So it's like, tell people to go crazy generating GPT-4 data
00:49:51.340 | if you're willing to take the organizational cloud of,
00:49:54.260 | should we be doing this?
00:49:55.260 | But I think most people have accepted
00:49:56.760 | that you kind of do this, especially at individuals.
00:49:59.020 | Yeah, they're not going to come after individuals.
00:50:01.100 | I do think more companies should think twice
00:50:04.300 | before doing tons of OpenAI outputs,
00:50:07.260 | also just because the data contamination
00:50:09.740 | and what it does to your workflow
00:50:11.220 | is probably hard to control at scale.
00:50:14.140 | And we should just mention, at the time of recording,
00:50:16.660 | we've seen the first example of OpenAI enforcing
00:50:19.340 | their terms of service.
00:50:20.780 | ByteDance was caught, reported to be training on GPT-4 data,
00:50:24.820 | and they got their access to OpenAI revoked.
00:50:28.300 | So that was one example.
00:50:29.420 | I don't know if you have a comment on that.
00:50:31.260 | I don't expect OpenAI to go too crazy on this,
00:50:33.880 | because there's going to be so much backlash against them.
00:50:36.340 | Everyone's doing it, yeah.
00:50:37.460 | And everyone's going to do it anyways.
00:50:39.860 | And what's at stake here, to spell it out,
00:50:41.740 | is like, OK, this costs $10 to collect one data
00:50:45.180 | point from a human.
00:50:46.460 | It's going to cost you a tenth of a cent with OpenAI, right?
00:50:51.100 | So it's just orders of magnitude cheaper,
00:50:52.860 | and therefore people are just going to do it.
00:50:54.740 | Yeah, and the signal you get from humans from preferences
00:50:58.140 | is not high.
00:50:58.860 | The signal that you get from humans for instructions
00:51:02.500 | is pretty high, but it is also very expensive.
00:51:04.860 | So the human instructions are definitely, by far and away,
00:51:07.500 | the best ones out there, compared
00:51:08.940 | to the synthetic data.
00:51:10.940 | But I think the synthetic preferences
00:51:12.660 | are just so much easier to get some sort of signal running
00:51:15.780 | with, and you can work in other--
00:51:17.260 | I think people will start working in other goals
00:51:19.300 | there, between safety and whatever.
00:51:21.820 | But that's something that's taking off,
00:51:23.780 | and we'll kind of see that.
00:51:24.980 | I think in 2024, at some point, people
00:51:27.460 | will start doing things like constitutional AI
00:51:29.460 | for preferences, which will be pretty interesting.
00:51:33.500 | We saw how long it took RLHF to get started in open source.
00:51:37.180 | Instruction tuning was the only thing
00:51:38.780 | that was really happening until maybe like August, really.
00:51:41.660 | I think Zephyr was the first model that
00:51:44.220 | showed success with RLHF in the public.
00:51:46.860 | But that's a long time from everyone
00:51:49.500 | knowing that it was something that people are interested in,
00:51:51.960 | to having any check mark.
00:51:54.620 | So I accept that and think the same will
00:51:57.660 | happen with constitutional AI.
00:51:59.180 | But once people show that you can do it once,
00:52:01.980 | they continue to explore.
00:52:04.140 | Yeah, excellent.
00:52:05.500 | Just in the domain of human preference data suppliers,
00:52:09.980 | ScaleAI very happily will tell you
00:52:11.460 | that they supplied all that data for Lama 2.
00:52:17.020 | The other one is probably interesting,
00:52:18.560 | LMSYS from Berkeley.
00:52:21.340 | What they're running with Chaterina
00:52:22.820 | is perhaps a good store of human preference data?
00:52:26.420 | Yeah, they released some toxicity data.
00:52:28.380 | They, I think, are generally worried about releasing data
00:52:31.060 | because they have to process it and make
00:52:32.680 | sure everything is safe.
00:52:33.680 | And they're really lightweight work.
00:52:35.180 | And they're trying to.
00:52:36.380 | They're trying to release the preference data.
00:52:38.260 | I have-- if we make it to evaluation,
00:52:40.260 | I'd pretty much say that Chaterina
00:52:41.980 | is the best limited evaluation that people have to learn
00:52:44.540 | how to use language models.
00:52:45.700 | And it's very valuable data.
00:52:47.980 | And trying to get--
00:52:50.060 | they also may share some data with people
00:52:52.460 | that they host models from.
00:52:54.060 | So if your model is hosted there,
00:52:55.780 | and you pay for the hosting, you can get the prompts.
00:52:57.980 | Because you're pointing the endpoint at it,
00:52:59.740 | and it gets pinged to you.
00:53:00.820 | And your-- any real LLM inference stack
00:53:03.940 | saves the prompts that you get.
00:53:05.660 | So that is some signal.
00:53:07.020 | I don't know if the shared preferences--
00:53:09.220 | I do think they're trying to.
00:53:10.420 | They're trying to do all the right things.
00:53:12.220 | They're just very strapped.
00:53:13.340 | And moving data comes with other legal and liability concerns
00:53:16.780 | in some cases.
00:53:18.980 | Awesome.
00:53:20.140 | So kind of looping back a little bit
00:53:22.420 | from that very valuable digression
00:53:24.900 | on what preference data is, we're
00:53:27.180 | talking about the actual loss function.
00:53:29.260 | Because it's kind of like this classifier approach that
00:53:32.060 | might not make too much sense to people.
00:53:34.140 | You take a language model, and you
00:53:36.380 | chop it into pieces a little bit at the end
00:53:38.300 | so that it outputs one number.
00:53:39.780 | It's like-- in technical level, it's
00:53:41.540 | a logit that corresponds to the probability
00:53:43.500 | that we talked about earlier.
00:53:45.020 | But in order to train this, you can't just
00:53:46.780 | have prompt incompletions.
00:53:48.460 | You need to have these pairs.
00:53:50.100 | Because we talked about scalars don't really work.
00:53:52.220 | So in order to train it, you use the magical batching
00:53:54.700 | of all language model, all deep learning architectures.
00:53:57.420 | And you put in the chosen prompt and the rejected prompt
00:53:59.780 | at the same time.
00:54:00.540 | And then you end up with two numbers.
00:54:02.580 | And then there's this fun loss function.
00:54:04.460 | And you essentially have to increase the difference
00:54:06.580 | between these two predicted numbers.
00:54:08.660 | It's always fun when you think about automatic
00:54:10.940 | differentiation.
00:54:11.700 | It updates the same parameters to separate these two numbers
00:54:15.380 | at once.
00:54:16.260 | And there's this loss function that you'll see in OpenAI
00:54:19.060 | Anthropic and everyone's papers.
00:54:20.900 | What it looks like is it's like some log
00:54:22.900 | of a scalar with an exponential.
00:54:25.300 | That's the difference between these two predicted rewards.
00:54:28.000 | It's just some fancy math around a difference, a subtraction
00:54:32.000 | between the reward of the rejected prediction
00:54:34.080 | and the reward of-- the predicted
00:54:35.760 | reward for the rejected completion
00:54:37.660 | and the predicted reward of the chosen completion.
00:54:40.520 | Fun fact is that these loss functions
00:54:43.040 | look different in Anthropic and OpenAI's papers.
00:54:45.840 | But they're just literally just log transforms.
00:54:47.800 | So if you start like expandiating both sides
00:54:49.800 | and taking a log of both sides, you'll converge on one of the--
00:54:52.680 | both the two papers end up being the same thing.
00:54:55.320 | And people don't know how to train preference models,
00:54:57.800 | particularly well now.
00:54:59.040 | I think if you zoom into any of the details
00:55:01.720 | to look at the agreement number, so how--
00:55:04.880 | if you look at a test set, you'll
00:55:06.720 | have a chosen and rejected.
00:55:08.640 | And you can take the reward model you're training,
00:55:10.720 | pass in those completions, and you
00:55:13.000 | see if the chosen predicted reward, so the scalar number,
00:55:16.200 | is higher than the rejected predicted reward.
00:55:18.880 | And this is the agreement numbers
00:55:20.320 | in all of these data sets.
00:55:21.400 | It's like where you see they have the 65% to 75% agreement.
00:55:24.400 | This just means that these scalar numbers were ordered
00:55:27.880 | correctly.
00:55:28.480 | And that's a pretty low number.
00:55:29.780 | It's not going to get to 100%.
00:55:31.880 | That goes to show the kind of deep questions at play here.
00:55:35.480 | People are playing with different loss functions,
00:55:37.480 | ensembles, different models to try to address this.
00:55:39.760 | But it's really a fundamental issue.
00:55:41.260 | It's like-- it goes back to what does it mean to do RLHF?
00:55:45.460 | And we're not going to answer that now.
00:55:47.040 | But it's good to know that this 65% to 75% agreement,
00:55:49.900 | you'll see these numbers everywhere.
00:55:51.360 | It's like we don't have 100% agreement with the reward
00:55:53.920 | model and the data.
00:55:55.720 | And that's fine.
00:55:56.820 | That's just where we're at.
00:55:58.360 | And we essentially take this model,
00:56:00.080 | and then we start throwing RL at it, I think.
00:56:03.480 | PPO, proximal policy optimization,
00:56:07.360 | it's pretty complicated compared to what
00:56:09.080 | you really need to know.
00:56:10.320 | It really just does RL under the hood.
00:56:13.160 | Things like PPO, it learns a value function,
00:56:15.420 | and then it uses the value function to update the bottle.
00:56:18.680 | You could look at--
00:56:19.720 | if you actually look at a feedback diagram,
00:56:23.240 | it's more of like a systems problem than an RL problem.
00:56:27.280 | So you'll see things like you need
00:56:28.700 | to have two copies of the language model.
00:56:30.560 | This is for the KL constraint that we talked about before.
00:56:33.000 | You need to have the reward model, which
00:56:34.700 | is either a separate reward model or value head
00:56:37.000 | on your base model.
00:56:38.820 | And then you need to have your RL code that actually
00:56:42.000 | learns a value function and updates all the parameters.
00:56:44.360 | I think it just is really messy to actually set up.
00:56:48.360 | But if you dig into it, most people
00:56:50.280 | could understand what each of the components are.
00:56:52.320 | And then the hard parts are like,
00:56:53.760 | how do we actually make a language model that
00:56:55.600 | works out of this, which is not something
00:56:57.640 | that people know that well.
00:56:58.840 | I think things that I talk about a lot
00:57:00.480 | is just like, OK, what is the signal flow?
00:57:03.260 | How do you access the reward model?
00:57:05.440 | The reward model is used in RLHF exactly what you would think.
00:57:08.220 | You have a prompt.
00:57:09.360 | The language model generates a completion.
00:57:11.480 | And then that completion is given a score.
00:57:13.360 | That score gets plugged into the whole RL stuff.
00:57:16.240 | And it learns-- and it updates the parameters.
00:57:18.640 | That's kind of the core of it.
00:57:20.720 | There's a lot of different things,
00:57:22.240 | zooming in on where exactly you put this distance
00:57:25.400 | penalty between the base model and the RL model.
00:57:28.440 | Most people say that you just deduct it from the reward.
00:57:31.040 | So if you go all the way back to RL
00:57:33.600 | as an agent acting in the world, the reward from that world
00:57:37.600 | would be a combination of the reward model and any
00:57:39.760 | constraints, like KL, that you put on it.
00:57:41.560 | There's a lot of different ways to do this,
00:57:43.400 | because a lot of RL algorithms, like PPO,
00:57:45.320 | actually have a KL constraint built into them.
00:57:47.520 | So it's confusing, because you hear KL twice.
00:57:49.520 | But those are different KLs.
00:57:50.920 | One of them is about the text, and one of them
00:57:53.120 | is about the value function distance, or the policy
00:57:55.320 | distance, or something like this.
00:57:56.920 | So those are different.
00:57:57.880 | It really ends up being kind of gibberish
00:58:00.120 | that I think is less important now,
00:58:01.620 | because it's more about data and infrastructure than RL details,
00:58:04.720 | than value functions and everything.
00:58:07.820 | A lot of the papers have different terms
00:58:10.040 | in the equations.
00:58:11.120 | I think InstructGPT does something
00:58:12.840 | where they try to get the RL model to match
00:58:16.200 | the instruction tuning model, or the instruction tuning
00:58:18.480 | data set, because they're really happy with that data set
00:58:20.820 | to constrain the distribution.
00:58:22.600 | LLAMA does some different things.
00:58:24.240 | But I think these are all small gains over just
00:58:27.960 | getting the deep understanding of the data
00:58:31.080 | in the infrastructure setup.
00:58:33.840 | This is why we say it's so little RL.
00:58:35.680 | It's like, now we are getting to the point
00:58:37.600 | where you don't even really need this to get a good model.
00:58:40.040 | So that's why it's like, OK, the RL is such a small part
00:58:43.520 | of the actual doing RLHF.
00:58:46.520 | Like, RLHF is a metaphor for all language model adaptation.
00:58:50.440 | And RL is one tool used at one point in the time.
00:58:54.160 | So that's kind of where I wrap up
00:58:55.640 | the core overview in my mind, to say RL doesn't really
00:58:58.680 | do as much as people think.
00:58:59.840 | But you could put up flashy equations
00:59:01.580 | and do all sorts of stuff if you want to.
00:59:04.240 | I think it's kind of misleading, even,
00:59:05.840 | because I don't think about those equations
00:59:07.580 | on a regular basis.
00:59:08.600 | But what if we called it Q*?
00:59:10.640 | Yeah.
00:59:14.560 | So in your mind, is the takeaway for this kind
00:59:18.240 | of next generation of people working on models,
00:59:21.520 | maybe the underlying theories is less important than actually
00:59:25.200 | getting good data, basically?
00:59:26.680 | Yeah, I think it's getting good data.
00:59:28.240 | And we'll see, like, I have this advanced topics
00:59:30.240 | thing in the slides, which it starts with the vowels.
00:59:32.760 | And then it talks about a lot of different ways
00:59:34.800 | that people are using reward models
00:59:36.320 | or constructing training signals, really.
00:59:38.640 | And I think that it's about understanding
00:59:41.880 | what your information flow is.
00:59:43.200 | And if your reward signal is good,
00:59:44.960 | and if your language model is generating right,
00:59:47.840 | zooming in on the tokens it's generating,
00:59:49.780 | and kind of understanding how those things change over time.
00:59:54.040 | I have a slide that I--
00:59:55.560 | in here, I think this is something we could also
00:59:57.800 | talk about evaluation.
00:59:59.000 | But it's really like, RLHF is not
01:00:01.520 | that shown to improve capabilities yet.
01:00:03.360 | I think one of the fun ones is from the GPT-4 technical
01:00:05.920 | report.
01:00:06.420 | They essentially listed their kind of bogus evaluations.
01:00:09.200 | Because it's a hilarious table, because it's like LSAT, AP
01:00:11.680 | exams.
01:00:12.680 | And then, like, AMC-10 and AMC-12
01:00:15.240 | are kind of reasonable vowels in language model land.
01:00:19.240 | But they just showed that RLHF doesn't improve
01:00:21.360 | their evaluation metrics.
01:00:22.440 | We don't know if internally they have other ones.
01:00:24.480 | They probably do.
01:00:25.240 | But from what OpenAI has shown us externally,
01:00:27.920 | like, RLHF improves some metrics.
01:00:29.800 | It decreases some metrics.
01:00:31.440 | No one could really see.
01:00:32.520 | I do think it does things that they care about.
01:00:35.600 | But it's like, RLHF is not an easy tool
01:00:38.160 | to make numbers go up with.
01:00:39.600 | It's a powerful tool to change your language model.
01:00:42.560 | But as we've seen with LLAMA and safety RLHF,
01:00:45.760 | that doesn't always mean that people
01:00:47.260 | are going to be happy with those changes,
01:00:48.760 | or it's going to do exactly what you want.
01:00:50.520 | It's like--
01:00:51.160 | Well, I think this is intuitive.
01:00:52.520 | A lot of these tests are multiple choice.
01:00:55.360 | And RLHF isn't necessarily intended
01:00:58.240 | to improve your multiple choice reasoning capabilities.
01:01:01.280 | Yeah.
01:01:01.780 | Yeah.
01:01:02.280 | I think that it is reasonable, but I
01:01:04.800 | don't think a lot of people have connected the dots there.
01:01:08.360 | And like, what is it in a preference point?
01:01:12.520 | Like, what if your preference data
01:01:14.080 | was between a correct and a wrong answer?
01:01:16.120 | Like, it could conceivably do it,
01:01:18.280 | but I just don't think that is remotely
01:01:20.760 | what it is actually doing.
01:01:22.160 | Yeah.
01:01:22.920 | It's much better being a sommelier, apparently.
01:01:25.520 | Yeah.
01:01:27.480 | That was the weirdest one that was included in the GPT-404.
01:01:30.320 | Yeah, I did.
01:01:30.800 | I just see that the last three down there.
01:01:32.760 | That's really funny.
01:01:34.320 | I can't even taste it.
01:01:35.320 | You can't even taste it?
01:01:36.320 | It's just like-- anyway.
01:01:39.360 | Cool.
01:01:40.120 | Emerging directions.
01:01:41.120 | Yeah, so this is essentially how to use RLHF-like things
01:01:44.000 | to make the bottle better without using PPO,
01:01:46.880 | because PPO is kind of a nightmare to scale.
01:01:49.080 | The first thing that I started with
01:01:50.540 | is kind of the ideas of rejection sampling and best
01:01:53.000 | event sampling.
01:01:53.840 | I think best event sampling is what people often encounter
01:01:56.800 | first, which is the idea of you take a prompt,
01:02:00.200 | you generate like 10, 20 responses through it,
01:02:04.720 | you pass it through a reward model.
01:02:06.400 | The reward model assigns a scalar for each of them.
01:02:08.960 | You pick the one with the highest number,
01:02:10.680 | and that's the one you answer the question with.
01:02:13.000 | It seems pretty logical to people,
01:02:14.480 | because it's just spending more inference time compute
01:02:16.760 | to make your outputs better.
01:02:18.120 | And it works in a lot of things.
01:02:19.800 | This Let's Verify step-by-step paper
01:02:21.640 | that I talked about from OpenAI, they use it.
01:02:23.600 | Lots of papers use it.
01:02:26.360 | It's just kind of like a good thing
01:02:27.760 | to know that you can do.
01:02:28.720 | You can spend more inference compute
01:02:31.080 | based on a preference data set to make your answers better.
01:02:34.300 | The interesting thing that people are confused about more
01:02:36.680 | is rejection sampling, because Meta talked about it
01:02:39.060 | in Llama 2.
01:02:40.280 | Essentially, rejection sampling is putting something
01:02:43.600 | like best event sampling in a feedback loop.
01:02:45.600 | And instead of just returning the best answer to a user,
01:02:48.920 | you take the best few answers, and then
01:02:50.720 | you apply instruction tuning on that data set.
01:02:53.480 | And then you do the instruction tuning,
01:02:55.240 | and then you could collect more preference data,
01:02:57.240 | do a new reward model, and then you rank some new outputs,
01:02:59.800 | and you do instruction tuning again.
01:03:01.400 | So essentially, Llama started their RLHF process
01:03:04.440 | with this to get some signal out of preference data.
01:03:06.880 | That preference data went into a reward model,
01:03:08.960 | and then the reward model did a good enough ranking
01:03:11.760 | that it was essentially super-powered instruction
01:03:14.060 | tuning based on rewards.
01:03:16.920 | Works pretty well, much easier to implement the PPO,
01:03:19.280 | because you can use it in all of your--
01:03:21.720 | it's still instruction tuning, so it's
01:03:23.280 | the same autoregressive loss.
01:03:24.560 | It's easy to plug into things like transformers
01:03:26.560 | and stuff like that, a lot easier
01:03:28.340 | than whatever freaking mess doing RL at scale is going to be.
01:03:33.280 | So that's one.
01:03:34.800 | A quick nod that offline RL is something
01:03:37.400 | that people talk about for RLHF, essentially
01:03:39.760 | because your model doesn't have to generate.
01:03:42.120 | In that case, you just look at data,
01:03:44.920 | and it back-propagates through your reward model directly.
01:03:47.520 | So in PPO, you have the step of needing
01:03:50.200 | to generate everything and passing it
01:03:51.740 | through the reward model.
01:03:53.420 | How offline RL essentially works is
01:03:56.160 | that all of this is kind of just done on one big data set.
01:04:00.280 | I'm not an expert in this, but essentially, you
01:04:02.680 | do much less inference costs during the RLHF process
01:04:07.000 | if you do offline RL.
01:04:08.440 | There's a few papers that people have published.
01:04:12.360 | Not a lot of traction.
01:04:13.440 | I think it could take off some people that I know in the RLHF
01:04:15.980 | area really think a lot of people
01:04:17.240 | are doing this in industry, just because it makes
01:04:19.280 | the kind of training process simpler
01:04:22.160 | and the number of things you have to have running.
01:04:24.200 | Different feedback types are probably
01:04:29.520 | going to come into play.
01:04:32.040 | There's papers like written feedback
01:04:34.400 | or labeling multiple scores or multiple pairwise preferences
01:04:38.240 | for every completion.
01:04:39.960 | That's coming.
01:04:40.920 | It's also kind of related to what
01:04:42.280 | we mentioned in process reward models,
01:04:43.920 | where you're labeling each step in the chain of thought
01:04:47.360 | reasoning just to kind of make the problem more specific.
01:04:51.360 | It seems very likely that different feedback will
01:04:53.320 | be used for different domains.
01:04:55.240 | Chain of thought reasoning is great for math,
01:04:57.800 | and that's where these process reward models
01:05:00.000 | are being designed.
01:05:01.200 | Probably not great for things like poetry,
01:05:03.000 | but as any tool gets better, it gets more specific.
01:05:07.920 | Then kind of get into more of a talking point,
01:05:11.200 | which I think is fun.
01:05:12.080 | The next one I have is constitutional AI.
01:05:14.000 | I think this is something that people really don't--
01:05:18.440 | like, I think just kind of misunderstood.
01:05:21.080 | I think most people thought that constitutional AI was doing
01:05:23.920 | something where it's like created the preference
01:05:27.200 | data based on the specific principles in some way,
01:05:31.440 | where it's like--
01:05:32.960 | what did you two think of constitutional AI?
01:05:35.080 | I'll be the dumb person, and you correct me.
01:05:37.960 | As far as I understood, Anthropic came out
01:05:39.720 | and said that the best way of doing--
01:05:42.040 | of generating this sort of preference data or alignment
01:05:44.480 | is give a second model a constitution
01:05:46.960 | to evaluate the first model's outputs.
01:05:51.760 | And the constitution is unspecified,
01:05:53.320 | but it draws from the UN Declaration of Human Rights
01:05:57.200 | and the Apple Terms of Service, for some reason.
01:05:59.200 | Yeah, and this leads into the question
01:06:00.780 | of what is the other model evaluating,
01:06:02.760 | and how is it evaluating in a way that you can train on?
01:06:05.360 | And that's what I mean.
01:06:06.320 | People didn't think about this.
01:06:07.700 | A lot of the CAI paper was actually
01:06:09.340 | talking about instruction tuning, which
01:06:11.080 | is if you have an instruction, you then
01:06:13.040 | have a language model that critiques the instruction based
01:06:15.560 | on principles, and then your instruction responses
01:06:18.080 | are closer to the constitutional principles.
01:06:20.000 | This was the first half, which is like they
01:06:22.080 | have some acronym for all of this.
01:06:23.560 | The diagram in their paper's wild in this one.
01:06:26.920 | Their papers are sometimes pretty funny,
01:06:28.600 | because they're not capabilities papers.
01:06:30.360 | They're like alignment papers.
01:06:31.720 | So they don't make everything super clear.
01:06:33.720 | So the first half of constitutional AI
01:06:35.480 | is fine-tuning your instructions based on principles.
01:06:39.600 | That's one half.
01:06:40.320 | And then the second half is what people really
01:06:42.680 | thought that they knew, which is like,
01:06:44.680 | how do you use this other bottle to provide
01:06:47.600 | a critique based on principles?
01:06:49.780 | And in the paper, they list--
01:06:51.440 | essentially, they say what their prompt was,
01:06:53.520 | which is for the synthetic feedback for generating
01:06:57.320 | new preferences, which is essentially
01:06:59.000 | pick between these two answers based on this principle.
01:07:03.040 | So they're sampling from the principles
01:07:05.800 | in their constitution and from A, B,
01:07:09.600 | two options of completions.
01:07:11.280 | And then the AI model is essentially
01:07:14.600 | given the context of a certain principle
01:07:17.840 | to pick the A or B preference.
01:07:19.400 | And then that's a new preference data set.
01:07:21.160 | It's just the two completions without the context
01:07:23.760 | of the principles.
01:07:25.200 | So with this sampling idea, they're
01:07:27.800 | sampling from 30 principles and a wide data
01:07:30.520 | set of two candidate completions across different prompts.
01:07:33.440 | So to me, it's a very loose--
01:07:37.440 | the values are not explicit in this.
01:07:39.600 | It's just kind of how they're guided.
01:07:41.940 | And it's a very machine learning-y approach
01:07:45.140 | because it is relying on averages and scale
01:07:47.200 | to get the principles in there.
01:07:48.920 | But it is way less explicit than I thought it was going to be.
01:07:51.880 | I kind of thought there was this feedback
01:07:53.600 | thing in the preference data, where
01:07:55.080 | it checked to see if the principles were satisfied
01:07:57.920 | or anything like this.
01:07:58.840 | But it's really just a modification to the RLHF setup
01:08:02.420 | that we've talked about with instruction tuning
01:08:04.380 | and preference data collection, where there is an AI
01:08:07.120 | model providing critiques.
01:08:08.520 | And a lot of those critiques are based
01:08:10.600 | on sampling of constitutional values.
01:08:13.720 | So it almost sounds more tractable in that way.
01:08:17.160 | But I would also guess, while I just say, oh, look,
01:08:20.000 | I figured it out, I'm guessing they do different things
01:08:22.880 | than they said in the paper.
01:08:24.000 | Like, this paper is from 2022.
01:08:26.280 | It's a pretty old paper.
01:08:28.280 | And they're surely doing more.
01:08:31.200 | But it's good to know where they started, at least in this case.
01:08:34.880 | I thought the communication around the Pareto optimal
01:08:38.560 | improvement was helpful in understanding that you do
01:08:42.560 | actually want it to be more helpful and honest
01:08:47.640 | while maintaining the same level of harmlessness
01:08:49.640 | or something like that, right?
01:08:50.920 | Yeah, so that figure right at the top of the constitutional AI
01:08:54.680 | paper is worth seeing, if you don't have it immediately
01:08:57.080 | pop into your head, where they essentially compare
01:08:59.520 | constitutional AI to other RLHF that they're
01:09:01.840 | doing internally at different--
01:09:03.880 | and it's something that most RLHF papers don't do,
01:09:06.120 | is they have little dots on the lines
01:09:07.800 | to indicate intermediate checkpoints.
01:09:09.720 | And it'd be really great to see more RLHF papers kind
01:09:12.280 | of showing how per epoch or per half epoch of training,
01:09:16.680 | because most RLHF is only a few epochs, at least
01:09:19.240 | in the open models, what is happening there.
01:09:22.240 | People release checkpoints.
01:09:23.400 | But that's how we should be thinking about it,
01:09:25.320 | because the optimizer is so strong.
01:09:27.520 | And it's like, we don't know what's happening
01:09:29.440 | in this kind of intermediate land.
01:09:31.400 | I don't know if this is a relevant comparison for you,
01:09:33.960 | but OpenAI also recently released a weak-to-strong
01:09:38.000 | generalization paper, where they actually
01:09:40.200 | talked about a few intermediate checkpoints for GPT-4.
01:09:44.120 | Any comments on the comparison between constitutional AI
01:09:46.720 | and weak-to-strong generalization?
01:09:48.960 | I didn't see the paper.
01:09:50.040 | I think I saw people criticizing it
01:09:51.840 | for just being safety-washing from the fact
01:09:55.680 | that they're talking about GPT-2 still,
01:09:58.040 | which is such a kind of odd model to focus on.
01:10:00.800 | I didn't really look at it.
01:10:01.960 | I had it lying around.
01:10:02.960 | So I think that it's a thing with OpenAI.
01:10:05.760 | It's like they're sharing less than they know.
01:10:08.840 | So I think they probably have things
01:10:10.280 | that are pretty cool that they're doing internally.
01:10:14.440 | So I'll summarize for listeners who may not
01:10:16.200 | have seen the paper, because it's impossible to keep up
01:10:18.520 | and everything.
01:10:19.640 | I do think that what constitutional AI and RLHF
01:10:24.480 | represents is that we are starting
01:10:27.360 | to come to a point where it's just
01:10:28.800 | impossible for manual human preference data
01:10:31.840 | collection to scale.
01:10:33.800 | And the only way to scale this is to trust our AI overlords
01:10:36.560 | to model our human preferences.
01:10:39.520 | And constitutional AI was the first version of this.
01:10:41.880 | What the second version, or what weak-to-strong is,
01:10:44.600 | is that anticipating a future of superintelligence
01:10:47.720 | or the need for superalignment, where
01:10:50.440 | the thing that we're trying to control is smarter than us.
01:10:53.400 | So you take GPT-2 and try to use GPT-4
01:10:57.240 | to teach it to be smarter than itself,
01:11:00.080 | because this is what we're going to have
01:11:01.880 | to do in the future as well, when we are not--
01:11:04.560 | we're no longer fully in control.
01:11:07.560 | Are we the metaphorical GPT-2, or is--
01:11:10.680 | No, we're not even in the process
01:11:12.840 | anymore at the point of superintelligence.
01:11:15.560 | So they're just basically-- they're prepping.
01:11:17.680 | They're preppers.
01:11:19.960 | And they're saying, this will happen.
01:11:21.480 | And humans will be so far out in the dust
01:11:24.640 | that we just have no say in this debate.
01:11:27.680 | How do we still control systems then?
01:11:30.240 | And weak-to-strong generalization
01:11:32.040 | seems to be the answer.
01:11:33.040 | And I see a lineage from constitutional AI to this.
01:11:37.000 | Yeah, the constitutional AI and the superalignment
01:11:39.480 | is very conceptually linked.
01:11:41.320 | It's like a group of people that has
01:11:43.200 | a very similar intellectual upbringing,
01:11:45.360 | and they work together for a long time,
01:11:47.240 | coming to the same conclusions in different ways.
01:11:49.600 | And I understand the argument.
01:11:50.960 | And I mostly just don't--
01:11:53.840 | I think they're just waiting to see more
01:11:55.640 | from the superalignment team.
01:11:56.840 | Because I just didn't really put it together in my brain
01:11:59.240 | quickly, looking at weak-to-strong generalization
01:12:01.400 | of exactly how it all fits.
01:12:02.920 | But I'm also not a safety researcher.
01:12:05.560 | But I think that could be feedback for them.
01:12:08.640 | I understand what synthetic data means in all of this.
01:12:11.240 | It's like, how could they communicate that a little bit
01:12:14.000 | more specifically in this context?
01:12:15.800 | Because I want to know what they think about this.
01:12:17.840 | Which is why I like that Pareto optimal thing,
01:12:19.760 | it links-- it takes stairs debate away from x-risk
01:12:23.960 | to no, this makes navigation models more useful.
01:12:27.920 | And we can all get behind that.
01:12:29.640 | Yeah.
01:12:30.640 | Yeah, yeah.
01:12:31.600 | I agree.
01:12:32.760 | I think the last kind of emerging direction
01:12:34.600 | that I have might just be this debate.
01:12:36.320 | You can control how long we talk about this,
01:12:38.280 | which is about direct preference optimization.
01:12:42.440 | You could go read my blog post on this.
01:12:44.280 | I had tried to summarize this already.
01:12:46.000 | But essentially, DPO is a different class of algorithms.
01:12:51.040 | I still call it RLHF, because RLHF is so vague
01:12:54.800 | in how it's defined.
01:12:55.640 | I think DPO is closer to RLHF than RLHF is to RL.
01:12:59.580 | You can unpack that if you need to.
01:13:01.880 | But what DPO is doing is essentially
01:13:05.040 | deriving a optimal reward function
01:13:07.960 | from the preference data, where the preference data is
01:13:10.160 | the same thing that we've talked about.
01:13:11.780 | And then the clever math in the paper
01:13:14.880 | emerges the optimal policy to that
01:13:18.440 | based on an implicit reward function.
01:13:20.080 | That's a ratio of log probs.
01:13:22.040 | It's very odd.
01:13:23.160 | The difference between what a DPO reward is
01:13:25.640 | and a classifier reward is very different,
01:13:28.120 | where the classifier is trained to output a scalar value based
01:13:32.360 | on this contrastive-like loss, where
01:13:34.720 | DPO is purely based on the difference between two log
01:13:38.400 | prob ratios.
01:13:39.160 | So the reward there is the ratio between the policy generation
01:13:43.640 | likelihood and the base model generation likelihood.
01:13:47.080 | I don't have intuitions for what that means yet,
01:13:49.080 | but what the reward actually is is very different.
01:13:52.600 | The data starting point, in principle, could be the same.
01:13:55.440 | And I think we've seen a lot of successes in open source
01:13:58.040 | with it.
01:13:58.560 | It's way simpler to implement and to work
01:14:01.000 | with in that regard, which is why
01:14:02.640 | I think we'll keep seeing a lot of success
01:14:04.980 | with it in the short term.
01:14:06.200 | I think we'll keep seeing DPO models for the time being.
01:14:10.080 | But we won't really answer what the fundamental differences
01:14:13.120 | are, because it depends on your data.
01:14:15.160 | It depends on your infrastructure.
01:14:17.400 | Rumors seem to be that people still
01:14:19.080 | think that PPO-like methods or other RL methods
01:14:22.560 | have a higher top end.
01:14:25.000 | But I don't necessarily think--
01:14:26.680 | Sorry, what is top end?
01:14:27.800 | Just the absolute best model you could get.
01:14:29.600 | I see.
01:14:30.120 | So Google and OpenAI aren't using DPO
01:14:32.800 | because they could do something more complicated.
01:14:34.840 | But that's not what academics and open source people
01:14:38.040 | really care about.
01:14:38.840 | They care about being able to improve on their methods
01:14:40.840 | and understand where to iterate the models
01:14:42.720 | and work off of each other.
01:14:44.200 | So in a lot of ways, I think DPO still will be what people see.
01:14:47.840 | But in some ways, it's probably slightly more constrained.
01:14:52.480 | There's other ways that you could think of PPO
01:14:54.880 | working nicely in code, where it's
01:14:56.640 | if your code runs is the score that you give it.
01:14:59.640 | And you have to generate--
01:15:01.840 | you have to do canned things to get DPO to have the same data.
01:15:06.920 | So there are specific cases where the DPO formulation
01:15:09.560 | is a little bit harder.
01:15:10.640 | But I expect to see more DPO models than anything else
01:15:14.320 | in the next six months.
01:15:15.360 | That's probably what most people need to know,
01:15:17.960 | unless they're an RLHF expert.
01:15:19.720 | And I would love to learn more about PPO and a lot of authors
01:15:23.360 | in this space, from the DPO authors,
01:15:25.160 | who are great to talk to.
01:15:26.200 | You could reach out to all three of them.
01:15:28.360 | So as a time of recording, we're actually
01:15:30.080 | about to publish our NeurIPS recap,
01:15:31.720 | where we talk to the authors.
01:15:33.440 | So for people who are listening to this in the future,
01:15:35.360 | you can refer to that episode.
01:15:36.600 | Yeah.
01:15:37.100 | I think Rafael, Eric, and Archit--
01:15:39.220 | I've talked to all of them at a good length.
01:15:41.660 | And they're all fantastic.
01:15:42.900 | And it's like, they'll say similar things.
01:15:44.740 | And they'll also defend their method,
01:15:46.280 | because it's an awesome paper.
01:15:47.620 | If you want to learn how a good math--
01:15:51.140 | a kind of mathy, but still experimental paper in language
01:15:54.700 | models is, the DPO paper is a really good one
01:15:57.580 | to spend more time on.
01:15:59.220 | Yeah.
01:15:59.720 | Well, when I asked them questions about it,
01:16:02.940 | they just kind of gestured at their poster
01:16:04.660 | and said, look at the equation.
01:16:05.700 | Just stare at it.
01:16:06.340 | And you'll see it.
01:16:07.100 | Yeah, that's my criticism for them.
01:16:08.580 | It's like, they--
01:16:09.380 | [LAUGHTER]
01:16:10.860 | Like, what?
01:16:12.020 | Yeah, they're a little--
01:16:13.540 | they're still in the academic world, where
01:16:15.540 | some of their answers reflect that.
01:16:17.040 | But I've done it enough with them
01:16:18.700 | that I understand what they're saying.
01:16:20.620 | Yeah, yeah.
01:16:21.420 | I will say, it does remind me of Flesh Attention a little bit,
01:16:24.040 | in the sense that it's kind of an equivalent thing
01:16:28.260 | to the thing it's replacing.
01:16:29.420 | And it's just faster, cheaper, just better in every way.
01:16:31.860 | It's a very different optimization tool.
01:16:33.900 | It's essentially the thing, in my mind,
01:16:35.600 | that I can't get past is the difference
01:16:37.140 | between the control you get in training a reward model
01:16:40.180 | and then training a policy.
01:16:41.380 | Because essentially, everything you want your reward model
01:16:43.620 | to do might not be everything that you train the policy
01:16:45.920 | to do in the RLHF step, where you
01:16:47.620 | have the two different prompt distributions.
01:16:50.100 | But with DPO, you're doing both at once.
01:16:52.020 | So you don't control that.
01:16:53.660 | And we don't know if you have fancy engineering--
01:16:57.100 | like, if you have fancy engineering abstractions
01:16:59.140 | and test your reward model to do different things,
01:17:01.220 | if that separation is really important.
01:17:03.140 | And I think that's where this benefit
01:17:05.680 | at the absolute biggest scale and most investment
01:17:08.200 | could come from.
01:17:09.240 | But DPO is one update.
01:17:12.000 | It is one model.
01:17:13.040 | You can't separate that.
01:17:14.680 | So that's a thing to know.
01:17:17.760 | It probably doesn't matter for most people.
01:17:19.600 | But it is very different.
01:17:20.960 | And I was asking somebody who was on some of those earlier
01:17:23.660 | OpenAI papers that's not OpenAI anymore.
01:17:25.720 | And they were like, I wish we had thought of that.
01:17:28.080 | So it is a really cool idea.
01:17:32.000 | That's the type of thing that academia still can do
01:17:35.400 | and can do really well and hopefully continues to do.
01:17:39.920 | Yeah.
01:17:40.520 | One thing I wanted to make sure I cover
01:17:42.220 | before we leave this topic--
01:17:45.520 | one of the DPO models that we're trained,
01:17:48.240 | apart from Zephyr and Mixtraw, which
01:17:50.120 | is two of the more high-profile ones,
01:17:52.400 | is Tulu from the Allen Institute.
01:17:55.000 | And you're one of the few people maybe placed to explain--
01:17:57.680 | So funny.
01:17:59.240 | Maybe like, what's Allen Institute doing here?
01:18:01.600 | And what's the backstory?
01:18:03.200 | Yeah, so the Allen Institute for AI is--
01:18:05.040 | I think the 10-year birthday is in January.
01:18:07.200 | It's a special event.
01:18:08.760 | And also, people should know, this
01:18:10.160 | is Paul Allen from Microsoft.
01:18:11.760 | Yeah, Paul Allen owns everything in Seattle.
01:18:14.480 | Not literally.
01:18:15.160 | I mean, his past and his estate is still
01:18:18.180 | operating in a lot of great ways.
01:18:19.640 | But the Allen Institute is mostly
01:18:21.520 | known as being a super academic lab where they have
01:18:23.760 | more resources than academia and publish hit
01:18:26.880 | after hit of research paper.
01:18:28.920 | And they're trying to move more in the direction of really
01:18:31.560 | using models.
01:18:32.280 | And this is part of why I joined.
01:18:33.700 | It's like talking with the new CEO, Ali Farhadi.
01:18:37.280 | I don't know if I pronounced the last name right.
01:18:39.320 | But he's trying to move from an org that does papers only
01:18:43.640 | to something that does papers, releases models,
01:18:45.760 | is active in policy, maybe is helping
01:18:48.640 | work with these for-profit institutions that
01:18:50.640 | don't have a middle, an established place where they
01:18:53.640 | could all go through to new things.
01:18:55.400 | So they're really trying to expand the scope.
01:18:58.160 | It's part of why I joined.
01:18:59.880 | And the Tulu2 model is the type of thing I've joined.
01:19:02.200 | And they were talking about this.
01:19:03.560 | And I was like, OK, we should just train it and release it,
01:19:05.360 | because no one has done this direct preference
01:19:07.280 | optimization at a scale of really
01:19:10.120 | like 70 billion parameter scale.
01:19:11.920 | And this experiment is hilarious.
01:19:13.280 | This is classic of everything kind of works right now in ML.
01:19:16.840 | I showed up in the grad student Hamish Iveson.
01:19:20.240 | And I need to learn how to pronounce last names better.
01:19:22.520 | But he had some Jaxx DPO code built on this EZLM framework.
01:19:26.200 | And we have deep TPUs that we could
01:19:29.240 | access for research purposes.
01:19:30.740 | So it's like, OK, we have a huge TPU.
01:19:32.240 | It's like, let's just try the Zephyr recipe
01:19:34.480 | on 70 billion parameters.
01:19:35.720 | And it's literally like the first run.
01:19:37.340 | It's like we did no ablations, didn't change any parameters.
01:19:39.880 | We just copied them all over.
01:19:41.560 | And that's the model that people have been working with.
01:19:45.040 | That goes to show that there's a lot of runway and understanding
01:19:48.080 | and improving on this.
01:19:49.360 | We took the same data and just took it
01:19:51.720 | to a different Jaxx implementation
01:19:53.480 | and scaled it up 10x.
01:19:54.760 | And it still returned a model that was pretty good on benchmarks
01:19:58.400 | and in people using it.
01:20:00.240 | So let's say it's like 2024, we'll be busy in this space.
01:20:04.980 | We're running data ablations to try to understand what's best.
01:20:08.400 | Then Allen Institute is pre-training language models
01:20:11.280 | or pre-training open language models,
01:20:13.280 | where we'll be able to share data, code, everything,
01:20:16.680 | the kind of horn that everyone likes
01:20:18.560 | to get annoyed about these days.
01:20:19.840 | It's like, well, I'm not releasing data.
01:20:21.720 | So that'll come in the new year.
01:20:23.760 | And then things like Tulu2 are the recipes
01:20:26.520 | that we will apply to that.
01:20:28.320 | And we'll kind of keep doing both.
01:20:30.560 | As the pre-trained models get better,
01:20:32.560 | those will probably become more of a priority.
01:20:34.480 | But starting pre-training is very hard.
01:20:36.560 | So it's like you still want to learn from LLAMA2 and LLAMA3.
01:20:40.560 | So that's fun.
01:20:43.240 | I think DPO releases are kind of becoming expected,
01:20:46.880 | because Mistral released a DPO model as well.
01:20:50.720 | I think the slide after this is just like, there's a ton.
01:20:53.800 | It's like Intel releases DPO models.
01:20:55.800 | Stability releases DPO models.
01:20:58.400 | At some point, you just have to accept that that's
01:21:00.440 | where we're going, whether or not
01:21:01.820 | you care about the whole DPO debate.
01:21:03.500 | And that's why I find it so funny,
01:21:04.960 | because there's really interesting, debatable
01:21:07.640 | questions between DPO and other RL methods.
01:21:10.200 | But we just won't have the answer.
01:21:12.120 | And it'll look like there isn't a debate,
01:21:14.200 | because everything that is published is with DPO.
01:21:16.440 | But that doesn't mean that anything
01:21:17.920 | is answered in the time being.
01:21:21.880 | Yeah, kind of last of this stuff is evaluation.
01:21:25.640 | And these slides were prepared kind of last minute.
01:21:29.000 | But I think the question is, how do you evaluate these models
01:21:32.680 | and what you should be doing?
01:21:33.840 | I think the PSA is like, don't trust your numbers
01:21:36.600 | and actually talk to models.
01:21:37.960 | It's very hard to do if you're an engineer or a researcher,
01:21:40.320 | because you have your specific thing that you're zoomed in on.
01:21:43.040 | And it feels like a waste of time
01:21:44.420 | to just go play with chat GPT or go play with chat arena.
01:21:46.800 | But I really don't think it is.
01:21:48.100 | It's something that I-- this is me telling myself
01:21:50.360 | what I should be doing.
01:21:51.760 | But there's the question of, is the Hugging Face leaderboard
01:21:56.920 | good for open source?
01:21:58.960 | And then what else can people do?
01:22:01.240 | The Hugging Face leaderboard came out of the team
01:22:03.240 | that I was on there.
01:22:04.080 | We were trying to build a framework to automatically
01:22:07.400 | evaluate the models that we were training
01:22:09.240 | and the models that people were releasing,
01:22:10.960 | and then have them in a central place where it could be like,
01:22:12.740 | look, here's the evaluation scores.
01:22:14.240 | This is what we're competing with.
01:22:15.760 | It obviously blew up.
01:22:16.880 | I think it's very good for companies
01:22:18.960 | trying to operate in the open LLM space
01:22:21.760 | to build businesses around it.
01:22:23.080 | I think it's bad for people building LLMs
01:22:25.080 | that they think are the best, because it's
01:22:27.240 | easy to overfit if you're training and focusing on them
01:22:30.960 | as a developer.
01:22:31.660 | But it's good to have distribution of models
01:22:35.280 | when there's so many people training them.
01:22:38.440 | But it's like, now it has six evaluation tools.
01:22:41.640 | I can't even name all of them off the top of my head.
01:22:43.840 | ARC, Hellaswag, MMLU.
01:22:47.200 | There was Drop on it at one point,
01:22:48.920 | but they dropped a drop, which was pretty funny.
01:22:50.960 | Drupal, QA, and then I think maybe some other math.
01:22:55.600 | I don't know.
01:22:58.280 | So this benchmark question is something
01:23:01.480 | that everyone's talking about, because there's a lot of gaming
01:23:05.240 | that it seems to be going on.
01:23:07.240 | Is there some discussion about held out benchmarks
01:23:11.440 | that Hugging Face could hold onto?
01:23:13.640 | Mostly it's who's going to pay for it.
01:23:15.760 | But we're thinking about this at Allen AI, too,
01:23:18.240 | is improving on-- we're specifically
01:23:20.240 | thinking about improving on Opaqa eval, which is--
01:23:22.000 | Who's going to pay for running the evals?
01:23:23.280 | Who's going to pay for running the evals?
01:23:24.800 | Right now, Hugging Face is just running every eval every day.
01:23:27.120 | Yeah.
01:23:27.620 | So they have 1,000 GPUs.
01:23:29.040 | At one point, they were going to do more training.
01:23:31.240 | It was going to be used for that.
01:23:32.580 | But now they have less training, and they've
01:23:34.720 | run a good amount of GPUs.
01:23:35.920 | And one of their blog posts, they
01:23:36.920 | said how much compute it was.
01:23:38.160 | I don't think it's a ton to run these,
01:23:39.760 | but it is like, you have to have hundreds of GPUs
01:23:42.440 | to maintain this leaderboard.
01:23:45.320 | So one technical question.
01:23:47.440 | Some of these are open source models that they don't change,
01:23:50.080 | so you just have to run them once.
01:23:51.400 | Yeah.
01:23:52.640 | So it's not that crazy, I don't think.
01:23:56.140 | It's tractable for--
01:23:56.920 | It's only the closed source models
01:23:58.440 | that need to be re-evaluated.
01:23:59.640 | Yeah, so if you look at the chat arena,
01:24:01.320 | they take specific dates.
01:24:04.700 | Yeah.
01:24:05.200 | And then there's this whole controversy of,
01:24:07.040 | is chat GPT from March better than chat GPT from June?
01:24:12.280 | So on one of these future slides,
01:24:15.520 | it's slide 58 is the chatbot arena leaderboard,
01:24:20.880 | if you're looking later, which chatbot arena is this thing
01:24:23.760 | from LLAMSYS that we were looking at.
01:24:25.720 | And then on the x-axis is models.
01:24:27.520 | And you can see that GPT-4 from March has a higher score.
01:24:33.360 | And the same-- it's like, this is not a perfect comparison.
01:24:37.360 | But there are signs that are pretty funny there,
01:24:39.480 | that there are things cooking.
01:24:42.680 | But you don't know who's collecting this data,
01:24:44.720 | what prompts they're doing, and what--
01:24:47.440 | but it's such a funny timeline.
01:24:49.680 | So for those listening, GT-4 March 14
01:24:53.000 | is 40 Elo points higher than GT-4 June 13.
01:24:55.940 | Yeah, it's outside of the error bars on the LLAMSYS thing.
01:24:58.480 | That's pretty high.
01:24:59.280 | And the other piece of context is
01:25:00.680 | that GPT-4 Turbo is also notably ahead of the other GPT-4s,
01:25:05.960 | which it kind of showed up immediately
01:25:08.040 | once they added it to the leaderboard or to the arena.
01:25:11.040 | And I was like, all the GPT-4.5 memes aside,
01:25:15.480 | it seems like this is effectively
01:25:17.000 | a bump in the model.
01:25:18.040 | If it's clear-- if you zoom into this,
01:25:21.520 | the leaderboard is very close for many strata of models.
01:25:28.280 | So there are levels where you can get your model to,
01:25:31.080 | and it'll be really close to your peers.
01:25:32.800 | So in the open source, there's things like Mixtral Instruct
01:25:38.080 | 2.2.7db, which is effectively--
01:25:40.600 | it's a way bigger model than Mixtral.
01:25:42.920 | Mixtral's the mixture of expert model.
01:25:45.360 | I'll do credit.
01:25:46.040 | It's a very good model, and that's
01:25:47.460 | going to be the next level once people
01:25:49.280 | get better at fine-tuning it.
01:25:50.560 | Ye34bchat, this is one level.
01:25:53.600 | And then there was a level with the alpacas and the vicunas.
01:25:56.840 | But all of these open source models,
01:25:58.880 | there's then another step up to GPT-4,
01:26:01.500 | and then there's another step up to GPT-4 Turbo.
01:26:03.960 | So it's like the difference from the GPT-4 Turbo
01:26:07.720 | to the GPT-4 that was first released
01:26:10.720 | is bigger than the difference from Tulu2 to GPT-4.
01:26:14.600 | So that's just like, there's something good going on there.
01:26:17.720 | And I was like, OK, that's a new model by my standards,
01:26:20.840 | but they're not going to tell us about it.
01:26:23.400 | They did in DevDay.
01:26:24.280 | They said it's our new model, but they weren't like,
01:26:26.920 | this is our new best-performing model,
01:26:28.840 | because the benchmark scores are probably the same,
01:26:32.120 | but they made it so that people like using it more.
01:26:34.560 | There's some hints that 4.5 might drop at some point.
01:26:37.400 | We don't actually know how true those things are,
01:26:39.400 | but I don't think it really matters.
01:26:40.900 | It's like they could call anything--
01:26:42.440 | they're retraining these models, and they could call any of them
01:26:45.240 | GPT-4.5.
01:26:47.640 | I think the two tools that I talk about most
01:26:50.120 | in research domains on RLHF is AlpacaValidMTBench.
01:26:55.200 | They're two academic-maintained leaderboards
01:26:58.100 | for evaluating chat capabilities.
01:27:00.400 | Evaluating chat is really hard, and what they both do
01:27:03.360 | is they have GPT-4 provide some sort of feedback.
01:27:06.520 | MTBench is called MT for multi-turn,
01:27:10.480 | and they have a prompt and a follow-up question.
01:27:12.560 | So what they do is they ask GPT-4
01:27:14.680 | to score both the initial response
01:27:19.520 | and the second response, and provide the average.
01:27:21.600 | Kind of given up on following the slides.
01:27:23.320 | It's all on the slides if you look for it.
01:27:25.120 | And then AlpacaVal is a little bit different,
01:27:27.840 | where you're comparing a candidate model.
01:27:31.000 | So the model we've trained.
01:27:32.200 | So when we're training Tulu, we submit this.
01:27:34.860 | And what it's doing under the hood is comparing the new model
01:27:37.360 | to DaVinci 0.0.3, which is one of OpenAI's older instruction
01:27:42.680 | models, and calculating the win rate
01:27:45.560 | that GPT-4 sees between the new model and DaVinci.
01:27:49.440 | So it has many more prompts than MTBench.
01:27:53.800 | MTBench is custom prompts that they
01:27:55.480 | made to just kind of take a stance on what
01:27:58.560 | is a good chat model.
01:27:59.840 | AlpacaVal sources theirs from Self-Instruct,
01:28:03.240 | which is a popular paper from AI2.
01:28:05.880 | Open Assistant, Vicuna, Koala, Anthropix, Helpful, Harmless.
01:28:09.120 | So AlpacaVal is from sources that people know and love.
01:28:12.320 | MTBench is its own thing.
01:28:14.360 | We were more focused on MTBench at Hugging Face.
01:28:17.280 | At AI2, we're a little bit more focused on AlpacaVal.
01:28:20.040 | But it really can go either way.
01:28:22.080 | These are kind of like table stakes to saying that you have
01:28:24.160 | a good RHF model.
01:28:26.400 | You should be able to have a pretty good score on both
01:28:29.040 | of these.
01:28:30.080 | And then the kind of proof is in people actually talking to it.
01:28:33.280 | So I think the Zephyr model from Hugging Face
01:28:35.960 | was a kind of step change in people's perception
01:28:39.280 | of open models that got integrated
01:28:41.000 | into a bunch of products within a few weeks.
01:28:43.200 | Like U.com was experimenting with it.
01:28:45.960 | And someone else, like I saw some substacker,
01:28:48.440 | was using it as a writing feedback bot instead of chat
01:28:52.440 | But that's what happens when a good open release is there now.
01:28:57.600 | It's like the evaluations are good and people pick it up.
01:29:00.920 | And the evaluations are just enough
01:29:03.040 | to say, OK, we're in the right ballpark.
01:29:05.240 | But you never really know if the model is
01:29:08.140 | the one or one of these big ones without talking to it.
01:29:12.440 | However much you talk about evals,
01:29:13.920 | that's still where we're at.
01:29:15.720 | You can't prove anything definitively.
01:29:18.280 | And Google's seeing that.
01:29:20.160 | And until Gemini Ultra comes out, we don't know.
01:29:23.520 | It's probably a great model, but we don't know what they have.
01:29:27.360 | Gemini Pro didn't do so great on the other stuff, too.
01:29:30.480 | Yeah, I want to know if Gemini Pro is just
01:29:32.520 | like some intermediate checkpoint,
01:29:34.520 | or if it was like a major deliverable for them or not.
01:29:38.240 | Which if it wasn't a major deliverable,
01:29:39.960 | it's probably a strategy headache for Google.
01:29:42.480 | But that's not my problem.
01:29:43.840 | You have a bunch of open questions here.
01:29:48.960 | One of our lightning round questions is always--
01:29:51.160 | Yeah, we just do inverted lightning round?
01:29:52.560 | Yeah, exactly.
01:29:53.640 | You asked people open questions.
01:29:57.280 | Oh, I mean, there's so much to do here.
01:29:59.200 | They're kind of like summarization of things
01:30:01.040 | that will be hinted at in the talk to this point, which
01:30:06.280 | is like I split it up in my work between data training
01:30:09.080 | and model, which is essentially like how
01:30:11.440 | do we evaluate what's happening at the model level with RLHF.
01:30:14.640 | I think big labs are so over-indexed--
01:30:16.680 | are indexed on their own base models,
01:30:18.280 | so they don't know what's swapping between Cloud Base
01:30:20.760 | or GPT-4 Base, how that would change any notion of preference
01:30:24.400 | or what you do with RLHF.
01:30:25.640 | I think in the open, we could do that.
01:30:27.480 | We could swap between Lama 2 and MixedRAW
01:30:29.720 | and kind of see, does RLHF work the same for both of those?
01:30:33.080 | Do they both get alpaca valve bumps
01:30:34.760 | when you use the same data set in the same framework
01:30:37.200 | down the line?
01:30:38.200 | That'd be good to know how sensitive RLHF is.
01:30:41.720 | On the data, we talk a lot about aggregation.
01:30:44.220 | On the research side, there's a lot of interesting things
01:30:46.600 | just like, does getting your data
01:30:48.360 | from scale or a Discord army change
01:30:51.000 | the quality of the data based on professional contexts?
01:30:54.720 | And like--
01:30:55.200 | The results of this might really affect scale.
01:30:57.360 | Yeah.
01:30:58.480 | They probably should do it internally.
01:31:00.160 | They should do internal market analysis on that line.
01:31:05.040 | We should also mention, there has
01:31:06.520 | been a report that a lot of these labelers
01:31:08.760 | use ChatGPT to do their work.
01:31:11.400 | Yeah.
01:31:11.960 | I mean, I'm not surprised.
01:31:13.600 | So it's like-- it's a lot of messy grounds
01:31:16.080 | in RL these days.
01:31:17.720 | And then there's more training questions,
01:31:19.440 | which is like, what happens at the end of the day?
01:31:22.600 | I mentioned what I call qualitative alignment
01:31:25.520 | earlier on, which is like, do the models
01:31:27.520 | get better in ways matching the preference data preferences?
01:31:31.040 | So if you collect two batches of preference data
01:31:33.720 | with different priorities, what is the downstream model change?
01:31:37.280 | I don't know if it does anything.
01:31:38.840 | Should all data be equal?
01:31:40.480 | If you have health care questions,
01:31:42.080 | should it be the same as like, write me a joke?
01:31:45.160 | This is all implicit to deep learning.
01:31:47.520 | Like, deep learning just scales and aggregates.
01:31:50.040 | And I think we are going to be on that ride,
01:31:53.280 | but it's not necessarily what some people would
01:31:55.320 | call fair or good.
01:31:57.160 | And then the kind of last slide that I have is fun,
01:31:59.240 | which is just like, John Schulman talks about this
01:32:01.360 | in his ICML talk.
01:32:02.440 | His ICML talk on proxy objectives for RLHF
01:32:05.280 | is public now.
01:32:06.000 | They made it public three months after the conference
01:32:08.240 | or some weird timeline.
01:32:09.680 | But he talks about things like ChatGPT being verbose
01:32:12.560 | and have self-doubt, refusals.
01:32:14.880 | Things that are really in vogue in the conversation right now
01:32:17.840 | and how those can emerge in the process of continually trying
01:32:22.520 | to adjust the RLHF process based on what users
01:32:25.840 | are seeing in the model.
01:32:27.440 | And this is like a sort of outer loop optimization
01:32:29.760 | that no one in the open is even remotely qualified
01:32:32.200 | to talk about, but OpenAI does monitor.
01:32:34.160 | And they'll rerun RLHF and train a new reward model
01:32:36.920 | with a mixture of their curated data and user prompts
01:32:40.280 | to try to make it work better over time.
01:32:43.040 | And that's the different model versions.
01:32:44.880 | And while there's a lot of critiques about this,
01:32:47.200 | they're definitely intentional in trying to fix--
01:32:50.400 | I feel like it's probably whack-a-mole, where they're
01:32:52.600 | like, oh, there's this problem.
01:32:53.240 | We have the data.
01:32:53.800 | We can fix this.
01:32:54.520 | And then it pops up some new problem after doing RLHF,
01:32:57.680 | and they're studying this.
01:32:59.720 | And if you could really figure it out,
01:33:01.400 | this is where things start to look more like RL.
01:33:03.480 | You could automate it.
01:33:05.600 | Things are just like longer time frame of optimizing the model.
01:33:09.640 | It would be cool, but I feel like I'm
01:33:12.320 | years away from ever actually working on this.
01:33:14.680 | But we can try to get details from people who are.
01:33:18.940 | Yeah, excellent.
01:33:20.320 | Awesome.
01:33:21.320 | Yeah, anything else that we missed?
01:33:23.680 | I think we covered a lot of it.
01:33:26.360 | I mean, I'm good.
01:33:27.080 | I would ask you guys about if you know companies that
01:33:29.320 | are doing this and things.
01:33:30.680 | I know some that are in the RLHF as a service space
01:33:34.240 | will become busy, I think for good reason,
01:33:36.400 | just because--
01:33:37.280 | This company is doing RLEIF as a service.
01:33:39.760 | Yeah, both of them are.
01:33:40.720 | It depends if synthetic data is going to win over human data.
01:33:43.600 | If human data is the real winning feature in the end,
01:33:48.040 | it's a big capital investment.
01:33:49.440 | So it kind of makes sense as a VC model anyways,
01:33:51.960 | but there's going to be both of them for a while.
01:33:54.760 | That'd be cool.
01:33:56.520 | You see a lot of people-- because I know
01:33:58.200 | Luis Castricado is starting a company.
01:34:01.000 | Is there a lot of ambition in this field to start companies,
01:34:04.600 | or is this more such a research-driven part of the stack
01:34:09.280 | that maybe it just stays there?
01:34:10.640 | There definitely is, because I know my former colleague
01:34:13.280 | Nazneen Rajani from Hugging Face is also starting
01:34:16.560 | a company in this space.
01:34:18.400 | The Falcon team who left Hugging Face, I think,
01:34:21.200 | is also working in this space.
01:34:22.880 | I don't really know.
01:34:24.600 | I don't know exactly what--
01:34:25.760 | I haven't talked to them since ICML,
01:34:27.240 | so I don't know what they're doing.
01:34:28.240 | Startups change a lot.
01:34:29.440 | But there are definitely a lot of people
01:34:31.080 | looking at this space.
01:34:32.520 | I mean, Scale's probably trying to do it.
01:34:34.240 | If I was Scale, they would want to do it.
01:34:35.920 | I think they've historically had trouble
01:34:37.640 | keeping technical ML talent, but they've
01:34:39.800 | started a new research lab, so that should help.
01:34:43.200 | It's a busy area.
01:34:45.160 | Cool.
01:34:46.040 | What's going on?
01:34:47.040 | Yeah.
01:34:47.560 | Awesome, Nathan.
01:34:48.400 | Thank you so much.
01:34:48.920 | That was a masterclass.
01:34:50.040 | I think this is the first 201 that we've ever had,
01:34:52.160 | and you set the bar very high.
01:34:54.040 | Thank you.
01:34:56.800 | Bye, everyone.
01:34:58.360 | Bye-bye.
01:34:58.960 | [MUSIC PLAYING]
01:35:02.320 | [MUSIC PLAYING]
01:35:06.280 | [MUSIC PLAYING]
01:35:09.640 | [MUSIC PLAYING]
01:35:13.560 | [MUSIC PLAYING]
01:35:16.920 | [MUSIC PLAYING]
01:35:20.440 | [MUSIC PLAYING]
01:35:23.800 | [BLANK_AUDIO]