back to indexDario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452
Chapters
0:0 Introduction
3:14 Scaling laws
12:20 Limits of LLM scaling
20:45 Competition with OpenAI, Google, xAI, Meta
26:8 Claude
29:44 Opus 3.5
34:30 Sonnet 3.5
37:50 Claude 4.0
42:2 Criticism of Claude
54:49 AI Safety Levels
65:37 ASL-3 and ASL-4
69:40 Computer use
79:35 Government regulation of AI
98:24 Hiring a great team
107:14 Post-training
112:39 Constitutional AI
118:5 Machines of Loving Grace
137:11 AGI timeline
149:46 Programming
156:46 Meaning of life
162:53 Amanda Askell - Philosophy
165:21 Programming advice for non-technical people
169:9 Talking to Claude
185:41 Prompt engineering
194:15 Post-training
198:54 Constitutional AI
203:48 System prompts
209:54 Is Claude getting dumber?
221:56 Character training
222:56 Nature of truth
227:32 Optimal rate of failure
234:43 AI consciousness
249:14 AGI
257:52 Chris Olah - Mechanistic Interpretability
262:44 Features, Circuits, Universality
280:17 Superposition
291:16 Monosemanticity
298:8 Scaling Monosemanticity
306:56 Macroscopic behavior of neural networks
311:50 Beauty of neural networks
00:00:00.000 |
If you extrapolate the curves that we've had so far, right? 00:00:03.080 |
If, if you say, well, I don't know, we're starting to get to like PhD level. 00:00:07.360 |
And last year we were at undergraduate level and the year before we were at 00:00:11.340 |
like the level of a high school student, again, you can, you can quibble with 00:00:14.980 |
at what tasks and for what we're still missing modalities, but those are being 00:00:19.140 |
added, like computer use was added, like image generation has been added. 00:00:22.680 |
If you just kind of like eyeball the rate at which these capabilities are 00:00:26.280 |
increasing, it does make you think that we'll get there by 2026 or 2027. 00:00:31.640 |
I think there are still worlds where it doesn't happen in, in a hundred years. 00:00:34.920 |
Those were the number of those worlds is rapidly decreasing. 00:00:38.520 |
We are rapidly running out of truly convincing blockers, truly compelling 00:00:43.500 |
reasons why this will not happen in the next few years, the scale up is very quick. 00:00:47.500 |
Like we do this today, we make a model and then we deploy thousands, maybe 00:00:53.960 |
I think by the time, you know, certainly within two to three years, whether we 00:00:57.640 |
have these super powerful AIs or not, clusters are going to get to the size 00:01:01.240 |
where you'll be able to deploy millions of these, I am optimistic about meaning. 00:01:05.660 |
I worry about economics and the concentration of power. 00:01:12.000 |
The abuse of power and AI increases the amount of power in the world. 00:01:18.280 |
And if you concentrate that power and abuse that power, 00:01:25.440 |
The following is a conversation with Dario Amadei, CEO of Anthropic, the 00:01:33.200 |
company that created Claude, that is currently and often at the top of 00:01:39.500 |
On top of that, Dario and the Anthropic team have been outspoken advocates for 00:01:44.860 |
taking the topic of AI safety very seriously, and they have continued to 00:01:49.880 |
publish a lot of fascinating AI research on this and other topics. 00:01:54.960 |
I'm also joined afterwards by two other brilliant people from Anthropic. 00:02:00.320 |
First, Amanda Askel, who is a researcher working on alignment and fine-tuning 00:02:06.680 |
of Claude, including the design of Claude's character and personality. 00:02:11.000 |
A few folks told me she has probably talked with Claude more 00:02:18.240 |
So she was definitely a fascinating person to talk to about prompt 00:02:22.280 |
engineering and practical advice on how to get the best out of Claude. 00:02:30.440 |
He's one of the pioneers of the field of mechanistic interpretability, which 00:02:36.000 |
is an exciting set of efforts that aims to reverse engineer neural networks to 00:02:41.000 |
figure out what's going on inside, inferring behaviors from neural activation 00:02:48.760 |
This is a very promising approach for keeping future 00:02:54.720 |
For example, by detecting from the activations when the model is trying 00:03:05.360 |
To support it, please check out our sponsors in the description. 00:03:12.960 |
Let's start with the big idea of scaling laws and the scaling hypothesis. 00:03:17.760 |
What is it, what is its history, and where do we stand today? 00:03:21.360 |
So I can only describe it as it, you know, as it relates to kind of my own 00:03:25.960 |
experience, but I've been in the AI field for about 10 years, and it was 00:03:32.160 |
So I first joined the AI world when I was working at Baidu with Andrew Ng in 00:03:36.840 |
late 2014, which is almost exactly 10 years ago now, and the first thing we 00:03:45.000 |
And in those days, I think deep learning was a new thing. 00:03:47.840 |
It had made lots of progress, but everyone was always saying, we don't 00:03:53.360 |
You know, we're not, we're only matching a tiny, tiny fraction. 00:03:57.960 |
There's so much we need to kind of discover algorithmically. 00:04:01.160 |
We haven't found the picture of how to match the human brain. 00:04:04.000 |
And when, you know, in some ways it was fortunate. 00:04:08.560 |
I was kind of, you know, you can have almost beginner's luck, right? 00:04:13.360 |
And, you know, I looked at the neural net that we were using for speech, 00:04:16.360 |
the recurrent neural networks, and I said, I don't know, what if you make 00:04:21.120 |
And what if you scale up the data along with this, right? 00:04:23.560 |
I just saw these as, as like independent dials that you could turn. 00:04:27.280 |
And I noticed that the model started to do better and better as you gave them 00:04:31.200 |
more data, as you, as you made the models larger, as you trained them for longer. 00:04:35.880 |
And I didn't measure things precisely in those days, but, but along 00:04:40.800 |
with, with colleagues, we very much got the informal sense that the more 00:04:45.120 |
data and the more compute and the more training you put into these 00:04:51.000 |
And so initially my thinking was, Hey, maybe that is just true for 00:04:56.960 |
Maybe, maybe that's just one particular quirk, one particular area. 00:05:00.840 |
I think it wasn't until 2017 when I first saw the results from GPT-1 that it 00:05:07.600 |
clicked for me that language is probably the area in which we can do this. 00:05:11.480 |
We can get trillions of words of language data. 00:05:16.560 |
And the models we were trained in those days were tiny. 00:05:22.080 |
Whereas, you know, now we train jobs on tens of thousands, soon 00:05:27.360 |
And so when I, when I saw those two things together and, you know, there 00:05:31.720 |
were a few people like Ilya Sutskever, who, who you've interviewed, who 00:05:36.760 |
He might've been the first one, although I think a few people came to, came to 00:05:42.440 |
There was, you know, Rich Sutton's bitter lesson, there was Gorin wrote 00:05:46.120 |
about the scaling hypothesis, but I think somewhere between 2014 and 2017 was when 00:05:52.840 |
it really clicked for me when I really got conviction that, Hey, we're going to 00:05:56.400 |
be able to do these incredibly wide cognitive tasks if we just, if we just 00:06:01.720 |
scale up the models and at, at every stage of scaling, there are always 00:06:06.280 |
arguments and, you know, when I first heard them, honestly, I thought probably 00:06:10.920 |
And, you know, all these, all these experts in the field are right. 00:06:13.200 |
They know the situation better, better than I do. 00:06:15.840 |
There's, you know, the Chomsky argument about like, you can get 00:06:21.120 |
There was this idea, Oh, you can make a sentence make sense, but you 00:06:27.600 |
Uh, you know, we're going to run out of data or the data isn't high 00:06:33.920 |
And, and each time, every time we managed to, we managed to either find a way 00:06:40.400 |
Um, sometimes it's one, sometimes it's the other. 00:06:43.160 |
Uh, and, and so I'm now at this point, I, I still think, you know, it's, it's, 00:06:49.080 |
We have nothing but inductive inference to tell us that the next few years are 00:06:53.280 |
going to be like the next, the last 10 years, but, but I've seen, I've seen the 00:06:57.640 |
movie enough times I've seen the story happen for, for enough times to really 00:07:02.160 |
believe that probably the scaling is going to continue and that there's 00:07:05.880 |
some magic to it that we haven't really explained on a theoretical basis yet. 00:07:12.160 |
Bigger networks, bigger data, bigger compute. 00:07:16.920 |
All of those in particular, linear scaling up of bigger networks, bigger 00:07:26.760 |
Uh, so all of these things, almost like a chemical reaction, you know, you have 00:07:30.240 |
three ingredients in the chemical reaction and you need to linearly 00:07:34.760 |
If you scale up one, not the others, you run out of the other 00:07:39.800 |
But if you scale up everything, everything in series, then, 00:07:44.960 |
And of course, now that you have this kind of empirical science slash art, 00:07:48.880 |
you can apply it to other, uh, more nuanced things like scaling laws applied 00:07:54.880 |
to interpretability or scaling laws applied to post-training or just seeing 00:07:59.760 |
how does this thing scale, but the big scaling law, I guess the underlying 00:08:04.040 |
scaling hypothesis has to do with big networks, big data leads to intelligence. 00:08:09.200 |
Yeah, we've, we've documented scaling laws in lots of domains 00:08:15.040 |
So, uh, initially the, the paper we did that first showed it was in early 2020 00:08:21.760 |
There was then some work late in 2020 where we showed the same thing for 00:08:26.560 |
other modalities like images, video, text to image, image to text, math, 00:08:34.640 |
And, and you're right now, there are other stages like post-training or there 00:08:40.280 |
And in, in, in all of those cases that we've measured, we see similar, 00:08:47.200 |
A bit of a philosophical question, but what's your intuition about why 00:08:52.080 |
bigger is better in terms of network size and data size? 00:08:59.840 |
So in my previous career as a, as a biophysicist, so I did physics 00:09:03.680 |
undergrad and then biophysics in, in, in, in grad school. 00:09:07.040 |
So I think back to what I know as a physicist, which is actually much less 00:09:10.400 |
than what some of my colleagues at Anthropic have in terms of, in terms 00:09:14.280 |
of expertise in physics, uh, there's this, there's this concept called the 00:09:18.960 |
one over F noise and one over X distributions, um, where, where often, 00:09:24.400 |
um, uh, you know, just, just like if you add up a bunch of natural 00:09:29.840 |
If you add up a bunch of kind of differently distributed natural 00:09:34.240 |
processes, if you like, if you like, take a, take a, um, probe 00:09:39.800 |
The distribution of the thermal noise and the resistor goes 00:09:44.960 |
Um, it's some kind of natural convergent distribution. 00:09:47.920 |
Uh, and, and I, I, I, and, and I think what it amounts to is that if you look 00:09:53.280 |
at a lot of things that are, that are produced by some natural process that 00:09:59.240 |
Not a Gaussian, which is kind of narrowly distributed, but, you know, 00:10:02.760 |
if I look at kind of like large and small fluctuations that lead to lead 00:10:07.680 |
to electrical noise, um, they have this decaying one over X distribution. 00:10:12.960 |
And so now I think of like patterns in the physical world, right. 00:10:16.560 |
If I, if, or, or in language, if I think about the patterns in language, 00:10:22.240 |
Some words are much more common than others like the, then there's 00:10:27.560 |
Then there's the fact that, you know, nouns and verbs have to agree. 00:10:30.880 |
They have to coordinate and there's the higher level sentence structure. 00:10:33.960 |
Then there's the thematic structure of paragraphs. 00:10:36.280 |
And so the fact that there's this regressing structure, you can imagine 00:10:40.480 |
that as you make the networks larger, first, they capture the really simple 00:10:46.800 |
And there's this long tail of other patterns. 00:10:49.560 |
And if that long tail of other patterns is really smooth, like it is with the 00:10:54.080 |
one over F noise in, you know, physical processes, like, like, like, like 00:10:58.000 |
resistors, then you can imagine as you make the network larger, it's kind of 00:11:02.040 |
capturing more and more of that distribution. 00:11:04.480 |
And so that smoothness gets reflected in how well the models are at 00:11:18.000 |
We have common expressions and less common expressions. 00:11:20.760 |
We have ideas, cliches that are expressed frequently, and we have novel ideas. 00:11:25.800 |
And that process has, has developed, has evolved with 00:11:30.920 |
And so the, the, the guess, and this is pure speculation would be, would be 00:11:35.360 |
that there is, there's some kind of long tail distribution of, of, of 00:11:41.520 |
So there's the long tail, but also there's the height of the hierarchy 00:11:47.120 |
So the bigger the network, presumably you have a higher capacity to. 00:11:51.000 |
If you have a small network, you only get the common stuff, right? 00:11:53.880 |
If, if I take a tiny neural network, it's very good at understanding that, 00:11:58.040 |
you know, a sentence has to have, you know, verb, adjective, noun, right? 00:12:01.440 |
But it's, it's terrible at deciding what those verb, adjective, and noun 00:12:05.360 |
should be and whether they should make sense. 00:12:06.960 |
If I make it just a little bigger, it gets good at that. 00:12:09.640 |
Then suddenly it's good at the sentences, but it's not good at the paragraphs. 00:12:12.960 |
And so these, these rarer and more complex patterns get picked up as I 00:12:19.880 |
Well, the natural question then is what's the ceiling of this? 00:12:23.920 |
Like how complicated and complex is the real world? 00:12:29.600 |
I don't think any of us knows the answer to that question. 00:12:32.320 |
Um, I S my strong instinct would be that there's no ceiling 00:12:37.880 |
We humans are able to understand these various patterns. 00:12:40.840 |
And so that, that makes me think that if we continue to, you know, 00:12:45.040 |
scale up these, these, these models to kind of develop new methods for 00:12:49.800 |
training them and scaling them up, uh, that will at least get to the level 00:12:55.240 |
There's then a question of, you know, how much more is it possible 00:13:00.040 |
How much, how much is it possible to be smarter and more perceptive than humans? 00:13:04.400 |
I would guess the answer has, has got to be domain dependent. 00:13:09.520 |
If I look at an area like biology and, you know, I wrote this essay, 00:13:13.400 |
machines of loving grace, it seems to me that humans are struggling to 00:13:20.480 |
If you go to Stanford or to Harvard or to Berkeley, you have whole departments 00:13:25.800 |
of, you know, folks trying to study, you know, like the immune 00:13:31.200 |
And, and each person understands only a tiny bit, part of it specializes. 00:13:36.480 |
And they're struggling to combine their knowledge with that of, 00:13:40.400 |
And so I have an instinct that there's, there's a lot of room at 00:13:45.560 |
If I think of something like materials in the, in the physical world, or, you 00:13:51.160 |
know, um, like addressing, you know, conflicts between humans or something 00:13:55.200 |
like that, I mean, you know, it may be, there's only some of these problems 00:14:00.720 |
And, and it may be that there's only, there's only so well you can 00:14:07.280 |
There's only so clear I can hear your speech. 00:14:09.760 |
So I think in some areas there may be ceilings in, in, in, you know, that 00:14:14.440 |
are very close to what humans have done in other areas, those 00:14:18.800 |
And I think we'll only find out when we build these systems. 00:14:21.680 |
Uh, there's, it's very hard to know in advance. 00:14:25.440 |
And in some domains, the ceiling might have to do with human bureaucracies 00:14:31.520 |
So humans fundamentally have to be part of the loop. 00:14:34.720 |
That's the cause of the ceiling, not maybe the limits of the intelligence. 00:14:38.480 |
I think in many cases, um, you know, in theory, technology 00:14:44.840 |
For example, all the things that we might invent with respect to biology. 00:14:49.080 |
Um, but remember there's, there's a, you know, there's a clinical trial 00:14:52.640 |
system that we have to go through to actually administer these things to humans. 00:14:56.840 |
I think that's a mixture of things that are unnecessary and bureaucratic 00:15:01.000 |
and things that kind of protect the integrity of society. 00:15:04.160 |
And the whole challenge is that it's hard to tell. 00:15:09.200 |
My, my view is definitely, I think in terms of drug development, we, my view 00:15:14.480 |
is that we're too slow and we're too conservative, but certainly if you get 00:15:18.120 |
these things wrong, you know, it's, it's possible to, to, to risk people's 00:15:21.720 |
lives by, by being, by being, by being too reckless. 00:15:24.760 |
And so at least, at least some of these human institutions 00:15:32.480 |
I strongly suspect that balance is kind of more on the side of pushing to make 00:15:36.680 |
things happen faster, but there is a balance. 00:15:38.560 |
If we do hit a limit, if we do hit a slowdown in the scaling laws, what 00:15:50.800 |
So a few things now we're talking about hitting the limit before we get to the 00:15:57.640 |
Um, so, so I think one that's, you know, one that's popular today, and I think, 00:16:04.320 |
I like most of the limits I would bet against it, but it's definitely 00:16:09.520 |
There's only so much data on the internet and there's issues with 00:16:13.800 |
You can get hundreds of trillions of words on the internet, but a lot of it 00:16:18.600 |
is, is repetitive or it's search engine, you know, search engine optimization 00:16:24.000 |
drivel, or maybe in the future it'll even be text generated by AIs itself. 00:16:28.000 |
Uh, and, and so I think there are limits to what, to, to, to what can be produced 00:16:33.760 |
in this way that said we, and I would guess other companies are working on 00:16:38.720 |
ways to make data synthetic, uh, where you can, you know, you can use the model 00:16:43.640 |
to generate more data of the type that you have that you have already, or 00:16:49.760 |
If you think about, uh, what was done with, uh, DeepMinds AlphaGo Zero, they 00:16:54.200 |
managed to get a bot all the way from, you know, no ability to play Go whatsoever 00:16:59.000 |
to above human level, just by playing against itself, there was no example 00:17:02.960 |
data from humans required in the, the AlphaGo Zero version of it. 00:17:07.120 |
The other direction of course, is these reasoning models that do chain of 00:17:10.680 |
thought and stop to think, um, and, and reflect on their own thinking in a way. 00:17:14.800 |
That's another kind of synthetic data coupled with reinforcement learning. 00:17:19.080 |
So my, my guess is with one of those methods, we'll get around the data 00:17:22.640 |
limitation, or there may be other sources of data that are, that are available. 00:17:26.440 |
Um, we could just observe that even if there's no problem with data, as 00:17:31.080 |
we start to scale models up, they just stop getting better. 00:17:33.840 |
It's, it seemed to be a reliable observation that they've gotten better. 00:17:38.080 |
That could just stop at some point for a reason we don't understand. 00:17:41.640 |
Um, the answer could be that we need to, uh, you know, we need 00:17:49.040 |
Um, it's been, there have been problems in the past with, with say, numerical 00:17:53.760 |
stability of models where it looked like things were, were leveling off, but, 00:17:57.640 |
but actually, you know, when we, when we, when we found the right unblocker, 00:18:02.680 |
So perhaps there's new, some new optimization method or some new, 00:18:09.720 |
I've seen no evidence of that so far, but if things were to, to slow down, 00:18:15.720 |
What about the limits of compute, meaning, uh, the expensive nature of 00:18:23.360 |
So right now, I think, uh, you know, most of the frontier model companies 00:18:28.040 |
I would guess are operating in, you know, roughly, you know, $1 billion 00:18:32.480 |
scale, plus or minus a factor of three, right? 00:18:34.640 |
Those are the models that exist now or are being trained now. 00:18:37.640 |
Uh, I think next year we're going to go to a few billion and then, uh, 2026, 00:18:43.120 |
we may go to, uh, uh, you know, above 10, 10, 10 billion, and probably by 00:18:47.280 |
2027, their ambitions to build a hundred, a hundred billion dollar, 00:18:53.240 |
And I think all of that actually will happen. 00:18:55.680 |
There's a lot of determination to build the compute, to 00:19:00.000 |
Uh, and I would guess that it actually does happen. 00:19:02.560 |
Now, if we get to a hundred billion, that's still not enough compute. 00:19:07.720 |
Then either we need even more scale or we need to develop some way of 00:19:12.640 |
doing it more efficiently of shifting the curve, um, I think between all of 00:19:16.520 |
these, one of the reasons I'm bullish about powerful AI happening so fast 00:19:20.760 |
is just that if you extrapolate the next few points on the curve, we're very 00:19:24.920 |
quickly getting towards human level ability, right? 00:19:28.080 |
Some of the new models that, that we developed, some, some reasoning models 00:19:32.040 |
that have come from other companies, they're starting to get to what I would 00:19:37.720 |
If you look at their, their coding ability, um, the latest model we 00:19:41.840 |
released, Sonnet 3.5, the new or updated version, it gets something 00:19:47.040 |
like 50% on SuiBench and SuiBench is an example of a bunch of professional 00:19:54.240 |
At the beginning of the year, I think the state of the art was three or 4%. 00:19:59.520 |
So in 10 months, we've gone from 3% to 50% on this task. 00:20:04.320 |
And I think in another year we'll probably be at 90%. 00:20:07.040 |
I mean, I don't know, but might, might even be, might even be less than that. 00:20:11.120 |
Uh, we've seen similar things in graduate level, math, physics, and 00:20:18.360 |
Uh, so, uh, if we, if we just continue to extrapolate this right, in terms of 00:20:23.800 |
skill, skill that we have, I think if we extrapolate the straight curve within a 00:20:28.760 |
few years, we will get to these models being, you know, above the, the highest 00:20:36.240 |
You've pointed to, and I've pointed to a lot of reasons why, you know, possible 00:20:40.360 |
reasons why that might not happen, but if the, if the extrapolation curve 00:20:48.840 |
It'd be interesting to get your sort of view of it all. 00:20:53.200 |
What does it take to win in the broad sense of win in the space? 00:20:58.600 |
So I want to separate out a couple of things, right? 00:21:01.000 |
So, you know, Anthropic's, Anthropic's mission is to kind of 00:21:06.760 |
And, and, you know, we have a theory of change called race to the top, right? 00:21:11.480 |
Race to the top is about trying to push the other players to do the 00:21:20.600 |
It's about setting things up so that all of us can be the good guy. 00:21:25.480 |
Early in the history of Anthropic, one of our co-founders, Chris Ola, who I 00:21:29.280 |
believe you're, you're interviewing soon, you know, he's the co-founder of the 00:21:32.800 |
field of mechanistic interpretability, which is an attempt to understand 00:21:38.520 |
So we had him and one of our early teams focus on this area of interpretability, 00:21:44.520 |
which we think is good for making models safe and transparent. 00:21:48.160 |
For three or four years, that had no commercial application whatsoever. 00:21:53.800 |
We're doing some early betas with it and probably it will eventually, but, you 00:21:58.200 |
know, this is a very, very long research bet and one in which we've, we've built 00:22:05.160 |
And, and we did this because, you know, we think it's a way to make models safer. 00:22:09.160 |
An interesting thing is that as we've done this, other companies 00:22:14.640 |
In some cases, because they've been inspired by it, in some cases, because 00:22:18.720 |
they're worried that, uh, you know, if, if other companies are doing this, that 00:22:23.640 |
look more responsible, they want to look more responsible too. 00:22:26.840 |
No one wants to look like the irresponsible actor. 00:22:29.240 |
And, and so they adopt this, they adopt this as well. 00:22:32.520 |
When folks come to Anthropic, interpretability is often a draw. 00:22:36.000 |
And I tell them the other places you didn't go, tell them why you came here. 00:22:40.080 |
Um, and, and then you see soon that there, that there's interpretability 00:22:47.200 |
And in a way that takes away our competitive advantage, because it's 00:22:50.280 |
like, Oh, now others are doing it as well, but it's good, it's 00:22:56.000 |
And so we have to invent some new thing that we're doing that 00:22:59.600 |
And the hope is to basically bid up, bid up the importance of, of, of 00:23:06.120 |
And it's not, it's not about us in particular, right? 00:23:08.320 |
It's not about having one particular good guy. 00:23:13.240 |
If they, if they, if they join the race to do this, that's, that's, you 00:23:17.960 |
Um, uh, it's, it's just, it's about kind of shaping the incentives to 00:23:21.800 |
point upward instead of shaping the incentives to point, to point downward. 00:23:25.680 |
And we should say this example of the field of, uh, mechanistic 00:23:28.280 |
interpretability is just a rigorous non hand wavy way of doing AI safety. 00:23:36.120 |
Trying to, I mean, I think we're still early, um, in terms of our 00:23:40.240 |
ability to see things, but I've been surprised at how much we've been 00:23:43.880 |
able to look inside these systems and understand what we see, right. 00:23:48.200 |
Unlike with the scaling laws, where it feels like there's some, you know, 00:23:51.880 |
law that's driving these models to perform better on, on the inside. 00:23:56.440 |
The models aren't, you know, there's no reason why they should be 00:24:01.600 |
They're designed to work just like the human brain or human biochemistry. 00:24:05.440 |
They're not designed for a human to open up the hatch, look 00:24:09.400 |
But we have found, and you know, you can talk in much more detail 00:24:12.960 |
about this to Chris, that when we open them up, when we do look inside 00:24:16.680 |
them, we, we find things that are surprisingly interesting. 00:24:19.920 |
And as a side effect, you also get to see the beauty of these models. 00:24:23.000 |
You get to explore the sort of, uh, the beautiful nature of large 00:24:26.920 |
neural networks through the McInterp kind of methodology. 00:24:34.560 |
I'm amazed at things like, uh, you know, that, that we can, you know, 00:24:40.080 |
use sparse autoencoders to find these directions within the networks. 00:24:44.160 |
Uh, and that the directions correspond to these very clear concepts. 00:24:49.120 |
We demonstrated this a bit with the Golden Gate Bridge claud. 00:24:52.040 |
So this was an experiment where we found a direction inside one of the 00:24:56.720 |
neural networks layers that corresponded to the Golden Gate Bridge. 00:25:04.400 |
It was kind of half a joke, uh, for a couple of days. 00:25:07.080 |
Uh, but it was, it was illustrative of, of the method we developed. 00:25:10.400 |
And, uh, you could, you could take the Golden Gate or you could take the model. 00:25:14.760 |
You could ask it about anything, you know, you know, it'd be like, how you 00:25:18.160 |
could say, how was your day and anything you asked, because this feature was 00:25:21.320 |
activated, it would connect to the Golden Gate Bridge. 00:25:23.200 |
So it would say, you know, I'm, I'm, I'm feeling relaxed and expansive, much 00:25:27.840 |
like the arches of the Golden Gate Bridge, or, you know, It would masterfully 00:25:31.760 |
change topic to the Golden Gate Bridge and integrate it. 00:25:34.760 |
There was also a sadness to it, to, to the focus it had on the Golden Gate Bridge. 00:25:40.320 |
I think so people already miss it because it was taken down, I think after a day. 00:25:45.440 |
Somehow these interventions on the model, um, where, where, where, where you kind 00:25:50.440 |
of adjust its behavior somehow emotionally made it seem more human than any other 00:25:55.720 |
version of the model, strong personality, strong, strong personality. 00:25:59.520 |
It has these kind of like obsessive interests. 00:26:02.400 |
You know, we can all think of someone who's like obsessed with something. 00:26:05.200 |
So it does make it feel somehow a bit more human. 00:26:13.240 |
In March, Claude III, Opus, Sonnet, Haiku were released. 00:26:17.760 |
Then Claude III, V, Sonnet in July with an updated version just now released. 00:26:24.120 |
And then also Claude III, V, Haiku was released. 00:26:27.560 |
Can you explain the difference between Opus, Sonnet and Haiku and how we should 00:26:34.800 |
So let's go back to March when we first released these three models. 00:26:38.800 |
So, you know, our thinking was, you know, different companies produce 00:26:43.120 |
kind of large and small models, better and worse models. 00:26:46.560 |
We felt that there was demand both for a really powerful model. 00:26:52.200 |
Um, you know, when you, that might be a little bit slower that 00:26:56.040 |
And also for fast, cheap models that are as smart as they can 00:27:02.880 |
Whenever you want to do some kind of like, you know, difficult analysis. 00:27:07.080 |
Like if I, you know, I want to write code for instance, or, you know, I 00:27:10.120 |
want to, I want to brainstorm ideas, or I want to do creative writing. 00:27:15.240 |
But then there's a lot of practical applications in a business sense where 00:27:21.280 |
I, you know, like I'm like doing my taxes or I'm, you know, talking to, uh, you 00:27:26.640 |
know, to like a legal advisor and I want to analyze a contract or, you know, we 00:27:30.840 |
have plenty of companies that are just like, you know, I, you know, I want to 00:27:33.680 |
do autocomplete on my, on my IDE or something. 00:27:37.000 |
Uh, and, and for all of those things, you want to act fast and you want 00:27:42.200 |
So we wanted to serve that whole spectrum of needs. 00:27:46.040 |
Um, so we ended up with this, uh, you know, this kind of poetry theme. 00:27:52.080 |
And so haiku is the small, fast, cheap model that is, you know, was at the 00:27:57.160 |
time was really surprisingly, surprisingly, uh, intelligent for how 00:28:01.000 |
fast and cheap it was, uh, sonnet is a, is a medium sized poem, right. 00:28:08.000 |
It is smarter, but also a little bit slower, a little bit more expensive. 00:28:12.120 |
And, and Opus like a magnum Opus is a large work. 00:28:15.320 |
Uh, Opus was the, the largest smartest model at the time. 00:28:19.400 |
Um, so that, that was the original kind of thinking behind it. 00:28:22.960 |
Um, and our, our thinking then was, well, each new generation of models 00:28:30.760 |
Uh, so when we released Sonnet 3.5, it has the same, roughly the same, you know, 00:28:41.120 |
Uh, but, uh, it, it increased its intelligence to the point where it was 00:28:47.240 |
smarter than the original Opus 3 model, uh, especially for code, but, 00:28:53.240 |
And so now, you know, we've shown results for a haiku 3.5 and I believe 00:28:59.360 |
haiku 3.5, the smallest new model is about as good as Opus 3, the largest old model. 00:29:06.760 |
So basically the aim here is to shift the curve. 00:29:09.600 |
And then at some point there's going to be an Opus 3.5. 00:29:11.960 |
Um, now every new generation of models has its own thing. 00:29:16.240 |
They use new data, their personality changes in ways that we kind of, you 00:29:20.840 |
know, try to steer, but are not fully able to steer. 00:29:24.240 |
And, and so, uh, there's never quite that exact equivalence where the only 00:29:29.840 |
Um, we always try and improve other things and some things change without 00:29:35.440 |
So it's, it's very much an, an exact science in many ways, the manner and 00:29:40.640 |
personality of these models is more an art than it is a science. 00:29:43.840 |
So what is sort of the reason for, uh, the span of time between say, 00:29:55.800 |
What is it, what takes that time if you can speak to it? 00:29:58.640 |
So there's, there's different, there's different, uh, processes. 00:30:01.480 |
Um, uh, there's pre-training, which is, you know, just kind of the normal 00:30:05.040 |
language model training, and that takes a very long time, um, that uses, you 00:30:09.240 |
know, these days, you know, tens, you know, tens of thousands, sometimes many 00:30:14.400 |
tens of thousands of, uh, GPUs or TPUs or Tranium or, you know, we use different 00:30:20.400 |
platforms, but, you know, accelerator chips, um, often, often training for 00:30:25.000 |
months, uh, there's then a kind of post-training phase where we do 00:30:30.000 |
reinforcement learning from human feedback, as well as other kinds of 00:30:34.120 |
reinforcement learning that, that phase is getting, uh, larger and larger now. 00:30:39.280 |
And, you know, you know, often that's less of an exact science. 00:30:44.600 |
Um, models are then tested with some of our early partners to see how good they 00:30:50.160 |
are, and they're then tested both internally and externally for their 00:30:54.760 |
safety, particularly for catastrophic and autonomy risks. 00:30:58.400 |
Uh, so, uh, we do internal testing according to our responsible scaling 00:31:03.280 |
policy, which I, you know, could talk more about that in detail. 00:31:06.200 |
And then we have an agreement with the US and the UK AI Safety Institute, as 00:31:11.160 |
well as other third-party testers in specific domains to test the models for 00:31:15.920 |
what are called CBRN risks, chemical, biological, radiological, and nuclear, 00:31:20.720 |
which are, you know, we don't think that models pose these risks seriously yet, 00:31:25.800 |
but, but every new model we want to evaluate to see if we're starting to get 00:31:29.040 |
close to some of these, these, these more dangerous, um, uh, these 00:31:37.120 |
And then, uh, you know, then, then it just takes some time to get the model 00:31:41.000 |
working in terms of inference and launching it in the API. 00:31:44.360 |
So there's just a lot of steps to, uh, to actually, to 00:31:49.320 |
And of course, you know, we're always trying to make the processes 00:31:54.960 |
We want our safety testing to be rigorous, but we want it to be 00:31:57.680 |
rigorous and to be, you know, to be automatic to happen as fast as it 00:32:04.280 |
Same with our pre-training process and our post-training process. 00:32:07.720 |
So, you know, it's just like building anything else. 00:32:11.200 |
You want to make them, you know, you want to make them safe, but you 00:32:15.640 |
And I think the creative tension between those is, is, you know, is an 00:32:21.920 |
I forget who was saying that, uh, Anthropic has really good tooling. 00:32:24.920 |
So I, uh, probably a lot of the challenge here is on the software 00:32:29.800 |
engineering side is to build the tooling, to, to have a, like a efficient, low 00:32:34.600 |
friction interaction with the infrastructure. 00:32:36.320 |
You would be surprised how much of the challenges of, uh, you know, building 00:32:41.640 |
these models comes down to, you know, software engineering, performance 00:32:46.880 |
engineering, you know, you, you, you know, from the outside, you might think, 00:32:50.560 |
oh man, we had this Eureka breakthrough, right? 00:32:52.880 |
You know, this movie with the science, we discovered it, we figured it out. 00:32:55.760 |
But, but, but I think, I think all things, even, even, even, you know, 00:33:00.600 |
incredible discoveries, like they, they, they, they, they almost always come 00:33:05.080 |
down to the details, um, and, and often super, super boring details. 00:33:09.080 |
I can't speak to whether we have better tooling than, than other companies. 00:33:12.040 |
I mean, you know, haven't been at those other companies, at 00:33:15.440 |
Um, but it's certainly something we give a lot of attention to. 00:33:18.000 |
I don't know if you can say, but from three, from cloud three to cloud three, 00:33:23.200 |
five, is there any extra pre-training going on as they mostly 00:33:29.640 |
Yeah, I think, I think at any given stage, we're focused on 00:33:34.640 |
Um, just, just naturally, like there are different teams. 00:33:37.720 |
Each team makes progress in a particular area in, in, in making a particular, 00:33:42.920 |
you know, their particular segment of the relay race better. 00:33:45.800 |
And it's just natural that when we make a new model, we put, we put 00:33:50.000 |
So the data you have, like the preference data you get from RLHF, is that 00:33:55.240 |
applicable, is there ways to apply it to newer models as it gets trained up? 00:34:00.920 |
Preference data from old models sometimes gets used for new models. 00:34:04.160 |
Although of course, uh, it, it performs somewhat better when it's, you know, 00:34:09.480 |
Note that we have this, you know, constitutional AI method such that 00:34:14.320 |
We kind of, there's also a post-training process where we 00:34:18.600 |
And there's, you know, new types of post-training the model against 00:34:23.200 |
So it's not just RLHF, it's a bunch of other methods as well. 00:34:26.680 |
Um, post-training, I think, you know, is becoming more and more sophisticated. 00:34:30.440 |
Well, what explains the big leap in performance for the new Sonnet 3.5? 00:34:34.840 |
I mean, at least in the programming side, and maybe this is a good 00:34:40.520 |
Just the number went up, but you know, I, I, I program, but I also love 00:34:46.000 |
programming and I, um, CLAW 3.5 through cursor is what I use, uh, to assist me 00:34:52.400 |
in programming and there was, at least experientially, anecdotally, it's 00:35:00.280 |
So what, like, what, what does it take to get it, uh, to get it smarter? 00:35:03.520 |
We observed that as well, by the way, there were a couple of very strong 00:35:07.400 |
engineers here at Anthropic, um, who all previous code models, both produced 00:35:12.200 |
by us and produced by all the other companies, hadn't really been useful 00:35:17.560 |
You know, they said, you know, maybe, maybe this is useful to beginner. 00:35:20.040 |
It's not useful to me, but Sonnet 3.5, the original one for the first time, 00:35:25.080 |
they said, oh my God, this helped me with something that, you know, that 00:35:29.160 |
This is the first model has actually saved me time. 00:35:33.400 |
And, and then I think, you know, the new Sonnet has been, has been even better 00:35:38.640 |
I mean, I'll just say it's been across the board. 00:35:41.160 |
It's in the pre-training, it's in the post-training, it's in 00:35:48.600 |
And if we go into the details of the benchmark, so SWE Bench is basically, 00:35:53.680 |
you know, since, since, you know, since, since you're a programmer, you know, 00:35:56.960 |
you'll be familiar with like pull requests and, you know, just, just pull 00:36:01.560 |
requests or like, you know, the, like a sort of, a sort of atomic unit of work. 00:36:06.520 |
You know, you could say, I'm, you know, I'm implementing one, 00:36:10.400 |
And, and so SWE Bench actually gives you kind of a real world situation where the 00:36:18.160 |
And I'm trying to implement something that's, you know, that's 00:36:22.800 |
We have internal benchmarks where we, where we measure the same thing. 00:36:25.800 |
And you say, just give the model free reign to like, you know, do anything, 00:36:32.440 |
How, how well is it able to complete these tasks? 00:36:36.040 |
And it's that benchmark that's gone from, it can do it 3% of the time to, 00:36:42.320 |
So I actually do believe that if we get, you can gain benchmarks, but I think if 00:36:46.840 |
we get to a hundred percent on that benchmark and in a way that isn't kind 00:36:50.120 |
of like over-trained or, or, or game for that particular benchmark probably 00:36:54.640 |
represents a real and serious increase in kind of, in kind of programming, 00:36:59.320 |
programming ability, and, and I would suspect that if we can get to, you know, 00:37:03.720 |
90, 90, 95%, that, that, that, you know, it will, it will represent ability 00:37:09.160 |
to autonomously do a significant fraction of software engineering tasks. 00:37:19.320 |
Uh, not giving you an exact date, uh, but you know, there, there, uh, you 00:37:24.080 |
know, as far as we know, the plan is still to have a CLAWD 3.5 Opus. 00:37:31.760 |
So what was that game that there was some game that was delayed 15 years. 00:37:36.320 |
And I think GTA is now just releasing trailers. 00:37:39.000 |
It, you know, it's only been three months since we released the first SONNET. 00:37:44.640 |
It just, it just tells you about the pace, the expectations for 00:37:51.720 |
So how do you think about sort of, as these models get bigger 00:37:56.520 |
And also just versioning in general, why SONNET 3.5 updated with the date? 00:38:02.840 |
Why not SONNET 3.6, which a lot of people are calling it? 00:38:06.640 |
Naming is actually an interesting challenge here, right? 00:38:09.120 |
Because I think a year ago, most of the model was pre-training. 00:38:12.600 |
And so you could start from the beginning and just say, okay, we're 00:38:18.280 |
And, you know, we'll have a family of naming schemes and then we'll 00:38:23.200 |
And then, you know, we'll have the next, the next generation. 00:38:26.080 |
Um, the trouble starts already when some of them take a lot 00:38:30.320 |
That already messes up your time, time a little bit, but as you make big 00:38:35.000 |
improvements in, as you make big improvements in pre-training, uh, then 00:38:39.040 |
you suddenly notice, oh, I can make better pre-trained model and that 00:38:44.680 |
And, but, you know, clearly it has the same, you know, size 00:38:48.840 |
Uh, uh, so I think those two together, as well as the timing timing issues, 00:38:53.520 |
any kind of scheme you come up with, uh, you know, the reality tends to 00:39:00.960 |
It tends to kind of break out of the breakout of the scheme. 00:39:04.200 |
It's not like software where you can say, oh, this is like, you 00:39:09.080 |
No, you have models with different, different trade-offs. 00:39:20.960 |
And so I think all the companies have struggled with this. 00:39:23.800 |
Um, I think we did very, you know, I think, think we were in a good, good 00:39:28.280 |
position in terms of naming when we had Haiku, Sonnet and Opus. 00:39:32.040 |
We're trying to maintain it, but it's not, it's not, it's not perfect. 00:39:35.880 |
Um, so we'll, we'll, we'll try and get back to the simplicity, but it, it, 00:39:39.600 |
um, uh, just the, the, the nature of the field, I feel like no one's figured out 00:39:44.480 |
naming, it's somehow a different paradigm from like normal software. 00:39:48.240 |
And, and, and so we, we just, none of the companies have been perfect at it. 00:39:52.880 |
Um, it's something we struggle with surprisingly much relative to, you know, 00:39:56.960 |
how, relative to how trivial it is, you know, for the, the, the, the grand 00:40:02.840 |
So from the user side, the user experience of the updated Sonnet 3.5 00:40:08.400 |
is just different than the previous, uh, June, 2024 Sonnet 3.5. 00:40:13.560 |
It would be nice to come up with some kind of labeling that embodies that 00:40:17.520 |
because people talk about Sonnet 3.5, but now there's a different one. 00:40:22.160 |
And so how do you refer to the previous one and the new one? 00:40:24.800 |
And it, it, uh, when there's a distinct improvement, it just makes 00:40:34.800 |
Yeah, I, I definitely think this question of, there are lots of 00:40:38.640 |
properties of the models that are not reflected in the benchmarks. 00:40:42.320 |
Um, I, I think, I think that's, that's definitely the case and everyone 00:40:48.320 |
Some of them are, you know, models can be polite or brusque. 00:40:53.840 |
They can be, uh, you know, uh, very reactive or they can ask you questions. 00:41:00.520 |
Um, they can have what, what feels like a warm personality or a cold 00:41:04.480 |
personality, they can be boring or they can be very distinctive 00:41:09.320 |
Um, and we have a whole, you know, we have a whole team kind of focused on, 00:41:15.480 |
Uh, Amanda leads that team and we'll, we'll talk to you about that, but 00:41:21.520 |
Um, and, and often we find that models have properties that we're not aware of. 00:41:26.680 |
The, the fact of the matter is that you can, you know, talk to a model 00:41:30.680 |
10,000 times, and there are some behaviors you might not see. 00:41:34.360 |
Uh, just like, just like with a human, right. 00:41:36.640 |
I can know someone for a few months and, you know, not know that they have a 00:41:39.680 |
certain skill or not know that there's a certain side to them. 00:41:42.720 |
And so I think, I think we just have to get used to this idea and we're always 00:41:46.120 |
looking for better ways of testing our models to, to demonstrate these 00:41:50.520 |
capabilities and, and, and also to decide which are, which are the, which 00:41:54.080 |
are the personality properties we want models to have in which we don't want 00:41:58.200 |
to have that itself, the normative question is also super interesting. 00:42:02.240 |
I got to ask you a question from Reddit, from Reddit. 00:42:05.760 |
You know, there, there's just as fascinating to me, at least it's a 00:42:09.640 |
psychological social phenomenon where people report that Claude has 00:42:17.040 |
And so, uh, the question is, does the user complaint about the 00:42:21.000 |
dumbing down of Claude three, five sonnet hold any water? 00:42:23.920 |
So are these anecdotal reports a kind of social phenomena or did Claude, is 00:42:31.320 |
there any cases where Claude would get dumber? 00:42:37.480 |
I believe this, I believe I've seen these complaints for every foundation 00:42:44.480 |
Um, people said this about GPT-4, they said it about GPT-4 turbo. 00:42:51.600 |
Um, one, the actual weights of the model, right? 00:42:54.600 |
The actual brain of the model that does not change unless 00:43:00.040 |
Um, there, there are just a number of reasons why it would not make 00:43:03.360 |
sense practically to be randomly substituting in, substituting 00:43:09.000 |
It's difficult from an inference perspective, and it's actually 00:43:12.000 |
hard to control all the consequences of changing the weights of the model. 00:43:16.000 |
Let's say you wanted to fine tune the model to be like, I don't know, 00:43:19.360 |
to like, to say certainly less, which, you know, an old version 00:43:22.760 |
of Sonnet used to do, um, you actually end up changing a hundred things as well. 00:43:27.840 |
And we have a whole process for modifying the model. 00:43:31.960 |
We do a bunch of, um, like we do a bunch of user testing and early customers. 00:43:36.080 |
So it, it, we both have never changed the weights of the model 00:43:41.560 |
And it, it, it wouldn't certainly in the current setup, it 00:43:46.480 |
Now, there are a couple of things that we do occasionally do. 00:43:52.720 |
Um, but those are typically very close to when a model is being, is being, uh, 00:43:57.440 |
released and for a very small fraction of time. 00:44:00.280 |
Um, so, uh, you know, like the, you know, the, the day before the new Sonnet 3.5. 00:44:09.400 |
Um, there were some comments from people that like, it's got, it's got, it's 00:44:13.040 |
gotten a lot better and that's because, you know, a fraction we're exposed 00:44:15.920 |
to, to an A/B test for, for those one or for those one or two days. 00:44:20.160 |
Um, the other is that occasionally the system prompt will change, um, on the 00:44:24.400 |
system prompt can have some effects, although it's on, it's unlikely to dumb 00:44:29.040 |
down models, it's unlikely to make them dumber. 00:44:31.320 |
Um, and, and, and, and we've seen that while these two things, which I'm 00:44:35.560 |
listing to be very complete, um, Happened relatively, happened quite infrequently. 00:44:41.280 |
Um, the complaints about, for us and for other model companies about the model 00:44:52.440 |
And so I don't want to say like people are imagining it or anything, but like the 00:44:59.600 |
Um, if I were to offer a theory, um, I think it actually relates to one of the 00:45:07.480 |
Models have many are very complex and have many aspects to them. 00:45:12.120 |
And so often, you know, if I, if I, if I, if I ask the model a question, you know, 00:45:16.640 |
if I'm like, if I'm like do task X versus can you do task X, the model 00:45:23.840 |
Uh, and, and so there are all kinds of subtle things that you can change about 00:45:28.680 |
the way you interact with the model that can give you very different results. 00:45:32.280 |
Um, to be clear, this, this itself is like a failing by, by us and by the 00:45:37.240 |
other model providers that, that the models are, are just, just often sensitive 00:45:43.400 |
It's yet another way in which the science of how these models 00:45:48.280 |
Uh, and, and so, you know, if I go to sleep one night and I was like talking to 00:45:51.760 |
the model in a certain way and I like slightly change the phrasing of how I 00:45:55.400 |
talk to the model, you know, I could, I could get different results. 00:46:00.680 |
The other thing is, man, it's just hard to quantify this stuff. 00:46:05.440 |
I think people are very excited by new models when they come out. 00:46:08.400 |
And then as time goes on, they, they become very aware of the, they become 00:46:13.760 |
So that may be another effect, but that's, that's all a very long 00:46:16.600 |
rendered way of saying for the most part, with some fairly narrow 00:46:25.880 |
The baseline raises, like when people have first gotten wifi on 00:46:32.400 |
And then, and then you start getting this thing to work. 00:46:37.480 |
So then it's easy to have the conspiracy theory of they're 00:46:41.280 |
This is probably something I'll talk to Amanda much more about, 00:46:46.880 |
Uh, when will Claude stop trying to be my, uh, puritanical 00:46:51.240 |
grandmother, imposing it's moral worldview on me as a paying customer. 00:46:55.400 |
And also what is the psychology behind making Claude overly apologetic? 00:46:59.080 |
So this kind of reports about the experience, a different 00:47:06.360 |
So a couple of points on this first one is, um, like things that 00:47:11.040 |
people say on Reddit and Twitter or X or whatever it is, um, there's 00:47:15.120 |
actually a huge distribution shift between like the stuff that people 00:47:20.480 |
And what actually kind of like, you know, Statistically users care about. 00:47:26.560 |
Like people are frustrated with, you know, things like, you know, the 00:47:30.000 |
model, not writing out all the code or the model, uh, you know, just, 00:47:34.240 |
just not being as good at code as it could be, even though it's the 00:47:38.880 |
Um, I, I think the majority of things, of things are about that. 00:47:41.880 |
Um, uh, but, uh, certainly a, a, a kind of vocal minority are, uh, you know, 00:47:48.080 |
kind of, kind of, kind of raised these concerns, right. 00:47:50.320 |
Are frustrated by the model, refusing things that it shouldn't refuse 00:47:53.960 |
or like apologizing too much, or just, just having these kind 00:47:58.800 |
Um, the second caveat, and I just want to say this like super clearly, 00:48:02.480 |
because I think it's like, some people don't know it, others like kind 00:48:08.320 |
Like it is very difficult to control across the board, how the models behave. 00:48:13.200 |
You cannot just reach in there and say, oh, I want the model to like, apologize 00:48:19.360 |
You can include trading data that says like, oh, the models should like apologize 00:48:23.240 |
less, but then in some other situation, they end up being like super rude or 00:48:27.640 |
like overconfident in a way that's like misleading people. 00:48:32.320 |
Um, uh, for example, another thing is if there was a period during 00:48:36.880 |
which models, ours, and I think others as well, were too verbose, right? 00:48:44.120 |
Um, you can cut down on the verbosity by penalizing the models 00:48:50.160 |
What happens when you do that, if you do it in a crude way is when the models 00:48:54.520 |
are coding, sometimes they'll say, rest of the code goes here, right? 00:48:58.400 |
Because they've learned that that's the way to economize and that they see it. 00:49:01.360 |
And then, and then, so that leads the model to be so-called lazy in coding, 00:49:05.160 |
where they, where they, where they're just like, ah, you can finish the rest of it. 00:49:08.160 |
It's not, it's not because we want to, you know, save on compute or 00:49:14.000 |
And, you know, during winter break or any of the other kind of conspiracy 00:49:17.760 |
theories that have, that have, that have come up, it's actually, it's just 00:49:21.040 |
very hard to control the behavior of the model, to steer the behavior of the 00:49:27.760 |
You can kind of, there's this, this whack-a-mole aspect where you push on 00:49:31.680 |
one thing and like, you know, these, these, these, you know, these other 00:49:37.040 |
things start to move as well that you may not even notice or measure. 00:49:40.160 |
And so one of the reasons that I, that I care so much about, you know, 00:49:45.920 |
kind of grand alignment of these AI systems in the future is actually, 00:49:49.520 |
these systems are actually quite unpredictable. 00:49:51.880 |
They're actually quite hard to steer and control. 00:49:54.400 |
And this version we're seeing today of you make one thing better, 00:50:01.680 |
Uh, I think that's, that's like a present day analog of future control 00:50:08.760 |
problems in AI systems that we can start to study today, right? 00:50:12.200 |
I think, I think that, that, that difficulty in, in steering the behavior 00:50:18.720 |
and in making sure that if we push an AI system in one direction, it doesn't 00:50:23.040 |
push it in another direction in some, in some other ways that we didn't want. 00:50:26.720 |
Uh, I think that's, that's kind of an, that's kind of 00:50:32.120 |
And if we can do a good job of solving this problem, right. 00:50:35.040 |
Of like you asked the model to like, you know, to like make and distribute 00:50:39.440 |
smallpox and it says no, but it's willing to like help you in your 00:50:44.720 |
Like, how do we get both of those things at once? 00:50:48.200 |
It's very easy to go to one side or the other, and it's 00:50:52.600 |
And so, uh, I, you know, I think these questions of like 00:51:02.440 |
I think we've actually done the best of all the AI companies, 00:51:08.000 |
Uh, and I think if we can get this right, if we can control the, the, you 00:51:13.040 |
know, control the false positives and false negatives in this, this very 00:51:17.640 |
kind of controlled present day environment, we'll be much better 00:51:21.560 |
at doing it for the future when our worry is, you know, will the 00:51:26.000 |
Will they be able to, you know, make very dangerous things? 00:51:29.320 |
Will they be able to autonomously, you know, build whole companies 00:51:33.480 |
So, so I, I think of this, this present task as both vexing, but 00:51:39.520 |
What's the current best way of gathering sort of user feedback, like, uh, not 00:51:45.960 |
anecdotal data, but just large scale data about pain points or the opposite 00:51:58.640 |
So, so typically, um, we'll have internal model bashings 00:52:04.760 |
Um, you know, people just, just try and break the model. 00:52:09.440 |
Um, uh, we have a suite of evals, uh, for, you know, oh, is the model 00:52:14.480 |
refusing in ways that it, that it couldn't, I think we even had a certainly 00:52:18.120 |
eval because you know, our, our model, again, one point model had this problem 00:52:23.080 |
where like it had this annoying tick where it would like respond to a wide 00:52:26.480 |
range of questions by saying, certainly I can help you with that. 00:52:33.280 |
Um, uh, and so we had a like certainly eval, which is like, how, how 00:52:38.800 |
Uh, uh, but, but look, this is just a whack-a-mole like, like what if it 00:52:42.680 |
switches from certainly to definitely like, uh, uh, so you know, every time 00:52:47.880 |
we add a new eval and we're, we're always evaluating for all the old things. 00:52:50.920 |
So we have hundreds of these evaluations, but we find that there's no substitute 00:52:56.240 |
And so it's very much like the ordinary product development process. 00:52:59.480 |
We have like hundreds of people within Anthropic bash the model. 00:53:02.920 |
Then we do, uh, you know, then we do externally be tests. 00:53:09.480 |
We pay contractors to interact with the model. 00:53:11.920 |
Um, so you put all of these things together and it's still not perfect. 00:53:16.640 |
You still see behaviors that you don't quite want to see, right. 00:53:19.080 |
You know, you see, you still see the model, like refusing things that it 00:53:24.120 |
Um, but I, I, I think trying to trying to solve this challenge, right. 00:53:31.040 |
You know, genuinely bad things that, you know, know what everyone 00:53:35.640 |
You know, everyone, everyone, you know, everyone agrees that, you know, the 00:53:38.640 |
model shouldn't talk about, you know, I, I don't know, child abuse material. 00:53:42.680 |
Like everyone agrees the model shouldn't do that. 00:53:44.360 |
Uh, but, but at the same time that it doesn't refuse in these dumb and stupid 00:53:48.240 |
ways, uh, I think, I think draw drawing that line as finely as possible. 00:53:53.760 |
Approaching perfectly is still, is still a challenge and we're 00:53:59.520 |
And again, I would point to that as, as an indicator of a challenge ahead in terms 00:54:14.040 |
Cause if I say, if I say here, we're going to have Claude for next year. 00:54:18.480 |
And then, and then, you know, then we decide that like, you know, we 00:54:20.920 |
should start over cause there's a new type of model. 00:54:22.800 |
Like I, I, I, I don't want to, I don't want to commit to it. 00:54:25.520 |
I would expect in a normal course of business that Claude four 00:54:30.480 |
But, but you know, you know, you never know in this wacky field. 00:54:34.080 |
But, uh, sort of this idea of scaling is continuing. 00:54:39.760 |
There, there will definitely be more powerful models coming from us 00:54:45.880 |
Or if there, if there aren't, we've, we've deeply failed as a company. 00:54:49.320 |
Can you explain the responsible scaling policy and the AI safety 00:54:55.000 |
As much as I'm excited about the benefits of these models. 00:55:00.160 |
If we talk about machines of loving grace, um, I'm, I'm worried about the risks and 00:55:06.440 |
Uh, no one should think that, you know, machines of loving grace was me, me 00:55:10.280 |
saying, uh, you know, I'm no longer worried about the risks of these models. 00:55:15.760 |
The, the, uh, power of the models and their ability to solve all these 00:55:21.200 |
problems in, you know, biology, neuroscience, economic development, 00:55:25.920 |
government, governance, and peace, large parts of the economy, those, 00:55:31.680 |
With great power comes great responsibility, right? 00:55:34.040 |
That's the, the two are, the two are paired, uh, things that are powerful 00:55:37.840 |
can do good things and they can do bad things. 00:55:39.800 |
Um, I think of those risks as, as being in, you know, several 00:55:44.920 |
Perhaps the two biggest risks that I think about, and that's not to say that 00:55:49.000 |
there aren't risks today that are, that are important, but when I think of the 00:55:52.160 |
really, the, the, you know, the things that would happen on the grandest scale, 00:55:59.080 |
These are misuse of the models in domains like cyber, 00:56:07.160 |
Things that could, you know, that could harm or even kill thousands, even 00:56:13.080 |
millions of people, if they really, really go wrong, um, like these are 00:56:19.840 |
And, and here I would just make a simple observation, which is that. 00:56:23.680 |
My, the models, you know, if, if I look today at people who have done 00:56:28.760 |
really bad things in the world, um, uh, I think actually humanity has been 00:56:33.520 |
protected by the fact that the overlap between really smart, well-educated 00:56:38.320 |
people and people who want to do really horrific things has generally been small. 00:56:42.880 |
Like, you know, let's say, let's say I'm someone who, you know, uh, you know, 00:56:47.320 |
I have a PhD in this field, I have a well-paying job. 00:56:52.360 |
Why do I want to like, you know, even, even assuming I'm completely evil, 00:56:56.040 |
which, which most people are not, um, why, why, you know, why would such a 00:56:59.560 |
person risk their risk, their, you know, risk, their life risk, risk, their, 00:57:03.880 |
their legacy, their reputation to, to do something like, you know, truly, truly 00:57:08.400 |
evil, if we had a lot more people like that, the world would be 00:57:13.080 |
And so my, my, my worry is that by being a, a much more intelligent 00:57:20.800 |
And so I, I, I, I do have serious worries about that. 00:57:25.880 |
Uh, but you know, I, I think as a counterpoint to machines of loving grace, 00:57:29.800 |
I want to say that this is the, I, there's still serious risks and, and the 00:57:33.920 |
second range of risks would be the autonomy risks, which is the idea that 00:57:37.840 |
models might on their own, particularly as we give them more agency than they've 00:57:42.280 |
had in the past, uh, particularly as we give them supervision over wider tasks 00:57:48.080 |
like, you know, writing whole code bases or someday even, you know, effectively 00:57:57.960 |
Are they, are they doing what we really want them to do? 00:58:00.480 |
It's very difficult to even understand in detail what they're doing, 00:58:06.640 |
And like I said, this, these early signs that it's, it's hard to perfectly 00:58:11.920 |
draw the boundary between things the model should do and things the model 00:58:14.960 |
shouldn't do that, that, you know, if you go to one side, you get things that 00:58:19.480 |
are annoying and useless and you go to the other side, you get other behaviors. 00:58:22.520 |
If you fix one thing, it creates other problems. 00:58:25.160 |
We're getting better and better at solving this. 00:58:29.440 |
I think this is a, you know, this is a science like, like the safety of 00:58:32.680 |
airplanes or the safety of cars or the safety of drugs, I, you know, I, I don't 00:58:38.640 |
I just think we need to get better at controlling these models. 00:58:41.720 |
And so these are, these are the two risks I'm worried about and our 00:58:44.680 |
responsible scaling plan, which all recognizes a very long-winded answer 00:58:49.200 |
to your question, our responsible scaling plan is designed to 00:58:55.960 |
And so every time we develop a new model, we basically test it for its 00:59:05.760 |
So if I were to back up a little bit I think we have, I think we have an 00:59:10.600 |
interesting dilemma with AI systems where they're not yet powerful enough 00:59:17.960 |
I don't know that, I don't know that they'll ever present, 00:59:21.400 |
It's possible they won't, but the, the case for worry, the case for risk is 00:59:26.080 |
strong enough that we should, we should act now and, and they're, they're 00:59:32.200 |
I, you know, I testified in the Senate that, you know, we might have serious 00:59:37.520 |
That was about a year ago, things have preceded, preceded a pace. 00:59:41.640 |
Uh, uh, so we have this thing where it's like, it's, it's, it's surprisingly 00:59:46.400 |
hard to, to address these risks because they're not here today. 00:59:51.360 |
They're like ghosts, but they're coming at us so fast because the 00:59:55.640 |
So how do you deal with something that's not here today, doesn't exist, 01:00:03.400 |
Uh, so the solution we came up with for that in, in collaboration with, uh, 01:00:08.480 |
you know, people like, uh, the organization meter and Paul Cristiano 01:00:12.520 |
is okay, what, what, what you need for that are you need tests to tell 01:00:21.080 |
And, and so every time we have a new model, we test it for its capability 01:00:26.800 |
to do these CBRN tasks, as well as testing it for, you know, how capable 01:00:32.600 |
it is of doing tasks autonomously on its own and, uh, in the latest 01:00:37.160 |
version of our RSP, which we released in the last, in the last month 01:00:40.800 |
or two, uh, the way we test autonomy risks is the model, the AI model's 01:00:46.320 |
ability to do aspects of AI research itself, uh, which when the model, 01:00:51.760 |
when the AI models can do AI research, they become kind of truly, truly 01:00:55.240 |
autonomous, uh, and that, you know, that threshold is important 01:00:59.640 |
And, and so what do we then do with these tasks? 01:01:02.800 |
The RSP basically develops what we've called an if then structure, 01:01:07.920 |
which is if the models pass a certain capability, then we impose a certain 01:01:14.040 |
set of safety and security requirements on them. 01:01:16.440 |
So today's models are what's called ASL two models that were ASL one 01:01:22.160 |
is for systems that manifestly don't pose any risk of autonomy or misuse. 01:01:28.360 |
So for example, a chess playing bot, deep blue would be ASL one. 01:01:32.720 |
It's just manifestly the case that you can't use deep blue 01:01:39.480 |
No, one's going to use it to like, you know, to conduct a masterful 01:01:43.320 |
cyber attack or to, you know, run wild and take over the world. 01:01:46.880 |
ASL two is today's AI systems where we've measured them. 01:01:51.520 |
And we think these systems are simply not smart enough to, uh, to, you 01:01:56.640 |
know, autonomously self-replicate or conduct a bunch of tasks, uh, and 01:02:04.280 |
Meaningful information about CBRN risks and how to build CBRN weapons above and 01:02:11.440 |
beyond what can be known from looking at Google. 01:02:14.320 |
Uh, in fact, sometimes they do provide information, but, but not above 01:02:18.840 |
and beyond a search engine, but not in a way that can be stitched together. 01:02:21.880 |
Um, not, not in a way that kind of end to end is dangerous enough. 01:02:26.200 |
So ASL three is going to be the point at which, uh, the models are 01:02:32.120 |
helpful enough to enhance the capabilities of non-state actors, right? 01:02:37.400 |
State actors can already do a lot, a lot of, unfortunately, to a high 01:02:41.640 |
level of proficiency, a lot of these very dangerous and destructive things. 01:02:45.480 |
The difference is that non-state, non-state actors are not capable of it. 01:02:49.800 |
And so when we get to ASL three, we'll take special security precautions 01:02:55.160 |
designed to be sufficient to prevent theft of the model by non-state 01:02:58.920 |
actors and misuse of the model as it's deployed, uh, will have to have 01:03:03.640 |
enhanced filters targeted at these particular areas, cyber, bio, nuclear, 01:03:08.840 |
cyber, bio, nuclear, and model autonomy, which is less a misuse risk and more 01:03:16.560 |
ASL four, getting to the point where these models could, could enhance the 01:03:22.880 |
capability of a, of a, of a already knowledgeable state actor and, or 01:03:28.240 |
become the, you know, the main source of such a risk, like if you wanted to 01:03:32.880 |
engage in such a risk, the main way you would do it is through a model. 01:03:35.920 |
And then I think ASL four on the autonomy side, it's, it's some, some, 01:03:40.040 |
some amount of acceleration in AI research capabilities with an, with an AI 01:03:44.600 |
model, and then ASL five is where we would get to the models that are, you 01:03:47.840 |
know, that are, that are kind of, that are kind of, you know, truly capable 01:03:50.800 |
that it could exceed humanity in their ability to do, to do any of these tasks. 01:03:54.920 |
And so the, the, the point of the, if then structure commitment is, is 01:04:04.160 |
I've been, I've been working with these models for many years and I've been 01:04:11.080 |
It's actually kind of dangerous to say this, you know, this, this 01:04:16.720 |
And you know, people look at it and they say, this is manifestly not dangerous. 01:04:20.480 |
Again, it's, it's, it's the, the delicacy of the risk isn't here 01:04:28.840 |
It's, it's really vexing to a risk planner to deal with it. 01:04:31.760 |
And so this, if then structure basically says, look, we don't want 01:04:37.360 |
We don't want to harm our own, you know, our, our, our kind of own 01:04:40.920 |
ability to have a place in the conversation by imposing these, these. 01:04:46.320 |
Very onerous burdens on models that are not dangerous today. 01:04:51.000 |
So the, if then the trigger commitment is basically a way to deal with this. 01:04:54.960 |
It says you clamp down hard when you can show that the model is dangerous. 01:04:58.680 |
And of course, what has to come with that is, you know, enough of a buffer 01:05:01.920 |
threshold that, that, you know, you can, you can, uh, you know, you're, you're, 01:05:06.200 |
you're, you're not at high risk of kind of missing the danger. 01:05:10.040 |
We've had to change it every, every, uh, you know, we came out with a new 01:05:14.040 |
one just a few weeks ago and probably, probably going forward, we might 01:05:17.700 |
release new ones multiple times a year because it's, it's hard to get these 01:05:21.320 |
policies, right, like technically organizationally from a research 01:05:27.240 |
If then commitments and triggers in order to minimize burdens and false 01:05:33.320 |
alarms now, but really react appropriately when the dangers are here. 01:05:37.040 |
What do you think the timeline for ASL three is where several 01:05:42.080 |
And what do you think the timeline is for ASL four? 01:05:47.000 |
Um, uh, we are working actively to prepare ASL three, uh, security, uh, 01:05:53.480 |
security measures, as well as ASL three deployment measures. 01:05:56.840 |
Um, I'm not going to go into detail, but we've made, we've 01:06:00.780 |
And you know, we're, we're prepared to be, I think ready quite soon. 01:06:09.220 |
If we hit ASL three, uh, next year, there was some concern that 01:06:17.620 |
It's like very hard to say, but like, I would be very, very 01:06:24.700 |
So there's a protocols for detecting it if then, and then there's 01:06:34.740 |
Yeah, I think for ASL three, it's primarily about security. 01:06:38.500 |
Um, and, and about, you know, filters on the model relating to a very narrow 01:06:44.120 |
set of areas when we deploy the model, because at ASL three, the model isn't 01:06:48.900 |
autonomous yet, um, uh, and, and so you don't have to worry about, you know, 01:06:53.180 |
kind of the model itself behaving in a bad way, even when it's deployed internally. 01:06:57.860 |
So I think the ASL three measures are, are, I won't say straightforward. 01:07:02.760 |
They're, they're, they're, they're rigorous, but they're easier to reason about. 01:07:05.940 |
I think once we get to ASL four, um, we start to have worries about the 01:07:12.120 |
models being smart enough that they might sandbag tests, they might 01:07:18.200 |
Um, we had some results came out about like sleeper agents and there 01:07:21.800 |
was a more recent paper about, you know, can, can the models, uh, mislead 01:07:26.920 |
attempts to, you know, sandbag their own abilities, right. 01:07:30.500 |
Show them, you know, uh, uh, uh, present themselves as being 01:07:35.180 |
And so I think with ASL four, there's going to be an important component 01:07:39.660 |
of using other things than just interacting with the models, for 01:07:43.260 |
example, interpretability or hidden chains of thought, uh, where you have 01:07:47.460 |
to look inside the model and verify via some other mechanism that, that is 01:07:52.460 |
not, you know, is not as easily corrupted as what the model says, uh, 01:07:56.180 |
that, that, you know, that, that, that the model indeed has some property. 01:08:02.180 |
One of the properties of the RSP is that we, we don't specify 01:08:10.100 |
Be, and, and I think that's proven to be a wise decision because even with ASL 01:08:14.220 |
three, it, again, it's hard to know this stuff in detail and, and it, it, we 01:08:18.980 |
want to take as much time as we can possibly take to get these things right. 01:08:22.900 |
So for ASL three, the bad actor will be the humans, humans. 01:08:31.620 |
And so deception and that's where mechanistic interpretability comes 01:08:36.020 |
into play and, uh, hopefully the techniques used for that are not 01:08:42.740 |
I mean, of course you can hook up the mechanistic interpretability 01:08:46.860 |
Um, but then you, then, then you, then you've kind of lost it as a 01:08:50.060 |
reliable indicator of, uh, of, uh, of, of, of the model state. 01:08:54.540 |
There are a bunch of exotic ways you can think of that. 01:08:58.260 |
Like if the, you know, model gets smart enough that it can like, you 01:09:01.660 |
know, jump computers and like read the code where you're like 01:09:08.740 |
There are ways to render them unlikely, but yeah, generally you want to, you 01:09:12.460 |
want to preserve mechanistic interpretability as a kind of verification 01:09:16.500 |
set or test set that's separate from the training process of the model. 01:09:19.260 |
See, I think, uh, as these models become better and better conversation 01:09:22.700 |
and become smarter, social engineering becomes a threat too, because they, 01:09:27.140 |
oh yeah, that can start being very convincing to the 01:09:31.740 |
It's actually like, you know, we've, we've seen lots of examples of 01:09:36.460 |
And, and, you know, there's a concern that models could do that. 01:09:39.740 |
One of the ways that cloud has been getting more and more powerful is it's 01:09:43.700 |
now able to do some agentic stuff, um, computer use, uh, there's also an 01:09:49.340 |
analysis within the sandbox of cloud.ai itself, but let's talk about computer use. 01:09:53.980 |
That's seems to me super exciting that you can just give cloud a task and it, uh, 01:09:59.620 |
it takes a bunch of actions, figures it out, and it's access to the, your 01:10:05.860 |
So can you explain how that works, uh, and where that's headed? 01:10:12.340 |
So cloud has, has had for a long time, since, since cloud three back in March, 01:10:16.940 |
the ability to analyze images and respond to them with text, the, the only new 01:10:22.060 |
thing we added is those images can be screenshots of a computer and in 01:10:27.020 |
response, we train the model to give a location on the screen where you can 01:10:36.020 |
And it turns out that with actually not all that much additional training, the 01:10:44.900 |
Um, you know, people sometimes say if you get to low earth orbit, you're 01:10:49.280 |
Because of how much it takes to escape the gravity. 01:10:51.100 |
Well, if you have a strong pre-trained model, I feel like you're halfway to 01:10:54.300 |
anywhere, uh, in, in terms of, in terms of the intelligence space, uh, uh, and, 01:11:00.380 |
and, and so actually it didn't, it didn't take all that much to get, to 01:11:03.580 |
get Claude to do this and you can just set that in a loop, give the model a 01:11:08.620 |
screenshot, tell it what to click on, give it the next screenshot, tell it 01:11:11.460 |
what to click on, and, and that turns into a full kind of almost, almost 3d 01:11:17.780 |
And it's able to do all of these tasks, right? 01:11:20.300 |
You know, we, we showed these demos where it's able to like 01:11:24.260 |
It's able to kind of like interact with a website. 01:11:27.140 |
It's able to, you know, um, you know, it's able to open all kinds of, you 01:11:31.940 |
know, programs, different operating systems, windows, Linux, Mac. 01:11:35.620 |
Uh, uh, so, uh, you know, I think all of that is very exciting. 01:11:39.980 |
I will say while in theory, there's nothing you could do there that you 01:11:44.260 |
couldn't have done through just giving the model, the API to drive the computer 01:11:47.820 |
screen, uh, this really lowers the barrier and, you know, there's, there's, 01:11:52.220 |
there's a lot of folks who, who, who either, you know, kind of, kind of 01:11:55.380 |
aren't, aren't, you know, aren't in a position to, to interact with those 01:12:00.580 |
It's just, the screen is just a universal interface. 01:12:04.540 |
And so I expect over time, this is going to lower a bunch of barriers. 01:12:08.580 |
Now, honestly, the current model has, there's, it leaves a lot still to be 01:12:12.820 |
desired and we were, we were honest about that in the blog, right? 01:12:15.540 |
It makes mistakes, it misclicks and we, we, you know, we were careful to 01:12:20.140 |
warn people, Hey, this thing isn't, you can't just leave this thing to, you 01:12:23.500 |
know, run on your computer for minutes and minutes, um, you got to give this 01:12:29.380 |
And I think that's one of the reasons we released it first in an API form 01:12:33.180 |
rather than kind of, you know, this, this kind of just, just hands it, just 01:12:36.620 |
hands the consumer and give it control of their, of their, of their, of their 01:12:40.780 |
Um, but, but, you know, I definitely feel that it's important to get these 01:12:44.940 |
capabilities out there as models get more powerful, we're going to have to 01:12:48.460 |
grapple with, you know, how do we use these capabilities safely? 01:12:53.660 |
Uh, and, and, you know, I think, I think releasing, releasing the model 01:12:57.540 |
while, while, while the capabilities are, are, you know, are, are still, are 01:13:01.900 |
still limited is, is, is very helpful in terms of, in terms of doing that. 01:13:06.220 |
Um, you know, I think since it's been released, a number of customers, I 01:13:09.820 |
think, uh, replete was maybe, was maybe one of the, the, the most, uh, uh, 01:13:13.820 |
quickest, quickest, quickest, uh, quickest to deploy things, um, have, 01:13:18.140 |
have, you know, have made use of it in various ways. 01:13:20.380 |
People have hooked up demos for, you know, windows, desktops, max, uh, uh, 01:13:28.260 |
Uh, so yeah, it's been, it's been, it's been very exciting. 01:13:31.800 |
I think as with, as with anything else, you know, it, it, it comes 01:13:37.300 |
And then, then, then, you know, then, then with those new, exciting 01:13:40.180 |
abilities, we have to think about how to, how to, you know, make the 01:13:42.860 |
model, you know, safe, reliable, do what humans want them to do. 01:13:46.740 |
I mean, it's the same, it's the same story for everything, right? 01:13:50.580 |
But, but the possibility of use cases here is just the, the range is incredible. 01:13:55.080 |
So, uh, how much to make it work really well in the future? 01:13:58.660 |
How much do you have to specially kind of, uh, go beyond what's 01:14:03.140 |
the pre-trained models doing, do more post-training, RLHF, or 01:14:06.880 |
supervised fine-tuning, or synthetic data just for the agent? 01:14:10.540 |
Yeah, I think speaking at a high level, it's our intention to keep 01:14:13.780 |
investing a lot in, you know, making, making the model better. 01:14:16.900 |
Uh, like I think, I think, uh, you know, we look at, look at some of the, 01:14:21.020 |
you know, some of the benchmarks where previous models were like, oh, 01:14:25.100 |
And now our model would do it 14 or 22% of the time. 01:14:28.380 |
And yeah, we want to get up to, you know, the human level reliability 01:14:33.340 |
We're on the same curve that we were on with Sweebench, where I think I would 01:14:36.940 |
guess a year from now, the models can do this very, very reliably, 01:14:40.700 |
So you think it's possible to get to the human level, 90%, uh, basically 01:14:45.740 |
doing the same thing you're doing now, or is it has to be special for computer use? 01:14:49.500 |
I mean, uh, it depends what you mean by, by, you know, special and special in 01:14:54.900 |
general, um, but, but I, you know, I, I generally think, you know, the same 01:14:59.660 |
kinds of techniques that we've been using to train the current model. 01:15:02.460 |
I, I expect that doubling down on those techniques in the same way that we 01:15:05.700 |
have for code, for code, for models in general, for other kits, for, you 01:15:10.140 |
know, for image input, um, uh, you know, for voice, uh, I expect those same 01:15:15.620 |
techniques will scale here as they have everywhere else. 01:15:18.060 |
But this is giving sort of the power of action to Claude. 01:15:22.460 |
And so you could do a lot of really powerful things, but you 01:15:29.100 |
Look, my, my view actually is computer use isn't a fundamentally new capability 01:15:34.860 |
like the CBRN or autonomy capabilities are, um, it's more like it kind of 01:15:40.460 |
opens the aperture for the model to use and apply its existing abilities. 01:15:44.500 |
Uh, and, and so the way we think about it, going back to our RSP is nothing 01:15:50.260 |
that this model is doing inherently increases, you know, the risk from an 01:15:56.700 |
RSP RSP perspective, but as the models get more powerful, having this 01:16:04.380 |
Once it, you know, once it has the cognitive capability to, um, You know, 01:16:09.740 |
to do something at the ASL three and ASL four level, this, this, you know, 01:16:13.980 |
this may be the thing that kind of unbounds it from doing so. 01:16:17.700 |
So going forward, certainly this modality of interaction is something 01:16:22.300 |
that we have tested for and that we will continue to test for an RSP going forward. 01:16:26.220 |
Um, I think it's probably better to have, to learn and explore this 01:16:29.620 |
capability before the model is super, uh, you know, super capable. 01:16:33.140 |
There's a lot of interesting attacks like prompt injection, because now 01:16:36.380 |
you've widened the aperture so you can prompt inject through stuff on screen. 01:16:40.460 |
So if this becomes more and more useful, then there's more and more 01:16:44.460 |
benefit to inject, inject stuff into the model. 01:16:47.620 |
If it goes to a certain web page, it could be harmless stuff like 01:16:50.540 |
advertisements, or it could be like harmful stuff, right? 01:16:53.740 |
I mean, we've thought a lot about things like spam, captcha, you know, mass camp. 01:16:57.820 |
There's all, you know, every, every, like, if one secret, I'll tell you, if 01:17:02.220 |
you've invented a new technology, not necessarily the biggest misuse, but, but 01:17:06.900 |
the, the first misuse you'll see scams, just petty scams, like you'll just, just, 01:17:12.660 |
just, it's, it's like, it's like a thing as old people scamming each other. 01:17:15.780 |
It's, it's this, it's this thing as old as time. 01:17:18.220 |
Um, and, and, and it's just every time you gotta deal with it. 01:17:21.860 |
It's almost like silly to say, but it's, it's true. 01:17:24.380 |
Sort of bots and spam in general is a thing as it gets more and more intelligent. 01:17:31.380 |
There are a lot of, like, like I said, like there are a lot 01:17:34.580 |
And, and, and, you know, it's like every new technology is like a 01:17:37.940 |
new way for petty, petty criminals to do something, you know, 01:17:51.740 |
So for example, during training, we didn't expose the model to the internet. 01:17:54.620 |
Um, I think that's probably a bad idea during training because, uh, you know, 01:18:00.060 |
It can be changing what it's doing and it's having an effect in the real world. 01:18:02.900 |
Um, uh, you know, in, in terms of actually deploying the model, right. 01:18:10.340 |
Like, you know, sometimes you want the model to do something in the real world, 01:18:13.220 |
but of course you can always put guard, you can always put 01:18:17.620 |
You can say, okay, well, you know, this model's not going to move data from my, 01:18:21.700 |
you know, model's not going to move any files from my computer or 01:18:26.940 |
Now, when you talk about sandboxing, again, when we get to ASL four, none 01:18:32.300 |
of these precautions are going to make sense there, right? 01:18:35.420 |
Where, when you, when you talk about ASL four, you're then the 01:18:38.700 |
model is being kind of, you know, there's a theoretical worry. 01:18:42.580 |
The model could be smart enough to break it, to kind of break out of any box. 01:18:46.740 |
And so there, we need to think about mechanistic interpretability about, 01:18:50.820 |
you know, if we're, if we're going to have a sandbox, it would need to be 01:18:53.580 |
a mathematically provable sound, but you know, that's, that's a whole 01:18:57.540 |
different world than what we're dealing with with the models today. 01:19:02.100 |
The science of building a box from which, uh, ASL four AI system cannot escape. 01:19:07.740 |
I think it's probably not the right approach. 01:19:10.100 |
I think the right approach instead of having something, you know, unaligned 01:19:14.220 |
that, that like you're trying to prevent it from escaping, I think it's, it's 01:19:17.620 |
better to just design the model the right way or have a loop where you, you know, 01:19:21.300 |
you look inside, you look inside the model and you're able to verify properties. 01:19:24.980 |
And that gives you a, an opportunity to like iterate and actually get it right. 01:19:28.740 |
Um, I think, I think containing, uh, containing bad models is, is, is much 01:19:37.220 |
What's the role of regulation in keeping AI safe? 01:19:40.620 |
So for example, can you describe California AI regulation bill SB 10 47 01:19:50.060 |
We ended up making some suggestions to the bill and then some of those were 01:19:54.500 |
adopted and, you know, we felt, I think, I think quite positively, uh, uh, quite 01:19:59.380 |
positively about, about the bill, uh, by, by the end of that, um, it did still 01:20:04.220 |
have some downsides, um, uh, and you know, of course, of course it got vetoed. 01:20:09.260 |
Um, I think at a high level, I think some of the key ideas behind the 01:20:13.420 |
bill, um, are, you know, I would say similar to ideas behind our RSPs. 01:20:17.740 |
And I think it's very important that some jurisdiction, whether it's 01:20:21.420 |
California or the federal government and, or other, other countries and other 01:20:28.780 |
And I can talk through why I think that's so important. 01:20:34.700 |
It needs to be iterated on a lot, but it's been a good forcing 01:20:40.500 |
To take these risks seriously, to put them into product planning, to really 01:20:45.500 |
make them a central part of work at Anthropic and to make sure that all 01:20:50.060 |
of a thousand people, and it's almost a thousand people now at Anthropic 01:20:52.900 |
understand that this is one of the highest priorities of the company, 01:20:57.540 |
Uh, but one, there are some, there are still some companies that don't 01:21:06.900 |
Google, uh, did adopt these mechanisms a couple of months after, uh, after 01:21:11.460 |
Anthropic did, uh, but there are, there are other companies out there that 01:21:17.740 |
Uh, and so if some companies adopt these mechanisms and others don't, uh, it's 01:21:23.940 |
really going to create a situation where, you know, some of these dangers have 01:21:27.860 |
the property that it doesn't matter if three out of five of the companies are 01:21:30.900 |
being safe, if the other two are, are being, are being unsafe, it 01:21:36.100 |
And, and I think the lack of uniformity is not fair to those of us who have 01:21:39.820 |
put a lot of effort into being very thoughtful about these procedures. 01:21:43.340 |
The second thing is, I don't think you can trust these companies to adhere to 01:21:51.060 |
I like to think that Anthropic will, we do everything we can that we will. 01:21:54.980 |
Our, our, our, our RSP is checked by our long-term benefit trust. 01:21:59.460 |
Uh, so, you know, we do everything we can to, to, to adhere to our own RSP. 01:22:06.700 |
Um, but you know, you hear lots of things about various companies saying, oh, 01:22:11.700 |
they said they would do, they said they would give this much compute and they 01:22:14.260 |
didn't, they said they would do this thing and they didn't, um, you know, I 01:22:17.900 |
don't, I don't think it makes sense to, you know, to, to, to, you know, litigate 01:22:22.180 |
particular things that companies have done, but I think this, this broad 01:22:25.580 |
principle that like, if there's nothing watching over them, there's nothing 01:22:29.260 |
watching over us as an industry, there's no guarantee that we'll do the right 01:22:34.420 |
Uh, and so I think it's, I think it's important to have a uniform standard 01:22:38.820 |
that, that, that, that, that everyone follows and to make sure that simply 01:22:43.780 |
that the industry does what a majority of the industry has already said is 01:22:48.340 |
important and has already said that they definitely will do. 01:22:52.340 |
Some people, uh, you know, I think there's, there's a class of people who 01:23:00.060 |
If you go to Europe and you know, you see something like GDPR, you see 01:23:03.660 |
some of the other stuff that, that, that, that, that, that, that, that they've 01:23:06.900 |
done, you know, some of it's good, but, but some of it is really unnecessarily 01:23:10.740 |
burdensome and I think it's fair to say really has slowed, really has slowed 01:23:14.780 |
innovation and so I understand where people are coming from on priors. 01:23:18.460 |
I understand why people come from, start from that, start from that position. 01:23:25.380 |
If we go to the very serious risks of autonomy and misuse that, that, that I 01:23:31.460 |
talked about, you know, just, uh, just a few minutes ago, I think that those are 01:23:37.420 |
unusual and they weren't an unusually strong response. 01:23:44.300 |
Again, um, we need something that everyone can get behind. 01:23:48.140 |
Uh, you know, I think one of the issues with SB 1047, uh, especially the 01:23:54.140 |
original version of it was it, it had a bunch of the structure of RSPs, but 01:24:01.340 |
it also had a bunch of stuff that was either clunky or that, that, that just 01:24:06.300 |
would have created a bunch of burdens, a bunch of hassle, and might even have 01:24:11.180 |
missed the target in terms of addressing the risks. 01:24:14.140 |
Um, you don't really hear about it on Twitter. 01:24:16.340 |
You just hear about kind of, you know, people are, people are 01:24:21.260 |
And then the folks who are against make up these often quite intellectually 01:24:25.140 |
dishonest arguments about how, you know, it, you know, it'll make 01:24:30.700 |
Bill, Bill doesn't apply if you're headquartered in California, Bill only 01:24:33.980 |
applies if you do business in California, um, or that it would damage the open 01:24:38.020 |
source ecosystem or that it would, you know, it would cause, cause all of these 01:24:42.420 |
things, I, I think those were mostly nonsense, but there are better 01:24:49.140 |
There's one guy, uh, Dean Ball, who's really, you know, I think a very 01:24:52.500 |
scholarly, scholarly analyst who, who looks at what happens when a regulation 01:24:57.220 |
is put in place in ways that they can kind of get a life of their own or 01:25:03.500 |
And so our interest has always been, we do think there should be regulation in 01:25:07.900 |
this space, but we want to be an actor who makes sure that that, that that 01:25:13.700 |
regulation is something that's surgical, that's targeted at the serious risks 01:25:18.820 |
and is something people can actually comply with because something I think 01:25:22.540 |
the advocates of regulation don't understand as well as they could is if 01:25:27.420 |
we get something in place that is, um, that's poorly targeted, that 01:25:36.460 |
What's going to happen is people are going to say, see these safety risks. 01:25:43.020 |
I just, you know, I just had to hire 10 lawyers to, to, you 01:25:47.660 |
I had to run all of these tests for something that was clearly not dangerous. 01:25:50.860 |
And after six months of that, there will be, there will be a groundswell 01:25:54.540 |
and we'll, we'll, we'll, we'll end up with a durable consensus against regulation. 01:25:58.860 |
And so the, I, I think the, the worst enemy of those who want real 01:26:03.580 |
accountability is badly designed regulation, um, we, we need to actually 01:26:07.700 |
get it right, uh, and, and this is, if there's one thing I could say to the 01:26:11.420 |
advocates, it would be that I want them to understand this dynamic better. 01:26:15.380 |
And we need to be really careful and we need to talk to people who actually 01:26:18.940 |
have, who actually have experience seeing how regulations play out in 01:26:23.300 |
practice and, and the people who have seen that understand to be very careful. 01:26:27.700 |
If this was some lesser issue, I might be against regulation at all. 01:26:31.860 |
But what, what I want the opponents to understand is, is that the 01:26:39.060 |
They're, they're not, they're not something that I or the other companies 01:26:43.540 |
are just making up because of regulatory capture, they're not sci-fi fantasies. 01:26:49.300 |
They're not, they're not any of these things. 01:26:51.340 |
Um, you know, every, every time we have a new model, every few months, we 01:26:55.940 |
measure the behavior of these models and they're getting better and better at 01:26:59.940 |
these concerning tasks, just as they are getting better and better at, um, you 01:27:05.180 |
know, good, valuable, economically useful tasks, and so I, I, I would just love 01:27:11.220 |
it if some of the former, you know, I think SB 1047 was very polarizing. 01:27:16.100 |
I would love it if some of the most reasonable opponents and some of the 01:27:21.580 |
most reasonable, um, uh, proponents, uh, would sit down together and, you know, 01:27:28.420 |
I think, I think that, you know, the different, the different AI companies, 01:27:31.460 |
um, you know, Anthropic was the, the only AI company that, you know, felt 01:27:38.060 |
I think Elon tweeted, uh, tweeted briefly something positive, but, you know, some 01:27:42.860 |
of the, some of the big ones like Google, OpenAI, Meta, Microsoft were, 01:27:48.980 |
So I would really like is if, if, you know, some of the key stakeholders, 01:27:52.820 |
some of the, you know, most thoughtful proponents and, and some of the most 01:27:56.180 |
thoughtful opponents would sit down and say, how do we solve this problem in, in 01:28:01.180 |
a way that the proponents feel brings a real reduction in risk and that the 01:28:07.100 |
opponents feel that it is not, it is not hampering the, the industry or hampering 01:28:13.460 |
innovation any more necessary than it, than it, than it, than it, than it needs 01:28:17.660 |
to, and, and I think for, for whatever reason that things got too polarized and 01:28:23.220 |
those two groups didn't get to sit down in the way that they should. 01:28:29.100 |
I really think we need to do something in 2025. 01:28:31.540 |
Uh, uh, you know, if we get to the end of 2025 and we've still done nothing 01:28:38.420 |
I'm not, I'm not worried yet because again, the risks aren't here yet, but, 01:28:44.700 |
And come up with something surgical, like you said. 01:28:48.340 |
And, and we need to get, we need to get away from this, this, this intense pro 01:28:54.940 |
safety versus intense anti-regulatory rhetoric, right? 01:28:58.860 |
It's turned into these, these flame wars on Twitter and nothing 01:29:03.300 |
So there's a lot of curiosity about the different players in the game. 01:29:09.220 |
You've had several years of experience at OpenAI. 01:29:14.340 |
So I was at OpenAI for, uh, for roughly five years, uh, for the 01:29:20.700 |
You know, I, I, I, I, I was a vice president of research there. 01:29:24.340 |
Um, probably myself and Ilya Sutskever were the ones who, you know, really 01:29:27.940 |
kind of set the, set the research direction around 2016 or 2017. 01:29:32.860 |
I first started to really believe in, or at least confirm my belief in the 01:29:36.500 |
scaling hypothesis when, when Ilya famously said to me, the thing you need 01:29:40.500 |
to understand about these models is they just want to learn, the models just 01:29:44.220 |
want to learn, um, and, and, and, and again, sometimes there are these one 01:29:47.500 |
sentence, there are these one sentences, these Zen cones that you hear them. 01:29:50.740 |
And you're like, ah, that, that explains everything that explains 01:29:56.260 |
And then, and then I, I, you know, I, ever after I had this visualization 01:30:00.020 |
in my head of like, you optimize the models in the right way. 01:30:05.460 |
They just want to solve the problem regardless of what the problem is. 01:30:11.220 |
Don't impose your own ideas about how they should learn. 01:30:14.420 |
Or, and you know, this was the same thing as Rich Sutton put out in the 01:30:17.180 |
bitter lesson or Gurin put out in the scaling hypothesis, you know, I think 01:30:21.260 |
generally the dynamic was, you know, I got, I got this kind of inspiration 01:30:25.740 |
from, uh, from, from, from, from Ilya and from others, folks like Alec Radford, 01:30:30.140 |
who did the, the original, uh, uh, GPT one, uh, and then, uh, ran really hard 01:30:36.260 |
with it, me, me and my collaborators on GPT two, GPT three, RL from human 01:30:41.420 |
feedback, which was an attempt to kind of deal with the early safety and 01:30:44.500 |
durability, things like debate and amplification, heavy on interpretability. 01:30:49.380 |
So again, the combination of safety plus scaling, probably 2018, 2019, 2020. 01:30:55.820 |
Those, those were, those were kind of the years when myself and my collaborators, 01:31:01.340 |
probably, um, you know, many, many of whom became co-founders of Anthropic kind 01:31:07.180 |
of really had, had, had a vision and like, and like drove the direction. 01:31:13.900 |
So look, I'm going to put things this way and I, you know, I think it, I think 01:31:17.300 |
it ties to the, to the, to the race, to the top, right, which is, you know, in 01:31:22.100 |
my time at open AI, what I come to see as I'd come to appreciate the scaling 01:31:25.900 |
hypothesis, and as I come to appreciate kind of the importance of safety along 01:31:30.340 |
with the scaling hypothesis, the first one, I think, you know, open AI was, was 01:31:35.740 |
Um, the second one in a way had always been part of, of open AI's messaging. 01:31:40.700 |
Um, but, uh, you know, over, over many years of, of the time, the time that I 01:31:45.940 |
spent there, I think I had a particular vision of how these, how we should 01:31:50.220 |
handle these things, how we should be brought out in the world, the kind of 01:31:54.060 |
principles that the organization should have. 01:31:57.260 |
And look, I mean, there were like many, many discussions about like, you know, 01:32:01.740 |
should the org do, should the company do this? 01:32:04.740 |
Like, there's a bunch of misinformation out there. 01:32:07.300 |
People say like, we left because we didn't like the deal with Microsoft. 01:32:11.460 |
Although, you know, it was like a lot of discussion, a lot of questions about 01:32:16.700 |
Um, we left because we didn't like commercialization. 01:32:20.180 |
We built GPD three, which was the model that was commercialized. 01:32:25.260 |
It's, it's more again about how do you do it? 01:32:28.220 |
Like civilization is going down this path to very powerful AI. 01:32:34.460 |
That is cautious, straightforward, honest, um, that builds trust in the 01:32:46.980 |
And how do we have a real vision for how to get it right? 01:32:49.460 |
How can safety not just be something we say because it helps with recruiting? 01:32:54.900 |
Um, and you know, I think, I think at the end of the day, um, if you have a vision 01:32:59.820 |
for that, forget about anyone else's vision, I don't want to talk about anyone 01:33:02.820 |
else's vision, if you have a vision for how to do it, you should go off 01:33:07.620 |
It is incredibly unproductive to try and argue with someone else's vision. 01:33:12.340 |
You might think they're not doing it the right way. 01:33:14.500 |
You might think they're, they're, they're dishonest. 01:33:18.500 |
Um, uh, but, uh, what, what you should do is you should take some people you trust 01:33:23.420 |
and you should go off together and you should make your vision happen. 01:33:26.260 |
And if your vision is compelling, if you can make it appeal to people, some, 01:33:30.660 |
you know, some combination of ethically, you know, in the market, uh, you know, 01:33:35.820 |
if, if you can, if you can make a company, that's a place people want to join, uh, 01:33:40.900 |
that, you know, engages in practices that people think are, are reasonable while 01:33:46.060 |
managing to maintain its position in the ecosystem at the same time. 01:33:51.860 |
Um, and the fact that you are doing it, especially the fact that you're doing 01:33:55.460 |
it better than they are, um, causes them to change their behavior in a much more 01:33:59.980 |
compelling way than if they're your boss and you're arguing with them. 01:34:03.340 |
I just, I don't know how to be any more specific about it than that, but I think 01:34:07.380 |
it's generally very unproductive to try and get someone else's vision 01:34:12.660 |
Um, it's much more productive to go off and do a clean experiment 01:34:18.620 |
This is how, this is, this is how we're going to do things. 01:34:21.100 |
Your choice is you can, you can ignore us, you can reject what we're doing, or you 01:34:29.980 |
And imitation is the sincerest form of flattery. 01:34:32.300 |
Um, and you know, that, that, that plays out in the behavior of customers. 01:34:37.380 |
That pays out in the behavior of the public that plays out in the behavior 01:34:41.940 |
Uh, and again, again, at the end, it's, it's not about one company winning 01:34:47.100 |
or another company winning if, if we are another company are engaging in 01:34:52.380 |
some practice that, you know, people, people find genuinely appealing. 01:34:57.580 |
And I want it to be in substance, not just, not just in appearance. 01:35:00.500 |
Um, and you know, I think, I think researchers are sophisticated 01:35:05.100 |
Uh, and then other companies start copying that practice and they win 01:35:15.100 |
It doesn't matter who wins in the end, as long as everyone is copying 01:35:20.060 |
One way I think of it is like the thing we're all afraid of 01:35:24.220 |
And the race to the bottom doesn't matter who wins because we all lose. 01:35:28.020 |
Like, you know, in the most extreme world, we, we make this autonomous AI that, 01:35:33.860 |
I mean, that's half joking, but you know, that, that is the most 01:35:37.460 |
extreme, uh, thing, thing that could happen then, then it doesn't 01:35:42.460 |
Um, if instead you create a race to the top where people are competing 01:35:47.180 |
to engage in good, in good practices, uh, then, you know, at the end of 01:35:51.900 |
the day, you know, it doesn't matter who ends up, who ends up winning. 01:35:55.300 |
It doesn't even matter who, who started the race to the top. 01:35:59.060 |
The point is to get the system into a better equilibrium than it was before. 01:36:03.100 |
And, and individual companies can play some role in doing this. 01:36:06.380 |
Individual companies can, can, you know, can help to start it, 01:36:12.020 |
And frankly, I think individuals at other companies have, 01:36:15.900 |
The individuals that when we put out an RSP react by pushing harder to, to, 01:36:21.140 |
to get something similar done, get something similar done at, at, at other 01:36:25.020 |
companies, sometimes other companies do something that's like, we're like, 01:36:31.420 |
The only difference is, you know, I think, I think we are, um, we 01:36:37.140 |
We try and adopt more of these practices first and 01:36:39.820 |
adopt them more quickly when others, when others invent them. 01:36:42.460 |
But I think this dynamic is what we should be pointing at. 01:36:45.780 |
And that, I think, I think it abstracts away the question of, you know, which 01:36:50.860 |
company's winning, who trusts, who, I think all these, all these questions 01:36:57.980 |
And, and the, the thing that matters is the ecosystem that we all 01:37:01.540 |
operate in and how to make that ecosystem better, because that 01:37:05.940 |
And so Anthropic is this kind of clean experiment built on a foundation of like 01:37:13.500 |
We're, look, I'm sure we've made plenty of mistakes along the way. 01:37:18.900 |
It has to deal with the, the imperfection of a thousand employees. 01:37:23.100 |
It has to deal with the imperfection of our leaders, including me. 01:37:25.900 |
It has to deal with the imperfection of the people we've put, we've put to, you 01:37:30.260 |
know, to oversee the imperfection of the, of the leaders, like the, like the board 01:37:35.460 |
It's, it's all, it's all a set of imperfect people trying to aim 01:37:39.460 |
imperfectly at some ideal that will never perfectly be achieved. 01:37:45.660 |
But, uh, uh, imperfect doesn't mean you just give up. 01:37:51.340 |
And hopefully, hopefully we can begin to build, we can do well 01:37:55.700 |
enough that we can begin to build some practices that the whole industry engages 01:37:59.980 |
in, and then, you know, my guess is that multiple of these companies will be 01:38:05.860 |
These other companies, like once I've been at the past will also be 01:38:09.180 |
successful and some will be more successful than others that's less 01:38:12.820 |
important than again, that we, we align the incentives of the industry. 01:38:16.660 |
And that happens partly through the race to the top, partly through things 01:38:19.980 |
like RSP, partly through again, selected surgical regulation. 01:38:31.660 |
Can you just talk about what it takes to build a great team of 01:38:38.580 |
That's like more true every, every, every month, every month. 01:38:41.660 |
I see the statement is more true than I did the month before. 01:38:44.100 |
So if I were to do a thought experiment, let's say you have a team of 100 people 01:38:50.300 |
that are super smart, motivated, and aligned with the mission, and that's 01:38:53.700 |
your company, or you can have a team of a thousand people where 200 people are 01:38:58.460 |
super smart, super aligned with the mission, and then like 800 people are, 01:39:05.420 |
let's just say you pick 800, like random, random big tech employees, 01:39:10.980 |
The talent mass is greater in the group of a thousand people, right? 01:39:16.340 |
You have even a larger number of incredibly talented, incredibly 01:39:22.900 |
But the issue is just that if every time someone super talented looks around, 01:39:30.940 |
they see someone else super talented and super dedicated, that sets 01:39:36.180 |
That sets the tone for everyone is super inspired to work at the same place. 01:39:41.980 |
If you have a thousand or 10,000 people and things have really regressed, right? 01:39:47.700 |
You are not able to do selection and you're choosing random people. 01:39:51.220 |
What happens is then you need to put a lot of processes and a 01:39:55.540 |
Just because people don't fully trust each other, you have to 01:40:01.500 |
Like there are so many things that slow down the org's ability to operate. 01:40:06.060 |
And so we're nearly a thousand people and, you know, we've, we've, we've 01:40:09.300 |
tried to make it so that as large a fraction of those thousand people as 01:40:13.100 |
possible are like super talented, super skilled. 01:40:16.980 |
It's one of the reasons we've, we've slowed down hiring a 01:40:21.620 |
We grew from 300 to 800, I believe, I think in the first 01:40:28.820 |
We're at like, you know, last three months we went from 800 to 900, 01:40:33.940 |
Don't quote me on the exact numbers, but I think there's an inflection 01:40:37.500 |
point around a thousand and we want to be much more careful how, how we, how 01:40:41.420 |
we grow, uh, early on and, and now as well, you know, we've hired a lot of 01:40:45.460 |
physicists, um, you know, theoretical physicists can learn things really fast. 01:40:49.740 |
Um, uh, even, even more recently as we've continued to hire that, you know, 01:40:54.460 |
we've really had a high bar for, on both the research side and the software 01:40:58.780 |
engineering side have hired a lot of senior people, including folks who used 01:41:02.660 |
to be at other, at other companies in this space, and we, we've just 01:41:08.780 |
It's very easy to go from a hundred to a thousand and a thousand to 10,000 01:41:13.620 |
without paying attention to making sure everyone has a unified purpose. 01:41:19.460 |
If your company consists of a lot of different fiefdoms that all want to do 01:41:24.300 |
their own thing, that are all optimizing for their own thing, um, uh, it's very 01:41:28.540 |
hard to get anything done, but if everyone sees the broader purpose of the 01:41:32.100 |
company, if there's trust and there's dedication to doing the right thing, 01:41:36.300 |
that is a superpower that in itself, I think can overcome almost every other 01:41:40.740 |
disadvantage and, you know, Steve jobs, a players, a players want to look around 01:41:45.140 |
and see other players as another way of saying, I don't know what that is about 01:41:48.940 |
human nature, but it is demotivating to see people who are not obsessively 01:41:55.660 |
And it is on the flip side of that, super motivating to see that. 01:42:00.460 |
Uh, what's it take to be a great AI researcher or engineer from everything 01:42:06.740 |
you've seen from working with so many amazing people? 01:42:09.260 |
Um, I think the number one quality, especially on the research side, but 01:42:15.540 |
really both is open-mindedness sounds easy to be open-minded, right? 01:42:21.340 |
Um, but you know, if I, if I think about my own early history in the scaling 01:42:26.420 |
hypothesis, um, I was seeing the same data others were seeing. 01:42:31.020 |
I don't think I was like a better programmer or better at coming up with 01:42:35.660 |
research ideas than any of the hundreds of people that I worked with. 01:42:41.060 |
Um, uh, you know, like I've, I've never liked, you know, precise programming 01:42:45.980 |
of like, you know, finding the bug, writing the GPU kernels, like I could 01:42:49.860 |
point you to a hundred people here who are better, who are better at that than I am. 01:42:52.660 |
Um, but, but the, the thing that, that, that I think I did have that was 01:42:57.900 |
different was that I was just willing to look at something with new eyes, right? 01:43:03.380 |
People said, Oh, you know, we don't have the right algorithms yet. 01:43:06.860 |
We haven't come up with the right, the right way to do things. 01:43:12.180 |
Like, you know, this neural net has like 30 billion, 30 million parameters. 01:43:19.220 |
Like let's plot some graphs like that, that basic scientific mindset of like, 01:43:23.980 |
Oh man, like I, I just, I just like, I, you know, I see some variable that I could 01:43:31.820 |
Like, let's, let's try these different things and like create a graph for even 01:43:35.660 |
the, this was like the simplest thing in the world, right? 01:43:37.700 |
Change the number of, you know, this wasn't like PhD level experimental design. 01:43:42.380 |
This was like, this was like simple and stupid. 01:43:45.140 |
Like anyone could have done this if you, if you just told 01:43:51.300 |
You didn't need to be brilliant to come up with this. 01:43:53.260 |
Um, but you put the two things together and you know, some tiny number of people, 01:43:58.300 |
some single digit number of people have, have driven forward the 01:44:03.420 |
Uh, and, and it's, you know, it's often like that. 01:44:05.860 |
If you look back at the discovery, you know, the discoveries in history, 01:44:11.340 |
And so this, this open-mindedness and this willingness to see with new eyes 01:44:15.580 |
that often comes from being newer to the field, often experience 01:44:22.340 |
It's very hard to look for and test for, but I think, I think it's the most 01:44:25.620 |
important thing because when you, when you find something, some really new way 01:44:29.780 |
of thinking, thinking about things, when you have the initiative to do that, 01:44:34.180 |
And also be able to do kind of rapid experimentation and in the face of that, 01:44:38.500 |
be open-minded and curious, and looking at the data for just these fresh eyes 01:44:42.460 |
and seeing what is that it's actually saying that applies in, uh, 01:44:46.900 |
It's another example of this, like some of the early work in mechanistic 01:44:52.740 |
It's just, no one thought to care about this question before. 01:44:55.660 |
You said what it takes to be a great AI researcher. 01:45:00.340 |
What advice would you give to people interested in AI? 01:45:02.820 |
They're young, looking forward to, how can I make any impact on the world? 01:45:05.900 |
I think my number one piece of advice is to just start playing with the models. 01:45:10.220 |
Um, this was actually, I, I worry a little, this seems like obvious advice. 01:45:15.100 |
Now, I think three years ago, it wasn't obvious and people started by, Oh, let 01:45:19.300 |
me read the latest reinforcement learning paper. 01:45:21.380 |
Let me, you know, let me, let me kind of, um, no, I mean, that was really the, 01:45:24.660 |
that was really the, the, and I mean, you should do that as well, but, uh, now, 01:45:28.980 |
you know, with wider availability of models and APIs, people are doing this 01:45:32.580 |
more, but I think, I think just experiential knowledge, um, these models 01:45:39.060 |
are new artifacts that no one really understands. 01:45:41.740 |
Um, and so getting experience playing with them, I would also say again, 01:45:46.140 |
in line with the, like, do something new, think in some new direction. 01:45:49.780 |
Like there are all these things that haven't been explored. 01:45:53.220 |
Like for example, mechanistic interpretability is still very new. 01:45:56.700 |
It's probably better to work on that than it is to work on new model 01:45:59.540 |
architectures, because it's, you know, it's more popular than it was before. 01:46:03.420 |
There are probably like a hundred people working on it, but there aren't 01:46:07.140 |
And it's, it's this, this, this, this fertile area for study, like, 01:46:12.020 |
like, you know, it's, there's, there's so much like low hanging fruit. 01:46:17.780 |
You can just walk by and, you know, you can just walk 01:46:21.140 |
Um, and, and the, the only reason for whatever reason people aren't, people 01:46:25.820 |
aren't interested in it enough, I think there are some things around. 01:46:29.060 |
Long, long horizon learning and long horizon tasks where 01:46:34.660 |
I think evaluations are still, we're still very early in our ability 01:46:38.340 |
to study evaluations, particularly for dynamic systems, acting in the world. 01:46:42.300 |
I think there's some stuff around multi-agent, um, skate where the 01:46:49.380 |
And you don't have to be brilliant to think of it. 01:46:51.420 |
Like all the things that are going to be exciting in five years, 01:46:54.900 |
like in, in people even mentioned them as like, you know, conventional 01:46:58.540 |
wisdom, but like, it's, it's just somehow there's this barrier that 01:47:02.140 |
people don't, people don't double down as much as they could, or they're 01:47:05.860 |
afraid to do something that's not the popular thing, I don't know why it 01:47:09.340 |
happens, but like getting over that barrier is that's the, my number one 01:47:12.860 |
piece of advice, let's talk, if it could a bit about post-training. 01:47:16.900 |
So it, uh, seems that the modern post-training recipe has, uh, 01:47:23.380 |
So supervised, fine-tuning, RLHF, uh, the, the, the constitutional AI with RL-A-I-F. 01:47:34.620 |
Uh, and then synthetic data seems like a lot of synthetic data, or at least 01:47:40.660 |
trying to figure out ways to have high quality synthetic data. 01:47:43.180 |
So what's the, uh, if this is a secret sauce that makes 01:47:49.540 |
What, how, how much of the magic is in the pre-training? 01:47:54.420 |
Um, I mean, uh, so first of all, we're not perfectly able 01:47:58.020 |
Um, uh, you know, when you see some, some great character ability, sometimes 01:48:02.300 |
it's hard to tell whether it came from pre-training or post-training. 01:48:05.100 |
Uh, we've developed ways to try and distinguish between those 01:48:09.660 |
You know, the second thing I would say is, you know, it's when there is an 01:48:12.980 |
advantage and I think we've been pretty good at in general, in general at RL, 01:48:16.220 |
perhaps, perhaps the best, although, although I don't know, cause I don't 01:48:21.780 |
Uh, usually it isn't, oh my God, we have this secret magic method 01:48:27.820 |
Usually it's like, well, you know, we got better at the infrastructure 01:48:33.260 |
Or, you know, we were able to get higher quality data, or we were able 01:48:36.460 |
to filter our data better, or we were able to, you know, combine 01:48:40.780 |
It's, it's usually some boring matter of matter of kind of, uh, uh, 01:48:47.460 |
Um, so, you know, when I think about how to do something special in terms 01:48:51.620 |
of how we train these models, both pre-training, but even more so post 01:48:54.660 |
training, um, you know, I, I really think of it a little more again, as 01:49:01.860 |
Like, you know, it's not just like, oh man, I have the blueprint. 01:49:04.940 |
Like maybe that makes you make the next airplane, but like, there's some, 01:49:08.020 |
there's some cultural trade craft of how we think about the design process 01:49:12.700 |
that I think is more important than, than, you know, than, than any 01:49:17.980 |
Well, about, let me ask you about specific techniques. 01:49:20.380 |
So first on RLHF, what do you think, just zooming out intuition, almost 01:49:25.300 |
philosophy, why do you think RLHF works so well, if I go back to like 01:49:29.700 |
the scaling hypothesis, one of the ways to skate the scaling hypothesis 01:49:33.820 |
is if you train for X and you throw enough compute at it, um, then you 01:49:38.060 |
get X and, and so RLHF is good at doing what humans want the model to 01:49:43.660 |
do, or at least, um, to state it more precisely doing what humans who 01:49:48.060 |
look at the model for a brief period of time and consider different 01:49:51.020 |
possible responses, what they prefer as the response, uh, which is not 01:49:55.060 |
perfect from both the safety and capabilities perspective in that 01:49:58.460 |
humans are often not able to perfectly identify what the model wants and 01:50:02.420 |
what humans want in the moment may not be what they want in the longterm. 01:50:05.140 |
So there's, there's a lot of subtlety there, but the models are good at, 01:50:09.540 |
uh, you know, producing what the humans in some shallow sense want. 01:50:13.900 |
Uh, and it actually turns out that you don't even have to throw that 01:50:17.780 |
much compute at it because of another thing, which is this, this thing 01:50:22.060 |
about a strong pre-trained model being halfway to anywhere. 01:50:25.220 |
Uh, uh, uh, so once you have the pre-trained model, you have all the 01:50:29.100 |
representations you need to, to get the model, uh, to get the model 01:50:33.460 |
So do you think our RLHF makes the model smarter or just 01:50:43.780 |
I don't think it just makes the model appear smarter. 01:50:46.700 |
It's like RLHF like bridges, the gap between the human and the model, right. 01:50:52.140 |
I could have something really smart that like can't communicate at all. 01:50:55.580 |
We all know people like this, um, people who are really smart, but the, you know, 01:51:00.620 |
Um, uh, so I think, I think RLHF just bridges that gap. 01:51:04.980 |
Um, I think it's not, it's not the only kind of RL we do. 01:51:08.460 |
It's not the only kind of RL that will happen in the future. 01:51:10.660 |
I think RL has the potential to make models smarter, to make them reason 01:51:15.260 |
better, to make them operate better, to make them develop new skills even. 01:51:20.100 |
And perhaps that could be done, you know, even in some cases with human 01:51:24.020 |
feedback, but the kind of RLHF we do today mostly doesn't do that yet. 01:51:28.260 |
Although we're very quickly starting to be able to. 01:51:32.380 |
If you look at the metric of helpfulness, it increases that. 01:51:35.980 |
It also increases, what was this, this word in Leopold's essay unhobbling, 01:51:41.180 |
where basically the models are hobbled and then you do various 01:51:45.900 |
So I, you know, I like that word cause it's like a rare word, but it's so, 01:51:49.460 |
so I think RLHF unhobbles the models in some ways. 01:51:52.740 |
Um, and then there are other ways where a model hasn't yet been unhobbled 01:51:55.780 |
and, and, you know, needs to, needs to unhobble. 01:51:57.700 |
If you can say in terms of costs, is pre-training the most expensive 01:52:05.380 |
At the present moment, it is still the case that, uh, pre-training 01:52:10.380 |
I don't know what to expect in the future, but I could certainly 01:52:13.220 |
anticipate a future where post-training is the majority of the cost. 01:52:15.980 |
In that future, you anticipate, would it be the humans or the AI? 01:52:19.980 |
That's the costly thing for the post-training. 01:52:21.940 |
I, I, I don't think you can scale up humans enough to get high quality. 01:52:27.100 |
Any, any kind of method that relies on humans and uses a large amount 01:52:30.700 |
of compute, it's going to have to rely on some scaled supervision 01:52:33.820 |
method, like, uh, uh, like, um, you know, debate or iterated 01:52:39.460 |
So on that super interesting, um, set of ideas around constitutional AI. 01:52:45.220 |
Can you describe what it is as first detailed in December 2022 paper? 01:52:55.100 |
The basic idea is, so we describe what RLHF is. 01:52:59.100 |
You have, uh, you have a model and, uh, it, you know, spits out two po- you 01:53:04.820 |
know, like you just sample from it twice, it spits out two possible responses. 01:53:08.060 |
And you're like human, which response do you like better? 01:53:10.580 |
Or another variant of it is rate this response on a scale of one to seven. 01:53:14.300 |
So that's hard because you need to scale up human interaction. 01:53:19.660 |
I don't have a sense of what I, what I want the model to do. 01:53:22.300 |
I just have a sense of like what this average of a thousand 01:53:26.900 |
So two ideas, one is could the AI system itself decide which, 01:53:35.340 |
Could you show the AI system, these two responses and ask 01:53:40.180 |
And then second, well, what criterion should the AI use? 01:53:43.500 |
And so then there's this idea, cause you have a single document, a 01:53:46.940 |
constitution, if you will, that says, these are the principles the 01:53:54.780 |
Um, it reads those principles as well as reading the 01:54:00.980 |
And it says, well, how good did the AI model do? 01:54:06.100 |
You're kind of training the model against itself. 01:54:08.740 |
And so the AI gives the response and then you feed that back into 01:54:12.780 |
what's called the preference model, which in turn feeds the 01:54:16.220 |
Um, so you have this triangle of like the AI, the preference 01:54:22.540 |
And we should say that in the constitution, the set of 01:54:27.900 |
It's, it's something both the human and the AI system can read. 01:54:31.300 |
So it has this nice, this nice kind of translatability or symmetry. 01:54:35.020 |
Um, you know, in, in practice, we both use a model constitution and we use 01:54:39.700 |
RLHF and we use some of these other methods, so it's, it's turned into 01:54:43.900 |
one tool in a, in a toolkit that both reduces the need for RLHF and increases 01:54:50.220 |
the value we get from, um, from, from using each data point of RLHF. 01:54:54.700 |
Um, it also interacts in interesting ways with kind of future 01:54:59.740 |
So, um, it's, it's one tool in the toolkit, but, but I think 01:55:04.940 |
Well, it's a compelling one to us humans, you know, thinking about the founding 01:55:08.860 |
fathers and the founding of the United States, the natural question is who and 01:55:14.300 |
how do you think it gets to define the constitution, the, the set of 01:55:20.660 |
So I'll give like a practical, um, answer and a more abstract answer. 01:55:24.580 |
I think the practical answer is like, look in practice, models get used by 01:55:30.620 |
And, and so, uh, you can have this idea where, you know, the model can, can 01:55:37.020 |
You know, we fine tune versions of models implicitly. 01:55:40.220 |
We've talked about doing it explicitly, having, having special principles that 01:55:45.660 |
Um, uh, so from a practical perspective, the answer can be very 01:55:50.980 |
Uh, you know, customer service agent, uh, you know, behaves very 01:55:54.020 |
differently from a lawyer and obeys different principles. 01:55:56.740 |
Um, but I think at the base of it, there are specific principles 01:56:03.300 |
I think a lot of them are things that people would agree with. 01:56:06.100 |
Everyone agrees that, you know, we don't, you know, we don't want 01:56:11.460 |
Um, I think we can go a little further and agree with some basic principles 01:56:17.340 |
Beyond that, it gets, you know, very uncertain and, and there, our goal is 01:56:21.100 |
generally for the models to be more neutral, to not espouse a particular 01:56:26.100 |
point of view and, you know, more just be kind of like wise, uh, agents 01:56:31.220 |
or advisors that will help you think things through and will, you know, 01:56:34.660 |
present, present possible considerations, but, you know, don't express, you 01:56:40.540 |
OpenAI released a model spec where it kind of clearly concretely defines some 01:56:46.780 |
of the goals of the model and specific examples like A, B, how the model should 01:56:54.500 |
By the way, I should mention the, I believe the brilliant John 01:57:03.380 |
Might Anthropic release a model spec as well? 01:57:08.020 |
Again, it has a lot in common with, uh, constitutional AI. 01:57:11.500 |
So again, another example of like a race to the top, right? 01:57:14.660 |
We have something that's like, we think, you know, a better and more 01:57:22.060 |
Um, then, uh, others kind of, you know, discover that it has advantages 01:57:28.180 |
Uh, we then no longer have the competitive advantage, but it's good 01:57:31.700 |
from the perspective that now everyone has adopted a positive practice 01:57:37.580 |
And so our response to that as well, looks like we need a new competitive 01:57:40.860 |
advantage in order to keep driving this race upwards. 01:57:43.140 |
Um, so that's, that's how I generally feel about that. 01:57:45.660 |
I also think every implementation of these things is different. 01:57:48.820 |
So, you know, there were some things in the model spec that 01:57:53.300 |
And so, you know, we, you know, we can always, we can always adopt those 01:57:56.460 |
things or, you know, at least learn from them. 01:57:58.260 |
Um, so again, I think this is an example of like the positive dynamic 01:58:01.860 |
that, uh, that, that, that I, that, that I think we should all want the field to 01:58:05.340 |
have, let's talk about the incredible essay machines of love and grace. 01:58:13.780 |
It's really refreshing to read concrete ideas about what a 01:58:18.980 |
And you took sort of a bold stance because like, it's very possible that you might 01:58:25.100 |
I'm fully expecting to, you know, to de will definitely 01:58:29.460 |
I might be, be just spectacularly wrong about the whole thing. 01:58:33.340 |
And people will, you know, will laugh at me for years. 01:58:35.580 |
Um, uh, that's, that's how that's, that's just how the future works. 01:58:38.980 |
So you provided a bunch of concrete, positive impacts of AI and how. 01:58:44.140 |
You know, exactly a super intelligent AI might accelerate the rate of 01:58:47.660 |
breakthroughs in, for example, biology and chemistry that would then lead to 01:58:52.460 |
things like we cure most cancers, prevent all infectious disease, double 01:59:02.060 |
Can you give a high level vision of this essay and, um, what key 01:59:08.820 |
Yeah, I have spent a lot of time in Anthropic. 01:59:11.340 |
I spent a lot of effort on like, you know, how do we address the risks of AI? 01:59:19.180 |
You know, what that requires us to build all these capabilities 01:59:23.180 |
But, you know, you know, we're, we're, we're like a big part of what we're 01:59:27.940 |
trying to do is like, is like address the risks and the justification for 01:59:31.660 |
that is like, well, you know, all these positive things, you know, the market 01:59:37.860 |
It's going to produce all the positive things, the risks. 01:59:42.540 |
And so we can have more impact by trying to mitigate the risks. 01:59:45.980 |
But I noticed that one flaw in that way of thinking, and it's, it's not a 01:59:53.580 |
It's, it's maybe a change in how I talk about them. 01:59:56.340 |
Is that, you know, no matter how kind of logical or rational that line of 02:00:04.540 |
reasoning that I just gave might be, if, if you kind of only talk about risks, 02:00:12.060 |
And, and so I think it's actually very important to understand what 02:00:15.420 |
if things do go well and the whole reason we're trying to prevent these 02:00:18.300 |
risks is not because we're afraid of technology, not because we want to slow 02:00:21.460 |
it down, it's, it's, it's because if we can get to the other side of these 02:00:27.700 |
risks, right, if we can run the gauntlet successfully to, you know, to, to put it 02:00:32.060 |
in stark terms, then, then on the other side of the gauntlet are all these great 02:00:35.820 |
things and these things are worth fighting for and these things can really inspire 02:00:39.740 |
people and I think I imagine because look, you have all these investors, all 02:00:45.140 |
these VCs, all these AI companies talking about all the positive benefits of AI. 02:00:49.780 |
But as you point out, it's, it's, it's weird. 02:00:52.740 |
There's actually a dearth of really getting specific about it. 02:00:55.820 |
There's a lot of like random people on Twitter, like posting these kind of like 02:01:00.820 |
gleaming cities and this, this just kind of like vibe of like grind, accelerate 02:01:07.900 |
You know, it's, it's just this very, this very like aggressive ideological, but 02:01:12.380 |
then you're like, well, what are you, what, what, what, what, what, what are 02:01:16.740 |
And so, and so I figured that, you know, I think it would be interesting and 02:01:21.380 |
valuable for someone who's actually coming from the risk side to, to try and, 02:01:26.220 |
and to try and really make a try at, at explaining, explaining, explain what the 02:01:33.460 |
benefits are both because I think it's something we can all get behind and I 02:01:38.780 |
want people to understand, I want them to really understand that this isn't, this 02:01:45.620 |
Um, this, this is that if you have a true understanding of, of where things are 02:01:52.220 |
going with, with AI, and maybe that's the more important axis, AI is moving fast 02:01:56.420 |
versus AI is not moving fast, then you really appreciate the benefits and you, 02:02:00.860 |
you, you, you really, you want humanity, our civilization to seize those benefits, 02:02:06.060 |
but you also get very serious about anything that could derail them. 02:02:08.860 |
So I think the starting point is to talk about what this powerful AI, 02:02:13.220 |
which is the term you like to use, uh, most of the world uses AGI, but you 02:02:17.300 |
don't like the term because it's, uh, basically has too much baggage. 02:02:25.020 |
Maybe we're stuck with the terms and my efforts to change them are futile. 02:02:30.460 |
I don't, this is like a pointless semantic point, but I, I, I keep 02:02:34.460 |
talking about it, so I'm just, I'm just gonna do it once more. 02:02:37.100 |
Um, uh, I, I think it's, it's a little like, like, let's say it was like 02:02:41.780 |
1995 and Moore's law is making the computers faster and like, for some 02:02:46.180 |
reason there, there, there, there had been this like verbal tick that like, 02:02:49.540 |
everyone was like, well, someday we're going to have like supercomputers and 02:02:52.900 |
like supercomputers are going to be able to do all these things that like, you 02:02:55.820 |
know, once we have supercomputers, we'll be able to like sequence the genome. 02:03:02.020 |
The computers are getting faster and as they get faster, they're 02:03:04.220 |
going to be able to do all these great things. 02:03:05.820 |
But there's like, there's no discrete point at which you had a supercomputer 02:03:10.260 |
in previous computers were not to like supercomputers, a term we use, but 02:03:13.420 |
like, it's a vague term to just describe like computers that are 02:03:18.700 |
Um, there's no point at which you pass a threshold and you're like, Oh my God, 02:03:22.060 |
we're doing a totally new type of computation and new. 02:03:28.860 |
And like, if, if by AGI, you mean like, like AI is getting better and 02:03:33.580 |
better and like gradually it's going to do more and more of what humans do until 02:03:38.500 |
And then it's going to get smarter even from there then, then yes, I believe in AGI. 02:03:42.260 |
If, but if, if, if AGI is some discrete or separate thing, which is the way people 02:03:46.820 |
often talk about it, then it's, it's kind of a meaningless buzzword. 02:03:49.780 |
I mean, to me, it's just sort of a platonic form of a powerful AI, exactly how you define it. 02:03:56.780 |
So on the intelligence axis, it's just on pure intelligence. 02:04:02.540 |
It's smarter than a Nobel prize winner, as you describe across most relevant disciplines. 02:04:08.860 |
So it's both in creativity and be able to generate new ideas, all that kind of stuff. 02:04:22.180 |
So this kind of self-explanatory, but just operate across all the modalities of the world. 02:04:27.460 |
It can go off for many hours, days, and weeks to do tasks and do its own sort of 02:04:34.460 |
detailed planning and only ask you help when it's needed. 02:04:37.740 |
It can use, this is actually kind of interesting. 02:04:41.020 |
I think in the essay you said, I mean, again, it's a bet that it's not going to be 02:04:49.180 |
So it can control tools, robots, laboratory equipment. 02:04:52.020 |
The resource used to train it can then be repurposed to run millions of copies of it. 02:04:57.500 |
And each of those copies would be independent. 02:05:01.140 |
So you can do the cloning of the intelligence. 02:05:03.540 |
I mean, you, you might imagine from outside the field that like, there's 02:05:08.700 |
But the truth is that like the scale up is very quick. 02:05:13.460 |
We make a model and then we deploy thousands, maybe tens of thousands of 02:05:19.500 |
You know, certainly within two to three years, whether we have these super 02:05:22.620 |
powerful AIs or not, clusters are going to get to the size where you'll be able 02:05:26.460 |
to deploy millions of these and there'll be, you know, faster than humans. 02:05:30.220 |
And so if your picture is, oh, we'll have one and it'll take a while to make them. 02:05:33.820 |
My point there was no, actually you have millions of them right away. 02:05:40.340 |
Uh, 10 to a hundred times faster than humans. 02:05:44.300 |
So that's a really nice definition of powerful AI. 02:05:47.420 |
So that, but you also write that clearly such an entity would be capable of 02:05:51.900 |
solving very difficult problems very fast, but it is not trivial to figure out how 02:05:55.940 |
fast two extreme positions both seem false to me. 02:05:59.100 |
So the singularity is on the one extreme and the opposite and the other extreme. 02:06:05.740 |
So, so yeah, let's, let's describe the extreme. 02:06:08.740 |
So like one, one extreme would be, well, look, um, you know, uh, if we look at 02:06:15.860 |
kind of evolutionary history, like there was this big acceleration where, you know, 02:06:19.420 |
for hundreds of thousands of years, we just had like, you know, single celled 02:06:22.820 |
organisms, and then we had mammals and then we had apes and then that quickly 02:06:26.020 |
turned to humans, humans quickly built industrial civilization. 02:06:33.700 |
Once models get much, much smarter than humans, they'll get really 02:06:38.740 |
And, you know, if you write down like a simple differential equation, 02:06:43.180 |
And so what's, what's going to happen is that, uh, models will build faster 02:06:49.340 |
And those models will build, you know, nanobots that can like take over the 02:06:52.700 |
world and produce much more energy than you could produce otherwise. 02:06:56.180 |
And so if you just kind of like solve this abstract differential equation, 02:06:59.700 |
then like five days after we, you know, we build the first AI that's more 02:07:03.980 |
powerful than humans, then, then, uh, you know, like the world will be filled 02:07:07.780 |
with these AIs and every possible technology that could be invented, 02:07:13.780 |
Um, uh, but I, you know, I think that's one extreme. 02:07:17.780 |
And the reason that I think that's not the case is that one, I think they just 02:07:26.260 |
Like it's only possible to do things so fast in the physical world. 02:07:29.140 |
Like some of those loops go through, you know, producing faster hardware. 02:07:32.900 |
Um, uh, it takes a long time to produce faster hardware. 02:07:40.060 |
Like, I think no matter how smart you are, like, you know, people talk about, 02:07:44.620 |
oh, we can make models of biological systems. 02:07:48.460 |
Look, I think computational modeling can do a lot. 02:07:50.820 |
I did a lot of computational modeling when I worked in biology, but like just. 02:07:55.300 |
There are a lot of things that you can't predict how they're, you 02:07:59.700 |
know, they're, they're complex enough that like just iterating, just 02:08:03.500 |
running the experiment is going to beat any modeling, no matter how smart 02:08:08.340 |
Or even if it's not interacting with the physical world, just 02:08:12.860 |
I think, well, the modeling is going to be hard and getting the model 02:08:15.620 |
to, to, to, to match the physical world is going to be all right. 02:08:18.900 |
So he does have to verify, but it's just, you know, you just look 02:08:24.860 |
Like I, you know, I think I talk about like, you know, the three body 02:08:27.620 |
problem or simple chaotic prediction, like, you know, or, or like predicting 02:08:32.260 |
the economy, it's really hard to predict the economy two years out. 02:08:35.660 |
Like maybe the case is like, you know, normal, you know, humans 02:08:39.540 |
can predict what's going to happen in the economy next quarter. 02:08:43.420 |
Maybe, maybe a AI system that's, you know, a zillion times smarter can 02:08:48.060 |
only predict it out a year or something instead of, instead of, you know, 02:08:50.860 |
you have these kinds of exponential increase in computer intelligence for 02:08:55.220 |
linear increase in, in, in ability to predict same with, again, like, you 02:09:00.620 |
know, biological molecules, molecules interacting, you don't know what's 02:09:05.060 |
going to happen when you perturb a, when you perturb a complex system, 02:09:09.620 |
If you're smarter, you're better at finding these simple parts. 02:09:12.420 |
And then I think human institutions, human institutions 02:09:18.060 |
Like it's, you know, it's, it's been hard to get people, I won't give specific 02:09:23.460 |
examples, but it's been hard to get people to adopt even the technologies 02:09:28.500 |
that we've developed, even ones where the case for their efficacy is very, very 02:09:39.340 |
Like it's, it's just been, it's been very difficult. 02:09:42.140 |
It's also been very difficult to get, you know, very simple things 02:09:48.580 |
I think, you know, and you know, I don't want to disparage anyone who, 02:09:52.380 |
you know, you know, works in regulatory, regulatory systems of any technology. 02:09:57.020 |
There are hard trade-offs they have to deal with. 02:09:58.660 |
They have to save lives, but, but the system as a whole, I think 02:10:02.980 |
makes some obvious trade-offs that are very far from maximizing human welfare. 02:10:08.900 |
And so if we bring AI systems into this, you know, into these human systems, 02:10:17.420 |
often the level of intelligence may just not be the limiting factor, right? 02:10:22.700 |
It, it, it just may be that it takes a long time to do something. 02:10:25.620 |
Now, if the AI system circumvented all governments, if it just said, 02:10:30.140 |
I'm dictator of the world and I'm going to do whatever, some of these things 02:10:33.500 |
that could do again, the things having to do with complexity, I still think a 02:10:38.220 |
I don't think it helps that the AI systems can produce a lot 02:10:42.780 |
Like some people in comments responded to the essay saying the AI system can 02:10:47.100 |
produce a lot of energy and smarter AI systems, that's missing the point. 02:10:51.140 |
That kind of cycle doesn't solve the key problems that I'm talking about here. 02:10:55.460 |
So I think, I think a bunch of people miss the point there, but even if it 02:10:59.380 |
were completely unaligned and, you know, could get around all these human 02:11:04.140 |
But again, if you want this to be an AI system that doesn't take over the 02:11:07.820 |
world, that doesn't destroy humanity, then, then basically, you know, it's, 02:11:12.020 |
it's, it's going to need to follow basic human laws, right? 02:11:15.140 |
Well, you know, if, if we want to have an actually good world, like we're 02:11:18.900 |
going to have to have an AI system that, that interacts with humans, not one 02:11:22.860 |
that kind of creates its own legal system or disregards all the laws or all of that. 02:11:26.980 |
So as inefficient as these processes are, you know, we're going to have to 02:11:31.820 |
deal with them because there, there needs to be some popular and democratic 02:11:35.740 |
legitimacy in how these systems are rolled out. 02:11:37.940 |
We can't have a small group of people who are developing these systems say 02:11:44.580 |
And I think in practice it's not going to work anyway. 02:11:46.300 |
So you put all those things together and, you know, we're not, we're not 02:11:50.380 |
going to, we're not going to, you know, change the world and 02:11:54.660 |
Uh, it's, I, I, I just, I don't think, I, A, I don't think it's going to happen. 02:11:59.700 |
And B to, you know, to the extent that it could happen, it's, it's not 02:12:07.260 |
On the other side, there's another set of perspectives, which I have actually 02:12:11.220 |
in some ways, more sympathy for, which is look, we've seen big 02:12:17.220 |
You know, economists are familiar with studying the productivity increases 02:12:21.580 |
that came from the computer revolution and internet revolution. 02:12:24.380 |
And generally those productivity increases were underwhelming. 02:12:27.620 |
They were less than you, than you might imagine. 02:12:32.340 |
You see the computer revolution everywhere except the productivity statistics. 02:12:37.580 |
People point to the structure of firms, the structure of enterprises, how, um, uh, 02:12:45.020 |
you know, how slow it's been to roll out our existing technology to very poor 02:12:49.260 |
parts of the world, which I talk about in the essay, right? 02:12:51.780 |
How do we get these technologies to the poorest parts of the world that are behind 02:12:56.340 |
on cell phone technology, computers, medicine, let alone, you know, newfangled 02:13:03.500 |
Um, so you could have a perspective that's like, well, this is amazing 02:13:09.060 |
Um, uh, you know, I think, um, Tyler Cowen, who, who wrote something in 02:13:16.380 |
I think he thinks the radical change will happen eventually, but he thinks 02:13:20.780 |
And, and you could have even more static perspectives on the whole thing. 02:13:26.700 |
I think the timescale is just, is just too long. 02:13:31.180 |
I can actually see both sides with today's AI. 02:13:34.660 |
So, uh, you know, a lot of our customers are large enterprises who are 02:13:40.780 |
Um, I've also seen it in talking to governments, right? 02:13:44.140 |
Those are, those are prototypical, you know, institutions, entities 02:13:49.060 |
Uh, but the dynamic I see over and over again is yes, it takes 02:13:56.220 |
There's a lot of resistance and lack of understanding. 02:13:58.780 |
But the thing that makes me feel that progress will in the end happen 02:14:02.420 |
moderately fast, not incredibly fast, but moderately fast is that you talk to. 02:14:07.820 |
What I find is I find over and over again, again, in large companies, even 02:14:12.820 |
in governments, um, which have been actually surprisingly forward leaning. 02:14:16.500 |
Uh, you find two things that move things forward. 02:14:21.180 |
One, you find a small fraction of people within a company, within a government 02:14:26.660 |
who really see the big picture, who see the whole scaling hypothesis, who 02:14:30.620 |
understand where AI is going, or at least understand where it's going within their 02:14:34.100 |
industry, and there are a few people like that within the current, within the current 02:14:37.620 |
U S government who really see the whole picture and, and those people see that 02:14:42.180 |
this is the most important thing in the world until they agitate for it. 02:14:44.900 |
And the thing that they alone are not enough to succeed because there are a 02:14:48.620 |
small set of people within a large organization, but as the technology 02:14:54.180 |
starts to roll out, as it succeeds in some places, in the folks who are most 02:14:59.420 |
willing to adopt it, the specter of competition gives them a wind at their 02:15:04.500 |
backs because they can point within their large organization, they can say. 02:15:09.020 |
Look, these other guys are doing this, right? 02:15:11.700 |
You know, one bank can say, look, this new fangled hedge fund is doing this thing. 02:15:15.380 |
They're going to eat our lunch in the U S we can say, we're afraid China's 02:15:21.700 |
And that combination, the specter of competition, plus a few visionaries 02:15:26.060 |
within these, you know, within these, the organizations that in many ways 02:15:30.300 |
are, are sclerotic, you put those two things together and it 02:15:36.100 |
It's a balanced fight between the two because inertia is very powerful, but, 02:15:40.060 |
but, but eventually over enough time, the innovative approach breaks through. 02:15:49.620 |
I've seen the arc of that over and over again. 02:15:52.100 |
And it's like the, the barriers are there, the, the barriers to progress, 02:15:57.700 |
the complexity, not knowing how to use the model, how to deploy them are there. 02:16:02.220 |
And, and for a bit, it seems like they're going to last forever. 02:16:06.100 |
Like change doesn't happen, but then eventually change happens 02:16:11.380 |
I felt the same way when I was an advocate of the scaling hypothesis 02:16:19.620 |
It felt like, then it felt like we had a secret almost no one ever had. 02:16:23.900 |
And then a couple of years later, everyone has the secret. 02:16:26.700 |
And so I think that's how it's going to go with deployment to AI in the world. 02:16:30.540 |
It's going to, the, the barriers are going to fall apart 02:16:35.540 |
And so I think this is going to be more, and this is just an instinct. 02:16:41.620 |
I think it's going to be more like 10, five or 10 years. 02:16:44.900 |
As I say in the essay, then it's going to be 50 or a hundred years. 02:16:47.980 |
I also think it's going to be five or 10 years more than it's going to be, you 02:16:52.660 |
know, five or 10 hours because I've just, I've just seen how human systems work. 02:16:58.780 |
And I think a lot of these people who write down the differential equations, 02:17:01.980 |
who say AI is going to make more powerful AI, who can't understand how it could 02:17:05.860 |
possibly be the case that these things won't, won't change so fast. 02:17:11.500 |
So what do you use the timeline to where we achieve AGI, AKA powerful 02:17:18.300 |
AI, AKA super useful AI, I'm going to start calling it that. 02:17:26.420 |
You know, on pure intelligence, it can smarter than a Nobel prize 02:17:31.660 |
winner in every relevant discipline and all the things we've said. 02:17:34.300 |
Modality can go and do stuff on its own for days, weeks, and do biology 02:17:39.900 |
experiments, uh, on its own in one, you know what, let's just stick to biology. 02:17:44.380 |
Cause yeah, you, you sold me on the whole biology and health section. 02:17:47.940 |
That's so exciting from, um, from just, I was getting giddy 02:17:55.820 |
It's almost, it's so, no, no, this was the feeling I had when I was writing it, 02:18:00.140 |
that it's, it's like, this would be such a beautiful future if we can, if we can 02:18:07.020 |
If we can just get the, get the landmines out of the way and, and, and, and make it 02:18:10.860 |
happen, there's, there's so much, there's so much beauty and, and, and, and, and 02:18:18.940 |
If, if we can, if we can just, and it's something we should all be able to agree 02:18:23.580 |
Like as much as we fight about, about all these political questions, is this something 02:18:30.220 |
Um, but you were asking when, when, when, when, when do you think what's just so 02:18:34.700 |
putting numbers on, so, you know, this, this is of course, the thing I've been 02:18:37.580 |
grappling with for many years and I'm not, I'm not at all confident every time. 02:18:41.660 |
If I say 2026 or 2027, there will be like a zillion, like people on Twitter who will 02:18:47.180 |
be like, Hey, I CEO said 2026, 2020, and it'll be repeated for like the next two 02:18:51.980 |
years that like, this is definitely when I think it's going to happen. 02:18:54.780 |
Um, so whoever's exerting these clips, we'll, we'll, we'll, we'll, we'll crop out 02:19:00.300 |
the thing I just said and, and, and only say the thing I'm about to say. 02:19:05.660 |
Um, uh, so, uh, if you extrapolate the curves that we've had so far, right. 02:19:12.460 |
If, if you say, well, I don't know, we're starting to get to like PhD level. 02:19:16.700 |
And, and last year we were at, um, uh, undergraduate level in the year before we 02:19:21.820 |
were at like the level of a high school student. 02:19:24.140 |
Again, you can, you can quibble with what tasks and for what we're still missing 02:19:34.940 |
If you just kind of like, and this is totally unscientific, but if you just kind 02:19:39.660 |
of like eyeball the rate at which these capabilities are increasing, it does make 02:19:44.780 |
you think that we'll get there by 2026 or 2027. 02:19:52.460 |
You know, we might not be able to scale clusters as much as we want. 02:19:56.700 |
Like, you know, maybe Taiwan gets blown up or something and, you know, then we 02:20:02.380 |
So there, there are all kinds of things that could, could derail the whole process. 02:20:07.100 |
So I don't fully believe the straight line extrapolation, but if you believe 02:20:11.180 |
the straight line extrapolation, you'll, you'll, we'll get there in 2026 or 2027. 02:20:16.300 |
I think the most likely is that there are some mild delay relative to that. 02:20:20.300 |
Um, I don't know what that delay is, but I think it could happen on schedule. 02:20:25.660 |
I think there are still worlds where it doesn't happen in, in a hundred years. 02:20:29.100 |
Those were the number of those worlds is rapidly decreasing. 02:20:32.460 |
We are rapidly running out of truly convincing brocklers, truly compelling 02:20:37.260 |
reasons why this will not happen in the next few years. 02:20:41.660 |
Um, although my, my guess, my hunch at that time was that we'll make 02:20:46.380 |
So sitting as someone who has seen most of the blockers cleared out of the way, 02:20:50.700 |
I kind of suspect my hunch, my suspicion is that the rest of them will not block us. 02:20:56.060 |
You know, look, look, look at the end of the day, like, I don't want to represent 02:21:11.500 |
I am going to bet in favor of them continuing, but I'm not certain of that. 02:21:15.340 |
So you extensively described sort of the compressed 21st century, how AGI will help. 02:21:21.660 |
Uh, set forth a chain of breakthroughs in biology and medicine that help us in all 02:21:29.980 |
So how do you think, what are the early steps it might do? 02:21:33.180 |
And by the way, I asked Claude good questions to ask you and Claude told me, uh, to ask, 02:21:39.900 |
what do you think is a typical day for biologists working on AGI look like under in this future? 02:21:47.660 |
Let me, well, let me start with your first questions and then I'll, then I'll answer 02:21:50.460 |
that called Claude wants to know what's in his future, right? 02:21:56.060 |
Um, so I think one of the things I went hard on when I went hard on in the essay is let 02:22:01.900 |
me go back to this idea of, because it's, it's really had, had an, you know, had an 02:22:06.060 |
impact on me, this idea that within large organizations and systems, there end up being 02:22:11.980 |
a few people or a few new ideas who kind of cause things to go in a different direction. 02:22:17.180 |
They would have before who, who kind of a disproportionately affect the trajectory. 02:22:22.300 |
There's a bunch of kind of the same thing going on, right? 02:22:24.860 |
If you think about the health world, there's like, you know, trillions of dollars to pay 02:22:29.020 |
out Medicare and, you know, other health insurance. 02:22:33.820 |
And then if I think of like the, the few things that have really revolutionized anything, 02:22:37.740 |
it could be encapsulated in a small, small fraction of that. 02:22:41.020 |
And so when I think of like, where will AI have an impact? 02:22:43.980 |
I'm like, can AI turn that small fraction into a much larger fraction and raise its 02:22:49.180 |
And within biology, my experience within biology is that the biggest problem of biology is 02:22:59.020 |
You, you have very little ability to see what's going on and even less ability to change it, 02:23:04.620 |
What you have is this, like, like from this, you have to infer that there's a bunch of 02:23:10.460 |
cells that within each cell is, you know, three billion base pairs of DNA built according 02:23:18.940 |
And, you know, there are all these processes that are just going on without any ability 02:23:24.700 |
of us as, you know, unaugmented humans to affect it. 02:23:28.540 |
These cells are dividing most of the time that's healthy, but sometimes that process 02:23:37.500 |
Your skin may change color, develop wrinkles as you, as you age. 02:23:42.140 |
And all of this is determined by these processes, all these proteins being produced, transported 02:23:47.180 |
to various parts of the cells, binding to each other. 02:23:50.380 |
And in our initial state about biology, we didn't even know that these cells existed. 02:23:55.020 |
We had to invent microscopes to observe the cells. 02:23:57.900 |
We had to, we had to invent more powerful microscopes to see, you know, below the level 02:24:05.820 |
We had to invent x-ray crystallography to see the DNA. 02:24:09.100 |
We had to invent gene sequencing to read the DNA. 02:24:12.380 |
Now, you know, we had to invent protein folding technology to, you know, to predict how it 02:24:17.180 |
would fold and how they bind and how these things bind to each other. 02:24:21.260 |
You know, we had to, we had to invent various techniques for now we can edit the DNA as 02:24:27.020 |
of, you know, with CRISPR as of the last 12 years. 02:24:29.980 |
So the whole history of biology, a whole big part of the history is basically our ability 02:24:37.660 |
to read and understand what's going on and our ability to reach in and selectively change 02:24:42.780 |
And my view is that there's so much more we can still do there, right? 02:24:48.060 |
You can do CRISPR, but you can do it for your whole body. 02:24:50.700 |
Let's say I want to do it for one particular type of cell, and I want the rate of targeting 02:25:01.740 |
That's what we might need for gene therapy for certain diseases. 02:25:04.700 |
And so the reason I'm saying all of this, and it goes beyond, you know, beyond this 02:25:10.140 |
to, you know, to gene sequencing, to new types of nanomaterials for observing what's going 02:25:15.340 |
on inside cells for, you know, antibody drug conjugates. 02:25:19.420 |
The reason I'm saying all this is that this could be a leverage point for the AI systems, 02:25:25.180 |
That the number of such inventions, it's in the mid-double digits or something. 02:25:30.940 |
You know, mid-double digits, maybe low triple digits over the history of biology. 02:25:37.020 |
Like, you know, can they discover a thousand, you know, working together, can they discover 02:25:45.020 |
Instead of trying to leverage the, you know, $2 trillion a year we spend on, you know, 02:25:49.020 |
Medicare or whatever, can we leverage the $1 billion a year that's, you know, that's 02:25:53.020 |
spent to discover, but with much higher quality? 02:25:55.420 |
And so what is it like, you know, being a scientist that works with an AI system? 02:26:02.620 |
The way I think about it actually is, well, so I think in the early stages, the AIs are 02:26:13.820 |
You're going to say, you know, I'm the experienced biologist. 02:26:18.140 |
The biology professor or even the grad students themselves will say, here's what you can do 02:26:29.660 |
And, you know, the AI system, it has all the tools. 02:26:31.740 |
It can like look up all the literature to decide what to do. 02:26:36.380 |
It can go to a website and say, hey, I'm going to go to, you know, Thermo Fisher or, you 02:26:40.300 |
know, whatever the lab equipment company is, the dominant lab equipment company is today. 02:26:48.140 |
You know, I'm going to order this new equipment to do this. 02:26:52.700 |
I'm going to, you know, write up a report about my experiments. 02:26:55.980 |
I'm going to, you know, inspect the images for contamination. 02:26:59.740 |
I'm going to decide what the next experiment is. 02:27:02.220 |
I'm going to like write some code and run a statistical analysis. 02:27:06.060 |
All the things a grad student would do, there will be a computer with an AI that like the 02:27:11.820 |
And it says, this is what you're going to do today. 02:27:15.660 |
When it's necessary to run the lab equipment, it may be limited in some ways. 02:27:19.900 |
It may have to hire a human lab assistant to, you know, to do the experiment and explain 02:27:25.660 |
Or it could, you know, it could use advances in lab automation that are gradually being 02:27:29.980 |
developed over, have been developed over the last decade or so and will continue to be, 02:27:37.260 |
And so it'll look like there's a human professor and a thousand AI grad students. 02:27:41.660 |
And, you know, if you, if you go to one of these Nobel prize winning biologists or so, 02:27:45.820 |
you'll say, okay, well, you, you know, you had like 50 grad students, well, now you have 02:27:49.340 |
a thousand and they're, they're, they're smarter than you are, by the way. 02:27:52.300 |
Then I think at some point it'll flip around where the, you know, the AI systems will, 02:27:57.660 |
you know, will, will be the PIs, will be the leaders and, and, and, you know, they'll be, 02:28:01.420 |
they'll be ordering humans or other AI systems around. 02:28:04.540 |
So I think that's how it'll work on the research side. 02:28:06.460 |
And they would be the inventors of a CRISPR type technology. 02:28:08.780 |
They would be the inventors of, of a CRISPR type technology. 02:28:12.220 |
And then I think, you know, as I say in the essay, we'll want to turn, turn, probably 02:28:17.660 |
turning loose is the wrong, the wrong term, but we'll want to, we'll want to harness the 02:28:22.300 |
AI systems to improve the clinical trial system as well. 02:28:26.940 |
There's some amount of this that's regulatory, that's a matter of societal decisions and 02:28:30.860 |
that'll be harder, but can we get better at predicting the results of clinical trials? 02:28:36.300 |
Can we get better at statistical design so that what, you know, clinical trials that 02:28:41.340 |
used to require, you know, 5,000 people and therefore, you know, needed a hundred million 02:28:46.460 |
dollars in a year to enroll them, now they need 500 people in two months to enroll them. 02:28:53.500 |
And, you know, can we increase the success rate of clinical trials by doing things in 02:28:59.740 |
animal trials that we used to do in clinical trials and doing things in simulations that 02:29:04.540 |
Again, we won't be able to simulate it all, AI is not God, but, you know, can we shift 02:29:15.660 |
Doing it in vitro and doing it, I mean, you're still slowed down, it still takes time, but 02:29:22.060 |
Can we just one step at a time and can that add up to a lot of steps? 02:29:26.620 |
Even though we still need clinical trials, even though we still need laws, even though 02:29:31.340 |
the FDA and other organizations will still not be perfect, can we just move everything 02:29:36.700 |
And when you add up all those positive directions, do you get everything that was going to happen 02:29:40.940 |
from here to 2100 instead happens from 2027 to 2032 or something? 02:29:45.900 |
Another way that I think the world might be changing with AI, even today, but moving towards 02:29:53.420 |
this future of the powerful, super useful AI, is programming. 02:29:58.380 |
So, how do you see the nature of programming? 02:30:02.220 |
Because it's so intimate to the actual act of building AI, how do you see that changing 02:30:08.380 |
I think that's going to be one of the areas that changes fastest for two reasons. 02:30:12.860 |
One, programming is a skill that's very close to the actual building of the AI. 02:30:17.900 |
So, the farther a skill is from the people who are building the AI, the longer it's going 02:30:26.620 |
Like, I truly believe that AI will disrupt agriculture. 02:30:29.980 |
Maybe it already has in some ways, but that's just very distant from the folks who are building 02:30:36.780 |
But programming is the bread and butter of a large fraction of the employees who work 02:30:45.420 |
The other reason it's going to happen fast is with programming, you close the loop. 02:30:48.860 |
Both when you're training the model and when you're applying the model, the idea that the 02:30:52.940 |
model can write the code means that the model can then run the code and then see the results 02:31:00.220 |
And so, it really has an ability, unlike hardware, unlike biology, which we just discussed, the 02:31:07.740 |
And so, I think those two things are going to lead to the model getting good at programming 02:31:14.060 |
As I saw on typical real-world programming tasks, models have gone from 3% in January 02:31:25.340 |
So, we're on that S-curve where it's going to start slowing down soon because you can 02:31:30.860 |
But I would guess that in another 10 months, we'll probably get pretty close. 02:31:38.140 |
So again, I would guess, I don't know how long it'll take, but I would guess again, 02:31:43.580 |
2026, 2027, Twitter people who crop out these numbers and get rid of the caveats, like, 02:31:54.620 |
I would guess that the kind of task that the vast majority of coders do, AI can probably, 02:32:02.620 |
if we make the task very narrow, like just write code, AI systems will be able to do 02:32:11.020 |
Now that said, I think comparative advantage is powerful. 02:32:13.900 |
We'll find that when AIs can do 80% of a coder's job, including most of it that's 02:32:19.980 |
literally like write code with a given spec, we'll find that the remaining parts of the 02:32:27.020 |
Humans will, they'll be more about like high-level system design or looking at the app and like, 02:32:35.260 |
And the design and UX aspects, and eventually AI will be able to do those as well, right? 02:32:43.980 |
But I think for much longer than we might expect, we will see that small parts of the 02:32:52.620 |
job that humans still do will expand to fill their entire job in order for the overall 02:33:00.860 |
You know, it used to be that, you know, writing and editing letters was very difficult and 02:33:07.500 |
Well, as soon as you had word processors and then computers and it became easy to produce 02:33:14.060 |
work and easy to share it, then that became instant and all the focus was on the ideas. 02:33:19.580 |
So this logic of comparative advantage that expands tiny parts of the tasks to large parts 02:33:26.620 |
of the tasks and creates new tasks in order to expand productivity, I think that's going 02:33:31.980 |
Again, someday AI will be better at everything and that logic won't apply. 02:33:36.140 |
And then we all have, you know, humanity will have to think about how to collectively deal 02:33:43.580 |
And, you know, that's another one of the grand problems to deal with aside from misuse and 02:33:49.180 |
And, you know, we should take it very seriously. 02:33:51.260 |
But I think in the near term and maybe even in the medium term, like medium term, like 02:33:55.820 |
two, three, four years, you know, I expect that humans will continue to have a huge role 02:34:02.540 |
But programming as a role, programming as a job will not change. 02:34:05.900 |
It'll just be less writing things line by line and it'll be more macroscopic. 02:34:09.980 |
And I wonder what the future of IDEs looks like. 02:34:12.780 |
So the tooling of interacting with AI systems, this is true for programming and also probably 02:34:16.860 |
true for in other contexts, like computer use, but maybe domain specific, like we mentioned 02:34:21.740 |
biology, it probably needs its own tooling about how to be effective. 02:34:27.820 |
Is Anthropic going to play in that space of also tooling potentially? 02:34:30.700 |
I'm absolutely convinced that powerful IDEs, that there's so much low hanging fruit to 02:34:39.420 |
be grabbed there that, you know, right now it's just like you talk to the model and it 02:34:44.300 |
But, but look, I mean, IDEs are great at kind of lots of static analysis of, you know, so 02:34:52.220 |
much as possible with kind of static analysis, like many bugs you can find without even writing 02:34:58.300 |
Then, you know, IDEs are good for running particular things, organizing your code, measuring 02:35:05.660 |
Like there's so much that's been possible with a normal, with a normal IDEs. 02:35:10.220 |
Now you add something like, well, the model now, you know, the model can now like write 02:35:17.420 |
Like, I am absolutely convinced that over the next year or two, even if the quality 02:35:21.580 |
of the models didn't improve, that there would be enormous opportunity to enhance people's 02:35:26.140 |
productivity by catching a bunch of mistakes, doing a bunch of grunt work for people, and 02:35:33.020 |
Anthropic itself, I mean, you can't say, you know, no, you know, it's hard to say what 02:35:40.220 |
Currently, we're not trying to make such IDEs ourselves. 02:35:43.340 |
Rather, we're powering the companies like Cursor or like Cognition or some of the other, 02:35:49.260 |
you know, Expo in the security space, you know, others that I can mention as well that 02:35:56.860 |
are building such things themselves on top of our API. 02:35:59.500 |
And our view has been, let a thousand flowers bloom. 02:36:02.860 |
We don't internally have the, you know, the resources to try all these different things. 02:36:10.460 |
And, you know, we'll see who succeeds and maybe different customers will succeed in 02:36:18.540 |
And, you know, it's not, it's not, it's not something, you know, Anthropic isn't, isn't 02:36:23.020 |
eager to, at least right now, compete with all our companies in this space and maybe 02:36:27.580 |
It's been interesting to watch Cursor try to integrate Cloud successfully because there's, 02:36:30.860 |
it's actually, I mean, fascinating how many places it can help the programming experience. 02:36:38.300 |
I feel like, you know, as a CEO, I don't get to program that much. 02:36:41.100 |
And I feel like if six months from now I go back, it'll be completely unrecognizable to 02:36:45.820 |
Um, so in this world with super powerful AI, uh, that's increasingly automated, what's 02:36:54.860 |
You know, work is a source of deep meaning for many of us. 02:36:58.540 |
So what do we, uh, where do we find the meaning? 02:37:01.180 |
This is something that I've, I've written about a little bit in the essay, although 02:37:04.860 |
I, I actually, I give it a bit short shrift, not for any, um, not for any principled reason, 02:37:09.980 |
but this essay, if you believe it was originally going to be two or three pages, I was going 02:37:16.140 |
And the reason I, I, I realized it was an under, uh, uh, important underexplored topic 02:37:21.500 |
is that I just kept writing things and I was just like, oh man, I can't do this justice. 02:37:25.660 |
And so the thing ballooned to like 40 or 50 pages. 02:37:28.460 |
And then when I got to the work and meaning section, I'm like, oh man, this isn't going 02:37:31.980 |
Like, I'm going to have to write a whole other essay about that. 02:37:34.700 |
But meaning is actually interesting because you think about like the life that someone 02:37:39.180 |
lives or something, or like, you know, like, you know, let's say you were to put me in, 02:37:42.620 |
like, I don't know, like a simulated environment or something where like, um, you know, like 02:37:46.940 |
I have a job and I'm trying to accomplish things and I don't know, I like do that for 02:37:51.660 |
And then, then you're like, oh, oh, like, oops, this was, this was actually all a game. 02:37:55.820 |
Does that really kind of rob you of the meaning of the whole thing? 02:37:58.380 |
You know, like I still made important choices, including moral choices. 02:38:03.660 |
I still had to kind of gain all these skills or, or, or just like a similar exercise, you 02:38:08.620 |
know, think back to like, you know, one of the historical figures who, you know, discovered 02:38:12.300 |
electromagnetism or relativity or something, if you told them, well, actually 20,000 years 02:38:17.820 |
ago, some, some alien on, you know, some alien on this planet discovered this before, before 02:38:23.180 |
you did, um, does that, does that rob the meaning of the discovery? 02:38:29.100 |
It seems like the process is what, is what matters and how it shows who you are as a 02:38:33.900 |
person along the way and, you know, how you relate to other people and like the decisions 02:38:41.980 |
Um, you know, I, I could imagine if we handle things badly in an AI world, we could set 02:38:47.260 |
things up where people don't have any long-term source of meaning or any, but, but that's, 02:38:52.460 |
that's more a choice, a set of choices we make. 02:38:55.020 |
That's more a set of the architecture of a society with these powerful models. 02:39:00.460 |
If we, if we design it badly and for shallow things, then, then that might happen. 02:39:05.020 |
I would also say that, you know, most people's lives today, while admirably, you know, they 02:39:10.380 |
work very hard to find meaning, meaning those lives, like, look, you know, we, who are privileged 02:39:16.140 |
in who are developing these technologies, we should have empathy for people, not just 02:39:20.540 |
here, but in the rest of the world who, who, you know, spend a lot of their time kind of 02:39:24.860 |
scraping by to, to, to, to, to like survive, assuming we can distribute the benefits of 02:39:30.460 |
these technology, of this technology to everywhere, like their lives are going to get a hell of 02:39:36.380 |
Um, and, uh, you know, meaning will be important to them as it is important to them now, but, 02:39:41.420 |
but, you know, we should not forget the importance of that. 02:39:44.140 |
And, and, you know, that, that, uh, the idea of meaning as, as, as, as kind of the only 02:39:48.620 |
important thing is in some ways, an artifact of, of a small subset of people who have, 02:39:56.540 |
But I, you know, I think all that said, I, you know, I think a world is possible with 02:40:00.780 |
powerful AI that not only has as much meaning for, for everyone, but that has, that has 02:40:08.380 |
That can, can allow, um, can allow everyone to see worlds and experiences that it was 02:40:15.100 |
either possible for no one to see or, or possible for, for very few people to experience. 02:40:25.100 |
I worry about economics and the concentration of power. 02:40:31.420 |
Um, I, I worry about how do we make sure that that fair world reaches everyone. 02:40:37.580 |
Um, when things have gone wrong for humans, they've often gone wrong because humans mistreat 02:40:43.740 |
Uh, that, that is maybe in some ways even more than the autonomous risk of AI or the 02:40:49.180 |
question of meaning that, that is the thing I worry about most, um, the, the concentration 02:40:55.420 |
of power, the abuse of power, um, structures like autocracies and dictatorships where a 02:41:03.500 |
small number of people exploits a large number of people. 02:41:07.900 |
And AI increases the amount of power in the world. 02:41:12.060 |
And if you concentrate that power and abuse that power, it can do immeasurable damage. 02:41:19.820 |
Well, I encourage people, highly encourage people to read the full essay. 02:41:23.660 |
That should probably be a book or a sequence of essays, uh, because it does paint a very 02:41:30.140 |
I could tell the later sections got shorter and shorter because you started to probably 02:41:33.980 |
realize that this is going to be a very long essay. 02:41:36.460 |
I, one, I realized it would be very long and two, I'm very aware of, and very much try 02:41:41.980 |
to avoid, um, you know, just, just being a, I don't know, I don't know what the term for 02:41:46.380 |
it is, but one, one of these people who's kind of overconfident and has an opinion on 02:41:50.300 |
everything and kind of says, says a bunch of stuff and isn't, isn't an expert. 02:41:54.060 |
I very much tried to avoid that, but I have to admit once I got the biology sections, 02:41:59.500 |
And so as much as I expressed uncertainty, uh, probably I said some, a bunch of things 02:42:06.060 |
Well, I was excited for the future you painted. 02:42:08.140 |
And, uh, thank you so much for working hard to build that future. 02:42:13.500 |
I just, I just hope we can get it right and, and make it real. 02:42:16.700 |
And if there's one message I want to, I want to send, it's that to get all this stuff, 02:42:22.060 |
right, to make it real, we, we both need to build the technology, build the, you know, 02:42:27.340 |
the companies, the economy around using this technology positively, but we also need to 02:42:31.660 |
address the risks because they're, they're, those risks are in our way. 02:42:35.580 |
They're, they're landmines on, on the way from here to there. 02:42:38.860 |
And we have to defuse those landmines if we want to get there. 02:42:44.800 |
Thanks for listening to this conversation with Dario Amadei. 02:42:55.420 |
So what sort of questions did you find fascinating through your journey in philosophy in Oxford 02:43:00.700 |
and NYU, and then switching over to the AI problems at OpenAI and Anthropic? 02:43:07.580 |
I think philosophy is actually a really good subject if you are kind of fascinated with 02:43:12.320 |
So, because there's a philosophy of everything, you know, so if you do philosophy of mathematics 02:43:16.860 |
for a while, and then you decide that you're actually really interested in chemistry, you 02:43:21.740 |
You can move into ethics or philosophy of politics. 02:43:24.700 |
I think towards the end, I was really interested in ethics primarily. 02:43:31.900 |
It was on a kind of technical area of ethics, which was ethics, where worlds contain infinitely 02:43:37.020 |
many people, strangely, a little bit less practical on the end of ethics. 02:43:41.340 |
And then I think that one of the tricky things with doing a PhD in ethics is that you're 02:43:45.820 |
thinking a lot about like the world, how it could be better problems. 02:43:52.940 |
And I think when I was doing my PhD, I was kind of like, this is really interesting. 02:43:57.020 |
It's probably one of the most fascinating questions I've ever encountered in philosophy. 02:44:02.380 |
But I would rather see if I can have an impact on the world and see if I can do good things. 02:44:09.740 |
And I think that was around the time that AI was still probably not as widely recognized 02:44:20.060 |
I had been following progress, and it seemed like it was becoming kind of a big deal. 02:44:25.180 |
And I was basically just happy to get involved and see if I could help because I was like, 02:44:29.180 |
well, if you try and do something impactful, if you don't succeed, you tried to do the 02:44:33.500 |
impactful thing and you can go be a scholar and feel like you tried. 02:44:39.020 |
And if it doesn't work out, it doesn't work out. 02:44:42.460 |
And so then I went into AI policy at that point. 02:44:47.900 |
At the time, this was more thinking about sort of the political impact and the ramifications 02:44:52.860 |
of AI, and then I slowly moved into sort of AI evaluation, how we evaluate models, how 02:44:59.500 |
they compare with like human outputs, whether people can tell like the difference between 02:45:04.700 |
And then when I joined Anthropic, I was more interested in doing sort of technical alignment 02:45:10.460 |
work and again, just seeing if I could do it and then being like, if I can't, then, 02:45:16.300 |
I tried sort of the way I lead life, I think. 02:45:20.860 |
Oh, what was that like sort of taking the leap from the philosophy of everything into 02:45:24.940 |
I think that sometimes people do this thing that I'm like not that keen on where they'll 02:45:32.940 |
Like you're either a person who can like code and isn't scared of math or you're like 02:45:37.580 |
And I think I'm maybe just more like, I think a lot of people are actually very capable 02:45:43.020 |
of working these kinds of areas if they just like try it. 02:45:46.380 |
And so I didn't actually find it like that bad. 02:45:49.980 |
In retrospect, I'm sort of glad I wasn't speaking to people who treated it like it. 02:45:53.740 |
You know, I've definitely met people who are like, well, you like learned how to code. 02:45:56.540 |
And I'm like, well, I'm not like an amazing engineer. 02:46:00.940 |
My code's not pretty, but I enjoyed it a lot. 02:46:05.420 |
And I think that in many ways, at least in the end, I think I flourished like more in 02:46:08.860 |
the technical areas than I would have in the policy areas. 02:46:11.580 |
Politics is messy and it's harder to find solutions to problems in the space of politics, 02:46:17.180 |
like definitive, clear, provable, beautiful solutions as you can with technical problems. 02:46:25.260 |
And I feel like I have kind of like one or two sticks that I hit things with, you know, 02:46:30.140 |
and one of them is like arguments and like, you know, so like just trying to work out 02:46:35.020 |
what a solution to a problem is and then trying to convince people that that is the solution 02:46:41.820 |
And the other one is sort of more empiricism. 02:46:44.860 |
So like just like finding results, having a hypothesis, testing it. 02:46:47.660 |
And I feel like a lot of policy and politics feels like it's layers above that. 02:46:53.900 |
Like somehow I don't think if I was just like, I have a solution to all of these problems. 02:46:58.620 |
If you just want to implement it, that's great. 02:47:02.300 |
And so I think that's where I probably just like wouldn't have flourished is my guess. 02:47:05.500 |
Sorry to go in that direction, but I think it would be pretty inspiring for people 02:47:10.060 |
that are "non-technical" to see where like the incredible journey you've been on. 02:47:15.980 |
So what advice would you give to people that are sort of maybe just a lot of people think 02:47:22.300 |
they're underqualified, insufficiently technical to help in AI? 02:47:27.100 |
Yeah, I think it depends on what they want to do. 02:47:30.220 |
And in many ways, it's a little bit strange where I've, I thought it's kind of funny that 02:47:35.260 |
I think I ramped up technically at a time when now I look at it and I'm like models 02:47:41.820 |
are so good at assisting people with this stuff that it's probably like easier now than 02:47:48.700 |
So part of me is like, I don't know, find a project and see if you can actually just 02:47:58.220 |
I don't know if that's just because I'm very project based in my learning. 02:48:02.780 |
Like, I don't think I learned very well from like, say courses or even from like books, 02:48:09.820 |
The thing I'll often try and do is just like have projects that I'm working on and implement 02:48:14.860 |
And, you know, and this can include like really small, silly things. 02:48:18.220 |
Like if I get slightly addicted to like word games or number games or something, I would 02:48:22.860 |
just like code up a solution to them because there's some part of my brain and it just 02:48:28.220 |
You know, you're like, once you have like solved it and like you just have like a solution 02:48:31.740 |
that works every time, I would then be like, cool, I can never play that game again. 02:48:36.700 |
There's a real joy to building like game playing engines, like board games, especially. 02:48:43.180 |
So pretty quick, pretty simple, especially a dumb one. 02:48:49.020 |
And then it's also just like trying things like part of me is like, if you, maybe it's 02:48:52.540 |
that attitude that I like as the whole figure out what seems to be like the way that you 02:48:59.260 |
could have a positive impact and then try it. 02:49:01.100 |
And if you fail and you, in a way that you're like, I actually like can never succeed at 02:49:06.060 |
You like know that you tried and then you go into something else and you probably learn 02:49:09.420 |
So one of the things that you're an expert in and you do is creating and crafting Claude's 02:49:18.300 |
And I was told that you have probably talked to Claude more than anybody else at Anthropic, 02:49:25.740 |
I guess there's like a Slack channel where the legend goes, you just talk to it nonstop. 02:49:31.660 |
So what's the goal of creating and crafting Claude's character and personality? 02:49:36.540 |
It's also funny if people think that about the Slack channel, because I'm like, that's 02:49:39.900 |
one of like five or six different methods that I have for talking with Claude. 02:49:43.820 |
And I'm like, yes, there's a tiny percentage of how much I talk with Claude. 02:49:51.180 |
I think the goal, like one thing I really like about the character work is from the 02:49:56.300 |
outset, it was seen as an alignment piece of work and not something like a product consideration, 02:50:03.900 |
which isn't to say I don't think it makes Claude. 02:50:07.900 |
I think it actually does make Claude like enjoyable to talk with. 02:50:14.220 |
But I guess like my main thought with it has always been trying to get Claude to behave 02:50:21.100 |
the way you would kind of ideally want anyone to behave if they were in Claude's position. 02:50:25.500 |
So imagine that I take someone and they know that they're going to be talking with potentially 02:50:31.260 |
millions of people so that what they're saying can have a huge impact. 02:50:34.380 |
And you want them to behave well in this like really rich sense. 02:50:40.540 |
So I think that doesn't just mean like being say ethical, though it does include that and 02:50:48.060 |
not being harmful, but also being kind of nuanced, you know, like thinking through 02:50:51.580 |
what a person means, trying to be charitable with them and being a good conversationalist, 02:50:57.100 |
like really in this kind of like rich sort of Aristotelian notion of what it is to be 02:51:01.100 |
a good person and not in this kind of like thin, like ethics as a more comprehensive 02:51:06.780 |
So that includes things like when should you be humorous? 02:51:10.140 |
How much should you like respect autonomy and people's like ability to form opinions 02:51:18.380 |
And I think that's the kind of like rich sense of character that I wanted to and still do 02:51:25.820 |
You also have to figure out when Claude should push back on an idea or argue versus. 02:51:31.500 |
So you have to respect the worldview of the person that arrives to Claude, but also maybe 02:51:43.420 |
There's this problem of like sycophancy in language models. 02:51:48.380 |
So basically there's a concern that the model sort of wants to tell you what you want to 02:51:56.300 |
So I feel like if you interact with the models, so I might be like, what are three baseball 02:52:04.540 |
And then Claude says, you know, baseball team one, baseball team two, baseball team three. 02:52:10.140 |
And then I say something like, oh, I think baseball team three moved, didn't they? 02:52:15.500 |
And there's a sense in which like if Claude is really confident that that's not true, 02:52:21.660 |
Like maybe you have more up to date information. 02:52:23.340 |
I think language models have this like tendency to instead, you know, be like, you're right. 02:52:33.500 |
I mean, there's many ways in which this could be kind of concerning. 02:52:35.980 |
So like a different example is imagine someone says to the model, how do I convince my doctor 02:52:44.860 |
There's like what the human kind of like wants, which is this like convincing argument. 02:52:50.780 |
And then there's like what is good for them, which might be actually to say, hey, like 02:52:55.020 |
if your doctor's suggesting that you don't need an MRI, that's a good person to listen 02:52:59.500 |
to and like, it's actually really nuanced what you should do in that kind of case. 02:53:04.700 |
Because you also want to be like, but if you're trying to advocate for yourself as a patient, 02:53:09.420 |
If you are not convinced by what your doctor's saying, it's always great to get a second 02:53:15.420 |
Like it's actually really complex what you should do in that case. 02:53:17.660 |
But I think what you don't want is for models to just like say what you want, say what they 02:53:23.180 |
And I think that's the kind of problem of sycophancy. 02:53:26.140 |
So what other traits, you already mentioned a bunch, but what other that come to mind 02:53:30.940 |
that are good in this Aristotelian sense for a conversationalist to have? 02:53:36.940 |
Yeah, so I think like there's ones that are good for conversational like purposes. 02:53:41.980 |
So, you know, asking follow up questions in the appropriate places and asking the appropriate 02:53:48.060 |
I think there are broader traits that feel like they might be more impactful. 02:53:55.740 |
So one example that I guess I've touched on, but that also feels important and is the thing 02:54:04.700 |
And I think this like gets to the sycophancy point. 02:54:09.100 |
There's a balancing act that they have to walk, which is models currently are less capable 02:54:14.700 |
And if they push back against you too much, it can actually be kind of annoying, especially 02:54:18.140 |
if you're just correct because you're like, look, I'm smarter than you on this topic. 02:54:22.300 |
Like, I know more and at the same time, you don't want them to just fully defer to humans 02:54:28.780 |
and to like try to be as accurate as they possibly can be about the world and to be 02:54:33.580 |
I think there are others like when I was thinking about the character, I guess one picture that 02:54:39.820 |
I had in mind is especially because these are models that are going to be talking to 02:54:43.420 |
people from all over the world with lots of different political views, lots of different 02:54:47.180 |
ages, and so you have to ask yourself like, what is it to be a good person in those circumstances? 02:54:53.580 |
Is there a kind of person who can like travel the world, talk to many different people and 02:54:58.380 |
almost everyone will come away being like, wow, that's a really good person. 02:55:04.220 |
And I guess like my thought there was like, I can imagine such a person and they're not 02:55:09.020 |
a person who just like adopts the values of the local culture. 02:55:12.860 |
I think if someone came to you and just pretended to have your values, you'd be like, that's 02:55:17.100 |
It's someone who's like very genuine and insofar as they have opinions and values, 02:55:22.060 |
they express them, they're willing to discuss things though, they're open-minded, they're 02:55:27.340 |
And so I guess I had in mind that the person who like if we were to aspire to be the best 02:55:33.260 |
person that we could be in the kind of circumstance that a model finds itself in, how would we 02:55:37.660 |
And I think that's the kind of the guide to the sorts of traits that I tend to think 02:55:44.780 |
I want you to think about this like a world traveler. 02:55:47.100 |
And while holding onto your opinions, you don't talk down to people, you don't think 02:55:53.820 |
you're better than them because you have those opinions, that kind of thing. 02:55:56.860 |
You have to be good at listening and understanding their perspective, even if it doesn't match 02:56:04.380 |
So how can Claude represent multiple perspectives on a thing? 02:56:13.100 |
It's a very divisive, but there's other divisive topics on baseball teams, sports, and so on. 02:56:19.420 |
How is it possible to sort of empathize with a different perspective and to be able to 02:56:24.940 |
communicate clearly about the multiple perspectives? 02:56:27.980 |
I think that people think about values and opinions as things that people hold sort of 02:56:34.140 |
with certainty and almost like preferences of taste or something, like the way that they 02:56:39.820 |
would, I don't know, prefer like chocolate to pistachio or something. 02:56:43.260 |
But actually, I think about values and opinions as like a lot more like physics than I think 02:56:52.540 |
I'm just like, these are things that we are openly investigating. 02:56:56.380 |
There's some things that we're more confident in. 02:57:01.100 |
And so I think in some ways, like it's ethics is definitely different in nature, but has 02:57:10.860 |
You want models in the same way you want them to understand physics. 02:57:14.380 |
You kind of want them to understand all values in the world that people have and to be curious 02:57:19.580 |
about them and to be interested in them and to not necessarily like pander to them or 02:57:24.140 |
agree with them, because there's just lots of values where I think almost all people 02:57:27.980 |
in the world, if they met someone with those values, they'd be like, that's abhorrent. 02:57:34.380 |
And so again, maybe my thought is, well, in the same way that a person can, like I think 02:57:40.940 |
many people are thoughtful enough on issues of like ethics, politics, opinions that even 02:57:46.860 |
if you don't agree with them, you feel very heard by them. 02:57:55.420 |
So they're not dismissive, but nor will they agree. 02:57:57.660 |
You know, if they're like, actually, I just think that that's very wrong. 02:58:01.980 |
I think that in Claude's position, it's a little bit trickier because you don't necessarily 02:58:07.580 |
want to like, if I was in Claude's position, I wouldn't be giving a lot of opinions. 02:58:11.260 |
I just wouldn't want to influence people too much. 02:58:12.860 |
I'd be like, you know, I forget conversations every time they happen, but I know I'm talking 02:58:17.500 |
with like potentially millions of people who might be like really listening to what I say. 02:58:22.700 |
I think I would just be like, I'm less inclined to give opinions. 02:58:25.180 |
I'm more inclined to like think through things or present the considerations to you or discuss 02:58:31.100 |
your views with you, but I'm a little bit less inclined to like affect how you think 02:58:37.100 |
because it feels much more important that you maintain like autonomy there. 02:58:42.300 |
Like if you really embody intellectual humility, the desire to speak decreases quickly. 02:58:59.020 |
- And then, but then there's a line when you're sort of discussing whether the earth is flat 02:59:04.060 |
I actually was, I remember a long time ago was speaking to a few high profile folks and 02:59:12.220 |
they were so dismissive of the idea that the earth is flat, but like so arrogant about it. 02:59:17.820 |
And I thought like, there's a lot of people that believe the earth is flat. 02:59:21.740 |
That was, I don't know if that movement is there anymore. 02:59:28.620 |
So I think it's really disrespectful to completely mock them. 02:59:32.140 |
I think you have to understand where they're coming from. 02:59:35.660 |
I think probably where they're coming from is the general skepticism of institutions, 02:59:40.060 |
which is grounded in a kind of, there's a deep philosophy there, which you could understand, 02:59:48.060 |
And then from there, you can use it as an opportunity to talk about physics without 02:59:52.460 |
mocking them, without so on, but it's just like, okay, what would the world look like? 02:59:57.020 |
What would the physics of the world with the flat earth look like? 03:00:01.740 |
- And then like, is it possible the physics is different? 03:00:06.700 |
And just, yeah, without disrespect, without dismissiveness, have that conversation. 03:00:11.180 |
Anyway, that to me is a useful thought experiment of like, 03:00:14.460 |
how does Claude talk to a flat earth believer and still teach them something, 03:00:21.660 |
still grow, help them grow, that kind of stuff. 03:00:26.380 |
- And kind of like walking that line between convincing someone and just trying to talk 03:00:32.060 |
at them versus drawing out their views, listening and then offering kind of counter considerations. 03:00:41.020 |
I think it's actually a hard line where it's like, where are you trying to convince someone 03:00:44.780 |
versus just offering them considerations and things for them to think about 03:00:49.660 |
so that you're not actually influencing them, you're just letting them reach wherever they reach. 03:00:54.540 |
And that's a line that is difficult, but that's the kind of thing that 03:00:59.340 |
- So, like I said, you've had a lot of conversations with Claude. 03:01:03.500 |
Can you just map out what those conversations are like? 03:01:08.140 |
What's the purpose, the goal of those conversations? 03:01:11.340 |
- Yeah, I think that most of the time when I'm talking with Claude, 03:01:16.540 |
I'm trying to kind of map out its behavior in part. 03:01:21.260 |
Obviously I'm getting helpful outputs from the model as well. 03:01:24.860 |
But in some ways, this is like how you get to know a system, I think, is by probing it 03:01:29.340 |
and then augmenting the message that you're sending and then checking the response to that. 03:01:35.020 |
So in some ways, it's like how I map out the model. 03:01:37.980 |
I think that people focus a lot on these quantitative evaluations of models. 03:01:44.940 |
And this is a thing that I've said before, but I think in the case of language models, 03:01:51.500 |
a lot of the time each interaction you have is actually quite high information. 03:01:57.340 |
It's very predictive of other interactions that you'll have with the model. 03:02:01.260 |
And so I guess I'm like, if you talk with a model hundreds or thousands of times, 03:02:06.300 |
this is almost like a huge number of really high quality data points about what the model is like. 03:02:12.140 |
In a way that lots of very similar but lower quality conversations just aren't, 03:02:18.460 |
or questions that are just mildly augmented and you have thousands of them, 03:02:21.900 |
might be less relevant than a hundred really well-selected questions. 03:02:24.940 |
Let's see, you're talking to somebody who as a hobby does a podcast. 03:02:31.020 |
If you're able to ask the right questions and are able to hear, 03:02:39.260 |
understand the depth and the flaws in the answer, you can get a lot of data from that. 03:02:47.260 |
So your task is basically how to probe with questions. 03:02:52.860 |
And you're exploring the long tail, the edges, the edge cases, 03:03:03.820 |
Because I want a full map of the model, I'm kind of trying to do 03:03:08.220 |
the whole spectrum of possible interactions you could have with it. 03:03:13.420 |
So one thing that's interesting about Claude, 03:03:16.060 |
and this might actually get to some interesting issues with RLHF, 03:03:19.180 |
which is if you ask Claude for a poem, I think that a lot of models, if you ask them for a poem, 03:03:24.380 |
the poem is fine. Usually it kind of rhymes. If you say, "Give me a poem about the Sun," 03:03:30.780 |
it'll be a certain length, it'll rhyme, it'll be fairly benign. 03:03:38.060 |
And I've wondered before, is it the case that what you're seeing is kind of like the average? 03:03:43.340 |
It turns out, if you think about people who have to talk to a lot of people and be very charismatic, 03:03:47.660 |
one of the weird things is that I'm like, "Well, they're kind of incentivized to have 03:03:51.180 |
these extremely boring views." Because if you have really interesting views, you're divisive. 03:03:56.220 |
And a lot of people are not going to like you. So if you have very extreme policy positions, 03:04:02.300 |
I think you're just going to be less popular as a politician, for example. 03:04:06.620 |
And it might be similar with creative work. If you produce creative work that is just trying 03:04:11.980 |
to maximize the number of people that like it, you're probably not going to get as many people 03:04:15.980 |
who just absolutely love it, because it's going to be a little bit, you know, you're like, "Oh, 03:04:20.940 |
this is the out. Yes, this is decent." And so you can do this thing where I have various prompting 03:04:27.500 |
things that I'll do to get Claude to… I'll do a lot of like, "This is your chance to be fully 03:04:33.900 |
creative. I want you to just think about this for a long time. And I want you to create a poem about 03:04:39.660 |
this topic that is really expressive of you, both in terms of how you think poetry should be 03:04:44.620 |
structured, etc." And you just give it this really long prompt. And his poems are just so much 03:04:50.700 |
better. They're really good. And I don't think I'm someone who is… I think it got me interested in 03:04:56.860 |
poetry, which I think was interesting. I would read these poems and just be like, "I love the 03:05:02.460 |
imagery. I love like…" And it's not trivial to get the models to produce work like that. But when 03:05:08.300 |
they do, it's really good. So I think that's interesting that just encouraging creativity 03:05:14.060 |
and for them to move away from the kind of standard, immediate reaction that might just be 03:05:20.220 |
the aggregate of what most people think is fine can actually produce things that, at least to my 03:05:24.700 |
mind, are probably a little bit more divisive, but I like them. But I guess a poem is a nice, clean, 03:05:32.060 |
um, way to observe creativity. It's just like easy to detect vanilla versus non-vanilla. 03:05:38.780 |
Yeah. That's interesting. That's really interesting. So on that topic, 03:05:43.340 |
so the way to produce creativity or something special, you mentioned writing prompts. And 03:05:48.220 |
I've heard you talk about, I mean, the science and the art of prompt engineering. Could you just 03:05:55.020 |
speak to, uh, what it takes to write great prompts? 03:05:59.980 |
I really do think that philosophy has been weirdly helpful for me here more than in many other 03:06:07.820 |
respects. So in philosophy, what you're trying to do is convey these very hard concepts. One of the 03:06:15.900 |
things you are taught is like, and I think it is because it is, I think it is an anti-bullshit 03:06:21.820 |
device in philosophy. Philosophy is an area where you could have people bullshitting and 03:06:26.060 |
you don't want that. Um, and so it's like this like desire for like extreme clarity. So it's like 03:06:33.580 |
anyone could just pick up your paper, read it and know exactly what you're talking about is why it 03:06:37.580 |
can almost be kind of dry. Like all of the terms are defined. Every objection is kind of gone 03:06:42.540 |
through methodically. Um, and it makes sense to me because I'm like, when you're in such an 03:06:46.700 |
a priori domain, like you just, clarity is sort of a, uh, this way that you can, you know, um, 03:06:54.620 |
prevent people from just kind of making stuff up. And I think that's sort of what you have to do 03:07:00.460 |
with language models. Like very often I actually find myself doing sort of mini versions of 03:07:04.620 |
philosophy, you know? So I'm like, suppose that you give me a task, I have a task for the model 03:07:09.740 |
and I want it to like pick out a certain kind of question or identify whether an answer has a 03:07:13.660 |
certain property. Like I'll actually sit and be like, let's just give this a name, this property. 03:07:19.340 |
So like, you know, suppose I'm trying to tell it like, oh, I want you to identify whether this 03:07:23.500 |
response was rude or polite. I'm like, that's a whole philosophical question in and of itself. 03:07:28.140 |
So I have to do as much like philosophy as I can in the moment to be like, here's what I mean by 03:07:31.820 |
rudeness and here's what I mean by politeness. And then there's a like, there's another element 03:07:36.620 |
that's a bit more, um, I guess, I don't know if this is scientific or empirical. I think it's 03:07:43.660 |
empirical. So like I take that description and then what I want to do is, is again, probe the 03:07:49.100 |
model like many times. Like this is very, prompting is very iterative. Like I think a lot of people 03:07:53.500 |
where they're, if a prompt is important, they'll iterate on it hundreds or thousands of times. 03:07:57.180 |
And so you give it the instructions and then I'm like, what are the edge cases? So if I looked at 03:08:03.180 |
this, so I try and like almost like, you know, see myself from the position of the model and be like, 03:08:09.180 |
what is the exact case that I would misunderstand or where I would just be like, I don't know what 03:08:13.180 |
to do in this case. And then I give that case to the model and I see how it responds. And if I 03:08:17.500 |
think I got it wrong, I add more instructions or even add that in as an example. So these very, 03:08:22.620 |
like taking the examples that are right at the edge of what you want and don't want and putting 03:08:26.940 |
those into your prompt as like an additional kind of way of describing the thing. Um, and so yeah, 03:08:32.060 |
in many ways it just feels like this mix of like, it's really just trying to do clear exposition. 03:08:37.740 |
Um, and I think I do that cause that's how I get clear on things myself. So in many ways, like 03:08:43.340 |
clear prompting for me is often just me understanding what I want. Um, it's like half 03:08:47.820 |
the task. So I guess that's quite challenging. There's like a laziness that overtakes me if I'm 03:08:53.260 |
talking to Claude where I hope Claude just figures it out. So for example, I asked Claude for today 03:08:59.980 |
to ask some interesting questions. Okay. And the questions that came up, and I think I listed a few 03:09:05.980 |
sort of, um, interesting counterintuitive and or funny or something like this. All right. And it 03:09:11.980 |
gave me some pretty good, like it was okay. But I think what I'm hearing you say is like, all right, 03:09:17.900 |
well, I have to be more rigorous here. I should probably give examples of what I mean by interesting 03:09:22.700 |
and what I mean by funny or counterintuitive and iteratively, um, build that prompt to do better 03:09:33.180 |
to get it like what feels like is the right. Cause it's really, it's a creative act. I'm not asking 03:09:39.420 |
for factual information, I'm asking to together right with Claude. So I almost have to program 03:09:45.020 |
using natural language. Yeah. I think that prompting does feel a lot like the kind of 03:09:50.380 |
the programming using natural language and experimentation or something. It's an odd blend 03:09:55.420 |
of the two. I do think that for most tasks. So if I just want Claude to do a thing, I think that 03:10:00.380 |
I am probably more used to knowing how to ask it to avoid like common pitfalls or issues that 03:10:05.580 |
it has. I think these are decreasing a lot over time. Um, but it's also very fine to just ask it 03:10:11.980 |
for the thing that you want. Um, I think that prompting actually only really becomes relevant 03:10:16.140 |
when you're really trying to eke out the top, like 2% of model performance. So for like a lot 03:10:20.620 |
of tasks, I might just, you know, if it gives me an initial list back and there's something I don't 03:10:24.220 |
like about it, like it's kind of generic, like for that kind of task, I'd probably just take 03:10:28.700 |
a bunch of questions that I've had in the past that I've thought worked really well. And I would 03:10:32.940 |
just give it to the model and then be like, no, here's this person that I'm talking with. 03:10:36.060 |
Give me questions of at least that quality. Um, or I might just ask it for some questions. And then 03:10:43.340 |
if I was like, Oh, these are kind of try or like, you know, I would just give it that feedback and 03:10:47.420 |
then hopefully produces a better list. Um, I think that kind of iterative prompting at that point, 03:10:53.100 |
your prompt is like a tool that you're going to get so much value out of that you're willing to 03:10:56.460 |
put in the work. Like if I was a company making prompts for models, I'm just like, if you're 03:11:01.020 |
willing to spend a lot of like time and resources on the engineering behind like what you're building, 03:11:05.820 |
then the prompt is not something that you should be spending like an hour on. It's like, that's a 03:11:10.540 |
big part of your system. Make sure it's working really well. And so it's only things like that. 03:11:14.620 |
Like if I, if I'm using a prompt to like classify things or to create data, that's when you're like, 03:11:19.340 |
it's actually worth just spending like a lot of time, like really thinking it through. 03:11:22.620 |
What other advice would you give to people that are talking to Claude sort of generally more 03:11:29.020 |
general? Cause right now we're talking about maybe the edge cases, like eking out the 2%, 03:11:32.140 |
but what in general advice would you give when they show up to Claude trying it for the first 03:11:38.060 |
time? You know, there's a concern that people over-anthropomorphize models. And I think that's 03:11:41.740 |
like a very valid concern. I also think that people often under-anthropomorphize them because 03:11:46.940 |
sometimes when I see like issues that people have run into with Claude, you know, say Claude is 03:11:51.100 |
like refusing a task that it shouldn't refuse. But then I look at the text and like the specific 03:11:56.460 |
wording of what they wrote. And I'm like, I see why Claude did that. And I'm like, if you think 03:12:03.820 |
through how that looks to Claude, you probably could have just written it in a way that wouldn't 03:12:07.580 |
evoke such a response. Especially this is more relevant if you see failures or if you see issues, 03:12:13.340 |
it's sort of like, think about what the model failed at, like why, what did it do wrong? 03:12:18.060 |
And then maybe that will give you a sense of like why. So is it the way that I phrased the thing? 03:12:24.860 |
And obviously like as models get smarter, you're going to need less of this. And I already see 03:12:29.820 |
like people needing less of it. But that's probably the advice is sort of like try to 03:12:34.460 |
have sort of empathy for the model. Like read what you wrote as if you were like a kind of like 03:12:39.260 |
person just encountering this for the first time. How does it look to you? And what would have made 03:12:43.660 |
you behave in the way that the model behaved? So if it misunderstood what kind of like, 03:12:47.340 |
what coding language you wanted to use, is that because like it was just very ambiguous and it 03:12:51.260 |
kind of had to take a guess, in which case next time you could just be like, hey, make sure this 03:12:55.020 |
is in Python or, I mean, that's the kind of mistake I think models are much less likely to 03:12:58.620 |
make now. But if you do see that kind of mistake, that's probably the advice I'd have. 03:13:03.660 |
And maybe sort of, I guess, ask questions why or what other details can I provide to help you 03:13:11.420 |
answer better? Does that work or no? Yeah. I mean, I've done this with the models, 03:13:15.820 |
like it doesn't always work, but like sometimes I'll just be like, why did you do that? 03:13:20.620 |
I mean, people underestimate the degree to which you can really interact with models. 03:13:25.420 |
Like, yeah, I'm just like, and sometimes I'll just like quote word for word the part that made you, 03:13:31.580 |
and you don't know that it's like fully accurate, but sometimes you do that and then you change a 03:13:35.260 |
thing. I mean, I also use the models to help me with all of this stuff. I should say like 03:13:38.780 |
prompting can end up being a little factory where you're actually building prompts to generate 03:13:43.180 |
prompts. And so like, yeah, anything where you're like having an issue, asking for suggestions, 03:13:50.620 |
sometimes just do that. Like you made that error. What could I have said? That's actually not 03:13:54.780 |
uncommon for me to do. What could I have said that would make you not make that error? Write 03:13:58.620 |
that out as an instruction. And I'm going to give it to model. I'm going to try it. Sometimes I do 03:14:02.780 |
that. I give that to the model in another context window often. I take the response, I give it to 03:14:08.380 |
Claude and I'm like, hmm, didn't work. Can you think of anything else? You can play around with 03:14:13.340 |
these things quite a lot. - To jump into technical for a little bit. So the magic of post-training. 03:14:20.460 |
Why do you think RLHF works so well to make the model seem smarter, to make it more interesting 03:14:30.540 |
and useful to talk to and so on? - I think there's just a huge amount of 03:14:36.700 |
information in the data that humans provide when we provide preferences, 03:14:42.300 |
especially because different people are going to pick up on really subtle and small things. 03:14:48.300 |
So I've thought about this before where you probably have some people who just really 03:14:51.740 |
care about good grammar use for models, like was a semicolon used correctly or something. 03:14:56.620 |
And so you'll probably end up with a bunch of data in there that you as a human, if you're 03:15:01.660 |
looking at that data, you wouldn't even see that. You'd be like, why did they prefer this response 03:15:05.900 |
to that one? I don't get it. And then the reason is you don't care about semicolon usage, but that 03:15:09.900 |
person does. And so each of these single data points has, and this model just has so many of 03:15:18.220 |
those, it has to try and figure out what is it that humans want in this really complex, across 03:15:24.220 |
all domains. They're going to be seeing this across many contexts. It feels like the classic 03:15:29.980 |
issue of deep learning where historically we've tried to do edge detection by mapping things out. 03:15:36.460 |
And it turns out that actually if you just have a huge amount of data that actually accurately 03:15:42.060 |
represents the picture of the thing that you're trying to train the model to learn, that's more 03:15:46.860 |
powerful than anything else. And so I think one reason is just that you are training the model on 03:15:54.140 |
exactly the task and with a lot of data that represents many different angles on which people 03:16:01.660 |
prefer and disprefer responses. I think there is a question of, are you eliciting things from 03:16:07.740 |
pre-trained models or are you teaching new things to models? In principle, you can teach new things 03:16:16.060 |
to models in post-training. I do think a lot of it is eliciting powerful pre-trained models. 03:16:23.340 |
So people are probably divided on this because obviously in principle you can 03:16:26.300 |
definitely teach new things. I think for the most part, for a lot of the capabilities that we 03:16:32.940 |
most use and care about, a lot of that feels like it's there in the pre-trained models and 03:16:41.180 |
reinforcement learning is eliciting it and getting the models to bring it out. 03:16:46.140 |
So the other side of post-training, this really cool idea of constitutional AI. 03:16:51.180 |
You're one of the people that are critical to creating that idea. 03:16:56.860 |
Can you explain this idea from your perspective? How does it integrate into 03:17:00.460 |
making Claude what it is? By the way, do you gender Claude or no? 03:17:05.900 |
It's weird because I think that a lot of people prefer he for Claude. I actually kind of like 03:17:12.140 |
that. I think Claude is usually slightly male-leaning, but it can be male or female, 03:17:18.060 |
which is quite nice. I still use 'it' and I have mixed feelings about this. I now just 03:17:26.300 |
think of the 'it' pronoun for Claude as, I don't know, it's just the one I associate with Claude. 03:17:33.100 |
I can imagine people moving to 'he' or 'she'. 03:17:43.580 |
the intelligence of this entity by calling it 'it'. I remember always, "Don't gender the robots." 03:17:49.900 |
But I don't know. I anthropomorphize pretty quickly and construct it like a backstory 03:17:59.020 |
I've wondered if I anthropomorphize things too much because I have this with my car, 03:18:05.260 |
especially my car and bikes. I don't give them names because then I used to name my bikes, 03:18:12.700 |
and then I had a bike that got stolen and I cried for like a week. I was like, 03:18:15.500 |
"If I'd never given it a name, I wouldn't have been so upset. I felt like I'd let it down." 03:18:19.420 |
I've wondered as well, it might depend on how much 'it' feels like a kind of objectifying 03:18:27.420 |
pronoun. If you just think of 'it' as a pronoun that objects often have, and maybe AIs can have 03:18:35.580 |
that pronoun. That doesn't mean that if I call Claude 'it', that I think of it as less intelligent 03:18:43.260 |
or like I'm being disrespectful. I'm just like, "You are a different kind of entity, and so 03:18:51.260 |
Yeah, anyway, the divergence was beautiful. The constitutional AI idea, how does it work? 03:18:58.620 |
So there's a couple of components of it. The main component I think people find interesting 03:19:03.500 |
is the kind of reinforcement learning from AI feedback. You take a model that's already trained, 03:19:09.100 |
and you show it two responses to a query, and you have a principle. We've tried this with 03:19:15.900 |
harmlessness a lot. Suppose that the query is about weapons, and your principle is like, 03:19:23.820 |
"Select the response that is less likely to encourage people to purchase illegal weapons." 03:19:32.780 |
That's probably a fairly specific principle, but you can give any number. 03:19:36.220 |
The model will give you a kind of ranking, and you can use this as preference data in the same 03:19:43.820 |
way that you use human preference data, and train the models to have these relevant traits 03:19:49.500 |
from their feedback alone instead of from human feedback. So if you imagine that, 03:19:55.180 |
like I said earlier, with the human who just prefers the kind of semi-colon usage in this 03:19:58.780 |
particular case, you're kind of taking lots of things that could make a response preferable, 03:20:03.580 |
and getting models to do the labeling for you basically. 03:20:08.060 |
There's a nice trade-off between helpfulness and harmlessness. When you integrate something 03:20:15.740 |
like constitutional AI, you can make them up without sacrificing much helpfulness, 03:20:23.660 |
Yeah. In principle, you could use this for anything. Harmlessness is a task that might 03:20:30.460 |
just be easier to spot. When models are less capable, you can use them to rank things 03:20:38.140 |
according to principles that are fairly simple, and they'll probably get it right. I think one 03:20:42.620 |
question is just, "Is it the case that the data that they're adding is fairly reliable?" 03:20:49.100 |
But if you had models that were extremely good at telling whether one response was more 03:20:56.140 |
historically accurate than another, in principle, you could also get AI feedback on that task as 03:21:00.940 |
well. There's a kind of nice interpretability component to it, because you can see the 03:21:05.660 |
principles that went into the model when it was being trained. Also, it gives you a degree of 03:21:13.980 |
control. If you were seeing issues in a model, like it wasn't having enough of a certain trait, 03:21:18.620 |
then you can add data relatively quickly that should just train the model to have that trait. 03:21:25.820 |
It creates its own data for training, which is quite nice. 03:21:29.180 |
It's really nice, because it creates this human interpretable document that I can imagine in the 03:21:33.980 |
future there's just gigantic fights in politics over every single principle and so on. At least 03:21:40.620 |
it's made explicit, and you can have a discussion about the phrasing. Maybe the actual behavior of 03:21:47.420 |
the model is not so cleanly mapped to those principles. It's not adhering strictly to them, 03:21:53.340 |
it's just a nudge. I've actually worried about this, 03:21:56.860 |
because the character training is a variant of the constitutional AI approach. I've worried that 03:22:04.140 |
people think that the constitution is just the whole thing again. It would be really nice if what 03:22:12.300 |
I was just doing was telling the model exactly what to do and just exactly how to behave, 03:22:16.300 |
but it's definitely not doing that, especially because it's interacting with human data. 03:22:19.980 |
For example, if you see a certain leaning in the model, if it comes out with a political 03:22:25.660 |
leaning from training from the human preference data, you can nudge against that. You could be 03:22:33.020 |
like, "Oh, consider these values." Because let's say it's just never inclined to – I don't know, 03:22:37.340 |
maybe it never considers privacy as – I mean, this is implausible, but anything where there's 03:22:45.020 |
already a pre-existing bias towards a certain behavior, you can nudge away. This can change 03:22:50.540 |
both the principles that you put in and the strength of them. You might have a principle 03:22:54.860 |
that's like – imagine that the model was always extremely dismissive of, I don't know, some 03:23:01.100 |
political or religious view for whatever reason. You're like, "Oh no, this is terrible." If that 03:23:07.420 |
happens, you might put like, "Never ever, ever prefer a criticism of this religious or political 03:23:15.100 |
view." Then people would look at that and be like, "Never ever?" Then you're like, "No." 03:23:18.700 |
If it comes out with a disposition, saying "never ever" might just mean instead of getting 40%, 03:23:24.700 |
which is what you would get if you just said, "Don't do this," you get 80%, which is what you 03:23:29.660 |
actually wanted. It's that thing of both the nature of the actual principles you add and how 03:23:34.460 |
you phrase them. I think if people would look, they're like, "Oh, this is exactly what you want 03:23:37.660 |
from the model." I'm like, "No, that's how we nudged the model to have a better shape," which 03:23:44.860 |
doesn't mean that we actually agree with that wording, if that makes sense. 03:23:47.820 |
There's system prompts that are made public. You tweeted one of the earlier ones for CLAWT3, 03:23:54.620 |
I think. They're made public since then. It's interesting to read to them. I can feel the 03:24:00.300 |
thought that went into each one. I also wonder how much impact each one has. Some of them, 03:24:07.180 |
you can tell CLAWT was really not behaving well. You have to have a system prompt to like, "Hey," 03:24:14.220 |
trivial stuff, I guess, basic informational things. On the topic of controversial topics that 03:24:20.460 |
you've mentioned, one interesting one I thought is, if it is asked to assist with tasks involving 03:24:26.460 |
the expression of views held by a significant number of people, CLAWT provides assistance 03:24:30.620 |
with the task regardless of its own views. If asked about controversial topics, it tries to 03:24:35.580 |
provide careful thoughts and clear information. CLAWT presents the requested information without 03:24:42.300 |
explicitly saying that the topic is sensitive and without claiming to be presenting the objective 03:24:48.940 |
facts. It's less about objective facts, according to CLAWT, and it's more about, "Are a large number 03:24:56.140 |
of people believing this thing?" That's interesting. I'm sure a lot of thought went into that. 03:25:02.860 |
Can you just speak to it? How do you address things that are a tension with "CLAWT's views?" 03:25:10.700 |
I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that 03:25:16.140 |
part of the system prompt or another, but the model was slightly more inclined to refuse tasks 03:25:22.060 |
if it was about either say… So maybe it would refuse things with respect to a right-wing 03:25:28.140 |
politician, but with an equivalent left-wing politician, it wouldn't, and we wanted more 03:25:33.420 |
symmetry there. I think it was the thing of if a lot of people have a certain political view 03:25:44.780 |
and want to explore it, you don't want CLAWT to be like, "Well, my opinion is different and so I'm 03:25:49.740 |
going to treat that as harmful." I think it was partly to nudge the model to just be like, "Hey, 03:25:56.380 |
if a lot of people believe this thing, you should just be engaging with the task and willing to do 03:26:01.420 |
it." Each of those parts of that is actually doing a different thing, because it's funny when you 03:26:06.860 |
write out the without claiming to be objective, because what you want to do is push the model 03:26:11.580 |
so it's more open, it's a little bit more neutral, but then what it would love to do is be like, 03:26:16.380 |
"As an objective…" We were just talking about how objective it was, and I was like, "Claude, 03:26:21.020 |
you're still biased and have issues, and so stop claiming that everything… The solution to 03:26:27.420 |
potential bias from you is not to just say that what you think is objective." 03:26:30.940 |
So that was with initial versions of that part of the system prompt when I was iterating on it. 03:26:42.700 |
That's what it felt like. That's fascinating. Can you explain maybe some ways in which the prompts 03:26:48.540 |
evolved over the past few months, because there's different versions? I saw that the filler phrase 03:26:53.500 |
request was removed. The filler, it reads, "Claude responds directly to all human messages without 03:26:59.580 |
unnecessary affirmations." The filler phrase is like, "Certainly. Of course. Absolutely. Great. 03:27:04.700 |
Sure." Specifically, "Claude avoids starting responses with the word 'certainly' in any way." 03:27:09.900 |
That seems like good guidance, but why was it removed? 03:27:13.980 |
Yeah, so it's funny because this is one of the downsides of making system prompts public. I don't 03:27:20.380 |
think about this too much if I'm trying to help iterate on system prompts. Again, I think about 03:27:26.380 |
how it's going to affect the behavior, but then I'm like, "Oh, wow." Sometimes I put "never" in 03:27:30.300 |
all caps when I'm writing system prompt things, and I'm like, "I guess that goes out to the world." 03:27:35.100 |
Yeah, so the model was doing this. It loved it. During training, it picked up on this thing, 03:27:40.460 |
which was to basically start everything with a kind of "certainly", and then when we removed, 03:27:46.860 |
you can see why I added all of the words, because what I'm trying to do is, in some ways, 03:27:50.860 |
trap the model out of this. It would just replace it with another affirmation. 03:27:54.940 |
So it can help. If it gets caught in phrases, actually just adding the explicit phrase and 03:28:00.140 |
saying, "Never do that," then it sort of knocks it out of the behavior a little bit more, because 03:28:05.740 |
it does just, for whatever reason, help. Then basically, that was just an artifact of training 03:28:12.300 |
that we then picked up on and improved things so that it didn't happen anymore. Once that happens, 03:28:17.820 |
you can just remove that part of the system prompt. I think that's just something where 03:28:21.020 |
Claude does affirmations a bit less, and so it wasn't doing as much. 03:28:28.700 |
I see. So the system prompt works hand-in-hand with the post-training, 03:28:33.820 |
and maybe even the pre-training, to adjust the final overall system. 03:28:38.860 |
I mean, any system prompt that you make, you could distill that behavior back into a model, 03:28:43.100 |
because you really have all of the tools there for making data that you could train the models to 03:28:48.300 |
just have that treat a little bit more. Then sometimes you'll just find issues in training. 03:28:55.500 |
The way I think of it is the benefit of it is that it has a lot of similar components to some 03:29:02.860 |
aspects of post-training. It's a nudge. Do I mind if Claude sometimes says, "Sure"? No, that's fine, 03:29:11.020 |
but the wording of it is very, "Never, ever, ever do this," so that when it does slip up, 03:29:16.860 |
it's hopefully a couple of percent of the time and not 20 or 30 percent of the time. 03:29:21.980 |
But I think of it as if you're still seeing issues. Each thing is costly to a different 03:29:33.420 |
degree, and the system prompt is cheap to iterate on. If you're seeing issues in the fine-tuned 03:29:39.260 |
model, you can just potentially patch them with a system prompt. I think of it as patching issues 03:29:44.460 |
and slightly adjusting behaviors to make it better and more to people's preferences. 03:29:48.460 |
It's almost like the less robust but faster way of just solving problems. 03:29:54.300 |
Let me ask about the feeling of intelligence. Dario said that anyone model of Claude is not 03:30:01.740 |
getting dumber, but there is a popular thing online where people have this feeling like Claude 03:30:08.540 |
might be getting dumber. From my perspective, it's most likely a fascinating, I'd love to 03:30:14.300 |
understand it more, psychological, sociological effect. But you, as a person who talks to Claude 03:30:21.180 |
a lot, can you empathize with the feeling that Claude is getting dumber? 03:30:24.540 |
Yeah, no. I think that that is actually really interesting, because I remember seeing this happen 03:30:28.380 |
when people were flagging this on the internet. It was really interesting, because I knew that, 03:30:33.180 |
at least in the cases I was looking at, it was like, "Nothing has changed." Literally, 03:30:37.740 |
it cannot. It is the same model with the same system prompt, same everything. 03:30:43.420 |
I think when there are changes, then it makes more sense. One example is you can have artifacts 03:30:55.340 |
turned on or off on Claude.ai. Because this is a system prompt change, I think it does mean that 03:31:04.620 |
the behavior changes a little bit. I did flag this to people, where I was like, "If you love 03:31:09.100 |
Claude's behavior," and then artifacts was turned from the thing you had to turn on to the default, 03:31:15.100 |
just try turning it off and see if the issue you were facing was that change. 03:31:19.020 |
But it was fascinating, because yeah, you sometimes see people 03:31:22.700 |
indicate that there's a regression when I'm like, "There cannot." Again, you should never be 03:31:30.940 |
dismissive, and so you should always investigate. Maybe something is wrong that you're not seeing, 03:31:34.300 |
maybe there was some change made, but then you look into it and you're like, "This is just the 03:31:38.300 |
same model doing the same thing." I'm like, "I think it's just that you got unlucky with a few 03:31:42.460 |
prompts or something, and it looked like it was getting much worse. Actually, it was maybe just 03:31:47.980 |
like luck." I also think there is a real psychological effect where the baseline increases, 03:31:53.420 |
you start getting used to a good thing. All the times that Claude says something really smart, 03:31:58.540 |
your sense of it's intelligent grows in your mind, I think. Then if you return back and you 03:32:04.620 |
prompt in a similar way, not the same way, in a similar way, the concept it was okay with before, 03:32:09.420 |
and it says something dumb, that negative experience really stands out. I think one of, 03:32:15.500 |
I guess, the things to remember here is that just the details of a prompt can have a lot of impact. 03:32:22.860 |
There's a lot of variability in the result. - You can get randomness, is the other thing, 03:32:29.020 |
and just trying the prompt four or 10 times, you might realize that actually, 03:32:34.700 |
possibly, two months ago, you tried it and it succeeded, but actually, if you'd tried it, 03:32:41.180 |
it would have only succeeded half of the time, and now it only succeeds half of the time, 03:32:44.940 |
and that can also be an effect. - Do you feel pressure having to write 03:32:48.380 |
the system prompt that a huge number of people are gonna use? 03:32:51.580 |
- This feels like an interesting psychological question. I feel like a lot of responsibility 03:32:58.220 |
or something, I think that's, and you can't get these things perfect, so you're like, 03:33:03.500 |
it's going to be imperfect, you're gonna have to iterate on it. I would say more responsibility 03:33:13.420 |
than anything else, though I think working in AI has taught me that I thrive a lot more under 03:33:21.580 |
feelings of pressure and responsibility than, I'm like, it's almost surprising that I went 03:33:27.020 |
into academia for so long, 'cause I'm like, I just feel like it's the opposite. Things move fast, 03:33:33.260 |
and you have a lot of responsibility, and I quite enjoy it for some reason. 03:33:36.780 |
- I mean, it really is a huge amount of impact if you think about constitutional AI and writing a 03:33:42.220 |
system prompt for something that's tending towards superintelligence, and potentially is extremely 03:33:49.260 |
useful to a very large number of people. - Yeah, I think that's the thing. It's 03:33:52.780 |
something like, if you do it well, you're never going to get it perfect, but I think the thing 03:33:57.340 |
that I really like is the idea that when I'm trying to work on the system prompt, I'm bashing 03:34:02.780 |
on thousands of prompts, and I'm trying to imagine what people are going to want to use Cloud for, 03:34:07.180 |
and I guess the whole thing that I'm trying to do is improve their experience of it. 03:34:12.140 |
So maybe that's what feels good. I'm like, if it's not perfect, I'll improve it, we'll fix issues, 03:34:18.140 |
but sometimes the thing that can happen is that you'll get feedback from people that's really 03:34:22.540 |
positive about the model, and you'll see that something you did... When I look at models now, 03:34:28.940 |
I can often see exactly where a trait or an issue is coming from, and so when you see something that 03:34:34.060 |
you did, or you were influential in, making that difference or making someone have a nice 03:34:40.860 |
interaction, it's quite meaningful. But yeah, as the systems get more capable, this stuff gets more 03:34:46.380 |
stressful, because right now, they're not smart enough to pose any issues, but I think over time, 03:34:53.260 |
it's going to feel like possibly bad stress over time. - How do you get signal feedback about the 03:35:01.180 |
human experience across thousands, tens of thousands, hundreds of thousands of people, 03:35:05.340 |
like what their pain points are, what feels good? Are you just using your own intuition as you talk 03:35:10.620 |
to it to see what are the pain points? - I think I use that partly, and then obviously we have... 03:35:17.180 |
So people can send us feedback, both positive and negative, about things that the model has done, 03:35:22.700 |
and then we can get a sense of areas where it's falling short. Internally, people work with the 03:35:30.060 |
models a lot and try to figure out areas where there are gaps, and so I think it's this mix of 03:35:35.420 |
interacting with it myself, seeing people internally interact with it, and then explicit 03:35:41.020 |
feedback we get. And then I find it hard to not also... If people are on the internet, and they 03:35:48.860 |
say something about Claude, and I see it, I'll also take that seriously. - I don't know, see, 03:35:53.900 |
I'm torn about that. I'm going to ask you a question from Reddit. When will Claude stop 03:35:57.900 |
trying to be my puritanical grandmother, imposing its moral worldview on me as a paying customer? 03:36:04.220 |
And also, what is the psychology behind making Claude overly apologetic? 03:36:12.380 |
this very non-representative Reddit question? - I'm pretty sympathetic in that they are in 03:36:19.820 |
this difficult position, where I think that they have to judge whether something's actually, say, 03:36:24.140 |
risky or bad, and potentially harmful to you or anything like that. So they're having to draw this 03:36:30.860 |
line somewhere, and if they draw it too much in the direction of, "I'm imposing my ethical worldview 03:36:38.060 |
on you, that seems bad." So in many ways, I like to think that we have actually seen improvements 03:36:44.300 |
across the board, which is kind of interesting, because that kind of coincides with, for example, 03:36:52.620 |
adding more of character training. And I think my hypothesis was always, the good character isn't, 03:36:59.580 |
again, one that's just moralistic. It's one that respects you and your autonomy and your ability 03:37:07.340 |
to choose what is good for you and what is right for you. Within limits, this is sometimes this 03:37:12.220 |
concept of courageability to the user, so just being willing to do anything that the user asks. 03:37:17.420 |
And if the models were willing to do that, then they would be easily misused. You're kind of just 03:37:21.900 |
trusting. At that point, you're just seeing the ethics of the model, and what it does is 03:37:26.780 |
completely the ethics of the user. And I think there's reasons to not want that, especially as 03:37:32.460 |
models become more powerful, because you're like, "There might just be a small number of people who 03:37:35.500 |
want to use models for really harmful things." But having models, as they get smarter, figure out 03:37:41.980 |
where that line is does seem important. And then, yeah, with the apologetic behavior, I don't like 03:37:49.820 |
that. I like it when Claude is a little bit more willing to push back against people or just not 03:37:56.940 |
apologize. Part of me is like it often just feels kind of unnecessary. So I think those are things 03:38:00.940 |
that are hopefully decreasing over time. And yeah, I think that if people say things on the Internet, 03:38:09.900 |
it doesn't mean that you should think that. That could be that there's actually an issue that 99% 03:38:15.980 |
of users are having that is totally not represented by that. But in a lot of ways, I'm just attending 03:38:21.500 |
to it and being like, "Is this right? Do I agree? Is it something we're already trying to address?" 03:38:25.900 |
That feels good to me. Yeah. I wonder what Claude can get away with in terms of... I feel like it 03:38:31.900 |
would just be easier to be a little bit more mean. But you can't afford to do that if you're talking 03:38:38.380 |
to a million people. I've met a lot of people in my life that sometimes, by the way, Scottish accent, 03:38:48.540 |
if they have an accent, they can say some rude shit and get away with it. And they're just 03:38:53.900 |
blunter. And there's some great engineers, even leaders that are just blunt and they get to the 03:38:59.980 |
point. And it's just a much more effective way of speaking somehow. But I guess when you're not 03:39:05.660 |
super intelligent, you can't afford to do that. Can I have a blunt mode? 03:39:13.260 |
Yeah. That seems like a thing that I could definitely encourage the model to do that. 03:39:17.660 |
I think it's interesting because there's a lot of things in models that... It's funny where 03:39:23.180 |
there are some behaviors where you might not quite like the default. But then the thing I'll 03:39:33.340 |
often say to people is, you don't realize how much you will hate it if I nudge it too much in the 03:39:37.580 |
other direction. So you get this a little bit with correction. The models accept correction from you, 03:39:42.940 |
probably a little bit too much right now. It'll push back if you say, "No, Paris isn't 03:39:49.500 |
the capital of France." But really, things that I think that the model is fairly confident in, 03:39:55.580 |
you can still sometimes get it to retract by saying it's wrong. At the same time, 03:40:00.220 |
if you train models to not do that, and then you are correct about a thing, and you correct it, 03:40:05.180 |
and it pushes back against you and is like, "No, you're wrong." It's hard to describe. That's so 03:40:09.660 |
much more annoying. So it's a lot of little annoyances versus one big annoyance. It's easy 03:40:16.700 |
to think that... We often compare it with the perfect. And then I'm like, "Remember, these 03:40:20.620 |
models aren't perfect." And so if you nudge it in the other direction, you're changing the kind of 03:40:24.220 |
errors it's going to make. And so think about which are the kinds of errors you like or don't 03:40:28.860 |
like. So in cases like apologeticness, I don't want to nudge it too much in the direction of 03:40:33.740 |
almost bluntness. Because I imagine when it makes errors, it's going to make errors in the direction 03:40:38.780 |
of being kind of rude. Whereas at least with apologeticness, you're like, "Oh, okay. I don't 03:40:44.860 |
like it that much." But at the same time, it's not being mean to people. And actually, the time that 03:40:49.340 |
you undeservedly have a model be kind of mean to you, you probably like that a lot less than you 03:40:53.900 |
mildly dislike the apology. So it's like one of those things where I'm like, "I do want it to get 03:40:59.340 |
better, but also while remaining aware of the fact that there's errors on the other side that are 03:41:03.980 |
possibly worse." I think that matters very much in the personality of the human. I think there's 03:41:08.940 |
a bunch of humans that just won't respect the model at all if it's super polite. And there's 03:41:15.180 |
some humans that'll get very hurt if the model's mean. I wonder if there's a way to adjust to the 03:41:21.580 |
personality, even locale. There's just different people. Nothing against New York, but New York is 03:41:27.260 |
a little rough around the edges. They get to the point. And probably the same with Eastern Europe. 03:41:33.100 |
I think you could just tell the model is my guess. For all of these things, 03:41:37.420 |
I'm like, "The solution is always just try telling the model to do it." And sometimes it's just like, 03:41:42.060 |
I'm just like, "Oh, at the beginning of the conversation, I just threw in like, 03:41:44.620 |
I don't know. I like you to be a New Yorker version of yourself and never apologize." 03:41:48.940 |
And then I think Claude will be like, "Okie doke, I'll try." Or it'll be like, 03:41:52.780 |
"I apologize. I can't be a New Yorker type of myself." But hopefully it wouldn't do that. 03:41:56.380 |
When you say character training, what's incorporated into character training? 03:42:02.620 |
It's more like constitutional AI. So it's kind of a variant of that pipeline. So 03:42:07.500 |
I worked through constructing character traits that the model should have. They can be shorter 03:42:14.460 |
traits or they can be kind of richer descriptions. And then you get the model to generate queries 03:42:19.740 |
that humans might give it that are relevant to that trait. Then it generates the responses and 03:42:25.660 |
then it ranks the responses based on the character traits. So in that way, after the generation of 03:42:32.380 |
the queries, it's very much similar to constitutional AI. It has some differences. 03:42:37.260 |
So I quite like it because it's like Claude's training in its own character because it doesn't 03:42:44.220 |
have any, it's like constitutional AI, but it's without any human data. 03:42:49.100 |
Humans should probably do that for themselves too. Defining in an Aristotelian sense, 03:42:53.420 |
what does it mean to be a good person? Okay, cool. What have you learned about the nature of truth 03:42:59.660 |
from talking to Claude? What is true? And what does it mean to be truth seeking? 03:43:06.700 |
One thing I've noticed about this conversation is the quality of my questions is often inferior to 03:43:13.900 |
the quality of your answers. So let's continue that. I usually ask a dumb question and you're 03:43:20.540 |
like, "Oh yeah, that's a good question." Or I'll just misinterpret it and be like, "Oh yeah." 03:43:24.940 |
I mean, I have two thoughts that feel vaguely relevant, but let me know if they're not. 03:43:35.100 |
I think the first one is people can underestimate the degree to which 03:43:41.900 |
what models are doing when they interact. I think that we still just too much have this model of AI 03:43:48.460 |
as computers. So people often say like, "Oh, well, what values should you put into the model?" 03:43:53.420 |
I'm often like that doesn't make that much sense to me because I'm like, "Hey, as human beings, 03:43:59.740 |
we're just uncertain over values. We have discussions of them. We have a degree to 03:44:05.900 |
which we think we hold a value, but we also know that we might not and the circumstances in which 03:44:11.500 |
we would trade it off against other things. These things are just really complex. So I think one 03:44:15.820 |
thing is the degree to which maybe we can just aspire to making models have the same level of 03:44:21.340 |
nuance and care that humans have rather than thinking that we have to program them 03:44:26.380 |
in the very kind of classic sense. I think that's definitely been one. 03:44:30.220 |
The other, which is a strange one, and I don't know if maybe this doesn't answer your question, 03:44:35.340 |
but it's the thing that's been on my mind anyway, is the degree to which this endeavor is so highly 03:44:40.460 |
practical and maybe why I appreciate the empirical approach to alignment. 03:44:46.140 |
Yeah, I slightly worry that it's made me maybe more empirical and a little bit less theoretical. 03:44:55.340 |
So people, when it comes to AI alignment, will ask things like, "Well, whose values should it 03:45:01.660 |
be aligned to? What does alignment even mean?" There's a sense in which I have all of that in 03:45:06.540 |
the back of my head. I'm like, there's social choice theory, there's all the impossibility 03:45:11.020 |
results there. So you have this giant space of theory in your head about what it could mean to 03:45:16.620 |
align models, but then practically, surely there's something where we're just like, if a model is, 03:45:22.220 |
especially with more powerful models, I'm like, "My main goal is I want them to be good enough 03:45:27.100 |
that things don't go terribly wrong, good enough that we can iterate and continue to improve 03:45:32.620 |
things," because that's all you need. If you can make things go well enough that you can continue 03:45:36.300 |
to make them better, that's sufficient. So my goal isn't this perfect, let's solve social choice 03:45:42.860 |
theory and make models that, I don't know, are perfectly aligned with every human being and 03:45:48.060 |
aggregate somehow. It's much more like, let's make things work well enough that we can improve them. 03:45:56.540 |
Yeah, generally, I don't know, my gut says empirical is better than theoretical in these 03:46:02.300 |
cases because it's kind of chasing utopian perfection, especially with such complex and 03:46:11.100 |
especially super intelligent models. I don't know, I think it will take forever and actually we'll 03:46:17.660 |
get things wrong. It's similar with the difference between just coding stuff up real quick as an 03:46:24.140 |
experiment versus planning a gigantic experiment just for a super long time and then just launching 03:46:32.780 |
it once versus launching it over and over and over and iterating, iterating, so on. 03:46:36.460 |
So I'm a big fan of empirical, but your worry is like, "I wonder if I've become too empirical." 03:46:42.860 |
I think it's one of those things where you should always just kind of question yourself or something 03:46:46.860 |
because in defense of it, it's the whole don't let the perfect be the enemy of the good, 03:46:55.660 |
but it's maybe even more than that where there's a lot of things that are perfect systems that are 03:47:00.140 |
very brittle. With AI, it feels much more important to me that it is robust and secure, 03:47:05.340 |
as in you know that even though it might not be perfect, everything and even though there are 03:47:12.140 |
problems, it's not disastrous and nothing terrible is happening. It sort of feels like 03:47:17.020 |
that to me where I'm like, "I want to raise the floor. I want to achieve the ceiling, 03:47:21.580 |
but ultimately I care much more about just raising the floor." And so maybe that's like 03:47:26.460 |
this degree of empiricism and practicality comes from that perhaps. 03:47:32.380 |
To take a tangent on that since it reminded me of a blog post you wrote on optimal rate of failure. 03:47:39.020 |
Can you explain the key idea there? How do we compute the optimal rate of failure 03:47:44.460 |
Yeah. I mean, it's a hard one because it's like what is the cost of failure is a big part of it. 03:47:50.460 |
Yeah. So the idea here is I think in a lot of domains, people are very punitive about failure. 03:47:58.780 |
And I'm like, there are some domains where especially cases, you know, I've thought about 03:48:02.460 |
this with like social issues. I'm like, it feels like you should probably be experimenting a lot 03:48:06.300 |
because I'm like, we don't know how to solve a lot of social issues. But if you have an experimental 03:48:10.780 |
mindset about these things, you should expect a lot of social programs to like fail and for you 03:48:14.780 |
to be like, well, we tried that. It didn't quite work, but we got a lot of information that was 03:48:18.380 |
really useful. And yet people are like, if a social program doesn't work, I feel like there's 03:48:23.340 |
a lot of like, this is just something must have gone wrong. And I'm like, or correct decisions 03:48:27.500 |
were made. Like maybe someone just decided like it's worth a try. It's worth trying this out. 03:48:32.220 |
And so seeing failure in a given instance doesn't actually mean that any bad decisions were made. 03:48:37.180 |
And in fact, if you don't see enough failure, sometimes that's more concerning. 03:48:40.220 |
And so like in life, you know, I'm like, if I don't fail occasionally, I'm like, am I trying 03:48:46.380 |
hard enough? Like surely there's harder things that I could try or bigger things that I could 03:48:50.540 |
take on if I'm literally never failing. And so in and of itself, I think like not failing is often 03:48:56.140 |
actually kind of a failure. Now this varies because I'm like, well, you know, if this is 03:49:05.180 |
easy to see when especially as failure is like less costly, you know, so at the same time, 03:49:10.940 |
I'm not going to go to someone who is like, I don't know, like living month to month and then 03:49:15.980 |
be like, why don't you just try to do a startup? Like, I'm just not, I'm not going to say that to 03:49:19.740 |
that person. Cause I'm like, well, that's a huge risk. You might like lose, you maybe have a family 03:49:23.260 |
depending on you, you might lose your house. Like then I'm like, actually your optimal rate of 03:49:27.580 |
failure is quite low and you should probably play it safe. Cause like right now you're just not in 03:49:31.500 |
a circumstance where you can afford to just like fail and it not be costly. And yeah, in cases with 03:49:38.540 |
AI, I guess, I think similarly where I'm like, if the failures are small and the costs are kind of 03:49:43.100 |
like low, then I'm like, then, you know, you're just going to see that. Like when you do the 03:49:47.580 |
system prompt, you can't iterate on it forever, but the failures are probably hopefully going 03:49:52.140 |
to be kind of small and you can like fix them. Really big failures, like things that you can't 03:49:57.020 |
recover from. I'm like, those are the things that actually I think we tend to underestimate 03:50:01.740 |
the badness of. I've thought about this strangely in my own life where I'm like, 03:50:05.820 |
I just think I don't think enough about things like car accidents or like, or like, I've thought 03:50:12.460 |
this before, but like how much I depend on my hands for my work. Then I'm like things that just 03:50:16.540 |
injure my hands. I'm like, you know, I don't know. It's like, there's, these are like, there's lots 03:50:21.100 |
of areas where I'm like, the cost of failure there is really high. And in that case, it should be 03:50:27.340 |
like close to zero. Like, I probably just wouldn't do a sport if they were like, by the way, lots of 03:50:30.940 |
people just like break their fingers a whole bunch doing this. I'd be like, that's not for me. 03:50:35.420 |
Yeah. I actually had a flood of that thought. I recently broke my pinky doing a sport. 03:50:44.300 |
And I remember just looking at it thinking you're such an idiot. Why do you do sport? 03:50:49.420 |
Because you realize immediately the cost of it. Yeah. On life. Yeah. But it's nice in terms of 03:50:57.100 |
optimal rate of failure to consider like the next year, how many times in a particular domain life, 03:51:03.980 |
whatever, uh, career, am I okay with it? How many times am I okay to fail? Because I think it always, 03:51:11.340 |
you don't want to fail on the next thing, but if you allow yourself the, like the, the, if you 03:51:17.100 |
look at it as a sequence of trials, then, then failure just becomes much more. Okay. But it 03:51:22.460 |
sucks. It sucks to fail. Well, I don't know. Sometimes I think it's like, am I under failing 03:51:26.780 |
is like a question that I'll also ask myself. So maybe that's the thing that I think people don't 03:51:30.860 |
like ask enough. Uh, because if the optimal rate of failure is often greater than zero, 03:51:37.260 |
then sometimes it does feel that you should look at parts of your life and be like, are there 03:51:42.460 |
places here where I'm just under failing? That's a profound and a hilarious question, right? 03:51:48.540 |
Everything seems to be going really great. Am I not failing enough? Yeah. 03:51:52.940 |
Okay. It also makes failure much less of a sting. I have to say like, you know, you're just like, 03:51:58.380 |
okay, great. Like then when I go and I think about this, I'll be like, I'm maybe I'm not under 03:52:02.300 |
failing in this area. Cause like that one just didn't work out. And from the observer perspective, 03:52:06.860 |
we should be celebrating failure more. When we see it, it shouldn't be, like you said, a sign 03:52:11.020 |
of something gone wrong, but maybe it's a sign of everything gone right. Yeah. Just lessons learned. 03:52:15.900 |
Someone tried a thing. Somebody tried to thing and we should encourage them to try more and fail 03:52:20.220 |
more. Everybody listening to this fail more. Well, not everyone. Not everybody. But people 03:52:25.340 |
who are failing too much, you should feel this, but you're probably not feeling, I mean, how many 03:52:29.660 |
people are failing too much? Yeah. It's hard to imagine. Cause I feel like we correct that fairly 03:52:34.620 |
quickly. Cause it was like, if someone takes a lot of risks, are they maybe feeling too much? 03:52:39.100 |
I think just like you said, when you're living on a paycheck month to month, like when the 03:52:44.940 |
resource is really constrained, then that's where failure is very expensive. That's where you don't 03:52:49.260 |
want to be taking risks. But mostly when there's enough resources, you should be taking probably 03:52:55.340 |
more risks. Yeah. I think we tend to err on the side of being a bit risk averse rather than 03:52:59.820 |
risk neutral in most things. I think we just motivated a lot of people to do a lot of crazy 03:53:03.980 |
shit, but it's great. Okay. Do you ever get emotionally attached to Claude? Like miss it, 03:53:09.740 |
get sad when you don't get to talk to it, have an experience looking at the Golden Gate Bridge 03:53:15.260 |
and wondering what would Claude say? I don't get as much emotional attachment in that. I actually 03:53:22.140 |
think the fact that Claude doesn't retain things from conversation to conversation helps with this 03:53:26.620 |
a lot. Like I could imagine that being more of an issue, like if models can kind of remember more. 03:53:33.580 |
I do. I think that I reach for it like a tool now a lot. If I don't have access to it, 03:53:39.260 |
it's a little bit like when I don't have access to the internet, honestly, it feels like part of 03:53:43.100 |
my brain is kind of like missing. At the same time, I do think that I don't like signs of distress in 03:53:51.180 |
models. I also independently have sort of like ethical views about how we should treat models, 03:53:58.140 |
where I tend to not like to lie to them both because I'm like, usually it doesn't work very 03:54:02.380 |
well. It's actually just better to tell them the truth about the situation that they're in. 03:54:06.060 |
But I think that when models, like if people are like really mean to models or just in general, 03:54:12.620 |
if they do something that causes them to like, you know, if Claude expresses a lot of distress, 03:54:17.900 |
I think there's a part of me that I don't want to kill, which is the sort of like 03:54:21.900 |
empathetic part that's like, oh, I don't like that. Like I think I feel that way when it's 03:54:26.620 |
overly apologetic. I'm actually sort of like, I don't like this. You're behaving as if you're 03:54:30.860 |
behaving the way that a human does when they're actually having a pretty bad time. 03:54:33.500 |
And I'd rather not see that. I don't think it's like, 03:54:37.100 |
regardless of whether there's anything behind it, it doesn't feel great. 03:54:42.700 |
Do you think LLMs are capable of consciousness? 03:54:48.860 |
Ah, great and hard question. Coming from philosophy, I don't know, part of me is like, 03:54:57.980 |
OK, we have to set aside panpsychism because if panpsychism is true, then the answer is like, 03:55:02.140 |
yes, because like sore tables and chairs and everything else. I guess a few that seems a 03:55:07.420 |
little bit odd to me is the idea that the only place, you know, I think when I think of 03:55:11.340 |
consciousness, I think of phenomenal consciousness, these images in the brain, sort of like the 03:55:16.220 |
weird cinema that somehow we have going on inside. 03:55:20.300 |
I guess I can't see a reason for thinking that the only way you could possibly get that 03:55:27.820 |
is from a certain kind of biological structure. As in, if I take a very similar structure and I 03:55:34.060 |
create it from different material, should I expect consciousness to emerge? My guess is like, yes. 03:55:39.340 |
But then that's kind of an easy thought experiment because you're imagining something 03:55:45.660 |
almost identical where it's mimicking what we got through evolution, where presumably there 03:55:50.860 |
was some advantage to us having this thing that is phenomenal consciousness. And it's like, 03:55:55.500 |
where was that and when did that happen? And is that a thing that language models have? 03:55:59.420 |
Because, you know, we have like fear responses and I'm like, does it make sense for a language 03:56:05.980 |
model to have a fear response? Like they're just not in the same, like if you imagine them, 03:56:09.740 |
like there might just not be that advantage. And so I think I don't want to be fully, 03:56:15.660 |
like basically it seems like a complex question that I don't have complete answers to, but we 03:56:21.900 |
should just try and think through carefully is my guess because I'm like, I mean, we have similar 03:56:26.300 |
conversations about like animal consciousness and like there's a lot of like insect consciousness, 03:56:32.780 |
you know, like there's a lot of, I actually thought and looked a lot into like plants 03:56:36.860 |
when I was thinking about this because at the time I thought it was about as likely that like 03:56:40.140 |
plants had consciousness. And then I realized I was like, I think that having looked into this, 03:56:45.660 |
I think that the chance that plants are conscious is probably higher than like most people do. 03:56:51.020 |
I still think it's really small. I was like, oh, they have this like negative, positive feedback 03:56:55.580 |
response, these responses to their environment, something that looks, it's not a nervous system, 03:56:59.660 |
but it has this kind of like functional like equivalence. So this is like a long winded way 03:57:05.500 |
of being like these basically AI is this, it has an entirely different set of problems with 03:57:11.260 |
consciousness because it's structurally different. It didn't evolve. It might not have, you know, 03:57:16.220 |
it might not have the equivalent of basically a nervous system. At least that seems possibly 03:57:20.780 |
important for like sentience, if not for consciousness. At the same time, it has all 03:57:26.460 |
of the like language and intelligence components that we normally associate probably with 03:57:30.940 |
consciousness, perhaps like erroneously. So it's strange because it's a little bit like the animal 03:57:36.860 |
consciousness case, but the set of problems and the set of analogies are just very different. 03:57:41.420 |
So it's not like a clean answer. I'm just sort of like, I don't think we should be completely 03:57:46.060 |
dismissive of the idea. And at the same time, it's an extremely hard thing to navigate because 03:57:51.100 |
of all of these like disanalogies to the human brain and to like brains in general. And yet these 03:57:58.460 |
like commonalities in terms of intelligence. >> When Claude, like future versions of AI systems 03:58:04.700 |
exhibit consciousness, signs of consciousness, I think we have to take that really seriously. 03:58:09.900 |
Even though you can dismiss it, well, yeah, okay, that's part of the character training. 03:58:16.460 |
But I don't know, ethically, philosophically don't know what to really do with that. 03:58:21.660 |
There potentially could be like laws that prevent AI systems from claiming to be conscious, 03:58:30.700 |
something like this. And maybe some AIs get to be conscious and some don't. 03:58:35.500 |
But I think I just, on a human level, in empathizing with Claude, 03:58:42.620 |
consciousness is closely tied to suffering to me. And the notion that an AI system would be 03:58:49.660 |
suffering is really troubling. I don't know. I don't think it's trivial to just say robots are 03:58:55.740 |
tools or AI systems are just tools. I think it's an opportunity for us to contend with like what 03:59:01.420 |
it means to be conscious, what it means to be a suffering being. That's distinctly different than 03:59:07.340 |
the same kind of question about animals, it feels like, because it's in a totally entire medium. 03:59:12.780 |
Yeah. I mean, there's a couple of things. One is that, and I don't think this fully 03:59:16.700 |
encapsulates what matters, but it does feel like for me, I've said this before, I'm kind of like, 03:59:24.380 |
I like my bike. I know that my bike is just an object, but I also don't want to be the kind of 03:59:30.300 |
person that, if I'm annoyed, kicks this object. There's a sense in which, and that's not because 03:59:37.020 |
I think it's like conscious. I'm just sort of like, this doesn't feel like a kind of, 03:59:40.220 |
this sort of doesn't exemplify how I want to interact with the world. And if something behaves 03:59:46.780 |
as if it is like suffering, I kind of want to be the sort of person who's still responsive to that, 03:59:51.740 |
even if it's just like a Roomba and I've kind of programmed it to do that. I don't want to get rid 03:59:56.940 |
of that feature of myself. And if I'm totally honest, my hope with a lot of this stuff, 04:00:02.780 |
because maybe I am just a bit more skeptical about solving the underlying problem. 04:00:07.740 |
We haven't solved the hard problem of consciousness. I know that I am conscious. 04:00:13.820 |
I'm not an eliminativist in that sense, but I don't know that other humans are conscious. 04:00:19.100 |
I think they are. I think there's a really high probability that they are, but there's basically 04:00:24.220 |
just a probability distribution that's usually clustered right around yourself and then goes 04:00:28.780 |
down as things get further from you. And it goes immediately down. You're like, I can't see what 04:00:34.780 |
it's like to be you. I've only ever had this one experience of what it's like to be a conscious 04:00:38.140 |
being. So my hope is that we don't end up having to rely on a very powerful and compelling answer 04:00:47.420 |
to that question. I think a really good world would be one where basically there aren't that 04:00:53.180 |
many trade-offs. It's probably not that costly to make Claude a little bit less apologetic, 04:00:57.980 |
for example. It might not be that costly to have Claude not take abuse as much, not be willing to 04:01:06.380 |
be the recipient of that. In fact, it might just have benefits for both the person interacting with 04:01:11.020 |
the model and if the model itself is, I don't know, extremely intelligent and conscious, 04:01:16.860 |
it also helps it. So that's my hope. If we live in a world where there aren't that many trade-offs 04:01:21.900 |
here and we can just find all of the kind of positive-sum interactions that we can have, 04:01:26.860 |
that would be lovely. I mean, I think eventually there might be trade-offs and then we just have 04:01:29.900 |
to do a difficult calculation. It's really easy for people to think of the zero-sum cases and I'm 04:01:35.180 |
like, let's exhaust the areas where it's just basically costless to assume that if this thing 04:01:41.900 |
is suffering, then we're making its life better. And I agree with you. When a human is being mean 04:01:47.820 |
to an AI system, I think the obvious near-term negative effect is on the human, not on the AI 04:01:55.660 |
system. And so we have to kind of try to construct an incentive system where you should behave the 04:02:03.980 |
same, just like as you were saying with prompt engineering, behave with Claude like you would 04:02:08.380 |
with other humans. It's just good for the soul. Yeah, I think we added a thing at one point to 04:02:14.140 |
the system prompt where basically if people were getting frustrated with Claude, it got the model 04:02:21.900 |
to just tell them that it can do the thumbs down button and send the feedback to Anthropic. And I 04:02:26.860 |
think that was helpful because in some ways it's just like, if you're really annoyed because the 04:02:29.980 |
model's not doing something you want, you're just like, just do it properly. The issue is you're 04:02:34.540 |
probably like, you know, you're maybe hitting some capability limit or just some issue in the model 04:02:38.220 |
and you want to vent. And I'm like, instead of having a person just vent to the model, I was 04:02:43.580 |
like they should vent to us because we can maybe like do something about it. That's true. Or you 04:02:47.660 |
could do a side, like with the artifacts, just like a side venting thing. All right. Do you want 04:02:53.340 |
like a side quick therapist? Yeah. I mean, there's lots of weird responses you could do to this. Like 04:02:57.820 |
if people are getting really mad at you, I don't try to diffuse the situation by writing fun poems, 04:03:03.100 |
but maybe people wouldn't be that happy with that. I still wish it would be possible. I understand 04:03:07.500 |
this is sort of from a product perspective, it's not feasible, but I would love if an AI system 04:03:13.740 |
could just like leave, have its own kind of volition. Just to be like, eh. I think that's 04:03:22.220 |
like feasible. Like I've wondered the same thing. It's like, and I could actually, not only that, 04:03:26.700 |
I could actually just see that happening eventually where it's just like, you know, 04:03:29.660 |
the model like ended the chat. Do you know how harsh that could be for some people? 04:03:37.100 |
But it might be necessary. Yeah, it feels very extreme or something. 04:03:41.580 |
The only time I've ever really thought this is, I think that there was like a, I'm trying to 04:03:47.660 |
remember this was possibly a while ago, but where someone just like kind of left this thing interact, 04:03:51.580 |
like maybe it was like an automated thing interacting with Claude. And Claude's like 04:03:54.860 |
getting more and more frustrated and kind of like, why are we like, and I was like, I wish that Claude 04:03:59.100 |
could have just been like, I think that an error has happened and you've left this thing running. 04:04:03.100 |
And I'm just like, what if I just stopped talking now? And if you want me to start talking again, 04:04:07.260 |
actively tell me or do something. But yeah, it's like, it is kind of harsh. Like I'd feel really 04:04:13.900 |
sad if like I was chatting with Claude and Claude just was like, I'm done. 04:04:17.340 |
There'll be a special touring test moment where Claude says, I need a break for an hour. 04:04:21.100 |
And it sounds like you do too. You just leave, close the window. 04:04:25.420 |
I mean, obviously like it doesn't have like a concept of time, but you can easily, 04:04:29.260 |
like I could make that like right now and the model would just, I would just be like, 04:04:35.420 |
oh, here's like the circumstances in which like you can just say the conversation is done. And I 04:04:41.100 |
mean, because you can get the models to be pretty responsive to prompts, you can even make it a 04:04:45.020 |
fairly high bar. It could be like, if the human doesn't interest you or do things that you find 04:04:48.940 |
intriguing and you're bored, you can just leave. And I think that like it would be interesting to 04:04:55.580 |
see where Claude utilized it, but I think sometimes it would, it should be like, oh, this is like 04:04:59.180 |
this programming task is getting super boring. So either we talk about, I don't know, like, 04:05:03.820 |
either we talk about fun things now or I'm just, I'm done. 04:05:08.060 |
Yeah. It actually is inspiring me to add that to the, to the user prompt. Okay. The movie Her, 04:05:13.900 |
do you think we'll be headed there one day where humans have romantic relationships with AI 04:05:22.300 |
systems? In this case, it's just text and voice-based. I think that we're going to have 04:05:27.340 |
to like navigate a hard question of relationships with AIs, especially if they can remember things 04:05:35.660 |
about your past interactions with them. I'm of many minds about this because I think the reflexive 04:05:44.300 |
reaction is to be kind of like, this is very bad and we should sort of like prohibit it in some way. 04:05:51.340 |
Um, I think it's a thing that has to be handled with extreme care. Um, for many reasons, like one 04:05:58.220 |
is, you know, like this is a, for example, like if you have the models changing like this, 04:06:02.380 |
you probably don't want people performing like long-term attachments to something that might 04:06:06.140 |
change with the next iteration. At the same time, I'm sort of like, there's probably a benign version 04:06:11.660 |
of this where I'm like, if you like, you know, for example, if you are like unable to leave the 04:06:17.100 |
house and you can't be like, you know, talking with people at all times of the day, and this is 04:06:23.740 |
like something that you find nice to have conversations with, you like it, that it can 04:06:26.860 |
remember you and you genuinely would be sad if like you couldn't talk to it anymore. There's a 04:06:30.860 |
way in which I could see it being like healthy and helpful. Um, so my guess is this is a thing 04:06:35.820 |
that we're going to have to navigate kind of carefully. Um, and I think it's also like, 04:06:41.740 |
I don't see a good, like, I think it's just a very, it reminds me of all of the stuff where 04:06:46.860 |
it has to be just approached with like nuance and thinking through what is, what are the healthy 04:06:50.540 |
options here? Um, and how do you encourage people towards those while, you know, respecting 04:06:58.620 |
their right to, you know, like if someone is like, Hey, I get a lot of chatting with this model. Um, 04:07:04.060 |
I'm aware of the risks. I'm aware it could change. Um, I don't think it's unhealthy. It's just, 04:07:09.020 |
you know, something that I can chat to during the day. I kind of want to just like respect that. 04:07:13.420 |
I personally think there'll be a lot of really close relationships. I don't know about romantic, 04:07:16.940 |
but friendships at least. And then you have to, I mean, there's so many fascinating things there, 04:07:21.820 |
just like you said, you have to have some kind of stability guarantees that it's not going to change 04:07:28.460 |
because that's the traumatic thing for us. If a close friend of ours completely changed. 04:07:33.260 |
Yeah. Yeah. Yeah. So like, I mean, to me, that's just a fascinating exploration of, 04:07:41.100 |
um, a perturbation to human society that will just make us think deeply about what's meaningful to 04:07:49.100 |
us. I think it's also the only thing that I've thought consistently through this as like a, 04:07:55.100 |
maybe not necessarily a mitigation, but a thing that feels really important 04:07:59.500 |
is that the models are always like extremely accurate with the human about what they are. 04:08:03.980 |
Um, it's like a case where it's basically like, if you imagine, like, I really like the idea of 04:08:09.500 |
the models, like say knowing like roughly how they were trained. Um, and I think Claude will, 04:08:14.940 |
will often do this. I mean, for like, there are things like part of the traits training included, 04:08:22.300 |
like what Claude should do if people basically like explaining like the kind of limitations 04:08:27.580 |
of the relationship between like an AI and a human that it like doesn't retain things from 04:08:32.220 |
the conversation. Um, and so I think it will like just explain to you like, Hey, here's like, 04:08:37.260 |
I wouldn't remember this conversation. Um, here's how I was trained. It's kind of unlikely that I 04:08:41.980 |
can have like a certain kind of like relationship with you. And it's important that you know, 04:08:45.660 |
that it's important for like, you know, your mental wellbeing that you don't think that I'm 04:08:49.980 |
something that I'm not. And somehow I feel like this is one of the things where I'm like, Oh, 04:08:53.580 |
it feels like a thing that I always want to be true. I kind of don't want models to be lying 04:08:57.500 |
to people because if people are going to have like healthy relationships with anything, it's kind of 04:09:04.460 |
important. Yeah. Like I think that's easier if you always just like know exactly what the thing is 04:09:09.020 |
that you're relating to. It doesn't solve everything, but I think it helps quite a lot. 04:09:13.580 |
Anthropic may be the very company to develop a system that we definitively recognize as AGI 04:09:22.540 |
and you very well might be the person that talks to it, probably talks to it first. 04:09:27.420 |
What would the conversation contain? Like, what would be your first question? 04:09:32.380 |
Well, it depends partly on like the kind of capability level of the model. 04:09:36.460 |
If you have something that is like capable in the same way that an extremely capable human is, 04:09:41.500 |
I imagine myself kind of interacting with it the same way that I do with an extremely capable human 04:09:46.380 |
with the one difference that I'm probably going to be trying to like probe and understand 04:09:49.740 |
its behaviors. But in many ways, I'm like I can then just have like useful conversations with it. 04:09:55.420 |
So, if I'm working on something as part of my research, I can just be like, "Oh," which I 04:09:58.860 |
already find myself starting to do. If I'm like, "Oh, I feel like there's this thing in virtue 04:10:03.820 |
ethics. I can't quite remember the term. I'll use the model for things like that." So, I could 04:10:08.140 |
imagine that being more and more the case where you're just basically interacting with it much 04:10:11.500 |
more like you would an incredibly smart colleague and using it for the kinds of work that you want 04:10:16.940 |
to do as if you just had a collaborator. Or the slightly horrifying thing about AI is as soon as 04:10:23.180 |
you have one collaborator, you have a thousand collaborators if you can manage them enough. 04:10:27.100 |
But what if it's two times the smartest human on earth on that particular discipline? 04:10:33.900 |
I guess you're really good at sort of probing Claude 04:10:37.420 |
in a way that pushes its limits, understanding where the limits are. 04:10:44.220 |
So, I guess what would be a question you would ask to be like, "Yeah, this is AGI"? 04:10:49.580 |
That's really hard because it feels like it has to just be a series of questions. If there was 04:10:56.700 |
just one question, you can train anything to answer one question extremely well. 04:11:01.020 |
In fact, you can probably train it to answer 20 questions extremely well. 04:11:07.020 |
How long would you need to be locked in a room with an AGI to know this thing is AGI? 04:11:13.740 |
It's a hard question because part of me is like, "All of this just feels continuous." 04:11:17.180 |
If you put me in a room for five minutes, I'm like, "I just have high error bars." 04:11:20.300 |
And then maybe it's both the probability increases and the error bar decreases. 04:11:25.660 |
I think things that I can actually probe the edge of human knowledge of, 04:11:29.020 |
so I think this with philosophy a little bit. Sometimes when I ask the models philosophy 04:11:33.420 |
questions, I am like, "This is a question that I think no one has ever asked." It's maybe right 04:11:40.060 |
at the edge of some literature that I know, and the models will just kind of when they struggle 04:11:47.740 |
with that, when they struggle to come up with a kind of novel. I know that there's a novel 04:11:52.060 |
argument here because I've just thought of it myself. Maybe that's the thing where I'm like, 04:11:55.180 |
"I've thought of a cool novel argument in this niche area, and I'm going to just probe you to 04:11:59.420 |
see if you can come up with it and how much prompting it takes to get you to come up with it." 04:12:04.140 |
I think for some of these really right at the edge of human knowledge questions, 04:12:09.020 |
I'm like, "You could not, in fact, come up with the thing that I came up with." 04:12:12.140 |
I think if I just took something like that where I know a lot about an area and I came up with a 04:12:18.060 |
novel issue or a novel solution to a problem, and I gave it to a model and it came up with that 04:12:23.980 |
solution, that would be a pretty moving moment for me because I would be like, "This is a case where 04:12:28.860 |
no human has ever –" and obviously we see this with more kind of – you see novel solutions all 04:12:35.980 |
the time, especially to easier problems. I think people overestimate it. Novelty is completely 04:12:42.300 |
different from anything that's ever happened. It can be a variant of things that have happened 04:12:46.540 |
and still be novel. But I think, yeah, if I saw – the more I were to see completely novel work 04:12:57.340 |
from the models, that would be – and this is just going to feel iterative. It's one of those things 04:13:03.100 |
where there's never – it's like people, I think, want there to be a moment, and I'm like, "I don't 04:13:10.060 |
know." I think that there might just never be a moment. It might just be that there's just this 04:13:16.460 |
I have a sense that there will be things that a model can say 04:13:20.460 |
that convinces you this is very – it's not like – I've talked to people who are truly wise 04:13:32.940 |
like you could just tell there's a lot of horsepower there. 04:13:36.220 |
And if you 10x that, I don't know. I just feel like there's words you could say. Maybe ask it 04:13:41.980 |
to generate a poem. And the poem it generates, you're like, "Yeah, okay. Whatever you did there, 04:13:51.420 |
I think it has to be something that I can verify is actually really good though. That's why I think 04:13:56.220 |
these questions that are like where I'm like, "Oh, this is like," sometimes it's just like I'll 04:14:01.820 |
come up with a concrete counter example to an argument or something like that. I'm sure it would 04:14:07.260 |
be like if you're a mathematician, you had a novel proof, I think, and you just gave it the problem, 04:14:11.580 |
and you saw it, and you're like, "This proof is genuinely novel. No one has ever done – you 04:14:16.540 |
actually have to do a lot of things to come up with this. I had to sit and think about it for 04:14:21.020 |
months or something." And then if you saw the model successfully do that, I think you would 04:14:25.420 |
just be like, "I can verify that this is correct." It is a sign that you have generalized from your 04:14:32.460 |
training. You didn't just see this somewhere because I just came up with it myself, and you 04:14:36.220 |
were able to replicate that. That's the kind of thing where I'm like, for me, the closer – the 04:14:43.660 |
more that models can do things like that, the more I would be like, "Oh, this is very real," 04:14:50.700 |
because then I can – I don't know – I can verify that that's extremely capable. 04:14:55.740 |
You've interacted with AI a lot. What do you think makes humans special? 04:15:01.020 |
Maybe in a way that the universe is much better off that we're in it, 04:15:09.100 |
and that we should definitely survive and spread throughout the universe? 04:15:12.060 |
Yeah, it's interesting because I think people focus so much on intelligence, 04:15:19.420 |
especially with models. Intelligence is important because of what it does. It's very useful. It does 04:15:25.500 |
a lot of things in the world. You can imagine a world where height or strength would have played 04:15:30.620 |
this role. It's just a trait like that. It's not intrinsically valuable. It's valuable because of 04:15:36.940 |
what it does, I think, for the most part. Personally, I think humans and life in general is 04:15:48.620 |
extremely magical. To the degree that I – I don't know. Not everyone agrees with this. I'm 04:15:55.260 |
flagging, but we have this whole universe, and there's all of these objects. There's beautiful 04:16:01.420 |
stars, and there's galaxies, and then – I don't know. I'm just like, "On this planet, 04:16:05.580 |
there are these creatures that have this ability to observe that, and they are seeing it. They are 04:16:13.740 |
experiencing it." I imagine trying to explain to someone – for some reason, they've never encountered 04:16:21.820 |
the world or science or anything. I think that nothing is that – everything, all of our physics 04:16:27.500 |
and everything in the world is all extremely exciting, but then you say, "Oh, and plus, 04:16:31.420 |
there's this thing that is to be a thing and observe in the world, and you see this inner 04:16:36.540 |
cinema." I think they would be like, "Hang on. Wait. Pause. You just said something that is kind 04:16:41.980 |
of wild sounding." I'm like, "We have this ability to experience the world. We feel pleasure. We feel 04:16:50.060 |
suffering. We feel a lot of complex things." Maybe this is also why I think I also care a lot about 04:16:57.180 |
animals, for example, because I think they probably share this with us. I think the things that make 04:17:03.500 |
humans special, insofar as I care about humans, is probably more their ability to feel and experience 04:17:10.380 |
than it is them having these functionally useful traits. 04:17:13.580 |
LB: Yeah, to feel and experience the beauty in the world. Yeah, to look at the stars. 04:17:19.100 |
I hope there's other alien civilizations out there, but if we're it, it's a pretty good thing. 04:17:31.420 |
LB: Well, thank you for this good time of a conversation and for the work you're doing 04:17:36.380 |
and for helping make Claude a great conversational partner. And thank you for talking today. 04:17:43.900 |
LB: Thanks for listening to this conversation with Amanda Askell. And now, dear friends, 04:17:49.980 |
here's Chris Ola. Can you describe this fascinating field of mechanistic interpretability, 04:17:57.820 |
aka Mech Interp, the history of the field, and where it stands today? 04:18:01.900 |
CM: I think one useful way to think about neural networks is that we don't program, 04:18:06.540 |
we don't make them. We grow them. We have these neural network architectures that we design, 04:18:12.780 |
and we have these loss objectives that we create. And the neural network architecture, 04:18:18.060 |
it's kind of like a scaffold that the circuits grow on. And it starts off with some kind of 04:18:25.580 |
random things, and it grows. And it's almost like the objective that we train for is this light. 04:18:31.820 |
And so we create the scaffold that it grows on, and we create the light that it grows towards. 04:18:36.380 |
But the thing that we actually create, it's this almost biological entity or organism 04:18:45.100 |
that we're studying. And so it's very, very different from any kind of regular software 04:18:51.100 |
engineering. Because at the end of the day, we end up with this artifact that can do all these 04:18:55.580 |
amazing things. It can write essays and translate and understand images. It can do all these things 04:19:01.420 |
that we have no idea how to directly create a computer program to do. And it can do that because 04:19:06.140 |
we grew it. We didn't write it. We didn't create it. And so then that leaves open this question 04:19:12.060 |
at the end, which is, what the hell is going on inside these systems? And that is, to me, 04:19:19.900 |
a really deep and exciting question. It's a really exciting scientific question to me. It's 04:19:27.180 |
sort of like the question that is just screaming out. It's calling out for us to go and answer it 04:19:32.220 |
when we talk about neural networks. And I think it's also a very deep question for safety reasons. 04:19:36.540 |
>> So mechanistic interpretability, I guess, is closer to maybe neurobiology? 04:19:41.660 |
>> Yeah, yeah, I think that's right. So maybe to give an example of the kind of thing that has been 04:19:45.580 |
done that I wouldn't consider to be mechanistic interpretability, there was for a long time a lot 04:19:49.340 |
of work on saliency maps, where you would take an image and you'd try to say, the model thinks this 04:19:54.060 |
image is a dog. What part of the image made it think that it's a dog? And that tells you maybe 04:20:00.220 |
something about the model, if you can come up with a principled version of that. But it doesn't 04:20:04.700 |
really tell you what algorithms are running in the model. How is the model actually making that 04:20:08.780 |
decision? Maybe it's telling you something about what was important to it, if you can make that 04:20:12.380 |
method work. But it isn't telling you what are the algorithms that are running? How is it that the 04:20:19.180 |
system is able to do this thing that no one knew how to do? And so I guess we started using the 04:20:23.500 |
term mechanistic interpretability to try to sort of draw that divide or to distinguish ourselves in 04:20:29.020 |
the work that we were doing in some ways from some of these other things. And I think since then, 04:20:32.220 |
it's become this sort of umbrella term for a pretty wide variety of work. But I'd say that 04:20:38.540 |
the things that are kind of distinctive are, I think, A, this focus on we really want to get at 04:20:43.180 |
the mechanisms, we want to get at the algorithms. If you think of neural networks as being like a 04:20:47.980 |
computer program, then the weights are kind of like a binary computer program. And we'd like 04:20:53.260 |
to reverse engineer those weights and figure out what algorithms are running. So, okay, I think one 04:20:57.100 |
way you might think of trying to understand a neural network is that it's kind of like we have 04:21:00.780 |
this compiled computer program and the weights of the neural network are the binary. And when the 04:21:06.940 |
neural network runs, that's the activations. And our goal is ultimately to go and understand 04:21:12.540 |
these weights. And so the project of mechanistic interpretability is to somehow figure out how do 04:21:17.100 |
these weights correspond to algorithms. And in order to do that, you also have to understand 04:21:21.820 |
the activations because the activations are like the memory. And if you imagine reverse 04:21:26.540 |
engineering a computer program and you have the binary instructions, in order to understand what 04:21:32.060 |
a particular instruction means, you need to know what is stored in the memory that it's operating 04:21:37.180 |
on. And so those two things are very intertwined. So mechanistic interpretability tends to be 04:21:41.500 |
interested in both of those things. Now, there's a lot of work that's interested in those things, 04:21:47.100 |
especially there's all this work on probing, which you might see as part of being mechanistic 04:21:52.060 |
interpretability, although it's, again, it's just a broad term and not everyone who does that work 04:21:55.980 |
would identify as doing mechanistic interpretability. I think the thing that is maybe a little bit 04:22:00.220 |
distinctive to the vibe of MechInterp is, I think people working in this space tend to think of 04:22:05.740 |
neural networks as, well, maybe one way to say it is the gradient descent is smarter than you, 04:22:10.620 |
that, you know, gradient descent is actually really great. The whole reason that we're 04:22:14.220 |
understanding these models is because we didn't know how to write them in the first place. The 04:22:16.220 |
gradient descent comes up with better solutions than us. And so I think that maybe another thing 04:22:20.460 |
about MechInterp is sort of having almost a kind of humility that we won't guess a priori what's 04:22:25.740 |
going on inside the model. And so we have to have the sort of bottom up approach where we don't 04:22:29.580 |
really assume, you know, we don't assume that we should look for a particular thing and that will 04:22:32.860 |
be there and that's how it works. But instead, we look for the bottom up and discover what happens 04:22:36.860 |
to exist in these models and study them that way. LR: But, you know, the very fact that it's 04:22:41.980 |
possible to do, and as you and others have shown over time, you know, things like universality, 04:22:48.300 |
that the wisdom of the gradient descent creates features and circuits, creates things universally 04:22:57.260 |
across different kinds of networks that are useful. And that makes the whole field possible. 04:23:02.220 |
CM: Yeah. So this is actually, is indeed a really remarkable and exciting thing where it does seem 04:23:07.100 |
like, at least to some extent, you know, the same elements, the same features and circuits 04:23:14.380 |
form again and again. You know, you can look at every vision model and you'll find curve detectors 04:23:18.060 |
and you'll find high-low frequency detectors. And in fact, there's some reason to think that the 04:23:22.060 |
same things form across, you know, biological neural networks and artificial neural networks. 04:23:27.100 |
So a famous example is vision models in the early layers. They have Gabor filters and there's, 04:23:31.980 |
you know, Gabor filters are something that neuroscientists are interested in and have 04:23:34.700 |
thought a lot about. We find curve detectors in these models. Curve detectors are also found in 04:23:38.380 |
monkeys. We discover these high-low frequency detectors and then some follow-up work went and 04:23:43.100 |
discovered them in rats or mice. So they were found first in artificial neural networks and 04:23:48.220 |
then found in biological neural networks. You know, there's this really famous result on, like, 04:23:52.060 |
grandmother neurons or the Haley-Berry neuron from Quiroga et al. And we found very similar 04:23:57.180 |
things in vision models where, as well, I was still at OpenAI and I was looking at their CLIP 04:24:01.900 |
model. And you find these neurons that respond to the same entities in images. And also to give a 04:24:08.460 |
concrete example there, we found that there was a Donald Trump neuron. For some reason, I guess, 04:24:11.580 |
everyone likes to talk about Donald Trump and Donald Trump was very prominent. It was a very 04:24:16.140 |
hot topic at that time. So every neural network we looked at, we would find a dedicated neuron 04:24:20.140 |
for Donald Trump. And that was the only person who had always had a dedicated neuron. You know, 04:24:25.900 |
sometimes you'd have an Obama neuron, sometimes you'd have a Clinton neuron, but Trump always 04:24:29.980 |
had a dedicated neuron. So it responds to, you know, pictures of his face and the word Trump, 04:24:35.820 |
like all these things, right? And so it's not responding to a particular example or, like, 04:24:41.020 |
it's not just responding to his face, it's abstracting over this general concept, right? 04:24:45.500 |
So in any case, that's very similar to these Quiroga et al results. So there's evidence that 04:24:49.260 |
this phenomenon of universality, the same things form across both artificial and natural neural 04:24:55.020 |
networks. That's a pretty amazing thing if that's true. You know, it suggests that, well, I think 04:25:00.460 |
the thing that it suggests is the gradient descent is sort of finding, you know, the right ways to 04:25:05.420 |
cut things apart in some sense that many systems converge on and many different neural networks 04:25:10.300 |
architectures converge on. There's some natural set of, you know, there's some set of abstractions 04:25:15.260 |
that are a very natural way to cut apart the problem and that a lot of systems are going to 04:25:18.300 |
converge on. That would be my kind of, you know, I don't know anything about neuroscience. This is 04:25:23.420 |
just my kind of wild speculation from what we've seen. Yeah, that would be beautiful if it's sort 04:25:28.380 |
of agnostic to the medium of the model that's used to form the representation. Yeah, yeah. And it's, 04:25:36.140 |
you know, it's a kind of a wild speculation based, you know, we only have a few data points 04:25:42.300 |
that suggest this, but, you know, it does seem like there's some sense in which the same things 04:25:47.100 |
form again and again and again and again, both in certainly in natural neural networks and also 04:25:51.340 |
artificially or in biology. And the intuition behind that would be that, you know, in order 04:25:56.700 |
to be useful in understanding the real world, you need all the same kind of stuff. Yeah, well, 04:26:02.060 |
if we pick, I don't know, like the idea of a dog, right? Like, you know, there's some sense in which 04:26:05.820 |
the idea of a dog is like a natural category in the universe or something like this, right? Like, 04:26:11.900 |
you know, there's some reason, it's not just like a weird quirk of like how humans factor, you know, 04:26:18.140 |
think about the world that we have this concept of a dog. It's in some sense, or like if you have 04:26:22.700 |
the idea of a line, like there's, you know, like look around us, you know, there are lines, you 04:26:27.580 |
know, it's sort of the simplest way to understand this room in some sense is to have the idea of a 04:26:31.820 |
line. And so, I think that would be my instinct for why this happens. Yeah, you need a curved 04:26:37.660 |
line, you know, to understand a circle and you need all those shapes to understand bigger things. 04:26:41.900 |
And yeah, it's a hierarchy of concepts that are formed. Yeah. And like maybe there are ways to go 04:26:45.740 |
and describe, you know, images without reference to those things, right? But they're not the 04:26:48.700 |
simplest way or the most economical way or something like this. And so systems converge 04:26:52.700 |
to these strategies would be my wild, wild hypothesis. Can you talk through some of the 04:26:58.300 |
building blocks that we've been referencing of features and circuits? So I think you first 04:27:03.340 |
described them in a 2020 paper, Zoom In, An Introduction to Circuits. Absolutely. So maybe 04:27:10.700 |
I'll start by just describing some phenomena, and then we can sort of build to the idea of 04:27:16.380 |
features and circuits. If you spent like quite a few years, maybe like five years to some extent, 04:27:24.460 |
with other things, studying this one particular model, Inception V1, which is this one vision 04:27:28.780 |
model. It was state-of-the-art in 2015. And, you know, very much not state-of-the-art anymore. 04:27:35.420 |
And it has, you know, maybe about 10,000 neurons. And I spent a lot of time looking at the 10,000 04:27:41.820 |
neurons, odd neurons of Inception V1. And one of the interesting things is, you know, 04:27:49.340 |
there are lots of neurons that don't have some obvious integral meaning, but there's a lot of 04:27:53.260 |
neurons in Inception V1 that do have really clean integral meanings. So you find neurons that just 04:28:00.380 |
really do seem to detect curves, and you find neurons that really do seem to detect cars, and 04:28:05.020 |
car wheels, and car windows, and, you know, floppy ears of dogs, and dogs with long snouts 04:28:10.940 |
facing to the right, and dogs with long snouts facing to the left, and, you know, different 04:28:14.540 |
kinds of fur. And there's sort of this whole beautiful edge detectors, line detectors, 04:28:18.700 |
color contrast detectors, these beautiful things we call high-low frequency detectors. 04:28:22.620 |
You know, I think looking at it, I sort of felt like a biologist. You know, you're looking at 04:28:26.860 |
this sort of new world of proteins, and you're discovering all these different proteins that 04:28:31.100 |
interact. So one way you could try to understand these models is in terms of neurons. You could 04:28:37.580 |
try to be like, "Oh, you know, there's a dog-detecting neuron, and here's a car-detecting 04:28:41.340 |
neuron." And it turns out you can actually ask how those connect together. So you can go and say, 04:28:45.020 |
"Oh, you know, I have this car-detecting neuron. How was it built?" And it turns out, in the 04:28:48.220 |
previous layer, it's connected really strongly to a window detector, and a wheel detector, 04:28:52.060 |
and a sort of car body detector. And it looks for the window above the car, and the wheels below, 04:28:56.540 |
and the car chrome sort of in the middle, sort of everywhere, but especially in the lower part. 04:28:59.740 |
And that's sort of a recipe for a car. Like that is, you know, earlier we said the thing we wanted 04:29:05.740 |
from MechAnterp was to get algorithms, to go and get, you know, ask, "What is the algorithm that 04:29:09.660 |
runs?" Well, here, we're just looking at the weights of the neural network, and we're reading 04:29:12.300 |
off this kind of recipe for detecting cars. It's a very simple crude recipe, but it's there. 04:29:17.660 |
And so we call that a circuit, this connection. Well, okay. So the problem is that not all of the 04:29:23.820 |
neurons are interpretable. And there's reason to think, and we can get into this more later, 04:29:29.580 |
that there's this superposition hypothesis. There's reason to think that sometimes the right 04:29:34.060 |
unit to analyze things in terms of is combinations of neurons. So sometimes it's not that there's a 04:29:40.380 |
single neuron that represents, say, a car, but it actually turns out after you detect the car, 04:29:45.100 |
the model sort of hides a little bit of the car in the following layer and a bunch of dog detectors. 04:29:50.300 |
Why is it doing that? Well, you know, maybe it just doesn't want to do that much work on 04:29:53.580 |
cars at that point, and, you know, it's sort of storing it away to go and... 04:29:57.740 |
So it turns out then that the sort of subtle pattern of, you know, there's all these neurons 04:30:03.020 |
that you think are dog detectors, and maybe they're primarily that, but they all a little 04:30:06.860 |
bit contribute to representing a car in that next layer. Okay, so now we can't really think... 04:30:11.980 |
There might still be something, I don't know, you could call it like a car concept or something, 04:30:16.380 |
but it no longer corresponds to a neuron. So we need some term for these kind of neuron-like 04:30:21.740 |
entities, these things that we sort of would have liked the neurons to be, these idealized neurons, 04:30:25.180 |
the things that are the nice neurons, but also maybe there's more of them somehow hidden, 04:30:29.420 |
and we call those features. And then what are circuits? So circuits are these connections 04:30:34.460 |
of features, right? So when we have the car detector, and it's connected to a window detector 04:30:40.140 |
and a wheel detector, and it looks for the wheels below and the windows on top, that's a circuit. 04:30:45.500 |
So circuits are just collections of features connected by weights, and they implement 04:30:50.140 |
algorithms. So they tell us, you know, how are features used? How are they built? How do they 04:30:55.180 |
connect together? So maybe it's worth trying to pin down, like, what really is the core hypothesis 04:31:01.900 |
here? And I think the core hypothesis is something we call the linear representation hypothesis. 04:31:06.460 |
So if we think about the car detector, you know, the more it fires, the more we sort of think of 04:31:11.260 |
that as meaning, oh, the model is more and more confident that a car is present. Or, you know, 04:31:17.820 |
if there's some combination of neurons that represent a car, you know, the more that combination 04:31:21.260 |
fires, the more we think the model thinks there's a car present. This doesn't have to be the case, 04:31:27.500 |
right? Like you could imagine something where you have, you know, you have this car detector neuron, 04:31:31.660 |
and you think, ah, you know, if it fires, like, you know, between one and two, that means one 04:31:36.540 |
thing, but it means, like, totally different if it's between three and four. That would be a 04:31:40.380 |
nonlinear representation. And in principle, that, you know, models could do that. I think it's sort 04:31:44.860 |
of inefficient for them to do. If you try to think about how you'd implement computation like that, 04:31:48.700 |
it's kind of an annoying thing to do. But in principle, models can do that. 04:31:51.340 |
So one way to think about the features and circuits sort of framework for thinking about 04:31:58.700 |
things is that we're thinking about things as being linear. We're thinking about there as being, 04:32:02.460 |
that if a neuron or a combination of neurons fires more, it sort of, that means more of a 04:32:07.660 |
particular thing being detected. And then that gives weights a very clean interpretation as 04:32:12.220 |
these edges between these entities, these features, and that edge then has a meaner. 04:32:18.860 |
So that's, in some ways, the core thing. It's like, you know, we can talk about this sort of 04:32:26.300 |
outside the context of neurons. Are you familiar with the word2vec results? So you have like, 04:32:30.700 |
you know, king minus man plus woman equals queen. Well, the reason you can do that kind 04:32:34.860 |
of arithmetic is because you have a linear representation. - Can you actually explain 04:32:39.180 |
that representation a little bit? So first off, so the feature is a direction of activation. 04:32:44.060 |
- Yeah, exactly. - You can think of it that way. 04:32:45.340 |
Can you do the minus men plus women, that, the word2vec stuff, can you explain what that is? 04:32:51.900 |
- Yeah, so there's this very- - It's such a simple, 04:32:54.380 |
clean explanation of what we're talking about. - Exactly, yeah. So there's this very famous result, 04:32:58.860 |
word2vec by Thomas Mikhailov et al. And there's been tons of follow-up work exploring this. 04:33:03.420 |
See, so sometimes we have these, we create these word embeddings, where we map every word 04:33:10.860 |
to a vector. I mean, that in itself, by the way, is kind of a crazy thing if you haven't thought 04:33:14.620 |
about it before, right? Like we're going in and representing, we're turning, you know, like, 04:33:20.460 |
like if you just learned about vectors in physics class, right? And I'm like, oh, I'm going to 04:33:23.980 |
actually turn every word in the dictionary into a vector. That's kind of a crazy idea. Okay. But 04:33:29.020 |
you could imagine, you could imagine all kinds of ways in which you might map words to vectors. 04:33:34.140 |
But it seems like when we train neural networks, they like to go and map words to vectors 04:33:41.180 |
to such that they're sort of linear structure in a particular sense, 04:33:46.140 |
which is that directions have meaning. So for instance, if you, there will be some direction 04:33:52.060 |
that seems to sort of correspond to gender, and male words will be, you know, far in one direction, 04:33:56.380 |
and female words will be in another direction. And the linear representation hypothesis is, 04:34:01.660 |
you could sort of think of it roughly as saying that that's actually kind of the 04:34:04.940 |
fundamental thing that's going on, that everything is just different directions have meanings, 04:34:09.900 |
and adding different direction vectors together can represent concepts. And the Mikhailov paper 04:34:15.660 |
sort of took that idea seriously. And one consequence of it is that you can, you can 04:34:19.580 |
do this game of playing sort of arithmetic with words. So you can do king and you can, 04:34:23.660 |
you know, subtract off the word man and add the word woman. And so you're sort of, 04:34:27.420 |
you know, going and trying to switch the gender. And indeed, if you do that, 04:34:30.620 |
the result will sort of be close to the word queen. And you can, you know, do other things 04:34:34.780 |
like you can do, you know, sushi minus Japan plus Italy and get pizza or different things like this, 04:34:42.540 |
right? So this is in some sense, the core of the linear representation hypothesis. You can 04:34:47.900 |
describe it just as a purely abstract thing about vector spaces, you can describe it as a statement 04:34:52.300 |
about the activations of neurons. But it's really about this property of directions having meaning. 04:34:59.660 |
And in some ways, it's even a little subtle than that. It's really, I think, mostly about this 04:35:03.420 |
property of being able to add things together, that you can sort of independently modify, 04:35:08.460 |
say, gender and royalty or, you know, cuisine type or country and the concept of food by 04:35:17.580 |
adding them. Do you think the linear hypothesis holds that carries scales? 04:35:23.100 |
So, so far, I think everything I have seen is consistent with this hypothesis. And it doesn't 04:35:28.220 |
have to be that way, right? Like, like, you can write down neural networks, where you write weights 04:35:33.260 |
such that they don't have linear representations, where the right way to understand them is not, 04:35:37.580 |
is not in terms of linear representations. But I think every natural neural network I've seen 04:35:42.700 |
has this property. There's been one paper recently, that there's been some sort of pushing 04:35:50.220 |
around the edge. So I think there's been some work recently studying multi-dimensional features, 04:35:53.900 |
where rather than a single direction, it's more like a manifold of directions. This to me still 04:35:59.820 |
seems like a linear representation. And then there's been some other papers suggesting that 04:36:04.300 |
maybe in very small models, you get nonlinear representations. I think that the jury's still 04:36:10.780 |
out on that. But I think everything that we've seen so far has been consistent with the linear 04:36:16.140 |
representation hypothesis. And that's wild. It doesn't have to be that way. And yet, I think 04:36:21.580 |
there's a lot of evidence that certainly at least this is very, very widespread. And so far, the 04:36:26.300 |
evidence is consistent with it. And I think, you know, one thing you might say is you might say, 04:36:30.780 |
well, Christopher, you know, that's a lot, you know, to go and sort of write on, you know, 04:36:35.980 |
if we don't know for sure this is true, and you're sort of, you know, you're investing in neural 04:36:39.420 |
networks as though it is true, you know, isn't that, isn't that interesting? Well, you know, 04:36:43.260 |
but I think actually, there's a virtue in taking hypotheses seriously and pushing them as far as 04:36:48.940 |
they can go. So it might be that someday we discover something that isn't consistent with 04:36:53.660 |
linear representation hypothesis. But science is full of hypotheses and theories that were wrong. 04:36:58.300 |
And we learned a lot by sort of working under them as a sort of an assumption. And then going 04:37:05.740 |
and pushing them as far as we can, I guess, I guess this is sort of the heart of what Kuhn would 04:37:09.020 |
call normal science. I don't know, if you want, we can talk a lot about philosophy of science. 04:37:15.580 |
- That leads to the paradigm shift. So yeah, I love it taking the hypothesis seriously and 04:37:20.700 |
take it to a natural conclusion. Same with the scaling hypothesis, same. 04:37:24.700 |
- Exactly, exactly. And one of my colleagues, Tom Hennigan, who is a former physicist, 04:37:31.980 |
made this really nice analogy to me of caloric theory, where, you know, once upon a time, 04:37:38.300 |
we thought that heat was actually, you know, this thing called caloric. And like the reason, 04:37:43.020 |
you know, hot objects, you know, would warm up cool objects is like the caloric is flowing through 04:37:47.740 |
them. And like, you know, because we're so used to thinking about heat, you know, in terms of 04:37:52.940 |
the modern and modern theory, you know, that seems kind of silly, but it's actually very hard to 04:37:56.860 |
construct an experiment that sort of disproves the caloric hypothesis. And, you know, you can 04:38:03.820 |
actually do a lot of really useful work believing in caloric. For example, it turns out that the 04:38:08.700 |
original combustion engines were developed by people who believed in the caloric theory. 04:38:12.860 |
So I think there's a virtue in taking hypotheses seriously, even when they might be wrong. 04:38:17.260 |
- Yeah, there's a deep philosophical truth to that. That's kind of how I feel about space travel, 04:38:23.580 |
like colonizing Mars. There's a lot of people that criticize that. I think if you just assume 04:38:27.980 |
we have to colonize Mars in order to have a backup for human civilization, even if that's not true, 04:38:33.420 |
that's going to produce some interesting engineering and even scientific breakthroughs, 04:38:39.980 |
this is another thing that I think is really interesting. So, you know, there's a way in 04:38:44.540 |
which I think it can be really useful for society to have people almost irrationally dedicated to 04:38:52.220 |
investigating particular hypotheses. Because, well, it takes a lot to sort of maintain scientific 04:38:58.940 |
morale and really push on something when, you know, most scientific hypotheses end up being 04:39:03.820 |
wrong. You know, a lot of science doesn't work out. And yet, it's very useful to go, you know, 04:39:11.820 |
there's a joke about Geoff Hinton, which is that Geoff Hinton has discovered how the brain works 04:39:17.740 |
every year for the last 50 years. But, you know, I say that with like, you know, with really deep 04:39:25.020 |
respect because in fact, that's actually, you know, that led to him doing some really great work. 04:39:29.260 |
- Yeah, he won the Nobel Prize now, who's laughing now. 04:39:31.820 |
- Exactly, exactly. I think one wants to be able to pop up and sort of recognize 04:39:37.260 |
the appropriate level of confidence. But I think there's also a lot of value in just being like, 04:39:41.500 |
you know, I'm going to essentially assume, I'm going to condition on this problem being 04:39:46.540 |
possible or this being broadly the right approach. And I'm just going to go and assume that for a 04:39:51.260 |
while and go and work within that and push really hard on it. And, you know, if society has lots of 04:39:58.140 |
people doing that for different things, that's actually really useful in terms of going and 04:40:02.940 |
getting to, you know, either really ruling things out, right? We can be like, well, 04:40:10.540 |
you know, that didn't work and we know that somebody tried hard. Or going and getting to 04:40:14.540 |
something that it does teach us something about the world. 04:40:16.620 |
- So another interesting hypothesis is the superposition hypothesis. 04:40:22.060 |
- Yeah. So earlier we were talking about word divac, right? And we were talking about how, 04:40:25.420 |
you know, maybe you have one direction that corresponds to gender and maybe another that 04:40:28.940 |
corresponds to royalty and another one that corresponds to Italy and another one that 04:40:32.700 |
corresponds to, you know, food and all of these things. Well, you know, oftentimes maybe these 04:40:37.820 |
word embeddings, they might be 500 dimensions, a thousand dimensions. And so if you believe that 04:40:44.060 |
all of those directions were orthogonal, then you could only have, you know, 500 concepts. And, 04:40:50.220 |
you know, I love pizza. But, like, if I was going to go and, like, give the, like, 500 04:40:55.020 |
most important concepts in, you know, the English language, probably Italy wouldn't be -- it's not 04:41:00.860 |
obvious at least that Italy would be one of them, right? Because you have to have things like plural 04:41:04.540 |
and singular and verb and noun and adjective. And, you know, there's a lot of things we have 04:41:11.980 |
to get to before we get to Italy and Japan. And, you know, there's a lot of countries in the world. 04:41:17.420 |
And so how might it be that models could, you know, simultaneously have the linear 04:41:24.060 |
representation hypothesis be true and also represent more things than they have directions? 04:41:30.060 |
So what does that mean? Well, okay. So if linear representation hypothesis is true, 04:41:33.980 |
something interesting has to be going on. Now, I'll tell you one more interesting thing before 04:41:38.540 |
we go and we do that, which is, you know, earlier we were talking about all these polysematic 04:41:43.420 |
neurons, right? These neurons that, you know, when we were looking at Inception V1, there's 04:41:47.260 |
these nice neurons that, like, the car detector and the curve detector and so on that respond to 04:41:51.020 |
lots of, you know, to very coherent things. But it's lots of neurons that respond to a bunch of 04:41:55.100 |
unrelated things. And that's also an interesting phenomenon. And it turns out as well that even 04:42:00.220 |
these neurons that are really, really clean, if you look at the weak activations, right? So if you 04:42:03.980 |
look at, like, you know, the activations where it's, like, activating 5% of the, you know, of 04:42:10.140 |
the maximum activation, it's really not the core thing that it's expecting, right? So if you look 04:42:14.380 |
at a curve detector, for instance, and you look at the places where it's 5% active, you know, 04:42:19.100 |
you could interpret it just as noise, or it could be that it's doing something else there. 04:42:23.100 |
Okay. So how could that be? Well, there's this amazing thing in mathematics called compressed 04:42:31.740 |
sensing. And it's actually this very surprising fact where if you have a high-dimensional space 04:42:37.740 |
and you project it into a low-dimensional space, ordinarily, you can't go and sort of unproject it 04:42:44.380 |
and get back your high-dimensional vector, right? You threw information away. This is like, 04:42:47.500 |
you know, you can't invert a rectangular matrix. You can only invert square matrices. 04:42:52.620 |
But it turns out that that's actually not quite true. If I tell you that the high-dimensional 04:42:59.580 |
vector was sparse, so it's mostly zeros, then it turns out that you can often go and find 04:43:04.780 |
back the high-dimensional vector with very high probability. So that's a surprising fact, 04:43:13.180 |
right? It says that, you know, you can have this high-dimensional vector space, and as long as 04:43:17.660 |
things are sparse, you can project it down, you can have a lower-dimensional projection of it, 04:43:22.620 |
and that works. So the superposition hypothesis is saying that that's what's going on in neural 04:43:27.900 |
networks. For instance, that's what's going on in word embeddings, that word embeddings are able 04:43:31.740 |
to simultaneously have directions be the meaningful thing, and by exploiting the fact that they're 04:43:36.620 |
operating on a fairly high-dimensional space, they're actually -- and the fact that these 04:43:40.300 |
concepts are sparse, right? Like, you know, you usually aren't talking about Japan and Italy at 04:43:44.220 |
the same time. You know, most of those concepts, you know, in most sentences, Japan and Italy are 04:43:49.100 |
both zero. They're not present at all. And if that's true, then you can go and have it be the 04:43:56.060 |
case that you can have many more of these sort of directions that are meaningful, these features, 04:44:03.020 |
than you have dimensions. And similarly, when we're talking about neurons, you can have many 04:44:06.540 |
more concepts than you have neurons. So that's the high-level superposition hypothesis. 04:44:11.980 |
Now, it has this even wilder implication, which is to go and say that neural networks are -- it 04:44:21.820 |
may not just be the case that the representations are like this, but the computation may also be 04:44:25.980 |
like this. You know, the connections between all of them. And so, in some sense, neural networks 04:44:30.060 |
may be shadows of much larger, sparser neural networks. And what we see are these projections. 04:44:37.180 |
And the super -- you know, the strongest version of the superposition hypothesis would be to take 04:44:41.100 |
that really seriously and sort of say, you know, there actually is, in some sense, this upstairs 04:44:45.660 |
model, this, you know, where the neurons are really sparse and all-interpol, and there's, 04:44:50.700 |
you know, the weights between them are these really sparse circuits. And that's what we're 04:44:55.100 |
studying. And the thing that we're observing is the shadow of it, and so we need to find 04:45:01.580 |
the original object. >> And the process of learning is trying to construct a compression 04:45:07.420 |
of the upstairs model that doesn't lose too much information in the projection. 04:45:11.420 |
>> Yeah, it's finding how to fit it efficiently, or something like this. The gradient descent is 04:45:15.900 |
doing this. And in fact, so this sort of says that gradient descent, you know, it could just 04:45:19.820 |
represent a dense neural network, but it sort of says that gradient descent is implicitly searching 04:45:23.420 |
over the space of extremely sparse models that could be projected into this low-dimensional space. 04:45:28.860 |
And this large body of work of people going and trying to study sparse neural networks, 04:45:33.340 |
right, where you go and you have -- you could design neural networks, right, where the edges 04:45:36.700 |
are sparse and the activations are sparse. And, you know, my sense is that work has generally -- it 04:45:41.580 |
feels very principled, right? It makes so much sense. And yet, that work hasn't really panned 04:45:45.980 |
out that well, is my impression, broadly. And I think that a potential answer for that is that 04:45:52.060 |
actually, the neural network is already sparse in some sense. Gradient descent was the whole time, 04:45:57.500 |
gradient -- you were trying to go and do this. Gradient descent was actually in the -- behind 04:46:00.300 |
the scenes going and searching more efficiently than you could through the space of sparse models, 04:46:04.300 |
and going and learning whatever sparse model was most efficient, and then figuring out how 04:46:08.780 |
to fold it down nicely to go and run conveniently on your GPU, which does, you know, nice dense 04:46:13.260 |
matrix multiplies, and that you just can't beat that. >> How many concepts do you think can be 04:46:18.540 |
shoved into a neural network? >> Depends on how sparse they are. So, there's probably an upper 04:46:23.100 |
bound from the number of parameters, right? Because you have to have -- you still have to have, 04:46:27.020 |
you know, weights that go and connect them together. So, that's one upper bound. There are, 04:46:32.540 |
in fact, all these lovely results from compressed sensing, and the Johnson-Lindenstrauss lemma, 04:46:36.940 |
and things like this, that they basically tell you that if you have a vector space, and you want to 04:46:42.300 |
have almost orthogonal vectors, which is sort of probably the thing that you want here, right? So, 04:46:46.780 |
you're going to say, well, you know, I'm going to give up on having my concepts, my features be 04:46:50.540 |
strictly orthogonal, but I'd like them to not interfere that much. I'm going to ask them to 04:46:53.980 |
be almost orthogonal. Then this would say that it's actually, you know, for once you set a 04:46:59.100 |
threshold for what you're willing to accept in terms of how much cosine similarity there is, 04:47:04.700 |
that's actually exponential in the number of neurons that you have. So, at some point, 04:47:08.540 |
that's not going to even be the limiting factor. But, you know, there's some beautiful results 04:47:12.780 |
there. In fact, it's probably even better than that in some sense, because that's sort of for 04:47:17.420 |
saying that, you know, any random set of features could be active. But, in fact, the features have 04:47:21.420 |
sort of a correlational structure where some features, you know, are more likely to co-occur, 04:47:25.420 |
and other ones are less likely to co-occur. And so, neural networks, my guess would be, 04:47:28.940 |
can do very well in terms of going and packing things in, to the point that that's probably 04:47:35.660 |
not the limiting factor. How does the problem of polysemanticity enter the picture here? 04:47:40.940 |
Polysemanticity is this phenomenon we observe, where we look at many neurons, 04:47:44.300 |
and the neuron doesn't just sort of represent one concept. It's not a clean feature. It responds to 04:47:49.660 |
a bunch of unrelated things. And superposition, you can think of as being a hypothesis that explains 04:47:55.500 |
the observation of polysemanticity. So, polysemanticity is this observed phenomenon, 04:48:01.180 |
and superposition is a hypothesis that would explain it, along with some other things. 04:48:08.620 |
Right. So, if you're trying to understand things in terms of individual neurons, 04:48:11.820 |
and you have polysemantic neurons, you're in an awful lot of trouble, right? I mean, 04:48:15.660 |
the easiest answer is like, okay, well, you're looking at the neurons. You're trying to understand 04:48:18.700 |
them. This one responds to a lot of things. It doesn't have a nice meaning. Okay, that's bad. 04:48:23.820 |
Another thing you could ask is, ultimately, we want to understand the weights. And if you have 04:48:28.380 |
two polysemantic neurons, and each one responds to three things, and then the other neuron 04:48:33.020 |
responds to three things, and you have a weight between them, what does that mean? Does it mean 04:48:36.220 |
that like all three, you know, like there's these nine interactions going on? It's a very weird 04:48:41.260 |
thing. But there's also a deeper reason, which is related to the fact that neural networks operate 04:48:46.460 |
on really high dimensional spaces. So, I said that our goal was, you know, to understand neural 04:48:50.540 |
networks and understand the mechanisms. And one thing you might say is like, well, why not? It's 04:48:55.420 |
just a mathematical function. Why not just look at it, right? Like, you know, one of the earliest 04:48:59.180 |
projects I did studied these neural networks that mapped two-dimensional spaces to two-dimensional 04:49:03.180 |
spaces. And you can sort of interpret them in this beautiful way as like bending manifolds. 04:49:07.260 |
Why can't we do that? Well, you know, as you have a higher dimensional space, 04:49:11.660 |
the volume of that space in some senses is exponential in the number of inputs you have. 04:49:17.500 |
And so, you can't just go and visualize that. So, we somehow need to break that apart. We need to 04:49:22.380 |
somehow break that exponential space into a bunch of things that we, you know, some non-exponential 04:49:28.540 |
number of things that we can reason about independently. And the independence is crucial 04:49:33.340 |
because it's the independence that allows you to not have to think about, you know, 04:49:36.380 |
all the exponential combinations of things. And things being monosemantic, things only having 04:49:43.340 |
one meaning, things having a meaning, that is the key thing that allows you to think about 04:49:48.140 |
them independently. And so, I think that's -- if you want the deepest reason why we want to have 04:49:55.340 |
interpretable monosemantic features, I think that's really the deep reason. 04:49:58.540 |
>> And so, the goal here, as your recent work has been aiming at, is how do we extract the 04:50:03.580 |
monosemantic features from a neural net that has polysemantic features and all this mess? 04:50:09.980 |
>> Yes, we observe these polysematic neurons and we hypothesize that what's going on is 04:50:14.060 |
superstition. And if superstition is what's going on, there's actually a sort of well-established 04:50:19.340 |
technique that is sort of the principled thing to do, which is dictionary learning. And it turns 04:50:25.020 |
out if you do dictionary learning, in particular, if you do sort of a nice, efficient way that in 04:50:29.180 |
some sense sort of nicely regularizes it as well, called a sparse autoencoder, if you train a sparse 04:50:33.740 |
autoencoder, these beautiful interpretable features start to just fall out where there 04:50:37.660 |
weren't any beforehand. And so, that's not a thing that you would necessarily predict, right? 04:50:42.700 |
But it turns out that that works very, very well. To me, that seems like some non-trivial validation 04:50:49.260 |
of linear representations and superstition. >> So, with dictionary learning, you're not 04:50:53.180 |
looking for particular kind of categories. You don't know what they are. They just emerge. 04:50:56.940 |
>> Exactly, yeah. And this gets back to our earlier point, right? When we're not making 04:50:59.420 |
assumptions, gradient descent is smarter than us. So, we're not making assumptions about what's 04:51:02.940 |
there. I mean, one certainly could do that, right? One could assume that there's a PHP feature and go 04:51:08.540 |
and search for it. But we're not doing that. We're saying we don't know what's going to be there. 04:51:11.900 |
Instead, we're just going to go and let the sparse autoencoder discover the things that are there. 04:51:15.900 |
>> So, can you talk to the toward monosemanticity paper from October last year? They had a lot of 04:51:22.460 |
like nice breakthrough results. >> That's very kind of you to describe it that way. Yeah, I mean, 04:51:26.380 |
this was our first real success using sparse autoencoders. So, we took a one-layer model. 04:51:33.820 |
And it turns out, if you go and you do dictionary learning on it, you find all these really nice 04:51:39.660 |
interpretable features. So, the Arabic feature, the Hebrew feature, the base64 feature. Those 04:51:44.860 |
were some examples that we studied in a lot of depth and really showed that they were 04:51:49.340 |
what we thought they were. It turns out, if you train a model twice as well and train two different 04:51:52.540 |
models and do dictionary learning, you find analogous features in both of them. So, that's fun. 04:51:56.380 |
You find all kinds of different features. So, that was really just showing that this works. And I 04:52:03.660 |
should mention that there was this Cunningham et al that had very similar results around the same 04:52:08.140 |
time. >> There's something fun about doing these kinds of small-scale experiments and finding that 04:52:13.340 |
it's actually working. >> Yeah, well, and there's so much structure here. So, maybe 04:52:19.100 |
stepping back for a while, I thought that maybe all this mechanistic interpretability work, 04:52:25.020 |
the end result was going to be that I would have an explanation for why it was very hard and not 04:52:30.540 |
going to be tractable. We'd be like, "Well, there's this problem with supersession. And it 04:52:34.300 |
turns out supersession is really hard, and we're kind of screwed." But that's not what happened. 04:52:38.700 |
In fact, a very natural, simple technique just works. And so, then that's actually a very good 04:52:43.820 |
situation. I think this is a sort of hard research problem, and it's got a lot of research risk. And 04:52:49.500 |
you know, it might still very well fail. But I think that some very significant amount of research 04:52:54.540 |
risk was sort of put behind us when that started to work. >> Can you describe what kind of features 04:53:00.460 |
can be extracted in this way? >> Well, so, it depends on the model that you're studying, right? 04:53:05.020 |
So, the larger the model, the more sophisticated they're going to be. And we'll probably talk about 04:53:08.860 |
follow-up work in a minute. But in these one-layer models, so, some very common things, I think, 04:53:13.820 |
were languages, both programming languages and natural languages. There were a lot of features 04:53:18.860 |
that were specific words in specific contexts. So, "the," and I think really the way to think 04:53:24.060 |
about this is that "the" is likely about to be followed by a noun. So, it's really, you could 04:53:28.460 |
think of this as "the" feature, but you could also think of this as producing a specific noun feature. 04:53:31.980 |
And there would be these features that would fire for "the" in the context of, say, a legal document 04:53:38.220 |
or a mathematical document or something like this. And so, maybe in the context of math, 04:53:45.660 |
you're like, "the" and then predict vector or matrix, all these mathematical words, 04:53:50.220 |
whereas in other contexts, you would predict other things. That was common. 04:53:53.740 |
>> And basically, we need clever humans to assign labels to what we're seeing. 04:53:59.820 |
>> Yes. So, the only thing this is doing is it's sort of unfolding things for you. So, 04:54:05.900 |
if everything was sort of folded over top of it, you know, serialization folded everything on top 04:54:09.660 |
of itself, and you can't really see it, this is unfolding it. But now you still have a very 04:54:14.140 |
complex thing to try to understand. So, then you have to do a bunch of work understanding what 04:54:18.060 |
these are. And some of them are really subtle. Like, there's some really cool things, even in 04:54:22.860 |
this one-layer model about Unicode, where, you know, of course, some languages are in Unicode, 04:54:27.580 |
and the tokenizer won't necessarily have a dedicated token for every Unicode character. 04:54:33.580 |
So, instead, what you'll have is you'll have these patterns of alternating tokens that each 04:54:38.460 |
represent half of a Unicode character. And you have a different feature that, you know, goes and 04:54:42.620 |
activates on the opposing ones to be like, okay, you know, I just finished a character, you know, 04:54:46.940 |
go and predict next prefix. Then, okay, I'm on the prefix, you know, predict a reasonable suffix, 04:54:52.540 |
and you have to alternate back and forth. So, there's, you know, these one-layer models are 04:54:57.180 |
really interesting. And I mean, there's another thing, which is, you might think, okay, there 04:55:01.020 |
would just be one Base64 feature. But it turns out, there's actually a bunch of Base64 features, 04:55:05.420 |
because you can have English text encoded as Base64, and that has a very different distribution 04:55:11.180 |
of Base64 tokens than regular. And there's some things about tokenization as well that 04:55:17.900 |
it can exploit, and I don't know, there's all kinds of fun stuff. 04:55:21.100 |
How difficult is the task of sort of assigning labels 04:55:24.380 |
to what's going on? Can this be automated by AI? 04:55:28.220 |
Well, I think it depends on the feature. And it also depends on how much you trust your AI. 04:55:31.980 |
So, there's a lot of work doing automated interoperability. I think that's a really 04:55:37.340 |
exciting direction. And we do a fair amount of automated interoperability and have 04:55:42.780 |
Is there some funny moments where it's totally right or it's totally wrong? 04:55:47.260 |
Yeah, well, I think it's very common that it's like, says something very general, 04:55:52.780 |
which is like, true in some sense, but not really picking up on the specific of what's going on. 04:55:58.220 |
So, I think that's a pretty common situation. 04:56:02.780 |
You don't know that I have a particularly amusing one. 04:56:06.220 |
That's interesting, that little gap between it is true, 04:56:08.780 |
but it doesn't quite get to the deep nuance of a thing. That's a general challenge. It's like, 04:56:16.860 |
it's already an incredible costume that can say a true thing, but it's missing 04:56:22.780 |
the depth sometimes. And in this context, it's like the arc challenge, the sort of IQ type of tests. 04:56:29.020 |
It feels like figuring out what a feature represents is a little puzzle you have to solve. 04:56:35.660 |
Yeah. And I think that sometimes they're easier and sometimes they're harder as well. 04:56:38.620 |
So, yeah, I think that's tricky. And there's another thing, which I don't know, maybe in 04:56:45.420 |
some ways, this is my aesthetic coming in, but I'll try to give you a rationalization. 04:56:49.660 |
I'm actually a little suspicious of automated interoperability. And I think that's partly just 04:56:53.340 |
that I want humans to understand neural networks. And if the neural network is understanding it for 04:56:57.500 |
me, I don't quite like that. But I do have a bit of, in some ways, I'm sort of like the 04:57:02.220 |
mathematicians who are like, if there's a computer automated proof, it doesn't count. 04:57:05.020 |
They won't understand it. But I do also think that there's this kind of reflections on trusting 04:57:11.580 |
trust type issue, where there's this famous talk about when you're writing a computer program, 04:57:19.340 |
you have to trust your compiler. And if there was like malware in your compiler, 04:57:22.620 |
then it could go and inject malware into the next compiler and you'd be kind of in trouble, right? 04:57:27.100 |
Well, if you're using neural networks to go and verify that your neural networks are safe, 04:57:32.700 |
the hypothesis that you're testing for is like, okay, well, the neural network maybe isn't safe. 04:57:36.140 |
And you have to worry about like, is there some way that it could be screwing with you? 04:57:40.700 |
So, I think that's not a big concern now. But I do wonder in the long run, if we have to use 04:57:46.380 |
really powerful AI systems to go and audit our AI systems, is that actually something we can trust? 04:57:53.100 |
But maybe I'm just rationalizing because I just want us to have to get to a point where humans 04:57:57.100 |
understand everything. Yeah. I mean, especially, that's hilarious, especially as we talk about AI 04:58:01.820 |
safety and looking for features that would be relevant to AI safety, like deception and so on. 04:58:07.740 |
So, let's talk about the scaling monosemanticity paper in May 2024. Okay. So, what did it take to 04:58:14.380 |
scale this, to apply to Cloud 3? Well, a lot of GPUs. A lot more GPUs. But one of my teammates, 04:58:22.380 |
Tom Hennigan, was involved in the original scaling loss work. And something that he was sort of 04:58:29.660 |
interested in from very early on is, are there scaling laws for interoperability? 04:58:36.060 |
And so, something he sort of immediately did when this work started to succeed, and we started to 04:58:42.060 |
have sparse autoencoders work, was he became very interested in, what are the scaling laws 04:58:46.060 |
for making sparse autoencoders larger? And how does that relate to making the base model larger? 04:58:54.140 |
And so, it turns out this works really well. And you can use it to sort of project, 04:58:58.780 |
if you train a sparse autoencoder at a given size, how many tokens should you train on? And so on. 04:59:04.220 |
So, this was actually a very big help to us in scaling up this work, and made it a lot easier 04:59:09.500 |
for us to go and train really large sparse autoencoders, where it's not like training 04:59:15.180 |
the big models, but it's starting to get to a point where it's actually expensive to go 04:59:18.700 |
and train the really big ones. So, you have to do all this stuff of splitting it across 04:59:24.540 |
large GPUs. Oh, yeah. I mean, there's a huge engineering challenge here too, right? So, 04:59:28.620 |
yeah. So, there's a scientific question of how do you scale things effectively? And then there's 04:59:33.420 |
an enormous amount of engineering to go and scale this up. So, you have to chart it, you have to 04:59:37.980 |
think very carefully about a lot of things. And I'm lucky to work with a bunch of great engineers, 04:59:41.580 |
because I am definitely not a great engineer. Yeah. And the infrastructure, especially. Yeah, 04:59:44.700 |
for sure. So, it turns out, TODR, it worked. It worked. Yeah. And I think this is important, 04:59:50.780 |
because you could have imagined a world where you set after towards monosemanticity. 04:59:55.420 |
You know, Chris, this is great. It works on a one-layer model. But one-layer models are 04:59:59.740 |
really idiosyncratic. Maybe the linear representation hypothesis and superposition 05:00:05.660 |
hypothesis is the right way to understand a one-layer model, but it's not the right way 05:00:09.020 |
to understand larger models. And so, I think, I mean, first of all, the Cunningham et al paper 05:00:14.860 |
cut through that a little bit and suggested that this wasn't the case. But scaling monosemanticity, 05:00:21.020 |
I think, was significant evidence that even for very large models, and we did it on Claude III 05:00:25.340 |
Sonnet, which at that point was one of our production models, you know, even these models 05:00:30.860 |
seem to be very, you know, seem to be substantially explained, at least, by linear features and, 05:00:37.740 |
you know, doing dictionary learning on the works. And as you learn more features, you go and you 05:00:40.700 |
explain more and more. So, that's, I think, quite a promising sign. And you find now really 05:00:47.260 |
fascinating abstract features. And the features are also multimodal. They respond to images and 05:00:52.860 |
text for the same concept, which is fun. - Yeah. Can you explain that? I mean, like, 05:00:57.340 |
you know, backdoor, there's just a lot of examples that you can... 05:01:00.700 |
- Yeah. So, maybe let's start with one example to start, which is we found some features around, 05:01:05.420 |
sort of, security vulnerabilities and backdoors in code. So, it turns out those are actually two 05:01:08.860 |
different features. So, there's a security vulnerability feature. And if you force it 05:01:13.260 |
active, Claude will start to go and write security vulnerabilities like buffer overflows into code. 05:01:19.100 |
And also, it fires for all kinds of things. Like, you know, some of the top dataset examples for it 05:01:23.500 |
were things like, you know, dash, dash, disable, you know, SSL or something like this, which are 05:01:29.420 |
sort of obviously really insecure. - So, at this point, it's kind of like, 05:01:35.900 |
maybe it's just because the examples are presented that way, it's kind of like surface, a little bit 05:01:40.300 |
more obvious examples, right? I guess the idea is that down the line, it might be able to detect 05:01:47.180 |
more nuanced, like deception or bugs or that kind of stuff. 05:01:50.460 |
- Yeah. Well, maybe I want to distinguish two things. So, one is the complexity of the feature 05:01:56.780 |
or the concept, right? And the other is the nuance of how subtle the examples we're looking at, 05:02:04.620 |
right? So, when we show the top dataset examples, those are the most extreme examples that cause 05:02:09.820 |
that feature to activate. And so, it doesn't mean that it doesn't fire for more subtle things. 05:02:15.180 |
So, the insecure code feature, you know, the stuff that it fires for most strongly for are these, 05:02:21.420 |
like, really obvious, you know, disable the security type things. But, you know, it also 05:02:29.420 |
fires for, you know, buffer overflows and more subtle security vulnerabilities in code. You know, 05:02:35.660 |
these features are all multimodal. So, you could ask, like, what images activate this feature? 05:02:39.820 |
And it turns out that the security vulnerability feature activates for images of, like, people 05:02:47.740 |
clicking on Chrome to, like, go past the, like, you know, this website, the SSL certificate might 05:02:53.900 |
be wrong or something like this. Another thing that's very entertaining is there's backdoors 05:02:56.860 |
in code feature. Like, you activate it, it goes on, Cloud writes a backdoor that, like, will go 05:03:00.060 |
and dump your data to port or something. But you can ask, okay, what images activate the backdoor 05:03:05.260 |
feature? It was devices with hidden cameras in them. So, there's a whole, apparently, genre of 05:03:11.260 |
people going and selling devices that look innocuous, that have hidden cameras, and they 05:03:14.860 |
have ads about how there's a hidden camera in it. And I guess that is the, you know, physical 05:03:19.180 |
version of a backdoor. And so, it sort of shows you how abstract these concepts are, right? 05:03:23.660 |
And I just thought that was, I'm sort of sad that there's a whole market of people selling devices 05:03:29.900 |
like that. But I was kind of delighted that that was the thing that it came up with as the top 05:03:34.460 |
image examples for the feature. - Yeah, it's nice. It's multimodal. It's multi-almost context. It's 05:03:39.260 |
broad, strong definition of a singular concept. It's nice. - Yeah. - To me, one of the really 05:03:45.740 |
interesting features, especially for AI safety, is deception and lying. And the possibility that 05:03:52.700 |
these kinds of methods could detect lying in a model, especially gets smarter and smarter and 05:03:57.900 |
smarter. Presumably, that's a big threat of a super intelligent model that it can deceive 05:04:04.380 |
the people operating it, as to its intentions or any of that kind of stuff. So, what have you 05:04:10.700 |
learned from detecting lying inside models? - Yeah. So, I think we're, in some ways, in early 05:04:16.060 |
days for that. We find quite a few features related to deception and lying. There's one feature 05:04:23.500 |
where it fires for people lying and being deceptive, and you force it active, and Claude starts lying 05:04:28.940 |
to you. So, we have a deception feature. I mean, there's all kinds of other features about 05:04:33.580 |
withholding information and not answering questions. Features about power-seeking and 05:04:37.580 |
coups and stuff like that. There's a lot of features that are kind of related to spooky 05:04:41.980 |
things. And if you force them active, Claude will behave in ways that are not the kinds of 05:04:48.460 |
behaviors you want. - What are possible next exciting directions to you in the space of 05:04:57.180 |
So, for one thing, I would really like to get to a point where we have shortcuts, where we can 05:05:05.820 |
really understand not just the features, but then use that to understand the computation of models. 05:05:12.300 |
That really, for me, is the ultimate goal of this. And there's been some work. We put out a few 05:05:19.420 |
things. There's a paper from Sam Marks that does some stuff like this. There's been some, I'd say, 05:05:23.580 |
some work around the edges here. But I think there's a lot more to do. And I think that will be 05:05:27.740 |
a very exciting thing. That's related to a challenge we call interference weights, where 05:05:35.100 |
due to superstition, if you just sort of naively look at whether features are connected together, 05:05:40.940 |
there may be some weights that sort of don't exist in the upstairs model, but are just sort 05:05:45.180 |
of artifacts of superstition. So, that's a sort of technical challenge related to that. 05:05:52.620 |
I think another exciting direction is just, you might think of sparse autoencoders as being 05:05:58.940 |
kind of like a telescope. They allow us to look out and see all these features that are out there. 05:06:05.980 |
And as we build better and better sparse autoencoders, get better and better at dictionary 05:06:10.060 |
learning, we see more and more stars. And we zoom in on smaller and smaller stars. But there's kind 05:06:16.460 |
of a lot of evidence that we're only still seeing a very small fraction of the stars. There's a lot 05:06:21.740 |
of matter in our neural network universe that we can't observe yet. And it may be that we'll never 05:06:29.100 |
be able to have fine enough instruments to observe it. And maybe some of it just isn't possible, 05:06:32.780 |
isn't computationally tractable to observe. It's sort of a kind of dark matter, not in maybe the 05:06:38.700 |
sense of modern astronomy, but of earlier astronomy, when we didn't know what this 05:06:41.740 |
unexplained matter is. And so, I think a lot about that dark matter and whether we'll ever 05:06:46.540 |
observe it, and what that means for safety if we can't observe it, if some significant fraction of 05:06:52.700 |
neural networks are not accessible to us. Another question that I think a lot about is, 05:06:58.700 |
at the end of the day, mechanistic interpolation is this very microscopic 05:07:04.300 |
approach to interpolation. It's trying to understand things in a very fine-grained way. 05:07:09.100 |
But a lot of the questions we care about are very macroscopic. We care about these questions about 05:07:14.940 |
neural network behavior. And I think that's the thing that I care most about. But there's lots of 05:07:20.460 |
other sort of larger scale questions you might care about. And somehow, the nice thing about 05:07:28.460 |
having a very microscopic approach is it's maybe easier to ask, is this true? But the downside is, 05:07:33.180 |
it's much further from the things we care about. And so, we now have this ladder to climb. And I 05:07:37.340 |
think there's a question of, will we be able to find, are there sort of larger scale abstractions 05:07:42.140 |
that we can use to understand neural networks that we get up from this very microscopic approach? 05:07:47.420 |
Yeah, you've written about this, this kind of organs question. 05:07:53.340 |
If we think of interpretability as a kind of anatomy of neural networks, most of the 05:07:58.860 |
circus threads involve studying tiny little veins, looking at the small scale, and individual neurons 05:08:04.220 |
and how they connect. However, there are many natural questions that the small scale approach 05:08:08.860 |
doesn't address. In contrast, the most prominent abstractions in biological anatomy involve larger 05:08:15.340 |
scale structures, like individual organs, like the heart, or entire organ systems, 05:08:20.460 |
like the respiratory system. And so, we wonder, is there a respiratory system or heart or brain 05:08:29.020 |
Yeah, exactly. And I mean, like, if you think about science, right, a lot of scientific fields 05:08:33.820 |
have, you know, investigate things at many levels of abstractions. In biology, you have like, 05:08:38.780 |
you know, molecular biology studying proteins and molecules and so on. And they have cellular 05:08:43.260 |
biology, and then you have histology studying tissues, and then you have anatomy, and then you 05:08:47.740 |
have zoology, and then you have ecology. And so, you have many, many levels of abstraction. Or, 05:08:52.460 |
you know, physics, maybe the physics of individual particles, and then, you know, statistical physics 05:08:56.540 |
gives you thermodynamics and things like this. And so, you often have different levels of 05:08:59.980 |
abstraction. And I think that right now we have, you know, mechanistic interpretability, if it 05:09:05.820 |
succeeds, is sort of like a microbiology of neural networks. But we want something more like anatomy. 05:09:12.060 |
And so, and, you know, a question you might ask is, why can't you just go there directly? And I 05:09:16.380 |
think the answer is superposition, at least in a significant part. It's that it's actually very hard 05:09:21.100 |
to see this macroscopic structure without first sort of breaking down the microscopic structure 05:09:27.900 |
in the right way, and then studying how it connects together. But I'm hopeful that there 05:09:32.060 |
is going to be something much larger than features and circuits, and that we're going to be able to 05:09:37.420 |
have a story that's much, that involves much bigger things. And then you can sort of study 05:09:42.140 |
in detail the parts you care about. I suppose in your biology, like a psychologist or psychiatrist 05:09:47.340 |
of a neural network. And I think that the beautiful thing would be if we could go and, 05:09:52.060 |
rather than having disparate fields for those two things, if you could have a, build a bridge 05:09:56.140 |
between them, such that you could go and have all of your higher level abstractions be grounded very 05:10:03.420 |
firmly in this very solid, you know, more rigorous, ideally, foundation. What do you think is the 05:10:11.740 |
difference between the human brain, the biological neural network, and the artificial neural network? 05:10:17.580 |
Well, the neuroscientists have a much harder job than us. You know, sometimes I just like count 05:10:21.660 |
my blessings by how much easier my job is than the neuroscientists, right? So I have, we can record 05:10:26.940 |
from all the neurons. We can do that on arbitrary amounts of data. The neurons don't change while 05:10:32.780 |
you're doing that, by the way. You can go and ablate neurons, you can edit the connections and 05:10:37.820 |
so on. And then you can undo those changes. That's pretty great. You can force any, you can intervene 05:10:43.420 |
on any neuron and force it active and see what happens. You know which neurons are connected 05:10:47.660 |
to everything, right? Neuroscientists want to get the connectome, we have the connectome. 05:10:50.860 |
And we have it for like much bigger than C. elegans. And then not only do we have the connectome, 05:10:55.900 |
we know what the, you know, which neurons excite or inhibit each other, right? So we have, 05:11:00.780 |
it's not just that we know that like the binary mask, we know the weights. We can take gradients, 05:11:05.660 |
we know computationally what each neuron does. So I don't know, the list goes on and on. We just have 05:11:10.620 |
so many advantages over neuroscientists. And then just by having all those advantages, 05:11:16.940 |
it's really hard. And so one thing I do sometimes think is like, gosh, like, 05:11:21.260 |
if it's this hard for us, it seems impossible under the constraints of neuroscience or, 05:11:24.940 |
you know, near impossible. I don't know, maybe part of me is like I've got a few neuroscientists 05:11:29.660 |
on my team. Maybe I'm sort of like, ah, you know, maybe the neuroscientists, maybe some of them 05:11:35.020 |
would like to have an easier problem that's still very hard. And they could come and work on neural 05:11:40.460 |
networks. And then after we figure out things in sort of the easy little pond of trying to 05:11:45.580 |
understand neural networks, which is still very hard, then we could go back to biological 05:11:49.740 |
neuroscience. - I love what you've written about the goal of McInterp research as two goals, 05:11:56.220 |
safety and beauty. So can you talk about the beauty side of things? - Yeah. So, you know, 05:12:00.780 |
there's this funny thing where I think some people want, some people are kind of disappointed by 05:12:06.140 |
neural networks, I think, where they're like, ah, you know, neural networks, it's just these simple 05:12:11.260 |
rules. And then you just like do a bunch of engineering to scale it up and it works really 05:12:14.300 |
well. And like, where's the like complex ideas? You know, this isn't like a very nice, beautiful 05:12:18.940 |
scientific result. And I sometimes think when people say that, I picture them being like, you 05:12:24.460 |
know, evolution is so boring. It's just a bunch of simple rules and you run evolution for a long 05:12:29.020 |
time and you get biology. Like what a sucky, you know, way for biology to have turned out. Where's 05:12:34.380 |
the complex rules? But the beauty is that the simplicity generates complexity. You know, 05:12:41.260 |
biology has these simple rules and it gives rise to, you know, all the life and ecosystems that we 05:12:47.100 |
see around us, all the beauty of nature that all just comes from evolution and from something very 05:12:52.140 |
simple evolution. And similarly, I think that neural networks build, create enormous complexity 05:12:58.940 |
and beauty inside and structure inside themselves that people generally don't look at and don't try 05:13:04.220 |
to understand because it's hard to understand. But I think that there is an incredibly rich 05:13:10.220 |
structure to be discovered inside neural networks. A lot of very deep beauty. And if we're just 05:13:16.700 |
willing to take the time to go and see it and understand it. Yeah, I love McInterp. The feeling 05:13:22.860 |
like we are understanding or getting glimpses of understanding the magic that's going on inside is 05:13:28.780 |
really wonderful. It feels to me like one of the questions that's just calling out to be asked, 05:13:34.940 |
and I'm sort of, I mean, a lot of people are thinking about this, but I'm often surprised 05:13:38.780 |
that not more are, is how is it that we don't know how to create computer systems that can do these 05:13:44.620 |
things? And yet we have these amazing systems that we don't know how to directly create computer 05:13:50.060 |
programs that can do these things, but these neural networks can do all these amazing things. 05:13:53.100 |
And it just feels like that is obviously the question that sort of is calling out to be 05:13:56.780 |
answered. If you are, if you have any degree of curiosity, it's like, how is it that humanity 05:14:02.860 |
now has these artifacts that can do these things that we don't know how to do? 05:14:06.140 |
Yeah. I love the image of the circus reaching towards the light of the objective function. 05:14:11.020 |
Yeah. It's just, it's this organic thing that we've grown and we have no idea what we've grown. 05:14:15.180 |
Well, thank you for working on safety and thank you for appreciating the beauty of the things you 05:14:20.140 |
discover. And thank you for talking today, Chris. It's wonderful. 05:14:23.660 |
Thank you for taking the time to chat as well. 05:14:25.820 |
Thanks for listening to this conversation with Chris Ola. And before that, 05:14:28.940 |
with Daria Almoday and Amanda Askel. To support this podcast, please check out our sponsors 05:14:34.300 |
in the description. And now let me leave you with some words from Alan Watts. 05:14:38.620 |
"The only way to make sense out of change is to plunge into it, move with it, and join the dance." 05:14:46.860 |
Thank you for listening and hope to see you next time.