back to index

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind


Chapters

0:0 Introductions
1:14 Why Nicholas writes
2:9 The Game of Life
5:7 "How I Use AI" blog post origin story
8:24 Do we need software engineering agents?
11:3 Using AI to kickstart a project
14:8 Ephemeral software
17:37 Using AI to accelerate research
21:34 Experts vs non-expert users as beneficiaries of AI
24:2 Research on generating less secure code with LLMs.
27:22 Learning and explaining code with AI
30:12 AGI speculations?
32:50 Distributing content without social media
35:39 How much data do you think you can put on a single piece of paper?
37:37 Building personal AI benchmarks
43:4 Evolution of prompt engineering and its relevance
46:6 Model vs task benchmarking
52:14 Poisoning LAION 400M through expired domains
55:38 Stealing OpenAI models from their API
61:29 Data stealing and recovering training data from models
63:30 Finding motivation in your work

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey, everyone.
00:00:05.000 | Welcome to the Latent Space Podcast.
00:00:06.680 | This is Alessio, partner and CTO
00:00:08.440 | in Residence at Decibel Partners,
00:00:09.940 | and I'm joined by my co-host, Swiggs, founder of Small AI.
00:00:13.000 | - Hey, and today we're in the in-person studio,
00:00:15.520 | which Alessio has gorgeously set up for us,
00:00:19.080 | with Nicholas Carlini.
00:00:20.160 | Welcome.
00:00:20.980 | - Thank you. (laughing)
00:00:21.920 | - You're a research scientist at DeepMind.
00:00:23.960 | You work at the intersection
00:00:25.040 | of machine learning and computer security.
00:00:26.960 | You got your PhD from Berkeley in 2018,
00:00:29.640 | and also your BA from Berkeley as well.
00:00:32.800 | And mostly we're here to talk about your blogs,
00:00:35.780 | because you are so generous in just writing up what you know.
00:00:39.440 | Well, actually, why do you write?
00:00:40.880 | - Because I like,
00:00:41.880 | I feel like it's fun to share what you've done.
00:00:44.600 | I don't like writing.
00:00:46.120 | Sufficiently didn't like writing,
00:00:47.160 | I almost didn't do a PhD,
00:00:48.360 | because I knew how much writing
00:00:49.600 | was involved in writing papers.
00:00:51.600 | I was terrible at writing when I was younger.
00:00:54.860 | I do like the remedial writing classes
00:00:57.040 | when I was in university,
00:00:57.920 | 'cause I was really bad at it.
00:00:59.680 | So I don't actually enjoy,
00:01:00.520 | I still don't enjoy the active writing,
00:01:02.220 | but I feel like it is useful to share what you're doing,
00:01:05.480 | and I like being able to talk about the things
00:01:07.520 | that I'm doing that I think are fun.
00:01:08.960 | And so I write because I think
00:01:11.240 | I want to have something to say,
00:01:12.300 | not because I enjoy the active writing, but yeah.
00:01:14.600 | - It's a tool for thought, as they often say.
00:01:17.360 | Is there any sort of backgrounds
00:01:19.160 | or thing that people should know about you as a person,
00:01:22.000 | like, you know, just--
00:01:22.920 | - Yeah, so I tend to focus on, like you said,
00:01:26.160 | I do security work.
00:01:27.600 | I try to, like, attacking things,
00:01:29.440 | and I want to do, like, high-quality security research,
00:01:32.680 | and that's mostly what I spend my actual time
00:01:36.560 | trying to be productive members of society doing that.
00:01:38.640 | But then I get distracted by things,
00:01:40.960 | and I just like, you know,
00:01:42.040 | working on random, fun projects.
00:01:43.680 | And so--
00:01:44.520 | - Like a Doom clone in JavaScript.
00:01:45.680 | - Yes, like that, or, you know,
00:01:48.480 | I've done a number of, yeah,
00:01:49.760 | sort of things that have absolutely no utility,
00:01:51.800 | but are fun things to have done.
00:01:54.560 | And so it's interesting to say, like,
00:01:56.520 | you should work on fun things that just are interesting,
00:01:59.720 | even if they're not useful in any real way.
00:02:01.680 | And so that's what I tend to put up there,
00:02:03.600 | is after I have completed something I think is fun,
00:02:06.440 | or if I think it's sufficiently interesting,
00:02:07.800 | write something down there.
00:02:09.480 | - Before we go into, like, AI, LLMs, and whatnot,
00:02:11.880 | why are you obsessed with the game of life?
00:02:14.240 | So you built multiplexing circuits in the game of life,
00:02:18.600 | which is mind-boggling.
00:02:20.800 | So where did that come from?
00:02:22.160 | And then how do you go from just clicking boxes
00:02:25.160 | on the UI web version to, like,
00:02:27.680 | building multiplexing circuits?
00:02:29.640 | - I like Turing completeness.
00:02:31.640 | The definition of Turing completeness is
00:02:33.880 | a computer that can run anything, essentially.
00:02:36.240 | And the game of life, Conway's game of life,
00:02:38.440 | is a very simple cellular 2D automata
00:02:41.240 | where you have cells that are either on or off,
00:02:43.360 | and a cell becomes on if, in the previous generation,
00:02:45.680 | some configuration holds true and off otherwise.
00:02:48.960 | And it turns out there is a proof
00:02:51.560 | that the game of life is Turing complete,
00:02:53.240 | that you can run any program in principle
00:02:55.400 | using Conway's game of life.
00:02:57.040 | I don't know.
00:02:57.880 | And so you can, therefore someone should.
00:02:59.920 | And so I wanted to do it.
00:03:01.240 | And some other people have done some similar things,
00:03:03.920 | but I got obsessed into, like,
00:03:05.840 | if you're gonna try and make it work,
00:03:07.360 | like, we already know it's possible in theory.
00:03:08.960 | I want to try and, like, actually make something
00:03:10.280 | I can run on my computer, like, a real computer I can run.
00:03:13.400 | And so, yeah, I've been going down this rabbit hole
00:03:15.520 | of trying to make a CPU that I can run
00:03:18.280 | semi-real-time on the game of life,
00:03:20.120 | and I have been making some reasonable progress there.
00:03:22.720 | And yeah, but, you know, Turing completeness
00:03:25.120 | is just, like, a very fun trap you can go down.
00:03:28.480 | A while ago, as part of a research paper,
00:03:31.320 | I was able to show that in C,
00:03:33.560 | if you call into printf, it's Turing complete.
00:03:36.840 | Like, printf, you know, like, which, like, you know,
00:03:38.280 | you can print numbers or whatever, right?
00:03:39.920 | - Yeah, but there should be, you know,
00:03:41.120 | like, control flow stuff in there.
00:03:41.960 | - There is, because printf has a %n specifier
00:03:45.840 | that lets you write an arbitrary amount of data
00:03:48.160 | to an arbitrary location.
00:03:49.720 | And the printf format specifier has an index
00:03:53.080 | into where it is in the loop that is in memory.
00:03:56.040 | So you can overwrite the location
00:03:58.920 | of where printf is, currently indexing, using %n.
00:04:02.760 | So you can get loops, you can get conditionals,
00:04:04.920 | and you can get arbitrary data rates again.
00:04:06.640 | So we sort of have another Turing complete language
00:04:08.680 | using printf, which, again, like,
00:04:10.960 | this has essentially zero practical utility,
00:04:13.160 | but, like, it's just, I feel like a lot of people
00:04:17.040 | get into programming because they enjoy
00:04:18.920 | the art of doing these things.
00:04:21.040 | And then they go work on developing
00:04:23.360 | some software application and lose all joy.
00:04:25.240 | - Need a little sass with the boys, as they say.
00:04:27.240 | - Yeah, and I want to still have joy in doing these things.
00:04:30.720 | And so on occasion, I try to stop doing
00:04:33.560 | productive, meaningful things and just, like,
00:04:36.040 | what's a fun thing that we can do
00:04:37.880 | and try and make that happen?
00:04:39.480 | - Awesome, and you've been kind of like a pioneer
00:04:42.160 | in the AI security space.
00:04:43.800 | You've done a lot of talks starting back in 2018.
00:04:46.720 | We'll kind of leave that to the end.
00:04:48.360 | - Sure.
00:04:49.200 | - Because I know the security part is,
00:04:51.000 | there's maybe a smaller audience,
00:04:52.640 | but it's a very intense audience.
00:04:54.480 | So I think that'll be fun.
00:04:55.400 | But everybody in our Discord started posting
00:04:58.080 | your "How I Use AI" blog post.
00:05:00.400 | And we were like, "We should get Carlini on the podcast."
00:05:02.480 | And then--
00:05:03.760 | - And you were so nice to just--
00:05:04.840 | - Yeah, and then I sent you an email and you're like,
00:05:06.400 | "Okay, I'll come."
00:05:07.240 | And I was like, "Oh, I thought that would be harder."
00:05:09.720 | So I think there's, as you said in the blog post,
00:05:12.520 | a lot of misunderstanding about what LLMs
00:05:15.480 | can actually be used for, what are they useful at,
00:05:18.160 | what are they not good at,
00:05:19.280 | and whether or not it's even worth arguing
00:05:21.000 | what they're not good at, because they're obviously not.
00:05:23.800 | So if you cannot count the Rs in a word,
00:05:26.480 | they're like, "It's just not what it does."
00:05:28.520 | So how painful was it to write such a long post,
00:05:31.160 | given that you just said that you don't like to write?
00:05:33.680 | And then we can kind of run through the things,
00:05:36.080 | but maybe just talk about the motivation,
00:05:37.600 | why you thought it was important to do it.
00:05:39.240 | - Yeah.
00:05:40.080 | So I wanted to do this because I feel like most people
00:05:42.120 | who write about language models being good or bad,
00:05:45.360 | some underlying message of like,
00:05:46.800 | they have their camp, and their camp is like,
00:05:48.760 | "AI is bad," or "AI is good," or whatever.
00:05:50.920 | And they spin whatever they're gonna say
00:05:53.440 | according to their ideology.
00:05:55.120 | And they don't actually just look
00:05:56.400 | at what is true in the world.
00:05:58.800 | So I've read a lot of things where people say
00:06:01.560 | how amazing they are and how all programs
00:06:03.640 | are gonna be obsolete by 2024.
00:06:05.480 | And I've read a lot of things where people who say
00:06:07.080 | they can't do anything useful at all,
00:06:08.880 | and it's just like it's only the people
00:06:11.280 | who've come off of blockchain, crypto stuff,
00:06:14.680 | and are here to make another quick buck
00:06:16.680 | and move on.
00:06:17.520 | And I don't really agree with either of these.
00:06:19.760 | And I'm not someone who cares really one way or the other
00:06:23.560 | how these things go.
00:06:25.240 | And so I wanted to write something that just says like,
00:06:27.640 | look, let's sort of ground reality
00:06:29.960 | and what we can actually do with these things.
00:06:32.280 | Because my actual research is in like security
00:06:35.680 | and showing that these models have lots of problems.
00:06:38.800 | Like this is like, my day-to-day job is saying like,
00:06:40.760 | we probably shouldn't be using these in lots of cases.
00:06:43.080 | I thought I could have a little bit of credibility
00:06:45.320 | of in saying, it is true.
00:06:48.000 | They have lots of problems.
00:06:49.400 | We maybe shouldn't be deploying them in lots of situations.
00:06:51.800 | And still they are also useful.
00:06:54.560 | And that is the like, the bit that I wanted to get across
00:06:58.400 | is to say, I'm not here to try and sell you on anything.
00:07:01.680 | I just think that they're useful
00:07:03.720 | for the kinds of work that I do.
00:07:05.600 | And hopefully some people would listen.
00:07:08.160 | And it turned out that a lot more people liked it
00:07:10.000 | than I thought.
00:07:11.600 | But yeah, that was the motivation
00:07:13.520 | behind why I wanted to write this.
00:07:15.680 | - So you had about a dozen sections
00:07:18.960 | of like how you actually use AI.
00:07:20.880 | Maybe we can just kind of run through them all.
00:07:22.720 | And then maybe the ones where you have extra commentary
00:07:25.680 | to add if we can.
00:07:26.520 | - Sure, yeah, yeah, yeah.
00:07:28.040 | I didn't put as much thought into this
00:07:29.480 | as maybe was deserved because, yeah.
00:07:32.400 | I probably spent, I don't know,
00:07:35.160 | definitely less than 10 hours putting this together.
00:07:38.560 | - Wow, it took me close to that to do a podcast episode.
00:07:41.440 | So that's pretty impressive.
00:07:43.120 | - Yeah, I wrote it in one pass.
00:07:45.320 | I've gotten a number of emails of like,
00:07:46.960 | you got this editing thing wrong.
00:07:48.440 | You got this sort of other thing wrong.
00:07:49.640 | And it's like, I haven't looked at it.
00:07:51.800 | I tend to try it.
00:07:52.920 | I feel like, I still don't like writing.
00:07:55.240 | And so because of this, the way I tend to treat this
00:07:57.320 | is like, I will put it together
00:07:58.680 | into the best format that I can at a time
00:08:00.560 | and then put it on the internet and then never change it.
00:08:03.000 | And I guess this is an aspect of the research side of me
00:08:05.360 | is like, once a paper is published,
00:08:07.320 | it is done, it is an artifact, it exists in the world.
00:08:09.640 | I could forever edit the very first thing I ever put
00:08:12.480 | to make it the most perfect version of what it is.
00:08:14.880 | And I would do nothing else.
00:08:16.360 | And so I feel like, I find it useful to be like,
00:08:18.040 | this is the artifact.
00:08:18.880 | I will spend some certain amount of hours on it,
00:08:20.720 | which is what I think it is worth.
00:08:21.640 | And then I will just--
00:08:22.480 | - Yeah, timeboxing. - Yeah, stop.
00:08:24.120 | - So the first one was to make applications.
00:08:26.080 | We just recorded an episode with the founder of Cosign,
00:08:28.600 | which is like a AI software engineer colleague.
00:08:31.680 | You said it took you 30,000 words to get GPT-4
00:08:35.520 | to build you the, can GPT-4 solve this, kind of like a app.
00:08:39.360 | Where are we in the spectrum where chat GPT is all you need
00:08:42.480 | to actually build something versus I need a full-on agent
00:08:45.320 | that does everything for me?
00:08:46.560 | - Yeah, okay, so this was an,
00:08:47.800 | so I built a web app last year sometime
00:08:50.600 | that was just like a fun demo where you can guess
00:08:53.480 | if you can predict whether or not GPT-4 at the time
00:08:55.840 | could solve a given task.
00:08:58.000 | This is, as far as web apps go, very straightforward.
00:09:02.200 | Like you need basic HTML, CSS.
00:09:04.520 | You have a little slider that moves.
00:09:06.640 | You have a button, sort of animate the text
00:09:08.520 | coming to the screen.
00:09:09.680 | The reason people are going here
00:09:11.720 | is not because they want to see my wonderful HTML, right?
00:09:14.960 | Like, you know, I used to know how to do like modern HTML,
00:09:18.480 | like in 2007, 2008, like, you know,
00:09:22.840 | when I was very good at fighting with IE6
00:09:24.920 | and these kinds of things.
00:09:25.760 | Like I knew how to do that.
00:09:26.720 | I have no longer had to build any web app stuff
00:09:28.680 | like in the meantime, which means that like,
00:09:31.120 | I know how everything works,
00:09:32.640 | but I don't know any of the new, Flexbox is new to me.
00:09:35.720 | Flexbox is like 10 years old at this point.
00:09:38.200 | But like, it's just amazing having,
00:09:39.680 | being able to go to the model and just say like,
00:09:41.880 | write me this thing.
00:09:42.800 | It will give me all of the boilerplate
00:09:44.240 | that I need to get going.
00:09:45.880 | And of course it's imperfect.
00:09:47.560 | It's not going to get you the right answer.
00:09:49.600 | And it doesn't do anything that's complicated right now,
00:09:53.920 | but it gets you to the point where the only remaining work
00:09:57.400 | that needs to be done is the interesting hard part for me,
00:10:00.220 | that like is the actual novel part.
00:10:02.640 | And even the current models, I think,
00:10:04.420 | are entirely good enough at doing this kind of thing,
00:10:07.240 | that they're very useful.
00:10:08.320 | It may be the case that if you had something like,
00:10:10.040 | you were saying, smarter agent,
00:10:11.840 | that could debug problems by itself.
00:10:15.060 | That might be even more useful.
00:10:16.900 | Currently though, make a model into an agent
00:10:18.800 | by just copying and pasting error messages
00:10:20.360 | for the most part.
00:10:21.200 | And that's what I do is, you know, you run it
00:10:24.120 | and it gives you some code that doesn't work
00:10:25.560 | and either I'll fix the code or it will give me buggy code
00:10:28.240 | and I won't know how to fix it.
00:10:29.080 | And I'll just copy and paste the error message and say,
00:10:31.160 | it tells me this, what do I do?
00:10:33.680 | And it will just tell me how to fix it.
00:10:35.460 | You can't trust these things blindly,
00:10:37.920 | but I feel like most people on the internet
00:10:40.400 | already understand that things on the internet,
00:10:42.400 | you can't trust blindly.
00:10:43.840 | And so there's not like, this is not like a big mental shift
00:10:47.080 | you have to go through to understand that it is possible
00:10:50.300 | to read something and find it useful,
00:10:52.400 | even if it is not completely perfect in its output.
00:10:54.920 | - It's very human-like in that sense.
00:10:56.480 | It's the same ring of trust, you know,
00:10:57.840 | I kind of think about it that way,
00:10:59.760 | if you had trust levels.
00:11:02.920 | And there's maybe a couple that tie together.
00:11:05.040 | So there was like to make applications
00:11:06.960 | and then there's to get started,
00:11:08.360 | which is a similar, you know, kickstart,
00:11:10.600 | maybe like a project that, you know, the LLM cannot solve.
00:11:14.080 | It's kind of how you think about it.
00:11:15.120 | - Yeah, so like for getting started on things
00:11:17.560 | is one of the cases where I think it's really great
00:11:19.320 | for some of these things
00:11:20.160 | where I sort of use it as a personalized,
00:11:24.400 | help me use this technology I've never used before.
00:11:27.120 | So for example, I had never used Docker before January.
00:11:30.360 | I know what Docker is.
00:11:31.200 | - Lucky you.
00:11:32.020 | - Yeah, like I'm a computer security person.
00:11:34.160 | Like I sort of, I have read lots of papers on, you know,
00:11:37.880 | on all the technology behind how these things work.
00:11:40.360 | You know, I know all the exploits on them.
00:11:41.720 | I've done some of these things,
00:11:43.440 | but I had never actually used Docker.
00:11:45.560 | But I wanted it to be able,
00:11:46.960 | so that I could run the outputs of language model stuff
00:11:49.960 | in some controlled, contained environment,
00:11:51.280 | which I know is the right application.
00:11:52.800 | So I just ask it, like,
00:11:54.000 | I want to use Docker to do this thing.
00:11:55.680 | Like, tell me how to run a Python program
00:11:58.440 | in a Docker container.
00:11:59.640 | And it like gives me a thing.
00:12:00.480 | And I'm like, step back.
00:12:01.880 | You said Docker Compose.
00:12:02.960 | I do not know what this word Docker Compose is.
00:12:04.820 | Is this Docker?
00:12:05.660 | Is this not Docker?
00:12:06.480 | Help me.
00:12:07.320 | And like, it'll sort of tell me all of these things.
00:12:08.720 | And I'm sure there's this knowledge
00:12:09.960 | that's out there on the internet.
00:12:11.000 | Like, this is not some groundbreaking thing that I'm doing,
00:12:14.680 | but I just wanted it as a small piece
00:12:17.000 | of one thing I was working on.
00:12:19.120 | And I didn't want to learn Docker from first principles.
00:12:22.640 | Like, at some point, if I need it, I can do that.
00:12:25.000 | Like, I have the background that I can make that happen.
00:12:27.880 | But what I wanted to do was thing one.
00:12:30.680 | And it's very easy to get bogged down in the details
00:12:32.640 | of this other thing that helps you accomplish your end goal.
00:12:34.920 | And I just wanted, like, tell me enough about Docker
00:12:37.040 | so I can do this particular thing.
00:12:38.360 | And I can check that it's doing the safe thing.
00:12:40.760 | I sort of know enough about that from my other background.
00:12:44.440 | And so I can just have the model help teach me
00:12:46.920 | exactly the one thing I want to know and nothing more.
00:12:49.120 | I don't need to worry about other things
00:12:50.680 | that the writer of this thinks is important
00:12:52.560 | that actually isn't.
00:12:53.400 | Like, I can just like stop the conversation and say,
00:12:55.040 | no, boring to me.
00:12:56.320 | Explain this detail I don't understand.
00:12:57.520 | I think that was very useful for me.
00:12:59.760 | It would have taken me, you know, several hours
00:13:01.640 | to figure out some things that take 10 minutes
00:13:03.560 | if you could just ask exactly the question
00:13:05.000 | you want the answer to.
00:13:06.120 | - Have you had any issues with like newer tools?
00:13:08.640 | Have you felt any meaningful kind of like a cutoff day
00:13:11.600 | where like there's not enough data on the internet or?
00:13:14.080 | - I'm sure that the answer to this is yes.
00:13:16.160 | But I tend to just not use most of these things.
00:13:19.600 | Like, I feel like this is like the significant way
00:13:22.120 | in which I use machine learning models
00:13:23.800 | is probably very different than most people
00:13:25.960 | is that I'm a researcher
00:13:27.560 | and I get to pick what tools that I use.
00:13:29.080 | And most of the things that I work on
00:13:30.160 | are fairly small projects.
00:13:31.640 | And so I can entirely see how someone
00:13:34.040 | who is in a big giant company
00:13:35.720 | where they have their own proprietary legacy code base
00:13:38.000 | of a hundred million lines of code or whatever.
00:13:39.680 | And like, you just might not be able to use things
00:13:41.840 | the same way that I do.
00:13:42.960 | I still think there are lots of use cases there
00:13:44.720 | that are entirely reasonable
00:13:46.120 | that are not the same ones that I've put down.
00:13:48.280 | But I wanted to talk about what I have personal experience
00:13:50.800 | in being able to say is useful.
00:13:52.360 | And I would like it very much
00:13:53.800 | if someone who is in one of these environments
00:13:56.720 | would be able to describe the ways
00:13:58.120 | in which they find current models useful to them
00:14:00.640 | and not philosophize on what someone else
00:14:02.760 | might be able to find useful.
00:14:03.760 | But actually say like,
00:14:04.600 | "Here are real things that I have done
00:14:06.720 | that I found useful for me."
00:14:08.560 | - Yeah, this is what I often do
00:14:10.200 | to encourage people to write more,
00:14:12.080 | to share their experiences,
00:14:13.040 | because they often fear being attacked on the internet.
00:14:16.480 | But you are the ultimate authority on how you use things
00:14:19.160 | and it's objectively true.
00:14:21.360 | So they cannot be debated.
00:14:23.920 | One thing that people are very excited about
00:14:25.760 | is the concept of ephemeral software,
00:14:27.480 | like personal software.
00:14:30.120 | This use case in particular
00:14:31.840 | basically lowers the activation energy
00:14:34.040 | for creating software,
00:14:36.200 | which I like as a vision.
00:14:37.920 | I don't think I have taken as much advantage of it
00:14:42.240 | as I could.
00:14:43.280 | I feel guilty about that,
00:14:44.560 | but also we're trending towards there.
00:14:47.840 | - Yeah, no, I do think that this is a direction
00:14:50.080 | that is exciting to me.
00:14:52.360 | Yeah, one of the things I wrote
00:14:53.440 | that was like a lot of the ways that I use these models
00:14:55.200 | are for one-off things that I just need to happen
00:14:58.120 | that I'm gonna throw away in five minutes.
00:15:00.000 | - Yeah, and you can.
00:15:01.680 | - Yeah, exactly, right.
00:15:02.520 | It's like the kind of thing where
00:15:03.880 | it would not have been worth it
00:15:05.040 | for me to have spent 45 minutes writing this
00:15:07.960 | because I don't need the answer that badly.
00:15:10.480 | But if it will only take me five minutes,
00:15:12.480 | then I'll just figure it out,
00:15:14.640 | run the program, and then get it right.
00:15:16.760 | And if it turns out that you ask the thing
00:15:18.440 | and it doesn't give you the right answer,
00:15:19.360 | well, I didn't actually need the answer that badly
00:15:20.720 | in the first place.
00:15:21.680 | Like either I can decide to dedicate the 45 minutes
00:15:23.640 | or I cannot, but the cost of doing it is fairly low.
00:15:26.320 | You see what the model can do,
00:15:27.720 | and if it can't, then okay.
00:15:30.040 | When you're using these models,
00:15:31.160 | if you're getting the answer you want always,
00:15:32.880 | it means you're not asking them hard enough questions.
00:15:34.960 | - Ooh, say more.
00:15:36.920 | - Lots of people only use them
00:15:38.120 | for very small particular use cases,
00:15:40.240 | and it always does the thing that they want.
00:15:42.440 | - Yeah, they use it like a search engine.
00:15:44.160 | - Yeah, or like one particular case.
00:15:45.520 | And if you're finding that when you're using these,
00:15:47.440 | it's always giving you the answer that you want,
00:15:49.600 | then probably it has more capabilities
00:15:51.680 | than you're actually using.
00:15:52.880 | And so I oftentimes try when I have something
00:15:55.160 | that I'm curious about to just feed into the model
00:15:57.680 | and be like, well, maybe it's to solve my problem for me.
00:15:59.640 | You know, most of the time it doesn't,
00:16:01.160 | but like on occasion, it's like,
00:16:02.800 | it's done things that would have taken me,
00:16:05.160 | you know, a couple hours that it's been great
00:16:07.600 | and just like solved everything immediately.
00:16:09.080 | And if it doesn't, then it's usually easier
00:16:11.200 | to verify whether or not the answer is correct
00:16:13.080 | than to have written it in the first place.
00:16:14.760 | And so you check, you're like,
00:16:15.760 | well, that's just, you're entirely misguided.
00:16:17.320 | Nothing here is right.
00:16:18.160 | It's just like, I'm not going to do this.
00:16:19.520 | I'm gonna go write it myself or whatever.
00:16:21.360 | - Even for non-tech, I had to fix my irrigation system.
00:16:24.720 | I had an old irrigation system.
00:16:26.000 | I didn't know how it worked to program it.
00:16:27.480 | I took a photo, I sent it to Claude.
00:16:28.920 | And it's like, oh yeah, that's like the RT900.
00:16:31.120 | This is exactly, I was like, oh wow.
00:16:32.920 | You know, you know a lot of stuff.
00:16:34.280 | - Was it right?
00:16:35.120 | - Yeah, it was right.
00:16:35.960 | It worked.
00:16:36.800 | - Did you compare with OpenAI?
00:16:38.160 | - No, I canceled my OpenAI subscription,
00:16:40.240 | so I'm a Claude boy.
00:16:42.600 | Do you have a way to think about
00:16:43.800 | these like one-offs, softer thing?
00:16:46.120 | One way I talk to people about it is like,
00:16:47.920 | LLMs are kind of converging
00:16:49.320 | into like semantic serverless functions.
00:16:51.560 | You know, like you can say something
00:16:53.400 | and like it can run the function in a way
00:16:55.040 | and then that's it.
00:16:56.040 | It just kind of dies there.
00:16:57.360 | Do you have a mental model to just think about
00:16:59.280 | how long it should live for and like anything like that?
00:17:02.920 | - I don't think I have anything interesting to say here, no.
00:17:05.440 | I will take whatever tools are available in front of me
00:17:08.520 | and try and see if I can use them in meaningful ways.
00:17:10.440 | And if they're helpful, then great.
00:17:11.840 | If they're not, then fine.
00:17:13.000 | And like, you know, there are lots of people
00:17:14.880 | that I'm very excited about seeing all of these people
00:17:16.800 | who are trying to make better applications
00:17:18.320 | that use these or all these kinds of things.
00:17:20.400 | And I think that's amazing.
00:17:22.640 | I would like to see more of it,
00:17:23.640 | but I do not spend my time thinking
00:17:25.480 | about how to make this any better.
00:17:26.920 | - What's the most underrated thing in the list?
00:17:29.280 | I know there's like simplified code,
00:17:31.480 | solving boring tasks,
00:17:32.760 | or maybe is there something that you forgot to add
00:17:35.080 | that you wanna throw in there?
00:17:37.360 | - I mean, okay, so in the list,
00:17:39.400 | I only put things that people could look at
00:17:42.840 | and go, I understand how this solved my problem.
00:17:48.200 | I didn't want to put things
00:17:49.880 | where the model was very useful to me,
00:17:52.200 | but it would not be clear to someone else
00:17:54.640 | that it was actually useful.
00:17:56.160 | So for example, one of the things that I use it a lot for
00:17:59.080 | is debugging errors.
00:18:01.080 | But the errors that I have
00:18:03.840 | are very much not the errors
00:18:05.000 | that anyone else in the world will have.
00:18:06.120 | And in order to understand
00:18:07.200 | whether or not the solution was right,
00:18:08.920 | you just have to trust me on it.
00:18:09.800 | Because, you know, like I got my machine in a state
00:18:12.520 | that like CUDA was not talking to whatever,
00:18:15.920 | some other thing, the versions were mismatched.
00:18:18.040 | Something, something, something,
00:18:19.040 | and everything was broken.
00:18:20.160 | And like, I could figure it out
00:18:21.160 | when I interacted with the model,
00:18:22.120 | and it told me the steps I needed to take.
00:18:24.400 | But at the end of the day,
00:18:25.320 | when you look at the conversation,
00:18:26.400 | you just have to trust me that it worked.
00:18:28.640 | And I didn't want to write things online
00:18:33.120 | that were this like,
00:18:33.960 | you have to trust me in what I'm saying.
00:18:35.760 | I want everything that I said to like have evidence
00:18:38.200 | that like, here's the conversation,
00:18:39.440 | you can go and check
00:18:40.960 | whether or not this actually solved the task
00:18:43.280 | as I just said that the model does.
00:18:45.000 | Because a lot of people I feel like say,
00:18:47.080 | I used a model to solve this very complicated task.
00:18:50.200 | And what they mean is,
00:18:51.480 | the model did 10% and I did the other 90%.
00:18:53.720 | So I wanted everything to be verifiable.
00:18:55.760 | And so one of the biggest use cases for me,
00:18:57.880 | I didn't describe even at all,
00:18:59.480 | because it's not the kind of thing
00:19:00.720 | that other people could have verified by themselves.
00:19:02.680 | So that maybe is like one of the things
00:19:04.600 | that I wish I maybe had said a little bit more about,
00:19:07.760 | and just stated that the way that this is done.
00:19:11.680 | Because I feel like
00:19:12.520 | that this didn't come across quite as well.
00:19:13.880 | But yeah, of the things that I talked about,
00:19:16.760 | the thing that I think is most underrated
00:19:19.240 | is the ability of it to solve
00:19:21.440 | the uninteresting parts of problems for me right now,
00:19:24.560 | where people always say,
00:19:26.560 | this is one of the biggest arguments
00:19:27.760 | that I don't understand why people say,
00:19:29.440 | is the model can only do things
00:19:32.080 | that people have done before.
00:19:33.640 | Therefore, the model is not going to be helpful
00:19:35.720 | in doing new research or like discovering new things.
00:19:39.040 | And as someone whose day job is to do new things,
00:19:42.720 | like what is research?
00:19:44.000 | Research is doing something
00:19:45.040 | literally no one else in the world has ever done before.
00:19:47.480 | So like, this is what I do like every single day.
00:19:50.240 | 90% of this is not doing something new.
00:19:53.360 | Like 90% of this is like doing things
00:19:55.840 | a million people have done before,
00:19:57.360 | and then a little bit of something that was new.
00:19:59.640 | There's a reason why we say
00:20:00.480 | we stand on the shoulders of giants.
00:20:02.040 | It's true.
00:20:02.880 | Almost everything that I do
00:20:03.720 | is something that's been done many, many times before.
00:20:06.080 | And that is the piece that can be automated.
00:20:08.600 | Even if the thing that I'm doing as a whole is new,
00:20:12.400 | it is almost certainly the case
00:20:13.560 | that the small pieces that build up to it are not.
00:20:17.520 | And a number of people who use these models,
00:20:20.640 | I feel like expect that they can either solve
00:20:22.280 | the entire task or none of the task.
00:20:24.160 | But now I find myself very often,
00:20:27.040 | even when doing something very new and very hard,
00:20:29.760 | having models write the easy parts for me.
00:20:33.440 | And the reason I think this is so valuable,
00:20:35.560 | everyone who programs understands this,
00:20:37.120 | like you're currently trying to solve some problem
00:20:39.320 | and you get distracted.
00:20:40.880 | And you know, whatever the case may be,
00:20:42.720 | someone comes and talks to you.
00:20:43.720 | You have to go look up something online, whatever it is.
00:20:46.920 | You lose a lot of time to that.
00:20:49.640 | And one of the ways we currently don't think
00:20:51.320 | about being distracted is you're solving some hard problem
00:20:53.720 | and you realize you need a helper function that does X.
00:20:57.120 | Where X is like, it's a known algorithm.
00:20:59.200 | Any person in the world, you say like,
00:21:00.720 | give me the algorithm that, you know,
00:21:03.040 | have a dense graph or a sparse graph.
00:21:04.600 | I need to make it dense.
00:21:05.560 | You can do this by, you know,
00:21:06.880 | doing some matrix multiplies.
00:21:08.640 | It's like, this is a solved problem.
00:21:09.760 | I knew how to do this 15 years ago.
00:21:11.560 | But it distracts me from the problem
00:21:13.640 | I'm thinking about in my mind.
00:21:14.560 | I needed this done.
00:21:15.960 | And so instead of using my mental capacity
00:21:19.240 | and solving that problem,
00:21:20.240 | and then coming back to the problem
00:21:21.240 | I was originally trying to solve,
00:21:22.800 | you can just ask model, please solve this problem for me.
00:21:25.000 | It gives you the answer.
00:21:25.840 | You run it.
00:21:26.680 | You can check that it works very, very quickly.
00:21:28.240 | And now you go back to solving the problem
00:21:29.440 | without having lost all the mental state.
00:21:30.960 | And I feel like this is one of the things
00:21:32.000 | that's been very useful for me.
00:21:34.400 | - And in terms of this concept of expert users
00:21:37.320 | versus non-expert users, floors versus ceilings,
00:21:41.200 | you had some strong opinion here
00:21:42.320 | that basically it actually is more beneficial
00:21:45.040 | for non-experts.
00:21:46.160 | - Yeah, I don't know.
00:21:47.120 | I think it could go either way.
00:21:48.520 | Let me give you the argument for both of these.
00:21:50.840 | - Yes.
00:21:51.680 | - So I can only speak on the expert user behalf
00:21:53.000 | because I've been doing computers for a long time.
00:21:54.960 | And so, yeah, the cases where it's useful for me
00:21:56.440 | are exactly these cases where I can check the output.
00:21:59.360 | I know, and anything the model could do,
00:22:01.200 | I could have done.
00:22:02.040 | I could have done better.
00:22:02.920 | I can check every single thing that the model's doing
00:22:05.000 | and make sure it's correct in every way.
00:22:06.800 | And so I can only speak and say,
00:22:09.160 | definitely it's been useful for me.
00:22:10.680 | But I also see a world in which this could be very useful
00:22:13.360 | for the kinds of people who do not have this knowledge
00:22:16.760 | with caveats, because I'm not one of the people
00:22:18.600 | I don't have this direct experience.
00:22:20.240 | But one of these big ways that I can see this
00:22:22.960 | is for things that you can check fairly easily,
00:22:27.120 | someone who could never have asked
00:22:29.320 | or have written a program themselves to do a certain task
00:22:32.080 | could just ask for the program that does the thing.
00:22:34.480 | And you know, some of the times it won't get it right,
00:22:37.520 | but some of the times it will,
00:22:39.080 | and they'll be able to have the thing in front of them
00:22:41.920 | that they just couldn't have done before.
00:22:44.080 | And we see a lot of people trying to do applications
00:22:47.080 | for this, like integrating language models
00:22:49.080 | into spreadsheets.
00:22:50.360 | Spreadsheets run the world.
00:22:53.000 | And there are some people who know how to do
00:22:54.440 | all the complicated spreadsheet equations
00:22:56.040 | and various things, and other people who don't,
00:22:58.520 | who just use the spreadsheet program,
00:23:00.700 | but just manually do all of the things
00:23:02.640 | one by one by one by one.
00:23:04.480 | And this is a case where you could have a model
00:23:07.640 | that could try and give you a solution.
00:23:11.000 | And as long as the person is rigorous in testing
00:23:13.080 | that the solution does actually the correct thing,
00:23:14.760 | this is the part that I'm worried about most.
00:23:16.200 | You know, I think depending on these systems
00:23:18.160 | in ways that we shouldn't,
00:23:19.440 | like this is what my research says.
00:23:20.760 | My research says is entirely on this,
00:23:22.200 | like you probably shouldn't trust these models
00:23:24.360 | to do the things in adversarial situations.
00:23:26.040 | Like I understand this very deeply.
00:23:28.280 | And so I think that it's possible for people
00:23:31.520 | who don't have this knowledge
00:23:32.720 | to make use of these tools in ways,
00:23:34.720 | but I'm worried that it might end up in a world
00:23:37.680 | where people just blindly trust them,
00:23:39.680 | deploy them in situations that aren't,
00:23:41.480 | they probably shouldn't.
00:23:42.840 | And then someone like me gets to come along
00:23:44.720 | and just break everything because everything is terrible.
00:23:47.680 | And so like, I am very, very worried
00:23:49.440 | about that being the case,
00:23:50.920 | but I think if done carefully,
00:23:52.360 | it is possible that these could be very useful.
00:23:54.920 | - Yeah, there is some research out there
00:23:57.200 | that shows that when people use LLMs to generate code,
00:24:00.800 | they do generate less secure code.
00:24:02.440 | - Yeah, Dan Boneh has a nice paper on this.
00:24:03.920 | There are a bunch of papers that touch on exactly this.
00:24:08.040 | - My slight issue is, is there an agenda here?
00:24:10.800 | - I mean, okay, yeah.
00:24:12.400 | Dan Boneh, at least the one they have,
00:24:14.280 | I fully trust everything that sort of, yeah.
00:24:16.080 | - Sorry, I don't know who Dan is.
00:24:17.880 | - Professor at Stanford.
00:24:19.040 | Yeah, he and some students have some things on this.
00:24:20.960 | And yeah, there's like a number of,
00:24:22.880 | I agree that a lot of the stuff
00:24:24.560 | feel like people have an agenda behind it.
00:24:26.560 | There are some that don't,
00:24:27.480 | and I sort of trust them to have done the right thing.
00:24:31.440 | I also think, even on this though, we have to be careful
00:24:34.240 | because the argument,
00:24:35.960 | whenever someone says X is true about language models,
00:24:38.200 | you should always append the suffix for current models
00:24:41.400 | because I'll be the first to admit
00:24:43.320 | I was one of the people who was very much on the opinion
00:24:45.800 | that these language models are fun toys
00:24:47.480 | and are gonna have absolutely no practical utility.
00:24:49.480 | And if you had asked me this, let's say in 2020,
00:24:53.320 | I still would have said the same thing.
00:24:54.160 | It was like after I had seen GPT-2,
00:24:56.480 | I had written a couple of papers
00:24:58.320 | studying GPT-2 very carefully.
00:25:00.200 | I still would have told you these things are toys.
00:25:03.000 | And when I first read the RLHF paper
00:25:06.000 | and the instruction tuning paper,
00:25:08.600 | I was like, nope, this is like this thing
00:25:10.920 | that these weird AI people are doing.
00:25:12.960 | It's like they're trying to make some analogies
00:25:15.480 | to people that it makes no sense.
00:25:17.160 | It's just like, I don't even care to read it.
00:25:19.000 | I saw what it was about and just didn't even look at it.
00:25:22.040 | I was obviously wrong.
00:25:23.560 | These things can be useful.
00:25:25.320 | And I feel like a lot of people
00:25:28.760 | had the same mentality that I did
00:25:30.760 | and decided not to change their mind.
00:25:32.840 | And I feel like this is the thing
00:25:34.160 | that I want people to be careful about.
00:25:36.760 | I want them to at least know what is true about the world
00:25:39.080 | so that they can then see that maybe they should reconsider
00:25:42.720 | some of the opinions that they had
00:25:43.880 | from four or five years ago
00:25:44.960 | that may just not be true about today's models.
00:25:47.440 | - Specifically, because you brought up spreadsheets,
00:25:49.240 | I want to share my personal experience
00:25:51.240 | because I think Google's done a really good job
00:25:53.040 | that people don't know about,
00:25:54.160 | which is if you use Google Sheets,
00:25:56.160 | it's Gemini's integrated inside of Google Sheets
00:25:58.560 | and it helps you write formulas.
00:26:00.120 | - Great, that's news to me.
00:26:01.440 | - Right?
00:26:02.280 | They don't maybe do a good job.
00:26:04.920 | Unless you watch Google I/O,
00:26:06.400 | there was no other opportunity to learn
00:26:07.880 | that Gemini is now in your Google Sheets.
00:26:09.880 | And so I just don't write formulas manually anymore.
00:26:12.720 | It just prompts Gemini to do it for me and it does it.
00:26:15.600 | - Yeah, one of the problems that these machine learning
00:26:17.480 | models have is a discoverability problem.
00:26:18.920 | I think this will be figured out.
00:26:20.840 | I mean, it's the same problem that you have
00:26:22.600 | with any assistant.
00:26:25.160 | You're given a blank box and you're like,
00:26:26.920 | "What do I do with it?"
00:26:28.400 | No, I think this is great.
00:26:29.800 | More of these things, it would be good for them to exist.
00:26:32.760 | I want them to exist in ways that we can actually make sure
00:26:36.760 | that they're done correctly.
00:26:38.240 | I don't want to just have them be pushed
00:26:42.040 | into more and more things just blindly.
00:26:43.880 | I feel like lots of people, there are far too many.
00:26:47.440 | X plus AI, where X is like arbitrary thing in the world
00:26:51.600 | that has nothing to do with it
00:26:52.840 | and could not be benefited at all.
00:26:53.960 | And they're just doing it because they want to use the word.
00:26:56.520 | And I don't want that to happen.
00:26:58.480 | - You don't want an AI fridge?
00:26:59.760 | (both laughing)
00:27:00.800 | - No.
00:27:02.000 | Yes, I do not want my fridge on the internet.
00:27:03.560 | I do not want like, yeah.
00:27:05.360 | Okay, anyway, let's not go down that rabbit hole.
00:27:07.000 | I understand why some of that happens
00:27:08.520 | because people want to sell things and whatever.
00:27:10.440 | But I feel like a lot of people see that
00:27:12.600 | and then they write off everything as a result of it.
00:27:14.560 | And I just want to say, there are allowed to be people
00:27:17.720 | who are trying to do things that don't make any sense.
00:27:20.040 | Just ignore them.
00:27:21.120 | Do the things that make sense.
00:27:22.560 | - Another chunk of use cases was learning.
00:27:25.480 | So both explaining code, being a API reference,
00:27:29.600 | all of these different things.
00:27:31.080 | Any suggestions on like how to go at it?
00:27:34.080 | I feel like, you know, one thing is like generate code
00:27:37.080 | and then explain to me.
00:27:38.160 | One way is like, just tell me about this technology.
00:27:40.880 | Another thing is like, hey, I read this online.
00:27:42.920 | Kind of help me understand it.
00:27:44.560 | Any best practices on getting the most out of it or?
00:27:47.680 | - Yeah, I don't know if I have best practices.
00:27:49.640 | I have how I use them.
00:27:50.800 | - Yeah.
00:27:51.640 | - Yeah, I find it very useful for cases
00:27:55.280 | where I understand the underlying ideas,
00:27:58.440 | but I have never used them in this way before.
00:28:00.640 | I know what I'm looking for,
00:28:01.720 | but I just don't know how to get there.
00:28:03.600 | And so yeah, as an API reference is a great example.
00:28:06.240 | You know, the tool everyone always picks on is like FFmpeg.
00:28:09.960 | No one in the world knows the command line arguments
00:28:13.040 | to do what they want.
00:28:13.880 | They like make the thing faster.
00:28:16.360 | You know, I want lower bit rate, like dash V, you know,
00:28:20.280 | but like once you tell me what the answer is,
00:28:21.400 | like I can check.
00:28:22.400 | Like this is one of the things where it's great
00:28:24.040 | for these kinds of things.
00:28:25.640 | Or, you know, in other cases,
00:28:27.640 | things where I don't really care
00:28:29.760 | that the answer was 100% correct.
00:28:31.320 | So for example, I do a lot of security work.
00:28:33.560 | Most of security work is reading some code
00:28:36.880 | you've never seen before and finding out
00:28:38.760 | which pieces of the code are actually important.
00:28:41.200 | Because, you know, most of the program
00:28:43.680 | isn't actually do anything to do with security.
00:28:46.280 | It has, you know, the display piece
00:28:48.280 | or the other piece or whatever.
00:28:49.200 | And like, you just, you would only ignore all of that.
00:28:51.320 | So one very fun use of models is to like,
00:28:54.120 | just have it describe all of the functions
00:28:56.000 | and just skim it and be like, wait,
00:28:57.560 | which ones look like approximately
00:28:58.920 | the right things to look at?
00:29:00.280 | Because otherwise, what are you gonna do?
00:29:02.120 | You're gonna have to read them all manually.
00:29:03.320 | And when you're reading them manually,
00:29:04.360 | you're gonna skim the function anyway
00:29:06.080 | and not just figure out what's going on perfectly.
00:29:08.200 | Like you already know that when you're gonna read
00:29:10.840 | these things, what you're going to try and do
00:29:12.680 | is figure out roughly what's going on.
00:29:15.080 | And then you'll delve into the details.
00:29:16.600 | This is a great way of just doing that, but faster,
00:29:19.040 | because it will abstract most of what is right.
00:29:21.840 | It's gonna be wrong some of the time, I don't care.
00:29:23.240 | I would have been wrong too.
00:29:24.640 | And as long as you treat it with this way,
00:29:26.160 | I think it's great.
00:29:27.000 | And so like one of the particular use cases I have
00:29:28.960 | in the thing is decompiling binaries,
00:29:32.080 | where, you know, oftentimes people will release a binary,
00:29:34.600 | they won't give you the source code
00:29:35.880 | and you wanna figure out how to attack it.
00:29:38.920 | And so one thing you could do is you could try
00:29:40.720 | and run some kind of decompiler.
00:29:42.440 | It turns out for the thing that I wanted, none existed.
00:29:44.680 | And so like I spent too many hours doing it by hand
00:29:48.120 | before I first thought, you know, like, why am I doing this?
00:29:50.320 | I should just check if the model can do it for me.
00:29:52.200 | And it turns out that it can,
00:29:53.160 | and it can turn the compiled source code,
00:29:54.980 | which is impossible for any human to understand,
00:29:56.880 | into the Python code that is entirely reasonable
00:29:58.760 | to understand.
00:29:59.600 | And, you know, it doesn't run, it has a bunch of problems,
00:30:02.000 | but like, it's so much nicer
00:30:03.320 | that it's immediately a win for me.
00:30:04.760 | I can just figure out approximately
00:30:06.320 | where I should be looking and then spend all of my time
00:30:08.460 | doing that sort of by hand.
00:30:10.080 | And again, like you get a big win there.
00:30:12.160 | - So, I mean, I fully agree with, you know,
00:30:14.680 | all those use cases.
00:30:15.840 | And especially for you as a security researcher
00:30:18.400 | and having to dive into multiple things,
00:30:21.840 | I imagine that's super helpful.
00:30:23.440 | I do think we want to sort of move to your other blog posts,
00:30:26.320 | but, you know, I wanted to,
00:30:27.960 | you ended your post with a little bit of a teaser
00:30:29.920 | about your next post and your speculations.
00:30:33.120 | What are you thinking about?
00:30:34.040 | - Okay, so I want to write something,
00:30:35.480 | and I will do that at some point when I have time,
00:30:37.840 | maybe after I'm done writing my current papers
00:30:39.560 | for iClear or something,
00:30:40.720 | where I want to talk about some thoughts I have
00:30:44.000 | for where language models are going in the near-term future.
00:30:46.680 | The reason why I want to talk about this
00:30:48.040 | is because, again, I feel like the discussion
00:30:50.280 | tends to be people who are either very much AGI by 2027,
00:30:55.240 | or-- - Always five years away.
00:30:57.560 | - Yes, or are going to make statements of the form,
00:31:01.080 | you know, LLMs are the wrong path,
00:31:03.240 | and, you know, we should be abandoning this,
00:31:05.060 | and we should be doing something else instead.
00:31:06.760 | And again, I feel like people tend to look at this
00:31:09.420 | and see these two polarizing options and go,
00:31:12.160 | well, those obviously are both very far extremes.
00:31:14.360 | Like, how do I actually, like,
00:31:16.200 | what's the more nuance to take here?
00:31:18.280 | And so I have some opinions about this
00:31:20.960 | that I want to put down.
00:31:22.860 | Just saying, you know, I have wide margins of error.
00:31:25.840 | I think you should too.
00:31:27.160 | If you would say there's a 0% chance that something,
00:31:30.000 | you know, the models will get very, very good
00:31:31.680 | in the next five years, you're probably wrong.
00:31:33.440 | If you're gonna say there's a 100% chance
00:31:35.040 | that in the next five years, then you're probably wrong.
00:31:37.740 | And like, to be fair, most of the people,
00:31:39.240 | if you read behind the headlines,
00:31:41.360 | actually say something like this.
00:31:43.280 | But it's very hard to get clicks on the internet
00:31:45.360 | of like, some things may be good in the future.
00:31:48.440 | Like, everyone wants like, you know, a very like,
00:31:51.360 | nothing is gonna be good.
00:31:52.560 | This is entirely wrong.
00:31:53.640 | It's gonna be amazing.
00:31:54.600 | You know, like, they want to see this.
00:31:56.360 | I want things who have,
00:31:57.760 | people who have negative reactions
00:31:58.960 | to these kinds of extreme views
00:32:00.680 | to be able to at least say like,
00:32:02.360 | to tell them there is something real here.
00:32:05.400 | It may not solve all of our problems,
00:32:07.080 | but it's probably going to get better.
00:32:08.640 | I don't know by how much.
00:32:10.040 | And that's basically what I want to say.
00:32:11.840 | And then at some point I'll talk about
00:32:13.760 | the safety and security things as a result of this.
00:32:16.640 | Because the way in which security intersects
00:32:19.000 | with these things depends a lot
00:32:20.680 | in exactly how people use these tools.
00:32:23.880 | You know, if it turns out to be the case
00:32:25.360 | that these models get to be truly amazing
00:32:28.040 | and can solve, you know, tasks completely autonomously,
00:32:31.840 | that's a very different security world to be living in
00:32:33.900 | than if there's always a human in the loop.
00:32:35.540 | And the types of security questions I would want to ask
00:32:37.600 | are be very different.
00:32:38.840 | And so I think, you know, in some very large parts,
00:32:42.320 | understanding what the future will look like
00:32:44.280 | a couple of years ahead of time
00:32:45.400 | is helpful for figuring out which problems
00:32:47.360 | as a security person, I want to solve now.
00:32:49.360 | - You mentioned getting clicks on the internet,
00:32:50.960 | but you don't even have like an ex account or anything.
00:32:53.200 | How do you get people to read your stuff?
00:32:54.800 | What's the, what's your distribution strategy?
00:32:56.920 | Because this post was popping up everywhere.
00:32:59.320 | And then people on Twitter were like,
00:33:00.960 | Nicolas Scarlini brought this, like what's his handle?
00:33:03.840 | And it's like, he doesn't have it.
00:33:04.880 | It's like, how did you find it?
00:33:06.040 | What's the story?
00:33:07.560 | - So I have an RSS feed and an email list, and that's it.
00:33:12.240 | I don't like most social media things.
00:33:14.840 | I feel like, on principle, I feel like they have some harms.
00:33:18.000 | As a person, I have a problem when people say things
00:33:20.760 | that are wrong on the internet,
00:33:22.280 | and I would get nothing done if I were to have a Twitter.
00:33:25.080 | I would spend all of my time correcting people
00:33:27.800 | and getting into fights.
00:33:28.920 | And so I feel like it was just useful for me
00:33:31.080 | for this not to be an option.
00:33:32.720 | I tend to just post things online.
00:33:35.160 | Yeah, it's a very good question.
00:33:36.000 | I don't know how people find it.
00:33:37.040 | I feel like, for some things that I write,
00:33:39.320 | other people think it resonates with them,
00:33:41.560 | and then they put it on Twitter.
00:33:43.640 | - Hacker News as well.
00:33:44.680 | - Sure, yeah, yeah.
00:33:45.520 | I am, because my day job is doing research,
00:33:50.520 | I get no value for having this be picked up.
00:33:54.120 | There's no whatever.
00:33:55.240 | I don't need to be someone who has to have this other thing
00:33:57.720 | to give talks.
00:33:59.200 | And so I feel like I can just say what I want to say,
00:34:02.000 | and if people find it useful, then they'll share it widely.
00:34:04.200 | This one went pretty wide.
00:34:05.920 | I wrote a thing, whatever, sometime late last year
00:34:09.360 | about how to recover data off of an Apple Profile Drive
00:34:14.360 | from 1980.
00:34:17.080 | This probably got, I think, 1,000x less views than this,
00:34:21.100 | but I don't care.
00:34:22.000 | That's not why I'm doing this.
00:34:22.840 | This is the benefit of having a thing
00:34:24.840 | that I actually care about, which is my research.
00:34:26.760 | I would care much more if that didn't get seen.
00:34:29.160 | This is a thing that I write
00:34:30.600 | because I have some thoughts that I just want to put down.
00:34:32.960 | - I think it's the long-form thoughtfulness
00:34:35.600 | and authenticity that is sadly lacking sometimes
00:34:38.880 | in modern discourse that makes it attractive.
00:34:42.120 | And I think now you have a little bit of a brand
00:34:44.160 | of you are an independent thinker, writer, person
00:34:47.640 | that people are tuned in to pay attention
00:34:50.980 | to whatever is next coming.
00:34:52.400 | - Yeah, this kind of worries me a little bit.
00:34:54.760 | Whenever I have a popular thing that,
00:34:56.440 | and then I write another thing
00:34:57.360 | which is entirely unrelated, I don't want people-
00:35:00.560 | - You should actually just throw people off right now.
00:35:02.080 | - Exactly, I'm trying to figure out,
00:35:04.120 | I need to put something else online.
00:35:05.960 | So the last two or three things I've done in a row
00:35:07.720 | have been actually things that people should care about.
00:35:10.600 | So I have a couple of things I'm trying to figure out.
00:35:12.200 | Which one do I put online to just cull the list
00:35:14.880 | of people who have subscribed to my email?
00:35:16.280 | And so tell them, no, what you're here for
00:35:17.880 | is not informed, well-thought-through takes.
00:35:20.160 | What you're here for is whatever I want to talk about.
00:35:21.920 | And if you're not up for that, then go away.
00:35:24.160 | This is not what I want out of my personal website.
00:35:27.480 | - So here's top 10 enemies or something like that.
00:35:30.600 | What's the next project you're going to work on
00:35:32.360 | that is completely unrelated to research LLMs?
00:35:35.640 | Or what games do you want to port into the browser next?
00:35:39.120 | - Okay, yeah, so maybe, okay, here's a fun question.
00:35:43.320 | How much data do you think you can put
00:35:45.240 | on a single piece of paper?
00:35:47.320 | - I mean, you can think about bits and atoms.
00:35:49.160 | - Yeah, no, like a normal printer.
00:35:51.320 | Like I gave you an office printer.
00:35:53.120 | How much data can you put on a piece of paper?
00:35:54.440 | - Can you redecode it?
00:35:56.240 | So like, you know, Base64 or whatever.
00:35:58.880 | - Yeah, whatever you want.
00:36:00.080 | You get normal off-the-shelf printer,
00:36:01.400 | off-the-shelf scanner.
00:36:02.420 | How much data?
00:36:03.260 | - I'll just throw out there, like 10 megabytes.
00:36:07.040 | - Oh, that's enormous.
00:36:07.880 | - I know.
00:36:08.700 | (laughing)
00:36:09.540 | - That's a lot.
00:36:10.380 | - Really small fonts.
00:36:11.440 | - That's my question.
00:36:12.680 | So I have a thing that does about a megabyte.
00:36:14.600 | - Yeah, okay, there you go.
00:36:15.440 | That's awesome order of magnitude.
00:36:16.760 | - Yeah, okay, so in particular,
00:36:18.320 | it's about 1.44 megabytes.
00:36:20.760 | - Floppy disk.
00:36:21.600 | - Yeah, exactly.
00:36:22.420 | This is supposed to be the title at some point,
00:36:23.800 | is the floppy disk.
00:36:24.620 | - A paper is a floppy disk?
00:36:25.460 | - Yeah, so this is a little hard because, you know,
00:36:27.280 | so you can do the math and you get 8 1/2 by 11.
00:36:30.480 | You can print at 300 by 300 DPI,
00:36:33.000 | and this gives you two megabytes.
00:36:35.240 | And so you need to be able,
00:36:36.080 | like so if you, every single pixel,
00:36:38.240 | you need to be able to recover up to like 90 plus percent,
00:36:41.480 | like 95%, like 99 point something percent accuracy
00:36:44.840 | in order to be able to actually decode this off the paper.
00:36:47.420 | This is one of the things that I'm considering.
00:36:50.360 | I need to like get a couple more things working for this
00:36:52.840 | where, you know, again, I'm running to some random problems,
00:36:55.960 | but this is probably, this will be one thing
00:36:57.560 | that I'm going to talk about.
00:36:59.360 | There's this contest called
00:37:00.200 | the International Obfuscated C Code Contest,
00:37:01.800 | which is amazing.
00:37:02.800 | People try and write the most obfuscated C code that they can,
00:37:05.640 | which is great.
00:37:06.560 | And I have a submission for that
00:37:08.200 | whenever they open up the next one for it,
00:37:10.760 | and I'll write about that submission.
00:37:12.120 | I have a very fun gate level emulation of an old CPU
00:37:15.920 | that runs like fully precisely,
00:37:18.560 | and it's a fun kind of thing.
00:37:20.240 | - Interesting.
00:37:21.080 | Your comment about the piece of paper
00:37:22.440 | reminds me of when I was in college
00:37:24.040 | and you would have like one cheat sheet
00:37:26.040 | that you could write, right?
00:37:26.960 | So you have a formula, a theoretical limit
00:37:29.080 | for bits per inch.
00:37:31.160 | And, you know, that's how much,
00:37:33.080 | I would squeeze in really, really small pieces
00:37:35.240 | to fill one of those sheets.
00:37:36.080 | - Definitely, yeah.
00:37:36.920 | - Okay, we are also going to talk about your benchmarking
00:37:39.480 | because you released your own benchmark
00:37:41.480 | that got some attention
00:37:43.080 | thanks to some friends on the internet.
00:37:45.240 | What's the story behind your own benchmark?
00:37:47.080 | Do you not trust the open source benchmarks?
00:37:50.800 | What's going on there?
00:37:51.640 | - Okay, benchmarks tell you how well the model solves
00:37:55.400 | the task the benchmark is designed to solve.
00:37:57.360 | For a long time, models were not useful.
00:37:59.720 | And so the benchmark that you tracked
00:38:01.320 | was just something someone came up with
00:38:03.120 | because you need to track something.
00:38:05.480 | All of deep learning exists
00:38:07.920 | because people tried to make models classify digits
00:38:12.240 | and classify images into a thousand classes.
00:38:14.720 | There is no one in the world
00:38:16.440 | who cares specifically about the problem
00:38:18.800 | of distinguishing between 300 breeds of dog
00:38:22.040 | for an image that's 224 or 224 pixels.
00:38:24.680 | And yet, like, this is what drove a lot of progress.
00:38:26.960 | And people did this,
00:38:27.920 | not because they cared about this problem,
00:38:29.520 | because they want to just measure progress in some way.
00:38:32.000 | And a lot of benchmarks are of this flavor.
00:38:34.240 | You want to construct a task that is hard,
00:38:36.240 | and we will measure progress on this benchmark,
00:38:38.280 | not because we care about the problem per se,
00:38:39.880 | but because we know that progress on this
00:38:41.520 | is in some way correlated with making better models.
00:38:44.160 | And this is fine when you don't want to actually use
00:38:46.600 | the models that you have.
00:38:48.000 | But when you want to actually make use of them,
00:38:50.160 | it's important to find benchmarks that track
00:38:52.760 | with whether or not they're useful to you.
00:38:54.400 | And the thing that I was finding
00:38:56.360 | is that there would be model after model after model
00:38:58.720 | that was being released that would find some benchmark
00:39:01.800 | that they could claim state-of-the-art on
00:39:03.600 | and then say, "Therefore, ours is the best."
00:39:05.800 | And that wouldn't be helpful to me
00:39:07.960 | to know whether or not I should then switch to it.
00:39:10.280 | So the argument that I tried to lay out in this post
00:39:13.160 | is that more people should make benchmarks
00:39:16.440 | that are tailored to them.
00:39:17.840 | And so what I did is I wrote a domain-specific language
00:39:21.120 | that anyone can write for,
00:39:22.480 | and say, you can take tasks
00:39:25.040 | that you have wanted models to solve for you,
00:39:27.480 | and you can put them into your benchmark
00:39:30.160 | that's the thing that you care about.
00:39:31.440 | And then when a new model comes out,
00:39:32.600 | you benchmark the model on the things that you care about,
00:39:35.840 | and you know that you care about them
00:39:36.760 | because you've actually asked for those answers before.
00:39:39.000 | And if the model scores well,
00:39:40.160 | then you know that for the kinds of things
00:39:41.640 | that you have asked models for in the past,
00:39:43.000 | it can solve these things well for you.
00:39:45.040 | This has been useful for me
00:39:46.320 | because when another model comes out,
00:39:47.440 | I just, I can run it, I can see is this,
00:39:49.280 | does this solve the kinds of things that I care about?
00:39:51.040 | And sometimes the answer is yes,
00:39:52.120 | and sometimes the answer is no.
00:39:53.680 | And then I can decide whether or not
00:39:55.200 | I want to use that model or not.
00:39:56.880 | I don't want to say that existing benchmarks are not useful.
00:40:00.080 | They're very good at measuring the thing
00:40:01.720 | that they're designed to measure.
00:40:03.760 | But in many cases, what that's designed to measure
00:40:06.760 | is not actually the thing that I want to use it for.
00:40:08.640 | And I would expect that the way that I want to use it
00:40:10.600 | is different than the way that you want to use it.
00:40:12.080 | And I would just like more people
00:40:13.920 | to have these things out there in the world.
00:40:15.800 | And the final reason for this is,
00:40:17.920 | it is very easy if you want to make a model
00:40:20.840 | good at some benchmark to make it good at that benchmark.
00:40:23.520 | You can sort of like find the distribution of data
00:40:25.600 | that you need and train the model to be good
00:40:27.480 | on the distribution of data.
00:40:28.440 | And then you have your model
00:40:29.820 | that can solve this benchmark well.
00:40:31.440 | And by having a benchmark that is not very popular,
00:40:35.540 | you can be relatively certain
00:40:37.320 | that no one has tried to optimize their model
00:40:39.180 | for your benchmark.
00:40:40.280 | And I would like to be--
00:40:41.120 | - So publishing your benchmark is a little bit--
00:40:42.720 | (laughing)
00:40:43.560 | - Okay, sure.
00:40:44.380 | Yeah, okay, so my hope in doing this
00:40:47.440 | was not that people would use mine as theirs.
00:40:50.680 | My hope in doing this was that people would say--
00:40:52.160 | - You should make yours.
00:40:53.000 | - Yes, you should make your benchmark.
00:40:54.040 | And if, for example, there were even
00:40:57.760 | a very small fraction of people, 0.1% of people
00:41:00.120 | who made a benchmark that was useful for them,
00:41:01.980 | this would still be hundreds of new benchmarks
00:41:04.320 | that not want to make one myself,
00:41:06.040 | but I might know the kinds of work that I do
00:41:09.400 | is a little bit like this person,
00:41:10.600 | a little bit like that person.
00:41:11.440 | I'll go check how it is on their benchmarks
00:41:13.160 | and I'll see roughly I'll get a good sense
00:41:16.220 | of what's going on because the alternative
00:41:18.220 | is people just do this vibes-based evaluation thing
00:41:21.620 | where you interact with the model five times
00:41:23.580 | and you see if it worked on the kinds of things
00:41:24.980 | that you just like your toy questions,
00:41:26.660 | but five questions is a very low bit output
00:41:29.420 | from whether or not it works for this thing.
00:41:31.060 | And if you could just automate
00:41:32.300 | running it 100 questions for you,
00:41:33.620 | it's a much better evaluation.
00:41:35.300 | So that's why I did this.
00:41:37.220 | - Yeah, I like the idea of going through your chat history
00:41:39.700 | and actually pulling out real-life examples.
00:41:42.420 | I regret to say that I don't think my chat history
00:41:44.840 | is used as much these days because I'm using cursor,
00:41:47.560 | like the sort of native AI ID.
00:41:50.480 | So your examples are all coding related.
00:41:52.800 | And the immediate question is like,
00:41:54.680 | now that you've written the "How I Use AI" post,
00:41:57.600 | which is a little bit broader,
00:41:59.440 | are you able to translate all these things to evals?
00:42:01.600 | Are some things unevaluable?
00:42:03.720 | - Right, a number of things that I do
00:42:05.360 | are harder to evaluate.
00:42:06.440 | So this is the problem with a benchmark
00:42:08.400 | is you need some way to check
00:42:10.160 | whether or not the output was correct.
00:42:12.200 | And so all of the kinds of things
00:42:13.540 | that I can put into the benchmark
00:42:14.620 | are the kinds of things that you can check.
00:42:16.620 | You can check more things
00:42:17.920 | than you might have thought would be possible
00:42:19.820 | if you do a little bit of work on the backend.
00:42:22.220 | So for example, all of the code that I have the model right,
00:42:24.960 | it runs the code and sees whether the answer
00:42:26.660 | is the correct answer.
00:42:28.180 | Or in some cases, it runs the code,
00:42:30.060 | feeds the output to another language model,
00:42:31.780 | and the language model judges was the output correct.
00:42:34.260 | And again, is using a language model
00:42:36.060 | to judge here perfect?
00:42:36.940 | No, but like, what's the alternative?
00:42:39.100 | The alternative is to not do it.
00:42:41.220 | And what I care about is just,
00:42:43.460 | is this thing broadly useful
00:42:45.240 | for the kinds of questions that I have?
00:42:46.540 | And so as long as the accuracy
00:42:47.660 | is better than roughly random,
00:42:49.460 | like I'm okay with this.
00:42:51.660 | I've inspected the outputs of these
00:42:52.860 | and like, they're almost always correct.
00:42:54.460 | If you sort of, if you ask the model
00:42:55.940 | to judge these things in the right way,
00:42:57.420 | they're very good at being able to tell this.
00:42:59.680 | And so yeah, I probably think
00:43:02.220 | this is a useful thing for people to do.
00:43:04.140 | - You complained about prompting and being lazy
00:43:07.660 | and how you do not want to tip your model
00:43:09.780 | and you do not want to murder a kitten
00:43:12.540 | just to get the right answer.
00:43:14.060 | How do you see the evolution of like prompt engineering?
00:43:16.660 | Even like 18 months ago,
00:43:17.980 | maybe, you know, it was kind of like really hot
00:43:19.900 | and people wanted to like build companies around it.
00:43:21.660 | Today, it's like the models are getting good.
00:43:23.180 | Do you think it's going to be
00:43:24.020 | less and less relevant going forward
00:43:25.820 | or what's the minimum valuable prompt?
00:43:28.660 | - Yeah, I don't know.
00:43:29.580 | I feel like a big part of making an agent
00:43:31.700 | is just like a fancy prompt
00:43:33.380 | that like, you know, calls back to the model again.
00:43:36.260 | I have no opinion.
00:43:37.140 | It seems like maybe it turns out
00:43:38.860 | that this is really important.
00:43:39.980 | Maybe it turns out that this isn't.
00:43:41.420 | I guess the only comment I was making here
00:43:43.040 | is just to say, oftentimes when I use a model
00:43:47.140 | and I find it's not useful,
00:43:48.220 | I talk to people who help make it.
00:43:50.260 | The answer they usually give me is like,
00:43:51.660 | you're using it wrong.
00:43:53.140 | Which like reminds me very much of like
00:43:54.260 | that you're holding it wrong
00:43:55.220 | from like the iPhone kind of thing, right?
00:43:56.580 | Like, you know, like,
00:43:57.820 | I don't care that I'm holding it wrong.
00:43:58.980 | I'm holding it that way.
00:43:59.860 | If the thing is not working with me,
00:44:01.380 | then like it's not useful for me.
00:44:02.500 | Like it may be the case that there exists
00:44:05.140 | a way to ask the model
00:44:06.260 | such that it gives me the answer that's correct.
00:44:08.180 | But that's not the way I'm doing it.
00:44:10.660 | If I have to spend so much time thinking
00:44:12.820 | about how I want to frame the question,
00:44:14.820 | that it would have been faster for me
00:44:15.780 | just to get the answer.
00:44:17.060 | It didn't save me any time.
00:44:18.380 | And so oftentimes, you know what I do is like,
00:44:20.140 | I just dump in whatever current thought that I have
00:44:22.260 | in whatever ill-formed way it is.
00:44:24.220 | And I expect the answer to be correct.
00:44:26.500 | And if the answer is not correct,
00:44:27.420 | like in some sense,
00:44:28.500 | maybe the model was right to give me the wrong answer.
00:44:30.380 | Like I may have asked the wrong question,
00:44:33.100 | but I want the right answer still.
00:44:34.420 | And so like, I just want to sort of get this as a thing.
00:44:38.300 | And maybe the way to fix this is
00:44:40.580 | you have some default prompt
00:44:41.660 | that always goes into all the models or something.
00:44:43.620 | Or you do something like clever like this.
00:44:45.780 | It would be great if someone had a way
00:44:47.020 | to package this up and make a thing.
00:44:48.340 | I think that's entirely reasonable.
00:44:49.940 | Maybe it turns out that as models get better,
00:44:51.420 | you don't need to prompt them as much in this way.
00:44:53.180 | I don't know.
00:44:54.020 | I just want to use the things that are in front of me.
00:44:55.540 | - Do you think that's like a limitation
00:44:57.660 | of just how models work?
00:44:59.180 | Like, you know, at the end of the day,
00:45:00.420 | you're using the prompt to kind of like steer it
00:45:02.500 | in the latent space.
00:45:03.340 | Like, do you think there's a way
00:45:04.420 | to actually not make the prompt really relevant
00:45:07.060 | and have the model figure it out?
00:45:08.060 | Or like, what's the--
00:45:08.900 | - I mean, you could fine tune it into the model,
00:45:11.140 | for example, that like it's supposed to.
00:45:12.860 | I mean, it seems like some models have done this,
00:45:14.300 | for example.
00:45:15.140 | Like some recent model, many recent models,
00:45:16.540 | if you ask them a question,
00:45:17.820 | computing an integral of this thing,
00:45:19.580 | they'll say, "Let's think through this step-by-step."
00:45:21.540 | And then they'll go through the step-by-step answer.
00:45:22.900 | I didn't tell it.
00:45:23.820 | Two years ago, I would have have to have prompted it.
00:45:25.980 | Think step-by-step on solving the following thing.
00:45:27.900 | Now you ask them the question and the model says,
00:45:30.300 | "Here's how I'm going to do it.
00:45:31.140 | "I'm going to take the following approach."
00:45:32.380 | And then like sort of self-prompt itself.
00:45:34.220 | Is this the right way?
00:45:35.620 | Seems reasonable.
00:45:36.700 | Maybe you don't have to do it.
00:45:37.620 | I don't know.
00:45:38.460 | This is for the people whose job
00:45:39.860 | is to make these things better.
00:45:40.780 | And yeah, I just want to use these things.
00:45:43.340 | - For listeners, that would be Orca and Agent Instruct,
00:45:46.420 | is the soda on this stuff.
00:45:48.340 | - Great. - Yeah.
00:45:49.260 | - Does a few-shot, is included in the lazy prompting?
00:45:52.500 | Like, do you do few-shot prompting?
00:45:54.380 | Like, do you collect some examples
00:45:55.780 | when you want to put them in, or?
00:45:57.140 | - I don't because usually when I want the answer,
00:46:00.260 | I just, I want to get the answer.
00:46:02.140 | (laughing)
00:46:02.980 | - Brutal, this is hard mode.
00:46:04.180 | - Yeah, exactly.
00:46:05.180 | This is fine.
00:46:06.260 | I want to be clear.
00:46:07.100 | There's a difference between
00:46:08.260 | testing the ultimate capability level of the model
00:46:10.740 | and testing the thing that I'm doing with it.
00:46:12.740 | What I'm doing is I'm not exercising
00:46:14.340 | its full capability level.
00:46:15.620 | Because there are almost certainly better ways
00:46:17.180 | to ask the questions
00:46:18.020 | and sort of really see how good the model is.
00:46:19.940 | And if you're evaluating a model
00:46:22.180 | for being state-of-the-art,
00:46:23.460 | this is ultimately what you care about.
00:46:24.780 | And so I'm entirely fine with people doing fancy prompting
00:46:27.380 | to show me what the true capability level could be.
00:46:29.860 | Because it's really useful to know
00:46:31.100 | what the ultimate level of the model could be.
00:46:32.740 | But I think it's also important
00:46:33.900 | just to have available to you
00:46:35.820 | how good the model is if you don't do fancy things.
00:46:39.220 | - Yeah, I will say that here's a divergence
00:46:40.980 | between how models are marketed these days
00:46:43.780 | versus how people use it,
00:46:45.860 | which is when they test MMLU,
00:46:47.420 | they'll do like five shots, 25 shots, 50 shots.
00:46:50.620 | And no one's providing 50 examples.
00:46:53.020 | - I completely agree.
00:46:54.900 | You know, for these numbers,
00:46:56.460 | the problem is everyone wants to get state-of-the-art
00:46:58.020 | on the benchmark.
00:46:58.860 | And so you find the way
00:47:00.220 | that you can ask the model the questions
00:47:01.940 | so that you get state-of-the-art on the benchmark.
00:47:04.300 | And it's legitimately good to know.
00:47:06.700 | Like it's good to know the model can do this thing
00:47:08.860 | if only you try hard enough.
00:47:10.660 | Because it means that if I have some tasks
00:47:12.980 | that I want to be solved,
00:47:14.180 | I know what the capability level is.
00:47:16.460 | And I could get there if I was willing to work hard enough.
00:47:18.500 | And the question then is,
00:47:19.540 | should I work harder
00:47:20.380 | and figure out how to ask the model the question?
00:47:21.860 | Or do I just do the thing myself?
00:47:23.020 | And for me, I have programmed for many, many, many years.
00:47:26.260 | It's often just faster for me just to do the thing
00:47:28.460 | than to like figure out the incantation to ask the model.
00:47:31.300 | But I can imagine someone who has never programmed before
00:47:34.380 | might be fine writing five paragraphs in English,
00:47:37.340 | describing exactly the thing that they want
00:47:39.100 | and have the model build it for them
00:47:41.060 | if the alternative is not.
00:47:43.180 | But again, this goes to all these questions
00:47:44.900 | of how are they going to validate?
00:47:46.740 | Should they be trusting the output?
00:47:47.860 | These kinds of things, but yeah.
00:47:49.740 | - One problem with your eval paradigm
00:47:53.340 | and most eval paradigms, I'm not picking on you,
00:47:55.940 | is that we're actually training these things for chat,
00:47:58.580 | for interactive back and forth.
00:47:59.940 | And you actually obviously reveal much more information
00:48:02.260 | in the same way that asking 20 questions
00:48:04.220 | reveals more information
00:48:05.180 | in sort of like a tree search branching sort of way.
00:48:08.300 | Then this is also by the way,
00:48:09.540 | the problem with LMS's arena, right?
00:48:10.980 | Where the vast majority of prompts are single question,
00:48:13.580 | single answer, eval, done.
00:48:15.300 | But actually the way that we use chat things,
00:48:18.380 | in the way, even in the stuff that you posted
00:48:20.060 | in your higher use AI stuff,
00:48:21.220 | you have maybe 20 turns of back and forth.
00:48:24.460 | How do you eval that?
00:48:25.420 | - Yeah, okay, very good question.
00:48:26.980 | This is the thing that I think many people
00:48:28.740 | should be doing more of.
00:48:30.100 | I would like more multi-turn evals.
00:48:31.780 | I might be writing a paper on this at some point
00:48:33.940 | if I get around to it.
00:48:35.340 | A couple of the evals in the benchmark thing I have
00:48:38.140 | are already multi-turn.
00:48:39.740 | I mentioned 20 questions.
00:48:40.580 | I have a 20 question eval there, just for fun.
00:48:43.540 | But I have a couple others that are like,
00:48:46.300 | I just tell the model, here's my get thing,
00:48:48.700 | figure out how to cherry pick off this other branch
00:48:50.700 | and move it over there.
00:48:51.860 | And so what I do is I just,
00:48:53.340 | I basically build a tiny little agency thing.
00:48:55.620 | I just ask the model how I do it.
00:48:57.660 | I run the thing on Linux.
00:49:00.300 | I'd spin up a Docker.
00:49:01.220 | This is what I want a Docker for.
00:49:02.340 | I spin up a Docker container.
00:49:03.740 | I run whatever the model told me the output to do is.
00:49:06.860 | I feed the output back into the model.
00:49:08.100 | I repeat this many rounds.
00:49:09.300 | And then I check at the very end,
00:49:11.100 | does the git commit history show
00:49:12.860 | that it is correctly cherry picked in this way?
00:49:15.100 | And so I have a couple of these.
00:49:16.740 | I agree that I have many fewer
00:49:17.780 | than what I actually use them for.
00:49:19.420 | And I think the reason why
00:49:20.260 | is just that it's hard to evaluate this.
00:49:22.060 | Like it's more challenging to do this kind of evaluation.
00:49:24.540 | Yeah, I would like to see a lot more
00:49:26.740 | of these kinds of things to exist
00:49:28.580 | so that people could come up with these evals
00:49:31.580 | that more closely measure what they're actually doing.
00:49:34.820 | - Just before we wrap on this,
00:49:36.620 | there was one example about a UU encode.
00:49:39.340 | And you mentioned how like nobody uses this thing anymore.
00:49:42.780 | When you run into something like this
00:49:44.540 | and you know that no more data
00:49:46.220 | is gonna get produced on this thing,
00:49:48.540 | do you figure out how to like fine tune the model?
00:49:52.060 | Like if it really mattered to you,
00:49:53.940 | put together some examples or would you just say,
00:49:55.820 | hey, the model just doesn't do it, whatever, move on?
00:49:57.980 | - Yeah, yeah, okay.
00:49:59.220 | This was an example of a thing
00:50:01.540 | where I was looking at some data that was,
00:50:04.500 | there was a file that was produced
00:50:07.420 | in like the mid '90s, early '90s or something,
00:50:11.220 | when UU encoding was actually a thing that people would do.
00:50:13.700 | And I wanted the model to be able
00:50:15.860 | to automatically determine the type of file
00:50:17.340 | to decompress in something.
00:50:18.740 | And like it was doing it correctly for like 99% of cases.
00:50:21.860 | And like I found a few UU encoded things
00:50:23.460 | where like it couldn't figure out
00:50:24.380 | this was UU encoding, not base 64.
00:50:26.020 | Okay, this is not important.
00:50:27.620 | I just was curious if it could do it.
00:50:28.860 | And so I put this as a thing.
00:50:30.900 | I think probably this is the thing
00:50:33.020 | that if you really cared about this task being solved well,
00:50:35.300 | you would train a model for.
00:50:37.020 | But again, this is one of these kinds of tasks
00:50:39.100 | that this was some dumb project
00:50:41.100 | that like no one's gonna care about.
00:50:42.220 | I just wanted to see if I could do it.
00:50:43.940 | If the model was good enough
00:50:44.900 | that it gets me 90% of the way there, good, like done.
00:50:46.740 | Like I figured it out.
00:50:47.660 | Like I can sort of have fun for a couple hours
00:50:49.220 | and then move on.
00:50:50.060 | And that's all I want.
00:50:50.900 | I was not like, if I ever had to train a thing for this,
00:50:53.180 | I was not gonna do it.
00:50:54.020 | And so it did well enough for me that I could move on.
00:50:57.500 | - It does give me an idea for adversarial examples
00:51:00.740 | inside of a benchmark that are basically canaries
00:51:03.300 | for overtraining on the benchmark.
00:51:05.100 | Typically right now, benchmarks have canary strings,
00:51:07.340 | or if you ask it to repeat back the string and it does,
00:51:09.420 | then it's trained on it.
00:51:10.620 | But you know, it's easy to filter out those things.
00:51:12.380 | But the benchmarks, you put in some things,
00:51:14.860 | some questions that are intentionally wrong,
00:51:16.700 | and if it gives you the intentionally wrong answer,
00:51:18.740 | then you know it's.
00:51:19.900 | - Yeah, there are actually a couple of papers
00:51:22.340 | that don't do exactly this,
00:51:24.460 | but that are doing dataset inference.
00:51:26.860 | So the field of work called membership inference,
00:51:29.540 | this is one of the things I do research on,
00:51:31.420 | that tries to figure out,
00:51:32.260 | did you train on this example or not?
00:51:33.940 | There's a field called like dataset inference.
00:51:35.740 | Did you train on this dataset or not?
00:51:37.180 | And there's like a specific subfield of this
00:51:39.540 | that looks specifically at,
00:51:42.060 | like did you train on your test set
00:51:43.700 | or you train on your training set?
00:51:45.260 | And they basically look at exactly this.
00:51:47.460 | Like for example, one,
00:51:48.380 | there's this paper by Tatsu out of Stanford,
00:51:50.940 | where they check if the order
00:51:54.300 | that the specific questions happen to be in matters.
00:51:57.380 | And if the answer is yes,
00:51:58.300 | then you probably trained on it
00:51:59.260 | because the order of the questions is arbitrary
00:52:00.660 | and shouldn't matter.
00:52:01.500 | There are a number of papers that follow up on this
00:52:02.780 | and do some similar things.
00:52:03.700 | I think this is a great way of doing this now.
00:52:06.260 | It might be even better if some people
00:52:07.540 | included some canary questions in their benchmarks,
00:52:09.820 | but even if they don't,
00:52:10.780 | you can already sort of start getting at this now.
00:52:13.260 | - Yeah.
00:52:14.100 | - Yeah, let's go into some of your research.
00:52:15.700 | I always love security work.
00:52:17.660 | I was at Black Hat last week.
00:52:19.140 | I had to miss DEF CON.
00:52:20.540 | Let's start from the Leon 400M,
00:52:24.260 | kind of like data possible, data poisoning.
00:52:27.300 | So basically the idea is,
00:52:29.260 | Leon 400M is one of the biggest image datasets
00:52:32.020 | for image models.
00:52:33.300 | And a lot of the image gets pulled from live domains.
00:52:36.340 | So it's not all, yeah.
00:52:38.340 | - Every image gets pulled from a live domain, yes.
00:52:39.900 | - So it's not all stored
00:52:41.060 | and a bunch of the domains expired.
00:52:42.900 | So then you went on and you bought the domains
00:52:44.700 | and you get to put literally anything on it.
00:52:47.020 | And you get to poison every single model
00:52:49.340 | that was training in the dataset.
00:52:51.180 | - Yep, it was a lot of fun.
00:52:52.540 | - Maybe just talk about some of the things
00:52:54.540 | that people don't think about
00:52:55.900 | when it comes to like the datasets.
00:52:57.100 | We talked before about low background tokens.
00:52:59.220 | So before maybe 2020,
00:53:01.420 | you can imagine most things you get from the internet,
00:53:04.100 | a human wrote, or like, you know.
00:53:06.540 | After 2021, you can imagine most things written
00:53:09.140 | are like somewhat AI generated.
00:53:11.620 | Any other fun stories
00:53:12.780 | or like maybe give more of the Leon background.
00:53:14.460 | How did you figure out,
00:53:15.420 | do you just like check all the domains in it
00:53:17.980 | and see what expired?
00:53:18.900 | Why did they not do it to prevent this?
00:53:21.700 | - Okay, so why did this paper happen?
00:53:23.780 | The adversarial machine learning literature
00:53:25.900 | for a very long time was focused on
00:53:29.060 | what could I do in the worst case?
00:53:32.700 | Because no one was using these tools.
00:53:34.500 | And no one's using them,
00:53:35.340 | it doesn't make sense to really ask like,
00:53:37.140 | how do I attack this actual system?
00:53:38.860 | And so people would write papers,
00:53:40.500 | I mean, me included, I have lots of these
00:53:41.980 | that like assume an adversary could do the following
00:53:45.340 | and then list 10 unrealistic things.
00:53:47.500 | Then very bad harm could happen.
00:53:49.460 | And in some sense, like you have to do this.
00:53:51.540 | If you have no real system in front of you,
00:53:53.180 | like what are you gonna do as a security researcher?
00:53:54.940 | One thing you could do is just nothing.
00:53:56.020 | You could just wait.
00:53:56.860 | Like this is a bad option
00:53:58.060 | because eventually someone's gonna use these things
00:53:59.540 | and you would rather have a headstart.
00:54:00.900 | So how do you get a headstart?
00:54:01.900 | You make a guess.
00:54:03.020 | You say maybe future systems will do X.
00:54:05.300 | And then you write a paper that sort of looks at this.
00:54:07.860 | And then maybe it turns out that some of these
00:54:09.380 | are directionally correct, some are not.
00:54:10.860 | And so, okay.
00:54:11.700 | So this has happened for quite some long time.
00:54:13.220 | And then machine learning started to work.
00:54:14.820 | And the thing that bothered me is it seems like
00:54:17.220 | the adversarial machine learning community
00:54:18.460 | didn't then try and adapt
00:54:19.620 | and try and actually start studying real problems.
00:54:21.860 | So we very deliberately started looking like,
00:54:24.740 | what are the problems that actually arise in real systems
00:54:28.140 | as they exist now?
00:54:29.140 | Like, what is the kind of paper
00:54:30.940 | that I could imagine writing that would be at Black Hat?
00:54:33.860 | Like a real security person would want to see,
00:54:37.420 | not because here's a fun thing
00:54:39.220 | that you can make this machine learning model do,
00:54:40.940 | but because legitimately the easiest way
00:54:42.660 | to make the bad thing happen
00:54:43.820 | is to go after the machine learning model.
00:54:45.260 | So the way we decided to do this is like,
00:54:47.380 | every time you see some new thing,
00:54:51.060 | you say, well, here are the bad things that could happen.
00:54:52.940 | I could try and do an evasion attack at test time.
00:54:54.660 | I could try and do a poisoning attack
00:54:55.820 | that made the model train on bad data.
00:54:57.020 | I could try and steal the model.
00:54:58.060 | I could try and steal the data.
00:54:59.140 | You have a list of like 10 bad things
00:55:00.540 | that you could try and make happen.
00:55:01.700 | And every time you see some new thing,
00:55:02.900 | you ask, okay, here's my list of 10 problems.
00:55:05.420 | Which of them are most important and relevant to this?
00:55:07.980 | And you just do this for every single one in the list.
00:55:10.300 | And most of the time, the answer is nothing,
00:55:12.780 | and then you get nothing out of it.
00:55:14.300 | But on occasion, you sort of figure out,
00:55:15.660 | okay, here's this new dataset.
00:55:17.140 | It is being distributed in such a way
00:55:19.020 | that anyone in the world can buy domains
00:55:21.820 | that let them inject arbitrary images into the dataset.
00:55:24.500 | There's the attack.
00:55:25.340 | And this is, I think, the way that we came to doing this
00:55:29.380 | from this motivation of let's try
00:55:30.700 | and look at some real security stuff.
00:55:32.700 | - I think when people think of AI security,
00:55:34.820 | they either think of jailbreaks,
00:55:37.660 | which is kind of very limited,
00:55:39.020 | or they're gonna go the broader,
00:55:40.700 | oh, is AI gonna kill us all?
00:55:42.420 | I think you've done a lot of awesome papers
00:55:44.540 | on the in-between.
00:55:45.940 | So one thing is the jailbreak.
00:55:47.540 | You've also had a paper on stealing part of a production LLM.
00:55:51.340 | You extracted the Babbage and Ada dimension layers
00:55:56.260 | from the OpenAI API.
00:55:58.180 | So there's even things that as a user,
00:56:00.460 | you're worried about the jailbreaks,
00:56:01.740 | but as a model provider,
00:56:03.420 | you're actually worried about the-
00:56:04.780 | - Yeah, exactly.
00:56:05.620 | This paper was, again, with the exact same motivation.
00:56:08.380 | So as some history,
00:56:09.220 | there's this field of research called model stealing.
00:56:11.460 | What it's interested in is you have your model
00:56:13.580 | that you have trained, it was very expensive.
00:56:15.140 | I want to query your model and steal a copy of the model
00:56:17.580 | so that I have your model
00:56:18.740 | without paying for the training costs.
00:56:20.980 | And we have some very nice work
00:56:22.820 | that shows that this is possible.
00:56:24.380 | Like I can steal your exact model.
00:56:26.060 | As long as your model has, let's say,
00:56:28.260 | a couple thousand neurons evaluated in Float64
00:56:31.300 | with value-only activation, fully connected networks.
00:56:34.140 | I see the full logic outputs
00:56:35.900 | and I can feed in arbitrary floating point
00:56:37.700 | 64 numbers and inputs.
00:56:39.220 | Each of these assumptions I just said is false in practice.
00:56:41.540 | Like none of these things are things you can really do.
00:56:43.820 | I think it's fun research.
00:56:44.980 | I mean, there's a reason the paper is at Crypto.
00:56:47.220 | Like the reason it's at Crypto
00:56:48.220 | and not like at like an actual security conference
00:56:50.260 | because like it's a very theoretical kind of thing.
00:56:52.580 | And I think it's like an important direction
00:56:54.020 | for people to think about
00:56:54.860 | because maybe you can extend these to make it be possible.
00:56:57.340 | But I also think it's worth thinking about the problem
00:56:59.060 | from the other direction.
00:56:59.980 | Like let's look at what the real models
00:57:01.260 | we have in front of us are.
00:57:02.340 | Let's see how we can make those models
00:57:04.260 | be vulnerable to stealing attacks.
00:57:05.860 | And then we can push from the other direction.
00:57:07.820 | Like let's take the most practical attacks
00:57:09.380 | and make them more powerful.
00:57:10.740 | And that's again, like what we're trying to do here.
00:57:12.180 | We sort of looked at what APIs do actually people expose
00:57:15.900 | in the biggest models.
00:57:17.100 | How can we use some of that
00:57:18.260 | to do as much stealing as we possibly can?
00:57:20.580 | Yeah, and for this, we ran the attack
00:57:22.720 | that let us stole several of OpenAI's models
00:57:25.660 | with their permission.
00:57:27.460 | You know, it's sort of, it's a fun email
00:57:29.100 | to send, you know, hello, Mr. Lawyer.
00:57:31.600 | So I'm at Google.
00:57:32.700 | You know, I first have to email the Google lawyer.
00:57:35.720 | I would like to steal OpenAI's models.
00:57:37.940 | And they say like, you know, under no circumstances.
00:57:40.380 | And you say, okay, but what if they agree to it?
00:57:42.220 | And you're like, if they agree to it, fine.
00:57:43.780 | And you said, then you say, I know some people there.
00:57:45.660 | I emailed them like, can I steal your model?
00:57:47.540 | And they're like, as long as you delete it afterwards, okay.
00:57:50.220 | And I'm like, can you get your general counsel
00:57:52.140 | to put that in writing?
00:57:52.980 | And they're like, sure.
00:57:53.980 | So like, we had all of the lawyers talk to each other.
00:57:57.660 | Everyone agreed that like, you know,
00:57:59.340 | it's important to do this.
00:58:00.160 | Like, you know, you don't want to actually, you know,
00:58:03.180 | sort of cause harm when doing security work.
00:58:05.100 | And so we got all of the, like,
00:58:07.220 | the agreements out of the way.
00:58:08.420 | And then we went and ran the attack.
00:58:10.260 | And yeah, and it worked great.
00:58:11.660 | And then we can write the paper.
00:58:13.140 | Before we put the paper online,
00:58:14.980 | we notified everyone who was vulnerable to this attack.
00:58:17.700 | Some Google models were vulnerable.
00:58:19.020 | Some OpenAI models were vulnerable.
00:58:20.880 | There were one or two other people who were vulnerable
00:58:22.860 | that we didn't name in the paper.
00:58:24.140 | We notified them all, gave them 90 days to fix it,
00:58:26.100 | which is like a standard disclosure period in security.
00:58:28.660 | They was all patched.
00:58:29.900 | You know, OpenAI got rid of some APIs.
00:58:31.940 | And then we put the paper online.
00:58:33.060 | - The fix was just don't show logits.
00:58:35.300 | - Yeah, so the fix in particular was don't show log probs
00:58:39.580 | when you supply a logit bias.
00:58:42.340 | And what you don't show is the logit bias plus the log prob,
00:58:44.460 | which is like a very narrow thing.
00:58:45.620 | They sort of did the narrow thing to prevent this.
00:58:48.060 | Some people were unhappy, but like, this is, you know,
00:58:50.620 | this is the nature of making,
00:58:51.700 | you can have a more useful system
00:58:53.900 | or a more secure system in many ways.
00:58:55.460 | I really like this example because for a very long time,
00:58:58.940 | nothing about GPT-4 would be at all different
00:59:01.760 | if the field, like the entire field
00:59:03.420 | of ever so machine learning disappeared.
00:59:04.860 | Like everything to do with ever so examples,
00:59:06.500 | like all of, like for the most part,
00:59:08.260 | like GPT-4 would exist identically.
00:59:10.540 | This is not true in other fields in, you know,
00:59:12.980 | in system security.
00:59:13.900 | Like the way we design our processors today
00:59:16.100 | is fundamentally different
00:59:17.180 | because of the security attacks that we've had in the past.
00:59:19.620 | You know, the way we design databases,
00:59:20.940 | the way we design the internet is fundamentally different
00:59:22.860 | because of the way the attacks that we have.
00:59:24.740 | And what that means is it means that the attacks
00:59:26.220 | that we had were so compelling to the non-security people
00:59:29.340 | that they were willing to change
00:59:30.620 | and make their systems less useful
00:59:33.180 | in order to make the security better.
00:59:34.620 | In ever so machine learning, we didn't have this.
00:59:36.060 | We didn't have attacks that were useful enough
00:59:37.540 | that you could show it to someone
00:59:38.980 | who actually designed a real system.
00:59:41.100 | And they'd be willing to say,
00:59:42.100 | I am going to make my system less useful
00:59:43.580 | because the attack that you've presented to me
00:59:44.920 | is so compelling that I will break
00:59:46.640 | the functionality of my system.
00:59:47.860 | And this is one of the first cases I think
00:59:49.220 | that we were able to show this is someone,
00:59:51.300 | we had an attack that someone said,
00:59:52.420 | I agree with this attack is sufficiently bad
00:59:54.320 | that I will break utility in order to prevent this attack.
00:59:56.660 | And I would like to see more of these kinds of attacks,
00:59:59.620 | not because I want things to be worse,
01:00:01.740 | but because I want to be sure
01:00:03.300 | that we have exhausted the space of possible attacks
01:00:05.980 | so that it's not going to be the case
01:00:07.700 | that someone else comes up with a very bad thing
01:00:10.160 | that like they're not going to disclose,
01:00:12.360 | sit on for, you know, a couple months
01:00:14.420 | and then go and bang on everything
01:00:16.540 | and see what they can hit.
01:00:17.580 | And this is the hope of doing this research direction.
01:00:20.220 | - I want to spell it out for people
01:00:21.280 | who are maybe not so specialized in this.
01:00:23.260 | Your attack could potentially steal
01:00:25.620 | the entire projection matrix.
01:00:27.140 | - Yeah, so a model has many layers.
01:00:29.620 | We pick one of the layers
01:00:30.720 | and we show how to steal that layer.
01:00:32.740 | - And then just scaling it up, you can steal the others.
01:00:35.700 | - For this attack, I do not know.
01:00:37.120 | - Yeah, okay.
01:00:37.960 | - So this is the important detail.
01:00:39.860 | We only steal one in the attack that as we present it,
01:00:42.860 | we only know how to steal one layer.
01:00:44.300 | For the other research we have done in the past,
01:00:47.120 | we have shown how after stealing one layer,
01:00:49.060 | you can then extend it to the second layer
01:00:50.820 | and then the second to the third and third to the fourth.
01:00:52.420 | And you can do this like arbitrarily deep.
01:00:54.220 | And we have done this in the past,
01:00:56.400 | but that made ridiculous assumptions.
01:00:58.100 | And what we're trying to do now is similar kind of thing,
01:01:00.900 | but let's make less ridiculous assumptions.
01:01:02.980 | - Yeah, it's kind of like insecurity,
01:01:04.520 | how you have like privilege escalation.
01:01:06.140 | Once you're in the system, you can escalate.
01:01:08.500 | - Yeah, that's the hope.
01:01:09.340 | And so like the reason why we want to write
01:01:11.100 | these kinds of papers is to say,
01:01:13.880 | let's always know what the best attack is.
01:01:15.420 | Let's have the best attack be public
01:01:17.340 | so that people can at least prevent
01:01:18.740 | what the best is that is known right now.
01:01:21.300 | And if someone else were to discover a stronger variant,
01:01:23.860 | I would hope that they would take a similar approach,
01:01:25.740 | let everyone know how to patch it,
01:01:27.300 | patch the thing, release it to everyone and go from there.
01:01:29.280 | - We do also serve people building on top of models.
01:01:31.900 | And one thing that I think people are interested in
01:01:33.680 | is prompt injections, prompt security, that kind of stuff.
01:01:37.500 | I feel like the relevant version of your thing
01:01:40.380 | is can I steal the rag corpus
01:01:42.560 | that might be proprietary to a company?
01:01:45.500 | I don't know if you've heard.
01:01:46.420 | - No, this is a very good question.
01:01:48.580 | Yeah, so there's two kinds of stealing.
01:01:50.740 | There's model stealing and there's data stealing.
01:01:52.500 | Data stealing is exactly this kind of question.
01:01:55.260 | And I think this is a very good question.
01:01:57.340 | In many ways, the answer is yes.
01:01:59.880 | Even without rag, you can often steal data
01:02:02.060 | that the model was trained on.
01:02:03.360 | So we've done some work where we have trained a model,
01:02:06.540 | or we have shown that for production models,
01:02:08.220 | okay, in this case, in the most extreme variant,
01:02:10.620 | we showed a way to recover training data from GPT 3.5 Turbo.
01:02:15.380 | Yeah, one of my co-authors, Milad,
01:02:17.580 | was working on some other random experiments
01:02:19.260 | and he figured out that if you prompt ChantGPT
01:02:23.520 | to repeat a word forever,
01:02:25.500 | then it will repeat the word many, many, many times in a row
01:02:28.220 | and then explode and just start doing random stuff.
01:02:31.780 | And when it was doing random stuff,
01:02:33.260 | maybe a small percent of the time,
01:02:34.540 | maybe 2% of the time,
01:02:35.700 | it would just repeat training data back to you,
01:02:37.540 | which is very confusing.
01:02:39.060 | But this is a thing that happened
01:02:41.940 | and was an exciting kind of thing.
01:02:43.620 | And we've seen this in the past, yeah.
01:02:45.220 | - Do we know, is it exactly the training data
01:02:47.940 | or is it something that looks like the training data?
01:02:49.300 | - Identical to the training data.
01:02:50.860 | - Because it cannot memorize.
01:02:52.340 | It doesn't have the weights to memorize all the training.
01:02:54.580 | - No, no, it can't memorize all the training data.
01:02:55.780 | No, definitely.
01:02:56.620 | But it can memorize some of it.
01:02:58.260 | How am I so certain?
01:02:59.540 | We found text that was on the internet,
01:03:01.460 | 10 terabytes of data.
01:03:02.700 | And what I can say is that the output of the model
01:03:05.240 | was a verbatim, at least 50-word-in-a-row match
01:03:09.960 | to some other document
01:03:11.280 | that appeared on the internet previously.
01:03:13.000 | So there's two possible explanations for this.
01:03:14.780 | One is the model happened to come up
01:03:17.040 | with the same 50-word-in-a-row sequence
01:03:19.200 | as was existed on the internet previously.
01:03:21.600 | In principle, this is possible, or it memorized it.
01:03:24.800 | And for some of them, we have several hundred words
01:03:26.860 | in a row where the probability is astronomically low.
01:03:30.400 | - So you also have a blog post about why I attack.
01:03:33.560 | Last week, we did a man versus machine event
01:03:35.920 | at Black Hat with our friend, H.D. Moore.
01:03:38.720 | It was basically like an AI CTF.
01:03:40.640 | And then Vijay, who was the CISO of DeepMind,
01:03:42.840 | he also came to the award ceremony.
01:03:44.700 | And I was talking to him.
01:03:45.640 | I told him, "We're gonna interview you."
01:03:47.400 | And he was like, "You should ask Carlini
01:03:49.200 | "why he does not want to build the fences."
01:03:51.800 | And so he told me to ask you that.
01:03:54.000 | So I'll just open the floor to you now to answer.
01:03:56.320 | - You asked his boss for a question.
01:03:57.720 | (both laughing)
01:04:00.680 | - Yeah, okay, no.
01:04:01.520 | So, okay, this is a good question.
01:04:03.260 | There are a couple of reasons.
01:04:05.360 | The most basic level,
01:04:06.720 | I attack things because I think it's fun.
01:04:08.500 | I feel like people should do things
01:04:09.720 | that they find are interesting in the world.
01:04:11.760 | I also think that it's important to attack things
01:04:14.680 | because you don't know what's secure
01:04:15.960 | unless you know what the best attacks are.
01:04:17.520 | And so it's worth having what the best attacks are
01:04:19.440 | in order to be able to discover what is secure.
01:04:21.840 | People then say, both of these things are true,
01:04:23.840 | and yet you should still build the fences.
01:04:25.400 | You know, I have gotten this a lot through my career.
01:04:28.900 | And it is possible that I would be able
01:04:30.280 | to construct the fences.
01:04:31.640 | On rare occasions, I have helped write papers
01:04:33.640 | that have defenses.
01:04:34.880 | I just don't find it very fun.
01:04:36.360 | I have a hard time motivating myself to work on it.
01:04:39.040 | And I think this is very important
01:04:41.480 | because let's suppose that you decide,
01:04:43.920 | okay, I am going to be a person
01:04:45.440 | who is going to try and do maximal good in the world.
01:04:48.000 | Presumably, there are jobs you could take
01:04:50.500 | that would like save more lives
01:04:52.520 | than what you're doing right now.
01:04:53.840 | But if you would wake up every day hating your life,
01:04:57.840 | it is very unlikely you would do an actually good job.
01:05:00.560 | You know, like I could sort of switch now to be a doctor
01:05:03.380 | or, you know, to do elderly care or something like this.
01:05:06.560 | But someone who actually went into it
01:05:08.040 | for the right motivations is going to do so much better
01:05:11.260 | than if I just decided, like, I am going to be a robot,
01:05:14.200 | I'm going to ignore what I actually enjoy,
01:05:15.680 | and I'm going to do the things that are,
01:05:18.880 | someone else has described objectively
01:05:20.920 | as like better for the world.
01:05:22.560 | I don't actually think that you would do that good
01:05:25.640 | because you're not gonna wake up every morning being like,
01:05:28.080 | I'm excited to solve this problem.
01:05:30.040 | You'll do your job from nine to five,
01:05:31.900 | and you'll go home and work on what you actually find fun.
01:05:33.960 | And a big part of doing high quality work
01:05:37.200 | is actually being willing to think
01:05:39.720 | about these kinds of problems all the time.
01:05:42.560 | And whenever like a new thing comes up,
01:05:43.960 | like you want to do the thing,
01:05:46.040 | you want to like be like, I have to go to sleep now,
01:05:48.840 | even though I want to be working on this problem.
01:05:49.960 | Like you will do better work in the grand scheme of things
01:05:52.720 | if you sort of look at the product of, you know,
01:05:56.040 | how valuable the thing is multiplied
01:05:57.480 | by how much you're gonna actually be able to do for it.
01:05:59.240 | And there are some, lots of things
01:06:00.320 | that are very high impact that like,
01:06:02.560 | you are just not the right person to solve.
01:06:04.400 | And I feel like that's the case for me for defenses
01:06:06.560 | is I really just don't care.
01:06:08.520 | Like, I just like, it's not interesting to me.
01:06:10.720 | I don't know why.
01:06:11.800 | I've tried in order to graduate,
01:06:13.840 | my thesis had to have a piece of it, which was a defense.
01:06:16.480 | And so it's there, but like that last, you know,
01:06:19.680 | a little while, I was just, I was not having a good time.
01:06:21.880 | Like I, it's there, like it didn't become a paper.
01:06:24.760 | It's like a chapter in my thesis until I had my PhD.
01:06:26.800 | But like, it's not like a thing that like actually
01:06:29.240 | motivated me to like be excited by the thing.
01:06:31.960 | And so I think maybe some people can get motivated
01:06:35.960 | on the work on things that like are really important
01:06:38.360 | and then they should do that.
01:06:40.480 | But I feel like if there are things in the world
01:06:42.640 | that in principle you could do more good,
01:06:45.680 | but like you're just not the right person for them,
01:06:48.240 | you will likely end up doing less good
01:06:50.520 | because you will not actually be able to do as much
01:06:53.280 | as you really could have if you had tried to do better.
01:06:55.800 | - Awesome, anything else we missed?
01:06:57.760 | Any underrated work that you really want people
01:07:00.800 | to check out, anything?
01:07:03.080 | - I mean, no, I mean like, yeah,
01:07:04.640 | I tend to do a fairly broad set of things.
01:07:06.760 | So anything you have missed, almost certainly yes.
01:07:08.880 | Anything that's particularly important
01:07:10.040 | that you have missed, probably not.
01:07:11.480 | I feel like, you know, just it's,
01:07:13.200 | I think people should work on more fun things.
01:07:15.000 | - Thank you so much for coming on.
01:07:16.240 | - Yeah, thank you.
01:07:17.240 | (upbeat music)
01:07:19.840 | (upbeat music)
01:07:22.440 | (upbeat music)
01:07:25.000 | (upbeat music)