Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey, everyone.

00:00:05.000 | Welcome to the Latent Space Podcast.

00:00:06.680 | This is Alessio, partner and CTO

00:00:08.440 | in Residence at Decibel Partners,

00:00:09.940 | and I'm joined by my co-host, Swiggs, founder of Small AI.

00:00:13.000 | - Hey, and today we're in the in-person studio,

00:00:15.520 | which Alessio has gorgeously set up for us,

00:00:19.080 | with Nicholas Carlini.

00:00:20.160 | Welcome.

00:00:20.980 | - Thank you. (laughing)

00:00:21.920 | - You're a research scientist at DeepMind.

00:00:23.960 | You work at the intersection

00:00:25.040 | of machine learning and computer security.

00:00:26.960 | You got your PhD from Berkeley in 2018,

00:00:29.640 | and also your BA from Berkeley as well.

00:00:32.800 | And mostly we're here to talk about your blogs,

00:00:35.780 | because you are so generous in just writing up what you know.

00:00:39.440 | Well, actually, why do you write?

00:00:40.880 | - Because I like,

00:00:41.880 | I feel like it's fun to share what you've done.

00:00:44.600 | I don't like writing.

00:00:46.120 | Sufficiently didn't like writing,

00:00:47.160 | I almost didn't do a PhD,

00:00:48.360 | because I knew how much writing

00:00:49.600 | was involved in writing papers.

00:00:51.600 | I was terrible at writing when I was younger.

00:00:54.860 | I do like the remedial writing classes

00:00:57.040 | when I was in university,

00:00:57.920 | 'cause I was really bad at it.

00:00:59.680 | So I don't actually enjoy,

00:01:00.520 | I still don't enjoy the active writing,

00:01:02.220 | but I feel like it is useful to share what you're doing,

00:01:05.480 | and I like being able to talk about the things

00:01:07.520 | that I'm doing that I think are fun.

00:01:08.960 | And so I write because I think

00:01:11.240 | I want to have something to say,

00:01:12.300 | not because I enjoy the active writing, but yeah.

00:01:14.600 | - It's a tool for thought, as they often say.

00:01:17.360 | Is there any sort of backgrounds

00:01:19.160 | or thing that people should know about you as a person,

00:01:22.000 | like, you know, just--

00:01:22.920 | - Yeah, so I tend to focus on, like you said,

00:01:26.160 | I do security work.

00:01:27.600 | I try to, like, attacking things,

00:01:29.440 | and I want to do, like, high-quality security research,

00:01:32.680 | and that's mostly what I spend my actual time

00:01:36.560 | trying to be productive members of society doing that.

00:01:38.640 | But then I get distracted by things,

00:01:40.960 | and I just like, you know,

00:01:42.040 | working on random, fun projects.

00:01:43.680 | And so--

00:01:44.520 | - Like a Doom clone in JavaScript.

00:01:45.680 | - Yes, like that, or, you know,

00:01:48.480 | I've done a number of, yeah,

00:01:49.760 | sort of things that have absolutely no utility,

00:01:51.800 | but are fun things to have done.

00:01:54.560 | And so it's interesting to say, like,

00:01:56.520 | you should work on fun things that just are interesting,

00:01:59.720 | even if they're not useful in any real way.

00:02:01.680 | And so that's what I tend to put up there,

00:02:03.600 | is after I have completed something I think is fun,

00:02:06.440 | or if I think it's sufficiently interesting,

00:02:07.800 | write something down there.

00:02:09.480 | - Before we go into, like, AI, LLMs, and whatnot,

00:02:11.880 | why are you obsessed with the game of life?

00:02:14.240 | So you built multiplexing circuits in the game of life,

00:02:18.600 | which is mind-boggling.

00:02:20.800 | So where did that come from?

00:02:22.160 | And then how do you go from just clicking boxes

00:02:25.160 | on the UI web version to, like,

00:02:27.680 | building multiplexing circuits?

00:02:29.640 | - I like Turing completeness.

00:02:31.640 | The definition of Turing completeness is

00:02:33.880 | a computer that can run anything, essentially.

00:02:36.240 | And the game of life, Conway's game of life,

00:02:38.440 | is a very simple cellular 2D automata

00:02:41.240 | where you have cells that are either on or off,

00:02:43.360 | and a cell becomes on if, in the previous generation,

00:02:45.680 | some configuration holds true and off otherwise.

00:02:48.960 | And it turns out there is a proof

00:02:51.560 | that the game of life is Turing complete,

00:02:53.240 | that you can run any program in principle

00:02:55.400 | using Conway's game of life.

00:02:57.040 | I don't know.

00:02:57.880 | And so you can, therefore someone should.

00:02:59.920 | And so I wanted to do it.

00:03:01.240 | And some other people have done some similar things,

00:03:03.920 | but I got obsessed into, like,

00:03:05.840 | if you're gonna try and make it work,

00:03:07.360 | like, we already know it's possible in theory.

00:03:08.960 | I want to try and, like, actually make something

00:03:10.280 | I can run on my computer, like, a real computer I can run.

00:03:13.400 | And so, yeah, I've been going down this rabbit hole

00:03:15.520 | of trying to make a CPU that I can run

00:03:18.280 | semi-real-time on the game of life,

00:03:20.120 | and I have been making some reasonable progress there.

00:03:22.720 | And yeah, but, you know, Turing completeness

00:03:25.120 | is just, like, a very fun trap you can go down.

00:03:28.480 | A while ago, as part of a research paper,

00:03:31.320 | I was able to show that in C,

00:03:33.560 | if you call into printf, it's Turing complete.

00:03:36.840 | Like, printf, you know, like, which, like, you know,

00:03:38.280 | you can print numbers or whatever, right?

00:03:39.920 | - Yeah, but there should be, you know,

00:03:41.120 | like, control flow stuff in there.

00:03:41.960 | - There is, because printf has a %n specifier

00:03:45.840 | that lets you write an arbitrary amount of data

00:03:48.160 | to an arbitrary location.

00:03:49.720 | And the printf format specifier has an index

00:03:53.080 | into where it is in the loop that is in memory.

00:03:56.040 | So you can overwrite the location

00:03:58.920 | of where printf is, currently indexing, using %n.

00:04:02.760 | So you can get loops, you can get conditionals,

00:04:04.920 | and you can get arbitrary data rates again.

00:04:06.640 | So we sort of have another Turing complete language

00:04:08.680 | using printf, which, again, like,

00:04:10.960 | this has essentially zero practical utility,

00:04:13.160 | but, like, it's just, I feel like a lot of people

00:04:17.040 | get into programming because they enjoy

00:04:18.920 | the art of doing these things.

00:04:21.040 | And then they go work on developing

00:04:23.360 | some software application and lose all joy.

00:04:25.240 | - Need a little sass with the boys, as they say.

00:04:27.240 | - Yeah, and I want to still have joy in doing these things.

00:04:30.720 | And so on occasion, I try to stop doing

00:04:33.560 | productive, meaningful things and just, like,

00:04:36.040 | what's a fun thing that we can do

00:04:37.880 | and try and make that happen?

00:04:39.480 | - Awesome, and you've been kind of like a pioneer

00:04:42.160 | in the AI security space.

00:04:43.800 | You've done a lot of talks starting back in 2018.

00:04:46.720 | We'll kind of leave that to the end.

00:04:48.360 | - Sure.

00:04:49.200 | - Because I know the security part is,

00:04:51.000 | there's maybe a smaller audience,

00:04:52.640 | but it's a very intense audience.

00:04:54.480 | So I think that'll be fun.

00:04:55.400 | But everybody in our Discord started posting

00:04:58.080 | your "How I Use AI" blog post.

00:05:00.400 | And we were like, "We should get Carlini on the podcast."

00:05:02.480 | And then--

00:05:03.760 | - And you were so nice to just--

00:05:04.840 | - Yeah, and then I sent you an email and you're like,

00:05:06.400 | "Okay, I'll come."

00:05:07.240 | And I was like, "Oh, I thought that would be harder."

00:05:09.720 | So I think there's, as you said in the blog post,

00:05:12.520 | a lot of misunderstanding about what LLMs

00:05:15.480 | can actually be used for, what are they useful at,

00:05:18.160 | what are they not good at,

00:05:19.280 | and whether or not it's even worth arguing

00:05:21.000 | what they're not good at, because they're obviously not.

00:05:23.800 | So if you cannot count the Rs in a word,

00:05:26.480 | they're like, "It's just not what it does."

00:05:28.520 | So how painful was it to write such a long post,

00:05:31.160 | given that you just said that you don't like to write?

00:05:33.680 | And then we can kind of run through the things,

00:05:36.080 | but maybe just talk about the motivation,

00:05:37.600 | why you thought it was important to do it.

00:05:39.240 | - Yeah.

00:05:40.080 | So I wanted to do this because I feel like most people

00:05:42.120 | who write about language models being good or bad,

00:05:45.360 | some underlying message of like,

00:05:46.800 | they have their camp, and their camp is like,

00:05:48.760 | "AI is bad," or "AI is good," or whatever.

00:05:50.920 | And they spin whatever they're gonna say

00:05:53.440 | according to their ideology.

00:05:55.120 | And they don't actually just look

00:05:56.400 | at what is true in the world.

00:05:58.800 | So I've read a lot of things where people say

00:06:01.560 | how amazing they are and how all programs

00:06:03.640 | are gonna be obsolete by 2024.

00:06:05.480 | And I've read a lot of things where people who say

00:06:07.080 | they can't do anything useful at all,

00:06:08.880 | and it's just like it's only the people

00:06:11.280 | who've come off of blockchain, crypto stuff,

00:06:14.680 | and are here to make another quick buck

00:06:16.680 | and move on.

00:06:17.520 | And I don't really agree with either of these.

00:06:19.760 | And I'm not someone who cares really one way or the other

00:06:23.560 | how these things go.

00:06:25.240 | And so I wanted to write something that just says like,

00:06:27.640 | look, let's sort of ground reality

00:06:29.960 | and what we can actually do with these things.

00:06:32.280 | Because my actual research is in like security

00:06:35.680 | and showing that these models have lots of problems.

00:06:38.800 | Like this is like, my day-to-day job is saying like,

00:06:40.760 | we probably shouldn't be using these in lots of cases.

00:06:43.080 | I thought I could have a little bit of credibility

00:06:45.320 | of in saying, it is true.

00:06:48.000 | They have lots of problems.

00:06:49.400 | We maybe shouldn't be deploying them in lots of situations.

00:06:51.800 | And still they are also useful.

00:06:54.560 | And that is the like, the bit that I wanted to get across

00:06:58.400 | is to say, I'm not here to try and sell you on anything.

00:07:01.680 | I just think that they're useful

00:07:03.720 | for the kinds of work that I do.

00:07:05.600 | And hopefully some people would listen.

00:07:08.160 | And it turned out that a lot more people liked it

00:07:10.000 | than I thought.

00:07:11.600 | But yeah, that was the motivation

00:07:13.520 | behind why I wanted to write this.

00:07:15.680 | - So you had about a dozen sections

00:07:18.960 | of like how you actually use AI.

00:07:20.880 | Maybe we can just kind of run through them all.

00:07:22.720 | And then maybe the ones where you have extra commentary

00:07:25.680 | to add if we can.

00:07:26.520 | - Sure, yeah, yeah, yeah.

00:07:28.040 | I didn't put as much thought into this

00:07:29.480 | as maybe was deserved because, yeah.

00:07:32.400 | I probably spent, I don't know,

00:07:35.160 | definitely less than 10 hours putting this together.

00:07:38.560 | - Wow, it took me close to that to do a podcast episode.

00:07:41.440 | So that's pretty impressive.

00:07:43.120 | - Yeah, I wrote it in one pass.

00:07:45.320 | I've gotten a number of emails of like,

00:07:46.960 | you got this editing thing wrong.

00:07:48.440 | You got this sort of other thing wrong.

00:07:49.640 | And it's like, I haven't looked at it.

00:07:51.800 | I tend to try it.

00:07:52.920 | I feel like, I still don't like writing.

00:07:55.240 | And so because of this, the way I tend to treat this

00:07:57.320 | is like, I will put it together

00:07:58.680 | into the best format that I can at a time

00:08:00.560 | and then put it on the internet and then never change it.

00:08:03.000 | And I guess this is an aspect of the research side of me

00:08:05.360 | is like, once a paper is published,

00:08:07.320 | it is done, it is an artifact, it exists in the world.

00:08:09.640 | I could forever edit the very first thing I ever put

00:08:12.480 | to make it the most perfect version of what it is.

00:08:14.880 | And I would do nothing else.

00:08:16.360 | And so I feel like, I find it useful to be like,

00:08:18.040 | this is the artifact.

00:08:18.880 | I will spend some certain amount of hours on it,

00:08:20.720 | which is what I think it is worth.

00:08:21.640 | And then I will just--

00:08:22.480 | - Yeah, timeboxing. - Yeah, stop.

00:08:24.120 | - So the first one was to make applications.

00:08:26.080 | We just recorded an episode with the founder of Cosign,

00:08:28.600 | which is like a AI software engineer colleague.

00:08:31.680 | You said it took you 30,000 words to get GPT-4

00:08:35.520 | to build you the, can GPT-4 solve this, kind of like a app.

00:08:39.360 | Where are we in the spectrum where chat GPT is all you need

00:08:42.480 | to actually build something versus I need a full-on agent

00:08:45.320 | that does everything for me?

00:08:46.560 | - Yeah, okay, so this was an,

00:08:47.800 | so I built a web app last year sometime

00:08:50.600 | that was just like a fun demo where you can guess

00:08:53.480 | if you can predict whether or not GPT-4 at the time

00:08:55.840 | could solve a given task.

00:08:58.000 | This is, as far as web apps go, very straightforward.

00:09:02.200 | Like you need basic HTML, CSS.

00:09:04.520 | You have a little slider that moves.

00:09:06.640 | You have a button, sort of animate the text

00:09:08.520 | coming to the screen.

00:09:09.680 | The reason people are going here

00:09:11.720 | is not because they want to see my wonderful HTML, right?

00:09:14.960 | Like, you know, I used to know how to do like modern HTML,

00:09:18.480 | like in 2007, 2008, like, you know,

00:09:22.840 | when I was very good at fighting with IE6

00:09:24.920 | and these kinds of things.

00:09:25.760 | Like I knew how to do that.

00:09:26.720 | I have no longer had to build any web app stuff

00:09:28.680 | like in the meantime, which means that like,

00:09:31.120 | I know how everything works,

00:09:32.640 | but I don't know any of the new, Flexbox is new to me.

00:09:35.720 | Flexbox is like 10 years old at this point.

00:09:38.200 | But like, it's just amazing having,

00:09:39.680 | being able to go to the model and just say like,

00:09:41.880 | write me this thing.

00:09:42.800 | It will give me all of the boilerplate

00:09:44.240 | that I need to get going.

00:09:45.880 | And of course it's imperfect.

00:09:47.560 | It's not going to get you the right answer.

00:09:49.600 | And it doesn't do anything that's complicated right now,

00:09:53.920 | but it gets you to the point where the only remaining work

00:09:57.400 | that needs to be done is the interesting hard part for me,

00:10:00.220 | that like is the actual novel part.

00:10:02.640 | And even the current models, I think,

00:10:04.420 | are entirely good enough at doing this kind of thing,

00:10:07.240 | that they're very useful.

00:10:08.320 | It may be the case that if you had something like,

00:10:10.040 | you were saying, smarter agent,

00:10:11.840 | that could debug problems by itself.

00:10:15.060 | That might be even more useful.

00:10:16.900 | Currently though, make a model into an agent

00:10:18.800 | by just copying and pasting error messages

00:10:20.360 | for the most part.

00:10:21.200 | And that's what I do is, you know, you run it

00:10:24.120 | and it gives you some code that doesn't work

00:10:25.560 | and either I'll fix the code or it will give me buggy code

00:10:28.240 | and I won't know how to fix it.

00:10:29.080 | And I'll just copy and paste the error message and say,

00:10:31.160 | it tells me this, what do I do?

00:10:33.680 | And it will just tell me how to fix it.

00:10:35.460 | You can't trust these things blindly,

00:10:37.920 | but I feel like most people on the internet

00:10:40.400 | already understand that things on the internet,

00:10:42.400 | you can't trust blindly.

00:10:43.840 | And so there's not like, this is not like a big mental shift

00:10:47.080 | you have to go through to understand that it is possible

00:10:50.300 | to read something and find it useful,

00:10:52.400 | even if it is not completely perfect in its output.

00:10:54.920 | - It's very human-like in that sense.

00:10:56.480 | It's the same ring of trust, you know,

00:10:57.840 | I kind of think about it that way,

00:10:59.760 | if you had trust levels.

00:11:02.920 | And there's maybe a couple that tie together.

00:11:05.040 | So there was like to make applications

00:11:06.960 | and then there's to get started,

00:11:08.360 | which is a similar, you know, kickstart,

00:11:10.600 | maybe like a project that, you know, the LLM cannot solve.

00:11:14.080 | It's kind of how you think about it.

00:11:15.120 | - Yeah, so like for getting started on things

00:11:17.560 | is one of the cases where I think it's really great

00:11:19.320 | for some of these things

00:11:20.160 | where I sort of use it as a personalized,

00:11:24.400 | help me use this technology I've never used before.

00:11:27.120 | So for example, I had never used Docker before January.

00:11:30.360 | I know what Docker is.

00:11:31.200 | - Lucky you.

00:11:32.020 | - Yeah, like I'm a computer security person.

00:11:34.160 | Like I sort of, I have read lots of papers on, you know,

00:11:37.880 | on all the technology behind how these things work.

00:11:40.360 | You know, I know all the exploits on them.

00:11:41.720 | I've done some of these things,

00:11:43.440 | but I had never actually used Docker.

00:11:45.560 | But I wanted it to be able,

00:11:46.960 | so that I could run the outputs of language model stuff

00:11:49.960 | in some controlled, contained environment,

00:11:51.280 | which I know is the right application.

00:11:52.800 | So I just ask it, like,

00:11:54.000 | I want to use Docker to do this thing.

00:11:55.680 | Like, tell me how to run a Python program

00:11:58.440 | in a Docker container.

00:11:59.640 | And it like gives me a thing.

00:12:00.480 | And I'm like, step back.

00:12:01.880 | You said Docker Compose.

00:12:02.960 | I do not know what this word Docker Compose is.

00:12:04.820 | Is this Docker?

00:12:05.660 | Is this not Docker?

00:12:06.480 | Help me.

00:12:07.320 | And like, it'll sort of tell me all of these things.

00:12:08.720 | And I'm sure there's this knowledge

00:12:09.960 | that's out there on the internet.

00:12:11.000 | Like, this is not some groundbreaking thing that I'm doing,

00:12:14.680 | but I just wanted it as a small piece

00:12:17.000 | of one thing I was working on.

00:12:19.120 | And I didn't want to learn Docker from first principles.

00:12:22.640 | Like, at some point, if I need it, I can do that.

00:12:25.000 | Like, I have the background that I can make that happen.

00:12:27.880 | But what I wanted to do was thing one.

00:12:30.680 | And it's very easy to get bogged down in the details

00:12:32.640 | of this other thing that helps you accomplish your end goal.

00:12:34.920 | And I just wanted, like, tell me enough about Docker

00:12:37.040 | so I can do this particular thing.

00:12:38.360 | And I can check that it's doing the safe thing.

00:12:40.760 | I sort of know enough about that from my other background.

00:12:44.440 | And so I can just have the model help teach me

00:12:46.920 | exactly the one thing I want to know and nothing more.

00:12:49.120 | I don't need to worry about other things

00:12:50.680 | that the writer of this thinks is important

00:12:52.560 | that actually isn't.

00:12:53.400 | Like, I can just like stop the conversation and say,

00:12:55.040 | no, boring to me.

00:12:56.320 | Explain this detail I don't understand.

00:12:57.520 | I think that was very useful for me.

00:12:59.760 | It would have taken me, you know, several hours

00:13:01.640 | to figure out some things that take 10 minutes

00:13:03.560 | if you could just ask exactly the question

00:13:05.000 | you want the answer to.

00:13:06.120 | - Have you had any issues with like newer tools?

00:13:08.640 | Have you felt any meaningful kind of like a cutoff day

00:13:11.600 | where like there's not enough data on the internet or?

00:13:14.080 | - I'm sure that the answer to this is yes.

00:13:16.160 | But I tend to just not use most of these things.

00:13:19.600 | Like, I feel like this is like the significant way

00:13:22.120 | in which I use machine learning models

00:13:23.800 | is probably very different than most people

00:13:25.960 | is that I'm a researcher

00:13:27.560 | and I get to pick what tools that I use.

00:13:29.080 | And most of the things that I work on

00:13:30.160 | are fairly small projects.

00:13:31.640 | And so I can entirely see how someone

00:13:34.040 | who is in a big giant company

00:13:35.720 | where they have their own proprietary legacy code base

00:13:38.000 | of a hundred million lines of code or whatever.

00:13:39.680 | And like, you just might not be able to use things

00:13:41.840 | the same way that I do.

00:13:42.960 | I still think there are lots of use cases there

00:13:44.720 | that are entirely reasonable

00:13:46.120 | that are not the same ones that I've put down.

00:13:48.280 | But I wanted to talk about what I have personal experience

00:13:50.800 | in being able to say is useful.

00:13:52.360 | And I would like it very much

00:13:53.800 | if someone who is in one of these environments

00:13:56.720 | would be able to describe the ways

00:13:58.120 | in which they find current models useful to them

00:14:00.640 | and not philosophize on what someone else

00:14:02.760 | might be able to find useful.

00:14:03.760 | But actually say like,

00:14:04.600 | "Here are real things that I have done

00:14:06.720 | that I found useful for me."

00:14:08.560 | - Yeah, this is what I often do

00:14:10.200 | to encourage people to write more,

00:14:12.080 | to share their experiences,

00:14:13.040 | because they often fear being attacked on the internet.

00:14:16.480 | But you are the ultimate authority on how you use things

00:14:19.160 | and it's objectively true.

00:14:21.360 | So they cannot be debated.

00:14:23.920 | One thing that people are very excited about

00:14:25.760 | is the concept of ephemeral software,

00:14:27.480 | like personal software.

00:14:30.120 | This use case in particular

00:14:31.840 | basically lowers the activation energy

00:14:34.040 | for creating software,

00:14:36.200 | which I like as a vision.

00:14:37.920 | I don't think I have taken as much advantage of it

00:14:42.240 | as I could.

00:14:43.280 | I feel guilty about that,

00:14:44.560 | but also we're trending towards there.

00:14:47.840 | - Yeah, no, I do think that this is a direction

00:14:50.080 | that is exciting to me.

00:14:52.360 | Yeah, one of the things I wrote

00:14:53.440 | that was like a lot of the ways that I use these models

00:14:55.200 | are for one-off things that I just need to happen

00:14:58.120 | that I'm gonna throw away in five minutes.

00:15:00.000 | - Yeah, and you can.

00:15:01.680 | - Yeah, exactly, right.

00:15:02.520 | It's like the kind of thing where

00:15:03.880 | it would not have been worth it

00:15:05.040 | for me to have spent 45 minutes writing this

00:15:07.960 | because I don't need the answer that badly.

00:15:10.480 | But if it will only take me five minutes,

00:15:12.480 | then I'll just figure it out,

00:15:14.640 | run the program, and then get it right.

00:15:16.760 | And if it turns out that you ask the thing

00:15:18.440 | and it doesn't give you the right answer,

00:15:19.360 | well, I didn't actually need the answer that badly

00:15:20.720 | in the first place.

00:15:21.680 | Like either I can decide to dedicate the 45 minutes

00:15:23.640 | or I cannot, but the cost of doing it is fairly low.

00:15:26.320 | You see what the model can do,

00:15:27.720 | and if it can't, then okay.

00:15:30.040 | When you're using these models,

00:15:31.160 | if you're getting the answer you want always,

00:15:32.880 | it means you're not asking them hard enough questions.

00:15:34.960 | - Ooh, say more.

00:15:36.920 | - Lots of people only use them

00:15:38.120 | for very small particular use cases,

00:15:40.240 | and it always does the thing that they want.

00:15:42.440 | - Yeah, they use it like a search engine.

00:15:44.160 | - Yeah, or like one particular case.

00:15:45.520 | And if you're finding that when you're using these,

00:15:47.440 | it's always giving you the answer that you want,

00:15:49.600 | then probably it has more capabilities

00:15:51.680 | than you're actually using.

00:15:52.880 | And so I oftentimes try when I have something

00:15:55.160 | that I'm curious about to just feed into the model

00:15:57.680 | and be like, well, maybe it's to solve my problem for me.

00:15:59.640 | You know, most of the time it doesn't,

00:16:01.160 | but like on occasion, it's like,

00:16:02.800 | it's done things that would have taken me,

00:16:05.160 | you know, a couple hours that it's been great

00:16:07.600 | and just like solved everything immediately.

00:16:09.080 | And if it doesn't, then it's usually easier

00:16:11.200 | to verify whether or not the answer is correct

00:16:13.080 | than to have written it in the first place.

00:16:14.760 | And so you check, you're like,

00:16:15.760 | well, that's just, you're entirely misguided.

00:16:17.320 | Nothing here is right.

00:16:18.160 | It's just like, I'm not going to do this.

00:16:19.520 | I'm gonna go write it myself or whatever.

00:16:21.360 | - Even for non-tech, I had to fix my irrigation system.

00:16:24.720 | I had an old irrigation system.

00:16:26.000 | I didn't know how it worked to program it.

00:16:27.480 | I took a photo, I sent it to Claude.

00:16:28.920 | And it's like, oh yeah, that's like the RT900.

00:16:31.120 | This is exactly, I was like, oh wow.

00:16:32.920 | You know, you know a lot of stuff.

00:16:34.280 | - Was it right?

00:16:35.120 | - Yeah, it was right.

00:16:35.960 | It worked.

00:16:36.800 | - Did you compare with OpenAI?

00:16:38.160 | - No, I canceled my OpenAI subscription,

00:16:40.240 | so I'm a Claude boy.

00:16:42.600 | Do you have a way to think about

00:16:43.800 | these like one-offs, softer thing?

00:16:46.120 | One way I talk to people about it is like,

00:16:47.920 | LLMs are kind of converging

00:16:49.320 | into like semantic serverless functions.

00:16:51.560 | You know, like you can say something

00:16:53.400 | and like it can run the function in a way

00:16:55.040 | and then that's it.

00:16:56.040 | It just kind of dies there.

00:16:57.360 | Do you have a mental model to just think about

00:16:59.280 | how long it should live for and like anything like that?

00:17:02.920 | - I don't think I have anything interesting to say here, no.

00:17:05.440 | I will take whatever tools are available in front of me

00:17:08.520 | and try and see if I can use them in meaningful ways.

00:17:10.440 | And if they're helpful, then great.

00:17:11.840 | If they're not, then fine.

00:17:13.000 | And like, you know, there are lots of people

00:17:14.880 | that I'm very excited about seeing all of these people

00:17:16.800 | who are trying to make better applications

00:17:18.320 | that use these or all these kinds of things.

00:17:20.400 | And I think that's amazing.

00:17:22.640 | I would like to see more of it,

00:17:23.640 | but I do not spend my time thinking

00:17:25.480 | about how to make this any better.

00:17:26.920 | - What's the most underrated thing in the list?

00:17:29.280 | I know there's like simplified code,

00:17:31.480 | solving boring tasks,

00:17:32.760 | or maybe is there something that you forgot to add

00:17:35.080 | that you wanna throw in there?

00:17:37.360 | - I mean, okay, so in the list,

00:17:39.400 | I only put things that people could look at

00:17:42.840 | and go, I understand how this solved my problem.

00:17:48.200 | I didn't want to put things

00:17:49.880 | where the model was very useful to me,

00:17:52.200 | but it would not be clear to someone else

00:17:54.640 | that it was actually useful.

00:17:56.160 | So for example, one of the things that I use it a lot for

00:17:59.080 | is debugging errors.

00:18:01.080 | But the errors that I have

00:18:03.840 | are very much not the errors

00:18:05.000 | that anyone else in the world will have.

00:18:06.120 | And in order to understand

00:18:07.200 | whether or not the solution was right,

00:18:08.920 | you just have to trust me on it.

00:18:09.800 | Because, you know, like I got my machine in a state

00:18:12.520 | that like CUDA was not talking to whatever,

00:18:15.920 | some other thing, the versions were mismatched.

00:18:18.040 | Something, something, something,

00:18:19.040 | and everything was broken.

00:18:20.160 | And like, I could figure it out

00:18:21.160 | when I interacted with the model,

00:18:22.120 | and it told me the steps I needed to take.

00:18:24.400 | But at the end of the day,

00:18:25.320 | when you look at the conversation,

00:18:26.400 | you just have to trust me that it worked.

00:18:28.640 | And I didn't want to write things online

00:18:33.120 | that were this like,

00:18:33.960 | you have to trust me in what I'm saying.

00:18:35.760 | I want everything that I said to like have evidence

00:18:38.200 | that like, here's the conversation,

00:18:39.440 | you can go and check

00:18:40.960 | whether or not this actually solved the task

00:18:43.280 | as I just said that the model does.

00:18:45.000 | Because a lot of people I feel like say,

00:18:47.080 | I used a model to solve this very complicated task.

00:18:50.200 | And what they mean is,

00:18:51.480 | the model did 10% and I did the other 90%.

00:18:53.720 | So I wanted everything to be verifiable.

00:18:55.760 | And so one of the biggest use cases for me,

00:18:57.880 | I didn't describe even at all,

00:18:59.480 | because it's not the kind of thing

00:19:00.720 | that other people could have verified by themselves.

00:19:02.680 | So that maybe is like one of the things

00:19:04.600 | that I wish I maybe had said a little bit more about,

00:19:07.760 | and just stated that the way that this is done.

00:19:11.680 | Because I feel like

00:19:12.520 | that this didn't come across quite as well.

00:19:13.880 | But yeah, of the things that I talked about,

00:19:16.760 | the thing that I think is most underrated

00:19:19.240 | is the ability of it to solve

00:19:21.440 | the uninteresting parts of problems for me right now,

00:19:24.560 | where people always say,

00:19:26.560 | this is one of the biggest arguments

00:19:27.760 | that I don't understand why people say,

00:19:29.440 | is the model can only do things

00:19:32.080 | that people have done before.

00:19:33.640 | Therefore, the model is not going to be helpful

00:19:35.720 | in doing new research or like discovering new things.

00:19:39.040 | And as someone whose day job is to do new things,

00:19:42.720 | like what is research?

00:19:44.000 | Research is doing something

00:19:45.040 | literally no one else in the world has ever done before.

00:19:47.480 | So like, this is what I do like every single day.

00:19:50.240 | 90% of this is not doing something new.

00:19:53.360 | Like 90% of this is like doing things

00:19:55.840 | a million people have done before,

00:19:57.360 | and then a little bit of something that was new.

00:19:59.640 | There's a reason why we say

00:20:00.480 | we stand on the shoulders of giants.

00:20:02.040 | It's true.

00:20:02.880 | Almost everything that I do

00:20:03.720 | is something that's been done many, many times before.

00:20:06.080 | And that is the piece that can be automated.

00:20:08.600 | Even if the thing that I'm doing as a whole is new,

00:20:12.400 | it is almost certainly the case

00:20:13.560 | that the small pieces that build up to it are not.

00:20:17.520 | And a number of people who use these models,

00:20:20.640 | I feel like expect that they can either solve

00:20:22.280 | the entire task or none of the task.

00:20:24.160 | But now I find myself very often,

00:20:27.040 | even when doing something very new and very hard,

00:20:29.760 | having models write the easy parts for me.

00:20:33.440 | And the reason I think this is so valuable,

00:20:35.560 | everyone who programs understands this,

00:20:37.120 | like you're currently trying to solve some problem

00:20:39.320 | and you get distracted.

00:20:40.880 | And you know, whatever the case may be,

00:20:42.720 | someone comes and talks to you.

00:20:43.720 | You have to go look up something online, whatever it is.

00:20:46.920 | You lose a lot of time to that.

00:20:49.640 | And one of the ways we currently don't think

00:20:51.320 | about being distracted is you're solving some hard problem

00:20:53.720 | and you realize you need a helper function that does X.

00:20:57.120 | Where X is like, it's a known algorithm.

00:20:59.200 | Any person in the world, you say like,

00:21:00.720 | give me the algorithm that, you know,

00:21:03.040 | have a dense graph or a sparse graph.

00:21:04.600 | I need to make it dense.

00:21:05.560 | You can do this by, you know,

00:21:06.880 | doing some matrix multiplies.

00:21:08.640 | It's like, this is a solved problem.

00:21:09.760 | I knew how to do this 15 years ago.

00:21:11.560 | But it distracts me from the problem

00:21:13.640 | I'm thinking about in my mind.

00:21:14.560 | I needed this done.

00:21:15.960 | And so instead of using my mental capacity

00:21:19.240 | and solving that problem,

00:21:20.240 | and then coming back to the problem

00:21:21.240 | I was originally trying to solve,

00:21:22.800 | you can just ask model, please solve this problem for me.

00:21:25.000 | It gives you the answer.

00:21:25.840 | You run it.

00:21:26.680 | You can check that it works very, very quickly.

00:21:28.240 | And now you go back to solving the problem

00:21:29.440 | without having lost all the mental state.

00:21:30.960 | And I feel like this is one of the things

00:21:32.000 | that's been very useful for me.

00:21:34.400 | - And in terms of this concept of expert users

00:21:37.320 | versus non-expert users, floors versus ceilings,

00:21:41.200 | you had some strong opinion here

00:21:42.320 | that basically it actually is more beneficial

00:21:45.040 | for non-experts.

00:21:46.160 | - Yeah, I don't know.

00:21:47.120 | I think it could go either way.

00:21:48.520 | Let me give you the argument for both of these.

00:21:50.840 | - Yes.

00:21:51.680 | - So I can only speak on the expert user behalf

00:21:53.000 | because I've been doing computers for a long time.

00:21:54.960 | And so, yeah, the cases where it's useful for me

00:21:56.440 | are exactly these cases where I can check the output.

00:21:59.360 | I know, and anything the model could do,

00:22:01.200 | I could have done.

00:22:02.040 | I could have done better.

00:22:02.920 | I can check every single thing that the model's doing

00:22:05.000 | and make sure it's correct in every way.

00:22:06.800 | And so I can only speak and say,

00:22:09.160 | definitely it's been useful for me.

00:22:10.680 | But I also see a world in which this could be very useful

00:22:13.360 | for the kinds of people who do not have this knowledge

00:22:16.760 | with caveats, because I'm not one of the people

00:22:18.600 | I don't have this direct experience.

00:22:20.240 | But one of these big ways that I can see this

00:22:22.960 | is for things that you can check fairly easily,

00:22:27.120 | someone who could never have asked

00:22:29.320 | or have written a program themselves to do a certain task

00:22:32.080 | could just ask for the program that does the thing.

00:22:34.480 | And you know, some of the times it won't get it right,

00:22:37.520 | but some of the times it will,

00:22:39.080 | and they'll be able to have the thing in front of them

00:22:41.920 | that they just couldn't have done before.

00:22:44.080 | And we see a lot of people trying to do applications

00:22:47.080 | for this, like integrating language models

00:22:49.080 | into spreadsheets.

00:22:50.360 | Spreadsheets run the world.

00:22:53.000 | And there are some people who know how to do

00:22:54.440 | all the complicated spreadsheet equations

00:22:56.040 | and various things, and other people who don't,

00:22:58.520 | who just use the spreadsheet program,

00:23:00.700 | but just manually do all of the things

00:23:02.640 | one by one by one by one.

00:23:04.480 | And this is a case where you could have a model

00:23:07.640 | that could try and give you a solution.

00:23:11.000 | And as long as the person is rigorous in testing

00:23:13.080 | that the solution does actually the correct thing,

00:23:14.760 | this is the part that I'm worried about most.

00:23:16.200 | You know, I think depending on these systems

00:23:18.160 | in ways that we shouldn't,

00:23:19.440 | like this is what my research says.

00:23:20.760 | My research says is entirely on this,

00:23:22.200 | like you probably shouldn't trust these models

00:23:24.360 | to do the things in adversarial situations.

00:23:26.040 | Like I understand this very deeply.

00:23:28.280 | And so I think that it's possible for people

00:23:31.520 | who don't have this knowledge

00:23:32.720 | to make use of these tools in ways,

00:23:34.720 | but I'm worried that it might end up in a world

00:23:37.680 | where people just blindly trust them,

00:23:39.680 | deploy them in situations that aren't,

00:23:41.480 | they probably shouldn't.

00:23:42.840 | And then someone like me gets to come along

00:23:44.720 | and just break everything because everything is terrible.

00:23:47.680 | And so like, I am very, very worried

00:23:49.440 | about that being the case,

00:23:50.920 | but I think if done carefully,

00:23:52.360 | it is possible that these could be very useful.

00:23:54.920 | - Yeah, there is some research out there

00:23:57.200 | that shows that when people use LLMs to generate code,

00:24:00.800 | they do generate less secure code.

00:24:02.440 | - Yeah, Dan Boneh has a nice paper on this.

00:24:03.920 | There are a bunch of papers that touch on exactly this.

00:24:08.040 | - My slight issue is, is there an agenda here?

00:24:10.800 | - I mean, okay, yeah.

00:24:12.400 | Dan Boneh, at least the one they have,

00:24:14.280 | I fully trust everything that sort of, yeah.

00:24:16.080 | - Sorry, I don't know who Dan is.

00:24:17.880 | - Professor at Stanford.

00:24:19.040 | Yeah, he and some students have some things on this.

00:24:20.960 | And yeah, there's like a number of,

00:24:22.880 | I agree that a lot of the stuff

00:24:24.560 | feel like people have an agenda behind it.

00:24:26.560 | There are some that don't,

00:24:27.480 | and I sort of trust them to have done the right thing.

00:24:31.440 | I also think, even on this though, we have to be careful

00:24:34.240 | because the argument,

00:24:35.960 | whenever someone says X is true about language models,

00:24:38.200 | you should always append the suffix for current models

00:24:41.400 | because I'll be the first to admit

00:24:43.320 | I was one of the people who was very much on the opinion

00:24:45.800 | that these language models are fun toys

00:24:47.480 | and are gonna have absolutely no practical utility.

00:24:49.480 | And if you had asked me this, let's say in 2020,

00:24:53.320 | I still would have said the same thing.

00:24:54.160 | It was like after I had seen GPT-2,

00:24:56.480 | I had written a couple of papers

00:24:58.320 | studying GPT-2 very carefully.

00:25:00.200 | I still would have told you these things are toys.

00:25:03.000 | And when I first read the RLHF paper

00:25:06.000 | and the instruction tuning paper,

00:25:08.600 | I was like, nope, this is like this thing

00:25:10.920 | that these weird AI people are doing.

00:25:12.960 | It's like they're trying to make some analogies

00:25:15.480 | to people that it makes no sense.

00:25:17.160 | It's just like, I don't even care to read it.

00:25:19.000 | I saw what it was about and just didn't even look at it.

00:25:22.040 | I was obviously wrong.

00:25:23.560 | These things can be useful.

00:25:25.320 | And I feel like a lot of people

00:25:28.760 | had the same mentality that I did

00:25:30.760 | and decided not to change their mind.

00:25:32.840 | And I feel like this is the thing

00:25:34.160 | that I want people to be careful about.

00:25:36.760 | I want them to at least know what is true about the world

00:25:39.080 | so that they can then see that maybe they should reconsider

00:25:42.720 | some of the opinions that they had

00:25:43.880 | from four or five years ago

00:25:44.960 | that may just not be true about today's models.

00:25:47.440 | - Specifically, because you brought up spreadsheets,

00:25:49.240 | I want to share my personal experience

00:25:51.240 | because I think Google's done a really good job

00:25:53.040 | that people don't know about,

00:25:54.160 | which is if you use Google Sheets,

00:25:56.160 | it's Gemini's integrated inside of Google Sheets

00:25:58.560 | and it helps you write formulas.

00:26:00.120 | - Great, that's news to me.

00:26:01.440 | - Right?

00:26:02.280 | They don't maybe do a good job.

00:26:04.920 | Unless you watch Google I/O,

00:26:06.400 | there was no other opportunity to learn

00:26:07.880 | that Gemini is now in your Google Sheets.

00:26:09.880 | And so I just don't write formulas manually anymore.

00:26:12.720 | It just prompts Gemini to do it for me and it does it.

00:26:15.600 | - Yeah, one of the problems that these machine learning

00:26:17.480 | models have is a discoverability problem.

00:26:18.920 | I think this will be figured out.

00:26:20.840 | I mean, it's the same problem that you have

00:26:22.600 | with any assistant.

00:26:25.160 | You're given a blank box and you're like,

00:26:26.920 | "What do I do with it?"

00:26:28.400 | No, I think this is great.

00:26:29.800 | More of these things, it would be good for them to exist.

00:26:32.760 | I want them to exist in ways that we can actually make sure

00:26:36.760 | that they're done correctly.

00:26:38.240 | I don't want to just have them be pushed

00:26:42.040 | into more and more things just blindly.

00:26:43.880 | I feel like lots of people, there are far too many.

00:26:47.440 | X plus AI, where X is like arbitrary thing in the world

00:26:51.600 | that has nothing to do with it

00:26:52.840 | and could not be benefited at all.

00:26:53.960 | And they're just doing it because they want to use the word.

00:26:56.520 | And I don't want that to happen.

00:26:58.480 | - You don't want an AI fridge?

00:26:59.760 | (both laughing)

00:27:00.800 | - No.

00:27:02.000 | Yes, I do not want my fridge on the internet.

00:27:03.560 | I do not want like, yeah.

00:27:05.360 | Okay, anyway, let's not go down that rabbit hole.

00:27:07.000 | I understand why some of that happens

00:27:08.520 | because people want to sell things and whatever.

00:27:10.440 | But I feel like a lot of people see that

00:27:12.600 | and then they write off everything as a result of it.

00:27:14.560 | And I just want to say, there are allowed to be people

00:27:17.720 | who are trying to do things that don't make any sense.

00:27:20.040 | Just ignore them.

00:27:21.120 | Do the things that make sense.

00:27:22.560 | - Another chunk of use cases was learning.

00:27:25.480 | So both explaining code, being a API reference,

00:27:29.600 | all of these different things.

00:27:31.080 | Any suggestions on like how to go at it?

00:27:34.080 | I feel like, you know, one thing is like generate code

00:27:37.080 | and then explain to me.

00:27:38.160 | One way is like, just tell me about this technology.

00:27:40.880 | Another thing is like, hey, I read this online.

00:27:42.920 | Kind of help me understand it.

00:27:44.560 | Any best practices on getting the most out of it or?

00:27:47.680 | - Yeah, I don't know if I have best practices.

00:27:49.640 | I have how I use them.

00:27:50.800 | - Yeah.

00:27:51.640 | - Yeah, I find it very useful for cases

00:27:55.280 | where I understand the underlying ideas,

00:27:58.440 | but I have never used them in this way before.

00:28:00.640 | I know what I'm looking for,

00:28:01.720 | but I just don't know how to get there.

00:28:03.600 | And so yeah, as an API reference is a great example.

00:28:06.240 | You know, the tool everyone always picks on is like FFmpeg.

00:28:09.960 | No one in the world knows the command line arguments

00:28:13.040 | to do what they want.

00:28:13.880 | They like make the thing faster.

00:28:16.360 | You know, I want lower bit rate, like dash V, you know,

00:28:20.280 | but like once you tell me what the answer is,

00:28:21.400 | like I can check.

00:28:22.400 | Like this is one of the things where it's great

00:28:24.040 | for these kinds of things.

00:28:25.640 | Or, you know, in other cases,

00:28:27.640 | things where I don't really care

00:28:29.760 | that the answer was 100% correct.

00:28:31.320 | So for example, I do a lot of security work.

00:28:33.560 | Most of security work is reading some code

00:28:36.880 | you've never seen before and finding out

00:28:38.760 | which pieces of the code are actually important.

00:28:41.200 | Because, you know, most of the program

00:28:43.680 | isn't actually do anything to do with security.

00:28:46.280 | It has, you know, the display piece

00:28:48.280 | or the other piece or whatever.

00:28:49.200 | And like, you just, you would only ignore all of that.

00:28:51.320 | So one very fun use of models is to like,

00:28:54.120 | just have it describe all of the functions

00:28:56.000 | and just skim it and be like, wait,

00:28:57.560 | which ones look like approximately

00:28:58.920 | the right things to look at?

00:29:00.280 | Because otherwise, what are you gonna do?

00:29:02.120 | You're gonna have to read them all manually.

00:29:03.320 | And when you're reading them manually,

00:29:04.360 | you're gonna skim the function anyway

00:29:06.080 | and not just figure out what's going on perfectly.

00:29:08.200 | Like you already know that when you're gonna read

00:29:10.840 | these things, what you're going to try and do

00:29:12.680 | is figure out roughly what's going on.

00:29:15.080 | And then you'll delve into the details.

00:29:16.600 | This is a great way of just doing that, but faster,

00:29:19.040 | because it will abstract most of what is right.

00:29:21.840 | It's gonna be wrong some of the time, I don't care.

00:29:23.240 | I would have been wrong too.

00:29:24.640 | And as long as you treat it with this way,

00:29:26.160 | I think it's great.

00:29:27.000 | And so like one of the particular use cases I have

00:29:28.960 | in the thing is decompiling binaries,

00:29:32.080 | where, you know, oftentimes people will release a binary,

00:29:34.600 | they won't give you the source code

00:29:35.880 | and you wanna figure out how to attack it.

00:29:38.920 | And so one thing you could do is you could try

00:29:40.720 | and run some kind of decompiler.

00:29:42.440 | It turns out for the thing that I wanted, none existed.

00:29:44.680 | And so like I spent too many hours doing it by hand

00:29:48.120 | before I first thought, you know, like, why am I doing this?

00:29:50.320 | I should just check if the model can do it for me.

00:29:52.200 | And it turns out that it can,

00:29:53.160 | and it can turn the compiled source code,

00:29:54.980 | which is impossible for any human to understand,

00:29:56.880 | into the Python code that is entirely reasonable

00:29:58.760 | to understand.

00:29:59.600 | And, you know, it doesn't run, it has a bunch of problems,

00:30:02.000 | but like, it's so much nicer

00:30:03.320 | that it's immediately a win for me.

00:30:04.760 | I can just figure out approximately

00:30:06.320 | where I should be looking and then spend all of my time

00:30:08.460 | doing that sort of by hand.

00:30:10.080 | And again, like you get a big win there.

00:30:12.160 | - So, I mean, I fully agree with, you know,

00:30:14.680 | all those use cases.

00:30:15.840 | And especially for you as a security researcher

00:30:18.400 | and having to dive into multiple things,

00:30:21.840 | I imagine that's super helpful.

00:30:23.440 | I do think we want to sort of move to your other blog posts,

00:30:26.320 | but, you know, I wanted to,

00:30:27.960 | you ended your post with a little bit of a teaser

00:30:29.920 | about your next post and your speculations.

00:30:33.120 | What are you thinking about?

00:30:34.040 | - Okay, so I want to write something,

00:30:35.480 | and I will do that at some point when I have time,

00:30:37.840 | maybe after I'm done writing my current papers

00:30:39.560 | for iClear or something,

00:30:40.720 | where I want to talk about some thoughts I have

00:30:44.000 | for where language models are going in the near-term future.

00:30:46.680 | The reason why I want to talk about this

00:30:48.040 | is because, again, I feel like the discussion

00:30:50.280 | tends to be people who are either very much AGI by 2027,

00:30:55.240 | or-- - Always five years away.

00:30:57.560 | - Yes, or are going to make statements of the form,

00:31:01.080 | you know, LLMs are the wrong path,

00:31:03.240 | and, you know, we should be abandoning this,

00:31:05.060 | and we should be doing something else instead.

00:31:06.760 | And again, I feel like people tend to look at this

00:31:09.420 | and see these two polarizing options and go,

00:31:12.160 | well, those obviously are both very far extremes.

00:31:14.360 | Like, how do I actually, like,

00:31:16.200 | what's the more nuance to take here?

00:31:18.280 | And so I have some opinions about this

00:31:20.960 | that I want to put down.

00:31:22.860 | Just saying, you know, I have wide margins of error.

00:31:25.840 | I think you should too.

00:31:27.160 | If you would say there's a 0% chance that something,

00:31:30.000 | you know, the models will get very, very good

00:31:31.680 | in the next five years, you're probably wrong.

00:31:33.440 | If you're gonna say there's a 100% chance

00:31:35.040 | that in the next five years, then you're probably wrong.

00:31:37.740 | And like, to be fair, most of the people,

00:31:39.240 | if you read behind the headlines,

00:31:41.360 | actually say something like this.

00:31:43.280 | But it's very hard to get clicks on the internet

00:31:45.360 | of like, some things may be good in the future.

00:31:48.440 | Like, everyone wants like, you know, a very like,

00:31:51.360 | nothing is gonna be good.

00:31:52.560 | This is entirely wrong.

00:31:53.640 | It's gonna be amazing.

00:31:54.600 | You know, like, they want to see this.

00:31:56.360 | I want things who have,

00:31:57.760 | people who have negative reactions

00:31:58.960 | to these kinds of extreme views

00:32:00.680 | to be able to at least say like,

00:32:02.360 | to tell them there is something real here.

00:32:05.400 | It may not solve all of our problems,

00:32:07.080 | but it's probably going to get better.

00:32:08.640 | I don't know by how much.

00:32:10.040 | And that's basically what I want to say.

00:32:11.840 | And then at some point I'll talk about

00:32:13.760 | the safety and security things as a result of this.

00:32:16.640 | Because the way in which security intersects

00:32:19.000 | with these things depends a lot

00:32:20.680 | in exactly how people use these tools.

00:32:23.880 | You know, if it turns out to be the case

00:32:25.360 | that these models get to be truly amazing

00:32:28.040 | and can solve, you know, tasks completely autonomously,

00:32:31.840 | that's a very different security world to be living in

00:32:33.900 | than if there's always a human in the loop.

00:32:35.540 | And the types of security questions I would want to ask

00:32:37.600 | are be very different.

00:32:38.840 | And so I think, you know, in some very large parts,

00:32:42.320 | understanding what the future will look like

00:32:44.280 | a couple of years ahead of time

00:32:45.400 | is helpful for figuring out which problems

00:32:47.360 | as a security person, I want to solve now.

00:32:49.360 | - You mentioned getting clicks on the internet,

00:32:50.960 | but you don't even have like an ex account or anything.

00:32:53.200 | How do you get people to read your stuff?

00:32:54.800 | What's the, what's your distribution strategy?

00:32:56.920 | Because this post was popping up everywhere.

00:32:59.320 | And then people on Twitter were like,

00:33:00.960 | Nicolas Scarlini brought this, like what's his handle?

00:33:03.840 | And it's like, he doesn't have it.

00:33:04.880 | It's like, how did you find it?

00:33:06.040 | What's the story?

00:33:07.560 | - So I have an RSS feed and an email list, and that's it.

00:33:12.240 | I don't like most social media things.

00:33:14.840 | I feel like, on principle, I feel like they have some harms.

00:33:18.000 | As a person, I have a problem when people say things

00:33:20.760 | that are wrong on the internet,

00:33:22.280 | and I would get nothing done if I were to have a Twitter.

00:33:25.080 | I would spend all of my time correcting people

00:33:27.800 | and getting into fights.

00:33:28.920 | And so I feel like it was just useful for me

00:33:31.080 | for this not to be an option.

00:33:32.720 | I tend to just post things online.

00:33:35.160 | Yeah, it's a very good question.

00:33:36.000 | I don't know how people find it.

00:33:37.040 | I feel like, for some things that I write,

00:33:39.320 | other people think it resonates with them,

00:33:41.560 | and then they put it on Twitter.

00:33:43.640 | - Hacker News as well.

00:33:44.680 | - Sure, yeah, yeah.

00:33:45.520 | I am, because my day job is doing research,

00:33:50.520 | I get no value for having this be picked up.

00:33:54.120 | There's no whatever.

00:33:55.240 | I don't need to be someone who has to have this other thing

00:33:57.720 | to give talks.

00:33:59.200 | And so I feel like I can just say what I want to say,

00:34:02.000 | and if people find it useful, then they'll share it widely.

00:34:04.200 | This one went pretty wide.

00:34:05.920 | I wrote a thing, whatever, sometime late last year

00:34:09.360 | about how to recover data off of an Apple Profile Drive

00:34:14.360 | from 1980.

00:34:17.080 | This probably got, I think, 1,000x less views than this,

00:34:21.100 | but I don't care.

00:34:22.000 | That's not why I'm doing this.

00:34:22.840 | This is the benefit of having a thing

00:34:24.840 | that I actually care about, which is my research.

00:34:26.760 | I would care much more if that didn't get seen.

00:34:29.160 | This is a thing that I write

00:34:30.600 | because I have some thoughts that I just want to put down.

00:34:32.960 | - I think it's the long-form thoughtfulness

00:34:35.600 | and authenticity that is sadly lacking sometimes

00:34:38.880 | in modern discourse that makes it attractive.

00:34:42.120 | And I think now you have a little bit of a brand

00:34:44.160 | of you are an independent thinker, writer, person

00:34:47.640 | that people are tuned in to pay attention

00:34:50.980 | to whatever is next coming.

00:34:52.400 | - Yeah, this kind of worries me a little bit.

00:34:54.760 | Whenever I have a popular thing that,

00:34:56.440 | and then I write another thing

00:34:57.360 | which is entirely unrelated, I don't want people-

00:35:00.560 | - You should actually just throw people off right now.

00:35:02.080 | - Exactly, I'm trying to figure out,

00:35:04.120 | I need to put something else online.

00:35:05.960 | So the last two or three things I've done in a row

00:35:07.720 | have been actually things that people should care about.

00:35:10.600 | So I have a couple of things I'm trying to figure out.

00:35:12.200 | Which one do I put online to just cull the list

00:35:14.880 | of people who have subscribed to my email?

00:35:16.280 | And so tell them, no, what you're here for

00:35:17.880 | is not informed, well-thought-through takes.

00:35:20.160 | What you're here for is whatever I want to talk about.

00:35:21.920 | And if you're not up for that, then go away.

00:35:24.160 | This is not what I want out of my personal website.

00:35:27.480 | - So here's top 10 enemies or something like that.

00:35:30.600 | What's the next project you're going to work on

00:35:32.360 | that is completely unrelated to research LLMs?

00:35:35.640 | Or what games do you want to port into the browser next?

00:35:39.120 | - Okay, yeah, so maybe, okay, here's a fun question.

00:35:43.320 | How much data do you think you can put

00:35:45.240 | on a single piece of paper?

00:35:47.320 | - I mean, you can think about bits and atoms.

00:35:49.160 | - Yeah, no, like a normal printer.

00:35:51.320 | Like I gave you an office printer.

00:35:53.120 | How much data can you put on a piece of paper?

00:35:54.440 | - Can you redecode it?

00:35:56.240 | So like, you know, Base64 or whatever.

00:35:58.880 | - Yeah, whatever you want.

00:36:00.080 | You get normal off-the-shelf printer,

00:36:01.400 | off-the-shelf scanner.

00:36:02.420 | How much data?

00:36:03.260 | - I'll just throw out there, like 10 megabytes.

00:36:07.040 | - Oh, that's enormous.

00:36:07.880 | - I know.

00:36:08.700 | (laughing)

00:36:09.540 | - That's a lot.

00:36:10.380 | - Really small fonts.

00:36:11.440 | - That's my question.

00:36:12.680 | So I have a thing that does about a megabyte.

00:36:14.600 | - Yeah, okay, there you go.

00:36:15.440 | That's awesome order of magnitude.

00:36:16.760 | - Yeah, okay, so in particular,

00:36:18.320 | it's about 1.44 megabytes.

00:36:20.760 | - Floppy disk.

00:36:21.600 | - Yeah, exactly.

00:36:22.420 | This is supposed to be the title at some point,

00:36:23.800 | is the floppy disk.

00:36:24.620 | - A paper is a floppy disk?

00:36:25.460 | - Yeah, so this is a little hard because, you know,

00:36:27.280 | so you can do the math and you get 8 1/2 by 11.

00:36:30.480 | You can print at 300 by 300 DPI,

00:36:33.000 | and this gives you two megabytes.

00:36:35.240 | And so you need to be able,

00:36:36.080 | like so if you, every single pixel,

00:36:38.240 | you need to be able to recover up to like 90 plus percent,

00:36:41.480 | like 95%, like 99 point something percent accuracy

00:36:44.840 | in order to be able to actually decode this off the paper.

00:36:47.420 | This is one of the things that I'm considering.

00:36:50.360 | I need to like get a couple more things working for this

00:36:52.840 | where, you know, again, I'm running to some random problems,

00:36:55.960 | but this is probably, this will be one thing

00:36:57.560 | that I'm going to talk about.

00:36:59.360 | There's this contest called

00:37:00.200 | the International Obfuscated C Code Contest,

00:37:01.800 | which is amazing.

00:37:02.800 | People try and write the most obfuscated C code that they can,

00:37:05.640 | which is great.

00:37:06.560 | And I have a submission for that

00:37:08.200 | whenever they open up the next one for it,

00:37:10.760 | and I'll write about that submission.

00:37:12.120 | I have a very fun gate level emulation of an old CPU

00:37:15.920 | that runs like fully precisely,

00:37:18.560 | and it's a fun kind of thing.

00:37:20.240 | - Interesting.

00:37:21.080 | Your comment about the piece of paper

00:37:22.440 | reminds me of when I was in college

00:37:24.040 | and you would have like one cheat sheet

00:37:26.040 | that you could write, right?

00:37:26.960 | So you have a formula, a theoretical limit

00:37:29.080 | for bits per inch.

00:37:31.160 | And, you know, that's how much,

00:37:33.080 | I would squeeze in really, really small pieces

00:37:35.240 | to fill one of those sheets.

00:37:36.080 | - Definitely, yeah.

00:37:36.920 | - Okay, we are also going to talk about your benchmarking

00:37:39.480 | because you released your own benchmark

00:37:41.480 | that got some attention

00:37:43.080 | thanks to some friends on the internet.

00:37:45.240 | What's the story behind your own benchmark?

00:37:47.080 | Do you not trust the open source benchmarks?

00:37:50.800 | What's going on there?

00:37:51.640 | - Okay, benchmarks tell you how well the model solves

00:37:55.400 | the task the benchmark is designed to solve.

00:37:57.360 | For a long time, models were not useful.

00:37:59.720 | And so the benchmark that you tracked

00:38:01.320 | was just something someone came up with

00:38:03.120 | because you need to track something.

00:38:05.480 | All of deep learning exists

00:38:07.920 | because people tried to make models classify digits

00:38:12.240 | and classify images into a thousand classes.

00:38:14.720 | There is no one in the world

00:38:16.440 | who cares specifically about the problem

00:38:18.800 | of distinguishing between 300 breeds of dog

00:38:22.040 | for an image that's 224 or 224 pixels.

00:38:24.680 | And yet, like, this is what drove a lot of progress.

00:38:26.960 | And people did this,

00:38:27.920 | not because they cared about this problem,

00:38:29.520 | because they want to just measure progress in some way.

00:38:32.000 | And a lot of benchmarks are of this flavor.

00:38:34.240 | You want to construct a task that is hard,

00:38:36.240 | and we will measure progress on this benchmark,

00:38:38.280 | not because we care about the problem per se,

00:38:39.880 | but because we know that progress on this

00:38:41.520 | is in some way correlated with making better models.

00:38:44.160 | And this is fine when you don't want to actually use

00:38:46.600 | the models that you have.

00:38:48.000 | But when you want to actually make use of them,

00:38:50.160 | it's important to find benchmarks that track

00:38:52.760 | with whether or not they're useful to you.

00:38:54.400 | And the thing that I was finding

00:38:56.360 | is that there would be model after model after model

00:38:58.720 | that was being released that would find some benchmark

00:39:01.800 | that they could claim state-of-the-art on

00:39:03.600 | and then say, "Therefore, ours is the best."

00:39:05.800 | And that wouldn't be helpful to me

00:39:07.960 | to know whether or not I should then switch to it.

00:39:10.280 | So the argument that I tried to lay out in this post

00:39:13.160 | is that more people should make benchmarks

00:39:16.440 | that are tailored to them.

00:39:17.840 | And so what I did is I wrote a domain-specific language

00:39:21.120 | that anyone can write for,

00:39:22.480 | and say, you can take tasks

00:39:25.040 | that you have wanted models to solve for you,

00:39:27.480 | and you can put them into your benchmark

00:39:30.160 | that's the thing that you care about.

00:39:31.440 | And then when a new model comes out,

00:39:32.600 | you benchmark the model on the things that you care about,

00:39:35.840 | and you know that you care about them

00:39:36.760 | because you've actually asked for those answers before.

00:39:39.000 | And if the model scores well,

00:39:40.160 | then you know that for the kinds of things

00:39:41.640 | that you have asked models for in the past,

00:39:43.000 | it can solve these things well for you.

00:39:45.040 | This has been useful for me

00:39:46.320 | because when another model comes out,

00:39:47.440 | I just, I can run it, I can see is this,

00:39:49.280 | does this solve the kinds of things that I care about?

00:39:51.040 | And sometimes the answer is yes,

00:39:52.120 | and sometimes the answer is no.

00:39:53.680 | And then I can decide whether or not

00:39:55.200 | I want to use that model or not.

00:39:56.880 | I don't want to say that existing benchmarks are not useful.

00:40:00.080 | They're very good at measuring the thing

00:40:01.720 | that they're designed to measure.

00:40:03.760 | But in many cases, what that's designed to measure

00:40:06.760 | is not actually the thing that I want to use it for.

00:40:08.640 | And I would expect that the way that I want to use it

00:40:10.600 | is different than the way that you want to use it.

00:40:12.080 | And I would just like more people

00:40:13.920 | to have these things out there in the world.

00:40:15.800 | And the final reason for this is,

00:40:17.920 | it is very easy if you want to make a model

00:40:20.840 | good at some benchmark to make it good at that benchmark.

00:40:23.520 | You can sort of like find the distribution of data

00:40:25.600 | that you need and train the model to be good

00:40:27.480 | on the distribution of data.

00:40:28.440 | And then you have your model

00:40:29.820 | that can solve this benchmark well.

00:40:31.440 | And by having a benchmark that is not very popular,

00:40:35.540 | you can be relatively certain

00:40:37.320 | that no one has tried to optimize their model

00:40:39.180 | for your benchmark.

00:40:40.280 | And I would like to be--

00:40:41.120 | - So publishing your benchmark is a little bit--

00:40:42.720 | (laughing)

00:40:43.560 | - Okay, sure.

00:40:44.380 | Yeah, okay, so my hope in doing this

00:40:47.440 | was not that people would use mine as theirs.

00:40:50.680 | My hope in doing this was that people would say--

00:40:52.160 | - You should make yours.

00:40:53.000 | - Yes, you should make your benchmark.

00:40:54.040 | And if, for example, there were even

00:40:57.760 | a very small fraction of people, 0.1% of people

00:41:00.120 | who made a benchmark that was useful for them,

00:41:01.980 | this would still be hundreds of new benchmarks

00:41:04.320 | that not want to make one myself,

00:41:06.040 | but I might know the kinds of work that I do

00:41:09.400 | is a little bit like this person,

00:41:10.600 | a little bit like that person.

00:41:11.440 | I'll go check how it is on their benchmarks

00:41:13.160 | and I'll see roughly I'll get a good sense

00:41:16.220 | of what's going on because the alternative

00:41:18.220 | is people just do this vibes-based evaluation thing

00:41:21.620 | where you interact with the model five times

00:41:23.580 | and you see if it worked on the kinds of things

00:41:24.980 | that you just like your toy questions,

00:41:26.660 | but five questions is a very low bit output

00:41:29.420 | from whether or not it works for this thing.

00:41:31.060 | And if you could just automate

00:41:32.300 | running it 100 questions for you,

00:41:33.620 | it's a much better evaluation.

00:41:35.300 | So that's why I did this.

00:41:37.220 | - Yeah, I like the idea of going through your chat history

00:41:39.700 | and actually pulling out real-life examples.

00:41:42.420 | I regret to say that I don't think my chat history

00:41:44.840 | is used as much these days because I'm using cursor,

00:41:47.560 | like the sort of native AI ID.

00:41:50.480 | So your examples are all coding related.

00:41:52.800 | And the immediate question is like,

00:41:54.680 | now that you've written the "How I Use AI" post,

00:41:57.600 | which is a little bit broader,

00:41:59.440 | are you able to translate all these things to evals?

00:42:01.600 | Are some things unevaluable?

00:42:03.720 | - Right, a number of things that I do

00:42:05.360 | are harder to evaluate.

00:42:06.440 | So this is the problem with a benchmark

00:42:08.400 | is you need some way to check

00:42:10.160 | whether or not the output was correct.

00:42:12.200 | And so all of the kinds of things

00:42:13.540 | that I can put into the benchmark

00:42:14.620 | are the kinds of things that you can check.

00:42:16.620 | You can check more things

00:42:17.920 | than you might have thought would be possible

00:42:19.820 | if you do a little bit of work on the backend.

00:42:22.220 | So for example, all of the code that I have the model right,

00:42:24.960 | it runs the code and sees whether the answer

00:42:26.660 | is the correct answer.

00:42:28.180 | Or in some cases, it runs the code,

00:42:30.060 | feeds the output to another language model,

00:42:31.780 | and the language model judges was the output correct.

00:42:34.260 | And again, is using a language model

00:42:36.060 | to judge here perfect?

00:42:36.940 | No, but like, what's the alternative?

00:42:39.100 | The alternative is to not do it.

00:42:41.220 | And what I care about is just,

00:42:43.460 | is this thing broadly useful

00:42:45.240 | for the kinds of questions that I have?

00:42:46.540 | And so as long as the accuracy

00:42:47.660 | is better than roughly random,

00:42:49.460 | like I'm okay with this.

00:42:51.660 | I've inspected the outputs of these

00:42:52.860 | and like, they're almost always correct.

00:42:54.460 | If you sort of, if you ask the model

00:42:55.940 | to judge these things in the right way,

00:42:57.420 | they're very good at being able to tell this.

00:42:59.680 | And so yeah, I probably think

00:43:02.220 | this is a useful thing for people to do.

00:43:04.140 | - You complained about prompting and being lazy

00:43:07.660 | and how you do not want to tip your model

00:43:09.780 | and you do not want to murder a kitten

00:43:12.540 | just to get the right answer.

00:43:14.060 | How do you see the evolution of like prompt engineering?

00:43:16.660 | Even like 18 months ago,

00:43:17.980 | maybe, you know, it was kind of like really hot

00:43:19.900 | and people wanted to like build companies around it.

00:43:21.660 | Today, it's like the models are getting good.

00:43:23.180 | Do you think it's going to be

00:43:24.020 | less and less relevant going forward

00:43:25.820 | or what's the minimum valuable prompt?

00:43:28.660 | - Yeah, I don't know.

00:43:29.580 | I feel like a big part of making an agent

00:43:31.700 | is just like a fancy prompt

00:43:33.380 | that like, you know, calls back to the model again.

00:43:36.260 | I have no opinion.

00:43:37.140 | It seems like maybe it turns out

00:43:38.860 | that this is really important.

00:43:39.980 | Maybe it turns out that this isn't.

00:43:41.420 | I guess the only comment I was making here

00:43:43.040 | is just to say, oftentimes when I use a model

00:43:47.140 | and I find it's not useful,

00:43:48.220 | I talk to people who help make it.

00:43:50.260 | The answer they usually give me is like,

00:43:51.660 | you're using it wrong.

00:43:53.140 | Which like reminds me very much of like

00:43:54.260 | that you're holding it wrong

00:43:55.220 | from like the iPhone kind of thing, right?

00:43:56.580 | Like, you know, like,

00:43:57.820 | I don't care that I'm holding it wrong.

00:43:58.980 | I'm holding it that way.

00:43:59.860 | If the thing is not working with me,

00:44:01.380 | then like it's not useful for me.

00:44:02.500 | Like it may be the case that there exists

00:44:05.140 | a way to ask the model

00:44:06.260 | such that it gives me the answer that's correct.

00:44:08.180 | But that's not the way I'm doing it.

00:44:10.660 | If I have to spend so much time thinking

00:44:12.820 | about how I want to frame the question,

00:44:14.820 | that it would have been faster for me

00:44:15.780 | just to get the answer.

00:44:17.060 | It didn't save me any time.

00:44:18.380 | And so oftentimes, you know what I do is like,

00:44:20.140 | I just dump in whatever current thought that I have

00:44:22.260 | in whatever ill-formed way it is.

00:44:24.220 | And I expect the answer to be correct.

00:44:26.500 | And if the answer is not correct,

00:44:27.420 | like in some sense,

00:44:28.500 | maybe the model was right to give me the wrong answer.

00:44:30.380 | Like I may have asked the wrong question,

00:44:33.100 | but I want the right answer still.

00:44:34.420 | And so like, I just want to sort of get this as a thing.

00:44:38.300 | And maybe the way to fix this is

00:44:40.580 | you have some default prompt

00:44:41.660 | that always goes into all the models or something.

00:44:43.620 | Or you do something like clever like this.

00:44:45.780 | It would be great if someone had a way

00:44:47.020 | to package this up and make a thing.

00:44:48.340 | I think that's entirely reasonable.

00:44:49.940 | Maybe it turns out that as models get better,

00:44:51.420 | you don't need to prompt them as much in this way.

00:44:53.180 | I don't know.

00:44:54.020 | I just want to use the things that are in front of me.

00:44:55.540 | - Do you think that's like a limitation

00:44:57.660 | of just how models work?

00:44:59.180 | Like, you know, at the end of the day,

00:45:00.420 | you're using the prompt to kind of like steer it

00:45:02.500 | in the latent space.

00:45:03.340 | Like, do you think there's a way

00:45:04.420 | to actually not make the prompt really relevant

00:45:07.060 | and have the model figure it out?

00:45:08.060 | Or like, what's the--

00:45:08.900 | - I mean, you could fine tune it into the model,

00:45:11.140 | for example, that like it's supposed to.

00:45:12.860 | I mean, it seems like some models have done this,

00:45:14.300 | for example.

00:45:15.140 | Like some recent model, many recent models,

00:45:16.540 | if you ask them a question,

00:45:17.820 | computing an integral of this thing,

00:45:19.580 | they'll say, "Let's think through this step-by-step."

00:45:21.540 | And then they'll go through the step-by-step answer.

00:45:22.900 | I didn't tell it.

00:45:23.820 | Two years ago, I would have have to have prompted it.

00:45:25.980 | Think step-by-step on solving the following thing.

00:45:27.900 | Now you ask them the question and the model says,

00:45:30.300 | "Here's how I'm going to do it.

00:45:31.140 | "I'm going to take the following approach."

00:45:32.380 | And then like sort of self-prompt itself.

00:45:34.220 | Is this the right way?

00:45:35.620 | Seems reasonable.

00:45:36.700 | Maybe you don't have to do it.

00:45:37.620 | I don't know.

00:45:38.460 | This is for the people whose job

00:45:39.860 | is to make these things better.

00:45:40.780 | And yeah, I just want to use these things.

00:45:43.340 | - For listeners, that would be Orca and Agent Instruct,

00:45:46.420 | is the soda on this stuff.

00:45:48.340 | - Great. - Yeah.

00:45:49.260 | - Does a few-shot, is included in the lazy prompting?

00:45:52.500 | Like, do you do few-shot prompting?

00:45:54.380 | Like, do you collect some examples

00:45:55.780 | when you want to put them in, or?

00:45:57.140 | - I don't because usually when I want the answer,

00:46:00.260 | I just, I want to get the answer.

00:46:02.140 | (laughing)

00:46:02.980 | - Brutal, this is hard mode.

00:46:04.180 | - Yeah, exactly.

00:46:05.180 | This is fine.

00:46:06.260 | I want to be clear.

00:46:07.100 | There's a difference between

00:46:08.260 | testing the ultimate capability level of the model

00:46:10.740 | and testing the thing that I'm doing with it.

00:46:12.740 | What I'm doing is I'm not exercising

00:46:14.340 | its full capability level.

00:46:15.620 | Because there are almost certainly better ways

00:46:17.180 | to ask the questions

00:46:18.020 | and sort of really see how good the model is.

00:46:19.940 | And if you're evaluating a model

00:46:22.180 | for being state-of-the-art,

00:46:23.460 | this is ultimately what you care about.

00:46:24.780 | And so I'm entirely fine with people doing fancy prompting

00:46:27.380 | to show me what the true capability level could be.

00:46:29.860 | Because it's really useful to know

00:46:31.100 | what the ultimate level of the model could be.

00:46:32.740 | But I think it's also important

00:46:33.900 | just to have available to you

00:46:35.820 | how good the model is if you don't do fancy things.

00:46:39.220 | - Yeah, I will say that here's a divergence

00:46:40.980 | between how models are marketed these days

00:46:43.780 | versus how people use it,

00:46:45.860 | which is when they test MMLU,

00:46:47.420 | they'll do like five shots, 25 shots, 50 shots.

00:46:50.620 | And no one's providing 50 examples.

00:46:53.020 | - I completely agree.

00:46:54.900 | You know, for these numbers,

00:46:56.460 | the problem is everyone wants to get state-of-the-art

00:46:58.020 | on the benchmark.

00:46:58.860 | And so you find the way

00:47:00.220 | that you can ask the model the questions

00:47:01.940 | so that you get state-of-the-art on the benchmark.

00:47:04.300 | And it's legitimately good to know.

00:47:06.700 | Like it's good to know the model can do this thing

00:47:08.860 | if only you try hard enough.

00:47:10.660 | Because it means that if I have some tasks

00:47:12.980 | that I want to be solved,

00:47:14.180 | I know what the capability level is.

00:47:16.460 | And I could get there if I was willing to work hard enough.

00:47:18.500 | And the question then is,

00:47:19.540 | should I work harder

00:47:20.380 | and figure out how to ask the model the question?

00:47:21.860 | Or do I just do the thing myself?

00:47:23.020 | And for me, I have programmed for many, many, many years.

00:47:26.260 | It's often just faster for me just to do the thing

00:47:28.460 | than to like figure out the incantation to ask the model.

00:47:31.300 | But I can imagine someone who has never programmed before

00:47:34.380 | might be fine writing five paragraphs in English,

00:47:37.340 | describing exactly the thing that they want

00:47:39.100 | and have the model build it for them

00:47:41.060 | if the alternative is not.

00:47:43.180 | But again, this goes to all these questions

00:47:44.900 | of how are they going to validate?

00:47:46.740 | Should they be trusting the output?

00:47:47.860 | These kinds of things, but yeah.

00:47:49.740 | - One problem with your eval paradigm

00:47:53.340 | and most eval paradigms, I'm not picking on you,

00:47:55.940 | is that we're actually training these things for chat,

00:47:58.580 | for interactive back and forth.

00:47:59.940 | And you actually obviously reveal much more information

00:48:02.260 | in the same way that asking 20 questions

00:48:04.220 | reveals more information

00:48:05.180 | in sort of like a tree search branching sort of way.

00:48:08.300 | Then this is also by the way,

00:48:09.540 | the problem with LMS's arena, right?

00:48:10.980 | Where the vast majority of prompts are single question,

00:48:13.580 | single answer, eval, done.

00:48:15.300 | But actually the way that we use chat things,

00:48:18.380 | in the way, even in the stuff that you posted

00:48:20.060 | in your higher use AI stuff,

00:48:21.220 | you have maybe 20 turns of back and forth.

00:48:24.460 | How do you eval that?

00:48:25.420 | - Yeah, okay, very good question.

00:48:26.980 | This is the thing that I think many people

00:48:28.740 | should be doing more of.

00:48:30.100 | I would like more multi-turn evals.

00:48:31.780 | I might be writing a paper on this at some point

00:48:33.940 | if I get around to it.

00:48:35.340 | A couple of the evals in the benchmark thing I have

00:48:38.140 | are already multi-turn.

00:48:39.740 | I mentioned 20 questions.

00:48:40.580 | I have a 20 question eval there, just for fun.

00:48:43.540 | But I have a couple others that are like,

00:48:46.300 | I just tell the model, here's my get thing,

00:48:48.700 | figure out how to cherry pick off this other branch

00:48:50.700 | and move it over there.

00:48:51.860 | And so what I do is I just,

00:48:53.340 | I basically build a tiny little agency thing.

00:48:55.620 | I just ask the model how I do it.

00:48:57.660 | I run the thing on Linux.

00:49:00.300 | I'd spin up a Docker.

00:49:01.220 | This is what I want a Docker for.

00:49:02.340 | I spin up a Docker container.

00:49:03.740 | I run whatever the model told me the output to do is.

00:49:06.860 | I feed the output back into the model.

00:49:08.100 | I repeat this many rounds.

00:49:09.300 | And then I check at the very end,

00:49:11.100 | does the git commit history show

00:49:12.860 | that it is correctly cherry picked in this way?

00:49:15.100 | And so I have a couple of these.

00:49:16.740 | I agree that I have many fewer

00:49:17.780 | than what I actually use them for.

00:49:19.420 | And I think the reason why

00:49:20.260 | is just that it's hard to evaluate this.

00:49:22.060 | Like it's more challenging to do this kind of evaluation.

00:49:24.540 | Yeah, I would like to see a lot more

00:49:26.740 | of these kinds of things to exist

00:49:28.580 | so that people could come up with these evals

00:49:31.580 | that more closely measure what they're actually doing.

00:49:34.820 | - Just before we wrap on this,

00:49:36.620 | there was one example about a UU encode.

00:49:39.340 | And you mentioned how like nobody uses this thing anymore.

00:49:42.780 | When you run into something like this

00:49:44.540 | and you know that no more data

00:49:46.220 | is gonna get produced on this thing,

00:49:48.540 | do you figure out how to like fine tune the model?

00:49:52.060 | Like if it really mattered to you,

00:49:53.940 | put together some examples or would you just say,

00:49:55.820 | hey, the model just doesn't do it, whatever, move on?

00:49:57.980 | - Yeah, yeah, okay.

00:49:59.220 | This was an example of a thing

00:50:01.540 | where I was looking at some data that was,

00:50:04.500 | there was a file that was produced

00:50:07.420 | in like the mid '90s, early '90s or something,

00:50:11.220 | when UU encoding was actually a thing that people would do.

00:50:13.700 | And I wanted the model to be able

00:50:15.860 | to automatically determine the type of file

00:50:17.340 | to decompress in something.

00:50:18.740 | And like it was doing it correctly for like 99% of cases.

00:50:21.860 | And like I found a few UU encoded things

00:50:23.460 | where like it couldn't figure out

00:50:24.380 | this was UU encoding, not base 64.

00:50:26.020 | Okay, this is not important.

00:50:27.620 | I just was curious if it could do it.

00:50:28.860 | And so I put this as a thing.

00:50:30.900 | I think probably this is the thing

00:50:33.020 | that if you really cared about this task being solved well,

00:50:35.300 | you would train a model for.

00:50:37.020 | But again, this is one of these kinds of tasks

00:50:39.100 | that this was some dumb project

00:50:41.100 | that like no one's gonna care about.

00:50:42.220 | I just wanted to see if I could do it.

00:50:43.940 | If the model was good enough

00:50:44.900 | that it gets me 90% of the way there, good, like done.

00:50:46.740 | Like I figured it out.

00:50:47.660 | Like I can sort of have fun for a couple hours

00:50:49.220 | and then move on.

00:50:50.060 | And that's all I want.

00:50:50.900 | I was not like, if I ever had to train a thing for this,

00:50:53.180 | I was not gonna do it.

00:50:54.020 | And so it did well enough for me that I could move on.

00:50:57.500 | - It does give me an idea for adversarial examples

00:51:00.740 | inside of a benchmark that are basically canaries

00:51:03.300 | for overtraining on the benchmark.

00:51:05.100 | Typically right now, benchmarks have canary strings,

00:51:07.340 | or if you ask it to repeat back the string and it does,

00:51:09.420 | then it's trained on it.

00:51:10.620 | But you know, it's easy to filter out those things.

00:51:12.380 | But the benchmarks, you put in some things,

00:51:14.860 | some questions that are intentionally wrong,

00:51:16.700 | and if it gives you the intentionally wrong answer,

00:51:18.740 | then you know it's.

00:51:19.900 | - Yeah, there are actually a couple of papers

00:51:22.340 | that don't do exactly this,

00:51:24.460 | but that are doing dataset inference.

00:51:26.860 | So the field of work called membership inference,

00:51:29.540 | this is one of the things I do research on,

00:51:31.420 | that tries to figure out,

00:51:32.260 | did you train on this example or not?

00:51:33.940 | There's a field called like dataset inference.

00:51:35.740 | Did you train on this dataset or not?

00:51:37.180 | And there's like a specific subfield of this

00:51:39.540 | that looks specifically at,

00:51:42.060 | like did you train on your test set

00:51:43.700 | or you train on your training set?

00:51:45.260 | And they basically look at exactly this.

00:51:47.460 | Like for example, one,

00:51:48.380 | there's this paper by Tatsu out of Stanford,

00:51:50.940 | where they check if the order

00:51:54.300 | that the specific questions happen to be in matters.

00:51:57.380 | And if the answer is yes,

00:51:58.300 | then you probably trained on it

00:51:59.260 | because the order of the questions is arbitrary

00:52:00.660 | and shouldn't matter.

00:52:01.500 | There are a number of papers that follow up on this

00:52:02.780 | and do some similar things.

00:52:03.700 | I think this is a great way of doing this now.

00:52:06.260 | It might be even better if some people

00:52:07.540 | included some canary questions in their benchmarks,

00:52:09.820 | but even if they don't,

00:52:10.780 | you can already sort of start getting at this now.

00:52:13.260 | - Yeah.

00:52:14.100 | - Yeah, let's go into some of your research.

00:52:15.700 | I always love security work.

00:52:17.660 | I was at Black Hat last week.

00:52:19.140 | I had to miss DEF CON.

00:52:20.540 | Let's start from the Leon 400M,

00:52:24.260 | kind of like data possible, data poisoning.

00:52:27.300 | So basically the idea is,

00:52:29.260 | Leon 400M is one of the biggest image datasets

00:52:32.020 | for image models.

00:52:33.300 | And a lot of the image gets pulled from live domains.

00:52:36.340 | So it's not all, yeah.

00:52:38.340 | - Every image gets pulled from a live domain, yes.

00:52:39.900 | - So it's not all stored

00:52:41.060 | and a bunch of the domains expired.

00:52:42.900 | So then you went on and you bought the domains

00:52:44.700 | and you get to put literally anything on it.

00:52:47.020 | And you get to poison every single model

00:52:49.340 | that was training in the dataset.

00:52:51.180 | - Yep, it was a lot of fun.

00:52:52.540 | - Maybe just talk about some of the things

00:52:54.540 | that people don't think about

00:52:55.900 | when it comes to like the datasets.

00:52:57.100 | We talked before about low background tokens.

00:52:59.220 | So before maybe 2020,

00:53:01.420 | you can imagine most things you get from the internet,

00:53:04.100 | a human wrote, or like, you know.

00:53:06.540 | After 2021, you can imagine most things written

00:53:09.140 | are like somewhat AI generated.

00:53:11.620 | Any other fun stories

00:53:12.780 | or like maybe give more of the Leon background.

00:53:14.460 | How did you figure out,

00:53:15.420 | do you just like check all the domains in it

00:53:17.980 | and see what expired?

00:53:18.900 | Why did they not do it to prevent this?

00:53:21.700 | - Okay, so why did this paper happen?

00:53:23.780 | The adversarial machine learning literature

00:53:25.900 | for a very long time was focused on

00:53:29.060 | what could I do in the worst case?

00:53:32.700 | Because no one was using these tools.

00:53:34.500 | And no one's using them,

00:53:35.340 | it doesn't make sense to really ask like,

00:53:37.140 | how do I attack this actual system?

00:53:38.860 | And so people would write papers,

00:53:40.500 | I mean, me included, I have lots of these

00:53:41.980 | that like assume an adversary could do the following

00:53:45.340 | and then list 10 unrealistic things.

00:53:47.500 | Then very bad harm could happen.

00:53:49.460 | And in some sense, like you have to do this.

00:53:51.540 | If you have no real system in front of you,

00:53:53.180 | like what are you gonna do as a security researcher?

00:53:54.940 | One thing you could do is just nothing.

00:53:56.020 | You could just wait.

00:53:56.860 | Like this is a bad option

00:53:58.060 | because eventually someone's gonna use these things

00:53:59.540 | and you would rather have a headstart.

00:54:00.900 | So how do you get a headstart?

00:54:01.900 | You make a guess.

00:54:03.020 | You say maybe future systems will do X.

00:54:05.300 | And then you write a paper that sort of looks at this.

00:54:07.860 | And then maybe it turns out that some of these

00:54:09.380 | are directionally correct, some are not.

00:54:10.860 | And so, okay.

00:54:11.700 | So this has happened for quite some long time.

00:54:13.220 | And then machine learning started to work.

00:54:14.820 | And the thing that bothered me is it seems like

00:54:17.220 | the adversarial machine learning community

00:54:18.460 | didn't then try and adapt

00:54:19.620 | and try and actually start studying real problems.

00:54:21.860 | So we very deliberately started looking like,

00:54:24.740 | what are the problems that actually arise in real systems

00:54:28.140 | as they exist now?

00:54:29.140 | Like, what is the kind of paper

00:54:30.940 | that I could imagine writing that would be at Black Hat?

00:54:33.860 | Like a real security person would want to see,

00:54:37.420 | not because here's a fun thing

00:54:39.220 | that you can make this machine learning model do,

00:54:40.940 | but because legitimately the easiest way

00:54:42.660 | to make the bad thing happen

00:54:43.820 | is to go after the machine learning model.

00:54:45.260 | So the way we decided to do this is like,

00:54:47.380 | every time you see some new thing,

00:54:51.060 | you say, well, here are the bad things that could happen.

00:54:52.940 | I could try and do an evasion attack at test time.

00:54:54.660 | I could try and do a poisoning attack

00:54:55.820 | that made the model train on bad data.

00:54:57.020 | I could try and steal the model.

00:54:58.060 | I could try and steal the data.

00:54:59.140 | You have a list of like 10 bad things

00:55:00.540 | that you could try and make happen.

00:55:01.700 | And every time you see some new thing,

00:55:02.900 | you ask, okay, here's my list of 10 problems.

00:55:05.420 | Which of them are most important and relevant to this?

00:55:07.980 | And you just do this for every single one in the list.

00:55:10.300 | And most of the time, the answer is nothing,

00:55:12.780 | and then you get nothing out of it.

00:55:14.300 | But on occasion, you sort of figure out,

00:55:15.660 | okay, here's this new dataset.

00:55:17.140 | It is being distributed in such a way

00:55:19.020 | that anyone in the world can buy domains

00:55:21.820 | that let them inject arbitrary images into the dataset.

00:55:24.500 | There's the attack.

00:55:25.340 | And this is, I think, the way that we came to doing this

00:55:29.380 | from this motivation of let's try

00:55:30.700 | and look at some real security stuff.

00:55:32.700 | - I think when people think of AI security,

00:55:34.820 | they either think of jailbreaks,

00:55:37.660 | which is kind of very limited,

00:55:39.020 | or they're gonna go the broader,

00:55:40.700 | oh, is AI gonna kill us all?

00:55:42.420 | I think you've done a lot of awesome papers

00:55:44.540 | on the in-between.

00:55:45.940 | So one thing is the jailbreak.

00:55:47.540 | You've also had a paper on stealing part of a production LLM.

00:55:51.340 | You extracted the Babbage and Ada dimension layers

00:55:56.260 | from the OpenAI API.

00:55:58.180 | So there's even things that as a user,

00:56:00.460 | you're worried about the jailbreaks,

00:56:01.740 | but as a model provider,

00:56:03.420 | you're actually worried about the-

00:56:04.780 | - Yeah, exactly.

00:56:05.620 | This paper was, again, with the exact same motivation.

00:56:08.380 | So as some history,

00:56:09.220 | there's this field of research called model stealing.

00:56:11.460 | What it's interested in is you have your model

00:56:13.580 | that you have trained, it was very expensive.

00:56:15.140 | I want to query your model and steal a copy of the model

00:56:17.580 | so that I have your model

00:56:18.740 | without paying for the training costs.

00:56:20.980 | And we have some very nice work

00:56:22.820 | that shows that this is possible.

00:56:24.380 | Like I can steal your exact model.

00:56:26.060 | As long as your model has, let's say,

00:56:28.260 | a couple thousand neurons evaluated in Float64

00:56:31.300 | with value-only activation, fully connected networks.

00:56:34.140 | I see the full logic outputs

00:56:35.900 | and I can feed in arbitrary floating point

00:56:37.700 | 64 numbers and inputs.

00:56:39.220 | Each of these assumptions I just said is false in practice.

00:56:41.540 | Like none of these things are things you can really do.

00:56:43.820 | I think it's fun research.

00:56:44.980 | I mean, there's a reason the paper is at Crypto.

00:56:47.220 | Like the reason it's at Crypto

00:56:48.220 | and not like at like an actual security conference

00:56:50.260 | because like it's a very theoretical kind of thing.

00:56:52.580 | And I think it's like an important direction

00:56:54.020 | for people to think about

00:56:54.860 | because maybe you can extend these to make it be possible.

00:56:57.340 | But I also think it's worth thinking about the problem

00:56:59.060 | from the other direction.

00:56:59.980 | Like let's look at what the real models

00:57:01.260 | we have in front of us are.

00:57:02.340 | Let's see how we can make those models

00:57:04.260 | be vulnerable to stealing attacks.

00:57:05.860 | And then we can push from the other direction.

00:57:07.820 | Like let's take the most practical attacks

00:57:09.380 | and make them more powerful.

00:57:10.740 | And that's again, like what we're trying to do here.

00:57:12.180 | We sort of looked at what APIs do actually people expose

00:57:15.900 | in the biggest models.

00:57:17.100 | How can we use some of that

00:57:18.260 | to do as much stealing as we possibly can?

00:57:20.580 | Yeah, and for this, we ran the attack

00:57:22.720 | that let us stole several of OpenAI's models

00:57:25.660 | with their permission.

00:57:27.460 | You know, it's sort of, it's a fun email

00:57:29.100 | to send, you know, hello, Mr. Lawyer.

00:57:31.600 | So I'm at Google.

00:57:32.700 | You know, I first have to email the Google lawyer.

00:57:35.720 | I would like to steal OpenAI's models.

00:57:37.940 | And they say like, you know, under no circumstances.

00:57:40.380 | And you say, okay, but what if they agree to it?

00:57:42.220 | And you're like, if they agree to it, fine.

00:57:43.780 | And you said, then you say, I know some people there.

00:57:45.660 | I emailed them like, can I steal your model?

00:57:47.540 | And they're like, as long as you delete it afterwards, okay.

00:57:50.220 | And I'm like, can you get your general counsel

00:57:52.140 | to put that in writing?

00:57:52.980 | And they're like, sure.

00:57:53.980 | So like, we had all of the lawyers talk to each other.

00:57:57.660 | Everyone agreed that like, you know,

00:57:59.340 | it's important to do this.

00:58:00.160 | Like, you know, you don't want to actually, you know,

00:58:03.180 | sort of cause harm when doing security work.

00:58:05.100 | And so we got all of the, like,

00:58:07.220 | the agreements out of the way.

00:58:08.420 | And then we went and ran the attack.

00:58:10.260 | And yeah, and it worked great.

00:58:11.660 | And then we can write the paper.

00:58:13.140 | Before we put the paper online,

00:58:14.980 | we notified everyone who was vulnerable to this attack.

00:58:17.700 | Some Google models were vulnerable.

00:58:19.020 | Some OpenAI models were vulnerable.

00:58:20.880 | There were one or two other people who were vulnerable

00:58:22.860 | that we didn't name in the paper.

00:58:24.140 | We notified them all, gave them 90 days to fix it,

00:58:26.100 | which is like a standard disclosure period in security.

00:58:28.660 | They was all patched.

00:58:29.900 | You know, OpenAI got rid of some APIs.

00:58:31.940 | And then we put the paper online.

00:58:33.060 | - The fix was just don't show logits.

00:58:35.300 | - Yeah, so the fix in particular was don't show log probs

00:58:39.580 | when you supply a logit bias.

00:58:42.340 | And what you don't show is the logit bias plus the log prob,

00:58:44.460 | which is like a very narrow thing.

00:58:45.620 | They sort of did the narrow thing to prevent this.

00:58:48.060 | Some people were unhappy, but like, this is, you know,

00:58:50.620 | this is the nature of making,

00:58:51.700 | you can have a more useful system

00:58:53.900 | or a more secure system in many ways.

00:58:55.460 | I really like this example because for a very long time,

00:58:58.940 | nothing about GPT-4 would be at all different

00:59:01.760 | if the field, like the entire field

00:59:03.420 | of ever so machine learning disappeared.

00:59:04.860 | Like everything to do with ever so examples,

00:59:06.500 | like all of, like for the most part,

00:59:08.260 | like GPT-4 would exist identically.

00:59:10.540 | This is not true in other fields in, you know,

00:59:12.980 | in system security.

00:59:13.900 | Like the way we design our processors today

00:59:16.100 | is fundamentally different

00:59:17.180 | because of the security attacks that we've had in the past.

00:59:19.620 | You know, the way we design databases,

00:59:20.940 | the way we design the internet is fundamentally different

00:59:22.860 | because of the way the attacks that we have.

00:59:24.740 | And what that means is it means that the attacks

00:59:26.220 | that we had were so compelling to the non-security people

00:59:29.340 | that they were willing to change

00:59:30.620 | and make their systems less useful

00:59:33.180 | in order to make the security better.

00:59:34.620 | In ever so machine learning, we didn't have this.

00:59:36.060 | We didn't have attacks that were useful enough

00:59:37.540 | that you could show it to someone

00:59:38.980 | who actually designed a real system.

00:59:41.100 | And they'd be willing to say,

00:59:42.100 | I am going to make my system less useful

00:59:43.580 | because the attack that you've presented to me

00:59:44.920 | is so compelling that I will break

00:59:46.640 | the functionality of my system.

00:59:47.860 | And this is one of the first cases I think

00:59:49.220 | that we were able to show this is someone,

00:59:51.300 | we had an attack that someone said,

00:59:52.420 | I agree with this attack is sufficiently bad

00:59:54.320 | that I will break utility in order to prevent this attack.

00:59:56.660 | And I would like to see more of these kinds of attacks,

00:59:59.620 | not because I want things to be worse,

01:00:01.740 | but because I want to be sure

01:00:03.300 | that we have exhausted the space of possible attacks

01:00:05.980 | so that it's not going to be the case

01:00:07.700 | that someone else comes up with a very bad thing

01:00:10.160 | that like they're not going to disclose,

01:00:12.360 | sit on for, you know, a couple months

01:00:14.420 | and then go and bang on everything

01:00:16.540 | and see what they can hit.

01:00:17.580 | And this is the hope of doing this research direction.

01:00:20.220 | - I want to spell it out for people

01:00:21.280 | who are maybe not so specialized in this.

01:00:23.260 | Your attack could potentially steal

01:00:25.620 | the entire projection matrix.

01:00:27.140 | - Yeah, so a model has many layers.

01:00:29.620 | We pick one of the layers

01:00:30.720 | and we show how to steal that layer.

01:00:32.740 | - And then just scaling it up, you can steal the others.

01:00:35.700 | - For this attack, I do not know.

01:00:37.120 | - Yeah, okay.

01:00:37.960 | - So this is the important detail.

01:00:39.860 | We only steal one in the attack that as we present it,

01:00:42.860 | we only know how to steal one layer.

01:00:44.300 | For the other research we have done in the past,

01:00:47.120 | we have shown how after stealing one layer,

01:00:49.060 | you can then extend it to the second layer

01:00:50.820 | and then the second to the third and third to the fourth.

01:00:52.420 | And you can do this like arbitrarily deep.

01:00:54.220 | And we have done this in the past,

01:00:56.400 | but that made ridiculous assumptions.

01:00:58.100 | And what we're trying to do now is similar kind of thing,

01:01:00.900 | but let's make less ridiculous assumptions.

01:01:02.980 | - Yeah, it's kind of like insecurity,

01:01:04.520 | how you have like privilege escalation.

01:01:06.140 | Once you're in the system, you can escalate.

01:01:08.500 | - Yeah, that's the hope.

01:01:09.340 | And so like the reason why we want to write

01:01:11.100 | these kinds of papers is to say,

01:01:13.880 | let's always know what the best attack is.

01:01:15.420 | Let's have the best attack be public

01:01:17.340 | so that people can at least prevent

01:01:18.740 | what the best is that is known right now.

01:01:21.300 | And if someone else were to discover a stronger variant,

01:01:23.860 | I would hope that they would take a similar approach,

01:01:25.740 | let everyone know how to patch it,

01:01:27.300 | patch the thing, release it to everyone and go from there.

01:01:29.280 | - We do also serve people building on top of models.

01:01:31.900 | And one thing that I think people are interested in

01:01:33.680 | is prompt injections, prompt security, that kind of stuff.

01:01:37.500 | I feel like the relevant version of your thing

01:01:40.380 | is can I steal the rag corpus

01:01:42.560 | that might be proprietary to a company?

01:01:45.500 | I don't know if you've heard.

01:01:46.420 | - No, this is a very good question.

01:01:48.580 | Yeah, so there's two kinds of stealing.

01:01:50.740 | There's model stealing and there's data stealing.

01:01:52.500 | Data stealing is exactly this kind of question.

01:01:55.260 | And I think this is a very good question.

01:01:57.340 | In many ways, the answer is yes.

01:01:59.880 | Even without rag, you can often steal data

01:02:02.060 | that the model was trained on.

01:02:03.360 | So we've done some work where we have trained a model,

01:02:06.540 | or we have shown that for production models,

01:02:08.220 | okay, in this case, in the most extreme variant,

01:02:10.620 | we showed a way to recover training data from GPT 3.5 Turbo.

01:02:15.380 | Yeah, one of my co-authors, Milad,

01:02:17.580 | was working on some other random experiments

01:02:19.260 | and he figured out that if you prompt ChantGPT

01:02:23.520 | to repeat a word forever,

01:02:25.500 | then it will repeat the word many, many, many times in a row

01:02:28.220 | and then explode and just start doing random stuff.

01:02:31.780 | And when it was doing random stuff,

01:02:33.260 | maybe a small percent of the time,

01:02:34.540 | maybe 2% of the time,

01:02:35.700 | it would just repeat training data back to you,

01:02:37.540 | which is very confusing.

01:02:39.060 | But this is a thing that happened

01:02:41.940 | and was an exciting kind of thing.

01:02:43.620 | And we've seen this in the past, yeah.

01:02:45.220 | - Do we know, is it exactly the training data

01:02:47.940 | or is it something that looks like the training data?

01:02:49.300 | - Identical to the training data.

01:02:50.860 | - Because it cannot memorize.

01:02:52.340 | It doesn't have the weights to memorize all the training.

01:02:54.580 | - No, no, it can't memorize all the training data.

01:02:55.780 | No, definitely.

01:02:56.620 | But it can memorize some of it.

01:02:58.260 | How am I so certain?

01:02:59.540 | We found text that was on the internet,

01:03:01.460 | 10 terabytes of data.

01:03:02.700 | And what I can say is that the output of the model

01:03:05.240 | was a verbatim, at least 50-word-in-a-row match

01:03:09.960 | to some other document

01:03:11.280 | that appeared on the internet previously.

01:03:13.000 | So there's two possible explanations for this.

01:03:14.780 | One is the model happened to come up

01:03:17.040 | with the same 50-word-in-a-row sequence

01:03:19.200 | as was existed on the internet previously.

01:03:21.600 | In principle, this is possible, or it memorized it.

01:03:24.800 | And for some of them, we have several hundred words

01:03:26.860 | in a row where the probability is astronomically low.

01:03:30.400 | - So you also have a blog post about why I attack.

01:03:33.560 | Last week, we did a man versus machine event

01:03:35.920 | at Black Hat with our friend, H.D. Moore.

01:03:38.720 | It was basically like an AI CTF.

01:03:40.640 | And then Vijay, who was the CISO of DeepMind,

01:03:42.840 | he also came to the award ceremony.

01:03:44.700 | And I was talking to him.

01:03:45.640 | I told him, "We're gonna interview you."

01:03:47.400 | And he was like, "You should ask Carlini

01:03:49.200 | "why he does not want to build the fences."

01:03:51.800 | And so he told me to ask you that.

01:03:54.000 | So I'll just open the floor to you now to answer.

01:03:56.320 | - You asked his boss for a question.

01:03:57.720 | (both laughing)

01:04:00.680 | - Yeah, okay, no.

01:04:01.520 | So, okay, this is a good question.

01:04:03.260 | There are a couple of reasons.

01:04:05.360 | The most basic level,

01:04:06.720 | I attack things because I think it's fun.

01:04:08.500 | I feel like people should do things

01:04:09.720 | that they find are interesting in the world.

01:04:11.760 | I also think that it's important to attack things

01:04:14.680 | because you don't know what's secure

01:04:15.960 | unless you know what the best attacks are.

01:04:17.520 | And so it's worth having what the best attacks are

01:04:19.440 | in order to be able to discover what is secure.

01:04:21.840 | People then say, both of these things are true,

01:04:23.840 | and yet you should still build the fences.

01:04:25.400 | You know, I have gotten this a lot through my career.

01:04:28.900 | And it is possible that I would be able

01:04:30.280 | to construct the fences.

01:04:31.640 | On rare occasions, I have helped write papers

01:04:33.640 | that have defenses.

01:04:34.880 | I just don't find it very fun.

01:04:36.360 | I have a hard time motivating myself to work on it.

01:04:39.040 | And I think this is very important

01:04:41.480 | because let's suppose that you decide,

01:04:43.920 | okay, I am going to be a person

01:04:45.440 | who is going to try and do maximal good in the world.

01:04:48.000 | Presumably, there are jobs you could take

01:04:50.500 | that would like save more lives

01:04:52.520 | than what you're doing right now.

01:04:53.840 | But if you would wake up every day hating your life,

01:04:57.840 | it is very unlikely you would do an actually good job.

01:05:00.560 | You know, like I could sort of switch now to be a doctor

01:05:03.380 | or, you know, to do elderly care or something like this.

01:05:06.560 | But someone who actually went into it

01:05:08.040 | for the right motivations is going to do so much better

01:05:11.260 | than if I just decided, like, I am going to be a robot,

01:05:14.200 | I'm going to ignore what I actually enjoy,

01:05:15.680 | and I'm going to do the things that are,

01:05:18.880 | someone else has described objectively

01:05:20.920 | as like better for the world.

01:05:22.560 | I don't actually think that you would do that good

01:05:25.640 | because you're not gonna wake up every morning being like,

01:05:28.080 | I'm excited to solve this problem.

01:05:30.040 | You'll do your job from nine to five,

01:05:31.900 | and you'll go home and work on what you actually find fun.

01:05:33.960 | And a big part of doing high quality work

01:05:37.200 | is actually being willing to think

01:05:39.720 | about these kinds of problems all the time.

01:05:42.560 | And whenever like a new thing comes up,

01:05:43.960 | like you want to do the thing,

01:05:46.040 | you want to like be like, I have to go to sleep now,

01:05:48.840 | even though I want to be working on this problem.

01:05:49.960 | Like you will do better work in the grand scheme of things

01:05:52.720 | if you sort of look at the product of, you know,

01:05:56.040 | how valuable the thing is multiplied

01:05:57.480 | by how much you're gonna actually be able to do for it.

01:05:59.240 | And there are some, lots of things

01:06:00.320 | that are very high impact that like,

01:06:02.560 | you are just not the right person to solve.

01:06:04.400 | And I feel like that's the case for me for defenses

01:06:06.560 | is I really just don't care.

01:06:08.520 | Like, I just like, it's not interesting to me.

01:06:10.720 | I don't know why.

01:06:11.800 | I've tried in order to graduate,

01:06:13.840 | my thesis had to have a piece of it, which was a defense.

01:06:16.480 | And so it's there, but like that last, you know,

01:06:19.680 | a little while, I was just, I was not having a good time.

01:06:21.880 | Like I, it's there, like it didn't become a paper.

01:06:24.760 | It's like a chapter in my thesis until I had my PhD.

01:06:26.800 | But like, it's not like a thing that like actually

01:06:29.240 | motivated me to like be excited by the thing.

01:06:31.960 | And so I think maybe some people can get motivated

01:06:35.960 | on the work on things that like are really important

01:06:38.360 | and then they should do that.

01:06:40.480 | But I feel like if there are things in the world

01:06:42.640 | that in principle you could do more good,

01:06:45.680 | but like you're just not the right person for them,

01:06:48.240 | you will likely end up doing less good

01:06:50.520 | because you will not actually be able to do as much

01:06:53.280 | as you really could have if you had tried to do better.

01:06:55.800 | - Awesome, anything else we missed?

01:06:57.760 | Any underrated work that you really want people

01:07:00.800 | to check out, anything?

01:07:03.080 | - I mean, no, I mean like, yeah,

01:07:04.640 | I tend to do a fairly broad set of things.

01:07:06.760 | So anything you have missed, almost certainly yes.

01:07:08.880 | Anything that's particularly important

01:07:10.040 | that you have missed, probably not.

01:07:11.480 | I feel like, you know, just it's,

01:07:13.200 | I think people should work on more fun things.

01:07:15.000 | - Thank you so much for coming on.

01:07:16.240 | - Yeah, thank you.

01:07:17.240 | (upbeat music)

01:07:19.840 | (upbeat music)

01:07:22.440 | (upbeat music)

01:07:25.000 | (upbeat music)

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Chapters