back to indexPersonal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Chapters
0:0 Introductions
1:14 Why Nicholas writes
2:9 The Game of Life
5:7 "How I Use AI" blog post origin story
8:24 Do we need software engineering agents?
11:3 Using AI to kickstart a project
14:8 Ephemeral software
17:37 Using AI to accelerate research
21:34 Experts vs non-expert users as beneficiaries of AI
24:2 Research on generating less secure code with LLMs.
27:22 Learning and explaining code with AI
30:12 AGI speculations?
32:50 Distributing content without social media
35:39 How much data do you think you can put on a single piece of paper?
37:37 Building personal AI benchmarks
43:4 Evolution of prompt engineering and its relevance
46:6 Model vs task benchmarking
52:14 Poisoning LAION 400M through expired domains
55:38 Stealing OpenAI models from their API
61:29 Data stealing and recovering training data from models
63:30 Finding motivation in your work
00:00:09.940 |
and I'm joined by my co-host, Swiggs, founder of Small AI. 00:00:13.000 |
- Hey, and today we're in the in-person studio, 00:00:32.800 |
And mostly we're here to talk about your blogs, 00:00:35.780 |
because you are so generous in just writing up what you know. 00:00:41.880 |
I feel like it's fun to share what you've done. 00:00:51.600 |
I was terrible at writing when I was younger. 00:01:02.220 |
but I feel like it is useful to share what you're doing, 00:01:05.480 |
and I like being able to talk about the things 00:01:12.300 |
not because I enjoy the active writing, but yeah. 00:01:14.600 |
- It's a tool for thought, as they often say. 00:01:19.160 |
or thing that people should know about you as a person, 00:01:22.920 |
- Yeah, so I tend to focus on, like you said, 00:01:29.440 |
and I want to do, like, high-quality security research, 00:01:32.680 |
and that's mostly what I spend my actual time 00:01:36.560 |
trying to be productive members of society doing that. 00:01:49.760 |
sort of things that have absolutely no utility, 00:01:56.520 |
you should work on fun things that just are interesting, 00:02:03.600 |
is after I have completed something I think is fun, 00:02:09.480 |
- Before we go into, like, AI, LLMs, and whatnot, 00:02:14.240 |
So you built multiplexing circuits in the game of life, 00:02:22.160 |
And then how do you go from just clicking boxes 00:02:33.880 |
a computer that can run anything, essentially. 00:02:41.240 |
where you have cells that are either on or off, 00:02:43.360 |
and a cell becomes on if, in the previous generation, 00:02:45.680 |
some configuration holds true and off otherwise. 00:03:01.240 |
And some other people have done some similar things, 00:03:07.360 |
like, we already know it's possible in theory. 00:03:08.960 |
I want to try and, like, actually make something 00:03:10.280 |
I can run on my computer, like, a real computer I can run. 00:03:13.400 |
And so, yeah, I've been going down this rabbit hole 00:03:20.120 |
and I have been making some reasonable progress there. 00:03:25.120 |
is just, like, a very fun trap you can go down. 00:03:33.560 |
if you call into printf, it's Turing complete. 00:03:36.840 |
Like, printf, you know, like, which, like, you know, 00:03:41.960 |
- There is, because printf has a %n specifier 00:03:45.840 |
that lets you write an arbitrary amount of data 00:03:53.080 |
into where it is in the loop that is in memory. 00:03:58.920 |
of where printf is, currently indexing, using %n. 00:04:02.760 |
So you can get loops, you can get conditionals, 00:04:06.640 |
So we sort of have another Turing complete language 00:04:13.160 |
but, like, it's just, I feel like a lot of people 00:04:25.240 |
- Need a little sass with the boys, as they say. 00:04:27.240 |
- Yeah, and I want to still have joy in doing these things. 00:04:33.560 |
productive, meaningful things and just, like, 00:04:39.480 |
- Awesome, and you've been kind of like a pioneer 00:04:43.800 |
You've done a lot of talks starting back in 2018. 00:05:00.400 |
And we were like, "We should get Carlini on the podcast." 00:05:04.840 |
- Yeah, and then I sent you an email and you're like, 00:05:07.240 |
And I was like, "Oh, I thought that would be harder." 00:05:09.720 |
So I think there's, as you said in the blog post, 00:05:15.480 |
can actually be used for, what are they useful at, 00:05:21.000 |
what they're not good at, because they're obviously not. 00:05:28.520 |
So how painful was it to write such a long post, 00:05:31.160 |
given that you just said that you don't like to write? 00:05:33.680 |
And then we can kind of run through the things, 00:05:40.080 |
So I wanted to do this because I feel like most people 00:05:42.120 |
who write about language models being good or bad, 00:05:46.800 |
they have their camp, and their camp is like, 00:05:58.800 |
So I've read a lot of things where people say 00:06:05.480 |
And I've read a lot of things where people who say 00:06:17.520 |
And I don't really agree with either of these. 00:06:19.760 |
And I'm not someone who cares really one way or the other 00:06:25.240 |
And so I wanted to write something that just says like, 00:06:29.960 |
and what we can actually do with these things. 00:06:32.280 |
Because my actual research is in like security 00:06:35.680 |
and showing that these models have lots of problems. 00:06:38.800 |
Like this is like, my day-to-day job is saying like, 00:06:40.760 |
we probably shouldn't be using these in lots of cases. 00:06:43.080 |
I thought I could have a little bit of credibility 00:06:49.400 |
We maybe shouldn't be deploying them in lots of situations. 00:06:54.560 |
And that is the like, the bit that I wanted to get across 00:06:58.400 |
is to say, I'm not here to try and sell you on anything. 00:07:08.160 |
And it turned out that a lot more people liked it 00:07:20.880 |
Maybe we can just kind of run through them all. 00:07:22.720 |
And then maybe the ones where you have extra commentary 00:07:35.160 |
definitely less than 10 hours putting this together. 00:07:38.560 |
- Wow, it took me close to that to do a podcast episode. 00:07:55.240 |
And so because of this, the way I tend to treat this 00:08:00.560 |
and then put it on the internet and then never change it. 00:08:03.000 |
And I guess this is an aspect of the research side of me 00:08:07.320 |
it is done, it is an artifact, it exists in the world. 00:08:09.640 |
I could forever edit the very first thing I ever put 00:08:12.480 |
to make it the most perfect version of what it is. 00:08:16.360 |
And so I feel like, I find it useful to be like, 00:08:18.880 |
I will spend some certain amount of hours on it, 00:08:26.080 |
We just recorded an episode with the founder of Cosign, 00:08:28.600 |
which is like a AI software engineer colleague. 00:08:31.680 |
You said it took you 30,000 words to get GPT-4 00:08:35.520 |
to build you the, can GPT-4 solve this, kind of like a app. 00:08:39.360 |
Where are we in the spectrum where chat GPT is all you need 00:08:42.480 |
to actually build something versus I need a full-on agent 00:08:50.600 |
that was just like a fun demo where you can guess 00:08:53.480 |
if you can predict whether or not GPT-4 at the time 00:08:58.000 |
This is, as far as web apps go, very straightforward. 00:09:11.720 |
is not because they want to see my wonderful HTML, right? 00:09:14.960 |
Like, you know, I used to know how to do like modern HTML, 00:09:26.720 |
I have no longer had to build any web app stuff 00:09:32.640 |
but I don't know any of the new, Flexbox is new to me. 00:09:39.680 |
being able to go to the model and just say like, 00:09:49.600 |
And it doesn't do anything that's complicated right now, 00:09:53.920 |
but it gets you to the point where the only remaining work 00:09:57.400 |
that needs to be done is the interesting hard part for me, 00:10:04.420 |
are entirely good enough at doing this kind of thing, 00:10:08.320 |
It may be the case that if you had something like, 00:10:21.200 |
And that's what I do is, you know, you run it 00:10:25.560 |
and either I'll fix the code or it will give me buggy code 00:10:29.080 |
And I'll just copy and paste the error message and say, 00:10:40.400 |
already understand that things on the internet, 00:10:43.840 |
And so there's not like, this is not like a big mental shift 00:10:47.080 |
you have to go through to understand that it is possible 00:10:52.400 |
even if it is not completely perfect in its output. 00:11:02.920 |
And there's maybe a couple that tie together. 00:11:10.600 |
maybe like a project that, you know, the LLM cannot solve. 00:11:15.120 |
- Yeah, so like for getting started on things 00:11:17.560 |
is one of the cases where I think it's really great 00:11:24.400 |
help me use this technology I've never used before. 00:11:27.120 |
So for example, I had never used Docker before January. 00:11:34.160 |
Like I sort of, I have read lots of papers on, you know, 00:11:37.880 |
on all the technology behind how these things work. 00:11:46.960 |
so that I could run the outputs of language model stuff 00:12:02.960 |
I do not know what this word Docker Compose is. 00:12:07.320 |
And like, it'll sort of tell me all of these things. 00:12:11.000 |
Like, this is not some groundbreaking thing that I'm doing, 00:12:19.120 |
And I didn't want to learn Docker from first principles. 00:12:22.640 |
Like, at some point, if I need it, I can do that. 00:12:25.000 |
Like, I have the background that I can make that happen. 00:12:30.680 |
And it's very easy to get bogged down in the details 00:12:32.640 |
of this other thing that helps you accomplish your end goal. 00:12:34.920 |
And I just wanted, like, tell me enough about Docker 00:12:38.360 |
And I can check that it's doing the safe thing. 00:12:40.760 |
I sort of know enough about that from my other background. 00:12:44.440 |
And so I can just have the model help teach me 00:12:46.920 |
exactly the one thing I want to know and nothing more. 00:12:53.400 |
Like, I can just like stop the conversation and say, 00:12:59.760 |
It would have taken me, you know, several hours 00:13:01.640 |
to figure out some things that take 10 minutes 00:13:06.120 |
- Have you had any issues with like newer tools? 00:13:08.640 |
Have you felt any meaningful kind of like a cutoff day 00:13:11.600 |
where like there's not enough data on the internet or? 00:13:16.160 |
But I tend to just not use most of these things. 00:13:19.600 |
Like, I feel like this is like the significant way 00:13:35.720 |
where they have their own proprietary legacy code base 00:13:38.000 |
of a hundred million lines of code or whatever. 00:13:39.680 |
And like, you just might not be able to use things 00:13:42.960 |
I still think there are lots of use cases there 00:13:46.120 |
that are not the same ones that I've put down. 00:13:48.280 |
But I wanted to talk about what I have personal experience 00:13:53.800 |
if someone who is in one of these environments 00:13:58.120 |
in which they find current models useful to them 00:14:13.040 |
because they often fear being attacked on the internet. 00:14:16.480 |
But you are the ultimate authority on how you use things 00:14:37.920 |
I don't think I have taken as much advantage of it 00:14:47.840 |
- Yeah, no, I do think that this is a direction 00:14:53.440 |
that was like a lot of the ways that I use these models 00:14:55.200 |
are for one-off things that I just need to happen 00:15:19.360 |
well, I didn't actually need the answer that badly 00:15:21.680 |
Like either I can decide to dedicate the 45 minutes 00:15:23.640 |
or I cannot, but the cost of doing it is fairly low. 00:15:31.160 |
if you're getting the answer you want always, 00:15:32.880 |
it means you're not asking them hard enough questions. 00:15:45.520 |
And if you're finding that when you're using these, 00:15:47.440 |
it's always giving you the answer that you want, 00:15:52.880 |
And so I oftentimes try when I have something 00:15:55.160 |
that I'm curious about to just feed into the model 00:15:57.680 |
and be like, well, maybe it's to solve my problem for me. 00:16:05.160 |
you know, a couple hours that it's been great 00:16:11.200 |
to verify whether or not the answer is correct 00:16:15.760 |
well, that's just, you're entirely misguided. 00:16:21.360 |
- Even for non-tech, I had to fix my irrigation system. 00:16:28.920 |
And it's like, oh yeah, that's like the RT900. 00:16:57.360 |
Do you have a mental model to just think about 00:16:59.280 |
how long it should live for and like anything like that? 00:17:02.920 |
- I don't think I have anything interesting to say here, no. 00:17:05.440 |
I will take whatever tools are available in front of me 00:17:08.520 |
and try and see if I can use them in meaningful ways. 00:17:14.880 |
that I'm very excited about seeing all of these people 00:17:26.920 |
- What's the most underrated thing in the list? 00:17:32.760 |
or maybe is there something that you forgot to add 00:17:42.840 |
and go, I understand how this solved my problem. 00:17:56.160 |
So for example, one of the things that I use it a lot for 00:18:09.800 |
Because, you know, like I got my machine in a state 00:18:15.920 |
some other thing, the versions were mismatched. 00:18:35.760 |
I want everything that I said to like have evidence 00:18:47.080 |
I used a model to solve this very complicated task. 00:19:00.720 |
that other people could have verified by themselves. 00:19:04.600 |
that I wish I maybe had said a little bit more about, 00:19:07.760 |
and just stated that the way that this is done. 00:19:21.440 |
the uninteresting parts of problems for me right now, 00:19:33.640 |
Therefore, the model is not going to be helpful 00:19:35.720 |
in doing new research or like discovering new things. 00:19:39.040 |
And as someone whose day job is to do new things, 00:19:45.040 |
literally no one else in the world has ever done before. 00:19:47.480 |
So like, this is what I do like every single day. 00:19:57.360 |
and then a little bit of something that was new. 00:20:03.720 |
is something that's been done many, many times before. 00:20:08.600 |
Even if the thing that I'm doing as a whole is new, 00:20:13.560 |
that the small pieces that build up to it are not. 00:20:20.640 |
I feel like expect that they can either solve 00:20:27.040 |
even when doing something very new and very hard, 00:20:37.120 |
like you're currently trying to solve some problem 00:20:43.720 |
You have to go look up something online, whatever it is. 00:20:51.320 |
about being distracted is you're solving some hard problem 00:20:53.720 |
and you realize you need a helper function that does X. 00:21:22.800 |
you can just ask model, please solve this problem for me. 00:21:26.680 |
You can check that it works very, very quickly. 00:21:34.400 |
- And in terms of this concept of expert users 00:21:37.320 |
versus non-expert users, floors versus ceilings, 00:21:42.320 |
that basically it actually is more beneficial 00:21:48.520 |
Let me give you the argument for both of these. 00:21:51.680 |
- So I can only speak on the expert user behalf 00:21:53.000 |
because I've been doing computers for a long time. 00:21:54.960 |
And so, yeah, the cases where it's useful for me 00:21:56.440 |
are exactly these cases where I can check the output. 00:22:02.920 |
I can check every single thing that the model's doing 00:22:10.680 |
But I also see a world in which this could be very useful 00:22:13.360 |
for the kinds of people who do not have this knowledge 00:22:16.760 |
with caveats, because I'm not one of the people 00:22:20.240 |
But one of these big ways that I can see this 00:22:22.960 |
is for things that you can check fairly easily, 00:22:29.320 |
or have written a program themselves to do a certain task 00:22:32.080 |
could just ask for the program that does the thing. 00:22:34.480 |
And you know, some of the times it won't get it right, 00:22:39.080 |
and they'll be able to have the thing in front of them 00:22:44.080 |
And we see a lot of people trying to do applications 00:22:56.040 |
and various things, and other people who don't, 00:23:04.480 |
And this is a case where you could have a model 00:23:11.000 |
And as long as the person is rigorous in testing 00:23:13.080 |
that the solution does actually the correct thing, 00:23:14.760 |
this is the part that I'm worried about most. 00:23:22.200 |
like you probably shouldn't trust these models 00:23:34.720 |
but I'm worried that it might end up in a world 00:23:44.720 |
and just break everything because everything is terrible. 00:23:52.360 |
it is possible that these could be very useful. 00:23:57.200 |
that shows that when people use LLMs to generate code, 00:24:03.920 |
There are a bunch of papers that touch on exactly this. 00:24:08.040 |
- My slight issue is, is there an agenda here? 00:24:19.040 |
Yeah, he and some students have some things on this. 00:24:27.480 |
and I sort of trust them to have done the right thing. 00:24:31.440 |
I also think, even on this though, we have to be careful 00:24:35.960 |
whenever someone says X is true about language models, 00:24:38.200 |
you should always append the suffix for current models 00:24:43.320 |
I was one of the people who was very much on the opinion 00:24:47.480 |
and are gonna have absolutely no practical utility. 00:24:49.480 |
And if you had asked me this, let's say in 2020, 00:25:00.200 |
I still would have told you these things are toys. 00:25:12.960 |
It's like they're trying to make some analogies 00:25:17.160 |
It's just like, I don't even care to read it. 00:25:19.000 |
I saw what it was about and just didn't even look at it. 00:25:36.760 |
I want them to at least know what is true about the world 00:25:39.080 |
so that they can then see that maybe they should reconsider 00:25:44.960 |
that may just not be true about today's models. 00:25:47.440 |
- Specifically, because you brought up spreadsheets, 00:25:51.240 |
because I think Google's done a really good job 00:25:56.160 |
it's Gemini's integrated inside of Google Sheets 00:26:09.880 |
And so I just don't write formulas manually anymore. 00:26:12.720 |
It just prompts Gemini to do it for me and it does it. 00:26:15.600 |
- Yeah, one of the problems that these machine learning 00:26:29.800 |
More of these things, it would be good for them to exist. 00:26:32.760 |
I want them to exist in ways that we can actually make sure 00:26:43.880 |
I feel like lots of people, there are far too many. 00:26:47.440 |
X plus AI, where X is like arbitrary thing in the world 00:26:53.960 |
And they're just doing it because they want to use the word. 00:27:02.000 |
Yes, I do not want my fridge on the internet. 00:27:05.360 |
Okay, anyway, let's not go down that rabbit hole. 00:27:08.520 |
because people want to sell things and whatever. 00:27:12.600 |
and then they write off everything as a result of it. 00:27:14.560 |
And I just want to say, there are allowed to be people 00:27:17.720 |
who are trying to do things that don't make any sense. 00:27:25.480 |
So both explaining code, being a API reference, 00:27:34.080 |
I feel like, you know, one thing is like generate code 00:27:38.160 |
One way is like, just tell me about this technology. 00:27:40.880 |
Another thing is like, hey, I read this online. 00:27:44.560 |
Any best practices on getting the most out of it or? 00:27:47.680 |
- Yeah, I don't know if I have best practices. 00:27:58.440 |
but I have never used them in this way before. 00:28:03.600 |
And so yeah, as an API reference is a great example. 00:28:06.240 |
You know, the tool everyone always picks on is like FFmpeg. 00:28:09.960 |
No one in the world knows the command line arguments 00:28:16.360 |
You know, I want lower bit rate, like dash V, you know, 00:28:20.280 |
but like once you tell me what the answer is, 00:28:22.400 |
Like this is one of the things where it's great 00:28:38.760 |
which pieces of the code are actually important. 00:28:43.680 |
isn't actually do anything to do with security. 00:28:49.200 |
And like, you just, you would only ignore all of that. 00:29:06.080 |
and not just figure out what's going on perfectly. 00:29:08.200 |
Like you already know that when you're gonna read 00:29:10.840 |
these things, what you're going to try and do 00:29:16.600 |
This is a great way of just doing that, but faster, 00:29:19.040 |
because it will abstract most of what is right. 00:29:21.840 |
It's gonna be wrong some of the time, I don't care. 00:29:27.000 |
And so like one of the particular use cases I have 00:29:32.080 |
where, you know, oftentimes people will release a binary, 00:29:38.920 |
And so one thing you could do is you could try 00:29:42.440 |
It turns out for the thing that I wanted, none existed. 00:29:44.680 |
And so like I spent too many hours doing it by hand 00:29:48.120 |
before I first thought, you know, like, why am I doing this? 00:29:50.320 |
I should just check if the model can do it for me. 00:29:54.980 |
which is impossible for any human to understand, 00:29:56.880 |
into the Python code that is entirely reasonable 00:29:59.600 |
And, you know, it doesn't run, it has a bunch of problems, 00:30:06.320 |
where I should be looking and then spend all of my time 00:30:15.840 |
And especially for you as a security researcher 00:30:23.440 |
I do think we want to sort of move to your other blog posts, 00:30:27.960 |
you ended your post with a little bit of a teaser 00:30:35.480 |
and I will do that at some point when I have time, 00:30:37.840 |
maybe after I'm done writing my current papers 00:30:40.720 |
where I want to talk about some thoughts I have 00:30:44.000 |
for where language models are going in the near-term future. 00:30:48.040 |
is because, again, I feel like the discussion 00:30:50.280 |
tends to be people who are either very much AGI by 2027, 00:30:57.560 |
- Yes, or are going to make statements of the form, 00:31:05.060 |
and we should be doing something else instead. 00:31:06.760 |
And again, I feel like people tend to look at this 00:31:12.160 |
well, those obviously are both very far extremes. 00:31:22.860 |
Just saying, you know, I have wide margins of error. 00:31:27.160 |
If you would say there's a 0% chance that something, 00:31:30.000 |
you know, the models will get very, very good 00:31:31.680 |
in the next five years, you're probably wrong. 00:31:35.040 |
that in the next five years, then you're probably wrong. 00:31:43.280 |
But it's very hard to get clicks on the internet 00:31:45.360 |
of like, some things may be good in the future. 00:31:48.440 |
Like, everyone wants like, you know, a very like, 00:32:13.760 |
the safety and security things as a result of this. 00:32:28.040 |
and can solve, you know, tasks completely autonomously, 00:32:31.840 |
that's a very different security world to be living in 00:32:35.540 |
And the types of security questions I would want to ask 00:32:38.840 |
And so I think, you know, in some very large parts, 00:32:49.360 |
- You mentioned getting clicks on the internet, 00:32:50.960 |
but you don't even have like an ex account or anything. 00:32:54.800 |
What's the, what's your distribution strategy? 00:33:00.960 |
Nicolas Scarlini brought this, like what's his handle? 00:33:07.560 |
- So I have an RSS feed and an email list, and that's it. 00:33:14.840 |
I feel like, on principle, I feel like they have some harms. 00:33:18.000 |
As a person, I have a problem when people say things 00:33:22.280 |
and I would get nothing done if I were to have a Twitter. 00:33:25.080 |
I would spend all of my time correcting people 00:33:55.240 |
I don't need to be someone who has to have this other thing 00:33:59.200 |
And so I feel like I can just say what I want to say, 00:34:02.000 |
and if people find it useful, then they'll share it widely. 00:34:05.920 |
I wrote a thing, whatever, sometime late last year 00:34:09.360 |
about how to recover data off of an Apple Profile Drive 00:34:17.080 |
This probably got, I think, 1,000x less views than this, 00:34:24.840 |
that I actually care about, which is my research. 00:34:26.760 |
I would care much more if that didn't get seen. 00:34:30.600 |
because I have some thoughts that I just want to put down. 00:34:35.600 |
and authenticity that is sadly lacking sometimes 00:34:38.880 |
in modern discourse that makes it attractive. 00:34:42.120 |
And I think now you have a little bit of a brand 00:34:44.160 |
of you are an independent thinker, writer, person 00:34:52.400 |
- Yeah, this kind of worries me a little bit. 00:34:57.360 |
which is entirely unrelated, I don't want people- 00:35:00.560 |
- You should actually just throw people off right now. 00:35:05.960 |
So the last two or three things I've done in a row 00:35:07.720 |
have been actually things that people should care about. 00:35:10.600 |
So I have a couple of things I'm trying to figure out. 00:35:12.200 |
Which one do I put online to just cull the list 00:35:20.160 |
What you're here for is whatever I want to talk about. 00:35:24.160 |
This is not what I want out of my personal website. 00:35:27.480 |
- So here's top 10 enemies or something like that. 00:35:30.600 |
What's the next project you're going to work on 00:35:32.360 |
that is completely unrelated to research LLMs? 00:35:35.640 |
Or what games do you want to port into the browser next? 00:35:39.120 |
- Okay, yeah, so maybe, okay, here's a fun question. 00:35:47.320 |
- I mean, you can think about bits and atoms. 00:35:53.120 |
How much data can you put on a piece of paper? 00:36:03.260 |
- I'll just throw out there, like 10 megabytes. 00:36:12.680 |
So I have a thing that does about a megabyte. 00:36:22.420 |
This is supposed to be the title at some point, 00:36:25.460 |
- Yeah, so this is a little hard because, you know, 00:36:27.280 |
so you can do the math and you get 8 1/2 by 11. 00:36:38.240 |
you need to be able to recover up to like 90 plus percent, 00:36:41.480 |
like 95%, like 99 point something percent accuracy 00:36:44.840 |
in order to be able to actually decode this off the paper. 00:36:47.420 |
This is one of the things that I'm considering. 00:36:50.360 |
I need to like get a couple more things working for this 00:36:52.840 |
where, you know, again, I'm running to some random problems, 00:37:02.800 |
People try and write the most obfuscated C code that they can, 00:37:12.120 |
I have a very fun gate level emulation of an old CPU 00:37:33.080 |
I would squeeze in really, really small pieces 00:37:36.920 |
- Okay, we are also going to talk about your benchmarking 00:37:51.640 |
- Okay, benchmarks tell you how well the model solves 00:38:07.920 |
because people tried to make models classify digits 00:38:24.680 |
And yet, like, this is what drove a lot of progress. 00:38:29.520 |
because they want to just measure progress in some way. 00:38:36.240 |
and we will measure progress on this benchmark, 00:38:38.280 |
not because we care about the problem per se, 00:38:41.520 |
is in some way correlated with making better models. 00:38:44.160 |
And this is fine when you don't want to actually use 00:38:48.000 |
But when you want to actually make use of them, 00:38:56.360 |
is that there would be model after model after model 00:38:58.720 |
that was being released that would find some benchmark 00:39:07.960 |
to know whether or not I should then switch to it. 00:39:10.280 |
So the argument that I tried to lay out in this post 00:39:17.840 |
And so what I did is I wrote a domain-specific language 00:39:25.040 |
that you have wanted models to solve for you, 00:39:32.600 |
you benchmark the model on the things that you care about, 00:39:36.760 |
because you've actually asked for those answers before. 00:39:49.280 |
does this solve the kinds of things that I care about? 00:39:56.880 |
I don't want to say that existing benchmarks are not useful. 00:40:03.760 |
But in many cases, what that's designed to measure 00:40:06.760 |
is not actually the thing that I want to use it for. 00:40:08.640 |
And I would expect that the way that I want to use it 00:40:10.600 |
is different than the way that you want to use it. 00:40:20.840 |
good at some benchmark to make it good at that benchmark. 00:40:23.520 |
You can sort of like find the distribution of data 00:40:31.440 |
And by having a benchmark that is not very popular, 00:40:37.320 |
that no one has tried to optimize their model 00:40:41.120 |
- So publishing your benchmark is a little bit-- 00:40:47.440 |
was not that people would use mine as theirs. 00:40:50.680 |
My hope in doing this was that people would say-- 00:40:57.760 |
a very small fraction of people, 0.1% of people 00:41:00.120 |
who made a benchmark that was useful for them, 00:41:01.980 |
this would still be hundreds of new benchmarks 00:41:18.220 |
is people just do this vibes-based evaluation thing 00:41:23.580 |
and you see if it worked on the kinds of things 00:41:37.220 |
- Yeah, I like the idea of going through your chat history 00:41:42.420 |
I regret to say that I don't think my chat history 00:41:44.840 |
is used as much these days because I'm using cursor, 00:41:54.680 |
now that you've written the "How I Use AI" post, 00:41:59.440 |
are you able to translate all these things to evals? 00:42:17.920 |
than you might have thought would be possible 00:42:19.820 |
if you do a little bit of work on the backend. 00:42:22.220 |
So for example, all of the code that I have the model right, 00:42:31.780 |
and the language model judges was the output correct. 00:42:57.420 |
they're very good at being able to tell this. 00:43:04.140 |
- You complained about prompting and being lazy 00:43:14.060 |
How do you see the evolution of like prompt engineering? 00:43:17.980 |
maybe, you know, it was kind of like really hot 00:43:19.900 |
and people wanted to like build companies around it. 00:43:21.660 |
Today, it's like the models are getting good. 00:43:33.380 |
that like, you know, calls back to the model again. 00:43:43.040 |
is just to say, oftentimes when I use a model 00:44:06.260 |
such that it gives me the answer that's correct. 00:44:18.380 |
And so oftentimes, you know what I do is like, 00:44:20.140 |
I just dump in whatever current thought that I have 00:44:28.500 |
maybe the model was right to give me the wrong answer. 00:44:34.420 |
And so like, I just want to sort of get this as a thing. 00:44:41.660 |
that always goes into all the models or something. 00:44:49.940 |
Maybe it turns out that as models get better, 00:44:51.420 |
you don't need to prompt them as much in this way. 00:44:54.020 |
I just want to use the things that are in front of me. 00:45:00.420 |
you're using the prompt to kind of like steer it 00:45:04.420 |
to actually not make the prompt really relevant 00:45:08.900 |
- I mean, you could fine tune it into the model, 00:45:12.860 |
I mean, it seems like some models have done this, 00:45:19.580 |
they'll say, "Let's think through this step-by-step." 00:45:21.540 |
And then they'll go through the step-by-step answer. 00:45:23.820 |
Two years ago, I would have have to have prompted it. 00:45:25.980 |
Think step-by-step on solving the following thing. 00:45:27.900 |
Now you ask them the question and the model says, 00:45:43.340 |
- For listeners, that would be Orca and Agent Instruct, 00:45:49.260 |
- Does a few-shot, is included in the lazy prompting? 00:45:57.140 |
- I don't because usually when I want the answer, 00:46:08.260 |
testing the ultimate capability level of the model 00:46:10.740 |
and testing the thing that I'm doing with it. 00:46:15.620 |
Because there are almost certainly better ways 00:46:18.020 |
and sort of really see how good the model is. 00:46:24.780 |
And so I'm entirely fine with people doing fancy prompting 00:46:27.380 |
to show me what the true capability level could be. 00:46:31.100 |
what the ultimate level of the model could be. 00:46:35.820 |
how good the model is if you don't do fancy things. 00:46:47.420 |
they'll do like five shots, 25 shots, 50 shots. 00:46:56.460 |
the problem is everyone wants to get state-of-the-art 00:47:01.940 |
so that you get state-of-the-art on the benchmark. 00:47:06.700 |
Like it's good to know the model can do this thing 00:47:16.460 |
And I could get there if I was willing to work hard enough. 00:47:20.380 |
and figure out how to ask the model the question? 00:47:23.020 |
And for me, I have programmed for many, many, many years. 00:47:26.260 |
It's often just faster for me just to do the thing 00:47:28.460 |
than to like figure out the incantation to ask the model. 00:47:31.300 |
But I can imagine someone who has never programmed before 00:47:34.380 |
might be fine writing five paragraphs in English, 00:47:53.340 |
and most eval paradigms, I'm not picking on you, 00:47:55.940 |
is that we're actually training these things for chat, 00:47:59.940 |
And you actually obviously reveal much more information 00:48:05.180 |
in sort of like a tree search branching sort of way. 00:48:10.980 |
Where the vast majority of prompts are single question, 00:48:15.300 |
But actually the way that we use chat things, 00:48:18.380 |
in the way, even in the stuff that you posted 00:48:31.780 |
I might be writing a paper on this at some point 00:48:35.340 |
A couple of the evals in the benchmark thing I have 00:48:40.580 |
I have a 20 question eval there, just for fun. 00:48:48.700 |
figure out how to cherry pick off this other branch 00:48:53.340 |
I basically build a tiny little agency thing. 00:49:03.740 |
I run whatever the model told me the output to do is. 00:49:12.860 |
that it is correctly cherry picked in this way? 00:49:22.060 |
Like it's more challenging to do this kind of evaluation. 00:49:28.580 |
so that people could come up with these evals 00:49:31.580 |
that more closely measure what they're actually doing. 00:49:39.340 |
And you mentioned how like nobody uses this thing anymore. 00:49:48.540 |
do you figure out how to like fine tune the model? 00:49:53.940 |
put together some examples or would you just say, 00:49:55.820 |
hey, the model just doesn't do it, whatever, move on? 00:50:07.420 |
in like the mid '90s, early '90s or something, 00:50:11.220 |
when UU encoding was actually a thing that people would do. 00:50:18.740 |
And like it was doing it correctly for like 99% of cases. 00:50:33.020 |
that if you really cared about this task being solved well, 00:50:37.020 |
But again, this is one of these kinds of tasks 00:50:44.900 |
that it gets me 90% of the way there, good, like done. 00:50:47.660 |
Like I can sort of have fun for a couple hours 00:50:50.900 |
I was not like, if I ever had to train a thing for this, 00:50:54.020 |
And so it did well enough for me that I could move on. 00:50:57.500 |
- It does give me an idea for adversarial examples 00:51:00.740 |
inside of a benchmark that are basically canaries 00:51:05.100 |
Typically right now, benchmarks have canary strings, 00:51:07.340 |
or if you ask it to repeat back the string and it does, 00:51:10.620 |
But you know, it's easy to filter out those things. 00:51:16.700 |
and if it gives you the intentionally wrong answer, 00:51:19.900 |
- Yeah, there are actually a couple of papers 00:51:26.860 |
So the field of work called membership inference, 00:51:33.940 |
There's a field called like dataset inference. 00:51:54.300 |
that the specific questions happen to be in matters. 00:51:59.260 |
because the order of the questions is arbitrary 00:52:01.500 |
There are a number of papers that follow up on this 00:52:03.700 |
I think this is a great way of doing this now. 00:52:07.540 |
included some canary questions in their benchmarks, 00:52:10.780 |
you can already sort of start getting at this now. 00:52:29.260 |
Leon 400M is one of the biggest image datasets 00:52:33.300 |
And a lot of the image gets pulled from live domains. 00:52:38.340 |
- Every image gets pulled from a live domain, yes. 00:52:42.900 |
So then you went on and you bought the domains 00:52:57.100 |
We talked before about low background tokens. 00:53:01.420 |
you can imagine most things you get from the internet, 00:53:06.540 |
After 2021, you can imagine most things written 00:53:12.780 |
or like maybe give more of the Leon background. 00:53:41.980 |
that like assume an adversary could do the following 00:53:53.180 |
like what are you gonna do as a security researcher? 00:53:58.060 |
because eventually someone's gonna use these things 00:54:05.300 |
And then you write a paper that sort of looks at this. 00:54:07.860 |
And then maybe it turns out that some of these 00:54:11.700 |
So this has happened for quite some long time. 00:54:14.820 |
And the thing that bothered me is it seems like 00:54:19.620 |
and try and actually start studying real problems. 00:54:21.860 |
So we very deliberately started looking like, 00:54:24.740 |
what are the problems that actually arise in real systems 00:54:30.940 |
that I could imagine writing that would be at Black Hat? 00:54:33.860 |
Like a real security person would want to see, 00:54:39.220 |
that you can make this machine learning model do, 00:54:51.060 |
you say, well, here are the bad things that could happen. 00:54:52.940 |
I could try and do an evasion attack at test time. 00:55:02.900 |
you ask, okay, here's my list of 10 problems. 00:55:05.420 |
Which of them are most important and relevant to this? 00:55:07.980 |
And you just do this for every single one in the list. 00:55:21.820 |
that let them inject arbitrary images into the dataset. 00:55:25.340 |
And this is, I think, the way that we came to doing this 00:55:47.540 |
You've also had a paper on stealing part of a production LLM. 00:55:51.340 |
You extracted the Babbage and Ada dimension layers 00:56:05.620 |
This paper was, again, with the exact same motivation. 00:56:09.220 |
there's this field of research called model stealing. 00:56:11.460 |
What it's interested in is you have your model 00:56:13.580 |
that you have trained, it was very expensive. 00:56:15.140 |
I want to query your model and steal a copy of the model 00:56:28.260 |
a couple thousand neurons evaluated in Float64 00:56:31.300 |
with value-only activation, fully connected networks. 00:56:39.220 |
Each of these assumptions I just said is false in practice. 00:56:41.540 |
Like none of these things are things you can really do. 00:56:44.980 |
I mean, there's a reason the paper is at Crypto. 00:56:48.220 |
and not like at like an actual security conference 00:56:50.260 |
because like it's a very theoretical kind of thing. 00:56:54.860 |
because maybe you can extend these to make it be possible. 00:56:57.340 |
But I also think it's worth thinking about the problem 00:57:05.860 |
And then we can push from the other direction. 00:57:10.740 |
And that's again, like what we're trying to do here. 00:57:12.180 |
We sort of looked at what APIs do actually people expose 00:57:32.700 |
You know, I first have to email the Google lawyer. 00:57:37.940 |
And they say like, you know, under no circumstances. 00:57:40.380 |
And you say, okay, but what if they agree to it? 00:57:43.780 |
And you said, then you say, I know some people there. 00:57:47.540 |
And they're like, as long as you delete it afterwards, okay. 00:57:50.220 |
And I'm like, can you get your general counsel 00:57:53.980 |
So like, we had all of the lawyers talk to each other. 00:58:00.160 |
Like, you know, you don't want to actually, you know, 00:58:14.980 |
we notified everyone who was vulnerable to this attack. 00:58:20.880 |
There were one or two other people who were vulnerable 00:58:24.140 |
We notified them all, gave them 90 days to fix it, 00:58:26.100 |
which is like a standard disclosure period in security. 00:58:35.300 |
- Yeah, so the fix in particular was don't show log probs 00:58:42.340 |
And what you don't show is the logit bias plus the log prob, 00:58:45.620 |
They sort of did the narrow thing to prevent this. 00:58:48.060 |
Some people were unhappy, but like, this is, you know, 00:58:55.460 |
I really like this example because for a very long time, 00:58:58.940 |
nothing about GPT-4 would be at all different 00:59:10.540 |
This is not true in other fields in, you know, 00:59:17.180 |
because of the security attacks that we've had in the past. 00:59:20.940 |
the way we design the internet is fundamentally different 00:59:24.740 |
And what that means is it means that the attacks 00:59:26.220 |
that we had were so compelling to the non-security people 00:59:34.620 |
In ever so machine learning, we didn't have this. 00:59:36.060 |
We didn't have attacks that were useful enough 00:59:43.580 |
because the attack that you've presented to me 00:59:54.320 |
that I will break utility in order to prevent this attack. 00:59:56.660 |
And I would like to see more of these kinds of attacks, 01:00:03.300 |
that we have exhausted the space of possible attacks 01:00:07.700 |
that someone else comes up with a very bad thing 01:00:17.580 |
And this is the hope of doing this research direction. 01:00:32.740 |
- And then just scaling it up, you can steal the others. 01:00:39.860 |
We only steal one in the attack that as we present it, 01:00:44.300 |
For the other research we have done in the past, 01:00:50.820 |
and then the second to the third and third to the fourth. 01:00:58.100 |
And what we're trying to do now is similar kind of thing, 01:01:21.300 |
And if someone else were to discover a stronger variant, 01:01:23.860 |
I would hope that they would take a similar approach, 01:01:27.300 |
patch the thing, release it to everyone and go from there. 01:01:29.280 |
- We do also serve people building on top of models. 01:01:31.900 |
And one thing that I think people are interested in 01:01:33.680 |
is prompt injections, prompt security, that kind of stuff. 01:01:37.500 |
I feel like the relevant version of your thing 01:01:50.740 |
There's model stealing and there's data stealing. 01:01:52.500 |
Data stealing is exactly this kind of question. 01:02:03.360 |
So we've done some work where we have trained a model, 01:02:08.220 |
okay, in this case, in the most extreme variant, 01:02:10.620 |
we showed a way to recover training data from GPT 3.5 Turbo. 01:02:19.260 |
and he figured out that if you prompt ChantGPT 01:02:25.500 |
then it will repeat the word many, many, many times in a row 01:02:28.220 |
and then explode and just start doing random stuff. 01:02:35.700 |
it would just repeat training data back to you, 01:02:45.220 |
- Do we know, is it exactly the training data 01:02:47.940 |
or is it something that looks like the training data? 01:02:52.340 |
It doesn't have the weights to memorize all the training. 01:02:54.580 |
- No, no, it can't memorize all the training data. 01:03:02.700 |
And what I can say is that the output of the model 01:03:05.240 |
was a verbatim, at least 50-word-in-a-row match 01:03:13.000 |
So there's two possible explanations for this. 01:03:21.600 |
In principle, this is possible, or it memorized it. 01:03:24.800 |
And for some of them, we have several hundred words 01:03:26.860 |
in a row where the probability is astronomically low. 01:03:30.400 |
- So you also have a blog post about why I attack. 01:03:40.640 |
And then Vijay, who was the CISO of DeepMind, 01:03:54.000 |
So I'll just open the floor to you now to answer. 01:04:11.760 |
I also think that it's important to attack things 01:04:17.520 |
And so it's worth having what the best attacks are 01:04:19.440 |
in order to be able to discover what is secure. 01:04:21.840 |
People then say, both of these things are true, 01:04:25.400 |
You know, I have gotten this a lot through my career. 01:04:31.640 |
On rare occasions, I have helped write papers 01:04:36.360 |
I have a hard time motivating myself to work on it. 01:04:45.440 |
who is going to try and do maximal good in the world. 01:04:53.840 |
But if you would wake up every day hating your life, 01:04:57.840 |
it is very unlikely you would do an actually good job. 01:05:00.560 |
You know, like I could sort of switch now to be a doctor 01:05:03.380 |
or, you know, to do elderly care or something like this. 01:05:08.040 |
for the right motivations is going to do so much better 01:05:11.260 |
than if I just decided, like, I am going to be a robot, 01:05:22.560 |
I don't actually think that you would do that good 01:05:25.640 |
because you're not gonna wake up every morning being like, 01:05:31.900 |
and you'll go home and work on what you actually find fun. 01:05:46.040 |
you want to like be like, I have to go to sleep now, 01:05:48.840 |
even though I want to be working on this problem. 01:05:49.960 |
Like you will do better work in the grand scheme of things 01:05:52.720 |
if you sort of look at the product of, you know, 01:05:57.480 |
by how much you're gonna actually be able to do for it. 01:06:04.400 |
And I feel like that's the case for me for defenses 01:06:08.520 |
Like, I just like, it's not interesting to me. 01:06:13.840 |
my thesis had to have a piece of it, which was a defense. 01:06:16.480 |
And so it's there, but like that last, you know, 01:06:19.680 |
a little while, I was just, I was not having a good time. 01:06:21.880 |
Like I, it's there, like it didn't become a paper. 01:06:24.760 |
It's like a chapter in my thesis until I had my PhD. 01:06:26.800 |
But like, it's not like a thing that like actually 01:06:29.240 |
motivated me to like be excited by the thing. 01:06:31.960 |
And so I think maybe some people can get motivated 01:06:35.960 |
on the work on things that like are really important 01:06:40.480 |
But I feel like if there are things in the world 01:06:45.680 |
but like you're just not the right person for them, 01:06:50.520 |
because you will not actually be able to do as much 01:06:53.280 |
as you really could have if you had tried to do better. 01:06:57.760 |
Any underrated work that you really want people 01:07:06.760 |
So anything you have missed, almost certainly yes. 01:07:13.200 |
I think people should work on more fun things.