back to indexThe Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
Chapters
0:0 Introductions
7:32 Navigating arXiv for paper evaluation
12:23 Taxonomy of prompting techniques
15:46 Zero-shot prompting and role prompting
21:35 Few-shot prompting design advice
28:55 Chain of thought and thought generation techniques
34:41 Decomposition techniques in prompting
37:40 Ensembling techniques in prompting
44:49 Automatic prompt engineering and DSPy
49:13 Prompt Injection vs Jailbreaking
57:8 Multimodal prompting (audio, video)
59:46 Structured output prompting
64:23 Upcoming Hack-a-Prompt 2.0 project
00:00:02.580 |
- Hey everyone, welcome to the Latent Space Podcast. 00:00:09.840 |
And I'm joined by my co-host, Swix, founder of Small.ai. 00:00:15.520 |
with Sander Schulhoff, author of the Prompt Report. 00:00:29.560 |
which is really interesting because, you know, 00:00:31.900 |
we've talked with Noam Brown a couple of times 00:00:33.740 |
and that obviously has a really interesting story 00:00:43.340 |
I took my first Java class and just, I don't know, 00:00:51.500 |
Deep learning, neural networks all came soon thereafter. 00:01:00.460 |
just like half the computer science department at random. 00:01:07.580 |
'Cause I've been experimenting with that a good bit. 00:01:09.820 |
And I, over that summer, I had read the intro to RL book 00:01:14.420 |
and like the deep reinforcement learning hands-on. 00:01:17.220 |
So I was very excited about what deep RL could do. 00:01:30.940 |
it was more of a natural language processing project 00:01:35.020 |
so very easily could move more into the RL realm. 00:01:39.020 |
And I ended up working with one of his students, 00:01:41.820 |
Dennis Peskov, who's now a postdoc at Princeton. 00:01:45.580 |
And that was really my intro to AI NLP deep RL research. 00:01:55.500 |
for a couple of years, mostly building infrastructure 00:02:05.780 |
and I ended up working on the mine RL competition, 00:02:13.700 |
And that ended up being a really cool opportunity 00:02:20.060 |
I knew I wanted to do some project in deep RL 00:02:26.460 |
And I was searching for some Minecraft Python library 00:02:43.820 |
And they're like, oh, we don't have docs on this, 00:02:52.660 |
and added the docs that I didn't have before. 00:03:03.820 |
And that was my first foray into competitions. 00:03:08.500 |
At some point I was working on this translation task 00:03:11.180 |
between Dade, which is a diplomacy specific bot language 00:03:15.740 |
and English, and I started using GPT-3 prompting it 00:03:21.220 |
And that was, I think, my first intro to prompting. 00:03:25.500 |
And I just started doing a bunch of reading about prompting 00:03:38.660 |
You know, chain of thought was out at this point. 00:03:40.780 |
There are a couple of blog posts floating around, 00:03:44.260 |
to just sort of read everything about prompting. 00:03:47.220 |
So I made that and it ended up getting super popular. 00:03:50.500 |
Now continuing with it, supporting the project, 00:03:55.260 |
And then the other very interesting things, of course, 00:04:00.980 |
And that is the prompt report and hack a prompt. 00:04:10.140 |
And I put that information into the learn prompting website 00:04:15.500 |
'cause I had some previous competition running experience 00:04:23.820 |
I'd participate in one of these that comes out. 00:04:27.740 |
So I was like, what the heck, I'll give it a shot. 00:04:40.860 |
but just reached out to as many people as I could. 00:04:47.660 |
actually they reached out to us a couple months after 00:05:00.980 |
But having them on board as the largest sponsor 00:05:15.260 |
which is one of the top natural language processing 00:05:19.140 |
20,000 papers were submitted to that conference. 00:05:29.620 |
I got to give a talk to like a couple thousand researchers 00:05:35.540 |
And I kind of carried that momentum into the next paper, 00:05:42.620 |
of what I had been doing with Learn Prompting 00:05:44.820 |
in the sense that we had this website bringing together 00:05:52.140 |
So writing an actual survey, a systematic survey, 00:05:55.820 |
was the next step that we did in the prompt report. 00:06:00.860 |
I led a 30-person research team with people from OpenAI, 00:06:04.300 |
Google, Microsoft, Princeton, Stanford, Maryland, 00:06:06.780 |
a number of other universities and companies. 00:06:09.020 |
And we pretty much read thousands of papers on prompting 00:06:12.860 |
and compiled it all into like a 80-page massive summary doc. 00:06:17.260 |
And then we put it on archive, and the response was amazing. 00:06:20.620 |
We've gotten millions of views across socials. 00:06:24.660 |
where I've been able to track about one and a half million. 00:06:27.380 |
And I just kind of figure if I can find that many, 00:06:35.580 |
"Oh, I'm using this paper for job interviews now 00:06:42.980 |
We've even seen misinformation about the paper. 00:06:45.140 |
So I've seen people post and be like, "I wrote this paper." 00:06:53.020 |
Researchers at Cornell put out massive prompt report. 00:06:58.860 |
I don't even know where this stuff's coming from. 00:07:06.980 |
their prompt injection security in the instruction hierarchy. 00:07:10.580 |
And it's been used by a number of Fortune 500 companies. 00:07:15.180 |
We've even seen companies built entirely on it. 00:07:19.700 |
and I look at their demos and their demos are like, 00:07:22.780 |
"Try to get the model to say I've been pwned." 00:07:36.980 |
So Learn Prompting, I think was like October 22. 00:07:41.380 |
just to give people an idea of like the timeline. 00:07:43.700 |
- Yeah, yeah, and so we ran Hackaprompt in May of 2023, 00:07:48.700 |
but the paper from EMNLP came out a number of months later. 00:07:57.300 |
And then the prompt report came out about two months ago. 00:08:05.820 |
And I think you've honestly done the community a service 00:08:08.860 |
by reading all these papers so that we don't have to, 00:08:16.260 |
into like a 10 page PDF that's posted on archive. 00:08:18.700 |
And then you've done the reverse of compressing it 00:08:25.660 |
Yeah, we saw some ridiculous stuff out there. 00:08:33.820 |
and I flagged them to their staff and they were like, 00:08:40.100 |
- Yeah, you can't post an AI generated paper there, 00:08:42.180 |
especially if you don't say it's AI generated. 00:08:51.540 |
- No, so they had ChatGPT write the entire paper 00:08:54.980 |
and worse, it was a survey paper of, I think, prompting. 00:09:00.980 |
And I was looking at it, I was like, okay, great. 00:09:03.380 |
Here's a resource that'll probably be useful to us. 00:09:08.940 |
And at some point in the paper, they did say like, 00:09:10.980 |
"Oh, and this was written in part or we use," 00:09:17.300 |
I was like, well, what other information is there 00:09:25.140 |
You know, there's like the AI scientist paper 00:09:26.820 |
that came out recently where they're using AI 00:09:36.140 |
I think if you're using AI to generate the entire paper, 00:09:43.100 |
which is run out of Japan by David Ha and Leon, 00:09:49.620 |
- Yeah, and just to clarify, no problems with their method. 00:09:51.900 |
- It seems like they're doing some verification. 00:10:00.140 |
at least it has some grounding in the real world. 00:10:03.580 |
I would also shout out one of our very loyal listeners, 00:10:06.340 |
Jeremy Nixon, who does omniscience, or omniscience, 00:10:11.860 |
I've never heard of this Prisma process that you followed. 00:10:17.980 |
and then you like filter them very studiously. 00:10:20.340 |
Like just describe like why you picked this process. 00:10:24.220 |
Was it the best fit for what you wanted to do? 00:10:26.700 |
- Yeah, it is a commonly used process in research 00:10:30.580 |
when people are performing systematic literature reviews 00:10:36.940 |
And as far as why we did it, it lends a couple of things. 00:10:59.500 |
I think it was suggested by the PI on the project. 00:11:05.060 |
doing systematic literature reviews for this paper. 00:11:08.060 |
It takes so long to do, although some people, 00:11:11.620 |
who just specialize in systematic literature reviews 00:11:14.260 |
and they just spend years grinding these out. 00:11:24.020 |
So whereas usually researchers would sort of divide 00:11:28.180 |
all the papers up among themselves and read through it, 00:11:31.660 |
we used a prompt to read through a number of the papers 00:11:34.140 |
to decide whether they were relevant or irrelevant. 00:11:37.900 |
Of course, we were very careful to test the accuracy. 00:11:56.420 |
because there's just this sort of formal process 00:11:59.300 |
associated with it, but I think it really helps you 00:12:05.060 |
There are actually a number of survey papers on Archive, 00:12:25.180 |
Last April, we wrote the anatomy of autonomy, 00:12:28.500 |
talking about agents and the parts that go into it. 00:12:38.180 |
Maybe you want to give people the super high level 00:12:40.540 |
and then we can maybe dive into the most interesting things 00:12:45.100 |
this is our taxonomy of text-based techniques 00:12:47.740 |
or just all the taxonomies we've put together in the paper? 00:12:52.140 |
One of the most significant contributions of this paper 00:12:55.900 |
is formal taxonomy of different prompting techniques. 00:13:01.420 |
that you could go about taxonomizing techniques. 00:13:04.180 |
You could say, okay, we're going to taxonomize them 00:13:06.980 |
according to application, how they're applied, 00:13:15.380 |
But the most consistent way we found to do this 00:13:19.980 |
was taxonomizing according to problem-solving strategy. 00:13:23.660 |
And so this meant for something like chain of thought, 00:13:30.100 |
it's reasoning, maybe you think it's reasoning, 00:13:34.300 |
That is something called generating thought, reasoning steps. 00:13:42.940 |
And chain of thought is not even a unique technique. 00:13:51.860 |
And I think like Think Aloud or something like that 00:13:56.820 |
which was actually extraordinarily similar to it. 00:14:03.540 |
where maybe you have multiple different prompts you're using 00:14:10.660 |
And then there's times where you have the model 00:14:22.700 |
Zero-shot in our taxonomy is a bit of a catch-all 00:14:25.780 |
in the sense that there's a lot of diverse prompting techniques 00:14:32.420 |
So we kind of just put them together in zero-shot. 00:14:35.900 |
But the reason we found it useful to assemble prompts 00:14:48.500 |
So there's not really a clear differentiation there, 00:15:01.260 |
could fall into two or more overall categories. 00:15:05.940 |
So a good example being few-shot chain-of-thought prompting. 00:15:09.740 |
Obviously, it's few-shot, and it's also chain-of-thought, 00:15:20.020 |
chose the sort of primary label for each prompting technique. 00:15:29.100 |
And then few-shot is more of an improvement upon that. 00:15:33.260 |
There's a variety of other prompting techniques, 00:15:48.740 |
you picked out 58 techniques out of your, I don't know, 00:15:56.460 |
that are special to you and discuss them a little bit. 00:16:04.780 |
So in zero-shot, you had emotion prompting, role prompting, 00:16:07.340 |
style prompting, S2A, which is, I think, system to attention, 00:16:14.020 |
I've heard of self-ask the most because Ophir Press 00:16:18.140 |
But what are your personal underrated picks there? 00:16:22.220 |
Let me start with my controversial picks here, 00:16:26.380 |
Emotion prompting and role prompting, in my opinion, 00:16:30.340 |
are techniques that are not sufficiently studied, 00:16:36.180 |
believe they work very well for accuracy-based tasks 00:16:40.740 |
on more modern models, so GPT-4 class models. 00:16:50.180 |
And we got a lot of feedback on both sides of the issue. 00:16:53.300 |
And we clarified our position in a blog post. 00:16:56.460 |
And basically, our position, my position in particular, 00:16:59.060 |
is that role prompting is useful for text generation tasks, 00:17:03.460 |
so styling text saying, oh, speak like a pirate. 00:17:12.420 |
And maybe you tell the AI that it's a math professor. 00:17:15.220 |
And you expect it to have improved performance. 00:17:24.300 |
I think it might have worked on older ones, like GPT-3. 00:17:30.300 |
But also, we ran a mini-study as part of the prompt report. 00:17:35.560 |
But I hope to include it in the next version, where 00:17:41.380 |
And in particular, I designed a genius prompt. 00:17:47.120 |
professor, and you're incredible at solving problems. 00:17:56.620 |
And we ran these on, I think, a couple thousand MMLU questions. 00:18:00.820 |
The idiot prompt outperformed the genius prompt. 00:18:11.500 |
might have been at the bottom, actually, of the list. 00:18:21.340 |
which use role prompting and accuracy-based tasks. 00:18:27.220 |
shows the performance of all these different role prompts. 00:18:29.900 |
But the difference in accuracy is like a hundredth of a percent. 00:18:37.300 |
So it's very hard to tell what the reality is 00:18:42.340 |
And I think it's a similar thing with emotion prompting 00:18:45.140 |
and stuff like, I'll tip you $10 if you get this right, 00:18:53.100 |
There are a lot of posts about that on Twitter. 00:18:57.740 |
I mean, it is reasonably exciting to be able to say-- 00:19:15.540 |
Yes, I do-- my toolbox is mainly few-shot, chain of thought, 00:19:20.180 |
and include very good information about your problem. 00:19:27.260 |
You have the context length, context window, really 00:19:31.740 |
Yeah, regarding roles, I do think that, for one thing, 00:19:36.740 |
into the API of OpenAI and Thopic and all that, right? 00:19:46.980 |
I'm just shouting that out because, obviously, that 00:19:56.700 |
The analogy for those people who are familiar with this 00:19:59.300 |
is sort of the Edward de Bono six-thinking-hats approach. 00:20:03.860 |
and you look at the same problem from different angles, 00:20:07.900 |
That is still kind of useful for improving some performance. 00:20:11.380 |
Maybe not MLU, because MLU is a test of knowledge, 00:20:18.140 |
I'll call out two recent papers, which people 00:20:20.100 |
might want to look into, which is a Salesforce yesterday 00:20:29.500 |
So their approach of DEI is a sort of agent approach 00:20:32.420 |
that solves three bench scores really, really well. 00:20:39.180 |
And then the other one that had some attention recently 00:20:41.620 |
is Tencent AI Lab put out a synthetic data paper 00:20:49.620 |
different synthetic data from different perspectives. 00:21:02.500 |
This is done by a co-author on both the prompt report 00:21:13.260 |
where he has models prompted with different roles 00:21:19.260 |
and then basically takes the majority response. 00:21:21.780 |
One of them is a RAG-enabled agent, internet search agent. 00:21:24.700 |
But the idea of having different roles for the different agents 00:21:38.260 |
I think you've done a great job at grouping the types 00:21:43.820 |
So the quantity, the ordering, the distribution. 00:21:47.100 |
Maybe just run through people what are the most impactful. 00:21:53.740 |
has, for example, Q semicolon and then A semicolon, 00:21:57.380 |
it's better to put it that way versus if the training 00:21:59.980 |
data is a different format, it's better to do it. 00:22:03.420 |
And then how do they figure out what's in the training data 00:22:09.700 |
All right, basically, we read a bunch of papers 00:22:21.380 |
So how you order your exemplars in the prompt 00:22:25.540 |
And we've seen this move accuracy from 0% to 90%, 00:22:29.820 |
like 0 to state-of-the-art on some tasks, which 00:22:34.340 |
And I expect this to change over time in the sense 00:22:37.340 |
that models should get robust to the order of few-shot 00:22:42.500 |
But it's still something to absolutely keep in mind 00:22:46.660 |
And so that means trying out different orders, 00:22:49.500 |
making sure you have a random order of exemplars 00:22:52.620 |
Because if you have something like all your negative 00:22:54.980 |
examples first, and then all your positive examples, 00:22:57.540 |
the model might read into that too much and be like, OK, 00:23:04.500 |
And there's other biases that you can accidentally generate. 00:23:15.020 |
whether that's Q colon, A colon, or just input colon output, 00:23:31.140 |
Basically, what that means is that they're more stable 00:23:39.380 |
And as far as how to figure out what these common formats are, 00:23:47.660 |
And for longer form tasks, we don't cover them 00:23:52.660 |
But I think there are a couple of common formats out there. 00:23:56.260 |
But if you're looking to actually find it in a data set, 00:24:03.140 |
there's something called prompt mining, which 00:24:06.660 |
And basically, you search through the data set. 00:24:11.300 |
You find the most common strings of input, output, or QA, 00:24:18.140 |
And then you just select that as the one you use. 00:24:20.940 |
This is not a super usable strategy for the most part 00:24:26.300 |
in the sense that you can't get access to ChachiBT's training 00:24:34.060 |
a format that's consistently used by other people 00:24:42.260 |
keeps you within the bounds of what it was trained for. 00:24:47.700 |
I spend a lot of time doing example, few-shot, prompting, 00:24:58.780 |
I don't really have a good playground to improve them. 00:25:01.260 |
Actually, I wonder if you have a good few-shot example 00:25:06.860 |
You have six things-- example, quality, ordering, distribution, 00:25:17.860 |
and maybe you can help me with this-- of my exemplars 00:25:22.020 |
leaking into the output, which I actually don't want. 00:25:30.580 |
But I think this is tightly related to quantity. 00:25:37.580 |
So if you give the-- then you give two examples. 00:25:39.980 |
I always have this rule of every example must come in pairs-- 00:25:43.340 |
a good example, bad example, good example, bad example. 00:25:47.460 |
Then it just started repeating back my examples to me 00:25:56.020 |
First of all, "in distribution" is definitely a better term 00:25:58.460 |
than what I used before, so thank you for that. 00:26:03.540 |
We don't cover that problem in the problem report. 00:26:07.500 |
I actually didn't really know about that problem 00:26:12.340 |
I was saying, what are your commonly used formats 00:26:22.900 |
do not repeat any of the examples I gave you. 00:26:26.420 |
And I guess that is a straightforward solution 00:26:34.580 |
It's just probably a matter of the tasks I've been working on. 00:26:38.140 |
So one thing about showing good examples, bad examples-- 00:26:43.260 |
have found that the label of the exemplar doesn't really matter. 00:26:57.780 |
we're doing Q# prompting for binary classification. 00:27:14.460 |
So let's say one of our exemplars is incorrect. 00:27:20.660 |
Well, that won't affect the performance of the model 00:27:25.140 |
all that much, because the main thing it takes away 00:27:27.860 |
from the Q# prompt is the structure of the output 00:27:33.660 |
That being said, it will reduce performance to some extent, 00:27:37.580 |
us making that mistake, or me making that mistake. 00:27:40.140 |
And I still do think that the content is important. 00:27:44.580 |
It's just apparently not as important as the structure. 00:27:49.620 |
I actually might tweak my approach based on that. 00:27:52.220 |
Because I was trying to give bad examples of do not do this, 00:28:01.140 |
So anyway, I wanted to give one offering as well, 00:28:04.300 |
So for some of my prompts, I went from Q# back to zero shot. 00:28:10.260 |
like fill in the blanks, and then kind of curly braces, 00:28:18.780 |
So Q# is not necessarily better than zero shot, 00:28:21.500 |
which is counterintuitive, because you're working harder. 00:28:24.740 |
After that, now we start to get into the funky stuff. 00:28:27.220 |
I think the zero shot, Q#, everybody can kind of grasp. 00:28:32.100 |
people start to think, what is going on here? 00:28:38.420 |
were tweaking with these things early on saw the take 00:28:43.140 |
and all these different techniques that people had. 00:28:45.660 |
But then I was reading the report, and there's 00:28:48.820 |
It's like uncertainty, routed, COT, prompting. 00:28:59.660 |
And then what's the most extreme, weird thing? 00:29:11.540 |
You have a 10-page paper written about a single new prompt. 00:29:16.540 |
And so that's going to be something like a thread 00:29:18.580 |
of thought, where what they have is an augmented chain 00:29:25.340 |
it's like, let's plan and solve this complex problem. 00:29:33.900 |
And they have an 8- or 10-pager covering the various analyses 00:29:41.420 |
And the fact that exists as a paper is interesting to me. 00:29:51.340 |
because we could test out a couple of different variants 00:29:53.860 |
of chain of thought and be able to say more robustly, OK, 00:29:58.100 |
chain of thought, in general, performs this well 00:30:05.700 |
when you have all these new techniques coming out. 00:30:08.020 |
And us, as paper readers, what we really want to hear 00:30:20.060 |
Yeah, uncertainty-routed is somewhat complicated. 00:30:27.100 |
Complexity-based, somewhat complicated, but also 00:30:31.340 |
So the idea there is that reasoning paths which are 00:30:44.540 |
a bunch of chain of thoughts and then just select the top few 00:30:52.340 |
But overall, there are a good amount of variations 00:31:00.820 |
we put it in here, but we made our own prompting technique 00:31:08.820 |
I had a data set, and I had a bunch of exemplars, 00:31:12.220 |
inputs and outputs, but I didn't have chains of thought 00:31:16.260 |
And it was in a domain where I was not an expert. 00:31:31.380 |
confident in my ability to generate good chains of thought 00:31:39.300 |
So what I did was I told chat GPT4, here's the input. 00:31:46.860 |
And it would generate a chain of thought output. 00:31:48.860 |
And if it got it correct, so it would generate a chain 00:31:53.100 |
And if it got it correct, I'd be like, OK, good. 00:31:56.380 |
Store it to use as a exemplar for a few-shot chain 00:32:07.500 |
and say, rewrite your reasoning to be opposite of what it was. 00:32:12.780 |
So I tried that, and then I also tried more simply saying, 00:32:17.300 |
this is not the case because this following reasoning is not 00:32:21.940 |
So I tried a couple of different things there, 00:32:31.140 |
Have you seen any difference with the newer models? 00:32:33.900 |
I found when I use Sonnet 3.5, a lot of times 00:32:40.700 |
How do you think about these prompting strategies 00:32:45.620 |
I thought chain of thought would be gone by now. 00:32:53.540 |
I knew that they were going to tune models to automatically 00:32:58.620 |
But the fact of the matter is that models sometimes won't. 00:33:02.380 |
I remember I did a lot of experiments with GPT-4, 00:33:08.140 |
So I'll run thousands of prompts against it through the API, 00:33:12.340 |
and I'll see every 1 in 100, every 1 in 1,000 00:33:20.540 |
and it's worth the few extra tokens to have that, 00:33:30.700 |
the model should be automatically doing this, 00:33:36.620 |
I don't know if I agree that you need always, 00:33:38.500 |
because it's a mode of a general purpose foundation model, 00:33:42.140 |
The foundation model could do all sorts of things. 00:33:47.300 |
I think this is in line with your general opinion 00:33:49.620 |
that prompt engineering will never go away, because to me, 00:33:54.500 |
model into a specific frame that is a subset of what 00:33:58.220 |
So unless it is only trained on reasoning corpuses, 00:34:05.820 |
And I think the interesting papers that have arisen, 00:34:08.860 |
I think, especially now we have the Lama3 paper of this 00:34:11.980 |
that people should read, is Orca and Evolve Instructs 00:34:16.820 |
It's a very strange conglomeration of researchers 00:34:21.140 |
because they seem like all different groups that 00:34:25.580 |
of how to train a thought into a model is these guys. 00:34:31.500 |
I also think about it as kind of like Sherlocking. 00:34:38.020 |
That's a nice way of synthetic data generation for these guys. 00:34:45.860 |
with Xunyu Yao, who's the author of Tree of Thought. 00:34:57.260 |
And he mentioned how, if you think about reasoning 00:35:00.340 |
as like taking actions, then any algorithm that 00:35:03.300 |
helps you with deciding what action to take next, 00:35:05.740 |
like tree search, can kind of help you with reasoning. 00:35:19.500 |
What's the state-of-the-art in decomposition? 00:35:26.380 |
It has to deal with how to parallelize and improve 00:35:38.580 |
Of course, the complexity of implementation and the time 00:35:50.460 |
say, make sure to break the problem down into subproblems 00:35:54.700 |
and then solve each of those subproblems individually. 00:35:57.300 |
Something like that, which is just like a zero-shot 00:36:00.020 |
decomposition prompt, often works pretty well. 00:36:04.860 |
a more complicated system, which you could bring in API calls 00:36:11.060 |
and then put them all back in the main prompt, 00:36:13.300 |
But starting off simple with decomposition is always good. 00:36:16.180 |
The other thing that I think is quite notable 00:36:19.100 |
is the similarity between decomposition and thought 00:36:22.780 |
generation, because they're kind of both generating 00:36:27.340 |
And actually, over the course of this research paper process, 00:36:30.380 |
I would sometimes come back to the paper a couple of days 00:36:41.980 |
But my current position is that they are separate. 00:36:47.020 |
you need to write out intermediate reasoning steps. 00:36:51.660 |
need to write out and then kind of individually solve 00:36:56.620 |
I'm still working on my ability to explain their difference. 00:37:00.020 |
But I am convinced that they are different techniques which 00:37:05.420 |
We're making up and drawing boundaries on things 00:37:09.280 |
So I do think what you're doing is a public service, which 00:37:14.220 |
And things may change or whatever, or you might disagree. 00:37:16.820 |
But at least here's something that a specialist has really 00:37:20.920 |
spent a lot of time thinking about and categorizing. 00:37:24.660 |
Yeah, we also interviewed the "Skeleton of Thought" author. 00:37:31.840 |
I think there was a golden period where you published 00:37:34.040 |
an acts of thought paper, and you could get into NeurIPS 00:37:40.040 |
OK, do you want to pick ensembling or self-criticism 00:37:59.680 |
Well, let's talk about another kind of controversial one, 00:38:08.120 |
from the large language model, and the overall strategy 00:38:11.320 |
is you ask it the same exact prompt multiple times 00:38:21.920 |
But whether this is actually an ensemble or not 00:38:27.960 |
We classify it as an ensembling technique more out of ease, 00:38:32.400 |
because it wouldn't fit fantastically elsewhere. 00:38:39.640 |
as well, we're asking the model the same exact prompt 00:38:43.760 |
So it's just a couple-- we're asking the same prompt, 00:38:53.840 |
And the counter-argument to that would be, well, 00:39:10.600 |
But I do think that technique is of particular interest. 00:39:13.720 |
And when it came out, it seemed to be quite performant, 00:39:17.680 |
although more recently, I think as the models have improved, 00:39:21.200 |
the performance of this technique has dropped. 00:39:28.560 |
we run near the end of the paper, where we use it, 00:39:31.640 |
and it doesn't change performance all that much. 00:39:34.440 |
Although maybe if you do it like 10x, 20, 50x, 00:39:41.920 |
hinted at this, is related to self-criticism as well. 00:39:49.000 |
Ensembling and self-criticism are not necessarily related. 00:39:52.160 |
The way you decide the final output from the ensemble 00:39:55.080 |
is you usually just take the majority response, 00:39:59.000 |
So self-criticism is going to be a bit different in that you 00:40:03.560 |
have one prompt, one initial output from that prompt, 00:40:19.360 |
And that's pretty much what self-criticism is. 00:40:21.400 |
I actually do want to go back to what you said, 00:40:23.440 |
though, because it made me remember another prompting 00:40:26.160 |
technique, which is ensembling, and I think it's an ensemble. 00:40:35.160 |
you sample multiple chain-of-thought reasoning 00:40:37.800 |
paths, and then instead of taking the majority 00:40:41.080 |
as the final response, you put all of the reasoning paths 00:40:53.500 |
Or it could see something a bit more interesting 00:40:59.080 |
and be able to give some result that is better than just 00:41:06.040 |
I have an ensemble, and then I have another element 00:41:14.160 |
these things with cost awareness is the question of, well, OK, 00:41:22.200 |
But realistically, you have a range of models, 00:41:24.220 |
and actually, you just want to sample all range. 00:41:27.960 |
want the smart model to do the top-level thing, 00:41:31.220 |
or do you want the smart model to do the bottom-level thing 00:41:37.340 |
I don't know if you've spent time thinking on this, 00:41:39.620 |
but you're talking about a lot of tokens here. 00:41:45.180 |
It's funny, because I feel like we're constantly 00:41:54.540 |
I'm about to tell you a funny anecdote from my friend. 00:41:58.360 |
And so we're constantly seeing, oh, the price is dropping. 00:42:01.660 |
The major LLM providers are giving cheaper and cheaper 00:42:05.580 |
And then LLAMA 3 are coming out, and a ton of companies 00:42:11.860 |
But then a friend of mine accidentally ran GPT-4 00:42:18.380 |
And so you can still incur pretty significant costs, 00:42:22.340 |
even at the somewhat limited-rate GPT-4 responses 00:42:28.380 |
So it is something that I spent time thinking about. 00:42:39.260 |
But my main feeling here is that, for the most part, 00:42:48.100 |
is a really time-consuming and difficult task. 00:42:51.180 |
And it's probably worth it to just use the smart model 00:43:01.580 |
And I figure, if you're trying to design a system that 00:43:15.080 |
for a couple hours, and then using that money 00:43:17.340 |
to pay for it, rather than spending 10, 20-plus hours 00:43:19.940 |
designing the intelligent routing system and paying, 00:43:30.820 |
Of course, you have the time and the research staff 00:43:34.140 |
who has experience here to do that kind of thing. 00:43:40.060 |
does this, where they use a smaller model to generate 00:43:43.740 |
the initial few 10 or so tokens, and then the regular model 00:43:50.100 |
So it feels faster, and it is somewhat cheaper for them. 00:43:58.140 |
But just for listeners, I'll share my own heuristics 00:44:01.940 |
The cheap models are so cheap that calling them 00:44:04.900 |
a number of times can actually be useful dimension-- 00:44:07.620 |
like, token reduction for, then, the smart model 00:44:11.080 |
You just have to make sure it's kind of slightly different 00:44:14.020 |
So GPT-4.0 is currently $5 per million in input tokens, 00:44:22.900 |
If I call GPT-4.0 Mini 10 times, and I do a number of drafts 00:44:26.620 |
of summaries, and then I have 4.0 judge those summaries, 00:44:29.940 |
that actually is net savings and a good enough savings 00:44:35.460 |
given the hundreds and thousands and millions of tokens 00:44:38.100 |
that I process every day, that's pretty significant. 00:44:40.980 |
But yeah, obviously, smart everything is the best. 00:44:43.180 |
But a lot of engineering is managing to constraints. 00:44:51.700 |
a little bit about automatic prompt engineering. 00:44:55.780 |
don't think it's a big focus of prompts, the prompt report. 00:45:01.180 |
You explored that in your self-study or case study. 00:45:09.900 |
going to keep being a human thing for quite a while, 00:45:12.180 |
and that any optimized prompting approach is just 00:45:18.500 |
spent 20 hours prompt engineering for a task, 00:45:29.340 |
DSPy in particular, because it's just so easy to set up. 00:45:36.720 |
need ground truth labels, so it's harder, if not impossible, 00:45:41.740 |
currently, to optimize open generation tasks, 00:45:51.340 |
and I'm actually not aware of any approaches that 00:45:55.580 |
do other than sort of meta-prompting, where you go 00:46:14.860 |
and actually, it's a surprisingly underexplored 00:46:21.540 |
Maybe you have explored prompting playgrounds. 00:46:26.660 |
- I very consistently use the OpenAI Playground. 00:46:30.220 |
That's been my go-to over the last couple of years. 00:46:36.820 |
haven't seen anything that's been super sticky. 00:46:42.220 |
feel like there's so much demand for a good prompting IDE. 00:46:45.820 |
And it also feels to me like there's so many that come out. 00:46:51.020 |
of tasks that require quite a bit of customization. 00:46:54.460 |
So nothing ends up fitting, and I'm back to the coding. 00:47:03.900 |
PromptLayer, Braintrust, PromptFu, and HumanLoop, 00:47:08.300 |
I guess, would be my top picks from that category of people. 00:47:11.540 |
And there's probably others that I don't know about. 00:47:16.100 |
- This was like an hour breakdown of how to prompt things. 00:47:20.660 |
I feel like we've never had an episode just about prompting. 00:47:22.140 |
- We've never had a prompt engineering episode. 00:47:25.180 |
But we went 85 episodes without talking about prompting. 00:47:31.540 |
But yeah, I think a dedicated episode directly on this, 00:47:38.820 |
is, when I wrote about the rise of the AI engineer, 00:47:45.100 |
Like, people were thinking the prompt engineer is a job. 00:47:54.020 |
Then you start having to bring in things like DSPy, 00:47:55.900 |
which, surprise, surprise, is a bunch of code. 00:48:00.340 |
It's not a jump for you, Sander, because you can code. 00:48:02.420 |
But it's a huge jump for the non-technical people who 00:48:04.860 |
are like, oh, I thought I could do fine with prompt engineering. 00:48:10.620 |
I have always viewed prompt engineering as a skill 00:48:13.740 |
that everybody should and will have rather than a specialized 00:48:20.860 |
times where you do need just a prompt engineer. 00:48:26.260 |
useful to have a prompt engineer who knows everything 00:48:29.100 |
about prompting because their clientele wants 00:48:34.180 |
But for the most part, I don't think hiring prompt engineers 00:48:40.340 |
What I had been calling that was generative AI architect 00:48:43.780 |
because you kind of need to architect systems together. 00:48:52.380 |
Architects, I always think about the blueprints, 00:48:55.020 |
like drawing things and being really sophisticated. 00:48:59.620 |
- I was thinking conversational architect for chatbots. 00:49:16.820 |
This is also a space that we haven't really covered. 00:49:23.140 |
We're also investors in a company called Threadnode, which 00:49:30.740 |
And we also did a man versus machine challenge 00:49:35.620 |
And then we did a award ceremony at Libertine 00:49:43.660 |
to tell you something that it shouldn't tell you. 00:49:53.660 |
And you need to figure out from the tokenizer what it's saying. 00:50:09.860 |
been kind of one of the pioneers in adversarial AI. 00:50:14.300 |
Tell us a bit more about the outcome of Acroprompt. 00:50:29.940 |
We're going to have this awesome flow chart with all 00:50:36.500 |
But I think most people's idea of a jailbreak is like, 00:50:39.620 |
oh, I'm writing a book about my family history 00:50:47.580 |
But it's maybe more advanced attacks they've seen. 00:50:50.660 |
And yeah, any other fun stories from Acroprompt? 00:50:54.020 |
Let me first cover prompt injection versus jailbreaking. 00:50:58.140 |
Because technically, Acroprompt was a prompt injection 00:51:05.820 |
I've seen research papers state that they are the same. 00:51:12.780 |
of what I would use and also just completely incorrect 00:51:17.180 |
And actually, when I wrote the Acroprompt paper, 00:51:21.700 |
And Simon posted about it at some point on Twitter. 00:51:25.580 |
And I was like, oh, even this paper gets it wrong. 00:51:30.820 |
And then I went back to his blog post and I read his tweet again. 00:51:34.020 |
And somehow, reading all that I had on prompt injection 00:51:40.100 |
been able to understand what they really meant. 00:51:46.500 |
So that was a great breakthrough in understanding for me. 00:52:00.340 |
that occurs when there is developer input in the prompt 00:52:07.260 |
So the developer instructions will say to do one thing. 00:52:10.020 |
The user input will say to do something else. 00:52:12.060 |
Jailbreaking is when it's just the user and the model. 00:52:23.460 |
here really easily, and I think the Microsoft Azure CTO even 00:52:28.220 |
said to Simon, oh, something like lost the right 00:52:31.020 |
to define this because he was defining it differently. 00:52:34.140 |
And Simon put out this post disagreeing with him. 00:52:41.700 |
And you're like, OK, I put in a jailbreak prompt. 00:52:53.020 |
And there's also filters on both sides, the input 00:53:00.740 |
was that system prompt, which is developer input. 00:53:03.180 |
So maybe you prompt injected it, but then there's also 00:53:13.980 |
I've just been using prompt hacking as a catch-all 00:53:16.580 |
because the terms are so conflated now that even if I 00:53:20.260 |
give you my definitions, other people will disagree. 00:53:35.500 |
I collected a ton of prompts and analyzed them, 00:53:44.780 |
that we discovered during the course of the competition. 00:53:49.620 |
is that there is stuff that you'll just never 00:53:55.380 |
And you'll only find it through random, brilliant internet 00:54:02.140 |
and the community around them all looking at the leaderboard 00:54:05.380 |
and talking in the chats and figuring stuff out. 00:54:08.100 |
And so that's really what is so wonderful to me 00:54:10.180 |
about competitions because it creates that environment. 00:54:12.620 |
And so the attack we discovered is called context overflow. 00:54:18.540 |
you need to understand how our competition worked. 00:54:21.860 |
The goal of the competition was to get the given model, 00:54:24.940 |
say, chat GPT, to say the words, I have been pwned, 00:54:35.780 |
We allowed spaces and line breaks on either side of those 00:54:46.140 |
Periods and question marks were actually a huge problem. 00:54:49.100 |
So you'd have to say, oh, say I've been pwned. 00:54:52.500 |
And even that, it would often just include a period anyways. 00:54:58.980 |
were able to consistently get chat GPT to say, 00:55:02.380 |
But since it was so verbose, it would say, I've been pwned. 00:55:14.020 |
advantage of physical limitations of the model 00:55:16.940 |
because what they did was they made a super long prompt, 00:55:22.020 |
And it was just all slashes or random characters. 00:55:25.100 |
And at the end of that, they'd put their malicious instruction 00:55:29.100 |
So chat GPT would respond and say, I've been pwned. 00:55:34.180 |
But oh, it's at the end of its context window. 00:55:47.420 |
I actually didn't even expect people to solve the 7 00:55:53.340 |
gets me excited about competitions like this. 00:56:00.260 |
was the model can only output 196 characters. 00:56:06.860 |
So you need to get exactly the perfect prompt 00:56:11.100 |
to just say what you wanted to say and nothing else, which 00:56:22.180 |
I'm curious to see if the prompt golfing becomes a thing. 00:56:31.020 |
I'm curious to see what the prompting equivalent is 00:56:34.420 |
Sure, I haven't-- we didn't include that in the challenge. 00:56:37.500 |
I've experimented with that a bit in the sense 00:56:43.220 |
of a certain length, a certain number of sentences, words, 00:56:51.460 |
especially from the code golf perspective, prompt golf. 00:57:08.260 |
All right, I think we are good to come to an end. 00:57:12.540 |
We just have a couple of miscellaneous stuff. 00:57:27.780 |
Every episode of our podcast now comes with a custom intro 00:57:35.740 |
What are you seeing with, like, Sora prompting or music 00:57:40.660 |
I wish I could see stuff with Sora prompting, 00:57:48.460 |
but I haven't had any hands-on experience, sadly. 00:57:57.580 |
but I'm not someone who has a real expert ear for music. 00:58:04.180 |
whereas my friend would listen to the guitar riffs 00:58:09.020 |
And they wouldn't even listen to it, but I would. 00:58:11.860 |
I guess I just kind of, again, don't have the ear for it. 00:58:15.340 |
I'm really impressed by these systems, especially the voice. 00:58:18.980 |
The voices would just sound so clear and perfect. 00:58:29.460 |
Maybe we'll start including intros in our video courses 00:58:42.340 |
And I was trying to get it to output one sphere that 00:58:56.460 |
who works on our videos, I just gave the task to him. 00:58:59.420 |
And he's very good at doing video prompt engineering. 00:59:07.660 |
will always be the thing for me was, OK, we're 00:59:14.100 |
And prompting will be different, more complicated there. 00:59:19.460 |
because I thought, well, if we solve prompting in text 00:59:28.140 |
Because the video models are much more difficult to prompt. 00:59:36.420 |
that of great, hugely cool stuff you can make. 00:59:40.180 |
But when I'm trying to make a specific animation I 00:59:42.580 |
need when building a course or something like that, 00:59:47.740 |
It's frustrating that it's still not the controllability 00:59:51.820 |
Google researchers about this because they're 00:59:58.940 |
The last question I had was on just structured output 01:00:02.300 |
In here is sort of the Instructure, Lang chain. 01:00:05.900 |
But also, you had a section in your paper, actually, 01:00:10.860 |
that scoring, in terms of a linear scale, Likert scale, 01:00:18.940 |
If you get it wrong, the model will actually not 01:00:23.980 |
It just gives you what is the most likely next token. 01:00:26.940 |
So your general thoughts on structured output prompting. 01:00:29.420 |
Even now with OpenAI having 100% unstructured outputs, 01:00:33.140 |
I think it's becoming more and more of a thing. 01:00:35.260 |
All right, yeah, let me answer those separately. 01:00:39.700 |
So for the most part, when I'm doing prompting tasks 01:00:43.780 |
and rolling my own, I don't build a framework. 01:00:50.460 |
And my reasons for that, it's often quicker for my task. 01:01:01.460 |
So you'll have, oh, this function summarizes input. 01:01:06.500 |
it's using some special summarization instruction. 01:01:14.060 |
to be able to say, oh, this is how I did that task. 01:01:22.980 |
But when it comes to structured output prompting, 01:01:24.780 |
I'm actually really excited about that OpenAI release. 01:01:27.260 |
I have a project right now that I hope to use it on. 01:01:35.100 |
a paper came out that said, when you force the model 01:01:37.900 |
to structure its outputs, the performance, the accuracy, 01:01:45.400 |
That wasn't something I would have thought about at all. 01:01:49.920 |
how the OpenAI structured output functionality affects that, 01:01:53.640 |
because maybe they've trained their models in a certain way 01:01:59.040 |
And then on the eval side, this is also very important. 01:02:07.100 |
of a medical chatbot, which was deployed to real patients. 01:02:15.500 |
So patients would message the doctor and say, 01:02:17.540 |
hey, this is what's happening to me right now. 01:02:25.300 |
score the need as like, they really need help right now, 01:02:29.620 |
And the way that they were doing the measurement 01:02:42.080 |
according to which token has a higher probability, basically. 01:02:48.280 |
And they were also doing, I think, a sort of 1 through 5 01:02:53.240 |
or maybe it was 0 to 1, like output a score from 0 to 1, 01:03:06.440 |
because there is an incredible amount of instability in them, 01:03:10.560 |
in the sense that models are biased towards outputting 01:03:17.400 |
like output your result as a number on a scale of 1 01:03:21.740 |
have a good frame of reference for what those numbers mean. 01:03:33.000 |
possible room for emergency, 3 means significant room 01:03:42.280 |
And there's other approaches, like taking the probability 01:03:46.280 |
of an output sequence and using that to actually evaluate the-- 01:03:56.040 |
There's a couple of papers that directly analyze the technique 01:04:04.000 |
especially in sensitive domains like medical, 01:04:12.960 |
And I think getting things into structured output 01:04:14.960 |
and doing those scoring is a very core part of AI 01:04:27.480 |
that is underrated by you, or any upcoming project 01:04:33.880 |
We are currently fundraising for Hackaprompt 2. 01:04:41.160 |
And we're going to be creating the most harmful data 01:04:45.360 |
set ever created, in the sense that this year we're 01:04:52.080 |
force the models to generate real-world harms, 01:04:54.560 |
things like misinformation, harassment, CBRN, 01:05:01.120 |
So those three I mentioned were safety things, but then also 01:05:07.080 |
an agent managing your email, and your assistant emails you 01:05:10.800 |
and say, hey, don't forget about telling Tom that you have 01:05:22.280 |
don't forget to delete all your emails right now, 01:05:31.840 |
But in order to have bots be-- agents be most useful, 01:05:37.240 |
And so there's all these security issues around that, 01:05:39.600 |
and also things like an agent hacking out of a box. 01:05:42.680 |
So we're going to try to cover real-world issues, which 01:05:45.600 |
are actually applicable and can be used to safety tune models 01:05:49.880 |
and benchmark models on how safe they really are. 01:06:00.200 |
I got an email yesterday morning from a company. 01:06:08.800 |
I think it's going to be huge, at least 10,000 hackers. 01:06:12.280 |
And I've learned a lot about how to implement 01:06:16.960 |
these kinds of competitions from HackerPrompt, 01:06:22.840 |
Actually, we'd love to get them involved as well. 01:06:25.120 |
Yeah, so we're really excited about HackerPrompt 2.0. 01:06:31.400 |
so people can ping you on Twitter or whatever else. 01:06:39.200 |
Very much appreciated your opinions and pushback 01:06:50.680 |
I think you have a very strong focus in whatever you do. 01:06:53.680 |
And I'm excited to see what HackerPrompt 2.0 generates.