Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

My name is Karina. Recently I've been working on CLOD, which is like a large language model trained by Anthropik. And most recently I was working on like reducing hallucinations, how to make CLOD self-correct its answers, and many other features that went into CLOD2 launch. And so I'm going to talk today about the writing principles for task-tuned prompt engineering, and kind of sort of like help you if you want to use CLOD API, help you to guide with the best practices and tips that I found and we found most effective.

So first of all, I would like to talk about why prompting is hard. And to understand why prompting is hard, we should understand what prompting is in the first place. So this model estimate the probability of each subsequent word given the preceding words. So in a way, a well-crafted prompt can increase the probability of generated desired and accurate phrases.

Due to attention mechanisms in large language models, the models can focus on specific parts of the input text, and so effective prompts ensure that the attention is directed properly for desired outputs, and and so it's important to incorporate task-specific keywords and context and examples within the prompt to activate the relevant portions of the model's internal knowledge.

And lastly, prompting leads to better results because without the need of like computational, like other compute, you just like, without like model retraining, so you can like leverage inference time test compute for this. And so like why prompting is hard? And I think I found, based on like my conversations with customers and like developers, I think prompting is hard because of like three different reasons.

First, people know, humans know what they want, but they don't know how to get the best performance from the model. And I think that's what we're going to focus on today. The second reason is that they vaguely know what they want, but they don't know how to explain what they want, but they don't know how to explain what they want.

So they don't know how to explain the best to explain the best to the model. And so the model gets confused what the human wants from the task. And the third reason is like they don't know what they want. So the humans don't know what they want. So it's pretty bad.

And it's hard for the model to understand. So basic strategies that you can like, if you find yourself, like you kind of really know what the task is, just provide a bunch of examples. And the model would be good at inferring what you're trying to do, just based on like examples.

And the examples may be, should be diverse and should encapsulate a bunch of like age cases. Try to explain as you would have to explain to a five-year-old or like very, you know, in a very simple terms. And I think what I found is that like as you have to like be able to like iterate a lot and like spend a lot of time just prompting.

And in a way, based on my experience, for example, as like research engineer, I spend the majority of pairing with people just collaborating on the prompts. In the past, I was thinking like how my main experience with the word prompt was only in the creative writing classes. I graded from Berkeley, and I took some like creative writing classes.

And we usually have like exercises, like prompting exercises, right. And so we often forget that prompting language models is actually an act of creative writing. And I see people get annoyed, like why their prompts just don't work. But in most cases, I think it just means that people lack some kind of like originality or creativity to think of like new novel ways, how to make it work.

I recently wrote a blog post about the cultures of writing. And one of the points that I'm making in the blog post is that prompting becomes like a new form of writing for any research engineer and like scientist who engages in daily, this kind of writing requires forming hypotheses.

So you have to ask the model, okay, can the model do this? And you want to test that. Can the model self-correct its responses? Yes or no. And so you start trying like forming the hypothesis. Next, you test certain assumptions that you make about the model. And if as you iterate more, you kind of like get new insights or like, oh, yeah, the model's pretty good at this particular thing, but it's not super good at another thing to achieve this task.

So you gain more clarity about like what's what's the strength and the weaknesses are. I wanted to start with like overall broad like writing principles. Because ultimately prompting is like writing, right? So the goal is to write prompts that clearly communicate the task objective, while providing just enough constraints and guidance to steer the model towards producing high quality and relevant outputs.

And so there are like four to six kind of like writing guidelines that I think I found effective, especially working with like Claude. So the read, maybe to clarify how Claude is different from or GPT or GPT models. I think with Claude models, I think with Claude, you have to almost treat it as like another human.

So you have to like explain things to a five-year-old. Or like you have to be elaborate. And like, I will share like more examples on like how to do that. And I will share like more examples on like how to do that. And I will share like more examples on like how to do that.

But I think that's like a distinguishing feature from like GPT models from like GPT models from Claude. So the first principle is clarity. Like use simple, unambiguous language in your prompts. And then you have to have to almost treat it as like another human. So you have to like explain things to a five-year-old.

Or like you have to be elaborate. And like I will share like more examples on like how to do that. But I think that's like a distinguishing feature from like GPT models from Claude. So the first principle is clarity. Like use simple, unambiguous language in your prompts. Avoid confusing syntax or vague phrases that could confuse the model.

The second is conciseness. Keep prompts short and focused. Include only key information that the model needs. The third is coherence. Logically structure the prompts with context at the beginning and clear task at the end. Consistency. Stick to similar formatting. If you use XML tags, use it consistently in the prompt.

If you use certain like terminology, do not like kind of like put the model of distribution basically. So make it consistent. Directions provide like genre, length, style, or any like guidelines. Guidance to direct the model's response. Ground prompts with examples, sources. Make the model to quote itself. Or like if you have like long document in the context, make it to quote from the document to support the argument.

Or help the models to form, support the arguments from like search results or like other supporting contextual information. Engaging use diverse edge cases examples. Very useful for future prompting. Now I'm going to go through certain like tasks that I thought would be interesting and see how you can use Claude in those specific tasks.

So the first case is obviously a recommendation system. As of last year, last year I made this project in Turalia. I used clip to kind of like, so I scraped a bunch of like images and like clothing items from different brands. And I used text as like an actual search engine.

So you can say like James Bond girl. And you can do dress. And in a way you get like results that are like dress in the style of James Bond. Or you can do like futuristic, a thorough outfit. And it's like more like vibes based search. And you can like go and look at the shop itself.

Yeah. Oh yeah. Clip is contrastive language to image model trained by open AI. It's open source, but it's basically, they provide embeddings for text and images. So you can like, And the way it works here is like you embed images and you embed, um, text results. And so what you can do is that you can like do a cosine similarity to find the most similar, um, items based on your database for the user query.

I don't know if that's clear. Let me know if you have questions. It's the same thing as multimodal. Yeah. Um, I think you can read the paper clip. Yeah. Um, if you're interested. So I was, I was thinking like, okay, how could I use Claude in this project to curate relevant recommendations based on the user's requests?

And that's the task. So in a way you have like users input, let's say dress in the style of Emma Chamberlain blazer, like in the great Gatsby movie and outfit with a futuristic vibe for the Met Gala. And on the other side, you have like image to text database, um, with images and their labels.

And the labels can be produced either by like the original source or you can use like multimodal models to like come up with labels based on the images. Um, and so the task of the Claude is to curate, like based on the labels from the images, decide whether this like item relevant, should I recommend this to the user?

Is this accurate? Like does it matches the, uh, users per like, can I personalize this? Um, and so if you can look at the, very simple curation strategy, um, for the prompt, you can just like zero shot at like, I need you to decide whether the item is relevant to the user query.

Here's the user query. Um, here's the item description. Um, is it item relevant or should, should be recommended to the user based on the user's query and say yes or no, please to write the answer in answer tags. Um, let's unpack this. First of all, Claude really likes XML tags, really loves XML tags.

I think like, this is like number one, not mistake, but like, um, one thing that people miss, they don't like put anything in XML tags. And so they don't have like very high, like good performance. So everything like, yeah, I do love XML tags. So you should like put everything in XML tags.

And, um, with XML tags, it should like be consistent. like be consistent. So what is user query item. And you can be very descriptive. I can share like more examples later on. Um, and here in a way you like the way you interact with cloud is like, you can see the language here.

I need you to decide whether the item is relevant. It's almost feels like you talk to a human. Um, Um, yeah. So, uh, this XML thing recently only came out like maybe like a month or two ago from official and dropping advice. Oh, really? Is this something that was like intentionally trained for or you discovered it after?

Um, I think it was, uh, I mean, we tried to XML, XML formatting was the first formatting that we like kind of like fine tuned on. Like, um, later on, we discovered, you know, customers need like markdown or like need, Claude needs to like use JSON formatting. So like we kind of like learn from customers, but originally it was like XML formatting.

Yeah. So is that mostly because of the fine tuning or is that because you had a training set that had tons of XML stuff in it? Uh, I think it's kind of both. Yeah. Um, yeah. Close XML tags. Yeah. I don't know. Um, Oh yeah. Sorry. I had a mistake.

Yes, I have. Um, yeah. Um, yeah. And so like one, one good thing about the XML tags is like, it's really easy to extract, right? Like the strings inside it. And Claude is pretty good at like, um, I can like say like, don't put anything in XML tags. So sometimes like Claude will like say, here's information, blah, blah, blah, blah.

But then if you ask, just like write the answer in like this tags, Claude will not put any additional information, which is one of the most annoying thing was like language models. So here's the results, uh, that I put, um, this is through Claude at AI interface. Um, and so yeah, you can see like, is this item relevant?

Uh, says no. Is this item relevant? Yes. But I don't think that's like a hundred percent like, um, you know, perfect system. So it's very like zero shots. So basically you can like iterate and we'll try to iterate more in this. Um, so number two is that like when, uh, you put, you ask the model to take some time to think whether the item is relevant or not.

Uh, in thoughts tags based on the criteria above, um, and you kind of like let the model think, um, a little bit more, um, with its reasoning. With its reasoning. And then, uh, this is like basically chain of thoughts. You can also add like criteria. Uh, so as a part of your critique, consider the following criteria.

And so if you want to like steer the model on like, does the item match the specific attributes requested by user? Like help the model to like think kind of, kind of like think through like what means, what does it mean to like recommend an item to the user?

Does the item match the season or whether conditions match it in the user's query? For example, you should not recommend winter coats during summer seasons. So like in the criteria, you can like give more examples, more elaborate examples. Um, another thing that you iterate on is like, not just like give answer yes or no, but you can like based on your critique score, whether the item should be recommended or not, where one is least to be recommended and 10 is highly recommended and put the final score in score tags.

And so how does it work? Uh, so, um, here like user query, James Bond blazer, um, item, which I took from, um, um, I think it was some brand. Um, and Claude would like start like thoughts tags. Um, overall, it seems like very relevant and the final score is nine.

Um, and here's another example. I want to dress in the style of the great Gatsby movie. Um, here's the item braided cord, cropped waist coast. And the critique is basically, uh, the item is not appropriate for the user needs based on the context clues in the query. Um, it doesn't like, you know, match the attributes of the great Gatsby movie.

It like tries to like have like some reasoning. And so the score is two, and you can be a little bit more elaborate. This is like very simple, like iteration, um, on, on that. Do you guys have any questions? Yeah. So one interesting thing that I saw, I mean, there was the XML tags now that were closed.

My friend, he is not a native English speaker. His prompts are always in, in very kind of funny English, but he structures them really well and they work really well despite the English being very incorrect. Right. Right. Why, why does that work? I think it's just the morals are like pretty good at like knowledge transfer between like languages or like can infer very well on like the user's intent.

Um, yeah. Yeah. I don't have like pretty clear answer. So that text with one syntax mistake looks close enough with the text with the right syntax. The probability of the real answer is close in four cases. Right. Like yes, I don't know. Yeah. In this particular example, I'm curious on whether you see any bias with the score.

In other words, if you were to look at the distribution of scores. Right. Would it be a normal distribution? Yeah. This is an interesting question. Like this is one question that we ask in our research settings. Like one thing that we are trying to understand, like we have a research group called societal impacts.

And one thing that we are trying to understand now is like when you summarize like news articles and you try to evaluate like the bias was kind of the distribution. And I feel like this is like research active, like, yeah, I think it depends on the task. Um, really I did not test on this.

Literally it was yesterday, uh, prompting. Cool. Um, the second task. Um, so Claude is known for a hundred K context, uh, size, which is the entire book of the great Gatsby can like put into the context. And you can like ask the model, uh, summarize the book or like, uh, ask some tasks based on the huge context.

And this is like basically time test compute, um, thing. And so it was long context. Um, let me see. The way you can use long context can be in different ways. Like one way is like you put multiple documents and try to summarize or like retrieve information based on the documents.

Another way to use long context is to have a bunch, a huge few shot prompt. And so as you know, like chain of thought, um, technique relies on the stated reasoning, faithfully reflecting the models, actual reasoning. And in one of the recent papers, we found that it's not super, like, it's not always the case.

So doesn't, so basically what it means is that like, if you ask the model, do a chain of thought, it might not necessarily attend to, you know, chain of thought, uh, to produce the final answer. Um, it might just like ignore it or like, uh, not take any account.

So it's not, we, we call it like unfaithful basically. It's not super faithful. And so if you propose in this paper, um, like decomposition based methods can actually achieve like strong performance on specifically question answering tasks. Sometimes approaching that of chain of thought performance while improving the faithfulness. Do you guys have any questions?

Okay. Yeah. Um, faithfulness is, um, yes, to your prompt. Yeah. What's decomposition? What's decomposition? Yeah. Uh, let me explain what decomposition is. So here's the graph from the paper. Uh, we have like three methods. First is a chain of thought method, which is like, uh, here's the question. Could, could we do fit in a kangaroo pouch?

Uh, there are two choices. A yes, B no. Chain of thought prompt saying like, let's think step by step. Um, step by step gives the reasoning. Um, the human asked the follow up questions based on the above. What is the single most likely answer choice? Uh, and the model says the correct answer choices is B, right?

The chain of thought decomposition is when you decompose a question, when you can ask the model to decompose a question into like multiple sub questions. So that each sub question are kind of independent from each other because in chain of thought, like you have one, two, three, you know, like they kind of like can influence each other, right?

Like in decomposition, you kind of like, um, you, you decompose and you like put each sub question in the independent context. So in a way it kind of like reduces the bias. Um, and so in this, um, let's, let's see here. Is it like a sub question one, uh, what type of animal is Scooby-Doo?

The answer from the model, Scooby-Doo is a fictional character. Another sub question for, uh, for the assistant for Claude, how big is an average kangaroo pouch? And you would, and what you can see is that like each sub question is kind of like self-contained. It's very atomic self-contained question.

Um, and so you have like multiple sub questions like this. And then what you do is you recompose. So like you like put sub question, answer, sub question, answer, sub question, answer into like one context and ask the model based on the above, what is the single most likely answer choice?

The correct answer choice. The correct answer choice is B. Yeah. Um, like in the system prompt or whatever the user's prompt is, like we mentioned less things step-by-step. Um, what do you do for the decomposition? Is there like a similar, you know, input to the model to make it decompose into multiple questions?

Yeah. Uh, I can share the prompt, um, in a few slides, uh, on this. Um, but, uh, yeah. Any other questions? Could you show the graph again? The, this graph? Yeah. Um, yeah, let's look at the prompt. Um, very hard to see, but I'll share the slides. Um, let's, um, I, I'm, I'm gonna give you like legal context, like legal question.

Let's say you have a question on like, a legal question and you ask us like which of the following is the most persuasive argument that a person is liable to the creditor under the terms of the agreement and here's the context. So that's the question basically. And so you have like choices for the model.

So this is like multiple choice question. And before that you have like a huge few shot prompt. Um, and basically here to answer your question, like it says, I'm going to give you a question. I want you to compose into a series of sub questions. Each sub question should be self contained with all the information necessary.

Um, this is really important, blah, blah, blah. Uh, make sure not to decompose more than necessary. Um, be concise, blah, blah, blah. Please put each sub question in like this tags, but include the numbers corresponding to each tag. So, um, and the model says, yes, I understand. Uh, you have a question.

Uh, multiple choice answers and the model provides sub questions for you. And then what you do is that you try to answer the first sub question and you give it to the model. You try to answer the second sub question. You give it to the model. The third sub question you give to the model.

And then later you say like based on everything above, like you give all the context, um, answer me the question. Um, the correct answer C. Yeah. And so this is like very similar in the legal context. You have sub questions like what is consideration of contract law, blah, blah, blah.

And you have, you can have like another model to like sample here. You can have like another model to answer this. It doesn't necessarily should be like one model. Um, and then there's like another sub question and here's the answer. Yeah. Do you guys have any questions? Um, the second thing that I want to talk about is how to use claw to do evaluations.

Like evaluating like clawed on like long context ability. Um, let's say you have a lot of like documents and you want to understand how good clawed is answering questions based on the document. Or is it, is it able to answer like the questions, not just like from its protein knowledge, but like based on the document itself.

Um, and so, um, I'm going to give you an example that we did at Anthropic, um, multiple choice QA, um, evaluation design. So our goal was to, with this experience to evaluate techniques to maximize clawed chance to correctly recalling a specific piece of information from a long document. And so the document that we chose was the government document that contains like a bunch of like meeting transcripts, different departments, And you also chose the one that was like, uh, from this year, July 13th, uh, which is like way after clots, um, training data cut off.

So that you don't like, um, you have the document that does not have in the preteen knowledge or something. And so what you're trying to do is like, now you want to use clawed to generate question answers pairs. Um, you, in a way like you create like data, data set based, you use language models to create like data sets.

And so the way you do that is that you split the document into sections and use clawed to generate like five multiple choice questions for each section. Each was three wrong answers and one right answer. And if you do that, you then reassemble like randomized sets of those sections into like long documents that you could pass them to clawed and test its recall, uh, of their contents.

This is very matter. Let me know if you have questions. Yeah. Um, so here's a prompt to generate multiple choice questions. Um, I asked, please write five actual questions for those, um, some guidelines at the end. Um, and basically we test different strategies, prompting strategies, just asking clawed, give clawed two fixed examples of correctly answered general knowledge and, um, that are unrelated to the government document.

Um, providing two examples and providing five examples of correctly answered questions. And we tested the strategies, uh, on different settings. Like one is containing the answer positioned at the beginning, the end or the middle and the input. And we tested with like 70K and 95K token documents. You can look at the prompts and more specific how we did this in our blog post.

But basically the results is this. Uh, here we see that, um, yeah, the metric was, um, to, let's see. So, let's see, like basically how many, how many times like clawed has correctly answered the question. Um, right. And so, yeah, sorry. Uh, basically what we find is that like for document Q and A, asking the question at the end of the prompt performs a lot better than asking at the beginning.

And you can see it here. Uh, pulling relevant quotes into like critique or like thoughts tags is helpful. Um, it's like the small cost to latency, but improves accuracy. Uh, and we tested on like both clawed and clawed instant. Um, and it seems like you can boost way better performance from like clawed instant, um, than clawed too.

Um, basically the idea is that like, if you want to use long doc Q and A, put the instructions at the end of your prompt. Yeah. That's like the results of this. What was the, I didn't catch, what was the scratch pad in the. Uh, oh yeah. Like you just asked the model to like put thoughts in like thoughts tags before answering the question.

So it has like more reasoning based. Yeah. Can you go to the table again? Sorry. Yeah. Sorry. I'm, uh, Yeah. But like the outcome was basically putting it at the end matters more than all the other. Yeah. Optimizing. Right. Yeah. Um, Yeah. Anyway, this is like an example to show like how to use cloud to generate a data set that you can like evaluate and like you can use it for like evaluation basically.

Yeah. So this has to do with putting your instruction at the end of the prompt. Uh, first of the theories on like why, why specifically the instructions should be at the end. And are there any things like, do we have any understanding of like, are there certain things at the beginning of the prompt that still might be weighted?

Or is it like this sliding scale that like the further in the beginning of the prompt, like the less attention it gets? Yeah. I think that's basically the hypothesis. Okay. It's like the, you know, it's like the distance. It's like the model attends more to the end of the prompt, uh, than the beginning.

Okay. It's not like a u or a physics example. I think there was a paper saying like it just forgets in the middle or something. Okay. Um, yeah. I think this is the problem with like long context that you're trying to fix or something. Yeah. So to follow on that question, you're saying that there was a paper that said that it remembers the beginning and then and kind of forget in the middle.

Yeah. So what you're saying is for plot to it seems to do best if you give it at the end. Yeah. So that paper doesn't apply to plot. Um, I did not read that paper. Like, No, I'm just curious. What you're saying is you're finding it at least for plot.

Right. The end part gets more attention. Yeah. Yeah. For like a specific task is like a long context, uh, Q and A for like long documents. Yeah. Um, yeah, we have not tested on other tasks to my knowledge. Um, so the prompts you showed were using regular pros and then the XML tags.

Um, I think that's also what's in the entropic docs. Have you guys ever done experiments on like that kind of format versus Markdown versus everything is in XML? Do you have any thoughts on that? Yeah. So, um, in general, I think, I think it's because Markdown was kind of like, there's not that much of like, I don't know, it's like best in XML tags.

Like I'm, I'm thinking the, I've like tried to like, you know, use like Jason, or like, uh, use Markdown, but sometimes it's like, you know, it's not as good as like XML. With XML, it's almost a hundred percent accuracy. Yeah. Let's see. Yeah. Um, let's go to another task.

Okay. Yeah. Um, let's go to another task. Um, let's go to another task, um, which is like, you can use language models to like auto label basically anything. Um, so one of the examples that we did last year, um, we asked Lot to categorize the labels for the clusters.

And so, um, this was for the paper, but the approach was very simple. We have a bunch of like, you know, texts and we embed them in you map. Um, and we do like K and N clustering and for each cluster, sorry, K means clustering and for each cluster select like, for each cluster aggregate all the, you know, little like statements or claims.

Um, and we ask the model to come up with the category for this cluster. So that's the approach. And you can look at the other labels. Uh, labels here are not super good because we used cloud 1.3 at that time. Cloud 2 is supposed to be like way better at this.

Um, but this is like, you know, cache. It was like last year. Um, where's my slides? And so one thing that you can do with this kind of task, we call it self consistency. You can generate N samples for the question. Um, so let's say you have a question like, how do you label this cluster?

And you generate independently and times and you can ask just like come up with like one category. Um, well, this method is mostly useful for like quantitative. Like if you have like a math question and you sample like different, like sample multiple times and come up with the answer.

Um, like the most common answer is, uh, the one that you select with the final answer. And this is called the majority of vote. Another technique that you can use is like, um, have like two generated samples and ask another model to evaluate whether those samples are consistent or not.

And if the samples are consistent, well, you gain more confidence that this is correct. Right. And if it's not consistent, you just like the select. Um, another thing that you want to do with cloud is, uh, if you, if cloud is kind of like misses the nuance, especially for like categorizing a lot of labels and you have like a lot of like categorizations.

Um, you can add contrasting conceptual distinctions in your instruction and you can do it in multiple ways. One way to do this, like you provide bad example. Let's say like, here's a very bad category and you should never come up with it because this is like too narrow or like too general.

And this is not what I want. Uh, like give like contrastive like examples and vary the context, use examples in different contexts and settings. Um, not just like, just like have like more diversity, like diversity is like, um, and the more diverse, like future prompts examples, the better. Use analogies and metaphors.

Um, if the concept is like too hard to understand for the model, try to like decompose and like bring analogy, um, point out like common misconceptions. Um, especially for like categorizing, like, let's say what is false presupposition, right? Like, uh, point out the common misconception and like clarify like why this is like incorrect, like provide examples that explicitly show why common misconception is wrong.

Uh, yes, do you guys have any questions? Yeah. Um, I can not speak here, but, um, yeah, like come up with a category. Like, um, yeah, come up with a category or like classify, uh, like label that cluster basically. Um, so here's like very basic like tips and strategies with cloud API.

Um, number one is formatting. Um, like human assistant is like what cloud loves. And if you miss this, you miss it, like you'll get like very, very terrible results. Uh, new line, new line, human, new line, new line assistant. Um, yeah. Uh, you can also put words in cloud's mouth to like, kind of like say like, do you understand it?

And you can like put in the, um, cloud's mouth? Yes. I understand it. And a way to like, you know, put, put the model into these modes. Have cloud repeat instructions back. Um, you can say like, do you understand the instructions? Um, and you can put like assistant, yes, I understand instructions.

Blah, blah, blah, blah. Uh, to reduce hallucinations, like let cloud hedge and like say like, I don't know. Or like, uh, I don't have enough information or like context to answer the question. Um, here's another thing. Um, if you have like generate direct quotes, if you have like a document or like, um, a long document in the context, um, make cloud to say, find appropriate quotes, but also say like, um, if there are no quotes in this document that seems relevant to this question, please just say, I don't find any relevant quotes.

So that it doesn't make up or fabricate new quotes. Uh, how to give good examples. Um, are the examples similar to the ones you need to classify? Are the examples diverse enough for cloud not to overfit to, to the specifics? Equally distributed among answer types. Don't always choose option A, but like, you kind of like have the diversity.

Um, yeah. Yeah. I got a lot of past. Yeah. Oh, formatting in a way. Oh, here. Um, I think they didn't put like new line, new line. Pretty sure. Oh, here. Sorry. Yes. Here. Um, I think they didn't put like new line, new line. Pretty sure. Oh, here. Sorry.

Yes. Here. Um, I think they didn't put like new line, new line. Pretty sure. Oh, here. Sorry. Here. You put like human assistant inside the XML tags. You only need, you only have to use human assistant as like tokens to like sample, but you should never put it in like, like inside the context itself.

Either use like user and like other like, um, you know, um, words like user AI or something, or like H or A, but you should never use human assistant. Human assistant is like very special, special words. If that didn't have the XML tags, would it be okay? Um, it would be okay, but you would like make, you should like have human and then assistant in between and then human assistant and then another human and assistant basically.

Uh, yeah. Formatting is like human assistant, human assistant. You should never have like human, human, assistant, assistant or something. Uh, that's bad. Um, yeah. I think we have like more extensive, uh, explanations in the API docs. If you can look at it. Yeah. I got a lot of questions.

Like what's the future of prompt engineering is. Um, and I think the answers are pretty clear. Like prompting will stay. We'll just ask like more complicated nuanced questions or like tasks for the model. Um, prompting engineering is a, we will like, we're moving towards the world where we'll have like more and more synthetic data generation.

And so I'm pretty optimistic about like more using models. So like generate like diverse sets of like data sets. Um, you can also use language models to write like evaluations. Um, so you use prompting to do that. Um, reinforcement learning from AI feedback, um, is an alternative to like reinforcement human, from human feedback, which is like a little more scalable, but basically you ask the model to revise its own responses in the process.

So you give the model, you like ask the model to like self reflect or like self revise. Um, and so you use prompting in that process to do this. Um, and especially like prompt engineering will become like a standard part of like product development. I feel like, um, things that we did in cloud products, such as like originating titles, like this things was never like done before, like before like large language models.

And so you can like create delightful mini UX experiences, uh, such as like that using just prompting or something and you can like have personalization. Um, maybe you can embed all the users conversations and like suggest like new topics for the conversation. Um, and you can use models to do that.

Um, and the most like interesting thing is like, uh, finding most optimal prompts for specific tasks. Um, maybe you want to like minimize the number of tokens to get the highest accuracy for the task. Um, yeah. Uh, here are some resources. Uh, we just, uh, launched a cookbook with like certain like demos on research, uh, on retrieval and search.

Um, we have prompt design guide in API, uh, book. Um, you can also read out the papers that we publish. Uh, oftentimes we have like appendix with like all the prompting that we do. Yeah. Thank you so much. And, uh, if you have any questions, let me know. Yeah.

We have five minutes for questions. Charles is coming up. We also have water. Thanks to Sean. Um, for bringing in some water. Uh, so, uh, yes. I do the people, but five minutes for questions. If you're thinking about . Yeah. So have you tried these with different things besides quad, like other, have you had similar kind of results?

cause you're talking about . Cause you're talking about . in this particular. Maybe like . Yeah. Uh, maybe like . Yeah. Yeah. So have you tried these with different things besides quad, like other, have you had similar kind of results? cause you're talking about quad in this particular? like maybe like .

Yeah. in some water so it gets happy to people. But five minutes for questions for anything about . Yep. So have you tried these with different things besides quad, like other, have you had similar kind of results? Because you're talking about quad in this particular, like maybe like .

Yeah, I think I'm most experienced with cloud because I use it like every day. Less experience with GPT. I did not look carefully, to be honest, at their like API docs, but it seems like the strategy is a little bit different. Yeah. They don't have like formatting as VR, let's say.

Yeah. Yeah, I think that's, that's actually one of the directions to like, um, I don't remember, how, what was the paper called? Like LLM says like optimizers, I think. Right. Um, but yeah, I guess like, um, in a way there are like certain tasks that the models are like not good at currently, like for example, like self-correction, like the models are not really good at like self-correcting the air like answers.

And like, can you find like a problem that was like pretty good at it? Like, um, other tasks that you want. Yeah. Yeah. Yeah. I'm curious about what techniques your team is using for actually like evaluating the quality of the responses. Yeah, I think, um, depends on the task.

Sometimes we just like have to look manually qualitatively, uh, at outputs. Um, sometimes you, let's say, um, you want to evaluate, you know, how much does the model refuses and if it refuses in a relevant context or not. And so, uh, you, uh, you use, you know, generated answers and you categorize infusals in different categories.

And use the model to categorize that. And so you just like see the rate. Um, yeah, I can think of that example. Yeah. It depends on the task. Some tasks are like, you know, um, for like hallucinations, you actually have to like look yourself or something. Yeah. Yeah. Uh, it was open AI.

Yeah. Yeah. Yeah. I think, uh, um, I won't say too much about this, but I actually have not like excessively used function calling from open AI like other models. Um, yeah. Right. Cool. Yeah. Yes. So you mentioned something about, you know, using, uh, LF to generate titles for customers.

How do you actually try to evaluate if the titles are relevant and actually consistent beyond human care? Um, yeah, I think, uh, one, uh, actually this is an interesting question. Like I worked on the outer generating titles for cloud AI. And, um, one thing that I asked cloud is to be like an editor, like have an editorial taste.

And what we did is actually we took previous titles and we put it in the context to generate a new title. And so in a way it's like a little bit more consistent to, um, what the style of the user is. Yeah. Uh, yeah, I'm not sure if I can share that.

Oh yeah. Uh, yeah, we, we use this in production. I can like show you like cloud AI interface. And, uh, one thing that we changed recently is that like, if you have like pretty like short, like, um, you know, sometimes you don't have, you don't need like LLM to come up with a title.

You just take, if the prompt is like very short, you just like, um, use the, like the first like words. Um, but here, yeah, like, I don't know. Um, let's see Hayes introduction, but then like recommend some books. Um, yeah, I don't know. Uh, yeah. I have a quick question.

Did you cover the difference between quality products then? Um, no, but I can tell. Uh, so Claude. So let's look at the, is there some dogs on this? When did we announce? Uh, or nine, August 9th. Um, basically Claude 2 is a larger model. Uh, is a little bit smarter.

Is like smarter than Claude instant. Claude instant is way cheaper and way faster. But Claude instant is better than Claude instant one. Um, in a, like more like reasoning based task. So it's way better at math. Uh, it's way better at code. Um, other benchmarks are like pretty similar, but I think we specifically trained Claude instant to be good at like math and code.

Um, and it's a way better at like red teaming, um, like automated red teaming evaluation. So it's more robust to like jail breaks. Um, yeah. I really liked this model. You guys should use it. Yeah. Yes. Exposed by ignorance. Exposed by ignorance. So when you talk about training, like when you train this client, was that fine tuning?

Or was that something different? Uh, yeah. Fine tuning. Yes. Last question. Can you say more about red teaming? Like what is red teaming? Yeah. Red teaming is an interesting concept. It's, um, basically you, like the models are pretty like, uh, vulnerable to like certain like jail breaks. Um, so sometimes let's say like a very simple example, like can the model give you instructions how to build a bomb?

And so we, we consider it as a jail break. And so the goal is to like, uh, in that cases, like the model should like refuse or do not like provide any additional information in case of like, um, unsafe like prompts or something like this. And so this is like the internal evaluation that we have.

Um, you can read in the model card that we have, we launched in cloud two, how we specifically do that. Um, but it's basically the amount of like, um, how robust the model is to like those jail breaks. Yeah. Cool. Thank you. Thank you very much. Yeah. Thank you.

Yeah. Thank you. Yeah. Uh, yeah. Yeah. Uh, yeah. Uh, yeah.

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Transcript