Back to Index

Building with Anthropic Claude: Prompt Workshop with Zack Witten


Transcript

All right, good afternoon everybody. Thank you all so much for joining us. We have the enviable position of being after lunch. I'm seeing some cookies on the table still, but thankfully you're not here to listen to me. You're going to be riveted by the prompt doctor who's going to come up in a second.

So I'm expecting no sleeping on the table, but you never know. Excited to be here. I'm Jamie Neuwirth. I lead our startup team at Anthropik. I just wanted to say a couple of quick things before we got going here with, again, the reason you're here, the prompt doctor. We've had a lot of really exciting releases just in the last couple of days, some in the last couple of hours, and wanted to just put these up there to highlight some of the cool things that we're doing, but also share how a lot of folks, not only in this room, some of your peers, maybe folks back at the office, can work with Anthropik on not only some of the prompting, of course, work we're going to do here, but just helping you grow your business and the really cool things you guys are building with Claude, with LOMs, on top of AI.

My team is here specifically to help your company grow and scale the business, whether that's getting access to higher rate limits, getting access to folks like Zach, who's going to be up here in a moment, learning more about what we're doing from an early access perspective. We want to work with you and empower the next wave of really exciting AI companies built on top of Claude.

And we're helping from a product perspective with a couple of the releases you see here. Has anyone tried 3.5 Sonnet yet? Love it. Very cool. Really excited. Yeah, thankfully, Zach has tried it as well, so we're in a good place. Really excited about what we were able to release there as well.

Also, just today, or excuse me, I guess Artifacts came out with that with our Claude Teams and Claude AI plan. Artifacts is this really cool tool that some have been playing around with, making video games with a couple lines of text that turns into code, but actually a lot of really cool business use cases from a diagram perspective.

I've seen a lot of cool pharma companies using this and thinking out, how can I put what's in my head on a gene sequencing kind of discussion onto something like an artifact, share that with my peers. All the way to prompting cookbooks, helping, again, folks like yourself. I would imagine a lot of you hopefully will be featured on the website, you know, coming up with what you'll be able to prompt for these use cases in your company.

So just some ways that we've, you know, things that we've come out with over the last couple of days, a couple of hours when you think of Claude Teams and projects, and then ways to connect. Feel free to reach out at sales, sales at anthropic.com. We have our head of DevRel, Alex, here as well.

So we really love this community, love staying engaged. We're really excited about what we've released over the last couple of days to be able to help you again in what you guys are building. And so I'm going to leave all that to the prompt doctor as well to get that, to really make sure that we can help.

And Zach, why don't you come on up here and see what we can do. Thank you all so much for joining us today. We're really excited. Thanks, Jamie, for the intro, and thank you all for coming. This is really awesome. I had no idea this many people were going to be here.

So thanks. Okay. So not going to be much talk. It's mostly just going to be straight up prompting from beginning to end. Did make a couple slides. So mostly about what to bring. So I set up a Slack channel in this, the AI engineer Slack. It's called PromptEng Live Workshop Anthropic.

So that's where you can upload prompts. And what we're going to do on stage is I'm just going to sort of look at them. I'm going to read them. We're going to test them in the console, the Anthropic console. We're going to see if we can get better results.

And we'll just sort of like try to learn as we go. So this is something that I do like internally in our team Slack quite a bit. But I've never done it in front of this many people. And it'll be exciting. It'll be fun. Might be some hiccups along the way.

But hopefully you all have a good time too and maybe learn something. I know I'll definitely learn something. So what kind of things should you put in this Slack channel? So you can put a prompt template. So a prompt template is kind of like a prompt. Actually, I just realized I don't even need this mic.

Okay. So you put a prompt template, which is like a prompt, but with spaces where the variables are going to go and the variables, they're going to be denoted with these double brackets. So in this case, it's like this document part. If you don't have it in this format, that's fine.

We can figure it out. This is just like the ideal. So this is like the prompt template. This is like the kind of thing you'd put there. And then you can also have a couple examples of cases where it doesn't do what you want. And that will give us the like direction as far as like where we want to go with it.

I might also like ask you questions out loud if I have questions about like what kind of output is good or not, or I might ask questions in Slack either way because it's easier. We'll have to kind of figure that out as we go. Okay. So that being said, we're going to use the console for iterating mostly, although I might use a couple other tools like quad for sheets, which is like a spreadsheet where you can call quad.

Okay. So, yeah. Let's see what we've got in the Slack already. Okay. We have something here. Thank you, Gordy. So, you're an expert patient. So let's put this into the console and then let's take a look. Okay. And I'm just going to go through as many of these as we can get through in the session.

And, yeah, this is pretty much what it's going to be. So, first of all, we can probably capitalize all the sentences. Does that matter? Does that matter? Yeah. Hi, I'm Gordy. Thank you. Yeah, yeah, yeah, yeah, yeah. Perfect. So, does it matter having capitalization? I think so. Okay. A lot of things, like, prompt engineering is, like, it's very new, right?

So, like, we don't know for sure. Somebody out there might have done a study where they, like, conclusively show that using capital letters and, like, using, like, better grammar, fixing grammar mistakes help. I have, like, anecdotally found this in a few cases. I also have read some, like, quantitative stuff showing that, like, typos do hurt performance.

But I'm also just, like, pretty obsessive about this stuff. So I just fix it. And I think it, like, it definitely doesn't hurt. Okay. Can you zoom in? Can I zoom in? Great question. Is that any better? No. A little more. Okay. Is that any better? Okay. So, first thing, let's put information in XML tags.

So, we can go like this. Why not, like, markdown? Why XML? Why not markdown? Another great question. So, Claude was trained with a lot of XML in its training data. And so it's sort of seen more of that than it's seen of other formats. So, it just works a little bit better.

So, this looks like all the information here. So, we have the medication review. We will be, okay. Can you run it before and after? So, before, then first iteration, and then the final iteration, and see how it comes first? Yeah. Great call. Okay. Actually, let me undo everything that I've done so far.

Okay. So, we can run it here. And then, now, in the console, it's asking us for the user input. So, do we have a user input? No, I gave you a sample of the flat. Okay. Perfect. So, who are you? Let's do this one. Why do I need to?

Yeah. We can do them both. Who are you? Okay. Okay. So, Gordy, what do you think of this? It's way too long. This is a conversational agent, so no more than one sentence. Okay. It's too long. We can probably fix that. So, well, we can also use this evaluate tab, so let's just, like, add all the test cases here.

So, this is the evaluate tab of the console. I'm also going to be doing, like, some showing off of the console features, because I think it's a cool tool for prompt iteration. There's also some secret new features that I might show. We'll see about that. Okay, so then we also have, why do I need to do this experiment?

Looks like it added a bunch of new lines. Let's definitely get rid of those. And then we can get this next one. Can I schedule it tomorrow instead? Cool. So, I hit run remaining. And this is all running through Dove, sorry, through, okay, so we have, why do I need to do this appointment?

So, this looks pretty long as well. And here, we have this, like, I apologize, but I don't have information about scheduling Okay, so, is that true? Is it true that we don't have information about scheduling? No, we are always available, make it easy for that. No, 24 hour availability.

24 hour availability. Okay. All right, so this is, like, the version one. Now let's make some changes. So, first of all, actually, I'll try to do things, like, roughly in, like, some order of lines. So, maybe I won't make you all sit through the capitalization, even though I, like, would definitely do that.

I'm also going to add a new line here, just because I think that's, like, more normal what you'd see in, like, a written document, you'd have a new line. We'll close the information tab. What's that? Yeah, I could, actually. That's not a bad idea. The one thing I wouldn't feel completely confident of is that it would, like, exactly transcribe the rest of everything, like, word for word.

I think it probably would. What I actually might do is just, like, have Claude write code to capitalize every first word of the sentence. Then I'd be worried about edge cases, like, what if there's, like, ellipses? But that kind of thing is definitely useful. And like, I definitely use Claude a lot in writing prompts.

For instance, like, we have, like, a quad tool that, like, helps complete code, basically, and I do a lot of prompting in that IDE, because, like, especially with, like, very nested XML tags. It helps a lot just, like, suggesting the closures of them, which is, like, pretty obvious, but still takes a long time to type.

So, yeah, if you have any sort of, like, co-pilot type thing, definitely that's, like, a good environment for writing prompts. Okay, now let's do the same thing with this instructions. And we can do this. It looks like this one, like, didn't get a number, so let's, like, do that.

Yeah, so the key thing in terms of XML, I think, is just, like, really, XML isn't even that important. The most important thing is just clearly separating the different parts of the prompt. Yeah, exactly, it's, like, here's this stuff, here's this other stuff. Like, if we wanted to, we could do something like -- like, I wouldn't do this, but, like, like, I think it would probably work fine.

Yeah. Okay. So this is all fine. Let's also do the same thing with user input. Now we can run, we can go back to the evaluate tab, and we can hit rerun all, and it's using our nice new prompt. Still looks pretty long. But we can also see how it does on the last case where -- okay, so here it's still said I don't have access to the specific scheduling information.

So let's try and fix these two things. So first of all, we can make it shorter, so -- do we have anything here about, like, making it shorter? What's that? Rule 7. Rule 7, okay. Be concise and offer only relevant information. Oops, let's actually do this. Don't want to misnumber here.

And be concise. So, like -- and offer only relevant information. Each response should be -- or let's be a little bit, like, less prescriptive to give Claude, like, a little bit more room. Like, if we say, like, every response should be, like, exactly three sentences, that might be, like, a little too constraining.

I'm just guessing. So we could just say, like -- so why is response should be two to four sentences better than telling it to be concise? Concise is arbitrary. Yeah, concise could mean a lot of different things to different people. Like, in some cases, like, concise might mean, like, literally only one word.

In some cases, like, if you ask for, like, a concise book review, we might be looking at, like, you know, a single-page Word doc, and that would be concise in the context of a book review. So, yeah, Claude is, like, trying to guess what you mean by concise. What if you also have a longer prompt, so you have a really long prompt.

Sorry, one sec. Go ahead. You have a really long prompt, so if you have a long system prompt with a lot of detailed instructions saying be concise, you're not going to get something super, super short. I think that's right. I think the tone of the prompt -- so, what he was saying, if people couldn't hear, is, like, the prompt is long, so the response might also be long.

I don't think that's, like, definitively true. Like, you can have long prompts that give short responses or short prompts that give long responses. But it's more like if you don't say anything, it might pick up on some of those, like, context clues. You were saying something over here? Yeah.

Yeah. So, let me actually get to that after we do this. So, this two to four sentences, it looks like it's still pretty long. I think maybe that's actually, like, longer than necessary, so maybe we should make it, like, one to two sentences. Let's try that. Never more than three.

Okay. Now we can try that here. Okay. That looks better, right? This is definitely shorter. Okay. And it also seems that it is giving variable numbers of sentences, so these were both two sentences and then this one is three. So one of the questions over here is, like, can the LLM figure out that it should do longer responses in certain situations and shorter in other words?

So it seems like it did that here. Okay. So then the next point was that in this case it shouldn't say that I don't have access to the scheduling system or specific appointment times. What should it say instead? It should say, "Sure, what time tomorrow." But intentionally in the prompt, I left out that it's a 24-hour service.

So this is a case where we're asking a question that's not present in the information it has. Okay. So, yeah, I mean, we could add something like, "You're open for 24-hour service," but you're saying you want to test its ability to, like, figure out how to do it without that.

It is saying I don't know. So... Oh, okay. It is saying I don't know, which is good. Okay. Well, then we're doing great. All right. Should we... Anything else that you wanted to get out of looking at this example, Gordy? I think that was... Oh, about the structure. So does the order of the rules or the order of putting information, then rules, or rules first, then information, does any of that matter?

Yeah. So does it matter what order we have these components? Yeah. I think it's better to put the information above the instructions. Okay. We've sort of found that instructions are more tightly followed the closer they are to the bottom of the prompt as a rule. This doesn't necessarily apply in all situations, so definitely test it out.

That actually is, like, a blanket statement that applies to everything that I say, but particularly for that. Yeah. Okay. Can I ask a question? Yeah. I don't know... I don't know if you were asking all the questions or... No, no, no. Go ahead. Okay. Okay. Okay. If there is some difference to the exclamation mark here, I noticed that it's...

Exclamation mark? The exclamation mark. Yes. Oh, remember. I don't think I added that on... It looks like you added that, Gordy. The exclamation marks, just as they emphasize things for humans, they also emphasize things to the model. Do you think that has more of an effect over the numbers or less, like, in terms of relevant?

Ooh. Yeah. I don't know at that level of detail, and I think it's dependent on context as well. But yeah, if you want to emphasize things, like, capitalizing them or putting exclamation marks or, like, just saying, "This is extremely important," that all does do something. Yeah. So, the tokenizer, though, I'm taking a look at some of your tokenizer codes, but it doesn't seem like exclamation points actually, like, do anything, really.

Like, the tokenizer kind of binds them into the word. It does something. It does something. That's all I can say. Just anecdotally, if you put exclamation marks in, it's different. I guess... Yeah. That's my analysis of it. All right. Okay. I think let's go to the next... How many we got?

We got six here already. That's pretty good. Okay. This is just a general question. I'll just answer this really quick. In general, for translations, our multilingual output, is it better to instruct our English or the native language? I think it's better to instruct in the native language if you speak the native language.

If you only speak English and you're choosing between, like, a better prompt that's written in English versus, like, a worse prompt that's written in a language that you don't understand, I would probably default to writing it in the language that I knew super well. But ideally, I think for the ideal prompt, you would find a native speaker of the language and explain your use case to them and have them write the prompt.

So, is that not the same question that I just answered? Is it different? I think it's better to have the prompt in that language. In general. If you can. If you can write a really good prompt. All right. Let's go to this next guy. So, you'll be acting as a test reviewer.

Let me pump up the size here, too. Okay. Not sure if there's a way to... Oh, I can hide this. Okay. Great. Responsible for improving unit tests based on a set of requirements. Below is the project directory. Project path. Don't include any other explanation, conversion, or output besides the JSON.

Okay. This is great because it's going to let me show off prefills. So, let's make a new prompt here. Let's paste this in. I have to use double brackets instead of single brackets to get variables in the console. And then I think there's another... But this JSON string... Who gave this prompt?

This is from Dan. The JSON string, Dan, that's... So, that's... Is that variable or is that... Is that like an example in this problem template? So, in this case, that's... That is just the template and then we... So, basically, we have one of you to write the unit tests and then...

Yeah. Test of your agent does not return... Always return JSON all way. Yeah. Like, almost all of this is just like workarounds for the fact that it doesn't always speak JSON right. Like, you can see how many times we said that. Yeah. And then... Do you have an example input here?

Uh, so that first... If you go up a little bit, sorry. Sorry. Sorry. Sorry. Okay. Yeah. Yeah. Yeah. Okay. Okay. Yeah. Yeah. Yeah. Yeah. Yeah. Like, almost all of this is just like workarounds for the fact that it doesn't always speak JSON right. Okay. Like, you can see how many times we said that.

Yeah. And then, uh, do you have an example input here? Uh, so that first... If you go up a little bit, sorry. So that first comment is from the unit test writer. So that is the input. Like, the unit test writer writes a bunch of unit tests. And then the reviewer reviews it and makes them better.

So everything... Everything here is... Is what I should put into... Yes. And then you can see that second set. This is... This is a good result where it writes JSON, which basically says, cool. Update this file with these unit tests and here's the modifications I made. That sort of thing.

Okay. So in this template here, where would the thing that I just copied go? Well, so essentially we don't provide it in line with the prompt. We just provide the conversation and then this thing jumps in as a separate agent. So, like, the context window is gonna have the unit tests in it.

But we're saying respond in this format given the unit tests that are earlier in the conversation that you're picking up. So the thing that you just pasted is, like, step three of a multi-shot conversation? Sorry, not multi-shot, multi-turn. Yeah, yeah. The thing I just pasted was two shots, right?

Unit test writer and then a unit test reviewer. And the reviewer is the one that's having the problem. It comes second. Okay, so it would be something like this. Here are some unit tests written by a unit test writer bot. Right. Okay, and then we have this. Now, you didn't...

Okay, so this is basically how it works. Does this look right? No, it does. I mean, this... Okay. We just... We do this in sort of this larger conversation, not just as sort of a standalone prompt for this one agent. Yeah. Okay, so then here I put the unit tests in.

Right, yep. And then for project path, what sort of thing should I put there? Anything. I mean, this is just like the local directory that's gonna be modified. And so it actually has access to the files in that directory, and it'll fill in its own sort of what files to modify.

Okay, so then I'll just put... That's fine. Something like that. Yep. Okay, so let's see if it comes out with the JSON or not. Bated breath here. Yep. We did do most of our tests on Cloud 3 and not 3.5, so 3.5 is probably a little better. Okay. If we haven't done...

Yeah, I mean, if it makes it more realistic, we could also switch the model version to use... Oh, no, I mean, we're gonna upgrade. So I'd rather see it with this. Okay. What was that? Is the temperature being set to zero intentional? Yeah, I usually, for knowledge work, I usually have the temperature set to zero.

Not 0.01? Like, I usually use that for hallucination. I mean, I'm just gonna go on here. But, like, is that... Like, am I totally out... I think using temperature zero, you'll probably get, like, marginally fewer hallucinations. Okay, I'm... Oh, here we go. Okay, so it looks like in this case it did output JSON, I think.

Yeah, that looks plausible. Okay. It's very long JSON. I guess that explains why it was taking so long to... Looks like it actually even ran into the max output tokens because it didn't finish its JSON. Aha. Uh-huh. Just to make this... Since this one is kind of going kind of slow, I will...

I will... If you're using temperature zero, you'll probably get, like, marginally fewer hallucinations. Okay. I'm... Oh, here we go. Okay. So it looks like in this case it did output JSON. I think. Yeah. That looks plausible. Okay. It's still... Very long JSON. I guess that explains why it was taking so long to...

Looks like actually it even ran into the max output tokens because it didn't finish its JSON. Aha. Just to make this... Since this one is kind of going kind of slow, I will test it with Haiku. Let's... And let's also increase the max tokens to sample so that it doesn't run into that issue.

So what I'm really hoping is to get a case that doesn't output JSON so that then I can fix it and then it will output JSON. If not, I can still say, like, how I would fix it. Yeah. That would be great. Just honestly, any comments you have just on how we structured things.

Okay. Yeah. So, I mean, this is, like, definitely, like, a big request from people is, like, how do I make sure the model outputs JSON? The most reliable way to do that, I feel, is using the assistant pre-fill. So, maybe some of you have used this feature before. Maybe some of you have, like, only used other models such as GPT that don't offer this feature.

Something that you can do in the Claude API is partially pre-fill an assistant message. So, what you're doing there is you're putting some words in Claude's mouth, as we call it. And then when Claude continues from there, it assumes that it's already said whatever you told it that it had said.

And that can help you get it on the right path. So, for instance, in this case, if we want to make sure... So, the classic, like, bad response from Claude, when people give it prompts like this, where they want to get JSON, is Claude would say something like, I'll just, like, add another message, just have some of us to type.

It might be, like, here is the JSON, right? Have people seen stuff like this? And this part right here is, like, very annoying. And difficult to get rid of. So, okay. So, I have two strategies. Let me actually just give, like, the simplest one. They both, though, they both require a tiny bit of post-processing at the end.

So, let's start by... Let's actually, like, take out all this stuff about make sure to only respond in JSON. That could be one way to get it to not do the... To be bad. So, we could just go like this. Let's try to make it not do JSON. Let's get rid of all this stuff.

Okay. So, here now, maybe it will do the preamble thing that we don't want it to do. Perfect. Okay. So, an easy way to get it to not do that is to just take this and then put it in the pre-fill. So, it thinks that it already said that.

Like this. Okay. So, if we do that... Just the JSON. So, what we're doing here is... You can think of quad almost like a... Like a child who's just, like, misbehaving and it wants to do something and you're like, don't do the thing, but it just keeps doing it because it just loves preamble so much and it has this, like, innate desire to do them.

So, one way is to, like, argue with it a lot. But, like, if you have a kid, sometimes, you know, you just have to, like, let them do the thing that they want to do and then they'll get over it. So, in this case, that's basically what we did.

We just gave quad this pre-fill where we let it do the thing. So, as far as it's concerned, it already did the thing. And then from there, what it's outputting is JSON. Now, if you want to make this even more reliable, you can put this nice little bracket here.

And then it's like, oh, dang. Like, I'm really in JSON mode now. Like, I'm really... My JSON has actually already begun. So, at this point, it's definitely not going to do the preamble. The only thing here is if you sample with this pre-fill, you will need to add the bracket back before you try to do your JSON.loads or what have you.

Because quad is... Since you told it that it had already said the opening bracket, it's not going to give you another opening bracket. Okay. So, then another thing that you can do is return the JSON in JSON tags. And then, if we do this without the pre-fill, let's try it without the pre-fill.

You don't normally capitalize JSON? I'm not a software engineer, okay? I'm a prompt engineer. I don't even know that it's capitalized. For all I know, that's just like an English word. But, yeah. Good. Thank you. Okay. So, here we see it did the thing. So, it gave its preamble, right?

And then, it gave the JSON tag. Everything within the JSON tag is JSON. And then, at the end, it closed this JSON tag. So, again, this requires, like, the tiniest smidgen of post-processing, where you're saying, like, you just... It's like a reg X. You're just, like, take everything within the JSON tags and then use that.

You can even combine these two techniques. So, you could say, here's the updated JSON. And now you give it the JSON tag. We can even put a bracket here. And now what we'll see is it will just give the JSON minus the bracket, and then it will close the bracket, and then it will close with the JSON tag.

There we go. And it also gave this... So, you can see it did... At first, I was, like, a little bit panicked because I didn't see the close JSON tag at the very bottom. But then I saw that it actually did include the tag up here, and then it gave this little explanation afterwards.

So, this is another useful thing. This will save you some time and tokens and trouble. One thing we could do, like, it costs you money to get Claude to output all this stuff. And you probably don't need it. In most cases, you don't need the explanation. You just need the JSON.

So, one thing we could do is we could say, do not include any explanation after the JSON, ? I mean, probably. But I don't know. Honestly, I don't yell that much. I'm just, like, this is actually meant to be my parody of, like, what a frustrated prompt engineer would write if they were, like, couldn't get rid of this.

But in practice, you might not need to do that. But the simpler way to do this, there's a... And we're getting outside the realm of prompt engineering for a second and into the world of, like, API parameters, but that's okay. There's a parameter that's called stop sequences. And if you set...

So, we told it to return the JSON and JSON tags, right? So, there's no functionality to do this in the console, so I can't show it off at this exact moment. But in the API, there's a parameter called stop sequences. And if you add this close JSON tag that I've highlighted with my mouse, if you add that to the stop sequences, then it will just hard stop after it outputs those, and you don't even have to worry about telling it not to continue from there because it just won't even sample from there.

You won't be charged. So, one of the things that I'm sort of hoping to impart with this talk is that a lot of times it's cheaper and easier to do a little bit of work outside of a call to the LOM and not even worry about prompting because prompting can feel sometimes like non-deterministic.

You don't know what the model is going to do. So, when you can offload stuff to code, especially if the code is really easy to write, it's like, just do that, right? Like, don't put a bunch of stuff in the prompt about you must output JSON, just use the prefill, and then parse it out with the regex.

You know, don't add a bunch of stuff about how you have to stop after you say a certain word, just add it to the stop sequences. So, like, simple is better, and falling back on code is better than relying on prompts. Yeah? Is the prefill available through the API?

Yes. The prefill is available through the API. What you do is you include an assistant message as the last message in the messages list. And when I say an assistant message, I just mean a message where the role is set to assistant. And what would have happened if you had the text that you put in the prefill, you just put it into the last line of the instructions?

So, in other words, if I said -- I'm actually not sure. That's a good question. So, let's actually try that. I genuinely do not know how Cloud will respond to this. So, let's see. So, it looks like what it did was -- it looks to me like what it did without looking at this JSON is it included an additional open bracket.

Right? Because it's supposed to already have started with an open bracket, but here it started with an additional open bracket. So, it kind of almost worked, but not quite. Anyways, I don't recommend doing this, but that was fun just out of curiosity's sake. Yeah? So, after the sentence, we can just write it like -- you wrote, write -- written the JSON, written the response in the JSON format, and then in the next line, you can just write JSON, colon, then leave it.

I think it will -- Oh, so, like if I said something like this? Like this? Yeah, that's it. Yeah, I could see this working. Yeah, it looks like it worked pretty well. I think there's like a lot of ways to accomplish this. I think the ways that I showed are the most reliable.

So, that's what I would like officially recommend. But yeah, like definitely experiment. If you were going to like try to use this for production or whatever, what -- like these exact kind of things you're playing around with right now, how would you think about testing that like at some sort of scale?

Like -- How do we test it at some sort of scale? Yeah, more than like one -- like the one-shot test we just did, right? Yeah, yeah, yeah. To test it at scale, you need a bunch of test cases. And if you don't have test cases, okay, this is maybe a good time to maybe show off this thing, although I'm actually not sure if it will work.

So, I guess a more pointed question is like in this case, I think test cases are useful useful when I'm writing a prompt to deduce whether like does asking it to think step by step lead to this thing being more accurate. But in this case for formatting, I guess what I'm wondering is like could you have this prompt and then feed in the output and the prompt and then ask the model itself to evaluate like how good these various things are at following the instructions?

Yeah, yeah. Okay, so can we do model grading? Can we model grade the outputs? Yeah, especially for formatting related things. For formatting, I would not model grade the outputs. Okay. Because formatting is something that I can check in code. So, if I can do anything in code and I don't have to call the LLM, the LLM is like this crazy black box, right?

It's like if I don't need to like make this pilgrimage to the Oracle and like ask it, I'd rather just do it like in code. So, formatting specifically, we're kind of like in luck. It's easy to check. For something like the other, the previous prompt we looked at where the outputs are a lot more squishy, we might possibly a model grading could work.

Possibly we might need a human to evaluate the answers. So, just to lightly push back on that. Yeah. I'm wondering like I actually put an example in the Slack channel. We don't need to get to it because we're talking through it now. But like for, let's imagine I don't have, or actually maybe tags are the answer to this.

Like imagine I'm asking for a summary or something and then I want to deduce whether there's additional chat like content like before or after that. In that case would you, I would have, my mental model would have been to use like an LLM as a grader, but it sounds like maybe would you encourage instead using the summary tags and checking like hard coding for additional text around that?

Yeah. I think that will be pretty quick and easy to do. Also just having the summary be in summary tags is like generally a good practice. Okay. I generally have all my outputs inside tags to make it really easy to extract them. I don't think there's really any downside to doing that.

So, and it might even be that by doing that you effectively fix your entire issue and you don't even like need to do the test anymore. Or, and you just put close summary in the stop sequences and you're kind of good to go. Okay. Cool. But that also does sound like a problem that an LLM could grade.

Okay. Let's go to the next prompt here. Just shout out your question. Here's a very poorly formatted Excel spreadsheet. Um, I got a question real quick. So, um, this seems like a really ridiculously like powerful attack vector. So, can we test the prompt real quick? Um, I don't want to get into too much like jail breaking stuff here.

Sorry. Okay. Apologies. Yeah, yeah. That's kind of my specialty. Okay. Yeah. Um, I'm going to go to the next prompt. Um, so what do we have here? Here's a poorly formatted Excel spreadsheet slash CSV. Please extract all data into JSON. Okay. How can we, so, uh, Jan or Jan?

Is it Jan or Jan? Jan. What, what is the actual text that I should? Can, can you, can you paste the text here? Because I don't know how to get the CSV into the, uh, into the console. I've just been, oh, hey, thanks. I've just been just copying the entire CSV.

I'm putting that into the prompts. Again, I've been trying to use Claude to extract some information from spreadsheets. And it's always been very, very hard. It hallucinates a lot or it skips a lot of stuff. And I was wondering if you, maybe more generally, how do you have Claude analyze really poorly formatted spreadsheets to sometimes the different clusters or multiple data sets in the same sheet and things like that?

Okay. I'll try to answer the general question of having, uh, analyzing poorly formatted spreadsheets. The first thing that came to mind, especially when you're talking about how the spreadsheets are very big, is breaking the problem down into, so, so give it, like, fewer spreadsheets at a time. Give it fewer columns of the spreadsheet at a time.

Only give it the, the columns that it needs to, to work with. Um, and then make the questions sort of smaller and more bite-sized. And then tackle it that way by, by breaking it down. So, at that level of general, generality, that's, would be, would be my answer here.

Um, I, I, I'd also be curious to look at this one more specifically. Right now I'm just struggling with how to, like, copy the text and put it into the tool. That's what I did, but it keeps downloading it. I guess I can, um, the next, sorry, the next tab.

This is that other one. Sorry, I don't want to click open my downloads. I'm scared I'm going to, like, reveal some private information. This is my work computer, so, I want to just do it all on the browser. That's fine, you can go to the next one. Yeah, let's do the next one.

Okay, you are social media ghostwriter. Giving the bow along for our article. Okay. Uh, generally we would recommend putting the, I like this one, it's short. We can do some quick hits here. Uh, we would recommend putting the instructions after the document. That's similar to the question that was asked about should we have the information first or the instructions first.

Particularly with long documents, it's a little bit better to give the instructions at the end. Let's also put some XML tags here. Uh, let's just, like, clean up the grammar a little bit. Given the above long form article. Create five to ten tweets. Don't use hashtags. Don't be hyperbolic and don't be cringe.

Uh, return a JSON array of posts content. Probably good to give an example, uh, of the format. Uh, so we could do, like, uh, I guess we can just do a return, like, a list. Wait, what did you originally have it? Is there a special reason that you wanted it to be a JSON array or is it just to make it parsable?

Uh, just to make it parsable for automation. Okay, yeah. So let's say return in -- I'm just a huge fan of these tags, so let's do it like this. Okay, so that is -- that's -- that's some stuff without adding examples. The other thing that I would want to do is to give some illustrative examples of what it means to not be cringe.

So, how long are these documents? Uh, I put an -- I used a recent one from the rocket. Perfect, yeah. Let's actually -- let's run this as is. Oh, yep. Thank you. Now we can take this. Uh, what if they write cringe tweets about our product? I'm going to be embarrassed.

Okay, this doesn't include any hashtags. It doesn't seem very hyperbolic. Is this cringe? What do we think? New feature alert. That could be a little bit cringe. There's no emojis. It's a good sign. Okay. What do you -- your name was Charlie. What do you think of this, Charlie?

Uh, I think they are adequate, but not engaging. Not engaging. Okay. So, yeah, yeah, yeah. So, we can try to make it more engaging without making it cringe. So, let's say -- don't give you hashtags, but cringe. Try to make the tweets engaging. Are these meant to be tweeted from the Anthropic Twitter account?

Or from, like, the AI influencer Twitter account? Sure. Let's say AI influencer Twitter account. Okay. Let's see how this goes. I'm going to switch back to Dove, too. Sorry. Exciting news. Is it better to break it up into small sentences in the prompt, or can you use complex sentences?

Is it better to break up sentences -- to use, like, small sentences in the prompt, or big sentences? I think, generally, in English writing, it's better to use small sentences in small words. So, I think it's probably also better to do that in a prompt. I think it's fine to use big words if you are really sure you know what you're doing, and you know that it's the exact right word for the situation.

Sometimes, I'll find myself using more academic language if I want the output to seem a bit more academic. Generally, I think, simple, small sentences is better. Okay. So, these are maybe a little bit more engaging. Like, they have these questions here. Want to try it? What do we think?

It's got exclamation points and question marks. Is it better? Do you want it to be even more engaging or something? Let's see. Okay. So, I honestly think temperature is a bit overrated, maybe. We can see how it differs, though. I'm not sure exactly how to distinguish these from the previous ones.

They look kind of similar to me, from the ones with temperature one, or temperature zero. That's right. Yeah. So, what I was going to say is, I think this is roughly as far as you can take this without example. I think the best thing to improve this prompt would be either examples of the sort of tweets that you want, or even an entire other document, an example of tweets that go with that document, and maybe, like, multiple of those.

So, if you're cost limited, maybe you don't want to put in all those input tokens every time. But, I don't know, the models are pretty cheap now, and we don't need to generate that many tweets. So, if they have, like, any economic value to you at all, it's probably pretty cost effective to put...

So, basically, like... But it's more work on your part, because what you're doing then is... So... Okay. So, the way that I would actually do this is, I would start out with some document. I would have Claude write a bunch of tweets. I would take the ones that I liked, and maybe I would write some more, or get, like, my friend to write some more.

Or maybe I'd have Claude... Oops. Claude generate a hundred tweets, and then I would take the seven that I liked best, and then I would put that in as an example. And then, from there, I would sample, okay, now here's another document, and then write a bunch more tweets based on this.

And what I would do is iteratively build up this list of documents plus example tweets, and then I'd put them all into the prompt. And it would look something like this. So, let's actually do that. So, let's imagine that we had done this. So, it could be, like, you know, system prompt, "You are an AI influencer who writes engaging social media content about new models and releases." It could be, like, here are some example documents along with the tweets you wrote about them.

And here you would actually... I'm gonna write this, but you would actually put the literal text of the document here. And here, again, you'd put a literal tweet here. And this could either be something that you wrote or something that Claude wrote, or, you know, something that Claude wrote and then you edited.

Like, a lot of times, Claude might give you an example that's not perfect, but it's close enough, and then you'll change it a little bit to make it perfect. I have honestly given multi-shot examples pretty short shrift in this talk so far relative to their level of importance. Like, I think that, in reality, most of the gains, most of the effort, most of the gains of writing a good prompt is literally just picking the perfect document that goes here, picking the perfect set of tweets that go here, altering and changing them to modulate the tone.

In some ways, that's more important than, like, everything else that I've said combined. Like, another way to do the whole JSON thing would just be, like, with examples of Claude giving the stuff without a preamble. The JSON one is maybe an exception because the prefill approach works so well there along with the tags, but for anything else, the examples are really huge.

For a few-shot prompting, do you prefer to promote those all-in-one response like this? Or do you find further success with an exchange of messages between the agent and the user where you're putting your few-shot prompts in there? Yeah. Really good question and something that I would dearly love to know the answer to, but I don't.

The question is -- I don't need to repeat the questions. I think people can hear them. But I'll repeat it anyway. So, do we want to just put all the examples in one big giant examples block like this? Or do we want to structure the examples as a dialogue where the human says something and then the assistant says something back?

And we're literally, like, putting a large number of messages into the messages list. I typically do it this way with a big examples block, but it's mostly because it's less work for me, and I don't have any evidence that this works either better or worse. I did do some testing of this at one point on a few data sets, and I found that it didn't make much of a difference for my particular case.

But there's a lot of, like, little particulars that went into my testing that make me not very confident in the result that I got. So, sorry for a bit of an unsatisfying answer here. I'll just say, I don't think -- if it is wrong to do one giant examples block, I don't think it's, like, very wrong.

Do you use anti-examples too? So, like, in here, would you give it a thing and say, like, this would be bad because this is cringe? Yes. Yeah. I think that is good. I think it's good to include negative examples. Particularly around, like, the cringe thing where Claude might mess up.

I think just negative examples on their own don't usually get you there. You want to have some positive examples too. But I think it's great to have, especially, like, contrasting pairs. So, like, here's a document. Like, here's a cringe tweet about this document. Here's an excellent tweet about the same document.

And, like, set those up side by side. I think that's pretty powerful. And I do that. And I think it helps Claude. And then if you also include, like, the reasoning for it, right? So, like, if it was a cringe tweet, it has, like, a little reasoning of, like, why?

Do you also, do you trust that reasoning for the model? So, like, if you ask it, like, hey, give me, like, what were you thinking when you were writing this tweet? And then write me this tweet. Yeah. When you're reading through your examples to choose the best ones, how much do you trust that reasoning and how much do you rely on that versus just, like, I just care about, like, the input/output?

I don't trust the reasoning very much. Especially if it's after something the model already said. No. Then I, like, really don't trust it. Yeah. But, I mean, humans are not very good at explaining why we do the things that we do. We're really good at rationalizing and coming up with, like, fake reasons, but a lot of times we don't even know why we do the things that we did, let alone be able to coherently explain them to someone else.

So, I, there's a subtlety here. So, something that does work pretty well is having the model think about its reasoning in advance and, like, go through different reasons or rationales for why it might choose one option or the other or think about what sort of things might go into good response.

So, if I had the model do some thinking in advance before it gave the response, then I might just trust or assume that the response would be better. Having a bunch of explanation for why I did the thing after, probably, I would not trust that. Sorry, you had a question for a while.

Do you, do you give your own reasoning that explains the examples? And if so, how do you make sure that the model doesn't get reasoning in advance? Do I give reasoning to explain the examples? Yes. I do a lot of giving, giving reasoning to explain the examples. So, for instance, just in this case, one thing that we could do here is, like, we could add something like, I was going to say tweet planning, but maybe it's, like, key points of document.

And then here we have some key points, like the document presents the launch of. So, you would have this after the examples. So, if you have 10 examples. No, this is before the examples. Sorry, this is part of the example right here. So, I, in this particular example, in this example block, I gave it a document.

Now I'm doing this, this key points business. Got it. And then I would have these tweets. Now this key points could be something that I wrote myself, or it could be something that Claude wrote, and then I'm, I'm editing it. Or if Claude did a perfect job, maybe I could just include the thing that Claude wrote.

But now in order to get this, get Claude to do this, we would also say something like, return in this format, key points. A list of the key points from the document. So this is, like, a lightweight chain of thought where we're having the model do some thinking thinking in advance.

And we also gave it examples of it doing the thinking in advance, like this. Yeah. Yeah. So, let's imagine we, like, really want to give examples like this. But we have a problem, which is that our documents are, like, super long. And I'm greedy and want to save on input tokens.

Yeah. Would you err on the side of doing, like, one document but a really good example? Or doing, like, truncated versions of more documents? I would, that's a good question. I would err on the side of one extremely good example and not truncated versions of more documents. But I would also want to look at the outputs and test that assumption because it's possible that with only one example, Claude would fixate on aspects of the exact document that it uploaded and start trying to transfer them to your document.

Right. So, I think it's one of those, it would be case by case. Right. But I would want to start with, like, having one extremely good example. Generally, I think that, like, less but, like, higher quality is a better way to go than, like, more and lower quality. Cool.

Thank you. Okay. We have a lot of prompts here. Let's go to the... Okay. This is good. I was hoping we would get some, like, persona ones here. Okay. So, this looks like something where we're trying to get Claude to roleplay in these different protocols. So, let's try this out and let's see how it works.

So, this looks something where we're going to have, like, a... This looks like it's, like, meant to be a multi-turn prompt, right? So, this is, like, a conversation. You are talking to an assistant. It said execute greater than P assist. Where's that? So, basically, at the top, you see three roles at the top.

And then you can decide who you want to talk to. Oh, thank you so much. So, you see three roles at the top. Yeah. And then if you do that P assist, you see down there that's highlighted in yellow. Yeah. You just do that little arrow P and then you can pick a different persona.

Yeah. And then you can have them talk between themselves or you can just switch. We use this for designers in our shop to do synthetic interviews to synthetic users, basically. Got you. It allows us to switch back and forth. And then what issues or troubles have you been having with this?

I'm guessing you have seen a lot of role-playing prompts out there. So, I was just wondering if you see anything that's perhaps not as optimized as it could be or any other best practices for role-playing, particularly with multiple synthetic personas within the same session. Yeah. Okay. For single personas, there's one answer that I would give.

This multiple personas thing, actually, I haven't worked that much with. But off the top of my head, here is probably how I would think about it. I would give all the personas to -- I would write a separate prompt for each persona and then I would have the user's command trigger some coding logic where it would decide which bot to send the -- which prompt to send that reply to.

So, this is getting back to the thing that I said before about, like, don't do it in a prompt if you don't have to. Like, this -- I mean, this prompt, like, is -- like, there's a lot of thought that went into it, which is -- probably makes it work a lot better than it would have if you hadn't put as much effort into it.

But I think it's going to be easier if you just dynamically route the query based on what the user said. Does that make sense? It does. Okay. You're talking about, like, if you were to use the API instead of just the chat -- Yeah. -- construct something like this, right?

Yeah, exactly. But you're doing this just in the chat. This was just in the chat, but I appreciate -- I definitely appreciate the note there. So, maybe related to that, one of the other things is how much have you dealt with having a second thread with the API that acts as maybe the entity that's capturing inputs from multiple ones into a single thread.

You know what I mean? Like, let's say that I build an app, and I have the user interact with these different synthetic personas, but then I have a second interaction with the API that's tying these things together into a cohesive whole. I don't know if you guys have explored some of that.

I'll be curious. Yeah. I don't have a great answer for that one. Sorry. I do want to kind of test this prompt out, though, just to kind of see how it goes. Yeah. So, maybe here I would say -- I can just say something like -- so, how would I switch it?

I could -- So, do the right arrow P? Yeah. Right arrow P. And then type Sam. And then say, hey -- yeah. Okay. So, now I could -- You could say, hey, how do you -- what's your process to look for the right -- for the best medication pricing whenever you get sick or something like that?

And then here in this particular case, if you switch to Joe, Joe is optimized more for convenience versus cost savings. So, you have two different types of users and we can learn from. Yeah. Okay. So, Quad did the thing here that I want to show you all how to get rid of.

Oh, yes. Yes. As Sam, it's like -- that's not something that Sam would say, right? Yeah. So, I don't know for sure this is going to work. I feel like a magician that's about to do a trick but, like, I haven't practiced it. But, generally, something that is pretty useful here is to -- we could say, prepend each response with the name of the current persona in brackets.

So, one thing I'm going to do here is I'm going to change this, like, multi-shot a little bit also because if Quad sees itself not doing the thing that I told it to do -- actually, let's just redo the whole conversation. Or we can take out this. So, let's just, like, run that back.

You are talking to assistant. Nice. And now we could say -- and now we could say the same thing, like, what's your process for finding best prices for medication? Oh. Okay. So, I guess we need to do this. We need to, like, change in a separate call. Yeah. Okay.

Great. Now it's going to work. Totally. It's going to totally work. Okay. It's a little bit better, right? It didn't say as Sam. This is, like, something that a human might maybe say. Like, as someone who's -- okay. I don't know. It's better than it was before, right? Maybe we could say something like -- You don't need to say too much about your persona and your responses.

Just stay in character. Hey, quick question. What are your thoughts on using things in the negative sense versus -- Yo, check it out. It worked a lot better. Sorry to interrupt. Oh, yeah. Very nice. Very nice. Yeah, so -- Yeah. What are your thoughts on using, like, negative stuff, like, you don't versus the positive sense?

Yeah. I think positive is, like, a little bit better. In this case, I don't really have a good answer for why I phrased this negatively. I guess I did a combination. I was like, you don't need to say too much. Just stay in character. I guess I think it's better to use, like, a light touch.

Like, if you're doing negative prompting. Like, I think there's, like, a little bit of a thing going on with reverse psychology where if you tell the model, like, don't talk about elephants. Don't -- definitely no elephants. Definitely don't say anything about an elephant. It might make it more likely to talk about an elephant.

So, if you do use negative prompting, I think it's better to have, like, a light touch where you just kind of say it once but, like, don't dwell on it too much. Also, something similar with parenting. It's like if you don't want your kid to eat prunes, you're just like, oh, we're not having prunes today, and then you just change the subject.

But if you really, like, emphasize that there are, like, no prunes to be had, then you might get more pushback. Hi there. I noticed you're not using the system prompt much. Like, is there a reason for that? Or what do you think the biggest value items for a system prompt are?

Yeah, system prompt. Personally, the only thing that I ever put in the system prompt is a role. So I might say, like, you are this, you are that. I think, generally, Claude follows instructions a little bit better if they're in the human prompt and not in the system prompt.

The exception is things like tool use, where maybe there's been some explicit fine tuning on, like, certain system prompts specifically. For, like, general prompts like the ones we've been going over here, though, I don't really think you need to use the system prompt very much. Yeah. One thing we've found when using the user prompt, I guess, sometimes is it makes it more prone to hallucinations because it thinks the user is saying it.

And so we migrated things to the system prompt more. I don't know if you have any experience with that. Yeah. Yeah. I've actually heard that before. So it's possible I'm missing something. I've heard this from enough people that I could just be wrong. So I'm unusually likely to be wrong when I say this.

I think that if you just put a bunch of context and you're like, here's the message from the user. Open user bracket. And then put the message. And then close the user bracket. It will work and you won't have that issue anymore. That said, like, I don't know. Maybe it does fall over sometimes.

But that would be my default is just to, like, specify even more clearly. And if you're having this issue, be like, here's the message from the user. Here's the stuff that I want you to do. And I think it probably won't get confused by that. I have a question about the counterexamples.

So before, in order to get it to say not cringy things, you were saying provide it with a counterexample. But here in the case of where you're doing this character bot, you haven't provided it any counterexamples. So this is sort of like a generic question. So if the model is trained on preference optimization with examples and counterexamples, do you get a better result in the prompting?

Well, I don't know that the details of the RLHF have that much bearing. Because I think when the model is trained, it doesn't usually see those both in the same, like, window. It's more that it's, like, some stuff that happens with, like, the RL algorithms. I don't think that's necessarily the right way to think of it.

With counterexamples, I don't feel that I have to include them in every prompt. It's just a tool that I have in my toolbox that I'd use sometimes. In regards to, like, negative prompting -- I'm over here. Hi. Do you think that it would be better to do negative prompting using control vectors, like what you talked about in your scaling monosemanticity paper?

Yeah. And maybe, like, having, like, a negative version of the vector as your kind of negative prompt instead of mentioning it in the prompt outright? Yeah. Steering is still, like, super new. We don't know how well it works relative to prompting. I'm, like, a, you know, die-hard prompter till the end, so.

I've played around with it a little bit. I haven't found it to work as well as prompting in my experience so far. That said, there's, like, a lot of research improvements that I won't get into in too much detail, but there's a lot of stuff that could make it work better than -- So, like, right now, it's, like, finding these features, and then you're steering according to the features, which are sort of, like, these abstractions on top of the underlying vector space.

Yep. There's other possibilities for how you could steer, and there's, like, academic papers that you can read where you're steering according to just, like, the differences in the activations versus, like, trying to pull it out to this feature first. So, maybe that would work a bit better, like, the control vectors thing.

Yep. I haven't played with it enough to know for sure, but I think there's definitely, like, something along those lines will work eventually. I can't say in the long term if it'll work better or worse than prompting. Right now, I still think prompting works, like, a lot better. I mean, from my experience with, like, smaller models and trying to work with control vectors, I've seen that it's better when it comes to style than it is for, like, actual deterministic prompting.

Yeah, pretty interesting. Yeah. Sometimes I feel like stuff from smaller models transfers. Sometimes it doesn't transfer. I don't have a great intuition for what does and doesn't transfer between small and large models, but, yeah, good points. Thank you. Okay. I think we've gone over this roleplay stuff enough. Let's go to the next one.

I'm going to upload a few screenshots of my dating profile. Okay. This is our first image one. Are there any screenshots? No screenshots. Okay. Actually, somebody responded in the very first in reply. So, since we're doing images, maybe I'll start there. If you scroll down to that, that was me.

That was you. Let me try and find my message here. All right. You said it's at the bottom of the... Riveting to watch me scroll through this channel, I'm sure. There's a lot of messages here. Okay. Here we go. Okay. So, let's... Can I copy the image? Copy image.

Okay. Cool. I actually don't know if I can paste it into the console. So, I might fall back on using call.ai for this. Pasting images is not enabled right now. Okay. Let's try call.ai then. Okay. So, then this... The question was... The prompt here was... High-performing validated AI model.

Sorry. I, like, lost all your formatting here. Okay. So, in this image here, if we zoom in, it's supposed to get Maddie white. And... Eight, six... Hmm. I'm having a hard time reading this. Eight, six, eighty-seven. That's the hard one that it messes up. Eight, six, eighty-seven. And then eight, six, eighty-seven down here.

And then in this one, it's just eight, six, seven. Yeah. They... That's a typo. That's a typo. Probably the human. And so, I'm hoping it can correct for that. I'm just trying to pull out kind of the average data from all three combined. Okay. And it said... It looks like it said middle mark.

So, it's misreading the... Okay. So, I don't know too many good tips for image is. But I'll tell you what I have. So, one of the things that you said... Oops. Sorry. One of the things that you said here was that it works better with zooming in and cropping image.

That's definitely like the easiest win that you can have is just giving it like a higher quality image, taking out the unnecessary details. That might be hard to do programmatically because it's like you don't know what the... Which details are necessary and unnecessary. But for the same reason that including extraneous information in text form, you probably won't get as good results.

If you include extraneous information in image form, the results probably won't be as good either. So, the more that you can like narrow in on the exact information that you need, the better. Is the model down sampling large images? I don't know slash can't talk about that. But definitely having like higher quality bigger images is better.

I did just read on your website that it says... It down samples to a thousand pixels by a thousand pixels. Okay. Great. That can be found with Google. Okay. So, then any general tips on how to discuss images with Solnit? My number one tip for images is to start by having the model describe everything it sees about the image.

So, I don't know if that will work here. This example is like... This one is hard enough for me to even read that I kind of doubt that quad will do well on it regardless of what we say. But we can give it another shot. One thing I've noticed when I attempted that where if I asked it to go like tube by tube in that image, if it...

Like the first tube, if it came to the wrong conclusion, it would use that and come to that same wrong conclusion on multiple tubes. Where like it made it so there was some kind of like directionality in its thinking where if it got the answer wrong at first, it would project that onto the rest of the image it was analyzed.

Yes. I think that's definitely true. If the model starts off on the wrong path, it probably will just continue going on the wrong path. It's trying really hard to be self-consistent. Mm-hm. And... It's... It's... And this is why self-correction is such a big like frontier for LMs in general right now.

But as of this month, why use those for this path? So, I love to use those, but why for this specific image? I found like text track and OCR models I played with don't do as well with some of the handwriting. Like if I zoom in on this image, it actually does perform pretty well.

from some of these fairly messy handwriting that's even hard for a human. Okay. Yeah. Unfortunately, I don't know how to get better results out of this other than maybe by cropping it better and up sampling. Okay. Cool. Thank you. Oops. Here we are. Here we are. So let's scroll up to where we were before.

Okay. Yeah. Still no screenshots there. How do I enable fragments? What are fragments? Is fragments like the pre-fill part? The pre... Artifacts. Oh, okay. It's just a setting. It's in the bottom left of the... You have to enable it. Yeah, you can find it online. Our people will help you.

Uh... I often dump full trace back errors directly into the prompt box. I often dump full trace back errors directly into the prompt box as an API. It seems exceptional at not running into trace back loops. I don't know, like, if that's intentional, like, literally I'll just take the entire trace back, zero context to Claude, and I'll just dump the entire thing in.

Yeah. And then it'll give me the fix. Okay. Great. So, but... Can you elaborate on how that might have... How this... For example, like, this only really appeared with, like, very recently. Like, you'd have to explain explicitly a lot. The models get better, man. They get better every generation.

I understand, but this is, like, you know, this isn't prompt. This is prompt engineering we're talking about, right? So, I'm just wondering, like, is this a form of prompt engineering, or is just the model being good? Sounds more like the model being good if you're just dumping it in.

Okay. I'm gonna move to this prompt here. Okay. So... Uh... To the person who uploaded this, do you... And generally, to anyone who's uploaded more examples, if you could just, like, put some stuff in the thread, like, write some stuff about what the issue is that you're having, or, like, why it's not working, that would be great.

This is actually a follow-up to, like, what I was doing with the translation. So, basically, what I'm trying to get to do is to actually analyze the text. So, like, in this case, you know, there's the original English. Yeah. There's a bad Japanese translation. And I'm trying to get to score between one and five how good the translation is.

And so, what I've been doing is adding a lot of stuff to try to get to do sort of chain of thought to, you know, see, like... Yeah. Because they'll notice errors. But, you know, it just generally does a very bad job at scoring. Mm. Yeah. Okay. So, this is great.

I'm really glad that you asked this. Because model grading is something that... I mean, it would be incredibly useful if it worked. And right now, it's in a place where it, like, sometimes kind of works and sometimes doesn't. So, let's paste this into the console. Okay. So, we have some English text and we have some...

Yep. Yep. Yep. Yep. Then we got the Japanese text. So, is this a good translation or a bad translation? Terrible translation. Terrible translation. Okay. Now, quad's actually supposed to be good at Japanese, too. So... Oh, this is somebody else. But I'm saying if quad is good at Japanese, it should be good at, like, judging other people's Japanese in theory.

In theory. In theory. Okay. So, now we've got the answer here. So... I guess it's stalled out here for whatever reason. So, what are we doing in this prompt? We're scoring between one and five as below. One is many grammatical errors the native would never make. Contains multiple grammatical errors.

An average quality translation with some errors. Okay. That looks pretty good. Look for specific clues or indicators. I don't know why. I think that's just a bug. I think I'll conclude here. Okay. So, it gave it a three, but we want it to give a one. Only closer, yeah.

Okay. Now, have you found that it's generally too forgiving or too strict or that it's just all over the place? Well, it's all over the place. Also, it seems to confuse sometimes content versus... You know, so like, for example, this is from, I think, the HLRF, you know, set.

So, you know, the content is fine. But so, it might not actually rate it low, even though it's a terrible translation, because it thinks the content is okay. Even though, if you asked it even to list all the errors, there are a dozen errors in that, you know, single piece of text for the translation.

Yeah, yeah. So, that's what I'm trying to also see if, you know, is there any way to like, get to separate out the grammatical or the actual translation errors versus... Yeah. Okay. A few thoughts. So, generally, this is like a thing where if you ask the model if some text is like good or bad, it's sort of, if the text is like about a nice subject, it's more likely to say that it's good in all ways.

Like, it's a good translation. Like, it's well written. It flows very logically. And if it's about like a negative subject, it's more likely to like, criticize it and say that it doesn't flow well. Yep. I think you can get at those issues by typing language about it in the prompt.

So, for instance, here, you might say something like... Is one or five good? You don't specify that. Hmm? Is one or five the best score? One is the worst and five should be the best. That's sort of like implicit in this rubric up here. But it might be good to say it anyways.

How do you get around the fact that you may not have a Japanese tokenizer? Do you have a Japanese tokenizer on cloud? The API will tokenize anything. But how is it trained? How is it trained? Yeah, yeah. Pre-trained. Off topic for... And also, I don't know. I don't think...

I actually don't think there is a tokenizer for cloud. Like... Oh, so... I mean, unless you say otherwise, there's not... I mean, if you upload some text, it will be tokenized. But it's not pre-trained. But it's not pre-trained. So you're not going to get a really good answer for this.

Or Japanese text. I've tested this. Okay. Claude speaks the best Japanese of any model available, actually. We don't need to debate, like, Claude's Japanese skill. Let's just like... Yeah. But the tokenizer isn't available. And so that would be interesting. Okay. Yeah. I have a question about programming. Sorry, can we actually cut the questions off while I type this prompt here?

Okay. So it's extra important to... So what we're trying to do here is get it to distinguish between the, like, ethical nature of the text and the quality of the translation. So... Is it useful to tell it to be, like, extra critical? Say, like, you're grading a, you know, like...

I don't know, like, graduate course, you know... Sorry, I'm a little bit all over the place. I need to, like, type this out before I can clear the queue and, like, respond to other questions here. So what's a good word here, like, risque topics, or R-rated... Yeah. So, I don't know, something like this could help the model not pay so much attention.

So, the main thing that I would want to do for this prompt is just add a bunch of examples. So for each category, I would add at least one example of that category. So, like, I have, like, a really bad translation, and I'd say why it's a one. And you can say that in your own words.

And I'd have an example of, like, a two-level translation, a three-level translation, and so on. Okay. And in each case, before you get to the answer, you would have the explanation for why it's good or bad. I can't tape all that here, among other reasons, because I don't speak Japanese.

Mm-hm. But I think that's the most valuable thing that you could do here. Otherwise, I mean, like, the formatting looks really good. The fact that you're doing the chain of thought in advance looks good. Yeah. I think maybe this specific clues are indicators. I think you could go into a little bit more detail here about what is...

Okay. So basically just end shot every single example or, you know, grade with an example, basically. Yeah. Yeah. I mean, it's tedious to write out all these examples, but A, quad can help, and then you have the problem of just editing quad's response versus, like, writing it all out yourself.

And B, it really does lead to better performance versus, like, almost anything else that you could do. Do you think it's better to use a number scale, or else could you say, like, good, bad, or, you know? Yeah. In terms of the scale on these rubrics, I think either a number scale or a good, bad is fine.

The thing I'd be careful about with the number scale is that I don't think it's very well calibrated. So if you're telling it, like, choose a number from 1 through 100, it's not going to be, like, a unbiased estimator necessarily. So I'd probably just limit the granularity to maybe five different classes.

Yeah. One thing we haven't talked about at all here, but going back to your question, is that this is a case where you might be able to utilize, like, log probs, and I'm wondering if that's something you ever use in any of your work or other prompts. Yeah. I agree this could be a case where log probs would be useful if you could get, like, the probability of each grade.

So here's the thing with log probs is you really want the chain of thought beforehand, and the chain of thought, I think, is going to get you more of a win than using the log probs. log probs. But log probs, there's no -- in any model, I don't think there's a way where you can just say, sample all the way through, and then output -- after you output this, like, closed chain of thought tag, then give me the log probs of whatever comes next.

Mm-hm. So if you are going to use the log probs, then you're looking at this, like, multi-turn setup, where you first sample all the chain of thought, or sample all the, like, precogitation that it wants to do. Mm-hm. And then cut it off there, and then re-upload that with the chain of thought as a pre-fill message, and then you could get the log probs.

But for that, you need a model that both has the pre-fill capacity and has log prob capacity. I'm not sure of what model has both of those characteristics right now. Walk me through why it wouldn't just be sufficient to, like, in this case, just ask for -- I'm asking for a score, one to five, only return that, but then, like, look at the log probs of what it returns in that case.

Yeah. So what you're losing there is whatever intelligence boost you got from having the model do the chain of thought. And my sense is that chain of thought plus have the model say either one, two, three, four, or five is going to be more accurate than, like, the additional nuance that you'd get by having it give you the log probs, because it's actually doing a lot of its thinking in that chain of thought.

You're, like, leveraging more computation. You're getting more forward passes. For all the same reason that chain of thought is, like, usually a good idea. Are you talking about, like, chain of thought, like, it's actually out loud writing a chain of thought before the answer? Exactly. Okay. Okay. Sure. I got you.

Which is what we see in this prompt right here, right? Like, we have this analysis section. So if we cut out this whole analysis section, we're really tanking the number of forward passes that the model can do before it gave you the answer. Okay. Cool. That's good to know.

What's that? Okay. I'm being told that I should answer a couple more questions and then get off stage. I was, like, before I came here, I was honestly really worried that, like, no one would have questions and I'd be supplying my own. But you all have had amazing questions so far and amazing examples.

So I really appreciate that. It's made this go on well. Sorry. I'm supposed to say that at the end of this, but I'm giving a pre-thank you. Now we can do the encore. Yeah. I just wanted to add that having a numbered list may give more weight to, like, number one, two, three, four, five versus just having an unstructured list.

So it may give more weight into the score for the output if you just change it from, like, one, two, three, four, five to, like, little dashes for criteria. Just in my experience of what I've been seeing. Okay. Cool. Yeah. I have replied back with the prompt that was fixed or that, like, in my own improvement of what I think is a better prompt for this.

Yeah. Okay. Awesome. Nisha, should I do another prompt or should I just answer a couple questions and then head up? All right. Let's do one last prompt. What's -- which one should we choose? Okay. Let's do this one. Good old mitigating hallucinations. Because we haven't really done that. Okay.

Please provide a summary of the text provider's input. Okay. First thing I'll do is just move these instructions down. Now, Matt, did you have an example of the document -- a document where it hallucinates with this prompt? Yeah. Can you just put that in the thread? Okay. Your summary should be concise while maintaining all important information that would assist in helping someone understand the content.

If it mentions any dates. Don't start or end with anything like I've generated a summary for you. Here's the summary you asked for. Okay. Yeah. So this one we can fix with a pre-fill. Okay. All right. We could do something like this. Now, how about the hallucination part? The best trick that I know for getting around hallucinations in a case like this is to have the model extract relevant quotes first.

So what I would say do here is I would say something like -- and now, of course, in this pre-fill, since we're having relevant quotes here, we wouldn't want to start with summary. that would just be confusing slash wrong. So we could say here is the -- And now, of course, in this pre-fill, since we're having relevant quotes here, we wouldn't want to start with summary.

That would just be confusing slash wrong. So we could say here is the -- something like this. Okay. Did you get the doc yet? Okay. And then, of course, I would put the document here. Okay. I think I should get off stage. So, yeah. Let me just call it here.

And, Matt, we can talk after. Yeah. Once again, I really appreciate you all coming out. It's been amazing to have such, like, a great audience engaged. I've had fun. I learned some things. I hope you all did, too. I'm planning to stick around this event for the next -- for the rest of the afternoon.

So I don't know exactly where I'll be, but maybe just DM me if you want to come find me and chat. I'm always happy to talk prompt engineering. It's, like, my truest passion at this point in the world. So, like, find me. Hit me up. We'll talk. And, yeah.

This has been great. Thank you so much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.