back to indexBuilding with Anthropic Claude: Prompt Workshop with Zack Witten

00:00:00.000 |
All right, good afternoon everybody. Thank you all so much for joining us. We have the enviable position of being after lunch. I'm seeing some cookies on the table still, but thankfully you're not here to listen to me. You're going to be riveted by the prompt doctor who's going to come up in a second. 00:00:29.980 |
So I'm expecting no sleeping on the table, but you never know. Excited to be here. I'm Jamie Neuwirth. I lead our startup team at Anthropik. 00:00:38.420 |
I just wanted to say a couple of quick things before we got going here with, again, the reason you're here, the prompt doctor. 00:00:44.000 |
We've had a lot of really exciting releases just in the last couple of days, some in the last couple of hours, and wanted to just put these up there to highlight some of the cool things that we're doing, 00:00:55.380 |
but also share how a lot of folks, not only in this room, some of your peers, maybe folks back at the office, can work with Anthropik on not only some of the prompting, of course, work we're going to do here, 00:01:06.680 |
but just helping you grow your business and the really cool things you guys are building with Claude, with LOMs, on top of AI. 00:01:15.540 |
My team is here specifically to help your company grow and scale the business, whether that's getting access to higher rate limits, getting access to folks like Zach, who's going to be up here in a moment, 00:01:26.180 |
learning more about what we're doing from an early access perspective. 00:01:29.280 |
We want to work with you and empower the next wave of really exciting AI companies built on top of Claude. 00:01:35.900 |
And we're helping from a product perspective with a couple of the releases you see here. 00:01:44.780 |
Yeah, thankfully, Zach has tried it as well, so we're in a good place. 00:01:47.420 |
Really excited about what we were able to release there as well. 00:01:51.080 |
Also, just today, or excuse me, I guess Artifacts came out with that with our Claude Teams and Claude AI plan. 00:01:57.800 |
Artifacts is this really cool tool that some have been playing around with, making video games with a couple lines of text that turns into code, 00:02:04.900 |
but actually a lot of really cool business use cases from a diagram perspective. 00:02:08.600 |
I've seen a lot of cool pharma companies using this and thinking out, 00:02:12.080 |
how can I put what's in my head on a gene sequencing kind of discussion onto something like an artifact, share that with my peers. 00:02:19.320 |
All the way to prompting cookbooks, helping, again, folks like yourself. 00:02:23.100 |
I would imagine a lot of you hopefully will be featured on the website, you know, coming up with what you'll be able to prompt for these use cases in your company. 00:02:29.780 |
So just some ways that we've, you know, things that we've come out with over the last couple of days, 00:02:35.520 |
a couple of hours when you think of Claude Teams and projects, and then ways to connect. 00:02:39.260 |
Feel free to reach out at sales, sales at anthropic.com. 00:02:42.960 |
We have our head of DevRel, Alex, here as well. 00:02:45.520 |
So we really love this community, love staying engaged. 00:02:48.440 |
We're really excited about what we've released over the last couple of days to be able to help you again in what you guys are building. 00:02:54.060 |
And so I'm going to leave all that to the prompt doctor as well to get that, to really make sure that we can help. 00:03:00.480 |
And Zach, why don't you come on up here and see what we can do. 00:03:13.320 |
Thanks, Jamie, for the intro, and thank you all for coming. 00:03:16.980 |
I had no idea this many people were going to be here. 00:03:23.080 |
It's mostly just going to be straight up prompting from beginning to end. 00:03:29.460 |
So I set up a Slack channel in this, the AI engineer Slack. 00:03:36.460 |
It's called PromptEng Live Workshop Anthropic. 00:03:43.060 |
And what we're going to do on stage is I'm just going to sort of look at them. 00:03:49.000 |
We're going to test them in the console, the Anthropic console. 00:03:52.120 |
We're going to see if we can get better results. 00:03:54.500 |
And we'll just sort of like try to learn as we go. 00:03:57.920 |
So this is something that I do like internally in our team Slack quite a bit. 00:04:02.740 |
But I've never done it in front of this many people. 00:04:10.000 |
But hopefully you all have a good time too and maybe learn something. 00:04:13.660 |
So what kind of things should you put in this Slack channel? 00:04:20.900 |
So a prompt template is kind of like a prompt. 00:04:23.880 |
Actually, I just realized I don't even need this mic. 00:04:28.260 |
So you put a prompt template, which is like a prompt, but with spaces where the variables are going 00:04:34.200 |
to go and the variables, they're going to be denoted with these double brackets. 00:04:39.160 |
So in this case, it's like this document part. 00:04:41.620 |
If you don't have it in this format, that's fine. 00:04:49.840 |
This is like the kind of thing you'd put there. 00:04:51.800 |
And then you can also have a couple examples of cases where it doesn't do what you want. 00:04:55.980 |
And that will give us the like direction as far as like where we want to go with it. 00:05:01.740 |
I might also like ask you questions out loud if I have questions about like what kind of output 00:05:09.560 |
is good or not, or I might ask questions in Slack either way because it's easier. 00:05:13.980 |
We'll have to kind of figure that out as we go. 00:05:17.900 |
So that being said, we're going to use the console for iterating mostly, 00:05:22.320 |
although I might use a couple other tools like quad for sheets, 00:05:26.260 |
which is like a spreadsheet where you can call quad. 00:05:32.620 |
Let's see what we've got in the Slack already. 00:05:40.880 |
So let's put this into the console and then let's take a look. 00:05:52.980 |
And I'm just going to go through as many of these as we can get through in the session. 00:05:57.100 |
And, yeah, this is pretty much what it's going to be. 00:06:00.260 |
So, first of all, we can probably capitalize all the sentences. 00:06:22.700 |
A lot of things, like, prompt engineering is, like, it's very new, right? 00:06:33.500 |
Somebody out there might have done a study where they, like, conclusively show that using capital 00:06:38.160 |
letters and, like, using, like, better grammar, fixing grammar mistakes help. 00:06:41.580 |
I have, like, anecdotally found this in a few cases. 00:06:44.660 |
I also have read some, like, quantitative stuff showing that, like, typos do hurt performance. 00:06:50.060 |
But I'm also just, like, pretty obsessive about this stuff. 00:06:54.200 |
And I think it, like, it definitely doesn't hurt. 00:07:11.260 |
So, first thing, let's put information in XML tags. 00:07:24.420 |
So, Claude was trained with a lot of XML in its training data. 00:07:28.220 |
And so it's sort of seen more of that than it's seen of other formats. 00:07:35.940 |
So, this looks like all the information here. 00:07:44.480 |
So, before, then first iteration, and then the final iteration, and see how it comes first? 00:07:50.600 |
Actually, let me undo everything that I've done so far. 00:07:58.760 |
And then, now, in the console, it's asking us for the user input. 00:08:28.360 |
This is a conversational agent, so no more than one sentence. 00:08:37.360 |
So, well, we can also use this evaluate tab, so let's just, like, add all the test cases 00:08:46.960 |
I'm also going to be doing, like, some showing off of the console features, because I think 00:08:55.680 |
There's also some secret new features that I might show. 00:09:01.960 |
Okay, so then we also have, why do I need to do this experiment? 00:09:32.160 |
And this is all running through Dove, sorry, through, okay, so we have, why do I need to 00:09:43.160 |
And here, we have this, like, I apologize, but I don't have information about scheduling 00:09:54.280 |
Is it true that we don't have information about scheduling? 00:09:56.540 |
No, we are always available, make it easy for that. 00:10:06.980 |
All right, so this is, like, the version one. 00:10:12.920 |
So, first of all, actually, I'll try to do things, like, roughly in, like, some order of 00:10:18.980 |
So, maybe I won't make you all sit through the capitalization, even though I, like, would 00:10:26.360 |
I'm also going to add a new line here, just because I think that's, like, more normal what 00:10:30.820 |
you'd see in, like, a written document, you'd have a new line. 00:10:46.740 |
The one thing I wouldn't feel completely confident of is that it would, like, exactly 00:10:50.500 |
transcribe the rest of everything, like, word for word. 00:10:58.780 |
What I actually might do is just, like, have Claude write code to capitalize every first word 00:11:05.620 |
Then I'd be worried about edge cases, like, what if there's, like, ellipses? 00:11:12.800 |
And like, I definitely use Claude a lot in writing prompts. 00:11:15.500 |
For instance, like, we have, like, a quad tool that, like, helps complete code, basically, 00:11:20.260 |
and I do a lot of prompting in that IDE, because, like, especially with, like, very nested XML 00:11:26.500 |
It helps a lot just, like, suggesting the closures of them, which is, like, pretty obvious, but 00:11:32.260 |
So, yeah, if you have any sort of, like, co-pilot type thing, definitely that's, like, a good 00:11:40.020 |
Okay, now let's do the same thing with this instructions. 00:11:51.380 |
It looks like this one, like, didn't get a number, so let's, like, do that. 00:11:56.780 |
Yeah, so the key thing in terms of XML, I think, is just, like, really, XML isn't even that important. 00:12:14.540 |
The most important thing is just clearly separating the different parts of the prompt. 00:12:19.900 |
Yeah, exactly, it's, like, here's this stuff, here's this other stuff. 00:12:24.500 |
Like, if we wanted to, we could do something like -- like, I wouldn't do this, but, like, 00:12:42.920 |
Let's also do the same thing with user input. 00:12:51.800 |
Now we can run, we can go back to the evaluate tab, and we can hit rerun all, and it's using 00:13:01.280 |
But we can also see how it does on the last case where -- okay, so here it's still said 00:13:08.660 |
I don't have access to the specific scheduling information. 00:13:13.100 |
So first of all, we can make it shorter, so -- do we have anything here about, like, making 00:13:25.640 |
Be concise and offer only relevant information. 00:13:39.340 |
So, like -- and offer only relevant information. 00:13:43.340 |
Each response should be -- or let's be a little bit, like, less prescriptive to give Claude, 00:13:52.940 |
Like, if we say, like, every response should be, like, exactly three sentences, that might 00:13:58.980 |
So we could just say, like -- so why is response should be two to four sentences better than 00:14:16.060 |
Yeah, concise could mean a lot of different things to different people. 00:14:20.060 |
Like, in some cases, like, concise might mean, like, literally only one word. 00:14:26.140 |
In some cases, like, if you ask for, like, a concise book review, we might be looking at, 00:14:29.740 |
like, you know, a single-page Word doc, and that would be concise in the context of a book 00:14:36.140 |
So, yeah, Claude is, like, trying to guess what you mean by concise. 00:14:41.820 |
What if you also have a longer prompt, so you have a really long prompt. 00:14:49.820 |
You have a really long prompt, so if you have a long system prompt with a lot of detailed 00:14:53.040 |
instructions saying be concise, you're not going to get something super, super short. 00:14:59.040 |
I think the tone of the prompt -- so, what he was saying, if people couldn't hear, is, like, 00:15:02.820 |
the prompt is long, so the response might also be long. 00:15:06.320 |
I don't think that's, like, definitively true. 00:15:09.220 |
Like, you can have long prompts that give short responses or short prompts that give long responses. 00:15:13.400 |
But it's more like if you don't say anything, it might pick up on some of those, like, context 00:15:21.260 |
So, let me actually get to that after we do this. 00:15:40.860 |
So, this two to four sentences, it looks like it's still pretty long. 00:15:43.260 |
I think maybe that's actually, like, longer than necessary, so maybe we should make it, 00:16:07.000 |
And it also seems that it is giving variable numbers of sentences, so these were both two 00:16:16.940 |
So one of the questions over here is, like, can the LLM figure out that it should do longer 00:16:20.380 |
responses in certain situations and shorter in other words? 00:16:25.780 |
So then the next point was that in this case it shouldn't say that I don't have access to 00:16:31.000 |
the scheduling system or specific appointment times. 00:16:39.600 |
But intentionally in the prompt, I left out that it's a 24-hour service. 00:16:43.840 |
So this is a case where we're asking a question that's not present in the information it has. 00:16:53.480 |
So, yeah, I mean, we could add something like, "You're open for 24-hour service," but you're 00:17:00.060 |
saying you want to test its ability to, like, figure out how to do it without that. 00:17:16.840 |
Anything else that you wanted to get out of looking at this example, Gordy? 00:17:23.240 |
So does the order of the rules or the order of putting information, then rules, or rules 00:17:28.300 |
first, then information, does any of that matter? 00:17:32.300 |
So does it matter what order we have these components? 00:17:37.240 |
I think it's better to put the information above the instructions. 00:17:41.800 |
We've sort of found that instructions are more tightly followed the closer they are to the 00:17:49.760 |
This doesn't necessarily apply in all situations, so definitely test it out. 00:17:54.200 |
That actually is, like, a blanket statement that applies to everything that I say, but particularly 00:18:05.680 |
I don't know if you were asking all the questions or... 00:18:12.680 |
If there is some difference to the exclamation mark here, I noticed that it's... 00:18:27.680 |
The exclamation marks, just as they emphasize things for humans, they also emphasize things 00:18:33.680 |
Do you think that has more of an effect over the numbers or less, like, in terms of relevant? 00:18:42.680 |
I don't know at that level of detail, and I think it's dependent on context as well. 00:18:45.680 |
But yeah, if you want to emphasize things, like, capitalizing them or putting exclamation 00:18:49.680 |
marks or, like, just saying, "This is extremely important," that all does do something. 00:18:56.680 |
So, the tokenizer, though, I'm taking a look at some of your tokenizer codes, but it doesn't 00:19:08.680 |
seem like exclamation points actually, like, do anything, really. 00:19:12.680 |
Like, the tokenizer kind of binds them into the word. 00:19:19.680 |
Just anecdotally, if you put exclamation marks in, it's different. 00:19:38.680 |
In general, for translations, our multilingual output, is it better to instruct our English 00:19:43.680 |
I think it's better to instruct in the native language if you speak the native language. 00:19:48.680 |
If you only speak English and you're choosing between, like, a better prompt that's written 00:19:53.680 |
in English versus, like, a worse prompt that's written in a language that you don't understand, 00:19:58.680 |
I would probably default to writing it in the language that I knew super well. 00:20:02.680 |
But ideally, I think for the ideal prompt, you would find a native speaker of the language 00:20:07.680 |
and explain your use case to them and have them write the prompt. 00:20:11.680 |
So, is that not the same question that I just answered? 00:20:34.680 |
I think it's better to have the prompt in that language. 00:20:58.680 |
Responsible for improving unit tests based on a set of requirements. 00:21:04.680 |
Don't include any other explanation, conversion, or output besides the JSON. 00:21:10.680 |
This is great because it's going to let me show off prefills. 00:21:28.680 |
I have to use double brackets instead of single brackets to get variables in the console. 00:21:46.680 |
Is that like an example in this problem template? 00:21:51.680 |
So, basically, we have one of you to write the unit tests and then... 00:22:01.680 |
Like, almost all of this is just like workarounds for the fact that it doesn't always speak JSON 00:22:03.680 |
Like, you can see how many times we said that. 00:22:25.680 |
Like, almost all of this is just like workarounds for the fact that it doesn't always speak JSON 00:22:31.680 |
Like, you can see how many times we said that. 00:22:33.680 |
And then, uh, do you have an example input here? 00:22:38.680 |
So that first comment is from the unit test writer. 00:22:42.680 |
Like, the unit test writer writes a bunch of unit tests. 00:22:45.680 |
And then the reviewer reviews it and makes them better. 00:22:56.680 |
This is a good result where it writes JSON, which basically says, cool. 00:23:00.680 |
Update this file with these unit tests and here's the modifications I made. 00:23:06.680 |
So in this template here, where would the thing that I just copied go? 00:23:12.680 |
Well, so essentially we don't provide it in line with the prompt. 00:23:18.680 |
We just provide the conversation and then this thing jumps in as a separate agent. 00:23:25.680 |
So, like, the context window is gonna have the unit tests in it. 00:23:29.680 |
But we're saying respond in this format given the unit tests that are earlier in the conversation that you're picking up. 00:23:35.680 |
So the thing that you just pasted is, like, step three of a multi-shot conversation? 00:23:43.680 |
The thing I just pasted was two shots, right? 00:23:45.680 |
Unit test writer and then a unit test reviewer. 00:23:47.680 |
And the reviewer is the one that's having the problem. 00:23:57.680 |
Here are some unit tests written by a unit test writer bot. 00:24:30.680 |
We do this in sort of this larger conversation, not just as sort of a standalone prompt for this one agent. 00:24:37.680 |
And then for project path, what sort of thing should I put there? 00:24:41.680 |
I mean, this is just like the local directory that's gonna be modified. 00:24:46.680 |
And so it actually has access to the files in that directory, and it'll fill in its own sort of what files to modify. 00:24:57.680 |
Okay, so let's see if it comes out with the JSON or not. 00:25:12.680 |
We did do most of our tests on Cloud 3 and not 3.5, so 3.5 is probably a little better. 00:25:18.680 |
Yeah, I mean, if it makes it more realistic, we could also switch the model version to use... 00:25:27.680 |
Is the temperature being set to zero intentional? 00:25:29.680 |
Yeah, I usually, for knowledge work, I usually have the temperature set to zero. 00:25:38.680 |
I think using temperature zero, you'll probably get, like, marginally fewer hallucinations. 00:25:43.680 |
Okay, so it looks like in this case it did output JSON, I think. 00:25:47.680 |
I guess that explains why it was taking so long to... 00:25:48.680 |
Looks like it actually even ran into the max output tokens because it didn't finish its JSON. 00:25:52.680 |
Since this one is kind of going kind of slow, I will... 00:25:53.680 |
If you're using temperature zero, you'll probably get, like, marginally fewer hallucinations. 00:25:58.680 |
So it looks like in this case it did output JSON. 00:26:07.680 |
I guess that explains why it was taking so long to... 00:26:11.680 |
Looks like actually it even ran into the max output tokens because it didn't finish its JSON. 00:26:18.680 |
Since this one is kind of going kind of slow, I will test it with Haiku. 00:26:23.680 |
And let's also increase the max tokens to sample so that it doesn't run into that issue. 00:26:32.680 |
So what I'm really hoping is to get a case that doesn't output JSON so that then I can fix 00:26:41.680 |
If not, I can still say, like, how I would fix it. 00:26:45.680 |
Just honestly, any comments you have just on how we structured things. 00:26:49.680 |
So, I mean, this is, like, definitely, like, a big request from people is, like, how do 00:26:56.680 |
The most reliable way to do that, I feel, is using the assistant pre-fill. 00:27:02.680 |
So, maybe some of you have used this feature before. 00:27:06.680 |
Maybe some of you have, like, only used other models such as GPT that don't offer this feature. 00:27:11.680 |
Something that you can do in the Claude API is partially pre-fill an assistant message. 00:27:17.680 |
So, what you're doing there is you're putting some words in Claude's mouth, as we call it. 00:27:23.680 |
And then when Claude continues from there, it assumes that it's already said whatever you 00:27:30.680 |
And that can help you get it on the right path. 00:27:34.680 |
So, for instance, in this case, if we want to make sure... 00:27:37.680 |
So, the classic, like, bad response from Claude, when people give it prompts like this, where 00:27:42.680 |
they want to get JSON, is Claude would say something like, I'll just, like, add another message, just 00:27:56.680 |
And this part right here is, like, very annoying. 00:28:07.680 |
Let me actually just give, like, the simplest one. 00:28:10.680 |
They both, though, they both require a tiny bit of post-processing at the end. 00:28:18.680 |
Let's actually, like, take out all this stuff about make sure to only respond in JSON. 00:28:23.680 |
That could be one way to get it to not do the... 00:28:39.680 |
So, here now, maybe it will do the preamble thing that we don't want it to do. 00:28:46.680 |
So, an easy way to get it to not do that is to just take this and then put it in the pre-fill. 00:29:14.680 |
Like a child who's just, like, misbehaving and it wants to do something and you're like, 00:29:19.680 |
don't do the thing, but it just keeps doing it because it just loves preamble so much and 00:29:26.680 |
So, one way is to, like, argue with it a lot. 00:29:29.680 |
But, like, if you have a kid, sometimes, you know, you just have to, like, let them do the 00:29:32.680 |
thing that they want to do and then they'll get over it. 00:29:34.680 |
So, in this case, that's basically what we did. 00:29:37.680 |
We just gave quad this pre-fill where we let it do the thing. 00:29:41.680 |
So, as far as it's concerned, it already did the thing. 00:29:44.680 |
And then from there, what it's outputting is JSON. 00:29:48.680 |
Now, if you want to make this even more reliable, you can put this nice little bracket here. 00:30:01.680 |
So, at this point, it's definitely not going to do the preamble. 00:30:04.680 |
The only thing here is if you sample with this pre-fill, you will need to add the bracket 00:30:13.680 |
back before you try to do your JSON.loads or what have you. 00:30:18.680 |
Since you told it that it had already said the opening bracket, it's not going to give you 00:30:25.680 |
So, then another thing that you can do is return the JSON in JSON tags. 00:30:33.680 |
And then, if we do this without the pre-fill, let's try it without the pre-fill. 00:30:53.680 |
For all I know, that's just like an English word. 00:31:09.680 |
And then, at the end, it closed this JSON tag. 00:31:13.680 |
So, again, this requires, like, the tiniest smidgen of post-processing, where you're saying, like, 00:31:23.680 |
You're just, like, take everything within the JSON tags and then use that. 00:31:37.680 |
And now what we'll see is it will just give the JSON minus the bracket, and then it will 00:31:41.680 |
close the bracket, and then it will close with the JSON tag. 00:31:48.680 |
At first, I was, like, a little bit panicked because I didn't see the close JSON tag at the 00:32:01.680 |
But then I saw that it actually did include the tag up here, and then it gave this little explanation 00:32:09.680 |
This will save you some time and tokens and trouble. 00:32:14.680 |
One thing we could do, like, it costs you money to get Claude to output all this stuff. 00:32:24.680 |
In most cases, you don't need the explanation. 00:32:27.680 |
So, one thing we could do is we could say, do not include any explanation after the JSON, 00:32:41.680 |
I'm just, like, this is actually meant to be my parody of, like, what a frustrated prompt 00:32:46.680 |
engineer would write if they were, like, couldn't get rid of this. 00:32:49.680 |
But in practice, you might not need to do that. 00:32:55.680 |
And we're getting outside the realm of prompt engineering for a second and into the world 00:33:01.680 |
There's a parameter that's called stop sequences. 00:33:05.680 |
So, we told it to return the JSON and JSON tags, right? 00:33:09.680 |
So, there's no functionality to do this in the console, so I can't show it off at this 00:33:14.680 |
But in the API, there's a parameter called stop sequences. 00:33:17.680 |
And if you add this close JSON tag that I've highlighted with my mouse, if you add that to the stop 00:33:26.680 |
sequences, then it will just hard stop after it outputs those, and you don't even have to 00:33:30.680 |
worry about telling it not to continue from there because it just won't even sample from 00:33:36.680 |
So, one of the things that I'm sort of hoping to impart with this talk is that a lot of times 00:33:43.680 |
it's cheaper and easier to do a little bit of work outside of a call to the LOM and not even worry 00:33:51.680 |
about prompting because prompting can feel sometimes like non-deterministic. 00:33:56.680 |
You don't know what the model is going to do. 00:33:58.680 |
So, when you can offload stuff to code, especially if the code is really easy to write, it's like, 00:34:03.680 |
Like, don't put a bunch of stuff in the prompt about you must output JSON, just use the prefill, 00:34:11.680 |
You know, don't add a bunch of stuff about how you have to stop after you say a certain 00:34:17.680 |
So, like, simple is better, and falling back on code is better than relying on prompts. 00:34:31.680 |
What you do is you include an assistant message as the last message in the messages list. 00:34:37.680 |
And when I say an assistant message, I just mean a message where the role is set to assistant. 00:34:43.680 |
And what would have happened if you had the text that you put in the prefill, 00:34:48.680 |
you just put it into the last line of the instructions? 00:34:51.680 |
So, in other words, if I said -- I'm actually not sure. 00:34:58.680 |
I genuinely do not know how Cloud will respond to this. 00:35:07.680 |
So, it looks like what it did was -- it looks to me like what it did without looking at this 00:35:17.680 |
JSON is it included an additional open bracket. 00:35:22.680 |
Because it's supposed to already have started with an open bracket, but here it started with 00:35:29.680 |
Anyways, I don't recommend doing this, but that was fun just out of curiosity's sake. 00:35:35.680 |
So, after the sentence, we can just write it like -- you wrote, write -- written the JSON, 00:35:50.680 |
written the response in the JSON format, and then in the next line, you can just write JSON, 00:36:11.680 |
I think there's like a lot of ways to accomplish this. 00:36:15.680 |
I think the ways that I showed are the most reliable. 00:36:17.680 |
So, that's what I would like officially recommend. 00:36:23.680 |
If you were going to like try to use this for production or whatever, what -- like these 00:36:31.680 |
exact kind of things you're playing around with right now, how would you think about testing 00:36:39.680 |
Yeah, more than like one -- like the one-shot test we just did, right? 00:36:44.680 |
To test it at scale, you need a bunch of test cases. 00:36:49.680 |
And if you don't have test cases, okay, this is maybe a good time to maybe show off this 00:36:58.680 |
thing, although I'm actually not sure if it will work. 00:37:00.680 |
So, I guess a more pointed question is like in this case, I think test cases are useful 00:37:06.680 |
useful when I'm writing a prompt to deduce whether like does asking it to think step 00:37:11.680 |
by step lead to this thing being more accurate. 00:37:13.680 |
But in this case for formatting, I guess what I'm wondering is like could you have this prompt 00:37:19.680 |
and then feed in the output and the prompt and then ask the model itself to evaluate like 00:37:25.680 |
how good these various things are at following the instructions? 00:37:31.680 |
Yeah, especially for formatting related things. 00:37:34.680 |
For formatting, I would not model grade the outputs. 00:37:38.680 |
Because formatting is something that I can check in code. 00:37:40.680 |
So, if I can do anything in code and I don't have to call the LLM, the LLM is like this 00:37:45.680 |
It's like if I don't need to like make this pilgrimage to the Oracle and like ask it, I'd rather 00:37:51.680 |
So, formatting specifically, we're kind of like in luck. 00:37:57.680 |
For something like the other, the previous prompt we looked at where the outputs are a lot more 00:38:01.680 |
squishy, we might possibly a model grading could work. 00:38:05.680 |
Possibly we might need a human to evaluate the answers. 00:38:11.680 |
I'm wondering like I actually put an example in the Slack channel. 00:38:12.680 |
We don't need to get to it because we're talking through it now. 00:38:14.680 |
But like for, let's imagine I don't have, or actually maybe tags are the answer to this. 00:38:20.680 |
Like imagine I'm asking for a summary or something and then I want to deduce whether there's 00:38:25.680 |
additional chat like content like before or after that. 00:38:28.680 |
In that case would you, I would have, my mental model would have been to use like an LLM as 00:38:34.680 |
a grader, but it sounds like maybe would you encourage instead using the summary tags and 00:38:38.680 |
checking like hard coding for additional text around that? 00:38:43.680 |
I think that will be pretty quick and easy to do. 00:38:48.680 |
Also just having the summary be in summary tags is like generally a good practice. 00:38:53.680 |
I generally have all my outputs inside tags to make it really easy to extract them. 00:38:56.680 |
I don't think there's really any downside to doing that. 00:38:59.680 |
So, and it might even be that by doing that you effectively fix your entire issue and you 00:39:07.680 |
Or, and you just put close summary in the stop sequences and you're kind of good to go. 00:39:13.680 |
But that also does sound like a problem that an LLM could grade. 00:39:26.680 |
Here's a very poorly formatted Excel spreadsheet. 00:39:32.680 |
So, um, this seems like a really ridiculously like powerful attack vector. 00:39:41.680 |
Um, I don't want to get into too much like jail breaking stuff here. 00:39:59.680 |
Here's a poorly formatted Excel spreadsheet slash CSV. 00:40:20.680 |
Because I don't know how to get the CSV into the, uh, into the console. 00:40:32.680 |
Again, I've been trying to use Claude to extract some information from spreadsheets. 00:40:40.680 |
It hallucinates a lot or it skips a lot of stuff. 00:40:43.680 |
And I was wondering if you, maybe more generally, how do you have Claude analyze really poorly 00:40:49.680 |
formatted spreadsheets to sometimes the different clusters or multiple data sets in the same sheet 00:40:57.680 |
I'll try to answer the general question of having, uh, analyzing poorly formatted spreadsheets. 00:41:04.680 |
The first thing that came to mind, especially when you're talking about how the spreadsheets are very big, 00:41:09.680 |
is breaking the problem down into, so, so give it, like, fewer spreadsheets at a time. 00:41:15.680 |
Give it fewer columns of the spreadsheet at a time. 00:41:17.680 |
Only give it the, the columns that it needs to, to work with. 00:41:21.680 |
Um, and then make the questions sort of smaller and more bite-sized. 00:41:26.680 |
And then tackle it that way by, by breaking it down. 00:41:29.680 |
So, at that level of general, generality, that's, would be, would be my answer here. 00:41:36.680 |
Um, I, I, I'd also be curious to look at this one more specifically. 00:41:41.680 |
Right now I'm just struggling with how to, like, copy the text and put it into the tool. 00:41:45.680 |
That's what I did, but it keeps downloading it. 00:41:50.680 |
I guess I can, um, the next, sorry, the next tab. 00:42:00.680 |
Sorry, I don't want to click open my downloads. 00:42:03.680 |
I'm scared I'm going to, like, reveal some private information. 00:42:05.680 |
This is my work computer, so, I want to just do it all on the browser. 00:42:28.680 |
Uh, generally we would recommend putting the, I like this one, it's short. 00:42:34.680 |
Uh, we would recommend putting the instructions after the document. 00:42:39.680 |
That's similar to the question that was asked about should we have the information first 00:42:46.680 |
Particularly with long documents, it's a little bit better to give the instructions at the end. 00:42:56.680 |
Uh, let's just, like, clean up the grammar a little bit. 00:43:09.680 |
Probably good to give an example, uh, of the format. 00:43:17.680 |
Uh, so we could do, like, uh, I guess we can just do a return, like, a list. 00:43:31.680 |
Is there a special reason that you wanted it to be a JSON array or is it just to make it 00:43:41.680 |
So let's say return in -- I'm just a huge fan of these tags, so let's do it like this. 00:43:53.680 |
Okay, so that is -- that's -- that's some stuff without adding examples. 00:44:22.680 |
The other thing that I would want to do is to give some illustrative examples of what 00:44:36.680 |
Uh, I put an -- I used a recent one from the rocket. 00:45:01.680 |
Uh, what if they write cringe tweets about our product? 00:45:31.680 |
Uh, I think they are adequate, but not engaging. 00:45:35.680 |
So, we can try to make it more engaging without making it cringe. 00:45:52.680 |
So, let's say -- don't give you hashtags, but cringe. 00:46:01.680 |
Are these meant to be tweeted from the Anthropic Twitter account? 00:46:05.680 |
Or from, like, the AI influencer Twitter account? 00:46:33.680 |
Is it better to break it up into small sentences in the prompt, or can you use complex sentences? 00:46:42.680 |
Is it better to break up sentences -- to use, like, small sentences in the prompt, or big 00:46:49.680 |
I think, generally, in English writing, it's better to use small sentences in small words. 00:46:56.680 |
So, I think it's probably also better to do that in a prompt. 00:47:00.680 |
I think it's fine to use big words if you are really sure you know what you're doing, and 00:47:04.680 |
you know that it's the exact right word for the situation. 00:47:08.680 |
Sometimes, I'll find myself using more academic language if I want the output to seem a bit 00:47:15.680 |
Generally, I think, simple, small sentences is better. 00:47:22.680 |
So, these are maybe a little bit more engaging. 00:47:39.680 |
It's got exclamation points and question marks. 00:47:42.680 |
Do you want it to be even more engaging or something? 00:47:47.680 |
So, I honestly think temperature is a bit overrated, maybe. 00:48:12.680 |
I'm not sure exactly how to distinguish these from the previous ones. 00:48:17.680 |
They look kind of similar to me, from the ones with temperature one, or temperature zero. 00:48:33.680 |
So, what I was going to say is, I think this is roughly as far as you can take this without 00:48:38.680 |
I think the best thing to improve this prompt would be either examples of the sort of tweets 00:48:44.680 |
that you want, or even an entire other document, an example of tweets that go with that document, 00:48:55.680 |
So, if you're cost limited, maybe you don't want to put in all those input tokens every time. 00:49:03.680 |
But, I don't know, the models are pretty cheap now, and we don't need to generate that many 00:49:09.680 |
So, if they have, like, any economic value to you at all, it's probably pretty cost effective 00:49:18.680 |
But it's more work on your part, because what you're doing then is... 00:49:23.680 |
So, the way that I would actually do this is, I would start out with some document. 00:49:31.680 |
I would take the ones that I liked, and maybe I would write some more, or get, like, my 00:49:39.680 |
Claude generate a hundred tweets, and then I would take the seven that I liked best, and 00:49:47.680 |
And then, from there, I would sample, okay, now here's another document, and then write a 00:49:54.680 |
And what I would do is iteratively build up this list of documents plus example tweets, 00:50:09.680 |
So, it could be, like, you know, system prompt, "You are an AI influencer who writes engaging 00:50:20.680 |
social media content about new models and releases." 00:50:28.680 |
It could be, like, here are some example documents along with the tweets you wrote about them. 00:50:49.680 |
I'm gonna write this, but you would actually put the literal text of the document here. 00:51:03.680 |
And here, again, you'd put a literal tweet here. 00:51:04.680 |
And this could either be something that you wrote or something that Claude wrote, or, you 00:51:08.680 |
know, something that Claude wrote and then you edited. 00:51:10.680 |
Like, a lot of times, Claude might give you an example that's not perfect, but it's close 00:51:14.680 |
enough, and then you'll change it a little bit to make it perfect. 00:51:19.680 |
I have honestly given multi-shot examples pretty short shrift in this talk so far relative to 00:51:29.680 |
Like, I think that, in reality, most of the gains, most of the effort, most of the gains 00:51:36.680 |
of writing a good prompt is literally just picking the perfect document that goes here, 00:51:42.680 |
picking the perfect set of tweets that go here, altering and changing them to modulate the 00:51:48.680 |
In some ways, that's more important than, like, everything else that I've said combined. 00:51:54.680 |
Like, another way to do the whole JSON thing would just be, like, with examples of Claude 00:51:59.680 |
The JSON one is maybe an exception because the prefill approach works so well there along with 00:52:05.680 |
the tags, but for anything else, the examples are really huge. 00:52:09.680 |
For a few-shot prompting, do you prefer to promote those all-in-one response like this? 00:52:21.680 |
Or do you find further success with an exchange of messages between the agent and the user where 00:52:27.680 |
you're putting your few-shot prompts in there? 00:52:30.680 |
Really good question and something that I would dearly love to know the answer to, but I don't. 00:52:36.680 |
The question is -- I don't need to repeat the questions. 00:52:42.680 |
So, do we want to just put all the examples in one big giant examples block like this? 00:52:46.680 |
Or do we want to structure the examples as a dialogue where the human says something and 00:52:53.680 |
And we're literally, like, putting a large number of messages into the messages list. 00:52:57.680 |
I typically do it this way with a big examples block, but it's mostly because it's less work 00:53:04.680 |
for me, and I don't have any evidence that this works either better or worse. 00:53:08.680 |
I did do some testing of this at one point on a few data sets, and I found that it didn't 00:53:12.680 |
make much of a difference for my particular case. 00:53:15.680 |
But there's a lot of, like, little particulars that went into my testing that make me not 00:53:22.680 |
So, sorry for a bit of an unsatisfying answer here. 00:53:25.680 |
I'll just say, I don't think -- if it is wrong to do one giant examples block, I don't think 00:53:34.680 |
So, like, in here, would you give it a thing and say, like, this would be bad because this 00:53:42.680 |
I think it's good to include negative examples. 00:53:44.680 |
Particularly around, like, the cringe thing where Claude might mess up. 00:53:49.680 |
I think just negative examples on their own don't usually get you there. 00:53:58.680 |
But I think it's great to have, especially, like, contrasting pairs. 00:54:03.680 |
Like, here's a cringe tweet about this document. 00:54:06.680 |
Here's an excellent tweet about the same document. 00:54:15.680 |
And then if you also include, like, the reasoning for it, right? 00:54:18.680 |
So, like, if it was a cringe tweet, it has, like, a little reasoning of, like, why? 00:54:23.680 |
Do you also, do you trust that reasoning for the model? 00:54:28.680 |
So, like, if you ask it, like, hey, give me, like, what were you thinking when you were writing 00:54:35.680 |
When you're reading through your examples to choose the best ones, how much do you trust 00:54:40.680 |
that reasoning and how much do you rely on that versus just, like, I just care about, like, 00:54:47.680 |
Especially if it's after something the model already said. 00:54:52.680 |
But, I mean, humans are not very good at explaining why we do the things that we do. 00:54:56.680 |
We're really good at rationalizing and coming up with, like, fake reasons, but a lot of times 00:55:01.680 |
we don't even know why we do the things that we did, let alone be able to coherently explain 00:55:12.680 |
So, something that does work pretty well is having the model think about its reasoning 00:55:16.680 |
in advance and, like, go through different reasons or rationales for why it might choose 00:55:20.680 |
one option or the other or think about what sort of things might go into good response. 00:55:25.680 |
So, if I had the model do some thinking in advance before it gave the response, then I 00:55:30.680 |
might just trust or assume that the response would be better. 00:55:35.680 |
Having a bunch of explanation for why I did the thing after, probably, I would not trust 00:55:43.680 |
Do you, do you give your own reasoning that explains the examples? 00:55:55.680 |
And if so, how do you make sure that the model doesn't get reasoning in advance? 00:55:59.680 |
I do a lot of giving, giving reasoning to explain the examples. 00:56:02.680 |
So, for instance, just in this case, one thing that we could do here is, like, we could 00:56:10.680 |
add something like, I was going to say tweet planning, but maybe it's, like, key points of 00:56:19.680 |
And then here we have some key points, like the document presents the launch of. 00:56:41.680 |
Sorry, this is part of the example right here. 00:56:44.680 |
So, I, in this particular example, in this example block, I gave it a document. 00:56:49.680 |
Now I'm doing this, this key points business. 00:56:57.680 |
Now this key points could be something that I wrote myself, or it could be something that 00:57:06.680 |
Or if Claude did a perfect job, maybe I could just include the thing that Claude wrote. 00:57:09.680 |
But now in order to get this, get Claude to do this, we would also say something like, 00:57:22.680 |
So this is, like, a lightweight chain of thought where we're having the model do some thinking 00:57:30.680 |
And we also gave it examples of it doing the thinking in advance, like this. 00:57:45.680 |
So, let's imagine we, like, really want to give examples like this. 00:57:48.680 |
But we have a problem, which is that our documents are, like, super long. 00:57:51.680 |
And I'm greedy and want to save on input tokens. 00:57:54.680 |
Would you err on the side of doing, like, one document but a really good example? 00:57:59.680 |
Or doing, like, truncated versions of more documents? 00:58:05.680 |
I would err on the side of one extremely good example and not truncated versions of more documents. 00:58:11.680 |
But I would also want to look at the outputs and test that assumption because it's possible 00:58:17.680 |
that with only one example, Claude would fixate on aspects of the exact document that it uploaded 00:58:22.680 |
and start trying to transfer them to your document. 00:58:27.680 |
So, I think it's one of those, it would be case by case. 00:58:30.680 |
But I would want to start with, like, having one extremely good example. 00:58:33.680 |
Generally, I think that, like, less but, like, higher quality is a better way to go than, 00:58:47.680 |
I was hoping we would get some, like, persona ones here. 00:58:55.680 |
So, this looks like something where we're trying to get Claude to roleplay in these different 00:59:08.680 |
So, let's try this out and let's see how it works. 00:59:12.680 |
So, this looks something where we're going to have, like, a... 00:59:15.680 |
This looks like it's, like, meant to be a multi-turn prompt, right? 00:59:32.680 |
So, basically, at the top, you see three roles at the top. 00:59:36.680 |
And then you can decide who you want to talk to. 00:59:44.680 |
And then if you do that P assist, you see down there that's highlighted in yellow. 00:59:49.680 |
You just do that little arrow P and then you can pick a different persona. 00:59:56.680 |
And then you can have them talk between themselves or you can just switch. 00:59:58.680 |
We use this for designers in our shop to do synthetic interviews to synthetic users, basically. 01:00:08.680 |
And then what issues or troubles have you been having with this? 01:00:13.680 |
I'm guessing you have seen a lot of role-playing prompts out there. 01:00:17.680 |
So, I was just wondering if you see anything that's perhaps not as optimized as it could be or any other best practices for role-playing, particularly with multiple synthetic personas within the same session. 01:00:31.680 |
For single personas, there's one answer that I would give. 01:00:36.680 |
This multiple personas thing, actually, I haven't worked that much with. 01:00:39.680 |
But off the top of my head, here is probably how I would think about it. 01:00:44.680 |
I would give all the personas to -- I would write a separate prompt for each persona and then I would have the user's command trigger some coding logic where it would decide which bot to send the -- which prompt to send that reply to. 01:01:00.680 |
So, this is getting back to the thing that I said before about, like, don't do it in a prompt if you don't have to. 01:01:04.680 |
Like, this -- I mean, this prompt, like, is -- like, there's a lot of thought that went into it, which is -- probably makes it work a lot better than it would have if you hadn't put as much effort into it. 01:01:13.680 |
But I think it's going to be easier if you just dynamically route the query based on what the user said. 01:01:23.680 |
You're talking about, like, if you were to use the API instead of just the chat -- 01:01:32.680 |
This was just in the chat, but I appreciate -- I definitely appreciate the note there. 01:01:37.680 |
So, maybe related to that, one of the other things is how much have you dealt with having a second thread with the API that acts as maybe the entity that's capturing inputs from multiple ones 01:01:54.680 |
Like, let's say that I build an app, and I have the user interact with these different synthetic personas, but then I have a second interaction with the API that's tying these things together into a cohesive whole. 01:02:08.680 |
I don't know if you guys have explored some of that. 01:02:18.680 |
I do want to kind of test this prompt out, though, just to kind of see how it goes. 01:02:22.680 |
So, maybe here I would say -- I can just say something like -- so, how would I switch it? 01:02:37.680 |
You could say, hey, how do you -- what's your process to look for the right -- for the best medication pricing whenever you get sick or something like that? 01:02:51.680 |
And then here in this particular case, if you switch to Joe, Joe is optimized more for convenience versus cost savings. 01:02:59.680 |
So, you have two different types of users and we can learn from. 01:03:04.680 |
So, Quad did the thing here that I want to show you all how to get rid of. 01:03:10.680 |
As Sam, it's like -- that's not something that Sam would say, right? 01:03:16.680 |
So, I don't know for sure this is going to work. 01:03:20.680 |
I feel like a magician that's about to do a trick but, like, I haven't practiced it. 01:03:24.680 |
But, generally, something that is pretty useful here is to -- we could say, prepend each response with the name of the current persona in brackets. 01:03:39.680 |
So, one thing I'm going to do here is I'm going to change this, like, multi-shot a little bit also because if Quad sees itself not doing the thing that I told it to do -- actually, let's just redo the whole conversation. 01:04:11.680 |
And now we could say -- and now we could say the same thing, like, what's your process for finding best prices for medication? 01:04:51.680 |
This is, like, something that a human might maybe say. 01:05:06.680 |
You don't need to say too much about your persona and your responses. 01:05:22.680 |
What are your thoughts on using things in the negative sense versus -- 01:05:36.680 |
What are your thoughts on using, like, negative stuff, like, you don't versus the positive sense? 01:05:41.680 |
I think positive is, like, a little bit better. 01:05:43.680 |
In this case, I don't really have a good answer for why I phrased this negatively. 01:05:53.680 |
I guess I think it's better to use, like, a light touch. 01:05:58.680 |
Like, I think there's, like, a little bit of a thing going on with reverse psychology where 01:06:01.680 |
if you tell the model, like, don't talk about elephants. 01:06:06.680 |
Definitely don't say anything about an elephant. 01:06:07.680 |
It might make it more likely to talk about an elephant. 01:06:09.680 |
So, if you do use negative prompting, I think it's better to have, like, a light touch 01:06:13.680 |
where you just kind of say it once but, like, don't dwell on it too much. 01:06:20.680 |
It's like if you don't want your kid to eat prunes, you're just like, oh, we're not having 01:06:22.680 |
prunes today, and then you just change the subject. 01:06:24.680 |
But if you really, like, emphasize that there are, like, no prunes to be had, then you might 01:06:32.680 |
I noticed you're not using the system prompt much. 01:06:36.680 |
Or what do you think the biggest value items for a system prompt are? 01:06:41.680 |
Personally, the only thing that I ever put in the system prompt is a role. 01:06:45.680 |
So I might say, like, you are this, you are that. 01:06:48.680 |
I think, generally, Claude follows instructions a little bit better if they're in the human 01:06:55.680 |
The exception is things like tool use, where maybe there's been some explicit fine tuning 01:07:00.680 |
on, like, certain system prompts specifically. 01:07:03.680 |
For, like, general prompts like the ones we've been going over here, though, I don't really 01:07:07.680 |
think you need to use the system prompt very much. 01:07:12.680 |
One thing we've found when using the user prompt, I guess, sometimes is it makes it more prone 01:07:22.680 |
to hallucinations because it thinks the user is saying it. 01:07:25.680 |
And so we migrated things to the system prompt more. 01:07:28.680 |
I don't know if you have any experience with that. 01:07:35.680 |
I've heard this from enough people that I could just be wrong. 01:07:38.680 |
So I'm unusually likely to be wrong when I say this. 01:07:41.680 |
I think that if you just put a bunch of context and you're like, here's the message from the 01:07:49.680 |
It will work and you won't have that issue anymore. 01:07:59.680 |
But that would be my default is just to, like, specify even more clearly. 01:08:03.680 |
And if you're having this issue, be like, here's the message from the user. 01:08:08.680 |
And I think it probably won't get confused by that. 01:08:15.680 |
So before, in order to get it to say not cringy things, you were saying provide it with a counterexample. 01:08:23.680 |
But here in the case of where you're doing this character bot, you haven't provided it any counterexamples. 01:08:33.680 |
So if the model is trained on preference optimization with examples and counterexamples, do you get a better result in the prompting? 01:08:44.680 |
Well, I don't know that the details of the RLHF have that much bearing. 01:08:50.680 |
Because I think when the model is trained, it doesn't usually see those both in the same, like, window. 01:08:57.680 |
It's more that it's, like, some stuff that happens with, like, the RL algorithms. 01:09:02.680 |
I don't think that's necessarily the right way to think of it. 01:09:05.680 |
With counterexamples, I don't feel that I have to include them in every prompt. 01:09:08.680 |
It's just a tool that I have in my toolbox that I'd use sometimes. 01:09:12.680 |
In regards to, like, negative prompting -- I'm over here. Hi. 01:09:17.680 |
Do you think that it would be better to do negative prompting using control vectors, like what you talked about in your scaling monosemanticity paper? 01:09:28.680 |
And maybe, like, having, like, a negative version of the vector as your kind of negative prompt instead of mentioning it in the prompt outright? 01:09:38.680 |
We don't know how well it works relative to prompting. 01:09:41.680 |
I'm, like, a, you know, die-hard prompter till the end, so. 01:09:48.680 |
I haven't found it to work as well as prompting in my experience so far. 01:09:53.680 |
That said, there's, like, a lot of research improvements that I won't get into in too much detail, but there's a lot of stuff that could make it work better than -- 01:10:00.680 |
So, like, right now, it's, like, finding these features, and then you're steering according to the features, which are sort of, like, these abstractions on top of the underlying vector space. 01:10:10.680 |
There's other possibilities for how you could steer, and there's, like, academic papers that you can read where you're steering according to just, like, the differences in the activations versus, like, trying to pull it out to this feature first. 01:10:23.680 |
So, maybe that would work a bit better, like, the control vectors thing. 01:10:27.680 |
I haven't played with it enough to know for sure, but I think there's definitely, like, something along those lines will work eventually. 01:10:34.680 |
I can't say in the long term if it'll work better or worse than prompting. 01:10:37.680 |
Right now, I still think prompting works, like, a lot better. 01:10:40.680 |
I mean, from my experience with, like, smaller models and trying to work with control vectors, I've seen that it's better when it comes to style than it is for, like, actual deterministic prompting. 01:10:52.680 |
Sometimes I feel like stuff from smaller models transfers. 01:10:55.680 |
I don't have a great intuition for what does and doesn't transfer between small and large models, but, yeah, good points. 01:11:03.680 |
I think we've gone over this roleplay stuff enough. 01:11:08.680 |
I'm going to upload a few screenshots of my dating profile. 01:11:21.680 |
Actually, somebody responded in the very first in reply. 01:11:24.680 |
So, since we're doing images, maybe I'll start there. 01:11:37.680 |
Riveting to watch me scroll through this channel, I'm sure. 01:12:15.680 |
I actually don't know if I can paste it into the console. 01:12:18.680 |
So, I might fall back on using call.ai for this. 01:12:58.680 |
So, in this image here, if we zoom in, it's supposed to get Maddie white. 01:13:15.680 |
And then in this one, it's just eight, six, seven. 01:13:23.680 |
I'm just trying to pull out kind of the average data from all three combined. 01:13:35.680 |
So, I don't know too many good tips for image is. 01:13:46.680 |
One of the things that you said here was that it works better with zooming in and cropping 01:13:55.680 |
That's definitely like the easiest win that you can have is just giving it like a higher 01:14:00.680 |
quality image, taking out the unnecessary details. 01:14:03.680 |
That might be hard to do programmatically because it's like you don't know what the... 01:14:09.680 |
But for the same reason that including extraneous information in text form, you probably won't 01:14:15.680 |
If you include extraneous information in image form, the results probably won't be as good 01:14:21.680 |
So, the more that you can like narrow in on the exact information that you need, the better. 01:14:32.680 |
But definitely having like higher quality bigger images is better. 01:14:36.680 |
I did just read on your website that it says... 01:14:39.680 |
It down samples to a thousand pixels by a thousand pixels. 01:14:46.680 |
So, then any general tips on how to discuss images with Solnit? 01:14:51.680 |
My number one tip for images is to start by having the model describe everything it sees 01:15:01.680 |
This one is hard enough for me to even read that I kind of doubt that quad will do well 01:15:09.680 |
One thing I've noticed when I attempted that where if I asked it to go like tube by tube in 01:15:17.680 |
Like the first tube, if it came to the wrong conclusion, it would use that and come to 01:15:21.680 |
that same wrong conclusion on multiple tubes. 01:15:24.680 |
Where like it made it so there was some kind of like directionality in its thinking where 01:15:28.680 |
if it got the answer wrong at first, it would project that onto the rest of the image it was analyzed. 01:15:35.680 |
If the model starts off on the wrong path, it probably will just continue going on the wrong path. 01:15:39.680 |
It's trying really hard to be self-consistent. 01:15:45.680 |
And this is why self-correction is such a big like frontier for LMs in general right now. 01:15:52.680 |
But as of this month, why use those for this path? 01:15:56.680 |
So, I love to use those, but why for this specific image? 01:16:01.680 |
I found like text track and OCR models I played with don't do as well with some of the handwriting. 01:16:07.680 |
Like if I zoom in on this image, it actually does perform pretty well. 01:16:11.680 |
from some of these fairly messy handwriting that's even hard for a human. 01:16:18.680 |
Unfortunately, I don't know how to get better results out of this other than maybe by cropping 01:17:11.680 |
I often dump full trace back errors directly into the prompt box. 01:17:11.680 |
I often dump full trace back errors directly into the prompt box as an API. 01:17:15.680 |
It seems exceptional at not running into trace back loops. 01:17:20.680 |
I don't know, like, if that's intentional, like, literally I'll just take the entire trace 01:17:39.680 |
back, zero context to Claude, and I'll just dump the entire thing in. 01:17:55.680 |
For example, like, this only really appeared with, like, very recently. 01:18:00.680 |
Like, you'd have to explain explicitly a lot. 01:18:05.680 |
I understand, but this is, like, you know, this isn't prompt. 01:18:08.680 |
This is prompt engineering we're talking about, right? 01:18:10.680 |
So, I'm just wondering, like, is this a form of prompt engineering, or is just the model 01:18:16.680 |
Sounds more like the model being good if you're just dumping it in. 01:18:31.680 |
And generally, to anyone who's uploaded more examples, if you could just, like, put some 01:18:36.680 |
stuff in the thread, like, write some stuff about what the issue is that you're having, or, 01:18:40.680 |
like, why it's not working, that would be great. 01:18:43.680 |
This is actually a follow-up to, like, what I was doing with the translation. 01:18:48.680 |
So, basically, what I'm trying to get to do is to actually analyze the text. 01:18:53.680 |
So, like, in this case, you know, there's the original English. 01:18:59.680 |
And I'm trying to get to score between one and five how good the translation is. 01:19:03.680 |
And so, what I've been doing is adding a lot of stuff to try to get to do sort of chain 01:19:10.680 |
But, you know, it just generally does a very bad job at scoring. 01:19:20.680 |
I mean, it would be incredibly useful if it worked. 01:19:22.680 |
And right now, it's in a place where it, like, sometimes kind of works and sometimes 01:19:33.680 |
So, we have some English text and we have some... 01:19:53.680 |
So, is this a good translation or a bad translation? 01:19:59.680 |
Now, quad's actually supposed to be good at Japanese, too. 01:20:05.680 |
But I'm saying if quad is good at Japanese, it should be good at, like, judging other people's 01:20:16.680 |
I guess it's stalled out here for whatever reason. 01:20:27.680 |
One is many grammatical errors the native would never make. 01:20:32.680 |
An average quality translation with some errors. 01:20:42.680 |
So, it gave it a three, but we want it to give a one. 01:20:59.680 |
Now, have you found that it's generally too forgiving or too strict or that it's just all 01:21:10.680 |
Also, it seems to confuse sometimes content versus... 01:21:15.680 |
You know, so like, for example, this is from, I think, the HLRF, you know, set. 01:21:22.680 |
But so, it might not actually rate it low, even though it's a terrible translation, because 01:21:28.680 |
Even though, if you asked it even to list all the errors, there are a dozen errors in 01:21:33.680 |
that, you know, single piece of text for the translation. 01:21:37.680 |
So, that's what I'm trying to also see if, you know, is there any way to like, get to 01:21:41.680 |
separate out the grammatical or the actual translation errors versus... 01:21:48.680 |
So, generally, this is like a thing where if you ask the model if some text is like good 01:21:53.680 |
or bad, it's sort of, if the text is like about a nice subject, it's more likely to say 01:22:05.680 |
And if it's about like a negative subject, it's more likely to like, criticize it and 01:22:11.680 |
I think you can get at those issues by typing language about it in the prompt. 01:22:18.680 |
So, for instance, here, you might say something like... 01:22:31.680 |
One is the worst and five should be the best. 01:22:34.680 |
That's sort of like implicit in this rubric up here. 01:22:43.680 |
How do you get around the fact that you may not have a Japanese tokenizer? 01:23:10.680 |
I actually don't think there is a tokenizer for cloud. 01:23:16.680 |
I mean, unless you say otherwise, there's not... 01:23:18.680 |
I mean, if you upload some text, it will be tokenized. 01:23:23.680 |
So you're not going to get a really good answer for this. 01:23:29.680 |
Claude speaks the best Japanese of any model available, actually. 01:23:33.680 |
We don't need to debate, like, Claude's Japanese skill. 01:23:46.680 |
Sorry, can we actually cut the questions off while I type this prompt here? 01:24:01.680 |
So what we're trying to do here is get it to distinguish between the, like, ethical nature 01:24:06.920 |
of the text and the quality of the translation. 01:24:11.920 |
Is it useful to tell it to be, like, extra critical? 01:24:13.920 |
Say, like, you're grading a, you know, like... 01:24:16.920 |
I don't know, like, graduate course, you know... 01:24:20.920 |
I need to, like, type this out before I can clear the queue and, like, respond to other questions 01:24:26.920 |
So what's a good word here, like, risque topics, or R-rated... 01:24:52.920 |
So, I don't know, something like this could help the model not pay so much attention. 01:25:18.920 |
So, the main thing that I would want to do for this prompt is just add a bunch of examples. 01:25:25.920 |
So for each category, I would add at least one example of that category. 01:25:30.920 |
So, like, I have, like, a really bad translation, and I'd say why it's a one. 01:25:36.920 |
And I'd have an example of, like, a two-level translation, a three-level translation, and so 01:25:43.920 |
And in each case, before you get to the answer, you would have the explanation for why it's 01:25:49.920 |
I can't tape all that here, among other reasons, because I don't speak Japanese. 01:25:54.920 |
But I think that's the most valuable thing that you could do here. 01:25:58.920 |
Otherwise, I mean, like, the formatting looks really good. 01:26:00.920 |
The fact that you're doing the chain of thought in advance looks good. 01:26:05.920 |
I think maybe this specific clues are indicators. 01:26:13.920 |
I think you could go into a little bit more detail here about what is... 01:26:19.920 |
So basically just end shot every single example or, you know, grade with an example, basically. 01:26:27.920 |
I mean, it's tedious to write out all these examples, but A, quad can help, and then you 01:26:31.920 |
have the problem of just editing quad's response versus, like, writing it all out yourself. 01:26:35.920 |
And B, it really does lead to better performance versus, like, almost anything else that you could 01:26:41.920 |
Do you think it's better to use a number scale, or else could you say, like, good, bad, or, 01:26:48.920 |
In terms of the scale on these rubrics, I think either a number scale or a good, bad is 01:26:55.920 |
The thing I'd be careful about with the number scale is that I don't think it's very well 01:27:00.920 |
So if you're telling it, like, choose a number from 1 through 100, it's not going to be, like, 01:27:06.920 |
So I'd probably just limit the granularity to maybe five different classes. 01:27:12.920 |
One thing we haven't talked about at all here, but going back to your question, is that 01:27:19.920 |
this is a case where you might be able to utilize, like, log probs, and I'm wondering if 01:27:24.920 |
that's something you ever use in any of your work or other prompts. 01:27:30.920 |
I agree this could be a case where log probs would be useful if you could get, like, the 01:27:37.920 |
So here's the thing with log probs is you really want the chain of thought beforehand, and the 01:27:44.920 |
chain of thought, I think, is going to get you more of a win than using the log probs. 01:27:49.920 |
But log probs, there's no -- in any model, I don't think there's a way where you can just 01:27:55.920 |
say, sample all the way through, and then output -- after you output this, like, closed chain 01:28:00.920 |
of thought tag, then give me the log probs of whatever comes next. 01:28:04.920 |
So if you are going to use the log probs, then you're looking at this, like, multi-turn 01:28:08.920 |
setup, where you first sample all the chain of thought, or sample all the, like, precogitation 01:28:15.920 |
And then cut it off there, and then re-upload that with the chain of thought as a pre-fill 01:28:21.920 |
message, and then you could get the log probs. 01:28:23.920 |
But for that, you need a model that both has the pre-fill capacity and has log prob capacity. 01:28:30.920 |
I'm not sure of what model has both of those characteristics right now. 01:28:34.920 |
Walk me through why it wouldn't just be sufficient to, like, in this case, just ask for -- I'm asking 01:28:39.920 |
for a score, one to five, only return that, but then, like, look at the log probs of what it 01:28:46.920 |
So what you're losing there is whatever intelligence boost you got from having the model do the 01:28:53.920 |
And my sense is that chain of thought plus have the model say either one, two, three, four, 01:28:57.920 |
or five is going to be more accurate than, like, the additional nuance that you'd get by 01:29:02.920 |
having it give you the log probs, because it's actually doing a lot of its thinking in that 01:29:10.920 |
For all the same reason that chain of thought is, like, usually a good idea. 01:29:13.920 |
Are you talking about, like, chain of thought, like, it's actually out loud writing a chain 01:29:23.920 |
Which is what we see in this prompt right here, right? 01:29:25.920 |
So if we cut out this whole analysis section, we're really tanking the number of forward passes 01:29:29.920 |
that the model can do before it gave you the answer. 01:29:36.920 |
I'm being told that I should answer a couple more questions and then get off stage. 01:29:44.920 |
I was, like, before I came here, I was honestly really worried that, like, no one would have 01:29:52.920 |
But you all have had amazing questions so far and amazing examples. 01:29:59.920 |
I'm supposed to say that at the end of this, but I'm giving a pre-thank you. 01:30:03.920 |
I just wanted to add that having a numbered list may give more weight to, like, number 01:30:09.920 |
one, two, three, four, five versus just having an unstructured list. 01:30:13.920 |
So it may give more weight into the score for the output if you just change it from, like, 01:30:18.920 |
one, two, three, four, five to, like, little dashes for criteria. 01:30:23.920 |
Just in my experience of what I've been seeing. 01:30:28.920 |
I have replied back with the prompt that was fixed or that, like, in my own improvement 01:30:38.920 |
Nisha, should I do another prompt or should I just answer a couple questions and then head 01:31:00.920 |
Please provide a summary of the text provider's input. 01:31:05.920 |
First thing I'll do is just move these instructions down. 01:31:10.920 |
Now, Matt, did you have an example of the document -- a document where it hallucinates with this 01:31:30.920 |
Your summary should be concise while maintaining all important information that would assist 01:31:50.920 |
Don't start or end with anything like I've generated a summary for you. 01:32:33.920 |
The best trick that I know for getting around hallucinations in a case like this is to have 01:32:41.920 |
So what I would say do here is I would say something like -- 01:32:44.920 |
and now, of course, in this pre-fill, since we're having relevant quotes here, we wouldn't 01:32:55.920 |
And now, of course, in this pre-fill, since we're having relevant quotes here, we wouldn't want to start with summary. 01:33:14.920 |
So we could say here is the -- something like this. 01:33:29.920 |
And then, of course, I would put the document here. 01:34:02.920 |
Once again, I really appreciate you all coming out. 01:34:04.920 |
It's been amazing to have such, like, a great audience engaged. 01:34:12.920 |
I'm planning to stick around this event for the next -- for the rest of the afternoon. 01:34:17.920 |
So I don't know exactly where I'll be, but maybe just DM me if you want to come find me and chat. 01:34:25.920 |
It's, like, my truest passion at this point in the world.