back to index[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
00:00:00.000 |
I mean the the the the the the question was do we record the whole thing or just the 00:00:06.480 |
just the presentation but then the presentation sometimes often extends to the whole thing so 00:00:10.960 |
uh if you want private question and answers you can write them in the chat they will not be 00:00:16.400 |
reflected in the youtube recording or we will do it afterwards when we stop recording okay yeah 00:00:21.680 |
and people have brought up can we save the chat transcripts so I think like for a writer episode 00:00:26.480 |
I saved the chat transcripts there's a lot of good questions in there now if people are interested 00:00:30.400 |
if we post them that's good to know because I've been saving them sometimes but they've damn been 00:00:35.680 |
sharing yeah it could be and just general thing for everyone that joins there's always great questions 00:00:43.840 |
that come up in chat so yeah always always open yeah I think the chat is the main one that's still 00:00:50.080 |
that's what was nice about discord hosting it because all the chat just stays in discord 00:00:55.120 |
in this space I have it pulled up to the side but if I miss any of them um please just pick you know 00:01:00.560 |
just raise your hand and just ask me to stop okay cool so we're going to talk about this paper who 00:01:07.280 |
validates the validators um I don't have any slides as for normal uh but I think that there are a few 00:01:13.440 |
key interesting ideas in this paper that I want to quickly go through in um hopefully in just uh 00:01:24.320 |
give me a second let me try to see if I can actually share my entire desktop I want to quickly go through 00:01:29.680 |
this uh paper in maybe about 30 45 minutes then I also want to share about um an app that I built 00:01:37.680 |
that I try to show all of this with evaluation mode um labeling mode as well as optimization mode 00:01:46.800 |
um I had given Shreyya a sneak preview of this the last time um so we'll see okay now on to who 00:01:55.600 |
validates the validators um so what what is this paper about so essentially this paper is about okay 00:02:01.120 |
everyone's using airline evaluators uh but the question is how are we actually going to evaluate it 00:02:07.120 |
so obviously the I think that the way to do this is to label some data label some data and then match 00:02:14.720 |
how it matches against that label data and you will see that that's how that's how they try to do it as 00:02:19.920 |
well um just to confirm you can actually see my screen right you can actually see the yellow and blue 00:02:26.080 |
highlights okay yeah so I'm just because I'm showing the whole desktop so we'll go through a couple of 00:02:32.480 |
key ideas here but the one key idea is this idea that's very interesting uh and these are scattered 00:02:38.480 |
throughout in blue so firstly the observation that Shreyya and her authors made is that to grade outputs 00:02:47.520 |
we need to have some evaluation criteria but the process of grading outputs help us define that 00:02:53.920 |
evaluation criteria so it's a chicken neck problem um across many tasks in terms of evaluation I've seen 00:03:00.720 |
that trying to grade the outputs without actually having seen your data actually leads to criteria 00:03:06.720 |
trying to create criteria we've actually seen the data actually leads to criteria that's very divorced 00:03:09.760 |
from reality like people will say oh I don't want any toxicity I don't want any um spelling errors 00:03:15.360 |
etc etc but the thing is well you know those are actually not problems um or they could say that 00:03:20.000 |
oh you know I want to tell them to speak like how it's speaking to a friend well the thing is 00:03:23.760 |
if you actually try it and you try to look at the data you realize that you actually cannot 00:03:27.360 |
create that kind of create a kind of text you can't create um text that is uh it can't really return 00:03:34.960 |
slang like for example things like romanticy which is a mixture of romance and fantasy like this 00:03:40.400 |
kind of genre or like grumpy sunshine um or reverse harem so things like this is not really quite 00:03:47.760 |
part of its vocabulary it's very hard to prompt it to get it written that kind of data so so that that 00:03:53.360 |
leads to undue expectations right these are expectations that just cannot be met so therefore 00:03:58.880 |
what she's proposing is that what they're saying is that it is impossible to completely determine good 00:04:05.040 |
evaluation criteria without actually looking at all the outputs so essentially the point is you have 00:04:10.800 |
to look at the data before you become evaluation criteria um unfortunately this this is not the main 00:04:16.480 |
part of the paper but it was a very key thing and I think this was actually one of the biggest 00:04:21.360 |
uh aspects of the paper so essentially what what she's proposing is that okay here's the typical 00:04:29.040 |
evaluation pipeline um that's right on top okay we start with uh prompting we get outputs then we 00:04:35.680 |
just use some kind of evaluator lm without really checking it we just oh vipu you're not muted 00:04:51.120 |
thanks thanks i took care of it yeah yeah okay so what we're seeing is that okay it's uh on the top 00:04:58.160 |
that's how we're doing it right now but we're not actually um checking the alien evaluator so what 00:05:04.080 |
they are proposing is that okay we put it through an alien evaluator we have some kind of candidate 00:05:09.360 |
criteria and the human also does need to grade the output in terms of yes or no and then we check 00:05:15.280 |
how well the candidate criteria aligns with the human grader output and you know this is an iteration 00:05:20.160 |
of the process um only once it's aligned enough and that's why they have an alignment report card 00:05:25.280 |
we consider the lm evaluator as aligned and usable both in development and in production so essentially 00:05:32.880 |
users iterate through the process of refining criteria and grading so that you align the lm together with 00:05:38.000 |
to your criteria i think there are three big there are three main problems uh when trying to use an 00:05:44.160 |
lml evaluator i think the first there are two of them that we just solved one is at the end which is given 00:05:49.040 |
the ideal prop can an ally evaluator achieve sufficient level of accuracy i believe the answer to this is 00:05:54.320 |
yes second one is how do we get humans to label a hundred to four hundred samples of data with 00:06:00.800 |
sufficient high quality i think the answer to that is also solved uh essentially you just motivate them 00:06:06.960 |
incentivize them the right way but the second the what's in the middle is that now that you have this 00:06:12.160 |
hundred to four hundred golden samples how do you come up with some kind of prompt or some kind of 00:06:17.520 |
input or output uh some kind of prompt or instruction or a few short samples that can actually achieve that 00:06:23.680 |
ideal the ideal prompt that can align well so this part that is in the middle i think is somewhat of an 00:06:30.080 |
unsolved problem at least for me uh and we see how how you try to solve this um so eval gen design 00:06:37.840 |
essentially how can we assist developers in creating evaluators 00:06:41.520 |
so even gen design how do we assist developers uh in creating evaluators to grade the element outputs right 00:06:52.320 |
so essentially instead of having them to use the vibe track how can you assist them to initialize the 00:06:56.400 |
l evaluator once and then we can move on so they have a workflow here i think it's useful to for 00:07:02.560 |
me to just talk through this i have had the chance to actually play with this uh live and it's very nice 00:07:07.920 |
and a lot of it is what inspired my own uh prototype so essentially at the start you can see here there's 00:07:14.720 |
a prop node uh where you're just writing a prompt essentially these are tweets you'll be doing name 00:07:18.960 |
entity recognition on tweets so this is the generation prompt now over here this uh still in box a you 00:07:25.680 |
can see that here over here the multi evaluator this is the evaluation prompt so in order to generate 00:07:30.400 |
an evaluation prompt they should provide you three ways which is you can infer the criteria based on 00:07:34.480 |
their prompt uh infer the uh you can write criteria manually or you can grade some responses first so 00:07:42.080 |
that's part two so after you've done this it will generate the criteria itself so essentially you're just 00:07:48.480 |
getting gpd-4 to generate some kind of criteria and this kind of criteria can be in the form of code or it can 00:07:54.960 |
be in the form of an lm prompt so there's two kinds of criteria here one is code and one's an 00:07:58.800 |
lm prompt so you can imagine things like no hashtags it's really just a regex search for hashtags uh 00:08:04.560 |
whereas no made up entities uh this is probably something that you need to rely on lm on so that's 00:08:10.720 |
box c over here creating criteria and then as you're creating criteria as you try to run it when you're 00:08:16.960 |
running it running these criteria through the lm as you're running it you have the option to grade some 00:08:23.440 |
uh great some of the sample data whereby you can say that okay is this good or is this bad 00:08:29.840 |
unfortunately this is very broad is it is it good or is it bad it's not clear enough i think and the 00:08:35.200 |
authors actually do mention that it is helpful to be more specific in some of their future lessons and then 00:08:40.400 |
finally in box e over here you can see uh what uh how well is the lm aligned so i think this one this 00:08:49.120 |
box c is uh box e i'm sorry box e is is very useful so they use the term coverage and false failure rate 00:08:58.160 |
so to simplify things coverage what it really means is recall how many of the bad responses can 00:09:05.280 |
we actually catch and false failure rate is actually one minus precision so it took me a long 00:09:12.160 |
time to try to understand this and i had several chats with shreya to really under uh clarify what was 00:09:17.120 |
she trying to indicate false failure rate means that we are actually failing this but they should 00:09:22.800 |
not have been failed so essentially how much of it has been wasted how many innocents are we just 00:09:27.840 |
excluding so it's essentially one minus precision um okay so essentially this is the entire workflow 00:09:35.520 |
the entire ux that you are thinking of come up lm will help you create criteria you will uh lm will 00:09:41.360 |
help you create criteria in box b you will create some responses in box d and then eventually we check 00:09:47.920 |
how well they align in box e and if it's aligned enough that's it we have an l our lm evaluator so um well i i 00:09:57.280 |
don't have to go through all this essentially that was all uh indicated in the diagram and 00:10:02.000 |
of course the main thing is that the users can edit everything and the users are also asked to create 00:10:06.080 |
their things um i think i will pause here to check if there are any questions 00:10:15.520 |
the chat has not much in chat okay sure thing no no questions then uh i will proceed 00:10:25.280 |
okay so now we know how the evalogen evalgen workflow looks like let's see how well it works and what do 00:10:33.600 |
developers say because they also had a small study on developers so essentially um how is it done criteria 00:10:39.280 |
suggestion gpt4 just proposes various criteria um candidate assertion essentially as they create 00:10:46.320 |
this criteria they will use code and then they will assert it either use code or prompt to assert it and 00:10:51.680 |
then return the scores and after that they they will they will sample some of the grades to get the users to 00:10:57.440 |
give binary feedback i think this is really interesting whereby you try to sample the roles that make the 00:11:02.880 |
biggest difference i i couldn't quite figure it out how they were actually doing this in a smart way that 00:11:08.000 |
made a difference other than just going through the entire list maybe to ensure us here we can actually 00:11:11.920 |
answer that so they had a lot of uh selectivity of candidate assertions essentially what this means is 00:11:17.680 |
that they will go select the assertions that have a higher rate of passing it and then they also have 00:11:24.000 |
confidence scores etc um so one here's how it compared to a previous paper by sure as well essentially it's 00:11:32.080 |
called spade essentially the idea of spade is that in a nutshell you version control the prompts and 00:11:39.040 |
every time there's a prompt change the assumption there is that every time there's a project pop change 00:11:44.400 |
is usually to add an extra instruction to prevent a failure case so essentially by versioning control the 00:11:51.360 |
props by checking all the different prompt changes you can create test cases out of it and you can 00:11:55.520 |
create an elevator out of it so that's what they were comparing against they have two data sets one is a 00:12:01.840 |
medical pipeline essentially uh transcription doctor patient calls so the goal is to extract specific 00:12:07.520 |
information without revealing any pii data the other one is a it's an e-commerce data set essentially 00:12:14.800 |
a hundred amazon products and the goal is to write seo optimized product descriptions 00:12:20.720 |
without mentioning the negative reviews so without mentioning the negative reviews so medical and 00:12:26.640 |
product uh and e-commerce uh so what i like what i like about these data sets is that it's fairly 00:12:31.520 |
balanced you know um the defect rate okay so essentially 70 and 50 of it was good uh essentially 00:12:38.800 |
if you flip it the defect rate is actually 30 and 50 which is what we actually really want to catch 00:12:45.600 |
so here's our two data sets product uh medical and product with 30 and 50 defect rate um so 00:12:51.520 |
essentially here's let's just focus on the results what we can see over here is that for the medical 00:12:59.360 |
pipeline we were able to identify uh recall essentially was 33 it's a bit on the lower side i don't i don't 00:13:08.240 |
that's it i think they didn't really do a lot of work to try to optimize it and try to improve recall 00:13:12.240 |
but essentially off the right of the bat recall is 33 33 so this is equivalent to spade and the 00:13:19.440 |
precision is also 90 remember false failure rate is one minus precision so precision is 90 uh and this 00:13:27.520 |
is this is pretty good position uh that's it so you might say hey you know what's the difference 00:13:31.120 |
between evil gen and spade well the difference here is this evil gen only needed three assertions be 00:13:37.120 |
code assertions or lm prompts so essentially you needed less than spade which needed five 00:13:43.360 |
to get comparable results now when we look at the product pipeline we see that evil gen only requires 00:13:51.120 |
four sessions which is half of what state needed um essentially four different uh criteria four different 00:13:58.400 |
criteria four different uh lm prompts or four different code prompts essentially up to four and it was able to 00:14:04.400 |
achieve pretty good recall which is close to 75 percent um in production ideally you want to try to get to 00:14:10.000 |
80 80 80 80 is really hard to achieve in production but 80 90 is pretty good uh but what's not so good is 00:14:16.240 |
the precision rate precision over here is 61 right because again false failure weight is one minus precision 00:14:25.440 |
uh eventually long story short evil gen does outperform spade in terms of getting good uh recall and precision 00:14:32.240 |
uh in terms of achieving uh similar or better recall precision but still a few ways to go 00:14:38.240 |
uh before we can use it in production uh that's it this uh this idea was actually pretty good in the sense that 00:14:48.000 |
again remember what's happening here is that the lm is just seeing the prompt either inferring some 00:14:54.640 |
criteria or the user is just manually creating some criteria the user is not writing the code 00:14:58.880 |
the user is not writing the evaluator prompt um but just by getting the user to label i don't know how 00:15:05.280 |
many the user actually label i think maybe 16 to 20 in those two data sets we were actually able to just 00:15:11.280 |
come up without anybody with the problems just just like this without the user having imagine a user 00:15:16.640 |
who is a product manager or an sde or software engineer who doesn't know how to write prompts 00:15:21.360 |
doesn't know about future prompting doesn't care about chain of thought um and they shouldn't right 00:15:26.080 |
all they should do is to provide you hey what's my criteria and then can you get it aligned 00:15:30.160 |
so i think that's a pretty good result so any questions here before i go into user study 00:15:37.440 |
okay okay i guess not uh how do i get the chat um when when you're talking about these 00:15:48.800 |
these prompts are they iterative prompts like when you say getting it aligned you're talking about 00:15:54.560 |
exchanges with the model this isn't fine-tuning the model right no no it's not fine-tuning model what i 00:16:00.080 |
mean by getting a line is that um so imagine if so um okay so let's look at box d over here right 00:16:09.120 |
so the the user is just saying is it good or bad so now there could be many many reasons why it's good 00:16:13.840 |
or bad it could be that because there's too many hashtags it could be that it's not using bullet points 00:16:18.240 |
it could be that um it's making our entities again the model just doesn't know and i guess in this in 00:16:24.400 |
this study the deal just just simplified it as good or bad so there could be many many different uh 00:16:29.760 |
criteria many different evaluation prompts so now how so what happens is that again in box c you see 00:16:36.080 |
here the lm is coming off many uh possible criteria it's trying to guess what the user what what the user 00:16:43.120 |
needs right um based on either based on the prompt or based on what the user has typed maybe just some 00:16:48.080 |
free text string so it's coming out these evaluators so eventually what we're going to do is we're going to run 00:16:53.200 |
these evaluators and try to see how well it aligns aligns with the data that the user has 00:16:59.760 |
provide which is up or down and and i'll have a example demo of this uh in a bit where we can see 00:17:06.880 |
how how our alignment is and so what is going to happen is that then there's an algorithm on the back 00:17:13.120 |
end uh essentially just uh computes which of these evaluators should we use combine and which of these should 00:17:21.200 |
be dropped to have the highest correlation with the human input good or bad labels 00:17:26.720 |
does that make sense and of course you know if the correlation is bad you could try to optimize the 00:17:33.760 |
prop rewrite the problem in a different way to find a better correlation so what that's what they mean by 00:17:39.440 |
alignment which is aligning to the user provided labels of good and bad got it thank you yeah so 00:17:47.440 |
essentially like this this is what they really care about coverage they want to have high coverage of 00:17:57.120 |
okay so what's really interesting is that they also had this um they also had a user study with nine 00:18:05.680 |
uh nine industry practitioners uh i was one of them so i had a chance to play this and i was actually 00:18:13.760 |
fairly uh impressed so you can see that okay you'll be the the prompt is you'll be doing name entity 00:18:19.280 |
extraction which is the example there extract up to three well-known entities 00:18:22.880 |
and then write a cent basically they have to make up a sentence describing a personal entity 00:18:28.000 |
and answer is a bulleted markdown list so you can see that there's quite a few there's quite a lot of 00:18:33.280 |
instructions here right extract entities um write a short one single sentence description and then return 00:18:40.000 |
it as a markdown list markdown list and don't extract and uh hashtags so you can see you can imagine this 00:18:46.560 |
being split into four different criteria and it's just easy to some of these can be uh solved by code 00:18:52.880 |
such as no hashtags uh and then bullet points like this some of these probably have to be like uh do not 00:18:59.680 |
make up do not make up uh any entities some of these could be probably like only an lm so then what we 00:19:08.880 |
found is that participants found that uh evalgen was a great starting point which is that it's a great 00:19:13.840 |
starting point to make it fun for people to label data and get immediate feedback on how their labeled data 00:19:20.560 |
is helping to align the lm evaluator right this this level of feedback and this level of feedback is that in 00:19:25.200 |
the order of minutes or at most an hour instead of um days or weeks right sometimes if you're not 00:19:30.720 |
labeling the data yourself if you're using something like scale ai um the the data might not come back 00:19:35.600 |
so quickly so the iteration loop is it really kills your product launches or your feature development 00:19:40.640 |
especially when you're using uh trying to build an lm powered customer experience so here's what 00:19:48.480 |
participants usually did um there's 100 lm outputs so they sort of viewed it like just quickly scan 00:19:53.920 |
through it make sure it looks right and then partners uh click to start on it so there were 00:19:58.320 |
six participants that just use autogen auto generate criteria one participant wrote the criteria and then 00:20:05.840 |
two other participants who actually want to look at the data amazing i think that's great um so after 00:20:11.280 |
then after that so after that they get some criteria suggestions from evalgen uh participants actually 00:20:16.400 |
really wanted to play the criteria and try to uh push the boundaries of it so they removed some 00:20:21.920 |
suggestions added some criteria of their own to see how far you are gently go um so evalgen as a starting 00:20:29.040 |
point what did the participants say right uh firstly it's really hard uh to come up with criteria on what 00:20:37.360 |
is a good summary or what's a good translation or what's a good extraction right so by prompting an lm and 00:20:43.200 |
you can probably do this with cherry bt ui or entropic cloud ui um it was helpful that it was able to get 00:20:49.600 |
the criteria one and you know this is nothing nothing outstanding um then participants who actually 00:20:57.280 |
started grading before actually creating criteria said that the grading process was helpful right and we 00:21:03.520 |
all know this by looking at the data and this was the the key point the key points that i highlighted in blue 00:21:08.800 |
which is by looking at the data first you understand the data you feel for the data a little bit better 00:21:13.360 |
and that helps you write better criteria right um so that's one i think this was me participant 00:21:20.960 |
actually said you should enforce that we look at at least 20 samples first and that's what i will 00:21:25.760 |
actually show this enforcement in the demo i'm going to show later so so because you know while lm is 00:21:33.440 |
doing the grading especially if there's chain of thought uh it takes some time so but a great use 00:21:38.400 |
of a time is that okay but this is maybe you want to go get a coffee but a great use of time is for the 00:21:42.080 |
participants actually provide labels themselves so this was what participants were sharing that they 00:21:47.280 |
were happy to grade while waiting and here's how the output looks like right so you can see here's 00:21:53.280 |
the full text of the tweet and here's the response okay let's maybe let's look at the second one 00:21:58.240 |
uh great fire department and tour nyc so it extracted great fire department extract the nyc extracted fdny 00:22:05.680 |
this is interesting i was not aware of this and then so you can see we have four evaluators here 00:22:10.800 |
whether it's a bulleted list whether there's no made up entities whether there's no hashtags and 00:22:15.120 |
whether it's in a single sentence and of course the single sentence two of these actually fail 00:22:18.800 |
uh because they they're not made of a single sentence so based on this you can actually see this and 00:22:24.800 |
of course all this data is just computed in the back end to create the alignment scorecard 00:22:28.000 |
um participants who like looking the courage and false failure rate looking at this made it very 00:22:34.400 |
easy which one was um was actually look which one actually looked like and here's what i think is 00:22:40.960 |
the mean of it which is alignment is iterative right the problem i think the main problem i think this is 00:22:48.080 |
right right is that most people don't have very strong confidence in an lm evaluator because they 00:22:55.680 |
are even they themselves are not confident of the criteria and the thing is this is not explicitly 00:23:01.920 |
vocalized but most of the time they're not confident of the criteria because they don't know if their 00:23:06.240 |
criteria is actually helping and the reason is that for that is because they haven't even actually looked at 00:23:11.280 |
the data sometimes the people that provide criteria are like uh pms or like uh business leaders who may 00:23:19.840 |
not have a chance to look at the data they don't uh maybe the data is too complex it's just not an easy 00:23:25.120 |
consumable format and therefore they don't know whether it actually works or not but i have found that by 00:23:30.400 |
getting people to look at the data even just 10 or 20 samples right of course you have to pick 00:23:34.720 |
representative samples with both defects and non-defects it helps them understand a little bit 00:23:39.120 |
better right so if you think of data in general there's quantitative data where you aggregate everything 00:23:46.000 |
into one or two metrics that gives you very broad scope and then there's anecdotes right where you 00:23:51.040 |
actually look at specific samples and jeff bezos has this quote saying that you know when the data and 00:23:55.280 |
the anecdotes disagree i tend to trust the anecdotes that's what we're that's what uh this is suggesting 00:24:01.200 |
here look at some of the data samples look at some of the anecdotes um yeah so again here's the chicken 00:24:08.000 |
and egg problem we need to think through criteria you need to write it down in order to grade outputs 00:24:13.920 |
but when we grade outputs we also figure out new data uh new criteria i mean it's the same thing like 00:24:22.880 |
it's the same thing as writing uh when you start writing you probably only know half of what you 00:24:27.440 |
actually eventually finish writing as you start writing you start digging a rabbit hole deeper and 00:24:31.040 |
deeper and deeper and you find more things so this is what this proposes which is that feedback loop 00:24:35.920 |
looking at data creating criteria creating criteria looking at data and creating more new criteria 00:24:40.640 |
that's very helpful um yeah so but one interesting thing here is that one participant mentioned 00:24:49.280 |
that they gave it a bad grade not because the output is bad but they wanted to be consistent with 00:24:53.200 |
their previous grades right previously they thought something was bad uh and then they give it a bad score 00:24:59.360 |
but after they could have been other things that were more egregious and you know on on hindsight this is 00:25:03.200 |
really just a minor paper cut it's not as critical but they continue to be consistent with their previous 00:25:09.040 |
grade instead of updating their previous grade so i personally don't think i think that this is a very 00:25:16.560 |
difficult problem to solve in the sense that you want to be internally consistent with what you have 00:25:20.880 |
done in the past but what that means is that this perpetuates the bad data and the bad labels 00:25:26.480 |
uh so i think there's a point here uh which is i i wrote to remind people that my design decision 00:25:32.800 |
here is that if you find that your criteria has drifted instead of trying to maintain the same criteria 00:25:39.600 |
and aligning to the to the previous grades instead what we should do is we should revisit those previous 00:25:44.880 |
grades and fix it because it's an iterative process 00:25:51.200 |
what constitutes alignment i think there's a question there about what was alignment uh is very subjective 00:26:04.000 |
right uh i think that when it's really just binary yes or no good or bad i think that's a little bit 00:26:11.120 |
easier to measure what alignment is in a sense it's really just recall and precision or colonis kappa 00:26:16.960 |
in terms of correlation metrics uh but sometimes it's just very difficult and sometimes 00:26:22.400 |
uh uh there was this slight disconnect when the the participants provided grades right they provided 00:26:30.000 |
labels but had little impact on their chosen assertions uh i i think this is just it's just not 00:26:35.280 |
clear how the grades actually the actual grading actually try to optimize this and actually during that 00:26:40.560 |
demo they actually didn't do any optimization or creating new evaluation prompts so that that may be 00:26:45.200 |
why there was some disconnect there the other thing is that users really want to have control over both 00:26:54.400 |
the evaluation type as well as the prompts itself right and we have seen this is that a lot of times 00:27:01.600 |
we will come up with an new prompt that actually works well on the eval but sometimes for better or for 00:27:08.800 |
worse for whatever reason people just want to include just this one sentence to make sure that 00:27:13.040 |
that defect doesn't happen when that defect actually doesn't happen and when you try that you know 00:27:17.120 |
sometimes the evals actually perform worse uh but unfortunately you know sometimes these are just 00:27:22.560 |
policy decisions that are a little bit disconnected from the data um finally this they found that you 00:27:32.640 |
know as developers they iterate over the criteria they find that and as they engage with as they look at 00:27:39.680 |
the data essentially look at the lm outputs it helps them refine their criteria better so essentially the 00:27:44.800 |
point here is that tighten the feedback loop and those are creating the lm evaluator and once the lm 00:27:50.560 |
evaluator is well aligned we can just use it for uh using production that said uh some participants 00:28:00.000 |
were actually quite skeptical of how lm-based assertions the lm-based evaluations might be helpful for 00:28:06.960 |
monitoring lm outputs in the production pipeline uh a year ago i might be i would have been in this camp 00:28:12.320 |
as well i think now i'm also partially in this camp but partially not in this camp i i i'm beginning to 00:28:17.520 |
see that i think it can be viable if the throughput and latency requirements are not too demanding so 00:28:23.680 |
participant 8 actually said i cannot begin to think about how lm as validators in production can work i'm very 00:28:30.080 |
skeptical um so yeah i think that's that's all i wanted to focus on uh essentially it's the workflow 00:28:42.960 |
here that i think is very valuable uh and they also have future work um like one one thing here they 00:28:50.400 |
actually didn't uh implement here which is users want their prompts to automatically improve based on 00:28:55.680 |
evaluation results so this would be something like dspy etc etc that you can think about which we saw in 00:29:02.400 |
some demos to automatically improve our evaluation prompts um but so yeah that's it that's all i had 00:29:09.520 |
to share any questions just gonna go back to the screen here oh there's a few questions 00:29:22.960 |
um okay i am gonna start from the one in the bottom um is there feedback on the information utility of 00:29:33.840 |
this is not clear to me um i don't know and i don't know if shreya has joined yet she may not be able to 00:29:45.760 |
join uh she may not have joined yet she has not i think this is a question that maybe i will wait 00:29:52.480 |
for shreya let me just check twitter if she is just respond i did share for the zoom link 00:29:57.280 |
eugene you said you made a prototype based on uh like the i think there was like a diagram earlier 00:30:13.600 |
like can you talk more about the prototype if no one else has questions for the paper 00:30:17.920 |
yeah sure i can share more about the prototype um 00:30:30.320 |
share so here's a here's a very small simple prototype essentially the question the question that i have for 00:30:41.280 |
myself is how can i help users um label data and evaluate data and improve their evaluation prompts 00:30:51.440 |
easier so here's a small toy that i made uh it's called label lm assistance for better evaluation 00:30:56.800 |
labels um so to get started we have to upload a csv file right so you can see as you file essentially 00:31:02.560 |
i think these are fairly reasonable constraints id input output and a label and the label can be empty 00:31:08.320 |
so what am i going to do is i'm going to input a upload csv file so this is a csv file it's the 00:31:15.040 |
factually inconsistent that's this is the factual inconsistent um data set so what i'm going to do 00:31:21.040 |
is let me just start from scratch um upload so here's how the csv profile looks like after we've 00:31:26.400 |
uploaded that's the input that's output the goal is to try to make it easy for the users to read 00:31:31.840 |
right and we can enter labels so the thing is currently right now we are in evaluate we are in 00:31:37.520 |
labeling mode so to unlock evaluation mode you have to label at least many rows and i try to provide 00:31:44.240 |
feedback that you know you have to label as many rows essentially i'm trying to gamify this to help 00:31:49.280 |
users get to the next level so but i have a data set that uh already has some rows uploaded 00:32:00.560 |
so this this has uh this is evaluation mode so in uh in order to achieve evaluation mode 00:32:13.520 |
uh wait a minute have i already labeled so many so much data 00:32:17.680 |
let me just okay maybe my data set wasn't prepared correctly 00:32:23.200 |
uh by the way this is this is now evaluation mode after we've labeled 20 out of 20 it will show that 00:32:28.720 |
evaluation mode has been unlocked so you see evaluation mode unlock scroll top to add your prompt 00:32:33.600 |
so once evaluation mode is unlocked you can enter the prompt and it'll allow you to use the prompt to 00:32:39.440 |
try to do that so evaluation mode you can just run evaluate it will evaluate all the data uh evaluate 00:32:47.520 |
on the data so you can see we start getting results as it comes in piece by piece essentially what this 00:32:52.240 |
is happening is just running on the back end uh big thanks to swix for teaching me how to do this the 00:32:56.640 |
right way to do this and then what we're doing is this it's running on the background it's all 00:33:00.400 |
storing the database but at the same time we're also polling it once a second to show the results 00:33:04.560 |
and what we see in the top left corner here these are metrics right these are metrics for recall 00:33:10.400 |
precision cohen's kappa and we also have true positive true negative uh false positive and false negative 00:33:16.400 |
so this is uh going to take some time and it's going to evaluate the data and along the way we can 00:33:24.080 |
see how well it lines so we can actually see that okay when alignment is there you see essentially 00:33:28.880 |
the label the prediction matches your label that's great and then when alignment is not there the 00:33:33.760 |
prediction doesn't match the label it's uh it'll be highlighted in red and you can look through it 00:33:38.080 |
you can look through it you can try to update the chain of thought you can try to understand why 00:33:41.920 |
it's giving this prediction based on the explanation and you can see the explanation here 00:33:46.080 |
and as we scroll down oh wow this is pretty good you can see a lot most of it is largely aligned 00:33:52.000 |
essentially what this is doing this is just a chain of thought followed by the final label 00:33:56.080 |
and it's just going to go through it and once it goes through all 50 of it um again this is uh so 00:34:03.120 |
currently we're in evaluation mode where the lm is evaluating it so to unlock optimization mode you have 00:34:08.480 |
to label at least 50 rows and you have to run evaluation mode once essentially we sort of need an 00:34:13.520 |
existing metric for how well we need a baseline for how well this basic evaluation prompt works and you 00:34:18.880 |
can see the problem is actually very short just evaluate if the output summary is actually consistent with the 00:34:22.880 |
input article if it's consistent return one if it's inconsistent return zero 00:34:26.240 |
so now we are at uh row number 38 oh seems to have broken nope it's still running it's just taking a 00:34:35.840 |
bit of time uh row number 41 so once we have at least 50 rows i actually think the right number should be 00:34:41.920 |
maybe 100 to 400 if you want to be optimizing based on this uh and after we've run evaluation mode once uh we 00:34:49.440 |
can actually run optimization mode so optimization mode is going to come up in maybe a couple of 00:34:55.440 |
seconds so you can see right here it's just i just want to try to figure out a way to give users fast 00:35:00.000 |
feedback all right so now that we've labeled 50 uh and evaluated all of them it shows that optimization 00:35:06.160 |
mode is unlocked so we can scroll to the top and click on optimize evaluation prompt so optimization 00:35:10.240 |
what it does is that on the back end we're just going to try to figure out how to optimize the evaluation 00:35:15.920 |
prompt uh unfortunately clicking on this number clicking on this button doesn't do anything 00:35:20.320 |
i just implemented this yesterday um but i tried to um well with next js it's very easy to try to make 00:35:29.440 |
it something fun and funky of course i'm sure six uh will do something way better but essentially this is 00:35:36.240 |
this is what the next step is clicking on this button will optimize in the back end we'll see a different 00:35:40.800 |
we'll see a different page which has like maybe 10 rows and all the different optimization scores 00:35:45.840 |
it's actually just hyper parameter tuning you can think of it hyper prompt tuning um so yeah that's my 00:35:51.680 |
short demo any questions or feedback would be nice yeah that's fire so so of all the tested frameworks 00:36:00.560 |
you ended up going with next js and not like um oh how do you know this is next how do you know this is next 00:36:05.440 |
yes you mentioned it a couple minutes a second ago oh okay yeah this is next js i it was really easy to 00:36:12.240 |
build this with cursor and v0.dev um and i went with it the back end uh is going to be fast api 00:36:23.040 |
i i believe the back end will be fast api uh or i could try to do it uh all in type script and xjs i 00:36:30.560 |
don't know uh but i think the prompt optimization back and i'll just have a fast api and then you 00:36:35.120 |
know next js this app will just call fast api uh you know here's the here's the label data set or rather 00:36:43.200 |
the label data set is all in a dynamo db or rather it's just in a sqlite db and then the the fast api 00:36:50.880 |
backend will also have access to that and just optimizing it and computing scores 00:36:54.400 |
so this is oh sorry go ahead no i was just gonna say that was really cool i'm now trying 00:37:05.440 |
to tie it back into like uh the validator paper that we just read and understand like the implications 00:37:09.520 |
that you're talking about for that diagram at the beginning exactly exactly that's great thank you flo um 00:37:15.840 |
so what i was trying to do was this this green section over here right i'm trying to force people 00:37:24.800 |
so you can see that there's a it's very specific right i'm trying to force people i really think 00:37:29.520 |
that you should be evaluating at least 20 samples evaluate at least 10 20 samples um i'm not trying 00:37:35.360 |
to create candidate criteria um i'm not trying to get lm to create initial criteria the thing is i need the 00:37:41.280 |
human i need a human to just provide the criteria and it could be bullet points so you can assume 00:37:46.320 |
someone who doesn't know anything about prompting just gonna put bullet points i'm just gonna take 00:37:49.760 |
whatever sop they have they're just gonna paste it in there and we just get s we'll just get the lm to 00:37:53.760 |
run right and then we're gonna check and then we can check the alignment report card which is this 00:38:00.640 |
thing over on the here on top probably will have a separate screen on its own um i don't know uh but this 00:38:07.680 |
thing is just a floating screen it's just i think it's fun to see i think it gives instantaneous 00:38:12.000 |
feedback that hey no this thing is doing some work your work is um your work is valuable you doing the 00:38:19.280 |
labeling is valuable and therefore i want you to label more data and of course every step of the way 00:38:24.800 |
i try to give you milestones after you've labeled um i'm gonna i'm gonna delete all of this or rather maybe 00:38:32.160 |
not i'm gonna give okay so if you if you want to unlock evaluation mode you have to label at least 20 00:38:37.280 |
rows okay so try to get into to the first milestone of labeling as 20 rows of data after you've labeled 00:38:43.440 |
20 rows of data you can run evaluation mode and if you want to unlock optimization mode you have to label 00:38:47.760 |
another 30 rows um so give them the next milestone and after they've done that okay now we can do 00:38:53.760 |
optimization mode i think i think for this demo the numbers were actually low i think 20 and 50 is a bit 00:38:58.560 |
too low i think maybe it should be 25 and 100 but i'm actually trying to play with that uh but yeah 00:39:05.600 |
that's that's the that's the this is the loop this green box here is the loop i'm trying to 00:39:12.800 |
uh improve on make it make tighter okay no more questions i guess yeah it's it's a fairly easy and 00:39:29.280 |
straightforward paper oh actually there's a lot stuff in the chat is the code open uh it's the code open 00:39:39.120 |
it's not open yet because it's very ugly um i will 00:39:42.800 |
this code is not open but the previous one on the evaluation frameworks it is open um so you can look 00:39:51.120 |
at it this this one is not open i still have quite a bit of work to do to make it a little bit cleaner 00:39:56.400 |
um you should experiment with the writer framework to see how it does and handle front and back for you 00:40:03.040 |
yeah oh kind of like fast html wow yeah i i i know you ping me that sam uh i haven't had a chance to 00:40:10.000 |
look at it um the am on is type the couple yeah um um yeah the back like this app i was really 00:40:22.720 |
again when i was trying to build it before i was building i was like okay maybe let's try to uh dig into 00:40:28.640 |
next js again or typescript frameworks again so in this over here uh shoot 00:40:34.480 |
i switched over to a different screen um in this framework over here so long story short i went through 00:40:43.840 |
fast html next js civic kit and even had a frankenstein of civic kit 00:40:51.360 |
svelte with fast api in the backend uh this was not good um yeah i really had to hack a lot of 00:40:59.920 |
things and frankenstein it i really don't like how it works but i think this is probably the best of both 00:41:04.800 |
worlds um yeah sdk might be fun to try i think i did use the versell ai sdk for this um but i can't 00:41:16.240 |
confirm let me let me let me take a look right now uh well so here's this ugly code um and the evaluate 00:41:26.720 |
route oh i'm just using and i'm just calling entropic directly yeah yeah sdk i think i initially 00:41:35.360 |
used ai sdk from versell but i think what was useful there was streaming and i didn't have streaming so i 00:41:41.280 |
i just simplified it and just deleted um stuff to yeah to simplify it yeah the law paper is leadership 00:41:50.400 |
principles oh so um eugene if i may yeah yeah um okay so in your experience were all or at least most 00:42:04.160 |
tasks distillable down to a binary prediction or is that just the ideal goal i think most tasks can be 00:42:12.160 |
i think if you think about it if summarization it's really relevance 00:42:17.600 |
factuality comprehensiveness or maybe you can give me an example of something that you don't think can be 00:42:24.160 |
distilled down um well off the top of my head because i don't have a specific use case in mind but 00:42:31.280 |
it may be you know multiple class classification but i guess that can be distilled down to pairwise 00:42:41.440 |
when you think about multiple class are you thinking about does this output 00:42:48.880 |
meet not meet policies for toxicity bias multiple choice questions where multiple answers uh are required for 00:42:58.720 |
some questions yeah so that is not the task of an lm evaluator right that's the task of an lm classifier 00:43:06.160 |
oh i see okay yeah okay okay targeting specifically the uh ability of an llm to evaluate uh whether 00:43:14.640 |
something is good or bad got it okay makes more so you're right for multi-task like you know or maybe like 00:43:21.760 |
um let's say given some kind of product description and title can you say is it uh 00:43:28.720 |
is it fmcg is it clothes is it furniture yeah that that that is completely different that that would be 00:43:36.800 |
that would be a separate thing but right now i'm just really focusing on evaluating it 00:43:42.640 |
is this is this classification is this classification correct or wrong simple that's binary but the goal 00:43:48.960 |
here is to have a uh let's say a solid justification for using an lm as an evaluator 00:43:57.920 |
uh no not really um i know that we have to use an lm as an evaluator there's no way around it if we 00:44:04.800 |
want to scale i think that's the only way the question is how can i tighten the alignment how can i 00:44:09.760 |
tighten the feedback loop to make it easier to use the lms and evaluator i'm taking a lot of um inspiration from 00:44:19.920 |
tracking the loop in order to be able to scale to uh more instances uh faster exactly so you can 00:44:27.520 |
exactly so you can imagine that uh maybe an ai engineer might be able to look at the data 00:44:34.480 |
understand the criteria talk to pms understand criteria look at the user experience and then write 00:44:39.840 |
the prompts and then they will do all the evaluation themselves right but now let's imagine that we don't 00:44:45.520 |
have that ai engineer or that ai engineer is actually bottlenecked and we want to use ai everywhere 00:44:50.240 |
right we're going to have lms doing classification summarization translations extraction everywhere 00:44:56.560 |
um now there will be a certain level of there at some point in time there will be a certain thing 00:45:02.480 |
there'll be certain level of risk tolerance whereby simple things like extraction and classification 00:45:08.480 |
if it's on the back end we probably don't need an evaluator but then there may be other things 00:45:13.200 |
like if you're summarizing a financial document or if you're summarizing a medical document if you're 00:45:19.840 |
extracting a medical document there may be certain things yeah essentially the key thing is i'm trying 00:45:25.360 |
to force people to look at the area oh my god i think this is like a um you should just post this 00:45:31.120 |
on twitter and tag hamel and me um and who knows what i mean like this is like um brilliant yeah and also 00:45:39.120 |
it might not have been your main intent but i think it makes for a good framework uh to uh 00:45:45.520 |
i don't know maybe to answer skeptics about llm evaluators uh it could show that there's a systematic 00:45:53.760 |
way of uh ensuring at least some some robustness or at least some reliability in lm evaluation before 00:46:00.880 |
actually deploying it in production exactly right i think the metrics here that's what we're trying to 00:46:06.480 |
do with the metrics here hey we can say that hey no you gave me again i don't use 50 yet you give me 00:46:12.080 |
100 to 400 rows of data i am able to do well on that and that's no different from machine learning 00:46:17.920 |
right instead of using the data to train a model is what we're using now is we're using the data 00:46:23.040 |
to fine tune a prompt which is the model now which is now the artifact that prompt that prompt married to 00:46:30.080 |
that coupled with that lm that you're using is now the artifact that you can then use to optimize 00:46:35.680 |
against the label data essentially it always starts with having labeled data um so i'm trying to force 00:46:40.960 |
people to label data uh instead of just using vibe checks when you are doing vibe tracks you also you 00:46:45.520 |
can also just take the extra step to just label the data and then with that label data i don't you 00:46:49.600 |
i don't need to bother you anymore to do the vibe checks you give me a little bit i try to optimize 00:46:54.080 |
it completely agree thank you you're welcome um so nicolay uh have you played around using re-rankers 00:47:01.360 |
what do you mean nicolay um yeah no i talked to the guy from mixed bread today and he mentioned they're 00:47:10.480 |
using the re-rankers a lot in evaluations but also in test time for example to evaluate something like 00:47:17.600 |
factuality and just training the re-rankers with a small data set um sorry i i confess i was trying 00:47:30.080 |
to ping shreya and i can't multitask so you're using re-rankers to try to evaluate what does that mean yeah 00:47:36.000 |
basically as opposed to training a classifier to predict like the label you are using re-rankers 00:47:43.280 |
and um so for example for factuality is like an an obvious one and using the scores plus a threshold 00:47:52.480 |
which apparently seems to work better than classifiers in their cases so okay just understand 00:47:59.360 |
understand just to make sure i understand you well you're using the re-ranker to rank poorer things down 00:48:06.640 |
um no in the end like the the re-ranker is trained on like you trained on peers so as you have in 00:48:18.480 |
factuality checks when you have nli models so in an nli model you basically you have a hypothesis and you 00:48:25.280 |
have like the ground truth and you basically try to determine where the hypothesis and the ground truth 00:48:30.560 |
the lines and you could use that basically with the label to train the re-ranker on it and then 00:48:37.280 |
basically set a threshold above the score of the re-ranker it's determined basically as the class like 00:48:44.960 |
one it's passing zero it's not passing i see i see so what you're saying is that okay we have some 00:48:50.240 |
references we have some golden samples and then we compare we train the re-ranker to rank the golden 00:48:55.440 |
samples higher yeah that could work i mean if you have a threshold that could work yeah i think what 00:49:03.120 |
is um what is hard for me to understand sometimes is that if you're using a re-ranker and you pass in 00:49:09.600 |
two pairwise preferences and it returns it's actually actually your re-ranker is actually returning a 00:49:14.400 |
score it's not really uh it's not really pairwise preference so that that was tripping out so if 00:49:18.640 |
your re-ranker is returning a score that's actually also you can think of as a classifier 00:49:24.800 |
yeah the the thing previously when i tried to do this several folks including from um 00:49:30.400 |
some of the labs they suggested doing pairwise preferences so the reason why pairwise preferences 00:49:37.440 |
cannot work is that if you give two things that are both factual um or if you give two things that 00:49:44.880 |
are both non-factual you would say that one is better than the other but it still doesn't meet the bar 00:49:49.440 |
of being factual enough but in your case i mean the case of re-ranker that you mentioned there's 00:49:54.160 |
a score involved so you use the score to cut the threshold and that would work 00:50:02.240 |
played around using re-rankers as as opposed to a classifier 00:50:06.320 |
okay i think shreya might not be able to join uh her seminar has been running late 00:50:19.280 |
or any other ideas on how we can make this more fun 00:50:28.960 |
for people sorry i just had a follow-up to what you just said regarding the pairwise cannot be used 00:50:37.280 |
for optimized evaluator so um what if you take like a batch of input outputs and then you did the pairwise 00:50:44.080 |
preferences computed elo scores for each one of the input outputs right within the particular batch 00:50:50.880 |
and then i guess you could have some kind of a threshold on okay anything which crossed this elo rating is 00:50:58.640 |
probably good enough like couldn't you use that method because like yeah you can then turn that 00:51:04.560 |
pairwise thing into like a per data point threshold that's i think that's a great point i think the 00:51:11.360 |
the way to answer this question the answer to this is that um let's look at man select there's just i mean 00:51:19.120 |
zoom just gives me no feedback when i'm sharing my screen or not um so the way to answer this is the 00:51:25.680 |
question is is your task objective or subjective so if your task is factual like something more objective 00:51:31.440 |
like toxicity or factuality like is it is it's a clear one or zero then if you're comparing two zeros 00:51:37.520 |
and it's saying that one is better um that you you that doesn't make sense or if you're comparing two 00:51:42.160 |
ones and saying that one is worse well but if your task is more subjective like which of this is more 00:51:47.520 |
persuasive which of this has a better tone which of this is better clickbait which of this is more 00:51:51.280 |
relevant yes uh a pairwise comparison could work so that's how i think about the task so a lot of times 00:51:59.840 |
it's like okay we have control control is our current one then we have something new let's say we have 00:52:05.280 |
new or better new or better recommendations it's the same thing is that we have a new or better prompt 00:52:09.600 |
we have new or better output we have a new or better way of creating the summary 00:52:13.440 |
and then we just do a you can just in that case pairwise comparisons can work and i know uh ai news 00:52:19.840 |
does a lot of pairwise comparisons almost every time there's a new model that comes out it does 00:52:24.560 |
pair commerce and it actually involves users in the pairwise comparisons as well and share shares that 00:52:28.560 |
and saying that you know these are different models that can work on the other hand there's certain 00:52:32.880 |
things that are just more objective i think pairwise comparisons doesn't really make sense 00:52:39.280 |
um i think just one minor pushback i think the differentiation on the objectivity versus 00:52:45.520 |
subjectivity is not as clean in practice like when for example here you have instruction following as 00:52:51.200 |
like a objective metric i think to an extent it's objective right if you have something like don't 00:52:56.560 |
exceed these many words and um don't uh use this kind of language but like even just within like a 00:53:03.440 |
two two domain experts right let's say it's like a medical task um i feel like there is some level 00:53:09.440 |
of subjectivity it's like taste that's what i think it is where each like domain expert is like well i 00:53:16.080 |
think a better way of following the instructions would have been this way um i don't know i think that's 00:53:22.000 |
all i i that that's like where i think in practice this kind of breaks down uh what i've seen is like 00:53:27.520 |
um even even with like factuality like um yeah there's that gray area of how you interpret the 00:53:34.000 |
data um so yeah i think maybe that's an issue on the margins maybe does not end up being as effective 00:53:39.920 |
but yeah that's a good point there's definitely gray areas right uh for example is is the show my 00:53:46.480 |
little pony should be should we tagged as horse or should it not be i mean people might search for 00:53:51.360 |
search for horses and they're looking for equestrian or like olympic sequestrian definitely not my little 00:53:56.000 |
pony but well my little pony is a horse so that's a great point so there's definitely a huge chunk of 00:54:02.640 |
gray area in the middle yeah um that's just hard that's just hard to get get free size on yeah because 00:54:08.960 |
i just feel like if ai is gonna get very intelligent it's gonna do more things like that yeah and i think 00:54:15.600 |
even for that we can just make a call to say that okay in this case we actually want it to be one and 00:54:20.320 |
we can write an instruction for it and we can make that task simplified to binary as opposed to 00:54:25.440 |
is this a better summary than this is which one has more relevance which one is more comprehensive 00:54:30.880 |
etc for me that is really hard to you can try to break it down into binary you can say relevance is 00:54:37.120 |
composed of these criteria um comprehensiveness is composed of this criteria but i think we are 00:54:42.960 |
usually better off with a pairwise comparison if you're trying to compare control versus treatment which 00:54:47.680 |
is better yeah thank you welcome um okay i guess that's it i guess sriya didn't have a it's not 00:54:58.160 |
gonna be able to have a chance to join us um it's okay uh she's i think she's giving a seminar right now 00:55:05.440 |
and you just cannot join well no no i'm here i was here for five minutes sorry you took your time you 00:55:12.960 |
took your time sure yeah sorry no i was at a seminar that ended at 12 50 i'm so sorry it was not click 00:55:20.160 |
bait i i did legitimately invite sharia uh i was hopeful she will be enjoying um i don't know so now that 00:55:25.760 |
sharia the author of the paper here is here anyone have any questions for her about the paper 00:55:31.360 |
i think we all just asked you with the questions earlier and you answered yeah actually there's one 00:55:40.960 |
question uh that i see from venu ah sharia is there feedback on the information utility of specific 00:55:48.240 |
labels for improving alignment we do not have it in the papers prototype um but the v2 that will 00:55:58.240 |
hopefully be pushed in the next couple of weeks it has to be before the conference um that does consider 00:56:03.760 |
natural language feedback nice so essentially it's like is it like chain of thought feedback and then 00:56:09.520 |
the llm will use it to update the prompt or update the the assertions if user gives the feedback maybe 00:56:16.800 |
on why something is bad just a few words and then that is included in the prompt for that validator 00:56:24.160 |
for that task or sorry for that criteria oh essentially it's a few short example yes oh no nice that's 00:56:32.000 |
that's great yeah yeah um any other trey i have a quick question are you are you planning to publish 00:56:42.880 |
this framework in some way as an open source project i didn't see if you had already done that or what's 00:56:48.160 |
going on not yet it's not out yet but we will so the conference is in three weeks we have to have it 00:56:53.520 |
out by then but it'll be implemented in chainforge.ai which is an open source llm pipeline building tool 00:57:01.120 |
right okay awesome yep great thanks um trey if i may um you had a tweet recently about a good test 00:57:12.880 |
decomposition and how it can 1000x uh the accuracy of an llm pipeline uh how do you think that plays 00:57:19.600 |
that interacts with uh an evaluation uh framework would you then have to evaluate each component 00:57:26.560 |
separately uh or does it all tie in in some end of end to end kind of way or some hybrid of both 00:57:33.440 |
great question we're exploring this as a research topic but the idea is to have each node in your 00:57:40.800 |
graph kind of be a standalone do a standalone thing that you can have standalone assertions for and if you 00:57:48.880 |
think about you know infinitely many inputs flowing through your pipeline there's going to be some 00:57:53.040 |
fraction of inputs that fail at each step then you can track error throughout the pipeline and do some 00:57:59.760 |
form of causal analysis to see like which is the most which node is the biggest reason for why the most 00:58:06.240 |
downstream node failed so that's something that we're working on now but the idea is you should do 00:58:11.200 |
decomposition in a way that like you can also validate the intermediates um they should be somewhat 00:58:17.440 |
of a unit task so basically it would involve some bottleneck analysis of which nodes uh like you said 00:58:24.320 |
are more most responsible for the overall performance and focus on those okay got it yeah thank you 00:58:29.920 |
all right i think that's the hour uh we're a little bit uh slightly distracted in the discord chat with uh 00:58:40.400 |
mira muradi getting well she says she resigned but we don't know 00:58:45.280 |
um not much speculation we can do uh yeah i think that's it thanks wait what's happening next week 00:58:54.640 |
uh yeah we have uh people signing up to do i i think sam signed up to do a paper he didn't set a date 00:59:01.840 |
yay sam we really like to dive into function calling and the berkeley team 00:59:07.440 |
i guess we're just mainlining all the berkeley phds uh berkeley team put out v3 if sam you're interested 00:59:14.400 |
yeah i can try you got this you got this it'll be my first one 00:59:24.400 |
so you want me to do the gorilla paper and then the v3 of the v fcl or or just the v3 yeah i actually think 00:59:35.760 |
the gorilla paper is not that important anymore uh but they just don't have papers for v2 and v3 of 00:59:41.280 |
the function calling leaderboard which they put out in the last two months and this is like the 00:59:46.240 |
industry standard now for function calling uh which i i've been meaning to dive into but if you want to 00:59:52.480 |
take it up next week it's yours yeah all right yeah just uh i guess as a hint if you drop by either 00:59:58.800 |
the gorilla server or the mem gpt server both uh charlie and shashir are generally like usually 01:00:04.400 |
available in there if you have questions both land service berkeley should just have because there's 01:00:09.520 |
too many things they should right like you just consolidate it into one we are gas we're actually 01:00:14.720 |
talking to lms uh next week as well um so we're trying to make a berkeley day um to have 01:00:21.680 |
like vllm like lmsys all these people hell yeah i don't know if we'll actually do it but whatever 01:00:32.960 |
i guess we you do eugene we uh thank you everyone yeah and we have we have new we have new we have 01:00:42.240 |
i'm glad that we have a paper next week yeah okay thanks thanks to six century for all the feedback 01:00:48.240 |
on the initial prototypes see everyone see ya