[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

00:00:00.000 | I mean the the the the the the question was do we record the whole thing or just the

00:00:06.480 | just the presentation but then the presentation sometimes often extends to the whole thing so

00:00:10.960 | uh if you want private question and answers you can write them in the chat they will not be

00:00:16.400 | reflected in the youtube recording or we will do it afterwards when we stop recording okay yeah

00:00:21.680 | and people have brought up can we save the chat transcripts so I think like for a writer episode

00:00:26.480 | I saved the chat transcripts there's a lot of good questions in there now if people are interested

00:00:30.400 | if we post them that's good to know because I've been saving them sometimes but they've damn been

00:00:35.680 | sharing yeah it could be and just general thing for everyone that joins there's always great questions

00:00:43.840 | that come up in chat so yeah always always open yeah I think the chat is the main one that's still

00:00:50.080 | that's what was nice about discord hosting it because all the chat just stays in discord

00:00:55.120 | in this space I have it pulled up to the side but if I miss any of them um please just pick you know

00:01:00.560 | just raise your hand and just ask me to stop okay cool so we're going to talk about this paper who

00:01:07.280 | validates the validators um I don't have any slides as for normal uh but I think that there are a few

00:01:13.440 | key interesting ideas in this paper that I want to quickly go through in um hopefully in just uh

00:01:24.320 | give me a second let me try to see if I can actually share my entire desktop I want to quickly go through

00:01:29.680 | this uh paper in maybe about 30 45 minutes then I also want to share about um an app that I built

00:01:37.680 | that I try to show all of this with evaluation mode um labeling mode as well as optimization mode

00:01:46.800 | um I had given Shreyya a sneak preview of this the last time um so we'll see okay now on to who

00:01:55.600 | validates the validators um so what what is this paper about so essentially this paper is about okay

00:02:01.120 | everyone's using airline evaluators uh but the question is how are we actually going to evaluate it

00:02:07.120 | so obviously the I think that the way to do this is to label some data label some data and then match

00:02:14.720 | how it matches against that label data and you will see that that's how that's how they try to do it as

00:02:19.920 | well um just to confirm you can actually see my screen right you can actually see the yellow and blue

00:02:26.080 | highlights okay yeah so I'm just because I'm showing the whole desktop so we'll go through a couple of

00:02:32.480 | key ideas here but the one key idea is this idea that's very interesting uh and these are scattered

00:02:38.480 | throughout in blue so firstly the observation that Shreyya and her authors made is that to grade outputs

00:02:47.520 | we need to have some evaluation criteria but the process of grading outputs help us define that

00:02:53.920 | evaluation criteria so it's a chicken neck problem um across many tasks in terms of evaluation I've seen

00:03:00.720 | that trying to grade the outputs without actually having seen your data actually leads to criteria

00:03:06.720 | trying to create criteria we've actually seen the data actually leads to criteria that's very divorced

00:03:09.760 | from reality like people will say oh I don't want any toxicity I don't want any um spelling errors

00:03:15.360 | etc etc but the thing is well you know those are actually not problems um or they could say that

00:03:20.000 | oh you know I want to tell them to speak like how it's speaking to a friend well the thing is

00:03:23.760 | if you actually try it and you try to look at the data you realize that you actually cannot

00:03:27.360 | create that kind of create a kind of text you can't create um text that is uh it can't really return

00:03:34.960 | slang like for example things like romanticy which is a mixture of romance and fantasy like this

00:03:40.400 | kind of genre or like grumpy sunshine um or reverse harem so things like this is not really quite

00:03:47.760 | part of its vocabulary it's very hard to prompt it to get it written that kind of data so so that that

00:03:53.360 | leads to undue expectations right these are expectations that just cannot be met so therefore

00:03:58.880 | what she's proposing is that what they're saying is that it is impossible to completely determine good

00:04:05.040 | evaluation criteria without actually looking at all the outputs so essentially the point is you have

00:04:10.800 | to look at the data before you become evaluation criteria um unfortunately this this is not the main

00:04:16.480 | part of the paper but it was a very key thing and I think this was actually one of the biggest

00:04:21.360 | uh aspects of the paper so essentially what what she's proposing is that okay here's the typical

00:04:29.040 | evaluation pipeline um that's right on top okay we start with uh prompting we get outputs then we

00:04:35.680 | just use some kind of evaluator lm without really checking it we just oh vipu you're not muted

00:04:51.120 | thanks thanks i took care of it yeah yeah okay so what we're seeing is that okay it's uh on the top

00:04:58.160 | that's how we're doing it right now but we're not actually um checking the alien evaluator so what

00:05:04.080 | they are proposing is that okay we put it through an alien evaluator we have some kind of candidate

00:05:09.360 | criteria and the human also does need to grade the output in terms of yes or no and then we check

00:05:15.280 | how well the candidate criteria aligns with the human grader output and you know this is an iteration

00:05:20.160 | of the process um only once it's aligned enough and that's why they have an alignment report card

00:05:25.280 | we consider the lm evaluator as aligned and usable both in development and in production so essentially

00:05:32.880 | users iterate through the process of refining criteria and grading so that you align the lm together with

00:05:38.000 | to your criteria i think there are three big there are three main problems uh when trying to use an

00:05:44.160 | lml evaluator i think the first there are two of them that we just solved one is at the end which is given

00:05:49.040 | the ideal prop can an ally evaluator achieve sufficient level of accuracy i believe the answer to this is

00:05:54.320 | yes second one is how do we get humans to label a hundred to four hundred samples of data with

00:06:00.800 | sufficient high quality i think the answer to that is also solved uh essentially you just motivate them

00:06:06.960 | incentivize them the right way but the second the what's in the middle is that now that you have this

00:06:12.160 | hundred to four hundred golden samples how do you come up with some kind of prompt or some kind of

00:06:17.520 | input or output uh some kind of prompt or instruction or a few short samples that can actually achieve that

00:06:23.680 | ideal the ideal prompt that can align well so this part that is in the middle i think is somewhat of an

00:06:30.080 | unsolved problem at least for me uh and we see how how you try to solve this um so eval gen design

00:06:37.840 | essentially how can we assist developers in creating evaluators

00:06:41.520 | so even gen design how do we assist developers uh in creating evaluators to grade the element outputs right

00:06:52.320 | so essentially instead of having them to use the vibe track how can you assist them to initialize the

00:06:56.400 | l evaluator once and then we can move on so they have a workflow here i think it's useful to for

00:07:02.560 | me to just talk through this i have had the chance to actually play with this uh live and it's very nice

00:07:07.920 | and a lot of it is what inspired my own uh prototype so essentially at the start you can see here there's

00:07:14.720 | a prop node uh where you're just writing a prompt essentially these are tweets you'll be doing name

00:07:18.960 | entity recognition on tweets so this is the generation prompt now over here this uh still in box a you

00:07:25.680 | can see that here over here the multi evaluator this is the evaluation prompt so in order to generate

00:07:30.400 | an evaluation prompt they should provide you three ways which is you can infer the criteria based on

00:07:34.480 | their prompt uh infer the uh you can write criteria manually or you can grade some responses first so

00:07:42.080 | that's part two so after you've done this it will generate the criteria itself so essentially you're just

00:07:48.480 | getting gpd-4 to generate some kind of criteria and this kind of criteria can be in the form of code or it can

00:07:54.960 | be in the form of an lm prompt so there's two kinds of criteria here one is code and one's an

00:07:58.800 | lm prompt so you can imagine things like no hashtags it's really just a regex search for hashtags uh

00:08:04.560 | whereas no made up entities uh this is probably something that you need to rely on lm on so that's

00:08:10.720 | box c over here creating criteria and then as you're creating criteria as you try to run it when you're

00:08:16.960 | running it running these criteria through the lm as you're running it you have the option to grade some

00:08:23.440 | uh great some of the sample data whereby you can say that okay is this good or is this bad

00:08:29.840 | unfortunately this is very broad is it is it good or is it bad it's not clear enough i think and the

00:08:35.200 | authors actually do mention that it is helpful to be more specific in some of their future lessons and then

00:08:40.400 | finally in box e over here you can see uh what uh how well is the lm aligned so i think this one this

00:08:49.120 | box c is uh box e i'm sorry box e is is very useful so they use the term coverage and false failure rate

00:08:58.160 | so to simplify things coverage what it really means is recall how many of the bad responses can

00:09:05.280 | we actually catch and false failure rate is actually one minus precision so it took me a long

00:09:12.160 | time to try to understand this and i had several chats with shreya to really under uh clarify what was

00:09:17.120 | she trying to indicate false failure rate means that we are actually failing this but they should

00:09:22.800 | not have been failed so essentially how much of it has been wasted how many innocents are we just

00:09:27.840 | excluding so it's essentially one minus precision um okay so essentially this is the entire workflow

00:09:35.520 | the entire ux that you are thinking of come up lm will help you create criteria you will uh lm will

00:09:41.360 | help you create criteria in box b you will create some responses in box d and then eventually we check

00:09:47.920 | how well they align in box e and if it's aligned enough that's it we have an l our lm evaluator so um well i i

00:09:57.280 | don't have to go through all this essentially that was all uh indicated in the diagram and

00:10:02.000 | of course the main thing is that the users can edit everything and the users are also asked to create

00:10:06.080 | their things um i think i will pause here to check if there are any questions

00:10:15.520 | the chat has not much in chat okay sure thing no no questions then uh i will proceed

00:10:25.280 | okay so now we know how the evalogen evalgen workflow looks like let's see how well it works and what do

00:10:33.600 | developers say because they also had a small study on developers so essentially um how is it done criteria

00:10:39.280 | suggestion gpt4 just proposes various criteria um candidate assertion essentially as they create

00:10:46.320 | this criteria they will use code and then they will assert it either use code or prompt to assert it and

00:10:51.680 | then return the scores and after that they they will they will sample some of the grades to get the users to

00:10:57.440 | give binary feedback i think this is really interesting whereby you try to sample the roles that make the

00:11:02.880 | biggest difference i i couldn't quite figure it out how they were actually doing this in a smart way that

00:11:08.000 | made a difference other than just going through the entire list maybe to ensure us here we can actually

00:11:11.920 | answer that so they had a lot of uh selectivity of candidate assertions essentially what this means is

00:11:17.680 | that they will go select the assertions that have a higher rate of passing it and then they also have

00:11:24.000 | confidence scores etc um so one here's how it compared to a previous paper by sure as well essentially it's

00:11:32.080 | called spade essentially the idea of spade is that in a nutshell you version control the prompts and

00:11:39.040 | every time there's a prompt change the assumption there is that every time there's a project pop change

00:11:44.400 | is usually to add an extra instruction to prevent a failure case so essentially by versioning control the

00:11:51.360 | props by checking all the different prompt changes you can create test cases out of it and you can

00:11:55.520 | create an elevator out of it so that's what they were comparing against they have two data sets one is a

00:12:01.840 | medical pipeline essentially uh transcription doctor patient calls so the goal is to extract specific

00:12:07.520 | information without revealing any pii data the other one is a it's an e-commerce data set essentially

00:12:14.800 | a hundred amazon products and the goal is to write seo optimized product descriptions

00:12:20.720 | without mentioning the negative reviews so without mentioning the negative reviews so medical and

00:12:26.640 | product uh and e-commerce uh so what i like what i like about these data sets is that it's fairly

00:12:31.520 | balanced you know um the defect rate okay so essentially 70 and 50 of it was good uh essentially

00:12:38.800 | if you flip it the defect rate is actually 30 and 50 which is what we actually really want to catch

00:12:45.600 | so here's our two data sets product uh medical and product with 30 and 50 defect rate um so

00:12:51.520 | essentially here's let's just focus on the results what we can see over here is that for the medical

00:12:59.360 | pipeline we were able to identify uh recall essentially was 33 it's a bit on the lower side i don't i don't

00:13:08.240 | that's it i think they didn't really do a lot of work to try to optimize it and try to improve recall

00:13:12.240 | but essentially off the right of the bat recall is 33 33 so this is equivalent to spade and the

00:13:19.440 | precision is also 90 remember false failure rate is one minus precision so precision is 90 uh and this

00:13:27.520 | is this is pretty good position uh that's it so you might say hey you know what's the difference

00:13:31.120 | between evil gen and spade well the difference here is this evil gen only needed three assertions be

00:13:37.120 | code assertions or lm prompts so essentially you needed less than spade which needed five

00:13:43.360 | to get comparable results now when we look at the product pipeline we see that evil gen only requires

00:13:51.120 | four sessions which is half of what state needed um essentially four different uh criteria four different

00:13:58.400 | criteria four different uh lm prompts or four different code prompts essentially up to four and it was able to

00:14:04.400 | achieve pretty good recall which is close to 75 percent um in production ideally you want to try to get to

00:14:10.000 | 80 80 80 80 is really hard to achieve in production but 80 90 is pretty good uh but what's not so good is

00:14:16.240 | the precision rate precision over here is 61 right because again false failure weight is one minus precision

00:14:25.440 | uh eventually long story short evil gen does outperform spade in terms of getting good uh recall and precision

00:14:32.240 | uh in terms of achieving uh similar or better recall precision but still a few ways to go

00:14:38.240 | uh before we can use it in production uh that's it this uh this idea was actually pretty good in the sense that

00:14:48.000 | again remember what's happening here is that the lm is just seeing the prompt either inferring some

00:14:54.640 | criteria or the user is just manually creating some criteria the user is not writing the code

00:14:58.880 | the user is not writing the evaluator prompt um but just by getting the user to label i don't know how

00:15:05.280 | many the user actually label i think maybe 16 to 20 in those two data sets we were actually able to just

00:15:11.280 | come up without anybody with the problems just just like this without the user having imagine a user

00:15:16.640 | who is a product manager or an sde or software engineer who doesn't know how to write prompts

00:15:21.360 | doesn't know about future prompting doesn't care about chain of thought um and they shouldn't right

00:15:26.080 | all they should do is to provide you hey what's my criteria and then can you get it aligned

00:15:30.160 | so i think that's a pretty good result so any questions here before i go into user study

00:15:37.440 | okay okay i guess not uh how do i get the chat um when when you're talking about these

00:15:48.800 | these prompts are they iterative prompts like when you say getting it aligned you're talking about

00:15:54.560 | exchanges with the model this isn't fine-tuning the model right no no it's not fine-tuning model what i

00:16:00.080 | mean by getting a line is that um so imagine if so um okay so let's look at box d over here right

00:16:09.120 | so the the user is just saying is it good or bad so now there could be many many reasons why it's good

00:16:13.840 | or bad it could be that because there's too many hashtags it could be that it's not using bullet points

00:16:18.240 | it could be that um it's making our entities again the model just doesn't know and i guess in this in

00:16:24.400 | this study the deal just just simplified it as good or bad so there could be many many different uh

00:16:29.760 | criteria many different evaluation prompts so now how so what happens is that again in box c you see

00:16:36.080 | here the lm is coming off many uh possible criteria it's trying to guess what the user what what the user

00:16:43.120 | needs right um based on either based on the prompt or based on what the user has typed maybe just some

00:16:48.080 | free text string so it's coming out these evaluators so eventually what we're going to do is we're going to run

00:16:53.200 | these evaluators and try to see how well it aligns aligns with the data that the user has

00:16:59.760 | provide which is up or down and and i'll have a example demo of this uh in a bit where we can see

00:17:06.880 | how how our alignment is and so what is going to happen is that then there's an algorithm on the back

00:17:13.120 | end uh essentially just uh computes which of these evaluators should we use combine and which of these should

00:17:21.200 | be dropped to have the highest correlation with the human input good or bad labels

00:17:26.720 | does that make sense and of course you know if the correlation is bad you could try to optimize the

00:17:33.760 | prop rewrite the problem in a different way to find a better correlation so what that's what they mean by

00:17:39.440 | alignment which is aligning to the user provided labels of good and bad got it thank you yeah so

00:17:47.440 | essentially like this this is what they really care about coverage they want to have high coverage of

00:17:51.920 | bad responses and low force failure rate

00:17:57.120 | okay so what's really interesting is that they also had this um they also had a user study with nine

00:18:05.680 | uh nine industry practitioners uh i was one of them so i had a chance to play this and i was actually

00:18:13.760 | fairly uh impressed so you can see that okay you'll be the the prompt is you'll be doing name entity

00:18:19.280 | extraction which is the example there extract up to three well-known entities

00:18:22.880 | and then write a cent basically they have to make up a sentence describing a personal entity

00:18:28.000 | and answer is a bulleted markdown list so you can see that there's quite a few there's quite a lot of

00:18:33.280 | instructions here right extract entities um write a short one single sentence description and then return

00:18:40.000 | it as a markdown list markdown list and don't extract and uh hashtags so you can see you can imagine this

00:18:46.560 | being split into four different criteria and it's just easy to some of these can be uh solved by code

00:18:52.880 | such as no hashtags uh and then bullet points like this some of these probably have to be like uh do not

00:18:59.680 | make up do not make up uh any entities some of these could be probably like only an lm so then what we

00:19:08.880 | found is that participants found that uh evalgen was a great starting point which is that it's a great

00:19:13.840 | starting point to make it fun for people to label data and get immediate feedback on how their labeled data

00:19:20.560 | is helping to align the lm evaluator right this this level of feedback and this level of feedback is that in

00:19:25.200 | the order of minutes or at most an hour instead of um days or weeks right sometimes if you're not

00:19:30.720 | labeling the data yourself if you're using something like scale ai um the the data might not come back

00:19:35.600 | so quickly so the iteration loop is it really kills your product launches or your feature development

00:19:40.640 | especially when you're using uh trying to build an lm powered customer experience so here's what

00:19:48.480 | participants usually did um there's 100 lm outputs so they sort of viewed it like just quickly scan

00:19:53.920 | through it make sure it looks right and then partners uh click to start on it so there were

00:19:58.320 | six participants that just use autogen auto generate criteria one participant wrote the criteria and then

00:20:05.840 | two other participants who actually want to look at the data amazing i think that's great um so after

00:20:11.280 | then after that so after that they get some criteria suggestions from evalgen uh participants actually

00:20:16.400 | really wanted to play the criteria and try to uh push the boundaries of it so they removed some

00:20:21.920 | suggestions added some criteria of their own to see how far you are gently go um so evalgen as a starting

00:20:29.040 | point what did the participants say right uh firstly it's really hard uh to come up with criteria on what

00:20:37.360 | is a good summary or what's a good translation or what's a good extraction right so by prompting an lm and

00:20:43.200 | you can probably do this with cherry bt ui or entropic cloud ui um it was helpful that it was able to get

00:20:49.600 | the criteria one and you know this is nothing nothing outstanding um then participants who actually

00:20:57.280 | started grading before actually creating criteria said that the grading process was helpful right and we

00:21:03.520 | all know this by looking at the data and this was the the key point the key points that i highlighted in blue

00:21:08.800 | which is by looking at the data first you understand the data you feel for the data a little bit better

00:21:13.360 | and that helps you write better criteria right um so that's one i think this was me participant

00:21:20.960 | actually said you should enforce that we look at at least 20 samples first and that's what i will

00:21:25.760 | actually show this enforcement in the demo i'm going to show later so so because you know while lm is

00:21:33.440 | doing the grading especially if there's chain of thought uh it takes some time so but a great use

00:21:38.400 | of a time is that okay but this is maybe you want to go get a coffee but a great use of time is for the

00:21:42.080 | participants actually provide labels themselves so this was what participants were sharing that they

00:21:47.280 | were happy to grade while waiting and here's how the output looks like right so you can see here's

00:21:53.280 | the full text of the tweet and here's the response okay let's maybe let's look at the second one

00:21:58.240 | uh great fire department and tour nyc so it extracted great fire department extract the nyc extracted fdny

00:22:05.680 | this is interesting i was not aware of this and then so you can see we have four evaluators here

00:22:10.800 | whether it's a bulleted list whether there's no made up entities whether there's no hashtags and

00:22:15.120 | whether it's in a single sentence and of course the single sentence two of these actually fail

00:22:18.800 | uh because they they're not made of a single sentence so based on this you can actually see this and

00:22:24.800 | of course all this data is just computed in the back end to create the alignment scorecard

00:22:28.000 | um participants who like looking the courage and false failure rate looking at this made it very

00:22:34.400 | easy which one was um was actually look which one actually looked like and here's what i think is

00:22:40.960 | the mean of it which is alignment is iterative right the problem i think the main problem i think this is

00:22:48.080 | right right is that most people don't have very strong confidence in an lm evaluator because they

00:22:55.680 | are even they themselves are not confident of the criteria and the thing is this is not explicitly

00:23:01.920 | vocalized but most of the time they're not confident of the criteria because they don't know if their

00:23:06.240 | criteria is actually helping and the reason is that for that is because they haven't even actually looked at

00:23:11.280 | the data sometimes the people that provide criteria are like uh pms or like uh business leaders who may

00:23:19.840 | not have a chance to look at the data they don't uh maybe the data is too complex it's just not an easy

00:23:25.120 | consumable format and therefore they don't know whether it actually works or not but i have found that by

00:23:30.400 | getting people to look at the data even just 10 or 20 samples right of course you have to pick

00:23:34.720 | representative samples with both defects and non-defects it helps them understand a little bit

00:23:39.120 | better right so if you think of data in general there's quantitative data where you aggregate everything

00:23:46.000 | into one or two metrics that gives you very broad scope and then there's anecdotes right where you

00:23:51.040 | actually look at specific samples and jeff bezos has this quote saying that you know when the data and

00:23:55.280 | the anecdotes disagree i tend to trust the anecdotes that's what we're that's what uh this is suggesting

00:24:01.200 | here look at some of the data samples look at some of the anecdotes um yeah so again here's the chicken

00:24:08.000 | and egg problem we need to think through criteria you need to write it down in order to grade outputs

00:24:13.920 | but when we grade outputs we also figure out new data uh new criteria i mean it's the same thing like

00:24:22.880 | it's the same thing as writing uh when you start writing you probably only know half of what you

00:24:27.440 | actually eventually finish writing as you start writing you start digging a rabbit hole deeper and

00:24:31.040 | deeper and deeper and you find more things so this is what this proposes which is that feedback loop

00:24:35.920 | looking at data creating criteria creating criteria looking at data and creating more new criteria

00:24:40.640 | that's very helpful um yeah so but one interesting thing here is that one participant mentioned

00:24:49.280 | that they gave it a bad grade not because the output is bad but they wanted to be consistent with

00:24:53.200 | their previous grades right previously they thought something was bad uh and then they give it a bad score

00:24:59.360 | but after they could have been other things that were more egregious and you know on on hindsight this is

00:25:03.200 | really just a minor paper cut it's not as critical but they continue to be consistent with their previous

00:25:09.040 | grade instead of updating their previous grade so i personally don't think i think that this is a very

00:25:16.560 | difficult problem to solve in the sense that you want to be internally consistent with what you have

00:25:20.880 | done in the past but what that means is that this perpetuates the bad data and the bad labels

00:25:26.480 | uh so i think there's a point here uh which is i i wrote to remind people that my design decision

00:25:32.800 | here is that if you find that your criteria has drifted instead of trying to maintain the same criteria

00:25:39.600 | and aligning to the to the previous grades instead what we should do is we should revisit those previous

00:25:44.880 | grades and fix it because it's an iterative process

00:25:48.320 | um

00:25:51.200 | what constitutes alignment i think there's a question there about what was alignment uh is very subjective

00:26:04.000 | right uh i think that when it's really just binary yes or no good or bad i think that's a little bit

00:26:11.120 | easier to measure what alignment is in a sense it's really just recall and precision or colonis kappa

00:26:16.960 | in terms of correlation metrics uh but sometimes it's just very difficult and sometimes

00:26:22.400 | uh uh there was this slight disconnect when the the participants provided grades right they provided

00:26:30.000 | labels but had little impact on their chosen assertions uh i i think this is just it's just not

00:26:35.280 | clear how the grades actually the actual grading actually try to optimize this and actually during that

00:26:40.560 | demo they actually didn't do any optimization or creating new evaluation prompts so that that may be

00:26:45.200 | why there was some disconnect there the other thing is that users really want to have control over both

00:26:54.400 | the evaluation type as well as the prompts itself right and we have seen this is that a lot of times

00:27:01.600 | we will come up with an new prompt that actually works well on the eval but sometimes for better or for

00:27:08.800 | worse for whatever reason people just want to include just this one sentence to make sure that

00:27:13.040 | that defect doesn't happen when that defect actually doesn't happen and when you try that you know

00:27:17.120 | sometimes the evals actually perform worse uh but unfortunately you know sometimes these are just

00:27:22.560 | policy decisions that are a little bit disconnected from the data um finally this they found that you

00:27:32.640 | know as developers they iterate over the criteria they find that and as they engage with as they look at

00:27:39.680 | the data essentially look at the lm outputs it helps them refine their criteria better so essentially the

00:27:44.800 | point here is that tighten the feedback loop and those are creating the lm evaluator and once the lm

00:27:50.560 | evaluator is well aligned we can just use it for uh using production that said uh some participants

00:28:00.000 | were actually quite skeptical of how lm-based assertions the lm-based evaluations might be helpful for

00:28:06.960 | monitoring lm outputs in the production pipeline uh a year ago i might be i would have been in this camp

00:28:12.320 | as well i think now i'm also partially in this camp but partially not in this camp i i i'm beginning to

00:28:17.520 | see that i think it can be viable if the throughput and latency requirements are not too demanding so

00:28:23.680 | participant 8 actually said i cannot begin to think about how lm as validators in production can work i'm very

00:28:30.080 | skeptical um so yeah i think that's that's all i wanted to focus on uh essentially it's the workflow

00:28:42.960 | here that i think is very valuable uh and they also have future work um like one one thing here they

00:28:50.400 | actually didn't uh implement here which is users want their prompts to automatically improve based on

00:28:55.680 | evaluation results so this would be something like dspy etc etc that you can think about which we saw in

00:29:02.400 | some demos to automatically improve our evaluation prompts um but so yeah that's it that's all i had

00:29:09.520 | to share any questions just gonna go back to the screen here oh there's a few questions

00:29:22.960 | um okay i am gonna start from the one in the bottom um is there feedback on the information utility of

00:29:30.480 | specific labels for improving alignment

00:29:33.840 | this is not clear to me um i don't know and i don't know if shreya has joined yet she may not be able to

00:29:45.760 | join uh she may not have joined yet she has not i think this is a question that maybe i will wait

00:29:52.480 | for shreya let me just check twitter if she is just respond i did share for the zoom link

00:29:57.280 | eugene you said you made a prototype based on uh like the i think there was like a diagram earlier

00:30:13.600 | like can you talk more about the prototype if no one else has questions for the paper

00:30:17.920 | yeah sure i can share more about the prototype um

00:30:21.760 | where is it okay here we go

00:30:30.320 | share so here's a here's a very small simple prototype essentially the question the question that i have for

00:30:41.280 | myself is how can i help users um label data and evaluate data and improve their evaluation prompts

00:30:51.440 | easier so here's a small toy that i made uh it's called label lm assistance for better evaluation

00:30:56.800 | labels um so to get started we have to upload a csv file right so you can see as you file essentially

00:31:02.560 | i think these are fairly reasonable constraints id input output and a label and the label can be empty

00:31:08.320 | so what am i going to do is i'm going to input a upload csv file so this is a csv file it's the

00:31:15.040 | factually inconsistent that's this is the factual inconsistent um data set so what i'm going to do

00:31:21.040 | is let me just start from scratch um upload so here's how the csv profile looks like after we've

00:31:26.400 | uploaded that's the input that's output the goal is to try to make it easy for the users to read

00:31:31.840 | right and we can enter labels so the thing is currently right now we are in evaluate we are in

00:31:37.520 | labeling mode so to unlock evaluation mode you have to label at least many rows and i try to provide

00:31:44.240 | feedback that you know you have to label as many rows essentially i'm trying to gamify this to help

00:31:49.280 | users get to the next level so but i have a data set that uh already has some rows uploaded

00:31:57.840 | um

00:32:00.560 | so this this has uh this is evaluation mode so in uh in order to achieve evaluation mode

00:32:13.520 | uh wait a minute have i already labeled so many so much data

00:32:17.680 | let me just okay maybe my data set wasn't prepared correctly

00:32:23.200 | uh by the way this is this is now evaluation mode after we've labeled 20 out of 20 it will show that

00:32:28.720 | evaluation mode has been unlocked so you see evaluation mode unlock scroll top to add your prompt

00:32:33.600 | so once evaluation mode is unlocked you can enter the prompt and it'll allow you to use the prompt to

00:32:39.440 | try to do that so evaluation mode you can just run evaluate it will evaluate all the data uh evaluate

00:32:47.520 | on the data so you can see we start getting results as it comes in piece by piece essentially what this

00:32:52.240 | is happening is just running on the back end uh big thanks to swix for teaching me how to do this the

00:32:56.640 | right way to do this and then what we're doing is this it's running on the background it's all

00:33:00.400 | storing the database but at the same time we're also polling it once a second to show the results

00:33:04.560 | and what we see in the top left corner here these are metrics right these are metrics for recall

00:33:10.400 | precision cohen's kappa and we also have true positive true negative uh false positive and false negative

00:33:16.400 | so this is uh going to take some time and it's going to evaluate the data and along the way we can

00:33:24.080 | see how well it lines so we can actually see that okay when alignment is there you see essentially

00:33:28.880 | the label the prediction matches your label that's great and then when alignment is not there the

00:33:33.760 | prediction doesn't match the label it's uh it'll be highlighted in red and you can look through it

00:33:38.080 | you can look through it you can try to update the chain of thought you can try to understand why

00:33:41.920 | it's giving this prediction based on the explanation and you can see the explanation here

00:33:46.080 | and as we scroll down oh wow this is pretty good you can see a lot most of it is largely aligned

00:33:52.000 | essentially what this is doing this is just a chain of thought followed by the final label

00:33:56.080 | and it's just going to go through it and once it goes through all 50 of it um again this is uh so

00:34:03.120 | currently we're in evaluation mode where the lm is evaluating it so to unlock optimization mode you have

00:34:08.480 | to label at least 50 rows and you have to run evaluation mode once essentially we sort of need an

00:34:13.520 | existing metric for how well we need a baseline for how well this basic evaluation prompt works and you

00:34:18.880 | can see the problem is actually very short just evaluate if the output summary is actually consistent with the

00:34:22.880 | input article if it's consistent return one if it's inconsistent return zero

00:34:26.240 | so now we are at uh row number 38 oh seems to have broken nope it's still running it's just taking a

00:34:35.840 | bit of time uh row number 41 so once we have at least 50 rows i actually think the right number should be

00:34:41.920 | maybe 100 to 400 if you want to be optimizing based on this uh and after we've run evaluation mode once uh we

00:34:49.440 | can actually run optimization mode so optimization mode is going to come up in maybe a couple of

00:34:55.440 | seconds so you can see right here it's just i just want to try to figure out a way to give users fast

00:35:00.000 | feedback all right so now that we've labeled 50 uh and evaluated all of them it shows that optimization

00:35:06.160 | mode is unlocked so we can scroll to the top and click on optimize evaluation prompt so optimization

00:35:10.240 | what it does is that on the back end we're just going to try to figure out how to optimize the evaluation

00:35:15.920 | prompt uh unfortunately clicking on this number clicking on this button doesn't do anything

00:35:20.320 | i just implemented this yesterday um but i tried to um well with next js it's very easy to try to make

00:35:29.440 | it something fun and funky of course i'm sure six uh will do something way better but essentially this is

00:35:36.240 | this is what the next step is clicking on this button will optimize in the back end we'll see a different

00:35:40.800 | we'll see a different page which has like maybe 10 rows and all the different optimization scores

00:35:45.840 | it's actually just hyper parameter tuning you can think of it hyper prompt tuning um so yeah that's my

00:35:51.680 | short demo any questions or feedback would be nice yeah that's fire so so of all the tested frameworks

00:36:00.560 | you ended up going with next js and not like um oh how do you know this is next how do you know this is next

00:36:05.440 | yes you mentioned it a couple minutes a second ago oh okay yeah this is next js i it was really easy to

00:36:12.240 | build this with cursor and v0.dev um and i went with it the back end uh is going to be fast api

00:36:23.040 | i i believe the back end will be fast api uh or i could try to do it uh all in type script and xjs i

00:36:30.560 | don't know uh but i think the prompt optimization back and i'll just have a fast api and then you

00:36:35.120 | know next js this app will just call fast api uh you know here's the here's the label data set or rather

00:36:43.200 | the label data set is all in a dynamo db or rather it's just in a sqlite db and then the the fast api

00:36:50.880 | backend will also have access to that and just optimizing it and computing scores

00:36:54.400 | so this is oh sorry go ahead no i was just gonna say that was really cool i'm now trying

00:37:05.440 | to tie it back into like uh the validator paper that we just read and understand like the implications

00:37:09.520 | that you're talking about for that diagram at the beginning exactly exactly that's great thank you flo um

00:37:15.840 | so what i was trying to do was this this green section over here right i'm trying to force people

00:37:24.800 | so you can see that there's a it's very specific right i'm trying to force people i really think

00:37:29.520 | that you should be evaluating at least 20 samples evaluate at least 10 20 samples um i'm not trying

00:37:35.360 | to create candidate criteria um i'm not trying to get lm to create initial criteria the thing is i need the

00:37:41.280 | human i need a human to just provide the criteria and it could be bullet points so you can assume

00:37:46.320 | someone who doesn't know anything about prompting just gonna put bullet points i'm just gonna take

00:37:49.760 | whatever sop they have they're just gonna paste it in there and we just get s we'll just get the lm to

00:37:53.760 | run right and then we're gonna check and then we can check the alignment report card which is this

00:38:00.640 | thing over on the here on top probably will have a separate screen on its own um i don't know uh but this

00:38:07.680 | thing is just a floating screen it's just i think it's fun to see i think it gives instantaneous

00:38:12.000 | feedback that hey no this thing is doing some work your work is um your work is valuable you doing the

00:38:19.280 | labeling is valuable and therefore i want you to label more data and of course every step of the way

00:38:24.800 | i try to give you milestones after you've labeled um i'm gonna i'm gonna delete all of this or rather maybe

00:38:32.160 | not i'm gonna give okay so if you if you want to unlock evaluation mode you have to label at least 20

00:38:37.280 | rows okay so try to get into to the first milestone of labeling as 20 rows of data after you've labeled

00:38:43.440 | 20 rows of data you can run evaluation mode and if you want to unlock optimization mode you have to label

00:38:47.760 | another 30 rows um so give them the next milestone and after they've done that okay now we can do

00:38:53.760 | optimization mode i think i think for this demo the numbers were actually low i think 20 and 50 is a bit

00:38:58.560 | too low i think maybe it should be 25 and 100 but i'm actually trying to play with that uh but yeah

00:39:05.600 | that's that's the that's the this is the loop this green box here is the loop i'm trying to

00:39:12.800 | uh improve on make it make tighter okay no more questions i guess yeah it's it's a fairly easy and

00:39:29.280 | straightforward paper oh actually there's a lot stuff in the chat is the code open uh it's the code open

00:39:39.120 | it's not open yet because it's very ugly um i will

00:39:42.800 | this code is not open but the previous one on the evaluation frameworks it is open um so you can look

00:39:51.120 | at it this this one is not open i still have quite a bit of work to do to make it a little bit cleaner

00:39:56.400 | um you should experiment with the writer framework to see how it does and handle front and back for you

00:40:03.040 | yeah oh kind of like fast html wow yeah i i i know you ping me that sam uh i haven't had a chance to

00:40:10.000 | look at it um the am on is type the couple yeah um um yeah the back like this app i was really

00:40:22.720 | again when i was trying to build it before i was building i was like okay maybe let's try to uh dig into

00:40:28.640 | next js again or typescript frameworks again so in this over here uh shoot

00:40:34.480 | i switched over to a different screen um in this framework over here so long story short i went through

00:40:43.840 | fast html next js civic kit and even had a frankenstein of civic kit

00:40:51.360 | svelte with fast api in the backend uh this was not good um yeah i really had to hack a lot of

00:40:59.920 | things and frankenstein it i really don't like how it works but i think this is probably the best of both

00:41:04.800 | worlds um yeah sdk might be fun to try i think i did use the versell ai sdk for this um but i can't

00:41:16.240 | confirm let me let me let me take a look right now uh well so here's this ugly code um and the evaluate

00:41:26.720 | route oh i'm just using and i'm just calling entropic directly yeah yeah sdk i think i initially

00:41:35.360 | used ai sdk from versell but i think what was useful there was streaming and i didn't have streaming so i

00:41:41.280 | i just simplified it and just deleted um stuff to yeah to simplify it yeah the law paper is leadership

00:41:50.400 | principles oh so um eugene if i may yeah yeah um okay so in your experience were all or at least most

00:42:04.160 | tasks distillable down to a binary prediction or is that just the ideal goal i think most tasks can be

00:42:12.160 | i think if you think about it if summarization it's really relevance

00:42:17.600 | factuality comprehensiveness or maybe you can give me an example of something that you don't think can be

00:42:24.160 | distilled down um well off the top of my head because i don't have a specific use case in mind but

00:42:31.280 | it may be you know multiple class classification but i guess that can be distilled down to pairwise

00:42:37.520 | one v one uh sort of framework so

00:42:41.440 | when you think about multiple class are you thinking about does this output

00:42:48.880 | meet not meet policies for toxicity bias multiple choice questions where multiple answers uh are required for

00:42:58.720 | some questions yeah so that is not the task of an lm evaluator right that's the task of an lm classifier

00:43:06.160 | oh i see okay yeah okay okay targeting specifically the uh ability of an llm to evaluate uh whether

00:43:14.640 | something is good or bad got it okay makes more so you're right for multi-task like you know or maybe like

00:43:21.760 | um let's say given some kind of product description and title can you say is it uh

00:43:28.720 | is it fmcg is it clothes is it furniture yeah that that that is completely different that that would be

00:43:36.800 | that would be a separate thing but right now i'm just really focusing on evaluating it

00:43:42.640 | is this is this classification is this classification correct or wrong simple that's binary but the goal

00:43:48.960 | here is to have a uh let's say a solid justification for using an lm as an evaluator

00:43:57.920 | uh no not really um i know that we have to use an lm as an evaluator there's no way around it if we

00:44:04.800 | want to scale i think that's the only way the question is how can i tighten the alignment how can i

00:44:09.760 | tighten the feedback loop to make it easier to use the lms and evaluator i'm taking a lot of um inspiration from

00:44:19.920 | tracking the loop in order to be able to scale to uh more instances uh faster exactly so you can

00:44:27.520 | exactly so you can imagine that uh maybe an ai engineer might be able to look at the data

00:44:34.480 | understand the criteria talk to pms understand criteria look at the user experience and then write

00:44:39.840 | the prompts and then they will do all the evaluation themselves right but now let's imagine that we don't

00:44:45.520 | have that ai engineer or that ai engineer is actually bottlenecked and we want to use ai everywhere

00:44:50.240 | right we're going to have lms doing classification summarization translations extraction everywhere

00:44:56.560 | um now there will be a certain level of there at some point in time there will be a certain thing

00:45:02.480 | there'll be certain level of risk tolerance whereby simple things like extraction and classification

00:45:08.480 | if it's on the back end we probably don't need an evaluator but then there may be other things

00:45:13.200 | like if you're summarizing a financial document or if you're summarizing a medical document if you're

00:45:19.840 | extracting a medical document there may be certain things yeah essentially the key thing is i'm trying

00:45:25.360 | to force people to look at the area oh my god i think this is like a um you should just post this

00:45:31.120 | on twitter and tag hamel and me um and who knows what i mean like this is like um brilliant yeah and also

00:45:39.120 | it might not have been your main intent but i think it makes for a good framework uh to uh

00:45:45.520 | i don't know maybe to answer skeptics about llm evaluators uh it could show that there's a systematic

00:45:53.760 | way of uh ensuring at least some some robustness or at least some reliability in lm evaluation before

00:46:00.880 | actually deploying it in production exactly right i think the metrics here that's what we're trying to

00:46:06.480 | do with the metrics here hey we can say that hey no you gave me again i don't use 50 yet you give me

00:46:12.080 | 100 to 400 rows of data i am able to do well on that and that's no different from machine learning

00:46:17.920 | right instead of using the data to train a model is what we're using now is we're using the data

00:46:23.040 | to fine tune a prompt which is the model now which is now the artifact that prompt that prompt married to

00:46:30.080 | that coupled with that lm that you're using is now the artifact that you can then use to optimize

00:46:35.680 | against the label data essentially it always starts with having labeled data um so i'm trying to force

00:46:40.960 | people to label data uh instead of just using vibe checks when you are doing vibe tracks you also you

00:46:45.520 | can also just take the extra step to just label the data and then with that label data i don't you

00:46:49.600 | i don't need to bother you anymore to do the vibe checks you give me a little bit i try to optimize

00:46:54.080 | it completely agree thank you you're welcome um so nicolay uh have you played around using re-rankers

00:47:01.360 | what do you mean nicolay um yeah no i talked to the guy from mixed bread today and he mentioned they're

00:47:10.480 | using the re-rankers a lot in evaluations but also in test time for example to evaluate something like

00:47:17.600 | factuality and just training the re-rankers with a small data set um sorry i i confess i was trying

00:47:30.080 | to ping shreya and i can't multitask so you're using re-rankers to try to evaluate what does that mean yeah

00:47:36.000 | basically as opposed to training a classifier to predict like the label you are using re-rankers

00:47:43.280 | and um so for example for factuality is like an an obvious one and using the scores plus a threshold

00:47:52.480 | which apparently seems to work better than classifiers in their cases so okay just understand

00:47:59.360 | understand just to make sure i understand you well you're using the re-ranker to rank poorer things down

00:48:06.640 | um no in the end like the the re-ranker is trained on like you trained on peers so as you have in

00:48:18.480 | factuality checks when you have nli models so in an nli model you basically you have a hypothesis and you

00:48:25.280 | have like the ground truth and you basically try to determine where the hypothesis and the ground truth

00:48:30.560 | the lines and you could use that basically with the label to train the re-ranker on it and then

00:48:37.280 | basically set a threshold above the score of the re-ranker it's determined basically as the class like

00:48:44.960 | one it's passing zero it's not passing i see i see so what you're saying is that okay we have some

00:48:50.240 | references we have some golden samples and then we compare we train the re-ranker to rank the golden

00:48:55.440 | samples higher yeah that could work i mean if you have a threshold that could work yeah i think what

00:49:03.120 | is um what is hard for me to understand sometimes is that if you're using a re-ranker and you pass in

00:49:09.600 | two pairwise preferences and it returns it's actually actually your re-ranker is actually returning a

00:49:14.400 | score it's not really uh it's not really pairwise preference so that that was tripping out so if

00:49:18.640 | your re-ranker is returning a score that's actually also you can think of as a classifier

00:49:24.800 | yeah the the thing previously when i tried to do this several folks including from um

00:49:30.400 | some of the labs they suggested doing pairwise preferences so the reason why pairwise preferences

00:49:37.440 | cannot work is that if you give two things that are both factual um or if you give two things that

00:49:44.880 | are both non-factual you would say that one is better than the other but it still doesn't meet the bar

00:49:49.440 | of being factual enough but in your case i mean the case of re-ranker that you mentioned there's

00:49:54.160 | a score involved so you use the score to cut the threshold and that would work

00:49:57.520 | yeah i i i i so to answer that i haven't

00:50:02.240 | played around using re-rankers as as opposed to a classifier

00:50:06.320 | okay i think shreya might not be able to join uh her seminar has been running late

00:50:16.800 | too bad but any other questions

00:50:19.280 | or any other ideas on how we can make this more fun

00:50:28.960 | for people sorry i just had a follow-up to what you just said regarding the pairwise cannot be used

00:50:37.280 | for optimized evaluator so um what if you take like a batch of input outputs and then you did the pairwise

00:50:44.080 | preferences computed elo scores for each one of the input outputs right within the particular batch

00:50:50.880 | and then i guess you could have some kind of a threshold on okay anything which crossed this elo rating is

00:50:58.640 | probably good enough like couldn't you use that method because like yeah you can then turn that

00:51:04.560 | pairwise thing into like a per data point threshold that's i think that's a great point i think the

00:51:11.360 | the way to answer this question the answer to this is that um let's look at man select there's just i mean

00:51:19.120 | zoom just gives me no feedback when i'm sharing my screen or not um so the way to answer this is the

00:51:25.680 | question is is your task objective or subjective so if your task is factual like something more objective

00:51:31.440 | like toxicity or factuality like is it is it's a clear one or zero then if you're comparing two zeros

00:51:37.520 | and it's saying that one is better um that you you that doesn't make sense or if you're comparing two

00:51:42.160 | ones and saying that one is worse well but if your task is more subjective like which of this is more

00:51:47.520 | persuasive which of this has a better tone which of this is better clickbait which of this is more

00:51:51.280 | relevant yes uh a pairwise comparison could work so that's how i think about the task so a lot of times

00:51:59.840 | it's like okay we have control control is our current one then we have something new let's say we have

00:52:05.280 | new or better new or better recommendations it's the same thing is that we have a new or better prompt

00:52:09.600 | we have new or better output we have a new or better way of creating the summary

00:52:13.440 | and then we just do a you can just in that case pairwise comparisons can work and i know uh ai news

00:52:19.840 | does a lot of pairwise comparisons almost every time there's a new model that comes out it does

00:52:24.560 | pair commerce and it actually involves users in the pairwise comparisons as well and share shares that

00:52:28.560 | and saying that you know these are different models that can work on the other hand there's certain

00:52:32.880 | things that are just more objective i think pairwise comparisons doesn't really make sense

00:52:39.280 | um i think just one minor pushback i think the differentiation on the objectivity versus

00:52:45.520 | subjectivity is not as clean in practice like when for example here you have instruction following as

00:52:51.200 | like a objective metric i think to an extent it's objective right if you have something like don't

00:52:56.560 | exceed these many words and um don't uh use this kind of language but like even just within like a

00:53:03.440 | two two domain experts right let's say it's like a medical task um i feel like there is some level

00:53:09.440 | of subjectivity it's like taste that's what i think it is where each like domain expert is like well i

00:53:16.080 | think a better way of following the instructions would have been this way um i don't know i think that's

00:53:22.000 | all i i that that's like where i think in practice this kind of breaks down uh what i've seen is like

00:53:27.520 | um even even with like factuality like um yeah there's that gray area of how you interpret the

00:53:34.000 | data um so yeah i think maybe that's an issue on the margins maybe does not end up being as effective

00:53:39.920 | but yeah that's a good point there's definitely gray areas right uh for example is is the show my

00:53:46.480 | little pony should be should we tagged as horse or should it not be i mean people might search for

00:53:51.360 | search for horses and they're looking for equestrian or like olympic sequestrian definitely not my little

00:53:56.000 | pony but well my little pony is a horse so that's a great point so there's definitely a huge chunk of

00:54:02.640 | gray area in the middle yeah um that's just hard that's just hard to get get free size on yeah because

00:54:08.960 | i just feel like if ai is gonna get very intelligent it's gonna do more things like that yeah and i think

00:54:15.600 | even for that we can just make a call to say that okay in this case we actually want it to be one and

00:54:20.320 | we can write an instruction for it and we can make that task simplified to binary as opposed to

00:54:25.440 | is this a better summary than this is which one has more relevance which one is more comprehensive

00:54:30.880 | etc for me that is really hard to you can try to break it down into binary you can say relevance is

00:54:37.120 | composed of these criteria um comprehensiveness is composed of this criteria but i think we are

00:54:42.960 | usually better off with a pairwise comparison if you're trying to compare control versus treatment which

00:54:47.680 | is better yeah thank you welcome um okay i guess that's it i guess sriya didn't have a it's not

00:54:58.160 | gonna be able to have a chance to join us um it's okay uh she's i think she's giving a seminar right now

00:55:05.440 | and you just cannot join well no no i'm here i was here for five minutes sorry you took your time you

00:55:12.960 | took your time sure yeah sorry no i was at a seminar that ended at 12 50 i'm so sorry it was not click

00:55:20.160 | bait i i did legitimately invite sharia uh i was hopeful she will be enjoying um i don't know so now that

00:55:25.760 | sharia the author of the paper here is here anyone have any questions for her about the paper

00:55:31.360 | i think we all just asked you with the questions earlier and you answered yeah actually there's one

00:55:40.960 | question uh that i see from venu ah sharia is there feedback on the information utility of specific

00:55:48.240 | labels for improving alignment we do not have it in the papers prototype um but the v2 that will

00:55:58.240 | hopefully be pushed in the next couple of weeks it has to be before the conference um that does consider

00:56:03.760 | natural language feedback nice so essentially it's like is it like chain of thought feedback and then

00:56:09.520 | the llm will use it to update the prompt or update the the assertions if user gives the feedback maybe

00:56:16.800 | on why something is bad just a few words and then that is included in the prompt for that validator

00:56:24.160 | for that task or sorry for that criteria oh essentially it's a few short example yes oh no nice that's

00:56:32.000 | that's great yeah yeah um any other trey i have a quick question are you are you planning to publish

00:56:42.880 | this framework in some way as an open source project i didn't see if you had already done that or what's

00:56:48.160 | going on not yet it's not out yet but we will so the conference is in three weeks we have to have it

00:56:53.520 | out by then but it'll be implemented in chainforge.ai which is an open source llm pipeline building tool

00:57:01.120 | right okay awesome yep great thanks um trey if i may um you had a tweet recently about a good test

00:57:12.880 | decomposition and how it can 1000x uh the accuracy of an llm pipeline uh how do you think that plays

00:57:19.600 | that interacts with uh an evaluation uh framework would you then have to evaluate each component

00:57:26.560 | separately uh or does it all tie in in some end of end to end kind of way or some hybrid of both

00:57:33.440 | great question we're exploring this as a research topic but the idea is to have each node in your

00:57:40.800 | graph kind of be a standalone do a standalone thing that you can have standalone assertions for and if you

00:57:48.880 | think about you know infinitely many inputs flowing through your pipeline there's going to be some

00:57:53.040 | fraction of inputs that fail at each step then you can track error throughout the pipeline and do some

00:57:59.760 | form of causal analysis to see like which is the most which node is the biggest reason for why the most

00:58:06.240 | downstream node failed so that's something that we're working on now but the idea is you should do

00:58:11.200 | decomposition in a way that like you can also validate the intermediates um they should be somewhat

00:58:17.440 | of a unit task so basically it would involve some bottleneck analysis of which nodes uh like you said

00:58:24.320 | are more most responsible for the overall performance and focus on those okay got it yeah thank you

00:58:29.920 | all right i think that's the hour uh we're a little bit uh slightly distracted in the discord chat with uh

00:58:40.400 | mira muradi getting well she says she resigned but we don't know

00:58:45.280 | um not much speculation we can do uh yeah i think that's it thanks wait what's happening next week

00:58:54.640 | uh yeah we have uh people signing up to do i i think sam signed up to do a paper he didn't set a date

00:59:01.840 | yay sam we really like to dive into function calling and the berkeley team

00:59:07.440 | i guess we're just mainlining all the berkeley phds uh berkeley team put out v3 if sam you're interested

00:59:14.400 | yeah i can try you got this you got this it'll be my first one

00:59:20.800 | the eval v3 what uh i missed it

00:59:24.400 | so you want me to do the gorilla paper and then the v3 of the v fcl or or just the v3 yeah i actually think

00:59:35.760 | the gorilla paper is not that important anymore uh but they just don't have papers for v2 and v3 of

00:59:41.280 | the function calling leaderboard which they put out in the last two months and this is like the

00:59:46.240 | industry standard now for function calling uh which i i've been meaning to dive into but if you want to

00:59:52.480 | take it up next week it's yours yeah all right yeah just uh i guess as a hint if you drop by either

00:59:58.800 | the gorilla server or the mem gpt server both uh charlie and shashir are generally like usually

01:00:04.400 | available in there if you have questions both land service berkeley should just have because there's

01:00:09.520 | too many things they should right like you just consolidate it into one we are gas we're actually

01:00:14.720 | talking to lms uh next week as well um so we're trying to make a berkeley day um to have

01:00:21.680 | like vllm like lmsys all these people hell yeah i don't know if we'll actually do it but whatever

01:00:28.880 | okay thank you that was amazing

01:00:32.960 | i guess we you do eugene we uh thank you everyone yeah and we have we have new we have new we have

01:00:42.240 | i'm glad that we have a paper next week yeah okay thanks thanks to six century for all the feedback

01:00:48.240 | on the initial prototypes see everyone see ya