[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

I mean the the the the the the question was do we record the whole thing or just the just the presentation but then the presentation sometimes often extends to the whole thing so uh if you want private question and answers you can write them in the chat they will not be reflected in the youtube recording or we will do it afterwards when we stop recording okay yeah and people have brought up can we save the chat transcripts so I think like for a writer episode I saved the chat transcripts there's a lot of good questions in there now if people are interested if we post them that's good to know because I've been saving them sometimes but they've damn been sharing yeah it could be and just general thing for everyone that joins there's always great questions that come up in chat so yeah always always open yeah I think the chat is the main one that's still that's what was nice about discord hosting it because all the chat just stays in discord in this space I have it pulled up to the side but if I miss any of them um please just pick you know just raise your hand and just ask me to stop okay cool so we're going to talk about this paper who validates the validators um I don't have any slides as for normal uh but I think that there are a few key interesting ideas in this paper that I want to quickly go through in um hopefully in just uh give me a second let me try to see if I can actually share my entire desktop I want to quickly go through this uh paper in maybe about 30 45 minutes then I also want to share about um an app that I built that I try to show all of this with evaluation mode um labeling mode as well as optimization mode um I had given Shreyya a sneak preview of this the last time um so we'll see okay now on to who validates the validators um so what what is this paper about so essentially this paper is about okay everyone's using airline evaluators uh but the question is how are we actually going to evaluate it so obviously the I think that the way to do this is to label some data label some data and then match how it matches against that label data and you will see that that's how that's how they try to do it as well um just to confirm you can actually see my screen right you can actually see the yellow and blue highlights okay yeah so I'm just because I'm showing the whole desktop so we'll go through a couple of key ideas here but the one key idea is this idea that's very interesting uh and these are scattered throughout in blue so firstly the observation that Shreyya and her authors made is that to grade outputs we need to have some evaluation criteria but the process of grading outputs help us define that evaluation criteria so it's a chicken neck problem um across many tasks in terms of evaluation I've seen that trying to grade the outputs without actually having seen your data actually leads to criteria trying to create criteria we've actually seen the data actually leads to criteria that's very divorced from reality like people will say oh I don't want any toxicity I don't want any um spelling errors etc etc but the thing is well you know those are actually not problems um or they could say that oh you know I want to tell them to speak like how it's speaking to a friend well the thing is if you actually try it and you try to look at the data you realize that you actually cannot create that kind of create a kind of text you can't create um text that is uh it can't really return slang like for example things like romanticy which is a mixture of romance and fantasy like this kind of genre or like grumpy sunshine um or reverse harem so things like this is not really quite part of its vocabulary it's very hard to prompt it to get it written that kind of data so so that that leads to undue expectations right these are expectations that just cannot be met so therefore what she's proposing is that what they're saying is that it is impossible to completely determine good evaluation criteria without actually looking at all the outputs so essentially the point is you have to look at the data before you become evaluation criteria um unfortunately this this is not the main part of the paper but it was a very key thing and I think this was actually one of the biggest uh aspects of the paper so essentially what what she's proposing is that okay here's the typical evaluation pipeline um that's right on top okay we start with uh prompting we get outputs then we just use some kind of evaluator lm without really checking it we just oh vipu you're not muted thanks thanks i took care of it yeah yeah okay so what we're seeing is that okay it's uh on the top that's how we're doing it right now but we're not actually um checking the alien evaluator so what they are proposing is that okay we put it through an alien evaluator we have some kind of candidate criteria and the human also does need to grade the output in terms of yes or no and then we check how well the candidate criteria aligns with the human grader output and you know this is an iteration of the process um only once it's aligned enough and that's why they have an alignment report card we consider the lm evaluator as aligned and usable both in development and in production so essentially users iterate through the process of refining criteria and grading so that you align the lm together with to your criteria i think there are three big there are three main problems uh when trying to use an lml evaluator i think the first there are two of them that we just solved one is at the end which is given the ideal prop can an ally evaluator achieve sufficient level of accuracy i believe the answer to this is yes second one is how do we get humans to label a hundred to four hundred samples of data with sufficient high quality i think the answer to that is also solved uh essentially you just motivate them incentivize them the right way but the second the what's in the middle is that now that you have this hundred to four hundred golden samples how do you come up with some kind of prompt or some kind of input or output uh some kind of prompt or instruction or a few short samples that can actually achieve that ideal the ideal prompt that can align well so this part that is in the middle i think is somewhat of an unsolved problem at least for me uh and we see how how you try to solve this um so eval gen design essentially how can we assist developers in creating evaluators so even gen design how do we assist developers uh in creating evaluators to grade the element outputs right so essentially instead of having them to use the vibe track how can you assist them to initialize the l evaluator once and then we can move on so they have a workflow here i think it's useful to for me to just talk through this i have had the chance to actually play with this uh live and it's very nice and a lot of it is what inspired my own uh prototype so essentially at the start you can see here there's a prop node uh where you're just writing a prompt essentially these are tweets you'll be doing name entity recognition on tweets so this is the generation prompt now over here this uh still in box a you can see that here over here the multi evaluator this is the evaluation prompt so in order to generate an evaluation prompt they should provide you three ways which is you can infer the criteria based on their prompt uh infer the uh you can write criteria manually or you can grade some responses first so that's part two so after you've done this it will generate the criteria itself so essentially you're just getting gpd-4 to generate some kind of criteria and this kind of criteria can be in the form of code or it can be in the form of an lm prompt so there's two kinds of criteria here one is code and one's an lm prompt so you can imagine things like no hashtags it's really just a regex search for hashtags uh whereas no made up entities uh this is probably something that you need to rely on lm on so that's box c over here creating criteria and then as you're creating criteria as you try to run it when you're running it running these criteria through the lm as you're running it you have the option to grade some uh great some of the sample data whereby you can say that okay is this good or is this bad unfortunately this is very broad is it is it good or is it bad it's not clear enough i think and the authors actually do mention that it is helpful to be more specific in some of their future lessons and then finally in box e over here you can see uh what uh how well is the lm aligned so i think this one this box c is uh box e i'm sorry box e is is very useful so they use the term coverage and false failure rate so to simplify things coverage what it really means is recall how many of the bad responses can we actually catch and false failure rate is actually one minus precision so it took me a long time to try to understand this and i had several chats with shreya to really under uh clarify what was she trying to indicate false failure rate means that we are actually failing this but they should not have been failed so essentially how much of it has been wasted how many innocents are we just excluding so it's essentially one minus precision um okay so essentially this is the entire workflow the entire ux that you are thinking of come up lm will help you create criteria you will uh lm will help you create criteria in box b you will create some responses in box d and then eventually we check how well they align in box e and if it's aligned enough that's it we have an l our lm evaluator so um well i i don't have to go through all this essentially that was all uh indicated in the diagram and of course the main thing is that the users can edit everything and the users are also asked to create their things um i think i will pause here to check if there are any questions the chat has not much in chat okay sure thing no no questions then uh i will proceed okay so now we know how the evalogen evalgen workflow looks like let's see how well it works and what do developers say because they also had a small study on developers so essentially um how is it done criteria suggestion gpt4 just proposes various criteria um candidate assertion essentially as they create this criteria they will use code and then they will assert it either use code or prompt to assert it and then return the scores and after that they they will they will sample some of the grades to get the users to give binary feedback i think this is really interesting whereby you try to sample the roles that make the biggest difference i i couldn't quite figure it out how they were actually doing this in a smart way that made a difference other than just going through the entire list maybe to ensure us here we can actually answer that so they had a lot of uh selectivity of candidate assertions essentially what this means is that they will go select the assertions that have a higher rate of passing it and then they also have confidence scores etc um so one here's how it compared to a previous paper by sure as well essentially it's called spade essentially the idea of spade is that in a nutshell you version control the prompts and every time there's a prompt change the assumption there is that every time there's a project pop change is usually to add an extra instruction to prevent a failure case so essentially by versioning control the props by checking all the different prompt changes you can create test cases out of it and you can create an elevator out of it so that's what they were comparing against they have two data sets one is a medical pipeline essentially uh transcription doctor patient calls so the goal is to extract specific information without revealing any pii data the other one is a it's an e-commerce data set essentially a hundred amazon products and the goal is to write seo optimized product descriptions without mentioning the negative reviews so without mentioning the negative reviews so medical and product uh and e-commerce uh so what i like what i like about these data sets is that it's fairly balanced you know um the defect rate okay so essentially 70 and 50 of it was good uh essentially if you flip it the defect rate is actually 30 and 50 which is what we actually really want to catch so here's our two data sets product uh medical and product with 30 and 50 defect rate um so essentially here's let's just focus on the results what we can see over here is that for the medical pipeline we were able to identify uh recall essentially was 33 it's a bit on the lower side i don't i don't that's it i think they didn't really do a lot of work to try to optimize it and try to improve recall but essentially off the right of the bat recall is 33 33 so this is equivalent to spade and the precision is also 90 remember false failure rate is one minus precision so precision is 90 uh and this is this is pretty good position uh that's it so you might say hey you know what's the difference between evil gen and spade well the difference here is this evil gen only needed three assertions be code assertions or lm prompts so essentially you needed less than spade which needed five to get comparable results now when we look at the product pipeline we see that evil gen only requires four sessions which is half of what state needed um essentially four different uh criteria four different criteria four different uh lm prompts or four different code prompts essentially up to four and it was able to achieve pretty good recall which is close to 75 percent um in production ideally you want to try to get to 80 80 80 80 is really hard to achieve in production but 80 90 is pretty good uh but what's not so good is the precision rate precision over here is 61 right because again false failure weight is one minus precision uh eventually long story short evil gen does outperform spade in terms of getting good uh recall and precision uh in terms of achieving uh similar or better recall precision but still a few ways to go uh before we can use it in production uh that's it this uh this idea was actually pretty good in the sense that again remember what's happening here is that the lm is just seeing the prompt either inferring some criteria or the user is just manually creating some criteria the user is not writing the code the user is not writing the evaluator prompt um but just by getting the user to label i don't know how many the user actually label i think maybe 16 to 20 in those two data sets we were actually able to just come up without anybody with the problems just just like this without the user having imagine a user who is a product manager or an sde or software engineer who doesn't know how to write prompts doesn't know about future prompting doesn't care about chain of thought um and they shouldn't right all they should do is to provide you hey what's my criteria and then can you get it aligned so i think that's a pretty good result so any questions here before i go into user study okay okay i guess not uh how do i get the chat um when when you're talking about these these prompts are they iterative prompts like when you say getting it aligned you're talking about exchanges with the model this isn't fine-tuning the model right no no it's not fine-tuning model what i mean by getting a line is that um so imagine if so um okay so let's look at box d over here right so the the user is just saying is it good or bad so now there could be many many reasons why it's good or bad it could be that because there's too many hashtags it could be that it's not using bullet points it could be that um it's making our entities again the model just doesn't know and i guess in this in this study the deal just just simplified it as good or bad so there could be many many different uh criteria many different evaluation prompts so now how so what happens is that again in box c you see here the lm is coming off many uh possible criteria it's trying to guess what the user what what the user needs right um based on either based on the prompt or based on what the user has typed maybe just some free text string so it's coming out these evaluators so eventually what we're going to do is we're going to run these evaluators and try to see how well it aligns aligns with the data that the user has provide which is up or down and and i'll have a example demo of this uh in a bit where we can see how how our alignment is and so what is going to happen is that then there's an algorithm on the back end uh essentially just uh computes which of these evaluators should we use combine and which of these should be dropped to have the highest correlation with the human input good or bad labels does that make sense and of course you know if the correlation is bad you could try to optimize the prop rewrite the problem in a different way to find a better correlation so what that's what they mean by alignment which is aligning to the user provided labels of good and bad got it thank you yeah so essentially like this this is what they really care about coverage they want to have high coverage of bad responses and low force failure rate okay so what's really interesting is that they also had this um they also had a user study with nine uh nine industry practitioners uh i was one of them so i had a chance to play this and i was actually fairly uh impressed so you can see that okay you'll be the the prompt is you'll be doing name entity extraction which is the example there extract up to three well-known entities and then write a cent basically they have to make up a sentence describing a personal entity and answer is a bulleted markdown list so you can see that there's quite a few there's quite a lot of instructions here right extract entities um write a short one single sentence description and then return it as a markdown list markdown list and don't extract and uh hashtags so you can see you can imagine this being split into four different criteria and it's just easy to some of these can be uh solved by code such as no hashtags uh and then bullet points like this some of these probably have to be like uh do not make up do not make up uh any entities some of these could be probably like only an lm so then what we found is that participants found that uh evalgen was a great starting point which is that it's a great starting point to make it fun for people to label data and get immediate feedback on how their labeled data is helping to align the lm evaluator right this this level of feedback and this level of feedback is that in the order of minutes or at most an hour instead of um days or weeks right sometimes if you're not labeling the data yourself if you're using something like scale ai um the the data might not come back so quickly so the iteration loop is it really kills your product launches or your feature development especially when you're using uh trying to build an lm powered customer experience so here's what participants usually did um there's 100 lm outputs so they sort of viewed it like just quickly scan through it make sure it looks right and then partners uh click to start on it so there were six participants that just use autogen auto generate criteria one participant wrote the criteria and then two other participants who actually want to look at the data amazing i think that's great um so after then after that so after that they get some criteria suggestions from evalgen uh participants actually really wanted to play the criteria and try to uh push the boundaries of it so they removed some suggestions added some criteria of their own to see how far you are gently go um so evalgen as a starting point what did the participants say right uh firstly it's really hard uh to come up with criteria on what is a good summary or what's a good translation or what's a good extraction right so by prompting an lm and you can probably do this with cherry bt ui or entropic cloud ui um it was helpful that it was able to get the criteria one and you know this is nothing nothing outstanding um then participants who actually started grading before actually creating criteria said that the grading process was helpful right and we all know this by looking at the data and this was the the key point the key points that i highlighted in blue which is by looking at the data first you understand the data you feel for the data a little bit better and that helps you write better criteria right um so that's one i think this was me participant actually said you should enforce that we look at at least 20 samples first and that's what i will actually show this enforcement in the demo i'm going to show later so so because you know while lm is doing the grading especially if there's chain of thought uh it takes some time so but a great use of a time is that okay but this is maybe you want to go get a coffee but a great use of time is for the participants actually provide labels themselves so this was what participants were sharing that they were happy to grade while waiting and here's how the output looks like right so you can see here's the full text of the tweet and here's the response okay let's maybe let's look at the second one uh great fire department and tour nyc so it extracted great fire department extract the nyc extracted fdny this is interesting i was not aware of this and then so you can see we have four evaluators here whether it's a bulleted list whether there's no made up entities whether there's no hashtags and whether it's in a single sentence and of course the single sentence two of these actually fail uh because they they're not made of a single sentence so based on this you can actually see this and of course all this data is just computed in the back end to create the alignment scorecard um participants who like looking the courage and false failure rate looking at this made it very easy which one was um was actually look which one actually looked like and here's what i think is the mean of it which is alignment is iterative right the problem i think the main problem i think this is right right is that most people don't have very strong confidence in an lm evaluator because they are even they themselves are not confident of the criteria and the thing is this is not explicitly vocalized but most of the time they're not confident of the criteria because they don't know if their criteria is actually helping and the reason is that for that is because they haven't even actually looked at the data sometimes the people that provide criteria are like uh pms or like uh business leaders who may not have a chance to look at the data they don't uh maybe the data is too complex it's just not an easy consumable format and therefore they don't know whether it actually works or not but i have found that by getting people to look at the data even just 10 or 20 samples right of course you have to pick representative samples with both defects and non-defects it helps them understand a little bit better right so if you think of data in general there's quantitative data where you aggregate everything into one or two metrics that gives you very broad scope and then there's anecdotes right where you actually look at specific samples and jeff bezos has this quote saying that you know when the data and the anecdotes disagree i tend to trust the anecdotes that's what we're that's what uh this is suggesting here look at some of the data samples look at some of the anecdotes um yeah so again here's the chicken and egg problem we need to think through criteria you need to write it down in order to grade outputs but when we grade outputs we also figure out new data uh new criteria i mean it's the same thing like it's the same thing as writing uh when you start writing you probably only know half of what you actually eventually finish writing as you start writing you start digging a rabbit hole deeper and deeper and deeper and you find more things so this is what this proposes which is that feedback loop looking at data creating criteria creating criteria looking at data and creating more new criteria that's very helpful um yeah so but one interesting thing here is that one participant mentioned that they gave it a bad grade not because the output is bad but they wanted to be consistent with their previous grades right previously they thought something was bad uh and then they give it a bad score but after they could have been other things that were more egregious and you know on on hindsight this is really just a minor paper cut it's not as critical but they continue to be consistent with their previous grade instead of updating their previous grade so i personally don't think i think that this is a very difficult problem to solve in the sense that you want to be internally consistent with what you have done in the past but what that means is that this perpetuates the bad data and the bad labels uh so i think there's a point here uh which is i i wrote to remind people that my design decision here is that if you find that your criteria has drifted instead of trying to maintain the same criteria and aligning to the to the previous grades instead what we should do is we should revisit those previous grades and fix it because it's an iterative process um what constitutes alignment i think there's a question there about what was alignment uh is very subjective right uh i think that when it's really just binary yes or no good or bad i think that's a little bit easier to measure what alignment is in a sense it's really just recall and precision or colonis kappa in terms of correlation metrics uh but sometimes it's just very difficult and sometimes uh uh there was this slight disconnect when the the participants provided grades right they provided labels but had little impact on their chosen assertions uh i i think this is just it's just not clear how the grades actually the actual grading actually try to optimize this and actually during that demo they actually didn't do any optimization or creating new evaluation prompts so that that may be why there was some disconnect there the other thing is that users really want to have control over both the evaluation type as well as the prompts itself right and we have seen this is that a lot of times we will come up with an new prompt that actually works well on the eval but sometimes for better or for worse for whatever reason people just want to include just this one sentence to make sure that that defect doesn't happen when that defect actually doesn't happen and when you try that you know sometimes the evals actually perform worse uh but unfortunately you know sometimes these are just policy decisions that are a little bit disconnected from the data um finally this they found that you know as developers they iterate over the criteria they find that and as they engage with as they look at the data essentially look at the lm outputs it helps them refine their criteria better so essentially the point here is that tighten the feedback loop and those are creating the lm evaluator and once the lm evaluator is well aligned we can just use it for uh using production that said uh some participants were actually quite skeptical of how lm-based assertions the lm-based evaluations might be helpful for monitoring lm outputs in the production pipeline uh a year ago i might be i would have been in this camp as well i think now i'm also partially in this camp but partially not in this camp i i i'm beginning to see that i think it can be viable if the throughput and latency requirements are not too demanding so participant 8 actually said i cannot begin to think about how lm as validators in production can work i'm very skeptical um so yeah i think that's that's all i wanted to focus on uh essentially it's the workflow here that i think is very valuable uh and they also have future work um like one one thing here they actually didn't uh implement here which is users want their prompts to automatically improve based on evaluation results so this would be something like dspy etc etc that you can think about which we saw in some demos to automatically improve our evaluation prompts um but so yeah that's it that's all i had to share any questions just gonna go back to the screen here oh there's a few questions um okay i am gonna start from the one in the bottom um is there feedback on the information utility of specific labels for improving alignment this is not clear to me um i don't know and i don't know if shreya has joined yet she may not be able to join uh she may not have joined yet she has not i think this is a question that maybe i will wait for shreya let me just check twitter if she is just respond i did share for the zoom link eugene you said you made a prototype based on uh like the i think there was like a diagram earlier like can you talk more about the prototype if no one else has questions for the paper yeah sure i can share more about the prototype um where is it okay here we go share so here's a here's a very small simple prototype essentially the question the question that i have for myself is how can i help users um label data and evaluate data and improve their evaluation prompts easier so here's a small toy that i made uh it's called label lm assistance for better evaluation labels um so to get started we have to upload a csv file right so you can see as you file essentially i think these are fairly reasonable constraints id input output and a label and the label can be empty so what am i going to do is i'm going to input a upload csv file so this is a csv file it's the factually inconsistent that's this is the factual inconsistent um data set so what i'm going to do is let me just start from scratch um upload so here's how the csv profile looks like after we've uploaded that's the input that's output the goal is to try to make it easy for the users to read right and we can enter labels so the thing is currently right now we are in evaluate we are in labeling mode so to unlock evaluation mode you have to label at least many rows and i try to provide feedback that you know you have to label as many rows essentially i'm trying to gamify this to help users get to the next level so but i have a data set that uh already has some rows uploaded um so this this has uh this is evaluation mode so in uh in order to achieve evaluation mode uh wait a minute have i already labeled so many so much data let me just okay maybe my data set wasn't prepared correctly uh by the way this is this is now evaluation mode after we've labeled 20 out of 20 it will show that evaluation mode has been unlocked so you see evaluation mode unlock scroll top to add your prompt so once evaluation mode is unlocked you can enter the prompt and it'll allow you to use the prompt to try to do that so evaluation mode you can just run evaluate it will evaluate all the data uh evaluate on the data so you can see we start getting results as it comes in piece by piece essentially what this is happening is just running on the back end uh big thanks to swix for teaching me how to do this the right way to do this and then what we're doing is this it's running on the background it's all storing the database but at the same time we're also polling it once a second to show the results and what we see in the top left corner here these are metrics right these are metrics for recall precision cohen's kappa and we also have true positive true negative uh false positive and false negative so this is uh going to take some time and it's going to evaluate the data and along the way we can see how well it lines so we can actually see that okay when alignment is there you see essentially the label the prediction matches your label that's great and then when alignment is not there the prediction doesn't match the label it's uh it'll be highlighted in red and you can look through it you can look through it you can try to update the chain of thought you can try to understand why it's giving this prediction based on the explanation and you can see the explanation here and as we scroll down oh wow this is pretty good you can see a lot most of it is largely aligned essentially what this is doing this is just a chain of thought followed by the final label and it's just going to go through it and once it goes through all 50 of it um again this is uh so currently we're in evaluation mode where the lm is evaluating it so to unlock optimization mode you have to label at least 50 rows and you have to run evaluation mode once essentially we sort of need an existing metric for how well we need a baseline for how well this basic evaluation prompt works and you can see the problem is actually very short just evaluate if the output summary is actually consistent with the input article if it's consistent return one if it's inconsistent return zero so now we are at uh row number 38 oh seems to have broken nope it's still running it's just taking a bit of time uh row number 41 so once we have at least 50 rows i actually think the right number should be maybe 100 to 400 if you want to be optimizing based on this uh and after we've run evaluation mode once uh we can actually run optimization mode so optimization mode is going to come up in maybe a couple of seconds so you can see right here it's just i just want to try to figure out a way to give users fast feedback all right so now that we've labeled 50 uh and evaluated all of them it shows that optimization mode is unlocked so we can scroll to the top and click on optimize evaluation prompt so optimization what it does is that on the back end we're just going to try to figure out how to optimize the evaluation prompt uh unfortunately clicking on this number clicking on this button doesn't do anything i just implemented this yesterday um but i tried to um well with next js it's very easy to try to make it something fun and funky of course i'm sure six uh will do something way better but essentially this is this is what the next step is clicking on this button will optimize in the back end we'll see a different we'll see a different page which has like maybe 10 rows and all the different optimization scores it's actually just hyper parameter tuning you can think of it hyper prompt tuning um so yeah that's my short demo any questions or feedback would be nice yeah that's fire so so of all the tested frameworks you ended up going with next js and not like um oh how do you know this is next how do you know this is next yes you mentioned it a couple minutes a second ago oh okay yeah this is next js i it was really easy to build this with cursor and v0.dev um and i went with it the back end uh is going to be fast api i i believe the back end will be fast api uh or i could try to do it uh all in type script and xjs i don't know uh but i think the prompt optimization back and i'll just have a fast api and then you know next js this app will just call fast api uh you know here's the here's the label data set or rather the label data set is all in a dynamo db or rather it's just in a sqlite db and then the the fast api backend will also have access to that and just optimizing it and computing scores so this is oh sorry go ahead no i was just gonna say that was really cool i'm now trying to tie it back into like uh the validator paper that we just read and understand like the implications that you're talking about for that diagram at the beginning exactly exactly that's great thank you flo um so what i was trying to do was this this green section over here right i'm trying to force people so you can see that there's a it's very specific right i'm trying to force people i really think that you should be evaluating at least 20 samples evaluate at least 10 20 samples um i'm not trying to create candidate criteria um i'm not trying to get lm to create initial criteria the thing is i need the human i need a human to just provide the criteria and it could be bullet points so you can assume someone who doesn't know anything about prompting just gonna put bullet points i'm just gonna take whatever sop they have they're just gonna paste it in there and we just get s we'll just get the lm to run right and then we're gonna check and then we can check the alignment report card which is this thing over on the here on top probably will have a separate screen on its own um i don't know uh but this thing is just a floating screen it's just i think it's fun to see i think it gives instantaneous feedback that hey no this thing is doing some work your work is um your work is valuable you doing the labeling is valuable and therefore i want you to label more data and of course every step of the way i try to give you milestones after you've labeled um i'm gonna i'm gonna delete all of this or rather maybe not i'm gonna give okay so if you if you want to unlock evaluation mode you have to label at least 20 rows okay so try to get into to the first milestone of labeling as 20 rows of data after you've labeled 20 rows of data you can run evaluation mode and if you want to unlock optimization mode you have to label another 30 rows um so give them the next milestone and after they've done that okay now we can do optimization mode i think i think for this demo the numbers were actually low i think 20 and 50 is a bit too low i think maybe it should be 25 and 100 but i'm actually trying to play with that uh but yeah that's that's the that's the this is the loop this green box here is the loop i'm trying to uh improve on make it make tighter okay no more questions i guess yeah it's it's a fairly easy and straightforward paper oh actually there's a lot stuff in the chat is the code open uh it's the code open it's not open yet because it's very ugly um i will this code is not open but the previous one on the evaluation frameworks it is open um so you can look at it this this one is not open i still have quite a bit of work to do to make it a little bit cleaner um you should experiment with the writer framework to see how it does and handle front and back for you yeah oh kind of like fast html wow yeah i i i know you ping me that sam uh i haven't had a chance to look at it um the am on is type the couple yeah um um yeah the back like this app i was really again when i was trying to build it before i was building i was like okay maybe let's try to uh dig into next js again or typescript frameworks again so in this over here uh shoot i switched over to a different screen um in this framework over here so long story short i went through fast html next js civic kit and even had a frankenstein of civic kit svelte with fast api in the backend uh this was not good um yeah i really had to hack a lot of things and frankenstein it i really don't like how it works but i think this is probably the best of both worlds um yeah sdk might be fun to try i think i did use the versell ai sdk for this um but i can't confirm let me let me let me take a look right now uh well so here's this ugly code um and the evaluate route oh i'm just using and i'm just calling entropic directly yeah yeah sdk i think i initially used ai sdk from versell but i think what was useful there was streaming and i didn't have streaming so i i just simplified it and just deleted um stuff to yeah to simplify it yeah the law paper is leadership principles oh so um eugene if i may yeah yeah um okay so in your experience were all or at least most tasks distillable down to a binary prediction or is that just the ideal goal i think most tasks can be i think if you think about it if summarization it's really relevance factuality comprehensiveness or maybe you can give me an example of something that you don't think can be distilled down um well off the top of my head because i don't have a specific use case in mind but it may be you know multiple class classification but i guess that can be distilled down to pairwise one v one uh sort of framework so when you think about multiple class are you thinking about does this output meet not meet policies for toxicity bias multiple choice questions where multiple answers uh are required for some questions yeah so that is not the task of an lm evaluator right that's the task of an lm classifier oh i see okay yeah okay okay targeting specifically the uh ability of an llm to evaluate uh whether something is good or bad got it okay makes more so you're right for multi-task like you know or maybe like um let's say given some kind of product description and title can you say is it uh is it fmcg is it clothes is it furniture yeah that that that is completely different that that would be that would be a separate thing but right now i'm just really focusing on evaluating it is this is this classification is this classification correct or wrong simple that's binary but the goal here is to have a uh let's say a solid justification for using an lm as an evaluator uh no not really um i know that we have to use an lm as an evaluator there's no way around it if we want to scale i think that's the only way the question is how can i tighten the alignment how can i tighten the feedback loop to make it easier to use the lms and evaluator i'm taking a lot of um inspiration from tracking the loop in order to be able to scale to uh more instances uh faster exactly so you can exactly so you can imagine that uh maybe an ai engineer might be able to look at the data understand the criteria talk to pms understand criteria look at the user experience and then write the prompts and then they will do all the evaluation themselves right but now let's imagine that we don't have that ai engineer or that ai engineer is actually bottlenecked and we want to use ai everywhere right we're going to have lms doing classification summarization translations extraction everywhere um now there will be a certain level of there at some point in time there will be a certain thing there'll be certain level of risk tolerance whereby simple things like extraction and classification if it's on the back end we probably don't need an evaluator but then there may be other things like if you're summarizing a financial document or if you're summarizing a medical document if you're extracting a medical document there may be certain things yeah essentially the key thing is i'm trying to force people to look at the area oh my god i think this is like a um you should just post this on twitter and tag hamel and me um and who knows what i mean like this is like um brilliant yeah and also it might not have been your main intent but i think it makes for a good framework uh to uh i don't know maybe to answer skeptics about llm evaluators uh it could show that there's a systematic way of uh ensuring at least some some robustness or at least some reliability in lm evaluation before actually deploying it in production exactly right i think the metrics here that's what we're trying to do with the metrics here hey we can say that hey no you gave me again i don't use 50 yet you give me 100 to 400 rows of data i am able to do well on that and that's no different from machine learning right instead of using the data to train a model is what we're using now is we're using the data to fine tune a prompt which is the model now which is now the artifact that prompt that prompt married to that coupled with that lm that you're using is now the artifact that you can then use to optimize against the label data essentially it always starts with having labeled data um so i'm trying to force people to label data uh instead of just using vibe checks when you are doing vibe tracks you also you can also just take the extra step to just label the data and then with that label data i don't you i don't need to bother you anymore to do the vibe checks you give me a little bit i try to optimize it completely agree thank you you're welcome um so nicolay uh have you played around using re-rankers what do you mean nicolay um yeah no i talked to the guy from mixed bread today and he mentioned they're using the re-rankers a lot in evaluations but also in test time for example to evaluate something like factuality and just training the re-rankers with a small data set um sorry i i confess i was trying to ping shreya and i can't multitask so you're using re-rankers to try to evaluate what does that mean yeah basically as opposed to training a classifier to predict like the label you are using re-rankers and um so for example for factuality is like an an obvious one and using the scores plus a threshold which apparently seems to work better than classifiers in their cases so okay just understand understand just to make sure i understand you well you're using the re-ranker to rank poorer things down um no in the end like the the re-ranker is trained on like you trained on peers so as you have in factuality checks when you have nli models so in an nli model you basically you have a hypothesis and you have like the ground truth and you basically try to determine where the hypothesis and the ground truth the lines and you could use that basically with the label to train the re-ranker on it and then basically set a threshold above the score of the re-ranker it's determined basically as the class like one it's passing zero it's not passing i see i see so what you're saying is that okay we have some references we have some golden samples and then we compare we train the re-ranker to rank the golden samples higher yeah that could work i mean if you have a threshold that could work yeah i think what is um what is hard for me to understand sometimes is that if you're using a re-ranker and you pass in two pairwise preferences and it returns it's actually actually your re-ranker is actually returning a score it's not really uh it's not really pairwise preference so that that was tripping out so if your re-ranker is returning a score that's actually also you can think of as a classifier yeah the the thing previously when i tried to do this several folks including from um some of the labs they suggested doing pairwise preferences so the reason why pairwise preferences cannot work is that if you give two things that are both factual um or if you give two things that are both non-factual you would say that one is better than the other but it still doesn't meet the bar of being factual enough but in your case i mean the case of re-ranker that you mentioned there's a score involved so you use the score to cut the threshold and that would work yeah i i i i so to answer that i haven't played around using re-rankers as as opposed to a classifier okay i think shreya might not be able to join uh her seminar has been running late too bad but any other questions or any other ideas on how we can make this more fun for people sorry i just had a follow-up to what you just said regarding the pairwise cannot be used for optimized evaluator so um what if you take like a batch of input outputs and then you did the pairwise preferences computed elo scores for each one of the input outputs right within the particular batch and then i guess you could have some kind of a threshold on okay anything which crossed this elo rating is probably good enough like couldn't you use that method because like yeah you can then turn that pairwise thing into like a per data point threshold that's i think that's a great point i think the the way to answer this question the answer to this is that um let's look at man select there's just i mean zoom just gives me no feedback when i'm sharing my screen or not um so the way to answer this is the question is is your task objective or subjective so if your task is factual like something more objective like toxicity or factuality like is it is it's a clear one or zero then if you're comparing two zeros and it's saying that one is better um that you you that doesn't make sense or if you're comparing two ones and saying that one is worse well but if your task is more subjective like which of this is more persuasive which of this has a better tone which of this is better clickbait which of this is more relevant yes uh a pairwise comparison could work so that's how i think about the task so a lot of times it's like okay we have control control is our current one then we have something new let's say we have new or better new or better recommendations it's the same thing is that we have a new or better prompt we have new or better output we have a new or better way of creating the summary and then we just do a you can just in that case pairwise comparisons can work and i know uh ai news does a lot of pairwise comparisons almost every time there's a new model that comes out it does pair commerce and it actually involves users in the pairwise comparisons as well and share shares that and saying that you know these are different models that can work on the other hand there's certain things that are just more objective i think pairwise comparisons doesn't really make sense um i think just one minor pushback i think the differentiation on the objectivity versus subjectivity is not as clean in practice like when for example here you have instruction following as like a objective metric i think to an extent it's objective right if you have something like don't exceed these many words and um don't uh use this kind of language but like even just within like a two two domain experts right let's say it's like a medical task um i feel like there is some level of subjectivity it's like taste that's what i think it is where each like domain expert is like well i think a better way of following the instructions would have been this way um i don't know i think that's all i i that that's like where i think in practice this kind of breaks down uh what i've seen is like um even even with like factuality like um yeah there's that gray area of how you interpret the data um so yeah i think maybe that's an issue on the margins maybe does not end up being as effective but yeah that's a good point there's definitely gray areas right uh for example is is the show my little pony should be should we tagged as horse or should it not be i mean people might search for search for horses and they're looking for equestrian or like olympic sequestrian definitely not my little pony but well my little pony is a horse so that's a great point so there's definitely a huge chunk of gray area in the middle yeah um that's just hard that's just hard to get get free size on yeah because i just feel like if ai is gonna get very intelligent it's gonna do more things like that yeah and i think even for that we can just make a call to say that okay in this case we actually want it to be one and we can write an instruction for it and we can make that task simplified to binary as opposed to is this a better summary than this is which one has more relevance which one is more comprehensive etc for me that is really hard to you can try to break it down into binary you can say relevance is composed of these criteria um comprehensiveness is composed of this criteria but i think we are usually better off with a pairwise comparison if you're trying to compare control versus treatment which is better yeah thank you welcome um okay i guess that's it i guess sriya didn't have a it's not gonna be able to have a chance to join us um it's okay uh she's i think she's giving a seminar right now and you just cannot join well no no i'm here i was here for five minutes sorry you took your time you took your time sure yeah sorry no i was at a seminar that ended at 12 50 i'm so sorry it was not click bait i i did legitimately invite sharia uh i was hopeful she will be enjoying um i don't know so now that sharia the author of the paper here is here anyone have any questions for her about the paper i think we all just asked you with the questions earlier and you answered yeah actually there's one question uh that i see from venu ah sharia is there feedback on the information utility of specific labels for improving alignment we do not have it in the papers prototype um but the v2 that will hopefully be pushed in the next couple of weeks it has to be before the conference um that does consider natural language feedback nice so essentially it's like is it like chain of thought feedback and then the llm will use it to update the prompt or update the the assertions if user gives the feedback maybe on why something is bad just a few words and then that is included in the prompt for that validator for that task or sorry for that criteria oh essentially it's a few short example yes oh no nice that's that's great yeah yeah um any other trey i have a quick question are you are you planning to publish this framework in some way as an open source project i didn't see if you had already done that or what's going on not yet it's not out yet but we will so the conference is in three weeks we have to have it out by then but it'll be implemented in chainforge.ai which is an open source llm pipeline building tool right okay awesome yep great thanks um trey if i may um you had a tweet recently about a good test decomposition and how it can 1000x uh the accuracy of an llm pipeline uh how do you think that plays that interacts with uh an evaluation uh framework would you then have to evaluate each component separately uh or does it all tie in in some end of end to end kind of way or some hybrid of both great question we're exploring this as a research topic but the idea is to have each node in your graph kind of be a standalone do a standalone thing that you can have standalone assertions for and if you think about you know infinitely many inputs flowing through your pipeline there's going to be some fraction of inputs that fail at each step then you can track error throughout the pipeline and do some form of causal analysis to see like which is the most which node is the biggest reason for why the most downstream node failed so that's something that we're working on now but the idea is you should do decomposition in a way that like you can also validate the intermediates um they should be somewhat of a unit task so basically it would involve some bottleneck analysis of which nodes uh like you said are more most responsible for the overall performance and focus on those okay got it yeah thank you all right i think that's the hour uh we're a little bit uh slightly distracted in the discord chat with uh mira muradi getting well she says she resigned but we don't know um not much speculation we can do uh yeah i think that's it thanks wait what's happening next week uh yeah we have uh people signing up to do i i think sam signed up to do a paper he didn't set a date yay sam we really like to dive into function calling and the berkeley team i guess we're just mainlining all the berkeley phds uh berkeley team put out v3 if sam you're interested yeah i can try you got this you got this it'll be my first one the eval v3 what uh i missed it so you want me to do the gorilla paper and then the v3 of the v fcl or or just the v3 yeah i actually think the gorilla paper is not that important anymore uh but they just don't have papers for v2 and v3 of the function calling leaderboard which they put out in the last two months and this is like the industry standard now for function calling uh which i i've been meaning to dive into but if you want to take it up next week it's yours yeah all right yeah just uh i guess as a hint if you drop by either the gorilla server or the mem gpt server both uh charlie and shashir are generally like usually available in there if you have questions both land service berkeley should just have because there's too many things they should right like you just consolidate it into one we are gas we're actually talking to lms uh next week as well um so we're trying to make a berkeley day um to have like vllm like lmsys all these people hell yeah i don't know if we'll actually do it but whatever okay thank you that was amazing i guess we you do eugene we uh thank you everyone yeah and we have we have new we have new we have i'm glad that we have a paper next week yeah okay thanks thanks to six century for all the feedback on the initial prototypes see everyone see ya

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Transcript