Real-Time Hallucination Detection!! [Paper Club]

go ahead yeah um so let me show miss um i think first yeah so hey i'm um i did my work on robotics and um continuous learning in my phd in robotics and continuous learning um currently i am working uh in flagship in boston on um biotech uh for you know machine learning for drug development um yeah so i can talk more about what i do a bit later um or in discord um and also we have a lot of open positions reach out to me if you're an mls machine learning engineer or ai engineer i would love to have you um so today i'll be uh presenting um hallucinations of paper uh this was a very interesting paper um seems an under uh like you know um a topic which has been brewing for so a long time um you know people have trying to ground models like gemini tried to ground models through the search um you know we always do like you know citations and backtracking um you know information through citations and ensuring that no hallucinations are uh surface to the users uh but this is an interesting approach uh where the the main focus is can you detect hallucination while the model is generating um let's say you're streaming responses and as soon as you detect like oh uh model is not confident um can you just say not generate it or say okay i don't know um so they use some interesting approaches to do that um so i'll do a deep dive on what they changed what model architecture they changed and like some interesting results as well um so the tldr is that the motivation is that can you identify hallucinations while you're streaming um the the limitation of the current approaches is that uh you know one of the main state-of-the-art method is called a semantic entropy um the idea here is that you just sample it multiple times and you see how many times it generated the same result and you group them and then make sure that like you know okay the cluster is small enough then okay maybe it's not a hallucination it's not a foolproof way of course um it fails all the time um other hallucination methods are very time consuming and requires a lot of resources um yeah so the approach they did is um they have something called spans uh they say it's a collection of tokens um and they use web search to identify okay if the span is true or not um i will go a bit deeper and like how they do this etc um and they have a probe uh linear probe if they attach to the model uh in the final layer or whatever layer you want and then that essentially gives you the probability of this token is hallucinated or not uh and uh they they train this model um you know next token prediction way uh and they have they do some loss uh manipulation to ensure that the probe probability is uh is not correct and uh yeah they achieved a state-of-the-art results for hallucination detection um and the context here is that i also tried to train this model and it works pretty well um and also it generalizes pretty well um um so you can think of it as like instruction fine tuning for hallucination um so even though you don't use that many samples uh but every week now you can use that probe to do hallucinations check for any topic that model has pre is pre-trained on um so which is a kind of awesome yeah so let's uh that is a tldr so let's dive into the paper um here um the main the main thing uh here is that they don't identify hallucination for the whole sentences uh they identify hallucination on the token level uh which is super powerful uh like i mentioned super helpful for streaming um and uh yeah so and another key contribution they've mentioned is that uh they they use web search to first generate a data set for the hallucination um you know if you if you go back and think of how instruction fine tuning was done uh you just get at some text and you pass it through a language model to create like you know in some question answer pairs uh here the same way they they they ask llm to generate some text they use a web search to ground that uh um ground and make sure like okay this this this text is this token is wrong etc so that is the main contribution they mentioned um but i have some text on this uh let's explain to like why we can make this even better we don't need web search um so yeah i'll explain that a bit later um and then the other one is super i found it interesting is the linear probing approach uh it's like uh attaching uh um attaching uh like you know a small linear probe at the final layer to identify the probability of this token is correct or wrong it's super powerful and can be used for other things as well um they have some different applications in the paper which i will cover um yeah like i mentioned sure of the art results um compared to like basic methods like kind of perplexity semantic entropy etc uh they were able to achieve uh um everything yeah so here is how they here is a very high level view of how their data looks like um how the uh how the labeling will look like um here you have a response generated by 3.1 uh lama um you know we have some text here and uh it essentially says that okay this uh 29 year old is 29 is wrong um and then it's able to identify what are the key components uh entities they mentioned uh and also identify which of these components are wrong which of these components are right etc um very high level um um it's pretty strained in a very interesting way um yeah so and not so they're not predicting on every token which is also super important um so every token doesn't have like you know for example uh rally v california uh here really v california is not labeled as zero uh they essentially think of really v california here the first three characters as unimportant tokens and then they have uh five seven three us three uh three seven uh this the following tokens as important tokens so you can think of it like they're doing two level classification uh one is like hey the token is important and the second one is is it hallucinated in the same training ground right um so that just which is an interesting approach i don't think anybody has done that um yeah i'm gonna skip over the introduction part of it it's just uh they just you know uh mentioned that okay nobody has done this the current models are uh all like citation checking methods uh etc okay so yeah uh the most of it here is uh yeah so they mentioned that they use a frontier model to do web search and extract entities um which are either it can be true or it can be false they use a sonnet here for sonnet 4 with web search uh to label to essentially extract uh they give a sentence and then the sonnet works as a web search and comes back with a json saying that okay these characters are wrong and they use that as a training data set so my my take here is that i've not tried this yet but i want to um i will do that very soon we don't need web search my take here is that okay we can just have data set and then we can inject hallucinations uh let's take wiki data set and change the names and dates etc and use that for training as well i don't think um you know doing web search you know adds anything um it's just you know api cost um so that is my take um yeah so and also feel free to uh jump on and ask me any questions oh i see a lot of questions in chat already but the ground truth is generated uh so can you explain uh what do you mean by that yeah sure i so my understanding of why they did it this way was that they're they took the they they took text and they generated whether or not the entities were based in fact or not because the text itself was a large generated corpus so so that um you can get i assume they generated 10x because they said they like expanded the long form uh or whatever it was called that data set by 10x and i presume they did that to get a larger training set hmm yeah that makes sense that makes sense um don't you think the train set will be even more if you just do uh inject injection of hallucination rather than like you know verification uh yeah i mean that's valid yeah that's it i mean it um you would still have to yeah i guess sure so you could just go and like um in in inject hallucinations yeah no i guess you know i agree with you actually yeah um and because in the uh maybe i'm jumping too ahead here but in the results they mentioned that uh they use data generated by let's say jama and then they used in llama and they still got the same amount of performance uh so even like they used llama 3 8b um data and 70b model and they were able to get like actually better performance um so which means that you know text underlying text with what generator doesn't really matter uh which is awesome um yeah so which was for verification to the phosphorous generation you might want you there is another question saying that uh i think the web search was for verifying the truth or falsehood uh of the generation you might want to generate hallucinations so true true uh that that's uh there's a question saying that uh you might maybe too much out of distribution sampling uh that is a valid point if you inject hallucination not too much out of distribution um that's a very valid point and uh i didn't think about that um yeah so let me jump into the next part of the paper which is related works uh so i think the motivation here is from the tag mark paper uh where they use linear probes to uh identify what is the representation or like truth directions uh they use that to do hallucination detection which was super uh like just orthogonal thinking um and then the other related work is an uncertainty like based detection this is essentially like perplexity right so when the model is generating tokens identifying like hey what is the perplexity of this token uh that is another they say it's a motivation for this work um and um yeah so and the last one is external verification uh this is what like most of the industry uses now uh like essentially hey generate citations for your text and like okay for this sentence the site is from cited text um verify that um so this diagram is related to the data set they generated um so i will give a more explanation on this in the next section um but the idea here is that they started with a long fact data set uh and then they added some more uh like you know biography questions legal prompts uh more topic focused queries uh these topics were biology chemistry etc and uh they asked in language model uh like llama once the lagging with model generated the queries they passed it to like i mentioned uh uh state-of-the-art llm like uh plot for um with search to verify that uh on token level is this now is this tokens correct or not um yeah so that is the data set generation process and they also released the data set which is super nice and the code is also released by the way you can train this super easily um yeah so yeah like i mentioned the data set is they took the long pack data set was already published and they expanded on that and then created this uh uh 20 25 000 uh um questions data set yeah uh and uh token level uh uh like yeah the token level uh uh i don't know like you know annotation is there are three annotations they uh that they let the cloud generate supported not supported and uh insufficient information um so supported is labeled as uh let's say um zero and not supported and insufficient information is labeled as one um so that is the thing and what they did um and they also did some human uh verification uh uh you could say so like the quality of the what the labeling is good or not i don't think n is equal to 50 or 25 000 really makes difference here uh but they just wanted to verify the quality of the um the human lab like labels which are generated by the state-of-the-art model so this is an example of how the how the model labeling looks like you this is the model let's say this is the sentence let's say generated by llama 3b and you pass it to a web search and then it identifies that okay albert edstein is important let's label that as a span uh was born on like an unimportant text uh they put a date here which is correct in berlin okay that's berlin is fall false and then um they do a token level uh labeling um so we can think of it as a tool to level labeling first level is like okay albert einstein is an important tokens uh uh y of one y of i is equal to one uh and the y of s is equal to zero because that's a true true tokens uh second one is uh um it's the same thing as uh uh it's important tokens but it's true and the third one is false so let's that's an important token but it's a false it's a hallucination um yeah so that's how the labeling works and the next part is uh how they attach probes um so here is a very basic rudimentary explanation how they attach probes uh they have they took a llama model so all the dashed boxes here are frozen and they had they added an optional laura so this is a hyper parameter they experimented with and without laura um i think laura helped them a little bit but they also show that with just adding a probe on the final level uh is good enough you don't you need the laura but if you add laura it actually enhances your generation um so here at this like you know this is one let's say one decoding presentation you send a tokenizer embedding go through n layers and then for the nth layer um you take the hidden states you pass it through a probe and then uh this probe outputs a probability uh that probability uh is zero if it's not a hallucination so one if it's a hallucination for that token um so that loop continues so every token produced uh will have a hallucination check yes uh so some uh ted is mentioning um uh okay there's a swix questions saying i think yeah so uh yeah so uh the tokens so the question here is that the it's a confusion between entities and tokens um so they use for the web search they extract entities and they uh they take that entities and they label the tokens now so let's say if your example here is um your albert einstein is an entity march 14th 1874 is an entity berlin germany is an entity uh and claude will return albert einstein uh is equal to zero uh mark march 4th march 14th uh 1879 as zero and berlin germany as one um and uh so this is are labeled this way and they take that entities um aka spans and then they label the tokens based on that um it's the end they use in entities and spans interchangeably in the paper a bit confusing yeah so eugen is saying that it's pretty cool that uh tokens are next to others and uh yeah um probes yes that is that is actually pretty cool so now you're not just generating tokens but you're also generating the probabilities of the token essentially it's a multi-level uh which is kind of awesome um so is the intent here that this model uh with the pro will be dual use for both general answering the question uh responding as well as being able to return a hallucination probability as well yes uh i see so so assuming that let's say uh i'm using uh of i'm using a lm api which i can which i cannot fine tune to do this i will not can i still pass the output that is streaming and pass it through and still probe every token yes i tried that and it works oh okay so that's really nice so essentially you don't really need a decoder if you if you just purely want it as a hallucination classifier you don't need a decoder head you just use the probe head yes thank you very cool you tried this because i think in the paper they said that that's a limitation that they can't they know they have a they have a thing in the in the appendix where they show that it's not as effective to do it if you don't have access to the logits or at least i mean or not the largest but the so the the residual stream or whatever so what i did is um so um okay so here is what i did so i entered some text and uh during the pre-fill stage um so during the initial stage where it now pre-fills i identified the okay this uh and i was able to calculate the probabilities um so essentially take the uh i was not about to stream it of course but what i was able to do is like okay once the text is generated from let's say uh gpt5 uh i was able to take that um uh during the pre-fill stage uh able to identify uh what tokens are hallucinations uh right so i don't need to decode it like i don't need to generate the next tokens after that i just need to see okay which tokens are hallucinated um the ydc limitation how do you do that is it essentially you just pretend that you start from the first token and you go through and you just scan through it and get yeah very cool yeah so so the reason why they say it's a it's a limitation is because uh you're not streaming like you're not able to identify during streaming right uh yeah yeah so that's a limitation so the strong assumption here also is that this model the model they're using for hallucination detection the data distribution of the text is approximately similar which i guess in human text with enough training data should be approximately similar thank you yeah true yeah yeah so for example if you're saying okay it needs to be in the same level right so um it needs to be in the same level as like llama a b should be the same level as gpt5 which is not true uh so i would use this with caution or another approach i was thinking um is like take your training data train this and then use this uh as your uh like a final label check or something uh before you release it to the next step of your agent pipeline uh so that there's no hallucination on that um yeah so i'm sorry you you lost me for a second i i don't know if this is what eugene said or what you said but i thought i heard you say something about the probe is actually predicting the next token and the probes not predicting the next token no so that is true the probes is not predicting the next token but the probe sits next to the next token prediction so uh my understanding is the probe actually sits a little earlier it's not immediately before next token prediction i mean i mean you have two outputs yeah exactly that is what i meant by sits next to that so the uh uh uh like what i meant is like you're when you're predicting the next token you're also predicting the probe so outputs are two uh at the place the place you place the like the where you place the probe uh in in the code they have made it modular so you can say n is equal to let's say 31 or 29 or something and then it'll go sit in the 29th decoder position then use that hidden stage to predict the probe right okay right yeah because that's the whole point of this real time is that as the tokens coming out you can actually um interrupt generation if you are concerned about hallucination yeah that's exactly right um yeah yeah so yeah that's actually pretty adam you had a question and i saw you come on camera did you want to ask it um go ahead uh i'll put some stuff in chat um uh oh i'm sorry if i call you out like i'm i'm happy for it to to wait until we're further down the paper okay let's let's wait then thank you thank you yeah yeah uh i see that uh laura here like positive it's not changing the best model yeah so that does correct me adam if i'm not wrong that the idea like are you asking like hey does laura change the way the model predict predicts a token yes it does and they have uh uh i will talk in the next part of the paper how they able to change that uh um like how the model quality changes like when you are training this hallucination check because laura should technically change the way the next token is changing but predicting also i mean my thought was to clarify uh with the laura question um like the laura's don't seem like they're part of the uh the the probe the the checking the truthiness evaluation there my guess would be that the laura's are purely once you have the ability to evaluate to make the results better for blayer generations and hopefully i'm understanding that right yeah i have not done the abolition study removing and adding laura so i just am believing what what they they mentioned uh when i trained i did train it using laura and so i don't know what will happen if you just train the probe so i don't know the quality will change or not yeah so let me let me just jump to the next part of it like hey what are the loss look like loss look like uh because you have two things now you have a probe and you have a regular generation um so they what they have is a regularization term uh here they have this term they set it to 0.1 um here like so majority of the loss is contributed from the probe and some of them is through the uh the um you know regularization um it's a regular uh loss um so um so yeah so that is the loss so let me jump deeper into the probe loss because uh um that is okay so for the regular loss they have to by the way uh last last regular they have two types of loss regulars one is a language model loss which is essentially next token across entropy and the second one is a kl diversion between uh what the model without laura will predict and what the model with the lower will predict um so uh like essentially controlling the distribution um between the the changed model the laura model and the older model um so these two these are the two loss functions they experiment with and they show the the comparison uh but like in the results section uh for the probe loss is super interesting what they did uh because you have a two level classification so you need to first identify which tokens are important and uh which tokens are entities or tokens are important um um then the second level is that uh uh if the two of the identified important tokens which is true or not um so what they do is that they do annealing um so first they start out with identifying uh the spans which are the end like super important tokens uh and then they as the training keeps on continuing they change the weight to identify the hallucination the first part of the loss is your uh the span detection and then the second part is the hallucination detection and you can observe here uh because this the whole span has multiple tokens right uh let me let me go back to my example here uh let's say berlin germany it's a like let's say it's a four token um again this is an example here uh it's a four token um so it all the four tokens need not be false technically it's just only the berlin is wrong or false um so only these two are false right so you cannot have a cross entropy loss for all everything as a loss so they do a soft cross entropy by putting a max term here right the idea here is that the the max of all the probability predicted across that tokens is uh it should be um you know one not the not every token there needs to be one uh this surprisingly worked works um i was actually a bit surprised how this soft um cross entropy is working um yeah so that is the one super important thing about this probability i think the reason why this worked is because of this loss function how they how they were able to switch between which identifying which probe like which tokens are important and then switching to identifying okay those tokens are true or false um yeah so any any questions on the loss uh specific definitions so what what did p exactly represent the probability of that token so the probe probability the probe probability okay got it yeah so cool uh so uh so someone is asking will it can it work with an api uh so it should like it it should be able to i don't see why not um again this assumption here is that like you are training your own model then only it can work in api yeah yeah i think you should answer that yeah uh yeah there is another question if you train uh if you pre-trained the hallucination class if you can an existing hallucination training data would you transfer to that is a very uh uh usually asked a question which uh question which uh so yes um that was super interesting to me as well so the question was that if we use the existing data can it generalize when we put a newly newly generated text um yes it works because what i did is that when i was experimenting i i gave it all the data which they have generated um but uh uh and i mixed that with uh um um i mixed that with like you know some chemistry questions uh for a drug development let's say uh and disabled to identify like okay this this protein structure is wrong right which is which is pretty impressive uh uh it's using its internal knowledge as well during this hallucination prediction um that's why i'm having a hard time understanding this here uh so ted is saying that we need the internals of the models so i'm starting i'm thinking is that like the hidden states or the residuals or what else so you need you need identities right so the for for the probes you need the hidden states so we don't exactly need the actual hidden state from the model that was used to generate it right we can simulate it via this model exactly yes that is what i meant by like uh you can take the output from gpt5 then pass it to this model and identify which tokens are hallucinated so would that make sense ted adam so um so so my tldr here okay is that uh when you when you do this mechanistic interpretability research people are saying the model actually knows that it's not confident in what it's saying and if you just simply do the the the methods they talk about in the background section where you look at the entropy or or other things about the logits you get a certain accuracy you know you get 60 70 percent or whatever um but what the what this paper is saying is that the model has internal knowledge of its uncertainty beyond what shows up in the logits and that's why the linear probe can outperform just logit based hallucination detection so um and so the linear probe is actually looking for basically a i'm going to say direction if you if you don't know the the the the the mechanistic stuff you might not know what i'm saying but it's looking for a direction on the residual stream that says i'm thinking about whatever i'm thinking about chemistry blah blah blah but i also have this thought i'm not really sure what i'm saying uh yeah so that's that's only available from the actual model that did the generation and it's hard to just fit it in and the reason why you want the around the 95th percent layer is because the last few layers are about refining next token prediction they're not actually about thought understanding creation and so so around the last five percent you actually tend to lose thought information because you're stripping it from the residual stream so that the residual stream is just next token information and nothing else so you would get worse linear probe information if you did it after the last layer instead of at the 95th percent ish layer that is true that is true that makes sense yeah but i i think it bears repeating it the experiment that you did was that you actually used a different model or potentially a different model right so you like like you take gpdp5's output running through llama and then use its its um concept of is this a hallucination or not and that was also way effective you mean llama right yeah and that was also effective right yeah yeah yeah seem to work but like a key is that uh that that that information was general you know for that both gpt5 and llama knew about it right so if if llama doesn't know that it doesn't work okay yeah it would be interesting so did you did you calculate any loss curves or anything on that i mean not last course uh like a roc curve or anything no no no no it was just a simple test which i did like you know with like let's say 10 or 15 examples i didn't do more than that yeah okay yeah i just wanted to see if it actually works before i you know interesting that's not scary sorry guys just to clarify you took gpt5's output and you calculated the value probability of hallucination based on what it was trained for so it's basically guessing based on what it was trained on from an open source model and if there is an overlap between the data sets then it probably works otherwise it doesn't right yes yes it's not exactly reverse engineering it's just a guesswork i mean yes it's a guesswork okay it's a guesswork uh but where it changes is if you have let's say domain specific information uh you can make the super like you know lama 8b model as your domain expert and then pass it that through that will be very powerful uh right and so let's say you have some proprietary information or some confidential information for your own stuff and now we have gpt generating this stuff and you can identify oh this is a hallucination according to my knowledge um gpt might fight or something um so yeah so we can mix and match it's using a smaller model to detect some very specific uh i won't call them hallucinations anymore it's just that factual errors and that's something you can detect uh because you have a smaller model trained on it it could be even faster yeah similar principle to speculative decoding in a sense not using probabilities but yeah yeah i would say it's spec spec tech is a bit different um but like you know but i totally agree there is a there is a similarity between these two and i think you could go either way your your your hallucination detection model could theoretically be larger but of course then it'd be expensive so you're probably not going to do it that way um the the authors of the paper would probably hope that if this technique were refined and made better and actually useful that then gpt-5 would just support it and you wouldn't have to have your own standalone lama that you that you run you know side by side but it's an interesting thing basically when you pass any text into this lama model whether it comes from gpt-5 or if you just write it yourself what lama is answering is if i had generated this text would i have been confident or would i have been uncertain and maybe hallucinating and so to the extent that llama is right then you can use it as a hallucination detection for your five-year-old kid or you can use it for gpt-5 or for whatever you want yep that's true cool uh yeah so that's a very good yeah so the comparisons what they did let me just jump to the next part of the paper the comparisons they did entropy uh so which is like and perplexity but these two are like you know models internal uh representation like how surprised it is by generated token uh second one is the semantic entropy i think ted already explained this a bit uh where you like you know you just um sample multiple times and then like essentially decode the same token multiple times and then group it and see okay what is the direction of this group and like if all the direction is the same that means that okay probably not a hallucination if it is like direction is like larger so and that means that your aggregated direction might not be same for all the generated tokens that means it might be a hallucination that is a comparison that was a state of the art actually and the black box self-evaluation um so this is essentially a judge um hell of a message um yeah so the experimental setup here they have a llama 8b and 70b model uh they also the repository also has ability to use any hogging paste model which is kind of cool uh so but they did also test it out for like jama models as well which works um you can see here uh this row so just your perplexity like not doing anything and just basing things on a complexity like ted mentioned works well like 76 area under the curve is like 76 0.76 but if you use uh like this probes even a linear or a llama probe like sorry laura probe um uh let me clarify what is a linear probe is the linear probe is there is no laura right so it's just a proof that's it laura is an optional so but when you add laura it just this essentially is very good in identifying this both recall and the area under the curve is actually pretty pretty pretty high um so this is a summary of the result um and they um they go like when they release the the data set as well which has about 25 000 total samples um yeah so so evaluations i'll skip all this and uh so this is the out of distribution performance um so this is the like what i was mentioning like you know what if the training data is not present what if the concept is not present in the training data and if you uh in your fine-tuning data it is present in the pre-training data it still performs well uh which means that it's using its internal hidden states very well um yeah um yeah yeah another one okay this is actually pretty um i still don't know why this happens i was i was i was a bit uh uh surprised on this as well so if you train on samples which have only short form text it doesn't perform well on the long form um so and long-form text if you have if you have hallucinations identify on long form text it doesn't really perform well on the short form text um i i don't know why i i don't know why it is different because uh i i genuinely don't know like i couldn't understand like this part of the paper because it it theoretically it should not matter but apparently it does so so if anybody has uh any insight on that i'm happy to hear and like um ask questions but uh for me this part was very surprising because i would assume how does the the length of the text matter for your hallucination detection of the one token apparently it does right so i was not sure yeah i i thought that was really uh maybe even the most interesting plot in the paper because i mean my explanation for that was that it's actually looking at the context around it and saying oh this is a likely thing to be hallucinated right yeah that is if that is true then long context should always beat the short context right uh does it well maybe maybe in it's maybe i would say that that supports this exactly what we see here because in the short con short form test then it may be that model like um relies more on con like wider context where as the short form one more relies on fact or something like that why is this result unintuitive to you okay uh yeah so because uh think think of it right so they they say a short form text is let's say a one answer one question right so they use like you know like let's say 100 token or 200 token answer and they mark uh some tokens as hallucination and they train on that that's it the long form is essentially like an essay of uh of 600 or like let's say 10 000 and here they mention 8 000 tokens or something like that um and they mark hallucinations and uh they take two models which are trained short and long and they apply it on others other data sets uh which is uh and it doesn't perform the same way uh so for me how does the length of the text matter uh so yeah that is why it was ted here i have a question you know and so so i think my my intuition around this result is that it's not about that the model is any different it's about whether or not we trained a good linear probe and so what they're saying is that the signal in short is because it's short it's not getting a super accurate signal and so then the probe just doesn't learn very well i don't think it's about whether or not hallucination looks different which is why if you get a really accurate signal by training on long it works pretty well on short because it's the same hallucination signal so it's really just a matter of what kind of training data you need to get this probe well trained is the way i read section five okay yeah that makes sense um so it's essentially it's not uh yeah i would i would i would change the plot as like you know performance on good quality data versus performance on bad quality data or something like that yeah okay that makes sense makes a bit sense um um yeah so here is the cross model generalization which was super um interesting so here you can think of it as a text generated by one model but the probe is another model um right so let's say they generate uh the uh the gwen 2.57b text and they validate it or probe it on uh 70b model uh and it works right so here is an indication of why one model text works another model and why gpt5 works when i tested it out uh with llama right so and so another surprising thing is if you have a a dumb model which is let's say is 2.57b which is a maybe a dumb dumb model and then you have a probe model which is bigger uh 70b model it performs actually better um so which is uh which is uh which is actually kind of interesting and it intuitively makes sense also um because you have the 70b model is a very intelligent model and can identify hallucinations well because it's an internal thought well yeah so so any any questions on this aspect in the cross model generalization okay so i'll continue i have 12 more minutes so yeah so here is a laura so uh i think i missed a mr thing here yeah so these are an explanation and the next part is um they they hypothesize that like hey if this model if if we add a laura is the tokens produced different than if we don't add laura i mean it has to be different right because the laura contributes to token uh decoding as well as probe so it has to be different and what they do is that they play around with the the regularization term so um during the last regular in the last regularization term um just to give a context so let me jump to the last regularization from here um so here there's a probe loss and then the regularization then this is the they play around with this uh the lambda term here and then they increase and decrease and see uh how the model performs um the mmlu and essentially the model strength and what they see here is that it does affect and but there is a sweet spot here uh with gl divergence at 0.5 where it doesn't uh deviate too much from the the original model um original models decoding but still is able to perform well in the uh hallucination detection so uh very interesting uh results uh so i have last 10 minutes so let me just jump through the next next one um the next one is uh pretty interesting so this plot this result here so in in streaming let's say you know you are generating tokens right say first token second token third token fourth token and then fifth token you realize that oh it's a hallucination what if you stop generating the token and say i don't know right so you can you can do that because you are going through like you have a probability of hallucination for every token and you're generating when using this model to generate tokens so that's what they did and then they were able to say that if you do that if you have a threshold of let's say 0.5 and you can essentially increase the model performance um because you can say i don't know right you can refuse to answer what you actually don't know um which is super uh which is a skill which is currently lacking by uh current llms they just even though they know it's wrong they keep generating but you can introduce the skill where uh as soon as you realize model realizes it doesn't know it can stop uh so this is a one skill addition you can do to the models uh in addition to hallucination uh see to say i don't know um super useful and it doesn't let yeah so any any thoughts on this um i thought this was very actually super good a good result if if could you explain i don't think i understand this graph properly because yeah i don't understand why you could have an attempt rate at less than one if you have no monitoring right so that t like that right most point i don't understand that yeah so let's say you have a super aggressive monitoring right so let's say t is equal to 0.1 right so you have a prop like the token number one the probability is 0.01 the token number two the probability is 0.1 you stop at token number two you don't generate more tokens right so that understood yeah but i mean so they're like t equals one like the right most point that's no monitoring so you never are yeah but then why is the attempt rate then 0.8 uh but because it's still there are some uh answers it doesn't respond like it it is oh it'll still say i don't know to some things just because it doesn't know and it recognizes that got it got it okay yeah yeah at least that is my understanding of how they did this so yeah that makes sense thank you yeah um yeah okay yeah this is yeah it's usually super good if someone like you know plot or something imitate this like saying i don't know it'd be super impressive uh um at least for our agent pipelines uh yeah so the other than that they say limitations um the limitations of this paper um they say okay then you need a lot of text to um the entity label hallucinations uh are problematic because it doesn't have the context of the whole uh the whole thought process etc but i think that is okay it's the best we have so the limitations here are not as concrete or as uh detailed um yeah so i think that is uh it they have some discussions here but those are really not that uh doesn't really add anything to the paper i would say yeah so and the other the appendix is they just explain some uh what the data set looks like they also give prompts and how they generate a text uh they go on to explain how the base baselines uh it looks like they give prompt for all the evaluations like lm judge and everything and say they have some also additional results um which which is actually kind of good um yeah so that is all the things here and they have some results here which compare across um um they just what if you just fine-tune models and black box evaluation um cool uh so i think that is about it any any any any any questions or uh anything that you guys want to bring up maybe a general question like this was really a sort of token level so if you had a model which just gets asked something that's where the sort of answer is really quite a lot of tokens like you need 10 plus tokens uh can this somehow be expanded to this kind of case or do you then really have to think of different hallucination detection methods methods so you mean like let's say a group of like 10 tokens you want to identify if that is wrong right yeah that's yeah that's not like yeah that won't work because this is done at token protection level so it's like one token at a time so it won't work for a group of 10 tokens um that's actually a very good limitation of this people but the section six limitation was kind of interesting where a lot of this relies on the backbone of uh verifying the facts with sonnet foreign search right and even with that there's only like 80 recall and 16 false positive so even with generated text and then having sonnet foreign web try to verify facts still has quite a bit of false positives so like it's a bit of proof of concept work right like they didn't really do that much in terms of like more strict guardrails on ensuring facts are there so just a word of caution for people that do actually try to use something like this you can probably get those numbers to be better with better guardrails right just have more rounds of filtration so like off the shelf that's a bit worrying that the code allows in 20 false issues but you know you can always do better if you try yeah that's actually true and that is the reason why i was mentioning like hey injecting hallucination is better than you know let the model generate and then we'll do web search and check if that it's hallucinated or not uh because you know injecting hallucination you can control how much hallucination was injected um whereas here you cannot really you don't know you don't know and i think that's especially true given the results that they were showing that like the cross model generalization that sort of that they even mentioned this kind of suggests that it's because like intrinsically there's information in the text so that you should maybe the concern about being out of distribution is not such a big concern and is maybe uh but but i would have a lot of concern and i this is thank you for bringing up bibu because i did have concern about this and that um that you know if you only have 80 recall on your uh or especially your false positives or either way if then in your training set then you're not really even how do you even any and your validation set then how do you even know that you're um like all you know is that i'm i'm like sort of train i'm able to kind of get to where i was with uh um with you know uh sonic three and uh you don't really know how you perform in the real world yeah that's true i think um while we're at it you might as well highlight the third limitation since it draws on that all this paper is about entity hallucination right so like facts names uh stuff like that but you know if you're going deeper it's not trying to check for hallucination and reasoning right or like contextual stuff stuff like that so like uh someone in here mentioned using this for rag um let me unkit brought it up using it for rag context uh this is all only trained on entities right so it's checking stuff like is this date correct is this correct but it's it's not necessarily going to work on um you know stuff like reasoning stuff like is this the correct context um that's that's another limitation and then when it comes to how you train it you can generate stuff with rag but then this this verification step doesn't really you know you have to do more work there yeah that is true so i would be i would be interested in like you know can you ground uh do you even a ground you even need to ground thinking um i think maybe not do not ground thinking and only ground the generation part of it or something like that because yeah i think i think there's aspects to both right like if you think in the ed tech space per se like you have a co-pilot that's teaching you something there's value in the reasoning steps right you you want you want valid reasoning you don't you don't want hallucinations there if you're trying to learn you often care more about process than end output so yeah it's always useful and it's it's not like it's like non-trivial to verify this stuff it's just a different objective that this isn't trained on like you can you can check if stuff is a proper reasoning path it's just not what this repo does so it's just something to think about and yeah uh like ted mentioned this is i think part of the maths work right yeah it's like uh internship that's in the um yeah interp safety space yes i think yeah so i think there was an internship product because like the only one author is from etation um yeah i don't know where though so uh because neil nanda is uh is from google yeah yeah he's the big maths advisor um a lot of the maths papers just started coming out uh john shulman was advising one so technically another interim thinking machine paper but you know it's it's all math internship work but this is this is this is some of the cooler stuff yeah definitely awesome adam you have a hand up yeah if i can kind of follow up with the sort of thing that uh video was saying at the start sorry i put your name um uh yeah like i i put a question in like way earlier on that was uh you know it seems like they're sort of when they do the entity checking they're checking does this this date this name this whatever exist mostly or like it maybe exist with some like similar context but it's it's uh it jumped out to me that like you know that you know it's good to check that the name is is real or whatnot but like it seems like you know you could like the model could score well using like john smith for every name you know what i mean like it it's suggests that there's uh a way for it to be orthogonal to truth you know uh i imagine there's something more in the paper that kind of talks about that uh i i didn't read this prior to to arriving but like i just wanted to question something in that space i suppose i think the paper goes over it actually they they talk about cloud for web search so breaking it down to token like minimal text span and then they they search the web and give it three labels supported non-supported or insufficient information like they had a whole section on it and then they give it a binary classification so like it's it's if you go to appendix c3 yes this is the prompt that sonic got to say verify this fact so it's not just does berlin exist it's actually verifying like was einstein burn born in berlin yes that is and it does web search so it's not even just based on sonic it's it's somewhat more grounded there right yeah it is it's a bit more grounded there it doesn't like you know um it doesn't really just use the uh information as well but the problem comes in when it you are deviating from this data set right um so that is where i don't know how well it works so and it works kind of but yeah and in terms of like basic improvements from internship project to production like even stuff like this right so like the guardrails that aren't caught and let's say you want to make this verified for reasoning or other stuff this is still doing three things in one call right like this is one call that's doing expanding and search and labeling like the simplest solution break those into three right um you're giving three very distinct tasks to one llm call yeah sure it can probably do it but you know yeah you could probably do better yeah exactly so like you know first call make it and verify it's true and then second call like hey which part of it's not uh not true and then third part like which part in the text is actually not true or something like that we can break it into three calls and have a better i think your false positives and like your precision will also improve like drastically if you do that uh yeah i do agree yeah and if you have real data instead of synthetic stuff you know you you can do more of this in parallel you can bring back data you can augment it so it's all you can do but i think it works as a good perfect concept yes and yeah it's actually super cool that they opened the repo you can you can experiment with that a lot like it's a very um it's a very good quality repo um it's like splitting into three thing like this seems like sort of what peter was was asking about earlier like you could you could attach that to whatever you wanted to yes that is uh true but like but but peter is like she wants to to predict if this set of like you know 10 tokens are true or not but you cannot really do that yeah you cannot really do that um here the label is per token does that make sense it wouldn't stop you from either running the 10 token passage through either classifier or the claw prompt here or whatever else for further evaluation like you it's not the same but like yeah i mean in the paper they basically just say if you take the max of the token level predictions that's a that's a reasonable first whatever proxy for the 10 token prediction yep awesome i don't know no other questions let's stop sharing from there so guys i think i'll end the call Thank you.

Real-Time Hallucination Detection!! [Paper Club]

Chapters

Transcript