back to index

Real-Time Hallucination Detection!! [Paper Club]


Chapters

0:0 Speaker Introduction and Context
0:50 Paper Thesis: Real-Time Hallucination Detection During Streaming
1:50 Limitations of Current Methods (Semantic Entropy, External Verification)
2:30 Proposed Method: Linear Probe and Token/Span Level Check
3:58 Key Components: Token-Level Analysis, Web Search Data Generation, Linear Probe
4:50 Critique on Labeling: Injecting Hallucinations vs. Costly Web Search Labeling
7:20 Detailed Data Generation Process and Entity Extraction
11:10 Related Work: Mechanistic Interpretability, Uncertainty, and External Verification
13:25 Token-Level Annotation and False Span Examples (Albert Einstein)
15:15 Probe Architecture: Linear Probe on Nth Decoder Layer (Frozen LLM + Optional LoRA)
17:50 Dual-Use and Post-Hoc Hallucination Detection Capability
24:28 Loss Function: Combining Probe Loss with LM/KL Regularization
25:38 Probe Loss Details: Soft Cross-Entropy for Multi-Token Spans
30:4 Mechanistic Interpretability Rationale for Using Hidden States (Residual Stream)
35:55 Experimental Results: Outperforming Baselines (Perplexity, Entropy)
38:35 Long-Form vs. Short-Form Training Data Discrepancy
42:35 Cross-Model Generalization: Detecting text from one LLM using a different LLM's probe
45:25 Refusal to Answer: Using Probe Threshold to Say "I Don't Know"
48:35 Core Limitations: Entity-Focus and Noisy Ground-Truth Labeling
53:18 Discussion on Entity vs. Reasoning Hallucination Verification

Whisper Transcript | Transcript Only Page

00:00:00.000 | go ahead yeah um so let me show miss um i think first yeah so hey i'm um i did my work on robotics
00:00:15.520 | and um continuous learning in my phd in robotics and continuous learning um currently i am working
00:00:21.840 | uh in flagship in boston on um biotech uh for you know machine learning for drug development
00:00:29.840 | um yeah so i can talk more about what i do a bit later um or in discord um and also we have a lot of
00:00:38.880 | open positions reach out to me if you're an mls machine learning engineer or ai engineer i would
00:00:44.320 | love to have you um so today i'll be uh presenting um hallucinations of paper uh this was a very
00:00:53.120 | interesting paper um seems an under uh like you know um a topic which has been brewing for so a long
00:01:03.120 | time um you know people have trying to ground models like gemini tried to ground models through the
00:01:09.840 | search um you know we always do like you know citations and backtracking um you know information
00:01:16.480 | through citations and ensuring that no hallucinations are uh surface to the users uh but this is an
00:01:21.920 | interesting approach uh where the the main focus is can you detect hallucination while the model is
00:01:28.000 | generating um let's say you're streaming responses and as soon as you detect like oh uh model is not
00:01:35.600 | confident um can you just say not generate it or say okay i don't know um so they use some interesting
00:01:42.320 | approaches to do that um so i'll do a deep dive on what they changed what model architecture they
00:01:48.320 | changed and like some interesting results as well um so the tldr is that the motivation is that can you
00:01:55.680 | identify hallucinations while you're streaming um the the limitation of the current approaches is that
00:02:02.240 | uh you know one of the main state-of-the-art method is called a semantic entropy um the idea here is
00:02:08.080 | that you just sample it multiple times and you see how many times it generated the same result and
00:02:13.200 | you group them and then make sure that like you know okay the cluster is small enough then okay maybe
00:02:19.280 | it's not a hallucination it's not a foolproof way of course um it fails all the time um other
00:02:26.080 | hallucination methods are very time consuming and requires a lot of resources um yeah so the approach
00:02:33.840 | they did is um they have something called spans uh they say it's a collection of tokens um and
00:02:41.680 | they use web search to identify okay if the span is true or not um i will go a bit deeper and like how
00:02:49.040 | they do this etc um and they have a probe uh linear probe if they attach to the model uh in the final
00:02:56.320 | layer or whatever layer you want and then that essentially gives you the probability of this token is
00:03:01.040 | hallucinated or not uh and uh they they train this model um you know next token prediction way uh and
00:03:10.320 | they have they do some loss uh manipulation to ensure that the probe probability is uh is not correct
00:03:17.120 | and uh yeah they achieved a state-of-the-art results for hallucination detection um and the context here is
00:03:23.600 | that i also tried to train this model and it works pretty well um and also it generalizes pretty well
00:03:29.760 | um um so you can think of it as like instruction fine tuning for hallucination um so even though you
00:03:36.160 | don't use that many samples uh but every week now you can use that probe to do hallucinations check
00:03:43.120 | for any topic that model has pre is pre-trained on um so which is a kind of awesome
00:03:50.000 | yeah so let's uh that is a tldr so let's dive into the paper um here um the main the main thing uh here is
00:04:00.240 | that they don't identify hallucination for the whole sentences uh they identify hallucination on the token
00:04:06.400 | level uh which is super powerful uh like i mentioned super helpful for streaming um and uh yeah so and
00:04:14.000 | another key contribution they've mentioned is that uh they they use web search to first generate a data
00:04:21.200 | set for the hallucination um you know if you if you go back and think of how instruction fine tuning was done
00:04:26.720 | uh you just get at some text and you pass it through a language model to create like you know
00:04:31.600 | in some question answer pairs uh here the same way they they they ask llm to generate some text they use
00:04:38.320 | a web search to ground that uh um ground and make sure like okay this this this text is this token is
00:04:45.440 | wrong etc so that is the main contribution they mentioned um but i have some text on this uh let's
00:04:52.640 | explain to like why we can make this even better we don't need web search um so yeah i'll explain that
00:04:58.640 | a bit later um and then the other one is super i found it interesting is the linear probing approach
00:05:04.720 | uh it's like uh attaching uh um attaching uh like you know a small linear probe at the final layer to
00:05:12.480 | identify the probability of this token is correct or wrong it's super powerful and can be used for other
00:05:17.360 | things as well um they have some different applications in the paper which i will cover
00:05:21.920 | um yeah like i mentioned sure of the art results um compared to like basic methods like kind of
00:05:28.080 | perplexity semantic entropy etc uh they were able to achieve uh um everything
00:05:33.680 | yeah so here is how they here is a very high level view of how their data looks like
00:05:41.440 | um how the uh how the labeling will look like um here you have a response generated by 3.1 uh lama
00:05:47.920 | um you know we have some text here and uh it essentially says that okay this uh 29 year old
00:05:54.480 | is 29 is wrong um and then it's able to identify what are the key components uh entities they mentioned
00:06:01.440 | uh and also identify which of these components are wrong which of these components are right etc
00:06:06.160 | um very high level um um it's pretty strained in a very interesting way um
00:06:14.480 | yeah so and not so they're not predicting on every token which is also super important
00:06:22.560 | um so every token doesn't have like you know for example uh rally v california uh here really
00:06:28.800 | v california is not labeled as zero uh they essentially think of really v california here the first
00:06:35.440 | three characters as unimportant tokens and then they have uh five seven three us three uh three seven
00:06:42.880 | uh this the following tokens as important tokens so you can think of it like they're doing two level
00:06:48.480 | classification uh one is like hey the token is important and the second one is is it hallucinated
00:06:54.960 | in the same training ground right um so that just which is an interesting approach i don't think anybody
00:07:00.320 | has done that um yeah i'm gonna skip over the introduction part of it it's just uh they just
00:07:08.080 | you know uh mentioned that okay nobody has done this the current models are uh all like citation checking
00:07:13.920 | methods uh etc okay so yeah uh the most of it here is uh yeah so they mentioned that they use a frontier model
00:07:27.840 | to do web search and extract entities um which are either it can be true or it can be false they use a
00:07:36.480 | sonnet here for sonnet 4 with web search uh to label to essentially extract uh they give a sentence and
00:07:44.880 | then the sonnet works as a web search and comes back with a json saying that okay these characters are wrong
00:07:50.560 | and they use that as a training data set so my my take here is that i've not tried this yet but i want
00:07:58.080 | to um i will do that very soon we don't need web search my take here is that okay we can just have
00:08:04.400 | data set and then we can inject hallucinations uh let's take wiki data set and change the names and dates
00:08:10.640 | etc and use that for training as well i don't think um you know doing web search you know adds anything
00:08:19.120 | um it's just you know api cost um so that is my take um
00:08:23.120 | yeah so and also feel free to uh jump on and ask me any questions oh i see a lot of questions in chat already
00:08:36.480 | but the ground truth is generated uh so can you explain uh what do you mean by that
00:08:42.480 | yeah sure i so my understanding of why they did it this way was that they're they took the they they
00:08:50.240 | took text and they generated whether or not the entities were based in fact or not because the text itself
00:08:59.200 | was a large generated corpus so so that um you can get i assume they generated 10x because they said
00:09:09.680 | they like expanded the long form uh or whatever it was called that data set by 10x and i presume they
00:09:17.360 | did that to get a larger training set hmm yeah that makes sense that makes sense um don't you think the
00:09:26.000 | train set will be even more if you just do uh inject injection of hallucination rather than like you
00:09:31.520 | know verification uh yeah i mean that's valid yeah that's it i mean it um you would still have to
00:09:42.560 | yeah i guess sure so you could just go and like um in in inject hallucinations yeah no i guess you know i agree
00:09:50.560 | with you actually yeah um and because in the uh maybe i'm jumping too ahead here but in the results
00:09:58.160 | they mentioned that uh they use data generated by let's say jama and then they used in llama and they
00:10:04.800 | still got the same amount of performance uh so even like they used llama 3 8b um data and 70b model and
00:10:13.920 | they were able to get like actually better performance um so which means that you know
00:10:18.320 | text underlying text with what generator doesn't really matter uh which is awesome um
00:10:24.000 | yeah so which was for verification to the phosphorous generation you might want you
00:10:33.520 | there is another question saying that uh i think the web search was for verifying the truth or falsehood
00:10:40.560 | uh of the generation you might want to generate hallucinations so
00:10:43.360 | true true uh that that's uh there's a question saying that uh you might maybe too much out of
00:10:54.960 | distribution sampling uh that is a valid point if you inject hallucination not too much out of distribution
00:11:00.880 | um that's a very valid point and uh i didn't think about that um yeah so let me jump into the next
00:11:09.920 | part of the paper which is related works uh so i think the motivation here is from the
00:11:15.840 | tag mark paper uh where they use linear probes to uh identify what is the representation or like truth
00:11:24.080 | directions uh they use that to do hallucination detection which was super uh like just orthogonal
00:11:32.000 | thinking um and then the other related work is an uncertainty like based detection this is essentially
00:11:38.880 | like perplexity right so when the model is generating tokens identifying like hey what is the perplexity
00:11:43.920 | of this token uh that is another they say it's a motivation for this work um and um yeah so and the
00:11:52.960 | last one is external verification uh this is what like most of the industry uses now uh like essentially
00:11:58.400 | hey generate citations for your text and like okay for this sentence the site is from cited text um
00:12:05.120 | verify that um so this diagram is related to the data set they generated um so i will give a more
00:12:19.840 | explanation on this in the next section um but the idea here is that they started with a long fact data
00:12:25.680 | set uh and then they added some more uh like you know biography questions legal prompts uh more topic
00:12:31.840 | focused queries uh these topics were biology chemistry etc and uh they asked in language model uh like llama
00:12:38.720 | once the lagging with model generated the queries they passed it to like i mentioned uh uh state-of-the-art
00:12:45.360 | llm like uh plot for um with search to verify that uh on token level is this now is this tokens correct or not
00:12:53.920 | um yeah so that is the data set generation process and they also released the data set which is super
00:13:02.080 | nice and the code is also released by the way you can train this super easily um
00:13:06.640 | yeah so yeah like i mentioned the data set is they took the long pack data set was already published and
00:13:14.160 | they expanded on that and then created this uh uh 20 25 000 uh um questions data set
00:13:21.600 | yeah uh and uh token level uh uh like yeah the token level uh uh i don't know like you know
00:13:30.640 | annotation is there are three annotations they uh that they let the cloud generate supported not
00:13:36.560 | supported and uh insufficient information um so supported is labeled as uh let's say um zero and
00:13:44.480 | not supported and insufficient information is labeled as one um so that is the thing and what they did um
00:13:51.200 | and they also did some human uh verification uh uh you could say so like the quality of the what the
00:14:00.080 | labeling is good or not i don't think n is equal to 50 or 25 000 really makes difference here uh but
00:14:06.400 | they just wanted to verify the quality of the um the human lab like labels which are generated by the
00:14:11.840 | state-of-the-art model so this is an example of how the how the model labeling looks like you this is
00:14:19.920 | the model let's say this is the sentence let's say generated by llama 3b and you pass it to a web search
00:14:26.000 | and then it identifies that okay albert edstein is important let's label that as a span uh was born on
00:14:31.280 | like an unimportant text uh they put a date here which is correct in berlin okay that's berlin is fall
00:14:37.840 | false and then um they do a token level uh labeling um so we can think of it as a tool to level labeling
00:14:45.200 | first level is like okay albert einstein is an important tokens uh uh y of one y of i is equal to
00:14:51.520 | one uh and the y of s is equal to zero because that's a true true tokens uh second one is uh um
00:14:58.160 | it's the same thing as uh uh it's important tokens but it's true and the third one is false so let's
00:15:06.080 | that's an important token but it's a false it's a hallucination um yeah so that's how the labeling works
00:15:12.880 | and the next part is uh how they attach probes um so here is a very basic rudimentary explanation how
00:15:20.800 | they attach probes uh they have they took a llama model so all the dashed boxes here are frozen and
00:15:29.040 | they had they added an optional laura so this is a hyper parameter they experimented with and without laura
00:15:36.240 | um i think laura helped them a little bit but they also show that with just adding a probe on the final
00:15:43.040 | level uh is good enough you don't you need the laura but if you add laura it actually enhances your
00:15:49.120 | generation um so here at this like you know this is one let's say one decoding presentation you send
00:15:57.360 | a tokenizer embedding go through n layers and then for the nth layer um you take the hidden states you
00:16:04.560 | pass it through a probe and then uh this probe outputs a probability uh that probability uh is zero if
00:16:11.440 | it's not a hallucination so one if it's a hallucination for that token um so that loop continues so every token
00:16:18.960 | produced uh will have a hallucination check
00:16:24.960 | yes uh so some uh ted is mentioning um uh okay there's a swix questions saying i think
00:16:40.720 | yeah so uh yeah so uh the tokens so the question here is that the it's a confusion between entities and
00:16:47.040 | tokens um so they use for the web search they extract entities and they uh they take that entities and
00:16:54.560 | they label the tokens now so let's say if your example here is um your albert einstein is an entity march
00:17:05.120 | 14th 1874 is an entity berlin germany is an entity uh and claude will return albert einstein uh is equal
00:17:12.640 | to zero uh mark march 4th march 14th uh 1879 as zero and berlin germany as one um and uh so this is
00:17:22.320 | are labeled this way and they take that entities um aka spans and then they label the tokens based
00:17:31.120 | on that um it's the end they use in entities and spans interchangeably in the paper a bit confusing
00:17:46.240 | yeah so eugen is saying that it's pretty cool that uh tokens are next to others and uh yeah um probes
00:17:52.000 | yes that is that is actually pretty cool so now you're not just generating tokens but you're also
00:17:56.240 | generating the probabilities of the token essentially it's a multi-level uh which is kind of awesome um
00:18:02.160 | so is the intent here that this model uh with the pro will be dual use for both general answering
00:18:09.840 | the question uh responding as well as being able to return a hallucination probability as well
00:18:16.320 | yes uh i see so so assuming that let's say uh i'm using uh of i'm using a lm api which i can which i
00:18:26.400 | cannot fine tune to do this i will not can i still pass the output that is streaming and pass it through
00:18:33.440 | and still probe every token yes i tried that and it works oh okay so that's really nice so essentially
00:18:41.520 | you don't really need a decoder if you if you just purely want it as a hallucination classifier you
00:18:46.400 | don't need a decoder head you just use the probe head yes thank you very cool you tried this because
00:18:52.080 | i think in the paper they said that that's a limitation that they can't they know they have a
00:18:58.320 | they have a thing in the in the appendix where they show that it's not as effective to do it if you
00:19:04.160 | don't have access to the logits or at least i mean or not the largest but the so the the residual stream
00:19:11.360 | or whatever so what i did is um so um okay so here is what i did so i entered some text and uh during
00:19:21.920 | the pre-fill stage um so during the initial stage where it now pre-fills i identified the okay this
00:19:28.080 | uh and i was able to calculate the probabilities um so essentially take the uh i was not about to stream
00:19:36.560 | it of course but what i was able to do is like okay once the text is generated from let's say uh gpt5
00:19:42.320 | uh i was able to take that um uh during the pre-fill stage uh able to identify uh what tokens are
00:19:49.760 | hallucinations uh right so i don't need to decode it like i don't need to generate the next tokens
00:19:55.760 | after that i just need to see okay which tokens are hallucinated um the ydc limitation how do you do that
00:20:01.840 | is it essentially you just pretend that you start from the first token and you go through and you just
00:20:07.440 | scan through it and get yeah very cool yeah so so the reason why they say it's a it's a limitation
00:20:15.120 | is because uh you're not streaming like you're not able to identify during streaming right uh yeah yeah
00:20:22.560 | so that's a limitation so the strong assumption here also is that this model the model they're using for
00:20:29.920 | hallucination detection the data distribution of the text is approximately similar which i guess in human
00:20:36.960 | text with enough training data should be approximately similar thank you yeah true yeah yeah so for
00:20:42.400 | example if you're saying okay it needs to be in the same level right so um it needs to be in the same
00:20:50.240 | level as like llama a b should be the same level as gpt5 which is not true uh so i would use this with
00:20:56.800 | caution or another approach i was thinking um is like take your training data train this and then use this uh
00:21:06.480 | as your uh like a final label check or something uh before you release it to the next step of your
00:21:12.320 | agent pipeline uh so that there's no hallucination on that um yeah so i'm sorry you you lost me for a
00:21:19.920 | second i i don't know if this is what eugene said or what you said but i thought i heard you say
00:21:24.240 | something about the probe is actually predicting the next token and the probes not predicting the next token
00:21:29.840 | no so that is true the probes is not predicting the next token but the probe sits next to the next token
00:21:36.240 | prediction so uh my understanding is the probe actually sits a little earlier it's not immediately
00:21:46.000 | before next token prediction i mean i mean you have two outputs yeah exactly that is what i meant by sits next to that so the uh
00:21:53.840 | uh uh like what i meant is like you're when you're predicting the next token you're also predicting the probe
00:21:58.640 | so outputs are two uh at the place the place you place the like the where you place the probe
00:22:04.320 | uh in in the code they have made it modular so you can say n is equal to let's say 31 or 29 or something
00:22:10.640 | and then it'll go sit in the 29th decoder position then use that hidden stage to predict the probe
00:22:15.760 | right okay right yeah because that's the whole point of this real time is that as the tokens coming
00:22:20.320 | out you can actually um interrupt generation if you are concerned about hallucination yeah that's exactly right
00:22:28.240 | um yeah
00:22:29.440 | yeah so yeah that's actually pretty adam you had a question and i saw you come on camera did you want to ask it
00:22:43.040 | um go ahead uh i'll put some stuff in chat um uh oh i'm sorry if i call you out like i'm i'm happy
00:22:52.960 | for it to to wait until we're further down the paper okay let's let's wait then thank you thank you
00:22:58.880 | yeah yeah uh i see that uh laura here like positive it's not changing the best model yeah so that does
00:23:06.800 | correct me adam if i'm not wrong that the idea like are you asking like hey does laura change
00:23:12.400 | the way the model predict predicts a token yes it does and they have uh uh i will talk in the next
00:23:19.840 | part of the paper how they able to change that uh um like how the model quality changes like when you
00:23:27.120 | are training this hallucination check because laura should technically change the way the next token is
00:23:32.720 | changing but predicting also i mean my thought was to clarify uh with the laura question um
00:23:42.320 | like the laura's don't seem like they're part of the uh the the probe the the checking the truthiness
00:23:51.600 | evaluation there my guess would be that the laura's are purely once you have the ability to evaluate
00:23:58.400 | to make the results better for blayer generations and hopefully i'm understanding that right yeah i have
00:24:05.760 | not done the abolition study removing and adding laura so i just am believing what what they they
00:24:11.360 | mentioned uh when i trained i did train it using laura and so i don't know what will happen if you
00:24:18.240 | just train the probe so i don't know the quality will change or not
00:24:21.200 | yeah so let me let me just jump to the next part of it like hey what are the loss look like loss look like
00:24:31.600 | uh because you have two things now you have a probe and you have a regular generation
00:24:36.000 | um so they what they have is a regularization term uh here they have this term they set it to 0.1
00:24:42.640 | um here like so majority of the loss is contributed from the probe and some of them is through the
00:24:48.480 | uh the um you know regularization um it's a regular uh loss um so um so yeah so that is the loss so let me
00:25:01.600 | jump deeper into the probe loss because uh um that is okay so for the regular loss they have to by the
00:25:09.200 | way uh last last regular they have two types of loss regulars one is a language model loss which is
00:25:14.720 | essentially next token across entropy and the second one is a kl diversion between uh what the model
00:25:20.320 | without laura will predict and what the model with the lower will predict um so uh like essentially
00:25:26.000 | controlling the distribution um between the the changed model the laura model and the older model
00:25:31.600 | um so these two these are the two loss functions they experiment with and they show the the comparison
00:25:37.200 | uh but like in the results section uh for the probe loss is super interesting what they did
00:25:42.720 | uh because you have a two level classification so you need to first identify which tokens are important
00:25:48.320 | and uh which tokens are entities or tokens are important um um then the second level is that uh uh
00:25:57.680 | if the two of the identified important tokens which is true or not um so what they do is that they do
00:26:03.680 | annealing um so first they start out with identifying uh the spans which are the end like super important
00:26:09.920 | tokens uh and then they as the training keeps on continuing they change the weight to identify the
00:26:15.520 | hallucination the first part of the loss is your uh the span detection and then the second part is the
00:26:21.680 | hallucination detection and you can observe here uh because this the whole span has multiple tokens
00:26:29.200 | right uh let me let me go back to my example here uh let's say berlin germany it's a like let's say it's a
00:26:35.440 | four token um again this is an example here uh it's a four token um so it all the four tokens need not be
00:26:42.960 | false technically it's just only the berlin is wrong or false um so only these two are false right so you
00:26:49.360 | cannot have a cross entropy loss for all everything as a loss so they do a soft cross entropy by putting a
00:26:56.160 | max term here right the idea here is that the the max of all the probability predicted across that tokens
00:27:03.760 | is uh it should be um you know one not the not every token there needs to be one uh this surprisingly
00:27:10.800 | worked works um i was actually a bit surprised how this soft um cross entropy is working um
00:27:18.400 | yeah so that is the one super important thing about this probability i think the reason why this worked
00:27:25.360 | is because of this loss function how they how they were able to switch between which identifying which
00:27:30.160 | probe like which tokens are important and then switching to identifying okay those tokens are true or
00:27:35.680 | false um yeah so any any questions on the loss uh specific definitions so what what did p exactly
00:27:45.920 | represent the probability of that token so the probe probability the probe probability okay got it yeah so
00:27:53.840 | cool uh so
00:27:59.120 | uh so someone is asking will it can it work with an api uh so it should like it it should be able to i don't
00:28:13.600 | see why not um again this assumption here is that like you are training your own model then only it can
00:28:20.320 | work in api yeah yeah i think you should answer that yeah
00:28:23.920 | uh yeah there is another question if you train uh if you pre-trained the hallucination class if you can
00:28:33.840 | an existing hallucination training data would you transfer to that is a very uh uh usually asked a
00:28:40.640 | question which uh question which uh so yes um that was super interesting to me as well so the question
00:28:47.920 | was that if we use the existing data can it generalize when we put a newly newly generated text um
00:28:55.120 | yes it works because what i did is that when i was experimenting i i gave it all the data which they have
00:29:01.360 | generated um but uh uh and i mixed that with uh um um i mixed that with like you know some chemistry
00:29:08.880 | questions uh for a drug development let's say uh and disabled to identify like okay this this protein
00:29:13.840 | structure is wrong right which is which is pretty impressive uh uh it's using its internal knowledge
00:29:18.800 | as well during this hallucination prediction um that's why i'm having a hard time understanding this here
00:29:25.200 | uh so ted is saying that we need the internals of the models so i'm starting i'm thinking is that like
00:29:30.240 | the hidden states or the residuals or what else so you need you need identities right so the for for
00:29:36.480 | the probes you need the hidden states so we don't exactly need the actual hidden state from the model
00:29:42.720 | that was used to generate it right we can simulate it via this model exactly yes that is what i meant by
00:29:49.120 | like uh you can take the output from gpt5 then pass it to this model and identify which tokens are
00:29:54.000 | hallucinated so would that make sense ted adam so um so so my tldr here okay is that uh when you when
00:30:07.600 | you do this mechanistic interpretability research people are saying the model actually knows that it's not
00:30:13.760 | confident in what it's saying and if you just simply do the the the methods they talk about in the
00:30:21.040 | background section where you look at the entropy or or other things about the logits you get a certain
00:30:29.600 | accuracy you know you get 60 70 percent or whatever um but what the what this paper is saying is that the
00:30:36.400 | model has internal knowledge of its uncertainty beyond what shows up in the logits and that's why the linear
00:30:43.440 | probe can outperform just logit based hallucination detection so um and so the linear probe is
00:30:52.160 | actually looking for basically a i'm going to say direction if you if you don't know the the the the
00:31:00.320 | the mechanistic stuff you might not know what i'm saying but it's looking for a direction
00:31:06.800 | on the residual stream that says i'm thinking about whatever i'm thinking about chemistry blah blah blah but i
00:31:12.800 | also have this thought i'm not really sure what i'm saying uh yeah so that's that's only available
00:31:19.680 | from the actual model that did the generation and it's hard to just fit it in and the reason why you
00:31:24.720 | want the around the 95th percent layer is because the last few layers are about refining next token
00:31:32.080 | prediction they're not actually about thought understanding creation and so so around the last five percent you
00:31:41.520 | actually tend to lose thought information because you're stripping it from the residual stream so
00:31:47.520 | that the residual stream is just next token information and nothing else so you would get worse linear probe
00:31:54.640 | information if you did it after the last layer instead of at the 95th percent ish layer that is true
00:32:02.240 | that is true that makes sense yeah but i i think it bears repeating it the experiment that you did was that
00:32:09.840 | you actually used a different model or potentially a different model right so you like like you take gpdp5's
00:32:18.880 | output running through llama and then use its its um concept of is this a hallucination or not and that was also
00:32:28.240 | way effective you mean llama right yeah and that was also effective right yeah yeah yeah seem to work
00:32:36.000 | but like a key is that uh that that that information was general you know for that both gpt5 and llama
00:32:42.480 | knew about it right so if if llama doesn't know that it doesn't work okay yeah it would be interesting
00:32:51.200 | so did you did you calculate any loss curves or anything on that i mean not last course uh like a
00:32:56.640 | roc curve or anything no no no no it was just a simple test which i did like you know with like
00:33:03.120 | let's say 10 or 15 examples i didn't do more than that yeah okay yeah i just wanted to see if it actually
00:33:08.240 | works before i you know interesting that's not scary sorry guys just to clarify you took gpt5's
00:33:16.800 | output and you calculated the value probability of hallucination based on what it was trained for
00:33:22.880 | so it's basically guessing based on what it was trained on from an open source model and if there
00:33:29.200 | is an overlap between the data sets then it probably works otherwise it doesn't right yes yes
00:33:36.000 | it's not exactly reverse engineering it's just a guesswork i mean yes it's a guesswork okay
00:33:41.440 | it's a guesswork uh but where it changes is if you have let's say domain specific information
00:33:48.720 | uh you can make the super like you know lama 8b model as your domain expert and then pass it that
00:33:55.440 | through that will be very powerful uh right and so let's say you have some proprietary information or
00:34:02.960 | some confidential information for your own stuff and now we have gpt generating this stuff and you
00:34:08.640 | can identify oh this is a hallucination according to my knowledge um gpt might fight or something
00:34:14.560 | um so yeah so we can mix and match it's using a smaller model to detect some very specific
00:34:22.560 | uh i won't call them hallucinations anymore it's just that factual errors and that's something you can detect
00:34:29.760 | uh because you have a smaller model trained on it it could be even faster
00:34:32.960 | yeah similar principle to speculative decoding in a sense not using probabilities but yeah yeah
00:34:40.640 | i would say it's spec spec tech is a bit different um but like you know but i totally agree
00:34:46.800 | there is a there is a similarity between these two and i think you could go either way your your
00:34:53.920 | your hallucination detection model could theoretically be larger but of course then it'd be expensive so
00:34:59.200 | you're probably not going to do it that way um the the authors of the paper would probably hope that
00:35:05.520 | if this technique were refined and made better and actually useful that then gpt-5 would just support
00:35:10.560 | it and you wouldn't have to have your own standalone lama that you that you run you know side by side but
00:35:15.600 | it's an interesting thing basically when you pass any text into this lama model whether it comes from
00:35:24.720 | gpt-5 or if you just write it yourself what lama is answering is if i had generated this text would i
00:35:31.920 | have been confident or would i have been uncertain and maybe hallucinating and so to the extent that llama
00:35:40.160 | is right then you can use it as a hallucination detection for your five-year-old kid or you can
00:35:45.360 | use it for gpt-5 or for whatever you want yep that's true cool uh yeah so that's a very good yeah so
00:35:55.600 | the comparisons what they did let me just jump to the next part of the paper the comparisons they did
00:36:01.040 | entropy uh so which is like and perplexity but these two are like you know models internal uh representation
00:36:07.360 | like how surprised it is by generated token uh second one is the semantic entropy i think ted
00:36:13.120 | already explained this a bit uh where you like you know you just um sample multiple times and then
00:36:19.680 | like essentially decode the same token multiple times and then group it and see okay what is the direction
00:36:25.200 | of this group and like if all the direction is the same that means that okay probably not a hallucination
00:36:30.640 | if it is like direction is like larger so and that means that your aggregated direction might not be
00:36:38.400 | same for all the generated tokens that means it might be a hallucination that is a comparison that
00:36:44.000 | was a state of the art actually and the black box self-evaluation um so this is essentially a judge um
00:36:50.800 | hell of a message um yeah so the experimental setup here they have a llama 8b and 70b model uh they also
00:37:00.880 | the repository also has ability to use any hogging paste model which is kind of cool uh so but they
00:37:07.520 | did also test it out for like jama models as well which works um you can see here uh this row so just your
00:37:14.000 | perplexity like not doing anything and just basing things on a complexity like ted mentioned works well
00:37:19.600 | like 76 area under the curve is like 76 0.76 but if you use uh like this probes even a linear or a llama
00:37:27.760 | probe like sorry laura probe um uh let me clarify what is a linear probe is the linear probe is there is
00:37:33.760 | no laura right so it's just a proof that's it laura is an optional so but when you add laura it just
00:37:39.680 | this essentially is very good in identifying this both recall and the area under the curve is actually
00:37:44.480 | pretty pretty pretty high um so this is a summary of the result um and they um they go like when they
00:37:53.520 | release the the data set as well which has about 25 000 total samples um yeah so so evaluations i'll skip
00:38:04.320 | all this and uh so this is the out of distribution performance um so this is the like what i was
00:38:12.720 | mentioning like you know what if the training data is not present what if the concept is not present in
00:38:18.480 | the training data and if you uh in your fine-tuning data it is present in the pre-training data it still
00:38:24.800 | performs well uh which means that it's using its internal hidden states very well um yeah um yeah
00:38:34.240 | yeah another one okay this is actually pretty um i still don't know why this happens i was i was
00:38:41.840 | i was a bit uh uh surprised on this as well so if you train on samples which have only short form text
00:38:48.480 | it doesn't perform well on the long form um so and long-form text if you have if you have
00:38:56.880 | hallucinations identify on long form text it doesn't really perform well on the short form text
00:39:01.040 | um i i don't know why i i don't know why it is different because uh i i genuinely don't know like
00:39:09.360 | i couldn't understand like this part of the paper because it it theoretically it should not matter but
00:39:15.760 | apparently it does so so if anybody has uh any insight on that i'm happy to hear and like um ask questions
00:39:25.760 | but uh for me this part was very surprising because i would assume how does the the length of the text
00:39:31.680 | matter for your hallucination detection of the one token apparently it does right so i was not sure yeah
00:39:38.240 | i i thought that was really uh maybe even the most interesting plot in the paper because i mean my
00:39:46.240 | explanation for that was that it's actually looking at the context around it and saying oh this is a
00:39:53.520 | likely thing to be hallucinated right yeah that is if that is true then long context should always beat
00:40:01.920 | the short context right uh does it well maybe maybe in it's maybe i would say that that supports this
00:40:12.240 | exactly what we see here because in the short con short form test then it may be that model like um
00:40:20.320 | relies more on con like wider context where as the short form one more relies on fact or something
00:40:28.160 | like that why is this result unintuitive to you okay uh yeah so because uh think think of it right so
00:40:37.360 | they they say a short form text is let's say a one answer one question right so they use like you know
00:40:44.240 | like let's say 100 token or 200 token answer and they mark uh some tokens as hallucination and they train
00:40:50.240 | on that that's it the long form is essentially like an essay of uh of 600 or like let's say 10 000 and
00:40:57.440 | here they mention 8 000 tokens or something like that um and they mark hallucinations and uh they take two
00:41:04.880 | models which are trained short and long and they apply it on others other data sets uh which is uh and it
00:41:12.560 | doesn't perform the same way uh so for me how does the length of the text matter uh so yeah that is why
00:41:21.680 | it was ted here i have a question you know and so so i think my my intuition around this result is that
00:41:31.440 | it's not about that the model is any different it's about whether or not we trained a good linear probe and so
00:41:38.400 | what they're saying is that the signal in short is because it's short it's not getting a super accurate
00:41:47.440 | signal and so then the probe just doesn't learn very well i don't think it's about whether or not
00:41:55.120 | hallucination looks different which is why if you get a really accurate signal by training on long it works
00:42:01.520 | pretty well on short because it's the same hallucination signal so it's really just a
00:42:06.080 | matter of what kind of training data you need to get this probe well trained is the way i read section
00:42:11.680 | five okay yeah that makes sense um so it's essentially it's not uh yeah i would i would i would change the
00:42:21.040 | plot as like you know performance on good quality data versus performance on bad quality data or something
00:42:26.080 | like that yeah okay that makes sense makes a bit sense um um yeah so here is the cross model generalization
00:42:37.600 | which was super um interesting so here you can think of it as a text generated by one model
00:42:43.360 | but the probe is another model um right so let's say they generate uh the uh the gwen 2.57b text and
00:42:51.040 | they validate it or probe it on uh 70b model uh and it works right so here is an indication of why
00:42:59.760 | one model text works another model and why gpt5 works when i tested it out uh with llama right so and
00:43:08.800 | so another surprising thing is if you have a a dumb model which is let's say is 2.57b which is a maybe a dumb
00:43:18.080 | dumb model and then you have a probe model which is bigger uh 70b model it performs actually better
00:43:23.920 | um so which is uh which is uh which is actually kind of interesting and it intuitively makes sense also
00:43:30.720 | um because you have the 70b model is a very intelligent model and can identify hallucinations well
00:43:37.440 | because it's an internal thought well yeah so so any any questions on this aspect in the cross model generalization
00:43:49.280 | okay so i'll continue
00:43:56.400 | i have 12 more minutes so yeah so here is a laura so uh i think i missed a mr thing here
00:44:05.040 | yeah so these are an explanation and the next part is um they they hypothesize that like hey if this
00:44:12.160 | model if if we add a laura is the tokens produced different than if we don't add laura i mean it
00:44:19.840 | has to be different right because the laura contributes to token uh decoding as well as probe
00:44:24.880 | so it has to be different and what they do is that they play around with the the regularization term
00:44:31.520 | so um during the last regular in the last regularization term um just to give a context so let
00:44:38.160 | me jump to the last regularization from here um so here there's a probe loss and then the regularization
00:44:45.360 | then this is the they play around with this uh the lambda term here and then they increase and decrease
00:44:50.640 | and see uh how the model performs um the mmlu and essentially the model strength and what they see
00:44:58.480 | here is that it does affect and but there is a sweet spot here uh with gl divergence at 0.5 where it
00:45:07.920 | doesn't uh deviate too much from the the original model um original models decoding but still is able to
00:45:15.920 | perform well in the uh hallucination detection so uh very interesting uh results
00:45:27.520 | uh so i have last 10 minutes so let me just jump through the next next one um the next one is uh
00:45:36.480 | pretty interesting so this plot this result here so in in streaming let's say you know you are generating
00:45:43.920 | tokens right say first token second token third token fourth token and then fifth token you realize
00:45:48.960 | that oh it's a hallucination what if you stop generating the token and say i don't know
00:45:54.640 | right so you can you can do that because you are going through like you have a probability of
00:46:02.480 | hallucination for every token and you're generating when using this model to generate tokens so that's
00:46:08.320 | what they did and then they were able to say that if you do that if you have a threshold of let's say
00:46:14.240 | 0.5 and you can essentially increase the model performance um because you can say i don't know
00:46:21.200 | right you can refuse to answer what you actually don't know um which is super uh which is a skill
00:46:27.680 | which is currently lacking by uh current llms they just even though they know it's wrong they keep
00:46:34.240 | generating but you can introduce the skill where uh as soon as you realize model realizes it doesn't know
00:46:41.120 | it can stop uh so this is a one skill addition you can do to the models uh in addition to hallucination
00:46:47.520 | uh see to say i don't know um super useful and it doesn't let yeah so any any thoughts on this um
00:46:56.480 | i thought this was very actually super good a good result
00:47:05.680 | if if could you explain i don't think i understand this graph properly because yeah i don't understand
00:47:12.640 | why you could have an attempt rate at less than one if you have no monitoring right so that t like that
00:47:19.440 | right most point i don't understand that yeah so let's say you have a super aggressive monitoring
00:47:24.000 | right so let's say t is equal to 0.1 right so you have a prop like the token number one the probability is
00:47:30.480 | 0.01 the token number two the probability is 0.1 you stop at token number two you don't generate more
00:47:36.800 | tokens right so that understood yeah but i mean so they're like t equals one like the right most point
00:47:45.840 | that's no monitoring so you never are yeah but then why is the attempt rate then 0.8 uh but because it's
00:47:55.760 | still there are some uh answers it doesn't respond like it it is oh it'll still say i don't know to
00:48:01.680 | some things just because it doesn't know and it recognizes that got it got it okay yeah
00:48:05.680 | yeah at least that is my understanding of how they did this so yeah that makes sense thank you
00:48:13.440 | yeah um yeah okay yeah this is yeah it's usually super good if someone like you know plot or something
00:48:21.520 | imitate this like saying i don't know it'd be super impressive uh um at least for our agent pipelines
00:48:27.200 | uh yeah so the other than that they say limitations um
00:48:33.280 | the limitations of this paper um they say okay then you need a lot of text to um
00:48:41.280 | the entity label hallucinations uh are problematic because it doesn't have the context of the whole uh
00:48:48.960 | the whole thought process etc but i think that is okay it's the best we have so the limitations here
00:48:56.720 | are not as concrete or as uh detailed um yeah so i think that is uh it they have some discussions here
00:49:07.520 | but those are really not that uh doesn't really add anything to the paper i would say
00:49:15.920 | yeah so and the other the appendix is they just explain some uh what the data set looks like they
00:49:23.680 | also give prompts and how they generate a text uh they go on to explain how the base baselines uh
00:49:29.840 | it looks like they give prompt for all the evaluations like lm judge and everything and say they have some
00:49:35.840 | also additional results um which which is actually kind of good um yeah so that is all the things here and
00:49:44.880 | they have some results here which compare across um um they just what if you just fine-tune models and
00:49:51.840 | black box evaluation um cool uh so i think that is about it
00:50:00.240 | any any any any any questions or uh anything that you guys want to bring up
00:50:05.040 | maybe a general question like this was really a sort of token level so if you had a model which
00:50:13.680 | just gets asked something that's where the sort of answer is really quite a lot of tokens like you
00:50:19.120 | need 10 plus tokens uh can this somehow be expanded to this kind of case or do you then
00:50:24.800 | really have to think of different hallucination detection methods methods so you mean like
00:50:29.360 | let's say a group of like 10 tokens you want to identify if that is wrong right yeah that's yeah
00:50:34.560 | that's not like yeah that won't work because this is done at token protection level so it's like one
00:50:40.640 | token at a time so it won't work for a group of 10 tokens um that's actually a very good limitation
00:50:46.480 | of this people but the section six limitation was kind of interesting where a lot of this relies on
00:50:55.600 | the backbone of uh verifying the facts with sonnet foreign search right and even with that there's
00:51:03.600 | only like 80 recall and 16 false positive so even with generated text and then having sonnet foreign
00:51:11.920 | web try to verify facts still has quite a bit of false positives so like it's a bit of proof of concept
00:51:19.520 | work right like they didn't really do that much in terms of like more strict guardrails on ensuring
00:51:26.240 | facts are there so just a word of caution for people that do actually try to use something like this
00:51:31.520 | you can probably get those numbers to be better with better guardrails right just have more rounds
00:51:37.200 | of filtration so like off the shelf that's a bit worrying that the code allows in 20 false issues
00:51:44.320 | but you know you can always do better if you try yeah that's actually true and that is the reason
00:51:49.120 | why i was mentioning like hey injecting hallucination is better than you know let the model generate and
00:51:55.200 | then we'll do web search and check if that it's hallucinated or not uh because you know injecting
00:52:00.880 | hallucination you can control how much hallucination was injected um whereas here you cannot really
00:52:06.800 | you don't know you don't know and i think that's especially true given the results that they were
00:52:13.200 | showing that like the cross model generalization that sort of that they even mentioned this kind
00:52:19.760 | of suggests that it's because like intrinsically there's information in the text so that you should
00:52:26.160 | maybe the concern about being out of distribution is not such a big concern and is maybe uh but but i would
00:52:34.640 | have a lot of concern and i this is thank you for bringing up bibu because i did have concern about
00:52:39.440 | this and that um that you know if you only have 80 recall on your uh or especially your false positives
00:52:48.880 | or either way if then in your training set then you're not really even how do you even any and your
00:52:54.800 | validation set then how do you even know that you're um like all you know is that i'm i'm like sort of
00:53:02.640 | train i'm able to kind of get to where i was with uh um with you know uh sonic three and uh you don't
00:53:12.960 | really know how you perform in the real world yeah that's true i think um while we're at it you might
00:53:19.120 | as well highlight the third limitation since it draws on that all this paper is about entity hallucination
00:53:25.120 | right so like facts names uh stuff like that but you know if you're going deeper it's not trying to
00:53:32.640 | check for hallucination and reasoning right or like contextual stuff stuff like that so like uh someone
00:53:38.720 | in here mentioned using this for rag um let me unkit brought it up using it for rag context uh this is
00:53:48.640 | all only trained on entities right so it's checking stuff like is this date correct is this correct but
00:53:55.440 | it's it's not necessarily going to work on um you know stuff like reasoning stuff like is this the correct
00:54:02.320 | context um that's that's another limitation and then when it comes to how you train it you can generate
00:54:09.840 | stuff with rag but then this this verification step doesn't really you know you have to do more work
00:54:15.120 | there yeah that is true so i would be i would be interested in like you know can you ground uh do
00:54:23.280 | you even a ground you even need to ground thinking um i think maybe not do not ground thinking and only
00:54:30.080 | ground the generation part of it or something like that because yeah i think i think there's aspects to
00:54:36.400 | both right like if you think in the ed tech space per se like you have a co-pilot that's teaching you
00:54:41.440 | something there's value in the reasoning steps right you you want you want valid reasoning you don't you
00:54:48.640 | don't want hallucinations there if you're trying to learn you often care more about process than end
00:54:53.440 | output so yeah it's always useful and it's it's not like it's like non-trivial to verify this stuff
00:55:01.200 | it's just a different objective that this isn't trained on like you can you can check if stuff is
00:55:06.640 | a proper reasoning path it's just not what this repo does so it's just something to think about
00:55:11.680 | and yeah uh like ted mentioned this is i think part of the maths work right yeah it's like uh
00:55:17.520 | internship that's in the um yeah interp safety space yes i think yeah so i think there was an
00:55:26.400 | internship product because like the only one author is from etation um yeah i don't know where though so
00:55:34.160 | uh because neil nanda is uh is from google yeah yeah he's the big maths advisor
00:55:40.720 | um a lot of the maths papers just started coming out uh john shulman was advising one so
00:55:46.720 | technically another interim thinking machine paper but you know it's it's all math internship work but this
00:55:52.320 | is this is this is some of the cooler stuff yeah definitely awesome
00:55:58.080 | adam you have a hand up yeah if i can kind of follow up with the sort of thing that uh
00:56:08.480 | video was saying at the start sorry i put your name um uh yeah like i i put a question in like way earlier
00:56:17.520 | on that was uh you know it seems like they're sort of when they do the entity checking they're checking
00:56:27.120 | does this this date this name this whatever exist mostly or like it maybe exist with some like similar
00:56:41.440 | context but it's it's uh it jumped out to me that like you know that you know it's good to check that
00:56:49.520 | the name is is real or whatnot but like it seems like you know you could like the model could score
00:56:56.880 | well using like john smith for every name you know what i mean like it it's suggests that there's uh
00:57:06.720 | a way for it to be orthogonal to truth you know uh i imagine there's something more in the paper that
00:57:15.840 | kind of talks about that uh i i didn't read this prior to to arriving but like i just wanted to
00:57:22.320 | question something in that space i suppose i think the paper goes over it actually they they talk about
00:57:27.600 | cloud for web search so breaking it down to token like minimal text span and then they they search the
00:57:34.560 | web and give it three labels supported non-supported or insufficient information like they had a whole
00:57:40.640 | section on it and then they give it a binary classification so like it's it's if you go to
00:57:46.320 | appendix c3 yes this is the prompt that sonic got to say verify this fact so it's not just does berlin
00:57:53.520 | exist it's actually verifying like was einstein burn born in berlin yes that is and it does web search
00:58:01.440 | so it's not even just based on sonic it's it's somewhat more grounded there right yeah it is it's a bit
00:58:07.680 | more grounded there it doesn't like you know um it doesn't really just use the uh information as well
00:58:14.560 | but the problem comes in when it you are deviating from this data set right um so that is where i
00:58:21.520 | don't know how well it works so and it works kind of but yeah and in terms of like basic improvements
00:58:28.640 | from internship project to production like even stuff like this right so like the guardrails that aren't
00:58:34.720 | caught and let's say you want to make this verified for reasoning or other stuff this is still doing
00:58:40.160 | three things in one call right like this is one call that's doing expanding and search and labeling
00:58:46.880 | like the simplest solution break those into three right um you're giving three very distinct tasks to one
00:58:53.520 | llm call yeah sure it can probably do it but you know yeah you could probably do better yeah exactly so
00:58:59.680 | like you know first call make it and verify it's true and then second call like hey which part of
00:59:04.640 | it's not uh not true and then third part like which part in the text is actually not true or something
00:59:09.440 | like that we can break it into three calls and have a better i think your false positives and like your
00:59:15.840 | precision will also improve like drastically if you do that uh yeah i do agree yeah and if you have real data
00:59:22.880 | instead of synthetic stuff you know you you can do more of this in parallel you can bring back data you
00:59:28.240 | can augment it so it's all you can do but i think it works as a good perfect concept yes and yeah it's
00:59:36.480 | actually super cool that they opened the repo you can you can experiment with that a lot like it's a very
00:59:43.280 | um it's a very good quality repo um it's like splitting into three thing like this seems like
00:59:53.440 | sort of what peter was was asking about earlier like you could you could attach that to whatever you
01:00:00.160 | wanted to yes that is uh true but like but but peter is like she wants to to predict if this set of
01:00:08.640 | like you know 10 tokens are true or not but you cannot really do that yeah you cannot really do
01:00:14.080 | that um here the label is per token does that make sense it wouldn't stop you from either running
01:00:24.080 | the 10 token passage through either classifier or the claw prompt here or whatever else for further
01:00:31.200 | evaluation like you it's not the same but like yeah i mean in the paper they basically just say if you
01:00:41.840 | take the max of the token level predictions that's a that's a reasonable first whatever proxy for the
01:00:50.800 | 10 token prediction yep awesome i don't know no other questions let's stop sharing from there
01:01:06.160 | so guys
01:01:14.160 | i think i'll end the call
01:01:27.520 | Thank you.