back to indexReal-Time Hallucination Detection!! [Paper Club]

Chapters
0:0 Speaker Introduction and Context
0:50 Paper Thesis: Real-Time Hallucination Detection During Streaming
1:50 Limitations of Current Methods (Semantic Entropy, External Verification)
2:30 Proposed Method: Linear Probe and Token/Span Level Check
3:58 Key Components: Token-Level Analysis, Web Search Data Generation, Linear Probe
4:50 Critique on Labeling: Injecting Hallucinations vs. Costly Web Search Labeling
7:20 Detailed Data Generation Process and Entity Extraction
11:10 Related Work: Mechanistic Interpretability, Uncertainty, and External Verification
13:25 Token-Level Annotation and False Span Examples (Albert Einstein)
15:15 Probe Architecture: Linear Probe on Nth Decoder Layer (Frozen LLM + Optional LoRA)
17:50 Dual-Use and Post-Hoc Hallucination Detection Capability
24:28 Loss Function: Combining Probe Loss with LM/KL Regularization
25:38 Probe Loss Details: Soft Cross-Entropy for Multi-Token Spans
30:4 Mechanistic Interpretability Rationale for Using Hidden States (Residual Stream)
35:55 Experimental Results: Outperforming Baselines (Perplexity, Entropy)
38:35 Long-Form vs. Short-Form Training Data Discrepancy
42:35 Cross-Model Generalization: Detecting text from one LLM using a different LLM's probe
45:25 Refusal to Answer: Using Probe Threshold to Say "I Don't Know"
48:35 Core Limitations: Entity-Focus and Noisy Ground-Truth Labeling
53:18 Discussion on Entity vs. Reasoning Hallucination Verification
00:00:00.000 |
go ahead yeah um so let me show miss um i think first yeah so hey i'm um i did my work on robotics 00:00:15.520 |
and um continuous learning in my phd in robotics and continuous learning um currently i am working 00:00:21.840 |
uh in flagship in boston on um biotech uh for you know machine learning for drug development 00:00:29.840 |
um yeah so i can talk more about what i do a bit later um or in discord um and also we have a lot of 00:00:38.880 |
open positions reach out to me if you're an mls machine learning engineer or ai engineer i would 00:00:44.320 |
love to have you um so today i'll be uh presenting um hallucinations of paper uh this was a very 00:00:53.120 |
interesting paper um seems an under uh like you know um a topic which has been brewing for so a long 00:01:03.120 |
time um you know people have trying to ground models like gemini tried to ground models through the 00:01:09.840 |
search um you know we always do like you know citations and backtracking um you know information 00:01:16.480 |
through citations and ensuring that no hallucinations are uh surface to the users uh but this is an 00:01:21.920 |
interesting approach uh where the the main focus is can you detect hallucination while the model is 00:01:28.000 |
generating um let's say you're streaming responses and as soon as you detect like oh uh model is not 00:01:35.600 |
confident um can you just say not generate it or say okay i don't know um so they use some interesting 00:01:42.320 |
approaches to do that um so i'll do a deep dive on what they changed what model architecture they 00:01:48.320 |
changed and like some interesting results as well um so the tldr is that the motivation is that can you 00:01:55.680 |
identify hallucinations while you're streaming um the the limitation of the current approaches is that 00:02:02.240 |
uh you know one of the main state-of-the-art method is called a semantic entropy um the idea here is 00:02:08.080 |
that you just sample it multiple times and you see how many times it generated the same result and 00:02:13.200 |
you group them and then make sure that like you know okay the cluster is small enough then okay maybe 00:02:19.280 |
it's not a hallucination it's not a foolproof way of course um it fails all the time um other 00:02:26.080 |
hallucination methods are very time consuming and requires a lot of resources um yeah so the approach 00:02:33.840 |
they did is um they have something called spans uh they say it's a collection of tokens um and 00:02:41.680 |
they use web search to identify okay if the span is true or not um i will go a bit deeper and like how 00:02:49.040 |
they do this etc um and they have a probe uh linear probe if they attach to the model uh in the final 00:02:56.320 |
layer or whatever layer you want and then that essentially gives you the probability of this token is 00:03:01.040 |
hallucinated or not uh and uh they they train this model um you know next token prediction way uh and 00:03:10.320 |
they have they do some loss uh manipulation to ensure that the probe probability is uh is not correct 00:03:17.120 |
and uh yeah they achieved a state-of-the-art results for hallucination detection um and the context here is 00:03:23.600 |
that i also tried to train this model and it works pretty well um and also it generalizes pretty well 00:03:29.760 |
um um so you can think of it as like instruction fine tuning for hallucination um so even though you 00:03:36.160 |
don't use that many samples uh but every week now you can use that probe to do hallucinations check 00:03:43.120 |
for any topic that model has pre is pre-trained on um so which is a kind of awesome 00:03:50.000 |
yeah so let's uh that is a tldr so let's dive into the paper um here um the main the main thing uh here is 00:04:00.240 |
that they don't identify hallucination for the whole sentences uh they identify hallucination on the token 00:04:06.400 |
level uh which is super powerful uh like i mentioned super helpful for streaming um and uh yeah so and 00:04:14.000 |
another key contribution they've mentioned is that uh they they use web search to first generate a data 00:04:21.200 |
set for the hallucination um you know if you if you go back and think of how instruction fine tuning was done 00:04:26.720 |
uh you just get at some text and you pass it through a language model to create like you know 00:04:31.600 |
in some question answer pairs uh here the same way they they they ask llm to generate some text they use 00:04:38.320 |
a web search to ground that uh um ground and make sure like okay this this this text is this token is 00:04:45.440 |
wrong etc so that is the main contribution they mentioned um but i have some text on this uh let's 00:04:52.640 |
explain to like why we can make this even better we don't need web search um so yeah i'll explain that 00:04:58.640 |
a bit later um and then the other one is super i found it interesting is the linear probing approach 00:05:04.720 |
uh it's like uh attaching uh um attaching uh like you know a small linear probe at the final layer to 00:05:12.480 |
identify the probability of this token is correct or wrong it's super powerful and can be used for other 00:05:17.360 |
things as well um they have some different applications in the paper which i will cover 00:05:21.920 |
um yeah like i mentioned sure of the art results um compared to like basic methods like kind of 00:05:28.080 |
perplexity semantic entropy etc uh they were able to achieve uh um everything 00:05:33.680 |
yeah so here is how they here is a very high level view of how their data looks like 00:05:41.440 |
um how the uh how the labeling will look like um here you have a response generated by 3.1 uh lama 00:05:47.920 |
um you know we have some text here and uh it essentially says that okay this uh 29 year old 00:05:54.480 |
is 29 is wrong um and then it's able to identify what are the key components uh entities they mentioned 00:06:01.440 |
uh and also identify which of these components are wrong which of these components are right etc 00:06:06.160 |
um very high level um um it's pretty strained in a very interesting way um 00:06:14.480 |
yeah so and not so they're not predicting on every token which is also super important 00:06:22.560 |
um so every token doesn't have like you know for example uh rally v california uh here really 00:06:28.800 |
v california is not labeled as zero uh they essentially think of really v california here the first 00:06:35.440 |
three characters as unimportant tokens and then they have uh five seven three us three uh three seven 00:06:42.880 |
uh this the following tokens as important tokens so you can think of it like they're doing two level 00:06:48.480 |
classification uh one is like hey the token is important and the second one is is it hallucinated 00:06:54.960 |
in the same training ground right um so that just which is an interesting approach i don't think anybody 00:07:00.320 |
has done that um yeah i'm gonna skip over the introduction part of it it's just uh they just 00:07:08.080 |
you know uh mentioned that okay nobody has done this the current models are uh all like citation checking 00:07:13.920 |
methods uh etc okay so yeah uh the most of it here is uh yeah so they mentioned that they use a frontier model 00:07:27.840 |
to do web search and extract entities um which are either it can be true or it can be false they use a 00:07:36.480 |
sonnet here for sonnet 4 with web search uh to label to essentially extract uh they give a sentence and 00:07:44.880 |
then the sonnet works as a web search and comes back with a json saying that okay these characters are wrong 00:07:50.560 |
and they use that as a training data set so my my take here is that i've not tried this yet but i want 00:07:58.080 |
to um i will do that very soon we don't need web search my take here is that okay we can just have 00:08:04.400 |
data set and then we can inject hallucinations uh let's take wiki data set and change the names and dates 00:08:10.640 |
etc and use that for training as well i don't think um you know doing web search you know adds anything 00:08:19.120 |
um it's just you know api cost um so that is my take um 00:08:23.120 |
yeah so and also feel free to uh jump on and ask me any questions oh i see a lot of questions in chat already 00:08:36.480 |
but the ground truth is generated uh so can you explain uh what do you mean by that 00:08:42.480 |
yeah sure i so my understanding of why they did it this way was that they're they took the they they 00:08:50.240 |
took text and they generated whether or not the entities were based in fact or not because the text itself 00:08:59.200 |
was a large generated corpus so so that um you can get i assume they generated 10x because they said 00:09:09.680 |
they like expanded the long form uh or whatever it was called that data set by 10x and i presume they 00:09:17.360 |
did that to get a larger training set hmm yeah that makes sense that makes sense um don't you think the 00:09:26.000 |
train set will be even more if you just do uh inject injection of hallucination rather than like you 00:09:31.520 |
know verification uh yeah i mean that's valid yeah that's it i mean it um you would still have to 00:09:42.560 |
yeah i guess sure so you could just go and like um in in inject hallucinations yeah no i guess you know i agree 00:09:50.560 |
with you actually yeah um and because in the uh maybe i'm jumping too ahead here but in the results 00:09:58.160 |
they mentioned that uh they use data generated by let's say jama and then they used in llama and they 00:10:04.800 |
still got the same amount of performance uh so even like they used llama 3 8b um data and 70b model and 00:10:13.920 |
they were able to get like actually better performance um so which means that you know 00:10:18.320 |
text underlying text with what generator doesn't really matter uh which is awesome um 00:10:24.000 |
yeah so which was for verification to the phosphorous generation you might want you 00:10:33.520 |
there is another question saying that uh i think the web search was for verifying the truth or falsehood 00:10:40.560 |
uh of the generation you might want to generate hallucinations so 00:10:43.360 |
true true uh that that's uh there's a question saying that uh you might maybe too much out of 00:10:54.960 |
distribution sampling uh that is a valid point if you inject hallucination not too much out of distribution 00:11:00.880 |
um that's a very valid point and uh i didn't think about that um yeah so let me jump into the next 00:11:09.920 |
part of the paper which is related works uh so i think the motivation here is from the 00:11:15.840 |
tag mark paper uh where they use linear probes to uh identify what is the representation or like truth 00:11:24.080 |
directions uh they use that to do hallucination detection which was super uh like just orthogonal 00:11:32.000 |
thinking um and then the other related work is an uncertainty like based detection this is essentially 00:11:38.880 |
like perplexity right so when the model is generating tokens identifying like hey what is the perplexity 00:11:43.920 |
of this token uh that is another they say it's a motivation for this work um and um yeah so and the 00:11:52.960 |
last one is external verification uh this is what like most of the industry uses now uh like essentially 00:11:58.400 |
hey generate citations for your text and like okay for this sentence the site is from cited text um 00:12:05.120 |
verify that um so this diagram is related to the data set they generated um so i will give a more 00:12:19.840 |
explanation on this in the next section um but the idea here is that they started with a long fact data 00:12:25.680 |
set uh and then they added some more uh like you know biography questions legal prompts uh more topic 00:12:31.840 |
focused queries uh these topics were biology chemistry etc and uh they asked in language model uh like llama 00:12:38.720 |
once the lagging with model generated the queries they passed it to like i mentioned uh uh state-of-the-art 00:12:45.360 |
llm like uh plot for um with search to verify that uh on token level is this now is this tokens correct or not 00:12:53.920 |
um yeah so that is the data set generation process and they also released the data set which is super 00:13:02.080 |
nice and the code is also released by the way you can train this super easily um 00:13:06.640 |
yeah so yeah like i mentioned the data set is they took the long pack data set was already published and 00:13:14.160 |
they expanded on that and then created this uh uh 20 25 000 uh um questions data set 00:13:21.600 |
yeah uh and uh token level uh uh like yeah the token level uh uh i don't know like you know 00:13:30.640 |
annotation is there are three annotations they uh that they let the cloud generate supported not 00:13:36.560 |
supported and uh insufficient information um so supported is labeled as uh let's say um zero and 00:13:44.480 |
not supported and insufficient information is labeled as one um so that is the thing and what they did um 00:13:51.200 |
and they also did some human uh verification uh uh you could say so like the quality of the what the 00:14:00.080 |
labeling is good or not i don't think n is equal to 50 or 25 000 really makes difference here uh but 00:14:06.400 |
they just wanted to verify the quality of the um the human lab like labels which are generated by the 00:14:11.840 |
state-of-the-art model so this is an example of how the how the model labeling looks like you this is 00:14:19.920 |
the model let's say this is the sentence let's say generated by llama 3b and you pass it to a web search 00:14:26.000 |
and then it identifies that okay albert edstein is important let's label that as a span uh was born on 00:14:31.280 |
like an unimportant text uh they put a date here which is correct in berlin okay that's berlin is fall 00:14:37.840 |
false and then um they do a token level uh labeling um so we can think of it as a tool to level labeling 00:14:45.200 |
first level is like okay albert einstein is an important tokens uh uh y of one y of i is equal to 00:14:51.520 |
one uh and the y of s is equal to zero because that's a true true tokens uh second one is uh um 00:14:58.160 |
it's the same thing as uh uh it's important tokens but it's true and the third one is false so let's 00:15:06.080 |
that's an important token but it's a false it's a hallucination um yeah so that's how the labeling works 00:15:12.880 |
and the next part is uh how they attach probes um so here is a very basic rudimentary explanation how 00:15:20.800 |
they attach probes uh they have they took a llama model so all the dashed boxes here are frozen and 00:15:29.040 |
they had they added an optional laura so this is a hyper parameter they experimented with and without laura 00:15:36.240 |
um i think laura helped them a little bit but they also show that with just adding a probe on the final 00:15:43.040 |
level uh is good enough you don't you need the laura but if you add laura it actually enhances your 00:15:49.120 |
generation um so here at this like you know this is one let's say one decoding presentation you send 00:15:57.360 |
a tokenizer embedding go through n layers and then for the nth layer um you take the hidden states you 00:16:04.560 |
pass it through a probe and then uh this probe outputs a probability uh that probability uh is zero if 00:16:11.440 |
it's not a hallucination so one if it's a hallucination for that token um so that loop continues so every token 00:16:24.960 |
yes uh so some uh ted is mentioning um uh okay there's a swix questions saying i think 00:16:40.720 |
yeah so uh yeah so uh the tokens so the question here is that the it's a confusion between entities and 00:16:47.040 |
tokens um so they use for the web search they extract entities and they uh they take that entities and 00:16:54.560 |
they label the tokens now so let's say if your example here is um your albert einstein is an entity march 00:17:05.120 |
14th 1874 is an entity berlin germany is an entity uh and claude will return albert einstein uh is equal 00:17:12.640 |
to zero uh mark march 4th march 14th uh 1879 as zero and berlin germany as one um and uh so this is 00:17:22.320 |
are labeled this way and they take that entities um aka spans and then they label the tokens based 00:17:31.120 |
on that um it's the end they use in entities and spans interchangeably in the paper a bit confusing 00:17:46.240 |
yeah so eugen is saying that it's pretty cool that uh tokens are next to others and uh yeah um probes 00:17:52.000 |
yes that is that is actually pretty cool so now you're not just generating tokens but you're also 00:17:56.240 |
generating the probabilities of the token essentially it's a multi-level uh which is kind of awesome um 00:18:02.160 |
so is the intent here that this model uh with the pro will be dual use for both general answering 00:18:09.840 |
the question uh responding as well as being able to return a hallucination probability as well 00:18:16.320 |
yes uh i see so so assuming that let's say uh i'm using uh of i'm using a lm api which i can which i 00:18:26.400 |
cannot fine tune to do this i will not can i still pass the output that is streaming and pass it through 00:18:33.440 |
and still probe every token yes i tried that and it works oh okay so that's really nice so essentially 00:18:41.520 |
you don't really need a decoder if you if you just purely want it as a hallucination classifier you 00:18:46.400 |
don't need a decoder head you just use the probe head yes thank you very cool you tried this because 00:18:52.080 |
i think in the paper they said that that's a limitation that they can't they know they have a 00:18:58.320 |
they have a thing in the in the appendix where they show that it's not as effective to do it if you 00:19:04.160 |
don't have access to the logits or at least i mean or not the largest but the so the the residual stream 00:19:11.360 |
or whatever so what i did is um so um okay so here is what i did so i entered some text and uh during 00:19:21.920 |
the pre-fill stage um so during the initial stage where it now pre-fills i identified the okay this 00:19:28.080 |
uh and i was able to calculate the probabilities um so essentially take the uh i was not about to stream 00:19:36.560 |
it of course but what i was able to do is like okay once the text is generated from let's say uh gpt5 00:19:42.320 |
uh i was able to take that um uh during the pre-fill stage uh able to identify uh what tokens are 00:19:49.760 |
hallucinations uh right so i don't need to decode it like i don't need to generate the next tokens 00:19:55.760 |
after that i just need to see okay which tokens are hallucinated um the ydc limitation how do you do that 00:20:01.840 |
is it essentially you just pretend that you start from the first token and you go through and you just 00:20:07.440 |
scan through it and get yeah very cool yeah so so the reason why they say it's a it's a limitation 00:20:15.120 |
is because uh you're not streaming like you're not able to identify during streaming right uh yeah yeah 00:20:22.560 |
so that's a limitation so the strong assumption here also is that this model the model they're using for 00:20:29.920 |
hallucination detection the data distribution of the text is approximately similar which i guess in human 00:20:36.960 |
text with enough training data should be approximately similar thank you yeah true yeah yeah so for 00:20:42.400 |
example if you're saying okay it needs to be in the same level right so um it needs to be in the same 00:20:50.240 |
level as like llama a b should be the same level as gpt5 which is not true uh so i would use this with 00:20:56.800 |
caution or another approach i was thinking um is like take your training data train this and then use this uh 00:21:06.480 |
as your uh like a final label check or something uh before you release it to the next step of your 00:21:12.320 |
agent pipeline uh so that there's no hallucination on that um yeah so i'm sorry you you lost me for a 00:21:19.920 |
second i i don't know if this is what eugene said or what you said but i thought i heard you say 00:21:24.240 |
something about the probe is actually predicting the next token and the probes not predicting the next token 00:21:29.840 |
no so that is true the probes is not predicting the next token but the probe sits next to the next token 00:21:36.240 |
prediction so uh my understanding is the probe actually sits a little earlier it's not immediately 00:21:46.000 |
before next token prediction i mean i mean you have two outputs yeah exactly that is what i meant by sits next to that so the uh 00:21:53.840 |
uh uh like what i meant is like you're when you're predicting the next token you're also predicting the probe 00:21:58.640 |
so outputs are two uh at the place the place you place the like the where you place the probe 00:22:04.320 |
uh in in the code they have made it modular so you can say n is equal to let's say 31 or 29 or something 00:22:10.640 |
and then it'll go sit in the 29th decoder position then use that hidden stage to predict the probe 00:22:15.760 |
right okay right yeah because that's the whole point of this real time is that as the tokens coming 00:22:20.320 |
out you can actually um interrupt generation if you are concerned about hallucination yeah that's exactly right 00:22:29.440 |
yeah so yeah that's actually pretty adam you had a question and i saw you come on camera did you want to ask it 00:22:43.040 |
um go ahead uh i'll put some stuff in chat um uh oh i'm sorry if i call you out like i'm i'm happy 00:22:52.960 |
for it to to wait until we're further down the paper okay let's let's wait then thank you thank you 00:22:58.880 |
yeah yeah uh i see that uh laura here like positive it's not changing the best model yeah so that does 00:23:06.800 |
correct me adam if i'm not wrong that the idea like are you asking like hey does laura change 00:23:12.400 |
the way the model predict predicts a token yes it does and they have uh uh i will talk in the next 00:23:19.840 |
part of the paper how they able to change that uh um like how the model quality changes like when you 00:23:27.120 |
are training this hallucination check because laura should technically change the way the next token is 00:23:32.720 |
changing but predicting also i mean my thought was to clarify uh with the laura question um 00:23:42.320 |
like the laura's don't seem like they're part of the uh the the probe the the checking the truthiness 00:23:51.600 |
evaluation there my guess would be that the laura's are purely once you have the ability to evaluate 00:23:58.400 |
to make the results better for blayer generations and hopefully i'm understanding that right yeah i have 00:24:05.760 |
not done the abolition study removing and adding laura so i just am believing what what they they 00:24:11.360 |
mentioned uh when i trained i did train it using laura and so i don't know what will happen if you 00:24:18.240 |
just train the probe so i don't know the quality will change or not 00:24:21.200 |
yeah so let me let me just jump to the next part of it like hey what are the loss look like loss look like 00:24:31.600 |
uh because you have two things now you have a probe and you have a regular generation 00:24:36.000 |
um so they what they have is a regularization term uh here they have this term they set it to 0.1 00:24:42.640 |
um here like so majority of the loss is contributed from the probe and some of them is through the 00:24:48.480 |
uh the um you know regularization um it's a regular uh loss um so um so yeah so that is the loss so let me 00:25:01.600 |
jump deeper into the probe loss because uh um that is okay so for the regular loss they have to by the 00:25:09.200 |
way uh last last regular they have two types of loss regulars one is a language model loss which is 00:25:14.720 |
essentially next token across entropy and the second one is a kl diversion between uh what the model 00:25:20.320 |
without laura will predict and what the model with the lower will predict um so uh like essentially 00:25:26.000 |
controlling the distribution um between the the changed model the laura model and the older model 00:25:31.600 |
um so these two these are the two loss functions they experiment with and they show the the comparison 00:25:37.200 |
uh but like in the results section uh for the probe loss is super interesting what they did 00:25:42.720 |
uh because you have a two level classification so you need to first identify which tokens are important 00:25:48.320 |
and uh which tokens are entities or tokens are important um um then the second level is that uh uh 00:25:57.680 |
if the two of the identified important tokens which is true or not um so what they do is that they do 00:26:03.680 |
annealing um so first they start out with identifying uh the spans which are the end like super important 00:26:09.920 |
tokens uh and then they as the training keeps on continuing they change the weight to identify the 00:26:15.520 |
hallucination the first part of the loss is your uh the span detection and then the second part is the 00:26:21.680 |
hallucination detection and you can observe here uh because this the whole span has multiple tokens 00:26:29.200 |
right uh let me let me go back to my example here uh let's say berlin germany it's a like let's say it's a 00:26:35.440 |
four token um again this is an example here uh it's a four token um so it all the four tokens need not be 00:26:42.960 |
false technically it's just only the berlin is wrong or false um so only these two are false right so you 00:26:49.360 |
cannot have a cross entropy loss for all everything as a loss so they do a soft cross entropy by putting a 00:26:56.160 |
max term here right the idea here is that the the max of all the probability predicted across that tokens 00:27:03.760 |
is uh it should be um you know one not the not every token there needs to be one uh this surprisingly 00:27:10.800 |
worked works um i was actually a bit surprised how this soft um cross entropy is working um 00:27:18.400 |
yeah so that is the one super important thing about this probability i think the reason why this worked 00:27:25.360 |
is because of this loss function how they how they were able to switch between which identifying which 00:27:30.160 |
probe like which tokens are important and then switching to identifying okay those tokens are true or 00:27:35.680 |
false um yeah so any any questions on the loss uh specific definitions so what what did p exactly 00:27:45.920 |
represent the probability of that token so the probe probability the probe probability okay got it yeah so 00:27:59.120 |
uh so someone is asking will it can it work with an api uh so it should like it it should be able to i don't 00:28:13.600 |
see why not um again this assumption here is that like you are training your own model then only it can 00:28:20.320 |
work in api yeah yeah i think you should answer that yeah 00:28:23.920 |
uh yeah there is another question if you train uh if you pre-trained the hallucination class if you can 00:28:33.840 |
an existing hallucination training data would you transfer to that is a very uh uh usually asked a 00:28:40.640 |
question which uh question which uh so yes um that was super interesting to me as well so the question 00:28:47.920 |
was that if we use the existing data can it generalize when we put a newly newly generated text um 00:28:55.120 |
yes it works because what i did is that when i was experimenting i i gave it all the data which they have 00:29:01.360 |
generated um but uh uh and i mixed that with uh um um i mixed that with like you know some chemistry 00:29:08.880 |
questions uh for a drug development let's say uh and disabled to identify like okay this this protein 00:29:13.840 |
structure is wrong right which is which is pretty impressive uh uh it's using its internal knowledge 00:29:18.800 |
as well during this hallucination prediction um that's why i'm having a hard time understanding this here 00:29:25.200 |
uh so ted is saying that we need the internals of the models so i'm starting i'm thinking is that like 00:29:30.240 |
the hidden states or the residuals or what else so you need you need identities right so the for for 00:29:36.480 |
the probes you need the hidden states so we don't exactly need the actual hidden state from the model 00:29:42.720 |
that was used to generate it right we can simulate it via this model exactly yes that is what i meant by 00:29:49.120 |
like uh you can take the output from gpt5 then pass it to this model and identify which tokens are 00:29:54.000 |
hallucinated so would that make sense ted adam so um so so my tldr here okay is that uh when you when 00:30:07.600 |
you do this mechanistic interpretability research people are saying the model actually knows that it's not 00:30:13.760 |
confident in what it's saying and if you just simply do the the the methods they talk about in the 00:30:21.040 |
background section where you look at the entropy or or other things about the logits you get a certain 00:30:29.600 |
accuracy you know you get 60 70 percent or whatever um but what the what this paper is saying is that the 00:30:36.400 |
model has internal knowledge of its uncertainty beyond what shows up in the logits and that's why the linear 00:30:43.440 |
probe can outperform just logit based hallucination detection so um and so the linear probe is 00:30:52.160 |
actually looking for basically a i'm going to say direction if you if you don't know the the the the 00:31:00.320 |
the mechanistic stuff you might not know what i'm saying but it's looking for a direction 00:31:06.800 |
on the residual stream that says i'm thinking about whatever i'm thinking about chemistry blah blah blah but i 00:31:12.800 |
also have this thought i'm not really sure what i'm saying uh yeah so that's that's only available 00:31:19.680 |
from the actual model that did the generation and it's hard to just fit it in and the reason why you 00:31:24.720 |
want the around the 95th percent layer is because the last few layers are about refining next token 00:31:32.080 |
prediction they're not actually about thought understanding creation and so so around the last five percent you 00:31:41.520 |
actually tend to lose thought information because you're stripping it from the residual stream so 00:31:47.520 |
that the residual stream is just next token information and nothing else so you would get worse linear probe 00:31:54.640 |
information if you did it after the last layer instead of at the 95th percent ish layer that is true 00:32:02.240 |
that is true that makes sense yeah but i i think it bears repeating it the experiment that you did was that 00:32:09.840 |
you actually used a different model or potentially a different model right so you like like you take gpdp5's 00:32:18.880 |
output running through llama and then use its its um concept of is this a hallucination or not and that was also 00:32:28.240 |
way effective you mean llama right yeah and that was also effective right yeah yeah yeah seem to work 00:32:36.000 |
but like a key is that uh that that that information was general you know for that both gpt5 and llama 00:32:42.480 |
knew about it right so if if llama doesn't know that it doesn't work okay yeah it would be interesting 00:32:51.200 |
so did you did you calculate any loss curves or anything on that i mean not last course uh like a 00:32:56.640 |
roc curve or anything no no no no it was just a simple test which i did like you know with like 00:33:03.120 |
let's say 10 or 15 examples i didn't do more than that yeah okay yeah i just wanted to see if it actually 00:33:08.240 |
works before i you know interesting that's not scary sorry guys just to clarify you took gpt5's 00:33:16.800 |
output and you calculated the value probability of hallucination based on what it was trained for 00:33:22.880 |
so it's basically guessing based on what it was trained on from an open source model and if there 00:33:29.200 |
is an overlap between the data sets then it probably works otherwise it doesn't right yes yes 00:33:36.000 |
it's not exactly reverse engineering it's just a guesswork i mean yes it's a guesswork okay 00:33:41.440 |
it's a guesswork uh but where it changes is if you have let's say domain specific information 00:33:48.720 |
uh you can make the super like you know lama 8b model as your domain expert and then pass it that 00:33:55.440 |
through that will be very powerful uh right and so let's say you have some proprietary information or 00:34:02.960 |
some confidential information for your own stuff and now we have gpt generating this stuff and you 00:34:08.640 |
can identify oh this is a hallucination according to my knowledge um gpt might fight or something 00:34:14.560 |
um so yeah so we can mix and match it's using a smaller model to detect some very specific 00:34:22.560 |
uh i won't call them hallucinations anymore it's just that factual errors and that's something you can detect 00:34:29.760 |
uh because you have a smaller model trained on it it could be even faster 00:34:32.960 |
yeah similar principle to speculative decoding in a sense not using probabilities but yeah yeah 00:34:40.640 |
i would say it's spec spec tech is a bit different um but like you know but i totally agree 00:34:46.800 |
there is a there is a similarity between these two and i think you could go either way your your 00:34:53.920 |
your hallucination detection model could theoretically be larger but of course then it'd be expensive so 00:34:59.200 |
you're probably not going to do it that way um the the authors of the paper would probably hope that 00:35:05.520 |
if this technique were refined and made better and actually useful that then gpt-5 would just support 00:35:10.560 |
it and you wouldn't have to have your own standalone lama that you that you run you know side by side but 00:35:15.600 |
it's an interesting thing basically when you pass any text into this lama model whether it comes from 00:35:24.720 |
gpt-5 or if you just write it yourself what lama is answering is if i had generated this text would i 00:35:31.920 |
have been confident or would i have been uncertain and maybe hallucinating and so to the extent that llama 00:35:40.160 |
is right then you can use it as a hallucination detection for your five-year-old kid or you can 00:35:45.360 |
use it for gpt-5 or for whatever you want yep that's true cool uh yeah so that's a very good yeah so 00:35:55.600 |
the comparisons what they did let me just jump to the next part of the paper the comparisons they did 00:36:01.040 |
entropy uh so which is like and perplexity but these two are like you know models internal uh representation 00:36:07.360 |
like how surprised it is by generated token uh second one is the semantic entropy i think ted 00:36:13.120 |
already explained this a bit uh where you like you know you just um sample multiple times and then 00:36:19.680 |
like essentially decode the same token multiple times and then group it and see okay what is the direction 00:36:25.200 |
of this group and like if all the direction is the same that means that okay probably not a hallucination 00:36:30.640 |
if it is like direction is like larger so and that means that your aggregated direction might not be 00:36:38.400 |
same for all the generated tokens that means it might be a hallucination that is a comparison that 00:36:44.000 |
was a state of the art actually and the black box self-evaluation um so this is essentially a judge um 00:36:50.800 |
hell of a message um yeah so the experimental setup here they have a llama 8b and 70b model uh they also 00:37:00.880 |
the repository also has ability to use any hogging paste model which is kind of cool uh so but they 00:37:07.520 |
did also test it out for like jama models as well which works um you can see here uh this row so just your 00:37:14.000 |
perplexity like not doing anything and just basing things on a complexity like ted mentioned works well 00:37:19.600 |
like 76 area under the curve is like 76 0.76 but if you use uh like this probes even a linear or a llama 00:37:27.760 |
probe like sorry laura probe um uh let me clarify what is a linear probe is the linear probe is there is 00:37:33.760 |
no laura right so it's just a proof that's it laura is an optional so but when you add laura it just 00:37:39.680 |
this essentially is very good in identifying this both recall and the area under the curve is actually 00:37:44.480 |
pretty pretty pretty high um so this is a summary of the result um and they um they go like when they 00:37:53.520 |
release the the data set as well which has about 25 000 total samples um yeah so so evaluations i'll skip 00:38:04.320 |
all this and uh so this is the out of distribution performance um so this is the like what i was 00:38:12.720 |
mentioning like you know what if the training data is not present what if the concept is not present in 00:38:18.480 |
the training data and if you uh in your fine-tuning data it is present in the pre-training data it still 00:38:24.800 |
performs well uh which means that it's using its internal hidden states very well um yeah um yeah 00:38:34.240 |
yeah another one okay this is actually pretty um i still don't know why this happens i was i was 00:38:41.840 |
i was a bit uh uh surprised on this as well so if you train on samples which have only short form text 00:38:48.480 |
it doesn't perform well on the long form um so and long-form text if you have if you have 00:38:56.880 |
hallucinations identify on long form text it doesn't really perform well on the short form text 00:39:01.040 |
um i i don't know why i i don't know why it is different because uh i i genuinely don't know like 00:39:09.360 |
i couldn't understand like this part of the paper because it it theoretically it should not matter but 00:39:15.760 |
apparently it does so so if anybody has uh any insight on that i'm happy to hear and like um ask questions 00:39:25.760 |
but uh for me this part was very surprising because i would assume how does the the length of the text 00:39:31.680 |
matter for your hallucination detection of the one token apparently it does right so i was not sure yeah 00:39:38.240 |
i i thought that was really uh maybe even the most interesting plot in the paper because i mean my 00:39:46.240 |
explanation for that was that it's actually looking at the context around it and saying oh this is a 00:39:53.520 |
likely thing to be hallucinated right yeah that is if that is true then long context should always beat 00:40:01.920 |
the short context right uh does it well maybe maybe in it's maybe i would say that that supports this 00:40:12.240 |
exactly what we see here because in the short con short form test then it may be that model like um 00:40:20.320 |
relies more on con like wider context where as the short form one more relies on fact or something 00:40:28.160 |
like that why is this result unintuitive to you okay uh yeah so because uh think think of it right so 00:40:37.360 |
they they say a short form text is let's say a one answer one question right so they use like you know 00:40:44.240 |
like let's say 100 token or 200 token answer and they mark uh some tokens as hallucination and they train 00:40:50.240 |
on that that's it the long form is essentially like an essay of uh of 600 or like let's say 10 000 and 00:40:57.440 |
here they mention 8 000 tokens or something like that um and they mark hallucinations and uh they take two 00:41:04.880 |
models which are trained short and long and they apply it on others other data sets uh which is uh and it 00:41:12.560 |
doesn't perform the same way uh so for me how does the length of the text matter uh so yeah that is why 00:41:21.680 |
it was ted here i have a question you know and so so i think my my intuition around this result is that 00:41:31.440 |
it's not about that the model is any different it's about whether or not we trained a good linear probe and so 00:41:38.400 |
what they're saying is that the signal in short is because it's short it's not getting a super accurate 00:41:47.440 |
signal and so then the probe just doesn't learn very well i don't think it's about whether or not 00:41:55.120 |
hallucination looks different which is why if you get a really accurate signal by training on long it works 00:42:01.520 |
pretty well on short because it's the same hallucination signal so it's really just a 00:42:06.080 |
matter of what kind of training data you need to get this probe well trained is the way i read section 00:42:11.680 |
five okay yeah that makes sense um so it's essentially it's not uh yeah i would i would i would change the 00:42:21.040 |
plot as like you know performance on good quality data versus performance on bad quality data or something 00:42:26.080 |
like that yeah okay that makes sense makes a bit sense um um yeah so here is the cross model generalization 00:42:37.600 |
which was super um interesting so here you can think of it as a text generated by one model 00:42:43.360 |
but the probe is another model um right so let's say they generate uh the uh the gwen 2.57b text and 00:42:51.040 |
they validate it or probe it on uh 70b model uh and it works right so here is an indication of why 00:42:59.760 |
one model text works another model and why gpt5 works when i tested it out uh with llama right so and 00:43:08.800 |
so another surprising thing is if you have a a dumb model which is let's say is 2.57b which is a maybe a dumb 00:43:18.080 |
dumb model and then you have a probe model which is bigger uh 70b model it performs actually better 00:43:23.920 |
um so which is uh which is uh which is actually kind of interesting and it intuitively makes sense also 00:43:30.720 |
um because you have the 70b model is a very intelligent model and can identify hallucinations well 00:43:37.440 |
because it's an internal thought well yeah so so any any questions on this aspect in the cross model generalization 00:43:56.400 |
i have 12 more minutes so yeah so here is a laura so uh i think i missed a mr thing here 00:44:05.040 |
yeah so these are an explanation and the next part is um they they hypothesize that like hey if this 00:44:12.160 |
model if if we add a laura is the tokens produced different than if we don't add laura i mean it 00:44:19.840 |
has to be different right because the laura contributes to token uh decoding as well as probe 00:44:24.880 |
so it has to be different and what they do is that they play around with the the regularization term 00:44:31.520 |
so um during the last regular in the last regularization term um just to give a context so let 00:44:38.160 |
me jump to the last regularization from here um so here there's a probe loss and then the regularization 00:44:45.360 |
then this is the they play around with this uh the lambda term here and then they increase and decrease 00:44:50.640 |
and see uh how the model performs um the mmlu and essentially the model strength and what they see 00:44:58.480 |
here is that it does affect and but there is a sweet spot here uh with gl divergence at 0.5 where it 00:45:07.920 |
doesn't uh deviate too much from the the original model um original models decoding but still is able to 00:45:15.920 |
perform well in the uh hallucination detection so uh very interesting uh results 00:45:27.520 |
uh so i have last 10 minutes so let me just jump through the next next one um the next one is uh 00:45:36.480 |
pretty interesting so this plot this result here so in in streaming let's say you know you are generating 00:45:43.920 |
tokens right say first token second token third token fourth token and then fifth token you realize 00:45:48.960 |
that oh it's a hallucination what if you stop generating the token and say i don't know 00:45:54.640 |
right so you can you can do that because you are going through like you have a probability of 00:46:02.480 |
hallucination for every token and you're generating when using this model to generate tokens so that's 00:46:08.320 |
what they did and then they were able to say that if you do that if you have a threshold of let's say 00:46:14.240 |
0.5 and you can essentially increase the model performance um because you can say i don't know 00:46:21.200 |
right you can refuse to answer what you actually don't know um which is super uh which is a skill 00:46:27.680 |
which is currently lacking by uh current llms they just even though they know it's wrong they keep 00:46:34.240 |
generating but you can introduce the skill where uh as soon as you realize model realizes it doesn't know 00:46:41.120 |
it can stop uh so this is a one skill addition you can do to the models uh in addition to hallucination 00:46:47.520 |
uh see to say i don't know um super useful and it doesn't let yeah so any any thoughts on this um 00:46:56.480 |
i thought this was very actually super good a good result 00:47:05.680 |
if if could you explain i don't think i understand this graph properly because yeah i don't understand 00:47:12.640 |
why you could have an attempt rate at less than one if you have no monitoring right so that t like that 00:47:19.440 |
right most point i don't understand that yeah so let's say you have a super aggressive monitoring 00:47:24.000 |
right so let's say t is equal to 0.1 right so you have a prop like the token number one the probability is 00:47:30.480 |
0.01 the token number two the probability is 0.1 you stop at token number two you don't generate more 00:47:36.800 |
tokens right so that understood yeah but i mean so they're like t equals one like the right most point 00:47:45.840 |
that's no monitoring so you never are yeah but then why is the attempt rate then 0.8 uh but because it's 00:47:55.760 |
still there are some uh answers it doesn't respond like it it is oh it'll still say i don't know to 00:48:01.680 |
some things just because it doesn't know and it recognizes that got it got it okay yeah 00:48:05.680 |
yeah at least that is my understanding of how they did this so yeah that makes sense thank you 00:48:13.440 |
yeah um yeah okay yeah this is yeah it's usually super good if someone like you know plot or something 00:48:21.520 |
imitate this like saying i don't know it'd be super impressive uh um at least for our agent pipelines 00:48:27.200 |
uh yeah so the other than that they say limitations um 00:48:33.280 |
the limitations of this paper um they say okay then you need a lot of text to um 00:48:41.280 |
the entity label hallucinations uh are problematic because it doesn't have the context of the whole uh 00:48:48.960 |
the whole thought process etc but i think that is okay it's the best we have so the limitations here 00:48:56.720 |
are not as concrete or as uh detailed um yeah so i think that is uh it they have some discussions here 00:49:07.520 |
but those are really not that uh doesn't really add anything to the paper i would say 00:49:15.920 |
yeah so and the other the appendix is they just explain some uh what the data set looks like they 00:49:23.680 |
also give prompts and how they generate a text uh they go on to explain how the base baselines uh 00:49:29.840 |
it looks like they give prompt for all the evaluations like lm judge and everything and say they have some 00:49:35.840 |
also additional results um which which is actually kind of good um yeah so that is all the things here and 00:49:44.880 |
they have some results here which compare across um um they just what if you just fine-tune models and 00:49:51.840 |
black box evaluation um cool uh so i think that is about it 00:50:00.240 |
any any any any any questions or uh anything that you guys want to bring up 00:50:05.040 |
maybe a general question like this was really a sort of token level so if you had a model which 00:50:13.680 |
just gets asked something that's where the sort of answer is really quite a lot of tokens like you 00:50:19.120 |
need 10 plus tokens uh can this somehow be expanded to this kind of case or do you then 00:50:24.800 |
really have to think of different hallucination detection methods methods so you mean like 00:50:29.360 |
let's say a group of like 10 tokens you want to identify if that is wrong right yeah that's yeah 00:50:34.560 |
that's not like yeah that won't work because this is done at token protection level so it's like one 00:50:40.640 |
token at a time so it won't work for a group of 10 tokens um that's actually a very good limitation 00:50:46.480 |
of this people but the section six limitation was kind of interesting where a lot of this relies on 00:50:55.600 |
the backbone of uh verifying the facts with sonnet foreign search right and even with that there's 00:51:03.600 |
only like 80 recall and 16 false positive so even with generated text and then having sonnet foreign 00:51:11.920 |
web try to verify facts still has quite a bit of false positives so like it's a bit of proof of concept 00:51:19.520 |
work right like they didn't really do that much in terms of like more strict guardrails on ensuring 00:51:26.240 |
facts are there so just a word of caution for people that do actually try to use something like this 00:51:31.520 |
you can probably get those numbers to be better with better guardrails right just have more rounds 00:51:37.200 |
of filtration so like off the shelf that's a bit worrying that the code allows in 20 false issues 00:51:44.320 |
but you know you can always do better if you try yeah that's actually true and that is the reason 00:51:49.120 |
why i was mentioning like hey injecting hallucination is better than you know let the model generate and 00:51:55.200 |
then we'll do web search and check if that it's hallucinated or not uh because you know injecting 00:52:00.880 |
hallucination you can control how much hallucination was injected um whereas here you cannot really 00:52:06.800 |
you don't know you don't know and i think that's especially true given the results that they were 00:52:13.200 |
showing that like the cross model generalization that sort of that they even mentioned this kind 00:52:19.760 |
of suggests that it's because like intrinsically there's information in the text so that you should 00:52:26.160 |
maybe the concern about being out of distribution is not such a big concern and is maybe uh but but i would 00:52:34.640 |
have a lot of concern and i this is thank you for bringing up bibu because i did have concern about 00:52:39.440 |
this and that um that you know if you only have 80 recall on your uh or especially your false positives 00:52:48.880 |
or either way if then in your training set then you're not really even how do you even any and your 00:52:54.800 |
validation set then how do you even know that you're um like all you know is that i'm i'm like sort of 00:53:02.640 |
train i'm able to kind of get to where i was with uh um with you know uh sonic three and uh you don't 00:53:12.960 |
really know how you perform in the real world yeah that's true i think um while we're at it you might 00:53:19.120 |
as well highlight the third limitation since it draws on that all this paper is about entity hallucination 00:53:25.120 |
right so like facts names uh stuff like that but you know if you're going deeper it's not trying to 00:53:32.640 |
check for hallucination and reasoning right or like contextual stuff stuff like that so like uh someone 00:53:38.720 |
in here mentioned using this for rag um let me unkit brought it up using it for rag context uh this is 00:53:48.640 |
all only trained on entities right so it's checking stuff like is this date correct is this correct but 00:53:55.440 |
it's it's not necessarily going to work on um you know stuff like reasoning stuff like is this the correct 00:54:02.320 |
context um that's that's another limitation and then when it comes to how you train it you can generate 00:54:09.840 |
stuff with rag but then this this verification step doesn't really you know you have to do more work 00:54:15.120 |
there yeah that is true so i would be i would be interested in like you know can you ground uh do 00:54:23.280 |
you even a ground you even need to ground thinking um i think maybe not do not ground thinking and only 00:54:30.080 |
ground the generation part of it or something like that because yeah i think i think there's aspects to 00:54:36.400 |
both right like if you think in the ed tech space per se like you have a co-pilot that's teaching you 00:54:41.440 |
something there's value in the reasoning steps right you you want you want valid reasoning you don't you 00:54:48.640 |
don't want hallucinations there if you're trying to learn you often care more about process than end 00:54:53.440 |
output so yeah it's always useful and it's it's not like it's like non-trivial to verify this stuff 00:55:01.200 |
it's just a different objective that this isn't trained on like you can you can check if stuff is 00:55:06.640 |
a proper reasoning path it's just not what this repo does so it's just something to think about 00:55:11.680 |
and yeah uh like ted mentioned this is i think part of the maths work right yeah it's like uh 00:55:17.520 |
internship that's in the um yeah interp safety space yes i think yeah so i think there was an 00:55:26.400 |
internship product because like the only one author is from etation um yeah i don't know where though so 00:55:34.160 |
uh because neil nanda is uh is from google yeah yeah he's the big maths advisor 00:55:40.720 |
um a lot of the maths papers just started coming out uh john shulman was advising one so 00:55:46.720 |
technically another interim thinking machine paper but you know it's it's all math internship work but this 00:55:52.320 |
is this is this is some of the cooler stuff yeah definitely awesome 00:55:58.080 |
adam you have a hand up yeah if i can kind of follow up with the sort of thing that uh 00:56:08.480 |
video was saying at the start sorry i put your name um uh yeah like i i put a question in like way earlier 00:56:17.520 |
on that was uh you know it seems like they're sort of when they do the entity checking they're checking 00:56:27.120 |
does this this date this name this whatever exist mostly or like it maybe exist with some like similar 00:56:41.440 |
context but it's it's uh it jumped out to me that like you know that you know it's good to check that 00:56:49.520 |
the name is is real or whatnot but like it seems like you know you could like the model could score 00:56:56.880 |
well using like john smith for every name you know what i mean like it it's suggests that there's uh 00:57:06.720 |
a way for it to be orthogonal to truth you know uh i imagine there's something more in the paper that 00:57:15.840 |
kind of talks about that uh i i didn't read this prior to to arriving but like i just wanted to 00:57:22.320 |
question something in that space i suppose i think the paper goes over it actually they they talk about 00:57:27.600 |
cloud for web search so breaking it down to token like minimal text span and then they they search the 00:57:34.560 |
web and give it three labels supported non-supported or insufficient information like they had a whole 00:57:40.640 |
section on it and then they give it a binary classification so like it's it's if you go to 00:57:46.320 |
appendix c3 yes this is the prompt that sonic got to say verify this fact so it's not just does berlin 00:57:53.520 |
exist it's actually verifying like was einstein burn born in berlin yes that is and it does web search 00:58:01.440 |
so it's not even just based on sonic it's it's somewhat more grounded there right yeah it is it's a bit 00:58:07.680 |
more grounded there it doesn't like you know um it doesn't really just use the uh information as well 00:58:14.560 |
but the problem comes in when it you are deviating from this data set right um so that is where i 00:58:21.520 |
don't know how well it works so and it works kind of but yeah and in terms of like basic improvements 00:58:28.640 |
from internship project to production like even stuff like this right so like the guardrails that aren't 00:58:34.720 |
caught and let's say you want to make this verified for reasoning or other stuff this is still doing 00:58:40.160 |
three things in one call right like this is one call that's doing expanding and search and labeling 00:58:46.880 |
like the simplest solution break those into three right um you're giving three very distinct tasks to one 00:58:53.520 |
llm call yeah sure it can probably do it but you know yeah you could probably do better yeah exactly so 00:58:59.680 |
like you know first call make it and verify it's true and then second call like hey which part of 00:59:04.640 |
it's not uh not true and then third part like which part in the text is actually not true or something 00:59:09.440 |
like that we can break it into three calls and have a better i think your false positives and like your 00:59:15.840 |
precision will also improve like drastically if you do that uh yeah i do agree yeah and if you have real data 00:59:22.880 |
instead of synthetic stuff you know you you can do more of this in parallel you can bring back data you 00:59:28.240 |
can augment it so it's all you can do but i think it works as a good perfect concept yes and yeah it's 00:59:36.480 |
actually super cool that they opened the repo you can you can experiment with that a lot like it's a very 00:59:43.280 |
um it's a very good quality repo um it's like splitting into three thing like this seems like 00:59:53.440 |
sort of what peter was was asking about earlier like you could you could attach that to whatever you 01:00:00.160 |
wanted to yes that is uh true but like but but peter is like she wants to to predict if this set of 01:00:08.640 |
like you know 10 tokens are true or not but you cannot really do that yeah you cannot really do 01:00:14.080 |
that um here the label is per token does that make sense it wouldn't stop you from either running 01:00:24.080 |
the 10 token passage through either classifier or the claw prompt here or whatever else for further 01:00:31.200 |
evaluation like you it's not the same but like yeah i mean in the paper they basically just say if you 01:00:41.840 |
take the max of the token level predictions that's a that's a reasonable first whatever proxy for the 01:00:50.800 |
10 token prediction yep awesome i don't know no other questions let's stop sharing from there