Weakly Supervised Detection of Hallucinations in LLM Activations
Identification
DOI:
10.48550/arxiv.2312.02798
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate downstream tasks. introduce weakly supervised technique using subset scanning approach detect anomalous LLM activations from pre-trained models. Importantly, our does not need knowledge of the type a-priori. Instead, it relies on reference dataset devoid anomalies during testing. Further, enables identification pivotal nodes responsible for encoding these patterns, offer crucial insights fine-tuning specific sub-networks bias mitigation. two new methods handle sentences that deviate expected distribution either direction. Our results confirm prior findings BERT's limited capacity hallucinations, while OPT appears capable hallucination information internally. approach, without exposure false statements, performs comparably fully out-of-distribution classifier.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....