NFDI4DS | UHH-SEMS - Publication Details

Weakly Supervised Detection of Hallucinations in LLM Activations

Identification

DOI: 10.48550/arxiv.2312.02798 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Miriam Rateike

Célia Cintas

John Wamburu

Tanya Akumu

Skyler Speakman

ABSTRACT

We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate downstream tasks. introduce weakly supervised technique using subset scanning approach detect anomalous LLM activations from pre-trained models. Importantly, our does not need knowledge of the type a-priori. Instead, it relies on reference dataset devoid anomalies during testing. Further, enables identification pivotal nodes responsible for encoding these patterns, offer crucial insights fine-tuning specific sub-networks bias mitigation. two new methods handle sentences that deviate expected distribution either direction. Our results confirm prior findings BERT's limited capacity hallucinations, while OPT appears capable hallucination information internally. approach, without exposure false statements, performs comparably fully out-of-distribution classifier.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Weakly Supervised Detection of Hallucinations in LLM Activations

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....