The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Cryptography and Security
Computer Science - Computation and Language
Cryptography and Security (cs.CR)
Computation and Language (cs.CL)
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2404.13208
Publication Date:
2024-04-19
AUTHORS (6)
ABSTRACT
Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries overwrite a model's original instructions with their own malicious prompts. In this work, we argue one of the primary vulnerabilities underlying these is often consider system prompts (e.g., text from an application developer) be same priority as untrusted users third parties. To address this, propose instruction hierarchy explicitly defines how models should behave when different priorities conflict. We then data generation method demonstrate hierarchical following behavior, which teaches selectively ignore lower-privileged instructions. apply GPT-3.5, showing it drastically increases robustness -- even for attack types not seen during training while imposing minimal degradations on standard capabilities.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....