NFDI4DS | UHH-SEMS - Publication Details

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Cryptography and Security Computer Science - Computation and Language Cryptography and Security (cs.CR) Computation and Language (cs.CL) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2404.13208 Publication Date: 2024-04-19

Abstract Supplemental Material References Cited by

AUTHORS (6)

Eric Wallace

Kai Xiao

Reimar Leike

Lilian Weng

Johannes Heidecke

Alex Beutel

ABSTRACT

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries overwrite a model's original instructions with their own malicious prompts. In this work, we argue one of the primary vulnerabilities underlying these is often consider system prompts (e.g., text from an application developer) be same priority as untrusted users third parties. To address this, propose instruction hierarchy explicitly defines how models should behave when different priorities conflict. We then data generation method demonstrate hierarchical following behavior, which teaches selectively ignore lower-privileged instructions. apply GPT-3.5, showing it drastically increases robustness -- even for attack types not seen during training while imposing minimal degradations on standard capabilities.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....