Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
FOS: Computer and information sciences
Computer Science - Computation and Language
Computation and Language (cs.CL)
DOI:
10.48550/arxiv.2502.12970
Publication Date:
2025-02-18
AUTHORS (5)
ABSTRACT
The reasoning abilities of Large Language Models (LLMs) have demonstrated remarkable advancement and exceptional performance across diverse domains. However, leveraging these capabilities to enhance LLM safety against adversarial attacks jailbreak queries remains largely unexplored. To bridge this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates reflections responses into LLMs' generation process, unlocking safety-aware mechanism. This approach enables self-evaluation at each step create pivot tokens as indicators the response's status. Furthermore, in order improve learning efficiency token prediction, Contrastive Pivot Optimization(CPO), which enhances model's ability perceive status dialogues. Through mechanism, LLMs dynamically adjust their response strategies during reasoning, significantly enhancing defense attacks. Extensive experimental results demonstrate R2D effectively mitigates various improves overall safety, highlighting substantial potential strengthening robustness jailbreaks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....