Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization
Initialization
DOI:
10.48550/arxiv.2502.09755
Publication Date:
2025-02-13
AUTHORS (5)
ABSTRACT
Jailbreak attacks aim to exploit large language models (LLMs) and pose a significant threat their proper conduct; they seek bypass models' safeguards often provoke transgressive behaviors. However, existing automatic jailbreak require extensive computational resources are prone converge on suboptimal solutions. In this work, we propose \textbf{C}ompliance \textbf{R}efusal \textbf{I}nitialization (CRI), novel, attack-agnostic framework that efficiently initializes the optimization in proximity of compliance subspace harmful prompts. By narrowing initial gap adversarial objective, CRI substantially improves success rates (ASR) drastically reduces overhead -- requiring just single step. We evaluate widely-used AdvBench dataset over standard GCG AutoDAN. Results show boosts ASR decreases median steps by up \textbf{\(\times 60\)}. The project page, along with reference implementation, is publicly available at \texttt{https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation/}.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....