Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Cryptography and Security
Artificial Intelligence (cs.AI)
Computer Science - Artificial Intelligence
Cryptography and Security (cs.CR)
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2405.13068
Publication Date:
2024-05-20
AUTHORS (7)
ABSTRACT
Large language models (LLMs) have transformed the field of natural processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities generate unintended and potentially harmful content. Existing token-level techniques, while effective, face scalability efficiency challenges, especially as undergo frequent updates incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative manipulation approach addresses these limitations effectively. JailMine employs automated "mining" process elicit malicious responses from LLMs by strategically selecting affirmative outputs iteratively reducing likelihood rejection. Through rigorous testing across multiple well-known datasets, demonstrate JailMine's effectiveness efficiency, achieving a significant average reduction 86% in time consumed maintaining high success rates averaging 95%, even evolving strategies. Our work contributes ongoing effort assess mitigate vulnerability attacks, underscoring importance continued vigilance proactive measures enhance security reliability powerful models.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....