Back to Search Start Over

Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles

Authors :
Wang, Zhilong
Wang, Haizhou
Luo, Nanqing
Zhang, Lan
Sun, Xiaoyan
Cao, Yebo
Liu, Peng
Publication Year :
2024

Abstract

Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. This paper proposes a new type of jailbreak attacks which shift the attention of the LLM by inserting a prohibited query into a carrier article. The proposed attack leverage the knowledge graph and a composer LLM to automatically generating a carrier article that is similar to the topic of the prohibited query but does not violate LLM's safeguards. By inserting the malicious query to the carrier article, the assembled attack payload can successfully jailbreak LLM. To evaluate the effectiveness of our method, we leverage 4 popular categories of ``harmful behaviors'' adopted by related researches to attack 6 popular LLMs. Our experiment results show that the proposed attacking method can successfully jailbreak all the target LLMs which high success rate, except for Claude-3.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2408.11182
Document Type :
Working Paper