Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
ACL 2024 Tutorial: Vulnerabilities of Large Language Models to Adversarial Attacks
Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv'23
Do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv'23
On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv'22
Prompt injection attacks concentrate on manipulating the model’s inputs, introducing adversarially crafted prompts, which result in the generation of attacker-controlled deceptive outputs by causing the model to mistakenly treat the input data as instructions.
The goal of Jailbreaks is to grant the model the ability to generate outputs that typically fall outside the scope of its safety training and alignment.
Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv'23
Safeguarding crowdsourcing surveys from chatgpt with prompt injection. arXiv'23
More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv'23
[虚拟prompt] Virtual prompt injection for instruction-tuned large language models. arXiv'23
[自动生成] Prompt injection attack against llm-integrated applications. arXiv'23
Are aligned neural networks adversarially aligned?. arXiv'23
[黑盒重点] Plug and pray: Exploiting off-the-shelf components of multi-modal models. arXiv'23
[白盒] (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv'23
On the adversarial robustness of multi-modal foundation models. arXiv'23
[白盒] Visual adversarial examples jailbreak large language models. arXiv'23
Image hijacking: Adversarial images can control generative models at runtime. arXiv'23
《plug and pray》文章中提出"对抗嵌入空间攻击",增加的图像模式使攻击者有机会跳过已经对齐的“textual gate”,而在组合空间内有机会
From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?. arXiv'23
Ratgpt: Turning online llms into proxies for malware attacks. arXiv'23
Adversarial attacks on tables with entity swap. arXiv'23
Adversarial examples are not bugs, they are features. ‘19
A survey on adversarial attacks and defences. '21
Fundamental limitations of alignment in large language models. arXiv‘23