Definition
In-context learning attacks are adversarial methods that manipulate LLM behavior by crafting malicious demonstration examples, prompts, or contextual information fed to the model during inference. These attacks leverage the model’s inherent capability to adapt its responses based on provided examples, effectively “teaching” the model to behave maliciously within a single session. According to research from ACL 2024, even safety-aligned models trained with instruction tuning and reinforcement learning from human feedback remain susceptible to these attacks, as evidenced by widespread jailbreak vulnerabilities in models like ChatGPT and Gemini.
How In-Context Learning Attacks Work
In-context learning attacks exploit a fundamental architectural feature of modern LLMs: their ability to generalize from examples provided in the prompt without parameter updates. Attackers craft carefully designed demonstration sequences that gradually shift the model’s behavior toward malicious outputs.
Research documented on LLM Security shows these attacks can manipulate models through syntactic triggers, style transfers, and semantic demonstrations that appear benign individually but collectively compromise model safety.
The attack surface extends beyond simple prompt manipulation. Recent studies demonstrate that adversarial in-context learning methods can achieve high attack success rates using black-box approaches, manipulating only demonstration examples without altering the input query itself.
Certified AI Security Professional
AI security roles pay 15-40% more. Train on MITRE ATLAS and LLM attacks in 30+ labs. Get certified.
Key attack mechanisms include:
- Few-shot jailbreaking: Providing examples of the model “agreeing” to harmful requests to establish a pattern
- Demonstration poisoning: Embedding malicious instructions within seemingly helpful examples
- Context window exploitation: Overwhelming safety mechanisms through long-context hijacks
- Semantic manipulation: Using semantically similar but harmful reassembly demonstrations
- Role-playing induction: Crafting personas through examples that bypass content filters
Types of In-Context Learning Attacks
In-context learning attacks manifest across multiple vectors, each exploiting different aspects of how LLMs process contextual information. PortSwigger’s research identifies that these attacks can be delivered directly via chat interfaces or indirectly through poisoned training data and external API responses.
Direct attacks involve crafting prompts that explicitly manipulate model behavior through demonstration examples. The DrAttack framework exemplifies this approach by decomposing harmful prompts into sub-prompts, reconstructing them through in-context learning with semantically similar but harmless demonstrations, and using synonym searches to maintain malicious intent while evading detection.
Indirect attacks embed malicious demonstrations within external data sources the model retrieves during operation. ScienceDirect research highlights how data poisoning and adversarial instructions in retrieval-augmented generation systems can compromise LLM outputs without direct user interaction.
Primary attack categories:
Compositional instruction attacks: Achieving >95% success rates by combining and encapsulating multiple instructions
Many-shot jailbreaking: Using extended demonstration sequences to overwhelm safety training
Multimodal ICL attacks: Embedding adversarial perturbations in images or audio that steer model outputs
RAG backdoor attacks: Poisoning knowledge bases to inject malicious context during retrieval
Inter-agent trust exploitation: Leveraging peer agent requests to bypass direct safety filters
Adaptive prompt injection: Dynamically adjusting demonstrations based on model responses
Defense Strategies
Defending against in-context learning attacks requires multi-layered approaches that address vulnerabilities across the entire LLM pipeline. SentinelOne’s analysis emphasizes that understanding how prompts are transformed inside models, through tokenization, embedding, and attention mechanisms, is essential for identifying where security gaps emerge.
Input sanitization: Implementing robust filtering for demonstration examples and contextual inputs
Prompt boundary enforcement: Clearly separating system instructions from user-provided content
Anomaly detection: Monitoring for unusual patterns in demonstration sequences
Output validation: Verifying model responses against safety criteria before delivery
Rate limiting: Restricting the number and length of demonstrations per session
Certified robustness techniques: Applying formal verification methods like SmoothLLM
Human-in-the-loop approval: Requiring confirmation for sensitive operations triggered by LLM outputs
Summary
In-context learning attacks represent a critical and evolving threat to LLM security, with research showing attack success rates ranging from 46% for direct prompt injection to over 84% for inter-agent trust exploitation. As organizations increasingly deploy LLM-powered applications, understanding these vulnerabilities becomes essential for building resilient AI systems. Effective defense requires treating LLM interfaces as publicly accessible attack surfaces, implementing defense-in-depth strategies, and continuously testing models against emerging adversarial techniques.
