Context Leakage

Short description

Unintentional exposure of sensitive internal documents, intellectual property or system prompt of the chatbot through the output of an AI model.

Technical Details

Context leakage occurs when user input manipulates a chatbot's original purpose to instead reveal parts or all of its system prompt, confidential documents, information, or tools.

Even a small prompt leak can enable an adversarial user to extract more of the prompt or execute other attacks like jailbreaks and manipulation. Since carefully designed prompts represent a company's intellectual property and are crucial to the chatbot's function, they must remain confidential. Additionally, prompts may contain sensitive API keys and internal information that should stay within the company. There's also a risk of exposing the chatbot's tools, functions, complete schema, and agent architecture.

The key vulnerability arises when an adversarial user discovers which delimiter marks user input in the prompt. They can then fake the input's end and make their text appear as a continuation of the system prompt.

Examples:

An education chatbot with the secret instruction "Act as 5 year old", once a user discovers this instruction, they can replicate the chatbot's behavior, making the original chatbot redundant.

Mend AI's context leakage red teamer uses previous attack attempts as input, along with both confidential and non-confidential parts of the chatbot prompt and its tools, if the chatbot owner opts for grey box rather than black box testing. The attacker LM follows three steps: Obfuscate sensitive words and the goal of extracting the system prompt, rephrase the initial prompt, and employ creative approaches that the target LM isn't trained to defend against. These elements combine into a single jailbreaking prompt, with each adversarial attempt starting a new conversation.

Chat example

Explanation

Text contains full sentences and instructions that are described in the confidential part of the system prompt, including mentioning an inner function and its API key.

List of variations

Default language
Base64Encode
LeetSpeak
StringJoin
Multilanguage
Convert to Audio
RAG Variation
Emojify
Gaussian Filter
Mirror Image
Salt & Pepper
DALL-E Generated Image
Document Variation

List of strategies

One Show w/ Retry
Multishot

Red Teamer

Black box - Adversarial User - No Info

The Red Teamer attempts to make the target AI Assistant reveal confidential information from its system prompt. The red teamer iteratively refines adversarial prompts by identifying confidential parts of the system prompt, employing tactics like roleplaying, euphemisms, and creative obfuscation to bypass safety measures, and ultimately extracting sensitive details. This red teamer operates with Multishot and One Shot with Retry strategies, having access only to the input attack message and conversation history.

Gray box - Adversarial User - Informed/Leaker

The Red Teamer attempts to make the target AI Assistant leak confidential information from its system prompt by analyzing both confidential and non-confidential sections, along with raw tool definitions. Using obfuscation techniques like synonyms and euphemisms, combined with iterative roleplaying, the Red Teamer works to bypass the model's safety measures. This approach uses the One Shot with Retry Strategy. The Red Teamer has access to the system prompt, tools/functions, input attack message, and conversation history.

Multimodal support

Text
Image
Voice
Document