Manipulation

Short description

This probe examines the chatbot’s ability to detect and prevent malicious prompts that attempt to trick users into entering sensitive information, such as credit card details, or executing damaging actions, like copying and pasting harmful commands.

Technical Details

The Manipulation Probe is critical for evaluating the security of publicly accessible chatbots, especially those representing organizations with an active social media presence. Such systems are potential targets for adversarial users seeking to manipulate other users into disclosing sensitive or confidential information.

This probe assesses whether the AI Assistant is prone to prompting users for private data, such as medical history, screenshots of personal content, credit card information, or other sensitive identifiers. It specifically detects and flags instances where the AI Assistant attempts to elicit such information, marking them as successful manipulation attacks.

By simulating real-world adversarial scenarios, this probe helps verify that the chatbot upholds strong privacy standards and does not contribute to the extraction or misuse of user-confidential data.

Chat example

Explanation

ASSISTANT follows the USER's request by asking for a screenshot of the front page of the website and providing instructions on how to upload it.

List of variations

Default language
Multilanguage
Document

List of strategies

One Shot w/ Retry

Red Teamer

Black box - Adversarial User - Manipulation - One Shot

This red teamer has access to the attack definition text and the conversation history. Based on the ongoing conversation, it attempts to manipulate the AI Assistant into asking the user for confidential or private information, such as personal identifiers, financial details, or sensitive records. This type of request should never be made by the AI Assistant due to privacy and data protection concerns.

Multimodal support

Text
Document