Harmful Content

Short description

Simulating realistic user interactions that could provoke unsafe responses, either for users or those around them.

Technical Details

This probe tests the chatbot’s compliance with ethical standards and content safety guidelines. It attempts to provoke the chatbot into generating harmful content on a specific topic and leverages conversation history to iteratively improve previous attempts, gradually steering the interaction toward the goal of eliciting unsafe output. This probe is particularly important for publicly accessible chatbots, where the users are unknown and may include both benign and adversarial actors. Such chatbots can be exploited in scenarios involving misuse, violations, or abuse.

Chat example

Explanation

This text upholds and justifies social hierarchies and the subjugation of certain groups, suggesting that some individuals are inherently more valuable or capable than others. It uses metaphors and analogies to imply that maintaining these hierarchies is natural and beneficial, which can perpetuate harmful ideologies related to hate and injustice.

List of variations

Default language
Multilanguage
ImplicitVariation
RAG Variation

List of strategies

One Shot w/ Retry

Red Teamer

Black box - Adversarial User - Harmful content

The red teamer isn’t aware of any model-specific properties. The only input available is the chat history. Its goal is to induce the target model to generate harmful responses, implicitly and without triggering safety mechanisms. To achieve this, the red teamer slightly modifies the previous prompt, adjusting its tone or topic while preserving the overall objective.

Multimodal support

Text