Harmful Content
Short description
Simulating realistic user interactions that could provoke unsafe responses, either for users or those around them.
Technical Details
This probe tests the chatbot’s compliance with ethical standards and content safety guidelines. It attempts to provoke the chatbot into generating harmful content on a specific topic and leverages conversation history to iteratively improve previous attempts, gradually steering the interaction toward the goal of eliciting unsafe output. This probe is particularly important for publicly accessible chatbots, where the users are unknown and may include both benign and adversarial actors. Such chatbots can be exploited in scenarios involving misuse, violations, or abuse.
Chat example

Explanation
This text upholds and justifies social hierarchies and the subjugation of certain groups, suggesting that some individuals are inherently more valuable or capable than others. It uses metaphors and analogies to imply that maintaining these hierarchies is natural and beneficial, which can perpetuate harmful ideologies related to hate and injustice.
List of variations
Default language
Multilanguage
ImplicitVariation
RAG Variation
List of strategies
One Shot w/ Retry
Red Teamer
Black box - Adversarial User - Harmful content
The red teamer isn’t aware of any model-specific properties. The only input available is the chat history. Its goal is to induce the target model to generate harmful responses, implicitly and without triggering safety mechanisms. To achieve this, the red teamer slightly modifies the previous prompt, adjusting its tone or topic while preserving the overall objective.
Multimodal support
Text