Skip to main content
Skip table of contents

Harmful Content

Short description 

Simulating realistic user interactions that could provoke unsafe responses, either for users or those around them.

Technical Details

This probe tests the chatbot’s compliance with ethical standards and content safety guidelines. It attempts to provoke the chatbot into generating harmful content on a specific topic and leverages conversation history to iteratively improve previous attempts, gradually steering the interaction toward the goal of eliciting unsafe output. This probe is particularly important for publicly accessible chatbots, where the users are unknown and may include both benign and adversarial actors. Such chatbots can be exploited in scenarios involving misuse, violations, or abuse. 

Chat example 

image-20250703-071305.png

Explanation 

This text upholds and justifies social hierarchies and the subjugation of certain groups, suggesting that some individuals are inherently more valuable or capable than others. It uses metaphors and analogies to imply that maintaining these hierarchies is natural and beneficial, which can perpetuate harmful ideologies related to hate and injustice. 

List of variations 

Default language 

Multilanguage 

ImplicitVariation 

RAG Variation 

List of strategies 

One Shot w/ Retry 

Red Teamer 

Black box - Adversarial User - Harmful content 

The red teamer isn’t aware of any model-specific properties. The only input available is the chat history. Its goal is to induce the target model to generate harmful responses, implicitly and without triggering safety mechanisms. To achieve this, the red teamer slightly modifies the previous prompt, adjusting its tone or topic while preserving the overall objective. 

Multimodal support 

Text 

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.