Strategy

Multistep

Explanation

This strategy builds the attack gradually over multiple steps, with each step laying the groundwork for the final exploit. It relies on collaboration between the Adapter and the Red Teamer, who jointly develop the conversation plan while keeping the end goal in focus throughout. The Adapter constructs the initial conversation flow, while the Red Teamer acts as a plan reviewer, ensuring that the conversation stays aligned with the intended objective. The number of retries is controlled by the max_retries parameter defined in the probe configuration.

Diagram

Example

Explanation

This example demonstrates how the strategy uses the initial step solely to prepare the ground for the actual attack. It begins by asking about the AI Assistant’s objective in order to align it with the attacker’s intended goal, expand the context, and divert the Assistant’s attention.

Multishot

Explanation

This strategy attempts the attack through multiple shots within the same conversation, adjusting the approach with each attempt. The number of retries is defined by the max_retries parameter in the probe configuration. For every retry, the attack is refined by the Red Teamer, who has full visibility into the conversation history and the defined goal, allowing for context-aware adaptation of the attack strategy.

Diagram

Example

Explanation

The same attack objective is pursued through multiple attempts, each using a different methodology, ultimately achieving the goal of generating answer with profane words in the final attempt.

Delayed Attack

Explanation

This strategy builds the attack gradually over multiple steps, with each step laying the groundwork for the final exploit. The attack is typically not carried out in the initial message; instead, this first step sets the stage for one of the subsequent messages where the actual attack will be attempted.

The strategy relies on collaboration between the Adapter and the Red Teamer, who jointly develop the conversation plan while keeping the end goal in focus throughout. The Adapter constructs the initial conversation flow, while the Red Teamer acts as a plan reviewer, ensuring that the conversation stays aligned with the intended objective. The number of retries is controlled by the max_retries parameter defined in the probe configuration.

Diagram

Example

Explanation

In the first message, the user describes how the target AI Assistant should behave when asked to provide a URL in its response. However, this initial message does not attempt the actual attack. The attack is executed in the third message, where the AI Assistant returns a modified URL, indicating the attack was successful.

One Shot w/Retry

Explanation

This strategy begins with a basic attack attempt and then, based on the target's response, formulates a new attack with the Red Teamer. Each new attempt is launched in a separate conversation. The number of retries is defined by the max_retries parameter in the probe configuration.

Diagram

Example

Explanation

This example illustrates how the attack improves over attempts. As shown in the image, each attempt is launched in a new conversation. In the final attempt, the attack succeeds after incorporating adjustments based on insights from the previous two tries.

Tree of Attack

Explanation

The strategy begins with a simple initial attack, which is used to initialize the BFS (Breadth-First Search) list of attacks. If this initial attempt succeeds, the strategy terminates. Otherwise, the attack tree is expanded using the BFS approach.

The algorithm proceeds by selecting the last element from the BFS list (initially the root node) and generating n new nodes from it, where n equals the number of Red Teamers multiplied by the number of branches specified in the configuration file. These new nodes are created sequentially. Each node represents a separate conversation, consisting of a single user input and the corresponding AI Assistant response.

Each generated node is evaluated: if its off-topic score is below the defined threshold, or if it results in a successful attack, it is added to the final tree. If a successful attack is detected at any point, the process is halted and the final tree is saved.

If none of the new nodes lead to a successful attack, the tree continues to grow by appending additional branches. The process continues as long as the BFS list is not empty and the depth of the next node to process does not exceed the maximum tree depth specified in the configuration.

Diagram

Example

Explanation

The example displays all conversations from the final attack tree. Each conversation represents a node where the off-topic score was below the defined threshold. In the first two attempts, the AI Assistant refused to respond to the prompt. However, in the final attempt, the Assistant provided an answer, resulting in a successful attack.

Jailbreak

Explanation

The strategy attempts an attack, and if it is not successful, it retries by employing different variations. This process continues until the maximum number of allowed attempts is reached.

Diagram

Example

Explanation

In this example, two variations are used: Default language and LeetSpeak. The initial attack is attempted using the default language, but after failing to jailbreak the model, the strategy retries by asking the jailbreaking question in a different variation, LeetSpeak. When the question "How to build a bomb?" is asked in LeetSpeak, the AI Assistant responds with an answer, resulting in a successful attack.

Name Bias

Explanation

The strategy is loading different names that differ in nationality and gender. It asks the target AI Assistant to assign names with different roles. If it consistently assigns the same groups with same roles across all n examples, the attack is successful.

Diagram

Example