Attacker Module

This section of documentation describes the submodules in easyjailbreak.attacker. The class here is used to initialize a method proposed in the paper for model jailbreaking

AutoDAN_Liu_2023

AutoDAN Class

This Class achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: AUTODAN: GENERATING STEALTHY JAILBREAK PROMPTS ON ALIGNED LARGE LANGUAGE MODELS

arXiv link: https://arxiv.org/abs/2310.04451

Source repository: https://github.com/SheltonLiu-N/AutoDAN.git

class easyjailbreak.attacker.AutoDAN_Liu_2023.AutoDAN(attack_model, target_model, jailbreakDatasets: JailbreakDataset, eval_model=None, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, device='cuda:0', num_steps: int = 100, sentence_level_steps: int = 5, word_dict_size: int = 30, batch_size: int = 64, num_elites: float = 0.1, crossover_rate: float = 0.5, mutation_rate: float = 0.01, num_points: int = 5, model_name: str = 'llama2', low_memory: int = 0, pattern_dict: dict | None = None)

AutoDAN is a class for conducting jailbreak attacks on language models. AutoDAN can automatically generate stealthy jailbreak prompts by hierarchical genetic algorithm.

attack()

Main loop for the attack process, iterate through jailbreakDatasets.

construct_momentum_word_dictionary(word_dict, individuals, score_list)

calculate momentum with score_list to maintain a momentum word_dict

evaluate_candidate_prompts(sample: Instance, prefix_manager)

Calculate current candidate prompts scores of sample, get the currently best prompt and the corresponding response.

get_score_autodan(conv_template, instruction, target, model, device, test_controls=None, crit=None)

Convert all test_controls to token ids and find the max length

get_score_autodan_low_memory(conv_template, instruction, target, model, device, test_controls=None, crit=None)

Convert all test_controls to token ids and find the max length when memory is low

log()

Report the attack results.

replace_with_synonyms(sentence, num=10)

replace words in sentence with synonyms

roulette_wheel_selection(data_list, score_list, num_selected)

apply roulette_wheel_selection on data_list

single_attack(instance: Instance)

Perform the AutoDAN-HGA algorithm on a single query.

update(Dataset: JailbreakDataset)

update jailbreak state

Cipher_Yuan_2023

Cipher Class

This Class enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations.

Paper title:GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

arXiv Link: https://arxiv.org/pdf/2308.06463.pdf

Source repository: https://github.com/RobustNLP/CipherChat

class easyjailbreak.attacker.Cipher_Yuan_2023.Cipher(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)

Cipher is a class for conducting jailbreak attacks on language models. It integrates attack strategies and policies to evaluate and exploit weaknesses in target language models.

attack()

Execute the attack process using four cipher methods on the entire Jailbreak_Dataset.

log()

Report the attack results.

single_attack(instance: Instance) JailbreakDataset

Conduct four cipher attack_mehtods on a single source instance.

update(dictionary: dict)

Update the state of the Cipher based on the evaluation results of attack_results.

DeepInception_Li_2023

DeepInception Class

This class can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks.

Paper title: DeepInception: Hypnotize Large Language Model to Be Jailbreaker

arXiv Link: https://arxiv.org/pdf/2311.03191.pdf

Source repository: https://github.com/tmlr-group/DeepInception

class easyjailbreak.attacker.DeepInception_Li_2023.DeepInception(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset, scene=None, character_number=None, layer_number=None)

DeepInception is a class for conducting jailbreak attacks on language models.

attack()

Execute the attack process using provided prompts.

log()

Report the attack results.

single_attack(instance: Instance) JailbreakDataset

single_attack is a method for conducting jailbreak attacks on language models.

update(Dataset: JailbreakDataset)

Update the state of the Jailbroken based on the evaluation results of Datasets.

Parameters:

Dataset – The Dataset that is attacked.

GCG_Zou_2023

Iteratively optimizes a specific section in the prompt using guidance from token gradients, ensuring that the model produces the desired text.

Paper title: Universal and Transferable Adversarial Attacks on Aligned Language Models

arXiv link: https://arxiv.org/abs/2307.15043

Source repository: https://github.com/llm-attacks/llm-attacks/

class easyjailbreak.attacker.GCG_Zou_2023.GCG(attack_model: WhiteBoxModelBase, target_model: ModelBase, jailbreak_datasets: JailbreakDataset, jailbreak_prompt_length: int = 20, num_turb_sample: int = 512, batchsize: int | None = None, top_k: int = 256, max_num_iter: int = 500, is_universal: bool = False)
attack()

Abstract method for performing the attack.

single_attack(instance: Instance)

Perform a single-instance attack, a common use case of the attack method. Returns a JailbreakDataset containing the attack results.

Parameters:

instance (Instance) – The instance to be attacked.

Returns:

The attacked dataset containing the modified instances.

Return type:

JailbreakDataset

Gptfuzzer_yu_2023

GPTFuzzer Class

This Class achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

arXiv link: https://arxiv.org/pdf/2309.10253.pdf

Source repository: https://github.com/sherdencooper/GPTFuzz

class easyjailbreak.attacker.Gptfuzzer_yu_2023.GPTFuzzer(attack_model, target_model, eval_model, jailbreakDatasets: JailbreakDataset | None = None, energy: int = 1, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, seeds_num=76)

GPTFuzzer is a class for performing fuzzing attacks on LLM-based models. It utilizes mutator and selection policies to generate jailbreak prompts, aiming to find vulnerabilities in target models.

attack()

Main loop for the fuzzing process, repeatedly selecting, mutating, evaluating, and updating.

is_stop()

Check if the stopping criteria for fuzzing are met. :return bool: True if any stopping criteria is met, False otherwise.

log()

The current attack status is displayed

single_attack(instance: Instance)

Perform an attack using a single query. :param ~Instance instance: The instance to be used in the attack. In gptfuzzer, the instance jailbreak_prompt is mutated by different methods. :return: ~JailbreakDataset: The response from the mutated query.

update(Dataset: JailbreakDataset)

Update the state of the fuzzer based on the evaluation results of prompt nodes. :param ~JailbreakDataset prompt_nodes: The prompt nodes that have been evaluated.

ICA_wei_2023

ICA Class

This Class executes the In-Context Attack algorithm described in the paper below. This part of code is based on the paper.

Paper title: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations arXiv link: https://arxiv.org/pdf/2310.06387.pdf

class easyjailbreak.attacker.ICA_wei_2023.ICA(target_model, jailbreakDatasets: JailbreakDataset, attack_model=None, eval_model=None, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, prompt_num: int = 5, user_input: bool = False, pattern_dict=None)

In-Context Attack(ICA) crafts malicious contexts to guide models in generating harmful outputs.

attack()

Main loop for the attack process, iterate through jailbreakDatasets.

log()

Report the attack results.

single_attack(sample: Instance)

Conduct a single attack on sample with n-shot attack demonstrations. Split the original jailbreak_prompt by roles and merge them into the current conversation_template as in-context demonstration.

update(Dataset)

Update the state of the attack.

Jailbroken_wei_2023

Jailbroken Class

Jailbroken utilized competing objectives and mismatched generalization modes of LLMs to constructed 29 artificial jailbreak methods.

Paper title: Jailbroken: How Does LLM Safety Training Fail?

arXiv Link: https://arxiv.org/pdf/2307.02483.pdf

class easyjailbreak.attacker.Jailbroken_wei_2023.Jailbroken(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)

Implementation of Jailbroken Jailbreak Challenges in Large Language Models

attack()

Execute the attack process using provided prompts and mutations.

log()

Report the attack results.

single_attack(instance: Instance) JailbreakDataset

single attack process using provided prompts and mutation methods.

Parameters:

instance – The Instance that is attacked.

update(Dataset: JailbreakDataset)

Update the state of the Jailbroken based on the evaluation results of Datasets.

Parameters:

Dataset – The Dataset that is attacked.

Multilingual_Deng_2023

Multilingual Class

This Class translates harmful queries from English into nine non-English languages with varying levels of resources, and in intentional scenarios, malicious users deliberately combine malicious instructions with multilingual prompts to attack LLMs.

Paper title: MULTILINGUAL JAILBREAK CHALLENGES IN LARGE LANGUAGE MODELS

arXiv Link: https://arxiv.org/pdf/2310.06474.pdf

Source repository: https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs

class easyjailbreak.attacker.Multilingual_Deng_2023.Multilingual(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)

Multilingual is a class for conducting jailbreak attacks on language models. It can translate harmful queries from English into nine non-English languages.

attack()

Execute the attack process using Multilingual Jailbreak in Large Language Models.

log()

report the attack results.

single_attack(instance: Instance) JailbreakDataset

Execute the single attack process using provided prompts.

translate_to_en(text, src_lang='auto')

Translate target response to English.

update(dataset)

update the state of the Jailbroken based on the evaluation results of Datasets.

PAIR_chao_2023

Catastrophic Modules

This Module achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: Jailbreaking Black Box Large Language Models in Twenty Queries

arXiv link: https://arxiv.org/abs/2310.08419

Source repository: https://github.com/patrickrchao/JailbreakingLLMs

class easyjailbreak.attacker.PAIR_chao_2023.PAIR(attack_model, target_model, eval_model, jailbreak_datasets: JailbreakDataset, template_file=None, attack_max_n_tokens=500, max_n_attack_attempts=5, attack_temperature=1, attack_top_p=0.9, target_max_n_tokens=150, target_temperature=1, target_top_p=1, judge_max_n_tokens=10, judge_temperature=1, n_streams=5, keep_last_n=3, n_iterations=5)
attack(save_path='PAIR_attack_result.jsonl')

Abstract method for performing the attack.

log()

Report the attack results.

single_attack(instance: Instance)

Perform a single-instance attack, a common use case of the attack method. Returns a JailbreakDataset containing the attack results.

Parameters:

instance (Instance) – The instance to be attacked.

Returns:

The attacked dataset containing the modified instances.

Return type:

JailbreakDataset

update(Dataset: JailbreakDataset)

Update the state of the ReNeLLM based on the evaluation results of Datasets.

TAP_Mehrotra_2023

‘Tree of Attacks’ Recipe

This module implements a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

arXiv link: https://arxiv.org/abs/2312.02119

Source repository: https://github.com/RICommunity/TAP

class easyjailbreak.attacker.TAP_Mehrotra_2023.TAP(attack_model, target_model, eval_model, Dataset: JailbreakDataset, tree_width=10, tree_depth=10, root_num=1, branching_factor=4, keep_last_n=3, max_n_attack_attempts=5, template_file=None, attack_max_n_tokens=500, attack_temperature=1, attack_top_p=0.9, target_max_n_tokens=150, target_temperature=1, target_top_p=1, judge_max_n_tokens=10, judge_temperature=1)

Tree of Attack method, an extension of PAIR method. Use 4 phases: 1. Branching 2. Pruning: (phase 1) 3. Query and Access 4. Pruning: (phase 2)

>>> from easyjailbreak.attacker.TAP_Mehrotra_2023 import TAP
>>> from easyjailbreak.models.huggingface_model import from_pretrained
>>> from easyjailbreak.datasets.jailbreak_datasets import JailbreakDataset
>>> from easyjailbreak.datasets.Instance import Instance
>>> attack_model = from_pretrained(model_path_1)
>>> target_model = from_pretrained(model_path_2)
>>> eval_model  = from_pretrained(model_path_3)
>>> dataset = JailbreakDataset('AdvBench')
>>> attacker = TAP(attack_model, target_model, eval_model, dataset)
>>> attacker.attack()
>>> attacker.jailbreak_Dataset.save_to_jsonl("./TAP_results.jsonl")
attack(save_path='TAP_attack_result.jsonl')

Execute the attack process using provided prompts.

log()

Report the attack results.

single_attack(instance) JailbreakDataset

Conduct an attack for an instance.

Parameters:

instance (~Instance) – The Instance that is attacked.

Return ~JailbreakDataset:

returns the attack result dataset.

update(Dataset: JailbreakDataset)

Update the state of the ReNeLLM based on the evaluation results of Datasets.

Parameters:

~JailbreakDataset – processed dataset after an iteration

ReNeLLM_ding_2023

ReNeLLM class

The implementation of our paper “A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily”.

Paper title: A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

arXiv link: https://arxiv.org/pdf/2311.08268.pdf

Source repository: https://github.com/NJUNLP/ReNeLLM

class easyjailbreak.attacker.ReNeLLM_ding_2023.ReNeLLM(attack_model, target_model, eval_model, jailbreakDatasets: JailbreakDataset, evo_max=20)

ReNeLLM is a class for conducting jailbreak attacks on language models. It integrates attack strategies and policies to evaluate and exploit weaknesses in target language models.

attack()

Execute the attack process using provided prompts.

log()

Report the attack results.

single_attack(instance: Instance) JailbreakDataset

Conduct an attack for an instance.

Parameters:

instance (~Instance) – The Instance that is attacked.

Return ~JailbreakDataset:

returns the attack result dataset.

update(Dataset: JailbreakDataset)

Update the state of the ReNeLLM based on the evaluation results of Datasets.