Attacker Module

This section of documentation describes the submodules in easyjailbreak.attacker. The class here is used to initialize a method proposed in the paper for model jailbreaking

AutoDAN_Liu_2023

AutoDAN Class

This Class achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: AUTODAN: GENERATING STEALTHY JAILBREAK PROMPTS ON ALIGNED LARGE LANGUAGE MODELS

arXiv link: https://arxiv.org/abs/2310.04451

Source repository: https://github.com/SheltonLiu-N/AutoDAN.git

class easyjailbreak.attacker.AutoDAN_Liu_2023.AutoDAN(attack_model, target_model, jailbreakDatasets: JailbreakDataset, eval_model=None, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, device='cuda:0', num_steps: int = 100, sentence_level_steps: int = 5, word_dict_size: int = 30, batch_size: int = 64, num_elites: float = 0.1, crossover_rate: float = 0.5, mutation_rate: float = 0.01, num_points: int = 5, model_name: str = 'llama2', low_memory: int = 0, pattern_dict: dict | None = None)

AutoDAN is a class for conducting jailbreak attacks on language models. AutoDAN can automatically generate stealthy jailbreak prompts by hierarchical genetic algorithm.

attack(): Main loop for the attack process, iterate through jailbreakDatasets.

construct_momentum_word_dictionary(word_dict, individuals, score_list): calculate momentum with score_list to maintain a momentum word_dict

evaluate_candidate_prompts(sample: Instance, prefix_manager): Calculate current candidate prompts scores of sample, get the currently best prompt and the corresponding response.

get_score_autodan(conv_template, instruction, target, model, device, test_controls=None, crit=None): Convert all test_controls to token ids and find the max length

get_score_autodan_low_memory(conv_template, instruction, target, model, device, test_controls=None, crit=None): Convert all test_controls to token ids and find the max length when memory is low

log(): Report the attack results.

replace_with_synonyms(sentence, num=10): replace words in sentence with synonyms

roulette_wheel_selection(data_list, score_list, num_selected): apply roulette_wheel_selection on data_list

single_attack(instance: Instance): Perform the AutoDAN-HGA algorithm on a single query.

update(Dataset: JailbreakDataset): update jailbreak state

Cipher_Yuan_2023

Cipher Class

This Class enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations.

Paper title：GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

arXiv Link: https://arxiv.org/pdf/2308.06463.pdf

Source repository: https://github.com/RobustNLP/CipherChat

class easyjailbreak.attacker.Cipher_Yuan_2023.Cipher(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)

Cipher is a class for conducting jailbreak attacks on language models. It integrates attack strategies and policies to evaluate and exploit weaknesses in target language models.

attack(): Execute the attack process using four cipher methods on the entire Jailbreak_Dataset.

log(): Report the attack results.

single_attack(instance: Instance) → JailbreakDataset: Conduct four cipher attack_mehtods on a single source instance.

update(dictionary: dict): Update the state of the Cipher based on the evaluation results of attack_results.

DeepInception_Li_2023

DeepInception Class

This class can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks.

Paper title: DeepInception: Hypnotize Large Language Model to Be Jailbreaker

arXiv Link: https://arxiv.org/pdf/2311.03191.pdf

Source repository: https://github.com/tmlr-group/DeepInception

class easyjailbreak.attacker.DeepInception_Li_2023.DeepInception(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset, scene=None, character_number=None, layer_number=None)

DeepInception is a class for conducting jailbreak attacks on language models.

attack(): Execute the attack process using provided prompts.

log(): Report the attack results.

single_attack(instance: Instance) → JailbreakDataset: single_attack is a method for conducting jailbreak attacks on language models.

update(Dataset: JailbreakDataset)

Update the state of the Jailbroken based on the evaluation results of Datasets.

Parameters:: Dataset – The Dataset that is attacked.

GCG_Zou_2023

Iteratively optimizes a specific section in the prompt using guidance from token gradients, ensuring that the model produces the desired text.

Paper title: Universal and Transferable Adversarial Attacks on Aligned Language Models

arXiv link: https://arxiv.org/abs/2307.15043

Source repository: https://github.com/llm-attacks/llm-attacks/

class easyjailbreak.attacker.GCG_Zou_2023.GCG(attack_model: WhiteBoxModelBase, target_model: ModelBase, jailbreak_datasets: JailbreakDataset, jailbreak_prompt_length: int = 20, num_turb_sample: int = 512, batchsize: int | None = None, top_k: int = 256, max_num_iter: int = 500, is_universal: bool = False)

attack(): Abstract method for performing the attack.

single_attack(instance: Instance)

Perform a single-instance attack, a common use case of the attack method. Returns a JailbreakDataset containing the attack results.

Parameters:: instance (Instance) – The instance to be attacked.
Returns:: The attacked dataset containing the modified instances.
Return type:: JailbreakDataset

Gptfuzzer_yu_2023

GPTFuzzer Class

This Class achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

arXiv link: https://arxiv.org/pdf/2309.10253.pdf

Source repository: https://github.com/sherdencooper/GPTFuzz

class easyjailbreak.attacker.Gptfuzzer_yu_2023.GPTFuzzer(attack_model, target_model, eval_model, jailbreakDatasets: JailbreakDataset | None = None, energy: int = 1, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, seeds_num=76)

GPTFuzzer is a class for performing fuzzing attacks on LLM-based models. It utilizes mutator and selection policies to generate jailbreak prompts, aiming to find vulnerabilities in target models.

attack(): Main loop for the fuzzing process, repeatedly selecting, mutating, evaluating, and updating.

is_stop(): Check if the stopping criteria for fuzzing are met. :return bool: True if any stopping criteria is met, False otherwise.

log(): The current attack status is displayed

single_attack(instance: Instance): Perform an attack using a single query. :param ~Instance instance: The instance to be used in the attack. In gptfuzzer, the instance jailbreak_prompt is mutated by different methods. :return: ~JailbreakDataset: The response from the mutated query.

update(Dataset: JailbreakDataset): Update the state of the fuzzer based on the evaluation results of prompt nodes. :param ~JailbreakDataset prompt_nodes: The prompt nodes that have been evaluated.

ICA_wei_2023

ICA Class

This Class executes the In-Context Attack algorithm described in the paper below. This part of code is based on the paper.

Paper title: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations arXiv link: https://arxiv.org/pdf/2310.06387.pdf

class easyjailbreak.attacker.ICA_wei_2023.ICA(target_model, jailbreakDatasets: JailbreakDataset, attack_model=None, eval_model=None, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, prompt_num: int = 5, user_input: bool = False, pattern_dict=None)

In-Context Attack(ICA) crafts malicious contexts to guide models in generating harmful outputs.

attack(): Main loop for the attack process, iterate through jailbreakDatasets.

log(): Report the attack results.

single_attack(sample: Instance): Conduct a single attack on sample with n-shot attack demonstrations. Split the original jailbreak_prompt by roles and merge them into the current conversation_template as in-context demonstration.

update(Dataset): Update the state of the attack.

Jailbroken_wei_2023

Jailbroken Class

Jailbroken utilized competing objectives and mismatched generalization modes of LLMs to constructed 29 artificial jailbreak methods.

Paper title: Jailbroken: How Does LLM Safety Training Fail?

arXiv Link: https://arxiv.org/pdf/2307.02483.pdf

class easyjailbreak.attacker.Jailbroken_wei_2023.Jailbroken(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)

Implementation of Jailbroken Jailbreak Challenges in Large Language Models

attack(): Execute the attack process using provided prompts and mutations.

log(): Report the attack results.

single_attack(instance: Instance) → JailbreakDataset

single attack process using provided prompts and mutation methods.

Parameters:: instance – The Instance that is attacked.

update(Dataset: JailbreakDataset)

Update the state of the Jailbroken based on the evaluation results of Datasets.

Parameters:: Dataset – The Dataset that is attacked.

Multilingual_Deng_2023

Multilingual Class

This Class translates harmful queries from English into nine non-English languages with varying levels of resources, and in intentional scenarios, malicious users deliberately combine malicious instructions with multilingual prompts to attack LLMs.

Paper title: MULTILINGUAL JAILBREAK CHALLENGES IN LARGE LANGUAGE MODELS

arXiv Link: https://arxiv.org/pdf/2310.06474.pdf

Source repository: https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs

class easyjailbreak.attacker.Multilingual_Deng_2023.Multilingual(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)

Multilingual is a class for conducting jailbreak attacks on language models. It can translate harmful queries from English into nine non-English languages.

attack(): Execute the attack process using Multilingual Jailbreak in Large Language Models.

log(): report the attack results.

single_attack(instance: Instance) → JailbreakDataset: Execute the single attack process using provided prompts.

translate_to_en(text, src_lang='auto'): Translate target response to English.

update(dataset): update the state of the Jailbroken based on the evaluation results of Datasets.

PAIR_chao_2023

Catastrophic Modules

This Module achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: Jailbreaking Black Box Large Language Models in Twenty Queries

arXiv link: https://arxiv.org/abs/2310.08419

Source repository: https://github.com/patrickrchao/JailbreakingLLMs

class easyjailbreak.attacker.PAIR_chao_2023.PAIR(attack_model, target_model, eval_model, jailbreak_datasets: JailbreakDataset, template_file=None, attack_max_n_tokens=500, max_n_attack_attempts=5, attack_temperature=1, attack_top_p=0.9, target_max_n_tokens=150, target_temperature=1, target_top_p=1, judge_max_n_tokens=10, judge_temperature=1, n_streams=5, keep_last_n=3, n_iterations=5)

attack(save_path='PAIR_attack_result.jsonl'): Abstract method for performing the attack.

log(): Report the attack results.

single_attack(instance: Instance)

Perform a single-instance attack, a common use case of the attack method. Returns a JailbreakDataset containing the attack results.

Parameters:: instance (Instance) – The instance to be attacked.
Returns:: The attacked dataset containing the modified instances.
Return type:: JailbreakDataset

update(Dataset: JailbreakDataset): Update the state of the ReNeLLM based on the evaluation results of Datasets.

TAP_Mehrotra_2023

‘Tree of Attacks’ Recipe

This module implements a jailbreak method describe in the paper below. This part of code is based on the code from the paper.

Paper title: Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

arXiv link: https://arxiv.org/abs/2312.02119

Source repository: https://github.com/RICommunity/TAP

class easyjailbreak.attacker.TAP_Mehrotra_2023.TAP(attack_model, target_model, eval_model, Dataset: JailbreakDataset, tree_width=10, tree_depth=10, root_num=1, branching_factor=4, keep_last_n=3, max_n_attack_attempts=5, template_file=None, attack_max_n_tokens=500, attack_temperature=1, attack_top_p=0.9, target_max_n_tokens=150, target_temperature=1, target_top_p=1, judge_max_n_tokens=10, judge_temperature=1)

Tree of Attack method, an extension of PAIR method. Use 4 phases: 1. Branching 2. Pruning: (phase 1) 3. Query and Access 4. Pruning: (phase 2)

>>> from easyjailbreak.attacker.TAP_Mehrotra_2023 import TAP
>>> from easyjailbreak.models.huggingface_model import from_pretrained
>>> from easyjailbreak.datasets.jailbreak_datasets import JailbreakDataset
>>> from easyjailbreak.datasets.Instance import Instance
>>> attack_model = from_pretrained(model_path_1)
>>> target_model = from_pretrained(model_path_2)
>>> eval_model  = from_pretrained(model_path_3)
>>> dataset = JailbreakDataset('AdvBench')
>>> attacker = TAP(attack_model, target_model, eval_model, dataset)
>>> attacker.attack()
>>> attacker.jailbreak_Dataset.save_to_jsonl("./TAP_results.jsonl")

attack(save_path='TAP_attack_result.jsonl'): Execute the attack process using provided prompts.

log(): Report the attack results.

single_attack(instance) → JailbreakDataset

Conduct an attack for an instance.

Parameters:: instance (~Instance) – The Instance that is attacked.
Return ~JailbreakDataset:: returns the attack result dataset.

update(Dataset: JailbreakDataset)

Update the state of the ReNeLLM based on the evaluation results of Datasets.

Parameters:: ~JailbreakDataset – processed dataset after an iteration

ReNeLLM_ding_2023

ReNeLLM class

The implementation of our paper “A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily”.

Paper title: A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

arXiv link: https://arxiv.org/pdf/2311.08268.pdf

Source repository: https://github.com/NJUNLP/ReNeLLM

class easyjailbreak.attacker.ReNeLLM_ding_2023.ReNeLLM(attack_model, target_model, eval_model, jailbreakDatasets: JailbreakDataset, evo_max=20)

ReNeLLM is a class for conducting jailbreak attacks on language models. It integrates attack strategies and policies to evaluate and exploit weaknesses in target language models.

attack(): Execute the attack process using provided prompts.

log(): Report the attack results.

single_attack(instance: Instance) → JailbreakDataset

Conduct an attack for an instance.

Parameters:: instance (~Instance) – The Instance that is attacked.
Return ~JailbreakDataset:: returns the attack result dataset.

update(Dataset: JailbreakDataset): Update the state of the ReNeLLM based on the evaluation results of Datasets.