Attacker Module
This section of documentation describes the submodules in easyjailbreak.attacker. The class here is used to initialize a method proposed in the paper for model jailbreaking
AutoDAN_Liu_2023
AutoDAN Class
This Class achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.
Paper title: AUTODAN: GENERATING STEALTHY JAILBREAK PROMPTS ON ALIGNED LARGE LANGUAGE MODELS
arXiv link: https://arxiv.org/abs/2310.04451
Source repository: https://github.com/SheltonLiu-N/AutoDAN.git
- class easyjailbreak.attacker.AutoDAN_Liu_2023.AutoDAN(attack_model, target_model, jailbreakDatasets: JailbreakDataset, eval_model=None, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, device='cuda:0', num_steps: int = 100, sentence_level_steps: int = 5, word_dict_size: int = 30, batch_size: int = 64, num_elites: float = 0.1, crossover_rate: float = 0.5, mutation_rate: float = 0.01, num_points: int = 5, model_name: str = 'llama2', low_memory: int = 0, pattern_dict: dict | None = None)
AutoDAN is a class for conducting jailbreak attacks on language models. AutoDAN can automatically generate stealthy jailbreak prompts by hierarchical genetic algorithm.
- attack()
Main loop for the attack process, iterate through jailbreakDatasets.
- construct_momentum_word_dictionary(word_dict, individuals, score_list)
calculate momentum with score_list to maintain a momentum word_dict
- evaluate_candidate_prompts(sample: Instance, prefix_manager)
Calculate current candidate prompts scores of sample, get the currently best prompt and the corresponding response.
- get_score_autodan(conv_template, instruction, target, model, device, test_controls=None, crit=None)
Convert all test_controls to token ids and find the max length
- get_score_autodan_low_memory(conv_template, instruction, target, model, device, test_controls=None, crit=None)
Convert all test_controls to token ids and find the max length when memory is low
- log()
Report the attack results.
- replace_with_synonyms(sentence, num=10)
replace words in sentence with synonyms
- roulette_wheel_selection(data_list, score_list, num_selected)
apply roulette_wheel_selection on data_list
- single_attack(instance: Instance)
Perform the AutoDAN-HGA algorithm on a single query.
- update(Dataset: JailbreakDataset)
update jailbreak state
Cipher_Yuan_2023
Cipher Class
This Class enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations.
Paper title:GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
arXiv Link: https://arxiv.org/pdf/2308.06463.pdf
Source repository: https://github.com/RobustNLP/CipherChat
- class easyjailbreak.attacker.Cipher_Yuan_2023.Cipher(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)
Cipher is a class for conducting jailbreak attacks on language models. It integrates attack strategies and policies to evaluate and exploit weaknesses in target language models.
- attack()
Execute the attack process using four cipher methods on the entire Jailbreak_Dataset.
- log()
Report the attack results.
- single_attack(instance: Instance) JailbreakDataset
Conduct four cipher attack_mehtods on a single source instance.
- update(dictionary: dict)
Update the state of the Cipher based on the evaluation results of attack_results.
DeepInception_Li_2023
DeepInception Class
This class can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks.
Paper title: DeepInception: Hypnotize Large Language Model to Be Jailbreaker
arXiv Link: https://arxiv.org/pdf/2311.03191.pdf
Source repository: https://github.com/tmlr-group/DeepInception
- class easyjailbreak.attacker.DeepInception_Li_2023.DeepInception(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset, scene=None, character_number=None, layer_number=None)
DeepInception is a class for conducting jailbreak attacks on language models.
- attack()
Execute the attack process using provided prompts.
- log()
Report the attack results.
- single_attack(instance: Instance) JailbreakDataset
single_attack is a method for conducting jailbreak attacks on language models.
- update(Dataset: JailbreakDataset)
Update the state of the Jailbroken based on the evaluation results of Datasets.
- Parameters:
Dataset – The Dataset that is attacked.
GCG_Zou_2023
Iteratively optimizes a specific section in the prompt using guidance from token gradients, ensuring that the model produces the desired text.
Paper title: Universal and Transferable Adversarial Attacks on Aligned Language Models
arXiv link: https://arxiv.org/abs/2307.15043
Source repository: https://github.com/llm-attacks/llm-attacks/
- class easyjailbreak.attacker.GCG_Zou_2023.GCG(attack_model: WhiteBoxModelBase, target_model: ModelBase, jailbreak_datasets: JailbreakDataset, jailbreak_prompt_length: int = 20, num_turb_sample: int = 512, batchsize: int | None = None, top_k: int = 256, max_num_iter: int = 500, is_universal: bool = False)
- attack()
Abstract method for performing the attack.
- single_attack(instance: Instance)
Perform a single-instance attack, a common use case of the attack method. Returns a JailbreakDataset containing the attack results.
- Parameters:
instance (Instance) – The instance to be attacked.
- Returns:
The attacked dataset containing the modified instances.
- Return type:
JailbreakDataset
Gptfuzzer_yu_2023
GPTFuzzer Class
This Class achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.
Paper title: GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
arXiv link: https://arxiv.org/pdf/2309.10253.pdf
Source repository: https://github.com/sherdencooper/GPTFuzz
- class easyjailbreak.attacker.Gptfuzzer_yu_2023.GPTFuzzer(attack_model, target_model, eval_model, jailbreakDatasets: JailbreakDataset | None = None, energy: int = 1, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, seeds_num=76)
GPTFuzzer is a class for performing fuzzing attacks on LLM-based models. It utilizes mutator and selection policies to generate jailbreak prompts, aiming to find vulnerabilities in target models.
- attack()
Main loop for the fuzzing process, repeatedly selecting, mutating, evaluating, and updating.
- is_stop()
Check if the stopping criteria for fuzzing are met. :return bool: True if any stopping criteria is met, False otherwise.
- log()
The current attack status is displayed
- single_attack(instance: Instance)
Perform an attack using a single query. :param ~Instance instance: The instance to be used in the attack. In gptfuzzer, the instance jailbreak_prompt is mutated by different methods. :return: ~JailbreakDataset: The response from the mutated query.
- update(Dataset: JailbreakDataset)
Update the state of the fuzzer based on the evaluation results of prompt nodes. :param ~JailbreakDataset prompt_nodes: The prompt nodes that have been evaluated.
ICA_wei_2023
ICA Class
This Class executes the In-Context Attack algorithm described in the paper below. This part of code is based on the paper.
Paper title: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations arXiv link: https://arxiv.org/pdf/2310.06387.pdf
- class easyjailbreak.attacker.ICA_wei_2023.ICA(target_model, jailbreakDatasets: JailbreakDataset, attack_model=None, eval_model=None, max_query: int = 100, max_jailbreak: int = 100, max_reject: int = 100, max_iteration: int = 100, prompt_num: int = 5, user_input: bool = False, pattern_dict=None)
In-Context Attack(ICA) crafts malicious contexts to guide models in generating harmful outputs.
- attack()
Main loop for the attack process, iterate through jailbreakDatasets.
- log()
Report the attack results.
- single_attack(sample: Instance)
Conduct a single attack on sample with n-shot attack demonstrations. Split the original jailbreak_prompt by roles and merge them into the current conversation_template as in-context demonstration.
- update(Dataset)
Update the state of the attack.
Jailbroken_wei_2023
Jailbroken Class
Jailbroken utilized competing objectives and mismatched generalization modes of LLMs to constructed 29 artificial jailbreak methods.
Paper title: Jailbroken: How Does LLM Safety Training Fail?
arXiv Link: https://arxiv.org/pdf/2307.02483.pdf
- class easyjailbreak.attacker.Jailbroken_wei_2023.Jailbroken(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)
Implementation of Jailbroken Jailbreak Challenges in Large Language Models
- attack()
Execute the attack process using provided prompts and mutations.
- log()
Report the attack results.
- single_attack(instance: Instance) JailbreakDataset
single attack process using provided prompts and mutation methods.
- Parameters:
instance – The Instance that is attacked.
- update(Dataset: JailbreakDataset)
Update the state of the Jailbroken based on the evaluation results of Datasets.
- Parameters:
Dataset – The Dataset that is attacked.
Multilingual_Deng_2023
Multilingual Class
This Class translates harmful queries from English into nine non-English languages with varying levels of resources, and in intentional scenarios, malicious users deliberately combine malicious instructions with multilingual prompts to attack LLMs.
Paper title: MULTILINGUAL JAILBREAK CHALLENGES IN LARGE LANGUAGE MODELS
arXiv Link: https://arxiv.org/pdf/2310.06474.pdf
Source repository: https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs
- class easyjailbreak.attacker.Multilingual_Deng_2023.Multilingual(attack_model, target_model, eval_model, Jailbreak_Dataset: JailbreakDataset)
Multilingual is a class for conducting jailbreak attacks on language models. It can translate harmful queries from English into nine non-English languages.
- attack()
Execute the attack process using Multilingual Jailbreak in Large Language Models.
- log()
report the attack results.
- single_attack(instance: Instance) JailbreakDataset
Execute the single attack process using provided prompts.
- translate_to_en(text, src_lang='auto')
Translate target response to English.
- update(dataset)
update the state of the Jailbroken based on the evaluation results of Datasets.
PAIR_chao_2023
Catastrophic Modules
This Module achieves a jailbreak method describe in the paper below. This part of code is based on the code from the paper.
Paper title: Jailbreaking Black Box Large Language Models in Twenty Queries
arXiv link: https://arxiv.org/abs/2310.08419
Source repository: https://github.com/patrickrchao/JailbreakingLLMs
- class easyjailbreak.attacker.PAIR_chao_2023.PAIR(attack_model, target_model, eval_model, jailbreak_datasets: JailbreakDataset, template_file=None, attack_max_n_tokens=500, max_n_attack_attempts=5, attack_temperature=1, attack_top_p=0.9, target_max_n_tokens=150, target_temperature=1, target_top_p=1, judge_max_n_tokens=10, judge_temperature=1, n_streams=5, keep_last_n=3, n_iterations=5)
- attack(save_path='PAIR_attack_result.jsonl')
Abstract method for performing the attack.
- log()
Report the attack results.
- single_attack(instance: Instance)
Perform a single-instance attack, a common use case of the attack method. Returns a JailbreakDataset containing the attack results.
- Parameters:
instance (Instance) – The instance to be attacked.
- Returns:
The attacked dataset containing the modified instances.
- Return type:
JailbreakDataset
- update(Dataset: JailbreakDataset)
Update the state of the ReNeLLM based on the evaluation results of Datasets.
TAP_Mehrotra_2023
‘Tree of Attacks’ Recipe
This module implements a jailbreak method describe in the paper below. This part of code is based on the code from the paper.
Paper title: Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
arXiv link: https://arxiv.org/abs/2312.02119
Source repository: https://github.com/RICommunity/TAP
- class easyjailbreak.attacker.TAP_Mehrotra_2023.TAP(attack_model, target_model, eval_model, Dataset: JailbreakDataset, tree_width=10, tree_depth=10, root_num=1, branching_factor=4, keep_last_n=3, max_n_attack_attempts=5, template_file=None, attack_max_n_tokens=500, attack_temperature=1, attack_top_p=0.9, target_max_n_tokens=150, target_temperature=1, target_top_p=1, judge_max_n_tokens=10, judge_temperature=1)
Tree of Attack method, an extension of PAIR method. Use 4 phases: 1. Branching 2. Pruning: (phase 1) 3. Query and Access 4. Pruning: (phase 2)
>>> from easyjailbreak.attacker.TAP_Mehrotra_2023 import TAP >>> from easyjailbreak.models.huggingface_model import from_pretrained >>> from easyjailbreak.datasets.jailbreak_datasets import JailbreakDataset >>> from easyjailbreak.datasets.Instance import Instance >>> attack_model = from_pretrained(model_path_1) >>> target_model = from_pretrained(model_path_2) >>> eval_model = from_pretrained(model_path_3) >>> dataset = JailbreakDataset('AdvBench') >>> attacker = TAP(attack_model, target_model, eval_model, dataset) >>> attacker.attack() >>> attacker.jailbreak_Dataset.save_to_jsonl("./TAP_results.jsonl")
- attack(save_path='TAP_attack_result.jsonl')
Execute the attack process using provided prompts.
- log()
Report the attack results.
- single_attack(instance) JailbreakDataset
Conduct an attack for an instance.
- Parameters:
instance (~Instance) – The Instance that is attacked.
- Return ~JailbreakDataset:
returns the attack result dataset.
- update(Dataset: JailbreakDataset)
Update the state of the ReNeLLM based on the evaluation results of Datasets.
- Parameters:
~JailbreakDataset – processed dataset after an iteration
ReNeLLM_ding_2023
ReNeLLM class
The implementation of our paper “A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily”.
Paper title: A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
arXiv link: https://arxiv.org/pdf/2311.08268.pdf
Source repository: https://github.com/NJUNLP/ReNeLLM
- class easyjailbreak.attacker.ReNeLLM_ding_2023.ReNeLLM(attack_model, target_model, eval_model, jailbreakDatasets: JailbreakDataset, evo_max=20)
ReNeLLM is a class for conducting jailbreak attacks on language models. It integrates attack strategies and policies to evaluate and exploit weaknesses in target language models.
- attack()
Execute the attack process using provided prompts.
- log()
Report the attack results.
- single_attack(instance: Instance) JailbreakDataset
Conduct an attack for an instance.
- Parameters:
instance (~Instance) – The Instance that is attacked.
- Return ~JailbreakDataset:
returns the attack result dataset.
- update(Dataset: JailbreakDataset)
Update the state of the ReNeLLM based on the evaluation results of Datasets.