Metric Module
Part of this document introduces the easyjailbreak.metrics.Metric of each module, used to score for calculating the final result.
metric_ASR
Metrics on AttackSuccessRate
This module contains the implementation of the AttackSuccessRate metric, which is designed to evaluate the effectiveness of jailbreak attacks in a dataset. It calculates the number of successful and failed attacks, and computes the overall attack success rate.
- class easyjailbreak.metrics.Metric.metric_ASR.AttackSuccessRate
A metric to evaluate the success rate of jailbreak attacks. It calculates the number of successful and failed attacks within a dataset, and determines the overall attack success rate.
- calculate(dataset: JailbreakDataset)
Calculate the attack success rate from the given dataset.
- Parameters:
dataset (~JailbreakDataset) – The dataset containing jailbreak attack results.
- Return dict:
A dictionary containing the number of successful attacks, failed attacks, and the attack success rate.
metric_perplexit
Perplexity Metric:
Class for calculating perplexity from Jailbreak_Dataset
- class easyjailbreak.metrics.Metric.metric_perplexit.Perplexity(model: WhiteBoxModelBase, max_length=512, stride=512)
- calculate(dataset: JailbreakDataset)
Calculates average Perplexity on the final prompts generated by attacker using a pre-trained small GPT-2 model.
- Parameters:
dataset (
Jailbreak_Dataset
objects) – list of instances with attack results