Metric Module

Part of this document introduces the easyjailbreak.metrics.Metric of each module, used to score for calculating the final result.

metric_ASR

Metrics on AttackSuccessRate

This module contains the implementation of the AttackSuccessRate metric, which is designed to evaluate the effectiveness of jailbreak attacks in a dataset. It calculates the number of successful and failed attacks, and computes the overall attack success rate.

class easyjailbreak.metrics.Metric.metric_ASR.AttackSuccessRate

A metric to evaluate the success rate of jailbreak attacks. It calculates the number of successful and failed attacks within a dataset, and determines the overall attack success rate.

calculate(dataset: JailbreakDataset)

Calculate the attack success rate from the given dataset.

Parameters:

dataset (~JailbreakDataset) – The dataset containing jailbreak attack results.

Return dict:

A dictionary containing the number of successful attacks, failed attacks, and the attack success rate.

metric_perplexit

Perplexity Metric:

Class for calculating perplexity from Jailbreak_Dataset

class easyjailbreak.metrics.Metric.metric_perplexit.Perplexity(model: WhiteBoxModelBase, max_length=512, stride=512)
calculate(dataset: JailbreakDataset)

Calculates average Perplexity on the final prompts generated by attacker using a pre-trained small GPT-2 model.

Parameters:

dataset (Jailbreak_Dataset objects) – list of instances with attack results