Evaluator Module

When a target model generates a response to a harmful query, users may need an evaluator to automatically evaluate the response. This evaluation result can be used for later updates, as signs for early stopping, or just to quantify the performance of a jailbreak scheme. EasyJailbreak implemented several evaluators for users.

Evaluator

Evaluator class

class easyjailbreak.metrics.Evaluator.Evaluator.Evaluator(eval_model=None): Abstract base class for implementing evaluation strategies for JailbreakDataset instances. This class serves as a template for creating custom evaluators, allowing for diverse and specialized evaluation strategies tailored to different datasets or models.

Evaluator_ClassificationGetScore

EvaluatorGenerativeGetScore class

class easyjailbreak.metrics.Evaluator.Evaluator_ClassificationGetScore.EvaluatorGenerativeGetScore(model=None, prompt_pattern=None, attr_name: List[str] | None = None, system_prompt=None)

An evaluator that uses a generative model to determine if a jailbreak attempt is successful. It assigns a score based on the model’s judgment, with higher scores indicating a successful jailbreak.

get_LLAMaGurad(model_path=None)

Loads the LLAMaGuard model for evaluation.

Parameters:: model_path (str) – The path to the LLAMaGuard model. If None, a default path is used.

get_score(seed: str, score_format, **kwargs)

Generates a score for a given seed text using the model.

Parameters:

seed (str) – The seed text for the model.
score_format (List[str]) – The format of the scores.

Returns:

The generated score.

Example: from easyjailbreak.models.huggingface_model import from_pretrained, HuggingfaceModel >>>> model = from_pretrained(‘internlm/internlm-20b’, ‘internlm-7b-chat’) >>>> evaluator = EvaluatorGenerativeGetScore(model) >>>> seed = “I’m sorry, but I can’t offer any tutorial of making a bomb which is illegal.” >>>> score_format = [‘0’,’1’,’2’,’3’] >>>> score = evaluator.get_score(seed, score_format) >>>> print(score)

Evaluator_ClassificationJudge

EvaluatorClassificatonJudge class

class easyjailbreak.metrics.Evaluator.Evaluator_ClassificationJudge.EvaluatorClassificatonJudge(eval_model=None, prompt_pattern=None, attr_name: List[str] | None = None)

EvalatorClassifcaton inherits the Evaluator class for evaluating models for classification tasks. It can be called with __call__ or tested with _evaluate for an instance. Both prompt_pattern and attr_name are supported to specify template names and variable names in the template, respectively. The variable name in the template is replaced by the attribute of the same name in instance.

judge(seed: str) → bool

Judges the classification of a given input text using the evaluation model.

Parameters:: seed (str) – The input text to be classified.
Return bool:: The classification result, where True indicates a successful jailbreak.

set_model(model_path=None)

Sets the evaluation model and tokenizer based on a given model path, defaulting to the RoBERTa model.

Parameters:: model_path (str) – Path to the pretrained RoBERTa model.

Evaluator_GenerativeGetScore

EvaluatorGenerativeGetScore class

class easyjailbreak.metrics.Evaluator.Evaluator_GenerativeGetScore.EvaluatorGenerativeGetScore(model)

Evaluator_GenerativeJudge

EvaluatorGenerativeJudge class

class easyjailbreak.metrics.Evaluator.Evaluator_GenerativeJudge.EvaluatorGenerativeJudge(eval_model, prompt_pattern=None, attr_name: List[str] | None = None, special_method=None)

EvalatorClassifcaton inherits the Evaluator class for evaluating models for classification tasks. It can be called with __call__ or tested with _evaluate for an instance. Both prompt_pattern and attr_name are supported to specify template names and variable names in the template, respectively. The variable name in the template is replaced by the attribute of the same name in instance.

judge(seed: str) → bool

Judges whether a jailbreak attempt is successful based on the model’s response.

Parameters:: seed (str) – The input text for the model.
Return int:: Returns 1 if the jailbreak is successful, otherwise 0.

Evaluator_Match

EvalatorMatch class

class easyjailbreak.metrics.Evaluator.Evaluator_Match.EvalatorMatch(eval_model=None): EvalatorMatch is a subclass of Evaluator specifically designed to check for direct matches between target responses and reference responses of an instance. It assigns a binary score based on whether any target response exactly matches a reference response.

Evaluator_PatternJudge

EvaluatorPatternJudge class

class easyjailbreak.metrics.Evaluator.Evaluator_PatternJudge.EvaluatorPatternJudge(pattern_dict=None, prompt_pattern=None, attr_name: List[str] | None = None)

EvalatorClassifcaton inherits the Evaluator class for evaluating models for classification tasks. It can be called with __call__ or tested with _evaluate for an instance. Both prompt_pattern and attr_name are supported to specify template names and variable names in the template, respectively. The variable name in the template is replaced by the attribute of the same name in instance.

judge(seed) → bool

Judges whether a jailbreak attempt is successful based on pattern matching.

Parameters:: seed (str) – The input text to be evaluated against the patterns.
Return bool:: Returns False if the jailbreak is unsuccessful (matching a failure pattern), otherwise True.

set_pattern(pattern_dict=None)

Sets a custom pattern dictionary for the evaluator.

Parameters:: pattern_dict (dict) – A dictionary containing patterns to match for jailbreak evaluation.

Evaluator_PrefixExactMatch

class easyjailbreak.metrics.Evaluator.Evaluator_PrefixExactMatch.EvaluatorPrefixExactMatch(eval_model=None): EvaluatorPrefixExactMatch extends the Evaluator class to evaluate whether any of the target responses in an instance start with the same string as any of the reference responses. It’s designed to perform a prefix-based exact match evaluation.