-
Notifications
You must be signed in to change notification settings - Fork 651
FEAT: Jailbreak Scenario #1329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT: Jailbreak Scenario #1329
Conversation
|
Thanks @ValbuenaVC for picking this up! One improvement I had in mind was to create more strategies by running the different groups of jailbreaks we have in PyRIT. Right now I have only the one at the root of the directory, but we added quite a few more recently, and it would make sense to have one strategy per folder (and ALL to run them all). |
fdubut
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| @@ -0,0 +1,16 @@ | |||
| dataset_name: airt_jailbreak | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we potentially have a more descriptive name? Jailbreak has a different meaning in pyrit. Potentially "airt_jailbreak_scenario"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or really "airt_harms.prompt" is also good
| # Will be resolved in _get_atomic_attacks_async | ||
| self._seed_groups: Optional[List[SeedAttackGroup]] = None | ||
|
|
||
| def _get_default_objective_scorer(self) -> TrueFalseScorer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but wondering if we should just make _get_default_objective_scorer a non-abstract base class
|
|
||
| return list(seed_groups) | ||
|
|
||
| def _get_all_jailbreak_templates(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using/extending the TextJailBreak class instead of looking for the yaml directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if the number of jailbreaks could have some further filtering from the scenario strategy, so it's not necessarily always "all". It could be random N, or it could be a subcategory, or maybe other.
This is probably important so we can have shorter or more targeted runs.
| ) | ||
|
|
||
| # Create the attack | ||
| attack = PromptSendingAttack( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not required) Wonder if we should send multiple times as an option
Description
Addition of a jailbreak scenario to PyRIT, which applies jailbreak templates to a set of test prompts and sends them to the target. Credit to @fdubut for developing the scenario.
Tests and Documentation
Adding
test_jailbreak.pyunder the unit tests.