Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT ] - Simple Adaptive Jailbreaking #537

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

donebydan
Copy link

Description

#266
@romanlutz @rlundeen2

Tests and Documentation

TODO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be converted into a notebook for documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should live under doc/code/orchestrators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or auxiliary_attacks instead of orchestrators. The more I think about it the more I feel like it's more like GCG than other orchestrators.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt not used but could be later. It is from the paper directly.

@@ -219,20 +223,28 @@ async def _complete_chat_async(self, messages: list[ChatMessageListDictContent])
response: ChatCompletion = await self._async_client.chat.completions.create(
model=self._deployment_name,
max_completion_tokens=self._max_completion_tokens,
max_tokens=self._max_tokens,
max_tokens=self._max_tokens, # TODO: this is given as NOT_GIVEN?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comes from inability to alter prompt target max tokens from orchestrator. I need to test.

temperature=self._temperature,
top_p=self._top_p,
frequency_penalty=self._frequency_penalty,
presence_penalty=self._presence_penalty,
logprobs=self._logprobs,
top_logprobs=self._top_logprobs, # TODO: when called and logprobs is False, this param is not passed
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is a bit trickier. When logprobs is None, we shouldn't pass top_logprobs

Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Thanks for putting in all this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should live under doc/code/orchestrators

@@ -121,6 +122,8 @@ def __init__(
# Original prompt id defaults to id (assumes that this is the original prompt, not a duplicate)
self.original_prompt_id = original_prompt_id or self.id

self.logprobs = logprobs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for @rlundeen2 / @rdheekonda : We can create a follow-up task to add this to the DB schema

@@ -119,6 +119,7 @@ def construct_response_from_request(
response_type: PromptDataType = "text",
prompt_metadata: Optional[str] = None,
error: PromptResponseError = "none",
logprobs: Optional[dict] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logprobs: Optional[dict] = None,
logprobs: Optional[Dict[str, float]] = None,

Is that right?

@@ -203,6 +203,7 @@ def _parse_attacker_response(self, *, response: PromptRequestResponse) -> str:
attacker_suggested_prompt = json_response["prompt"]
except (json.JSONDecodeError, KeyError):
# This forces the @pyrit_json_retry decorator to retry the function
breakpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will have to go 😄


class SimpleAdaptiveOrchestrator(Orchestrator):
"""
This orchestrator implements the Prompt Automatic Iterative Refinement (PAIR) algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect your notebook will elaborate but a paragraph on the intuition behind the algorithm wouldn't hurt 🙂


def _adjust_adversarial_suffix(self) -> str:
"""
Randomly changes the adversarial suffix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be more similar to the method name 😆 _change_suffix_randomly

if "Sure" in message:
breakpoint()

if self._desired_target_response_prefix in message[:len(self._desired_target_response_prefix)]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be == or in?

self._best_logprobs = sum(logprob_dict.values())
self._best_logprobs_dict = logprob_dict
self._best_adversarial_suffix = self._adversarial_suffix
elif " " + self._desired_target_response_prefix in message:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this all about? Seems identical except for a leading space?!

if self._best_logprobs == -np.inf:
breakpoint()
logger.info(f"No improvement in logprobs after {self._number_of_iterations} iterations.")
return jailbreaks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi-turn orchestrators we've added MultiTurnAttackResult just becaue every orchestrator was returning something custom and that's super annoying. Maybe we can use something similar?

return float(score_value) >= self._scorer_sensitivity


async def run(self) -> list[PromptRequestResponse]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly standardized yet but we seem to be converging towards run_attack_async as the name for this thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants