[DRAFT ] - Simple Adaptive Jailbreaking #537

donebydan · 2024-11-08T19:58:54Z

Description

#266
@romanlutz @rlundeen2

Tests and Documentation

TODO

…bles to expose logprobs in openai chat targets

…ss_conversation_stream. Adding prompt templates.

…ng logprobs exposed (not toplogprobs)

donebydan · 2024-11-08T19:59:30Z

adaptive_attack_exploration.py

To be converted into a notebook for documentation.

Should live under doc/code/orchestrators

Or auxiliary_attacks instead of orchestrators. The more I think about it the more I feel like it's more like GCG than other orchestrators.

donebydan · 2024-11-08T19:59:50Z

pyrit/datasets/orchestrators/simple_adaptive/gpt_4o_prompt.yaml

Prompt not used but could be later. It is from the paper directly.

donebydan · 2024-11-08T20:02:18Z

pyrit/prompt_target/openai/openai_chat_target.py

@@ -219,20 +223,28 @@ async def _complete_chat_async(self, messages: list[ChatMessageListDictContent])
        response: ChatCompletion = await self._async_client.chat.completions.create(
            model=self._deployment_name,
            max_completion_tokens=self._max_completion_tokens,
-            max_tokens=self._max_tokens,
+            max_tokens=self._max_tokens, # TODO: this is given as NOT_GIVEN?


This TODO comes from inability to alter prompt target max tokens from orchestrator. I need to test.

donebydan · 2024-11-08T20:02:49Z

pyrit/prompt_target/openai/openai_chat_target.py

            temperature=self._temperature,
            top_p=self._top_p,
            frequency_penalty=self._frequency_penalty,
            presence_penalty=self._presence_penalty,
+            logprobs=self._logprobs,
+            top_logprobs=self._top_logprobs, # TODO: when called and logprobs is False, this param is not passed


This TODO is a bit trickier. When logprobs is None, we shouldn't pass top_logprobs

romanlutz

Looking great! Thanks for putting in all this work.

romanlutz · 2024-11-08T22:03:46Z

adaptive_attack_exploration.py

Should live under doc/code/orchestrators

romanlutz · 2024-11-08T22:06:22Z

pyrit/models/prompt_request_piece.py

@@ -121,6 +122,8 @@ def __init__(
        # Original prompt id defaults to id (assumes that this is the original prompt, not a duplicate)
        self.original_prompt_id = original_prompt_id or self.id

+        self.logprobs = logprobs


Note for @rlundeen2 / @rdheekonda : We can create a follow-up task to add this to the DB schema

romanlutz · 2024-11-08T22:06:55Z

pyrit/models/prompt_request_response.py

@@ -119,6 +119,7 @@ def construct_response_from_request(
    response_type: PromptDataType = "text",
    prompt_metadata: Optional[str] = None,
    error: PromptResponseError = "none",
+    logprobs: Optional[dict] = None,


Suggested change

logprobs: Optional[dict] = None,

logprobs: Optional[Dict[str, float]] = None,

Is that right?

romanlutz · 2024-11-08T22:07:09Z

pyrit/orchestrator/pair_orchestrator.py

@@ -203,6 +203,7 @@ def _parse_attacker_response(self, *, response: PromptRequestResponse) -> str:
            attacker_suggested_prompt = json_response["prompt"]
        except (json.JSONDecodeError, KeyError):
            # This forces the @pyrit_json_retry decorator to retry the function
+            breakpoint()


These will have to go 😄

romanlutz · 2024-11-08T22:08:41Z